Khoros Atlas Logo

Big Data, Big Prediction? – Looking through the Predictive Window into the Future

Lithium Alumni (Retired) Lithium Alumni (Retired)
Lithium Alumni (Retired)

Michael Wu, Ph.D.mwu_whiteKangolHat_blog.jpg is 927iC9C1FD6224627807Lithium's Principal Scientist of Analytics, digging into the complex dynamics of social interaction and group behavior in online communities and social networks.


Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics and its application to Social CRM.He's a regular blogger on the Lithosphere's Building Community blog and previously wrote in the Analytic Science blog. You can follow him on Twitter or Google+.



Last time I said I was going to write about the big data processing pipeline. However, I decided to put that off until later, mainly because a couple of weeks ago, I was interviewed by USA Today on whether social media sentiment can predict an election outcome? As it turns out, the raw twitter sentiment data from Attensity was only able to predict the election outcome half right. So I thought it is timely to comment on the predictive power of big data.


Analysis of Sentiment Data for Each Presidential Candidate

Gallup vs Lithium_01.pngDuring this interview, I pulled some data from our own social media monitoring (SMM) platform, and did a simple analysis of the public sentiment data on the social web (which includes the blogs, twitter, forums, and news) for the top four Republican candidates (Mitt Romney, Newt Gingrich, Rick Santorum, and Ron Paul) and the incumbent candidate, President Barak Obama.


  1. Our platform estimated the entity level sentiment for each mention of the candidate.
  2. Our SMM platform automatically aggregated this raw sentiment data by day, for positive, negative, and neutral mentions. So we have total positive, neutral and negative mentions by day.
  3. I looked at the daily sentiment variation for each candidate over the last 6 months and determined the window over which the sentiment is stable and therefore predictive. I found this window is about 1.5 to 2 weeks. That means I can only use about two weeks of sentiment data for prediction. Using more data not only doesn’t help, it may be counterproductive. That is, using too much data could actually reduce your prediction accuracy. More data is not always a good thing!
  4. I computed the net sentiment by taking the positive mentions minus the negative mentions for each candidate over the two week period. It can be very misleading to examine only the positive sentiment since that is a biased and incomplete reflection of the public sentiment, so we must take into account the negative sentiment in our analysis.
  5. Likewise, we should probably take advantage of the large amount of neutral sentiment data on the social web too. There are many neutral mentions we haven’t use yet. To make use of this data, I simply weighted each neutral mention by 1/10 and added the result to the net sentiment computed above.
  6. The result is normalized to 100% and displayed via a pie chart (Figure 1). 


Gallup vs Lithium_02b.pngI was told that these data (now 2 weeks old, which makes them irrelevant today) actually lined up with the Gallup Poll nicely. I was so excited that I went and look up the Gallup data (see Figure 2) even though I haven’t been following the election closely.


To objectively quantify how well my analysis is able to predict the Gallup poll, I’ve computed the correlation coefficient between my prediction and Gallup data. To my great surprise, my simple analysis yields a predictive correlation coefficient of 0.965. That means this analysis is able to explain 93.11% of the variance in Gallup data. This is a superb result even though the model is overly simplistic.


But what does this really mean? Can social media sentiment really predict election outcome?


The Right Question to Ask: What is the Predictive Window?

Social media sentiment can definitely be used to predict election outcomes! Studies have shown that twitter mood data can even predict the stock market. However, whether social media data can predict election outcome is actually the wrong question to ask.


The important question to ask when doing any predictive analysis is “how far into the future is the prediction valid?” We call the period over which the prediction is still fairly accurate the “predictive window.”


Let’s look at a more familiar example of weather prediction. You can certainly use data collected today from all the meteorological instruments out there to predict the weather, but the prediction is only accurate for a short period of time, typically a few days. So the predictive window of your meteorological data is a few days. You could try to use this data to predict the weather one month from now, but it just wouldn’t be accurate. In fact, it’s so inaccurate that you might as well take a random guess. So trying to predict anything beyond the predictive window of your data is, pretty much, useless.


3d future+window2_full_325.jpgIn our example, even though social media sentiment can be used to predict election outcome, it can’t predict with any accuracy beyond a window of 1.5 weeks. If the election takes place a day after the sentiment data were collected, then it is possible to predict the election outcome after some serious human analysis. However, if the election takes place 2 weeks after, then these data would no longer be able to predict the outcome with any accuracy, no matter how much we analyze it. This kind of behavior is very common in non-stationary systems.


You may ask “why?” The reason is because sentiment is a point-in-time measure. It can change rapidly from day to day. I may love candidate #1 (say Obama) yesterday, and I tweeted about it, but after I watched his debate today, my sentiment may change or reverse completely. So my tweet from yesterday is completely irrelevant with respect to my candidate preference today.


So the important question is not “whether social media data can predict election outcome?” It definitely can. The right question to ask is “how long is the predictive window?” For something that changes very quickly like the financial market, the predictive window will be very short. For things that do not change as fast, the predictive window will be longer. For social media sentiment data, the predictive window for election forecasting is about 1.5 to 2 weeks. If you want to be conservative, you can use 1 week.



When you are doing any predictive analytics, you are really trying to peek into the future through the predictive window of your data. If you try to look outside of this window, your future will look very blurry; so blurry that you can’t make anything out of it with any certainty.


Even within this predictive window, your view is still limited by the power of your statistical model and the noise inherent in the data. Because of this, you often can’t see very far into the future. Although you can sometime stretch this window a little by using more powerful statistical and machine learning methods, it is often impractical and of diminishing return.

The scientific community has always been interested in prediction ever since the scientific method was developed. However, predictive analytics has proven to be a very challenging subject in mathematical statistics and probability theory. It is not a problem that can be addressed simply with more advanced technology. That means it doesn’t matter how much computing power you have (even with quantum computer and holographic storage), there are theoretical limits to what you can, and cannot, predict.


RotmanExecCRM250.pngNext time we’ll talk about why this predictive window is so important. And I will use our Super Tuesday sentiment data as an example to illustrate how we can improve the visibility within this prediction window.


BTW, I’ll be teaching at the Rotman Executive CRM program again this year with Paul Greenberg and Ray Wang. So I’ll be on the University of Toronto campus April 17-19. Alright, see you next time. Stay tuned for more on election campaign analytics...



About the Author
Dr. Michael Wu was the Chief Scientist at Lithium Technologies from 2008 until 2018, where he applied data-driven methodologies to investigate and understand the social web. Michael developed many predictive social analytics with actionable insights. His R&D work won him the recognition as a 2010 Influential Leader by CRM Magazine. His insights are made accessible through “The Science of Social,” and “The Science of Social 2”—two easy-reading e-books for business audience. Prior to industry, Michael received his Ph.D. from UC Berkeley’s Biophysics program, where he also received his triple major undergraduate degree in Applied Math, Physics, and Molecular & Cell Biology.
Not applicable

I was wondering when did you start capturing the data and how many results of days have you compared with gallup poll, since it doesn't mean too much if you only compared two weeks data with gallup. 

dave campbell
Not applicable

Interesting material and an important concept - the validity of predictions.


I think it might also be useful to consider sentiment trends and relative stability, in order to gauge the window with a little more confidence.


The basic idea is that wildly swinging sentiments would not hold as long as more stable sentiments.


Great ideas to include good, bad, and the neutral in the metric.  It seems the absolute number would also be some sort of indicator.  Combining that with the amount of movement feels like a velocity or momentum indicator. 



Occasional Commentator GaryNorth
Occasional Commentator

Ron Paul (R-TX) as usual, was the most discussed member on Twitter with over 100,000 mentions of him by fellow Tweeters. Due to his national presence as a Presidential hopeful, Paul is generates far more discussion than his counterparts in the House and Senate" -- picked up from here.


In data analytics the king is the data, those of ones who has the bigger amount of data is the king. As we may see, this source data may miss some important information Smiley Wink

Occasional Commentator GaryNorth
Occasional Commentator

After all I really like this article, especially the second part showing up the limits.


Just a note that one of the candidate from the subject of this analysis, Ron Paul follows an austrian economics whose fundations is exactly based on the fact all the prediction effort to predict the future is so limited, such as in the economics, therefore the austrian tradition says that it is even dangerous to use that predictive model for managing the economy, such as the central banking system does.


In fact austrians use the praxeology in studying human action, which in my understanding it can be studied by modern analytics techniques also. In fact the praxeology is also an interesting apriori reasoning technique and most people are using to do that way to predict the near future. As Ron Paul used to predict the financial problems based on boom-bust business cycle theory.

Lithium Alumni (Retired) Lithium Alumni (Retired)
Lithium Alumni (Retired)

Hello Robert,


Thank you for the comment and sorry for the late reply. I’ve been traveling quite a bit for conferences.


The prediction is estimated from data collected from 2 weeks, because that is how long the data is predictive of the current Gallup poll (current as when I analyze the data), since that is determined to be the predictive window. If I want to predict the earlier poll data, I would need the 2 weeks of data back from the date that we are trying to predict. That is because public sentiment changes from week to week and data older than 2 weeks are not predictive. So including it will actually hurts your prediction.


This is actually quite an important concept, so I plan to write more about this in the next blog. If you are interested, please come back in a week or two. Stay tuned!


Hope this address your question. If not, I hope the next blog post will address it.

Thanks for commenting.

And hope to see you next time.

Lithium Alumni (Retired) Lithium Alumni (Retired)
Lithium Alumni (Retired)

Hello Dave (and Robert, I guess this is relevant for you too),


Thank you for the comment and glad you find the “validity of prediction” concept interesting. The predictive window is actually a very important concept, and we will revisit it later.


What you said is true. Widely swinging sentiments would not hold as long as more stable one, but the problem is determining quantitatively and precisely how long for a given level of swing. And it depends on the type of the swing too. If there is any periodicity, then it doesn’t matter how big the swing is, since it is predictable from history. So it is not as simple as just choosing a window that you can fit a linear model to.


The method I used to determine the predictive window is via time series analysis. Basically I fit couple of time series model: (1) a simple ARMA model, and (2) an ARIMA model, and see how long is the auto-regressive window. These time series analysis will take care of most of the variations due to trends, variance, periodicity (i.e. seasonality), and quite a bit more of the deterministic characteristics of the time series even when they look quite random. I could have done some more sophisticated analysis that takes into account of the heteroskedasticity as well, but I didn’t do that because the ARMA and ARIMA result were pretty consistent and I just didn’t bother to use more complex models to determine the predictive window. I just didn't expect a more complex model would yield a different answer.


Anyway, that is the more technical part of the analysis that I didn’t bother writing it up for the general audience. If you want more detail, please let me know. I’d be happy to explain in greater detail, but you probably have to understand some basic time series analysis first though.


Alright, thanks for the comment, and see you later.


Lithium Alumni (Retired) Lithium Alumni (Retired)
Lithium Alumni (Retired)

Hello Gary,


Thank you for your comments, both of them  😉

So I’m just going to write 1 reply to address both of your comments.


Your 1st comment first. Although data is king, it doesn’t mean more data (i.e. more mention) is indicative of the winning. We do see that Ron Paul has the most mention, but that is completely irrelevant in predicting who is going to win. That is what I mean in my earlier post, where I talk about the first step in most data analysis is to identify the relevant data. Sorry to say this, but in predicting election result, sheer volume simply doesn’t matter at all. Moreover, we have more than just twitter data, we have the blogosphere, forums discssions, news, etc.


To understand why volume doesn’t matter, imagine the following hypothetical extreme case. Even if Ron Paul has the most mention, if everyone says s/he hate him, then clearly he’s not going to be king.  😉


In fact, we even see that Ron Paul has the most positive mention cumulatively (see figure below). That is more relevant data, but still not sufficient to predict the election outcome. We have to do all the analysis described in this post in order to get a moderately decent prediction. I was just surprised that it matched the Gallup poll so well.

Candidate +Sentiment.gif


Your 2nd comment is correct. No prediction will remain accurate indefinitely. That is why it is so important to determine the predictive window of your data before you actually do the predictive modeling. I actually don’t know about praxeology. But I do believe in rigorous statistical modeling and validation, so I am pretty agnostic about the precise model people use. As long as you understand the limitation of the model and validate your model, then it’s fine. There are usually more than 1 model that will give you an accurate prediction. In fact, there are  infinite number of models that we can use to model any data set, but of course, trying all the models is impractical. Yet, in practice, there are usually many models that does pretty much the same things in practice, each one just made slightly different assumptions about the data.


Anyway, thanks for taking the time to comment. I will take a look into praxeology next time to understand it more.

Next post, I will continue the analysis on using sentiment data to predict election outcomes.

Hope to see you next time. 

Cohan Sujay Carlos of Aiaioo Labs
Not applicable

Dr. Wu, there has been some prior work on the same lines.  


There's a paper "From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series" by Brendan O’Connor et al from Tepper School of Business, CMU that analyzed political and consumer data.  


However, they only reported correlations as high as 80% "While our results vary across datasets, in several cases the correlations are as high as 80%, and capture important large-scale trends."


Lithium Alumni (Retired) Lithium Alumni (Retired)
Lithium Alumni (Retired)

Hello Cohan,


Thank you for commenting and citing the reference here. I actually didn't know this paper, so I will most certainly take a look at it and see what they did exactly.


Yes, CMU does a lot of interesting things. And 80% predictive correlation is pretty good. Imagine what you can do with 80% predictive correlation if you were predicting the stock market.  😉  However, I will write more about how we can improve the analysis for predicting election outcome next time.


Alright, thanks again for the comment. 

And I hope you will return next time.


Not applicable

First of all congrat for your post.

I have two questions.

1. Using an ARIMA model or for that matter any such model, you are based on historical data to estimate the predictive window. However especially in politics events take place that drastically alter the validity of historical data. So the famous black swan effect might apply here.

2. You use sentiment analysis at the entity-level. We know that this is a difficult task with a large error rate. Could you elaborate a bit on the technology that you use and its success? Deep NLP, surface parsing?

Thank you in advance.


Lithium Alumni (Retired) Lithium Alumni (Retired)
Lithium Alumni (Retired)

Hello Jason,


Thank you for the comment and for asking such excellent question.


I'm on the road now, so I will try to address your quetion quickly.


1. Yes, most of these time series model are based on historical data. And we are using the volume of positive, neutral and negative sentiment in the history to predict the current sentiment volumes respectively. And we did this for every day for the 6 month of data that we have. So the result is an averaged across approximately 180+ days. So whatever political event happen and make the historical data invalid it basically make the predictive window very short. But when the political landscape is calm the window becomes longer. Averaging over 180+ days gives us the average predictive window length of 1.5-2 week. So it should take into account of the rate at which political event occurs. Of course, if we can determine when an event may occur that's even better, but then I'd be working at Wall Street then. This is just a very simplistic model. Next time we can talk about how we can improve it.


2. The sentiment scoring engine is uses a combination of POS tagging and lexical chaining, which is often use in text summarization, to do most of the NLP preprocessing. This gives us a pretty good reference to the entity of interest. I also set up the searches to disambiguate person with similar names as the candidates and different names that refers to the same candidate. And then I also set up a series of required keywords and relevant keywords to narrow down the search to mentions of entities that are relevant to the presidential campaign. These 2 steps are done manually. Then the sentiment is scored via statistical method by a trained nonlinear classifier. The success of this approach is a combination of both pretty deep NLP + statistical methods. 


OK, I hope this help. And thanks for asking this great question.

I'm sure a lot of people must have wonder about this too.


I hope you will join our discussion next time as I will continue to talk about how we can improve the model on predicting election outcome. See you next time.


Not applicable

Thanks for your replying. Looking forward to seeing your next post on this topic. 

Iason asked you about the sentimental analysis, which is an excellent question. You sentimental analysis method seems pretty standard to me and I was wondering the ROC curve of the result. I have done similar sentimental analysis on tweets, when it comes to comments about products, I can have 80-90% accuracy. But when it comes to political tweets, it is only 50%, which is also been verified in several papers related to mining political tweets. I was wondering if your sentimental analysis method is much better than the existing ones. Looking forward to seeing the result. Again, what you have done is an excellent work and thanks for all your comments.  

Lithium Alumni (Retired) Lithium Alumni (Retired)
Lithium Alumni (Retired)

Hello Robert,


Thank you for commenting and for asking further detail questions about our sentiment engine.


In terms of academic research, I would agree that our sentiment engine is something pretty standard, we are not using any latest and greatest in machine learning research or semantic ontologies. They are just combination of proven methods in Natural Language Processing (NLP) and Machine Learning (ML) combined with big data technologies (i.e. hadoop + HDFS) and lucene search. However, it is competitive in the social media monitoring market. The engine is trained initially a while back, and then successively retrained only by our client’s usage, so I do not have an ROC curve for you.


I’m not sure what you are trying to predict with political tweets, but I do suspect that’s a much harder task than sentiment on products. First, I guess you need to identify tweets that have political sentiments, then do sentiment scoring on those political tweets. Both of these stages are error prone, so the error tends to compound. In the case of product, it is easier to narrow down to tweets that mention the product. And since this can be done with high accuracy, the error is mainly contributed in the sentiment scoring stage. 


I believe one of the reason that contributed to my good prediction is that we do not only use tweets. Tweets are too short to express any serious political sentiments. Most of the more accurate sentiments are probably from blogs, forums and news. Also, I find that often good prediction comes from narrowing your relevant data set down to a truly comparable sets. In this case not just mentions of each candidate, but mentions that are also related to the presidential campaign specifically.


Anyway, thanks for the comment and I hope to see you next time.


Not applicable

Frankly I think trying to use social media data to predict the result of an election is a very poor use of it ;).


There is a much better use of social media by trying to CHANGE the result of an election by influencing the right people to come to the poll.


A few things on data: 

- do you capture whether the same people were positive last week and became negative this week? If you work at "conversation level" then you really mix up between one guy who may say 200 times that he's fan of Ron Paul and another one that will say once he will vote for Romney.

- do you check that the people you get conversations from are actually registered voters? If not then you may have a huge biais in the data?



Lithium Alumni (Retired) Lithium Alumni (Retired)
Lithium Alumni (Retired)

Hello Dominique,


First of all, I apologize for the late reply. I been offline for quite some time, because the blogs on lithosphere has been reorganized and my blog has just been reinstated.


You are absolutely right. We should use social media to influence voter behavior rather than just predicting election results. That is precisely what I told the reporters (see USA Today article). 


Nevertheless, you made some very good points. There is actually a sequel to this article "Big Data, BAD Predictions, and How to Improve It?" And a lot of the issue you point out are discussed in the sequel. I hope you have the chance to take a look since lithosphere has been undergoing some re-construction as we speak. If you like to discuss prediction science further, please do leave me a comment, I will definitely respond.


See you later.