Khoros Community

Big Data, BAD Predictions, and How to Improve It?

Lithium Alumni (Retired) Lithium Alumni (Retired)
Lithium Alumni (Retired)

Michael Wu, Ph.D.mwu_whiteKangolHat_blog.jpg is 927iC9C1FD6224627807Lithium's Principal Scientist of Analytics, digging into the complex dynamics of social interaction and group behavior in online communities and social networks.


Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics and its application to Social CRM.He's a regular blogger on the Lithosphere's Building Community blog and previously wrote in the Analytic Science blog. You can follow him on Twitter or Google+.



logo SocialTech_2012b.gifAlright, just a little announcement about my never-ending speaking engagement before we begin today. I will be speaking tomorrow (March 30th) at SocialTech. I will be talking about how B2B enterprise can leverage the power of gamification, and John Pasquarette from National Instruments (a Lithium client) will be co-presenting with me. SocialTech is happening right now at Seattle, but unfortunately I only have time to fly there tomorrow, speak, and then fly back immediately. Too busy!  😞


Now we can begin. Last time, I showed you the results of a simple analysis I performed on the sentiment data from our social media monitoring (SMM) platform with respect to the presidential candidates for the 2012 election. I was able to demonstrate the predictive power of our data, which is able to predict 93.11% of the data variance in the Gallup data. So why did Attensity’s data only get the election half right? This is an interesting question, so I decided to do a bit more analysis and share my findings there.


Good Data Science Practice: Know the Limit of Your Data

I collected the Super Tuesday results from Huffington Post, and tabulated the Attensity’s data from USA Today. The first thing I did is compute the correlation coefficient (cc) between these two sets of data to quantify how well is Attensity’s data able to explain the data variance in the Super Tuesday results. I did this for all states that participated in the Super Tuesday, March 6, 2012. The result is shown in red italics below.


 SuperTues vs Attensity v03_web.gif


Clearly, Attensity’s data is able to predict the election outcome in some states (e.g. Idaho, Massachusetts, Ohio, and Georgia), as indicated by the relatively high correlation coefficient (cc = 0.91, 0.99, 0.68, and 0.89 respectively). And where Attensity’s data fails to predict the election outcome, the correlation coefficient is relatively low: Oklahoma (cc = -0.48), Tennessee (cc = 0.41), Vermont (cc = -0.03), and North Dakota (cc = -0.10).


However, there are also cases where Attensity was able to predict the election outcome even though the correlation coefficient is relatively low: Alaska (cc = 0.32) and Virginia (cc = 0.44). What this means is that Attensity’s data really can’t predict the distribution of votes in these states, but they were able to predict the winning candidates coincidentally. In laymen’s terms, honestly, it’s just luck.


So how predictive is Attensity’s data set in this prediction exercise?


To address this question, I computed the average correlation coefficient across all 10 states, and the result is cc = 0.403. That means on average Attensity’s data is only able to predict 16.24% of the data variance in the Super Tuesday result.


Now, this is a retrospective analysis, so the computation is relatively simple. But there are ways to estimate the reliability and predictive power of your data with relative sample size and the intra- to inter-state variance ratio. Although there are many reasons (ranging from pure ignorance to willful marketing and PR tactics) for people to release marginally predictive data, I’m an advocate of responsible data practice. Moreover, it is always a good practice to know the limit of your data before making any inference and claims. Otherwise, your result could be very misleading.


After all, what good is analytics if all it does is give you the “illusion” of confidence?


How to Improve Prediction on Election Outcome?

As I alluded in my previous post, prediction science is a very challenging subject. Not only does it require statistical prowess and technical skills in computing, it also needs expert knowledge in the specific subject matter and a lot of good intuition. I also mentioned that a better model can sometimes improve your visibility in the predictive window (i.e. boost your prediction accuracy). With that said, what can campaign analysts do to improve their prediction?


First, we must recognize that most SMM systems are designed for marketers and PR agencies; they are not built for election campaigns. Therefore, even though the information about voter’s behavior may be implicit in SMM data, the analyses required to extract the voter’s preferences are not built-in for most SMM platforms. Currently, these analyses must be performed by human, and these analyses can start where SMM left off. However, since most SMM systems have some form of sentiment analysis, the sentiment data from SMM is a good common ground to start the analyses.

1. Although voter sentiment is a good indicator of election outcome, raw sentiment data from SMM are not a very accurate reflection of voter sentiment because each individual can tweet multiple times. So the first and most important analysis to infer voter sentiment from SMM’s sentiment data is normalization. We must normalize the sentiments down to a single voter.


For example, Romney may get more positive sentiment on SMM because the voters he engaged with are more vocal. He may have 1M supporters and each of them tweets 10 times a day, giving 10M positive sentiment per day. However, Obama’s supporter may be less vocal even though he may have more supporters. He may have 2M supporter, but each of them only tweets once a day giving him 2M positive sentiment per day.


Since Romney has 10M positive mentions and Obama only has 2M, the SMM sentiment data will predict Romney as the winning candidate. However, Obama actually has more supporters. When it comes to voting, it is the number of voters that each candidate gets that matters, not how vocal the voters are. After the ballots are counted, Obama will get 2M votes whereas Romney will only get 1M vote. To accurately predict election outcome, we must normalize the positive mentions down to number of unique users.


2. So what’s next? The obvious next step is to model how online interactions translate to offline actions. Although many users may express their positive sentiments for a candidate online via tweeting, sharing, blogging, vlogging etc., but there is not guarantee that any of them will actually vote. It is very possible that many young tweeters can’t even vote.


3. Once we have a good understanding of how online activities translate to offline voting behavior, we still need to model the electoral process. Social media is completely democratic (if we are able to accurately normalize the sentiment data down to individual users). That means social media is a good model of direct democracy. But the US government is not a direct democracy; instead it’s a representative democracy. In this system, 10K voters in California may contribute very differently from 10K voter from Alaska to the final election outcome.


4. To accurately model the indirect election of our electoral process, we must infer the geo-location of each user, because voters can only vote within their electoral districts. However, other than location-based services, which specifically record the user’s geo-location, geo-data is very sparse and not easily inferred.

All of these required analyses make predicting election a science of its own. But keep in mind that the predictive power of our models is still constraint by the predictive window. If we are outside of the predictive window of the data, then any analysis will be futile.


Due to time limitation, I certainly did not do any of these analyses when I was analyzing our SMM data for USA Today. That’s why I was very surprised that it was able to predict the Gallup data so well (i.e. cc = 0.965, which is equivalent to 93.11% of the data variance). I consider this coincidental, or just dumb luck, rather than anything special that I did. If I were to use the same formula for the next election, I probably wouldn’t be quite so lucky.



Prediction is a fool’s game, especially when you don’t have the necessary data. Since SMM platforms are not designed to predict elections, SMM data must be analyzed by humans in order to accurately predict election outcomes. These analyses are typically specific to the domain of quantitative politics.


  1. Normalized net sentiment on mentions down to unique users
  2. Model how each user’s online activities translate to actual voting offline
  3. Infer or capture geo-location data of all online users
  4. Model the indirect election process of our representative democracy


There are many more analyses that can be done to improve the election-outcome prediction. Your effort is basically constrained by time and resources. Without doing any of these analyses, it is unreasonable to expect raw SMM sentiments to predict election outcome with any accuracy. However, one can get lucky sometimes.


Next time we will return to the topics of big data analytics. But please keep in mind the concept of the predictive window. We will revisit this important concept when we talk about actionable analytics.


About the Author
Dr. Michael Wu was the Chief Scientist at Lithium Technologies from 2008 until 2018, where he applied data-driven methodologies to investigate and understand the social web. Michael developed many predictive social analytics with actionable insights. His R&D work won him the recognition as a 2010 Influential Leader by CRM Magazine. His insights are made accessible through “The Science of Social,” and “The Science of Social 2”—two easy-reading e-books for business audience. Prior to industry, Michael received his Ph.D. from UC Berkeley’s Biophysics program, where he also received his triple major undergraduate degree in Applied Math, Physics, and Molecular & Cell Biology.
New Commentator

Very thoughtful. The biased problem is the biggest concern I guess. 

Lithium Alumni (Retired) Lithium Alumni (Retired)
Lithium Alumni (Retired)

Hello Robert,


Thank you for the quick comment. 


Yes, bias will always be a problem in any statistical modeling. First there is sample bias, and even if you can hav a pretty unbiased sampling, there is still model bias, which I only talk about some here.


Maybe you can share some of your analysis here too.


Thanks for the comment and see you next time.


New Commentator

Hey Mike,

Thanks for very interesting observations. 

The last year we have done a research on TV ratings in Russia by having analyzed data collected through SMM and search words in Yandex search engine. Then, we compared the obtained sample with the TNS rating (people meters). As you can see on the graphic below* there is a quite good correlation as for the search inquires in Yandex and TNS ratings (left part), and a less evident correlation with social media voice (right part) but with a greater delay, giving a theoretical possibility to predictive analytics.




Unfortunately, we couldn't develop this approach, as the data we are able to collect via SMM is not big enough to make reliable analytics.

It could be just a visual correlation for sure, but it would be great to get your opinion on this case.

* Link to the picture if the embedded one does not appear properly: 

Lithium Alumni (Retired) Lithium Alumni (Retired)
Lithium Alumni (Retired)

Hello Kiryl,


Thank you for sharing your analysis here. You got some interesting data.


Rather than just doing this visual correlation. I would suggest computing the cross correlation function between the respective time-series you have. This would give you a series of cross correlogram, from which you can determine whether there is potentially any predictive information between the TNS rating vs. search inquiries vs. social media, and what is the precise lead or lag time. Without knowing more details about how any of these series were collected, that is the simplest thing I would suggest.


If you have more detail knowledge on how the data were collected, then I would first check for any systematic bias as I did in this post. And if you want to establish causal relationship, the time series structure in your data allows for a Granger Causality test to determine whether there is any real predictive information between the three time series you have.


Ok, thank you for your interest and asking these deeper questions.

I hope I’ve address your questions and I look forward to further discussions in the future. See you next time.