Last time we described the simplest class of analytics (i.e. descriptive analytics) that you can use to reduce your big data into much smaller, but consumable bites of information. Remember, most raw data, especially big data, are not suitable for human consumption, but the information we derived from the data is.
Today we will talk about the second class of analytics for data reduction—predictive analytics. First let me clarify 2 subtle points about predictive analytics that is often confusing.
Predictive Analytics: Extrapolate (e.g. Forecast)
The easiest way to understand predictive analytics is to apply it to the time domain. In this case, the simplest and most familiar predictive analytic is a trend line, which is typically a time series model (or any temporal predictive model) that summarizes the past trajectory of the data. Although temporal predictive models can be used to summarize existing data, the power of having a model is that we can use them to extrapolate to a future time where data doesn’t exist yet. This extrapolation in the time domain is what scientists referred to as forecasting.
Although predicting the future is a common and easy-to-understand use case of predictive analytics, statistical models are not limited to predictions in the temporal dimension. It can literally predict anything, as long as its predictive power is properly validated (see Learning the Science of Prediction). The essence of predictive analytics, in general, is that we use existing data to build a model. Then we use the model to predict data that doesn’t yet exist. So predictive analytics is all about using data you have to predict data that you don’t have.
In the case of temporal prediction, the data we don’t have are inaccessible because it is impossible obtain future data. No one has data about the future, because no one can measure the future. We would need a time machine to do this. But there are many other reasons why we may not have the data we need and therefore need to use a model to predict or infer it. And these reasons are almost always that the data is too expensive to measure, either to the degree of accuracy or scale we need.
Examples of Non-Tempral Predictive Analytics
If you’ve been following my writing on influence, you’ve already seen an example of non-temporal predictive analytics where a model uses someone’s existing social media activity data (data we have) to predict his/her potential to influence (data we don’t have). In this case, influence data is just really hard to measure and track at the social web scale (see Why Brands STILL don't Understand Digital Influence?). So we use something that we can measure in large scale (i.e. people’s social media activity data) to predict it.
Another well-known example of non-temporal predictive analytics in social analytics is sentiment analysis. Like influence, no platforms out there actually measure people’s sentiment. To really measure someone’s sentiment, the platform would have to survey his sentiment for every entity (i.e. person, place or object) that is mentioned within his posted message. Even if such a platform existed, no user would bother to provide the answers when he is explicitly prompted for such sentiment data for a list of mentioned entities.
This means nobody has actual sentiment data, not at the scale of the social web. However, we can track and store the textual content of any user posting (e.g. tweets, updates, blog articles, forum messages, etc.). So we build a model that predicts the user’s sentiment (data we don’t have) from his postings (data we have).
How do you do this? Well, there are many ways, but it really doesn’t matter. As long as the model is able to predict the user’s sentiment accurately, who cares how you come up with the model? But how can you be sure of the model’s prediction accuracy? You must validate the model with an independent measure of sentiment (see Learning the Science of Prediction).
With predictive analytics, coming up with a predictive model is the easy part, because anyone can hypothesize a theory about how things work and build a model base on his theory. The hard part is validating it (i.e. making sure that it can predict accurately).
While the purpose of descriptive analytics is to summarize and tell you what has happened in the past, the purpose of predictive analytics is to tell you what might happen in the future. Although predictive models can certainly summarize the existing data through its model parameters, the real advantage of having a model is that we can extrapolate the model to regimes where data doesn’t exist and make predictions. So general predictive analytics is all about using data that we have to predict or infer some data that we don’t have due to:
So you have seen 2 classes of analytics for data reduction: descriptive and predictive. Next time let’s take a look at what prescriptive analytics is all about. More importantly, what does it take to have prescriptive analytics? You don't want to miss the next post...
In the meantime, do you work with predictive analytics? What is the data that you have (i.e. the accessible data), and what is the data that you don’t have and are trying to infer (i.e. the inaccessible data)? Do you validate your predictive models? Let’s open the floor to discussion. That’s how we all learn...
Michael Wu, Ph.D. is Lithium's Chief Scientist. His research includes: deriving insights from big data, understanding the behavioral economics of gamification, engaging + finding true social media influencers, developing predictive + actionable social analytics algorithms, social CRM, and using cyber anthropology + social network analysis to unravel the collective dynamics of communities + social networks.
Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics + its application to Social CRM. He's a blogger on Lithosphere, and you can follow him @mich8elwu or Google+.
Excellent article and series.
I believe you clarify a common misperception - fueled in part by companies with roots in complex event processing - that predictive analytics can predict the future. Perhaps "anticipatory analytics" would have been a better, albeit less marketable, name.
That said, the statistics professor in my MBA program many years ago introduced himself to the class by saying that statistics can be used to prove any point of view. With all the hype around big data analytics, not enough attention is being given to the quality of data nor the validation of models built on the data.
Coefficients of determination can easily be manipulated to fit the hypothesis behind the model. As such, doesn't this also distort the analysis of the residuals? Models for spatial and temporal data would only appear to complicate validation even further.
Data management tools have improved to significantly increase the reliability of the data inputs. Until machines devise the models, focus on the validity of the data would improve model validation and reduce, not eliminate, inherent bias.
Thank you for posting your comment here. And I'm glad to hear that my work clarify some misperceptions in the industry.
Yes, statistics can definitely lie. I'm sure you heard of the phrase: "there are lies, damned lies, and statistics." That is why it is very important for data scientist and statisticians to hold the highest integrity in their work. All the hype around big data is really a 2 edged sword. It makes businesses more data conscious, but very often, that is not enough. They still need the proper training in basic stats to be data savvy enough to spot a fraud analysis.
That is why I like to write about this subject, to make sure that people have the right information when they are making a decision.
Data quality can definitely help. But at the minimum we need to properly validate any model we build. Simple cross validation is often enough, although it is possible to overfit to the validation data set if we do this too much. If you've been following my writing on influence scoring, you probably remember the following post
These 2 posts describes a perfect example of how vendors in the influence industry don't properly validate the model they use to infer people's influence from social media activity data. As a result, people never know if their influence score actually means anything. Moreover, people game the influence scoring system, leading to IEO.
I can't stress the importance of model validation and data integrity. And cross validation is really not that hard. I do it all the time. As a scientist and analyst, we must have a higher standard. And even with that such a high bar, the possibility of predictive analytics is limitless.
For the rest of the readers who are more interested in the application of predictive analytics... Care to come discuss some interesting possibility of predictive analytics? Many brave start-ups are already doing some of it. Come and share your predictive analytics story.
Thanks again for your comment, and thx for the linked in.
BTW, I like the name "anticipatory analytics."
Hope to see you on lithosphere next time.
Thank you for sharing your valuable thoughts and knowledge...
Now i could relate and differenciate Descriptive, Predictive & Prescriptive Analytics!
"In the meantime, do you work with predictive analytics? What is the data that you have (i.e. the accessible data), and what is the data that you don’t have and are trying to infer (i.e. theinaccessible data)? Do you validate your predictive models? Let’s open the floor to discussion. That’s how we all learn... "
No i dont work on Predictive analytics currently but, eager to know about it. i work in a chain of restaurant (FOOD & BEVERAGE) . I would like to do a predictive analystics, i have all the sale data ( with bill details & item details) with me and i want to predict monthly sales and also predict Sales on special days (Ex: Valentines day... etc)... what model can be used to accurately do this?... i am currently using only descriptive analytics as per the nature of summarizing the data.
Please help with your valuable opinion & knowledge!
Thanks in advance!
First, my apology for such late reply. Some how many of my blog comment notification emails got automatically routed to the clutter folder in outlook. I was just re-discovering how many conversation I missed.
To address your problem of predicting sales, I think you can start by setting up a simple linear regression model between the various predictor/operational metric over time (e.g. holidays, weekday vs weekends, supply chain data, etc.) and your sales data over time (e.g. daily sales, or weekly sales). Then once you solve that regression equation, the resulting coefficient would be able to give you the month predicted sales once you plug in the predictor/operational metrics, you will get a prediction of sales.
This does require you to have some statistical skills of setting up regression models and use packages and tools to solve it.
I hope this helps. Let me know if you need to dig deeper.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.