Exploratory Data Analysis: Playing with Big Data
In my previous big data post, we discussed the three necessary criteria for information to provide insights that are valuable. Through this discussion, we learned the key to insights discovery.
By definition, an insight must provide something we don’t already know. However, we typically don’t know what we don’t know, so we can’t really look for insights, since we won't know what to look for if we don't know what it is a priori. What we need to do is to temporarily forget about the value proposition of the data analysis and look beyond what’s relevant to the immediate problem we are trying to solve. Although there is no guarantee that we will find anything in the land of irrelevance, but ironically that is usually where insights are discovered.
How exactly do we do this? That is the topic we will discuss today.
Exploratory Data Analysis
When a good data scientist analyzes any complex data set, especially those that have high dimensionality, his first step is usually playing with data. John Tukey (the famous statisticians in the 20th century who coined the term “bit” for binary digits) calls this step Exploratory Data Analysis (EDA). For me, it really is just playing with the data, because there are no standard procedures or prescriptive methods for approaching it.
EDA involves looking at the data from many different angles. Slicing and dicing the data along non-trivial, non-orthogonal dimensions and combinations of dimensions. Transforming the data through some nonlinear operators, projecting the data onto a different subspace, and then examine the resulting distribution. Regardless of what it involves, the number of things to try in each of these steps is infinite. Worse, it’s actually uncountably infinite, so we cannot even enumerate them and then perform an exhaustive search by trying each one systematically.
If this sounds like gobbledygook to you, it should. Because it doesn't matter, they are just fancy statistical jargons for play, experimentation and exploration. Since there are inexhaustible numbers of ways to explore an infinite space, use your imagination and be creative!
Our brain is still, by far, the world most powerful nonlinear processor. It can perform many challenging tasks (e.g. recognize complex patterns, detect obscure outliers, discover hidden relationships, etc.) far better than the most sophisticated machine learning algorithm or statistical method available to us today. However, it is only through "play" that we can realize the full potential of this evolutionarily optimized processor (i.e. the brain).
When we play with data, we get a feel for the data. We get a sense of what might be an interesting thing to look for, and where/how might we find it. This will guide the rest of the analyses downstream, and determines the direction and the course of the data exploration. Consequently, EDA is extremely important, as it will often determine the success or failure of an analytics project. Heading off in a wrong direction not only wastes time and resources, it often results in termination of funding for the project. EDA is often the most critical and the difficult step of any analysis project, yet it is also what makes analytics fun.
With complex data sets, such as social data, you will never find anything new if you don't play with the data. You will find the information you look for, but you will not discover insights that you don’t already know without EDA.
Structured Play: More Than Just Imagination
One very important component of EDA is the creative application of analytical techniques on the data set. Many people have asked me, “What do I need to do to be creative?” I would say, “If it’s something that I can tell you, it wouldn’t be creative anymore.”
People have been looking for the magic formula for creativity for a long time. But the novelty and originality requirement of the creative process means it can’t possibly be a formula that we can reuse over and over again. In this respect, EDA is like art, music, writing, photography, or other creative disciplines. You must be novel, original, and imaginative to be successful and find something interesting.
However, pure creativity isn’t enough. Imagination without knowledge may lead to dead ends. You can be very creative and develop a truly novel set of analyses. But if there are any logical flaws in the analyses, the result may be invalid and misleading. Having a misleading result in EDA is worse than having no result at all, because it will guide you down a path that will eventually prove to be futile.
With most creative disciplines, the evaluation of the final result is somewhat subjective. There really isn’t a correct answer for what makes a piece of art great, or which piece of art is better. It’s mostly in the eyes of the beholders. But there are objective methods for evaluating EDA, and we can objectively quantify which answer is better. So EDA is not just an art. It is also a very rigorous science, and must meet all the stringent logical requirements of mathematics and statistics.
In a way, it's kind of like rock climbing. There are many possible ways to get to the top. However, you can't just do anything you like, or it will take you too long to get there.
EDA is a kind of play, but it’s a very structured play, where one must conform to 2,000 years of rules and logic accumulated through the history of statistical science. What I thought is novel and original isn’t enough, because it may simply be my ignorance about what others have done and failed in the past. EDA is one of those disciplines that requires both high level of imagination (to be novel) as well as substantial amount of domain knowledge in statistics (to prevent flawed analyses and logics). That’s why EDA is so challenging.
Conclusion
Albert Einstein (one of the greatest physicists of the 20th century and Nobel laureate) once said, “To raise new questions, new possibilities, and to regard old problems from a new angle, requires creative imagination and marks real advances in science.”
Likewise, this type of creative imagination is also required for us to truly advance our understanding of human behavior and the collective dynamics of social systems. It all begins with exploratory data analysis (EDA), which is really just another term for playing with data. However, pure imaginative play isn’t enough, because unconstrained EDA often lead to too many inconclusive results. Imagination and domain knowledge in statistics are both necessary to maximize the likelihood of insight discovery. That is why EDA is challenging, but it’s also what makes it fun.
So give your data scientist a little bit of freedom to play with the data. You may not find anything, but you may also find a diamond in the rough.
In the meantime, I’m happy to discuss the details of any actual analyses you may want to perform during your EDA. Next time let’s venture into the more mechanical parts of big data analytics.
Michael Wu, Ph.D. is Lithium's Chief Scientist. His research includes: deriving insights from big data, understanding the behavioral economics of gamification, engaging + finding true social media influencers, developing predictive + actionable social analytics algorithms, social CRM, and using cyber anthropology + social network analysis to unravel the collective dynamics of communities + social networks.
Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics + its application to Social CRM. He's a blogger on Lithosphere, and you can follow him @mich8elwu or Google+.