Khoros Community

Searching and Filtering Big Data: The 2 Sides of the “Relevance” Coin

Lithium Alumni (Retired) Lithium Alumni (Retired)
Lithium Alumni (Retired)

Michael Wu, Ph.D.mwu_whiteKangolHat_blog.jpg is 927iC9C1FD6224627807Lithium's Principal Scientist of Analytics, digging into the complex dynamics of social interaction and group behavior in online communities and social networks.


Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics and its application to Social CRM.He's a regular blogger on the Lithosphere's Building Community blog and previously wrote in the Analytic Science blog. You can follow him on Twitter or Google+.



SMSS Logo2c.pngAlright, a little announcement before we begin today. First I apologize that I haven’t been keeping my up with my blogging as much as I like to. I’ve been totally swamped with our analytics architecture project. Not to mention the external engagements that I already have scheduled and committed out to June. Speaking of that, I will be giving a closing keynote today at the Social Media Strategy Summit (SMSS) in Las Vegas. So if you happen to be at The Mirage, please drop by and say hello!


WuBookCover_small.jpgA couple weeks ago, Lithium launched “The Science of Social,” a compilation of most of my research work at Lithium. It is re-written for a business audience. As such, it is not designed to go as deep as my blog here (maybe that is the next book, if I ever get the chance to write it!).


Quite a few people have been asking on Twitter how they can get the book. So I thought I’ll briefly answer it here. Since this book is intended to be an easy to read business book, we decided that the main distribution format would be an e-book.

  1. The Kindle version is available for $4.99 on Amazon
  2. The iBook version WILL BE available from the Apple store soon (it is pending approval from Apple as we speak)
  3. However, hard copies of the book (both soft and hard cover) are available from Blurb


NOTE: Any proceeds collected from the sale of these books will be donated to charity, so I’ll keep you up to date on that in the future.


Now, let’s get back to big data. Last time we talked about one of the most important function of analytics – to help us make better decisions.


In order to achieve that, analytics face the challenge of reducing hundreds of terabytes of data down to a few bits, which we can decide and act on. Today, we will describe two of the most commonly used data reduction techniques. These are not new, and you probably are familiar with them already, but I like to briefly mention them for the purpose of completion before we move on to the more advanced techniques later in this mini-series.


If You Know What Data You Need – Search

If you know the data you need to help you make a decision, then the simplest data reduction technique is a search. This turns the data reduction problem into an information retrieval (IR) problem, which we know how to solve very effectively. At the very least, we can leverage open source IR library (i.e. Lucene) or ask Google for help.


Search is arguably the most efficient way for data reduction, but the caveat is that we must know what data we are looking for a priori. Due to its efficiency, search engines can be applied at the web scale to find and retrieve the data we need. This is why Google, Microsoft, Yahoo!, etc. are able to make a business out of their search technology.


What Happens When You Don’t Know the Data You Need?

However, as with many things in life, we often don’t know the data that will best help us with the decision in front of us. In these situations, we often resort to filtering: The process of selectively eliminating the data that are not relevant to our decision. Although the implementation of search and filter technology are quite different, they essentially solve the same problem. At the abstract level, they narrow the data down to a much smaller set that is relevant to our decision. With search, we do it by finding and retrieving the relevant data directly; whereas in filter, we do it by successively removing the irrelevant data, leaving behind the relevant pieces.


Because search is very efficient, we can start with a blank page like Google’s home page and then populate it with more and more relevant data through query refinement. Filtering is less efficient, because it often require showing samples from the entire data set for the user to filter upon in order to remove the irrelevant data. That is, the user has to look through the sample data to determine what’s irrelevant. Therefore, true filtering functions are rarely applied to very large data sets at the web scale.


Blurring the Boundary Between Searching and Filtering

coin_851429_23138072.pngNow, if you are Google, Microsoft, or you simply have lots of computing power, you can fake a filter by having your machines look through all the data and pre-compute attributes on the data set (e.g. date, location, media type, etc.).


Although these pre-computed filters functions like a filter and give user the ability to eliminate irrelevant data, they are really a search, because you must know what data you need before you can apply those filters. For example, you must know a priori, that the relevant data is within the last 24 hour in order to apply that filter. If you don’ know that, you are back to square one. The pre-computed filters won’t help you; you must look at the data in order to determine their relevancy.


In short, pre-computed filters (like those on the left panel of Google) are not real filters; they are really just searches in disguise. And they are implemented as searches underneath the filter-liked user interface. Don’t believe me? You can get the same result simply by specifying the filter conditions as part of your search query or use Google’s advance search.


With modern technologies, the difference between search and filter is really more of an academic distinction. However, it does have some design implications. Since search is much more efficient, when in doubt always apply search first before filtering. Because search often returns a much smaller result set with relatively little effort from the user, we can start with a rather general search and subsequently filter on this smaller data set to find the relevant data. Most successful search engines (i.e. Google) do this. Remember, real filters require the user to examine sample data, determine their relevance, and then remove the irrelevant pieces. In this perspective, query refinement is a form of data filtering. Because users must examine some of the top search results before we know how to refine the query to extract the relevant data we need.



The first step to make big data useful is to identify the relevant data. Clearly the data can’t be useful if it is not even relevant. We typically search and then filter to winnow the big data down to the relevant data set.


Ironically, the relevant data is usually a much smaller data set; in fact, many orders of magnitude smaller. This poses an interesting conundrum, although we have the technology to track, store, and process data at the web scale, most of the data are irrelevant! That is why search technologies were developed hand-in-hand with most big data technologies. Without search and filter technologies, big data is essentially useless.


Alright, in order to not give the spoiler away, I better stop now. Next time we will look at the implication of search and filtering on the value of big data. Stay tune for more on big data analytics.




About the Author
Dr. Michael Wu was the Chief Scientist at Lithium Technologies from 2008 until 2018, where he applied data-driven methodologies to investigate and understand the social web. Michael developed many predictive social analytics with actionable insights. His R&D work won him the recognition as a 2010 Influential Leader by CRM Magazine. His insights are made accessible through “The Science of Social,” and “The Science of Social 2”—two easy-reading e-books for business audience. Prior to industry, Michael received his Ph.D. from UC Berkeley’s Biophysics program, where he also received his triple major undergraduate degree in Applied Math, Physics, and Molecular & Cell Biology.
New Commentator

Nice distinction Michael. I think that we all prefer search to work because it means we can be lazy when we store data. For filtering to work we need to add meta data each time we store something and that means investing up front for a nebulous future benefit.

Lithium Alumni (Retired) Lithium Alumni (Retired)
Lithium Alumni (Retired)

Hello TobyB,


Thank you for stopping by and commenting.


You are right on! Those pre-computed filters are really just meta-data that enable faster searches. Given that simplicity (or ability) is one of the factor in the Fogg's Behavior Model for Gamification, it will definitely drive behavior more effectively. That is one plausible explanation for your observation that we prefer searching over filtering.


However, investing in computing those pre-filters (or meta-data, attributes, etc.) may have unknown benefits in the future. For companies like Google, VRM, or other data as a service (DAAS) vendors, that can be very important, but it may not be right for everyone. We will discuss a little bit about this in the coming articles.


Stay tuned!

Thanks again for commenting, and see you next time.

New Commentator

Thanks Michael for this interesting article. Can't wait for the next ones


You're perfectly right when you say that the relevant data is only a fraction of the analyzed data. Us we analyze social media conversations to extract actionable intelligence and we usually get between 0.1% to 2% of relevant data out of the chatter by combining search and filtering.

Lithium Alumni (Retired) Lithium Alumni (Retired)
Lithium Alumni (Retired)

Hello Bastien,


Thanks for the comment. I assume both of these are from you due to the similarity in the user name and content. So I'll just write one response here.


Thank you for sharing your big data story here. Let me see if I get this right. You are saying that the relevant data is typically a tiny fraction of the data being analyzed, not a fraction of the data you collect, right? It really depends on what you mean by analyze though. Although 0.1% -- 2% is a small fraction, I suspect that the fraction of relevant to totally collected data (the big data) is even smaller. There are typically a lot of data that we collect, but are filtered out before we even do any analysis on them. At least in our case, the data that I find relevant and worth analyzing is a much small fraction than the total data volume we collect.


Nevertheless, it's a good data point and validation!

Thanks for taking the time to comment and see you next time.


Frequent Commentator
Frequent Commentator

Hi Mike! 

I always like your first serie article where you explain basics terms to move in advanced areas! 
I meet same problem by solving my own search-filtering dilemma by social media monitoring. I call it "finding (searching) relevance" vs. "filtering out noise", when dozens of mentions contain relevant and noised mentions in unknown proportions. I think search queries can be viewed also as meta-tags when we are looking not simple for exact phrases, but for meaningful content pieces beyond these queries. 


That is exactly the problem you mentioned - "we must know what data we are looking for a priori."

But often that is not the case. Smiley Happy

Lithium Alumni (Retired) Lithium Alumni (Retired)
Lithium Alumni (Retired)

Hello Andrei,


Thank you for coming back again and glad to hear that you are having the same experience in your own data analysis.


You are absolutely right. This first step in taming big data is basically finding the signals among the ocean of noise. Although search query can be viewed as meta-data, they probably shouldn't be viewed as meta data for each of the data points retrieved, but as meta data for the entire result set that is returned by the query. Otherwise, you will have a very incomplete meta data for each data point, which doesn't help you with subsequent filtering. They are basically sparse annotations.  😉


Hope this make sense to you.


Thank you for the comment and see you again next time.


Frequent Commentator
Frequent Commentator

Hi Michael, 

I am back after a long hiatus 🙂  -- only in terms of comments as I have never missed any of your reads.


Analytics is always an interesting area - even with many years of experience there are still no dearth of challenges to solve to make data more meaningful and extract even better insights. 


I like the way you broke out search and filter. One caveat I would like to add is that one of the assumptions here is that the data quality is reasonably good with not much noise.  Else, the basic search and filter results are going to be suboptimal -- unless padded with noise reduction factors. Of course, this is again a topic in itself - on whether you know the type of noise in your data or you just know the data is noisy but don't know in what way.



Looking forward to your future articles - I sense some statistical methodologies coming along 🙂




Lithium Alumni (Retired) Lithium Alumni (Retired)
Lithium Alumni (Retired)

Hello Ned,


Welcome back. I hope everything is alright on your side. Glad to hear that you didn't miss any of my posts, and even more so now that you can participate in the conversation.


Yes, analytics is interesting. It is both a very rigorous science, involving math and stats, and at the same time, it is also a black art. That is why I want to start this series on big data and analytics science.


However, I don't think the assumption of good data quality is a requirement for search and filter to work. The internet is full of spams and junks, yet Google can gives us a decent result. The function of search and filter is to remove the noise and keep the relevant data. So the noise reduction is done via search and filter. Therefore, there is no need to assume that the data quality is reasonably good in the first place.


That said, however, cleaner data does make searching and filtering easier. If you have noisier data, you simply have to do more work to refine your query and filter to get to the data you want. In that view, all big data is, by definition, noisy, and no data is ever clean before you search/filter it. The point of big data is that you don't know what is noise and what is signal, so you record everything (I will actually talk about this in the next post).


So provided with powerful enough searching and filtering capability, you can always filter out the noise from the big data. You may need to do a LOT of work to filter out the noise, but the key is that you can. So data quality is NOT a necessary assumption.


Finally, the statistics will come, but much later though. We'll get the basic through so everyone is on the same page first. Well... we'll see how far we can go with this.


Frequent Commentator
Frequent Commentator


You are absolutely right that "provided with powerful enough searching and filtering capability, you can always filter out the noise from the big data".


I guess I have been on the business side of analytics too long :-). My point was made based on what I see everyday with organizations. Many firms do not have the money, resource, or sophisticated tools to apply search/filter directly to the data they collect. On top of that folks use statistical techniques without doing due diligence on whether the underlying data meets the basic assumptions to apply a given method. Some algorithms and heuristics lends itself better to noisy data whereas some others can give you erroneous results if the underlyng data does not meet certain criteria.


But you and me are on the same page. Firms like Google do have algorithms in place that addresses this issue and while it is not a necessary condition to clean your data before you apply search/filter, it improves your chances of getting results that are more meaningful from a decision making perspective




Lithium Alumni (Retired) Lithium Alumni (Retired)
Lithium Alumni (Retired)

Hello Ned,


Thank your for continuing the conversation.


Glad to hear that we are on the same page. I think we are just thinking about different stages of data analysis. What I am talking about here (searching & filtering) is really the data collection stage for the statistical analysis. We use search and filter to find the relevant data to analyze. But that is not the same stage as the data collection stage for the big data. When you collect data for big data, you also collect a lot of noise. That is precisely why we need further search/filter to collect the data again from the big data storage for our statistical analysis.


Anyway, I hope this make everything crystal clear.

See you next time.