In my last post, we discussed some of the infrastructural and algorithmic changes behind the new Community Health Index (CHI) shipped at the end of October. But that was just the beginning. Today, let’s talk about some of the forthcoming features we’ve planned with the new CHI score. Please note that these features are not yet available, but they will be soon.
Since we will be referring to some of the changes we discussed last time, here’s a summary of the changes in the new CHI. If you’d like additional details, please review Part 1 of this post, “What’s New about the New Community Health Index?”
One of the reasons why we invested so heavily in a highly scalable big data infrastructure is that we can now capture and store all of the intermediate data generated alongside the CHI processing pipeline. For example, 5 of the 6 raw health factors for the community are computed at a much finer granularity. They are subsequently aggregated up to the community level before normalization to their quantile scores. The raw health factors at these more granular levels are now captured, stored, indexed, and retrievable through the appropriate queries. The finest available granularity (i.e. the lowest level) for each raw health factor is as follows:
The availability and retrievability of granular data for each raw health factor is critically important because it enables diagnostic drill down.
The CHI score is currently computed weekly for all communities. With the increased sensitivity we discussed in the previous post, the weekly variation in CHI allows you to easily spot when something happened in your community (whether it’s a spike or a dip). If you experience a dip one week, you can now examine the health factor quantile scores (since they are now normalized and comparable) to determine what the problem is. Is it low traffic, low responsiveness, or something else that’s driving down your CHI score? Let’s suppose the problem is low responsiveness. Now, with this drill down capability, you can go one step further to identify where in the community the problem occurred. You can drill down to identify which category, which board… all the way down to the exact thread that’s causing the problem.
Aside from being a powerful diagnostic tool, this drill down capability also enables performance comparison at the category level and board level wherever granular data is available.
Daily CHI Score
Once we are able to compute CHI reliably from our event-log data, we can then increase the frequency at which we compute it. Our data platform team is diligently monitoring the performance load on our big data infrastructure as we speak. When the processing pipeline is stable—from event generation, all the way to the final CHI score—then we can provision the necessary hardware resources to scale up the computing frequency. This is another added benefit of the Hadoop-based infrastructure, which scales fairly linearly. The good news is that we are not hitting our hardware limit at the moment, so we can definitely scale up the computing frequency once the entire data pipeline is stable. This means CHI will be more responsive, and you can get an even earlier warning to potential problems within your community.
Adaptive and Evolving
The second reason why we invested in building CHI on modern, big data technology is the flexibility it offers.
Recall that we normalize the raw health factors (see Part 1 of this post for details). This is achieved by fitting the population distribution (i.e. the cross community histogram) for each health factor, which gives us a set of complex formulae for converting the raw health factors into the corresponding quantile scores. Although each community’s health factors may fluctuate dramatically, these weekly (or daily) fluctuations are generally not correlated across communities. So at the population level (across all communities), these data varies rather slowly. Nevertheless they do change, but over a much longer time frame. That means the formulae that transform the raw health factors into quantile scores must also change.
This is where the flexibility of our big data technology comes in handy. The advantage of implementing the conversion formulae (from raw health factors to quantile scores) as user-defined functions (UDFs) on Hive is that they can be modified and swapped in and out easily with little impact on the other parts of the data processing pipeline. This flexibility means the health factors’ distribution can be re-fitted periodically as we collect more data to construct the population histogram. If there are any population level changes in the behavior of raw health factors, we can easily modify the conversion formulae to reflect those changes.
We are monitoring the population distributions of the health factors as we speak. We are trying to understand how they change as a population, and we will determine the rate at which they change. Once we have this data point, we can instrument the periodic re-fitting and updating of the conversion formulae. This can even be automated. What we’ll get is a CHI score that evolves and adapts to the changing behavior of the consumers—it will always be accurate and never goes out-of-date. That’s the power of adaptive algorithms.
Ease of Benchmarking
Since the raw health factors are normalized to quantile scores based on the entire population of communities, this facilitates the benchmarking of community performance in many ways. First, the quantile scores can already be viewed as a benchmarked score against all other communities. Moreover, the quantile scores are all on the same scale, so when a community is compared against their benchmark average, the comparison is more meaningful and easy to understand.
In addition, the availability of the quantile scores can also help us to select a better benchmark set of communities for any particular community in mind. Previously, we offered 2 types of benchmarking comparisons:
Age-based benchmarking is more useful for younger communities (from just launched to ~2.5 years of age), as they offer a role model and a growth trajectory for these younger communities to follow. A new community (i.e. just launched) can see how other similar communities’ quantile scores (and CHI scores) change as they grow and mature. On the other hand, date-based benchmarks offer a more competitive benchmark for older communities that have reached maturity. That way these mature communities can see how their quantile scores (and CHI scores) compare to their competition in the market right now.
As you can see, the investment towards modernizing our data infrastructure has enabled many new features that were previously not feasible. Although we don’t have all of the aforementioned features right now, all of the ingredients are there, ready to be prioritized and built. As a result, we should start to see some of these features coming soon.
In summary, some of the forthcoming features we can expect to see in the near future are:
Again, this wouldn’t be possible without the help of the various data teams involved. I’m super excited about what’s ahead, and I hope you are too. In the meantime, I’d like to hear your thoughts on these features. As usual, comments, kudos, discussions, and critiques are equally welcome. See you again soon.
Michael Wu, Ph.D. is Lithium's Chief Scientist. His research includes: deriving insights from big data, understanding the behavioral economics of gamification, engaging + finding true social media influencers, developing predictive + actionable social analytics algorithms, social CRM, and using cyber anthropology + social network analysis to unravel the collective dynamics of communities + social networks.
Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics + its application to Social CRM. He's a blogger on Lithosphere, and you can follow him @mich8elwu or Google+.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.