Khoros Atlas Logo


What’s New about the New Community Health Index? Part 2—Upcoming Features

Lithium Alumni (Retired) Lithium Alumni (Retired)
Lithium Alumni (Retired)

In my last post, we discussed some of the infrastructural and algorithmic changes behind the new Community Health Index (CHI) shipped at the end of October. But that was just the beginning. Today, let’s talk about some of the forthcoming features we’ve planned with the new CHI score. Please note that these features are not yet available, but they will be soon.


Since we will be referring to some of the changes we discussed last time, here’s a summary of the changes in the new CHI. If you’d like additional details, please review Part 1 of this post, “What’s New about the New Community Health Index?


Infrastructural Changes:

  1. Built on highly scalable, modern, big data technologies
  2. Entirely based on our event-log framework—enabled by big data technologies
  3. Bot traffic is filtered out—enabled by the rich metadata within the event logs


Algorithmic Changes:

  1. Sensitive and responsive to near real-time changes, but also more volatile
  2. Includes user activities across the entire community—inclusive of segregated areas
  3. Raw health factors are normalized to quantile scores that are meaningful, comparable, and therefore more actionable
  4. CHI score is now a linear function of a non-deterministic, symmetric, weighted average of the 6 quantile scores, and it is computed via the generalized mean


“Drill down”-ability

drill down when what where 350px.pngOne of the reasons why we invested so heavily in a highly scalable big data infrastructure is that we can now capture and store all of the intermediate data generated alongside the CHI processing pipeline. For example, 5 of the 6 raw health factors for the community are computed at a much finer granularity. They are subsequently aggregated up to the community level before normalization to their quantile scores. The raw health factors at these more granular levels are now captured, stored, indexed, and retrievable through the appropriate queries. The finest available granularity (i.e. the lowest level) for each raw health factor is as follows:

  1. traffic: single message level
  2. content: conversation thread level
  3. members: community level
  4. liveliness: board level
  5. interaction: conversation thread level
  6. responsiveness: conversation thread level


The availability and retrievability of granular data for each raw health factor is critically important because it enables diagnostic drill down.


The CHI score is currently computed weekly for all communities. With the increased sensitivity we discussed in the previous post, the weekly variation in CHI allows you to easily spot when something happened in your community (whether it’s a spike or a dip). If you experience a dip one week, you can now examine the health factor quantile scores (since they are now normalized and comparable) to determine what the problem is. Is it low traffic, low responsiveness, or something else that’s driving down your CHI score? Let’s suppose the problem is low responsiveness. Now, with this drill down capability, you can go one step further to identify where in the community the problem occurred. You can drill down to identify which category, which board… all the way down to the exact thread that’s causing the problem.


Aside from being a powerful diagnostic tool, this drill down capability also enables performance comparison at the category level and board level wherever granular data is available.


Daily CHI Score

Once we are able to compute CHI reliably from our event-log data, we can then increase the frequency at which we compute it. Our data platform team is diligently monitoring the performance load on our big data infrastructure as we speak. When the processing pipeline is stable—from event generation, all the way to the final CHI score—then we can provision the necessary hardware resources to scale up the computing frequency. This is another added benefit of the Hadoop-based infrastructure, which scales fairly linearly. The good news is that we are not hitting our hardware limit at the moment, so we can definitely scale up the computing frequency once the entire data pipeline is stable. This means CHI will be more responsive, and you can get an even earlier warning to potential problems within your community.


Adaptive and Evolving

Adaptive Chameleon.pngThe second reason why we invested in building CHI on modern, big data technology is the flexibility it offers.


Recall that we normalize the raw health factors (see Part 1 of this post for details). This is achieved by fitting the population distribution (i.e. the cross community histogram) for each health factor, which gives us a set of complex formulae for converting the raw health factors into the corresponding quantile scores. Although each community’s health factors may fluctuate dramatically, these weekly (or daily) fluctuations are generally not correlated across communities. So at the population level (across all communities), these data varies rather slowly. Nevertheless they do change, but over a much longer time frame. That means the formulae that transform the raw health factors into quantile scores must also change.


This is where the flexibility of our big data technology comes in handy. The advantage of implementing the conversion formulae (from raw health factors to quantile scores) as user-defined functions (UDFs) on Hive is that they can be modified and swapped in and out easily with little impact on the other parts of the data processing pipeline. This flexibility means the health factors’ distribution can be re-fitted periodically as we collect more data to construct the population histogram. If there are any population level changes in the behavior of raw health factors, we can easily modify the conversion formulae to reflect those changes.


We are monitoring the population distributions of the health factors as we speak. We are trying to understand how they change as a population, and we will determine the rate at which they change. Once we have this data point, we can instrument the periodic re-fitting and updating of the conversion formulae. This can even be automated. What we’ll get is a CHI score that evolves and adapts to the changing behavior of the consumers—it will always be accurate and never goes out-of-date. That’s the power of adaptive algorithms.


Ease of Benchmarking

Since the raw health factors are normalized to quantile scores based on the entire population of communities, this facilitates the benchmarking of community performance in many ways. First, the quantile scores can already be viewed as a benchmarked score against all other communities. Moreover, the quantile scores are all on the same scale, so when a community is compared against their benchmark average, the comparison is more meaningful and easy to understand.


In addition, the availability of the quantile scores can also help us to select a better benchmark set of communities for any particular community in mind. Previously, we offered 2 types of benchmarking comparisons:

  1. date-based benchmark—how do I compare with other similar communities right now
  2. age-based benchmark—how do I compare with other similar communities that were at the same stage of maturity as I am now

benchmarking 350px.pngAge-based benchmarking is more useful for younger communities (from just launched to ~2.5 years of age), as they offer a role model and a growth trajectory for these younger communities to follow. A new community (i.e. just launched) can see how other similar communities’ quantile scores (and CHI scores) change as they grow and mature. On the other hand, date-based benchmarks offer a more competitive benchmark for older communities that have reached maturity. That way these mature communities can see how their quantile scores (and CHI scores) compare to their competition in the market right now.



As you can see, the investment towards modernizing our data infrastructure has enabled many new features that were previously not feasible. Although we don’t have all of the aforementioned features right now, all of the ingredients are there, ready to be prioritized and built. As a result, we should start to see some of these features coming soon.


In summary, some of the forthcoming features we can expect to see in the near future are:

  1. drill down capability
  2. daily CHI score
  3. quantile scores (and therefore CHI) that evolve and adapt based on consumer behavior
  4. benchmarking capabilities for quantile scores and the CHI score


Again, this wouldn’t be possible without the help of the various data teams involved. I’m super excited about what’s ahead, and I hope you are too. In the meantime, I’d like to hear your thoughts on these features. As usual, comments, kudos, discussions, and critiques are equally welcome. See you again soon.



Michael Wu, Ph.D.mwu_whiteKangolHat_blog.jpg is CRM2010MKTAWRD_influentials.pngLithium's Chief Scientist. His research includes: deriving insights from big data, understanding the behavioral economics of gamification, engaging + finding true social media influencers, developing predictive + actionable social analytics algorithms, social CRM, and using cyber anthropology + social network analysis to unravel the collective dynamics of communities + social networks.


Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics + its application to Social CRM. He's a blogger on Lithosphere, and you can follow him @mich8elwu or Google+.

About the Author
Dr. Michael Wu was the Chief Scientist at Lithium Technologies from 2008 until 2018, where he applied data-driven methodologies to investigate and understand the social web. Michael developed many predictive social analytics with actionable insights. His R&D work won him the recognition as a 2010 Influential Leader by CRM Magazine. His insights are made accessible through “The Science of Social,” and “The Science of Social 2”—two easy-reading e-books for business audience. Prior to industry, Michael received his Ph.D. from UC Berkeley’s Biophysics program, where he also received his triple major undergraduate degree in Applied Math, Physics, and Molecular & Cell Biology.