First, an apology for going radio silence for a month. Sorry. But I’m back.
It’s been a crazy October for me. Most of October seems like a blur, because it’s been a mixture of sleep deprivation and 80+ hour work weeks. Besides all the traveling for speaking events during day times and the lost luggage on the way (which I’m happy to share with you later if you are interested), I’ve been working with our data platform team to implement the new Community Health Index at night (CHI, denote by the Greek letter chi). Since we just made this feature available to all customers on October 31st, I thought it would be appropriate to discuss what’s new with CHI today.
For those who are not familiar with my earlier works, CHI was actually the first project I embarked on after joining Lithium ~6 years ago. CHI is an index (ranging from 0 to 1000) that scores the performance of a community on how well it serves its end users. I must reiterate that community health is not the same as community success.
These 2 concepts are related because meeting the needs of the consumers is probably a necessary condition for business success. However, a healthy community doesn’t automatically guarantee business success (i.e. it’s not a sufficient condition). This subtle distinction is what enabled the initial development of CHI that’s purely based on user behavior data.
If you take it at face value, the new CHI doesn’t seem very different from the old. It is still a score between 0 and 1000. It is still derived from the same 6 health factors as the old CHI: traffic, content, members, liveliness, interaction, responsiveness. However, if you dig deeper and look under the hood, you will start to see the significance of these changes.
First, we completely revamped the computational infrastructure for computing CHI. CHI used to be computed on our custom build data warehouse. It relies on a complex sequence of ETL processing to extract the counter-based metrics out of our application database (E), transform it (T), and then load into our data warehouse (L). As our customer base grows and their customers’ engagement level increases, the data warehouse solution doesn’t offer the scalability and flexibility we need.
The new CHI is built on top of our new event-log framework that has little dependency on the counter-based metrics within our application database. That means CHI can be computed with little performance impact on the community. Consequently, computing CHI for some of our largest communities becomes feasible. The new CHI is also built on modern big data technology. Hadoop’s HDFS serves as our highly scalable distributed storage engine for all the raw event-logs emitted from all our communities. User-defined functions (UDFs) on Hive perform most of the aggregation and number crunching. The results are indexed in ElasticSearch and served through the Lithium social intelligence (LSI) app.
One of the benefits of using our event-log framework as the fundamental input data to compute CHI is that it’s easier to filter out bot-traffics that pollute the traffic health factor of CHI. In this framework, user actions on our community platform are emitted and logged as events that contain rich metadata and contextual information about who, when, where, what, and how the action is taken. For example, bot traffic is identified via the user agent string that is tracked with every page view action contributed by the user.
Besides the modernization of our computing infrastructure, there are also significant algorithmic changes in the way we compute, model, and normalize the health factors, and how we combine them into the final CHI score.
First, we removed the smoothing step and the history dependency that was initially designed to make CHI robust against transient changes. CHI was originally designed to capture the long-term sustained health for the parts of the community that are intended for public participation. It ignores all changes that are not sustained for more than several several weeks. Although this is an accurate reflection of the long-term health of the community, we also received feedback from our customers that wanted a more sensitive CHI that reflects the near real-time changes within the community. Removing the smoothing step and eliminating the history dependency greatly simplified the algorithm, but the tradeoff is that CHI will become more volatile. Although people didn’t like the volatility 5 years ago, they have grown to be more comfortable with it now. So it’s sensible to make CHI more sensitive and therefore also more responsive to any changes implemented by community managers.
The original CHI ignored all activities within “segregated areas” of the community that are not intended for public participation (e.g. private boards, announcement boards, archive boards, hidden boards, boards that require special permissions to post). The reason was because these segregated areas of the community will often lower the overall CHI score quite significantly. Despite the fact that most of our communities are external focused, we are also seeing more people making use of these segregated areas for special purpose (e.g. employee participation). Some of these areas are actually very vibrant. We would like to study the effect of including these segregated areas into the CHI calculation, because they are an integral part of the community. If the impact on the final CHI score isn’t significant, perhaps it will make sense to include them. In the future we may revert back to the previous way—exclude the segregated areas from the computation of CHI altogether.
An important change to the CHI algorithm is that we have now normalized the raw health factors by converting them to quantile scores. While the computation of the raw health factors didn’t change, we model the distribution of each health factor at the population level (i.e. across all community, all time). We then use the fitted cumulative distribution to convert the raw health factors into scores between 0% and 100%. This has 2 important implications:
Together, this means that the health factor quantile scores offer a simple way to determine what the problem is when your CHI score drops (or why your CHI score rises). This makes CHI much more actionable.
Finally, we also change the way we combine the health factors into the final CHI score. Because we have eliminated the smoothing step and history dependency, this final combining step is actually much simpler than the old CHI. Most of the heavy math is done when we model the population distribution of the raw health factors. While it’s not quite as simple as averaging the 6 quantile scores or computing a weighted average, it’s almost as simple—at least at the conceptual level.
We compute the generalized mean of the 6 normalized quantile scores. Then we apply a linear function to shift and scale the result to between 0 and 1000 to obtain the final CHI score.
You can think of the generalized mean as a non-deterministic, symmetric, weighted average. Now, you all know what’s a weighted average, but what does non-deterministic and symmetric means? Let me explain:
So what’s new about the new CHI? Despite the fact that the CHI score is still just a number, there are quite a lot of changes—both infrastructure and algorithm.
This is a significant positive development for us, and many of them are non-trivial infrastructural and algorithmic changes. This definitely makes me feel proud to be part of the talented data teams at Lithium. But there is more to come in the near future.
Stay tuned for more details on some forthcoming changes in the new CHI. In the meantime, I welcome any questions, comments, constructive criticisms, kudos, or just candid conversations.
Michael Wu, Ph.D. is Lithium's Chief Scientist. His research includes: deriving insights from big data, understanding the behavioral economics of gamification, engaging + finding true social media influencers, developing predictive + actionable social analytics algorithms, social CRM, and using cyber anthropology + social network analysis to unravel the collective dynamics of communities + social networks.
Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics + its application to Social CRM. He's a blogger on Lithosphere, and you can follow him @mich8elwu or Google+.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.