Khoros Communities moves to The Cloud
Photo by Alex Machado on Unsplash
Nearly two years ago, the Khoros Communities Engineering organization began to move from co-located data centers to AWS cloud services. It's been a long process, but we're nearly complete. Technical Manager Aditya Pandurangi AdityaP , was kind enough to discuss the project with me.
Q: Can you summarize what cloud services are and how Khoros Communities uses them?
A: A cloud service, at a basic level, is a collection of many data centers that are spread across multiple locations and owned by a cloud service provider such as Amazon Web Services (AWS) or Microsoft Azure. The provider offers companies like Khoros a cloud-based platform, infrastructure, and storage services. In this case, the infrastructure to host and manage Khoros Communities. The cloud service provides a user interface or a set of APIs to request and manage specific hardware with the software of our choosing. The cloud service provider handles the hardware and maintenance, and we're in control of how we allocate resources.
Before cloud services, companies had to have their own hardware located in a data center (either on-premise or in a colocated data center with multiple companies). This meant that companies managed the acquisition and maintenance of their own hardware in addition to managing network traffic and other management tasks. Handling hardware failures, hardware replacements, and physically moving the hardware was all in the company's purview. Khoros Communities, for example, had our own servers and equipment running in two data centers: one in Europe (Amsterdam) and one in the West Coast US (San Jose, California).
In the cloud, we no longer worry about sourcing hardware, moving it physically, dealing with failures and outages, sourcing data center facilities, or paying data center storage rates. We don’t need to have Khoros employees add, remove, troubleshoot, and replace physical machines when we need more resources or if something breaks. In The Cloud™, everything is done for you.
Q: Any downsides?
A: There will always be sporadic hardware failures and restarts in the cloud that might temporarily make our services unavailable. That said, Khoros has built-in redundancy to handle failures as gracefully as possible. Also, the costs of using a cloud service over the long term are probably a bit higher because we don’t own the hardware and can’t amortize the cost over the duration of ownership.
Despite those points, moving to the cloud is very much a net benefit. It’s allowed us to do cool, new things that we couldn't before, as well as offer our customers a better experience.
Q: What does this migration mean for Khoros Communities customers?
A. This migration enables us to provide flexibility to our customers in ways we couldn't before. We can now scale our resources to the needs of any customer. For example, some customers host large events where they see a 2-3x increase in traffic on already busy communities. We're talking lots of views and demands on performance. If a game company has a huge game release taking place or if a software company is holding a major event, we can make sure we have enough hardware ready and on standby as needed.
Behind the scenes, our Engineering organization has much more flexibility, which benefits the customer. Unlike in the data center, we’re able to adjust our resources on the fly. More app servers needed? Give TechOps a couple of minutes to an hour -- done! Our database is getting overloaded and needs to be doubled in size temporarily? Once again, shine the TechOps signal -- done! We’ve been able to handle outages like never before, and we’ve been able to support many more customers and resources.
Q: What does our AWS infrastructure look like?
A: This is a high-level diagram.
Q: What were some of the challenges you faced?
A: The new environment presented a few challenges. AWS uses a paradigm of separate, siloed regions for different environments (such as QA vs. Production). This was a change from our datacenter infrastructure that used region-agnostic services. In the cloud, each region requires its own service deployment. This created extra work for teams that migrated their services to the cloud, but it was worth the trouble. Our infrastructure is more resilient -- there isn’t a single shared point of failure. An issue in one region does not affect the others.
The migration required several teams within the Community Engineering organization to improve services and update workflows. We broke large, multi-purpose services into smaller parts with a more narrow focus. While that led to improved security, technology updates, and better performance, we had to shake off old habits and adjust to multi-step processes.
Q: How big was this project, and how long has it been in progress?
A: After a couple of false starts, we created the first JIRA ticket for this project on May 15th, 2018, so I think we can consider that the beginning of the project. This migration has been a massive undertaking that’s taken the effort of many teams.
A multi-year project takes endurance. For fun, we hung a child's growth chart (like something you'd use to track height over time) in the San Francisco office. Each week, we'd update the chart. This worked great, until Covid when we shut down the office to shelter in place.
We got a chance to go back to the office in April to pick up personal items. The chart was still there and we updated our progress. (We got a little lazy with that chunk in the middle where we just drew a simple line.)
Later on, Jake Rozin (JakeRo), an engineer in our Customer Operations group, built a UI for us to track the status of community migration.
Here's a simplified, sanitized version of the page as we were just finishing moving communities out of the Amsterdam data center.
Q: What kind of coordination did the migration require? How many different teams had to work together to make this happen?
A: This effort has taken the coordination of many different groups. The primary teams involved have been Technical Operations (TechOps), Application Operations (ApOps), Information Security (InfoSec), and Rocket. Each of these teams has played a vital daily role in the AWS migration process.
In addition, all the teams responsible for individual microservices were involved. These teams have had to convert their services to use our new service deployment pattern in AWS, including Dockerizing their service and coordinating with the groups above to migrate and set up infrastructure.
Q: What's the Rocket team?
A: Rocket is an Engineering team in the Community organization. Essentially the team acts as the glue between TechOps and Engineering.
While TechOps deals with networking and setting up/monitoring our infrastructure, the Rocket team works one layer above. We design and implement deployment patterns and pipelines used to integrate Khoros Communities services with our AWS infrastructure. Engineering teams across the Community organization consult us about using our deployment patterns and the best ways to support different service types.
Rocket also builds internal tools. For example, we built one service to determine whether we can scale a community to another node and another to find the correct service to perform cloud-based operations (such as adding/removing targets from a load balancer or creating a CDN distribution).
Other projects include:
- improving our security practices and leveraging new cloud-based security options, such as encrypted parameter stores and policy-based access controls (in conjunction with TechOps and InfoSec)
- updating our current provisioning and de-provisioning processes to work with the new cloud infrastructure and paradigm (in conjunction with TechOps)
Q: Did the Rocket team have to build services or tools for the migration process?
A: We did! Unfortunately, we are in the process of patenting them. Once we get those filed, we'll write another post with the details 😁
Q: What have been the major milestones for the project?
A: We’ve had several major milestones throughout the AWS migration.
- Proof of concept: Our first milestone in 2018 was building and testing the AWS environment and developing a migration plan. We started with a basic cutover process. This initial proof of concept proved that Khoros could effectively host communities in AWS. The plan slowly evolved to more detailed cutover steps and enabled us to plan for the complete migration. Atlas (Khoros's community) was among the first communities hosted in the cloud.
- Iteration and automation: Process iteration led to more detailed cutover steps as well as automation services. That brought us to our next major milestone, where we could pass customer migration tasks to our AppOps team and free up Rocket team resources.
- EMEA community migration and first data center shutdown: End of October/mid-November of 2019, we moved every EMEA customer out of the Amsterdam colocation and into AWS. From there, TechOps completely shut down the Amsterdam data center.
- AMER community migration: As of Apr 14th, we reached our next milestone. All our AMER communities are in AWS.
All that remains are the final milestones -- completing service migration and shutting down the US datacenter in San Jose. Once this happens at the end of June, our AWS project will finally be complete!
Q: Can you share and charts or metrics showing improved performance for customers due to the migration?
A: Sure. The following charts show the drop in beacon times pre and post-AWS migration. Beacon time is our way of measuring the time it takes a human (we filter out bots) to load a page on a community. Lower beacon times signal that the page being viewed loaded faster. (The numbers on the Y-axis are time in milliseconds.)
Two of the customers featured in these charts have heavily trafficked communities. I picked the third at random. You can see that they all showed nice improvements post-migration.
Customer 1
Customer 2
Customer 3
Q: Nice! What is it about AWS that lowers beacon times?
A: There are a few factors. Our servers in the datacenter were starting to age and the AWS servers are newer, have better CPUs, and thus perform better. In addition, we started routing all traffic through Cloudfront as a CDN, which serves as a caching layer. I imagine that AWS also optimizes the route by which Cloudfront reaches the load balancers for speed.
Thanks so much, Aditya!
We appreciate your insight and all the details. We'll end here, but before we do, let's give a shout out to all the folks who made the AWS migration possible:
Rocket: JonL,eddielo), AdityaP , LauraPe , KaranS , hernan_vinuesa
TechOps: BillKr, CanC, ChrisSa, DanielA, DavidSu, EricV, GeorgeB, GokulN, KunjalS, MarcS, MarkJ, MattW, MichaelCa, TauqeerA, WillY
Information Security: BryanM, JuanCo, ManjunathM, MinhN, MohanaC, PeterN, SoumyaR, SuyashM
Application Operations: mso, ArunkumarG, DimitarI, Georgi Todorov, MichaelM, NicholasD, RonT, (WeiS)
News, tips, and stories about Khoros platform development and integration