MySQL is the most popular open source database server that is used by some of the biggest web properties, including Lithium. It is also the most critical part of the infrastructure because it stores data - and data is what drives us.
So it is of paramount importance to keep tabs on the performance and health of the MySQL infrastructure. And that is where Datadog comes into picture, as a monitoring service that monitors metrics of key systems. These metrics can then be used to analyze, alert and report on the health and performance of the MySQL architecture, giving you immediate and actionable insights into MySQL’s performance. To summarize, monitoring the metrics allows us to achieve the following purposes:
Alert us immediately when a problem arises
Alert us on a problem that we may have in the future
Adding Comprehensive MySQL Metrics to Datadog
Using Datadog, we soon realized that it does not monitor all of the desired key metrics; Most of the InnoDB related metrics were missing. Cacti had the same problem until Baron Schwartz started the Better Cacti Templates project which later on became a part of Percona Monitoring Plugins. Having previously worked at Percona for several years, solving performance problems, I do truly appreciate how the Better Cacti Templates project makes life easier to troubleshoot and foresee problems.
So I embarked on porting the same set of metrics to Datadog and decided to give it back to the community so that Lithium and other Datadog users could monitor MySQL more comprehensively.
The details of my changes are available in the Pull Request on GitHub here:
Let me share a few screenshots that show the metrics as seen when utilized in Datadog dashboard.
As you can see the graphs in the dashboard show pretty nice details that can be used to analyze a trend or to analyze a problem at hand.
Where do I see the Query Response Times?
There is one key metric that I think is the single most important metric when it comes to MySQL performance : the “95th percentile query response time”.
Wikipedia has a detailed post on “percentile” that is a good read to understand the logic behind wanting to monitor “95th percentile query response time”.
“95th percentile query response time” shows the query response time for 95% of the queries executed against the MySQL server.
It seemed there would be no better metric to highlight immediate problems but also problems that might be building up slowly. After all, we want applications to be able to execute queries against MySQL server as fast as possible.
So I also wrote another Datadog custom metric that fetches information from the performance_schema database to calculate the 95th percentile query response times and send it over to Datadog.
I have filed a Pull Request on GitHub to have this integrated into the official Datadog checks so that the community can use it too. The details of my changes are available here on GitHub:
Note that this check requires MySQL versions greater than or equal to 5.6.5. The check exposes average query response time per schema as well, apart from reporting the 95th percentile query response time across all the queries. Exposing average query response time per schema gives you a good idea as to which schema is responsible for most of the load on the MySQL server. This is particularly important for multi-tenant database servers that host more than one application, each with its own schema. An example would be a Wordpress Multisite Blog application.
Here are a few screenshots that show the metrics as seen when utilized in Datadog dashboard.
The query response time metrics are collected using microseconds as the unit. This allows for fine-grained resolution. Once enough data has been collected for the 95th percentile query response time metric, a threshold can be setup to alert you when query performance becomes unacceptable.
These contributions are now out there in community space so they can be used by the community at large. I hope you enjoy them, and look forward to your comments!