Introducing Project X
It’s 2am and you’re the on-call guy this weekend. You’ve just drifted off to sleep after doing an emergency reboot of a server that locked up for no good reason (is there really ANY good reason for a server to lock up?). Your phone buzzes and plays whatever obnoxious ringtone you’ve chosen to ensure you wake up. A different server has a full /tmp partition. You drag yourself out of bed, grab another mountain dew, and plop back down in front of your laptop. You sign into the VPN, log into the server, and remove some old temp files. Five minutes to get logged in, 30 seconds to resolve the issue. Back to bed, until the next page goes off. Sound familiar? If you’ve ever done on-call duty, I’m sure it does.
And that was the basis for the grand, new vision we’ve had at Hurricane Labs, which we dubbed (for absolutely no reason other than it sounds cool) “Project X”.
What happens is: when something routine alerts in our monitoring system, such as a disk space alert or a stopped process, the monitoring system will go through certain steps to try to remediate the problem itself. Sound exciting? Sound thrilling? Sound crazy kinds of hard? Yeah, it’s all of those things.
This is the first blog post that dives into the kinds problems we’re tackling in the process of getting this implemented. In the end, we should have a monitoring system that replaces our on-call engineers (okay, not really, but it sure sounds good).
To start, there’s always room for improvement
In order to do that, you have to have a phenomenally accurate monitoring system to start. While we’ve been pretty happy with our monitoring system for a while (based on Icinga 1.x and mod_gearman, to allow us to monitor 800+ hosts and 13000+ services), we decided there was room for improvement. Some of the performance wasn’t as good as we thought it could be.
Phase 1: Building a monitoring system with sub-second latency
The first phase of Project X was to build a monitoring system with sub-second latency. That is, every single monitoring check it executes is executed within a second of when it was scheduled. Our existing system fluctuated, but was nowhere close to this on average. So, we started seeking a solution.
It turns out that mod_gearman requires very specific versions of the libraries it uses, and strange things can happen – such as unjustifiably high check latency (the workers were not overloaded, adding more did not help, etc). We tried out a bundle of monitoring software called OMD (stands for Open Monitoring Distribution, though it’s not really a Linux distro of its own), which bundles several different monitoring cores (Icinga, Nagios, and Shinken) as well as several add-ons, including mod_gearman.
We set it up in testing, with just one worker node, and were AMAZED to see the latency drop from over a minute to less than 10 seconds. Once we had everything configured to match production, latency dropped to an average of less than half a second (.233 seconds, at this exact moment, with 1027 hosts and 14,413 services).
More on Open Monitoring Distribution (OMD)
OMD is a phenomenal tool. We are normally fans of using packages for the tools we use. OMD is a package, but in a different way. Instead of relying on other packages for its dependencies, they build many of their dependencies into their own package – specifically the ones with specific version requirements. Things like mod_gearman, gearmand, icinga, nagios, shinken – these are all bundled into their package, so you know you’re getting the right versions. They then let you create “sites”, which are self contained instances of the monitoring software – you can have several of these running on a system at a given time. The sites can also be running various versions of the OMD package. So, you can have a production site and then build a staging site when a new OMD version comes out to verify that everything works.
The process for migrating to OMD is very different than the one for starting from scratch. There rare plenty of tutorials and guides for starting from scratch available out there. Migrating existing configs is actually quite easy.
First, you’ll want to install OMD. You can follow the instructions here for setting up the repo on your preferred Linux distro. Once that’s done, you’ll need to create a site:
This will create a self-contained directory in /opt/omd/sites that contains the configurations and symlinks for the site. Inside is an etc directory, with directories for each of the things you’ll want to do. You can su – <site_name> to access the components – they all run as this user to keep things nice and isolated. Once you’ve done that, you can run:
Which will provide you with a nice, interactive configuration screen. You can choose which core to use (icinga, nagios, shin ken), and which web GUI you’d like to use, enable/disable various add-ons, etc. This is all covered by other tutorials, but for migration purposes you should configure it as close to your existing setup as possible. In our case, we enabled the components of mod_gearman and chose Icinga as our core. We put a copy of our production configs into /etc/icinga as some of the paths we used referenced this. We then added a file into /opt/omd/site_name/etc/icinga/icinga.d that included all of our relevant configurations from /etc/icinga – this allowed us to keep the configs up-to-date during the testing phase, without needing to do anything manually: we just did a “git pull” in /etc/icinga and reloaded icinga like so:
Because OMD just uses Icinga, it has easily satisfied the requirements of Phase 1 of our Project.
Next Phase: Evaluating Icinga alternatives
The next phase of the project is to evaluate alternatives to Icinga, the core of our monitoring system, to ensure that we are using the best tool over the next 5 years (it has been about 5 years since we switched to Icinga, and we used Nagios for the 6 before that). Initial observations say yes, although we are evaluating another core provided by OMD called Shinken.
The next installment in this series will be an evaluation of Shinken, and will announce our decision between the two. Stay tuned.