Writing

Software, technology, sysadmin war stories, and more. Feed
Sunday, January 6, 2013

Behind the scenes of several monitoring systems

There are good monitoring systems, and then there are horrible monitoring systems. I'll describe a few different systems I've encountered to give some idea of the variance you might encounter out there.

...

My first example system operates by having a series of redundant poller nodes for each collection of servers. There's a big provisioning system which keeps track of all servers and their states: still being built, online, under repair, cancelled, and so on. It also keeps a mapping of servers to their IP addresses as a matter of tracking company resources.

This system sends out "provisioning messages" to the pollers whenever something changes, like a new server coming online, or when a change is applied to the monitoring. As for the monitoring, it can be a simple ping or TCP connection to a couple of well-known ports. It can also be a few other things, like looking for a properly-formatted banner/protocol string on several other well-known ports.

When something goes down, the pollers throw a message back to the provisioning system, and it opens an "alert". This alert then appears on a list which can be sorted or filtered by type, location, operating system, customer type, or more. Alerts can then be assigned to individuals, attached to open tickets on that account, or used to create a brand new ticket (the alert attachment occurs automatically).

Technicians working on an alert can look in the ticket history for that account or server to see if it's a "repeat offender". They can also see if someone else is working on the machine because there might be another ticket open and assigned to a coworker. At that point, the alert and/or ticket is usually handed to the other person so as to not disrupt their work.

Once the problem has been resolved, the pollers tell the provisioning service that the service is now accessible, and the alert automatically closes. This makes it disappear from the alert list page.

This system will ignore alerts which are generated while a system is not in the usual "online" status. This way, when you take it off the rack to upgrade the memory or add another hard drive, you don't immediately get paged that it's down. That would just be silly.

...

My second example system has instances spread to several distant locations, and all of them constantly poll a debug endpoint on a web server to get a list of variables. These pollers then parse out values from this list and use them to generate things like rate-of-change. The values are then run through a gigantic list of rules in a custom language which was created specifically for this monitoring system. If any of the values are now out of spec, alerts are generated.

Since multiple aspects of these systems are being monitored, several alerts may fire at once for the same root cause. Each alert generates a separate paging event. The pager requires you to wait for the page to fully arrive and then tap a few keys to send an acknowledgement to the paging systems. If you fail to acknowledge the page within a small number of minutes (typically 5 to 10), it will page someone else, perhaps a coworker, or maybe your boss. It will keep doing this until it runs out of escalation points.

If your pager goes out of range for whatever reason, you will not get the page, and it will "roll over" to your backup. Their duty is to get you on the phone, not to solve the problem. If they can't get you on the phone to deal with the problem, they'll call your boss who will then try to sort it out directly, possibly by calling still more team members to have a look at things.

There is no way to see which of these services are currently being twiddled by a coworker. There is no "active" view for a given system. There is no "under repair" equivalent for squelching alerts in a small area.

You can "silence" alerts, but you have to choose between potentially silencing too many things by overusing wildcards and not silencing enough things because you don't know what the alert names will be. If you have sufficient permissions to the alerting system, it is possible to silence every alert for every other user of the system for days, weeks, or even months.

Your coworkers can and will make changes without silencing anything, and those changes will cause you to be paged. You will then have to poke them to see what's going on before you can safely make changes.

...

My third example system has a bunch of interesting requirements. First, all of the host names must be unique. That is, the short name of a host (like the "www" in "www.example.com") must be globally unique across all monitored systems. Right away, this means you can only have *one* machine called www.

If you happen to control the naming of your systems, this might work out. However, if you have thousands of servers leased by hundreds of customers, each with their own naming schemes, this may not work out. At this point, you may start forcing a ridiculous naming scheme involving the internal "computer number" onto customers. Hostnames like "www1" and "www2" will now be "47298-www1" and "47299-www2". Any customer caught changing their computer's host name may be denied access to monitoring services.

This monitoring system will not scale to handle a typical location with many racks full of computers. Instead of handling 10,000 different IP addresses with varied services on each from a single poller, it might handle 10% of that. This means you now need at least 10 pollers in order to monitor the same data center which previously only used two.

Notice, however, that the two-poller system #1 was actually a redundant setup. The ten-poller system #3 isn't redundant at all. It would require 10 more pollers to reach the same "n=2" redundancy. Suddenly, maintaining the pollers themselves becomes a nontrivial amount of work.

This system isn't connected to the existing ticketing or provisioning system. If the system is taken out of "online" status, alerts will still be generated. Likewise, when a system is first built and put online, no monitoring will exist until someone manually creates it through the interface.

This lack of a connection also means that alerts will not show up in the existing alert console. Instead, they will only appear on the monitoring service's web page. Each shift will now have to staff at least one "monitoring tech" who will do nothing but watch the monitoring results page and will then open new tickets and paste in the data.

As a final insult, the web pages for viewing and controlling the monitoring system will use ActiveX controls. This means they will only work in Internet Explorer, and that by extension means the monitoring tech has to run Windows. Every other tech on the floor runs Linux.

This also means that when the monitoring tech leaves for dinner or takes a night off, the regular techs may be unable to control or even view the outboard monitoring service. Those alerts may go completely unnoticed for an entire shift.

...

Finally, there's the null monitoring system. In this scenario, the admins of a service are alerted to a failure by receiving mails and/or phone calls from their users. The admins never stand up any other monitoring, and don't care to find out about problems before their users do.

Their monitoring service is their users. Chances are, the users have a corporate requirement to use this infrastructure service, and have no choice in the matter. Relying on them to yell means you don't have to worry about setting up or maintaining those complicated probers and alert generators.

These users can get angry, but they can't leave you because there is no alternative provider for your service.

...

Do any of these systems resemble the ones in your life?