An alert storm which only affected Linux boxes
Working a tech support job in the middle of the night is a great way to learn how to think on your feet. When it's late and nobody from the day shift is around, the company seems incredibly tiny. Aside from a skeleton crew in the data centers and a couple of support and networking folks, the place is empty. There are no VPs, middle managers, HR people or accounts receivable folks around, and most of sales is gone too.
One fine evening, it was business as usual in my support monkey job. The call volume had dropped off, we were staying ahead of our ticket queues, and life was good. It was about an hour until quitting time, at which point we would turn it over to third shift.
I'm not sure if someone said "it's too quiet", thus cursing us, but something happened. It was bad.
Dozens of alerts started popping open in our monitoring console. This was something we'd watch for those customers who paid for our highest level of monitoring. Most of them had it, and it looked like ALL of their servers had broken at once.
A closer look revealed that no, it wasn't every server. It wasn't even every location. The only alerts we had were all in just one data center. It was the newest and shiniest and also the biggest. What happened?
We called over there. It was on the other side of the country, far from HQ where support worked. They reported business as usual. Everything seemed fine to them. as usual. Nothing seemed odd to them.
Okay, so we had eliminated a full power failure, tornado, earthquake, bomb or similar event. Something else was going on... but what?
Around this time, someone noticed that all of the alerts were coming from Linux boxes. Now this was weird. Just one location, *and* just one operating system? That made no sense. Naturally, the Windows techs started talking smack about how it was our turn to experience a "SQL Slammer" type event. Those of us on the Linux side were not pleased.
I wanted a better explanation, since this just didn't feel like a worm to me. It had happened far too quickly for that to make sense. So, I sshed into a random customer's machine which had that alert showing in our console. It took a long time to log in between entering my password and reaching the actual prompt. Okay, that's bad.
'w' told me more about what was going on. It was showing me logged in from an IP address instead of the usual nat-vlan-foo.bar name sourced from DNS. That plus the delay told me that DNS was probably hosed on this box. I tried to do the PTR query myself with dig right there, and it failed.
Running that same query from my workstation was just fine. We were still publishing the PTR and our authoritative servers were fine. Whatever had broken was localized to that one data center, and then seemingly just the Linux boxes.
I looked back at the alert list and found that nearly every one of them was for SMTP. I think a few turned up for FTP. Nothing else (like ssh) looked dead to our monitoring host. This turned out to be significant.
After fighting with sendmail over the years, I had learned a few things about its behavior. One of its quirks is that it will try to reverse-resolve the address of anyone who connects to it. This is done in a blocking fashion, so you won't get anything on port 25 until it gets a response, a NXDOMAIN, or just times out. It's the timing out thing which is a problem. You don't get a banner until it gives up.
I changed this machine to use a different set of recursive DNS servers from some other place in the company. ssh went back to being zippy, sendmail started giving its SMTP banner quickly again, and the alert for this machine went away! Awesome.
It was now clear what had happened. Our monitoring service had a relatively short timeout on its SMTP option, so if it couldn't get a banner within a few seconds, it marked that service as dead. Something had kicked DNS out from under these machines and this was the result.
A few test queries against that data center's caching name servers confirmed my suspicions. All of them were toast. They weren't answering queries or even pings. Somehow, they had both melted down. That was our root cause -- everything else was just secondary noise.
Some manager type finally started waking people up until they got some internal IT person to log in from home and smack things around. Once those services came back up, the alerts all vanished. The crisis was over.
There's a whole bunch of stuff which went wrong here. First, all of the machines out there were affected, but only the Linux boxes reacted in a way which tripped the monitoring. Whatever the Windows machines were doing, they didn't seem to go crazy when they lost DNS action. The Windows guys seemed happy that everything was okay for them when in reality they were just as hosed as our Linux customers. They just didn't know about it.
Second, the monitoring for our internal caching nameservers was apparently nonexistent. We only found out anything was wrong with them after I did my troubleshooting described above and then confirmed it by waking people up. They didn't have anything to keep tabs on their own machines.
Third, we treated our internal services like crap. They didn't get even the same level of service that we would provide to our customers. Not monitoring something that big is unforgivable, but that kind of stuff happened all the time.
How do you build a reliable customer support organization on top of a wobbly internal support architecture? That's easy. You build it on the backs of the people you have working support. They go to ridiculous lengths to protect the customer from the crap infrastructure and evil hacks buried in a company's DNA. They do it because they care about those customers.
This works... for a while. Then they either burn out and become useless or they just bail out and start writing snarky war stories about it.
Yeah, that's the ticket.