Writing

Feed Software, technology, sysadmin war stories, and more.

Tuesday, November 15, 2011

Averting DNS disaster on hundreds of boxes

One night during my stint as a phone and ticket support person, we had something really bad come down the pipe via Red Hat's update mechanism. Even though a bunch of us were "over it" and were looking for ways out, we still went that extra mile to keep our customers from having outages.

It started about 10:30 on a Thursday night. A friend happened to be checking his mail from home and noticed something. As he put it, "we may have a problem". I looked at the same mail and agreed. The mail in question was from one of the techs who was working that night.

This tech had encountered a strange problem and had decided to mail an internal list to announce his solution. Basically, he found out that something had changed on a customer's machine and the BIND DNS server (named) would not restart. He proudly announced the fix was to "change ROOTDIR= in the config" and went on with life.

My friend flagged this as fishy and I agreed. We did a little digging on a "playground" box to be sure. It turned out that this guy's fix let the daemon start up, all right, but it also pointed it at a directory which was empty! For the customers running this one particular control panel, their DNS configs lived in another directory. Their configs had to point there or they would have no zones.

What this meant, of course, was that all of those customers who were running their own authoritative DNS were going to have all of those zones die miserably as the night rolled on. All of the customer machines were set to run up2date every night between roughly midnight and 6 AM, and it was already pushing 11 PM. There would be many more if we didn't do something, and fast.

Worse still, the usual monitoring systems wouldn't notice this. They only queried for some placeholder zone which was still there when BIND came up with no config. All of those customers would have broken configs but our monitoring would not notice. It would have been a horrible night followed by a miserable morning of credit memos, no doubt.

We had to act. I did some database magic, first pulling a list of who had that version of Red Hat running and then pulling another list of those servers which had that particular control panel running. Then I pulled the intersection and found out that we had almost 1000 machines which could be affected by this. Ouch.

We both drove in to work on it from the office. Things were about to get interesting and we needed to be on the support floor where things were happening. A third friend joined us, and we were able to get going on things. Things started evolving in parallel.

Using my list of machines pulled from the customer db, some of them went to work rigging a massive hack to stop the bleeding. It was going to log into every customer machine and turn off up2date until we could figure out the root cause and fix it for good. This would keep the machines from getting the "killer package" while we worked.

While they were doing that, I kept digging around on the test machines. I had established that installing one particular package would break it, but now needed to figure out why. I wound up pulling the SRPMs and ripped them open to get some sense of what was going on.

Ultimately, it turned out that a combination of things had conspired against us. Our custom kickstart installed a package which should never have been there for the control panel customers. Normally it was no big deal because the control panel was installed after that point, and it laid down its own configuration.

The problem was that one rare night when the bind-chroot package was actually updated and the systems updated it. That put back the original configuration which was no good for the control panel, leading to our problem.

So now that I knew how it got there, I also had to figure out what we were going to do about it. Here's where we got lucky. It turned out that merely removing the package would revert the damage done when it had been upgraded. It would leave the machine in a working state. We just had to run "rpm -e bind-chroot" on those nearly 1000 machines.

We also had to back out our temporary hack which had disabled up2date while all of this investigation had been going on. Again, the evil parallel ssh was pressed into service, and eventually it was all sorted out.

Finally, we sent a mail to our kickstart people telling them to remove that package from our install for control panel customers. That kept it from happening from anyone else.

The three of us who were driving this effort were over it. Two of us would quit within the month for other jobs. Even with all of that swirling around, we still drove in to work at midnight and stayed there until 4 or 5 in the morning to keep our customers from hurting.

It wasn't about the company, because we had no love left for it. It was about all of those other companies who had decided to host there. They did not deserve to suffer, so we did what we needed to do.

Besides, if anyone else had tried to handle it, they would have just made it worse.