Writing

Feed Software, technology, sysadmin war stories, and more.

Saturday, September 3, 2011

My post-it note gets a support person in trouble

If you have to distribute DNS server IP addresses on a post-it note, your organization has failed. I had to do exactly that once, and it was a pretty sure sign that no clues were present.

One night, I was working late. I had moved on from support and was off in a conference room which was where my new team was based. Note that unlike project darkness, this conference room was legitimately ours.

We got a notification that the internal IT folks were going to be working on internal DNS that evening, but "service would not be affected". Well, sure, you know what that means. Famous last words and all of that.

Sure thing, when the time arrived, everything stopped resolving. Both of the nameservers which had been handed out by their DHCP server were dead. Obviously, this was not supposed to happen. One of them was out for maintenance just like they said, and the other one was ... well, who knows. It wasn't working.

I knew that at that very moment, everyone on the support floor would have been freaking out. They would have been unable to get to the ticketing system, customer machines, their favorite search engine, the file server, and yes, even the internal Jabber server. Yep, it choked when DNS went down and nobody could re-connect. That meant I couldn't talk to them via traditional means.

Obviously, I was able to grab a pair of nameservers not at that particular site on the company network, and I got my own stuff working again by twiddling resolv.conf and telling dhclient to leave it alone. Then I just walked those IPs out to the support floor so those people could get going again.

That should be the end of the story: IT screwed up, we worked around them, and life went on. But no, they had an attitude about what happened.

So, a few minutes after I delivered those IPs, one of the guys on the floor fixed his system and sent out an e-mail to the support mailing list. It just said "DNS is down, try using X and Y". One of the IT monkeys responded saying "no, it was never down, don't do this". They then escalated this to someone higher on the food chain and it filtered down to this support person's boss. Fortunately, his boss stood up for him, but it should have never happened in the first place.

Let's review. First, they took down machine #2 while machine #1 was sick. Second, they didn't have any sort of fallback coverage for client DNS resolvers, like changing the DHCP advertisements ahead of the work. Third, when they screwed up, they left it that way for quite a long time -- long enough for me to notice, work around it, and then walk it out to the floor to help others do the same.

Fourth, when someone tried to be helpful, they lied and said it was "just latency". No, latency is adding 100 ms to a request. This was dropping all requests on the floor. Only an incompetent fool would characterize that as latency.

Fifth, they actually had the gall to try to get that support guy in trouble for trying to help. They're support! That's what they do!

I consider myself lucky that I wasn't working support at that point in the company's life. The amount of network breakage, ticketing system woes, and other general lameness was amazing. The worst thing in the world is when you have a customer on the phone and your own company's infrastructure is falling down around you. You have to somehow stall and try to make something work while not badmouthing your own company to the customer on the phone. It's the kind of stress which is entirely avoidable, but the company has to care about avoiding it first.

Now let's try this from the other angle: how to do it properly.

First, you actually monitor your services, so you wouldn't have been running on just one server while thinking both were okay.

Second, after you turn off the one which "shouldn't affect anything", make sure it did not in fact affect anything! Do a simple query or whatever. How long does it take to test? 5 seconds? How much pain would it cause if you created a problem and didn't spot it for half an hour? I bet it's worth spending those 5 seconds.

Third, have some kind of fallback plan. Having a caching name server which is in another building might be a good idea. Either advertising it all the time, or switching to it in your DHCP configuration ahead of time might work. Obviously, if you switch your DHCP stuff around, you have to wait until your clients have renewed their leases and picked up the update.

Fourth, if you do manage to screw up, notice it and fix it quickly.

Fifth, if you somehow screw up and don't notice it or fix it quickly, announce what happened and apologize for it. Don't wait for someone else to report it.

Sixth, if somehow they report it first, thank them for the information and respond truthfully! Take your lumps and learn from it. Improve yourself and your processes.

Seventh, if one of your peons tries to lie about what happened and then tries to get someone else in trouble, take that peon out to the wood chipper and remove him from the gene pool. You don't need that kind of garbage happening in your organization.

Putting up with this kind of garbage just enables the mediocre. Demand better.