Writing

Feed Software, technology, sysadmin war stories, and more.

Wednesday, February 27, 2013

The box which kept breaking itself

There's an amazing story floating around on the front page of Hacker News this morning called How I Fired Myself. Guess what happens when you do something really bad and have no backups. That and the other stories in the comment thread got me thinking about times I've pulled customers out of fires of their own making.

One evening in a January many years ago, I got one of those "emergency! everything is down" tickets for one of our Linux customers. He apparently had called in and had gotten one of our phone firewalls to open a ticket with the details, and then had hung up to deal with the inevitable flood of complaints coming from his customers.

When I read the ticket, it was simple enough: something bad had happened, and now he couldn't log in. He thinks maybe he had messed up a command as root and had moved most of the system's top level directories (/etc, /usr, /var, that kind of thing) into a directory deep under /home. So, in other words, you now had no /etc on the box, and instead it was hanging out in /home/user/who/knows/what/etc. Not surprisingly, this broke the box completely.

I tried to call him back as requested, but got only voicemail. After failing to contact him via phone, I instead updated the ticket with his options. I could try to break in via Webmin, since it runs as root and tends to survive certain disasters, but the more likely scenario would involve throwing it on the KVM (remote keyboard/video access), with the final resort being a rekick, that is, a complete reinstall of the box.

He updated the ticket and asked me to rip down the box and try fixing it at a given time. That time had not yet arrived, so I went off and did something else for 10-15 minutes. I looked back and found he had updated the ticket yet again to ask me to make it happen right away. Alrighty then. I updated the ticket to tell him I was on the case and went to work.

The first order of business was to get the data center peeps to cart it back to their office and put it on the KVM. This also meant it would show up on the "PXE boot" network where a variety of interesting netboot environments could be selected. I could just pick the right "rescue" flavor for that version of Red Hat to get started.

I got on the KVM and mounted his filesystems and went in search of his missing data. Sure enough, I found it deep under /home as expected. I moved all of it back to where it should have been and rebooted it. Everything seemed fine, so I shut it down and had the data center rerack it.

Once that was done, I tried to log in over the network with ssh as usual. sshd was running and listening on port 22, but logins were impossible. This wasn't some firewall issue. The box was broken yet again. I called up the customer and verified he was having the same problem, and then got back off the phone and went back to work.

I had the data center rip down the box again and throw it on the KVM again, and I jumped in one more time. That's when I found something I did not expect. The box was completely hosed just as it had been before. Everything was right back in that same path under /home as if I had done nothing to it.

I seriously doubted the customer had logged back in and broken it the same way, especially considering I didn't call him to check in until after it had been reracked and shown to not work via ssh. I figured it must have been some automated process and went looking for a bad startup script or a cron job.

Well, I found it. This guy had written a cron job which ran as root and moved files around. It used some kind of shell variable expansion, and while that variable was populated with a reasonable path when run by a normal user in a real login shell, that variable was empty when run in cron. So, instead of being "/some/path/to" + "/*", it wound up being "" + "/*", which is just "/*".

Oops.

(If this shell variable expansion thing seems familiar, it's because I ranted about it about three weeks ago in another post about IP config scripts.)

I commented out his bogus cron job, fixed the box yet again, then tested another boot. It was fine. I had the datacenter guys re-rack the machine one more time, and finally was able to ssh in from my workstation as expected.

Later on, the customer asked what had happened and I explained the whole thing, including where I figured out it must be an automated process and then debugged his shell programming to find out how it happened. He basically complimented me for my "kung-fu", and that was that.

The whole thing took about 90 minutes from creation to resolution.