Writing

Feed Software, technology, sysadmin war stories, and more.

Wednesday, May 23, 2012

Running in lock step

Traveling means you get to see a whole new set of crazy things. I got on a plane which was having trouble with its in-flight entertainment stuff, so they rebooted the system. I was fully expecting to see a Linux boot and a frame buffer penguin pop up, but instead I got this:

Obtaining DHCP address

A minute or two later, this happened:

No DHCP address obtained

This was also happening on every other screen I could see. When things started working, it all happened at different times for each seat-back. Maybe their servers can't handle a full simultaneous restart load?

Speaking of such things, one time there was a building with a bunch of computers in it. Then some unfortunate blue heron decided to cross-connect a couple of phases and barbequed itself. The building lost power. It wasn't supposed to, but it did.

There was the usual wailing and gnashing of teeth by physical plant people. Not too long after this, they fixed the power feed, and the machines came back up and went to work. Everyone thought that was the end of it.

Of course, my story doesn't end there. A day or two later, there was a super spooky failure mode in some low-level support system (sort of like LDAP, only not). It was something nobody had ever seen before, and they had no idea what could have happened.

I found out about this second-hand through a friend and I just said "hey, wait, isn't that where all of the machines were force-rebooted at the same time the other day?". When she answered in the affirmative, I said "well then, all of the machines have been running in lock-step, right?".

Apparently nobody had thought about this possibility. Normally, machines get rebooted and processes are restarted in such a way that it'll spread the load out across the day. Reboot all of those machines at the same time (or restart key processes at the same time), and if you have enough machines, you could have a nice self-DDOS going on.

Soldiers break step when crossing bridges because of resonance effects. London's Millennium Bridge tended to make pedestrians respond in a way that amplified the wobble before it was adjusted.

Software sometimes has code to spread out requests across some interval. Networks frequently have some way to deal with this designed into their lowest levels.

Unfortunately, some lessons have to keep being re-learned, and this is one of them.