Writing

Feed Software, technology, sysadmin war stories, and more.

Friday, May 31, 2013

Stupid network tricks

About 10 years ago, there was a project which intended to rewire a particularly large room full of computers. I forget exactly what they were doing, but I think it involved adding a UPS, an emergency power-off switch, or perhaps both. It was the sort of thing which should have been there all along but had been missing for about five years.

They scheduled a maintenance window for a Saturday afternoon when most people would not be at work to use the servers and networks. I was running a bunch of machines which were living in that space and so was affected by this. The power-down would touch every one of my systems in that room, including all of the routers. We'd be dead to the world for several hours.

I went through the whole process of making sure everything was okay on my end, and then halted the systems remotely. Some people who were actually on site turned them off in advance of the work. The electricians eventually showed up two hours late, and did something or other. Then they declared they were done for the day, but couldn't finish the project. We'd have to go through this again.

They turned my machines back on and my stuff went back online. A bunch of queued up mail from the outside world started arriving, and life more or less went back to normal.

Sunday afternoon, the entire network basically exploded. Every single voice circuit to the two dozen remote sites was down. This also took down a good chunk of the data connectivity to those same sites. Most of them had a secondary circuit for extra bandwidth, but that turned out to be moot since this "explosion" took out the link to the outside world, too.

One of the local people went out there and got me on the phone. That is, he called me from his cell phone, since there was no phone service in the offices, either. The whole thing was kaput. He reported that our CSU/DSU rack was lit up like a Christmas tree with alarms.

Paradyne CSU/DSU rack

That's a picture of it in normal conditions. Now imagine most of those green or dark lamps replaced with reds and yellows and blinking stuff, and you'll get the idea. There was also another rack just below it which was basically doing the same thing.

From all appearances, this thing had blown up. It was completely unprecedented. The "network engineer" who was there on site was beside himself.

I noticed that my backup Internet connection was up. I had purposely designed that entire thing to be "as separate as possible", so that didn't surprise me. It used a standalone DSU unit instead of the rack and it had a T1 jack on our old demarc instead of being part of our DS3 mux.

Maybe it was just my mentioning that mux, but that got him to walk over and check it out. That's when it became obvious: it was off. Okay, now we were getting somewhere. All of those T1 line cards in alarm were fed by this thing, and if it was powered down, then obviously they were going to flip out.

Now the question became one of figuring out why the mux was down. Did it catch fire? Nope. Had the ceiling opened up and poured water on it? Nah. Did hooligans break in and take a fireaxe to the rack? No such luck.

He noticed it was unplugged. Oh. But... nobody was around. It must have been like this for a while! So, why did it die now? We figured it out over the next few minutes.

What happened is that they unplugged it on Saturday afternoon before the work started, but never plugged it back in. This particular item had its own battery, and so kept on chugging, running from that battery for close to 24 hours. When the battery gave out, it fell over. Easy as that.

He plugged it back in, and it went back to work. The alarms eventually cleared on the individual circuits to the distant sites.

Here's a picture of that mux under normal operations.

DS3 mux

Notice that it's at the very top of a rack and is right below a ceiling tile with obvious signs of water damage. Perhaps there was a pipe somewhere above it which had a bad seal, or perhaps the roof itself (not too many feet above this false ceiling tile) had a hole. In any case, it's not the sort of place where you'd want to have water, and yet, there it is.

You can't really see it in this picture, but there seems to be a minor fault shown on the front panel for one of the controllers, and one of the spare DS1 handlers seems to be active. You might think this would be noticed, but hey, this was the same box which was left unplugged for a day without anyone knowing about it.

Sometimes, I wonder how any of this stuff managed to work.