Writing

Feed Software, technology, sysadmin war stories, and more.

Friday, April 20, 2012

What happens when you lose the main downtown telco switch?

You know you have a problem with your infrastructure when a single air conditioner can take out communications for an entire region. About two weeks before Y2K, I was doing my school district sysadmin thing. It was evening, and I was at home, doing nothing in particular. Then everything broke.

The first thing I noticed was that my ISDN connection to the office went down. This was pretty strange, since that thing had been absolutely solid since it had been installed at this particular house. No amount of poking or prodding would get it to latch onto the other router at work.

Otherwise, my line was fine. I could get a dial tone, and I could make voice calls, but I couldn't get connected to my router at the office just a few minutes down the road. It didn't make much sense, but I decided to snoop around before hopping in the car. Actually driving out would mean dealing with multiple padlocked gates, opening the shop, disarming the system, and calling Simplex to make sure they didn't send the cops for an after-hours opening.

All I had was my ISDN BRI circuit, but it terminated into a Pipeline 75, so I had analog ports on the back. I just dug around in my closet and found one of my old modems from my pre-ISDN days and lashed it up to one of those ports. Then I plugged it into one of my Linux boxes and fired up minicom to do a quick terminal mode session.

I was able to get into my dialup pool at work, and then I started seeing the carnage. My regular site-polling cronjob was reporting 12 different remote sites were down as of 6:44 PM. Two minutes later, I got another mail reporting that my secondary connection, star of my evil hack job, was also down.

This was unusually bad and rather confusing, too. I couldn't figure out the common element which might have brought all of them down at the same time. Clearly, we had power to the data room since all of my devices were up. It was the circuits which were being stupid. I struggled to think of something which they all shared which could possibly create this pattern of outages.

One thing to keep in mind here is that I had purposely set certain things up to avoid some "single point of failure" situations. For example, we used to have plain old copper T1 service from the telco, and it was delivered in a nice vertical row of jacks. At some point, we switched to having them feed us a DS3, and then we installed our own DS3 mux and broke out the channels ourselves. Apparently we got some kind of price break that way.

Since those original T1 jacks were still there and could be lit up at any time, I made sure they kept my secondary Internet connection on them. This kept it off the new DS3 circuit and mux. The only drawback to this is that the "octopus cable" for the CSU/DSU rack was now routed to the new mux instead of the old T1 jacks. I had to wire up my own extension to span the gap. That was no big deal.

The outages still didn't make sense. I kept poking at my net devices and finally figured out why my ISDN connection had dropped. My ISDN router at the office was saying it didn't have a link to the telco. This was the same "flashing WAN" situation I had for months on end back when they took forever to initially provision my home circuit.

At this point, I was pretty sure it wasn't anything of ours which had broken. Those ISDN circuits didn't go anywhere near the T1 jacks or DS3 mux, and yet they had been affected too. I knew they were actually delivered to the ancient demarc in the admin building and got to my data room by way of a massive bundle of pairs under the parking lot. Something else was going on.

Finally, I turned on my scanner and picked up a conversation from some local ham radio operators. They were reporting phone troubles as well. Uh oh. I flipped over to the local police frequencies. They had lost 911 service entirely. The police department had their helicopter up, and they reported that the local airport had lost touch with air traffic control at the next major airport in the region. Yow.

Two local TV stations broke in with special reports. Slowly, bits of knowledge started getting around. By 7:50 (according to my notes from back then), it had been reported that US West had lost a major piece of switching equipment in their downtown central office, and anything running through there had been taken out as a result.

It started making sense now. Our schools which were still up were served by other central offices, and were a straight shot from the district's computer room to them. The ones which had failed were closer to downtown, and must have had their circuits routed through there.

This also strangely explained my ISDN situation. Back when the office side of things had been installed, they apparently didn't have capacity in the normal CO which served the office, so they hauled them in from, yep, you guessed it, downtown. Later on, when I got my own personal circuit on that side of town, it came from the proper CO and had a different prefix to boot. I hadn't thought much about it until that night.

So what do you do when you don't have any 911 service? The PD had a plan for that, too. They entered "high-visibility mode" at 8:33 PM. I'm still not entirely sure what this means, but it probably involves a whole lot of driving around. I guess in that situation, if something bad happens, you have to wait for a cruiser to drive by and then flag them down.

Of course, four minutes later, at 8:37, we had all but two of our links back up. Four minutes after that, at 8:41, even those were back up. I'm not sure how long it took for them to restore service to everyone else, but some reports suggest certain things were out for as long as three hours.

About ten minutes later, one of the local TV stations broke in again with a second special report to explain what had happened. As they put it, there was "... water damage to our central office in downtown", and "huge air fans [are] being used to dry out the equipment".

Later newspaper articles would confirm that it was caused by a leaky air conditioner. People were told to go to their local fire stations for emergency help, but those locations weren't published in the phone book. Oops.

As a sysadmin, I had done my part in the design and continued operations of my stuff, but there was only so much I could do. If your circuit provider has a really bad day, all you can do is sit back and watch.

In terms of emergency preparedness, it never hurts to know a few nearby locations which are likely to be staffed by people with access to radios. This usually means a fire station, but it could also be the police department or a substation. Here in Silly Valley, be aware that the nearest such office might actually be in the next city over, so think about that if you are near a border. It might save you some time.

In a pinch, you might even try to flag down someone in a city vehicle. At least where I live, the people who run the street sweepers, work on power outages, and deal with the water and sewer lines all have radios. There's an "Emergency 1" talkgroup on that radio system which will reach the same dispatchers who answer 911 calls.

I only mention this because you normally wouldn't chase down a street sweeper to get medical assistance. The thing is, when normal lines of communication fail, you take what you can get. If it's crazy enough to help you remember it when you need it, then I've accomplished something by including it here.