Software, technology, sysadmin war stories, and more. Feed
Tuesday, February 19, 2013

More on last week's outage

Last week, the server which hosts this site was down for about 3.5 hours due to a double-whammy outage on the part of my hosting company. At the time, they said that both of their links from San Antonio to Dallas had been lost and that knocked things offline. I expected the final post-mortem would be an interesting report.

It's been a week, and I got that report in the mail today. Here it is:

Root Cause Analysis - Feb 12th - San Antonio Network Outage

PEER 1 Hosting/Serverbeach currently uses two unique transport providers for our circuits running between San Antonio and Dallas. The network outage that we experienced on February 12th 2013, uncovered the fact that both transport providers use a common third party, long-haul provider for the underlying fiber infrastructure. The underlying provider was performing a planned fiber maintenance in the Waco, Texas area that ran from 00:00CT to 06:00CT the morning of February 12th. During that maintenance, both backbone circuits that run between San Antonio and Dallas were taken offline; one for 3 hours and 30 minutes and the other for 5 hours and 13 minutes. While both circuits were down simultaneously for 3 hours and 30 minutes, clients in the San Antonio data center would have seen an outage.

These transport circuits were initially ordered as diverse paths, which means that no single maintenance should be able to impact both at once. However, it was discovered during this incident that one of our providers was recently sold to another company and physical circuit path routing may have been changed without our knowledge. Initial diagnosis of this event took longer than normal as PEER 1 Hosting/Serverbeach was not notified of this planned maintenance. PEER 1 Hosting/Serverbeach was also performing non-impacting port reconfiguration work in the same area of the network as the fiber maintenance, so this complicated the initial diagnostics. In light of this issue we are taking the following steps:

1) We have already been in the process of replacing our two fiber providers in the San Antonio to Dallas corridor for several months now. A new deal has been signed already with a separate third provider, and we have been in the deployment phase. The end result will be a redundant circuit ring through one provider and separate redundant circuit through another. We are taking the appropriate precautions to ensure that these new circuits will run on completely separate paths.

2) We are working with both our current providers to obtain more information about how these previously diverse circuits were diverted to be in shared paths and a full history of circuit re-locations.

3) How the transition of our account was handled when our one provider changed ownership and why we were not properly informed of this planned maintenance. We are also following up with the second provider regarding their lack of notification.

4) We are obtaining confirmation from both our current providers that there will not be any work on our circuits for the next 8 weeks.

5) We are also performing a PEER 1 network wide audit of the current fiber paths being used throughout our transport links to ensure full network redundancy is available in all locations. We will also be auditing our escalation lists and contact detail for all transport/transit vendors. PEER 1 Hosting/Serverbeach sincerely apologizes for the outage caused by this incident. We greatly appreciate your patience and continued support as work is done to ensure this type of issue does not reoccur.

Things I notice from this: apparently, as far as they are concerned, the only way to get connectivity down to SA is from Dallas. There's not even a hint of going to another location and peering there. Instead, they're just going to get a third route to Dallas on yet another path.

I expect this means that when something big happens in Dallas proper, or somewhere in between, it'll be another Bad Day for all concerned.

Fiber routes are a funny thing. I used to hear about upcoming network outages in my former capacity as a pager monkey, and a fair number of them were due to vendors performing maintenance. One time, there was a path which would be out of commission for a while because a bridge over a certain river was being replaced. I did some digging and found it was a state highway project. Looking at maps of the area showed only one state highway crossing that particular river, and the bridge itself was nothing special. You'd never think it carried anything but vehicular traffic, but it in fact represented a significant chunk of "the Internet" in that area at the time.

The routes themselves may be the sort of thing they like to keep under wraps, but it only takes a handful of details to figure out exactly where it has to go. If someone else already built a bridge, tunnel, or some other way to cross an obstacle, some bit-slinging company has probably found a way to cling to it for their own purposes.