Writing

Feed Software, technology, sysadmin war stories, and more.

Friday, August 24, 2018

Private peering issues and focusing on the end user

Let's talk a little about how your browser actually manages to connect to some of these massive web sites that billions of people use. Specifically, I'm thinking about the route the traffic takes after it leaves your home or business, and how it gets to the web site's actual networks. Chances are, if your ISP and the web site are big enough, it's not happening over "commodity" Internet connectivity.

Instead, it's far more likely that HugeISPCo and CatPicturesCo have at least one private connection between their networks. You might have heard this called "peering", in reference to what happens when two companies lash up a link (or more) to avoid hopping through their "upstream" connections (for whatever that even means if you're big enough).

So, okay, given that this happens, it can help explain some of the anomalies which go on when a certain big web site appears to be down, but only for the customers of a given service provider. If everyone in the world saw the problem, it might be the web site. If every web site appeared down for people on that provider, it might be the provider.

But, if it's just that one web site from just that provider, it might just be some kind of private peering anomaly.

Keep all of this in mind when considering this next story.

It started with someone reporting that they couldn't get to the big web site they helped run when they were using their home Internet connection in Europe. They'd try to ping the site and it would fail. Around the same time, random members of the public from the same part of Europe started posting to Twitter that they also could not get to the big web site. Other smaller web sites that were cousins of the main CatPics site were also not working for them.

Oddly enough, instead of blaming the big web site or the parent company, most of the people posting on Twitter were blaming their own ISP. Apparently it's so bad at routing packets that their own customers have gotten used to blaming them... and usually being right about it. Some of them were rather savvy, and did things like switching off wifi on their cell phones to force it to go through the cell carrier instead, at which point things would work fine. They isolated the problem and proved that the site itself was not the issue.

While all of this was going on, naturally the people behind the site were worried that they could have done something wrong. The "bat signal" was turned on to alert folks to a possible issue with the production environment. Since it wasn't quite clear exactly what the scope of it was, they over-estimated the scope initially, following best practices. You can always back it off later, after all.

Anyway, while this was going on, someone supposedly involved with fixing the problem started complaining about the fact that someone had triggered the bat signal, and that they were "working the problem". They had to be reminded that best practice involves overshooting if you don't know the true extent of the problem, and curtailing it later once you figure it out.

They eventually determined that yes, it was something wrong with the private link between the site and that one ISP, and particularly, it was at the far end. There was nothing the company could do in terms of twiddling things on the link to fix it. Moreover, it seemed like most people were happy to leave it with that: "problem isn't us, it's them, so screw 'em, they'll fix their shit and people will be able to get to us again".

Fortunately, at least one other individual buried in the team which handled such things was not willing to just leave it like that. Sure, the ISP may have made a massive mistake, but why make the users suffer? They don't deserve it.

That individual decided to do something simple and yet very effective as a test, and turned off the peering to that ISP. Doing that caused an interesting and intentional chain reaction of events. When the peering dropped, the ISP would no longer receive routing information directing traffic for the web site down that shared link. Instead, it would fall back on whatever else it could find. In that case, it meant going out on their regular Internet connection and finding their way back into the web site's network.

The only catch was that the ISP had to have enough bandwidth out to the Internet to absorb this traffic that had been let loose, and the web site had to likewise have enough bandwidth to accept it.

Long story short, they did, and it worked. The site immediately became responsive for anyone on the troubled ISP. The tweets shifted to say things like "yay! ISP fixed it!", when really, one person 1/3 of the way around the planet had done a little *clickity click* on her keyboard and put things into a workable temporary configuration.

Several hours later, the ISP figured out their problem and contacted the site, and the temporary hack was undone. Traffic went back to the private peering link, and life went on.

What do I want people to take away from this?

When something's down and you don't know how wide the blast area is, you might want to err on the side of caution and assume it's pretty bad. Wake up a few extra people. Get them to take a look. If it turns out to be a smaller problem, then dial it back and send them back to whatever they were doing. Would you rather have the opposite problem where people are afraid to ask for help, and so something burns all night without the right eyes on the problem? I say that because that is exactly what you will get if you cultivate a culture of shaming people for raising the alarm.

As a corollary to the last item, when someone does raise the alarm, stop arguing whether it's a one-alarm, two-alarm, or three-alarm fire. In that moment, it doesn't matter. You can come back and try to figure out a better way to characterize severities after the fire is out. Arguing it right there just shows that you are an annoyance who is not contributing to helping the problem.

Incidentally, if you happen to be watching such a situation, see whether the complaining people actually work the problem, or if they find the easy way out and disappear once it's "the ISP's fault", and leave thousands or millions of users in the lurch. Sure, the ISP screwed up, but that doesn't mean that your hands are tied. If you'd just own the problem and focus on the users, you'd see that there was a simple workaround that could only be done from the web site end. The users benefit. (And, well, let's face it, the advertisers do too.)

As a hidden bonus, when the workaround is in effect, the ISP's engineers won't have people breathing down their neck, and they can work the actual problem with some of the stress lifted. That might mean it gets solved sooner.

Ultimately, while it might be nice to say "X screwed up" and leave it at that, I call that the "hot potato" situation, and that's just lazy. Of course, the folks who actually care and hold on to the problem are frequently unrecognized, unappreciated, or even mistreated, so it's no surprise that things carry on in "hot potato" mode.

Finally, a note: I am none of the above people in the story, lest you think this was about me. I just watched it happen.