Writing

Feed Software, technology, sysadmin war stories, and more.

Monday, February 25, 2019

Service X, Destroyer of Worlds

Quite a few years back, I was working on reliability issues for a big concern. It was a place with a bunch of different "properties" -- more than just one web site, in other words. Some of them had been built up from scratch, and others had been acquired.

One in particular was a relatively new addition to the "family", and it was experiencing serious growing pains as part of the process of moving to the parent company's existing infrastructure. In particular, the folks behind it had only done part of the work so far, such that most of it was still hosted on Amazon's servers. Only certain parts had made it over to the parent company's setup, and the two parts had to talk to each other to handle requests.

I should mention that random hosts within the parent company's network could not directly talk to the outside world. It had to do with security, and a bunch of other random best practices. In general, when they needed something from the outside world, they had to go through a proxy. This proxy could do HTTP/HTTPS type requests, but it could also just open a plain old TCP connection and let the client do whatever. It was in this latter "TCP passing" mode that this new service called out to Amazon EC2 to handle its requests.

Trouble is, they obviously hadn't tried running this thing at scale, with a full load of people trying to push requests at it. On top of that, they tended to launch things for customers exactly at midnight UTC and would go from 0 to 100% instantly. The servers would go crazy trying to make good on these promises to deliver their content as fast as possible. Finally, the people who actually worked on it were many time zones away, and when midnight UTC rolled around, they tended to be asleep, and tended to not answer pages about their service.

This is how I came to find out about their proxying. It was still afternoon where I was, and midnight UTC happened, and then suddenly, nobody in a certain region could get out to the Internet through the proxies. One of the engineers who owned that proxy service looked into it and realized it was being absolutely nailed to the wall by this newcomer service trying to call out to EC2. It was so bad that the host was running out of ephemeral ports. (Yes, this is actually possible, if you set up your sockets a certain way.)

Complicating matters is that we had no way to control it, and no working way to bring the people online who supposedly knew what was up. They were thousands of miles away and probably out cold somewhere.

Since the connections to the proxies were seemingly coming from everywhere internally (the joys of distributed job runner systems), we couldn't block it that way. I found the only effective thing was to get on the proxies and block the outgoing connections by forcing them to fail fast. Yep, we got lucky and noticed that all of their connections were to a single TCP port that nobody else was likely to need right then. This worked well enough to let the machines catch up and start serving requests.

We dropped a SEV on them and basically said "please don't do this again". Time passed. It happened again, just like before, complete with me filtering the ports to bring back the rest of the service, and breaking them in the process. So, we dropped another SEV on them.

My patience for this had been running thin. They were just running forward as fast as they could, consuming resources, and then not cleaning up their messes, only to have the same outage happen again. Some of their people were downright unfriendly when I tried to talk to them in a reasonable way on IRC.

Finally, one afternoon, a teammate was telling me about their latest adventure. It turns out this new team had discovered a different service which also used proxies, and managed to boil all of them, too, thus screwing up way more than just themselves. They broke a bunch of services in this other region. In response, I did something that's not the greatest of things to do, but I did it anyway. I set the topic on my team's IRC channel to something approximately like this:

$new_service, Destroyer of Worlds

... only, you know, the actual name of the service was in there.

I should point out that people joked around about stuff in there a lot. The "topic" on the channel wasn't used for anything official. It would frequently have a quote from someone, or some random pithy saying about stuff that had been going on. This was just another one of those cases.

As the folks who tended to take the brunt of whatever random things happened to break on a day to day basis, it sometimes came out in ways that weren't exactly the most diplomatic. It was always related to some badness that had happened, but some people don't know or care about context.

One of them was my manager at the time. During our next 1:1, he basically took me to task for putting that in there. Yep, that's right, I was in trouble for "making the team look bad" or words to that effect. There's not much point of attempting to describe the meeting beyond that. His position was that it was unacceptable, and my position was "sometimes these things are cries for help". The only real outcome of that meeting was that I retreated further into my little world of making sure he didn't know what was really on my mind about matters of reliability there.

Like I said, this whole thing happened years ago, and I always remembered it as something that bothered me. I'm pretty sure other people had said things of equal or greater "impact" about services and/or products which had terrible reliability issues, but they seemed to do fine. Meanwhile, my venture into that space was met with immediate reprisal. Why can they say it and I can't?

Anyway, over time, they managed to get most of their badness ironed out, and the service managed to run without boiling the proxies. They still had some ridiculous arrangements that let them threaten the reliability of the entire site because someone let them operate with no restrictions on resource (RAM/CPU/wall time) consumption. Instead of doing the right thing, they just imposed on other teams and ran their costly reports on the shared infra. Running machines out of memory? Screwing up all of the other web requests on that server? Not their problem!

A year or two went by, and then the company killed the project. All of that money in the acquisition, the work on the tech, the screwing up of other systems by hammering the shared infra, and so on went down the toilet. I figured good riddance to bad rubbish, and guessed I had gotten lucky about calling it a dud.

That's about where the story ended for a very long time. Then I had a chance encounter with someone who overlapped with me at that same company. Yep, they had been there when I was there, and they had worked in the same office as $new_service. In fact, they were there when the acquisition happened, and ... they wound up working on that service!

Now I could finally find out if that team had ended up hearing about what I had said, and if they hated me and/or my whole team for it, like my manager implied. I asked this person straight up: does "$service, Destroyer of Worlds" mean anything or ring any bells? They said ... no, not really, but it's entirely accurate! They continued and told me horrible things about how bad it was, and how they never should have been acquired. They told me about the cronyism in the ranks, of managers gaslighting their people to say "you're the only one complaining about X", even though everyone was complaining about X. X, you see, was the significant other of some key person who they were protecting. They mentioned how useless the service was as a product, and how it was mostly delivering spam anyway. It obviously wasn't making money.

Basically, in the span of five minutes, this person managed to tell me that I hadn't pissed off that whole team, and in fact, the thing WAS terrible, and probably deserved the moniker "destroyer of worlds". Just like that, an old memory went from a relative annoyance to "I KNEW IT".

What a feeling.

I wouldn't recommend doing what I did. If you're in this line of work, find two or three trusted friends, form a secret support group or channel on the chat service of your choice, and keep your snarky feelings inside the group. The alternative is just asking for trouble.