Writing

Software, technology, sysadmin war stories, and more. Feed
Sunday, February 9, 2020

Trying to sneak in a sketchy .so over the weekend

I've seen more than a couple of bad rollouts, or rollouts that almost burned down the entire world in how broken they were. Some made it to production, and others were just barely stopped because someone happened to be there to put a foot down and refuse it.

This story starts on a Sunday afternoon much like today. Around 3:30 in the afternoon local time, some random person who had never done anything to the frontend servers which supported the entire business showed up on the company chat system. They needed help gaining access to those frontend machines in order to get a new ODBC (database client library) package installed. As they said themselves, "it was the first time [they] were dealing with something like this".

A perfect thing to do in production for the first time with nobody else around on a Sunday afternoon, right? I should point out that this wasn't in a place where they work Sun-Thu (hi TLV!), but a fairly boring US-based company where Mon-Fri is the norm. Everybody else was out enjoying their second day off, and maybe dreading the return to work that next morning.

But no, not this person. They decided it was time to ship this thing everywhere. They first popped up in a spot where people who had root on machines for the purposes of managing the OS hung out, but since it was Sunday, nobody was there. Bullet #1 dodged.

Maybe 20 minutes later, someone else popped up on the "general" chat spot used by this company to look for help. This one had a much higher chance of finding someone around even on a Sunday afternoon, since that company had a bunch of people who used to lurk just for fun. In particular, the folks who got bored around the house and ran towards production disasters in order to help out would usually keep a console open just in case that exact thing happened. I was one of those people, and so there I idled, along with several others.

(What can I say, it wasn't exactly the healthiest work-life balance on my part or others, but it kept us amused.)

So person number two pops up and asks: what's the best way to install a package on [every single frontend machine at the company]? They clarified and said: what is the identifier used by the system management stuff (think Puppet, Chef, Ansible, that kind of thing) that will select every single machine of that type at the same time?

Several hours later, I guess I wandered by the monitor, and since nobody else had said anything, I decided to jump in.

no, just no. post to (group) for a proper conversation, but I'll warn you right now: adding to stuff to [the entire frontend of the whole business] is rarely the answer.

(group), naturally, was a route to talk to the entire team instead of just the single member who happened to be lurking on the chat system right then. It was also a great way to make sure that certain spooky manager types saw it in case we needed them to drop bricks on people to make it not happen in the name of reliability.

Even though it was several hours later, the second person was still there, and they came back, asking what the best way to "resolve such dependencies" was. They still saw it as a problem of somehow getting every single frontend to link to their shared library so it could call out directly to their new database thing.

I told them that we normally build services for things that don't speak the company's approved transports (think gRPC, Thrift, SOAP, whatever). That way, the frontends don't have to learn every demented new thing people decide to bring into the company. They only have to speak to another instance of the same type of client, and the people who brought in the new thing get to write the interface.

The trick, of course, is that the approved transports were battle-hardened, and had all kinds of neat protections in them to keep them from breaking the frontend when a backend service went down. In particular, imagine the situation where a backend is toast, and their timeouts are set to 5 seconds. Every single request will now take at least 5 seconds instead of completing relatively quickly. At peak loads, that could chew up every available thread/worker in the frontends, and then the whole site would go down. Bye bye business.

That's why some clever person had added a client-side "gate" to the transport. If it saw that too many requests were timing out to some service, they'd just start failing all of them without waiting. Sure, this made some feature of the service disappear from people's experiences, but it was better than taking down the whole thing!

By way of comparison, this rando's brand new ODBC client almost certainly would not have any notion of doing that correctly.

The second person saw this and responded with a kind of honesty that was both refreshing and shocking. They said, more or less:

Thanks. I think adding a [transport layer] is a good longer term option. We are facing a very tight product launch timeline (Monday). Curious if there is any fast short term solution other than adding packages to [the entire company's front end servers].

Launch deadline... of Monday. Remember where I said this had started Sunday afternoon? It was now after 7 PM, and they wanted to ship something brand new that couldn't have been tested, didn't have any protective features, was a complete unknown binary blob, and they wanted it to go everywhere to production in order to use it ... TOMORROW?

I repeated that they definitely should post that to the group mentioned before, and also hinted that their deadline would probably meet substantial challenges by the people who are responsible for reliability.

Fortunately, it never shipped... to the production frontend.

A couple of months passed. Then, one day, the internal support pages used by employees to get things done all fell over. This was the internal equivalent of that production frontend, and it was used for everything from tracking bugs to seeing what was for lunch in the area that day. When it went down, most forward development progress ground to a halt.

Upon investigation, someone found a dashboard which had been written to speak directly to this new database. They had managed to slip in their new ODBC dependency and started using it. This is complete insanity since this thing was not really a database in any normal sense of the word. A request to it might take ten minutes to run, and that's when it's being fast. It's NOT like a MySQL or Postgres instance. It's for ridiculous corporate reporting stuff where you might grovel through 100 TB of data every single time. (Yes, really.)

When it started running, it would sit there in that code waiting on the server, and it held the entire rest of the internal frontend hostage. If the user got tired and reloaded the page, it would then grab another, and then another, and so on until they had nuked the entire set of machines and the (internal) site went down.

As someone who knew better put it, "if you ask [the db] more than 2-3 questions in a second, it hangs". Does this make sense to have it directly accepting connections from a fleet of between XXXXX and XXXXXX frontend machines? Definitely not.

What's a better way to handle this? At a healthy company that likes building solutions, it could have gone like this. At some earlier date, NOT the day before launch, they could have asked the owners of the frontend servers how best to solve for their problem.

They would have been asked a few questions, and when it turned out that they were able to accept perhaps 5 queries per minute, they would have been asked to put something up to protect themselves from the frontends. It would probably need some way to cache expensive queries, and it definitely would not be allowed to stay there waiting. Clients would have to "take a number" and check back for results, or use the growing "push" mechanism that the company was working on.

Those engineers would have seen it as an opportunity to do something new and interesting, and would have found a way to get them to "yes". Instead, since it all happened the wrong way at the worst possible time, it was everything they could do to throw up the flags and stop them from doing harm. I imagine they did not in fact ship "on time".

There's a time and a place to dig in your heels and say NO to something. It usually comes when someone else has been reckless. If you do it for everything, then either you're the problem, or you're surrounded by recklessness. I think you can figure out what to do then.