Writing

Feed Software, technology, sysadmin war stories, and more.

Saturday, October 30, 2021

Now you can (try to) serve five terabytes, too

Almost ten years ago, I wrote an obscure post about something that would have only made sense if you also worked at Google at the time it happened (around 2010). It was a reference to an Xtranormal video that someone created about the perils of trying to get your stuff running in production without having someone hate on you for "doing it wrong".

For those who never encountered it at the time, it was a site where you could type in a script and it would do text-to-speech and actually animated some goofy characters to lip-sync the words for you. People would do this and make them say all kinds of crazy things. Someone decided to do one about how hard it was to do something which shouldn't have been a big deal - serving five terabytes of data internally.

When it came out, I took a screenshot of the long-suffering engineer (some kind of red panda/fox girl thing) and had a T-shirt made from it. I wore it to work a few times to great effect. It was our own little internal joke within the company.

Well, recently, it surfaced again, but this time, someone released the video! It's now on the outside and it's open season. I recommend firing up your favorite youtube-dl type tool and scarfing this down before something happens to it.

Ready? Here it is.

(Note: I had nothing to do with the making of this video... or the release of it for that matter!)

As for everything in the video, it is really close to what the reality of production was back in the day. Things were actually quite a bit worse than what it suggests, if you can believe it. The whole thing about "get quota in two cells" misses the fact that it was damn near impossible to get the quota you needed across multiple services (compute, storage, ...) in the same place at the same time.

Get compute in location A and storage in location B? You now have twice as many failure points, you might be exposed to twice as many scheduled maintenances (each location was "in zone" for a week every quarter), and you probably just added latency and made it harder to reason about what's where and how it all works.

So, if you heard stories about software engineers doing shady deals to trade quota between teams so they could get all of the stuff in the same spot? Yep, this is where those came from. That's the kind of thing that was happening. People had to blow time and energy on worrying about that kind of thing instead of working on whatever they were supposed to be doing.

My own "solution" to it after far too much thrashing was just to say "we can't get all N types of quota in the same place so we are at the mercy of whatever happens to be available, and if that dries up, we stop running". Granted, this was for some internal stuff that was seven or eight levels removed from anything that anyone on the outside might ever see, but still, it was stupid and made me feel so dirty.

I'm sure my non-solution probably bit someone later. Sorry, whoever.