Writing

Software, technology, sysadmin war stories, and more. Feed
Wednesday, May 6, 2020

A bunch of rants about cloud-induced damage

I have a bunch of random complaints about where things have gotten of late with this "cloud" business. It seems like far too many people think that running virtual machines on other people's hardware is the only way to do anything. It doesn't absolve you from actually being good at this stuff, and frequently complicates matters to where you can never truly shine.

I'll say it up front: I prefer physical hardware. I know exactly what I'm working with, and there's no spooky stuff going on when I try to track down a problem. If the disk is slow, I can find out why. If the network is being odd, I can find out why. There's no mysterious "host system" that lives in someone else's world which is beyond my reach.

If you want satisfying answers to life's mysteries about production environments, you need to be able to get to the bottom of things. Every time you introduce some bit of virtualization, you have just reduced the likelihood of finding an answer. This gets even worse when it crosses domains into another company or vendor or whatever.

At that point, you can only drop your standards, and accept more uncertainty, badness, outages, and other strange behaviors. You have to just live with the fact that some things may never truly get better. Just this element right here will chase off some folks. They will either join up and then peace out, or never join in the first place.

I should mention that introducing these "troubleshooting singularities" where no reasoning can exist to solve a problem also creates a great opportunity for charlatans, fraud, and general incompetence. Why? That's easy. Every time something bad happens, you just blame it on something at one of those known trouble spots. People love their scapegoats.

...

There's another "fun" aspect to all of this virtual machine, cloud, and "elastic scaling" business: the incredible money sink that can be created. It seems like everyone sets up some kind of rule that will just keep an eye on CPU utilization and then will "helpfully" stand up more instances any time it goes past a certain point.

The number of terrible behaviors this enables is just incredible. It has to be seen to be believed. What do I mean? Ah yes, come with me on a little trip down memory lane.

I worked at a place that had a bunch of physical machines. It also had a whole group that did nothing but plan and allocate capacity. In other words, that group made sure that people got machines when appropriate, not just because they asked for it. Also, they weren't ignorant bean-counters, either.

The real driving force behind that team's allocation efforts was a badass engineer who was a friend of mine. That person alone probably saved the company multiple buildings worth of machines over the almost decade of tenure there.

How did this work? Easy. Every time some team wanted a whole whack of new hardware, my friend went out and took a look to see how they were using the stuff they already had. It wasn't just about CPU utilization, either. My friend wanted to know HOW it was using the CPU. If it was doing stupid and inefficient things, it would come out in the wash when they ran the analysis. They could see where this stuff was spending its time. Profiling is a thing, after all.

This had a nice effect of forcing teams to be honest and forthcoming about their needs, and to not try to be wasteful and then think that more hardware would save them. They didn't get to spend the company's money at will. My friend and the team she represented kept things from becoming a free-for-all.

Some of these "cloud" places, though, don't have that kind of oversight. They build their systems to just auto-scale up, and up, and up. Soon, there are dozens, then hundreds, then even thousands of these crazy things. If they write bad code, and their footprint goes up every day, they don't even realize it. Odds are, the teams themselves have no idea how many "machines" (VMs in reality) they are on, or how quickly it's changing.

There's no equivalent of my friend to stand there and tell them to first optimize their terrible code before they can get more hardware. The provisioning and deploy systems will give it away, and so they do.

What usually happens is that someone will periodically run a report and find out something mind-blowing, like "oh wow, service X now runs on 1200 machines". How did this happen? That's easy. Nobody stopped them from needing 1200 machines in their current setup. People wring their hands and promise to do better, but still those stupid instances sit there, burning cash all day and all night.

...

There's also fun in the other direction. Someone might set up a rule to "scale down" their job once it falls below some CPU utilization. Maybe they tell the scaler to lower it a notch every time the utilization is below 10% or something like that.

What do you suppose happens the first time it's very quiet, it's down at ONE instance, and then that one instance manages to come in under 10%? If you said "the autoscaler lowers it to zero", you'd be right! Yep, without a minimum size limiter in place, that's what you get.

What's more amazing is that apparently this can happen and then not immediately "fix" itself. You'd think it would notice that it needs SOME capacity to run things, and then get in a tug-of-war, going from 0 to 1 to 0 to 1 and so on all night long.

Personally, I think whatever was measuring the CPU utilization probably divided by zero once there were no more instances running, and couldn't run past that point. Not dealing with that kind of scenario seems par for the course.

...

Then there's a whole other neat situation that can happen. I talked about this elsewhere, but for the sake of completeness will give the summary here. You have a service that has to be up. It dies. Traffic to everything else drops, so the CPU utilization out there drops, too. Those other services scale down.

At some point, the essential service starts working again. Traffic floods back, and ... uh oh! Those other services are far too small now, and can't handle the load. Worse still, they've been written in the most fragile ways you can imagine, and so they can't even ignore extra traffic. They stupidly try to take on everything that's coming in, and have no notion of rejecting the stuff they have no hope of handling.

Now they're completely saturated, getting nothing done, and the CPU is running like crazy. The auto-scaling stuff notices, but it might be a good 10-15 minutes to get more instances running. This is not a joke. I've seen this happen and it's not pretty.

Basically, you should be able to point the equivalent of a flamethrower at your service and have it keep going. It doesn't have to accept the extra requests. It should just do the ones it would be capable of doing anyway, and ignore or reject the rest as appropriate. That way, at least it can start absorbing SOME of the load instead of just becoming part of the problem.

Think about it: if any one instance of your service will fall over and crash when presented with enough load, that will then rebalance the load onto the others. One of those will then boil, and it'll go down, rebalancing the load onto even fewer instances. This will continue until the whole thing is a molten lump of slag.

You can tell you're in a situation like this when you rig up horrible things like "turn off the load balancer until we can restart the service". Or, maybe you've "set the network stuff to drop 50% of the traffic until we can catch up". If you've ever had to do this, you're running incredibly fragile stuff! Likewise, if you've ever had to worry about "starting everything at once because otherwise the first one up will get boiled and will crash and then the second and third...", let me be the first to break it to you: you're already in the bad place.

[Side note: if this sounds familiar, it's because I had a customer about 15 years ago who absolutely refused to fix their terrible code and just wanted better tooling to start and stop their stuff in parallel. The people on my side weren't looking for an argument and so gave them exactly that. They were happy because they didn't have to think about writing better code. Is that what you want from your stuff?]

There are so many ways out of a situation like this. Here's just one. You can temporarily stop checking your listening socket(s) for new connections. This means you stop calling accept(). The listen queue will back up, and then something interesting will happen. Assuming Linux, it'll probably stop emitting SYN/ACKs for connection attempts after that point. You might do some magic to make it actively RST those connections so clients fail fast and go somewhere else right away. Whatever.

When things calm down, you maybe start looking for those new connections again. Maybe you do something neat and process the newest ones first, since the oldest ones have probably burned most of their timeout already and won't stick around to finish the job anyway. This is what I meant by "selective LIFO" in a post from a few months back. It's the kind of thing people tend to think about when they're in a "RPC" mindset and not a "web" mindset.

Want a really rough example of throttling? sendmail, the mail daemon which made life interesting for some of us for far too long, had this thing where it kept tabs on the machine's load average. If it got up over the first notch, it would still accept a connection, but it would wave them off with an error. If it got past the second notch, it would actually stop listening to the network! That's right, it would actually close the listening socket to forcibly reject any connection attempts.

...

That was more of a stream-of-consciousness than I intended, but all of it needed to get out there. Odds are, I'll come back to some of these points and spin them out into their own tales of what to do, what not to do, and how to tell when your team or company has already dug itself into a hole.

Spend the time to care about this stuff. Resources aren't free.