Control systems, yes, but we have to learn to crawl first
Yesterday's post about elastic service scaling and some of the problems I've seen with it has generated some responses already. I want to speak to some of them here.
A few folks mentioned PID controllers. Now, I'm not a Real Engineer, merely one of these computer-wrangling pretenders, but I am actually familiar with those things. I've worked at multiple places that get lots of mileage out of them. For example, one such service makes the very most of the CPU time on their machines, recognizing that a cycle idled is a cycle wasted.
If they have enough CPU time to go around, they will turn on all of the fancier features in their ranking software. This ensures you get served the very best ads, memes, cat pictures, updates from family members, portraits of just-served plates of food, and all of that other good stuff. Why do they do it? Because it makes them money -- lots and lots of money.
When load gets too high, perhaps because a lot of user traffic is arriving, they automatically scale back on those features. This happens on a per-process basis, and doesn't require a whole lot of coordination. The individual aggregation points and/or leaf nodes can figure this out for themselves a whole lot faster, and do. The quality of choices drops a little, and the CPU utilization drops back a bit, but they handle the load.
I also know about hysteresis. I usually use an example of a thermostat for that one, particularly the neat old ones with the bimetallic strips for triggering behaviors. If you've never played with one of those, you owe it to yourself to track one down. Just twist it around and enjoy the solid THWOCK every time it decides to change states. Then crack it open and see how simple it is.
The problem is that you can't just introduce stuff like this in an environment where people think that Python web services are the best way to run everything. I'm talking about thundering herds the likes of which we haven't seen since Microsoft trounced Linux in the web server benchmarks 20-plus years ago.
Why? Because people have gotten away with running at a relatively high level. They can throw machines (and SO much money) at the problems, and they will mostly get away with it. Who cares if you have a box that can only do four concurrent requests at a time? Who cares if there's no notion of timing out requests that sit on a queue instead of executing them, knowing the client is going to bail on you, or maybe already has?
What's strace? What's a syscall? Why do we need 1000 of these things? If you can afford it, you never really ask these questions. You just carry on and worry about something else instead.
This is what happens when you have a tremendously complex environment and only scattered educational resources that you have to piece together over multiple decades in order to even start approaching something resembling stable systems.
This is where stuff like my posts come in. My goal is to make sure everything I've picked up over the years from books, Usenet, trial and error, friends, coworkers, random script kiddies, and so forth makes it online somehow. Lots of people have fed info to me, and this is my way of keeping it moving.
We have a tremendously complex industry. Reject the notion of being THE ONE and just accept that nobody is going to know everything. That's why we build teams. That's why we spread the knowledge around.
This control system knowledge is certainly good, but the key is realizing that not everyone is ready to make use of it right now. They'll get there. This is how we all get there.