Software, technology, sysadmin war stories, and more. Feed
Thursday, October 27, 2011

Avoid monocultures on large fleets of machines

There's a discussion about moving off the cloud on Hacker News this afternoon. They were talking about going from virtualized shared environments to dedicated machines. I'm obviously a big fan of having your own machines where you can get to the bottom of problems, but not everybody sees it that way.

They worry about managing all of these systems. It's a reasonable thing to worry about, since who wants to pay for sysadmins? The bean counters want people who can contribute to the bottom line and create something rather than just being a cost. The less they pay to maintain their infrastructure, the better.

This leads to automation. Some of this is good. Some of it is very very bad. What's really confusing is that it varies according to many things, and one of them is how many machines you have! This may not seem obvious, so follow along and I'll try to convince you why I think this.

Let's say you're a small business with two Linux boxes. You have one in your office where you hack on new development and another co-located somewhere which is used to serve to the rest of the world. If you're a halfway competent sysadmin, odds are you can stay on top of just two boxes, even if they are completely different in terms of what distribution they run. It's just not that much stuff to track.

With two systems, you'd probably put more work into dealing with a maintenance system than you would just doing it the old-fashioned way. This mode of operation will hold out for a while.

Eventually, you get to a point where it just doesn't make sense to have so many different installs running in parallel. You start having to do a lot of extra work just to track everything. Unless you're a total masochist, odds are this is the point where you either build some helpers or start messing with third-party software.

This will probably lead to a situation where you will try to create a "golden" image. That is whatever operating system you want to have all of your systems run with all of your local customizations applied. Then you'll have some kind of syncer which pushes all of these changes out to your other machines. If you do it right, then you can install a package or twiddle a config directive once and it will just find its way to the other machines.

If used with care, this can scale for a while. Then you will inevitably shoot yourself in the foot. Someone with sufficient permissions to edit things on your master will do something boneheaded like "chmod -x /usr" or write an install script which destroys /var. As this propagates outward, your systems will start dying.

Even if you have some kind of Hadoop-ish fabric which allows tasks to migrate, things will start getting interesting. Lower-priority tasks will stop being able to run as the cell shrinks. Your whole config will start flapping as everything re-shuffles itself. Your site will suffer, and your users will notice.

Then you get to suffer the indignity of going around to all of those machines to fix them. If you're lucky, your syncer will be able to repair the damage. If not, maybe you can get in with ssh or similar. If you really screwed up, you're talking about logging into dozens, hundreds or even thousands of machines at the console to fix them. You'd reinstall them, but then you remembered that their hard drives contain tasty data you'd rather not lose. Oops.

At this point, the monoculture is no longer for you. You need to think of a better way to solve it, and here's a tip: adding administrative layers to the existing technical situation is a joke. You'll still have that one "cowboy admin" person who is too good to need reviews for his changes, and will drop another chmod bomb on you.

Humans do dumb things. You can either accept that and create systems which are resilient in the face of our limitations, or you can just ignore it and suffer. It's up to you.

Side note: this is a restatement of an earlier post which was couched in metaphor. If the technical nerdery here doesn't make sense, try the other post instead. It has fuzzy animals and moon colonies and all of that good stuff. Really.