Configuration management vs. real people with root
There are all sorts of configuration systems for Linux boxes which have popped up in recent years. These are the Chefs and cfengines of the world, plus all of those other ones which I haven't heard of. Every niche has its own angle on the problem, complete with varying degrees of configurability.
Some have their own little domain-specific configuration languages. Others just hand you an existing programming language and let you do whatever you want. If you can figure out how to write it, you can have it. Just don't mess up.
One common behavior seems to be the notion of having a list of things which must happen to turn a machine from state A to state B. Let's say state A is "doesn't have foo installed" and state B, unsurprisingly, is "does have foo installed". You write a script, rule, recipe or whatever for "A to B", and then any time someone wants foo on their machine, they run your little creation.
This creation of yours might be smart or it might be stupid. Imagine a completely braindead script, for instance:
#!/bin/sh wget http://internal.server/packages/foo install_package foo
I won't pick that one apart, since I did that sort of analysis in another post earlier this year. Instead, I'll talk about another problem which seems to be ignored: that of humans and entropy in general.
Maybe you have a few hundred Linux boxes and a system like this. When you want to set up a new server running foo, you have it run your sequence and now you have foo installed. Time passes. The company grows. More people come along who have access to the server. One day, one of them logs in and changes a config file directly.
What happens now? Let's say the change fixes a real problem, and that ad-hoc change persists out there for a year. Then, one day, that machine gets reinstalled and the change is gone. It's been so long that the original person doesn't even remember what they did to the machine. Maybe they don't even work there any more.
Now let's say you switch to a config system which actively tracks the files it installs. It makes sure they keep the same values and will flip them back if necessary. If someone makes an ad-hoc change in that environment, it'll be reverted fairly soon, and they'll realize something is wrong. It breaks during that short window when everyone still has context regarding the problem, in other words.
Okay, that's an improvement. But there's still more which could happen. Maybe you're running the type of software where any file in a magic directory becomes part of the active configuration. For examples, look at places like /etc/cron.d, /etc/profile.d, and anything else of the sort. Programs like qmail also behaved this way: individual config directives were handled with individual files.
So you're in this world and your config program is tracking files A, B and C. Then some human drops in and adds file D. That changes the way the system behaves, but your config program never catches it. You're back to the earlier situation.
Now you're facing a bigger configuration situation: having it maintain entire directories. That way, any new files which appear in those managed spaces will be removed. This also applies to subdirectories and anything else which might be dumped out there.
Maybe you do that too, and now your various recipes own entire slices of the filesystem. How long can this last? How long will they stay separate? Eventually, your config system will need to create unions of the constituent parts on a given system. If two recipes touch the same directory, they need to somehow fit together in a compatible way.
The alternative is having two warring recipes, constantly loading and unloading each other. That's not exactly productive.
It seems like ultimately you're bound to hit a wall in which two conflicting configurations are required at the same time. Maybe one tool expects /lib/libfoo.so to point at version 1.2, and another wants it to point to version 1.6. Now what? Do you resort to LD_PRELOAD hacks for one... or both? Do you chroot one of them?
My guess is the only sane solution is to have a base system image which is small and which has some kind of generic "overlord" to add things. Anything which gets added is walled off in its own chroot or possibly even an entire lxc style container. It doesn't run with root privileges, and as far as it knows, everything on the system exists solely to make it happy.
Of course, it's also possible to go too far down that path and wind up with a monoculture problem in your fleet where one bug wipes out of all of them.
That's obviously no good, either. More base system flavors are needed.
Remember the overlord process and how it walls off its tasks from the rest of the system? This is a good thing. It means they shouldn't have any idea of what's going on underneath. Now you can have base system A running some kind of Red Hat variant, base system B with some kind of Debian environment, and base system C with Slackware just for fun.
Screwing that up sounds a whole lot harder. One bad RPM won't take down the entire fleet if only because it won't install on systems B or C -- assuming that you don't do something clowny like making the other ones auto-convert and install them, of course!
This can have other benefits. Within your organization, you can have the different base systems be owned by totally separate groups. They can even live in different parts of the world if you want. Maybe one's on the west coast of the US and the other's in Dublin. That's a pretty common arrangement in tech companies.
Someone who likes working on RH-flavor systems joins team A, and someone who likes working on Debian goes for B. Then the sole engineer who enjoys playing with Slackware maintains C in her spare time. You get the idea.
I would even suggest taking this kind of behavior further up the stack if at all possible. If the "overlord" software can be reduced to a common API, why not have multiple implementations? Let them split into two flavors which are owned by the teams described earlier.
So many things become possible in this kind of world. You can run tests to see which side delivers better performance for a given task and have some friendly competition to keep improving.
People are going to log into machines and make changes by hand. Sometimes they intend for them to stick. Other times they don't, but they forget and those changes stay around far too long. You can try to legislate this out of existence, or you can create a resilient, self-healing system which doesn't allow things to get out of hand.
What'll it be?