Software, technology, sysadmin war stories, and more. Feed
Friday, November 1, 2019

Sorcerer's apprentice mode and busting ghosts

Some years ago, I picked up the term "Sorcerer's Apprentice Mode" from a coworker. It's a clever reference to a scene in Disney's Fantasia where things come to life and start making a real mess all by themselves. I usually use it now to refer to automatons which go out of control and do something that no reasonable person would do in real life. This story is one such example.

It's just some random tech company with a bunch of employees and Linux boxes. They use LDAP and/or Active Directory to track everyone. As far as the Linux boxes are concerned, it's all LDAP. You don't really exist in /etc/passwd, but instead show up by way of some NSS magic that fills in whenever someone tries to look you up. Run "id"? NSS. ssh in? NSS. See usernames in ls? You get the idea.

Nearly every engineer has at least one workstation type box somewhere, and since it's just a normal Linux machine, they can set up cron jobs, run things in screen or tmux, or otherwise leave stuff running without them being there. This includes after they quit, or worse, are fired. This could be very bad.

This company had someone thinking ahead, and so some now-unknown person came up with a tool to deal with this problem. They created their own little privileged cron job that would wake up every couple of minutes to look for processes owned by people who didn't work for the company any more. This way, if you set up something to mine bitcoins, or otherwise do nasty things after you left, it would be killed over and over again. Eventually, your machine would be reinstalled and it'd all die, but this thing filled in the gap.

On the surface, this sounded great. Former workers would be prevented from running things after they were no longer attached to the company. For the most part, it was great. It did its job.

Then, one day, someone or something screwed up LDAP and "fired" everyone. Every single employee disappeared from LDAP in one way or another. This made life very interesting because your user records needed to exist for you to ssh in, or sudo, or basically do anything involving creating a new shell. The people trying to fix the problem had to make do with whatever they already had open.

So there they were, working the problem, trying to find out what the heck had happened to LDAP... and then they were ALL LOGGED OUT. Every single connection was closed, and nobody could get back in again.

What happened? The cleanup job fired.

None of them were employees any more according to the available data, so it went to town and wiped out every process it found. All of those shells and sshds were gone moments later.

Nice, right?

The solution, incidentally, was to use a "canary user" to see if LDAP itself and the data conveyed by it were both plausible and worth honoring. The idea was that if they disappear, things are too screwed up to possibly allow killing processes. Initially, I had trouble coming up with one of these and so used a certain founder's account, figuring if they vanished for real, the company had bigger problems to solve.

Fortunately, someone else knew that someone had created a LDAP-only test account which would never exist in /etc/passwd or similar, and it would tell me exactly what I needed to know. It was used as the canary user, and the fix was shipped.

Some months later, somehow, the LDAP db was changed to "fire" everyone again. Remembering what happened before, a bunch of people cried out: "stop the cleanup thing before it fires!" ... but then they found it had been fixed and breathed a sigh of relief. Then they got busy fixing the actual problem.

The canary user did its job and nothing bad happened that time.