Corrupted chroots: the "tartar" bug
I used to like hanging out on a "water cooler" type IRC channel where people would gather to talk about things that had broken, or seemed odd, or when they were working an outage. Besides the fact that it was a source of interesting new things to ponder, it also was guaranteed to have at least one person associated with it who needed help. Being able to help someone with a problem which happened to be technical is what the fun was all about for me.
One day, the problem was in this pseudo-container environment. People were trying to start up their jobs, but they would sometimes get corrupted. The failure mode was that within the container, they expected a certain set of executables at certain paths, and symlinks to those executables at other paths. The symlinks were all wrong.
By "all wrong", I mean that instead of having a pointer at /usr/bin/foo directing you to "../../bin/foo", there'd be a zero byte regular file at /usr/bin/foo. Just a bit of a difference, right?
I stuck my nose in and started trying to help out. My first guess was "I wonder if someone wrote a dumb shell script without the usual set -e -u type stuff up top and it ran amok with an empty variable in front of a path". That didn't pan out. The files which were broken were some, but not all, of the links in this one directory. It didn't go by alpha order, either, so it probably wasn't some glob (like a * expansion) gone wrong. The pattern seemed to be as if it was something to do with symlinks with a slash in their targets, but that also made no sense. The broken files also seemed to have modification times within a few hundred milliseconds of each other.
After poking at that for a bit, eventually the troubleshooting work came around to "how many hosts did this", and from there, "when did they break", and "what did they have in common at those respective times". Basically, what was running on the box when things went sideways?
After staring at too many logs, I came up with an idea: the chroots were being prepared at this point in the timeline. Not only that, but a single directory was the target for multiple tasks which were all being set up at the same time! Instead of having a single setup task which the actual jobs would block on, they all went and ran it parallel.
In short, we had multiple writers, and since the source for this setup task was a tarball, it sure looked like we had warring tar instances.
Obviously, for something this ridiculous, I needed to rig a reproduction case to prove it, and after a few minutes of smacking the shell around, came up with exactly that. It required a tarball which contained (among other things) symlinks pointing at ../../../foo type paths, and some subshell/background magic to make them run at the same time.
It didn't always work, but given a couple of iterations, you'd eventually get the result: a zero-byte file with 000 permissions, like this:
---------- 1 somebody users 0 Mar 29 19:43 /tmp/race/proof/of/concept
Based on interrogating it with strace, it looked like it would first create a plain file with open(), then it would lstat() it, unlink() it, and finally symlink() it. However, if two tars ran at the same time, those calls could get intertwined, and it would complain and leave the empty and permissionless file behind.
It turned out that someone had recently enabled parallel setup, and as soon as fresh chroots started being "born" in this world, the problem would start happening.
The workaround, more or less, was to turn it back off. I'm not sure what ended up happening in terms of making it impossible to corrupt, but I hope it involved a "one dir, one setup mutex" type schtick.
This was a small potatoes outage, so it never got a real name like the really big ones did, but in my heart, this'll always be the tartar bug.