Software, technology, sysadmin war stories, and more. Feed
Monday, March 19, 2018

Repo coherence with git on NFS and multiple clients

All of this talk about NFS got me thinking about an interesting failure mode I saw some time back. Someone else did the work and figured out how it all fit together. I remembered the details to share with a wider audience as a warning about what not to do in your own systems.

Imagine you have a system which uses git as its source of truth. Specifically, the git repo lives on a NFS server called a filer. There's a host which can add stuff to the repo as updates come down the pipe, so there are commits being added on every couple of minutes.

There's also something watching this git repo to consume all of the updates. Every time it sees a new commit, it compares the last state of the repo to the current state of the repo, and derives a delta. That is, it'll say "file X was created with contents XC, file Y was updated with contents YC, and file Z was deleted".

These deltas are then translated to the language of another storage system and are then written there. The program knows how to create files, update existing files, and delete files in this downstream storage system, and it relies on the inflow of deltas from the git analyzer to stay busy.

This worked well enough, but then one day, every single file in the downstream storage system was deleted.

As you might expect, this caused a lot of anguish.

The response team started digging in. Who did it? It wasn't a person. What did it? It was the thing that applies the deltas. What were the deltas? A few hundred thousand DELETE operations, one for every file in the repo.

Wait, what? Delete every file in the repo? Why would it think that? Did the repo actually have every file deleted from it in git-world? No, it did not. It had all of the data, as you would expect.

I'll skip over the cleanup part and get straight to what caused it.

git can do this thing where it will repack your repo. You end up with a long-lived file inside the repo that says "go look at aaa for the contents". A git client opens that first file, finds the pointer to "aaa", opens it, and is happy.

Later, the repo gets repacked, and that same long-lived file now says "go look at bbb for the contents". A git client would then open that same first file, and the same sequence would follow.

Only, for some reason, it didn't work. The delta generator program got the pointer to "bbb", but when it tried to open the file, it got ENOENT -- the file was not there. This then triggered a whole cascading sequence of badness which created the problem.

Finding this little anomaly then touched off another sequence of investigations, and I'll spare you the dead ends and cut to the chase. It has to do with the Linux NFS client behavior.

By default, if you mount a filesystem over NFS onto Linux, it will do some client-side caching on your behalf. This is probably thought to make your life easier by reducing load on the network and/or filer. What you might not have realized is that it can also create a significant gap in what the exported filesystems look like to clients.

One of the things affected by this is what happens when a new file is created. You may not see the new file in a directory for up to 30 seconds in some cases. It's there on the server, but the client is giving you the last version of that directory as part of its attempt to "make things better for you".

Meanwhile, this does not affect existing files. If some other client updates the contents of a file on the filer, and then you read it, you will get those contents effectively immediately. There is no caching going on there.

So think back to how the git pack works. You have an existing file, so any updates are seen immediately. That existing file says "hey, you need to go look at bbb now". You see that update as soon as it lands on the server. You then go to open "bbb", but it's a brand new file, and the cache works against you, and it's not there.

In some cases, you might just retry the request, and eventually it would work. However, that's not what happened with the delta generator. For whatever reason, instead of returning a populated object with the latest contents of the git repo, it returned a "None" (think nil, NULL, or whatever else your favorite language happens to use).

That might have been fine, but something else above it in the stack decided to think that None was a valid git repo. Specifically, it thought it was a valid empty git repo.

Now look at the situation. You have a "before repo" with hundreds of thousands of files in it. You have an "after repo" with nothing in it. You're the delta generator sitting on top of this. What possible deltas could get you from "before" to "after"?

That's right, you emit a DELETE for every single file.

Downstream, those DELETE operations are consumed, and everything disappears from the secondary storage system.

Incidentally, this showed up because the repack ran from one client host, and the delta generator ran on another. If they had both run on the same host, this problem would have been hidden by the fact they were both seeing the same client-side cached directory. The problem would have only appeared one day much later when someone decided to split them into two different hosts for whatever reason. Even then, it's a matter of timing, and it would take a while for it to finally line up just right.

There are all kinds of things you can take away from a story like this.

First, there's a huge difference between "a valid set that happens to be empty" and "the absence of a set". The first one is an unusual situation. The second one is an error. Allowing equivalence between these lets an error inject a valid-looking empty set into your calculations. It's a variant on "zero is not NULL/None/nil/...".

Second, cache coherence across NFS with multiple writer systems is fraught with peril. The nfs(5) man page on Linux goes into a bunch of details about options you can set and knobs you can adjust to try to make this less dangerous, but it's ultimately on your program to do the right thing.

Third, this is a really hard one to get just right, but it can be helpful to have sanity checks in a program to prevent "Sorcerer's Appentice Mode" where it goes off and just makes bigger and bigger messes by itself. Somehow, you'd have to know what a normal set of operations looks like, perhaps by the quantity of operations allowed in any one change. Then, if it looks too weird, you pause yourself and throw an alert, and let a human come and figure it out.

Of course, if the human is just going to go "eh, whatever" and push the button to restart and ignore the error condition, you haven't really gained anything. If the working environment does not recognize and reward people for doing the right thing and only focuses on throughput, investigations will be skipped and avoidable outages will occur.

Hey, I said it was a hard problem.