Writing

Feed Software, technology, sysadmin war stories, and more.

Sunday, March 18, 2018

What to do about NFS and SIGBUS

A couple of days ago, I wrote about how running programs from NFS mounts can lead to interesting crashes if the files change on the file server. That post elicited a question from an anonymous reader:

Hey, I liked the article! Would you be willing to write a little bit about the strategies you can use to prevent the SIGBUS? The only thing I could think of is creating a wrapper script and the CI server building versions that include the datestamp and never deleting them.

Do you have better ideas?

So hey, I definitely have ideas. I don't know how practical they would be for every situation, so you'll have to judge and discard the ones which don't fit in.

My first recommendation would be to kick NFS to the curb. It's one of those things that you wind up having to do eventually in any organization once it grows beyond a certain size. The sooner you kick it, the less time and energy it'll take to get rid of it down the road.

I'm not completely opposed to using it for home directories, but then, those tend to be mounted in just a handful of locations at most. What bugs me is when you have huge mounts that are accessible from everywhere. At that point, people do the natural thing and start relying on it.

Before long, you have a ridiculous single point of failure that has no version control, really nutty caching behavior on some clients, and worse. There are also brilliant failure modes when you run big filers with primary/secondary relationships, and manually flip the sense of which one is feeding which at the wrong time.

If you have the option to ditch NFS and go to some explicit method of distributing your binaries, I recommend it highly.

Of course, if you decide to solve this by building RPMs because you're on a Red Hat-derived system, you will eventually run into scaling issues with yum, reliability issues with db4 (yes, even five years later) and worse.

But hey, it's probably easier to go from RPM to not-RPM than it is to go from NFS to not-NFS. So there's that.

If you really want to stay on NFS for some reason, then my recommendation would be to try having unique names for your builds, and use symlinks or some other method of pointing at them. If the files are treated as immutable, there shouldn't be any rug-pulling going on when the kernel starts paging things back in.

Of course, if you do that, now you have a roach motel: the binaries check in, but they never check out. Given enough time, you will fill up your filer, no matter how big it is. Now you'll have to come up with a garbage collection strategy, and that means being able to somehow track who's on what version, and what's in use. Then you have the problem of managing restarts so that hosts stay up to date and you don't have to retain as many builds as there are broken hosts.

The above is just another day in this line of work.