Software, technology, sysadmin war stories, and more. Feed
Tuesday, August 11, 2020

File handling in Unix: tips, traps and outright badness

I wrote a post over the weekend which said a lot about libraries letting people down, and other people becoming overly dependent on them. There was an aside of sorts in there which mentioned teaching people about all of the things to look out for when you're writing to a file on a Unix-ish/POSIX-ish filesystem situation. A friend reached out asking if I had a post talking about that stuff, and near as I can tell, I do not.

That brings us to right now. I will attempt to lay down a few things that I keep in mind any time I'm creating files.

First up, write(2) is not guaranteed to write the full set of data you hand to it. The first thing that probably comes to mind is "well sure, the disk might be full", but that's not quite what I'm talking about.

No, this is about the whole "worse is better" thing, where you can get interrupted while you're off in some syscall, doing something (like writing to a file). Yep, write() can get poked in the head by the kernel and return early, with only some of its work done. It's up to you to notice this and restart it.

You might notice that write() returns a ssize_t. Hopefully most people reading this know about values it might return. Most commonly, you'll get back the number of bytes you asked it to write. That's good. A bit less common is a -1, which means something broke and now you have to grovel around in errno to see what actually did happen.

What fewer people realize is that there is a middle ground between "everything" and "nothing". Let's say you call write and tell it to push 16384 bytes out to a file descriptor. It gets interrupted for some reason - maybe a signal fires off, or someone happens to attach strace or gdb to your process right there. Whatever.

Let's say it returns a value of 8192 instead. You have to notice this was less than the value you gave it initially (16384) and double back. Note that you can't just restart the exact command you did before, either, since the first half of your data has already been written! This time, you have to tell write() to start at the first byte which didn't already get written, and adjust the count downward to match.

[Note: this originally said it would set EINTR on a short write. It doesn't! It just returns less than what you passed in. Meanwhile, if it DOES return EINTR, guess what, you get to go again. Or look up SA_RESTART with respect to sigaction(2). The rabbit hole is never-ending.]

If you're thinking "this means I need a pointer, and a counter, and I have to do this in a loop while bumping the pointer forward and clawing the counter downward", you're on the right track. If you're also thinking "hey, this might get stuck in this loop forever if some pathological case happens", now you're really cooking with gas. That kind of paranoia will be rewarded when your system holds up under crazy situations instead of blowing up and taking everyone else down with it.

There are other fun things which can happen here. Maybe you're trying to do some kind of round-robin system where you shove data into multiple file descriptors as fast as you can, but some of them can't keep up. Maybe some clients are slower than others. This causes the whole thing to slow down to the speed of the slowest client. Maybe this gets you looking at non-blocking I/O, so that write() will never sit there waiting for the (network) buffer to accept the entire contents of what you passed it.

Of course, as soon as you do this, you will realize that non-blocking I/O sometimes means that it can't or won't do what you asked, and will return right away without doing a thing, and with errno set to EAGAIN or EWOULDBLOCK.

If you're now thinking "huh, guess I need a buffer of my own, and then I need to come back and try that write again later", you're right! Also, if you keep on going down that road, you should hit the point of saying "hey, eventually I need to give up on them, because otherwise my buffers will eat me alive if they become unresponsive", which is also a good thing to remember.

If you combine this situation with the one described above (EINTR), then you may come to the conclusion that you basically need a ring buffer of sorts, where you append new outgoing data to it, and feed the network (or disk, or printer, or whatever) with the oldest stuff that hasn't yet been pushed out. This means memory management, indices or pointers, and hard decisions about when enough is enough and you have to cut them off.

But wait, there's more! A write doesn't always just fail for simple reasons where it returns -1 and sets errno to something like ENOSPC (the disk is full). There's also something fun called a broken pipe. This is when you go to write something to a pipe that isn't open for reading, or a socket that's had its reading end closed. If you're doing TCP stuff, recall that you really have two different flows going on, and the other end can totally stop reading from you any time it wants.

Big deal, you think. It'll -1 with errno=EPIPE, right? Yes, but. It will also generate a SIGPIPE unless you've explicitly handled it ahead of time. That's right, you're getting a bouncy baby signal handed to you by the kernel. You need a signal handler to eat it and do something about it, or you have to explicitly say that you don't care and set it to "ignored", but you can't just pretend it won't happen... because it sure will.

So yeah, if you have programs that occasionally blow up with "Broken pipe" and dump out to the shell despite you swearing up and down that you are dealing with a bad return from write(), well, maybe that's why.

What else? How about races? Let's say two different programs both try to create the same file at the same path. They both open /tmp/coolcoolcool for writing. What happens? That depends on the flags used when it was being opened, and whether the file was already there.

If you pass in O_CREAT, it'll create the file if it doesn't already exist. If you also pass in O_EXCL (so *both* O_CREAT and O_EXCL), then it'll create the file if it doesn't exist, and it'll error out if it's already there.

If two programs both try to open with O_CREAT | O_EXCL on the same path at the same time, one will win and one will lose. This is actually a good thing! This is also how you find out that you laid down a new file at that place and didn't actually follow a symlink left behind by some evildoer.

Did you get that? If I can get you to write to a path I control, I might be able to leave a symlink in that spot, pointing at a file I'd like you to overwrite with something. Maybe I can get you to write something including a blob of data I pass you at a location where I can drop a symlink.

What happens if I set up /tmp/target to point to /home/you/.ssh/authorized_keys? What if I can get you to write something that might have a bunch of crap in it, but then will include a newline, a public key that will let me in, and another newline? If your system allows following that symlink (some security policy things prevent this), I might just find a way into your account.

People used to do this same thing by getting root to overwrite /etc/rhosts, or /etc/shadow, or whatever else you can think of. Some systems added paranoia about not following symlinks that are not owned by the same user in world-writable paths (like /tmp and friends). Don't rely on this being in place! Do the right thing and make sure you're opening the actual file.

Speaking of racing to write things and exclusive creation of files, let's talk about atomic updates. Recall that write() might not do all of its work at once. You may need to call it several times before the entirety of your content makes it out. How, then, do you keep other processes from seeing this half-written file in the middle?

One approach would be to use locking. There are things like flock() and lockf() and fcntl() you could use, with the understanding that every other thing also has to play by the same rules. That is, another process which reads the same path but doesn't bother to see if it's locked will just barrel on ahead and see whatever's managed to land there. flock and friends are advisory locks, not old-school MS-DOS style "you can't read this because SOMEONE has this thing open somewhere" locks.

Another approach is to do the whole "atomic update" thing. This is where you create a temporary file, drop all of your data there, and then rename it into place.

That is, you do all of the following:

  1. Create a file adjacent to the target path using mk*temp or similar.
  2. Write your data to it
  3. rename() it from the temporary name to the final name.

This might mean you create /home/you/.cool.conf.CLXHKJELHFJE, fill it up with tasty, tasty data, and then rename() it to /home/you/cool.conf. Readers of that file will either see the old one or the new one, but will never see an "in between" state.

I should note that creating a temporary file safely is non-trivial, and there are actual *system* libraries which do this properly. In particular, doing things like "I'll open a path that includes my uid, or pid, or time of day, or a random number" are all crap if that's your entire strategy. You also HAVE to do the O_EXCL|O_CREAT thing to make sure nobody races you to it.

Put it another way: if your idea of avoiding races in your temp file creation is to just have a wider namespace, what keeps an evildoer (me) from just grabbing every possibility first? Disk space is cheap, right? You can only have so many pids, and the time of day is obviously predictable, and your random number space is also similarly limited. I could just set up every one of them first and when you hit it, I win!

Finally, as long as we're talking about open and flags, don't forget about modes. If you want to create a file that nobody else can read, you need to make it that way from the start. You don't get to open the file and then chmod() it, because if someone manages to open it in the middle, they can read it (or worse).

This is why open() takes a mode_t, letting you tell it what sort of permissions it should have.

Of course, the fun doesn't end there. There's also the umask, which (as the name kinda-sorta suggests) masks off some of the permissions when a file is created. That is, the more bits you have set in the umask, the *fewer* bits will be set in the resulting file. Why have this? Well, think about a program that might want to make files 0644 or 0664 depending on whether you are working on stuff yourself, or are allowing other people in your (Unix) group to write to it. You, the user, could use the umask to pick which mode things get when they are first created, without changing the program.

Granted, I don't know many people who use these machines this way any more, but back in the days when they were truly multi-user, it made sense for certain jobs. In any case, it's still there and you still have to worry about it.

This is the kind of stuff I was talking about when I said that Unix has just enough potholes and bear traps to keep a whole valley going. Knowing this kind of crap has made me able to hawk my sanity-checking services for years. It also causes no end of breakage when folks (who are fortunate to not have this stuff taking up mental space) run afoul of it.

Talk about a double-edged sword.