Software, technology, sysadmin war stories, and more. Feed
Sunday, December 6, 2020

retvals, terrible teaching, and admitting we have a problem

Sometimes, my older posts find a new set of readers and generate a whole new round of interest. The whole "fork() can fail" thing from August 2014 did this earlier this year. It's over six years old but is still just as valid as ever. It still brings out THE ONE in certain venues, too.

Let's talk about what's going on here. The fundamental situation is that we have a library call that eventually does some sort of system call, and that system call can fail. It's actually kind of interesting, given that fork-the-library-call might call fork-the-syscall. It's just as likely that it'll call clone() instead, especially on Linux with glibc in the past, what, 15 or so years.

Regardless of whether it's fork, clone, or whatever else in your kernel, the important thing here is that it's not guaranteed to succeed. Instead, if you run it long enough in enough situations, you will eventually make it fail. If you have enough machines, it won't even take that long. You will find a way.

The situation from that Friday morning in August of 2014 was one of running out of memory on the entire box, but that's not the only way to do it. You could also hit a container limit, since that's totally a thing now. Or, this one is really fun - you can actually run out of pids. Yeah! Think of it as the process-level equivalent of running out of ephemeral ports (a story for another time).

Given this eventuality, you'd think that we'd be sending students of various computer-related disciplines into the world with the knowledge that it can happen, will happen, and has to be handled. Unfortunately, reader feedback and my own research says this is not the case.

I wrote a bit about this in September, in which I told a story about a real conversation I had while in a postmortem meeting from an outage for something or other you've probably used if you live in the US or Canada. In it, a couple of the people affirmed that they didn't think they needed to check return values because "it's not supposed to break", and "it makes the code messy".

Yes. They really told me that. Go read the old post if you like.

A few days later, I heard from someone who encountered a similar vibe in their textbook. Check this out.

Hey! I was reading your post on "What is your take on checking return values?" and a couple of hours later I came across this in my textbook "Programmers should always check for errors, but unfortunately, many skip error checking because it bloats the code and makes it harder to read." So I guess this is more common than I expected? Should it be treated as a red flag when working somewhere?

I was stunned when I read this. This is a line in a textbook which comes from an author who is influencing impressionable minds, and is responsible for setting people right in order to turn out people who hopefully won't willingly create utter shit code in their careers. They could have said something like "because many programmers THINK it bloats the code", but instead, note how it's written: "because it bloats the code". To me, that sounds like agreement on the author's part.

That's terrible! I corresponded with the person who sent me that feedback and found out that yep, it is in fact "Computer Systems: A Programmer's Perspective 3/E (CS:APP3e)", as you will discover if you go searching the web for that snippet they quoted. This is a book which has made it around to multiple schools and is just out there.

But wait, there's more. Instead of just bagging on this one book and author, let's go wider. Try this: go to your favorite search engine and search for code which calls fork(). Many schools with some kind of operating systems or systems architecture class will have a syllabus, slides, or even notes online, and you might find a hit.

The thing you want to search for is easy enough:

pid = fork();

Then load up a few of those pages and see what you find. I bet you're going to find a bunch of places where they test to see if pid is zero, and if so, it's the child, otherwise, it's the parent.

The fact that the retval (and therefore "pid") can be negative, meaning it failed, is missed entirely. The program just trundles on, doing whatever, and if it uses that not-a-pid "pid" variable to send a signal later on, hey, things get really interesting!

The first time I did this, I actually went and mailed the prof behind the topmost hit. Knowing that people with .edu addresses get all kinds of rando wingnut mail, I kept it simple in hopes they'd connect the dots and clean it up. I sent that mail at the end of October, and while I did get a nice "thanks for the tip" response, the page itself is still there. In other words, someone is wrong on the Internet! What? Oh, sorry.

Really though, this is everywhere. It's not just that one class. It's not just that one school. It shows up all over the place. The vast majority of pages about this kind of stuff manage to convey it incorrectly. It's clear that not only is the horse out of the barn, but the cat is out of the bag, and the whole damn menagerie has cut loose and is running down Broadway singing show tunes. You just can't expect people to do the right thing when the right thing is implemented this way. Too many people have voted with their feet and have decreed that they are just going to not check, and whatever happens, happens.

I'm pretty sure that the only way out of this mess is to just forget about ever expecting the "right behavior" from most people who cross paths with this kind of programming environment. It's like, yes, you CAN do the right thing, and you CAN make something solid that crosses every t, dots every ... lowercase j, and generally doesn't fall over when it meets some rough road.

In practice, though, it's clear: expecting this to happen now is pure and utter insanity. It's doing the same thing over and over and expecting different outcomes. If you want some really deep cuts, I'll reach back to July 2011 and just say again that if too many users are wrong, it's probably your fault.

Quite simply, I don't think we can trust people by default with these footguns. Just look at the evidence everywhere.

I don't have a good answer to this right off the top of my head. I figured I needed to get the problem out there first so the wheels could start turning. I don't think people realize just *why* this has been happening, and why there's this "yolo" thing going on when it comes to code correctness and handling the unhappy paths.

What can I say? Most people should not be programming? Oh, then I'm an elitist, and gatekeeper, and worse, because I'm obviously not going to include myself in that group, right?

We should take some languages out behind the shed and shoot them? Well then we'd have to do that to all of them, since they're all terrible in their own way. Nobody and nothing has the moral high ground here. Your precious little language is included in that. Yes, you too.

There are programs to be written, and every choice we have for creating them is broken in one way or another. This means that like it or not, people are going to keep doing it, and are going to keep getting in trouble.

You know the old saw about how if we built buildings like we write programs, the first woodpecker would destroy society?

Nobody ever mentioned that WE are the woodpeckers in the programming side of that analogy. It's not some random bird that comes and finds you. It's the very people who put it together in the first place.

Let's see if we can at least admit we all have a problem. Then maybe we can try to do something about it.