Software, technology, sysadmin war stories, and more. Feed
Wednesday, September 2, 2020

What is your take on checking return values?

One day not so long ago, I was in a meeting listening to a team explain why their service had gone down and taken out a big chunk of a business. They were one of those things that has to exist and work in order for the actual "thing that makes money" to go. Think of delivering pizzas, connecting dog walkers with dogs who need to be walked, that kind of thing.

It turned out they had been crashing every time a request came through for a certain part of the country. That is, not all pizzas, dog walkers, or whatever it was were handled identically, so they had their own city or region configurations. Think of differences in pricing, taxes, features, or whatever. Trying to process a request for this one particular region had caused the entire process to die when it hit a new config that was "bad" somehow.

Specifically, loading this new config put a NULL/nullptr/nil/None type thing into the value, and then something else tried to use it. When that happened, the program said "whoops, can't do that", and died. (Note, I am deliberately being obtuse about which language was involved. The badness in this story is not about the language, but how it was used.)

Now, since this program was massively parallel and also handled requests for all over the country, crashing on a single request actually interrupted service for a whole bunch of other things that otherwise had nothing wrong with them. It's what I'd now call "multi-tenant". Since this affected all of the replicas, the whole thing was compromised.

I had to ask a bunch of questions to get to this point of understanding. It wasn't really clear what decisions they had made to get us to that point.

"Why did it go down?"

"It broke."

"Why did it break?"

"It crashed."

"Why did it crash?"

"It (paniced/threw/segfaulted/...)." [again, being obtuse about the language here]

"Why did it do that?"

"Some field wasn't there [in the new config]."

"Why didn't it check for it, or notice the bad returned value somewhere?"

The answer to this one finally started shedding light on how this had all happened.

"All of that return checking makes the code messy."

You're just going to have to trust me that I am not making that up, and I am not even paraphrasing that last part. They actually said that to me: it makes the code messy so they don't want to do it.

This is not someone who knew me and was injecting humor into the situation. This is not a place where they let it hang in the air for a beat and went "no, just kidding, we actually XYZ". They actually meant that exactly as stated and were not joking even a little.

Around this time, I think I said something about how that's basically the job. That is, my job, their job, anyone's job... is to check for things that could go wrong and deal with it when it does. It's never been optional, and certainly not because of some kind of "avoid messiness" mandate (where the heck did that come from, anyway).

They added more.

"It's not supposed to be like that, so we don't need the checks."

Yep, this was the justification: the config is not supposed to return that (bad non-value), so this won't happen, so we need not check for it. Yep. Sure thing. Makes sense. Seems legit. Uh huh!

My reply, well...

"And yet, here we are, because it DID happen. So you do."

There was an awkward silence, and then the room moved on.

I've learned from going to far too many of these meetings that you can't just hammer them into the ground right there in front of everyone else. Once it's clear that they Do Not Get It, all you can do is take a note to follow up later.

What I found more interesting was that by and large, the room was going along with the presenters and their "it's messy so we don't do it" ethos. There was head nodding and general acceptance of the situation. There wasn't the kind of pushback I would have hoped to see.

This is the kind of culture stuff it would be nice to know before you accept a job somewhere. Imagine being able to run that line past a half-dozen people at a potential future employer and see how they respond. If they generally agree with it, worry... a LOT.