Software, technology, sysadmin war stories, and more. Feed
Saturday, January 7, 2017

Where have I been these past few years?

Hey there, and welcome to 2017. I hope things are good for you, whoever you are and whatever you're doing -- unless you're doing evil, then I wish you no end of heisenbugs and other things which thwart your plans.

Anyway, if you've been following my writing for a while, then it's pretty clear that something happened in July of 2013. What had been daily posts (and sometimes two or three on the same day) turned into a few a week, then weeks of silence, then months. Instead of a constant flood of traffic, there are just isolated posts.

What happened? Easy: I diverted nearly all of my cycles into a Real Job, and by that, I mean the sort of thing where you get up around the same time every day, go to an office with other people, work with them, form teams, make friendships, go to lunch with different groups occasionally, and so on.

But, eh, there's a catch. It's not a traditional job in the sense of having a boss saying "do this, go there, do that". No, it's more like "okay, it's on you now, so go off and find stuff that's broken, and make things better"... and it's been like that for almost four years now. It's sufficiently different that it actually works for me, even after I thought I was done with this whole industry in 2011.

What's this life like? I'll give you a taste. Sometimes this means jumping feet-first into situations, like The Site Is Down. Which site? You can probably figure that one out. I don't have to put it in writing, do I? You probably use it. Not the one with all of the primary colors. That was so 10 years ago. It's something else. (Yes, I'm being difficult here just for my own amusement. It's not exactly a secret.)

Right, so, it's not down that much. That means the rest of the time, which is nearly all of the time, I go do other things. That looks like this: you wander around looking for stuff that seems dubious. Then you see if it could actually break in that corner case that "never happens", but you know it will... someday.

Then, well, you get it fixed. Sometimes that means "write a patch". Other times that means "convince someone that their thing needs a patch". But, more often than not, it means "change the system so that anyone trying to do something this way ever again will encounter significant resistance from the system itself, such that it's easier for them to just do it the safe/right way".

You can't solve these problems by just telling people to do X instead. People are essentially bags of mostly water, and those bags are bad at remembering arbitrary rules parceled out by ridiculous chunks of silicon. Also, the particular people you may reach, even if they are perfect, will not be the same people who are doing the work just a year later. Turnover and growth in any team or company are high enough to where your valiant attempts at teaching people will be reduced to whispers in the wind in a matter of months.

What does this look like in concrete terms? Okay, like this. So you notice that a whole mess of failures have come from stuff segfaulting. Then you find out the segfaults were due to bad std::vector (C++) accesses. (How? You grovel through stack traces and core dumps.) People are reading off the end with the [] operator, or are going to indices which don't exist. Other people are using .front() or .back() when the thing is empty and thus has neither.

At that point it's a matter of talking to the static analysis wizards in the company and showing them just how messed up vector use can be in the hands of people. Demonstrate the impact that fixing it would have. Show the outages which didn't have to happen.

Then they do some magic and come up with something neat which detects accesses which aren't sane before you can even check in the bad code. An entire class of problems disappears inside the company, and hopefully a white paper or tool or something gets released to the world to quell it on the outside as well.

That's a small part of what I've been up to. Some of these things manage to make it outside and upstream and get merged, and then everyone else gets to benefit from them.


Another? Okay, how about lsof? Recent versions picked up this behavior where it would traverse /proc/pid/tasks/*/fdinfo/*. So let's say you had a process with pid 100, and it had 50 fds open, and 10 threads (call 'em 101 to 110). lsof would proceed to open every file in every fdinfo directory for every task under that process. That's 10 calls to opendir, each one generating 50 calls to stat and read the fdinfo file. 500 calls just like that, and we're just getting started. It would also do this for every process on the machine if you were using it in the default "look at the whole box" mode.

So that's approximately pids * tids * fds calls to open, read, parse, close, and so on. On a big enough machine with a lot of work going on, it would basically never finish. It might take hours, which as far as I'm concerned is "forever".

Never mind that fd n is the same fd for all threads on a process, so reading fdinfo multiple times made no sense. The big-O blowup was insane.

We tracked that down, squashed it and reported it. No big thing.

One fun thing was how I proved this was the problem. I didn't want to recompile lsof from a srpm, so I just looked for a way to turn off the option which said "traverse the tasks directory", and then discovered it's impossible. The default value for the options basically bitwise-ORs some values together and one of them says "do it". You can't XOR it back out without disabling stuff that we actually wanted.

So what did I do? I found the place in the binary where that default value was stored, then dropped the bit in question from that value, and then ran the binary again. (How? xxd, vi, then xxd again in reverse. Tactical actions for tough times.)

When it finished quickly without having a nap in fdinfo hell, I had my proof.


Okay, how about another one. If you run bash in long-lived scripts, you have probably been throwing away tons of CPU in managing the "bgpids" structure. Have you ever thought about how bash does job control, and how it can know that %1 is pid X and %2 is pid Y, and how "wait" works, and all of those things? Well, it has to track everything it starts, so that it can keep tabs on them and collect exit statuses later.

bgpids is where it does that, and on ordinary machines, it doesn't get that big because the max pid value is probably something like 32K. But what if you run huge machines with tons of pids (likely from dozens of processes with hundreds of threads)? Well, it's highly likely you've blown past the 32K number and had to raise that limit.

Well, once you do that, bash scales bgpids accordingly, and now it's chasing down numbers in a massive linked list every time it needs to do something in there. Yeah, that's right, I said linked list.

How would you notice this? Easy. Write a bash script that basically does "while true" and runs a bunch of stuff over and over and over, with relatively short-lived loop bodies, so it churns through pids. You will eventually notice it getting slower and slower, and if you profile it (I used 'perf', but you can use whatever you like), it'll point squarely back at the management of bgpids.

But... if you got a newer bash on your machine, or if someone backported the patch to your OS, then you suddenly don't have the problem any more.


Let's talk about the whole "more than 32K pids" thing some more.

The first thing you find is that a lot of tools don't know what to do with a pid that's seven characters wide. They've only ever planned for "%5d" and now you've got a pid like "1048576" sitting there. Some tools are smart about this. Others are not. They're mostly display bugs so they're mostly harmless. It looks stupid but you can probably cope until someone can fix it.

There are far more interesting problems you find. Some programs have a CHECK or an assert that the pid will always be <= some number, like 32K or 64K. I think 'perf' was one of these offenders, but it may have been something else in that space. The first time someone ran it on a box that had been up for a little while, it blew up because the target pid was too big. Oops.

Then there are the static array people. This may have been 'perf' again, but I know I saw it somewhere else, too. They basically figure that "you'll only ever have a small number of pids on the machine, so why not just define an array of size maxpid, and then you get neato O(1) access to your state on them?"... yeah well.

Here in 2017, when that array length is now into the millions, that's no longer a small data structure, and now you're chewing some serious memory. You might not even be able to fit it onto the machine if they're using sufficiently heavyweight structs in that array. Yep, they get to switch to using dynamically-allocated space, like a vector, only not that exactly since it's C and not C++. Fun!


How about one that's purely forward-looking? Remember when that one site broke because they were using a strftime-type function with the format string that made it use the year from the ISO week instead of the typical one most people would expect?

I figured "if it can bite them, it can bite us", and went looking, just in case. Then I fixed anything I found and put in some lint rules to make it downright difficult to do the wrong thing accidentally.

Incidentally, it turns out that strftime and friends are evil, and probably should not exist. Most people would be just fine calling a small number of well-named functions for the most common cases. Fixing this kind of problem in the general case has taught me this.

I bet most people don't want to think "%Y-%m-%d". I think they want "give me the full year without any abbreviation or truncation since Y2K sucked, then give me the month and day with zero padding because I want it to always be the same width, at least until the year 10000 I guess". Don't make them figure out a format string. Give them a full_year_with_blahblah() function and call it done.


What about other stuff? Forward-looking stuff? Well, there was that whole leap second thing, or lack thereof. Not this most recent one, but the one before it. I got some great pictures of my two clocks, one with "23:59:59" on it, another with "23:59:60" on it, and then both clicking over to "0:00:00" immediately afterward.


That's just some of the down and dirty technical stuff I wind up stepping in. It turns out that being cursed with the ability to break basically anything is kind of useful when you can turn around and find out just how it went wrong, and then make it not happen again.

Besides that, I teach classes, and I give talks, and I mentor people, and I even do some public speaking now and then. Perhaps you've seen some of them and recognized a couple of stories from these posts.

Oh, and I write. A lot. But it's nearly all internal. So, hey, if you want to know where most of my output has been going, it's in there. If you're an employee, then there you go.