Writing

Feed Software, technology, sysadmin war stories, and more.

Tuesday, October 18, 2011

How to abuse a shared on-call rotation situation

I don't understand how some people are okay with things that keep paging them relentlessly. They joke about the problem and give it funny names but never fix the root cause. That kind of stuff annoys me to no end and forces me to come up with solutions. This is what happened one time.

I was on a team which had been splitting into two services. The people on one side weren't really expected to know a whole lot about the newer goings-on from the other side. Basically, anything which came around after a certain point was expected to not be totally understood by the other side. We had a single on-call rotation at that point, so it's possible you could get paged by one of these new things.

There was supposedly an unspoken arrangement where you took care of your service and they took care of theirs during the day so it wouldn't lead to ridiculous pages at night. That way, the whole issue of not being totally familiar with the other service's newest features (and warts) would not be nearly as bad.

It didn't work.

At one point, the other team had rolled out a new version of part of their infrastructure. It had a bunch of new flags and knobs and foibles. One of these oddities is that it now had a condition where the "log" could get too big. They set up paging for it: LogTooBig error.

Apparently this thing had been going off for some time. It had gotten to the point where they started giving it a silly name. Someone suggested it might as well say LogTooPurple error, since it didn't convey any useful content by saying it was "TooBig". The page told them the "what" but not the "why". Figuring that out was their job. Too bad they didn't.

Eventually, the on-call rotation came around to me, and one of these things started paging me. I wasn't going to just sit there and ignore it or laugh at it. It must have some reason for happening, and I should give it an honest look to see exactly what is going on. If nothing else, it would give me some ammo to use when pushing back to the other team to say "stop bugging us".

I found a machine where it was happening a lot and logged in. This was a shared server, so it had a lot of stuff going on. It looked busy. It felt busy. Then I went looking at our process. What I found was not the model of health.

Eventually, by using strace on various threads, I determined that this program was committing a serious crime: it was calling sync(). As I mentioned earlier, this was a very busy box. It had bunches of hard drives, and all of them were always up to something. There was a lot of reading going on, but there were also many writes being dispatched at any given time.

Linux buffers disk I/O, naturally, so there's always a flow of stuff into the buffer from program at one end, and out of the buffer onto the disk at the other. It does some optimizations to try to minimize the overhead depending on what kind of scheduler you have. It'll get to your request sooner or later. You just have to be well-behaved.

This program which had been paging me was anything but. When it called sync(), it basically declared to the kernel that it must flush all of its write buffers to disk right then and there. It also meant the calling process (ours) would block in that sync() call until it finished. In other words, this thread would go into an uninterruptible sleep every time it happened.

This box was so busy that a sync() took 30 or 40 seconds. Then I noticed something else: as soon as it came back from a sync(), this thread immediately called it again and suffered another 30-40 second wait. This was ridiculous! You're not supposed to do that on a shared system, you punks!

I did some digging and found that some library they had started using with this new build called sync() after every write. I reported this to the other team's developers and they put in a work-around: they dialed back the frequency of calls to this particular function. A need for a "better fix" was also filed as a bug.

The good news is that it stopped paging for that particular problem after the workaround hack rolled out.

The bad news is that the original situation which allowed it to happen -- the people who chose to mock it rather than investigate -- persisted. That meant it was only a matter of time until it happened again.

Fortunately, at some point, the on-call rotations were split and we stopped getting their pages. As I've said before, enjoy your pages, guys.