Software, technology, sysadmin war stories, and more. Feed
Sunday, October 9, 2011

Blocking in state D? You're doing it wrong!

Let's talk about overloading your spindles. It's not a good thing.

Once, I was introduced to a system where it kept generating alerts for being too slow to do operation X, or operation Y. It would fire one off, then it would sit for a while, then it would clear. You'd wait an hour or so, and it would happen again. I don't know why this was allowed to persist with no action taken to fix it, but that was the state of affairs when I encountered it.

Eventually, I was handed a pager which was connected to this flaky system. This turned what was just a mere curiosity into a serious annoyance for me, and as expected, I started investigating. It took me pretty far down the rabbit hole.

The thing which was complaining was some high-level application. It was whining that more than x% of its writes were taking more than y milliseconds to complete. Upon sniffing around that app, I found that it was writing to a particular database instance.

Looking at that database instance showed that some of its write operations were really that slow, and it wasn't the high-level app's fault, or even the fault of the database. Instead, it was doing a write to a custom SAN type fabric which was taking forever to finish.

These boxes ran Linux, so I logged in to one of the systems at the bottom of this call chain to see what was going on. That's when I saw the mess. Just by looking at "ps", I could see we had a serious problem here. Some of the tasks were in state D. That's an uninterruptible sleep, and in my experience that usually means you're blocking on some kind of I/O.

strace on the offending processes bore this out, too: they'd go out into a write() or similar and would sit there for a l-o-n-g time while the system took its time to dispatch things to the actual drive. That got me looking at the disks on these machines and the sort of loads they were encountering.

As it turned out, the hot spots were almost always the machines with just one hard drive. Someone had decided that a single disk machine would make a capable storage node even though it was obviously very busy doing other stuff which also touched that one drive. I suppose it might have worked if the load had been adjusted appropriately, but it wasn't.

It seems that somewhere up the chain, there was a process which was responsible for directing load throughout the SAN-ish fabric. It apparently took a look at the machines it had to work with and the overall character of that group, and divided up the load accordingly.

If you had 20 machines with 10 hard drives and 2 with 1 hard drive, guess what? Your average would be way higher than those two single-diskers could possibly handle. We didn't know this at the time, but it was clear that machines with fewer disks were screwing things up. We didn't need the disk space, so booting them out was the way to go.

What was really frustrating about all of this is that nobody would believe me at first. I said, look, it's obvious: these machines are being hit too hard. State D is something you never want, since it means you're basically redlining the thing. At that point, all attempts at getting reasonable latency on disk accesses go right out the window.

Nobody believed me until I started taking matters into my own hands. There was a way to command a "SAN" node to stop taking on new files, and I'd do that to some of the weaker ones in the fabric. The load would find its way over to the other nodes, and latency would drop across the board. This would percolate up to our application which would then become happy and the pager would stop blowing up.

I wish I could say this was the only case where a bunch of people flat out did not believe me and my experience-based reasoning when I said something was obviously wrong. I left that environment behind and set sail for places where I was not automatically doubted.

Their loss, I guess. Enjoy your pages, guys.