Software, technology, sysadmin war stories, and more. Feed
Wednesday, June 1, 2011

The utility of unchecked MegaRAID ioctls

Imagine a web hosting company with a whole bunch of hard drives out there, whirring away in machines which are being leased by many customers. Usually you can see them in /proc/scsi/scsi, assuming Linux, but what happens when you have a hardware RAID adapter which hides them behind a virtual target? Life gets complicated.

One summer, we had a problem. Disks from a certain manufacturer with a certain firmware had a tendency to fail at higher than normal rates when used with MegaRAID controllers. We needed to find everyone who had the bad firmware and then schedule maintenance to have things upgraded.

This was no trivial affair, since the only way to see "through" the virtual sd* devices was to either drop into the card's BIOS or start up their proprietary binary-only "megamgr" tool. While the tool would tell you exactly what you had in there, it was a slow and annoying manual process, since it used a text-based graphical menu system with windows popping up all over the place. Scripting it would have been terribly difficult.

Worse, using the tool was potentially dangerous. There were many ways to drop the array or otherwise break the entire system, but only one correct way to use it. We might have hurt these systems worse than just leaving them to chance. Something had to be done.

Somehow, they found me, a new employee that year, and asked if I could do some "reverse engineering" to help out. I figured it couldn't be that hard, since anything running on the system probably can be sniffed with strace, ltrace, and some combination of groveling through the megamgr driver in the kernel itself. They gave me a test box, and off I went.

It turned out that the driver supported ioctls, and one of them would do a pass through to the SCSI bus behind the card. There, you could address different targets by their id numbers and fire off SCSI INQUIRY commands. Those commands would either return a small buffer or time out. Once you flipped through the entire range (0-15), you were done.

This made people very happy, and they went off and added it to the system check scripts or whatever else needed it. We upgraded a bunch of drive firmwares, and had a largely uneventful summer and fall as a result.

At some later date, I happened to run the program again, and it worked. This was a little surprising, since I hadn't been running it as root at the time. Upon closer inspection, it seemed the driver would let you send ioctls to it, including the MEGA_MBOXCMD_PASSTHRU required to talk directly to the disks, as long as you could open the /dev/megadev0 device. That device entry was typically world-readable on the RHEL systems of the day, so this meant anyone could do it.

It occurred to me that if anyone could send an INQUIRY command straight to a device, they could probably craft far more interesting SCSI commands and send them, too. Maybe you'd like to read the raw shadow file so you can try running crack against it? Or maybe just twiddle some bits in a crontab file so it runs your "cuckoo's egg" program as root instead of the usual thing? I never bothered to attempt an exploit of this, but it sounded plausible. Maybe someone should try it, assuming the driver is still out there in a vulnerable state.

One related bonus story about the MegaRAID from that summer: there was a system running FreeBSD, and we had no equivalent tool on that platform. We were not allowed to take the machine down to the BIOS to look at it that way, and they would not allow us root access, either. I proposed that they could run megamgr themselves and provided a link to a FreeBSD version, but warned them to be extremely careful since one false move could eat the whole machine.

A couple of hours later, I overheard another tech on the phone: so and so called in, and their whole machine just froze. Well, it was obvious what had happened: they started megamgr, dropped the array, so what else could the poor system do but die?

The real icing on that cake is they tried to blame us for it. That was the kind of customer churn we did not regret seeing.