Writing

Software, technology, sysadmin war stories, and more. Feed
Sunday, October 14, 2012

A patch which wasn't good enough (until it was)

Some years back, I found myself working on a system which had its own custom replication system built on top of Sleepycat's BerkeleyDB (BDB) scheme. It allowed new replicas to bootstrap themselves with a rather simple scheme: it would scp the file from another machine.

This linear read from 0 to EOF would cut off at some point, possibly in the middle of a record, but it was okay. The replication system was able to use the last complete record and then it would use its own proprietary replication network protocol to ask for the remaining records. Eventually, it would "catch up", and it would be in lock step with all of the other replicas. That was where you expected things to be normally.

Of course, it wasn't always that simple, since I'm writing about it now. We started having problems when the BDB files approached and then surpassed 2.1 GB. If that number isn't particularly meaningful to you, go look at 2^31 and think "signed 32 bit value". Basically, if you wanted to access files beyond that size, you needed "large file support" in your kernel, C library, and programs. Our ssh build didn't have it.

As a result, any time you tried to run scp on that file, it would bomb. We had to come up with other ways to make it copy. I guess I was on call one day and managed to do a "ssh user@host cat /path/to/file > local/file", and that actually worked. Even though ssh (and scp by extension) didn't have large file support out there, /bin/cat did, and so it was able to open and read the file. That got it updated locally, and I was able to bring that replica back up.

Obviously this is not the sort of solution you want to keep doing by hand. It's pointlessly manual and needs to be automated. I went to work and opened up the code for this storage server. Even though I was a pager monkey for that service and wasn't one of the developers, I considered it my duty to provide a patch for a key operational issue like this one.

My solution was to attempt scp and then fall back to ssh+cat if necessary. This added the least amount of code to a process which was rather important. I could have attempted to do the whole "check the remote size and then decide to call one or the other" but that seemed overly complicated for such an important job.

I'm not going to say exactly what it was I was working on, but let's just say that if my code brought down enough replicas, you would have noticed. Statistically, if you are reading this, you probably have one or more things which used to rely on those database servers being up and available. Okay? Okay.

Anyway, I sent it off for review. One of the devs got a hold of the review and quashed it. He wanted no part of it. I just said "fine, then", and deleted the code review and the pending change in my local depot. I did actually make a quick diff and socked it away in my home directory just for my own reference, but I didn't tell anyone about it. I figured if they were going to be obstinate, then they could just hand-hack replicas back to life every time they got a > 2 GB file on a machine with the old ssh binaries.

Months passed. Then I got a mail from another developer on the same project. He had been around at the time and remembered seeing my patch go by and saw it rejected. Apparently the issue had come back up again, and it wasn't going away. The database shards were all growing, and more and more of them were tripping this problem.

He asked nicely if I had might somehow still have the patch. I did, and I sent him a copy. He integrated it, and that was that. From then on until they retired the system a couple of years later, my few lines of code were in there keeping things running without human intervention.

When I noticed it and presented it, it wasn't good enough. When time passed and he put the same code into the tree, it was fine.

I can only wonder why.