Writing

Feed Software, technology, sysadmin war stories, and more.

Friday, October 21, 2011

Forgiveness over permission indicates a sick team

There's a well-worn saying about how it's easier to get forgiveness than it is to seek permission. I try to stay out of situations where such behavior is necessary, but sometimes there's no other way. One time I did it to set up a reasonable file archival system that was transparent. This is what happened.

A good-sized team with a bunch of people had a bunch of LAMP stack boxes running lots of programs regularly. It was a testing environment, and they'd run lots of jobs continuously to verify behavior or stress-test hardware. This generated an awful lot of output.

While the tests were supposed to generate well-formed data output for humans to consume, reality was not nearly as kind. A frequent annoyance was that a test would be badly written, and it would choke somewhere. Then it would just die, and not say anything useful about what happened in the official error log/reporting channel.

Once that happened, users would have to go sifting through the actual output files from all of these processes, including whatever they wrote to stdout and stderr. As you might imagine, this created a lot of log data. It started straining the capacity of the server.

I was enlisted to find a solution. What I wound up proposing was something which would copy an entire output directory to another set of servers which were backed by a big "SAN" type fabric. This fabric had more than enough storage space to do what we needed. Then we'd delete the original copy on the test machine and go on with life.

I had already written a small shell script which would call a utility program to copy data off, but it had to be run by hand. I wanted it to wake up directly inside the end-stage code for a given test job.

The problem is, there's no code like that in the whole suite. There's code which wakes up for each part of a job, but there's no one place that you can be assured of a call when the whole thing is done. Just to make sure it wasn't my ignorance of this system, I put it to the mailing list: "tell me where to hook this in". I had already wasted a bunch of time following their earlier advice and had hit a dead-end when that turned out to be wrong.

Despite prodding everyone on the mailing list and raising it at various meetings, nobody could answer me. The consensus was that "the framework doesn't work that way" -- it has no concept of a "job". They said this even though the files on disk were clearly organized in terms of jobs. Somehow, it was creating something but it did not reflect the internal design of the system.

This told me it was time to start ignoring them and start thinking of my own ideas. One morning before work, it came to me: I can figure out if a job is done by looking at the same database the actual test scheduler uses. I'll have to figure out their schema, but that's never stopped me before! To hell with their system, I said, I'll just snake the job right out from underneath it.

So I started with a ruse of sorts. I went down to the weekly meeting and let them talk about the situation without mentioning my killer idea. It went around and around with people trying to look smart and I just went along with it. Finally I just said something like "well, it's clear the original requirements have changed significantly, such that my original design will no longer fulfill it, so I need to go back to the drawing board. Let's not waste any more meeting time on it". Then I looked at the clock to make the point clear. They took my lead and moved on.

As far as they knew, I had to go back and dig deeper into their stupid system. In reality, I already had it all plumbed out in my head and just needed to get out of that soul-sucking meeting to start cranking on the code.

Later that evening, I closed my door, turned on the music, and went to work. My "Offloader" server was born not long after that. It just sat there on the machine scanning for output directories. When it spotted a new one, it would grovel around in the backing test database to see if the job was done. If that was true, it would start working.

I got it going over the space of a couple of days and then just pushed it to the machine and turned it on. I didn't tell anyone about it. It Just Worked, so they noticed nothing. The jobs were moved to the storage fabric and everything else followed the new pointers, and everything was happy.

At our next meeting, they asked for status on what was happening with my project. I said, "Oh, it's done. I wrote what I designed initially. It actually works, in contrast to the wild goose chase you sent me on. Also, it's been running for the better part of a week now, so you've been using it without realizing it." Then I just smirked.

They actually accepted this and went on with life. Amazing.

I'd like to say this was an isolated event, but it wasn't. If your team has to use this technique repeatedly to get anything done, it's terminally broken. It should be split up and the survivors should be dispatched to the four corners of the company.