Writing

Feed Software, technology, sysadmin war stories, and more.

Saturday, March 31, 2012

Second-order problems

Have you ever been in a situation where the actual problem with what was going on... was that nobody else thought there was a problem? It sounds confusing at first, but follow along and it should become clear.

I had been given the job of making a certain test run in an automated fashion. It was some new directive from on high. About a dozen tests had been chosen as "the ones we would care about" for benchmarking, and they needed to be rigged to run by themselves. Each test was given three engineers, and I was one of the three on a certain test.

This software had an interesting pedigree. First, somewhere else in the company, there was a team which was responsible for writing the base product and pushing it into production. Then, somewhere downstream from them, some kind of lackeys got a hold of it and built "tests" around it. This usually amounted to a bunch of documentation and very little code.

We came in after that point. We had to take the output of that second stage and somehow make it run in an automated fashion. The idea was to have a test which was a stable baseline so that other things could be adjusted on a given system to see what sort of results it would yield. Is memory management scheme A or B better? Is kernel X faster than Y, or did it regress? Those were the questions they needed to answer.

I decided to take a look. These testing people had basically gotten very good at running things by hand. They had taken some of their notes and had written up a wiki page which said how to do the whole thing. It was just a series of instructions you were supposed to follow like a trained monkey.

Here, have a banana.

The badness doesn't end there, though! Early in the documentation, it basically says this:

Sync your client to snapshot #1234, then check out everything in /some/path. Then run (this build command) and wait for it to finish.

To them, that was acceptable for a handoff to an automation team! They actually figured that having us sync to some arbitrary point in time and then compile a fresh copy of the service was reasonable. Obviously, this was crap. We needed to operate from a "blessed" snapshot to reduce the number of moving parts and start approaching some kind of consistency in our runs.

When I suggested this to them, they just gave me this look like a dog who's just been shown a card trick. We might have been speaking the same language, but my words had no meaning to them.

The documentation didn't get much better after that point. You were expected to do a whole bunch of commands and vary the sequence based on what you saw or what you wanted to test. All of these people had been doing it by hand most of the time. A couple of them had written several small shell scripts for each part of it.

Nobody had bothered to sit down and actually make a "hermetically sealed" version of the test. It should have been a proper snapshot of the code which was compiled to a binary and then packaged up. The test runner should have done all of the prep work necessary, and should have had exquisite error detection, handling, and recovery, plus enough logging to allow for troubleshooting after the fact.

So, to call back to my original premise, the first-order problem here was that this test was a big pile of garbage. The second-order problem is that nobody recognized it as such.

Think about it. If you wind up in a situation where you say "this is crap" (the first-order problem) and nobody else agrees because they have no skills, no taste, or just don't care, you now become the broken one in their eyes. They don't see anything wrong, so the fact that you're complaining about it must mean you are the problem.

This is how mediocrity propagates.

Last time I checked, they were still running that test by hand. If you ever wonder why your local microkitchen is out of bananas, well, now you know. The "testing by rote" monkeys are hard at work.