Software, technology, sysadmin war stories, and more. Feed
Wednesday, February 22, 2012

Analysis paralysis kills a project

Maybe you've heard of analysis paralysis, but do you know what it looks like? I had heard that term a few times, but could never really put my finger on it. It had come up at places like a certain "leadership training" retreat I suffered through one summer, but I couldn't connect it with real life.

I couldn't, that is, until I tried to do something big on a broken team. Looking back at it later, I realized it was a classic case.

Following the "spinner" project, I set out to do something about the bad test infrastructure. It was responsible for most of the failures we were seeing. Put in other words, we had tests that never even got a chance to start because the crufty and broken system would fail before it even started the test itself. That's how bad it was.

I laid out my vision of the system and everyone else seemed to be on board. They previously had resisted such a design, but now there were security-related edicts coming down from high up the corporate food chain. One of these demands was that certain kinds of access would no longer be allowed on the network, and it so happened that our existing (bad) testing system relied on it. To continue testing, we'd need to move it to the standard corporate infrastructure and off our hacked-up oddball hardware and software.

I said that I would start on the part which actually installed the kernel and rebooted, and then the other people could work on figuring out where to store test results. At the time, we had been using MySQL, and the "dashboard" people were very tightly bound to it. They were afraid of going to anything else.

I didn't care what they did, as long as they did something. They had their task (create a result storage and retrieval system) and I had mine (create the kernel installation and rebooting system), so I went off and started working.

Time passed, and I started delivering pieces of my system. First I could load kernels, then I could reboot, and then I could do it with a separate controller. Then that controller could do it across a bunch of machines in parallel. Finally, that controller became more of a library so it would be available to other things like the "big brain" which would eventually run the whole test infrastructure.

I checked in on the "storage and dashboard" guys. They were still going around and around in circles with the same old problems. When I actually paid attention, I realized they were all tied up with stuff like that stupid SQL binary search thing, and weren't even working on the new system!

Here they were just blowing days and days while trying to do something really idiotic on a system which was doomed by administrative decree. I had a whiteboard in our office which counted down the days until the hard cut-off by corporate for the old systems. Every day, I would update it to show the new number of days left. We had precious few weeks left, and they were still working on the old system? They could not possibly claim ignorance of the deadline. It was right there in black and white, visible to all.

Countdown calendar

You know the old saw about "rearranging deck chairs on the Titanic"? This was the closest you could get to it in modern times, and boy, were they good at it.

Pressing for details revealed a few things. Apparently they were convinced that Secret Sauce flavor B would "be too slow". Never mind the fact that it was used by the whole rest of the company to do really huge and scary things and was ridiculously fast in those contexts. They kept repeating this even though they had never used it themselves.

They'd flip back and forth between that, Secret Sauce flavor S, flavor M, and so on, all the while never picking one, so they never actually made any progress. This is what I would now consider to be analysis paralysis.

The way I had always heard it described, the "paralysis" came about as a result of being too afraid to choose something. That is, it was the first-order problem, and fixing that would somehow get past things.

With these guys, I now realize it wasn't quite that simple. There was something else going on. I don't know if they were just afraid of the unknown, and that leaving MySQL behind was super-scary for them. I mean, actually learning the corporate infrastructure that basically every other engineer knew about? What a concept! It's entirely possible they had reached the highest level they would ever perform at, and would never be able to deal with any more.

I also considered that they may have done this on purpose in an attempt to kill my project. This one is a pretty long shot, since there was no "third alternative" which would carry us past the corporate security deadline while still allowing testing to run in an automated fashion. If they were trying to set me up to fail, they were shooting themselves in the foot at the same time.

Finally, I wonder if they were just lazy. Sitting there in your rut and doing the same old LAMP crap over and over is simple. You don't have to stretch your brain or learn new things. You can just come in, day after day, and keep turning the crank. Once in a while, a food pellet appears, and you gobble it down. Then you go home and start again the next day.

I can't imagine living that kind of life. I'd go crazy.

I wound up working way too hard on that project and finally decided I would have no more of it. About a week before Thanksgiving, I declared that I was through with their laziness and brokenness and everything else, and would not do anything more to the project I had created. My position was simple: fire me if you want, but I'm not working on this with them any more.

I wound up transferring to another project which had its own problems, but that too is a story for another time.

As for the broken dashboard people? Well, they wound up being canned over the course of the next year or so. That half of the project no longer exists as far as I can tell. The other half (which I didn't describe here but had its own major issues) is still poking at it, as I found out a bit later.

I had a chance to talk to some insiders during a visit to the corporate campus lunch room some time ago. These are people who didn't even know I had quit, but were still willing to talk about things with me. Anyway, it seems that my former project is still nominally alive. It's also now the new thing which everyone hates. Whereas they used to hate the old system for being broken, they now hate the new system for being broken!

I should point out that I left the system in a state where it would grab machines, install kernels, reboot, start the one test I had ported to the new system, wait for it to finish, then clean up and start over. They managed to take this working system and turn it into a big pile of garbage just like they had broken the original system.

It was at this point that I realized something crucial: it doesn't matter how good the code is, or how good the foundation is. Broken people can ruin anything, and it's shocking just how fast it will happen.