Writing

Feed Software, technology, sysadmin war stories, and more.

Thursday, February 9, 2012

Driving automated tests with the "spinner"

A couple of years ago, I found myself on a team which was responsible for testing a bunch of software in an automated fashion. It had to take the latest releases of various flavors of the Secret Sauce and see how it performed given the latest builds of the Linux kernel. It also had to do this across a few types of systems, since not all hardware is created equal. Finally, there were a bunch of different tests which could be run, each for checking different parts of the Secret Sauce Stack.

At the time, automated testing was largely a function of a cron job. Once a night, this thing would wake up and launch a bunch of tests from a static list. I think it was hard-coded to look for one released kernel and one "latest build from HEAD" kernel for each of the tests. These tests would all get queued up, and they would eventually be started by the test infrastructure.

On the surface, this looked fine. In theory, every combination of things that we wanted to look at was there. In practice however, it was a complete disaster.

The testing infrastructure and most of the actual tests were written terribly. There was this apparent desire to slap things together in the quickest and dirtiest fashion which might work if everything was just right but would fail miserably if anything was just slightly out of spec. This would lead to endless test failures until someone manually cleaned it up or adjusted the test in strange magical ways.

Some of these tests might have run and worked one time in 20. Consider that it would only try to be started once per day, and pretty soon you're looking at a LONG time between usable results - on the order of weeks. Also, with so few runs, trying to characterize the failures so you can even figure out where to begin fixing the mess is difficult.

In addition to all of this, the test machines would basically sit there doing nothing the rest of the day. If the tests had all been attempted (note: not actually succeeded) for the day, they would just hang out until cron ran again the next day.

Someone initially tried to improve on this situation by writing a "spinner" which would check on a given group of machines and would then schedule another test if they seemed idle. This wasn't the greatest implementation, since it would be starved out if something else got to the machines first.

Worse, let's say you "locked" the machines such that no new jobs could begin on them. This might happen if you were working on them manually. They'd obviously be idle, so this first version of the spinner would wake up every 15 minutes and schedule another job!

Then, when you unlocked those machines, you had this *huge* backlog of jobs which had to either be worked through or manually cancelled before it would get back to doing the things other people had requested.

As you can see, it was like a three-ring circus, only instead of clowns, we had programmers running around being goofy.

I decided to turn this around. Once some administrative matters had been taken to remove certain roadblocks to actually building useful software, I started writing a "real" spinner. It had the same nominal purpose, which was to run tests and keep machines busy, but it would not create an endless backlog and it would try to be smart about what it asked for.

It worked based on a bunch of grids. Each test had groups of machines where it was able to run, and kernels which needed to be tested. For each group of machines, it would try to get all of the kernels up to 1 run - whether pass or fail. Then it would try to get them all up to 2 runs, then 3, then 4, and so on. It had a configurable window of time, and initially all of this was "same day", so it would reset the counters to zero at midnight.

It also kept track of the work it had requested. Every job had a unique identifier, so it kept that in local storage. Then it would check back to see if it had finished. Once a given job finished for that group of machines in that test, it would then request another one. This meant the maximum backlog for any group of machines which my spinner would create was one job.

Later, I added more logic to it. Some kernels were more interesting than others, and should run more often overall. I wound up expanding the window of time to a week (168 hours), and then gave different kernel versions their own minimum and/or maximum counts. Here's how that worked.

If a kernel had a minimum number of runs configured, then it would not be "satisfied" until it had gotten that far. Also, if it had a maximum run count set, then it would be considered "saturated" if it reached (or exceeded) it. When choosing which kernel to test next, this all worked to weight the calculations.

First, any kernel which had a minimum number and hadn't yet reached it would be considered. Within that group of kernels, it would try to get all of them to 1 run, then 2, then 3, and so on. Next, it would consider any other kernel as long as it hadn't reached its saturation point.

We were able to use this so that fresh releases of kernels would immediately run, since their brand-new version number would be at 0 runs by definition, and they'd get the first tier of the first group of kernels.

This also let us run the "old workhorse" kernels periodically, like once a week or so, without having them starve out the other kernels. You generally try to run the old releases once in a while to keep tabs on your baseline. In theory, it shouldn't change, given it's the same test, same kernel and same hardware, but in practice, there are other moving parts of the ecosystem which you have to account for.

Finally, there were kernels which would always be able to run. This let those systems stay busy even when everything else was already handled. Instead of having them just sit there, burning power and otherwise doing nothing, they would at least be able to contribute extra runs of the test, which was also a way of stress-testing the test itself and the test infrastructure.

Bringing it back full circle, all of this activity meant there were far more test runs to analyze. I was able to get meaningful statistics about what makes test runs fail, and a bunch of them pointed at horrible coding and design practices. This basically dictated what I should try to fix next, but there were more administrative demons in the way.

I'll have to talk about trying to improve on the infrastructure another time. It's not a nice story, and this one is long enough already.