Writing

Feed Software, technology, sysadmin war stories, and more.

Wednesday, October 13, 2021

Looking at the outlying data points

One of the fun things I used to do at a prior job was to dig through the raw data points which were collected from the *many* servers we had. I could just pick an interesting-sounding metric and then would just ask the system to show me the top 10 or bottom 10 out of a population of thousands or more.

Once in a while, something neat would jump out from the data. Most of the servers would have their values all clustered together, but then a handful of them would be "out there" to the point of being obvious at a glance. This is a story about one of those times.

To get my feet wet with the whole "investigations" thing that we'd do, I started looking into this one day and chose one of the many temp sensors which were being logged by the servers. I forget exactly which one it was, for there were many. The CPUs logged it, the chassis logged it, the power supply logged it, and so on. There were so many choices.

Anyway, out of any given group of systems, some of them would be clearly hotter than the others. I don't remember exactly how much it was any more, but want to say it might have been as much as 10 degrees C hotter in some cases, which is quite a bit, particularly for something that's supposed to be tightly controlled. The thermal profiles of those machines were all supposed to be the same, so having some of them be way out there didn't make much sense.

I went looking for a reason. What could possibly explain this? First off, it didn't line up with any obvious pattern in the host names. The hosts were named for whichever team owned them, followed by a sequence number that was bumped any time a new one was provisioned. There was no signal to be found in that.

A teammate showed me how to see other interesting stuff about the physical hardware, like the actual location it had inside the data center suite (the actual room). One dimension was "row" and the other was "rack". Think of them as an (x, y) coordinate pair within a given space. Here, too, nothing obvious jumped out. The weird heat came from all over. It wasn't just a hot spot in the suite, in other words.

There was more, though. You could also see a "rack position" in this data, and now it started getting interesting. It seems that all of the really hot machines had a rack position of something low, like 04 or 05, and never anything higher than that. In fact, when looking at a whole rack, it was always the lowest number of all of the positions that were being used. Now this was worth chasing!

I asked the people actually there in the locations how the numbering worked: was it top-down or bottom-up? Top-down would imply that 04 or 05 was the topmost server and then maybe it was cold air sinking and warm air rising? But that made no sense, since these things were supposed to be FORCING air from front to back and not leaving it to chance. There should have also been a gradient across the entire rack, and there was not one to be seen. It was just this one position that was much hotter, and the rest were all pretty "chill" (sorry).

It's good that I asked, since they told me something surprising: position 04 or 05 or whatever was the *bottom* of the rack. Yes, somehow, the lowest machine in the rack was the hottest! They found this interesting, too, and wound up digging into it. Given that I was far away in a distant office, this was the best way for us to proceed.

The Actual Engineers came back a while later with the answer: it seemed that the server design included this assumption that there would be another server underneath it. That way, the bottom of any given server's tray would be "sealed" by something else underneath it. Do you see the problem yet?

The problem was that the bottommost server would have nothing below it but just open space until it got to the floor. That space would not move air efficiently, and it obviously mattered.

The solution was great: they came up with something that looked like a server for airflow purposes, but which didn't actually do anything. It was basically two very big paperboard pizza boxes side by side. They would then slip that into the rack right below the last server, or any time they pulled one from somewhere higher in the stack. This would keep the airflow doing its thing happily for whatever was right above it.

This rolled out and my hot spots disappeared. Fantastic!