Software, technology, sysadmin war stories, and more. Feed
Wednesday, April 18, 2018

Given enough machines, you too may find a processor bug

In my years of tending these electronic flocks of sheep, I've seen a couple of weird scenarios which come down to issues in the actual CPU. I'm talking about the "big chip" which comes from Intel or AMD, or in years long past, Cyrix and friends. People don't usually talk about this sort of thing too much for whatever reason.

But, the truth is clear: if you have enough machines in your control, eventually you will find one or two that have something just slightly off in their processors. I mentioned a case where a certain bit pattern, when handed to exp(), would start growing in the wrong direction, becoming more and more negative. There have been machines where 10^n resulted in something other than a number that started with 1 and ended with a bunch of 0s.

There have been "sticky bits" which got stuck on when they should have been off, and "weak bits" which should have been on but were found to be off. The means of detection is varied. A lot of it comes down to looking for low-level oddities in the software stack, and then tracking it back to the source system.

I did some of the digging myself on some of these, and just contributed some advice on others. It's the advice I gave some other folks that I would like to share with the world here.

Basically, multi-CPU machines are the norm now. You might have multiple packages on the board, which is to say actual distinct chips in sockets. Each one of those might have have multiple cores on board, and each core might have multiple threads (as in hyperthreading). Odds are, if you really have found a "CPU bug", it will be limited to that core.

How do you verify this? Easy: use something like 'taskset'.

Let's say you have a program that will sometimes reproduce the failure, but not always. Maybe you run it 100 times on the machine and it only blows up on half of the attempts. To me, that would suggest that perhaps you have two cores, and one of them is bad. The way you find out is to deliberately limit your program to just one "CPU" *as Linux sees it*, and run it again to see if it reproduces.

The 'taskset' tool can run as a wrapper: tell it the CPU affinity you want and the command to run, and it'll start it for you. For people who'd rather not deal with a silly wrapper program, note that sched_setaffinity() is what actually does the magic, and you can call that directly from inside your program if you like.

Note there's a bit of a terminology issue here. Linux thinks of every single (hyper-)thread as a "CPU" or "processor", and that's why you have that number of entries in /proc/cpuinfo. The numbers you pass to taskset operate in that same range.

For instance, my machine right now has a single chip in it, which then has four cores, and each of those has two threads. 1 x 4 x 2 is 8, and sure enough, in /proc/cpuinfo it goes from 0 to 7. Machines vary, so be sure to know your hardware and test the permutations which apply there!

Anyway, let's say you run your repro program and it always blows up on some "CPU" (thread) numbers, and never on others. The reason might seem confusing until you look at /proc/cpuinfo and notice the "core id" field. I bet you'll find that all of the bad threads are actually on the same core (and by extension, the same package). Naturally, you'll want to run this across the rest of your fleet to see if any of them have single cores which are similar affected.

If you have a rock-solid repro case, you might get the manufacturer to take notice. If not, at least you'll have an interesting story and a new ornament for your keychain, right?