The day when starting a receiver fixed the transmitter
Have you ever tried to do something, but had it fail and weren't really sure why? Did you then try to fall back to doing something you could actually measure in order to then get a handle on the problem? I had something like this happen quite a while back with some software defined radio stuff. Here's how it went.
There are a bunch of interesting chips on the market which do this stuff now, and a bunch of companies have sprung up which do the integrator work to sell it as a product. They put it in a box, (maybe) get the FCC type acceptance, do emissions testing, advertise it, sell it, and hopefully make money.
Not all of them, however, are created equally. Even two devices built around the same chip can have very different performance. What people sometimes forget is that the software is just as important as the hardware. Just like you should be afraid if you ever see me with a screwdriver, you should also worry about hardware people hacking on drivers. You can usually tell when it's happened, and it's not pretty.
So I had this unnamed device and was asked by some friends to exercise it. They wanted me to try to show off its capabilities, like, oh, say, can you run an 802.11 network with a software defined radio? The code exists to do it, but the question is: can you hook it up to this product and make it go? My goal was to create a wifi network WITHOUT using a purpose-made wifi chipset. Easy, right?
After some initial problems involving the driver and the SDK not quite being on the same page, I got something working, and sure enough, it was able to transmit a nice narrow test frequency. I could set it up on an unlicensed band and pipe audio through it, and it would come out of the speaker on my regular old scanner radio. I did dial tones, white noise, dogs barking, and all of that stuff. I did a mix of them. It was all fine.
Now that I knew the transmission path worked, it was time to hook it up to the 802.11 software. Fortunately, this stuff already exists as a number of easily-obtained libraries, so once the transmitter worked, it should have been a matter of plugging it together. It didn't work. At no point did I manage to get it to generate a network that any other wifi device could see.
There was much disappointment to be had. Was it the hardware? The driver? The SDK? This random library I found? I had to isolate some things, and so I went out and bought a different kind of SDR hardware which was also able to transmit this kind of wideband signal and switched things around to run through that instead. It worked fine, and I could see the phony network right next to the real ones.
Over the next several days I tried all kinds of things to see if I had screwed something up in the transmitter on the test rig, but never managed to figure it out. While the dial tone, white noise and dog barks worked before, they were all narrow-band, and this wifi stuff was 20 MHz wide. Maybe that was it? Maybe this thing couldn't really do wideband transmissions? How would I find out?
The answer turned out to be the broadcast FM band as the source, and the unlicensed part of the 900 MHz ISM band as the destination. I recorded about 10 seconds of the ENTIRE FM band to a ramdisk, then turned around and pushed it back out in the unlicensed band on a continuous loop. This actually worked! I started hearing the 10 second loops of the stations at the right offsets in the 900 MHz band.
But... something was very very wrong. I mean, sure, the songs were playing, but they sounded all messed up. I've heard Champagne Supernova by Oasis a bunch of times, and I know for sure that they don't suddenly start warbling and woobling and wobbling while changing pitch AND tempo! Still, that's exactly what happened with that and everything else which was in that 10 second snippet of the band. No matter where I tuned, I heard things shifting this way and that.
Now I had a second problem on my hands: was it the recording itself? To answer that question, I took the file and washed it through a bunch more SDR gunk to isolate a single station, demodulate it to audio, and pushed it to my machine's speakers. The song sounded fine. Other stations were also okay.
I took my second wideband SDR and had it do the transmissions. It was fine. So now I knew it wasn't the recording (the massive file in the ramdisk), it wasn't the machine being unable to keep up with it somehow, and it wasn't the scanner.
So what was it? The problem eluded me further. I think I went to lunch to ponder things, and I guess the cheeseburger helped, since things changed that afternoon.
At some point, I did something ridiculous: using the same box as the transmitter, I turned on the second SDR while the first one was still in transmit mode. The idea was to use the receiver as a crappy spectrum analyzer to maybe see what was happening with the transmissions from the first one. It seemed like a long shot that this would even work, since it might overload the box and then nothing would happen. Or maybe it would overheat the machine or something?
The problem is that pushing 20 MHz of 16 bit I/Q data takes a whole bunch of USB bandwidth, and then I wanted to run TWO of them at the same time on the same box? I figured, eh, maybe it'll work. Maybe it has enough bandwidth and CPU time to go around. It'll save me from having to track down a second box at the lab to split up the work.
So, I plugged in the second SDR, hacked up a quick flowgraph to turn on its receiver and graph what it was seeing, and pushed the button to kick it off. And that's when it got even weirder.
As soon the receiver flowgraph came up, the transmissions became PERFECT. The terrible wobbly noises that had been infecting my scanner this whole time disappeared, and the song stayed on-tempo, on-key, on-everything. I changed the scanner to other stations in the recording. They were also fine. The whole thing sounded great.
WTF? Somehow, starting a RECEIVER fixed things?! I stopped the receiver and the music went stupid again. Then I restarted the receiver and it went back to being good. I could repeat this over and over and it would always track with the receiver being on.
I'm not quite sure exactly what sort of free-association happened in my head at this point, but for some reason I got to thinking "hmm, I wonder how busy the CPUs are" during all of this. Maybe it was from times I had tangled with "turbo" turning on and off at the place with all of the cat pictures. In any case, I rigged up something to watch the frequency values from /proc/cpuinfo with a fast update rate, and noticed that under extreme load, they all stayed about the same and the music sounded fine.
However, as soon as I stopped the receiver and the music went strange, the CPU frequencies were all over the place. Also, it's really hard to say for sure, but it sure seemed like the swings I was hearing from the speaker were matching the shifting values on the screen.
To be sure it was the CPU load and not anything to do with USB or some weird behavior of the radio software, I cranked up enough non-SDR dummy load jobs to keep it busy. It stopped going into turbo land and all of the CPUs stayed at the same frequency, and sure enough, the music evened out.
This was a tremendous lead, but I needed more. The next step was to twiddle my kernel settings to turn off intel_pstate. On the next boot, I was able to start messing with the CPU idle states manually, and before long it was clear: idle state C3 was causing it. If I didn't let the CPUs drop into that state, things wouldn't go crazy. The box would probably run a lot hotter, so it wasn't a good place to leave things, but now I had a smoking gun I could send back to the radio people.
I moved on to other things and never really found out what happened after that. I have to assume that they were doing something very strange in terms of timing that was super sensitive to the CPU frequency at any given moment. Perhaps they read the frequency once and calculated all of their delays relative to that? Of course, the very act of doing that much work on the machine would then cause the frequencies to change, and so their delays would change too, and the whole thing would collapse.
If it couldn't handle Oasis, I guess it's no surprise it couldn't handle wifi.