Writing

Software, technology, sysadmin war stories, and more. Feed
Tuesday, March 12, 2013

Mysterious workstation death

I don't have the best luck with computer hardware. I already told the tale of trying to upgrade to a smaller, quieter system and all of the craziness which went with that. What I haven't yet told is what happened back in the days of that older, bigger, noisier box.

It started out like any other Sunday morning. I had just parked myself in front of my workstation to check for mail and say hi to any friends who were online. I started typing out a command and then my monitor went to sleep all by itself. It came back on a second later just in time to see the usual reboot sequence starting. I had no idea what was going on, and just watched it.

The usual BIOS stuff played out and then it went to wake up the SCSI adapter and scan the chain, at which point it rebooted again. Now I knew something was definitely going on. The first time could have been explained by the OS going crazy or someone breaking into the box and commanding it to reboot, but rebooting ... during a reboot? That had to be something else.

As I watched, it did it again and again, and once, it even ejected the CD from my CD-R. This was an older drive which used a caddy, so this involved a deliberate "whirrr-THUNK-CLICK!", followed by the caddy at the front of the case. This made no sense.

I turned the box off and turned it back on, just to see what happened. It did the same thing. It still found all of the SCSI devices, but fell over every time the scan finished. I cracked the lid of the machine to have a look inside. I found something which resembled dust, but it was thicker than usual. It wasn't going away when I blew into the machine. I had to actually poke it with something substantial to make it move.

I figured maybe the machine had inhaled too much dust while sitting on the carpeted floor, and that had caused something to overheat. It was plausible, at least. I decided to clean it up and start swapping parts around to see if I could make the problem follow any one part. First, since the SCSI chain seemed to be so closely correlated with the problem, I pulled out the adapter and swapped it for an older one I had on a shelf.

This seemed okay, and I managed to boot off a floppy which had support for that other adapter. That let me back into the box, and I recompiled my usual kernel to make it support this temporary adapter. That went just fine, but when I went to run lilo, it fell over again. Now I didn't know what to do.

Maybe it was memory. I started up memtest86 and let it go. A bit later, I looked back, and it had detected an error. I only had a single stick of RAM in the thing, so it's not like I could do much about it. I dropped into the BIOS settings for the memory. There were all of these crazy settings like CAS and CL, and I thought I might have had a mismatch with whatever memory I had given this thing. With it set back to "BIOS defaults", it managed to run a pass without failing.

I went away for a while to have dinner, and when I had returned, it had failed yet another pass even with these new settings. Then it failed again. I kept fiddling with settings, slowing things down more and more. I didn't have a separate RAM tester, so what now?

Well, I still had my multimeter. I put it across one of the drive connectors and found I had a nice 5.00 volts on the +5 line and ~11.9 volts on the +12 line, at least until the SCSI adapter came up. Then it would drop to ~11.8 volts and stayed there until it crashed, at which point it would return to ~11.9 volts. Of course, I had never taken readings when the machine was healthy, so I couldn't say whether this was normal or not. It was data but it didn't necessarily mean anything.

After seven hours of fiddling with this ridiculous thing, I finally punted. I had "desktop publishing" type work to do and obviously couldn't rely on this machine. I had a laptop which could be connected to an external monitor and keyboard. It was annoying to use for this sort of work, but it would get the job done. My broken box would have to wait.

Running without a case

Over the next day or two, I tried all kinds of nutty things, including even running all of the parts without a case. I wanted to see if there was some kind of grounding anomaly or something shorting against the case or whatever. It didn't help.

I had one other machine which had compatible but slower memory (PC100 vs. PC133). It was my only source of alternative memory at the time, so I took down the other box and tried it. That didn't help. I think I tried another power supply, too, and that didn't help, either. Now I had a pretty good list of things which had been tried and didn't help.

Swapping the memory didn't help. Removing all of the add-on cards which could be removed and replacing those which couldn't (like the video card) didn't help. The problem never followed any part to another machine. The only things I hadn't been able to change were the CPU and motherboard. It had to be one of them, or maybe both. Nothing else could reasonably be blamed for this.

A new CPU was several hundred dollars while a new motherboard was only about $100, so the solution was obvious. I ordered a similar board which would accept my existing CPU with overnight shipping and got it at my doorstep about 24 hours later. I installed the new board and all was well. It never crashed again, so the CPU was obviously fine. It must have been the motherboard.

...

A few months passed. Then, one day, there was a story on Slashdot about a capacitor epidemic. A bunch of Taiwanese-sourced parts were going bad -- some actually exploded -- and they were taking out motherboards. I went back into my files and took a closer look at the pictures I had taken that morning when the machine first died. I noticed an interesting "dot" at the top of one of the capacitors in that picture:

Odd looking capacitor

[Cropped version. Click to see it uncropped and larger.]

That seemed odd but it wasn't a sure thing. I didn't know if capacitors were supposed to look like that or not. Then, one day months after that, I happened to find a picture I had taken inside the machine long before that point, and it happily included that same area of the board.

Normal-looking capacitor

Now it all made sense. In this picture from a year earlier, there was no "dot" on top of that capacitor. Finally, I could be reasonably sure of what had happened, and why my machine had suddenly checked out that one morning. It didn't change anything, but at least I had some closure.

Computers: what an incredible time sink.