Software, technology, sysadmin war stories, and more. Feed
Monday, February 10, 2020

Oh, don't worry, I've screwed up plenty of things too

I've gotten back into the swing of writing again for the usual reasons, and Hacker News has noticed. Yesterday's post got on there and THE ONE came out, all over the place.

But, hey, even in turds you can sometimes find a peanut, and one of them basically said "she never talks about things she's screwed up". I thought I had, but with over 1200 posts, I can see how it might be hard to find some of it.

So, just to make it easy, I'm going to try to rattle off some stories about things I broke at some point in time. Prepare for maximum navel-gazing!


Thirty years ago this month, I started a public BBS. That is, it had its own dedicated phone line, and members of the public could call it. It was no longer just a testing thing I ran on my family's single (voice) phone line for friends to play with.

While up in the attic hooking things up, I needed to test something and so clipped onto the first line to ring the second. I was still holding the second line at the time... and got to find out what 90 VAC at 20 Hz feels like. BZZT. Not my brightest moment.

Later that afternoon, once it was all hooked up, my first new user called in... almost. They got connected, and when it asked them "user id or NEW", they typed in NEW, and it hung up on them. Why? Because I had put the thing in "no new users mode" when it was on the voice line, "just in case someone else found it". My first chance at having a real user, and I told them to take a walk.

It was three more days before anyone else called. Oops.


I think I plugged in my modem (a Commodore 1670) upside-down once. I don't remember how. It probably involved being careless and not looking at what I was doing. It blew the fuse on the machine and brought the BBS down for a while until I could find another one.


Back when I was helping wrangle cat pictures, I had a starter project to redo the file cache in HHVM. (I can use some specifics here since it's open source. Really. Go look.)

I got everything working and flipped the switch, and promptly broke the next push. Why? I had conservatively set the file mode used with open() to 0600 - user read/write, and nothing for anyone else.

Trouble is, they built the cache with one user id, and ran the actual server with another. The user id was preserved when the cache was distributed to all of the web servers (think tar's -p switch). So the first time they tried to read the new-form file, they hit -EACCES and failed.

The RelEng folks chided me and worked around it by adding a chmod to their release script. Then I fixed the actual mode upstream to make it 0664 from the get-go, and then hopefully they removed the now-superfluous chmod.


My first time on call for said cat pictures, I got a request from someone to try to find a pattern in their core dumps. The frontend machines in question ran some helper job and it was falling over. This place used to actually symbolize core dumps and left a .txt file in /var/crash next to the actual core.

I figured, okay, cool, let's just run grep on all of those .txt files on the cat picture frontend machines out there to see if we can find what they were looking for.

parallel_ssh_thingy -S catpix.frontend.region1 "grep foo /var/crash/*"

I let that run for a bit and turned my attention to other things. I figured I'd come back to it in 20 minutes or so when it had hit all of the boxes.

A few minutes later, I noticed that cat picture uploads had gotten VERY VERY slow in one region. Like, really bad. I opened a SEV - my first one - to track whatever it would turn out to be.

I got on a slow box. It had this massive grep running. It was looking for a string in every single file in /var/crash, including the multiple tens-of-gigabytes-large core dump files. This was beating the living crap out of the disks, and was slowing the machines down enough to make the monitoring unhappy.

I killed the greps and things went back to normal.

Thus, the first SEV I ever opened in the tool was also the first one I caused. Derp.

Every company seems to write one of these "run things in ssh everywhere at once" tools sooner or later. They should all be renamed "instant SEV creator".


One time I was working in a basement of a house on some telco wiring. This house was getting a "proper" install with actual punchdown blocks and all of that good stuff. Something wasn't quite right with the block, so I was crammed into a nook in the utility room next to the furnace fiddling with it.

That's when I noticed that every time I touched the punchdown block with the (all metal) tool in my right hand, my left ear tingled. It was repeatable and distinct: touch equals buzz, remove equals not.

It didn't take too long to put it together: the line was hot, so the block was hot, too. It was just the usual on-hook line current, but it was still notable. It obviously went across the tool and right up my hand no problem.

As for my ear? I was leaning against a metal post which held the house up... and which had been driven down through the slab into the (literal and electrical) ground. So, that's where the current went.

Zap. Dumb.


I designed my original "mail delay" system particularly badly. It used a database schema which amounted to a row id, and then strings for each of the source IP address, HELO string, FROM string, and TO string. It took up a stupid amount of space on the disk and was super slow, too.

It got better. But that initial version was crap.


Many years ago, I was coming off the freeway at an unusual place because my usual exit was closed. I couldn't really see too well to my left, but took the right on red anyway because it seemed "clear enough". It wasn't. I had to dive for the shoulder as a car shot by and honked (quite rightly) at me.

That one was particularly stupid, and I'll never forget it.


Back in the days when the servers and switch at my sysadmin job sat on my actual desk, I was fixing some wiring and managed to hit the big ON/OFF toggle on a power strip which killed the switch, and thus interrupted any telnet sessions in use, including the one used by the boss. I turned it right back on, but it took a few seconds to start forwarding frames again. That was enough to hose things.

CLOMP clomp clomp clomp clomp. I could hear him coming down the hall. Before I could even crawl back out from under the desk, the boss was at my desk asking "WTF?!".

I now put a mollyguard over those things any time there's any chance of them being exposed and having unscheduled activations.


I was home alone as a kid, watching some movie on TV. I saw some guy grab a beer can and do that thing where you jam a pen in the side to make a hole, and then crack open the top. The idea is that you can suck the whole thing down at once because the other opening lets air pressure equalize, and so it doesn't stop to "breathe".

I went into the kitchen and grabbed a can of soda and a pen. Then I jammed it into the side and ... FWOOOOOOOOOOSH... it shot up and out and all over the kitchen, INCLUDING the ceiling (which I had no chance of reaching to clean up to cover my idiocy).

Stupid, right?


I wrote a binary tree implementation in plain old C that turned out to be a linked list in practice because everything eventually ended up sorted, and it would just chase down the right side pointer.

It just kept getting slower and slower until I finally went and found out why. Oh. Yeah. That would do it.

(I'll note that when I wrote about this six years ago, THE ONE showed up and asserted they would do no such thing. I guess THE ONE wouldn't. But the rest of us would.)


Back in the days when we built small Linux boxes on leftover/junk x86 hardware to act as dialup routers for distant sites, I built one for the "alternative" high school - for kids who got pregnant or had trouble with the law, or whatever. That far back in time, I didn't always trust LILO, and tended to make a boot floppy that would bring the box up just in case.

This one was no exception, although I ended up getting LILO to work, and so a subsequent kernel update during the build of that box didn't go out to the floppy. One important thing I had switched on was PPP support. The hard drive had it, but that floppy didn't. I didn't really appreciate that at the time.

This didn't matter for a long time because the floppy merely sat IN the floppy drive loosely but was not actually pushed all the way in. That is, it wasn't fully engaged and so the drive couldn't see it. It was just chilling as an "in case of emergency" type thing.

One day after a power outage, I found out their machine didn't get dialed back in. I went out there expecting the worst, and found that someone had pushed in the floppy. They must have noticed it "half in, half-out", and pushed it in the rest of the way in. Then when the power flipped and flopped, it rebooted off that disk, and found itself unable to stand up a PPP link connecting to the outside.

I had to drive all the way out there because I left an attractive nuisance there just begging to be "fixed", and besides, the on-floppy kernel lacked the ONE thing the box needed to be remotely accessible.

That whole school was offline for most of that morning.


Yeah, that's enough fodder for today.