Writing

Feed Software, technology, sysadmin war stories, and more.

Monday, February 18, 2013

More short tales from tech support

Some of my stories about working tech support are too short to warrant their own posts. I'll roll up several into a single post to keep from having them otherwise go unused. Here are a few more.

...

We had a surprising number of people who needed to know just how many hits and/or bytes were happening in their web server logs. Counting hits is easy enough since that's a job for "wc -l". Counting bytes is a little more painful because you need to add up all of those numbers. Trying to do it with a shell hack was usually stupidly slow.

I wound up writing a dumb little C program which just consumed stdin, fed the input to strtoul(), and then added that onto an accumulator. Once it hit EOF, it would kick out the sum. You just had to whittle down your input to just a stream of numbers through whatever means (cut, awk, ...) and let it do the rest. It wasn't pretty but it did work.

This usually came up whenever a customer didn't believe their bandwidth reports. Some poor technician would have to go on a wild goose chase to find evidence of a lot of data being served up somehow. This usually meant finding one virtual host out of hundreds which had been particularly busy. This was never fun, and the most interesting case was the one time I found an open writable anonymous FTP server chock full of pirated movies. The customer had to pay for that one since they left it open.

...

We had a customer add a third machine to their configuration. They had "privatenet", which is the scheme where you drop a second NIC into your servers and put them on a common subnet which is isolated from the world. This system even worked if the machines weren't on the same subnet or physical rack since the networking peeps would do some VLAN magic with the switches and routers to link it all up.

One time fairly early on, I got a ticket for a brand new machine which couldn't see the other two. I did some poking around and found that the old machines could get packets from it, but the responses (like replies to pings) didn't register. What was really strange is that the replies were showing up on the interface, but the box wasn't honoring them.

After poking at this for a bit I realized something was massively broken in terms of how the privatenet stuff was mapping IP addresses onto hardware addresses. It looked like there was something out there with a bad ARP cache, since the packets were arriving for the right IP but wrong Ethernet address.

I tried to pass this over to the networking group, but they just kicked it right back. It's like they didn't believe what I had reported. I finally had to get my manager involved in order to make them take notice and actually do something with it.

If you've ever heard of "customer service rep roulette", this is where you get a bogus answer and call back to try again with someone else. Well, we basically had that internally. The difference between someone taking action and yet another excuse to avoid work was all in whoever happened to grab your ticket. Sometimes, if I could, I'd purposely wait until one of the broken people went off shift before sending over something messy. Failing that, I'd find an owner for the ticket in advance by talking to the clueful folks directly before sending it their way. That way, they'd be able to grab it before the bad ones found it and screwed it up.

...

Another dumb and yet really useful tool I had back in those days was something I called "tping". All it did was ping a host once a minute until it came back up. Then it would call "xdialog" to pop up a window which I couldn't miss. It was just a shell script with a call to ping in a loop and a test for the exit code, but it saved me a lot of time.

Something which happened all the time is that a machine would be taken down for whatever reason, and I needed to know as soon as it came back up. That could take 5 minutes or 30 depending on what was happening: rebooting into a new kernel was usually quick, but asking for a motherboard swap generally wasn't.

My solution avoided spewing too many useless pings by only emitting one per minute, and it meant I didn't have to keep an eye on that xterm. I could actually kick it off in the background and then go on to something else on a totally different workspace in my window manager. When the machine came back up, it would let me know.

I'd use this to pipeline certain lengthy tasks. I could start something running on system A, and while waiting for that to come back, I could grab another ticket and start working on system B. Some people I knew would just sit there twiddling their thumbs until that first box came back up, but that didn't fly in my world. I'd rather throw that ticket into an idle status to get it off the radar and then work on something else for a bit.

...

On the topic of support issue parallelization, sometimes there were phone calls where you'd get stuck with some customer who always took forever to get anything done. There were various reasons for this. I suspect at least a couple of them were really lonely and figured the multi-hundred-dollar-per-month fee and "unlimited tech support" translated into a live voice they could tie up whenever they wanted.

I could tell when it was a particularly bad use of my time when the overall information density was low enough to let me start working on another ticket. I could actually drop into the queue, grab something interesting, and do my *clickity click* on that other box while the phone call droned on. Sometimes I'd close a ticket or two while on the phone with some totally different customer.

I never told them about this, obviously, and while it wasn't exactly common, it did happen now and then. Sometimes, it was a real challenge to get these people off the line so they could go back to talking to their cats or whatever instead of us.

...

Shortly after I left the support team to do a "meta-support" role, my friends on second shift found themselves having to deal with one of these long-winded customers. Every time someone got stuck with this guy, they were in for a marathon call. I'm talking about hour-long phone calls here, and those would be the short ones.

One night, this guy called in again, and wound up talking with one of our level 1 "phone firewall" people. Nobody wanted to take the call. I was still right there on the floor despite having moved to a new team, so I heard the commotion as this poor person tried to track down a victim for this guy.

I forget whether I said "how bad could it be" or something else equally stupid, but I wound up strapping on my officially-retired-but-still-there headset and picked up the call. Someone on the floor actually bet me dinner that I could not get this guy off the phone in less than an hour. I figured I could handle that without being mean, snippy, or obviously showing my hand in that I was trying to get this guy off the phone, and so I agreed.

It was around the 45 minute mark when I started sweating. This call was in full swing with no signs of letting up. Some people were wandering by and looking at the call timer on my phone. Others were shooting instant messages at me to check on things.

The clock kept ticking. This guy kept asking questions, and questions about questions, and so on. It was like an interview gone horribly wrong. 50 minutes. 53 minutes. 54. 55.

I think I finally got rid of him at the 56 minute mark. I managed to earn my lunch with only 4 minutes to spare, but that was ridiculous.

We went through a lot of batteries in those headsets.

...

Finally, speaking of batteries, there was a particularly nasty case of cargo-cult tech support one night. I forget exactly how I discovered this ticket, but it was clearly a case of a tech who "knew enough to be dangerous".

Basically, the customer had a machine which wouldn't keep its clock synced. They tried to do something like running ntpdate in cron, but that didn't help. They didn't notice that ntpdate would fail because ntpd was running (more about that in a few). Oh no.

So, they scheduled a maintenance window to replace the CMOS battery. Yes, as in the battery which maintains the real-time clock when the machine is otherwise without power. You know, the one which basically never gets used for a server, because it's always on?

That didn't help, either.

The actual problem was that we had a broken kickstart for a while, and it was putting machines online with a bad "restrict" setting in ntpd.conf. ntpd would start up but wouldn't accept the responses from the higher stratum servers. It would poll and poll but would never get any usable data, and so the clock would just drift, and drift... and drift...

I made sure to add this ticket to my little ticket tagging system as "yes, replace the cmos battery and schedule a cron job that will not do anything when ntpd is missing the 'restrict' option". That way, it showed up in the digest a week later and everyone who had subscribed to it would get to see the mess.

I would definitely categorize the tech who "handled" that ticket as a ramrod.