Writing

Software, technology, sysadmin war stories, and more. Feed
Tuesday, June 4, 2013

Late night turns into early morning on a support call

I once worked at a place which used customer service as their major selling point. They were far more expensive than the other companies who also nominally provided the same sort of web hosting business, but they tended to play that like a feature, not a bug. They were trying to cultivate a perception of being a "high end brand". It was one of those things where if it didn't cost enough, people would wonder what was wrong with it.

My schedule in those days was 4 PM to 1 AM Thursday to Monday. That basically meant I had Tuesday and Wednesday off. I mention this only because it gives the following story a little more oomph.

One week, a customer had a server go bad. It looked like a serious dose of disk corruption from out of nowhere, and fsck wouldn't touch it. The tech who worked it wound up having it reinstalled on a fresh disk and then restored from backups. I was superficially aware of this and gave some suggestions at the time.

Three days later, it happened again. The original tech was off and it wound up falling to me. Just like before, there was nothing which could be done to save the filesystem. fsck didn't want any part of it. Once again, it had to be reinstalled and restored from the latest backup.

This all started around 6:30 PM. At some point, the president/owner of the company got word of this and called in. He was off-scale flipping out that something could happen like this two times in three days. I was working the ticket, and so the call was routed to me.

I think this wound up being the longest phone call I was ever on. This server hosted their web site which had a web store on it, and the store was now down for a second night. They couldn't sell things, and "are losing thousands of dollars a minute", or whatever the usual line is. He wanted an explanation and he wasn't going to accept the basic version. I had to go into everything in exquisite detail.

All the while, the restore was running. He wasn't about to let me go, and I honestly didn't want to even try to turn this one over to third shift, so I just stayed with it. It rolled over to midnight, Tuesday morning. Another hour passed, and the rest of my teammates went home at the end of the shift: 1 AM. Still, the restore ran.

At one point, he ran out of steam and decided to just come at it fresh while we waited for the restore. He asked me why he should bother keeping his business there if something like this could happen. Basically, he wanted me to pretend I was him.

I said, well, actually, not too long ago, I was on the outside. I had only been at the company 5 weeks as of the day which had just ended. I then added that I had been working in the industry for many years prior to that to reassure him that he didn't have some kind of newbie on the job. I was merely new to that company.

I said that from what I could tell, they really meant what they said about customer support. I described how I was sitting there on a darkened support floor in an office building, and how there was a little shrine of sorts to people who had won the customer service award. I continued with how my team had actually gone home for the night, and while third shift was there, I wanted to see this one through myself.

This is about the point where he said something like "huh, I've had you on the phone this whole time", and basically realized he probably would be unable to get that anywhere else. It was unusual even where I worked, but considering the severity and frequency of the problem and his reaction to it, what else could I do?

The first restore bombed about an hour into it, and it had to be restarted. Apparently one of the tape drives in the storage cabinet went bad. This was just not a good night for this box. Eventually, the machine came back up from its restore, and I confirmed that his site was back up as it was several days prior. He thanked me for staying with him and let me go.

It was 3 AM - two hours past the end of my shift, and on the last day of my week, too - "my Friday", in effect. It also happened to be Memorial Day. Well, it had been, when it started. It rolled over to Tuesday morning. He got the highest level of support possible even though it was a holiday.

I went home. The ticket wrapped itself up during my "weekend", and by the time I came back on Thursday to start another week, it was closed.

A couple of days passed. Then I checked on their account again just to see what was happening. I wanted to see if he had followed up on his threat to pull his business and go elsewhere. He hadn't. In fact, they had expanded their configuration and had added another server to their config.

Awesome. Expansions by existing customers were what drove our bonuses, so every bit of growth helped.

A year passed. On Memorial Day the next year, I went and checked on the customer again. They were still hosted with us.

Given all of that, I think staying late was worth the trouble.

Thinking about this now many years later, one thing still bugs me a little: we never found out exactly what happened to their drive. It could have been someone with root doing something braindead like 'cp foo /dev/hda', for all I know. It also could have been some kind of serious hardware problem which flipped some bits in memory or on the way to the drive.

We're fortunate it apparently never happened to them again, I guess. Failure to identify and eradicate a root cause is tricky business.