Writing

Feed Software, technology, sysadmin war stories, and more.

Thursday, October 4, 2012

Two weeks of working as a meta support monkey

After I escaped from being a ticket and phone wrangling web hosting support monkey, I turned into a very odd creature. One of the hats I wore was "reporting monkey". In that role, I had to generate reports on the business for other folks inside the company. I got that role since I actually understood the schema for all of our databases and also knew what it meant in terms of actual support issues due to my experience on the floor.

Another one of my "hats" was to write utilities to help out techs on the floor. Still another was when I'd take escalations for customer issues for those really hairy ones nobody else could figure out. Finally, now and then, I'd jump into the queues with a fellow escapee and just smack down a few dozen tickets to "show them how it's done".

I also recently wrote about what a typical night work tech support looked like. That one was inspired by a friend who wondered what I had been doing at the time. I realize now that I never told him what happens once you "graduate from" that level of things and start doing the meta-support work which follows.

This is everything which happened in a two-week span of time.

One of our customers with a bunch of systems had been having a bunch of crashes apparently linked to the kernel quota code. We had set up a netdump receiver and configured their machines to point at it. They had a panic, but we didn't get a netdump. It looks like there was a bug in how their Ethernet driver talked to the netdump code. Lovely.

A customer opened a ticket which essentially said "I know you guys don't support code, but we've tried anything and need some help, could you just look?", so naturally I went looking. Normally customers would demand it was our server and it would turn out to be their code. In this case, they admitted it was their code and that was sufficiently novel as to get my attention.

Their code was Java, but it was nothing a decompiler couldn't fix. I'd done this sort of thing before, so I broke it open and took a look. It was hard-coded to always look for a specific component in every pathname, so I set up a symlink to make it happy. That is, it wanted to see "/foo-bar/" somewhere in the path every time, for some inexplicable reason. The otherwise pointless symlink let it see that. Problem solved.

The company's reporting database was stuck for a few days, so little happened there in terms of new data. It only got a dump when things were healthy, and when those got stuck, nothing would happen. Don't ask about why they didn't just set up a read-only follower, since replication and that particular database system were like oil and water: it took a lot of agitation to even simulate it, and even then, it would never be quite right.

We started doing this "employee of the quarter" thing, and so I had to stand up a voting intranet site. That went into a runoff, and then a second round, and then it ended. By virtue of being the system creator and administrator, I could see who had won. I had to sit on that knowledge for a few days because the official ceremony had been postponed. I guess it's cool that people trusted me enough to let me be a candidate and run the voting system at the same time.

I started looking at my original "reflection" tool in a new way: to make it target just a single tech. This way you could see who was working and who was just putting ".5" (half point) comments on tickets all night. Yet again, the ticket crunchers numbers were being gamed.

A web hosting customer had a really hairy problem where one of his sites included a PHP script which then tried to include an external (!) URL. That by itself wasn't too strange, since a bunch of customers did that, but there was an issue with PHP which was making it painful. There were three web heads serving the target site (hosted elsewhere), and one of the three was throwing 400 errors instead of giving usable results.

When that happened, the PHP build on his machine would go nuts. It would try to eat all of the memory on the machine, and that would inevitably bring down the whole system. I temporarily blocked outgoing connections to the one bad host and gave him some tips: those folks needed to fix their site, he could stop using it (not too likely, but worth a shot), or he could try upgrading PHP to see if it went away.

Management types in the UK office decided they wanted their techs to start showing up in the ticket cruncher numbers, so I added them to the system. They weren't great numbers, but they wanted to be part of it too for some reason. They probably shouldn't, since it just showed what most of us knew anyway: their tickets were horrible messes.

A support tech needed to track down the internal tracking entry used for a SSL certificate after a customer lost his drive and was restored from a two year old (!!!) backup. Basically, there was no way to search through the cert data through the normal web site, but my access to the reporting database let me pepper it with appropriate SELECTs to bring it back. Then, I wrote a small page for my self-service reporting tool so other people could run those same queries any time they needed it in the future.

A couple of days later, some people thought this was "brilliant" and asked for it to become its own standalone page on my bigger list of tools instead of just being a report on my workstation's magic reporting page. I put up a quick beta version and prepared to promote it to our "somewhat-production" server which was better than using my long-suffering workstation.

One of the support teams asked me to sit in on an interview for a tech. I threw a bunch of questions at this candidate and he got everything I would expect a "level two" to get. He also wasn't shy about suggesting alternatives instead of just having one answer for any given question. I figured he'd be a good addition to the support force. They passed on him. Naturally. I wonder why they invited me.

Corporate insanity descended upon the support teams and turned them from "Team X" and "Team Y" type names to "Division-1", "Division-2" type names, and then finally to "N-1", "N-2", "N-3" names within a short span of time. I resisted doing this at first, since there was a 1:1 mapping of the original names I knew (team X) to the N-* names, but they finally convinced me to change my tools to reflect it. I also fixed some related changes which had split off some key data into a completely separate database running on a totally different backend.

Another tool of mine which let you travel in time to see what the support queues looked like in the past had been broken by this corporate mayhem which forced the database splits. I got it working again, but it was showing far too many active tickets, like showing hundreds at a time when it should have been merely dozens. I also figured that we were about 6 months away from actually having hundreds of active tickets at any given time because of how things were going to hell.

Oh, and I actually took a day off during this span of time.

That's all of the stuff I did in two weeks that actually made it to my log of what happened. We had started doing it after someone read about how Google did weekly "snippets". There probably were a few more things which happened but were not logged for whatever reason.

If you stop and look back two weeks, just how much have you done? How much of it is forward progress, and how much of it is just "moving the food on your plate around" to make it look like something has happened? Even if you're getting things done, what about the rest of your team?