Writing

Software, technology, sysadmin war stories, and more. Feed
Sunday, September 23, 2012

A typical night working tech support

A friend asked what my life was like back when I used to work as a phone/ticket support monkey for a web hosting company. Fortunately for him, I took notes of what some of those nights were like. Here's a Sunday night from many years ago. I looked at the tickets from that night which were "closed" and still assigned to me a few nights later.

  1. ftp monitoring alert - high load on the box
  2. call about tarring up mysql and a quick restore for them
  3. reset password and pass to networking for a firewall update
  4. install x11 dev rpm
  5. tweak apache log format for better urchin reporting
  6. fix ntpd on 3 servers
  7. disable mysql monitoring (customer was running with skip-networking)
  8. https monitoring alert with no root access available, update to customer only (can't fix)
  9. outgoing mail to aol bouncing, dns resolution issues, modifications
  10. ping monitoring alert that had been happening for ages, nobody ever really looked into it, i did, found out it was ip_conntrack overflowing, removed, fixed!
  11. explain legend in backup report charts
  12. poke customer about dormant ticket
  13. force monitoring config for 3 new servers, one customer
  14. same, another customer
  15. raise upload_max_filesize for their php
  16. enable aic7xxx_old so they can use a scsi-linked crypto device (!)
  17. update customer about pgp and backup offerings
  18. whack idiot who claims that one of our PTRs was "non-rfc compliant"
  19. update customer about work actually performed yesterday (he missed the comment, apparently)
  20. firewall update
  21. new kernels on two boxes
  22. close dup ticket of above
  23. extend cancellation date one more week
  24. recommend mod_rewrite stuff for apache
  25. yank four $big_customer boxes out of monitoring since the service had been alerting for TWO WEEKS
  26. stop monitoring coldfusion, not even on the box, alert was never valid
  27. schedule maint for dell amber alert
  28. explain how i started apache earlier on another ticket without the root password (sudo!)
  29. nag at someone blocking monitoring poller, resolved
  30. update customer on how dns caching works

As mentioned above, those are the ones where they hadn't "popped out", like if the customer posted an update after I had gone home for the night. When that happens, the ticket autounassigns itself and some other tech can pick it up to finish things.

There were about 9 or 10 more tickets not the above list which I had assigned to me at some point on that shift. Some of those were probably things which were not for support to solve but had to be routed anyway. If a ticket crossed my path, I'd send it the right direction or knock it out. It didn't matter if it was simple or hard.

How much help did I have? Well, there were two guys from the day shift there at first, until maybe 5 or 6. One of them actually worked, and the other one tended to play some kind of game most of the time, so that's really only one person. Two of my fellow second shifters were fairly late that afternoon, and a fourth didn't show up at all.

A bunch of those tickets were just me undoing dumb things done by someone else. #7 is where I turned off monitoring for the MySQL service since the customer had it set up with skip-networking. In that configuration, it doesn't listen on port 3306, and you can only talk to it via the Unix domain socket. This is just fine if your clients are all on the same box. However, it meant monitoring would always fail.

Someone had turned it on and hadn't bothered to make sure it actually worked. This caused it to come up as a failure immediately, and it stayed there until I finally caught it and shut down the monitoring. Yes, the monitoring system should not have allowed them to turn on a check for a service which wasn't actually good at that time, but it didn't work that way. Instead, we had to exercise caution, and many people did not.

#13 and #14 were linked to new servers going online. People would normally just keep bugging the customer every 48 hours to get them to say what they wanted monitored. I took it upon myself to just scan the box, pick out the stuff actually in use, and then "force config" their monitoring. I'd advise them of what I had done, and told them to get in touch if they needed anything else. Ticket closed.

#25 was similar. Four machines had been throwing an alert for 2 weeks and had obviously not been configured yet. Someone had configured monitoring prematurely. I turned it off and told them to get in touch when they were ready to proceed. Ticket closed.

In terms of actual "kung fu", I'd say very little happened this evening. #10 was about as deep as it went. Some customer had a machine which was a habitual offender which constantly showed up as "failing ping" -- the monitoring machine couldn't reach it. This would set off the pagers, and the data center guys would check, and it would be fine. Nobody ever really figured it out.

Well, nobody else, that is. I decided to poke around and found that the machine was running ip_conntrack for no particular reason, and it was overflowing some kind of table (net.ipv4.ip_conntrack_max). This caused it to drop packets from various sources at times, and sometimes that source was our pinger. Since they weren't even using it, I just unloaded that module. I want to say that was actually set off by "iptables -L" tickling an autoload of that module, but I can't be sure after all this time.

Anyway, I did that, and it cleared up and never came back. Win.

#16 was the most unusual in that few customers shipped us their own SCSI cards and SCSI-linked hardware crypto/accelerator devices, but it was simple in practice: make sure the right module is loaded and verify it's in /proc/scsi/scsi. Easy enough.

That was a normal amount of load. What was somewhat unusual is that I wound up doing almost all of it.