Writing

Feed Software, technology, sysadmin war stories, and more.

Monday, January 21, 2013

Asleep at the wheel and out of disk space

I want to talk about quotas. I'm not talking about the kind that specify this many of this kind of person and that many of that kind. No, this is the technical kind where users only get so much of a resource. It's where you are told that you can't just consume as much disk space as you'd like, or memory, or CPU time, or anything else which is shared.

I was in a situation where I was part of a development team for an existing service. This thing was widely distributed and ran in many different locations. This service was composed of about a dozen servers (think "daemons", only not quite) per instance, but the instances did not need to talk to each other. They could operate in isolation.

A service like this tends to create a lot of log data. There's all sorts of debugging data and a running commentary by different parts of the software as it does its work. Sometimes, this isn't important, and it's okay to just discard it. Other times, you have to keep it around for SAS-70 or SOX or whatever else, and so you retain it according to some policy.

My service was one that fell into the "whatever else" in that it had a lot of access to a lot of other moving parts, and so it should have had rather lengthy log retention. However, as I found out while working on a different problem, this was not the case.

Storage systems sometimes assign a "default quota" to users who haven't explicitly been given their own allocations. This lets system administrators set up a new user for testing purposes before they have to jump through the hoops of getting the budget to pay for their disk space. This can be a good thing in terms of getting things going quickly, but it also means that you can easily forget to properly acquire quota later.

In the case of a sufficiently long-lived service, it will eventually chew up all of its disk space, and then it will stop saving its log files. Perversely, since there's no log rotation going on, the ones it keeps around the longest are actually the oldest ones. It's the recent data you never get to see since it is discarded as soon as it is written.

This pattern will repeat every time a new instance is installed and someone "forgets" to properly allocate quota to it. Nobody will probably notice any of this until some kind of security event happens and then there's a need to review logs to see what sort of badness might have happened through a given hole. That's when you find out that you're missing logs in any place which has been up for more than some short amount of time and may never know if it was actually exploited.

I ran into this exact issue of not having logs for several instances, and gave up on ever knowing whether something had happened. I did decide to get it fixed so that we'd have logs going forward, and so I opened a request with the pager monkey types who actually ran things. We had a separation of duties and permissions for various reasons, and people on the dev team (like me) couldn't touch production. That's why I filed the request with them instead of doing it myself.

What happened next surprised me. One of the pager monkeys assigned the issue back to me. I told him that no, I could not request more quota since I was not a member of the production group, and the production group owned the logs. It was something they would need to do. Then I assigned it back to him.

This happened once or twice more before I finally got tired of wrangling this in a ticketing system and brought it up in a meeting. I basically looked right at the boss of the ops team and said "so, you're going to get the quota, right?", and he just looked at me without saying anything for a long time. It was like he expected me to talk first and take on the task. Normally, I might have done that just to get it done, but in this case I could not make that quota request myself.

I forget exactly what I said next, but it was something along the lines of "you're an ops person, this is an ops problem, so do your job".

I don't suppose they were happy with this, but this was the truth. More than that, they should have been embarrassed that this service was running without logging. They should have gone to lengths to make the problem disappear rather than having it keep coming up and be kicked around in public.

I attribute many of the problems with that service to the distinct lack of ownership and overall caring from (nearly) all involved. If nobody cares about seeing something run successfully, is it any surprise when it becomes the software equivalent of a abandoned building with broken windows?