Writing

Feed Software, technology, sysadmin war stories, and more.

Monday, January 14, 2013

My idea for recovering from runaway chmod/chown calls

Web hosting support is an interesting business. You exist because the customers need help doing things on their machines. In my case, my job existed to help people get things done on their Linux boxes (and occasionally Solaris or some flavor of BSD). They are given root on their own machines, so this means you have people who have decided they need help and yet also have the power to do things themselves.

Sometimes, this leads to some interesting messes. Have you ever seen a chmod or chown go far beyond its intended target? It happens a fair bit when someone doesn't quite realize exactly what they're getting with a wildcard expansion or -R switch. One misstep and all of that metadata will be seriously damaged.

Other times, a customer would be having trouble making something run. Maybe they were installing a bit of PHP code which accepted file uploads. They might not realize exactly which uid would be effective when the script ran, and decided to "fix" things by granting wide-open permissions to a whole bunch of places. This would later lead to other interesting anomalies when certain paranoid tools refused to run. OpenSSH in particular is rather picky about the path to an authorized_keys file, for instance.

We'd get a "help!" ticket and would have to put things back the way they should have been and then also manage to come up with a solution to their original problem. Fixing the original problem of having a way to write things wasn't usually a big deal since we saw those add-on scripts all the time, and they tended to need similar configs.

The real problem was fixing all of the other stuff they may have broken. Sure, on a machine based around a packaging system like RPM, you can use it to get some idea of what the package-based files should look like, but what about everything else, like user data and customizations which didn't come from packages?

Some customers had purchased our backup service. It was a glorified version of dump/restore which ran over the network instead of writing to a local tape drive. It would have a copy of the affected data, but it was a rather heavyweight solution to the problem. For one thing, it might take a couple of hours to restore all of the data. Another problem is that we only really wanted the permissions and/or owner data, not the files themselves.

I came up with an idea that would give us some way to fix things in a pinch without too much trouble. We should have a cron job which does nothing but dump the metadata for every file on the machine into a file. That file should only be readable by root, and it should run nightly, with logrotate doing its thing to maintain about a week of backlog.

This could have been as simple as 'find / -ls', but I wanted it to be better than that. The output from this tool needed to be easy to parse and hard to get wrong. The last thing I wanted was for someone to have a hard time using the data in that file. The output from find is nice and all, but turning "drwxr-x---" back into "0750" for use with chmod is kind of a pain. Also, the inode numbers, file sizes and mtimes aren't really that useful in this recovery context.

I decided to write a small spec program to do exactly that. It would recursively traverse a filesystem, printing an entry for everything it found. It needed a few minor adjustments to not get hung up on stuff like /dev/fd or /proc/self, but it did what it needed to do.

Unfortunately, nobody really cared about it, so it just sat in my pile of random tools unused and unloved. Customers would still do "chown -R foo /" from time to time, and some poor tech would have to either try to bail them out or (more often) just say they were out of luck and would have to start over from a fresh install.

As a side note, the event which got me to write this tool was a customer who really did do a recursive chown against /. I know this because my friend who was working that ticket found it in their bash history. I figured there was a slight chance that 'locate' might be storing more than just filenames in its database dump, so I downloaded the source and started looking at the code to see what I could learn about it.

This is the kind of stuff I found (adjusted to fit the page here):

  if (foundit) {
    if (slevel == '1') {
      if (UID == 0 || check_path_access(strdup(codedpath))) {
        printit = 1;
      }
    } else
      printit=1;
  }
 
  if (printit) {
    res = 0;
    cur_queries += 1;
    printf("%s\n",codedpath);
    if (max_queries > 0 && cur_queries == max_queries)
      exit(0);
  }

Something about that just made me worry. foundit? printit? This wasn't the kind of code I wanted to find on my search for a fix.

By following this around, I discovered that it was using access() to check the live version of a file to see whether it should show up in the list or not. That implied that it wasn't picking up owner/permission info from its own storage, and so I gave up on that.

As far as I know, that customer had to reinstall and manually migrate their data over to the new filesystem. What a waste.