Logs, globbing, regexes and bucketed retention rules

I once had to write a utility to clean up a messy situation. I had been handed a series of directories where automated backups were stored, and they were taking too much space. Someone had decided that we only wanted to keep backups up to a certain point and then only in certain quantities, and it fell to me to do something about it.

Normally, I might approach this as a matter of using stat() to get the file modification time on the assumption that it won't change. However, that was out of the question here. Apparently these files had been shoved around from place to place, and every time that happened, it reset the mtimes. The files were actually quite a bit older than what stat() would say about them.

This meant I had to fall back to another source of data: embedded numbers in each filename. There might be a backup called foo-2012-09-16-01.gz, for instance. Elsewhere, another backup might be called bar20120916. A third type of backup might have been sharded across multiple directories, so now it would be blah/2012/0916. I realized that merely hard-coding a strategy would not work since too many possibilities existed.

Still, just to get the base logic down, I started with something simple and had it deal with YYYYMMDD alone. There was still a lot of other stuff to do besides format handling, and I needed to get my date math right. No matter how it was stored, I had to turn a file's date into a number of days in the past, and then figure out which "bucket" it fell into. That would determine just what sort of rules were applied.

Imagine a ruleset like this: we keep all 0-90 day old files. From 90 to 180 days, we keep one per week. From 180 to 365 days, we keep one per month. Then, everything else (older than a year) falls off the end and is deleted. Nothing is kept past that point, in other words.

Oh, and, this ruleset might vary from one backup to another. So, not only did they have completely different path schemes and filename/date encoding schemes, they also had different retention requirements. I had to solve all of this at the same time.

One thing I tried was strptime(), but that didn't provide the kind of flexibility I wanted. It wanted to see an exact match from left to right. While it would forgive a bit of whitespace, but there were too many variants to handle with this function. For instance, some of the backups included the machine name, and that would change. They also had variable widths. You might see a machine called "abcd33" today and "def20" tomorrow. strptime would not like that.

I wound up using a two-phase system to solve this. First, you had to give me a pattern which matched using glob rules. In this scheme, * means "anything" like what you get in a shell (or the old MS-DOS rules, blech), so you could say the prefix would be "/path/to/backups/log.*/". Then I'd scan through anything matching that (including subdirectories and the files in them) and apply it to phase two.

Phase two got pretty messy. Not only did it use regular expressions, but it forced you to actually use the "named capturing group" scheme. This is where you say "(?P<year>[[:digit:]]{4})" or something like that. It still matches things in the usual way, but the program can then go look it up based on a name instead of the usual 0, 1, 2, ... thing. This was important because I couldn't guarantee that it would always be year, month, and then day. Using numbers would have forced that upon me.

Once I had a year, month, and day, it was a relatively simple matter to hand it to a date library and get an internal representation which let me do calculations relative to "now". This gave me the age as an integer number of days, and that was used to match buckets as described above.

Imagine what this config file had to look like to make all of this bucket stuff work. It was something like this:

bucket:
  minimum_days: 0
  maximum_days: 90
  retain_type: ALL
 
bucket:
  minimum_days: 90
  maximum_days: 180
  retain_type: PER_WEEK
  retain_quantity: 1
 
bucket:
  minimum_days: 180
  maximum_days: 365
  retain_type: PER_MONTH
  retain_quantity: 1
 
bucket:
  minimum_days: 365
  # no maximum_days, implying "infinity"
  retain_type: NONE

So now, imagine all of this, and you have a chunk just like it for every group of files. It was a globbing pattern, a really funky regex, and a completely nutty bucket definition and retention scheme.

All of this actually worked. There were other things I haven't talked about to make it plug into the existing parallel processing and monitoring frameworks, and I got all of that working, too.

Want to know what a corporate coding job is like? It's like that. You didn't create the original mess and you can't keep it from happening again, but you have to write something to clean it up anyway.

Have a nice day. Go grab a free soda.

Writing

Logs, globbing, regexes and bucketed retention rules