Writing

Feed Software, technology, sysadmin war stories, and more.

Monday, October 1, 2012

Using database dumps when queries are forbidden

I once built a system which could find out if systems weren't being backed up properly. It worked by comparing the list of systems which were supposed to have backup services to the ones which were actually receiving it. It was just a bunch of set logic, but I'm sure it looked like magic to some people.

It was simple enough: if a given server had a "backup" SKU, then it was in the set of machines which should have been backed up. I then would make sure it had a successful completion within the past two weeks. If that was also true, then everything was fine, and that was it.

Where it got interesting was all of those other possibilities which existed. You could have the SKU and yet have no backups on file. In that case, I added some more logic to only fire once the machine had been around for at least two weeks. That way, brand new customers wouldn't create these alerts, since backup setup was (and probably still is) a manual process. Yes, actual people had to log in to those machines, install RPMs and stuff, and then twiddle things in the backup management software. This happened continuously and there was a team which did nothing but work through the backlog. But I digress.

Anyway, a new machine with the SKU and no backups was left alone for those two weeks. Then, after that, it would throw a "backup never started" alert. This was different from the first scenario which was a "backups are stale" alert.

Then there was the entire notion of hosts being suspended. Yet another database lookup would tell me who was in that set. If they were suspended, no alerts were to be fired, unless, of course, they had been suspended for too long. Then it was supposed to fire a "suspended too long" alert.

Does this sound easy? It wasn't. First, the data lived in a bunch of separate databases. Some were Postgres, some were MySQL. Later on, some of it was evicted from Postgres and was moved into a MS SQL Server. Ever have to talk to all three from a single program? I have. It isn't pretty.

This basically meant using JOINs was out of the question. It wasn't the end of the world, though, since all of the tests were binary situations. Either a server had a certain attribute or it did not. They were things like "has SKU", "backed up in past 2 weeks", "backed up in past 4 weeks", "is suspended", and so on.

I just turned all of those into bits and made a truth table. This way, every single possibility had a specific outcome, and it was easy to show to other people. The inputs were all well-defined by way of various database queries, so there was no ambiguity how those came about, either. Assuming we actually had answers for all of the "bits", then it was a simple matter to look up the answer and act accordingly.

Further complicating matters was that we weren't supposed to run queries against some of the database servers. They were worried that running arbitrary SELECTs would clog up the works and cause them to not log things. Apparently they were relatively underpowered and could just barely keep up with normal traffic, and analysis like what I was doing was right out.

However, apparently running database dumps were okay. It seemed they had allowed that on the basis of already permitting backups to happen, so having yet another dump happen every now and then was no big deal. As a result, my data access worked like this. Every couple of days, I would run a dump on all of the reporting systems. Then I would copy them to our analysis machine, and I'd twiddle them so they would create tables with slightly different names. Then I'd load them into our local database, causing it to sag under the load.

Once this finished, I'd do a quick double rename on each table to get the old ones out of the way and get the new ones in place. Then I'd poke my tools to look for new alerts with this fresh data and leave it alone for a couple of days. After that point, I'd do it again. It was slow, stupid, and manual. It was also embarrassing.

If you're thinking "this meant you recopied the entire db every time, including stuff that didn't change", you're right! There's no "diff" logic when you have to resort to such blunt and stupid methods. We burned human time, CPU time, bandwidth, and disk space over and over again just because people were afraid of soiling their little database machine.

Even the notion of just setting up a follower was right out. It should have been possible to add a secondary database machine which kept itself in sync using whatever method, and then I could have run my queries against it. That would mean any lagginess would affect that (read-only) follower and nothing else. They didn't want to maintain it.

In the end, nobody really used it to track down issues, and it went away. All of that work I had done as a result of noticing servers getting hosed when backups stopped and nobody noticed was for nothing.