Writing

Feed Software, technology, sysadmin war stories, and more.

Monday, August 13, 2012

Learn from my MySQL woes (and discover a possible attack)

If you've spent any time using MySQL's C API, you've hopefully encountered the mysql_stmt family of functions. They collectively provide a way to generate queries without worrying about escaping potentially evil content. When you use one of these things, you first hand it a query like "SELECT audio_fn FROM calls WHERE call_id = ?". Then you bind an argument for every ? in your query.

This is great, and with some work, it's possible to hide this behind an interface which behaves the way you want. I did exactly this last summer when I started building things for myself again. Now that my brain and all of its output belonged to me and not my employer, I started writing things, and as part of that, created some libraries to do common tasks. Database access was one of them, and when I wrote the MySQL part, I used the prepared statement interface.

Life was good. I wrote a bunch of things which came and went and did whatever they needed to do. Then, yesterday, I started something new. In the interest of making my scanner project survive certain Comcast-induced outages, I wrote something which replicates calls to another system. It's a server and client pair, and they are persistent processes which just throw RPCs back and forth.

They also keep their database connections around the whole time, and that's where things got messy. Apparently, I had never used anything which generated this much database traffic on a single connection before, and this uncovered a new failure mode.

I woke up this morning to a bunch of error messages from my RSS/Atom feed fetcher's cron job. It was complaining that it couldn't access the database.

mysql_stmt_prepare failed: Can't create more than max_prepared_stmt_count statements (current value: 16382)

I thought that was a little strange. I haven't touched that code in a while. The only thing which should break just sitting there by itself would be filling up the disk, and that hasn't happened. Then it hit me: I had added that new call migration server process. Looking at its logs confirmed it: it was hitting the same problem.

So, not only will this thing screw up attempts at SQL queries from the same connection, it also consumes server-side resources and will cause other queries to break as well! You read that correctly. I use totally different MySQL user accounts for each of my "roles", and in this case, one of them managed to totally deny service to anything else using prepared statements on the server.

There's more, too. fred (my feed fetcher) and the call migration tool run in two completely separate databases. Apparently there is just the single pool of prepared statements in the server, and if someone chews them up first, forget about anything else happening at the same time.

I pondered this for a moment and then realized... uh oh, what about the other side of this? There's a client program which has been running just as long and has generated at least as many SQL queries. If it has also screwed up MySQL on that host, then the actual radio monitoring process will have lost its connection to the database too!

I jumped into my trunked radio logger's logs and saw utter disaster:

I0813 054831 28052 scan_core.cc:276] New call on 853687500 - talkgroup 16

I0813 054835 28052 scan_core.cc:183] Ending call: 853687500

mysql_stmt_prepare failed: Can't create more than max_prepared_stmt_count statements (current value: 16382)

failing query: [...]

E0813 054835 28052 scan_core.cc:228] Saving call to MySQL failed

You know the "F7U12" meme? That was me around 11:45 this morning.

I checked. 667 calls had been dropped from a six hour span. I immediately killed the client side pusher thing and it recovered...

I0813 114627 508 scan_core.cc:276] New call on 853687500 - talkgroup 16

I0813 114635 508 scan_core.cc:183] Ending call: 853687500

I0813 114635 508 scan_core.cc:276] New call on 852300000 - talkgroup 16

I0813 114644 508 scan_core.cc:183] Ending call: 852300000

... but now I had another problem. I actually had all of the MP3s which were recorded for those calls. I was just lacking database rows for them. I could go back and reconstruct the metadata and INSERT those calls, but now they'd be out of order. Here's why.

The calls table had been growing normally up until 5:48 this morning. Then it stopped getting new rows, even though the MP3s were still being saved. Then when I cleaned up the database mess, it started getting new rows. I use an "auto_increment" row, so as far as the table is concerned, 11:46 came right after 5:48 with nothing in between.

Even if I managed to recover the 667 calls which had not made it into the database, they'd be out of order in the timeline, both in terms of chronological order and call_id order. It would be a horrible mess. Now what?

I came up with another plan on the spot. I needed to tear down my production radio logger and stand up another on a test system. It would save the calls and metadata over there until I could clean up this mess, and then I could add them back to the production system. Fortunately, I have one of those $20 ezcap "RTLSDR" sticks, and it does a decent job tracking the Santa Clara city system. I got it going and then killed production at 12:04. Now I could catch my breath and actually think about the problem without having more calls pile up that would just need to be fixed later.

While the backup logger was running, I now had to identify and attack my problem. There were calls up to 5:48 AM which were fine and must not be touched. Then there were the 667 calls immediately following that which needed to be reconstructed and added to the table at the next call id. This meant getting the calls which had sneaked in from 11:46 to 12:04 out of the way and then re-adding them later.

I got lucky here. Call files currently look like this: 16-1344862111.mp3. That's the talkgroup ID number and the start time. I can get the end time just by doing stat() on the file since the mtime will be within a couple of seconds of when they stopped talking and the call was officially shut down. For these purposes, it would do fine. All I needed was the frequency.

Well, if you noticed from those log extracts above, I announce the frequency and talkgroup when a new call starts, and all log entries include the date and time. Given all of this, it was possible to take each of the 667 missing calls and dig up its matching log entry. That gave me the frequency, and a quick map across to the actual numeric ID used in the database gave me everything I needed in a row. I wrote a quick program to do this and emit INSERT commands.

I copied the 11:46 - 12:04 call metadata out of the way and then deleted them from the table and frobbed MySQL to reset the auto_increment number back to where it needed to be. Then I ran the 667 INSERT commands. At this point, production was now caught up to 11:46 AM. Now I needed to get the 18 minutes of calls back in which I had just evicted.

Since I had kept a SELECT dump of all of those right before I DELETEd them, I just did a really quick shell script to flip through those lines and emit more INSERT commands. I "sourced" that in the mysql client and now THEY were back.

Now I needed to bring over the calls from my temp logger. That was just a matter of copying over all of the audio data, and then SELECTing all of the calls from its table, just like I had done moments earlier for 11:46 - 12:04. Then I pushed them into the production table, and now production was caught up to 13:32.

Finally, I restarted the production logger, and things started running normally again. New calls were again going straight into the production database and out to clients through the usual web page.

Only now could I start to play with my original MySQL problem to see what had happened. It didn't take long to discover that mysql_stmt_init needs to be paired with mysql_stmt_close. For the moment, I've just dropped one of those into my function at every place where I could possibly return from it, but that's a nasty, evil, disgusting stopgap hack. It's bad because it's too easy to miss one of them now or in the future and re-introduce the leak.

My eventual fix will be to roll up that MYSQL_STMT inside a real C++ class which will always call mysql_stmt_close() in its destructor. That way, upon exiting the function which uses one of these things, the object will automatically be destroyed and will free those resources.

I also need to write something which will log call metadata to a much more accessible format in the event that MySQL goes away. It's a given that databases will die from time to time, but as long as I still have my filesystem, I can still log radio traffic.

That's the thing about radio traffic: once it's gone, it's gone. You only get one chance to log it. Even if you can't serve it up to clients due to a failure in your plumbing, you still should go to great lengths to capture it somehow. There are no "do overs" in this realm.

Obviously, there's much to be done here. I hope sharing this tale and the mess I inadvertently inflicted upon myself will help others avoid it themselves. If you use mysql_stmt_init, you owe it to yourself to audit your source for matching mysql_stmt_close calls.

... ... ...

So, that's my story of failure from this morning. Now, I'll put on my "evil Rachel hat" and look at the denial of service implications here. Server A had fred's fetcher and the call server. Server B had the call logger and the call pusher client. They are running different builds of MySQL from two different Linux distributions, and yet both of them hit the same problem.

In all cases, all of the database accesses happened from separate processes, and thus separate connections. They were connecting as distinct MySQL accounts, and (in the case of fred vs. the call server) were using different databases and tables. That this happened suggests there is just one big global pool per server and no compartmentalization between connections, users, or even databases.

Whatever the root cause in MySQL's design might be, it was possible for one of them to totally choke out any others just by getting there first. I'm sure this could be used to bring a lot of grief upon other users of a shared system. I'm thinking of Plesk systems and other places where a bunch of separate customers get to talk to the same MySQL daemon. Just get a $5 hosting account on the box you want to hose over and write something which opens a bunch of statements and goes to sleep() forever. Then chuckle in glee as every site which uses prepared statements falls over simultaneously.

I haven't actually tried to duplicate this on purpose... yet. Even when I do, it'll happen on one of my testing machines. I'll leave the malicious angle to someone else.