Software, technology, sysadmin war stories, and more. Feed
Thursday, December 27, 2012

Deconstructed SQL

I've recently spent some time building things which translate incoming network requests into database queries and then back to outgoing network responses. The actual implementations vary but the same general principle applies: get a bunch of stuff from the request, map it into some SQL, then map the results back into the response.

Here's a common situation: perhaps you want to make it so that someone can do a HTTP GET for /some_list to get data from you. Internally, your program does a SELECT ... FROM ... WHERE ... and gets back a bunch of rows from whatever database you might be using. Then you take that and build up a JSON response: { "thing" : "value", "thing2" : "value2" } ... and so on.

At the very least, you have to write the SQL SELECT statement and some simple stuff to map columns from the result set into named components within the JSON stream. Exactly how this works is a function of your language and any frameworks you might be using.

I bet a lot of people spend a lot of time doing little more than this. They find themselves writing plumbing code which is just slightly different depending on which URL endpoint or RPC handler has been invoked. I know I've done it many times in many different situations, and it just seems like there should be a way to reduce it. It seemed like it would be the kind of thing which could be generated from a very strict definition in order to reduce implementation errors.

Tonight, this stuff finally gelled and I started messing around with the concept in a raw form: what if I had something which could represent the eventual SQL query while also conveying enough metadata to make the other stuff happen safely? I imagined an ASCII protocol buffer and just started typing up what I thought might work without caring about the larger implications of such a beast.

I wound up with something like this:

db_query <
  name: "asset_owners"
  field <
    column: "assets.asset_id"
    integer_output: "asset_id"
  field <
    column: "assets.title"
    string_output: "asset_title"
  field <
    column: "users.username"
    string_output: "owner"
  table: "assets"
  join <
    table: "users"
    using_column: "user_id"

Is that evil or what? I took a simple SQL statement and turned it into a bunch of explicit messages. I looked at that and figured it wouldn't be too difficult to make something actually generate a statement from it. After coming up with a message spec so the above "db_query" could be parsed, I wrote something to dig through it and build SQL. That gave me this:

SELECT assets.asset_id, assets.title, users.username FROM assets JOIN users USING (user_id)

Again, this isn't anything really special. With that working, the next thing to do was to make it take raw results from the database and generate some reasonable JSON output. I wound up with this:

{ "asset_owners" : [ { "asset_id" : 1, "asset_title" : "Test DVD", "owner" : "Rachel" }, { "asset_id" : 2, "asset_title" : "Test Book", "owner" : "Rachel" } ] }

Notice that my strings and integers are treated differently. There would also be escaping happening here if any of the strings needed it.

Once this works, it's not much of a stretch to get every other common query running this way. The specification can grow to encompass those bits of the language which are needed.

Finally, for those truly wild queries which can't be translated into such basic terms, they can still be handled the "old fashioned way" just as they are now. The point of a scheme like this is to make easy things easy without removing the ability to handle weird things which are complicated.

I don't know whether to be proud of this or terribly embarrassed. I bet I've probably reinvented at least three or four wheels while doing this.

I've written about it in the hopes it might be inspiring or otherwise useful to someone out there. If I've done something really silly or obvious, feel free to write in. Just be civil about it, okay?