Software, technology, sysadmin war stories, and more. Feed
Monday, December 5, 2011

Embrace the blob and "NoSQL" will make sense

A recent chat I had with a friend has convinced me that the so-called NoSQL world needs to do a lot more education before more people will "get it". It seems to scare certain individuals away because it's just so different. This should not be the case.

We started by talking about the case of having a customer account tracking system which has a tightly-integrated ticketing system. In the past, we both worked at places which had built such a beast. They existed in traditional SQL-based databases, and had breathtakingly long queries which would JOIN everything to make it happen.

These places had effectively written their own ORM. By creating an instance of "account 1234", it would run one or more queries to fully populate that object with the results. It would also do the same thing if you requested "ticket 1205-1234". This would happen on demand as you loaded pages representing different customers or tickets.

I made the case that you could probably make this work with just a key-value store. You'd want it to be a bit more than the simplest case like memcached, but you could make it go. If you could say "hey you, set X equal to this blob of data", you'd be in business. This is where the relative obscurity of "NoSQL" came in.

Ticket object built from SQL joins

He worried that it would never work without JOINs. After all, in that existing system, it was lousy with them, so if you can't do them, how can it possibly work? You have to get the ticket subject from one table, all of the comments from another, its status from a third, the queue from a fourth, and it just keeps going like this. In that world, a ticket was nothing but the fusion of a bunch of rows from a bunch of tables.

I said, okay, stop. Look at it as what it is in memory after it runs that SELECT and before it drops it at the end of the request. Odds are, it's been marshaled into some kind of nice data structure. The subject is a string, the queue name is another, all of those comments are in some kind of array, and its status might be an enumerated value.

Ticket object built from opaque blob

Now imagine what would happen if you could just take that object and drop a snapshot of it into storage somewhere. The next time you needed it, you could just load it back up and keep on going, right? You wouldn't need to do all of that set logic with JOINs again. It would just pop back to life just as it had been before.

Suddenly, the lack of joining is no longer a problem. Your ticket exists as a blob of data which is opaque to the database but is meaningful to your application. Your database becomes a dumb workhorse, in which you say "give me X" or "set X equal to this blob of stuff".

Granted, there are some catches here. What if you want to see every ticket with the exact word "xyzzy" in them? In the SQL world, that's not a huge deal. You can run a somewhat-inefficient query like this:

SELECT TCKT_Refnum FROM TCKT_Ticket WHERE TCKT_Subject LIKE '%xyzzy%'

It'll run for a little while as it scans the entire table, and it will kick out the data you asked for. That's it -- you're done.

So now let's think of the same situation where the database treats things as an opaque blob. For one thing, you can't ask it about a field called TCKT_Subject, since it has no idea what that is. That means you also can't ask it to tell you when it's equal to "xyzzy". Finally, it can't give you the reference number (TCKT_Refnum), since again, it has no idea what that is or where to find it.

At this point, your app is on the hook for decoding things. You have to say "okay database, give me every one of these blobs". You then have to load each one and turn it into a meaningful object one at a time. Then you get to do something like this in your language of choice:

tckt = parse_from_blob(raw_db_result);
if (tckt.subject.find("xyzzy") != string::npos) {
  // do something with tckt.refnum

It's the same logic a SQL-based system already does, but now it's your problem. There are some non-trivial consequences here.

First, all of this data now has to leave the database and travel to some instance of the app which understands it so it can run the string processing on it. This might add some latency, and it definitely adds network traffic.

Second, only things built around your app's logic can understand what is in the database. You can't just connect to the database with some random query tool and hope to make any sense of it.

There are ways around this. You could write a small helper program which runs alongside your database instances which knows how to parse your data. You'd talk to it, and it would query the local instance over loopback. Then only the results would transit the network. If you have a bunch of instances, they can all run in parallel, too!

You could also have a well-defined set of definitions for your "blob" logic and then keep a catalog of them. Your query tool could look up the right definition based on your database schema and use it to interpret things. You might even have a library which does this automatically.

I've found that it's this relocation of responsibilities which tends to scare people who are coming from a SQL world. They are used to just handing a query to the database and getting back a nicely-formatted table of results. They really don't want to do the processing on their side of the fence.

I'm here to say that it's not that bad. Some surprisingly large systems are built around this technique and manage to work rather well.