Software, technology, sysadmin war stories, and more. Feed
Wednesday, February 8, 2012

Parse this, I dare you.

It seems I may have stumbled across another one of those problems which creates a lot of yelling and screaming. You also probably will get to hear rather bold assertions which, when challenged, yield no results.

Here is the problem. It seems easy enough.

Take a line of characters and split it into four pieces. The first two pieces are always wrapped in double quotes ("), and the last two are not. Those first two pieces may contain any printable content (see the isprint(3) man page). You can't be sure what it will be.

The only real assurance you have is that any instances of " inside the actual first two pieces will be escaped, so it will appear as \". Likewise, the escape character itself, \, will show up as \\.

Oh, for what it's worth, the data won't be too long. If the whole thing goes past 1024 characters, I'd be surprised. If a line goes past 4096, I'm willing to assume it's garbage and can be ignored.

One example line might be this:

"abc def" "123 \"foo\" 456" blah blah

That should turn into four separate items:

That's it.

This particular bit of insanity was brought on by a discussion of the other things I've been doing this evening.

February 10, 2012: This post has an update.