Writing

Feed Software, technology, sysadmin war stories, and more.

Tuesday, March 5, 2013

A bit of boredom leads to "protofeed"

Oh dear. I may have done something really odd.

Last night, I got kind of bored, and yet had a really weird idea at the same time. I had been reading a post on HN about some service providing a XML feed of user data because they were shutting down. Apparently they had done a few things which made it difficult to actually use. Then, in the comments, I spotted something which made me feel a little self-conscious:

CDATA sections ... usually hint that the coder doesn't know what they're doing.

I immediately flashed on my own atom.xml feed for this site. I basically made it by cargo-culting some other feeds, and yes, it has CDATA in it. I think I looked at Daring Fireball and a few others. I didn't want to deal with actual XML generators back in 2011 when I wrote all of this, so *gasp* I just emit it "raw". There. My secret is out.

Anyway, this got me thinking about better ways to do things, and then that's what ultimately gave me my insane little idea. What do I use when I don't want to deal with XML in my own life? Easy: Protocol Buffers! They'll handle all kinds of stuff, have a nice compact binary representation, require no escaping of funky characters, and can still be expressed as ASCII for debugging purposes.

While in this loopy state, I came up with the following "minimum viable protobuf definition":

package feedspec;
 
message Feed {
  optional string title = 1;
  optional string link = 2;
  optional string unique_id = 3;
  optional uint32 last_updated = 4;
  optional string author = 5;
 
  message Post {
    optional string title = 1;
    optional string link = 2;
    optional string unique_id = 3;
    optional uint32 last_updated = 4;
    optional string content = 5;
  }
 
  repeated Post post = 16;
}

Then I was really bored, so I sat down and wrote something to actually populate one of these things with the data from all of my posts. It wasn't too difficult, considering it's the same basic idea as my Atom generator. All of the data is already there, and all of my little helper libs make the rest rather easy.

About an hour later, I had something which brought my crazy idea to life. Expressed in ASCII, the beginning of the feed looks like this:

title: "Writing"
link: "http://rachelbythebay.com/w/"
unique_id: "rachelbythebay.com,writing-2011"
last_updated: 1362515359
author: "rachelbythebay"
post {
  title: "\"Everqenote\" Corporation and the Filet Mingon"
  link: "http://rachelbythebay.com/w/2013/03/05/everqenote/"
  unique_id: "rachelbythebay.com,2013-03-05:everqenote"
  last_updated: 1362479183
  content: "<p> Call me picky, but <a href=\"/w [...]

Is it perfect? No. Will it express everything that Atom or RSS could? Nope. Not even close. (But it could be expanded. That is the nature of protobuf, after all.) Is it really, really simple to create and read back in? YEP.

If you're also feeling bored and would like to try fetching and using the output from this scheme, you're in luck. I have the binary serialized output from this process online for anyone who wants to give it a spin. It's still subject to my fine-tuning, naturally. It's only 12 hours old.

I will point out one important part of this whole thing. I deliberately used "string" instead of "bytes" to create an invisible hand which forces you to use UTF-8 everywhere. If you try to feed it a string which is not valid UTF-8, it will barf... and that's the whole point.

Have fun!


March 14, 2013: This post has an update.