Software, technology, sysadmin war stories, and more. Feed
Friday, January 13, 2012

The fred feed reader, or C++ project part four

For my fourth entry in this series of things I've written in C++ recently, I'm going to talk about fred. Yesterday, I covered publog, which is the software I use to generate these posts and associated Atom feed.

First, a note about the name: I needed to call it something. I figured that "feed reader something something" would be a start. Those first two letters gave me FR, and my desire to have something short and mildly quirky finished it. I was probably also influenced by the quirky names of some now-departed XM radio channels: Fred, Ethel, and Lucy.

If anyone demands a proper expansion for FRED now, I'll just say it stands for "feed reader extraordinaire, duh!" and leave it at that.

This project started pretty far back. I knew that Google Reader would eventually become part of the sucking chest wound that is Google 2.0. I looked at the alternatives, and none of them made me happy. Most of them had way too much stuff going on. I wanted a nice simple reader interface which let me flip through posts quickly and didn't get in my way. I posted something vague as a hint of what I was up to, and to hopefully encourage others to follow suit.

The first thing I had to do was to figure out how to sanely fetch the actual RSS and Atom feed URLs. Obviously, there's always the option of calling the "curl" or "wget" binaries and parsing their stdout/stderr and exit codes, but that's gross. I decided to do it the right way and started learning about libcurl.

libcurl makes life mildly interesting. It wants either a callback function or a FILE pointer as a place to store the data it grabs. I wanted this stuff in memory, so the FILE route was right out. Using a callback was okay in principle, but I was doing this in C++ with classes. Having a callback which goes to a particular instance of a class can be painful, especially if all you get to pass in is a C-style function pointer.

So, treading gingerly into these waters, I started playing with the notion of having a static function in my class. libcurl was nice enough to take an opaque user pointer which would then be handed to my function. I told my function to interpret that as an instance of the data structure I wanted to populate, and I was in business.

What can I say? I was spoiled by having all of these wonderful libraries in the corporate code depot. It should make sense that my first reaction upon jumping out is to re-create the ones which would be useful to me. Having a HttpClient class just made sense.

Next, I started thinking about actually parsing the XML gunk which came back in these feeds. I held my nose and grabbed libxml2 and started poking at things. Some hours later, I had a class which would take the raw XML which I had fetched from libcurl and would turn it into a nice STL container with data stored in meaningful fields.

There was a lot of annoying stuff here. I had to learn way too much about how the Atom sausages are made and parsed, including the whole "empty namespace" thing. You'd think it would be able to look at the document and plop out a tree, but actually using it with real data is a very different experience.

I wound up reworking bits of that class to make something which would deal with normal RSS. Naturally, there are a bunch of horrible hacks in there to deal with different date formats and field names. A lot of these came about via trial and error. I'd feed it my cache of feeds and see how much it could parse. Then I'd add more special cases and run it again. Eventually, it ran without complaining and I called it done.

Along the way I had also been tying all of this to a MySQL backend so it could keep track of individual posts. Each feed has its own namespace, and posts are tracked by their "id". If they change mtimes, I store a new copy. This is fun to use to find errors and omissions in the original versions of posts. (I fix old posts sometimes, too.)

By this point, I had enough to let it just emit a static page. It would just poll all of its feeds, and if it found something new, it would write it to that file. Then I could just load it in my browser and read for a while. Later, when I got bored, I'd run it again and repeat the process.

This page did not include the actual content. I was afraid of XSS craziness from the feeds. Anyone could have written evil Javascript which would have run in the rachelbythebay.com context which had served the page to me. There's nothing valuable in there, but it's the principle of the thing which scared me, so all I had were sanitized post titles, date info, and a link back to the "real" post URL.

This is about the point where I got fed up and wrote my rant about how in-band signalling is bad for the web. Not having a way to just say "don't trust this blob AT ALL" was bothering me greatly. As luck would have it, one of my friends who stays on top of security stuff told me about the new Content-Security-Policy magic in Firefox.

The next day, I had figured out how to use that to safely display untrusted content. It's simple enough: you just tell the browser not to run scripts or allow certain other annoyances except from one or two places that you control. Then you go ahead and dump in the HTML and go on with life. That inspired my CSP how-to post that evening.

During this period of fighting with how to safely display things, I had also reworked the page so it was now dynamic. Just like several other projects here, there's a bit of static HTML, CSS and Javascript, and code which pokes my server to ask for data. It'll hand you up to 10 posts at a time, and you can just flip through them.

It's smart enough to know when you are getting close to running out of posts and so it will fetch another batch of 10 from the server. Unless you hold down the right-arrow key, odds are you will never see a loading delay. It all happens in the background well before you should get there, assuming you're actually reading the posts.

This is about the point where my "would like to share" post went online. It had evolved to the point where I wouldn't be embarrassed to have other people playing around with it.

For users who are just visiting, it shows the latest posts in the whole system. That means any post from any of the feeds which are being followed by the subscribers. People with real accounts start with zero feeds in their list and can add them one by one. It also remembers which posts have been read if you have an account, so you never see the same post twice.

I've done some subtle UI tweaks to it. Try loading up a post which is longer than your screen is tall, and then scroll it down a bit. Then arrow left or right to another post, and then come right back. Did you notice that it remembers your scroll position? It's just like you'd get when navigating web pages normally, but this is all happening at the one URL with no reloading going on.

Also, if you are looking at a post in fred and just want to jump to the URL for that post, you can just hit ENTER. For most feeds (including mine), it will take you to the web page for that same post. Others, like Daring Fireball, take you to the page which is described in that post. All of this happens in a new tab so you don't lose your place back in fred.

The code for all of this comes in two major flavors. The first type is all of the backend handlers for different URLs on my web server. There are handlers for adding, dropping, and listing feeds. Another hands out a JSON-formatted blob for the next posts in your reading list. There's also a little bit of magic to migrate the cookies for anyone who loaded the page at its old URL before it moved to fred.rachelbythebay.com. Finally, there is something to mark a post as "read" when a user flips past it. That's it.

Everything else is related to actually fetching, parsing, and storing feeds. This is where the libcurl and libxml2 stuff comes in. This runs via a cron job once in a while, and it has some limiters to make sure it doesn't beat any web servers senseless even if it gets run far too often. There's actually a minimum interval between polls, in other words.

It's nothing special. It's just the results of poking at some code intermittently over the span of about two weeks last fall. It does mean I can stay logged out of Google and still follow my feeds, so I'm very happy.

Go ahead and give it a spin in guest mode.

January 14, 2012: This post has an update.