Software, technology, sysadmin war stories, and more. Feed
Thursday, January 12, 2012

C++ projects part three: publog (for these posts)

This is the third entry in my ad-hoc series of posts about "stuff I've written in C++". Yesterday's post was about "go" and the buzzword game. This one is a little more meta: it's about the software which drives these posts. I first talked about this in a post back in May, but it was short and lacking on detail. This one will go far deeper.

First, some background on all of this is needed. Before I quit, I had gotten into a groove of writing posts on Buzz. That's right, Buzz actually had a bunch of users internally at Google. People on the outside world never got to experience it in a "domain" sense because that was never shipped. However, those of us on the inside could see posts by anyone else in the company, reshare, comment, "like" something, and all of that.

Many of my posts were the same sort of biting commentary on how things had been going to hell, and what I was trying to do about it. There was a lot more detail in those posts, naturally, because anyone reading it was also behind the usual wall of secrecy.

Towards the end, people told me that they wished I would keep writing even after I was gone. I knew I had to bootstrap something really quickly so as not to lose them during the transition. So, I threw together a couple of static pages, including this post about a suggestion box. Then I just posted a link to my top "/w/" page in a Buzz and people started following along.

I had established continuity, and now it was time to start writing regularly to keep everyone engaged. At first, I was doing it by hand, and it was annoying. This included the Atom feed! Yes, I was actually throwing together Atom post stanzas shortly after writing the post for a given day because people really wanted to follow me with their feed readers. Rather than lose readers, I stood up the feed the hard way.

Obviously, this didn't scale. Once I was actually out the door and free to work on my own projects both legally and mentally, I started working on what would become "publog". Here's how it works.

I have two trees on my laptop. One of them is the "in" side, and it holds the raw posts and other content (images, audio files, etc.) for each day. It lives in a simple hierarchy: YYYY/MM/DD/shortname. This post, then, is in /2012/01/12/publog. The actual text I'm typing in goes in a file called "notes" within that directory.

I can put anything I want into these directories. Nothing happens until I add that post to the top-level index. It's just a simple text file which contains the path to a post and its title. It looks like this:

2012/01/12/publog C++ projects part three: publog (for these posts)

I just edit both "notes" and that list with my favorite text editor. Once I'm ready to publish something, I run the generator. publog itself is just that one program, technically.

My generator runs in three main phases. First, it scoops up the list of posts and keeps the data in memory. It's a 20K file on a machine with several gigs of memory, so this is no big deal. This will be used to actually flip through all of the posts later.

Second, it starts the "buildall" phase, where the Atom header is written and then a for loop starts enumerating through the list of posts it loaded earlier. Each post then has its mtime checked, and it is sent through a formatter which turns my plain text into proper HTML. This turns my blank lines into paragraph breaks, and it also interprets a handful of shortcut markup commands I can use to link to images easily. Every time you see an image in a post here, one of those commands is sitting in the raw "notes" file.

Now I have a formatted copy of the post in memory, and it's time to see if I need to do anything with it. If the index.html for the post (what you're reading right now if you're on my actual web site) doesn't exist or is older than the "notes" file, then I (re)build it. I send that in-memory post to something which will turn it into that file. It takes care of all of the header stuff (DOCTYPE, link rel...) and footer stuff (more writing, contact) you see on every post.

Next, I check to see if this post needs to be added to the Atom feed. Originally, every post went in there, but I've since added a limiter. My atom.xml URL was getting pointlessly huge, so now only the latest 100 posts wind up in there. If we haven't hit the limiter yet, I pass the same in-memory copy of the post to my Atom formatter. It does the appropriate containers (entry, title, link, id, updated, content) and drops in the post, then closes things up and returns.

Now that it's done with this post, the loop continues and scans through the others. Eventually, it runs out of posts and phase two is done.

Phase three is where it builds the actual top level index with links to all of my posts. This part starts from that same in-memory list of posts and just builds DIVs and ULs. It has a little bit of intelligence to make sure multiple posts on the same day don't get another date banner, but otherwise it's about as simple and dumb as you can imagine.

That's it for the generator. Once it finishes, I have an output tree which represents the "cooked" version of this site. Obviously, I don't serve this stuff straight off my laptop, so I have to push it to my web server. For that, I just have a shell script which runs rsync. This makes /w/ under my document root identical to whatever is on the laptop.

There's no database here. It's just a bunch of flat files. This means that my site is utter simplicity to serve and has about the lowest possible overhead that I can get. Anything remotely popular winds up staying in the Linux buffer cache, and Apache just throws it to you as fast as it can.

All of this just means that I can have a post go to the front page of Hacker News and stay there for 12 hours and the server won't break a sweat. It's not because I'm some kind of genius at tuning my server. It's just because I'm not asking it to do much to handle these requests.

I guess this approach has become popular of late and even has a name now: "baked" sites. If it works for you, great! If not, that's fine too. Use the things which let you accomplish your goals in life.

Be pragmatic!

January 13, 2012: This post has an update.