Writing

Software, technology, sysadmin war stories, and more. Feed
Saturday, March 30, 2013

How I used to rip CDs, or wonkiness part two

Last month, I wrote about how the frontend of my music collection evolved while it was still being handled by my Linux box. It was a large collection of MP3 files and then Ogg Vorbis files, and while it started with downloads, it eventually transitioned to being sourced from my own CD collection.

I wrote a lot of custom code to make those lists render nicely as described in my earlier post. What I haven't talked about yet is how all of those songs were extracted from my CDs in the first place.

Early on, none of the CD-ROM drives in my life would do the "CDDA" digital extraction stuff for audio. The only way to get music off the disc and into the computer was to start it playing somewhere, loop it back, and then record it as a WAV file. Or, if I'm talking way back here, as a VOC file (Sound Blaster Pro days!). I never really used that since it tended to sound poor and was a ton of work. Also, disk space was at a premium in those days.

By the time I got a drive which could do reasonable audio extraction, the "sweet spot" hard drives had grown to being about 2 GB, and so I had a little more breathing room. Now I could drop in a disc and run something like cdparanoia and it would give me a directory full of WAV files. Then I'd manually run l3enc on all of them, and rename the resulting files to something meaningful. Then I'd type in all of the album info, track names, and all of that, and shove the whole thing into place.

Finally, I'd run my generator and the new music would appear. It was manual, annoying, and slow. It needed to be improved. My first change to all of this was writing something which would wrap the ripper and encoder and would keep them busy. Then I added something to make it ask me the artist name, title, and date. Then it would scan for the number of tracks and dropped me into an editor to supply the track names. Once satisfied with my data entry, it would go off and do the work for me, including making reasonable filenames.

I'd usually parallelize my work at this point by plopping the CD and booklet onto my scanner to get images for the library. As long as I had the thing right there on my desk, it was the perfect time to get it scanned into the computer. Those tasks would normally wrap up around the same time, and I'd just put the CD away and call it done.

That was much better than the days of running all of this by hand, but it still needed improvement. I knew about things like the CDDB from the past, and it had evolved to become the FreeDB project. Rather than trying to write Yet Another Client for that spec, I just downloaded one that would scan a disc and spit out the raw data. Then I'd just parse it myself.

I wound up with more than I bargained for. The format used by these databases is an utter disaster. If you've already read rants about this, or tried it yourself, you know what I'm talking about. For everyone else, here's the story.

Some of the files start with "# xmcd". I call them "normal". Others start with a line of nothing but "#" (pound, hash, octothorpe, gate, whatever you want to call it) characters, and I call those "multi". They are treated completely differently.

A "normal" file might start out like this:

# xmcd CD database file generated by Grip 3.3.1
# 
# Track frame offsets:
#       150
#       24430
#       45347

Yes, that's right, it's using # as comments, but it also seems to be storing potentially useful information in those comments. With those frame offsets, you could get some idea of how long a song is, given that "Red Book" audio frames are 1/75th of a second, or about 13 ms. If you want that stuff, you have to write a parser which will ignore the fact it seems to be a comment and will go spelunking for it.

Later on, you get this:

# Disc length: 3152 seconds

Again, that's potentially useful information. From there, it starts listing actual data about the CD, like the title, year, and track names:

DISCID=930c4e0b
DTITLE=Sophie B. Hawkins / Tongues And Tails
DYEAR=1992
DGENRE=Rock
TTITLE0=Damn I Wish I Was Your Lover
TTITLE1=California Here I Come

No big deal, right? So, now have a look at the "multi" format. This is what you get when the database has more than one match for the disc fingerprint you submitted.

Here's the beginning of one such file (truncated a bit to fit here):

########################################################
0 data a80c840b Billy Joel / Greatest Hits Volume I und 
########################################################
# xmcd
 #
 # Track frame offsets:
 #      150
 #      25415

Did you notice the "data in a comment" part is actually indented one space? If you wrote your parser to handle the earlier situation, you now get to extend it to deal with this as well. It actually indents the entire block of data for the disc, including the DISCID, DTITLE, and everything else like what is shown above for my Sophie B. Hawkins CD.

Then, it jumps back to column zero for the next match:

#######################################################
1 misc a80c840b Billy Joel / Greatest Hits Volume I &
#######################################################
# xmcd
 # 
 # Track frame offsets:
 #     150
 #     25415

It's not some incredible coincidence of two discs that happened to hash to the same fingerprint. Nope, it's the same disc, entered two different ways. I had to add code to my program to show all of the matches to let me pick the one I wanted to use as a base. Still, I couldn't take that data directly, since it was frequently screwed up in ways too numerous to mention.

I took my existing program and just rigged it to use the CDDB type data as defaults. If it looked reasonable, I could just hit ENTER and take it as-is. Otherwise, it would drop me into my editor and I got to clean it up and set things right.

I think this freedb parser is the first code for which I ever wrote unit tests outside of work. In that directory, I have a couple of files which represent real returns from a database lookup. My test code just sends them through the parser to make sure it gets reasonable values out.

It's all supremely evil code, and yet it was still better than the alternative: manual data entry for everything.