Software, technology, sysadmin war stories, and more. Feed
Saturday, March 9, 2013

Character encodings are easy on this side of the pond

When it comes to things like Unicode, I have life pretty easy. Nothing I regularly encounter requires anything beyond the basic characters you can get from ASCII. I don't find myself needing to express a Euro symbol very often, for instance, and the currency symbol I need is right here at 0x24: $.

As a result I don't usually think about things like UTF-8 support. I know I have my web pages set up to deliver things like that, but I also know that nothing I publish really exploits it. I never worried about making sure I had a compatible terminal or anything like that.

Things changed yesterday when I got some feedback about one of my programming recording sessions which asked about UTF-8 support and C++. I had always treated such things as a black box, and never really thought about what would happen if I actually tried to use a multibyte sequence. This took me down an interesting road.

I had long known that 'xterm -u8' would give me a terminal which could render UTF-8 sequences in a meaningful way. I only used it when I had something which looked obviously wrong on my ordinary (non-UTF-8 compatible) rxvt just to see what the character was supposed to be. xterm itself has a couple of UI warts which don't work for me, so I couldn't use it for everyday life.

I managed to find "rxvt-unicode" yesterday, aka urxvt, and it maintains the rxvt behavior I'm used to while adding proper support for UTF-8. It also otherwise looks exactly the same, so I didn't have to adjust to yet another weird user interface regime. Using it, I was able to put together a couple of demo recordings to show a totally basic use of multibyte characters in a C++ program.

Recording #1: I print three bytes in a formatted string: 0xe2, 0x82, 0xac.

Recording #2: I actually directly input a Euro symbol (by pasting from another session, since my Linux box's keyboard doesn't have an obvious way to make one of them).

Recording #3: I use the \u20ac escape sequence.

This only shows that you can in fact emit the sequence and have it look correct if your terminal is set up for it. These little CGI programs should also look fine in a browser since they send the right sort of Content-Type header. The playback stuff also groks UTF-8, so when you view these recordings, it should render properly there, too.

What it definitely does not do is interpret the data in any meaningful way. It's just a meaningless blob. If I tried to use traditional string manipulation techniques on it, particularly the "character at a time" stuff, I would be in for an interesting surprise.

For now, I'm just going to leave them as opaque blobs. We can come back to actually using these multibyte sequences properly another time.