WARNING: What follows is a somewhat confused ramble about a topic of a boring, technical nature that I don’t really know that much about and in which I’m quite possibly completely wrong, and where I capitalize “TrackBack” at least three different ways:
I recently converted my weblog to UTF-8. I’ve discovered an interesting problem, though: the other day, I got a trackback in German, from a Movable Type weblog encoded in ISO-8859-1. Movable Type inserted the trackback’s excerpt as-is, and the non-ASCII characters showed up wrong. I’ve fixed this particular problem, via a hack that assumes trackback excerpts are ISO-8859-1 if they aren’t valid UTF-8, but this doesn’t address the issue at hand here, since it works only for text written in Western European languages (luckily for me, this is an English weblog, so that’s mostly what I’ve got).
The TrackBack Technical Specification makes no mention of character encoding. A trackback ping is a HTTP POST request of type
application/x-www-form-urlencoded (the response is XML, and XML handles encoding problems very well). The form of a
application/x-www-form-urlencoded document is defined in the HTML specification (of all places) to be the same as URL encoding, but URL encoding is (at least, at the time HTML was designed) for ASCII text only, and the encoding of non-ASCII characters is undefined. Most browsers handle this issue by using the encoding of the HTML page to which the form was submitted, but this isn’t possible with trackbacks, who have no associated form.
It appears that the choice of
application/x-www-form-urlencoded as an encoding for submitting Trackback pings is an unfortunate one. Using
multipart/form-data or another format (such as XML) that allows for a defined character encoding would have been a better choice. But given that the choice has already been made, what can be done to improve the situation?
One possibility would be to amend the TrackBack specification to allow (and recommend) the use of
multipart/form-data instead of
application/x-www-form-urlencoded. The Movable Type implementation of TrackBack uses the Perl
CGI module, which already knows how to parse these, so existing weblogs would be able to receive pings in this format (although further code would be needed to extract the charset and do the translation). Another possibility would be to define TrackBacks to always use UTF-8 encoding for sending pings. This would cause problems with non-ASCII TrackBacks between new and old implementations, but once everyone was upgraded, things would work smoothly. The third possibility, and the one with the least interoperability concerns with existing implementations, would be to have senders of TrackBack pings encode any non-ASCII characters using HTML entities before sending. This has the advantage that it does not require sending any non-ASCII character in a ping, so no changes need to be made to the protocol, and existing Web sites that display excerpts in HTML or XML contexts will work without a hitch.
The downside is that the trackback contents would no longer be treatable as unformatted text for the purpose of putting in email, etc… Any parser of trackback data for non-SGML-like purposes would need to decode the HTML entities. This is not necessarily bad, especially since I imagine there are trackback pingers out there who are probably sending full-fledged HTML in their excerpts anyway. It would require the addition of character set conversion to Movable Type, though (since HTML entities are Unicode-based), which it somehow has managed to avoid so far.
Okay, I’m done now.