17.5.12

Encodings - The Bane of Multilingual Text Processing


This post was prompted by a new assignment I got for my HiWi job at Uni Konstanz.
This time I needed to extract questions from an Urdu corpus and write them to a file. Since the Urdu alphabet significantly differs from American English, I expected there to be problems with the encoding. I didn't quite know what exactly they would be, so it was interesting for me to document them in this post.

First of all, the Urdu question mark character is ؟, internally Python represents it as \xd8\x9f. Because this representation is not ASCII (which is, unfortunately, the default encoding for Python programs), I had to place a special declaration at the start of my file that specified the use of 'utf-8' encoding in it.

In addition I had to alter the patterns I was using to find question marks, because in the regular expressions syntax \xd8\x9f is in fact two characters.

In my naïveté I thought that my troubles were over. What I had not accounted for were the encodings of the input and output files. The latter proved to be more of an issue. It seemed like the input file could be read in and processed fine as long as it was not opened with ASCII encoding. Not so with the output file.

In both cases I used the (relevantly) custom open() method from the codecs module that allowed me to specify not only file name and edit mode, but also the encoding of the opened file.

Yet despite my specifying of the 'utf-8' encoding on the output file, every time I called on the output file's write() method, I kept getting the brilliant UnicodeError* that claimed first that the ASCII codec could not decode the string, then that the 'utf-8' codec could not do the same.

I was confused... I had proof that the program was in fact finding questions, but was it storing them in the same encoding as they were found in, or perhaps was it sneakily converting to ASCII (or even UTF-8 !) behind my back somewhere during processing? After a couple more of frustratingly unsuccessful attempts at locating the issue with the cunning use of print statements, I turned to the Web (perhaps I should have started there) and plugged my last error into Google.

N.B. Do you guys know what Google is? It's very magical, you should definitely try it out: just type something in the box here, hit enter, see what happens!

Trolling aside, I did manage to find some useful material online. This article from Joel Spolsky, for example, explains that ideally all strings are kept as Unicode code points and only at read/write time do encodings get utilized to convert said code points (hex numbers, really) into characters or vice versa. This might explain why Python always raises a UnicodeError for exceptions related to encoding/decoding, even though the encodings involved may be different.

Reading Joel's rant also prompted me to reconsider the question what encoding my Urdu data was in. I thought perhaps there is a special encoding for Urdu... and buried myself in the python API.

Fortunately, I didn't have to go very far. The entries for encodings can be found on the page for the codecs module. To my surprise, I discovered that Urdu used a special encoding, very "intuitively" named CP1006.

As soon as I set all of my files to be encoded in it as well as my output buffer to be decoded as it, the code seemed to work like a charm and stored some reasonably unreadable (my OS doesn't support CP1006, apparently) Urdu sentences.

Let's see what my supervisor has to say about this mess...

Fazit, or What I Learned:
- One should always check what language one's input and output files are in and whether they correctly use the relevant encoding.
- One should keep in mind that encoding has to be specified for every file one deals with. This includes the files containing the actual code itself.
- The codecs library is a useful one.
- Joel Spolsky is a sassy blogger.