Krol is blating the wugs.

1.1.13

A Linguist trying to Program??

I've been writing miscellaneous scripts in Python for a little over two years now.
I am getting to the point where I can sit down and code a simple and bug-free program in a couple of hours. I, however, always have doubts lurking in the back of my mind whether my skills are in fact adequate, due to a lack of formal training in computer science. At the same time I know I do not right now have the time and energy to go "hardcore" into software design. In fact, it took me a long time to formulate what exactly I thought I was missing in terms of programming. What I think right now is that I need some practice designing algorithms and being able to structure a set of tasks efficiently and within a reasonable time frame.
Some friends more experienced in the field recommended I work my way through the apparently notorious book Structure and Interpretation of Computer Programs (referred to SICP for brevity according to the general convention). I decided to give it a spin this break.
To the dismay of some of my friends and family members, I spent Christmas reading that book and writing small snippets of Scheme.

First of all, I am thoroughly surprised why introductory computer science courses are not taught in LISP. Its syntax is so simple that one understands the principles of assembling LISP statements within the first hour of reading. I am definitely treading dangerous waters here, trying to weigh in on a discussion about which language should be used to introduce people to computer science. What I am saying here is purely an opinion, not a particularly informed one, to be honest. It is, however, a starting point and I would like to record it, even if only to laugh at it a few years down the road. It seems to me that the usefulness of LISP as an introductory language lies precisely in its simple syntax and the apparent uselessness, when it comes to real-world applications. I think this provides a nice break from the constant interfacing of libraries with user input (be it prompts of files) with yet more libraries that seems to characterize quite a bit of programming today. Instead one focuses more on the actual procedures and processes being discussed and not on reading endless APIs.

I was actually so impressed with the apparent simplicity of LISP syntax, that I decided to write an interpreter for it in Python. This turned out to be a tougher undertaking than I thought and deserves its own post in the future. A quick glance at Peter Norvig's take on the problem seems to indicate that I'm on the right track, however.

But enough of my ruminations, I originally wanted to share what I've learned so far. I would like to start with mentioning that I have been trying to diligently do the exercises in the book, thus my progress has been slow and in a little under a week of more or less continuous reading I have gone through only the first two chapters.

First of all, understanding recursion in its fullness was very exciting. I especially liked that almost any recursive process can have an iterative implementation (see this thread for some examples of recursion that cannot be made iterative), something that should be more resource-friendly in real-life applications of recursion. Tree and list traversal were the tasks I found most difficult to conceptualize and implement, so I will need to revisit those.

Another concept I found mind-stretching was data abstraction. I have heard the term thrown around here and there, probably even looked it up on Wikipedia sometime in the past. SICP covers it fairly rigorously, however.
What I frankly found enlightening was the fact that procedures can be abstracted over either operations or data types. For example, we can try to define a generic operation procedure that will then be performed slightly (or not so slightly) differently depending on the data type of the input. If we wanted to abstract over data types, however, we'd essentially have to define a data type as a procedure that takes another procedure as an argument and modifies its execution based on the constraints we set for that data type.

Well, that's about as far as I've made it so far. After 4 or 5 days of work I have a little under a hundred lines of mathematical operations in LISP and a semi-working python Scheme interpreter.

P.S.
While rummaging in the Internet, I found this post. Turns out even MIT has moved away from using LISP. Am I really 30 years behind?

17.5.12

Encodings - The Bane of Multilingual Text Processing

This post was prompted by a new assignment I got for my HiWi job at Uni Konstanz.
This time I needed to extract questions from an Urdu corpus and write them to a file. Since the Urdu alphabet significantly differs from American English, I expected there to be problems with the encoding. I didn't quite know what exactly they would be, so it was interesting for me to document them in this post.

First of all, the Urdu question mark character is ؟, internally Python represents it as \xd8\x9f. Because this representation is not ASCII (which is, unfortunately, the default encoding for Python programs), I had to place a special declaration at the start of my file that specified the use of 'utf-8' encoding in it.

In addition I had to alter the patterns I was using to find question marks, because in the regular expressions syntax \xd8\x9f is in fact two characters.

In my naïveté I thought that my troubles were over. What I had not accounted for were the encodings of the input and output files. The latter proved to be more of an issue. It seemed like the input file could be read in and processed fine as long as it was not opened with ASCII encoding. Not so with the output file.

In both cases I used the (relevantly) custom open() method from the codecs module that allowed me to specify not only file name and edit mode, but also the encoding of the opened file.

Yet despite my specifying of the 'utf-8' encoding on the output file, every time I called on the output file's write() method, I kept getting the brilliant UnicodeError* that claimed first that the ASCII codec could not decode the string, then that the 'utf-8' codec could not do the same.

I was confused... I had proof that the program was in fact finding questions, but was it storing them in the same encoding as they were found in, or perhaps was it sneakily converting to ASCII (or even UTF-8 !) behind my back somewhere during processing? After a couple more of frustratingly unsuccessful attempts at locating the issue with the cunning use of print statements, I turned to the Web (perhaps I should have started there) and plugged my last error into Google.

N.B. Do you guys know what Google is? It's very magical, you should definitely try it out: just type something in the box here, hit enter, see what happens!

Trolling aside, I did manage to find some useful material online. This article from Joel Spolsky, for example, explains that ideally all strings are kept as Unicode code points and only at read/write time do encodings get utilized to convert said code points (hex numbers, really) into characters or vice versa. This might explain why Python always raises a UnicodeError for exceptions related to encoding/decoding, even though the encodings involved may be different.

Reading Joel's rant also prompted me to reconsider the question what encoding my Urdu data was in. I thought perhaps there is a special encoding for Urdu... and buried myself in the python API.

Fortunately, I didn't have to go very far. The entries for encodings can be found on the page for the codecs module. To my surprise, I discovered that Urdu used a special encoding, very "intuitively" named CP1006.

As soon as I set all of my files to be encoded in it as well as my output buffer to be decoded as it, the code seemed to work like a charm and stored some reasonably unreadable (my OS doesn't support CP1006, apparently) Urdu sentences.

Let's see what my supervisor has to say about this mess...

Fazit, or What I Learned:
- One should always check what language one's input and output files are in and whether they correctly use the relevant encoding.
- One should keep in mind that encoding has to be specified for every file one deals with. This includes the files containing the actual code itself.
- The codecs library is a useful one.
- Joel Spolsky is a sassy blogger.

6.4.12

Switch Statement in Python?

Recently I am being presented with more opportunities to program in Python. As a result, I am learning more about the language, some of its core modules and packages, how it interacts with the operating system, etc.

Yesterday I discovered something I so useful that I would like to share it. This something is Python's equivalent of a switch statement.

It is not something particularly obscure, but it showcases a feature of the language that I found interesting.

For my job here at the uni I needed to run some scripts on corpora from 4 different languages. I decided to automate the process. Hence I created one main script that would call on the a respective subscript depending on which language was being processed. This meant that I was dealing with at least 4 cases (5, if I wanted to catch potential bugs when no language was specified for some reason).

Unlike the C family or Java, Python does not have the explicit construction switch{ case1 -> A, case2 -> B, caseN -> alpha} and the official solution, the use of elif-s namely, isn't very elegant and can be somewhat tedious to write. According to this PEP, there was not enough support for an explicit switch statement among coders, and now that I have discovered the alternative solution, I don't blame them.

At the time, however, I was still unenlightened and thus turned to Google that through its magic spewed out this page at the top of the results. When I searched for "python switch statement" again before writing this post, I also found a very humble and short description here, and also a python mailing list topic where it is mentioned.

It was by the concept introduced there that I found interesting. Turns out, it is possible to store any defined aliases as dictionary values. This includes, importantly, method and module names and even lambdas. This means that to emulate a switch statement, all one needs to do is create a dictionary with the keys representing the different cases and the values the respective procedures and then simply reference these procedures using the keys.

I hope that as I continue to write more code I will keep discovering neat things like this as I go. They make one's day sometimes.

4.12.10

Free Alternative to Acrobat

This is a nifty utility that allows one to merge documents into one and do other cool stuff to PDFs without having to pay for Acrobat.
Should work on any operating system. I've tried it out on Ubuntu 10.04 Lucid Lynx and so far worked without a hitch through the Terminal.

PDF Toolkit

2.11.10

Hiatus Over: Spellcheckers

So I've decided to get back to posting linguistics related things in this blog.

I probably won't have time to continue with AWK this semester, but I'll try resurrecting that theme again too.

For now, just a short rant about Spellcheckers:

Why are some singular forms of nouns permissable and at the same time the plural forms of the same nouns marked as errors?

I unfortunately don't have the example that drew my attention on hand (this happened a couple of months ago and I didn't write down the culprit word at the time), so this post must stay no more than a rant. Also, I am dealing with Openoffice 3.2. Maybe MSOffice is excempt from this problem, but I doubt it.

Apparently there is no mechanism that ensures that the plural forms of acceptable nouns are acceptable too. This is confirmed by the fact that if you add a word to the dictionary manually, the plural form is not created as well, but needs to be added separately the same way the singular form was added: by hand.

I wonder if this is done on purpose. Who should I ask?

1.9.10

Word of the Day

I thought I'd take a break from torturing AWK and write about something fun and light.

I have been meaning to post this for a couple of days but because of move-in to college and all the prep related I've been procrastinating on this.

Recently, in the Laguage Log, I discovered an awesome word: Locution

This word appeared in a post Pure Chinese in the second paragraph of the text. It's a synonym for "expression", "manner of speaking", "phraseology"(this is even listed as a synonym in Merriam-Webster). I feel that this word can be used often, considering that many of my friends have fairly strange ways of expressing themselves when it comes to speech.

I got the impression that the article itself is more speculation than concrete analysis. However, it does introduce some cool different approaches to dealing with ambiguous translation situations.

16.8.10

Patterns and Quoting

I am back with more discoveries of what one can do with files in AWK. My previous post was somewhat of an introduction and thus didn't cover much ground in terms of actual coding practices. This post and the ones that follow will be more concerned with the particular ways to parse files. I also hope that they will be a bit more structured too.

I base the organization of my material on the University of Utrecht AWK manual, as I found it to be more comprehensible than the offical GNU guide.

1 Syntax

Just as a reminder, below is the general syntax of an AWK command in the terminal.

awk 'command' input_file

What does this command part consist of? What do we need from text files in general? Most of the time, we need to find some data and perform some operations with it. Hence the command part actually breaks up into two subparts (though their presence is not always apparent): one that searches for certain patterns in the input file (conveniently called pattern) and one that determines the operation necessary to perform after the pattern has been found (this is called action). In my previous post I mentioned some parts of the AWK command being enclosed in "{". It turns out these are the actions. Patterns are not affected at all. We will deal with patterns in this entry.

1.2 Patterns

In order for us to search for something we need some way to express ourselves to the program. We need to specify how and what to look for in files. The simplest way to do this is to use a regular expression. With the help of operators, regular expressions help one define a pattern that AWK will search for in a file. The simplest form of a regular expression has already been introduced in the previous post. It's searching for a certain sequence by placing it between backslashes, like so:

awk '/ab/' test1.txt

Note that we are still working with the same input file that represents some of the words of our invented 3-segment language. The above command will print all the lines that contain anywhere in them the sequence "ab". Now what if we want to only search the first column of each line for the sequence "ab" and print all such lines? In this case we can utilize the ~ operator. The ~ tells AWK to find any sequences in the 1st column of every line that match a specified sequence, in our case "ab". The actual command will look like this:

awk '$1~/ab/' test1.txt

Again, because we didn't specify what to do once our sequence is found, AWK just prints the entire line where it found "ab" in the first column.

A complete list of regular expression operators can be found here. I think they are explained pretty well there, so I will only proceed to clarify some points that I feel the authors of the tutorial thought were a given and I didn't.

First of all, there are operators that are placed within the slashes, and there are those that are left outside. One way to differentiate between which symbols stay inside the slashes is to consider what they affect. If they modify the actual the expression we are looking for (for example, the "^", "$", "[]" etc operators), then they are left inside the slashes. If they describe where in the input file to look for an expression, the operators are placed outside the slashes. Another way to know where an operator should be placed is to remember that "~" and "!~" are outside the slashes and everything else is inside.

Secondly, the "\" symbol is used to suppress the special meaning of other symbols in AWK in general, not only in regular expressions. For example, if one needs to search for the value $234 in the first column of a file and upon finding it print "Don't panic!", the command will look like this:

awk '$1~/\$243/ { print "Don'\''t panic\! }' input_file_name

This will print "Don't panic!" for every line that contains the sequence "$234" in it's first column. Because the phrase "Don't panic!" contains a single quote and an exclamation point, we need a way to ensure that AWK prints them verbatim, instead of interpreting them as special characters. Note that while we can just insert a forward slash in front of the dollar sign and the exclamation point to cancel their special traits, we have to enclose the both the forward slash and the single quote it is trying to suppress in single quotes. This has to do with the fact that AWK uses a certain system of quoting. The official explanation of it can be read here. I found it somewhat confusing, so I will try to explain it to myself and the readers below.

There are two types of quotes used in awk: the single quotes (') and the double qoutes ("). In order to avoid getting errors, one must always make sure the quotes are closed. This applies to both types. The exception, of course, is when quotes are used verbatim (as in the somewhat nonsensical phrase "Don't" worry!).

As was mentioned in the previous post and as can be noticed in all the commands we have run so far, the single quotes are used to mark the beginning and end of the patter/action part when running an AWK command in the terminal. The double quotes are used for the print command, they tell it what string to print verbatim. Here's the catch: single quotes are "stronger" than double quotes, in a sense that they cannot be cancelled out by the latter. Hence a line like:

awk '{ print "don't worry" }'

Is going to confuse AWK because it will see one pair of closed single quotes (quotes and content in red) and also an open single quote that needs to be closed. We will receive an error if we run this script. Placing a forward slash in front of the second single quote will not help because AWK will still interpret this quote as the end of the command. Thus we must resort to the trick of placing another single quote pair within the original one.

awk '{ print "don'\''t worry" }'

Notice how the red and green quotes are symmetrically closed off, isolating the relevant parts of the command. The GNU guide offers some other solutions, but it seems to me that this is the most elegant and logical one in the context of a short command run in the terminal.