Krol is blating the wugs.: python

Showing posts with label python. Show all posts

1.1.13

A Linguist trying to Program??

I've been writing miscellaneous scripts in Python for a little over two years now.
I am getting to the point where I can sit down and code a simple and bug-free program in a couple of hours. I, however, always have doubts lurking in the back of my mind whether my skills are in fact adequate, due to a lack of formal training in computer science. At the same time I know I do not right now have the time and energy to go "hardcore" into software design. In fact, it took me a long time to formulate what exactly I thought I was missing in terms of programming. What I think right now is that I need some practice designing algorithms and being able to structure a set of tasks efficiently and within a reasonable time frame.
Some friends more experienced in the field recommended I work my way through the apparently notorious book Structure and Interpretation of Computer Programs (referred to SICP for brevity according to the general convention). I decided to give it a spin this break.
To the dismay of some of my friends and family members, I spent Christmas reading that book and writing small snippets of Scheme.

First of all, I am thoroughly surprised why introductory computer science courses are not taught in LISP. Its syntax is so simple that one understands the principles of assembling LISP statements within the first hour of reading. I am definitely treading dangerous waters here, trying to weigh in on a discussion about which language should be used to introduce people to computer science. What I am saying here is purely an opinion, not a particularly informed one, to be honest. It is, however, a starting point and I would like to record it, even if only to laugh at it a few years down the road. It seems to me that the usefulness of LISP as an introductory language lies precisely in its simple syntax and the apparent uselessness, when it comes to real-world applications. I think this provides a nice break from the constant interfacing of libraries with user input (be it prompts of files) with yet more libraries that seems to characterize quite a bit of programming today. Instead one focuses more on the actual procedures and processes being discussed and not on reading endless APIs.

I was actually so impressed with the apparent simplicity of LISP syntax, that I decided to write an interpreter for it in Python. This turned out to be a tougher undertaking than I thought and deserves its own post in the future. A quick glance at Peter Norvig's take on the problem seems to indicate that I'm on the right track, however.

But enough of my ruminations, I originally wanted to share what I've learned so far. I would like to start with mentioning that I have been trying to diligently do the exercises in the book, thus my progress has been slow and in a little under a week of more or less continuous reading I have gone through only the first two chapters.

First of all, understanding recursion in its fullness was very exciting. I especially liked that almost any recursive process can have an iterative implementation (see this thread for some examples of recursion that cannot be made iterative), something that should be more resource-friendly in real-life applications of recursion. Tree and list traversal were the tasks I found most difficult to conceptualize and implement, so I will need to revisit those.

Another concept I found mind-stretching was data abstraction. I have heard the term thrown around here and there, probably even looked it up on Wikipedia sometime in the past. SICP covers it fairly rigorously, however.
What I frankly found enlightening was the fact that procedures can be abstracted over either operations or data types. For example, we can try to define a generic operation procedure that will then be performed slightly (or not so slightly) differently depending on the data type of the input. If we wanted to abstract over data types, however, we'd essentially have to define a data type as a procedure that takes another procedure as an argument and modifies its execution based on the constraints we set for that data type.

Well, that's about as far as I've made it so far. After 4 or 5 days of work I have a little under a hundred lines of mathematical operations in LISP and a semi-working python Scheme interpreter.

P.S.
While rummaging in the Internet, I found this post. Turns out even MIT has moved away from using LISP. Am I really 30 years behind?

17.5.12

Encodings - The Bane of Multilingual Text Processing

This post was prompted by a new assignment I got for my HiWi job at Uni Konstanz.
This time I needed to extract questions from an Urdu corpus and write them to a file. Since the Urdu alphabet significantly differs from American English, I expected there to be problems with the encoding. I didn't quite know what exactly they would be, so it was interesting for me to document them in this post.

First of all, the Urdu question mark character is ؟, internally Python represents it as \xd8\x9f. Because this representation is not ASCII (which is, unfortunately, the default encoding for Python programs), I had to place a special declaration at the start of my file that specified the use of 'utf-8' encoding in it.

In addition I had to alter the patterns I was using to find question marks, because in the regular expressions syntax \xd8\x9f is in fact two characters.

In my naïveté I thought that my troubles were over. What I had not accounted for were the encodings of the input and output files. The latter proved to be more of an issue. It seemed like the input file could be read in and processed fine as long as it was not opened with ASCII encoding. Not so with the output file.

In both cases I used the (relevantly) custom open() method from the codecs module that allowed me to specify not only file name and edit mode, but also the encoding of the opened file.

Yet despite my specifying of the 'utf-8' encoding on the output file, every time I called on the output file's write() method, I kept getting the brilliant UnicodeError* that claimed first that the ASCII codec could not decode the string, then that the 'utf-8' codec could not do the same.

I was confused... I had proof that the program was in fact finding questions, but was it storing them in the same encoding as they were found in, or perhaps was it sneakily converting to ASCII (or even UTF-8 !) behind my back somewhere during processing? After a couple more of frustratingly unsuccessful attempts at locating the issue with the cunning use of print statements, I turned to the Web (perhaps I should have started there) and plugged my last error into Google.

N.B. Do you guys know what Google is? It's very magical, you should definitely try it out: just type something in the box here, hit enter, see what happens!

Trolling aside, I did manage to find some useful material online. This article from Joel Spolsky, for example, explains that ideally all strings are kept as Unicode code points and only at read/write time do encodings get utilized to convert said code points (hex numbers, really) into characters or vice versa. This might explain why Python always raises a UnicodeError for exceptions related to encoding/decoding, even though the encodings involved may be different.

Reading Joel's rant also prompted me to reconsider the question what encoding my Urdu data was in. I thought perhaps there is a special encoding for Urdu... and buried myself in the python API.

Fortunately, I didn't have to go very far. The entries for encodings can be found on the page for the codecs module. To my surprise, I discovered that Urdu used a special encoding, very "intuitively" named CP1006.

As soon as I set all of my files to be encoded in it as well as my output buffer to be decoded as it, the code seemed to work like a charm and stored some reasonably unreadable (my OS doesn't support CP1006, apparently) Urdu sentences.

Let's see what my supervisor has to say about this mess...

Fazit, or What I Learned:
- One should always check what language one's input and output files are in and whether they correctly use the relevant encoding.
- One should keep in mind that encoding has to be specified for every file one deals with. This includes the files containing the actual code itself.
- The codecs library is a useful one.
- Joel Spolsky is a sassy blogger.

6.4.12

Switch Statement in Python?

Recently I am being presented with more opportunities to program in Python. As a result, I am learning more about the language, some of its core modules and packages, how it interacts with the operating system, etc.

Yesterday I discovered something I so useful that I would like to share it. This something is Python's equivalent of a switch statement.

It is not something particularly obscure, but it showcases a feature of the language that I found interesting.

For my job here at the uni I needed to run some scripts on corpora from 4 different languages. I decided to automate the process. Hence I created one main script that would call on the a respective subscript depending on which language was being processed. This meant that I was dealing with at least 4 cases (5, if I wanted to catch potential bugs when no language was specified for some reason).

Unlike the C family or Java, Python does not have the explicit construction switch{ case1 -> A, case2 -> B, caseN -> alpha} and the official solution, the use of elif-s namely, isn't very elegant and can be somewhat tedious to write. According to this PEP, there was not enough support for an explicit switch statement among coders, and now that I have discovered the alternative solution, I don't blame them.

At the time, however, I was still unenlightened and thus turned to Google that through its magic spewed out this page at the top of the results. When I searched for "python switch statement" again before writing this post, I also found a very humble and short description here, and also a python mailing list topic where it is mentioned.

It was by the concept introduced there that I found interesting. Turns out, it is possible to store any defined aliases as dictionary values. This includes, importantly, method and module names and even lambdas. This means that to emulate a switch statement, all one needs to do is create a dictionary with the keys representing the different cases and the values the respective procedures and then simply reference these procedures using the keys.

I hope that as I continue to write more code I will keep discovering neat things like this as I go. They make one's day sometimes.