Krol is blating the wugs.: 2010

4.12.10

Free Alternative to Acrobat

This is a nifty utility that allows one to merge documents into one and do other cool stuff to PDFs without having to pay for Acrobat.
Should work on any operating system. I've tried it out on Ubuntu 10.04 Lucid Lynx and so far worked without a hitch through the Terminal.

PDF Toolkit

2.11.10

Hiatus Over: Spellcheckers

So I've decided to get back to posting linguistics related things in this blog.

I probably won't have time to continue with AWK this semester, but I'll try resurrecting that theme again too.

For now, just a short rant about Spellcheckers:

Why are some singular forms of nouns permissable and at the same time the plural forms of the same nouns marked as errors?

I unfortunately don't have the example that drew my attention on hand (this happened a couple of months ago and I didn't write down the culprit word at the time), so this post must stay no more than a rant. Also, I am dealing with Openoffice 3.2. Maybe MSOffice is excempt from this problem, but I doubt it.

Apparently there is no mechanism that ensures that the plural forms of acceptable nouns are acceptable too. This is confirmed by the fact that if you add a word to the dictionary manually, the plural form is not created as well, but needs to be added separately the same way the singular form was added: by hand.

I wonder if this is done on purpose. Who should I ask?

1.9.10

Word of the Day

I thought I'd take a break from torturing AWK and write about something fun and light.

I have been meaning to post this for a couple of days but because of move-in to college and all the prep related I've been procrastinating on this.

Recently, in the Laguage Log, I discovered an awesome word: Locution

This word appeared in a post Pure Chinese in the second paragraph of the text. It's a synonym for "expression", "manner of speaking", "phraseology"(this is even listed as a synonym in Merriam-Webster). I feel that this word can be used often, considering that many of my friends have fairly strange ways of expressing themselves when it comes to speech.

I got the impression that the article itself is more speculation than concrete analysis. However, it does introduce some cool different approaches to dealing with ambiguous translation situations.

16.8.10

Patterns and Quoting

I am back with more discoveries of what one can do with files in AWK. My previous post was somewhat of an introduction and thus didn't cover much ground in terms of actual coding practices. This post and the ones that follow will be more concerned with the particular ways to parse files. I also hope that they will be a bit more structured too.

I base the organization of my material on the University of Utrecht AWK manual, as I found it to be more comprehensible than the offical GNU guide.

1 Syntax

Just as a reminder, below is the general syntax of an AWK command in the terminal.

awk 'command' input_file

What does this command part consist of? What do we need from text files in general? Most of the time, we need to find some data and perform some operations with it. Hence the command part actually breaks up into two subparts (though their presence is not always apparent): one that searches for certain patterns in the input file (conveniently called pattern) and one that determines the operation necessary to perform after the pattern has been found (this is called action). In my previous post I mentioned some parts of the AWK command being enclosed in "{". It turns out these are the actions. Patterns are not affected at all. We will deal with patterns in this entry.

1.2 Patterns

In order for us to search for something we need some way to express ourselves to the program. We need to specify how and what to look for in files. The simplest way to do this is to use a regular expression. With the help of operators, regular expressions help one define a pattern that AWK will search for in a file. The simplest form of a regular expression has already been introduced in the previous post. It's searching for a certain sequence by placing it between backslashes, like so:

awk '/ab/' test1.txt

Note that we are still working with the same input file that represents some of the words of our invented 3-segment language. The above command will print all the lines that contain anywhere in them the sequence "ab". Now what if we want to only search the first column of each line for the sequence "ab" and print all such lines? In this case we can utilize the ~ operator. The ~ tells AWK to find any sequences in the 1st column of every line that match a specified sequence, in our case "ab". The actual command will look like this:

awk '$1~/ab/' test1.txt

Again, because we didn't specify what to do once our sequence is found, AWK just prints the entire line where it found "ab" in the first column.

A complete list of regular expression operators can be found here. I think they are explained pretty well there, so I will only proceed to clarify some points that I feel the authors of the tutorial thought were a given and I didn't.

First of all, there are operators that are placed within the slashes, and there are those that are left outside. One way to differentiate between which symbols stay inside the slashes is to consider what they affect. If they modify the actual the expression we are looking for (for example, the "^", "$", "[]" etc operators), then they are left inside the slashes. If they describe where in the input file to look for an expression, the operators are placed outside the slashes. Another way to know where an operator should be placed is to remember that "~" and "!~" are outside the slashes and everything else is inside.

Secondly, the "\" symbol is used to suppress the special meaning of other symbols in AWK in general, not only in regular expressions. For example, if one needs to search for the value $234 in the first column of a file and upon finding it print "Don't panic!", the command will look like this:

awk '$1~/\$243/ { print "Don'\''t panic\! }' input_file_name

This will print "Don't panic!" for every line that contains the sequence "$234" in it's first column. Because the phrase "Don't panic!" contains a single quote and an exclamation point, we need a way to ensure that AWK prints them verbatim, instead of interpreting them as special characters. Note that while we can just insert a forward slash in front of the dollar sign and the exclamation point to cancel their special traits, we have to enclose the both the forward slash and the single quote it is trying to suppress in single quotes. This has to do with the fact that AWK uses a certain system of quoting. The official explanation of it can be read here. I found it somewhat confusing, so I will try to explain it to myself and the readers below.

There are two types of quotes used in awk: the single quotes (') and the double qoutes ("). In order to avoid getting errors, one must always make sure the quotes are closed. This applies to both types. The exception, of course, is when quotes are used verbatim (as in the somewhat nonsensical phrase "Don't" worry!).

As was mentioned in the previous post and as can be noticed in all the commands we have run so far, the single quotes are used to mark the beginning and end of the patter/action part when running an AWK command in the terminal. The double quotes are used for the print command, they tell it what string to print verbatim. Here's the catch: single quotes are "stronger" than double quotes, in a sense that they cannot be cancelled out by the latter. Hence a line like:

awk '{ print "don't worry" }'

Is going to confuse AWK because it will see one pair of closed single quotes (quotes and content in red) and also an open single quote that needs to be closed. We will receive an error if we run this script. Placing a forward slash in front of the second single quote will not help because AWK will still interpret this quote as the end of the command. Thus we must resort to the trick of placing another single quote pair within the original one.

awk '{ print "don'\''t worry" }'

Notice how the red and green quotes are symmetrically closed off, isolating the relevant parts of the command. The GNU guide offers some other solutions, but it seems to me that this is the most elegant and logical one in the context of a short command run in the terminal.

7.8.10

Discovering the uses of AWK for linguistists in Unix.

As I started taking classes beyond intro level in linguistics I noticed that the problems we, students, were given became more complex and the datasets more extensive. I could no longer get away with solving problems in my head, I had to record the data meticulously. I can also expect this tendency to continue, as I continue my studies.

How does one deal with this? One can of course take out a notebook and a pen and draw table after table and try to wrap one's brain around the information contained therein. However, this approach becomes tedious when one has to deal with real-life research, where one needs to account for processes affecting an entire lexicon or several lexicons (if you are doing typological cross-linguistic research).
How does one minimize the necessary but repetitive tasks in analyzing a given dataset, thus speeding up the process and leaving the analyst more time to perform actually think?
The use of a computer provides the answer to this computational problem (what else are computers there for other than to compute?).
In particular, there is a nifty program that can be used for this purpose. It is called AWK or GAWK. It is a utility that is used to parse, edit and manipulate text files.
This particular program is usefull for linguistic analysis for the following reasons:
a) it comes preinstalled on all Unix-based systems (Linux and MacOS). Installing it in windows requires some work, but is certainly possible.
b) the programmer has the ability to both run simple parsing instructions directly from the command prompt and the potential to use scripts written in separate "program-files" for more complex analysis tasks. This can greatly automate the process and help create templates for dealing with analysis problems encountered in the past.
c) because AWK deals with text files, even a large dataset will not take up too much resources in terms of space and computing. This allows for quick parsing of large datasets, something quite important for a linguist.

This post will deal with opening AWK on a Unix system and provide an introduction to a couple of simple functions of this language. This is not intended as a tutorial (there are links to extensive documentation at the end of this post and in the right pane of the blog), nor as a Tips and Tricks article (I'm not that cool... yet), but rather an overview of the stuff I've discovered, a work in progress. I will be posting about various other things one can do with AWK as I go along. And NO, I have not yet gotten it to work well on windows, that will be the topic of a latter post sometime in the future.

As was mentioned before, all Unix-based systems (Linux and Macs) come with this program preinstalled. In that sense there's not much "setting up" necessary. One just needs to open the Terminal and type in: awk or gawk to see the list of options and program information.

On a Mac the terminal can be found in Applications->Utilities. In linux, the exact procedure for opening the terminal depends on your distribution and desktop environment, a quick google search will tell you where to find it.

Now that you have the Terminal open, let's get down to business. As was mentioned earlier, there are two ways of parse a file with AWK: running a set of instructions directly in the terminal or calling on a file that contains instructions in it. Usually the former method is better for relatively simple analysis, whereas the latter is better suited for more complex parsing. I will look at running commands straight in the terminal first, then proceed to the steps for calling a file.

The general syntax of an AWK command is as follows:

awk(or gawk) 'instructions' inputfile1 inputfile2 etc...

There are several points worth noting in the above line:

#1 Just like in a human written language the spaces between the different parts of the command are important because they separate said parts and tell the computer where one ends and the next begins.

#2 The order of the parts in the command matters. Unlike some human languages with seemingly "free" word order, AWK has a strict structure to its sentences: one calls the program (awk/gawk), one tells it what to do ('instructions') and one specifies which files to do it to (inputfile1 etc.).

#3 The reader might have noticed the single quotes around the instructions part of the command. This brings us to the next

#4 The only really configurable part of the command is 'instructions' (note the single quotes), since the (g)awk part doesn't change no matter what, and the rules for specifying an input file are simple. Therefore, most of the interesting stuff is located in this segment of the command.

1.0 Things you can do with AWK: a first glance.

Let's perform some simple tasks with the program to see what it's capable of. Let's say we are to analyze vowel epenthesis in a language with 3 segments: "a", "b" and "c". The features of these segments are the same as in English. "A" is a vowel, the rest are consonants. It so happens that this language has a certain irregularity in the way certain combinations surface. A sample input file for determining this constraint might be something like the one below. Please note the separation into lines and columns. The column on the left represents UR-s and the one on the right corresponding SR-s.

abc abac

cba caba

cab cab

bac bac

Let's name this file test1.txt and save it on our desktop. Now we can start by printing certain parts of this file. In the terminal, go to the folder (or directory) where this file is located using the cd command. Its usage is the same in Linux and Macs.

Once in the directory with the file, run this command:

awk '{ print $0 }' test1.txt

equiv="content-type" content="text/html; charset=utf-8">

This will print the whole file in the terminal. Let's have a look at the syntax.

The single quotes mark the beginning and end of the part of the command that tells AWK what to do with the input file.

The curly brackets show the borders of separate commands inside the big statement. In this case we are only running one command, so there is only one set of brackets.

The print command takes a specified input from the input file and prints it (hence the name) onto the terminal screen. This result can then be directed to another file, if we want (more about this in the future) or just looked at and analyzed directly in the terminal.

The $0 tells AWK to print all the columns in the file. If we were to type { print $1 } only the first column of the file would be printed. By the same logic, in order to print only the second column, we would insert "2" instead of "1" after the dollar sign. If we had a file with more than 2 columns and we wanted to print some of them, we would type, for example, { print $1, $4, $5 } . As you probably have guessed, this will print the 1st, 4th and 5th columns of the input file.

Now in order to filter out the info we are interested in instead of whole columns we need some functions that search for certain values. The simplest way to do that is by using the "find" command. In the terminal, type the following:

awk '/ab/ { print $2 }' test1.txt

The output should be this:

abac

caba

cab

Notice the "/ab/" part. What it does is tells AWK to look search all the lines in the input file for the combination of characters "a" and "b" in the order specified. In other words, AWK looks for lines that contain the sequence "ab" in them. Once it finds such a line, it prints the second column in it because we have a $2 after print. Notice also, that there are no brackets around the "search" statement. I don't know for sure yet, but this seems to be only true for this particular statement. This statement allows us to fish out only data relevant to us by just specifying a certain sequence of symbols for AWK to look for and having found them, print out certain pieces of data.

There are ways to refine one's search, I haven't had a chance to play with them yet, and this post has gotten a bit big already, so I think I will stop at this point for now. My future posts will assume that both I and the reader are in a directory (folder) that we have created/are using for AWK. Below are the topics I plan to cover in the post to come.

Topics for the next post:

- More ways to search files for input in awk

- A little about quoting rules

- the IF statement.

Thanks for reading, questions and suggestions are more than welcome!

Useful Links (can also be found in the useful links section in the right pane of this blog):

The official Gawk tutorial:

http://www.gnu.org/manual/gawk/html_node/index.html#Top

A tutorial for Nawk (the latest version of AWK, has some advanced features added):

http://people.cs.uu.nl/piet/docs/nawk/nawk_toc.html

1.8.10

Hello!

This is the first proper post to this blog. As such it will provide a short intro to my goals and aspirations for writing and start a topic that I am currently investigating.

My goals in creating this blog are:

- to exchange and share ideas about linguistics in general and computational methods in linguistic analysis in particular

- to improve my ability to write coherently and concisely about my topics of interest

My aspirations for this blog:

- I hope to record and improve my knowledge of linguistics by making blog entries and, I hope, occasionally starting sharing/discussions among my peers.

Aknowledgements:
I would like to thank Lina Halberg, Glynis Jones and David Rome for their suggestions on the name and content of this blog.