Krol is blating the wugs.: Discovering the uses of AWK for linguistists in Unix.

As I started taking classes beyond intro level in linguistics I noticed that the problems we, students, were given became more complex and the datasets more extensive. I could no longer get away with solving problems in my head, I had to record the data meticulously. I can also expect this tendency to continue, as I continue my studies.

How does one deal with this? One can of course take out a notebook and a pen and draw table after table and try to wrap one's brain around the information contained therein. However, this approach becomes tedious when one has to deal with real-life research, where one needs to account for processes affecting an entire lexicon or several lexicons (if you are doing typological cross-linguistic research).
How does one minimize the necessary but repetitive tasks in analyzing a given dataset, thus speeding up the process and leaving the analyst more time to perform actually think?
The use of a computer provides the answer to this computational problem (what else are computers there for other than to compute?).
In particular, there is a nifty program that can be used for this purpose. It is called AWK or GAWK. It is a utility that is used to parse, edit and manipulate text files.
This particular program is usefull for linguistic analysis for the following reasons:
a) it comes preinstalled on all Unix-based systems (Linux and MacOS). Installing it in windows requires some work, but is certainly possible.
b) the programmer has the ability to both run simple parsing instructions directly from the command prompt and the potential to use scripts written in separate "program-files" for more complex analysis tasks. This can greatly automate the process and help create templates for dealing with analysis problems encountered in the past.
c) because AWK deals with text files, even a large dataset will not take up too much resources in terms of space and computing. This allows for quick parsing of large datasets, something quite important for a linguist.

This post will deal with opening AWK on a Unix system and provide an introduction to a couple of simple functions of this language. This is not intended as a tutorial (there are links to extensive documentation at the end of this post and in the right pane of the blog), nor as a Tips and Tricks article (I'm not that cool... yet), but rather an overview of the stuff I've discovered, a work in progress. I will be posting about various other things one can do with AWK as I go along. And NO, I have not yet gotten it to work well on windows, that will be the topic of a latter post sometime in the future.

As was mentioned before, all Unix-based systems (Linux and Macs) come with this program preinstalled. In that sense there's not much "setting up" necessary. One just needs to open the Terminal and type in: awk or gawk to see the list of options and program information.

On a Mac the terminal can be found in Applications->Utilities. In linux, the exact procedure for opening the terminal depends on your distribution and desktop environment, a quick google search will tell you where to find it.

Now that you have the Terminal open, let's get down to business. As was mentioned earlier, there are two ways of parse a file with AWK: running a set of instructions directly in the terminal or calling on a file that contains instructions in it. Usually the former method is better for relatively simple analysis, whereas the latter is better suited for more complex parsing. I will look at running commands straight in the terminal first, then proceed to the steps for calling a file.

The general syntax of an AWK command is as follows:

awk(or gawk) 'instructions' inputfile1 inputfile2 etc...

There are several points worth noting in the above line:

#1 Just like in a human written language the spaces between the different parts of the command are important because they separate said parts and tell the computer where one ends and the next begins.

#2 The order of the parts in the command matters. Unlike some human languages with seemingly "free" word order, AWK has a strict structure to its sentences: one calls the program (awk/gawk), one tells it what to do ('instructions') and one specifies which files to do it to (inputfile1 etc.).

#3 The reader might have noticed the single quotes around the instructions part of the command. This brings us to the next

#4 The only really configurable part of the command is 'instructions' (note the single quotes), since the (g)awk part doesn't change no matter what, and the rules for specifying an input file are simple. Therefore, most of the interesting stuff is located in this segment of the command.

1.0 Things you can do with AWK: a first glance.

Let's perform some simple tasks with the program to see what it's capable of. Let's say we are to analyze vowel epenthesis in a language with 3 segments: "a", "b" and "c". The features of these segments are the same as in English. "A" is a vowel, the rest are consonants. It so happens that this language has a certain irregularity in the way certain combinations surface. A sample input file for determining this constraint might be something like the one below. Please note the separation into lines and columns. The column on the left represents UR-s and the one on the right corresponding SR-s.

abc abac

cba caba

cab cab

bac bac

Let's name this file test1.txt and save it on our desktop. Now we can start by printing certain parts of this file. In the terminal, go to the folder (or directory) where this file is located using the cd command. Its usage is the same in Linux and Macs.

Once in the directory with the file, run this command:

awk '{ print $0 }' test1.txt

equiv="content-type" content="text/html; charset=utf-8">

This will print the whole file in the terminal. Let's have a look at the syntax.

The single quotes mark the beginning and end of the part of the command that tells AWK what to do with the input file.

The curly brackets show the borders of separate commands inside the big statement. In this case we are only running one command, so there is only one set of brackets.

The print command takes a specified input from the input file and prints it (hence the name) onto the terminal screen. This result can then be directed to another file, if we want (more about this in the future) or just looked at and analyzed directly in the terminal.

The $0 tells AWK to print all the columns in the file. If we were to type { print $1 } only the first column of the file would be printed. By the same logic, in order to print only the second column, we would insert "2" instead of "1" after the dollar sign. If we had a file with more than 2 columns and we wanted to print some of them, we would type, for example, { print $1, $4, $5 } . As you probably have guessed, this will print the 1st, 4th and 5th columns of the input file.

Now in order to filter out the info we are interested in instead of whole columns we need some functions that search for certain values. The simplest way to do that is by using the "find" command. In the terminal, type the following:

awk '/ab/ { print $2 }' test1.txt

The output should be this:

abac

caba

cab

Notice the "/ab/" part. What it does is tells AWK to look search all the lines in the input file for the combination of characters "a" and "b" in the order specified. In other words, AWK looks for lines that contain the sequence "ab" in them. Once it finds such a line, it prints the second column in it because we have a $2 after print. Notice also, that there are no brackets around the "search" statement. I don't know for sure yet, but this seems to be only true for this particular statement. This statement allows us to fish out only data relevant to us by just specifying a certain sequence of symbols for AWK to look for and having found them, print out certain pieces of data.

There are ways to refine one's search, I haven't had a chance to play with them yet, and this post has gotten a bit big already, so I think I will stop at this point for now. My future posts will assume that both I and the reader are in a directory (folder) that we have created/are using for AWK. Below are the topics I plan to cover in the post to come.

Topics for the next post:

- More ways to search files for input in awk

- A little about quoting rules

- the IF statement.

Thanks for reading, questions and suggestions are more than welcome!

Useful Links (can also be found in the useful links section in the right pane of this blog):

The official Gawk tutorial:

http://www.gnu.org/manual/gawk/html_node/index.html#Top

A tutorial for Nawk (the latest version of AWK, has some advanced features added):

http://people.cs.uu.nl/piet/docs/nawk/nawk_toc.html

Krol is blating the wugs.

7.8.10

Discovering the uses of AWK for linguistists in Unix.

No comments:

Post a Comment