16.8.10

Patterns and Quoting

I am back with more discoveries of what one can do with files in AWK. My previous post was somewhat of an introduction and thus didn't cover much ground in terms of actual coding practices. This post and the ones that follow will be more concerned with the particular ways to parse files. I also hope that they will be a bit more structured too.
I base the organization of my material on the University of Utrecht AWK manual, as I found it to be more comprehensible than the offical GNU guide.

1 Syntax
Just as a reminder, below is the general syntax of an AWK command in the terminal.
awk 'command' input_file
What does this command part consist of? What do we need from text files in general? Most of the time, we need to find some data and perform some operations with it. Hence the command part actually breaks up into two subparts (though their presence is not always apparent): one that searches for certain patterns in the input file (conveniently called pattern) and one that determines the operation necessary to perform after the pattern has been found (this is called action). In my previous post I mentioned some parts of the AWK command being enclosed in "{". It turns out these are the actions. Patterns are not affected at all. We will deal with patterns in this entry.

1.2 Patterns
In order for us to search for something we need some way to express ourselves to the program. We need to specify how and what to look for in files. The simplest way to do this is to use a regular expression. With the help of operators, regular expressions help one define a pattern that AWK will search for in a file. The simplest form of a regular expression has already been introduced in the previous post. It's searching for a certain sequence by placing it between backslashes, like so:
awk '/ab/' test1.txt

Note that we are still working with the same input file that represents some of the words of our invented 3-segment language. The above command will print all the lines that contain anywhere in them the sequence "ab". Now what if we want to only search the first column of each line for the sequence "ab" and print all such lines? In this case we can utilize the ~ operator. The ~ tells AWK to find any sequences in the 1st column of every line that match a specified sequence, in our case "ab". The actual command will look like this:
awk '$1~/ab/' test1.txt

Again, because we didn't specify what to do once our sequence is found, AWK just prints the entire line where it found "ab" in the first column.

A complete list of regular expression operators can be found here. I think they are explained pretty well there, so I will only proceed to clarify some points that I feel the authors of the tutorial thought were a given and I didn't.
First of all, there are operators that are placed within the slashes, and there are those that are left outside. One way to differentiate between which symbols stay inside the slashes is to consider what they affect. If they modify the actual the expression we are looking for (for example, the "^", "$", "[]" etc operators), then they are left inside the slashes. If they describe where in the input file to look for an expression, the operators are placed outside the slashes. Another way to know where an operator should be placed is to remember that "~" and "!~" are outside the slashes and everything else is inside.
Secondly, the "\" symbol is used to suppress the special meaning of other symbols in AWK in general, not only in regular expressions. For example, if one needs to search for the value $234 in the first column of a file and upon finding it print "Don't panic!", the command will look like this:
awk '$1~/\$243/ { print "Don'\''t panic\! }' input_file_name

This will print "Don't panic!" for every line that contains the sequence "$234" in it's first column. Because the phrase "Don't panic!" contains a single quote and an exclamation point, we need a way to ensure that AWK prints them verbatim, instead of interpreting them as special characters. Note that while we can just insert a forward slash in front of the dollar sign and the exclamation point to cancel their special traits, we have to enclose the both the forward slash and the single quote it is trying to suppress in single quotes. This has to do with the fact that AWK uses a certain system of quoting. The official explanation of it can be read here. I found it somewhat confusing, so I will try to explain it to myself and the readers below.
There are two types of quotes used in awk: the single quotes (') and the double qoutes ("). In order to avoid getting errors, one must always make sure the quotes are closed. This applies to both types. The exception, of course, is when quotes are used verbatim (as in the somewhat nonsensical phrase "Don't" worry!).
As was mentioned in the previous post and as can be noticed in all the commands we have run so far, the single quotes are used to mark the beginning and end of the patter/action part when running an AWK command in the terminal. The double quotes are used for the print command, they tell it what string to print verbatim. Here's the catch: single quotes are "stronger" than double quotes, in a sense that they cannot be cancelled out by the latter. Hence a line like:
awk '{ print "don't worry" }'

Is going to confuse AWK because it will see one pair of closed single quotes (quotes and content in red) and also an open single quote that needs to be closed. We will receive an error if we run this script. Placing a forward slash in front of the second single quote will not help because AWK will still interpret this quote as the end of the command. Thus we must resort to the trick of placing another single quote pair within the original one.
awk '{ print "don'\''t worry" }'

Notice how the red and green quotes are symmetrically closed off, isolating the relevant parts of the command. The GNU guide offers some other solutions, but it seems to me that this is the most elegant and logical one in the context of a short command run in the terminal.


No comments:

Post a Comment