Thursday 15 September 2011

AWK Tutorial


awk tutorial
Why awk?
The Awk text-processing programming language and is a useful tool for manipulating text.
·  Awk recognizes the concepts of "file", "record", and "field".
·  A file consists of records, which by default are the lines of the file. One line becomes one record.
·  Awk operates on one record at a time.
·  A record consists of fields, which by default are separated by any number of spaces or tabs.
·  Field number 1 is accessed with $1, field 2 with $2, and so forth. $0 refers to the whole record.
[awkuser@p3nlh096 ~]$ awk -help
Usage: awk [POSIX or GNU style options] -f progfile [--] file ...
Usage: awk [POSIX or GNU style options] [--] 'program' file ...
POSIX options:          GNU long options:
        -f progfile             --file=progfile
        -F fs                   --field-separator=fs
        -v var=val              --assign=var=val
        -m[fr] val
        -W compat               --compat
        -W copyleft             --copyleft
        -W copyright            --copyright
        -W dump-variables[=file]        --dump-variables[=file]
        -W exec=file            --exec=file
        -W gen-po               --gen-po
        -W help                 --help
        -W lint[=fatal]         --lint[=fatal]
        -W lint-old             --lint-old
        -W non-decimal-data     --non-decimal-data
        -W profile[=file]       --profile[=file]
        -W posix                --posix
        -W re-interval          --re-interval
        -W source=program-text  --source=program-text
        -W traditional          --traditional
        -W usage                --usage
        -W version              --version

To report bugs, see node `Bugs' in `gawk.info', which is
section `Reporting Problems and Bugs' in the printed version.

gawk is a pattern scanning and processing language.
By default it reads standard input and writes standard output.

Examples:
        gawk '{ sum += $1 }; END { print sum }' file
        gawk -F: '{ print $1 }' /etc/passwd


Now, for an explanation of the { print } code block. In awk, curly braces are used to group blocks of code together, similar to C. Inside our block of code, we have a single print command. In awk, when a print command appears by itself, the full contents of the current line are printed.
$ awk '{ print $0 }' /etc/passwd

output
-------
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/bin/sh
bin:x:2:2:bin:/bin:/bin/sh
sys:x:3:3:sys:/dev:/bin/sh
sync:x:4:65534:sync:/bin:/bin/sync
...

In awk, the $0 variable represents the entire current line, so print and print $0 do exactly the same thing.
$ awk '{ print "" }' /etc/passwd

$ awk '{ print "hello" }' /etc/passwd
Running this script will fill your screen with hello's.
AWK Variables
awk variables are initialized to either zero or the empty string the first time they are used.
Variables
·  Variable declaration is not required
·  May contain any type of data, their data type may change over the life of the program
·  Must begin with a letter and continuing with letters, digits and underscores
·  Are case senstive
·  Some of the commonly used built-in variables are:
  • NR -- The current line's sequential number
  • NF -- The number of fields in the current line
  • FS -- The input field separator; defaults to whitespace and is reset by the -F command line parameter
/test$ cat calc
3 56
567 89
/test$ awk '{d=($2-($1-4));s=($2+$1);print d/sqrt(s),d*d/s }' calc
7.42077 55.0678
-18.5066 342.494
/test$
in above example we have a file calc with two rows and two columns. Note that the final statement, a "print" in this case, does not need a semicolon. It doesn't hurt to put it in, though.

Integer variables can be used to refer to fields. If one field contains information about which other field is important, this script will print only the important field:
$ awk '{imp=$1; print $imp }' calc

The special variable NF tells you how many fields are in this record. This script prints the first and last field from each record, regardless of how many fields there are:
if now calc file is
3 56 abd
567 89 xyz
$ awk '{print $1,$NF }' calc
3 abd
567 xyz

Begin and End
Any action associated with the BEGIN pattern will happen before any line-by-line processing is done. Actions with the END pattern will happen after all lines are processed.
1.One is to just mash them together, like so:

awk 'BEGIN{print"fee"} $1=="foo"{print"fi"}
     END{print"fo fum"}' filename


AWK Arrays
awk has arrays, but they are only indexed by strings. This can be very useful, but it can also be annoying. For example, we can count the frequency of words in a document (ignoring the icky part about printing them out):
$ awk '{for(i=1;i <=NF;i++) freq[$i]++ }' filename

The array will hold an integer value for each word that occurred in the file. Unfortunately, this treats "foo", "Foo", and "foo," as different words. Oh well. How do we print out these frequencies? awk has a special "for" construct that loops over the values in an array. This script is longer than most command lines, so it will be expressed as an executable script:
#!/usr/bin/awk -f
{for(i=1;i <=NF;i++) freq[$i]++ }
END{for(word in freq) print word, freq[word]
}

AWK Regular expressions and blocks

awk '/pattern_to_match/ {actions}' input_file

awk '/foo/ { print }' abc.txt

cat abc.txt|awk '/[0-9]+.[0-9]*/ { print }'


Expressions and blocks
fredprint

$1 == "fred" { print $3 }

root

$5 ~ /root/ { print $3 }

AWK Conditional statements
awk '{
   if ( $1 ~ /root/ )
  {
   print $1
  }
}' /etc/passwd

Both scripts function identically. In the first example, the boolean expression is placed outside the block, while in the second example, the block is executed for every input line, and we selectively perform the print command by using an if statement. Both methods are available, and you can choose the one that best meshes with the other parts of your script.
if
{
        if ( $1 == "foo" ) {
               if ( $2 == "foo" ) {
                       print "uno"
               } else {
                       print "one"
               }
        } else if ($1 == "bar" ) {
               print "two"
        } else {
               print "three"
        }
}


if
! /matchme/ { print $1 $3 $4 }
{
        if ( $0 !~ /matchme/ ) {
               print $1 $3 $4
        }
}

Both scripts will output only those lines that don't contain a matchme character sequence. Again, you can choose the method that works best for your code. They both do the same thing.

( $1 == "foo" ) && ( $2 == "bar" ) { print }

This example will print only those lines where field one equals foo and field two equals bar.

No comments:

Post a Comment