Awk command line or script to help you sort text files recommended by of

  • 2021-07-18 09:20:58
  • OfStack

Awk is a powerful tool that can perform certain tasks that might be accomplished by other common utilities, including sort.

Awk is a ubiquitous Unix command for scanning and processing text containing predictable patterns. However, because of its functional function, it can also be reasonably called a programming language.

Confusingly, there is more than one awk. (Or, if you think there is only one, then the others are clones.) There is awk (the original program written by Aho, Weinberger, and Kernighan), followed by gawk with nawk, mawk, and GNU versions. The GNU version of awk is a highly portable free software version of the utility with several unique features, so this article is about GNU awk.

Although its official name is gawk, on GNU+Linux systems, its alias is awk and is used as the default version of the command. On other systems that do not have GNU awk, you must first install it and call it gawk instead of awk. The terms awk and gawk are used interchangeably herein.

awk is both a command language and a programming language, making it a powerful tool to handle tasks that were originally left to sort, cut, uniq, and other common utilities. Fortunately, there's a lot of redundancy in open source, so if you're faced with the question of whether to use awk, the answer might be yes "casually."

The beauty of awk's flexibility is that if you have decided to use awk to complete a task, you can continue to use awk no matter what happens next. This includes the eternal need to sort data instead of in the order delivered to you.

Sample data set

Before exploring the sorting method of awk, please generate the sample data set to be used. Keep it simple, so that you won't be bothered by extreme situations and unexpected complexities. This is the sample set used in this article:


Aptenodytes;forsteri;Miller,JF;1778;Emperor
Pygoscelis;papua;Wagler;1832;Gentoo
Eudyptula;minor;Bonaparte;1867;Little Blue
Spheniscus;demersus;Brisson;1760;African
Megadyptes;antipodes;Milne-Edwards;1880;Yellow-eyed
Eudyptes;chrysocome;Viellot;1816;Sothern Rockhopper
Torvaldis;linux;Ewing,L;1996;Tux

This is a small data set, but it provides multiple data types:

Genus and species names, related but separate from each other A surname, sometimes an acronym beginning with a comma Integer representing date Any term All fields are separated by semicolons

Depending on your educational background, you might think of this as a 2-dimensional array or table, or just a row-delimited collection of data. What you think of it is only your problem, while awk only knows the text. It's up to you to tell awk how you want to parse it.

Just want to sort

If you only want to sort a text dataset by specific definable fields (such as "cells" in a spreadsheet), you can use the sort command.

Fields and Records

Regardless of the format of the input, you must find patterns in it before you can focus on the data that is important to you. In this example, the data is delimited by two factors: rows and fields. Each row represents a new record, as you see in a spreadsheet or database dump. In every 1 line, there is a semicolon (;) Separate different fields (treat them as cells in a spreadsheet).

awk only processes one record at a time, so when you construct this instruction to awk, you can only focus on one line of records. Write down what you want to do on one line of data, then test it on the next line (whether psychologically or with awk), and then run some other tests. Finally, you should make assumptions about the data that your awk script will handle, so that it can be provided to you according to the data structure you want.

In this example, it is easy to see that each field is separated by a semicolon. For simplicity, suppose you want to sort the list by field 1 of each row.

Before sorting, you must be able to have awk focus only on the first field of each row, so this is the first step. The syntax of the awk command in the terminal is awk, followed by related options, and finally the data file to be processed.


$ awk --field-separator=";" '{print $1;}' penguins.list
Aptenodytes
Pygoscelis
Eudyptula
Spheniscus
Megadyptes
Eudyptes
Torvaldis

Because the field separator is a character that has a special meaning for Bash shell, you must enclose the semicolon in quotation marks or precede it with a backslash. This command is only used to prove that you can focus on specific fields. You can try the same command with the number of another field to see the contents of another "column" of the data:


$ awk --field-separator=";" '{print $3;}' penguins.list
Miller,JF
Wagler
Bonaparte
Brisson
Milne-Edwards
Viellot
Ewing,L

We haven't done any sorting yet, but this is a good foundation.

Script programming

awk is not just a command, it is a programming language with indexes, arrays, and functions. This is important because it means you can get a list of fields to sort, store the list in memory, process it, and print the resulting data. For a series of complex operations like this, it is easier to operate in a text file, so create a new file called sort. awk and enter the following text:


#!/bin/gawk -f
BEGIN {
    FS=";";
}

This creates the file as an awk script, which contains lines for execution.

The BEGIN statement is a special setup feature provided by awk for tasks that only need to be executed once. Defines the built-in variable FS, which stands for the field separator field separator, and is the same value you set with--field-separator in the awk command. It only needs to be executed once, so it is included in the BEGIN statement.

Arrays in awk

You already know how to collect the value of a particular field by using the $symbol and the field number, but in this case you need to store it in an array instead of printing it to the terminal. This is done through the awk array. The awk array is important in that it contains keys and values. Imagine 1 the content of this article; It looks like this: author: "seth", title: "How to sort with awk", length: 1200. Elements such as author, title and length are keys, followed by values.

The advantage of doing this in the context of sorting is that you can assign any field as a key, assign any record as a value, and then use the built-in awk function asorti () (sort by index) to sort. Now, just assume that you only want to sort by the second field.

awk statements that are not caused by the special keywords BEGIN or END are loops that are executed on each record. This is Part 1 of the script, which scans the data for patterns and processes them accordingly. Every time awk turns its attention to a record, the statement in {} is executed (unless it starts with BEGIN or END).

To add keys and values to an array, create a variable that contains the array (in this sample script, I call it ARRAY, which is not very authentic but clear), and then assign it keys in square brackets, concatenating the values with an equal sign (=).


{  # dump each field into an array
  ARRAY[$2] = $R;
}

In this statement, the contents of the second field ($2) are used as the key, and the current record ($R) is used as the value.

asorti () Function

In addition to arrays, awk has 1 basic function that you can use as a quick and easy solution to common tasks. One of the functions introduced in GNU awk, asorti (), provides the ability to sort arrays with keys (indexes) or values.

You can only sort an array after it has been populated, which means that this action cannot be triggered for every new record, but only at the final stage of the script. To this end, awk provides a special END keyword. In contrast to BEGIN, the END statement only fires once after all records have been scanned.

Add these to your script:


END {
  asorti(ARRAY,SARRAY);
  # get length
  j = length(SARRAY);
  
  for (i = 1; i <= j; i++) {
    printf("%s %s\n", SARRAY[i],ARRAY[SARRAY[i]])
  }
}

The asorti () function takes the contents of ARRAY, sorts it by index, and puts the results into a new array called SARRAY (any name I invented in this article to mean "sorted ARRAY").

Next, assign the variable j (another arbitrary name) to the result of the length () function, which calculates the number of terms in SARRAY.

Finally, the for loop uses the printf () function to iterate through each entry in SARRAY to print each key, and then prints the corresponding value of that key in ARRAY.

Run the script

To run your awk script, first make it executable:

$ chmod +x sorter.awk

Then run it against the penguin. list sample data:


$ ./sorter.awk penguins.list
antipodes Megadyptes;antipodes;Milne-Edwards;1880;Yellow-eyed
chrysocome Eudyptes;chrysocome;Viellot;1816;Sothern Rockhopper
demersus Spheniscus;demersus;Brisson;1760;African
forsteri Aptenodytes;forsteri;Miller,JF;1778;Emperor
linux Torvaldis;linux;Ewing,L;1996;Tux
minor Eudyptula;minor;Bonaparte;1867;Little Blue
papua Pygoscelis;papua;Wagler;1832;Gentoo

As you can see, the data is sorted by the second field.

This is a little limited. It is best to have the flexibility to select the fields to use as sort keys at run time so that you can use this script on any dataset and get meaningful results.

Add command options

You can add command variables to the awk script by using the literal var in the script. Change the script so that the iteration clause uses var when creating the array:


{ # dump each field into an array
  ARRAY[$var] = $R;
}

Try running the script to sort it by field 3 when executing the script using the-v var option:


$ ./sorter.awk -v var=3 penguins.list
Bonaparte Eudyptula;minor;Bonaparte;1867;Little Blue
Brisson Spheniscus;demersus;Brisson;1760;African
Ewing,L Torvaldis;linux;Ewing,L;1996;Tux
Miller,JF Aptenodytes;forsteri;Miller,JF;1778;Emperor
Milne-Edwards Megadyptes;antipodes;Milne-Edwards;1880;Yellow-eyed
Viellot Eudyptes;chrysocome;Viellot;1816;Sothern Rockhopper
Wagler Pygoscelis;papua;Wagler;1832;Gentoo

Amendment

This article demonstrates how to sort data in pure GNU awk. You can improve the script to make it useful to you, spend some time studying the awk function on the gawk man page and customizing the script for better output.

This is the complete script so far:


#!/usr/bin/awk -f
# GPLv3 appears here
# usage: ./sorter.awk -v var=NUM FILE
BEGIN { FS=";"; }
{ # dump each field into an array
  ARRAY[$var] = $R;
}
END {
  asorti(ARRAY,SARRAY);
  # get length
  j = length(SARRAY);
  
  for (i = 1; i <= j; i++) {
    printf("%s %s\n", SARRAY[i],ARRAY[SARRAY[i]])
  }
}

Summarize


Related articles: