AWK Text Processing Speed
My default brand of AWK is the One-True-AWK also known as NAWK, coming with FreeBSD. For portability I have started supporting GNU AWK (henceforth GAWK). Because I work with people using Ubuntu, MAWK also made it into my list recently.
The Wikipedia article has a list of AWK versions.
Tests
Test Environment
The tests are run on a Core2Duo throttled to 800MHz on FreeBSD 9 r254957 (adm64) using a set of AWK scripts dealing with processing or generating C-code.
cstrip.awk
This script passes the given arguments to cpp
and collects the
output in a single string, reformatting the resulting code to one
command per line with all obsolete spaces discarded. Its purpose
is to simplify going through the code using regular expressions in
other scripts. For the following measurements 31 files, a total of
279777 (273.2k) bytes of code, were passed to the script.
Memory usage (maximum resident set size):
overlays.awk
, sanity.awk
The following two tests were supposed to refer to overlays.awk
and
sanity.awk
, which both call cstrip.awk
and both exhibited no
significant runtime difference to just calling cstrip.awk
.
dbc2c.awk
This script parses Vector DBC (Database CAN) files. Every object in the file is parsed and stored in arrays. All data and relations are kept in memory, no information is discarded. Afterwards a set of templates is used to output this information into a C-style header file with doxygen documentation. The script is run on 3 CAN databases totalling 674576 (658.8k) bytes of input and producing 2678866 (2.55m) bytes of output.
Memory usage (maximum resident set size):
xml.awk
This script parses an XML file (to be exact the subset of XML that is used by Keil µVision for its configuration files), offers a set of arguments to manipulate and reprint the XML tree. The XML code is not stored in memory, but recreated from the tree generated when parsing the file.
In this test case 2 XML files, 48757 (47.6k) bytes are parsed and simply printed.
Memory usage (maximum resident set size):
Conclusions
From the similar runtime of NAWK and GAWK in the cstrip.awk
test
case, I gather that the majority of the runtime is caused by cpp
.
So the higher runtime of MAWK seems to result from bad performance
performing gsub()
calls on entire file sized strings.
The MAWK performance in the dbc2c.awk
test case on the other hand
is extremely remarkable as well, outperforming GAWK by a factor of
~9.6, where NAWK only manages ~3.7. Clearly MAWK shines at array
handling.
MAWK recommends itself further in the XML parsing and printing case, where it outperforms NAWK by a factor of ~7.8 and GAWK by a factor of 25.7!
All in all I have to conclude that using GAWK, the default for many
GNU/Linux distributions, is a bad choice. It offers a couple of additional
features, but GAWK is simply too slow to benefit from them when processing
large amounts of data. NAWK outperforms GAWK in all my test cases.
And despite the lapse with the cstrip.awk
test case, the performance
of MAWK is just plain astonishing.
Too bad MAWK is released under GPLv2, which doesn’t recommend it for distribution with FreeBSD (NAWK is released under an MIT-style license). It might be a valuable effort to look at the hashing functions of NAWK to improve its array performance.
A Note on MAWK
MAWK is a little more picky than other flavours of AWK. First, no variable, no matter the context, may have the same name as a function.
The second difference to NAWK or GAWK is that MAWK does not support
length(array)
, length()
can only be applied to strings. In most
cases I only need to know whether an array has elements left at all.
In such cases a simple function like this can be used:
function empty(array, i) {
for (i in array) {
return 0
}
return 1
}
Workaround for MAWK.
Finally, MAWK uses exponential output for large numbers, even those
cast to int
(AWK interpreters store all numbers in double precision
floating point format). In order to make sure that output is always
identical use printf
/sprintf("%f", number)
and "%.f"
for integers.
Amendments
Versions
nawk
20121220 (FreeBSD)gawk
4.1.0mawk
1.3.3
nextfile
MAWK does not support nextfile
. This can easily be worked around
and here is an update list of measurements for the cstrip.awk
test
case:
Memory usage (maximum resident set size):