2014-09-27 Another day in my love affair with AWK

2016-07-18

I consider myself a C/C++ developer. Right now I am embracing C++11 (I wanted to wait till it is actually well supported by compilers) and I am loving it.

Note

This article was updated 2016-07-18 to fix some typos and broken references.

Despite my happy relationship with C/C++ I have maintained a torrid affair with AWK for many years, which has spilled into this blog before:

Almost a year ago I concluded that MAWK is freakin’ fast and GNU AWK freakin’ fast as a snail
The past summer I stumbled over a bottleneck in the one-true-AWK, default for *BSD and Mac OS-X

A Matter of Accountability

So far circumstances dictated that either the script or the input data or both had to be kept confidential. In this post both will be publicly available. The purpose of this post is to give people the opportunity to perform their own tests.

The following is required to perform the test:

The dbc2c.awk script was already part of my first post. It parses Vector DBC (Database CAN) files, an industry standard for describing a set of devices, messages and signals for the real time bus CAN (one can argue it’s soft real time, it depends). The script does the following things:

Parse data from 1 or more input files
Store the data in arrays, use indexes as references to describe relationships
Output the data
- Traverse the data structure and store attributes of objects in an array
- Read a template
- Insert data into the template and print on stdout

Test Environment

The operating system:

FreeBSD AprilRyan.norad 10.1-BETA2 FreeBSD 10.1-BETA2 #0 r271856: Fri Sep 19 12:55:39 CEST 2014 root@AprilRyan.norad:/usr/obj/S403/amd64/usr/src/sys/S403 amd64
The compiler:
- FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512
- Target: x86_64-unknown-freebsd10.1
- Thread model: posix
CPU: Core i7@2.4GHz (Haswell)
NAWK version: awk version 20121220 (FreeBSD)
MAWK version: mawk 1.3.4.20140914
GNU AWK version: GNU Awk 4.1.1, API: 1.1

Tests

With the recent changeset 4d1a902, the script switched from using array iteration (for (index in array) { … }) to creating a numbered index for each object type and iterate through them in order of creation to make sure data is output in the same order with every AWK implementation. This makes it much easier to compare and validate outputs from different flavours of AWK.

To reproduce the tests, run:

time -l awk -f scripts/dbc2c.awk -vDATE=whenever j1939_utf8.dbc | sha256

Validate the output of your test run.

The checksum for the output should read:

9f0a105ed06ecac710c20d863d6adefa9e1154e9d3a01c681547ce1bd30890df

Checksum of the non-diagnostic output.

Here are my runtime results [25 pt/s]:

6.23 s

6.32 s

6.27 s

11.79 s

11.88 s

11.80 s

1.98 s

2.02 s

1.97 s

Memory usage (maximum resident set size) [0.005 pt/k]:

22000 k

50688 k

26644 k

Conclusion

Once again the usual order of things establishes itself. GNU AWK wastes our time and memory while MAWK takes the winner’s crown and NAWK sticks to the middle ground.

The dbc2c.awk script has been tested before and GNU AWK actually performs much better this time, 6.0 instead of 9.6 times slower than MAWK. Maybe just parsing one file instead of 3 helps or the input data produces less collisions for the hashing algorithm (AWK array indexes are always cast to string and stored in hash tables).

In any way I’d love to see some more benchmarks out there. And maybe someone bringing their favourite flavour of AWK to the table.