Another day in my love affair with AWK

I consider myself a C/C++ developer. Right now I am embracing C++11 (I wanted to wait till it is actually well supported by compilers) and I am loving it.

Note

This article was updated 2016-07-18 to fix some typos and broken references.

Despite my happy relationship with C/C++ I have maintained a torrid affair with AWK for many years, which has spilled into this blog before:

A Matter of Accountability

So far circumstances dictated that either the script or the input data or both had to be kept confidential. In this post both will be publicly available. The purpose of this post is to give people the opportunity to perform their own tests.

The following is required to perform the test:

The dbc2c.awk script was already part of my first post. It parses Vector DBC (Database CAN) files, an industry standard for describing a set of devices, messages and signals for the real time bus CAN (one can argue it’s soft real time, it depends). The script does the following things:

Test Environment

Tests

With the recent changeset 4d1a902, the script switched from using array iteration (for (index in array) { … }) to creating a numbered index for each object type and iterate through them in order of creation to make sure data is output in the same order with every AWK implementation. This makes it much easier to compare and validate outputs from different flavours of AWK.

To reproduce the tests, run:

time -l awk -f scripts/dbc2c.awk -vDATE=whenever j1939_utf8.dbc | sha256

Validate the output of your test run.

The checksum for the output should read:

9f0a105ed06ecac710c20d863d6adefa9e1154e9d3a01c681547ce1bd30890df

Checksum of the non-diagnostic output.

Here are my runtime results [25 pt/s]:

6.23 s
6.32 s
6.27 s
11.79 s
11.88 s
11.80 s
1.98 s
2.02 s
1.97 s

Memory usage (maximum resident set size) [0.005 pt/k]:

22000 k
50688 k
26644 k

Conclusion

Once again the usual order of things establishes itself. GNU AWK wastes our time and memory while MAWK takes the winner’s crown and NAWK sticks to the middle ground.

The dbc2c.awk script has been tested before and GNU AWK actually performs much better this time, 6.0 instead of 9.6 times slower than MAWK. Maybe just parsing one file instead of 3 helps or the input data produces less collisions for the hashing algorithm (AWK array indexes are always cast to string and stored in hash tables).

In any way I’d love to see some more benchmarks out there. And maybe someone bringing their favourite flavour of AWK to the table.