bsda2: pkg_validate Performance Tweaks

I am currently updating the bsda2 code for pkg_validate with LST.sh, this adds some overhead (however small) and to counter that I decided to try tweak the performance a little. Two approaches have shown benefits.

Tweaking Checksum Verification

Checksum verification is performed in two steps. The checksum binary (currently only sha256 is supported) is passed a set of files that it checks in one go. The resulting list is checked against the reference checksums and mismatches are inspected individually in order to allow providing a reason (e.g. file missing, insufficient privileges etc.).

One important case is symlinks, the checksum tool scans the files referred to by a symlink whereas the reference checksum is a checksum of the path referred to by the symlink. This has to be reproduced (including reproducing a bug in pkg, which cannot be fixed without altering checksums).

The performance tweak performed is substituting symlinks with /dev/null in the file list in order to trigger the checksum mismatch without actually scanning a file. An alternative approach would be to substitute an invalid file name, but it turns out that checksumming /dev/null is faster than failing on a missing file.

The other tweak is changing the batch size, finding the correct batch size is simply a logarithmic search with a performance metric. The metric I used is the validate all packages benchmark.

A small batch size benefits improves CPU utilisation at the beginning and end of the process

A larger batch size reduces overhead (less calls of the sha256 executable, more locking operations on the task queue).

The original batch size was 64, runtime improved till 1024, beyond which it stalled and then degraded. So the new batch size is 1024.

Benchmark

The test system is an Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz with 6 cores / 12 threads, running FreeBSD stable/13-n248234-3218666bd082. The maximum turbo clock is 4.5 GHz for a single core and 4.0 GHz for all cores. The CPU clock is controlled by the hwpstate driver, but as far as I can tell single clock turbo does not work for this model, hwpstate always sets the same clock speed for all cores.

pkg_validate (1207 packages)

This benchmark was performed seven times for each pkg_validate version:

for i in $(jot 7); do time pkg_validate; done

Benchmark validating all packages seven times in a row.

The first two runs show additional turbo boost benefits, whereas the third run has reached thermal equilibrium and performance is fairly stable from that point onwards. The median runtime of pkg_validate 0.4.2 is 51.42 s and the tweaked version 45.79 s. A 10.9 % runtime reduction.

Usually a reduction in real time is achieved by improving the utilisation of cores, but in this case we actually managed to reduce actual work done (the user + sys measurements).

real [1 pt/s]

50.72 s
50.49 s
51.56 s
51.42 s
51.41 s
51.51 s
51.88 s
44.54 s
44.53 s
45.16 s
45.95 s
45.85 s
45.79 s
45.86 s

user + sys [1 pt/s]

213.02 user + 123.78 sys
218.13 user + 121.85 sys
217.33 user + 124.52 sys
223.38 user + 124.96 sys
226.02 user + 124.19 sys
222.61 user + 124.68 sys
224.00 user + 125.48 sys
169.65 user + 111.38 sys
175.92 user + 111.01 sys
174.80 user + 114.71 sys
184.08 user + 115.90 sys
180.01 user + 115.21 sys
181.42 user + 116.32 sys
184.42 user + 116.43 sys

pkg_validate texlive-*

To verify that there are no regressions I also ran a smaller test case validating the texlive packages:

for i in $(jot 9); do time pkg_validate texlive-\*; done

Benchmark validating all texlive packages nine times in a row.

This benchmark is dominated by the texlive-texmf package, which contributes 85605 out of 117570 files (72.8 %). This is the reason why the simple one job per package approach does not scale well.

Luckily even this use case gets away with a net win, where I expected at least a small performance regression from the tweaks.

It is noteworthy that this benchmarks does not seem to be thermally limited, increasing the number of runs to 25 did not make a difference either. Monitoring the system during the runs implies that CPU utilisation is too low to reach a state where thermal throttling limits the turbo boost.

It might mean there is some untapped performance potential - or we are constrained by the limits of file system IO.

real [5 pt/s]

10.69 s
10.61 s
10.45 s
10.35 s
10.54 s
10.60 s
10.47 s
10.42 s
10.82 s
9.45 s
9.42 s
9.50 s
9.38 s
9.54 s
9.37 s
9.45 s
9.36 s
9.48 s

user + sys [5 pt/s]

22.92 user + 15.94 sys
23.56 user + 14.65 sys
22.83 user + 15.08 sys
23.18 user + 14.58 sys
22.82 user + 15.33 sys
22.77 user + 16.04 sys
22.92 user + 14.90 sys
22.97 user + 14.89 sys
22.81 user + 15.70 sys
20.11 user + 12.32 sys
19.97 user + 12.25 sys
19.28 user + 12.37 sys
18.90 user + 11.83 sys
20.01 user + 11.83 sys
19.64 user + 11.80 sys
19.44 user + 12.27 sys
19.50 user + 11.62 sys
19.62 user + 11.98 sys

Conclusion

It’s always pleasant to find some low hanging fruit. If you want to play with the batch size yourself, you will be able to using the latest commit:

$ for i in $(jot 14 0); do time src/pkg_validate -b$((1 << i)) texlive-\* || break; done
       54.06 real       183.56 user       444.20 sys
       30.73 real       114.92 user       239.29 sys
       18.74 real        72.86 user       123.50 sys
       14.27 real        49.49 user        60.66 sys
       12.19 real        36.52 user        38.02 sys
       11.35 real        31.05 user        25.05 sys
       10.69 real        26.42 user        18.47 sys
       10.20 real        23.14 user        15.78 sys
        9.75 real        22.41 user        13.26 sys
        9.64 real        20.63 user        12.17 sys
        9.40 real        18.70 user        12.06 sys
        9.22 real        18.02 user        11.95 sys
        9.17 real        17.70 user        11.26 sys
        9.22 real        17.00 user        11.52 sys

Verify texlive packages with batch sizes from 1 to 8192.

References