More benchmarks: the bitvopt branch
Those who have been keeping an eye on my Git repository have surely
noticed the recent update of the
bitvopt branch, which contains
alternate implementations of the core routines of the matching engine
replaces the direct tests and propagation entirely with bit vector
operations (masking, shifting, etc.).
The main advantage of this approach is vectorization. Such
operations are easily done on a single unit (the type of which is
bitv.h and somewhat configurable with the
macro) and can be generalized to loops that carry no dependencies
between iterations. These loops can then be auto-vectorized by
a sufficiently smart compiler. The current code in
understood and vectorized by both GCC 4 with
ICC 10.1. Beware, ICC 10.0 fails because of the
which are necessary; also note that, on x86, you have to pass
USE_STDINT in order to select a type bigger than
unsigned char, or
else, SIMD shift operations won’t be available. To see the
vectorization reports, you can use
-vec-report for ICC.
The results are not quite satisfactory and the branch was not merged
On the chart, GCC has vectorization disabled by default while ICC
always has vectorization (even though I have not specified
anywhere). The measurements have been done on my old Pentium 4, with
-O2 and appropriate architecture flags, with GCC 4.1.2 and ICC
10.1. The values are averages computed over 100 runs of the following
pattern on the same dictionary as before:
$ xmlgrep 'p[hw/?="Ab\"a*cus"]#'
I must say I can’t explain why operations on native integers (the
-int builds) are slower than those on bytes. As for the relative
gain of non-vectorized code over vectorized one, this is most probably
because the pattern is not long enough (less than one base unit) to
take advantage of SIMD so we lose in the overhead introduced with
vectorization (all the more so as data may be unaligned on entry of
these functions). Lastly, it seems auto-vectorization in GCC really is
not advantageous right now, since the non-vectorized versions tend to
outperform their vectorized counterparts when compiled with GCC (and
this is not the case with ICC).
Anyway, this was just a side experiment of mine; work on the main tools continue: plans for more features in the matcher (on request, case insensitivity, arithmetic predicates, maybe some other things) as well as the introduction of xmlsed are underway.