xmlgrep toy benchmarks
- the selection tool from the xmlstarlet project (available in
- another xmlgrep from the Perl Twig package (
- and GNU grep, though it does not really accomplish the same, from the base NetBSD-5 distribution, as a reference.
Since each tool accepts its own patterns and treats them differently, I’ve tried my best to get meaningful results with each, but obviously, the behaviors obtained differ somewhat.
tests/memplot script was used to make the sampling. The data
were plotted using Gnuplot. The following commands were run:
$ xmlgrep 'hw/?="Ab\"a*cus"' d.xml $ xmlgrep 'hw[?="Ab\"a*cus"]' d.xml $ xml sel -t -c "//hw[text()='Ab"a*cus']" d.xml $ xml_grep "hw[string()='Ab\"a*cus']" d.xml $ grep 'Ab"a\*cus' d.xml
The goal was to retrieve the
<hw> element which contains the
Ab"a*cus string from a 53M dictionary, retrieved from
http://www.ibiblio.org/webster/ and merged into a single file.
The first two lines correspond to two different invocations of xmlgrep, which do not do exactly the same thing, and with the first being the good one, the other inefficiently abusing the look-ahead mechanism. Yet, it was included for comparison purposes and maybe to be fair to the XPath-based alternatives, which have to rely on a predicate.
Results follow; the second and third pictures are just zooms which omit one or another of the candidates.
I know these results don’t mean much, so don’t take them too seriously. I’ve only tested a single simple pattern. Actually, I didn’t really intend to make a comparison, at first, I just wanted to make sure xmlgrep doesn’t leak or otherwise misuse memory. But since the sampling code was there, I thought it’d be fun to do some additional measurements.
As expected, xmlstarlet, which uses a full DOM model requires a lot of memory to do its work (more than 400M!). But it is the fastest.
My xmlgrep, when used right, is still nearly 42% slower than xmlstarlet (from above 4s to below 6s), but I think the times remain reasonable; memory-wise, it is also the lightest, with a constant 2.9M in use throughout the whole run. Even GNU grep requires about 2.8M.
Surprisingly, the Twig xmlgrep is not too big a memory killer, though its memory usage increases over time, by steps (though it looks linear on the long run). This appears somewhat strange to me; why some nodes should be pruned from its internal DOM-like tree while others seem to remain for the duration of the program. However, on the speed side, Twig is predictably very slow; it takes nearly two minutes to produce the results comparable to those of its C counterparts.
That’s it for tonight; as one friend of mine told me, more interesting benchmarks will probably come from actual users.