GSoC 2009 for NetBSD: xmltools
  1. In search of a good pull-style XML parser
  2. XML as a tree representation
  3. xmlgrep toy benchmarks
  4. Improving performances of xmlgrep
  5. xmlgrep by examples: playing with Atom
  6. More benchmarks: the bitvopt branch
  7. xmlsed preview: writing XML
  8. xmltools update: new I/O layer and further plans
  9. xmltools implementation: automata and backtracking
  10. xmltools: what's done, what's left
  11. xmltools: after the GSoC

xmltools: what's done, what's left

(Nhat Minh Lê (rz0) @ 2009-08-13 23:44:37)

So the Summer of Code ends soon and it’s time to report what has been done, and what is still lacking. Obviously, everybody can see there is no xmlsed yet in the repository, so what have I been doing since I completed xmlgrep one month ago? What went wrong?

Well, if you ask me, nothing really went wrong, I have just deviated somewhat from my original goals and taken the time to do some things that weren’t part of the initial plan. And even though I am disappointed not to have finished, this has brought its share of good additions. Most of these modifications stem from discussions with David (Young) about features we’d like to see in my xmltools. Also in all these three months, I’ve spent about three weeks optimizing the matcher code.

What we have

Let’s talk about the good stuff first. Of course, the visible part of my work is the existence of xmlgrep, but most of the code actually lives in a common static library. It provides:

The library has managed to remain reasonably small, with less than 5k lines of code, counting comments and everything, and the core matching algorithm fits in less than 1.5k lines or so (compare this to the 15k-line xpath.c from libxml2). It is stream-oriented and care has been taken so it eats as little memory as possible while remaining efficient.

The pattern matching code

The matcher is undeniably the most important and sensible piece of code in there. It is a blend of a kind-of-NFA and a backtracking machine, which processes one node at the time (NFA) but is able to do unlimited look-ahead (backtracking). Recent work has involved adding new advanced constructs to the matcher (besides forward and look-ahead mechanisms, which were already implemented) and improving its performances when look-ahead is in use.

Performances

On the performance side, the main addition is that of a state cache and pre-computed states created by the pattern compiler (instead of the matching code itself).

The state cache (implemented as a hash table) operates at two levels :

One effect of using a cache is that bit vectors need not be pooled anymore and a lot of copying has disappeared in favor of pointer swapping. Also, relying on states pre-computed by the pattern compiler has shifted the focus from looping bit by bit through bit vectors to whole-vector operations, most of which can be auto-vectorized by a sufficiently smart compiler (ICC on Linux can vectorize a dozen of these in the whole xmlgrep code, but as I’ve pointed before, GCC auto-vectorizer seems to lag behind its non-use, at least on my code and setup).

With these optimizations merged in, xmlgrep now outperforms xmlstarlet both in terms of speed and memory consumption on my silly little benchmark (for those who have not been there, it consists in grepping for a word and its definition in a 50M XML-formatted English dictionary, and yes, it’s simple-minded), even when using suboptimal look-ahead subpatterns.

Indeed, knowledge of the XML format often enables us to write optimized patterns by anchoring a more costly part with a basic predicate. In our case, instead of searching for p[hw/.="somestring"] (the suboptimal case), we could search for body/p[hw/.="somestring"] since p can only occur inside body. This is an improvement since the p[] part, being a look-ahead subpattern, is sensibly slower than a forward pattern, so reducing the number of tentative matches is essential.

Compared to benchmarks done at the time of the release of xmlgrep, handling of this suboptimal case is about three times faster now.

Features

On the feature side, which is probably more interesting for casual would-be users, I have added grouping, Kleene-like operators, and subpattern negation, as well as reworked some of the old syntax.

Subpattern negation is the simplest extension, algthough it’s fairly powerful. It can be achieved because of backtracking so it’s only available on (look-ahead) subpatterns, not groups (see below). Negation lets you write something like {a}!{a/b} which selects all elements a that do not have a b element. This can’t be done with regular patterns because a node can have many children, and thus, unlike with regexes, something like a/!b (would-be syntax) would match some a node with a child which is not a b, but there can be many other children.

For example:

$ xmlgrep -x '{dict}!{dict/@id}/key/.' test.plist

will extract only keys from unnamed dictionaries.

The second addition is Kleene-like operators. Well, sorry about the name, but it’s just what comes to mind when I use these. They are, as the name implies, somewhat like their regex counter part, the Kleene star, in that they match zero or more repetitions of the preceding predicate. E.g. a*%b matches <a/><a/><a/>...<b/> but not, say, <a/><c/><d/>...<b/>. The %% and // operators have now become special cases of the Kleene operators. a%%b can be written a%**%b and a//b is really a/**/b, where the first star stands for "any node" and the second is part of the operator.

But having these repetition operators limited to repeating a single node predicate would have been quite boring, and so groups were born. Groups, written inside parentheses () (notice these have replaced the old subpattern syntax which has been moved back to being {} from its first days), well, group a pattern fragment, so that it can be repeated. They also accept alternation so you can write a%(b|c)*%d, which just seems natural.

Let’s take an example:

$ cat test.xml
<B/><a/><b/><a/><b/><a/><b/><a/><b/><a/><b/><E/>
$ xmlgrep 'B%(a|b)*%E' test.xml
<E/>

Groups are implemented without backtracking so using a group is not much more costly than using plain forward patterns, and is faster than using look-ahead subpatterns.

Yet, that is only true of normal groups. When you use a selection operator (+, #, - or =) on a group, it becomes a capturing group, a.k.a. look-ahead group. A look-ahead group is really a look-ahead subpattern combined with a group and has a somewhat subtle behavior. It behaves like a group, with the following difference: any selection operator inside a look-ahead group will only apply if the whole group matches. This can be used to bind selectors together. Thus we can revisit the classical key/value pair example: how do we match a key/pair with some properties on the key and some other on the value? You can always do that with simple look-ahead subpatterns combined with forward patterns, but groups just make it more convenient.

Imagine you want to match all key/value pairs where the key contains the letter a and the value is a string. Easy, with a capturing group:

$ xmlgrep -x '(key[.~"a"]#%string#)+' test.plist

Other components, reusability

Some people have discussed with me the possibility of making the library reusable. As it is now, it is not yet ready for wider exposure, but a lot of work has gone into rationalizing its API, in order to prepare for that. The API needs to be stabilized and the various components need to be better segregated.

Currently, the data structures in use are tightly integrated. For instance, the parser actually builds a tree with nodes able to hold matching states. Somehow, I would need to break that dependency, so that the matcher depends on the parser, but not the opposite, so you could use the XML parser as a stand-alone component (it has a nice hybrid design). Other parts are less affected, though.

What’s left to do

Now for the bitter part. Obviously, there are still a lot of things left to do, and I intend to continue working on this project, but whether it will be integrated into NetBSD at some point in the future is not for me to decide, though I really hope it will.

In short, here is the plan:

  1. write support for xmlsed templates in the library;
  2. write xmlsed proper;
  3. change the/add a high-level API to the library to make it possible to write xmljoin (because the high level framework only accepts one input file at this point in time);
  4. see what more needs to be adjusted for xmljoin to fit into the framework;
  5. write xmljoin proper;
  6. see what needs to be done for xmlsort to fit into the framework;
  7. write xmlsort.

Of course, I have given some thoughts to the "see what needs to be done" phases, but now is not right time to talk about that.

Also, at some point, I’ll need to address several shortcomings of the current code base:

And I will also want to:

Lastly, about xmlsed, though you can see no xmlsed directory in the repository, I can say it is pretty close to completion since the pieces needed are almost all there in the library (the only one missing is the template support, and that’s not the hard part).

Conclusion

The Google Summer of Code is nearing its end, but my summer vacations are not over yet. I won’t have classes until September, so I plan to carry on as if nothing much happened at least till the end of August. After that, I won’t have nearly as much free time, but I’ll do what I can; let’s not be hasty and wait till then to see how far the project has gone.

I will continue to update this blog and the code will be available at my Git repository, as usual. (Anyway, I have just recently added comments to my blog so feel free to drop by and give me your opinion.)