xmltools implementation: automata and backtracking
At the moment, I’m implementing submatches in patterns, in preparation for xmlsed. Since most of what I do draws from formal languages and automata (except I’m too lazy to write anything formal), I’ve been looking into TNFAs for inspiration. Seeing how TRE seems to implement look-ahead without backtracking, I began to ask myself if such a thing could be done for my pattern matcher. Unfortunately, I can’t seem to figure out a solution by myself, so I’m blogging instead. :)
Basically, the problem is with constructs such as
a[b]/c which match
c, child of
a only if
a has a child
b. However, when we reach
a, we don’t know yet whether it has a child
b or not, so we’re in
need of some look-ahead.
This is akin to
a(?=b)c in Perl regexes, but as you can already see,
it’s fairly different. Perhaps the most obvious difference is that
a tree has "two dimensions" while a string has only one, so it’s
a to have two children
c, in a tree, while
it’s not possible for
a to be followed by two different characters,
in a string.
The easiest way to solve this problem is to use backtracking. This is how the matcher is currently implemented. Even though about everything else is based solely on states and transitions, the part implementing look-ahead predicates uses backtracking.
The algorithm looks like this: everytime we have to match a look-ahead predicate, we stop, read as many nodes as we need to decide and then either drop the state or keep it, depending on the outcome.
This is a simple approach and it works well (i.e. it gives the
expected results). Only, it’s slow. As a comparison, using the
patterns I’ve used before for my benchmarks, the difference between
p[hw/.="Ab\"a*cus"]# is the second one is
about 2.66 times slower. This can be hand-optimized by rewriting the
second pattern as
body/p[hw/.="Ab\"a*cus"]#, with knowledge of the
input format, where
p can only appear as a child of
cuts the slowness to a mere 1.62 times slower.
At the same time, it demonstrates a major source of inefficiency in my backtracking implementation: loose (non-anchored) subpatterns, which may potentially apply to a great number of nodes yet would fail immediately for most of them, result in a lot of unwanted short backtracking segments where the matcher tries the subpattern and then immediately backtracks after failing, thus still examining the node twice.
In theory, a pattern such as
a(?=b)c matches if
a is followed by
c so another natural idea would be to use a product
automaton which transitions between couples of states: if we end up in
two final states, then we win.
Only, this works well because we want to know whether the string
matches as a whole. However, with an XML pattern such as
need to identify which node matched the
c part. This is where
things start to get complicated.
Now we do know how to extract a submatch from a string with tagged
transitions. So, great, does this apply to our XML matcher? The answer
is "no", as far as I know (I’d be glad if someone proved me wrong,
though). Tagged transitions are good at extracting one submatch, the
last one, but in our case, the problem is we want all occurrences of
c, not just the last one.
Keeping this information globally associated with the state is not an option since there may well be multiple instances of the look-ahead predicate trying to match at the same time. But trying to differentiate between two states based on the matching position is not an option either, because that could easily lead to linear memory usage. Consider, for example, the following XML fragment:
Trying to match
(a%%b) using the current syntax in the
master branch; sorry, I reverted to the old syntax because
be used for groups) against that would result in a match at every
element. Hence, keeping the position as a state property is not
a solution, since we’d have to distinguish between as many
as there are in the input data.
Optimizing the backtracking method
Although the implementation of the matcher has undergone many changes in order to improve performances recently, mostly so as to accommodate new features while keeping a reasonable running time, I believe there’s still room for much improvement, maybe in the form of algorithmic enhancements rather than implementation optimizations.
Some of my more realistic ideas are:
Read one element ahead to reject some subpatterns without entering a backtracking context.
Try to predict the outcome of a look-ahead predicate and compute the more likely transitions along with the predicate itself, saving on backtracking (except to prune the tree) in case we’re right (but incuring additional recomputations if we’re wrong).
Also, this is not directly related to the backtracking approach, but I’m thinking of putting more information in the state cache; at the moment, we’re only caching "raw" transitions which account for the old state set and the direction (i.e. successor or child); we could try to cache local information as well, which would move us one step closer to the way NFA transition functions get cached to get DFAs.
Conclusion: do we need performance that badly?
Well, that’s a good question. And the answer is "we probably don’t". This is kind of my hobby, so don’t get the wrong idea. The whole thing is not that slow. On my simple example, xmlgrep was only 0.5s slower than libxml-powered xmlstarlet, taking less than six seconds to look up a word definition in a 58M dictionary, on my 2x2GHz Core 2, while using 100 times less memory.
I doubt anybody is going to use xmlgrep to look up words in dictionaries anytime soon; such a specialized task would best be served using an indexed collection of some sort, although I should mention that there are XML formats which are more suited to being parsed using my tools than others, so converting to a more appropriate (e.g. annotated) XML format using xmlsed then searching with xmlgrep could well be a viable solution too (given the conversion is one-time or infrequent compared to searches). More on that once xmlsed is out!