GSoC 2009 for NetBSD: xmltools
  1. In search of a good pull-style XML parser
  2. XML as a tree representation
  3. xmlgrep toy benchmarks
  4. Improving performances of xmlgrep
  5. xmlgrep by examples: playing with Atom
  6. More benchmarks: the bitvopt branch
  7. xmlsed preview: writing XML
  8. xmltools update: new I/O layer and further plans
  9. xmltools implementation: automata and backtracking
  10. xmltools: what's done, what's left
  11. xmltools: after the GSoC

xmltools: after the GSoC

(Nhat Minh Lê (rz0) @ 2009-09-07 17:14:48)

Since the end of the Summer of Code, I have been busy doing various miscellaneous tasks around my xmltools, so that it be ready for inclusion into the NetBSD tree some time in the future. This is a short summary of what has happened (just so people know that my project’s not dead, I’ve not vanished, or something along these lines); please take a look at the code repo for details:

ATF test suites

I’ve finally decided to convert my bunch of shell scripts to ATF tests. These are basic tests which check the various constructs of the pattern language. There are 46 tests, three of which are optional (they are made against rather big data sets that I do not distribute with the sources). These include a test suite centered around property list matching. If you have relevant test cases and would like to see them included, please send them over.

Library API

Some people had been asking me if there would be a library, well, yes, and I’ve libified all the existing code in anticipation of that, and also because I wanted to eventually commit somthing clean to the NetBSD tree. All routines have been made to support proper error reporting, and expose clean interfaces. The library is divided into four main sub-namespaces:α

  • hhxml_chunk_* : XML in-memory tree manipulations.
  • hhxml_stream_* : XML streams and (push-style) parsers. The streams have a hybrid design: they can be used to read nodes one at a time in a pure stream fashion or to generate partial in-memory trees. See misc/atomheadlines.c in the source distribution for an example.
  • hhxml_path_* : patterns and matching. This one is incomplete and the old match.c will need to be converted to fit into the new API.
  • hhxml_pprint_* : XML pretty printer. The implementation is also mostly incomplete at this time: it only supports printing to stdout, well, because the command-line tools don’t require any more than that.

The various parts of the API will be completed as I go about writing the tools themselves, but I wanted to have clean well-defined interfaces to build onto.

α: By the way, "hhxml" stands for "(the) very short XML (library)".

Non-visible changes to the pattern matching engine and plans

Well, I’ll write more on this one when I get the time, but I’ve been looking into other implementations and theories, especially SPEX, which boasts algorithms with good complexities (better than what is currently implemented in my tools). I have plans for improvements to the engine which I believe, if I’m not wrong, would bring us similar performance, but that will have to wait, since it is not critically important, at the moment, in my opinion.

That’s all for today!