GSoC 2009 for NetBSD: xmltools
  1. In search of a good pull-style XML parser
  2. XML as a tree representation
  3. xmlgrep toy benchmarks
  4. Improving performances of xmlgrep
  5. xmlgrep by examples: playing with Atom
  6. More benchmarks: the bitvopt branch
  7. xmlsed preview: writing XML
  8. xmltools update: new I/O layer and further plans
  9. xmltools implementation: automata and backtracking
  10. xmltools: what's done, what's left
  11. xmltools: after the GSoC

XML as a tree representation

(Nhat Minh Lê (rz0) @ 2009-06-10 15:33:14)

xmltools is a project whose ultimate goal is to bring Unix command-line efficiency to the XML world (and not, as many might think, to bring XML into the Unix world, which is being done already, though not by my fault). The idea is to treat XML as a generic tree representation format.

This seems very straightforward; in fact, one could argue that XML was designed from the start to be a generic tree representation. However, XML has a number of problems that make it hard to use in this role.

The first such problems is doctype declarations. While the bare XML structure, which we see most of the time, is very simple, doctype declarations and their semantics are both unneeded and unwanted in tree (syntax) manipulation programs. It would be fine if doctype support was optional for non-validating XML processors, but as I understand it, the XML standard mandates support for a portion of it: entity declarations and references, and default attribute values.

The second issue is with white space handling: XML does not include any default behavior regarding white space. XML parsers must pass all white space to the application, which is responsible for some sensible default processing. The xml:space attribute can give a hint to the application but as far as I know, it’s not in wide use.

This all led me to develop a set of compatibility rules which precisely describe what one should expect from xmltools, no matter which backend is used (in case we move away from expat, which I plan to do, eventually). This subset is sufficient to describe any tree-like data set.

The rules will be maintained in the doc/compatxml.text file, in the xmltools source tree. For convenience, I’ve reproduced the current version below:

  1. Do not use doctype declarations for any purpose other than validation. In particular, do not rely on doctype declarations to provide default attributes and do not use internal entities for arbitrary substitutions; if you wanted to save typing, you wouldn’t be using XML. Use of numeric and predefined entities is OK. Use of directly-encoded characters is preferred.

  2. Do not presuppose the XML processor has access to any resource beside your file: do not use external entities.

  3. Do not assume the XML processor is able to handle multiple character sets: all the documents and other data supplied to the program should use the same character set. Also, set your locale appropriately.

  4. Unless xml:space is set, assume all white-space only text elements are ignored by the XML processor.

  5. Set xml:space to preserve whenever white space not kept by the previous rule must be preserved. As stated in the first rule, do not rely on doctype declarations to implicitly set this attribute. Please note however that the fourth rule works well for most purposes, including processing of HTML pre elements which only contain verbatim text.