XML as a tree representation
xmltools is a project whose ultimate goal is to bring Unix command-line efficiency to the XML world (and not, as many might think, to bring XML into the Unix world, which is being done already, though not by my fault). The idea is to treat XML as a generic tree representation format.
This seems very straightforward; in fact, one could argue that XML was designed from the start to be a generic tree representation. However, XML has a number of problems that make it hard to use in this role.
The first such problems is doctype declarations. While the bare XML structure, which we see most of the time, is very simple, doctype declarations and their semantics are both unneeded and unwanted in tree (syntax) manipulation programs. It would be fine if doctype support was optional for non-validating XML processors, but as I understand it, the XML standard mandates support for a portion of it: entity declarations and references, and default attribute values.
The second issue is with white space handling: XML does not include
any default behavior regarding white space. XML parsers must pass all
white space to the application, which is responsible for some
sensible default processing. The
xml:space attribute can give
a hint to the application but as far as I know, it’s not in wide use.
This all led me to develop a set of compatibility rules which precisely describe what one should expect from xmltools, no matter which backend is used (in case we move away from expat, which I plan to do, eventually). This subset is sufficient to describe any tree-like data set.
The rules will be maintained in the
doc/compatxml.text file, in the
xmltools source tree. For convenience, I’ve reproduced the current
Do not use doctype declarations for any purpose other than validation. In particular, do not rely on doctype declarations to provide default attributes and do not use internal entities for arbitrary substitutions; if you wanted to save typing, you wouldn’t be using XML. Use of numeric and predefined entities is OK. Use of directly-encoded characters is preferred.
Do not presuppose the XML processor has access to any resource beside your file: do not use external entities.
Do not assume the XML processor is able to handle multiple character sets: all the documents and other data supplied to the program should use the same character set. Also, set your locale appropriately.
xml:spaceis set, assume all white-space only text elements are ignored by the XML processor.
preservewhenever white space not kept by the previous rule must be preserved. As stated in the first rule, do not rely on doctype declarations to implicitly set this attribute. Please note however that the fourth rule works well for most purposes, including processing of HTML
preelements which only contain verbatim text.