GSoC 2009 for NetBSD: xmltools
  1. In search of a good pull-style XML parser
  2. XML as a tree representation
  3. xmlgrep toy benchmarks
  4. Improving performances of xmlgrep
  5. xmlgrep by examples: playing with Atom
  6. More benchmarks: the bitvopt branch
  7. xmlsed preview: writing XML
  8. xmltools update: new I/O layer and further plans
  9. xmltools implementation: automata and backtracking
  10. xmltools: what's done, what's left
  11. xmltools: after the GSoC

xmlsed preview: writing XML

(Nhat Minh Lê (rz0) @ 2009-06-25 12:48:34)

One might be tempted to think that, while parsing can be a complex task involving various competing models, printing is straightforward and should be the easy part of the job. This is not quite true however, as David and I have been in disagreement for some time already over the exact semantics and syntax of a potential XML printing subsystem of xmltools.

This post is all about the design of the printing language in xmltools and the underlying model. I also present at the end my latest hybrid proposal, that I’ve submitted to David for approval earlier today, and which I hope will bring satisfaction to all parties. If you have some opinion on this topic, do not hesitate to e-mail me! (My e-mail is my first name followed by my last name with all spaces replaced by dots; it’s written in French in the footer so you might not be able to decode it.)

The basic problem can be described as follows: we have a source tree, the input document and we want to transform it by adding or removing nodes in some places. Simple? Not quite.

Select & transform

The first design I proposed is based on sed(1): we select some node in the tree using a tree pattern (same as in my xmlgrep) and then transform it using some actions: insert, append, replace, etc. Only these needed to be adapted to XML.

My idea was then to use PYX/ESIS-inspired actions, which also happen to model closely my internal API. These consisted of a one-character prefix indicating the thing to print, followed by contents (depending on the prefix). Prefixes were:

( opening tag
) closing tag
@ attribute
. text
! comment
? junk

This is a very simple approach but David objected that it was unnatural to write XML using some language which was not XML.

Advantages
  • Very simple and straightforward.
  • Completely symmetrical to the way patterns are matched.
Disadvantages
  • Code used to write XML does not look like the output.
  • Not intuitive to read as data.

Decorate & pull

Another approach, which is used most notably by XSLT, is to have a template, which itself is written as XML, in which some elements from the source document are pulled.

Let me say it upfront in case anybody is not clear on this: pulling is not a stream technique, so this model as is cannot be implemented easily (some subsets of XSLT as stream transformations have been researched, but since it is from the start not a stream-oriented technique, it’s complex and IMHO not worth it).

David first suggested using a template as a way of having printing-heavy script actually look like their output. He observed plain XML could be written directly while some special elements, e.g. xmlsed:addattr could be used to print things that cannot be represented directly in XML. Indeed, there is no XML syntax to write just an attribute without the surrounding tag, yet there are cases where we might want to add an attribute to an existing element.

I criticized this idea based on the fact that reserving a namespace was intrusive and that using XML to represent scripting elements was too verbose.

But the main point was that pull-style transformation was never an option.

Advantages
  • XML in script remains in output.
  • Very intuitive and easy to read.
Disadvantages
  • Model is not adapted to streaming.
  • Too verbose.

Towards a hybrid model: XML syntax for select & transform

Given the stream-unfriendly nature of the template-based model, David came up with the idea to represent actions of the select & transform model as XML: XML opening and closing tags could be used to represent the corresponding printing actions (respectively ( and )) but for the rest, we were still stuck with the way-too-verbose XSLT-like <xmlsed:whatever> elements.

I was still very dissatisfied with the verbosity of having to write <xmlsed:whatever/> just to add an attribute or specify where to insert the current node (before, after, inside the plain XML part), but the concept itself, of using a template, was quite nice.

Advantages
  • XML in script remains in output.
  • Very intuitive and easy to read.
  • Still reasonably simple to implement.
Disadvantages
  • Still too verbose.

A hybrid model

Today, after more brainstorming, I finally decided on the following model, and unless David is really against it (which I seriously doubt), I think it will make it into xmlsed one way or the other: the idea is the same as that of the previous attempt, use a template which actually describes actions inferred from the structure of XML itself.

Only, this time, we do not use special elements to indicate actions anymore, everything is inferred from the structure. The trick to do this is to have a virtual element represent the current node; it can have any name (it will be user-settable), let’s call it : by default. We can place it wherever we want in the template. And to add an attribute (resp. a child), we just have to add an attribute (resp. a child) to that virtual element. The syntax remains reasonably compact (even optimally compact given we’re using XML as the basis for our template) and the script now very closely models the output, even more than XSLT scripts!

The idea was inspired from xargs(1) and its -I option. Again, Unix saved the day… well, not quite, but still, I think it is a nice and practical compromise. :)