xmltools update: new I/O layer and further plans
It has been nearly two weeks since my last update on my plans for xmlsed. Since that time, David and I have come to an agreement on the (hypothetical) syntax and my hybrid model has been retained.
Rewriting the I/O module
Although I now have a fair idea of what xmlsed should look like and how to implement it, some elements introduced after the recent discussions with my mentor required changes to the existing code.
In particular, the new template syntax required an XML parser able to handle partial XML documents. Besides, I wanted some syntactic sugar on top of XML to make writing templates easier. Expat being a well-behaved XML parser is quite strict about the syntax and there is no easy way to work around that. Though I did spend some days trying, at first, I eventually gave up, as it proved hard to write, let alone maintain in the future.
At the same time, I began to realize the original I/O abstraction I designed was showing its limits. I initially wanted to be able to read and write multiple tree representation formats, but in the end, the need to fully support XML, with all its specifics (doctypes, PIs, etc.), has convinced me to drop the idea and focus on XML alone.
All these reasons led me to rewrite the I/O layer (almost)
completely. This work is being committed to yet another temporary
Maybe the most visible change is that multi-backend support was dropped and the I/O module now consists of only two drivers: the Expat-based parser (the strict parser) and a home-made loose parser.
All tools now support two modes: the strict mode (
-s flag) and the
default loose mode.
In the strict mode, compliance with the XML standard is a priority: all extensions are disabled, including multi-root and partial documents (which was implemented with Expat as an ugly kludge before, so it was removed), the XML prolog is honored (the specified encoding is used and entity declarations are parsed), and stricter rules are enforced (e.g. in names).
In the loose mode, extensions are supported and the XML prolog is ignored (instead, we use the system locale).
The encoding issue
Although I’ve just stated how encodings will be handled, at the moment, I have not written any code to suppor that yet. Actually, it’s not a simple matter.
First of all, the XML standard mandates support of Unicode. For one
thing, XML character entities (
&xxxx;) are references to character
codes from the Unicode tables. This makes support for Unicode pretty
In order to handle this, the Expat people have chosen to serve all data as UTF-8 (or UTF-16, depending on the build-time configuration), no matter what input encoding is used. This should have given you a hint: Expat has its own encoding conversion engine.
Now, even though XML documents can specify an encoding, and so it would be alright to always use UTF-8, that is not an acceptable solution in the real world: Unix users expect their programs to comply to the current locale. But Expat does not care one way or the other about, or integrate with, the locale system.
Fortunately, NetBSD has iconv(3) (though neither FreeBSD nor OpenBSD does, apparently, which will be a problem in making my programs portable; there seems to be an ongoing GSoC effort to port NetBSD libiconv to FreeBSD though), so I plan to run everything through that on input and again on output. Sounds like a waste of resources? Well, don’t blame me.
However, that’s not all: there is also the loose parser. Since this one was written by me, and I had no reason to force UTF-8 everywhere, it does no conversion and assumes all sources are in the current locale. The only problem will be for character entities (not currently implemented), but I intend to localize the Unicode-to-locale transformation to these, since it is costly.
Remember the push-style vs event-driven parser debate?
Well, it wasn’t really a debate, but maybe some of you remember that I posted some weeks ago a ticket on this blog about how I grew dissatisfied with the rigid event-driven behavior of Expat and wanted a push-style parser.
I took the occasion this time to make my wish come true (almost) and the new I/O design is mostly push-style. I say mostly because I’ve made some compromises in order not to impair performances.
At first, I thought about having the parser run on a fragment of input text and build a queue of events to be fed to the application one by one, on a on-demand basis. But then I realized I could just use the internal tree directly as my event queue, and have the application read that. Since we have support for look-ahead in the matching engine (and hence need to keep a partial internal tree in any case) and most tools will use that, this effectively moved the tree building code from the matcher to the I/O module, at the same time eliminating direct event responses (a bit less than 500 lines of code). This last point needs some clarification: sure we do use a little more memory (but honestly that just doesn’t show) since we build the tree first and only then process it, but this is limited to how many nodes can be represented in one read buffer, which means it’s mostly insignificant. But we have gained what I think is far more important: an unified implementation of the matcher (which is the most sensitive part).
At the moment, the new parser does not support things such as entity declarations and I am still pondering whether to write code for that or not. In any case, it would be a good idea to have a xmlcat(1) utility which fully supports XML, including external entities, with the ability to fetch and include external documents (fetch(3)?) and substitute references. But let’s leave this discussion for another time.
As for the locale support, I think it will come after xmlsed is somewhat ready (i.e. as xmlgrep is right now). So for now, some more testing needs to be done on the new I/O components, and when this is done, I will resume work on xmlsed proper.