GSoC 2009 for NetBSD: xmltools
  1. In search of a good pull-style XML parser
  2. XML as a tree representation
  3. xmlgrep toy benchmarks
  4. Improving performances of xmlgrep
  5. xmlgrep by examples: playing with Atom
  6. More benchmarks: the bitvopt branch
  7. xmlsed preview: writing XML
  8. xmltools update: new I/O layer and further plans
  9. xmltools implementation: automata and backtracking
  10. xmltools: what's done, what's left
  11. xmltools: after the GSoC

xmlgrep by examples: playing with Atom

(Nhat Minh Lê (rz0) @ 2009-06-17 14:23:11)

Edited 22.06. Incompatible syntax changes and more examples.

Since Matthew Sporleder on tech-userlevel has (implicitly) suggested that xmlgrep could be used to dissect Atom feeds, but was a bit lost at how to do it exactly, I thought I’d post a little demo of a console session playing with an Atom file and xmlgrep (as well as some other command-line tools).

First, let’s fetch the feed:

$ ftp http://mspo.com/blog/atom.xml
[...]
As a side effect of xmlgrep, we might want to indent the XML file to make it human-readable:
$ xmlgrep '*' atom.xml | more
List all posts in the NetBSD category with their IDs:
$ xmlgrep -x 'entry[category/@term=NetBSD]/(title|id)/.' atom.xml
tag:blogger.com,1999:blog-6347225410141611306.post-1131641169617411392
NetBSD quotas - quickstart
tag:blogger.com,1999:blog-6347225410141611306.post-1939815769827620970
NetBSD device drivers - easier than you might think
[...]

Those of you yet unfamiliar with the syntax might have some trouble understanding. The previous pattern could be read "select a text child of an id or title element, itself a child of an entry element, which contains a category element which has an attribute child term equal to NetBSD." Step by step, you should notice that a[b] is read "a such that b", | stands for "or", / for "child of", @ for "attribute", . for "text", and the braces are used for grouping purposes.

Now, let’s select a post by ID:

$ xmlgrep -x 'entry[id/.~"post-1939"]#' atom.xml
Or, select a post by title and view its contents using w3m:
$ xmlgrep -x 'entry[title/.~"device drivers"]/content/.' atom.xml |
> sed -e 's/&lt;/</g' -e 's/&gt;/>/g' -e 's/&amp;/\&/g' |
> w3m -T text/html

As a side note, I should mention that up to now we have used subpatterns quite a lot. This is because the Atom feed specification does not force an order (or does it?) on the children of entry elements. With more precise knowledge of the order of elements relative to each other, we could have optimized the pattern to use % and %% where possible. Subpatterns are costly, but for data sets this size, we probably don’t care much.

Let’s print all entry titles which date from March 2009 using the fact that we know the updated element comes before the title one:

$ xmlgrep -x 'entry/updated[.~"^2009-03"]%%title/.' atom.xml
A friend of mine told me it would be useful to have arithmetic predicates. I think they will feature in xmltools sooner or later, but even without them, it is still possible to do some simple statistics, by combining the results with awk(1), for example. The following one-liner counts the number of posts that have no older than March:
$ xmlgrep -x 'entry/updated/.' atom.xml | awk -F - '$2>=3' | wc -l

That’s it; I hope this will help people who want to get started with xmlgrep. If you have other good examples you’d want me to elaborate, do not hesitate to send me a mail!