xmlgrep by examples: playing with Atom
Edited 22.06. Incompatible syntax changes and more examples.
Since Matthew Sporleder on tech-userlevel has (implicitly) suggested that xmlgrep could be used to dissect Atom feeds, but was a bit lost at how to do it exactly, I thought I’d post a little demo of a console session playing with an Atom file and xmlgrep (as well as some other command-line tools).
First, let’s fetch the feed:
As a side effect of xmlgrep, we might want to indent the XML file to make it human-readable:
$ ftp http://mspo.com/blog/atom.xml [...]
List all posts in the NetBSD category with their IDs:
$ xmlgrep '*' atom.xml | more
$ xmlgrep -x 'entry[category/@term=NetBSD]/(title|id)/.' atom.xml tag:blogger.com,1999:blog-6347225410141611306.post-1131641169617411392 NetBSD quotas - quickstart tag:blogger.com,1999:blog-6347225410141611306.post-1939815769827620970 NetBSD device drivers - easier than you might think [...]
Those of you yet unfamiliar with the syntax might have some trouble
understanding. The previous pattern could be read "select a text child
of an id or title element, itself a child of an entry element, which
contains a category element which has an attribute child term equal to
NetBSD." Step by step, you should notice that
a[b] is read "a such
| stands for "or",
/ for "child of",
@ for "attribute",
. for "text", and the braces are used for grouping purposes.
Now, let’s select a post by ID:
Or, select a post by title and view its contents using w3m:
$ xmlgrep -x 'entry[id/.~"post-1939"]#' atom.xml
$ xmlgrep -x 'entry[title/.~"device drivers"]/content/.' atom.xml | > sed -e 's/</</g' -e 's/>/>/g' -e 's/&/\&/g' | > w3m -T text/html
As a side note, I should mention that up to now we have used
subpatterns quite a lot. This is because the Atom feed specification
does not force an order (or does it?) on the children of
elements. With more precise knowledge of the order of elements
relative to each other, we could have optimized the pattern to use
%% where possible. Subpatterns are costly, but for data sets
this size, we probably don’t care much.
Let’s print all entry titles which date from March 2009 using the fact
that we know the
updated element comes before the
A friend of mine told me it would be useful to have arithmetic predicates. I think they will feature in xmltools sooner or later, but even without them, it is still possible to do some simple statistics, by combining the results with awk(1), for example. The following one-liner counts the number of posts that have no older than March:
$ xmlgrep -x 'entry/updated[.~"^2009-03"]%%title/.' atom.xml
$ xmlgrep -x 'entry/updated/.' atom.xml | awk -F - '$2>=3' | wc -l
That’s it; I hope this will help people who want to get started with xmlgrep. If you have other good examples you’d want me to elaborate, do not hesitate to send me a mail!