my recent reads..

Extracting XPath refs from an XML document


I was inspired by a recent post in the XMLDB Forum to look at the question of how to extract a complete list of XPaths and the associated text node values from an arbitrary XML file. I looked into an XSLT approach which I'll describe here.

Say we have an XML file like this:
<?xml version="1.0" encoding="ISO-8859-1"?>
<Library>
<Books>
<Book>
<Author>
<Last>Perry</Last>
<First>Anne</First>
</Author>
<Title>Long Spoon Lane</Title>
</Book>
</Books>
<Members>
<Member>
<Name>Paul</Name>
<Joined>2005-11-01</Joined>
</Member>
</Members>
</Library>

And our objective is to produce a listing like this:
/Library/Books/Book/Author/Last():Perry
/Library/Books/Book/Author/First():Anne
/Library/Books/Book/Title():Long Spoon Lane
/Library/Members/Member/Name():Paul
/Library/Members/Member/Joined():2005-11-01

After some investigation and reference to sites like Path Tracing and the XSLT 1.0 spec I arrived at what I think is the simplest xsl possible:
<?xml version="1.0" encoding="windows-1252" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>

<xsl:strip-space elements = "*" />

<xsl:template match="text()">

<xsl:for-each select="ancestor-or-self::*">
<xsl:text>/</xsl:text>
<xsl:value-of select="name()" />
</xsl:for-each>

<xsl:text>():</xsl:text>
<xsl:value-of select="." />
<xsl:text>&#xA;</xsl:text>

<xsl:apply-templates/>

</xsl:template>

</xsl:stylesheet>

What is going on here?

Well, firstly note that we strip-spaces and then match on all text() nodes - this ensures we skip all the pure whitespace nodes.

The magic that generates the XPath is the the "for-each" over all "ancestor-or-self" elements which generates the XPath identifier. Then we simply add the text value on the end.

A variation on the XSL template that produces an XML structure instead of text is as follows. It really varies just in terms of output formatting:
<?xml version="1.0" encoding="windows-1252" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml"/>

<xsl:strip-space elements = "*" />

<xsl:template match="/">
<items>
<xsl:apply-templates/>
</items>
</xsl:template>

<xsl:template match="text()">
<item>
<path>
<xsl:for-each select="ancestor-or-self::*">
<xsl:text>/</xsl:text>
<xsl:value-of select="name()" />
</xsl:for-each>
</path>
<value>
<xsl:value-of select="." />
<xsl:apply-templates/>
</value>
</item>
</xsl:template>

</xsl:stylesheet>