XPATH

 

  1. The XSLT/XPath Data Model
  2. Templates and Contexts
  3. Basic XPath Addressing
  4. Basic XPath Expressions
  5. Combining Index Addresses
  6. Wildcards
  7. Extended-Path Addressing
  8. XPath Data Types and Operators
  9. String-Value of an Element
  10. XPath Functions


 

Overview

XPath expressions specify a pattern that selects a set of XML nodes, which are then used by XSLT.

XPointer adds mechanisms for defining a point or a range, so that XPath expressions can be used for addressing.

The nodes in an XPath expression refer to more than just elements. They also refer to text and attributes, among other things. In fact, the XPath specification defines an abstract document model that defines seven different kinds of nodes:

The root element of the XML data is modeled by an element node. The XPath root node contains the document's root element, as well as other information relating to the document.

 

The XSLT/XPath Data Model

Like the DOM, the XSLT/XPath data model consists of a tree containing a variety of nodes. Under any given element node, there are text nodes, attribute nodes, element nodes, comment nodes, and processing instruction nodes.

In this abstract model, syntactic distinctions disappear, and you are left with a normalized view of the data. In a text node, for example, it makes no difference whether the text was defined in a CDATA section, or if it included entity references. The text node will consist of normalized data, as it exists after all parsing is complete. So the text will contain a < character, regardless of whether an entity reference like &lt; or a CDATA section was used to include it. (Similarly, the text will contain an & character, regardless of whether it was delivered using &amp; or it was in a CDATA section.)

In this section of the tutorial, we'll deal mostly with element nodes and text nodes. For the other addressing mechanisms, see the XPath Specification.

 

Templates and Contexts

An XSLT template is a set of formatting instructions that apply to the nodes selected by an XPATH expression. In an stylesheet, a XSLT template would look something like this:

<xsl:template match="//list">
    ...
</xsl:template>

The expression...

//list

...selects the set of list nodes from the input stream. Additional instructions within the template tell the system what to do with them.

The set of nodes selected by such an expression defines the context in which other expressions in the template are evaluated. That context can be considered as the whole set -- for example, when determining the number of the nodes it contains.

The context can also be considered as a single member of the set, as each member is processed one by one. For example, inside of the list-processing template, the expression @type refers to the type attribute of the current list node. (Similarly, the expression @* refers to all of attributes for the current list element.)

 

Basic XPath Addressing

An XML document is a tree-structured (hierarchical) collection of nodes. As with a hierarchical directory structure, it is useful to specify a path that points a particular node in the hierarchy. (Hence the name of the specification: XPath.) In fact, much of the notation of directory paths is carried over intact:

The forward slash (/) is used as a path separator. Paths that start with a forward slash, indicating the root directory, are absolute paths. A relative path from a given location starts with anything else.

A double period (..) indicates the parent of the current node.

A single period (.) indicates the current node.

For example, In an XHTML document (an XML document that looks like HTML, but which is well-formed according to XML rules) the path /h1/h2/ would indicate an h2 element under an h1. (Recall that in XML, element names are case sensitive, so this kind of specification works much better in XHTML than it would in plain HTML, because HTML is case-insensitive.)

In a pattern-matching specification like XSLT, the specification /h1/h2 selects all h2 elements that lie under an h1 element. To select a specific h2 element, square brackets [] are used for indexing (like those used for arrays). The path...

/h1[4]/h2[5]
...would therefore select the fifth h2 element under the fourth h1 element.

In XHTML, all element names are in lowercase.

A name specified in an XPath expression refers to an element. For example, "h1" in /h1/h2 refers to an h1 element. To refer to an attribute, you prefix the attribute name with an @ sign. For example, @type refers to the type attribute of an element. Assuming you have an XML document with list elements, for example, the expression...

list/@type

...selects the type attribute of the list element.

Since the expression does not begin with /, the reference specifies a list node relative to the current context--whatever position in the document that happens to be.

 

Basic XPath Expressions

The full range of XPath expressions takes advantage of the wildcards, operators, and functions that XPath defines.

The expression...

@type="unordered"

...specifies an attribute named type whose value is "unordered". An expression like...

list/@type

...specifies the type attribute of a list element.

You can combine those two notations. In XPath, the square-bracket notation ([]) normally associated with indexing is extended to specify selection criteria. So the expression...

list[@type="unordered"]

...selects all list elements whose type value is "unordered".

Similar expressions exist for elements, where each element has an associated string-value. Suppose you model what's going on in your organization with an XML structure that consists of PROJECT elements and ACTIVITY elements that have a text string with the project name, multiple PERSON elements to list the people involved and, optionally, a STATUS element that records the project status. Here are some more examples that use the extended square-bracket notation:

/PROJECT[.="MyProject"] selects a PROJECT named "MyProject".
/PROJECT[STATUS] selects all projects that have a STATUS child element.
/PROJECT[STATUS="Critical"] selects all projects that have a STATUS child element with the string-value "Critical".

 

Combining Index Addresses

The XPath specification defines quite a few addressing mechanisms, and they can be combined in many different ways. As a result, XPath delivers a lot of expressive power for a relatively simple specification. This section illustrates two more interesting combinations:

list[@type="ordered"][3] selects all list elements of type "ordered", and returns the third.
list[3][@type="ordered"] selects the third list element, but only if it is of type "ordered".

Many more combinations of address operators are listed in section 2.5 of the XPath Specification. This is arguably the most useful section of the spec for defining an XSLT transform.

 

Wildcards

By definition, an unqualified XPath expression selects a set of XML nodes that matches that specified pattern. For example, /HEAD matches all top-level HEAD entries, while /HEAD[1] matches only the first. Table 1 lists the wildcards that can be used in XPath expressions to broaden the scope of the pattern matching.

Wildcard
Meaning
*
Matches any element node (not attributes or text).
node()
Matches any node of any kind: element node, text node, attribute node, processing instruction node, namespace node, or comment node.
@*
Matches any attribute node.

In the project database example, for instance, /*/PERSON[.="Fred"] matches any PROJECT or ACTIVITY element that includes Fred.

 

Extended-Path Addressing

So far, all of the patterns we've seen have specified an exact number of levels in the hierarchy. For example, /HEAD specifies any HEAD element at the first level in the hierarchy, while /*/* specifies any element at the second level in the hierarchy. To specify an indeterminate level in the hierarchy, use a double forward slash (//). For example, the XPath expression...

//PARA

...selects all paragraph elements in a document, wherever they may be found.

The // pattern can also be used within a path. So the expression...

/HEAD/list//PARA

...indicates all paragraph elements in a subtree that begins from /HEAD/list.

 

XPath Data Types and Operators

XPath expressions yield either a set of nodes, a string, a boolean (true/false value), or a number. Table 2 lists the operators that can be used in an Xpath expression

Operator Meaning
| Alternative. For example, PARA|list selects all PARA and list elements.
or, and Returns the or/and of two boolean values.
=, != Equal or not equal, for booleans, strings, and numbers.
<, >, <=, >= Less than, greater than, less than or equal to, greater than or equal to--for numbers.
+, -, *, div, mod Add, subtract, multiply, floating-point divide, and modulus (remainder) operations (e.g. 6 mod 4 = 2)

Expressions can be grouped in parentheses, so you don't have to worry about operator precedence. "Operator precedence" is a term that answers the question, "If you specify a + b * c, does that mean (a+b) * c or a + (b*c)?". (The operator precedence is roughly the same as that shown in the table.)

 

String-Value of an Element

The string-value of an element is the concatenation of all descendent text nodes, no matter how deep. So, for a "mixed-model" XML data element like this:

<PARA>This paragraph contains a <B>bold</B> word</PARA>

The string-value of <PARA> is "This paragraph contains a bold word". In particular, note that <B> is a child of <PARA> and that the text contained in all children is concatenated to form the string-value.

Also, it is worth understanding that the text in the abstract data model defined by XPath is fully normalized. So whether the XML structure contains the entity reference &lt; or "<" in a CDATA section, the element's string-value will contain the "<" character. Therefore, when generating HTML or XML with an XSLT stylesheet, occurrences of "<" will have to be converted to &lt; or enclosed in a CDATA section. Similarly, occurrences of "&" will need to be converted to &amp;.

 

XPath Functions

You can use XPath functions to select a collection of nodes in the same way that you would use an an element specification like those you have already seen. Other functions return a string, a number, or a boolean value. For example, the expression...

/PROJECT/text()

...gets the string-value of PROJECT nodes.

Many functions depend on the current context. In the example above, the context for each invocation of the text() function is the PROJECT node that is currently selected.

There are many XPath functions--too many to describe in detail here. This section provides a quick listing that shows the available XPath functions, along with a summary of what they do.

 

Note: Skim the list of functions to get an idea of what's there. For more information, see Section 4 of the XPath Specification.

 

Node-set functions

Many XPath expressions select a set of nodes. In essence, they return a node-set. One function does that, too.

id(...) returns the node with the specified id.

(Elements only have an ID when the document has a DTD, which specifies which attribute has the ID type.)

 

Positional functions

These functions return positionally-based numeric values.

last() returns the index of the last element. For example: /HEAD[last()] selects the last HEAD element.
position() returns the index position.
For example: /HEAD[position() <= 5] selects the first five HEAD elements
count(...) returns the count of elements. For example: /HEAD[count(HEAD)=0] selects all HEAD elements that have no subheads.

 

String functions

These functions operate on or return strings.

concat(string, string, ...) concatenates the string values
starts-with(string1, string2) returns true if string1 starts with string2
contains(string1, string2) returns true if string1 contains string2
substring-before(string1, string2) returns the start of string1 before string2 occurs in it
substring-after(string1, string2) returns the remainder of string1 after string2 occurs in it
substring(string, idx) returns the substring from the index position to the end, where the index of the first char = 1
substring(string, idx, len) returns the substring from the index position, of the specified length
string-length() returns the size of the context-node's string-value. The context node is the currently selected node
string-length(string) returns the size of the specified string
normalize-space() returns the normalized string-value of the current node (no leading or trailing whitespace, and sequences of whitespace characters converted to a single space)
normalize-space(string) returns the normalized string-value of the specified string
translate(string1, string2, string3) converts string1, replacing occurrences of characters in string2 with the corresponding character from string3

 

XPath defines 3 ways to get the text of an element: text(), string(object), and the string-value implied by an element name in an expression like this: /PROJECT[PERSON="Fred"].

 

Boolean functions

These functions operate on or return boolean values:

not(...) negates the specified boolean value
true() returns true
false() returns false
lang(string) returns true if the language of the context node (specified by xml:Lang attributes) is the same as (or a sublanguage of) the specified language. For example: Lang("en") is true for <PARA_xml:Lang="en">...</PARA>

 

Numeric functions

These functions operate on or return numeric values.

sum(...) returns the sum of the numeric value of each node in the specified node-set
floor(N) returns the largest integer that is not greater than N
ceiling(N) returns the smallest integer that is greater than N
round(N) returns the integer that is closest to N

 

Conversion functions

These functions convert one data type to another.

string(...) returns the string value of a number, boolean, or node-set
boolean(...) returns a boolean value for a number, string, or node-set (a non-zero number, a non-empty node-set, and a non-empty string are all true)
number(...) returns the numeric value of a boolean, string, or node-set (true is 1, false is 0, a string containing a number becomes that number, the string-value of a node-set is converted to a number)

 

Namespace functions

These functions let you determine the namespace characteristics of a node.

local-name() returns the name of the current node, minus the namespace prefix
local-name(...) returns the name of the first node in the specified node set, minus the namespace prefix
namespace-uri() returns the namespace URI from the current node
namespace-uri(...) returns the namespace URI from the first node in the specified node set
name() returns the expanded name URI(plus local name) of the current node
name(...) returns the expanded name URI(plus local name) of the first node in the specified node set


  Home