Simple API for XML (SAX)

 

 


See Also

  1. org.xml.sax
  2. org.xml.sax.ext
  3. org.xml.sax.helpers
  4. SAX standard
  5. W3C standard validation mechanism, XML Schema
  6. RELAX NG's regular-expression based validation mechanism

 


Overview

SAX is a protocol for parsing XML files using serial I/O streams. To interactively modify data using a tree structure, use DOM, JDOM, or dom4j.

Here is a simple app, echo.java, that mimics the UNIX echo command. To construct:

  1. Install the JAXP JAR files.

  2. Compile echo.java

    javac echo.java

  3. Run echo.class

    java echo filename.xml

Here are some other apps that you can build and run:

In the Java WSDP, the JAXP libraries are distributed in the directory $WSDP_HOME/common/lib. To compile the program you created, you'll first need to install the JAXP JAR files in the appropriate location.

The XML specification requires all input line separators to be normalized to a single newline. The newline character is specified as in Java, C, and UNIX systems, but goes by the alias "linefeed" in Windows systems.

 

locator

A locator is an object that contains the information necessary to find the document. The Locator class encapsulates a system ID (URL) or a public identifier (URN), or both.

You would need that information if you wanted to find something relative to the current document, in the same way, for example, that an HTML browser processes an href="anotherFile" attribute in an anchor tag, the browser uses the location of the current document to find anotherFile.

You could also use the locator to print out good diagnostic messages. In addition to the document's location and public identifier, the locator contains methods that give the column and line number of the most recently-processed event. The setDocumentLocator method is called only once at the beginning of the parse, though. To get the current line or column number, you would save the locator when setDocumentLocator is invoked and then use it in the other event-handling methods.

 

Handling Processing Instructions

The format for a processing instruction is <?target data?>, where "target" is the target application that is expected to do the processing, and "data" is the instruction or information for it to process. Add text to add a processing instruction for a mythical slide presentation program that will query the user to find out which slides to display (technical, executive-level, or all):

The "data" portion of the processing instruction can contain spaces, or may even be null. But there cannot be any space between the initial <? and the target identifier.

The data begins after the first space.

Fully qualifying the target with the complete Web-unique package prefix makes sense, so as to preclude any conflict with other programs that might process the same data.

 

Errors

The parser can generate one of three kinds of errors: fatal error, error, and warning.

When a fatal error occurs, the parser is unable to continue. So, if the application does not generate an exception, then the default error-event handler generates one. The stack trace is generated by the Throwable exception handler in the main method:

 

Handling NonFatal Errors

A nonfatal error occurs when an XML document fails a validity constraint. If the parser finds that the document is not valid, then an error event is generated. Such errors are generated by a validating parser, given a DTD or schema, when a document has an invalid tag, or a tag is found where it is not allowed, or (in the case of a schema) if the element contains invalid data.

To take over error handling, you override the DefaultHandler methods that handle fatal errors, nonfatal errors, and warnings as part of the ErrorHandler interface. The SAX parser delivers a SAXParseException to each of these methods, so generating an exception when an error occurs is as simple as throwing it back.

It can be instructive to examine the error-handling methods defined in org.xml.sax.helpers.DefaultHandler. You'll see that the error() and warning() methods do nothing, while fatalError() throws an exception. Of course, you could always override the fatalError() method to throw a different exception. But if the code doesn't throw an exception when a fatal error occurs, then the SAX parser will, the XML specification requires it.

 

Handling Warnings

Warnings, too, are ignored by default. Warnings are informative, and require a DTD. For example, if an element is defined twice in a DTD, a warning is generated, it's not illegal, and it doesn't cause problems, but it's something you might like to know about since it might not have been intentional.

Generate a message when a warning occurs:

Since there is no good way to generate a warning without a DTD or schema, you won't be seeing any just yet. But when one does occur, you're ready!

 

Handling Special Characters

In XML, an entity is an XML structure (or plain text) that has a name. Referencing the entity by name causes it to be inserted into the document in place of the entity reference. To create an entity reference, the entity name is surrounded by an ampersand and a semicolon, like this:

 

Predefined Entities

An entity reference like &amp; contains a name (in this case, "amp") between the start and end delimiters. The text it refers to (&) is substituted for the name, like a macro in a C or C++ program. Table 6-1 shows the predefined entities for special characters.

Character Reference
& &amp;
< &lt;
> &gt;
" &quot;
' &apos;

 

Character References

A character reference like &#147; contains a hash mark (#) followed by a number. The number is the Unicode value for a single character, such as 65 for the letter "A", 147 for the left-curly quote, or 148 for the right-curly quote. In this case, the "name" of the entity is the hash mark followed by the digits that identify the character.

XML expects values to be specified in decimal. However, the Unicode charts at http://www.unicode.org/charts/ specify values in hexadecimal! So you'll need to do a conversion to get the right value to insert into your XML data set.

 

Entity Reference

 

Handling Text with XML-Style Syntax

When you are handling large blocks of XML or HTML that include many of the special characters, it would be inconvenient to replace each of them with the appropriate entity reference. For those situations, use a CDATA section. All whitespace in a CDATA section is significant, and characters in it are not interpreted as XML.

A CDATA section starts with <![CDATA[ and ends with ]]>. Add text to your slideSample.xml file to define a CDATA section for a fictitious technical slide:

The existence of CDATA makes the proper echoing of XML a bit tricky. If the text to be output is not in a CDATA section, then any angle brackets, ampersands, and other special characters in the text should be replaced with the appropriate entity reference. (Replacing left angle brackets and ampersands is most important, other characters will be interpreted properly without misleading the parser.)

But if the output text is in a CDATA section, then the substitutions should not occur, to produce text like that in the example above. In a simple program like our XMLEcho application, it's not a big deal. But many XML-filtering applications will want to keep track of whether the text appears in a CDATA section, in order to treat special characters properly.

One other area to watch for is attributes. The text of an attribute value could also contain angle brackets and semicolons that need to be replaced by entity references. Attribute(text can never be in a CDATA section, though, so there is never any question about doing that substitution.)

 


Create a DTD

After the XML declaration, the document prolog can include a DTD, which lets you specify the kinds of tags that can be included in your XML document. In addition to telling a validating parser which tags are valid, and in what arrangements, a DTD tells both validating and nonvalidating parsers where text is expected, which lets the parser determine whether the whitespace it sees is significant or ignorable.

 

Basic DTD Definitions

Here are the qualifiers you can add to an element definition:

Qualifier Name Meaning
? Question Mark Optional (zero or one)
* Asterisk Zero or more
+ Plus Sign One or more

You can include multiple elements inside the parentheses in a comma separated list, and use a qualifier on each element to indicate how many instances of that element may occur. The comma-separated list tells which elements are valid and the order they can occur in.

You can also nest parentheses to group multiple items. For an example, after defining an image element (coming up shortly), you could declare that every image element must be paired with a title element in a slide by specifying ((image, title)+). Here, the plus sign applies to the image/title pair to indicate that one or more pairs of the specified items can occur.

The DTD offers no sense of hierarchy. The definition for the title element applies equally to a slide title and to an item title. When we expand the DTD to allow HTML-style markup in addition to plain text, it would make sense to restrict the size of an item title compared to a slide title, for example. But the only way to do that would be to give one of them a different name, such as "item-title". The bottom line is that the lack of hierarchy in the DTD forces you to introduce a "hyphenation hierarchy" (or its equivalent) in your namespace. All of these limitations are fundamental motivations behind the development of schema-specification standards.

 

Special Element Values in the DTD

Rather than specifying a parenthesized list of elements, the element definition could use one of two special values: ANY or EMPTY. The ANY specification says that the element may contain any other defined element, or PCDATA. Such a specification is usually used for the root element of a general-purpose XML document such as you might create with a word processor. Textual elements could occur in any order in such a document, so specifying ANY makes sense.

The EMPTY specification says that the element contains no contents. So the DTD for e-mail messages that let you "flag" the message with <flag/> might have a line like this in the DTD:

 

Referencing a DTD

DTD definitions are generally in a separate file from the XML document, which means they have to be referenced using DOCTYPE tag and the SYSTEM identifier. Paths to the DTD file can either be relative, or absolute.

The setDocumentLocator method uses path information to find DTD files

See slideSample05.xml for an example of how to reference a DTD.

The DOCTYPE specification can DTD definitions within the XML document using the following syntax:

<!DOCTYPE slideshow SYSTEM "slideshow1.dtd" [
  ...local subset definitions here...
]>

 

Defining Attributes in the DTD

Here is an example of defining attributes:

<!ELEMENT slideshow (slide+)>
<!ATTLIST slideshow 
    title    CDATA    #REQUIRED
    date     CDATA    #IMPLIED
    author   CDATA    "unknown"
>
<!ELEMENT slide (title, item*)>

The ATTLIST tag begins the series of attribute definitions. The name that follows ATTLIST specifies the element for which the attributes are being defined.

The first element in each line is the name of the attribute. The second element indicates the type of the data: Commas and other separators are not allowed.

Here are the options for the attribute type.

Attribute Type Specifies...
(value1 | value2 | ...) A list of values separated by vertical bars. (Example below)
CDATA "Unparsed character data". (For normal people, a text string.)
ID A name that no other ID attribute shares.
IDREF A reference to an ID defined elsewhere in the document.
IDREFS A space-separated list containing one or more ID references.
ENTITY The name of an entity defined in the DTD.
ENTITIES A space-separated list of entities.
NMTOKEN A valid XML name composed of letters, numbers, hyphens, underscores, and colons.
NMTOKENS A space-separated list of names.
NOTATION The name of a DTD-specified notation, which describes a non-XML data format, such as those used for image files.*

When the attribute type consists of a parenthesized list of choices separated by vertical bars, the attribute must use one of the specified values. For an example, add text to the DTD:

<!ELEMENT slide (title, item*)>
<!ATTLIST slide 
    type   (tech | exec | all) #IMPLIED
>
<!ELEMENT title (#PCDATA)>
<!ELEMENT item (#PCDATA | item)* >

This specification says that the slide element's type attribute must be given as type="tech", type="exec", or type="all". No other values are acceptable. (DTD-aware XML editors can use such specifications to present a pop-up list of choices.)

The last entry in the attribute specification determines the attributes default value, if any, and tells whether or not the attribute is required. Table 6-4 shows the possible choices.

Specification Specifies...
#REQUIRED The attribute value must be specified in the document.
#IMPLIED The value need not be specified in the document. If it isn't, the application will have a default value it uses.
"defaultValue" The default value to use, if a value is not specified in the document.
#FIXED "fixedValue" The value to use. If the document specifies any value at all, it must be the same.

See slideshow1b.dtd for a complete example.

 

Defining Entities in the DTD

Add text to the DOCTYPE tag in your XML file:

<!DOCTYPE slideshow SYSTEM "slideshow.dtd" [
  <!ENTITY product  "WonderWidget">
  <!ENTITY products "WonderWidgets">
]>

The ENTITY tag name says that you are defining an entity. Next comes the name of the entity and its definition. In this case, you are defining an entity named "product" that will take the place of the product name. Later when the product name changes (as it most certainly will), you will only have to change the name one place, and all your slides will reflect the new value.

The last part is the substitution string that replaces the entity name whenever it is referenced in the XML document. The substitution string is defined in quotes, which are not included when the text is inserted into the document.

We defined two versions, one singular and one plural, so that when the marketing mavens come up with "Wally" for a product name, you will be prepared to enter the plural as "Wallies" and have it substituted correctly.

This is the kind of thing that really belongs in an external DTD. That way, all your documents can reference the new name when it changes.

Entities referenced with the same syntax (&entityName;) that one uses for predefined entities, and that the entity can be referenced in an attribute value as well as in an element's contents.

 

Additional Useful Entities

Here are several other examples for entity definitions that you might find useful when you write an XML document:

<!ENTITY ldquo  "&#147;"> <!-- Left Double Quote --> 
<!ENTITY rdquo  "&#148;"> <!-- Right Double Quote -->
<!ENTITY trade  "&#153;"> <!-- Trademark Symbol (TM) -->
<!ENTITY rtrade "&#174;"> <!-- Registered Trademark (R) -->
<!ENTITY copyr  "&#169;"> <!-- Copyright Symbol --> 

 

Referencing External Entities

You can also use the SYSTEM or PUBLIC identifier to name an entity that is defined in an external file. You'll do that now.

To reference an external entity, add text to the DOCTYPE statement in your XML file:

<!DOCTYPE slideshow SYSTEM "slideshow.dtd" [
  <!ENTITY product  "WonderWidget">
  <!ENTITY products "WonderWidgets">
  <!ENTITY copyright SYSTEM "copyright.xml">
]>

This definition references a copyright message contained in a file named copyright.xml. Create that file and put some interesting text in it, perhaps something like this:

  <!--  A SAMPLE copyright  -->

This is the standard copyright message that our lawyers
make us put everywhere so we don't have to shell out a
million bucks every time someone spills hot coffee in their
lap...

Finally, add text to your slideSample.xml file to reference the external entity:

<!-- TITLE SLIDE -->
  ...
</slide>

<!-- COPYRIGHT SLIDE -->
<slide type="all">
  <item>&copyright;</item>
</slide>

You could also use an external entity declaration to access a servlet that produces the current date using a definition something like this:

<!ENTITY currentDate SYSTEM
  "http://www.example.com/servlet/CurrentDate?fmt=dd-MMM-
yyyy"> 

You would then reference that entity the same as any other entity:

  Today's date is &currentDate;.

 

Use a MIME Data Type

There are two ways to go about referencing an unparsed entity like a binary image file. One is to use the DTD's NOTATION-specification mechanism. However, that mechanism is a complex, non-intuitive holdover that mostly exists for compatibility with SGML documents.

To set up the slideshow to use image files, add text to slideshow.dtd file:

<!ELEMENT slide (image?, title, item*)>
<!ATTLIST slide 
    type   (tech | exec | all) #IMPLIED
>
<!ELEMENT title (#PCDATA)>
<!ELEMENT item (#PCDATA | item)* >
<!ELEMENT image EMPTY>
<!ATTLIST image 
    alt    CDATA    #IMPLIED
    src    CDATA    #REQUIRED
    type   CDATA    "image/gif"
>

These modifications declare image as an optional element in a slide, define it as empty element, and define the attributes it requires. The image tag is patterned after the HTML 4.0 tag, img, with the addition of an image-type specifier, type. (The img tag is defined in the HTML 4.0 Specification.)

The image tag's attributes are defined by the ATTLIST entry. The alt attribute, which defines alternate text to display in case the image can't be found, accepts character data (CDATA). It has an "implied" value, which means that it is optional, and that the program processing the data knows enough to substitute something like "Image not found". On the other hand, the src attribute, which names the image to display, is required.

The type attribute is intended for the specification of a MIME data type, as defined at ftp://ftp.isi.edu/in-notes/iana/assignments/media-types/. It has a default value: image/gif.

The character data (CDATA) used for the type attribute will be one of the MIME data types. The two most common formats are: image/gif, and image/jpeg. Given that fact, it might be nice to specify an attribute list here, using something like:type ("image/gif", "image/jpeg")That won't work, however, because attribute lists are restricted to name tokens. The forward slash isn't part of the valid set of name-token characters, so this declaration fails. Besides that, creating an attribute list in the DTD would limit the valid MIME types to those defined today. Leaving it as CDATA leaves things more open ended, so that the declaration will continue to be valid as additional types are defined.

In the document, a reference to an image named "intro-pic" might look something like this:

<image src="image/intro-pic.gif", alt="Intro Pic", 
type="image/gif" />

The DTDHandler is invoked when the DTD encounters an unparsed entity or a notation declaration. The EntityResolver comes into play when a URN (public ID) must be resolved to a URL (system ID).

 

Using Entity References

Using a MIME data type as an attribute of an element is a mechanism that is flexible and expandable. To create an external ENTITY reference using the notation mechanism, use DTD NOTATION elements for jpeg and gif data. Those can of course be obtained from some central repository. Define a different ENTITY element for each image you intend to reference! In other words, adding a new image to the document always requires both a new entity definition in the DTD and a reference to it in the document. Given the anticipated ubiquity of the HTML 4.0 specification, the newer standard is to use the MIME data types and a declaration like image, which assumes the application knows how to process such elements.

 


Parser Implementations

If no other factory class is specified, the default SAXParserFactory class is used. To use a different manufacturer's parser, change the value of the environment variable that points to it. You can do that from the command line, like this:

java -Djavax.xml.parsers.SAXParserFactory=yourFactoryHere ...

The factory name you specify must be a fully qualified class name (all package prefixes included).

 

Associating a Document with A Schema

There are two ways to ensure that the XML document is associated with a schema.

When the application specifies the schema to use, it overrides any schema declaration in the document.

To specify the schema definition in the document, you would create XML like this:

<documentRoot
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:noNamespaceSchemaLocation='YourSchemaDefinition.xsd'
>
  ...

The first attribute defines the XML NameSpace (xmlns) prefix, "xsi", where "xsi" stands for "XML Schema Instance". The second line specifies the schema to use for elements in the document that do not have a namespace prefix, that is, for the elements you typically define in any simple, uncomplicated XML document.

You can also specify the schema file in the application, using code like this:

static final String JAXP_SCHEMA_SOURCE =
    "http://java.sun.com/xml/jaxp/properties/schemaSource";

...
SAXParser saxParser = spf.newSAXParser();
...
saxParser.setProperty(JAXP_SCHEMA_SOURCE,
    new File(schemaSource));

 

Error Handling in the Validating Parser

In general, a SAX parsing error is a validation error, although we have seen that it can also be generated if the file specifies a version of XML that the parser is not prepared to handle. The thing to remember is that the application will not generate a validation exception unless you supply an error handler like the one above.

 

Creating and Referencing a Parameter Entity

Recall that the existing version of the slide presentation could not be validated because the document used <em> tags, and those are not part of the DTD. In general, we'd like to use a whole variety of HTML-style tags in the text of a slide, not just one or two, so it makes more sense to use an existing DTD for XHTML than it does to define all the tags we might ever need. A parameter entity is intended for exactly that kind of purpose.

The DTD specifications shown here are contained in slideshow2.dtd. The XML file that references it is slideSample08.xml. (The browsable versions are slideshow2-dtd.html and slideSample08-xml.html.)

Open the DTD file for the slide presentation and add text to define a parameter entity that references an external DTD file:

<!ELEMENT slide (image?, title?, item*)>
<!ATTLIST slide 
      ...
>

<!ENTITY % xhtml SYSTEM "xhtml.dtd">
%xhtml;

<!ELEMENT title ...

Here, you used an <!ENTITY> tag to define a parameter entity, just as for a general entity, but using a somewhat different syntax. You included a percent sign (%) before the entity name when you defined the entity, and you used the percent sign instead of an ampersand when you referenced it.

Also, note that there are always two steps for using a parameter entity. The first is to define the entity name. The second is to reference the entity name, which actually does the work of including the external definitions in the current DTD. Since the URI for an external entity could contain slashes (/) or other characters that are not valid in an XML name, the definition step allows a valid XML name to be associated with an actual document. (This same technique is used in the definition of namespaces, and anywhere else that XML constructs need to reference external documents.)

The DTD file referenced by this definition is xhtml.dtd. You can either copy that file to the system or modify the SYSTEM identifier in the <!ENTITY> tag to point to the correct URL.

This file is a small subset of the XHTML specification, loosely modeled after the Modularized XHTML draft, which aims at breaking up the DTD for XHTML into bite-sized chunks, which can then be combined to create different XHTML subsets for different purposes. When work on the modularized XHTML draft has been completed, this version of the DTD should be replaced with something better.

Use an XHTML-based DTD to gain access to an entity it defines that covers HTML-style tags like <em> and <b>. Looking through xhtml.dtd reveals the following entity, which does exactly what we want:

  <!ENTITY % inline "#PCDATA|em|b|a|img|br"> 

This entity is a simpler version of those defined in the Modularized XHTML draft. It defines the HTML-style tags we are most likely to want to use, emphasis, bold, and break, plus a couple of others for images and anchors that we may or may not use in a slide presentation. To use the inline entity:

<!ELEMENT title (# %inline;)*>
<!ELEMENT item (# %inline; | item)* >

These changes replaced the simple #PCDATA item with the inline entity. #PCDATA is first in the inline entity, and that inline is first wherever we use it. That is required by XML's definition of a mixed-content model. To be in accord with that model, you also had to add an asterisk at the end of the title definition.

The Modularized XHTML DTD defines both inline and Inline entities, and does so somewhat differently. Rather than specifying #PCDATA|em|b|a|img|Br, their definitions are more like (#PCDATA|em|b|a|img|Br)*. Using one of those definitions, therefore, looks more like this:<!ELEMENT title %Inline; >

 

Conditional Sections

You cannot conditionalize the content of an XML document, but you can define conditional sections in a DTD that become part of the DTD only if you specify include. If you specify ignore, on the other hand, then the conditional section is not included.

Use references to parameter entities in place of the INCLUDE and IGNORE keywords:

someExternal.dtd: 
  <![ %XML; [
    ... XML-only definitions
  ]]>
  <![ %SGML; [
    ... SGML-only definitions
  ]]>
  ... common definitions 

Then each document that uses the DTD can set up the appropriate entity definitions:

<!DOCTYPE foo SYSTEM "someExternal.dtd" [
  <!ENTITY % XML  "INCLUDE" >
  <!ENTITY % SGML "IGNORE" >
]>
<foo>
  ...
</foo> 

This procedure puts each document in control of the DTD. It also replaces the INCLUDE and IGNORE keywords with variable names that more accurately reflect the purpose of the conditional section, producing a more readable, self-documenting version of the DTD.

 


Parsing DTDs

To keep two title elements separate, use a "hyphenation hierarchy". Change the name of the title element in slideshow.dtd to slide-title:

<!ELEMENT slide (image?, slide-title?, item*)>
<!ATTLIST slide 
      type   (tech | exec | all) #IMPLIED
>

<!-- Defines the %inline; declaration -->
<!ENTITY % xhtml SYSTEM "xhtml.dtd">
%xhtml;

<!ELEMENT slide-title (%inline;)*>

Use the new element name:

...
<slide type="all">
<slide-title>Wake up to ... </slide-title>
</slide>

...

<!-- OVERVIEW -->
<slide type="all">
<slide-title>Overview</slide-title>
<item>...


  Home