XML

 

 


Usage

  1. Overview
  2. Special symbols
  3. Declarations
  4. Processing Instructions
  5. Document Type Definitions
  6. Namespaces

Standards

  1. Core Standards
  2. Style and Transformation
  3. Schema and Validation
  4. Other Key Standards


Overview

XML files are used to exchange information between different computer systems. Text documents are marked up with tags, that classify each data item.

<bicycle>
    <type> Cruiser </type>
    <color> Red </color>
    <speeds> One </speeds>
</bicycle>
<bicycle>
    <type> Mountain </type>
    <color> Black </color>
    <speeds> 21 </speeds>
</bicycle>

Tags themselves can also contain data elements, called attributes, which reside within the tag's angle brackets:

<bicycle type="Cruiser" speeds="one" > 
  <text>
    This bicycle is suitable for city biking.
  </text>
</bicycle>

If the data is a simple text string, and is confined to a small number of fixed choices, one can make it an attribute. If the data element is complex, with multiple lines, make it a sub-element of another element. Attributes tend to be immutable and invisible, describing the characteristics of an element. Sub-elements are variable and are meant to be seen.

Tags that stand by themselves, with no content, and no closing tags, are called "empty tags". For example, <flag/>.

Whitespace is ignored within XML documents, so use plenty of it to make documents more readable. Lots of indents and double-spaces are recommended.

Unlike HTML, which allows a certain editorial sloppiness, XML documents need to be completely well-formed. Even a single misplaced angle bracket can cause an entire document to fail to be processed. Every tag needs a closing tag. All tags need to be completely nested. One can have...

<header>
    <detail>
    This is well-formed.
    </detail>
</header>

...but never...

<header>
     <detail>
    This is not well-formed.
</header>
    </detail>

There are close to 500 XML standards currently in existence, providing tools and utilities for editing and manipulating XML files.

Note that one can edit XML files with any text editor. I myself use vi for all but the most specialized tasks. One can also manipulate XML files using any programming language. Standard UNIX shell commands, such as grep and sed work well for parsing data. The Perl language also has many excellent XML tools.

XML comments are formatted like HTML comments:

  <!-- This is a comment -->

Special symbols include:

& &amp; Ampersand
< &lt; Less than sign
> &gt; Greater than sign
" &quot; Quote
' &apos; Apostrophe

 

Declarations

While not required, an XML file should start with a declaration, also called a prolog, that uses <?..?>. For example

<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>

Where...

version Version of the XML markup language used in the data. Required.
encoding Character set used to encode the data. "ISO-8859-1" is "Latin-1" the Western European and English language character set. The default is compressed Unicode: UTF-8.
standalone Whether or not the document references an external entity or an external data type specification.

 

Processing Instructions

XML processing instructions provide commands to an application, and have the following format:

  <?appname instructions...?>

XML files can have multiple processing instructions for multiple apps.

The appname name "xml" is reserved for XML standards.

 

DTD

Document type definitions (DTD) are used to specify which elements and attributes are valid in an XML document. The DTD can exist at the front of the document, as part of the prolog, or it can exist as a separate entity, generally in filename.dtd file.

DTD cannot restrict the content of elements and or specify complex relationships. It is used only for validating relatively simple XML documents. Because DTD can only validate simple XML documents, to validate more complex documents, one must turn to other schema and validation standards.

If possible, when designing XML files, use an existing DTD. Repositories of industry-standard DTDs are available at http://www.XML.org and http://www.xmlx.com

The url to filename.dtd can be relative or absolute. DTD files distributed on the internet require a PUBLIC identifier and an absolute URL.

 

DTD Symbols and Keywords:

, Elements need to be in sequence.
| If the first element exists in the XML document, the other is not necessary.
+ Unit must appear at least once in the element being defined.
* Unit can appear as many times as necessary, or not at all
? Unit can appear at most once, if at all
CDATA Attribute's value may contain any characters.
REQUIRED Attribute must contain some value in the XML document
#IMPLIED Attribute has no default value, and may be omitted
#FIXED An element is set to a default value, if it is set at all
ID Value must be unique throughout XML document.
IDREF Attribute whose value refers to an ID attribute. IDREFS can contain several white-space separated ID values.
Enforce XML typing for attributes, i.e., beginning with letters or underscores, and containing no spaces.

 

Namespaces

The namespace standard lets one write an XML document that uses two or more sets of XML tags in modular fashion.

For example, suppose one has an XML file containing invoices from different companies. The <InvoiceAmount> data would be summed by company, and as a whole, using namespaces.

 


Core Standards

 

Style and Transformation

  1. XSL

    The Extensible Stylesheets Language (XSL) standard is used for formatting and transforming XML documents. Also known as XSL-FO.

  2. XSLT

    The XSL Transformations standard is a subset of the XSL standard, and is used for setting up translations of XML documents (for example to HTML or other XML) and for dictating how the document is rendered.

 

Document Linking

  1. XLink

    The XML Linking Language standard provides the ability to imbed links within XML documents. In addition to regular links, also enables the following:

    • Two-way links
    • Links to multiple documents
    • Expanding links that insert the linked information.

  2. XML Base

    Specifies a "base" address used when evaluating a relative address specified in the document.

  3. XPointer

    Targets a document or document-segment using IDs. Allows one to provide references to elements, character strings, and other parts of XML documents, whether or not they bear an explicit ID attribute.

  4. XPath

 

Schema and Validation

  1. XML Schema

    Specifies structure relationships and mechanisms for validating the content of XML elements by specifying a datatype for each element. Large and complex.

  2. TREX

    The Tree Regular Expressions for XML standard is a means of expressing validation criteria by describing a pattern for the structure and content of an XML document. Now part of the RELAX NG spec.

  3. RELAX NG

    The Regular Language description for XML standard uses regular expression patterns to express constraints on structure relationships. Designed to work with the XML Schema datatyping mechanism. Includes a DTD to RELAX converter. "NG" stands for "Next Generation". It is a newer version of the RELAX schema mechanism that integrates TREX. A product of OASIS

  4. DOM Build object trees from XML documents

  5. SOX

    The Schema for Object-oriented XML standard includes extensible data types, namespaces, and embedded documentation.

  6. Schematron

    The Schema for Object-oriented XML standard is an assertion-based schema mechanism that allows for sophisticated validation.

 

Other Key Standards

    SAX

    Event-based parser

  1. XHTML

    Specification for making XML documents that look and act like HTML documents.

    RDF, and the RDF Schema, are used in conjunction with the XHTML specification, and with HTML pages, to describe the content of pages.

    A simpler, though immature, alternative to RDF is XTM, which can be used to build Topic Maps.

  2. JAXB

    Java Architecture for XML Binding is a Standard for writing out Java objects as XML, a process called marshalling, and for creating Java objects from such structures, a process called unmarshalling.

    When a data structure is fully specified, code can be generated automatically in a process called binding, creating classes that recognize and process different data elements by processing the specification that defines those elements.

  3. JDOM

    Creates a tree of objects from an XML structure.

    DOM4J is an open-source, object-oriented alternative to DOM.

  4. JAXM

    Defines a mechanism for exchanging asynchronous, XML-based messages between applications.

  5. JAXP

    Parse XML documents

  6. JAX-RPC

    Defines a mechanism for exchanging synchronous, XML-based messages between applications.

  7. JAXR

    Provides a mechanism for publishing available services in an external registry, and for consulting the registry to find those services.

  8. SMIL)

    Synchronized Multimedia Integration Language provides playback of audio, video, and animations.

  9. MathML

    Mathematical Markup Language represents mathematical formulas.

  10. SVG

    Vector graphic images.

  11. DrawML

    2D images for technical illustrations.

  12. ICE

    Protocol used by content syndicators and their subscribers to automate content exchange and reuse.

  13. ebXML

    Standard for creating a modular electronic business framework using XML. Product of a joint initiative by the United Nations (UN/CEFACT) and the OASIS.

  14. CXML

    RosettaNet standard for setting up interactive online catalogs for different buyers, where the pricing and product offerings are company specific. Includes mechanisms to handle purchase orders, change orders, status updates, and shipping notifications.

  15. CBL

    Library of element and attribute definitions maintained by CommerceNet

  16. UBL

    OASIS initiative aimed at compiling a standard library of XML business documents (purchase orders, invoices, etc.) that are defined with XML Schema definitions.


Home