XML

Inside the SAX Sample Program

Let's look at how the program you just saw uses a SAX parser to parse an XML document. The program just prints out messages that explain what it's doing at each step while parsing, along with the associated data from the XML document. You could easily replace this code with code that performs more useful tasks, such as performing a calculation or otherwise transforming the data, but because the purpose of this program is just to illustrate how the SAX parser works, the diagnostic messages are fine.

Because you already know the scoop on SAX, Java, and the Xerces SAX parser for Java, let's go ahead and jump right into the program code. Here are the first 12 lines of Java code:

import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.ErrorHandler;
import org.xml.sax.Locator;
import org.xml.sax.SAXParseException;
import org.xml.sax.XMLReader;
public class DocumentPrinter implements ContentHandler, ErrorHandler {
  // A constant containing the name of the SAX parser to use.
  private static final String PARSER_NAME
    = "org.apache.xerces.parsers.SAXParser";

This code imports classes that will be used later on and declares the class (program) that you're currently writing. The import statements indicate which classes will be used by this program. In this case, all of the classes that will be used are from the org.xml.sax package and are included in the xercesImpl.jar and xml-apis.jar archives.

This class, called DocumentPrinter, implements two interfacesContentHandler and ErrorHandler. These two interfaces are part of the standard SAX 2.0 package and are included in the import list. A program that implements ContentHandler is set up to handle events passed back in the normal course of parsing an XML document, and a program that implements ErrorHandler can handle any error events generated during SAX parsing.

In the Java world, an interface is a framework that specifies a list of methods that must be defined in a class. An interface is useful because it guarantees that any class that implements it meets the requirements of that interface. If you fail to include all of the methods required by the interface, your program will not compile. Because this program implements ContentHandler and ErrorHandler, the parser can be certain that it is capable of handling all of the events it triggers as it parses a document.

After the class has been declared, a single member variable is created for the class, PARSER_NAME. This variable is a constant that contains the name of the class that you're going to use as the SAX parser. As you learned earlier, there is any number of SAX parsers available. The Xerces parser just so happens to be one of the better Java SAX parsers out there, which explains the parser name of org.apache.xerces.parsers.SAXParser.

Although SAX is certainly a popular Java-based XML parser given its relatively long history, it has some serious competition from Sun, the makers of Java. The latest version of Java (J2SE 5.0) now includes an XML API called JAXP that serves as a built-in XML parser for Java. To learn more about JAXP, visit http://java.sun.com/xml/jaxp/.

The main() Method

Every command-line Java application begins its life with the main() method. In the Java world, the main method indicates that a class is a standalone program, as opposed to one that just provides functionality used by other classes. Perhaps more importantly, it's the method that gets run when you start the program. The purpose of this method is to set up the parser and get the name of the document to be parsed from the arguments passed in to the program. Here's the code:

  public static void main(String[] args) {
    if (args.length == 0) {
      System.out.println("No XML document path specified.");
      System.exit(1);
    }
    DocumentPrinter dp = new DocumentPrinter();
    XMLReader parser;
    try {
      parser = (XMLReader)Class.forName(PARSER_NAME).newInstance();
      parser.setContentHandler(dp);
      parser.setErrorHandler(dp);
      parser.parse(args[0]);
    }
    // Normally it's a bad idea to catch generic exceptions like this.
    catch (Exception ex) {
      System.out.println(ex.getMessage());
      ex.printStackTrace();
    }
  }

This program expects that the user will specify the path to an XML document as its only command-line argument. If no such argument is submitted, the program will exit and instruct the user to supply that argument when running the program.

Next, the program creates an instance of the DocumentPrinter object and assigns it to the variable dp. You'll need this object later when you tell the parser which ContentHandler and ErrorHandler to use. After instantiating dp, a try...catch block is opened to house the parsing code. This is necessary because some of the methods called to carry out the parsing can throw exceptions that must be caught within the program. All of the real work in the program takes place inside the try block.

The TRy...catch block is the standard way in which Java handles errors that crop up during the execution of a program. It enables the program to compensate and work around those errors if the user chooses to do so. In this case, you simply print out information about the error and allow the program to exit gracefully.

Within the try...catch block, the first order of business is creating a parser object. This object is actually an instance of the class named in the variable PARSER_NAME. The fact that you're using it through the XMLReader interface means that you can call only those methods included in that interface. For this application, that's fine. The class specified in the PARSER_NAME variable is then loaded and assigned to the variable parser. Because SAX 2.0 parsers must implement XMLReader, you can refer to the interface as an object of that type rather than referring to the class by its own nameSAXParser.

After the parser has been created, you can start setting its properties. Before actually parsing the document, however, you have to specify the content and error handlers that the parser will use. Because the DocumentPrinter class can play both of those roles, you simply set both of those properties to dp (the DocumentPrinter object you just created). At this point, all you have to do is call the parse() method on the URI passed in on the command line, which is exactly what the code does.

Implementing the ContentHandler Interface

The skeleton for the program is now in place. The rest of the program consists of methods that fulfill the requirements of the ContentHandler and ErrorHandler interfaces. More specifically, these methods respond to events that are triggered during the parsing of an XML document. In this program, the methods just print out the content that they receive.

The first of these methods is the characters() method, which is called whenever content is parsed in a document. Following is the code for this method:

  public void characters(char[] ch, int start, int length) {
    String chars = "";
    for (int i = start; i < start + length; i++)
      chars = chars + ch[i];
      if ((chars.trim()).length() > 0)
        System.out.println("Received characters: " + chars);
  }

The characters() method receives content found within elements. It accepts three arguments: an array of characters, the position in the array where the content starts, and the amount of content received. In this method, a for loop is used to extract the content from the array, starting at the position in the array where the content starts, and iterating over each element until the position of the last element is reached. When all of the characters are gathered, the code checks to make sure they aren't just empty spaces, and then prints the results if not.

It's important not to just process all of the characters in the array of characters passed in unless that truly is your intent. The array can contain lots of padding on both sides of the relevant content, and including it all will result in a lot of extra characters along with the content that you actually want. On the other hand, if you know that the code contains parsed character data (PCDATA) that you want to read verbatim, then by all means process all of the characters.

The next two methods, startDocument() and endDocument(), are called when the beginning and end of the document are encountered, respectively. They accept no arguments and are called only once each during document parsing, for obvious reasons. Here's the code for these methods:

  public void startDocument() {
    System.out.println("Start document.");
  }
  public void endDocument() {
    System.out.println("End of document reached.");
  }

Next let's look at the startElement() and endElement() methods, which accept the most complex set of arguments of any of the methods that make up a ContentHandler:

  public void startElement(String namespaceURI, String localName,
    String qName, Attributes atts) {
    System.out.println("Start element: " + localName);
  }
  public void endElement(String namespaceURI, String localName,
    String qName) {
    System.out.println("End of element: " + localName);
  }

The startElement() method accepts four arguments from the parser. The first is the namespace URI, which you'll see elsewhere as well. The namespace URI is the URI for the namespace associated with the element. If a namespace is used in the document, the URI for the namespace is provided in a namespace declaration. The local name is the name of the element without the namespace prefix. The qualified name is the name of the element including the namespace prefix if there is one. Finally, the attributes are provided as an instance of the Attributes object. The endElement() method accepts the same first three arguments but not the final attributes argument.

SAX parsers must have namespace processing turned on in order to populate all of these attributes. If that option is deactivated, any of the arguments (other than the attributes) may be populated with empty strings. The method for turning on namespace processing varies depending on which parser you use.

Let's look at attribute processing specifically. Attributes are supplied to the startElement() method as an instance of the Attributes object. In the sample code, you use three methods of the Attributes object: getLength(), getLocalName(), and getValue(). The getLength() method is used to iterate over the attributes supplied to the method call, while getLocalName() and getValue() accept the index of the attribute being retrieved as arguments. The code retrieves each attribute and prints out its name and value. In case you're curious, the full list of methods for the Attributes object appears in Table 17.1.

Table 17.1. Methods of the Attributes Object

Method

Purpose

getIndex(String qName)

Retrieves an attribute's index using its qualified name

getIndex(String uri, String localPart)

Retrieves an attribute's index using its namespace URI and the local portion of its name

getLength()

Returns the number of attributes in the element

getLocalName(int index)

Returns the local name of the attribute associated with the index

getQName(int index)

Returns the qualified name of the attribute associated with the index

getType(int index)

Returns the type of the attribute with the supplied index

getType(String uri, String localName)

Looks up the type of the attribute with the namespace URI and name specified

getURI(int index)

Looks up the namespace URI of the attribute with the index specified

getValue(int index)

Looks up the value of the attribute using the index

getValue(String qName)

Looks up the value of the attribute using the qualified name

getValue(String uri, String localName)

Looks up the value of the attribute using the namespace URI and local name


Getting back to the endElement() method, its operation is basically the same as that of startElement() except that it doesn't accept the attributes of the element as an argument.

The next two methods, startPrefixMapping() and endPrefixMapping(), have to do with prefix mappings for namespaces:

  public void startPrefixMapping(String prefix, String uri) {
    System.out.println("Prefix mapping: " + prefix);
    System.out.println("URI: " + uri);
  }
  public void endPrefixMapping(String prefix) {
    System.out.println("End of prefix mapping: " + prefix);
  }

These methods are used to report the beginning and end of namespace prefix mappings when they are encountered in a document.

The next method, ignorableWhitespace(), is similar to characters(), except that it returns whitespace from element content that can be ignored.

  public void ignorableWhitespace(char[] ch, int start, int length) {
    System.out.println("Received whitespace.");
  }

Next on the method agenda is processingInstruction(), which reports processing instructions to the content handler. For example, a stylesheet can be associated with an XML document using the following processing instruction:

<?xml-stylesheet href="mystyle.css" type="text/css"?>

The method that handles such instructions is

  public void processingInstruction(String target, String data) {
    System.out.println("Received processing instruction:");
    System.out.println("Target: " + target);
    System.out.println("Data: " + data);
  }

The last method you need to be concerned with is setDocumentLocator(), which is called when each and every event is processed. Nothing is output by this method in this program, but I'll explain what its purpose is anyway. Whenever an entity in a document is processed, the parser calls setDocumentLocator() with a Locator object. The Locator object contains information about where in the document the entity currently being processed is located. Here's the "do nothing" source code for the method:

    public void setDocumentLocator(Locator locator) { }

The methods of a Locator object are described in Table 17.2.

Table 17.2. The Methods of a Locator Object

Method

Purpose

getColumnNumber()

Returns the column number of the current position in the document being parsed

getLineNumber()

Returns the line number of the current position in the document being parsed

getPublicId()

Returns the public identifier of the current document event

getSystemId()

Returns the system identifier of the current document event


Because the sample program doesn't concern itself with the specifics of locators, none of these methods are actually used. However, it's good for you to know about them in case you need to develop a program that somehow is interested in locators.

Implementing the ErrorHandler Interface

I mentioned earlier that the DocumentPrinter class implements two interfaces, ContentHandler and ErrorHandler. Let's look at the methods that are used to implement the ErrorHandler interface. There are three types of errors that a SAX parser can generateerrors, fatal errors, and warnings. Classes that implement the ErrorHandler interface must provide methods to handle all three types of errors. Here's the source code for the three methods:

  public void error(SAXParseException exception) { }
  public void fatalError(SAXParseException exception) { }
  public void warning(SAXParseException exception) { }

As you can see, each of the three methods accepts the same argumenta SAXParseException object. The only difference between them is that they are called under different circumstances. To keep things simple, the sample program doesn't output any error notifications. For the sake of completeness, the full list of methods supported by SAXParseException appears in Table 17.3.

Table 17.3. Methods of the SAXParseException Interface

Method

Purpose

getColumnNumber()

Returns the column number of the current position in the document being parsed

getLineNumber()

Returns the line number of the current position in the document being parsed

getPublicId()

Returns the public identifier of the current document event

getSystemId()

Returns the system identifier of the current document event


Similar to the Locator methods, these methods aren't used in the Document Printer sample program, so you don't have to worry about the ins and outs of how they work.

Testing the Document Printer Program

Now that you understand how the code works in the Document Printer sample program, let's take it for a test drive one more time. This time around, you're running the program to parse the condos.xml sample document from the previous tutorial. Here's an excerpt from that document in case it's already gotten a bit fuzzy in your memory:

  <proj status="active">
    <location lat="36.122238" long="-86.845028" />
    <description>
      <name>Woodmont Close</name>
      <address>131 Woodmont Blvd.</address>
      <address2>Nashville, TN 37205</address2>
      <img>condowc.jpg</img>
    </description>
  </proj>

And here's the command required to run this document through the Document Printer program:

java -classpath xercesImpl.jar;xml-apis.jar;. DocumentPrinter condos.xml

Finally, Listing 17.2 contains the output of the Document Printer program after feeding it the condominium map data stored in the condos.xml document.

Listing 17.2. The Output of the Document Printer Example Program After Processing the condos.xml Document
 1:  Start document.
 2:  Start element: projects
 3:  Start element: proj
 4:  Start element: location
 5:  End of element: location
 6:  Start element: description
 7:  Start element: name
 8:  Received characters: Woodmont Close
 9:  End of element: name
10:  Start element: address
11:  Received characters: 131 Woodmont Blvd.
12:  End of element: address
13:  Start element: address2
14:  Received characters: Nashville, TN 37205
15:  End of element: address2
16:  Start element: img
17:  Received characters: condowc.jpg
18:  End of element: img
19:  End of element: description
20:  End of element: proj
21:  ...
22:  Start element: proj
23:  Start element: location
24:  End of element: location
25:  Start element: description
26:  Start element: name
27:  Received characters: Harding Hall
28:  End of element: name
29:  Start element: address
30:  Received characters: 2120 Harding Pl.
31:  End of element: address
32:  Start element: address2
33:  Received characters: Nashville, TN 37215
34:  End of element: address2
35:  Start element: img
36:  Received characters: condohh.jpg
37:  End of element: img
38:  End of element: description
39:  End of element: proj
40:  End of element: projects
41:  End of document reached.

The excerpt from the condos.xml document that you saw a moment ago corresponds to the first proj element in the XML document. Lines 3 through 20 show how the Document Printer program parses and displays detailed information for this element and all of its content.