[Previous] [Contents] [Next]


An XML Document as a Series of Events


Everyone knows how to work with a graphical user interface such as Windows, Mac OS, or Motif. In these environments, the OS captures all kinds of events: mouse movements and clicks, keyboard input, and so on. Subsequently, the OS sends messages to the program indicating which event has occurred. It is the responsibility of the programmer of the application to code the responses to those events.

An analogy can be drawn for XML files. When you read an XML file in a sequential way, all kinds of events happen. You encounter the start of a document, you read start and end tags, you encounter comments and processing instructions, and so on. All of these can be looked at as data events.

It is the responsibility of the programmer to code the responses to these events. These responses are called event handlers.

An event handler is the code that is executed when an event has occurred.

For the file musicians.xml (Listing 14.1), an event-based processor will generate the events listed in Listing 14.7.

Listing 14.7 Events Generated by the musicians.xml File
 1: Start document
 2: Start element: musicians
 3: Start element: musician
 4: Start element: name
 5: Characters: Joey Baron
 6: End element: name
 7: Start element: instrument
 8: Characters: drums
 9: End element: instrument
10: Start element: NrOfRecordings
11: Characters: 1
12: End element: NrOfRecordings
13: End element: musician
14: ...
15: End element: musicians
16: End document

It is important to see that the traversal of the XML document happens in a hierarchically sequential, left-to-right fashion.

Consider the XML structure shown in Figure 14.1.
Figure 14.1 The XML tree.

Figure 14.2 indicates the order in which the different nodes of the tree are traversed.
Figure 14.2 The order of traversal.

During traversal, most of the event-based processors store information about already-traversed nodes.

In event-driven mode, Balise from the company AIS keeps track of element types and attributes for ancestor nodes up to the root and for the first-left siblings of these ancestor nodes as indicated in Figure 14.3.
Figure 14.3 The context information that is kept.

Figure 14.3 shows which other nodes are being tracked for node number 12.

The context information that is kept can differ from product to product. You can expect that at least the information on previous siblings within the same parent element and on all ancestors is stored.

Thanks to this cataloging of some context (however limited), you can write event handlers using this context:

Start element: para, first after title.
In this way, you can define a different response if your para element comes first after title.
Start element: li, with an ancestor of ol and an attribute of type with a value of i.
Here also, you can write a specific response if this condition occurs.

On the other hand, you cannot answer questions that require looking further in the data stream. For example:

Is an element the last child element of its parent element?
You can know this only after having processed the element, not when you receive the start element event.
Does this element have an element in it that has an attribute with the name experience and the value firsttime?
You can only know this after processing the element.
There are solutions, of course, but they require a second pass.

Advantages of this event-based processing are

It's simple.
It works fast.
It doesn't consume a lot of memory.

Disadvantages are

It's impossible to look ahead.

Two implementations are mentioned here:

Omnimark
SAX

Omnimark is the market leader for doing heavy conversions in the SGML community. Recently it has been XML-enabled, and a free (although restricted) version called Omnimark LE is available on the Web at http://www.omnimark.com/develop/omle40/index.html.

Omnimark will be covered in more depth in Day 15, "Event-Driven Programming."

SAX stands for a Simple API for XML. It came about after Peter Murray-Rust, one of the early adopters of XML, made a complaint on the XML developers' mailing list. He said that when he wanted to change the parser coupled to his XML browser, JUMBO, he had to rewrite code because the APIs of the different parsers differed. The question raised was, "While waiting for the API defined by the W3C (DOM), can we agree on a simple event-based API?"



This was publicly discussed on the XML-DEV list, and lots of people contributed. It was David Megginson who wrote the SAX proposal, together with its implementation in Java.

SAX can be found at http://www.megginson.com/SAX/index.html. It's also subject of the study on Day 15.

[Previous] [Contents] [Next]