Reusable SAX Parsers

Jim's Pages => Java Pages => Reusable SAX Parsers

Overview

This page suggests a simple framework  for reusable SAX parsers. The framework (one abstract class) coordinates a set of small SAX handlers, each of which understands a particular section of an XML document and generates one or many Java objects to hold the data.  When standardized substructures, such as party name and address, are reused in multiple documents the parsers and objects can be reused as well.

Jim Standley

SAX Background

The SAX API invites us to write a handler for SAX events. The SAX parser calls the handler when it finds the beginning of a tag block, any attributes and characters within the tag, and the end tag. (And numerous other events not covered here.) This is a great model if we’re going to process everything in the document and don’t want to have the parser build a large Document Object Mode in memory. It is far simpler than "walking the DOM" to enumerate the complete contents of a document.

For the simplest text extraction the handler can initialize a StringBuffer in startElement(), append characters to the buffer in characters() and save the buffer to a variable in endElement(). endElement() usually involves "case" logic built of if-else tests.  (Note: the "append" in characters is important.  SAX does not guarantee to send all adjacent characters in one call, and could call once for each character!)

Here are methods from a perfectly serviceable handler that takes data out of XML and stores it in an object.  (See Jim'sCodingStyle for an explanation of the variable name prefixes.)

public void startElement( String aURI,
                          String aName,
                          String aQName,
                          Attributes aAtts)
{
    mTag   = aName;
    mAttrs = new AttributesImpl( aAtts );
    mText  = "";
}

public void characters( char aChars[], int aStart, int aLength )
{
    // you might add a strategy for handling special characters here
    mText += new String( aChars, aStart, aLength );
}

public void endElement( String aURI, String aName, String aQName )
{
    if ( aName.equals( "x" ) )
        mObject.fieldX = mText;
    else
    if ( aName.equals( "y" ) )
        mObject.fieldY = mText;
}

But I found that when my XML grew beyond trivial structures, the logic in endElement() got out of hand. Consider the structure below, used in the configuration file for a program that monitors the health of web sites by fetching pages at regular intervals.

<CONFIGURATION>
   <MAIL></MAIL>
   <SCHEDULE></SCHEDULE>
   <SITE>
      <SCHEDULE></SCHEDULE>
   </SITE>
</CONFIGURATION>

Configuration is the root tag for the document. Mail specifies an SMTP server, return address, subject and a list of e-mail addresses to receive error notifications. One mail section is permitted as a direct child of Configuration. Schedule sets the repeat interval and start and stop times. Start and stop may apply to the whole week, or a specified day. As a child of Configuration it sets a default schedule for any following sites that do not specify their own schedules. As a child of Site it overrides the default for that site only. Site gives the URL, maximum response time and other success criteria. 

Now imagine the endElement() method in a SAX handler for this document. It must test the tag name against every possible tag in the document. When handling a schedule, it must know if it’s working on the default schedule or a schedule for a particular site. Start time for the whole schedule is processed differently than start time for a particular day of the week. I wrote the configuration reader this way the first time and something about a 50-way case statement with added branches for various states and attributes raised a red flag.

Small & Simple Handlers

Thinking about how to simplify things, I happened to doodle an object model of my configuration as shown below. The arrows indicate dependencies. The Configuration holds Mail information, the default Schedule, and a list of Sites. Each Site has a Schedule, either a copy of the default or a customized Schedule.

I thought why not make a handler for each class? A small handler could parse a subset of the XML and build one object in the configuration graph. This design gave me a set of very simple handlers. The Configuration handler does almost nothing but delegate control to other handlers for Mail, Schedule or Site tags. The Mail and Schedule handlers extract text from a few tags and set variables as shown above. The Site handler is a bit more interesting because it might find a Schedule tag and delegate to the Schedule handler.

The multiple-handler idea needed two things to make it work: A mechanism to delegate and return control between handlers, and a mechanism to build the Configuration object graph as it progresses through the XML.

Handler Framework

After a false start with a central driver that maintained a stack of handlers, I settled on a doubly-linked-list chain of handlers. An abstract base handler extends the default content handler, and implements the framework. All concrete handlers extend the base. Each handler has a reference to its "parent" in the chain, and can return control when done. Here are the base class methods that manage moving up and down the chain. (See Source Code for the complete source.)

// Constructor for first handler in chain only
// Has no parent
public ChainedHandler( XMLReader aReader )
{
    mReader = aReader;
}
// Constructor for all other handlers
public ChainedHandler( ChainedHandler aParent )
{
    mParent = aParent;
    mReader = mParent.getReader();
}
public void startHandlingEvents()
{
    mReader.setDocumentHandler( this );
}
private void stopHandlingEvents()
{
    if ( null != mParent )
    {
        mReader.setDocumentHandler( mParent );
    }
}

The second constructor takes a parent ChainedHandler as an argument. It saves the parent and calls getReader() to get the XMLReader. The next crucial method on the base class is startHandlingEvents(). In this method we call setContentHandler() to "this" handler. Finally, in the stopHandlingEvents() method we setContentHandler() back to the parent handler. 

The constructor was separated from startHandlingEvents() with the idea that one might want to create a handler, configure it and keep it around before or after delegating control to it, or use it multiple times.

The derived classes override the SAX event methods as needed. Here is the startElement() method for the Configuration handler. It delegates all of the child tags in the XML structure and expresses confusion when it finds anything unexpected. If you’re into patterns, this is a good candidate for a handler factory.

public void startElement( String aName, AttributeList aList )
{
    String lName = aName.toLowerCase();
    if ( lName.equals( MailHandler.MAIL )
        new MailHandler( this ).startHandlingEvents();
    else
    if ( lName.equals( ScheduleHandler.SCHEDULE )
        new ScheduleHandler( this ).startHandlingEvents();
    else
    if ( lName.equals( SiteHandler.SITE )
        new SiteHandler( this ).startHandlingEvents();
    else
        System.out.println("Unexpected tag: " + aName );
}

The delegated handlers consume their own end tags, so the Configuration handler receives only the end tag for the whole Configuration.  The various handlers keep the configuration object graph up to date as they go, so there is nothing to do at the end of the document, and the Configuration handler doesn’t even require endElement() or endDocument() methods.

The Schedule and Mail handlers don’t have startElement() or characters() methods at all. The base handler provides a sufficient default startElement() method that saves the tag name and a clone of the attribute list (through a copy constructor) and resets the string buffer to an empty string. The default characters() method simply appends characters to the string buffer.  These default methods are shown in the first code sample.

The Site handler has a startElement() method because it has to take action when it finds a nested Schedule tag. It delegates Schedule tags to the Schedule handler exactly as the Configuration handler does. For all other tags it simply calls super.startElement().

The endElement() methods on these handlers fill variables with the text collected within each tag, just as described above. For example, the Mail handler checks for MailTo, MailFrom and SMTPServer tags and sets similarly named variables. The Mail handler has one extra twist because it allows multiple MailTo tags and formats a comma-delimited list of addresses.

The sub-handlers check for their own end tags, i.e. Mail, Site and Schedule. When they find their end tags they construct new configuration objects from all the variables they gathered and call stopHandlingEvents() to return control to the parent.

The Configuration Graph

Now the handlers are working their way through the XML and generating objects. How can they glue these objects together to construct the Configuration object graph? I considered building a static (poorly disguised global) anchor for the graph that any handler could modify, or passing parts of the graph to handlers as parameters in custom constructors. Both of these options gave the handlers too much dependency on the graphs they were building. Handlers would have to know the graph from the anchor up, or at least know how to fit their new objects into an existing partial graph.

Instead I added a pair of setAttribute() callback methods to the base handler. A child can call its parent, pushing a single object or an array of objects it has created from the XML. The parent naturally knows how to incorporate the child objects into its own.  This fragment shows the setAttribute() methods from the Configuration handler.

public void setAttribute( String aName, Object aValue )
{
    if ( aName.equals( ScheduleHandler.SCHEDULE ) )
        setDefaultSchedule( (Schedule)aValue );
}
public void setAttribute( String aName, Object[] aValues )
{
    if ( aName.equals( SiteHandler.SITE ) )
        addSiteAndSchedule( (Site)aValues[0], (Schedule)aValues[1] );
}

The parameters for the setAttribute() methods include an attribute name and an object an or array of objects. The names are coded as static constants on the handlers that create the objects. The Schedule handler calls back to the Configuration handler with the name "schedule" and a newly created Schedule object. The Configuration handlers saves the Schedule as the current default. The Site handler calls back with a name "site" and two objects: a Site and a Schedule. The Configuration handler establishes the relationship between the two and adds them to a list of sites.

Each handler knows about any child handlers that it may have to invoke, and objects that those children may return. But no handler knows anything about the parents that call it or how parents might make use of the objects they generate. The dependency graph for the handlers matches the configuration classes (with the addition of extending the same base class.)  I take this as a good sign.

Error Handling

To keep things simple for this article, I left out any mention of error handling. In real life you might want to setErrorHandler() on the parser early in the game and leave that handler in place for the duration. Any derived content handler can getErrorHandler() from the XMLReader and call common methods for logging or other error handling strategies.

Conclusion

At first glance this architecture may seem like overkill with a new base class and four new derived classes to replace one handler. But each of the new classes is pleasingly small and simple, conforming to the notion of doing one thing well. (Ok, you might argue two things: handle SAX events and build an object. As an exercise, separate those!)

Handlers that match the XML structure are even reusable. If you work with XML written to some standard vocabulary you will probably need to parse the same substructures over and over. For example, do you have more than one document type with a standard name and address section? With a toolbox full of reusable little handlers you’ll be able to combine standard parts into complex document handlers in no time.


Source Code

All sample code was built with Sun's JDK 1.4.  The JDK includes APIs for the SAX2 standard, and the Xerces parser implements them.

ChainedHandler.java   The framework
DemoDriver.java A program using the framework to dispaly any XML file
DemoHandler.java A generic chained handler for any XML node
DemoObject.java A generic data object for any XML node