15.2.  SAX Parsing

[ fromfile: xmlparsing.xml id: saxparsing ]

When using SAX-style XML parsers, the flow of execution depends entirely on the data being read sequentially from a file or stream. This inversion of control means that tracing the thread of execution requires a stack to keep track of passive calls to callback functions. Furthermore, our code (overrides of virtual functions) will be called by parsing code inside the Qt library.

Invoking the parser involves creating a reader and a handler, hooking them up, and calling parse(), as shown in Example 15.4.

Example 15.4. src/xml/sax1/tagreader.cpp

#include "myhandler.h"
#include <QFile>
#include <QXmlInputSource>
#include <QXmlSimpleReader>
#include <QDebug>

int main( int argc, char **argv ) {
    if ( argc < 2 ) {
        qDebug() << QString("Usage: %1 <xmlfiles>").arg(argv[0]);
        return 1;
    }
    MyHandler handler; 
    QXmlSimpleReader reader;                1
    reader.setContentHandler( &handler );   2
    for ( int i=1; i < argc; ++i ) {
        QFile xmlFile( argv[i] );
        QXmlInputSource source( &xmlFile );
        reader.parse( source );             3
    }
    return 0;
}

1

The generic parser.

2

Hook the objects together.

3

Start parsing.

<include src="src/xml/sax1/tagreader.cpp" href="src/xml/sax1/tagreader.cpp" id="tagreadercpp" mode="cpp"/>


The interface for parsing XML is described in the abstract base class QXmlContentHandler. We call this a passive interface because it is not our own code that calls MyHandler methods. A QXmlSimpleReader object reads an XML file and generates parse events, to which it then responds by calling MyHandler methods. Figure 15.1 shows the main classes involved.

Figure 15.1.  Abstract and Concrete SAX Classes

Abstract and Concrete SAX Classes

For the XML reader to provide any useful information, it needs an object to receive parse events. This object, a parse event handler, must implement the interface specified by its abstract base class, so it can "plug into" the parser, as shown in Figure 15.2.

Figure 15.2.  Plugin Component Architecture

Plugin Component Architecture

The handler derives (directly or indirectly) from QXmlContentHandler. The virtual methods get called by the parser when it encounters various elements of the XML file during parsing. This is event-driven parsing: You do not call these functions directly. Example 15.5, shows a class that extends the default handler so that it can respond to parse events in the particular way required by our application.

Example 15.5. src/xml/sax1/myhandler.h

[ . . . . ]
#include <QXmlDefaultHandler>
class QString;
class MyHandler : public QXmlDefaultHandler {
  public:
    bool startDocument();
    bool startElement( const QString & namespaceURI,
                       const QString & localName,
                       const QString & qName,
                       const QXmlAttributes & atts);
    bool characters(const QString& text);
    bool endElement( const QString & namespaceURI,
                     const QString & localName,
                     const QString & qName );
  private:
    QString indent;
};
[ . . . . ]

<include src="src/xml/sax1/myhandler.h" href="src/xml/sax1/myhandler.h" id="myhandlerh" mode="cpp"/>


Functions that are called passively are often referred to as callbacks. They respond to events generated by the parser. The client code for MyHandler is the QXmlSimpleReader class, inside the Qt XML Module.

[Important] ContentHandler or DefaultHandler?

QXmlContentHandler is an abstract class with many pure virtual functions, all of which must be overridden by any concrete derived class. Qt has provided a concrete class named QXmlDefaultHandler that implements the base class pure virtual functions as empty, do-nothing bodies. You can use this as a base class. Handlers derived from this class are not required to override all the methods but must override some to accomplish anything.

If you do not properly override each handler method that will be used by your application, the corresponding QXmlDefaultHandler method, which does nothing, is called instead. In the body of a handler function, you can

Example 15.6 contains the definition of a concrete event handler.

Example 15.6. src/xml/sax1/myhandler.cpp

[ . . . . ]
QTextStream cout(stdout);

bool MyHandler::startDocument() {
    indent = "";
    return TRUE;
}

bool MyHandler::characters(const QString& text) {
    QString t = text;
    cout << t.remove('\n');
    return TRUE;
}

bool MyHandler::startElement( const QString&, 1
                             const QString&, const QString& qName, 
                             const QXmlAttributes& atts) {
    QString str = QString("\n%1\\%2").arg(indent).arg(qName);
    cout << str;
    if (atts.length()>0) {
        QString fieldName = atts.qName(0);
        QString fieldValue = atts.value(0);
        cout << QString("(%2=%3)").arg(fieldName).arg(fieldValue);
    }
    cout << "{";
    indent += "    ";
    return TRUE;
}

bool MyHandler::endElement( const QString&,
    const QString& , const QString& ) {
    indent.remove( 0, 4 );
    cout << "}";
    return TRUE;
}
[ . . . . ]

1

We have omitted the names of the parameters that we don't use. This prevents the compiler from issuing "unused parameter" warnings.

<include src="src/xml/sax1/myhandler.cpp" href="src/xml/sax1/myhandler.cpp" id="myhandlercpp" allfiles="1" mode="cpp"/>


The QXmlAttributes object passed into the startElement() function is a map used to hold the name = value attribute pairs that were contained in the XML elements.

As it processes the file, the parse() function calls characters(), startElement(), and endElement() as these events are encountered in the file. Whenever a string of ordinary characters is found between the beginning and end of a tag, it's passed to the characters() function.

We ran this program on Example 15.3, and it transformed that document into Example 15.7, something that looks a little like LaTex, another document format.

Example 15.7. src/xml/sax1/tagreader-output.txt

\section(id=xmlintro){
    \title{ Intro to XML }
    \para{ This is a paragraph }
    \ul{
        \li{ This is an unordered list item. }
        \li(c=textbook){ This only shows up in the textbook }    }
    \p{ Look at this example code below: }
    \include(src=xmlsamplecode.cpp){}}

<include src="src/xml/sax1/tagreader-output.txt" href="src/xml/sax1/tagreader-output.txt" id="tagreaderoutputtxt" mode="text"/>