Chapter 15.  Parsing XML

Table of Contents

15.1. The Qt XML Parsers
15.2. SAX Parsing
15.3. XML, Tree Structures, and DOM
15.3.1. DOM Tree Walking
15.3.2. Generation of XML with DOM
15.4. XML Streams
15.5. Review Questions

[ fromfile: xmlparsing.xml id: xmlparsing ]

Abstract

This chapter demonstrates three different ways to parse XML data, two available from Qt's XML module, and a new, improved one from Qt's core module. These examples show parse-event driven parsing with SAX (the Simple API for XML), tree-style parsing with DOM (the Document Object Model), and stream-style parsing with QXmlStreamReader.

XML is an acronym for eXtensible Markup Language. It is a markup language similar to HTML (HyperText Markup Language), but with stricter syntax and no semantics (i.e., no meanings associated with the tags).

XML's stricter syntax is in strong contrast to HTML. For example:

Example 15.1 is an HTML document that does not conform to XML rules.

Example 15.1. src/xml/html/testhtml.html

<html>
<head> <title> This is a title </title>

 <!--Unterminated <link> and <input> tags are commonplace
         in HTML code:    -->
 <link rel="Saved&nbsp;Searches" title="oopdocbook"
   href="buglist.cgi?cmdtype=runnamed&amp;namedcmd=oopdocbook">
 <link rel="Saved&nbsp;Searches" title="queryj"
   href="buglist.cgi?cmdtype=runnamed&amp;namedcmd=queryj">
</head>
<body>
<p> This is a paragraph. What do you think of that?

Html makes use of unterminated line-breaks: <br>
And those do not make XML parsers happy. <br>

<ul>
<li> HTML is not very strict.
<li> An unclosed tag doesn't bother HTML parsers one bit.
</ul>

</body>
</html>

<include src="src/xml/html/testhtml.html" href="src/xml/html/testhtml.html" id="testhtml" mode="txt"/>


If we combined XML syntax with HTML element semantics, we would get a language called XHTML. Example 15.2 shows Example 15.1 rewritten as XHTML.

Example 15.2. src/xml/html/testxhtml.html

<!DOCTYPE xhtml >
<html>
<head>
<title> This is a title </title>
<!-- These links are now self-terminated: -->
<link rel="Saved&nbsp;Searches" title="oopdocbook"
  href="buglist.cgi?cmdtype=runnamed" />
<link rel="Saved&nbsp;Searches" title="queryj"
  href="buglist.cgi?namedcmd=queryj" />
</head>
<body>

<p> This is a paragraph. What do you think of that? </p>

<p>
Html self-terminating linebreaks are ok: <br/>
They don't confuse the XML parser. <br/>
</p>

<ul>
<li> This is proper list item </li>
<li> This is another list item </li>
</ul>

</body>
</html>

<include src="src/xml/html/testxhtml.html" href="src/xml/html/testxhtml.html" id="testxhtml" mode="txt"/>


XML is a whole class of file formats that is understandable and editable by humans and also by programs. XML has become a popular format for storing and exchanging data from web applications. It is also a natural language for representing hierarchical (tree-like) information, which includes most documentation.

Many applications (e.g., Qt Designer, Umbrello, Dia) use an XML file format for storing data. Qt Designer's .ui files use XML to describe the layout of Qt widgets in a GUI. This book is written in a flavor of XML called Slacker's DocBook. It's like DocBook, an XML language for writing books, but it adds some shorthand tags from XHTML and custom tags for describing courseware.

An XML document consists of nodes. Elements are nodes, and look like this: <tag> text or elements </tag>. An opening tag can contain attributes. An attribute has the form: name="value". Elements nested inside one another form a parent-child tree structure.

Example 15.3. src/xml/sax1/samplefile.xml

<section id="xmlintro">
    <title> Intro to XML </title>
    <para> This is a paragraph </para>
    <ul>
        <li> This is an unordered list item. </li>
        <li c="textbook"> This only shows up in the textbook </li>
    </ul>
    <p> Look at this example code below: </p>
    <include src="xmlsamplecode.cpp" mode="cpp"/>
</section>

<include src="src/xml/sax1/samplefile.xml" href="src/xml/sax1/samplefile.xml" id="samplefilexml" mode="text"/>


In Example 15.3, the <ul> has two <li> children, and its parent is a <section>. Elements with no children can be self-terminated with a />, e.g. <include/>. Some elements, such as <section> and <include> have attributes. Most parsers ignore extra whitespace, but indenting nested elements makes the code more readable by humans.

[Important]Question

How many direct children are there of the <section>?

[Tip] xmllint

The free tool xmllint is handy tool for checking an XML file for errors. It reports descriptive error messages (mismatched start/end tags, missing characters, and so on) and points out where the errors are. It can be used to indent, or "pretty print" a well-formed XML document. It can also be used to join together multi-part documents that use XInclude or external entity references, two features that are not supported by the Qt XML parsers.