Chapter 15.  Parsing XML

Table of Contents

15.1. The Qt XML Parsers
15.2. SAX Parsing
15.3. XML, Tree Structures, and DOM
15.3.1. DOM Tree Walking
15.3.2. Generation of XML with DOM
15.4. XML Streams
15.5. Review Questions

[ fromfile: xmlparsing.xml id: xmlparsing ]

Abstract

This chapter demonstrates three different ways to parse XML data, two available from Qt's XML module, and a new, improved one from Qt's core module. These examples show parse-event driven parsing with SAX (the Simple API for XML), tree-style parsing with DOM (the Document Object Model), and stream-style parsing with QXmlStreamReader.

Example 15.1 is an HTML document that does not conform to XML rules.

Example 15.1. src/xml/html/testhtml.html

<html>
<head> <title> This is a title </title>

 <!--Unterminated <link> and <input> tags are commonplace
         in HTML code:    -->
 <link rel="Saved&nbsp;Searches" title="oopdocbook"
   href="buglist.cgi?cmdtype=runnamed&amp;namedcmd=oopdocbook">
 <link rel="Saved&nbsp;Searches" title="queryj"
   href="buglist.cgi?cmdtype=runnamed&amp;namedcmd=queryj">
</head>
<body>
<p> This is a paragraph. What do you think of that?

Html makes use of unterminated line-breaks: <br>
And those do not make XML parsers happy. <br>

<ul>
<li> HTML is not very strict.
<li> An unclosed tag doesn't bother HTML parsers one bit.
</ul>

</body>
</html>

If we combined XML syntax with HTML element semantics, we would get a language called XHTML. Example 15.2 shows Example 15.1 rewritten as XHTML.

Example 15.2. src/xml/html/testxhtml.html

<!DOCTYPE xhtml >
<html>
<head>
<title> This is a title </title>
<!-- These links are now self-terminated: -->
<link rel="Saved&nbsp;Searches" title="oopdocbook"
  href="buglist.cgi?cmdtype=runnamed" />
<link rel="Saved&nbsp;Searches" title="queryj"
  href="buglist.cgi?namedcmd=queryj" />
</head>
<body>

<p> This is a paragraph. What do you think of that? </p>

<p>
Html self-terminating linebreaks are ok: <br/>
They don't confuse the XML parser. <br/>
</p>

<ul>
<li> This is proper list item </li>
<li> This is another list item </li>
</ul>

</body>
</html>

Example 15.3. src/xml/sax1/samplefile.xml

<section id="xmlintro">
    <title> Intro to XML </title>
    <para> This is a paragraph </para>
    <ul>
        <li> This is an unordered list item. </li>
        <li c="textbook"> This only shows up in the textbook </li>
    </ul>
    <p> Look at this example code below: </p>
    <include src="xmlsamplecode.cpp" mode="cpp"/>
</section>

[Important]Question

How many direct children are there of the <section>?

[Tip] xmllint
  • The free tool xmllint is handy tool for checking an XML file for errors.

  • It reports descriptive error messages (mismatched start/end tags, missing characters, and so on) and points out where the errors are.

  • It can be used to indent, or "pretty print" a well-formed XML document.

  • It can also be used to join together multi-part documents that use XInclude or external entity references, two features that are not supported by the Qt XML parsers.