Table of Contents
[ fromfile: xmlparsing.xml id: xmlparsing ]
Abstract
This chapter demonstrates three different ways to parse XML data, two available from Qt's XML module, and a new, improved one from Qt's core module. These examples show parse-event driven parsing with SAX (the Simple API for XML), tree-style parsing with DOM (the Document Object Model), and stream-style parsing with QXmlStreamReader.
XML is an acronym for eXtensible Markup Language. It is a markup language similar to HTML (HyperText Markup Language), but with stricter syntax and no semantics (i.e., no meanings associated with the tags).
XML's stricter syntax is in strong contrast to HTML. For example:
Each XML <tag>
must have a closing </tag>
, or be self-closing, like this: <br/>
.
XML tags are case-sensitive: <tag>
is not the same as <Tag>
In an XML document the characters > and < that are not actually part of a tag must be replaced by their passive equivalents >
and <
to avoid confusing the parser.
In addition, the ampersand character (&
) must be replaced by its passive equivalent, &
for the same reason.
Example 15.1 is an HTML document that does not conform to XML rules.
Example 15.1. src/xml/html/testhtml.html
<html> <head> <title> This is a title </title> <!--Unterminated <link> and <input> tags are commonplace in HTML code: --> <link rel="Saved Searches" title="oopdocbook" href="buglist.cgi?cmdtype=runnamed&namedcmd=oopdocbook"> <link rel="Saved Searches" title="queryj" href="buglist.cgi?cmdtype=runnamed&namedcmd=queryj"> </head> <body> <p> This is a paragraph. What do you think of that? Html makes use of unterminated line-breaks: <br> And those do not make XML parsers happy. <br> <ul> <li> HTML is not very strict. <li> An unclosed tag doesn't bother HTML parsers one bit. </ul> </body> </html>
<include src="src/xml/html/testhtml.html" href="src/xml/html/testhtml.html" id="testhtml" mode="txt"/>
If we combined XML syntax with HTML element semantics, we would get a language called XHTML. Example 15.2 shows Example 15.1 rewritten as XHTML.
Example 15.2. src/xml/html/testxhtml.html
<!DOCTYPE xhtml > <html> <head> <title> This is a title </title> <!-- These links are now self-terminated: --> <link rel="Saved Searches" title="oopdocbook" href="buglist.cgi?cmdtype=runnamed" /> <link rel="Saved Searches" title="queryj" href="buglist.cgi?namedcmd=queryj" /> </head> <body> <p> This is a paragraph. What do you think of that? </p> <p> Html self-terminating linebreaks are ok: <br/> They don't confuse the XML parser. <br/> </p> <ul> <li> This is proper list item </li> <li> This is another list item </li> </ul> </body> </html>
<include src="src/xml/html/testxhtml.html" href="src/xml/html/testxhtml.html" id="testxhtml" mode="txt"/>
XML is a whole class of file formats that is understandable and editable by humans and also by programs. XML has become a popular format for storing and exchanging data from web applications. It is also a natural language for representing hierarchical (tree-like) information, which includes most documentation.
Many applications (e.g., Qt Designer, Umbrello, Dia) use an XML file format for storing data. Qt Designer's .ui files use XML to describe the layout of Qt widgets in a GUI. This book is written in a flavor of XML called Slacker's DocBook. It's like DocBook, an XML language for writing books, but it adds some shorthand tags from XHTML and custom tags for describing courseware.
An XML document consists of nodes.
Elements are nodes, and look like this: <tag>
text or elements
</tag>
.
An opening tag can contain attributes.
An attribute has the form: name="value"
.
Elements nested inside one another form a parent-child tree structure.
Example 15.3. src/xml/sax1/samplefile.xml
<section id="xmlintro"> <title> Intro to XML </title> <para> This is a paragraph </para> <ul> <li> This is an unordered list item. </li> <li c="textbook"> This only shows up in the textbook </li> </ul> <p> Look at this example code below: </p> <include src="xmlsamplecode.cpp" mode="cpp"/> </section>
<include src="src/xml/sax1/samplefile.xml" href="src/xml/sax1/samplefile.xml" id="samplefilexml" mode="text"/>
In Example 15.3, the <ul>
has two <li>
children, and its parent is a <section>
.
Elements with no children can be self-terminated with a />
, e.g. <include/>
.
Some elements, such as <section>
and <include>
have attributes.
Most parsers ignore extra whitespace, but indenting nested elements makes the code more readable by humans.
Question | |
---|---|
How many direct children are there of the |
xmllint | |
---|---|
The free tool |
Generated: 2012-03-02 | © 2012 Alan Ezust and Paul Ezust. |