14.3.1.  Regular Expression Syntax

[ fromfile: regexp.xml id: regexsyntax ]

Much like a string, a regular expression is a series of characters; however, not all characters are taken literally. For example, while an "a" in a regular expression will match an "a" in the target string, the character "." will match any character. Here "." is called a meta-character. Another common meta-character is "*", which is used to indicate that zero or more of the preceding character may exist in the target string. For example, "a*" would match any number of "a"s (including zero) in a row. There are many different kinds of meta-characters, as illustrated below.

Following are some of the most commonly used meta-characters.

  1. Special characters

    • . (the dot matches any character)

    • \n (matches the newline character)

    • \f (matches the form feed character)

    • \t (matches the tab character)

    • \xhhhh (matches the Unicode character whose code is the hexadecimal number hhhh in the range 0x0000 to 0xFFFF)

  2. Quantifiers – Modifiers that specify the number of occurrences of the preceding character (or group) that may appear in the matching expression.

    • + (1 or more occurrences)

    • ? (0 or 1 occurrences)

    • * (0 or more occurrences)

    • {i,j} (at least i but not more than j occurrences)

  3. Character Sets – Sets of allowable values for the character in the specified position of the matching expression. Several character sets are predefined:

    • \s (matches any whitespace character)

    • \S (matches any non-whitespace character)

    • \d (matches any digit character: '0' to '9')

    • \D (matches any non-digit character)

    • \w (matches any "word" character; i.e., any letter or digit or the underscore '_')

    • \W (matches any non-word character)

    Character sets can also be specified in square brackets:

    • [AEIOU] (matches any of the characters 'A', 'E', 'I', 'O', or 'U')

    • [a-g] (the dash makes this a range from 'a' to 'g')

    • [^xyz] (matches any character except for 'x', 'y', and 'z')

  4. Grouping and Capturing Characters – (round parentheses) can be used to form a group. Groups can be back-referenced, meaning that if there is a match, the grouped values can be captured and accessed in various ways.

    For convenience, up to 9 groups can be referenced within the regular expression by using the identifiers \1 thru \9.

    There is also a QRegExp member function cap(int nth), which returns the nth group (as a QString).

  5. Anchoring Characters – Assertions that specify the boundaries of a matching effort.

    • The caret (^), if it is the first character in the regex, indicates that the match starts at the beginning of the string.

    • The dollar sign ($), when it is the last character in the regex, means that the effort to match must continue to the end of the string.

    • In addition, there are word boundary (\b) or non-word boundary (\B) assertions that help to focus the attention of the regex.

Table 14.3.  Examples of Regular Expressions

Pattern Meaning
hello Matches the literal string, hello
c*at Quantifier: zero or more occurrences of c, followed by at: at, cat, ccat, etc.
c?at Matches zero or 1 occurrences of c, followed by at: at or cat only.
c.t Matches c followed by any character, followed by t: cat, cot, c3t, c%t, etc.
c.*t Matches c followed by 0 or more characters, followed by t: ct, caaatttt, carsdf$#S8ft, etc.
ca+t + means 1 or more of the preceding "thing", so this matches cat, caat, caaaat, etc., but not ct.
c\.\*t Backslashes precede special characters to "escape them" so this matches only the string c.*t
c\\\.t Matches only the string, c\.t
c[0-9a-c]+z Between the 'c' and the 'z' one or more of the chars in the set [0-9a-c] – matches strings like c312abbaz and "caa211bac2z"
the (cat|dog) ate the (fish|mouse) (Alternation) the cat ate the fish or the dog ate the mouse or the dog ate the fish, or the cat ate the mouse
\w+ A sequence of one or more alphanumerics (word chars), same as [a-zA-Z0-9]+
\W A character which is not part of a word (punctuation, whitespace, etc)
\s{5} Exactly 5 whitespace chars (tabs, spaces, or newlines)
^\s+ Matches one or more white space at the beginning of the string.
\s+$ Matches one or more white space at the end of the string.
^Help Matches Help if it occurs at the beginning of the string.
[^Help] Matches any single char except one of the letters in the word Help, anywhere in the string. (a different meaning for the metacharacter ^)
\S{1,5} At least 1, at most 5 non-whitespace (printable characters)
\d A digit [0-9] (and \D is a non-digit, i.e., [^0-9] )
\d{3}-\d{4} 7-digit phone numbers: 555-1234
\bm[A-Z]\w+ \b means word boundary: matches mBuffer but not StreamBuffer

[Note] Backslashes and C++ Strings

Backslashes are used for escaping special characters in C++ strings as well, so this means that regular expression strings inside C++ strings must be "double-backslashed" – i.e. every \ becomes \\, and to match the backslash character itself you need four: \\\\.

[Note] C++ 0x Users

If your compiler supports C++0x, you may want to use raw quoted strings for regular expressions, to avoid the need to double-escape backslashes.

R"(The String Data \ Stuff " )"
R"delimiter(The String Data \ Stuff " )delimiter"

There is much more to regular expressions, and it is well worth investing the time to learn about them. The documentation for QRegExp is a good place to start; we also recommend [Friedl98] for a more in-depth examination of the subject.

In the meantime, you can explore the capabilities of QRegExp and test your own regular expressions with a Qt example from Nokia. You can find the code in src/regex-tester. Figure 14.3 shows a screenshot of the running program.

Figure 14.3.  Regular Expression Tester

Regular Expression Tester