[ fromfile: regexp.xml id: regexsyntax ]
Much like a string, a regular expression is a series of characters; however, not all characters are taken literally. For example, while an "a" in a regular expression will match an "a" in the target string, the character "." will match any character. Here "." is called a meta-character. Another common meta-character is "*", which is used to indicate that zero or more of the preceding character may exist in the target string. For example, "a*" would match any number of "a"s (including zero) in a row. There are many different kinds of meta-characters, as illustrated below.
Following are some of the most commonly used meta-characters.
Special characters
.
(the dot matches any character)
\n
(matches the newline character)
\f
(matches the form feed character)
\t
(matches the tab character)
\xhhhh
(matches the Unicode character whose code is the hexadecimal number hhhh in the range 0x0000
to 0xFFFF
)
Quantifiers – Modifiers that specify the number of occurrences of the preceding character (or group) that may appear in the matching expression.
+
(1 or more occurrences)
?
(0 or 1 occurrences)
*
(0 or more occurrences)
{i,j}
(at least i but not more than j occurrences)
Character Sets – Sets of allowable values for the character in the specified position of the matching expression. Several character sets are predefined:
\s
(matches any whitespace character)
\S
(matches any non-whitespace character)
\d
(matches any digit character: '0'
to '9'
)
\D
(matches any non-digit character)
\w
(matches any "word" character; i.e., any letter or digit or the underscore '_'
)
\W
(matches any non-word character)
Character sets can also be specified in square brackets:
[AEIOU]
(matches any of the characters 'A'
, 'E'
, 'I'
, 'O'
, or 'U'
)
[a-g]
(the dash makes this a range from 'a'
to 'g'
)
[^xyz]
(matches any character except for 'x'
, 'y'
, and 'z'
)
Grouping and Capturing Characters – (round parentheses) can be used to form a group. Groups can be back-referenced, meaning that if there is a match, the grouped values can be captured and accessed in various ways.
For convenience, up to 9 groups can be referenced within the regular expression by using the identifiers \1
thru \9
.
There is also a QRegExp member function cap(int nth)
, which returns the nth group (as a QString).
Anchoring Characters – Assertions that specify the boundaries of a matching effort.
The caret (^
), if it is the first character in the regex, indicates that the match starts at the beginning of the string.
The dollar sign ($
), when it is the last character in the regex, means that the effort to match must continue to the end of the string.
In addition, there are word boundary (\b
) or non-word boundary (\B
) assertions that help to focus the attention of the regex.
Table 14.3. Examples of Regular Expressions
Pattern | Meaning |
---|---|
hello | Matches the literal string, hello
|
c*at | Quantifier: zero or more occurrences of c , followed by at : at , cat , ccat , etc.
|
c?at | Matches zero or 1 occurrences of c , followed by at : at or cat only.
|
c.t | Matches c followed by any character, followed by t : cat , cot , c3t , c%t , etc.
|
c.*t | Matches c followed by 0 or more characters, followed by t : ct , caaatttt , carsdf$#S8ft , etc.
|
ca+t | + means 1 or more of the preceding "thing", so this matches cat , caat , caaaat , etc., but not ct .
|
c\.\*t | Backslashes precede special characters to "escape them" so this matches only the string c.*t
|
c\\\.t | Matches only the string, c\.t
|
c[0-9a-c]+z | Between the 'c' and the 'z' one or more of the chars in the set [0-9a-c] – matches strings like c312abbaz and "caa211bac2z"
|
the (cat|dog) ate the (fish|mouse) | (Alternation) the cat ate the fish or the dog ate the mouse or the dog ate the fish , or the cat ate the mouse |
\w+ | A sequence of one or more alphanumerics (word chars), same as [a-zA-Z0-9]+
|
\W | A character which is not part of a word (punctuation, whitespace, etc) |
\s{5} | Exactly 5 whitespace chars (tabs, spaces, or newlines) |
^\s+ | Matches one or more white space at the beginning of the string. |
\s+$ | Matches one or more white space at the end of the string. |
^Help | Matches Help if it occurs at the beginning of the string. |
[^Help] | Matches any single char except one of the letters in the word Help , anywhere in the string. (a different meaning for the metacharacter ^) |
\S{1,5} | At least 1, at most 5 non-whitespace (printable characters) |
\d | A digit [0-9] (and \D is a non-digit, i.e., [^0-9] ) |
\d{3}-\d{4} | 7-digit phone numbers: 555-1234
|
\bm[A-Z]\w+ |
\b means word boundary: matches mBuffer but not StreamBuffer
|
Backslashes and C++ Strings | |
---|---|
Backslashes are used for escaping special characters in C++ strings as well, so this means that regular expression strings inside C++ strings must be "double-backslashed" – i.e. every |
C++ 0x Users | |
---|---|
If your compiler supports C++0x, you may want to use raw quoted strings for regular expressions, to avoid the need to double-escape backslashes. R"(The String Data \ Stuff " )" R"delimiter(The String Data \ Stuff " )delimiter" |
There is much more to regular expressions, and it is well worth investing the time to learn about them. The documentation for QRegExp is a good place to start; we also recommend [Friedl98] for a more in-depth examination of the subject.
In the meantime, you can explore the capabilities of QRegExp and test your own regular expressions with a Qt example from Nokia.
You can find the code in src/regex-tester.
Figure 14.3 shows a screenshot of the running program.
Generated: 2012-03-02 | © 2012 Alan Ezust and Paul Ezust. |