14.3.2.  Regular Expressions: Phone Number Recognition

[ fromfile: regexp.xml id: regexphonerecog ]

The Problem

In almost any application, there is a need for an easy but general purpose way to specify conditions that must be satisfied by input data at runtime. For example:

How can you impose conditions such as these on incoming data in an object-oriented way?

Suppose that you want to write a program that recognizes phone number formats and could accept a variety of phone numbers from various countries. You would need to take the following things into consideration.

Imagine how you would write this program using the standard tools available to you in C++. It would be necessary to write lengthy parsing routines for each possible format. Example 14.5 shows the desired output of such a program.

Example 14.5. src/regexp/testphone.txt

src/regexp> ./testphone
Enter a phone number (or q to quit): 16175738000
 validated: (US/Canada) +1 617-573-8000
Enter a phone number (or q to quit): 680111111111
 validated: (Palau) + 680 (0)11-11-11-111
Enter a phone number (or q to quit): 777888888888
 validated: (Unknown - but possibly valid) + 777 (0)88-88-88-888
Enter a phone number (or q to quit): 86333333333
 validated: (China) + 86 (0)33-33-33-333
Enter a phone number (or q to quit): 962444444444
 validated: (Jordan) + 962 (0)44-44-44-444
Enter a phone number (or q to quit): 56777777777
 validated: (Chile) + 56 (0)77-77-77-777
Enter a phone number (or q to quit): 351666666666
 validated: (Portugal) + 351 (0)66-66-66-666
Enter a phone number (or q to quit): 31888888888
 validated: (Netherlands) + 31 (0)88-88-88-888
Enter a phone number (or q to quit): 20398478
Unknown format
Enter a phone number (or q to quit): 2828282828282
Unknown format
Enter a phone number (or q to quit): q
src/regexp>

<include src="src/regexp/testphone.txt" href="src/regexp/testphone.txt" id="testphonetxt" mode="text"/>


Example 14.6 is a procedural C-style solution that shows how to use QRegExp to handle this problem.

Example 14.6. src/regexp/testphoneread.cpp

[ . . . . ]
QRegExp filtercharacters ("[\\s-\\+\\(\\)\\-]"); 1

QRegExp usformat                                 2
("(\\+?1[- ]?)?\\(?(\\d{3})\\)?[\\s-]?(\\d{3})[\\s-]?(\\d{4})");

QRegExp genformat
("(00)?([[3-9]\\d{1,2})(\\d{2})(\\d{7})$");      3

QRegExp genformat2
("(\\d\\d)(\\d\\d)(\\d{3})");                    4


QString countryName(QString ccode) {
   if(ccode == "31") return "Netherlands";
   else if(ccode == "351") return "Portugal";
[ . . . . ]
   //Add more codes as needed ..."
   else return "Unknown - but possibly valid";
}

QString stdinReadPhone() {                       5
   QString str;
   bool knownFormat=false;
   do {                                          6
      cout << "Enter a phone number (or q to quit): ";
      cout.flush();
      str = cin.readLine();
      if (str=="q")
         return str;
      str.remove(filtercharacters);              7
      if (genformat.exactMatch(str)) {
         QString country = genformat.cap(2);
         QString citycode = genformat.cap(3);
         QString rest = genformat.cap(4);
         if (genformat2.exactMatch(rest)) {
            knownFormat = true;
            QString number = QString("%1-%2-%3")
                               .arg(genformat2.cap(1))
                               .arg(genformat2.cap(2))
                               .arg(genformat2.cap(3));
            str = QString("(%1) + %2 (0)%3-%4").arg(countryName(country))
                    .arg(country).arg(citycode).arg(number);
        }
     }
[ . . . . ]
     if (not knownFormat) {
        cout << "Unknown format" << endl;
     }
  } while (not knownFormat) ;
  return str;
}

int main() {
    QString str;
    do {
        str =  stdinReadPhone();
        if (str != "q")
            cout << " validated: " << str << endl;
    } while (str != "q");
    return 0;
}
[ . . . . ]

1

Remove these characters from the string that the user supplies.

2

All U.S. format numbers have country-code 1, and have 3 + 3 + 4 = 10 digits. Whitespaces, dashes and parantheses between these digit groups are ignored, but they help to make the digit groups recognizable.

3

Landline country codes in Europe begin with 3 or 4, Latin America with 5, Southeast Asia and Oceania with 6, East Asia with 8, and Central, South and Western Asia with 9. Country codes may be 2 or 3 digits long. Local phone numbers typically have 2(or 3) + 2 + 7 = 11(or 12) digits. This program does not attempt to interpret city codes.

4

The last 7 digits will be be arranged as 2 + 2 + 3.

5

Ensures the user-entered phone string complies with a regular expression, and extracts the proper components from it. Returns a properly formatted phone string.

6

Keep asking until you get a valid number.

7

Remove all dashes, spaces, parens, and so on.

<include src="src/regexp/testphoneread.cpp" href="src/regexp/testphoneread.cpp" id="testphonereadcpp" allfiles="1" mode="cpp"/>


In a stream-based program like this, the complete response of the user is examined by the QRegExp after s/he has typed it and pressed the [Enter] key. There is no way to prevent the user from entering inappropriate characters into the input stream.



[93] The phone number situation in Europe is quite complex and specialists have been working for years to develop a system that would work, and be acceptable, to all EU members. You can get an idea of what is involved by visiting this Wikipedia page.