PHP

Grouping and Choice

Regular expressions also enable us to group characters or character classes together, via parentheses: ( and ). Using these in combination with other operators enables us to form even more powerful regular expressions, such as "(very){1,}", which would find a match in all of very good, very very good, and very very very very very smokin' good.

Combined with the | metacharacter, which matches any of the sequences it separates, we can create groups to match individual words: (good|awesome|amazing|sweeet|cool). Note that groups are a more complicated construction, however, and as such tend to be a bit slower in execution. Therefore, although we could write the character class [aeiou] as (a|e|i|o|u), we probably do not want to.

Tricks and Traps

POSIX regular expressions have a couple of interesting properties that can cause some unexpected results (when matching against strings) and some potential performance problems. We cover a couple of the more common sources of confusion here.

First and foremost, POSIX regular expressions work in a fashion that leads them to be called greedy. Effectively, when given free reign to start matching characters, such as with a sequence such as ".*" (which says match any number of characters), a POSIX regular expression immediately starts gobbling up characters until it reaches the end of the string.

This behavior can cause problems if the regular expression is in fact something like ".*fish". If given the string I like to eat raw fish with that pattern, the processor matches all characters until it gets to the end of the string. It then realizes that it still has four more characters left to match, namely those in fish. It then starts working its way backward through the string, seeing whether it can make a match happen that way. It finally makes that match, but in a somewhat inefficient manner.

This greedy processing can cause some unexpected results if our patterns are not as specific as they need to be. Consider the following expression to match an IP address specified of the format xxx.yyy.zzz.www:

[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}

We have written this to try to match between one and three digits four times, each time separated by a period character. We have, however, forgotten that the dot character, when specified by itself in a regular expression, means "match any character." What we really wanted was to escape each of the periods with a backslash.

The preceding pattern correctly matches (as expected) against the following IP addresses:

1.2.3.4
192.168.0.1
255.255.255.255

What is unexpected, however, is that it successfully matches against the following:

192.168.255

Why? Because the regular expression processor works very hard to make patterns match. The preceding string matches the regular expression along the following lines:

  • The first two [0-9]{1,3}. sequences match the 192. and 168. respectively. The processor then uses the 255 to match the third one of these before realizing that there is still more in the regular expression to match.

  • After processing, however, it discovers that it can satisfy the regular expression with the remaining 255 by matching the 2 against the third [0-9]{1,3}. sequence, the first 5 against the dot character, and the second 5 against the fourth digit sequence [0-9]{1,3}.

We are thus given a match, even though that is not what we intended! To fix this problem, we should correctly escape the dot characters to indicate that we will only accept periods:

[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}

This new regular expression still correctly matches valid IP addresses, but it no longer matches the invalid one.

If you are getting strange or unexpected results with your regular expression, do not fixate on one particular part of the expression, but instead look at the whole sequence of patterns and try to see how it could be producing the results. Trying different input values to isolate how it is behaving will also help.

by BrainBellupdated
Advertisement: