PHP

Character Classes

When we want to search for more than just individual characters or strings, we can use square brackets ( [ and ] ) to define what are called character classes. These are used in positions where you want to allow one of a number of characters to appear. For example, to find any clothing that has the letter o followed by either a u or an e, you can use the following:

the_regex($clothes, "o[ue]");

This output results:

the_regex called to match 'o[ue]':
Array Index 0 matches: array ( 0 => 'oe', ) "shoes"
Array Index 7 matches: array ( 0 => 'ou', ) "blouse"

To find any string containing a vowel (any of a, e, i, o, or u), we could use the character class [aieou]. Similarly, to match against any number, we could use [0123456789], and to match against any lowercase letter, we could write [abcdefghijklmnopqrstuvwxyz]. These last two classes, however, are somewhat annoying to type in all the time, and they're prone to input errors.

To solve this problem, you can specify ranges of characters using the hyphen (-) character: [a-z], [A-Z], or [0-9]. You can include multiple ranges within one character class, such as [A-Za-z0-9], which instructs the processor to match any single uppercase letter, lowercase letter, or digit.

However, a note of caution is warranted against expressions such as [A-z] because regular expression ranges actually just operate on character codes. All the uppercase letters happen to lie consecutively in the character tables in most character sets, as do the lowercase ones, but between the two ranges, there are a number of characters. Therefore, the range [A-z] would also include characters such as [, ], ^, and _. The character class [a-Z], on the other hand, just generates an error from the regex or mbregex compiler in PHP. The character code for a comes after that of Z, which translates into an invalid range.

To specify nonprintable characters in character classes, you can use many of the same escape sequences that you would use in PHP, including those for tabs (\t), newlines (\n), carriage returns (\r), and hexadecimal representations of unprintable digits (\x0b). Of course, this means that if you want to search for the backslash character ( \), you must escape it: [\\].

Ranges in character classes work on any character set with contiguous character values. Therefore, in UTF-8 character sets, [-] represents all possible Japanese hiragana characters, and [09] represents the double-width digits found in most Asian fonts. (These digits differ from the regular single-width digits found in ASCII.)

In addition to putting individual digits, letters, or ranges within character classes, you can specify a number of special named classes available in POSIX regular expressions, as shown in Table 1

Table 1. Named Character Classes in POSIX Regular Expressions

Named Class

Description

[:alnum:]

Matches all ASCII letters and numbers. Equivalent to [a-zA-Z0-9].

[:alpha:]

Matches all ASCII letters. Equivalent to [a-zA-Z].

[:blank:]

Matches spaces and tab characters. Equivalent to [ \t].

[:space:]

Matches any whitespace characters, including space, tab, newlines, and vertical tabs. Equivalent to [\n\r\t \x0b].

[:cntrl:]

Matches unprintable control characters. Equivalent to [\x01-\x1f].

[:digit:]

Matches ASCII digits. Equivalent to [0-9].

[:lower:]

Matches lowercase letters. Equivalent to [a-z].

[:upper:]

Matches uppercase letters. Equivalent to [A-Z].


You cannot use these named character classes outside of character classes or as part of ranges. Thus, we could choose to write [0-9], [[:digit:]], or [[:alpha:][:digit:]], but not [A-[:lower:]].

One other important aspect of using character classes is the ^ character, which enables us to match anything except the contents of the character class. Therefore, the character class [^aeiou] matches any strings except those containing English vowels.

Finally, to include carets (^) or square brackets within the list of characters against which to match, you just escape them with backslashes: [\^\[\]].

by BrainBellupdated
Advertisement: