Regular expressions

How to process text for search and validation

A regular expression or regex is used for searching, editing, extracting and manipulating data. Using regular expressions (or regex) you can verify if a specific string matches a given text pattern or to find out a set of characters from a sentence or large batch of characters.

Regular expression are also used in replacing, splitting and re-arranging text. Regular expressions are generally follow a similar pattern in most of the programming languages.

You can quickly create and test your regex with following websites:

Search for a text

For example, if we are going to search for html tags <h1>, <h2>, <h3>, <h4>, <h5>, <h6>, and </h1>, </h2>, </h3>, </h4>, </h5>, </h6>, then we can simply write the code like this:

/<\/?h[1-6]>/ig

PHP Example

PHP uses these functions for search and replace, preg_match and preg_match_all for matching and preg_replace for replacement.

$html = '<H1>Heading 1</H1>';
$pattern = '/<\/?h[1-6]>/ig';
echo preg_match($pattern, $html);
//prints "1"

preg_match searches text for a match to the regular expression given in pattern (/ is pattern delimiter) and returns 1 if the pattern matches given text, 0 if it does not, or false if an error occurred.

JavaScript example

In JavaScript, regular expressions (or regex) are implemented as their own type of object (such as the RegExp object). These objects store patterns and options and can then be used to test and manipulate strings.

var regex = new RegExp('<\/?h[1-6]>','ig');
//or
var regex = /<\/?h[1-6]>/ig;
regex.test('<H1>Heading 1</H1>');
//returns true

Regex meta (or special) characters

The $^*()+.?[\{| punctuation letters are called meta characters which make regular expressions work. Here is an overview of these special characters:

Pattern delimiter

In PHP, a delimiter can be any non-alphanumeric, non-backslash, non-whitespace character. Often used delimiters are forward slashes /, hash signs # and tildes ~. The following are equivalent:

<?php
 $pattern = "/\w+/";
 
 //same as previous, but different delimiter
 $pattern = "~\w+~";

 //same as previous, but different delimiter
 $pattern = "!\w+!";
 
 //same as previous, but different delimiter
 $pattern = "#\w+#";

In JavaScript you would delimit regex with the / (forward slash) character. You also can use the global object RegExp using the normal string escape rules. The following are equivalent:

// Using delimiter
var pattern = /\s+/;

//Without delimiter, using RegExp object
var pattern = new RegExp("\\s+");

Matching start or end of text

^ lets the recognition start at the beginning. If you write ^a, the expression will match the letter “a” only if it appears at the very first character.

$ define where the pattern ends.

Meta Description
^ Beginning of text
$ End of text

If you look for the word “hello”, the “h” must be at the beginning, while “o” is at the end. To search this string exactly (and nothing before and after), your pattern is /^hello$/.

If the word you’re looking for is at beginning in the text, your pattern is /^hello/.

If the word you’re looking for is at the end in the text, your pattern is /hello$/.

If the word you’re looking for is anywhere in the text, the pattern is simply /hello/.

Quantifiers

*, +, ?, {, } interprets as quantifiers unless they are included in a character class. Quantifiers specify how many instances of a character, group, or character class must be present in the input for a match to be found.

Meta Description
? Once or not at all (0 – 1)
* Zero or more time (0 – ∞)
+ One or more time (1 – ∞)
{n} Exactly n times (where n is a number)
{n, } At least n times (n – ∞)
{ ,m} No or m times (0 – m) where m is a number
{n,m} At least n but not more than m times

Matching “no or one character”

? represent no or one character (and is equivalent to {0, 1}).

/hurrah!?/ matches “hurrah” and “hurrah!

Matching “no or any number of characters”

* represent no or any number of characters (and is equivalent to {0, }).

/hurrah!*/ matches “hurrah!“, “hurrah!!!” and “hurrah

Matching “one or any number of characters”

+ represent one or any number of characters (and is equivalent to {1, }).

/hurrah!+/ matches “hurrah!” and “hurrah!!!

Matching minimum, or maximum, or exact number of occurrence

{ } Quantifier or repetition marker, defines the character before the braces. The range can be defined by numbers; if the start and the end are given separately, the numbers are written with a comma {1,4}.

{100} Match exact number of characters. The {100} matches a string of 100 characters.

{1,4} For variable repetition, we use the quantifier {n,m}, where n is a non-negative integer and m is greater than n. 0{1,4} matches 0 character with one to four digits (0, 00, 000 or 0000).

{0, 1} matches zero or once and ? does the same as {0,1}.

{1,} Minimum number of characters. The quantifier {n,} allows for infinite repetition. {1, } matches one or more characters, and + does the same. {0, } matches zero or more characters, and * does the same.

{,4} Zero or maximum number of characters.

  • /hurrah!{0,1}/ works similar to ?
  • /hurrah!{0, }/ works similar to *
  • /hurrah!{1,}/ works similar to +
  • /hurrah!{3}/ matches hurrah!!!
  • /hurrah!{0,2}/ matches hurrah, hurrah! and hurrah!!
  • /hurrah!{1,2}/ matches hurrah! and hurrah!!

Groups

Parentheses ( ) are used to define groups in regular expressions. You can use the set operators *, +, and ? in such a group, too. Groups show how we can extract data from the input provided.

Meta Description
() Capturing group
(?<name>) Named capturing group
(?:) Non-capturing group
(?=) Positive look-ahead
(?!) Negative look-ahead
(?<=) Positive look-behind
(?<!) Negative look-behind

Character classes

[ ] Character class, match only characters that listed in the class. Defines one character out of group of letters or digits. [aeiou] match either a, e, i, o or u. A hyphen - creates a range when it is placed between two characters. The range includes the character before the hyphen, the character after the hyphen, and all characters that lie between them in numerical order. See following examples:

Meta Description
[0-9]Matches any digit.
[a-z]Matches any small alphabet character.
[A-Z]Matches any capital alphabet character.
[a-zA-Z0-9]Matches any alphanumeric character.
gr[ae]yMatches grey or gray but not graey.

Negation (matching if specific character(s) not exist)

The caret ^ at the beginning of the class means “No“. If a character class starts with the ^ meta-character it will matches only those characters that not in that class.

Meta Description
[^A-Z] matches everything except the upper case letters A through Z
[^a-z] matches everything except the lowercase letters a through z
[^0-9] matches everything except digits 0 through 9
[^a-zA-Z0-9] combination of all above mentioned examples

Matching any character

The .  (dot) is a shorthand for a character class that matches any character. If you want to search for a date, for example 08/08/2008 or 08-08-2008, simply use \d{2}.\d{2}.\d{4}.

\d{2}.\d{2}.\d{4} matches 01-12-2017, 10/10/2017 or 12.04.2000

  • \d is equivalent to [0-9].
  • \d{2} matches exactly two digits.
  • \d{4} matches exactly four digits.

Alternation, combine multiple regex

Logical OR, the | vertical bar, or pipe symbol, splits the regular expression into multiple alternatives. School|College|University matches School, or College, or University with each match attempt. Only one name matches each time, but a different name can match each time.

/a|b|c/ matches a, or b, or c with each match attempt.

Escape characters \

\ (the backslash) masks metacharacters and special characters so that they do no longer possess a special meaning. ˆ and $ are called metacharacters. If you want to look for a metacharacter as a regular character, you have to put a backslash in front of it. For example if you want to match one of these characters: $^*()+.?[\{|, you should have to escape that character with \.

/\$[0-9]+/ will match $100.

\$, \^ will match $ and ^

The backslash itself is a metacharacter, too. If you look for the backslash in particular, you write \\.

by BrainBellupdated