Regular Expressions

Fundamental Concepts of Text Pattern Matching

Regular expressions are everywhere — if you know where to look. They are a powerful and versatile way to extract information and manipulate text. This functionality is included in most good programmer's editors, dedicated command line utilities like grep, and even word processors. Yet we still find that many programmers have little or no skills in this area. This is an introduction to the basic concepts and syntax of regular expressions.

PREREQUISITES — You should already…

have some programming or scripting experience, in a language like C++, C#, Java, JavaScript, PHP, Perl, Python, AWK, Bash.

Introduction

Here is a high-level overview of the background and concepts of regular expressions.

Overview

Historically, regular expression is a term for a branch of mathematics (regular language). Today, however, the term generally refers to a “pattern matching” language. Regular expressions are included in many tools and libraries (often with names abbreviated as RegEx, Regex, regex, or just re). They are commonly used in programming, and are very useful for command line text manipulation.

A regular expression language is not a general purpose language, unlike Java or C#. The ability to use regular expressions, however, is usually available as a library for such a programming language, or even built into the syntax of the language (JavaScript, Perl, AWK).

Some of the oldest and most-used implementations are found in grep, sed and other venerable and mature command line tools.

Basic Concepts

A regular expression is a pattern, or sequence, of characters. A regular expression language is a set of syntax rules for describing or creating these patterns.

We will use the term regex in this document to refer to a regular expression.

The simplest pattern is a literal string. The regex processing software will then look for a match i.e. a sequence of characters that corresponds to that pattern. A very basic implementation of this is found in word processors: for example, you can search in a document for all occurrences of the pattern cat. The word processor will find matches in words such as catch, catastrophe, bobcats etc.

Matching

The most common use case is to test whether an input string matches some regex, i.e., true, or false. The match is by default conducted as a case-sensitive, sliding algorithm — it starts from the left and repeatedly attempts a match, moving on to the next character, if it doesn't, until the end of the string. This sliding “window” can match anywhere inside the search string, unless it is anchored (discussed later in this document).

A regular expression is traditionally line-oriented, but it does not have to be. By line-oriented, we mean that it will search for a match that is fully contained on a single line, but not for a match that starts on one line and ends on another. In a programming language, a string to match may contain newlines, but options are provided to perform matches across the newlines.

It is important to understand that for a regex to “match” successfully, some part of the string must satisfy the regex condition. For string manipulation, we often need to know exactly what part of the string matched. Consequently, all implementations will provide some mechanism to retrieve the matched part.

Compilation

Most implementations compile regular expressions, before they are applied to test if strings match. The compiled form is not machine code; it is just a structure that allows for faster matching. This can often be saved in an object or variable, and applied directly to test for matches. Compiled regular expressions are more efficient, especially when they are to be used several times.

Syntax

Regular expression syntax is relatively simple, even though it does not resemble a regular programming language. It does have the equivalent of keywords, operators with precedence, and grouping.

Special Characters

The “keywords” of the regular expression language are not keywords in the usual sense — simply, certain characters are treated as special, in that they have a special meaning. They serve the same purpose as keywords in more typical programming languages, and are more commonly referred to as metacharacters.

The special, or meta, characters are much the same across all implementations. The typical special characters are:

   . [ ] ( ) { } ^ * + ? # \ $

Note that your language or tool may provide more, or fewer, metacharacters.

Escape Sequences

To mark a special character as non-special, it must be escaped with a backslash (\).

Your implementation may require that the regex is enclosed in a native string literal, specific to the language or tool.

The backslash character itself, must be escaped with another backslash (\\). The language may also use backslash as an escape sequence (C, C++, Java, PHP, C#, etc.). In this case, you may need to type four backslashes, so that 2 will be stored in memory for use by the regex library. Or you can use your language's version of a ‘verbatim string‘ (‘raw string’ in Python), where backslashes have no special meaning. In a POSIX shell, use single quote delimited strings (also PHP, while Perl has special regex delimiters). In C#, you can use @"…". In Python, use R"…" or r"…".

Some special characters like tab (\t), newline (\n) and carriage return (\r) can be represented with the escape sequences shown.

The circumflex, or caret (^), is special only when it is used as an anchor or to negate a character class, both of which are described later in this document.

Structure

In general, a regex will have three components:

Character classes are used to match one or more characters.
Quantifiers specify how many times a character class may be repeated.
Anchors specify the position of the pattern in relation to a line of text.

We will look at all of these in more detail.

Character Classes

A fundamental concept is that of an expression that matches a single character. A character class means: “match one character”.

Effectively, any non-special character means “match this character here”. So the regex ABC means “match one A, followed by one B, followed by one C, anywhere inside a string”.

General Character Class

Instead of only one possibility like the ABC example above, however, we can specify a set of potential matches. The most general character class is the period (.), which means “match any character, except a newline”.

User-Defined Character Classes

For more flexibility, we can define the set of matches. A user-defined character class is created by specifying the potential matches between square brackets. For example:

[ABC] means: “match either one A, or one B, or one C”.
[ABC][123] means: “match one of either A, or B, or C; followed by one of either 1, 2, or 3”.

The first regex only matches one character. The second regex matches two characters and would have 9 possible matches, including A3, B1 and C2.

Ranges

Character classes may form a range, by separating the start and end characters with a minus sign (-):

[A-Z] means “match any uppercase alphabetic character”.
[A-Za-z] means “match any alphabetic character”.
[0-9] means “match any decimal digit”.

If the minus sign does not appear between two increasingly larger characters (based on ordinals), it will be treated as just another character that's eligible for matching. Thus:

[B-E] means “match any uppercase alphabetic character in the range from B to E”.
[E-B] means “match one of either E or - or B”.

Negation

When the circumflex or caret (^) appears as the first character in a character class, it inverses, or negates, the meaning of the character class.

[^A-Z] means “match any character except uppercase alphabetic characters”.
[^A-Za-z] means “match any non-alphabetic character”.
[^0-9] means “match any non-digit character”.

Pre-defined Character Classes

Most implementations provide escape sequences for some common character classes:

\d means “match a decimal digit”, and is shorthand for: [0-9].
\D means “match any character, but a decimal digit”. Shorthand for: [^0-9].
\s means “match any white space character” (tabs, newlines, spaces).
\S means “match any non-white space character”.
\w means “match any word character”. Shorthand for: [_0-9a-zA-Z].
\W means “match any non-word character”. Shorthand for: [^_0-9a-zA-Z].

Your implementation, language or library, may provide more convenient escape sequences.

POSIX Character Classes

In POSIX compliant systems, many regular expression engines have named character classes in the form: :class:, to better accomodate Unicode and internationalisation. Possible values for class are:

alnum   alpha   ascii   blank   cntrl   digit   graph   lower
print   punct   space   upper   word    xdigit

Alternation and Grouping

A set of parentheses can be used to group a sequence of characters or other pattern.

The pipe (|) can be used to specify alternates, i.e. “either-or” matches.

We can combine groups and alternates in a number of useful ways:

AB|C means “match one A; followed by either one B or one C”.
(AB)|C means “match either AB, or C.
(AB)|(CD) means “match either AB, or CD.

Quantifiers

More powerful patterns will express the concept of how many (quantity). A quantifier is effectively a postfix operator — it operates on the character or expression that appears before it.

0 to 1 (Optional)

The question mark (?) means “match 0 to 1 times”. It applies to the character or group before it, so:

AB?C means “match an A; optionally followed by a B; followed by one C”.
A.?C means “match an A; optionally followed by any character; followed by one C”.
(AB)?C means “match optional AB; followed by one C”.
\d?A means “match one optional decimal digit; followed by one A”.

0 to Many, 1 to Many

By Many, we effectively mean “as many as the implementation allows”. The upper limit is unbounded.

To match 0-Many times, use the asterisk (*):

AB*C means “match one A; followed by zero or more Bs; followed by one C.”
(AB)*C means “match zero or more AB sequences; followed by one C.”
A.*C means “match one A; followed by zero or more of any character; followed by one C”.

To match 1-Many times, use the plus sign (+):

AB+C means “match one A; followed by one or more Bs; followed by one C.”
(AB)+C means “match one or more AB sequences; followed by one C.”
A.+C means “match one A; followed by one or more of any character; followed by one C”.

Any to Any

When you need more flexibility, you can explicitly specify the minimum and maximum quantities. You do this by specifying the quantity in curly braces, and separating the minimum and maximum values with a comma. When the minimum and maximum values are equal i.e. you want an exact quantity, only specify the first value, without a comma. If the value before the comma is left out, it defaults to 0. If the value after the comma is left out, it defaults to Many.

Like the other quantifiers, this applies to the element before it.

A{,}B means “match one A, repeated 0 to Many times; followed by one B.
A{3}B means “match exactly 3 As; followed by one B.
A{,3}B means “match 0 to 3 As; followed by one B.
A{3,}B means “match 3 or more As; followed by one B.
A{3,5}B means “match 3, or 4, or 5 As; followed by one B.

Greedy vs Non-greedy

The quantifiers + and * are greedy — they match the longest sequence that still satisfies the match. For example, when the regex: <e>.*</e>, is applied to the string:

    This <e>is</e> a <e>very</e> stupid sentence.

it will match:

    <e>is</e> a <e>very</e>

You can create a non-greedy version of the greedy quantifiers by appending a question mark (?) to either * or +, e.g., A*?B or A+?B. As example, the regex: <e>.*?</e> when applied to the above string, it will match:

    <e>is</e>

The question mark (?), although a quantifier, when suffixed to an expression, does not involve “greediness”, since it has a fixed maximum (1) match value. When appending it to the unlimited upper bound quantifiers however, it changes their behaviour.

Anchors

You can anchor a regex to prevent it from matching anywhere in a search string. An anchor never matches a character. It is logically a position or location. If, by analogy, one treats a string as a sequence of physical blocks, the beginning of the sequence is the left edge of the first block. This analogy also illustrates the concept of “between characters”.

Beginning or End of String

The caret, when used at the beginning of the regex, means “anchor at the start”.

The dollar sign, $, when used at the end of a regex, means “anchor at the end”.

^ABC means “match ABC at the start of the string only” (begins with ABC).
ABC$ means “match ABC at the end of the string only” (ends with ABC).
^ABC$ means “match ABC at the start and the end of the string (equality)”.

The last example is a rather expensive way to test for equality. Anchors do not match any characters — they match positions.

The circumflex (caret) or dollar signs are not treated as anchors, or special characters in any way, when they appear anywhere else in the regex.

Word Boundary

A word boundary is a special kind of anchor, which matches the start or end of a sequence of word characters. Boundaries occur at the beginning of a string, at the end of a string, before and after punctuation, and before and after white space. A word boundary anchor is specified with: \b. The converse also exist: a non-word boundary, specified with \B. Thus:

\bABC\b means “match the word ABC” (but will not match part of a word like ABCDEF).

Back References

The grouping parentheses has an additional effect — every set “remembers” what was matched. Each parentheses set stores its matches in sequentially numbered memory locations, starting at 1. To “retrieve” a remembered value, you can refer back to it with \‹number›.

(.)\1 means “match any character, and remember in memory location 1; then match a copy of the character previously matched and stored in memory location 1”.

The above example will therefor match twins of any character. Another example:

(.)(.)\2\1 means “match any single character & store in memory location 1; followed by matching any single character & store in memory location 2; followed by matching what is stored in memory location 2; followed by matching what is stored in memory location 1.

The above example will match 4-letter palindromes, like abba, noon, deed, etc.

Conclusion

Apart from simple matches, various libraries provide functionality to perform tasks such as: search and replace; find all matches as a collection; return the location of a match; split strings based on regular expressions; name parentheses captures so that back references can use names instead of numbers, and more. The newer libraries also provide negative and positive look-ahead.

Many offline or online, commercial or free, regular expression testers and debuggers are available. Some are specific to one or more programming languages, others use a PCRE library.

Just about every programmer, or administrator, can use regular expressions to be more productive. If you are new to regular expressions, we hope this will help you to understand the concepts.

PCRE

You may see references to PCRE (Perl Compatible Regular Expressions) when reading regex documentation. The Perl regex library is very powerful and this term refers to libraries that follow the Perl implementation and syntax.

2022-03-03: Fix history message & add reference to Python's raw strings. [brx]
2020-04-22: Added missing closing quotation marks. [brx]
2019-03-07: Added note about POSIX named character classes. [brx]
2017-11-19: Update to new admonitions. [brx]
2017-10-04: Edited. [jjc]
2016-06-11: Created. Edited. [brx;jjc]