CMPS 260: Regular Expressions

We have seen that finite automata (i.e., finite state machines) can be used to describe all finite languages, plus (some) languages that are infinite. A quite different —and sometimes more convenient— way of describing a language is via a regular expression.

In preparation for presenting the syntax and semantics of regular expressions, here we introduce the three relevant operators on languages (i.e., sets of strings). Let R and S be languages over some alphabet Σ.

Union: R ∪ S = {x | x ∈ R ∨ x ∈ S}.
That is, the union of two languages includes any string that is a member of either one of them. (You already should be familiar with this operation, as it applies to all kinds of sets, not just sets of strings.)

Concatenation: R · S = {x·y | x ∈ R ∧ x ∈ S}.
That is, the concatenation of R and S includes precisely those strings that can be formed by choosing a member x of R and a member y of S, and concatenating them to form xy. (Typically, we omit the · operator, similar to how we omit the multiplication operator in writing arithmetic expressions.)
Example: {a, ab, ba} · {a, bb} = {aa, abb, aba, abbb, baa, babb}

Kleene (or Star) Closure: R* = { x₁x₂...x_n | n≥0 ∧ each x_i ∈ R}.
That is, R* includes precisely those strings that can be formed by choosing zero or more members of R (not necessarily distinct from each other) and concatenating them.
A nice reCURsive way to describe R* is like this: Define R⁰ = {λ} and, for n≥0, Rⁿ⁺¹ = R·Rⁿ. Then
R^* = ∪_n≥0 Rⁿ = R⁰ ∪ R¹ ∪ R² ∪ ...
Example: {a, ab}^* = {λ, a, ab, a·a, a·ab, ab·a, ab·ab, a·a·a, a·a·ab, a·ab·ab, ...}, the set of strings over the alphabet {a,b} in which every occurrence of b is immediately preceded by an occurrence of a.
Often we wish to refer to the set R¹ ∪ R² ∪ ..., which is the same thing as R·R^*. As a shorthand, we use R⁺.

Definition: Syntax and Semantics

Let Σ be an alphbet (i.e., a finite set of symbols), and suppose that a ∈ Σ. Then

Atomic Regular Expressions:
- ∅ is a regular expression representing the empty set (i.e., the language containing no members).
- a is a regular expression representing the set {a} (i.e., the language whose single member is a string of length one).

Composite Regular Expressions:
Assume that r and s are regular expressions representing the languages R and S, respectively. Then
- (r+s) is a regular expression representing the set R∪S
- (r·s) is a regular expression representing the set R·S
- (r^*) is a regular expression representing the set R^*

It is sometimes convenient to use the notation L(r) to denote the language represented by the regular expression r. Doing so, we could restate the rules above in this way:

Atomic regular expressions: L(∅) = ∅, L(a) = {a}
Compound regular expressions: L((r+s)) = L(r) ∪ L(s), L((r·s)) = L(r) · L(s), L((r^*)) = (L(r))^*.

The manner in which parentheses are dealt with here is superior to how Linz describes it. The idea is that, formally speaking, every compound regular expression begins with ( and ends with ), but to avoid clutter we rely upon rules of operator precedence that allow some pairs of parentheses to be omitted, without resulting in ambiguity. (This concept is familiar to you from your work with expressions involving the arithmetic operators +, −, ·, and / (and exponents).)

The standard order of precedence for regular expression operators is, from highest to lowest: *, ·, +. As in arithmetic expressions, we usually omit the · operator. Thus, for example, the (pseudo-) regular expression ab*+aa is shorthand for ((a·(b*)) + (a·a)).

Linz includes λ (representing {λ}) among the primitive regular expressions, even though it is redundant, because it describes the same language as does the composite regular expression ∅^*. Another shorthand notation often used is r⁺ in place of r·r^*.

Note that a minority of authors use the symbol ∪ rather than + as the union operator in regular expressions, which is probably a superior notation, given that '∪' is the standard union operator in set theory and that '+' is not only the standard addition operator but also, in some programming languages, the string concatenation operator. One disadvantage to using ∪ is that it is not a "plain ASCII" symbol.

Examples:

The set of all bit strings: (0+1)*

The set of bit strings ending with 1: (0+1)*1

The set of bit strings whose first two bits are the same: (00 + 11)(0+1)*

The set of bit strings whose second bit is 0: (00 + 10)(0+1)* or (0+1)0(0+1)*

The set of bit strings beginning and ending with same bit value: 0 + 1 + 0(0+1)*0 + 1(0+1)*1

The set of bit strings of length four or greater whose prefix of length two equals its suffix of length two: 00(0+1)*00 + 01(0+1)*01 + 10(0+1)*10 + 11(0+1)*11

The set of bit strings having at least one occurrence of 010 as a substring: (0+1)* 010 (0+1)*

The set of bit strings having no occurrences of 00 as a substring: (1+01)* (0+λ)

The next four examples are related, as the third one uses the second and the fourth one uses the first and third.

The set of bit strings having exactly one occurrence of 1: 0*10*

The set of bit strings having exactly two occurrences of 1: 0*10*10*

The set of bit strings having an even number of occurrences of 1: (0*10*10*)*

The set of bit strings having an odd number of occurrences of 1: 0*10* (0*10*10*)*