CMPS 260: Regular Expressions

We have seen that finite automata (i.e., finite state machines) can be used to describe all finite languages, plus (some) languages that are infinite. A quite different —and sometimes more convenient— way of describing a language is via a regular expression.

In preparation for presenting the syntax and semantics of regular expressions, here we introduce the three relevant operators on languages (i.e., sets of strings). Let R and S be languages over some alphabet Σ.


Definition: Syntax and Semantics

Let Σ be an alphbet (i.e., a finite set of symbols), and suppose that a ∈ Σ. Then

  1. Atomic Regular Expressions:
  2. Composite Regular Expressions:
    Assume that r and s are regular expressions representing the languages R and S, respectively. Then

It is sometimes convenient to use the notation L(r) to denote the language represented by the regular expression r. Doing so, we could restate the rules above in this way:

Atomic regular expressions: L() = ∅,   L(a) = {a}
Compound regular expressions: L((r+s)) = L(r) ∪ L(s),   L((r·s)) = L(r) · L(s),   L((r*)) = (L(r))*.

The manner in which parentheses are dealt with here is superior to how Linz describes it. The idea is that, formally speaking, every compound regular expression begins with ( and ends with ), but to avoid clutter we rely upon rules of operator precedence that allow some pairs of parentheses to be omitted, without resulting in ambiguity. (This concept is familiar to you from your work with expressions involving the arithmetic operators +, −, ·, and / (and exponents).)

The standard order of precedence for regular expression operators is, from highest to lowest: *, ·, +. As in arithmetic expressions, we usually omit the · operator. Thus, for example, the (pseudo-) regular expression ab*+aa is shorthand for ((a·(b*)) + (a·a)).

Linz includes λ (representing {λ}) among the primitive regular expressions, even though it is redundant, because it describes the same language as does the composite regular expression *. Another shorthand notation often used is r+ in place of r·r*.

Note that a minority of authors use the symbol rather than + as the union operator in regular expressions, which is probably a superior notation, given that '∪' is the standard union operator in set theory and that '+' is not only the standard addition operator but also, in some programming languages, the string concatenation operator. One disadvantage to using ∪ is that it is not a "plain ASCII" symbol.


Examples:

The set of all bit strings: (0+1)*

The set of bit strings ending with 1: (0+1)*1

The set of bit strings whose first two bits are the same: (00 + 11)(0+1)*

The set of bit strings whose second bit is 0: (00 + 10)(0+1)*   or   (0+1)0(0+1)*

The set of bit strings beginning and ending with same bit value: 0 + 1 + 0(0+1)*0 + 1(0+1)*1

The set of bit strings of length four or greater whose prefix of length two equals its suffix of length two: 00(0+1)*00 + 01(0+1)*01 + 10(0+1)*10 + 11(0+1)*11

The set of bit strings having at least one occurrence of 010 as a substring: (0+1)* 010 (0+1)*

The set of bit strings having no occurrences of 00 as a substring: (1+01)* (0+λ)

The next four examples are related, as the third one uses the second and the fourth one uses the first and third.

The set of bit strings having exactly one occurrence of 1: 0*10*

The set of bit strings having exactly two occurrences of 1: 0*10*10*

The set of bit strings having an even number of occurrences of 1: (0*10*10*)*

The set of bit strings having an odd number of occurrences of 1: 0*10* (0*10*10*)*