Computing FIRST() and FOLLOW() for a CFG
Let G = (V, Σ, S, P) be a context-free grammar (CFG).
V is the set of nonterminal symbols, Σ is the set of
terminal symbols, S ∈ V is the "start" symbol, and
P ∈ V × (V ∪ Σ)* is the
set of productions. (A production (A, α) ∈ P
is usually expressed in the form A ⟶ α.)
For the sake of brevity, we use Γ = V ∪ Σ
to refer to the set containing all terminal and nonterminal
symbols of G.
Also, we use Σ' to refer to Σ ∪ {$}.
The $ symbol, which is assumed not to be a member
of Γ, serves as an end-of-input-string marker.
This document describes algorithms by which to compute
FIRST(Z) for all Z ∈ Γ and
FOLLOW(A) for all A ∈ V.
(We are not concerned with the follow sets of termainal symbols.)
Recall the following definition:
FIRST'(Z) = { t ∈ Σ | Z ⟹* tα
for some α ∈ Γ* }
That is, if t ∈ Σ appears as the first
(i.e., leftmost) symbol in a string derivable from Z,
then t ∈ FIRST'(Z).
(Of course, if Z is a terminal symbol, then FIRST(Z) = {Z}.
In the case that Z ⟹* λ (i.e., Z is "nullable",
which obviously implies Z ∈ V),
FIRST(Z) = FIRST'(Z) ∪ {λ}.
If Z is not nullable, then FIRST(Z) = FIRST'(Z).
As for FOLLOW():
FOLLOW(A) = { t ∈ Σ' |
S$ ⟹* αAtβ$
for some α,β ∈ Γ* }
That is, if t ∈ Σ' appears immediately
after A in a string derivable from S$, then
t ∈ FOLLOW(A).
Having the values of FIRST(A) and FOLLOW(A) for every nonterminal
symbol A in grammar G is vital in devising a one-symbol-lookahead
stack machine that accepts L(G) or, what is really the same thing,
devising a parse table for G that guides the standard top-down
parsing algorithm. (Each cell in a parse table identifies which
(if any) production(s) of the grammar are viable candidates to be
applyed next, as a function of the next input symbol and the
nonterminal currently on the top of the stack. (A "viable"
production is one that, if applied next, could possibly
lead to a successful parse of the input string. Whether or not it
can lead to a successful parse depends upon the suffix of the
input string that follows the next input symbol.)
For a CFG to qualify as being LL(1), there can be at most
one viable production for each pair
(A, b) ∈ V×Σ'.
Step 1: Identify the nullable nonterminals.
q := empty queue
N := ∅ // N is the set of variables known to be nullable
do for each λ-production A ⟶ λ
| | N := N ∪ {A}
| | q.enqueue(A)
| fi
od
// At this point, all nonterminals that produce λ in one step
// (i.e., via the application of a single λ-production) are in N
// and on the queue. What follows is a loop to identify the
// nonterminals from which λ is derivable by a sequence of
// two or more applications of productions.
do while !q.isEmpty()
| B := q.dequeue();
| do for each production A ⟶ αBβ // in which B appears on RHS
| | if A ∉ N ∧ every symbol X in αβ satisfies X ∈ N
| | | N := N ∪ A
| | | q.enqueue(A)
| | fi
| od
od
// At this point, N contains precisely the nullable nonterminals
|
Step 2: Calculation of FIRST()
As noted above, a nonterminal symbol A is said to be nullable iff
A ⟹* λ.
Generalizing that, a string
α = X1X2···Xk
∈ Γ* is nullable iff
X1X2···Xk
⟹* λ.
(This condition is equivalent to each Xi, 1≤i≤m,
itself being nullable.
A special case of this occurs when α is λ, which
corresponds to the case in which k=0.)
The algorithm below exploits the fact that, if
A ⟹ X1X2···Xm
is a production, 1≤k≤m, and
X1X2···Xk-1
is nullable, then FIRST'(A) ⊆ FIRST'(Xk).
To confirm that, suppose that Xk ⟹* tβ
for some t ∈ Σ and β ∈ Γ*,
so that t ∈ FIRST'(Xk). Then
A ⟹ X1X2···Xm
⟹* XkXk+1···Xm
⟹* tβXk+1···Xm
demonstrating that t ∈ FIRST'(A), too.
In the algorithm, variable first() is used as a proxy for
the mathematical function FIRST(), the intent being that,
upon completion of execution, first(A) = FIRST(A) for
every A ∈ V.
// Let N be the set of nullable nonterminal symbols,
// as computed by the algorithm described above.
N := { A ∈ V | A ⟹* λ }
do for each t ∈ Σ
| first(t) := {t}
od
do for each A ∈ V
| first(A) := ∅
od
boolean updateOccurred = true;
do while updateOccurred
| updateOccurred := false;
| do for each non-λ production A ⟶ X1X2···Xm
| | do for each k in [1..m] such that X1···Xk-1 is nullable
| | | if first(Xk) - first(A) ≠ ∅ then
| | | | first(A) := first(A) ∪ first(Xk)
| | | | updateOccurred := true
| | | fi
| | od
| od
od
// At this point, first(Z) = FIRST'(Z) for every Z ∈ Γ
do for each A ∈ N
| first(A) := first(A) ∪ {λ}
od
// Now, first(Z) = FIRST(Z) for every Z ∈ Γ
|
The algorithm above describes a rather brute-force approach to computing
FIRST(). An algorithm that would be more efficient (with
respect to running time) appears in the appendix.
Step 3: Calculation of FOLLOW()
Above (and below, in the appendix) is shown an algorithm to compute
FIRST : Γ ⟶ Σ ∪ {λ}. For the purposes of
computing FOLLOW(), it is convenient to have extended versions of
the functions FIRST' and FIRST whose domains include not
just Γ (individual symbols) but also strings of length two or more.
The definitions of the extended functions are
FIRST'*(Z1Z2··Zr) =
{ t ∈ Σ |
Z1Z2··Zr ⟹* tα
for some α }
If every Zi (1≤i≤r) is nullable, then
FIRST*(Z1Z2··Zr) =
FIRST'*(Z1Z2··Zr)
∪ {λ}
Here is a method for computing FIRST*:
function first*(Z1··Zr) :
// Let N be the set of nullable nonterminal symbols,
// as computed by the algorithm described above.
result := FIRST'(Z1)
j := 1
do while (j < r ∧ Zj ∈ N)
| result := result ∪ FIRST'(Zj+1)
| j := j+1
od
if j = r ∧ Zr ∈ N then
| result := result ∪ {lambda;}
fi
return result
|
Making use of the first*() method,
we can compute the FOLLOW() function:
// Let N be the set of nullable nonterminal symbols,
// as computed by the algorithm described above.
N := { A ∈ V | A ⟹* λ }
do for each A ∈ V - {S}
| follow(A) := ∅
| follow-1(A) := ∅
od
follow(S) := { $ }
do for each production A ⟶ X1··Xm
| do for each k ∈ [1..m)
| | if Xk ∈ V
| | | F := first*(Xk+1··Xm)
| | | follow(Xk) := follow(Xk) ∪ (F - {λ})
| | | if λ ∈ F then
| | | | follow-1(A) := follow-1(A) ∪ {Xk}
| | | fi
| | fi
| od
| if Xm ∈ V
| | follow-1(A) := follow-1(A) ∪ {Xm)
| fi
od
// At this point, for every B ∈ V, follow(B) includes
// every t ∈ Σ such that there exists a production
// A ⟶ αBβ where β ⟹* tφ for some φ.
// Assuming (as we are) that every nonterminal symbol is
// useful, this condition implies the existence of the
// derivation S ⟹* γAη ⟹ γαBβη ⟹* γαBtφη
// demonstrating that t ∈ FOLLOW(B).
// Meanwhile, for every A ∈ V, follow-1(A) includes every
// nonterminal Xj such that for some production A ⟶ X1··Xm,
// Xj+1··Xm ⟹* λ, implying that FOLLOW(A) ⊆ FOLLOW(Xj).
// To demonstrate this, suppose that t ∈ FOLLOW(A).
// Then there is a derivation
// S$ ⟹* αAtβ ⟹ αX1··Xmtβ ⟹* αX1··Xjtβ
// Hence, t ∈ FOLLOW(Xj), too.
// Now resolve all the FOLLOW(A) ⊆ FOLLOW(B) relationships
// indicated by follow-1:
q := empty queue
do for each A ∈ V
| if follow-1(A) ≠ ∅
| | q.enqueue(A)
| fi
od
do while !q.isEmpty()
| A := q.dequeue()
| do for each B ∈ follow-1(A)
| | if follow(A) - follow(B) ≠ ∅ then
| | | follow(B) := follow(B) ∪ follow(A)
| | | if !q.inQueue(B) then
| | | | q.enqueue(B)
| | | fi
| | fi
| od
od
// At this point, follow(A) = FOLLOW(A) for all A ∈ V.
|
Appendix: A Better way to compute FIRST()
The algorithm described earlier for computing the FIRST() function
aimlessly iterates through every production in G "hoping" to find
one whose left-hand side's first() value should be updated
to include one or more new terminal symbols.
Only after an unproductive iteration through all the productions
does it recognize that there is nothing more to be done.
A better algorithm would, upon identifying a B ∈ V whose
first() value needs to be updated would, after making
that update, direct its attention to those nonterminals A ∈ V
such that FIRST'(B) ⊆ FIRST'(A) by virtue of the
fact that there is a production
A ⟶ αBβ where α is nullable.
(If first(A) does not include all the members of
first(B), then first(A) needs to absorb
all those members!)
The algorithm below does this.
// Let N be the set of nullable nonterminal symbols,
// as computed by the algorithm described above.
N := { A ∈ V | A ⟹* λ }
do for each t ∈ Σ
| first(t) := {t}
| first-1(t) := ∅
od
do for each A ∈ V
| first(A) := ∅
| first-1(t) := ∅
od
do for each non-λ production A ⟶ X1X2···Xm
| do for each k in [1..m] such that X1X2···Xk-1 is nullable
| | first-1(Xk) := first-1(Xk) ∪ {A}
| od
od
// At this point, first-1(X) = { A ∈ V | A ⟶ αXβ for some nullable α }.
// Significance: for each A ∈ first-1(X), FIRST'(A) ⊆ FIRST'(X)
do for each t ∈ Σ
| do for each A ∈ first-1(t)
| | first(A) := first(A) ∪ {t}
| od
od
// At this point, for every A ∈ V, first(A) includes every t ∈ Σ
// such that A ⟶ αtβ is a production and α is nullable.
q := empty queue
do for each B ∈ V
| if first-1(B) ≠ ∅
| | q.enqueue(B)
| fi
od
// At this point, the queue includes every B ∈ V for which there
// exists some A ∈ V such that FIRST'(B) ⊆ FIRST'(A) by virtue
// of there being a production A ⟶ αBβ, where α is nullable.
do while !q.isEmpty()
| B := q.dequeue()
| do for each A ∈ first-1(B)
| | if first(B) - first(A) ≠ ∅
| | | first(A) := first(A) ∪ first(B)
| | | if first-1(A) ≠ ∅ ∧ !q.inQueue(A)
| | | | q.enqueue(A)
| | | fi
| | fi
| od
od
// At this point, first(A) = FIRST'(A) for every A ∈ V
do for each A ∈ N
| first(A) := first(A) ∪ {λ}
od
// Now, first(A) = FIRST(A) for every A ∈ V
|