CMPS 260
Exhaustive Parsing Algorithm for CFG's

A fundamental decision problem with respect to any language class is the membership problem, each instance of which which asks whether a given string is a member of a given language (of that class).

We already know that, for regular languages (as described by either a finite automaton or a regular expression), this problem is decidable (meaning that there exists an algorithmic solution).

What is the status of this problem with respect to context-free languages (i.e., those generated by context-free grammars)? That is, does there exist an algorithm that solves this decision problem:

Given a context-free grammar (CFG) G and a string x, is x ∈ L(G)?

The answer is yes. One algorithm is described below. In effect, it performs a breadth-first search of a tree whose nodes are labeled by sentential forms resulting from leftmost derivations of the grammar.

If x ∈ L(G), the algorithm will always give the correct answer. To ensure that the algorithm will terminate when x ∉ L(G), we impose upon G the restriction that it be free of λ-productions. This does not cause a loss of generality, because a grammar with λ-productions can be algorithmically transformed into one without such productions and generating the same language (except for the loss of λ as a member).

This algorithm serves the purpose of demonstrating that the membership problem for context-free languages is decidable. It is not an algorithm having practical usefulness, simply because it is horribly slow and requires a tremendous amount of storage. Indeed, the growth rate of its running time, and storage needs, is exponential, at least for some grammars.

Jay Earley is credited with discovering an O(n3)-time algorithm for this problem, which he first described in 1968, and it runs in O(n2)-time if the grammar is unambiguous.

q := empty queue
q.enqueue(S);    // place G's start symbol on the queue
F := {S}         // set of all sentential forms ever generated
success := false;
do while !success ∧ !q.isEmpty()
|  α := q.dequeue()
|  A := leftmost nonterminal in α
|  do for each production A ⟶ φ
|  |  β := α with leftmost occurrence of A replaced by φ
|  |  if x = β then
|  |  |  success := true
|  |  else if β ∈ F  ∨  β is inconsistent with x
|  |  |  do nothing
|  |  else // β is consistent with x then
|  |  |  q.enqueue(β)
|  |  |  F := F ∪ {β}
|  |  fi
|  od
od

The condition "β is inconsistent with x" means either that

  1. |β| > |x|,
  2. β is a terminal string different from x,
  3. β = uAη for some terminal string u and some nonterminal A, and u is not a prefix of x, or
  4. β = αAv for some terminal string v and some nonterminal A, and v is not a suffix of x.

Because we are assuming that G has no λ-productions, a sentential form β cannot "contract", meaning that any terminal string derivable therefrom must have length |β| or greater. This justifies condition (1).

Consider the CFG G with the following productions:

SaSA | a   (1) (2)
AaAb | b   (3) (4)

Given G and the string x = aaaabbabb (which is a member of L(G)), the algorithm will, in effect, traverse the tree shown below, from top to bottom, going from left to right across each level. Each node is labeled by the sentential form that is obtained by doing a leftmost derivation (starting from start symbol S) in which the productions applied are as indicated on the edges along the path to the node from the root.

The leaves whose labels are underlined indicate dead ends at which the obtained sentential form is not consistent with the target string, for one of the reasons given above. The leaves whose labels are parenthesized indicate sentential forms that were obtained earlier. (The existence of such nodes implies that the grammar is ambiguous, of course.)

Target string: aaaabbabb

Grammar:
SaSA   (1)
Sa   (2)
AaAb   (3)
Ab   (4)

Note: Please ignore the two "untethered" occurrences of aaaabbabb appearing to the right of the breadth-first search tree. Why they appear there is a mystery, as is how to get rid of them! End of note.

The lone leaf whose label is neither underlined nor parenthesized is labeled by the target string. Using the labels on the edges leading to that leaf from the root, we can reconstruct the leftmost derivation (and corresponding derivation tree) for that string. That leftmost derivation is

S ⟹1 aSA ⟹1 aaSAA ⟹2 aaaAA ⟹3 aaaaAbA ⟹4 aaaabbA ⟹3 aaaabbaAb ⟹4 aaaabbabb

The superscripts identify which production was applied at each step. The corresponding derivation tree is this:

           S
          /|\
         / | \
        /  |  \
       /   |   \
      /    |    \
     /     |     \
    /      |      \
   /       |       \
  /        |        \
 /         |         \
a          S          A 
          /|\        /|\ 
         / | \      / | \ 
        /  |  \    /  |  \
       a   S   A  a   A   b
           |  /|\     |
           a a A b    b
               |
               b