CMPS 260: CYK Algorithm

The CYK algorithm (named for Cocke, Young, and Kasami, each of whom develeped it independently of the others in the mid-1960's) solves the membership problem for context-free grammars in Chomsky Normal Form. That is, given as input a CFG G in Chomsky Normal Form (CNF) and a string w, the algorithm determines whether or not w ∈ L(G). Because any context-free grammar can be transformed (algorithmically) into CNF (with the possible loss of the empty string from the generated language), this gives us a way of solving the membership problem for all CFG's.

Recall that a context-free grammar is said to be in Chomsky normal form if all its productions are of one of the two forms A → b or A → BC, where b is a terminal symbol and B and C are nonterminals.

The CYK algorithm is based on the following. Let G be a context-free grammar in Chomsky normal form, and let w = a1a2...an (each ai ∈ Σ) be a string over G's terminal alphabet Σ.

For i and j satisfying 1 ≤ i ≤ j ≤ n, let wi,j denote the substring aiai+1...aj of w beginning with its i-th symbol and ending with its j-th symbol, and let Vi,j denote the set of nonterminals in G from which wi,j can be derived. That is,

Vi,j = { A  |  A ⟹+ wi,j }

What the CYK algorithm does is to use the Dynamic Programming technique to compute each set Vi,j, culminating with the calculation of V1,n. The question of whether w ∈ L(G) reduces to the question of whether S ∈ V1,n, where S is the start symbol of G.

Lemma 1: For all i (1≤i≤n), Vi,i = { A  |  A → ai is a production in G }.
Proof: Inclusion from right to left is obvious. Inclusion in the other direction follows from the fact that a derivation that begins with an application of a production of the form A → BC cannot possibly produce a terminal string of length one (because a CNF CFG has no λ-productions). ∎

         A
        / \
       /   \
      /     \
     /       \
    B         C
   / \       / \  
  /   \     /   \  
 /     \   /     \  
+-------+ +-------+
   wi,k      wk+1,j
Lemma 2: For all i and j satisfying 1≤i<j≤n, A ∈ Vi,j if and only if there exist nonterminals B and C and a number k satisfying i≤k<j such that

  1. A → BC is a production in G,
  2. B ∈ Vi,k, and
  3. C ∈ Vk+1,j

Proof: Let P be the set of productions in G so that we can abbreviate "A → BC is a production in G" as "A → BC ∈ P".

Sufficiency (if):
    A → BC ∈ P ∧ B ∈ Vi,k ∧ C ∈ Vk+1,j

=     < by definition of Vi,k, Vk+1,j >

    A → BC ∈ P ∧ B ⇒+ wi,k ∧ C ⇒+ wk+1,j

⟹    < by properties of derivations >

    A ⇒ BC ⇒+ wi,kC ⇒+ wi,kwk+1,j

⟹    < by properties of derivations >

    A ⇒+ wi,kwk+1,j

=     < wi,kwk+1,j = wi,j >

    A ⇒+ wi,j

=      < by definition of Vi,j >

    A ∈ Vi,j
Necessity (only if):
    A ∈ Vi,j

=      < by definition of Vi,j >

    A ⇒+ wi,j

⟹    < In a CNF CFG, the derivation of a string of length
         two or more must begin with the application of a
         production whose right-hand side has length two. >

    A ⇒ BC ⇒+ wi,j for some nonterminals B,C

⟹    < properties of derivations and fact that
         in a CNF CFG no nonterminal can derive λ >

    A → BC ∈ P ∧ B ⇒+ wi,k ∧ C ⇒+ wk+1,j for some B,C,k

=     < by definition of Vi,k, Vk+1,j >

    A → BC ∈ P ∧ B ∈ Vi,k ∧ C ∈ Vk+1,j for some B,C,k

As pointed out above, the question of whether w ∈ L(G) reduces to whether S ∈ V1,n. Thus, we wish to compute V1,n. How can we do that?

Lemma 1 makes it clear that each of the sets Vi,i, 1≤i≤n, can be computed by examining the symbols of w and the productions of G of the form A → a (i.e., ones whose right-hand sides are single terminal symbols).

Lemma 2 makes it clear that, to compute each set Vi,j, where i<j, it suffices to examine the productions of G of the form A → BC and the sets Vi,k and Vk+1,j, for k in the range [i..j).

Algorithm

The CYK Algorithm, presented below, is an example of a problem-solving approach known as Dynamic Programming. In this approach, the computation of some desired value takes the form of filling the cells of a (typically, two-dimensional) table in such a way that the desired value ends up in the "last" cell. Except for some relatively small number of cells that can be filled initially, the value that belongs in each cell is a function of the values that appear in cells that were previously filled. Thus, the reason for building the table is to avoid having to carry out redundant computations. The effect is to sacrifice space (i.e., the use of memory to store a potentially large table) for time. Part of the ingenuity in developing such an algorithm is to ensure that, each time a cell's value is to be computed, all the cells on which that value depends have already been filled.

In the case of the CYK algorithm, the cells of the table are filled with the Vi,j values. (The value of Vi,j is placed into the cell at the intersection of row i and column j.) The value that we need (to answer the question "Is w ∈ L(G)?") is V1,n.

The cells on the main diagonal of the table (i.e., in which go the Vi,i's) can be filled based upon Lemma 1.

Lemma 2 tells us that, if i<j, then Vi,j depends upon Vi,k and Vk+1,j for k in the range [i..j). Notice that, for each k in that range, the difference j−i exceeds both k−i and j−(k+1).

Generally speaking, then, if the algorithm is designed so that the cells in the table are filled in ascending order according to the difference j−i between their column and row numbers, we can be sure that, each time the algorithm is to compute the value that goes into a cell, the cells upon which it depends have already been filled.

That is exactly what the CYK algorithm does. After filling the main diagonal of the table (in which the row and column numbers are the same), it fills the diagonals in which the column number exceeds the row number by 1, by 2, etc., in that order. The program variable m represents that difference.

CYK Algorithm
// This loop computes Vi,i for each i,
// in accord with Lemma 1.
do for i in [1..n]
|  Vi,i := ∅
|  do for each nonterminal symbol A
|  |  if A → ai is a production  
|  |  |  Vi,i := Vi,i ∪ {A}
|  |  fi
|  od
od

// Each iteration of the (outermost) m-loop fills one 
// diagonal of the table, in which the column number
// exceeds the row number by m.
do for m in [1..n-1]
|  // Each iteration of the i-loop fills one cell of
|  // the table, namely Vi,i+m.
|  do for i in [1..n-m]
|  |  j := i+m
|  |  Vi,j := ∅
|  |  // Each iteration of the k-loop attempts to find productions 
|  |  // of the form A → BC, where B ⇒+ wi,k and C ⇒+ wk+1,j,
|  |  // in accord with Lemma 2.
|  |  do for k in [i..j)
|  |  |  do for each production A → BC 
|  |  |  |  if B ∈ Vi,k  ∧  C ∈ Vk+1,j
|  |  |  |  |  Vi,j := Vi,j ∪ {A}
|  |  |  |  fi
|  |  |  od
|  |  od
|  od
od

Runtime Analysis: Each iteration of the i-loop computes one Vi,j value, of which there are approximately n2/2 (corresponding to the cells in the upper half of the table). Each time the k-loop is executed, it iterates j−i times, which, on average, is about n/3 times. Hence, the total number of times that it iterates is about n3/6. If we interpret the size of the grammar as being a constant, which is the convention, each execution of the most deeply-nested loop is considered to take constant time. Hence, we conclude that the asymptotic running time of the CYK algorithm is O(n3).


Example

SAB | b | c   (1) (2) (3)
ABB | AS | a   (4) (5) (6)
BBA | b   (7) (8)
We use the CYK algorithm to determine whether or not w = bbacb is a member of L(G), where G is the CNF grammar shown to the right.

See the table below. For each i and j satisfying 1≤i≤j≤n, the cell at the intersection of row i and column j identifies the substring wi,j of w as well as Vi,j, the set of nonterminals from which that substring is derivable. Because S ∈ V1,5, the conclusion is that w ∈ L(G).

12345
1b:{S,B} bb:{A} bba:{A} bbac:{A} bbacb:{S,A,B}
2 b:{S,B}ba:{B}bac:{B}bacb:{A,B}
3a:{A}ac:{A}acb:{S,A}
4 c:{S}cb:∅
5b:{S,B}

As an illustration of how to compute the entry in a table cell (that is not on the main diagonal), consider location (2,5), which is to contain the set V2,5, which includes all those nonterminals from which w2,5 can be derived.

There are three ways to split w2,5 into non-empty strings:

w2,5  =  w2,2·w3,5  =  w2,3·w4,5  =  w2,4·w5,5.

It follows (from Lemma 2) that V2,5 contains any nonterminal symbol on the left-hand side of a production whose right-hand side is in any of

V2,2·V3,5 = {S,B}·{S,A} = {SS, SA, BS, BA},
V2,3·V4,5 = {B}·∅ = ∅, or
V2,4·V5,5 = {B}·{S,B} = {BS, BB}

Note that the table cells containing each of these relevant Vi,j's has already been filled, due to the order in which the algorithm does that filling.

There are two productions that qualify: A → BB and B → BA. Thus, V2,5 is computed as being the set {A,B}.