CYK Algorithm

CMPS 260: CYK Algorithm

The CYK algorithm (named for Cocke, Young, and Kasami, each of whom develeped it independently of the others in the mid-1960's) solves the membership problem for context-free grammars in Chomsky Normal Form. That is, given as input a CFG G in Chomsky Normal Form (CNF) and a string w, the algorithm determines whether or not w ∈ L(G). Because any context-free grammar can be transformed (algorithmically) into CNF (with the possible loss of the empty string from the generated language), this gives us a way of solving the membership problem for all CFG's.

Recall that a context-free grammar is said to be in Chomsky normal form if all its productions are of one of the two forms A → b or A → BC, where b is a terminal symbol and B and C are nonterminals.

The CYK algorithm is based on the following. Let G be a context-free grammar in Chomsky normal form, and let w = a₁a₂...a_n (each a_i ∈ Σ) be a string over G's terminal alphabet Σ.

For i and j satisfying 1 ≤ i ≤ j ≤ n, let w_i,j denote the substring a_ia_i+1...a_j of w beginning with its i-th symbol and ending with its j-th symbol, and let V_i,j denote the set of nonterminals in G from which w_i,j can be derived. That is,

V_i,j = { A | A ⟹⁺ w_i,j }

What the CYK algorithm does is to use the Dynamic Programming technique to compute each set V_i,j, culminating with the calculation of V_1,n. The question of whether w ∈ L(G) reduces to the question of whether S ∈ V_1,n, where S is the start symbol of G.

Lemma 1: For all i (1≤i≤n), V_i,i = { A | A → a_i is a production in G }.
Proof: Inclusion from right to left is obvious. Inclusion in the other direction follows from the fact that a derivation that begins with an application of a production of the form A → BC cannot possibly produce a terminal string of length one (because a CNF CFG has no λ-productions). ∎

A / \ / \ / \ / \ B C / \ / \ / \ / \ / \ / \ +-------+ +-------+ w_i,k w_k+1,j

Lemma 2: For all i and j satisfying 1≤i<j≤n, A ∈ V_i,j if and only if there exist nonterminals B and C and a number k satisfying i≤k<j such that

A → BC is a production in G,
B ∈ V_i,k, and
C ∈ V_k+1,j

Proof: Let P be the set of productions in G so that we can abbreviate "A → BC is a production in G" as "A → BC ∈ P".

Sufficiency (if):
A → BC ∈ P ∧ B ∈ V_i,k ∧ C ∈ V_k+1,j = < by definition of V_i,k, V_k+1,j > A → BC ∈ P ∧ B ⇒⁺ w_i,k ∧ C ⇒⁺ w_k+1,j ⟹ < by properties of derivations > A ⇒ BC ⇒⁺ w_i,kC ⇒⁺ w_i,kw_k+1,j ⟹ < by properties of derivations > A ⇒⁺ w_i,kw_k+1,j = < w_i,kw_k+1,j = w_i,j > A ⇒⁺ w_i,j = < by definition of V_i,j > A ∈ V_i,j
Necessity (only if):
A ∈ V_i,j = < by definition of V_i,j > A ⇒⁺ w_i,j ⟹ < In a CNF CFG, the derivation of a string of length two or more must begin with the application of a production whose right-hand side has length two. > A ⇒ BC ⇒⁺ w_i,j for some nonterminals B,C ⟹ < properties of derivations and fact that in a CNF CFG no nonterminal can derive λ > A → BC ∈ P ∧ B ⇒⁺ w_i,k ∧ C ⇒⁺ w_k+1,j for some B,C,k = < by definition of V_i,k, V_k+1,j > A → BC ∈ P ∧ B ∈ V_i,k ∧ C ∈ V_k+1,j for some B,C,k

∎

As pointed out above, the question of whether w ∈ L(G) reduces to whether S ∈ V_1,n. Thus, we wish to compute V_1,n. How can we do that?

Lemma 1 makes it clear that each of the sets V_i,i, 1≤i≤n, can be computed by examining the symbols of w and the productions of G of the form A → a (i.e., ones whose right-hand sides are single terminal symbols).

Lemma 2 makes it clear that, to compute each set V_i,j, where i<j, it suffices to examine the productions of G of the form A → BC and the sets V_i,k and V_k+1,j, for k in the range [i..j).

Algorithm

The CYK Algorithm, presented below, is an example of a problem-solving approach known as Dynamic Programming. In this approach, the computation of some desired value takes the form of filling the cells of a (typically, two-dimensional) table in such a way that the desired value ends up in the "last" cell. Except for some relatively small number of cells that can be filled initially, the value that belongs in each cell is a function of the values that appear in cells that were previously filled. Thus, the reason for building the table is to avoid having to carry out redundant computations. The effect is to sacrifice space (i.e., the use of memory to store a potentially large table) for time. Part of the ingenuity in developing such an algorithm is to ensure that, each time a cell's value is to be computed, all the cells on which that value depends have already been filled.

In the case of the CYK algorithm, the cells of the table are filled with the V_i,j values. (The value of V_i,j is placed into the cell at the intersection of row i and column j.) The value that we need (to answer the question "Is w ∈ L(G)?") is V_1,n.

The cells on the main diagonal of the table (i.e., in which go the V_i,i's) can be filled based upon Lemma 1.

Lemma 2 tells us that, if i<j, then V_i,j depends upon V_i,k and V_k+1,j for k in the range [i..j). Notice that, for each k in that range, the difference j−i exceeds both k−i and j−(k+1).

Generally speaking, then, if the algorithm is designed so that the cells in the table are filled in ascending order according to the difference j−i between their column and row numbers, we can be sure that, each time the algorithm is to compute the value that goes into a cell, the cells upon which it depends have already been filled.

That is exactly what the CYK algorithm does. After filling the main diagonal of the table (in which the row and column numbers are the same), it fills the diagonals in which the column number exceeds the row number by 1, by 2, etc., in that order. The program variable m represents that difference.

CYK Algorithm
// This loop computes V_i,i for each i, // in accord with Lemma 1. do for i in [1..n] | V_i,i := ∅ | do for each nonterminal symbol A | | if A → a_i is a production | | | V_i,i := V_i,i ∪ {A} | | fi | od od // Each iteration of the (outermost) m-loop fills one // diagonal of the table, in which the column number // exceeds the row number by m. do for m in [1..n-1] | // Each iteration of the i-loop fills one cell of | // the table, namely V_i,i+m. | do for i in [1..n-m] | | j := i+m | | V_i,j := ∅ | | // Each iteration of the k-loop attempts to find productions | | // of the form A → BC, where B ⇒⁺ w_i,k and C ⇒⁺ w_k+1,j, | | // in accord with Lemma 2. | | do for k in [i..j) | | | do for each production A → BC | | | | if B ∈ V_i,k ∧ C ∈ V_k+1,j | | | | | V_i,j := V_i,j ∪ {A} | | | | fi | | | od | | od | od od

CYK Algorithm
// This loop computes V_i,i for each i, // in accord with Lemma 1. do for i in [1..n] \| V_i,i := ∅ \| do for each nonterminal symbol A \| \| if A → a_i is a production \| \| \| V_i,i := V_i,i ∪ {A} \| \| fi \| od od // Each iteration of the (outermost) m-loop fills one // diagonal of the table, in which the column number // exceeds the row number by m. do for m in [1..n-1] \| // Each iteration of the i-loop fills one cell of \| // the table, namely V_i,i+m. \| do for i in [1..n-m] \| \| j := i+m \| \| V_i,j := ∅ \| \| // Each iteration of the k-loop attempts to find productions \| \| // of the form A → BC, where B ⇒⁺ w_i,k and C ⇒⁺ w_k+1,j, \| \| // in accord with Lemma 2. \| \| do for k in [i..j) \| \| \| do for each production A → BC \| \| \| \| if B ∈ V_i,k ∧ C ∈ V_k+1,j \| \| \| \| \| V_i,j := V_i,j ∪ {A} \| \| \| \| fi \| \| \| od \| \| od \| od od

Runtime Analysis: Each iteration of the i-loop computes one V_i,j value, of which there are approximately n²/2 (corresponding to the cells in the upper half of the table). Each time the k-loop is executed, it iterates j−i times, which, on average, is about n/3 times. Hence, the total number of times that it iterates is about n³/6. If we interpret the size of the grammar as being a constant, which is the convention, each execution of the most deeply-nested loop is considered to take constant time. Hence, we conclude that the asymptotic running time of the CYK algorithm is O(n³).

Example

S ⟶ AB | b | c (1) (2) (3)

A ⟶ BB | AS | a (4) (5) (6)

B ⟶ BA | b (7) (8)

We use the CYK algorithm to determine whether or not w = bbacb is a member of L(G), where G is the CNF grammar shown to the right.

See the table below. For each i and j satisfying 1≤i≤j≤n, the cell at the intersection of row i and column j identifies the substring w_i,j of w as well as V_i,j, the set of nonterminals from which that substring is derivable. Because S ∈ V_1,5, the conclusion is that w ∈ L(G).

1 2 3 4 5

1 b:{S,B} bb:{A} bba:{A} bbac:{A} bbacb:{S,A,B}

2 b:{S,B} ba:{B} bac:{B} bacb:{A,B}

3 a:{A} ac:{A} acb:{S,A}

4 c:{S} cb:∅

5 b:{S,B}

	1	2	3	4	5
1	b:{S,B}	bb:{A}	bba:{A}	bbac:{A}	bbacb:{S,A,B}
2		b:{S,B}	ba:{B}	bac:{B}	bacb:{A,B}
3			a:{A}	ac:{A}	acb:{S,A}
4				c:{S}	cb:∅
5					b:{S,B}

As an illustration of how to compute the entry in a table cell (that is not on the main diagonal), consider location (2,5), which is to contain the set V_2,5, which includes all those nonterminals from which w_2,5 can be derived.

There are three ways to split w_2,5 into non-empty strings:

w_2,5 = w_2,2·w_3,5 = w_2,3·w_4,5 = w_2,4·w_5,5.

It follows (from Lemma 2) that V_2,5 contains any nonterminal symbol on the left-hand side of a production whose right-hand side is in any of

V_2,2·V_3,5 = {S,B}·{S,A} = {SS, SA, BS, BA},
V_2,3·V_4,5 = {B}·∅ = ∅, or
V_2,4·V_5,5 = {B}·{S,B} = {BS, BB}

Note that the table cells containing each of these relevant V_i,j's has already been filled, due to the order in which the algorithm does that filling.

There are two productions that qualify: A → BB and B → BA. Thus, V_2,5 is computed as being the set {A,B}.