CMPS 260
Chomsky Normal Form

Algorithm

Described here is an algorithm that, given a CFG having no unit productions (i.e., of the form A → B) and no λ-productions, constructs an equivalent CFG that is in Chomsky Normal Form (CNF), which is to say that every production's right-hand side is either a single terminal symbol (e.g., A → a) or a pair of nonterminals (e.g., A → BC). (Section 6.1 of Linz shows how to remove unit and λ-productions from a CFG without changing the language that it generates, except for the possible loss of λ from the language.)

Step 1(a): For each symbol t in the grammar's terminal alphabet, introduce a new nonterminal t' and give it the production t' → t.

Step 1(b): Except for productions whose right-hand sides are single terminal symbols (such as those introduced in Step 1(a)), replace every occurrence of a terminal symbol t by the corresponding nonterminal t' introduced in Step 1.

The result of this step will be that every production's right-hand side is either a single terminal symbol or a string of nonterminal symbols of length two or more.

Step 2: Repeat the following until it no longer applies:
For each production A → Bβ, where β is a string of nonterminals of length two or more, replace it by the pair of productions A → BVβ and Vβ → β, where Vβ is a new nonterminal. (If β has length greater than two, it will be replaced by subsequent applications of this rule.)

For example, the production A → BCDE would be replaced, after two iterations of Step 2, by this set of productions:

ABVCDE
VCDECVDE
VDEDE

Of course, the names of newly-introduced nontermals is arbitrary, but, for purposes of uniformity, we are using the convention that every newly-introduced nonterminal symbol is named Vα, where the intent is for newly-introduced productions to allow for the derivation Vα+ α.


Example Derivation of a CNF Grammar

To illustrate the algorithm, we transform the CFG shown below into an equivalent CFG in Chomsky Normal Form. (As required, the given grammar has no λ- or unit-productions.)

S aS  |  Sb  |  aAbA   (1) (2) (3)
A ASbA  |  ab   (4) (5)

Step 1(a): The given grammar's terminal alphabet is {a,b}, so we introduce nonterminals a' and b' and productions a' → a and b' → b.

Step 1(b): Replacing each occurrence of a (respectively, b) in the productions of the given grammar by a' (resp., b'), we end up with

S a'S  |  Sb'  |  a'Ab'A   (1') (2') (3')
A ASb'A  |  a'b'   (4') (5')
a'a
b'b

Step 2: Productions (3') and (4') have right-hand sides that are longer than two, and thus we must replace them. Doing so, we end up with the following grammar:

S a'S  |  Sb'  |  a'VAb'A
A AVSb'A  |  a'b'
a'a
b'b
VAb'AAVb'A
VSb'ASVb'A
Vb'Ab'A