Proof that a particular CFG generates a particular CFL

Consider the context-free grammar G consisting of the following four rules/productions:

S  ⟶  λ  |  aSb  |  bSa  |  SS     (1) (2) (3) (4)

S is the lone variable/nonterminal while a and b are the two symbols in the terminal alphabet. The empty string is denoted by λ.

Claim 1: The "yield" (i.e., the string spelled out by the labels of the leaves, going from left to right) of every G-derivation tree has an equal number of occurrences of a and b, respectively.

Proof: The proof is by induction on the height h of the derivation tree.

Base case: h=0,1. The only G-derivation tree of height zero has S as its yield, and the four G-derivation trees of height one have λ, aSb, bSa, and SS as their yields. All of these strings have equal numbers of occurrences of a's and b's.

Induction case: Let h≥1 and assume that every derivation tree of height h or less satisfies the claim. Let T be a G-derivation tree of height h+1. Then T must have one of these three forms, corresponding to which among productions (2), (3), or (4) was applied at the root.

Form 1Form 2Form 3
    S
   /|\
  / | \
 /  |  \
a   S   b
   / \
  /   \
 /     \
+-------+
    α
    S
   /|\
  / | \
 /  |  \
b   S   a
   / \
  /   \
 /     \
+-------+
    α
         S
        / \
       /   \
      /     \
     /       \
    S         S
   / \       / \
  /   \     /   \
 /     \   /     \
+-------+ +-------+
    α         β

Suppose that T is of Form 1. The subtree whose yield is α is a derivation tree of height h (one less than the height of T itself) and therefore, by the induction hypothesis, the numbers of occurrences of a and b, respectively, in α is equal. But the yield of T itself is aαb, which has exactly one more occurrence of a, and likewise b, than does α. Hence, T satisfies the claim.

Now suppose that T is of Form 2. By the same reasoning as in the previous paragraph, it satisfies the claim.

Finally, suppose that T is of Form 3. The subtree whose yield is α has height at most h, and the same is true of the subtree whose yield is β. (At least one of them has height exactly h, but that is not significant here.)

Applying the induction hypothesis to each of those two subtrees, we get that both α and β have equal numbers of occurrences of a's and b's, respectively. But then so does αβ, which is the yield of T. It follows that T satisfies the claim. ♦

Corollary to Claim 1: In every terminal string generated by G, the numbers of occurrences of a and b, respectively, are equal.

Proof: Every such string is the yield of some derivation tree. ♦

To be pedantic (or maybe just more rigorous?), we could have characterized the function #a(x) (denoting the number of occurrences of a in string x) (and similarly #b(x)) like this:

#a(λ) = 0
#a(a) = 1
#a(b) = 0
#a(xy) = #a(x) + #a(y)

after which the reasoning by which trees of Form 1 are shown to satisfy the claim might look like this:

   #a(aαb)

=    < property of #a >

   #a(a) + #a(α) + #a(b)

=    < property of #a >

   1 + #a(α) + 0

=    < Induction hypothesis applied to subtree yielding α >

   1 + #b(α) + 0

=    < symmetry of + >

   0 + #b(α) + 1
   
=    < property of #b >

   #b(a) + #b(α) + #b(b)

=    < property of #b >

   #b(aαb)


Now we prove the converse of the corollary above:

Claim 2: Every terminal string (over the alphabet {a,b}) having equal numbers of occurrences of a and b is the yield of some G-derivation tree (and hence is a member of L(G) (the language generated by G)).

Proof: The proof is by induction on the length n of the string.

Base case: n=0. The lone string of length zero is the yield of the G-derivation tree resulting from applying production (1) (S ⟶ λ) at the root.

Induction step: Let n≥0 and assume as an induction hypothesis that every terminal string z satisfying #a(z) = #b(z) and having length n or less is the yield of some G-derivation tree. Let x be a string of length n+1 having equal number of occurrences of a and b, respectively.

Case 1: x = ayb for some string y. Of course, y has one fewer occurrence of each of a and b than does x (and thus an equal number of a's and b's) and its length is n-1; it follows from the induction hypothesis that y is the yield of some G-derivation tree. But then x is the yield of the G-derivation tree shown in the figure below.

Case 2: x = bya for some string y. The reasoning here is as in the previous case; see figure below.

Case 3: x = aya for some string y. Then y must have two more occurrences of b than of a. It follows that some proper prefix u of y has exactly one more occurrence of b than of a. Let v be such that y = uv. Of course, v also has one more occurrence of b than of a.

So we have x = aya = auva, with each of au and va being of length at most n and having equal numbers of occurrences of a and b. It follows from the induction hypothesis that each of au and va is the yield of some G-derivation tree. Thus, x is the yield of the derivation tree shown in the figure below.

Case 4: x = byb for some string y. The reasoning here is as in the previous case; see figure below. ♦

Case 1Case 2Case 3Case 4
    S
   /|\
  / | \
 /  |  \
a   S   b
   / \
  /   \
 /     \
+-------+
    y
    S
   /|\
  / | \
 /  |  \
b   S   a
   / \
  /   \
 /     \
+-------+
    y
         S
        / \
       /   \
      /     \
     /       \
    S         S
   / \       / \
  /   \     /   \
 /     \   /     \
+-------+ +-------+
   au         va
         S
        / \
       /   \
      /     \
     /       \
    S         S
   / \       / \
  /   \     /   \
 /     \   /     \
+-------+ +-------+
   bu         vb
x = ayb
x = bya
x = aya = auva
x = byb = buvb

Let R = {x ∈ {0,1}* | #a(x) = #b(x) : x}, to use Gries's set notation. That is, R is the language (i.e., set of strings) consisting of precisely those strings over the alphabet {a,b} having equal numbers of occurrences of a and b.

Then what Claim 1 (or its corollary, anyway) says is L(G) ⊆ R. (L(G) denotes the set of terminal strings derivable from the start symbol, S, of grammar G.)

What Claim 2 says is R ⊆ L(G). In summary, then, we have proved that each of L(G) and R is a subset of the other, which is to say that they are the same set: L(G) = R.