Notes on Left-to-Right Top-Down Parsing

Acknowledgement: What is presented here is largely based upon material from Compiler Design Theory by Lewis, Rosenkrantz, and Stearns, hereafter referred to as LRS.

Background

The general notion of parsing is this: Given as input is a grammar G and a string x. The output to be produced is a (description of a) parse tree (in accord with G) whose yield is x, if one exists. (In place of a parse tree, the output might be something equivalent, such as (a description of) a derivation.)

Because programming languages are described using context-free grammars (as opposed to more general kinds of grammars, such as context-sensitive or phrase-structured, for which parsing is much more difficult (or even impossible in case of the latter)), that is our focus here.

Many algorithms have been developed for this purpose. Among the best known is Earley's, which runs in O(n3) time in the worst case, where n is the length of the input string x. (For simplicity's sake, the size of grammar G is taken to be a constant.) If G is unambiguous, Earley's algorithm is guaranteed to finish in no worse than O(n2) time.

But for a parser of a programming language to require quadratic time would be impractical, or at least annoyingly slow. Hence, researchers have identified restricted versions of CFG's for which linear-time parsing algorithms exist. These are the types of CFG's that are used for describing the syntax of programming languages.

Among the broad categories of approaches to parsing are top-down and bottom-up.

In the former, we begin with the root node of the tree; at each step, we attach new children to one of the already-existing leaves in the tree, consistent with some production in the grammar. (That is, if, for example, the node in question is labeled A and we attach new children labeled a, B, and b, then the grammar must have A ⟶ aBb as one of its productions.) Success occurs when the labels on the leaves of the tree (read from left to right, of course) spell out the input string.

In bottom-up parsing, we begin with a forest of trees, namely a bunch of leaf nodes arranged so that their labels spell out (from left to right, of course) the input string. At each step, we connect the roots of one or more adjacent trees in the forest to a new parent node, consistent with some production in the grammar. (That is, if, for example, we attach root nodes having labels a, B, and b to a new parent node labeled A, then the grammar must have A ⟶ aBb as one of its productions.) Success occurs when the forest has been reduced to a single tree whose root node is labeled by the grammar's start symbol.

In general, bottom-up parsing is the more powerful of the two approaches, in the sense that —assuming that we insist on parsing in linear time— it is applicable to a wider class of grammars. Donald E. Knuth invented/discovered the class of LR(1) grammars in the mid-1960's, which serves as the basis for most linear-time bottom-up parsing algorithms.

Top-down approaches tend to be a little easier to understand, so that is what we focus on here.


s-grammars

Grammar G1
(1) S → abR
(2) S → bRbS
(3) R → a
(4) R → bR
To illustrate the kinds of issues involved in top-down parsing, we start with an example of what LRS refer to as an s-grammar. Let's call it G1. (The start symbol is S. By convention, if no start symbol is explicitly identified but a grammar has a nonterminal symbol named S, that is its start symbol. If there is no such nonterminal, the start symbol is the nonterminal on the left-hand side (LHS) of the first production listed.)

Consider the string bababa, which happens to be in the language of this grammar, and let's try to construct a derivation of it. (From a derivation one can easily construct the corresponding parse tree, if desired.)

Starting with S, we ask: Which of the grammar's productions, applied to S, can possibly be the first step in a derivation of bababa? Clearly, the answer is (2) alone, as any derivation that begins with an application of (1) necessarily results in a string that begins with a rather than b. So we have

S ⇒(2) bRbS

as our derivation's first step. (The superscript identifies which production was applied.) Having accounted for the leading b in the input string, we are left with the problem of filling in the details of a derivation RbS ⇒* ababa. At this point, we have two choices: either apply a production to R, or apply a production to S. (In general, if we have a sentential form with one or more occurrences of nonterminal symbols, we can choose any one of them and apply a production.)

We shall stick to a purely left-to-right strategy, however, as that is the standard way of doing things in top-down parsing.

So the question to ask now is: Which of the grammar's productions, applied to the occurrence of R in the sentential form RbS, can possibly be the first step in a derivation of ababa? The answer, of course, is (3) alone.

By now, you should get the idea. Here is a (primitive ASCII-based) graphical depiction of the steps we would go through in completing our task. It is based on the idea of maintaining two stacks, the "upper" of which contains that portion y of the input string (necessarily a suffix, of course) that has yet to be generated and the "lower" of which contains the suffix α of the "current" sentential form that must be shown to generate y. LRS refer to this as a "stack movie".

Initially, the upper stack contains the entire input string and the lower stack contains the start symbol of the grammar. (A dollar sign ('$') is placed at the bottom of each stack to mark its bottom explicitly.)

Stack Movie (parsing bababa)G1
+-------+     +-------+     +------+     +------+     +-----+
|bababa$| (2) |bababa$| pop |ababa$| (3) |ababa$| pop |baba$| pop  
|S$     | ==> |bRbS$  | ==> |RbS$  | ==> |abS$  | ==> |bS$  | ==> 
+-------+     +-------+     +------+     +------+     +-----+

+----+     +----+     +---+      +--+     +--+     +-+
|aba$| (1) |aba$| pop |ba$| pop  |a$| (3) |a$| pop |$|
|S$  | ==> |abR$| ==> |bR$| ==>  |R$| ==> |a$| ==> |$|
+----+     +----+     +---+      +--+     +--+     +-+
(1) S → abR
(2) S → bRbS
(3) R → a
(4) R → bR

Notice that our stack movie mirrors the leftmost derivation

S ⇒(2) bRbS ⇒(3) babS ⇒(1) bababR ⇒(3) bababa

(A leftmost derivation is simply one in which, at each step, a production is applied to the leftmost occurrence of a nonterminal symbol.) The parenthesized numerals identify the production applied in each step. Notice that the sequence of numerals in the derivation matches those listed in the non-pop steps of the stack movie.

The underlying algorithm is that, at each step, we do this:

  1. if both stacks have $ at the top, then parsing has completed successfully.
  2. elsif the symbol X at the top of the lower stack is a terminal symbol then
    1. if X matches the symbol at the top of the upper stack, then pop both stacks.
    2. if X fails to match the symbol at the top of the upper stack, then
          either we made a wrong choice in an earlier step or else it is impossible to derive the input string.
  3. else   /* the symbol X at the top of the lower stack is a nonterminal symbol */
        "apply" one of X's productions X → α by popping X from the lower stack and pushing α onto it.

Of course, what is crucial in executing this plan is to make a correct choice each time case C applies. For G1 , the following table provides a means of making that choice, based upon only the topmost symbol on each stack:

top of
lower stack
next input
symbol
Grammar
  a b (1) S → abR
(2) S → bRbS
(3) R → a
(4) R → bR
S (1) (2)
R (3) (4)

Why does this work for G1? Because (as we observed before) each of the following is true:

To state it in slightly more general terms, G1 has these properties:

  1. Every production's RHS begins with a terminal symbol.
  2. For every two productions having the same LHS, their RHS's begin with distinct symbols.

A grammar satisfying this property is referred to (by LRS) as an s-grammar. It should be clear that, for any such grammar, a parsing table (such as that shown above for G1) can be constructed for the purpose of guiding our choice when case C arises during parsing. (Specifically, if A ⟶ tα (where t is a terminal symbol) is a production in the grammar, we would place its ID into cell (A,t) (i.e., the cell at the intersection of the row labeled A and the column labeled t).)

Hence, the parsing problem for s-grammars has a very efficient (indeed, linear time) algorithmic solution. On the other hand, s-grammars are a very restricted class of CFGs, and so, for a particular language of interest, it is not very likely that we would be able to devise an s-grammar that generates it. Hence, it would be nice to identify a larger class of CFGs for which the parsing problem can be solved in linear time.


q-grammars

G2
(1) S → aHS
(2) S → b
(3) H → cHS
(4) H → λ
Enter the concept of q-grammars, which generalize s-grammars by allowing λ-productions. (Recall that λ is a symbol representing the empty string.) As an example, G2 appears to the right.

Notice that the presence of production (4) precludes G2 from being an s-grammar. (Recall that in an s-grammar, each production's RHS must begin with a terminal symbol.)

Nevertheless, we can still parse strings deterministically in linear time. To illustrate this (but not prove it, of course), consider this stack movie showing a parse of the string aacbb:

+------+     +------+     +-----+     +-----+     +-----+     +----+
|aacbb$| (1) |aacbb$| pop |acbb$| (4) |acbb$| (1) |acbb$| pop |cbb$| (3)
|S$    | ==> |aHS$  | ==> |HS$  | ==> |S$   | ==> |aHS$ | ==> |HS$ | ==>
+------+     +------+     +-----+     +-----+     +-----+     +----+


+-----+     +----+     +---+     +---+     +--+     +--+     +-+
|cbb$ | pop |bb$ | (4) |bb$| (2) |bb$| pop |b$| (2) |b$| pop |$|
|cHSS$| ==> |HSS$| ==> |SS$| ==> |bS$| ==> |S$| ==> |b$| ==> |$|
+-----+     +----+     +---+     +---+     +--+     +--+     +-+

It turns out that the correct parse table for G2 is

top of
lower stack
top of
upper stack
Grammar
  a b c (1) S → aHS
(2) S → b
(3) H → cHS
(4) H → λ
S (1) (2)
-
H (4) (4) (3)

In particular, with H at the top of the lower stack and either a or b at the top of the upper stack, the proper production to apply is (4). (This makes sense, as application of (3) cannot generate a string that begins with either a or b.) But what really justifies placing (4) into cells (H,a) and (H,b) of the table, respectively, is that a can immediately follow H in a sentential form derived from S, and so can b. If the former weren't true, putting (4) into the (H,a) cell would be pointless; similarly, if the latter weren't true, putting (4) into the (H,b) cell would be pointless.

Now suppose that we added the production (5) H ⟶ b to obtain G3. (For that matter the RHS could be for any β.) Does this now mean that, with H at the top of the lower stack and b at the top of the upper stack, we should apply production (5)? Not necessarily! Consider these parses of ab and abb:

Parsing ab and abbG3
+---+     +----+     +---+     +--+     +--+     +-+
|ab$| (1) |ab$ | pop |b$ | (4) |b$| (2) |b$| pop |$|
|S$ | ==> |aHS$| ==> |HS$| ==> |S$| ==> |b$| ==> |$|
+---+     +----+     +---+     +--+     +--+     +-+

+----+     +----+     +---+     +---+     +--+     +--+     +-+
|abb$| (1) |abb$| pop |bb$| (5) |bb$| pop |b$| (2) |b$| pop |$|
|S$  | ==> |aHS$| ==> |HS$| ==> |bS$| ==> |S$| ==> |b$| ==> |$|
+----+     +----+     +---+     +---+     +--+     +--+     +-+
(1) S → aHS
(2) S → b
(3) H → cHS
(4) H → λ
(5) H → b

In the third step of each parse, we were confronted with H at the top of the lower stack and b at the top of the upper stack. In one instance, the correct production to apply was (4), but in the other it was (5). Which is to say that parsing with respect to G3 cannot be carried out (deterministically) based upon a table whose entries depend solely upon the top symbol on each of the two stacks. (That is, if we tried to fill in the cells of such a table, cell (H,b) would have to include both (4) and (5), because one choice is correct in some situations and the other choice is correct in others.)

G4
(1) S → aHS
(2) S → b
(3) H → cHS
(4) H → λ
(5) S → c
Or consider G4, which is obtained from G2 by adding the production (5) S ⟶ c (or, for that matter, S ⟶ cα for any string α). Clearly, we would place (5) into cell (S,c). But what about the table entries in the row labeled H?

Consider these parses of ac and acbb:

+---+     +----+      +---+     +--+     +--+      +-+
|ac$| (1) |ac$ | pop  |c$ | (4) |c$| (5) |c$| pop  |$|
|S$ | ==> |aHS$| ==>  |HS$| ==> |S$| ==> |c$| ==>  |$|
+---+     +----+      +---+     +--+     +--+      +-+

+-----+     +-----+     +----+     +-----+     +----+     +---+ 
|acbb$| (1) |acbb$| pop |cbb$| (3) |cbb$ | pop |bb$ | (4) |bb$|
|S$   | ==> |aHS$ | ==> |HS$ | ==> |cHSS$| ==> |HSS$| ==> |SS$|  ....
+-----+     +-----+     +----+     +-----+     +----+     +---+

In the third step of each parse, we were confronted with H at the top of the lower stack and c at the top of the upper stack. In one instance, the correct production to apply was (4), but in the other it was (3). Hence, the (H,c) cell in the parse table must contain both (3) and (4), which is to say that parsing with respect to G4 cannot be carried out (deterministically) based upon a table whose entries depend solely upon the top symbol on each of the two stacks.

What is the (general) explanation for why G2 can be parsed in a manner similar to the s-grammar G1 but that neither G3 nor G4 can be?

To provide such an explanation, first we define a nonterminal symbol's follow set:

FOLLOW(H) = { t ∈ Σ∪{$}  |  S$ ⇒* yHtz for some strings y,z }

Here, H is understood to be a nonterminal symbol, S is the start symbol, $ is the "end of string" marker, t is a terminal symbol (or possibly $), and y and z are terminal strings.

Informally, the follow set of H includes any terminal symbol (or $) that could possibly occur immediately after H in a sentential form generated from S$.

To qualify as a q-grammar, these conditions must be satisfied:

  1. Every production's RHS is either λ or else begins with a terminal symbol.
  2. If H ⟶ aα and H ⟶ bβ are two (non-λ) productions, then a ≠ b. (In other words, for every two productions having the same LHS, and with RHS's beginning with terminal symbols, those terminal symbols must be distinct.)
  3. If H ⟶ λ and H ⟶ aα are productions, then a is not a member of FOLLOW(H).

Comparing this with the definition of an s-grammar, we see that (1) generalizes the corresponding property of s-grammars by allowing λ-productions, (2) is essentially the same as the corresponding property of s-grammars, and (3) ensures that the type of conflict illustrated by grammars G3 and G4 cannot happen. Indeed, if we have a q-grammar, the cells in the parse table can be filled as follows with the assurance that no cell contains more than one entry:


LL(1) grammars

We started with s-grammars and then generalized to q-grammars without losing the former's nice parsing properties. Can we generalize even further while still preserving those nice parsing properties (namely, that we can parse deterministically at each step by correctly predicting which production to apply based upon only the top symbol on each stack)? The answer is yes.

The obvious generalization that we'd like to make is to lift the restriction that the RHS of any non-λ production must begin with a terminal symbol.

To illustrate, the figure below shows a grammar (from top of page 255 in LRS) and its parse table.

top of
lower stack
top of upper stack
  a b c d e $
S (1) (1) (1) (2) (1) ---
A (3) (4) (4) --- (3) ---
B --- (6) (5) (6) --- (6)
C (7) --- --- --- (8) ---
(1) S → AbB
(2) S → d
(3) A → CAb
(4) A → B
(5) B → cSd
(6) B → λ
(7) C → a
(8) C → ed

The red entries were placed there due to productions of the form A ⟶ tα (i.e., whose RHS's begin with a terminal symbol), and the blue entries were placed there due to productions of the form A ⟶ λ (i.e., λ productions). Indeed, the reasons for filling these particular cells in this way is exactly as we saw with q-grammars.

What may be less clear is the reason for (3) being in cells (A,a) and (A,e). In effect, what this says is that, using an application of (3) as a first step, it is possible to derive a string (from A) that begins with either a or e! This is easily verified:

A ⇒(3) CAb ⇒(7) aAb ⇒ ...
A ⇒(3) CAb ⇒(8) edAb ⇒ ...

Similar reasoning led to all the green entries in the table.

What we are using here is the idea (a special case of which we saw with s- and q-grammars) that the application of a production can lead (eventually) to a string that begins with a particular terminal symbol.

To formalize this, we define

FIRST(α) = { t ∈ Σ  |  α ⇒* tβ for some string β }

For reasons that should be clear at this point:

LL Rule 1: If A ⟶ α is a production and b is a member of FIRST(α), then the ID of this production belongs in the cell (A,b).

This is just the straightforward generalization of the first rule (see the very end of the previous section) that we used for filling cells in the parse table for a q-grammar. All red and green entries in the table are a result of applying this rule.

The generalization of the second rule is this:

LL Rule 2: If A ⟶ α is a production and α is nullable (meaning that from α we can generate the empty string (i.e., α ⟹* λ)), then the ID of this production belongs in any cell (A,b) such that b is in FOLLOW(A).

In addition to the blue entries, this rule gives rise to the purple entry in cell (A,b).

Definition: If a grammar is such that

  1. its parse table, filled according to LL Rule 1 and LL Rule 2 given above, has no cell containing more than one entry, and
  2. for no nonterminal symbol A is it the case that there exists a derivation of the form A ⟹+ (such a nonterminal is said to be left-recursive)
then that grammar is in the class of LL(1) grammars.