The general notion of parsing is this: Given as input is a grammar G and a string x. The output to be produced is a (description of a) parse tree (in accord with G) whose yield is x, if one exists. (In place of a parse tree, the output might be something equivalent, such as (a description of) a derivation.)
Because programming languages are described using context-free grammars (as opposed to more general kinds of grammars, such as context-sensitive or phrase-structured, for which parsing is much more difficult (or even impossible in case of the latter)), that is our focus here.
Many algorithms have been developed for this purpose. Among the best known is Earley's, which runs in O(n3) time in the worst case, where n is the length of the input string x. (For simplicity's sake, the size of grammar G is taken to be a constant.) If G is unambiguous, Earley's algorithm is guaranteed to finish in no worse than O(n2) time.
But for a parser of a programming language to require quadratic time would be impractical, or at least annoyingly slow. Hence, researchers have identified restricted versions of CFG's for which linear-time parsing algorithms exist. These are the types of CFG's that are used for describing the syntax of programming languages.
Among the broad categories of approaches to parsing are top-down and bottom-up.
In the former, we begin with the root node of the tree; at each step, we attach new children to one of the already-existing leaves in the tree, consistent with some production in the grammar. (That is, if, for example, the node in question is labeled A and we attach new children labeled a, B, and b, then the grammar must have A ⟶ aBb as one of its productions.) Success occurs when the labels on the leaves of the tree (read from left to right, of course) spell out the input string.
In bottom-up parsing, we begin with a forest of trees, namely a bunch of leaf nodes arranged so that their labels spell out (from left to right, of course) the input string. At each step, we connect the roots of one or more adjacent trees in the forest to a new parent node, consistent with some production in the grammar. (That is, if, for example, we attach root nodes having labels a, B, and b to a new parent node labeled A, then the grammar must have A ⟶ aBb as one of its productions.) Success occurs when the forest has been reduced to a single tree whose root node is labeled by the grammar's start symbol.
In general, bottom-up parsing is the more powerful of the two approaches, in the sense that —assuming that we insist on parsing in linear time— it is applicable to a wider class of grammars. Donald E. Knuth invented/discovered the class of LR(1) grammars in the mid-1960's, which serves as the basis for most linear-time bottom-up parsing algorithms.
Top-down approaches tend to be a little easier to understand, so that is what we focus on here.
(1) S → abR (2) S → bRbS (3) R → a (4) R → bR |
Consider the string bababa, which happens to be in the language of this grammar, and let's try to construct a derivation of it. (From a derivation one can easily construct the corresponding parse tree, if desired.)
Starting with S, we ask: Which of the grammar's productions, applied to S, can possibly be the first step in a derivation of bababa? Clearly, the answer is (2) alone, as any derivation that begins with an application of (1) necessarily results in a string that begins with a rather than b. So we have
as our derivation's first step. (The superscript identifies which production was applied.) Having accounted for the leading b in the input string, we are left with the problem of filling in the details of a derivation RbS ⇒* ababa. At this point, we have two choices: either apply a production to R, or apply a production to S. (In general, if we have a sentential form with one or more occurrences of nonterminal symbols, we can choose any one of them and apply a production.)
We shall stick to a purely left-to-right strategy, however, as that is the standard way of doing things in top-down parsing.
So the question to ask now is: Which of the grammar's productions, applied to the occurrence of R in the sentential form RbS, can possibly be the first step in a derivation of ababa? The answer, of course, is (3) alone.
By now, you should get the idea. Here is a (primitive ASCII-based) graphical depiction of the steps we would go through in completing our task. It is based on the idea of maintaining two stacks, the "upper" of which contains that portion y of the input string (necessarily a suffix, of course) that has yet to be generated and the "lower" of which contains the suffix α of the "current" sentential form that must be shown to generate y. LRS refer to this as a "stack movie".
Initially, the upper stack contains the entire input string and the lower stack contains the start symbol of the grammar. (A dollar sign ('$') is placed at the bottom of each stack to mark its bottom explicitly.)
Stack Movie (parsing bababa) | G1 |
---|---|
+-------+ +-------+ +------+ +------+ +-----+ |bababa$| (2) |bababa$| pop |ababa$| (3) |ababa$| pop |baba$| pop |S$ | ==> |bRbS$ | ==> |RbS$ | ==> |abS$ | ==> |bS$ | ==> +-------+ +-------+ +------+ +------+ +-----+ +----+ +----+ +---+ +--+ +--+ +-+ |aba$| (1) |aba$| pop |ba$| pop |a$| (3) |a$| pop |$| |S$ | ==> |abR$| ==> |bR$| ==> |R$| ==> |a$| ==> |$| +----+ +----+ +---+ +--+ +--+ +-+ |
(1) S → abR (2) S → bRbS (3) R → a (4) R → bR |
Notice that our stack movie mirrors the leftmost derivation
(A leftmost derivation is simply one in which, at each step, a production is applied to the leftmost occurrence of a nonterminal symbol.) The parenthesized numerals identify the production applied in each step. Notice that the sequence of numerals in the derivation matches those listed in the non-pop steps of the stack movie.
The underlying algorithm is that, at each step, we do this:
|
Of course, what is crucial in executing this plan is to make a correct choice each time case C applies. For G1 , the following table provides a means of making that choice, based upon only the topmost symbol on each stack:
top of lower stack |
next input symbol |
Grammar | |
---|---|---|---|
a | b | (1) S → abR (2) S → bRbS (3) R → a (4) R → bR |
|
S | (1) | (2) | |
R | (3) | (4) |
Why does this work for G1? Because (as we observed before) each of the following is true:
To state it in slightly more general terms, G1 has these properties:
A grammar satisfying this property is referred to (by LRS) as an s-grammar. It should be clear that, for any such grammar, a parsing table (such as that shown above for G1) can be constructed for the purpose of guiding our choice when case C arises during parsing. (Specifically, if A ⟶ tα (where t is a terminal symbol) is a production in the grammar, we would place its ID into cell (A,t) (i.e., the cell at the intersection of the row labeled A and the column labeled t).)
Hence, the parsing problem for s-grammars has a very efficient (indeed, linear time) algorithmic solution. On the other hand, s-grammars are a very restricted class of CFGs, and so, for a particular language of interest, it is not very likely that we would be able to devise an s-grammar that generates it. Hence, it would be nice to identify a larger class of CFGs for which the parsing problem can be solved in linear time.
(1) S → aHS (2) S → b (3) H → cHS (4) H → λ |
Notice that the presence of production (4) precludes G2 from being an s-grammar. (Recall that in an s-grammar, each production's RHS must begin with a terminal symbol.)
Nevertheless, we can still parse strings deterministically in linear time. To illustrate this (but not prove it, of course), consider this stack movie showing a parse of the string aacbb:
+------+ +------+ +-----+ +-----+ +-----+ +----+ |aacbb$| (1) |aacbb$| pop |acbb$| (4) |acbb$| (1) |acbb$| pop |cbb$| (3) |S$ | ==> |aHS$ | ==> |HS$ | ==> |S$ | ==> |aHS$ | ==> |HS$ | ==> +------+ +------+ +-----+ +-----+ +-----+ +----+ +-----+ +----+ +---+ +---+ +--+ +--+ +-+ |cbb$ | pop |bb$ | (4) |bb$| (2) |bb$| pop |b$| (2) |b$| pop |$| |cHSS$| ==> |HSS$| ==> |SS$| ==> |bS$| ==> |S$| ==> |b$| ==> |$| +-----+ +----+ +---+ +---+ +--+ +--+ +-+ |
It turns out that the correct parse table for G2 is
top of lower stack |
top of upper stack |
Grammar | ||
---|---|---|---|---|
a | b | c |
(1) S → aHS (2) S → b (3) H → cHS (4) H → λ |
|
S | (1) | (2) | ||
H | (4) | (4) | (3) |
In particular, with H at the top of the lower stack and either a or b at the top of the upper stack, the proper production to apply is (4). (This makes sense, as application of (3) cannot generate a string that begins with either a or b.) But what really justifies placing (4) into cells (H,a) and (H,b) of the table, respectively, is that a can immediately follow H in a sentential form derived from S, and so can b. If the former weren't true, putting (4) into the (H,a) cell would be pointless; similarly, if the latter weren't true, putting (4) into the (H,b) cell would be pointless.
Now suppose that we added the production (5) H ⟶ b
to obtain G3.
(For that matter the RHS could be
Parsing ab and abb | G3 |
---|---|
+---+ +----+ +---+ +--+ +--+ +-+ |ab$| (1) |ab$ | pop |b$ | (4) |b$| (2) |b$| pop |$| |S$ | ==> |aHS$| ==> |HS$| ==> |S$| ==> |b$| ==> |$| +---+ +----+ +---+ +--+ +--+ +-+ +----+ +----+ +---+ +---+ +--+ +--+ +-+ |abb$| (1) |abb$| pop |bb$| (5) |bb$| pop |b$| (2) |b$| pop |$| |S$ | ==> |aHS$| ==> |HS$| ==> |bS$| ==> |S$| ==> |b$| ==> |$| +----+ +----+ +---+ +---+ +--+ +--+ +-+ |
(1) S → aHS (2) S → b (3) H → cHS (4) H → λ (5) H → b |
In the third step of each parse, we were confronted with H at the top of the lower stack and b at the top of the upper stack. In one instance, the correct production to apply was (4), but in the other it was (5). Which is to say that parsing with respect to G3 cannot be carried out (deterministically) based upon a table whose entries depend solely upon the top symbol on each of the two stacks. (That is, if we tried to fill in the cells of such a table, cell (H,b) would have to include both (4) and (5), because one choice is correct in some situations and the other choice is correct in others.)
(1) S → aHS (2) S → b (3) H → cHS (4) H → λ (5) S → c |
Consider these parses of ac and acbb:
+---+ +----+ +---+ +--+ +--+ +-+ |ac$| (1) |ac$ | pop |c$ | (4) |c$| (5) |c$| pop |$| |S$ | ==> |aHS$| ==> |HS$| ==> |S$| ==> |c$| ==> |$| +---+ +----+ +---+ +--+ +--+ +-+ +-----+ +-----+ +----+ +-----+ +----+ +---+ |acbb$| (1) |acbb$| pop |cbb$| (3) |cbb$ | pop |bb$ | (4) |bb$| |S$ | ==> |aHS$ | ==> |HS$ | ==> |cHSS$| ==> |HSS$| ==> |SS$| .... +-----+ +-----+ +----+ +-----+ +----+ +---+ |
In the third step of each parse, we were confronted with H at the top of the lower stack and c at the top of the upper stack. In one instance, the correct production to apply was (4), but in the other it was (3). Hence, the (H,c) cell in the parse table must contain both (3) and (4), which is to say that parsing with respect to G4 cannot be carried out (deterministically) based upon a table whose entries depend solely upon the top symbol on each of the two stacks.
What is the (general) explanation for why G2 can be parsed in a manner similar to the s-grammar G1 but that neither G3 nor G4 can be?
To provide such an explanation, first we define a nonterminal symbol's follow set:
Here, H is understood to be a nonterminal symbol, S is the start symbol, $ is the "end of string" marker, t is a terminal symbol (or possibly $), and y and z are terminal strings.
Informally, the follow set of H includes any terminal symbol (or $) that could possibly occur immediately after H in a sentential form generated from S$.
To qualify as a q-grammar, these conditions must be satisfied:
Comparing this with the definition of an s-grammar, we see that (1) generalizes the corresponding property of s-grammars by allowing λ-productions, (2) is essentially the same as the corresponding property of s-grammars, and (3) ensures that the type of conflict illustrated by grammars G3 and G4 cannot happen. Indeed, if we have a q-grammar, the cells in the parse table can be filled as follows with the assurance that no cell contains more than one entry:
We started with s-grammars and then generalized to q-grammars without losing the former's nice parsing properties. Can we generalize even further while still preserving those nice parsing properties (namely, that we can parse deterministically at each step by correctly predicting which production to apply based upon only the top symbol on each stack)? The answer is yes.
The obvious generalization that we'd like to make is to lift the restriction that the RHS of any non-λ production must begin with a terminal symbol.
To illustrate, the figure below shows a grammar (from top of page 255 in LRS) and its parse table.
|
|
The red entries were placed there due to productions of the form A ⟶ tα (i.e., whose RHS's begin with a terminal symbol), and the blue entries were placed there due to productions of the form A ⟶ λ (i.e., λ productions). Indeed, the reasons for filling these particular cells in this way is exactly as we saw with q-grammars.
What may be less clear is the reason for (3) being in cells (A,a) and (A,e). In effect, what this says is that, using an application of (3) as a first step, it is possible to derive a string (from A) that begins with either a or e! This is easily verified:
Similar reasoning led to all the green entries in the table.
What we are using here is the idea (a special case of which we saw with s- and q-grammars) that the application of a production can lead (eventually) to a string that begins with a particular terminal symbol.
To formalize this, we define
For reasons that should be clear at this point:
LL Rule 1: If A ⟶ α is a production and b is a member of FIRST(α), then the ID of this production belongs in the cell (A,b).
This is just the straightforward generalization of the first rule (see the very end of the previous section) that we used for filling cells in the parse table for a q-grammar. All red and green entries in the table are a result of applying this rule.
The generalization of the second rule is this:
LL Rule 2: If A ⟶ α is a production and α is nullable (meaning that from α we can generate the empty string (i.e., α ⟹* λ)), then the ID of this production belongs in any cell (A,b) such that b is in FOLLOW(A).
In addition to the blue entries, this rule gives rise to the purple entry in cell (A,b).
Definition: If a grammar is such that