Next: Parsing of Context Sensitive Up: Grammars Previous: Computing First Sets of

Linear Parsing of LL(k) and LR(k) Grammars

(This section involves a more advanced topic in XSB programming, metainterpretation. It may be helpful to read the later section on metainterpreters if the going gets tough here.)

As discussed above, parsing context-free grammars with tabling results in an Earley-type parser. This is the parser we get when we write DCGs. We can also write an XSB program that takes a grammar as input (defined in a database predicate as in the example of first/3) and a string (defined by the database word/3 predicate) and succeeds if the grammar accepts the string. With such processing we can compute and use the FIRST sets to make the grammar processing more deterministic. This is similar to what is done in LL(k) and LR(k) parsing, but there the emphasis is on compile-time analysis and complete determinacy. Here the approach is more interpretive and supports nondeterminacy. However, if the grammars are indeed of the appropriate form (LL(k) or LR(k)), the corresponding interpreters presented here will have linear complexity.

It is very easy to write a simple context-free grammar parser in XSB. Again, we assume that the grammar is stored in facts of the form NT ==> Body where NT is a nonterminal symbol and Body is a list of terminals and nonterminals.

:- table parse/3.

% parse(Sym,S0,S) if symbol Sym generates the string from S0 to S.
parse(Sym,S0,S) :-
    word(S0,Sym,S).
parse(Sym,S0,S) :-
    (Sym ==> Body),
    parseSF(Body,S0,S).

% parseSF(SF,S0,S) if sentential form SF generates the string from S0 to S.
parseSF([],S,S).
parseSF([Sym|Syms],S0,S) :-
    parse(Sym,S0,S1),
    parseSF(Syms,S1,S).

The predicate parse/3 recognizes strings generated by a single grammar symbol; the first clause recognizes terminal symbols directly and the second clause uses parseSF/3 to recognize strings generated by the sentential form that makes up the body of a rule for a nonterminal. parseSF/3 simply maps parse/3 across the sequence of grammar symbols in its sentential form argument.

Were we not to add the table declaration, we would get a recursive descent recognizer. But with the table declaration, we get the Earley-type recognizer of XSB as described above. The tabling reflects right through the programmed recognizer.

Next we add look-ahead to this recognizer by computing FIRST sets and making calls only when the next symbols are in the first set of the sentential form to be processed. This will give us an LL(k)-like recognizer. Hovever, the form of FIRST we need here is slightly different from the one above. Here we want to include the context in which a First set is computed. For example, we may want to compute the FIRST_2 set of a symbol N, but N only generates one symbol. The definition of first above would return a list of length one for that symbol. Here we want to take into account the context in which N is used. For example, we may know that in the current context N is immediately followed by another nonterminal M, and we know the FIRST of M. Then we can compute the FIRST_2 of N in the following context of M by extending out the one symbol in first of N with symbols in FIRST of M.

The following definition of firstK/3 computes such first sets.

:- table firstK/3.

% firstK(+SF,+Follow,-First), where K = length(Follow)
firstK([],Follow,Follow).
firstK([Sym|SF],Follow,First) :-
    firstK(SF,Follow,SymFollow),
    ((Sym ==> Body)
     ->   firstK(Body,SymFollow,First)
     ;    First = [Sym|FirstTail]
          append(FirstTail,[_],SymFollow),
    ).

The predicate firstK/3 takes a sentential form SF, a follow string Follow, and returns in First a first string of SF in the context of the follow string. The value of k is taken to be the length of the list Follow; this will be the length of the look-ahead.

We can now extend our parser to have a top-down look-ahead component, similar to an LL(k) recognizer:

TEST THIS SUCKER!
:- table parseLL/4.    % without this, it is an LL(k)-like parser,
                        % but with it, what is it???

% parseLL(Sym,Follow,S0,S) if symbol Sym generates the string from S0 to S.
parseLL(Sym,Follow,S0,S) :-
    (Sym ==> Body)
     -> firstK(Body,Follow,First),
        next_str(First,S0),        % do the look-ahead, continuing only if OK
        parseLLSF(Body,Follow,S0,S)
     ;  word(S0,Sym,S).

% parseLLSF(SF,Follow,S0,S) if sentential form SF generates the string from S0 to S.
parseLLSF([],_Follow,S,S).
parseLLSF([Sym|Syms],Follow,S0,S) :-
    firstK(Syms,Follow,SymFollow),
    parseLL(Sym,SymFollow,S0,S1),
    parseLLSF(Syms,Follow,S1,S).

next_str([],_).
next_str(['$'|_],S) :- \+ word(S,_,_).  % $end of string
next_str([Sym|Syms],S) :- word(S,Sym,S1),next_str(Syms,S1).

The predicate parseLL/4 recognizes a string generated by a grammar symbol, using lookahead. The condition tests whether Sym is a nonterminal, and if it is and can be rewritten as Body, it checks to see whether the next symbols in the input string belong to the first set of the body of that rule. If not, it needn't process that rule because it cannot succeed. For an LL(k) grammar, only one rule will ever apply; the others all will be filtered out by this lookahead. So in this case a rule is never tried unless it is the only rule that might lead to a parse. If the symbol is a terminal, it simply checks to see whether the symbol matches the input.

parseLLSF/4 maps parseLL/4 across a sequence of grammar symbols in a sentential form. It uses firstK/3 to compute the follow strings of a symbol, which are needed in parseLL to compute the first strings in the correct context.

In this LL(k)-like parser, we tested that the next symbols were in the first set just before we called parseLLSF. We could also check the lookahead just before returning. This gives us an LR(k)-like parser, as follows:

% parseLR with tabling to get an lr(k)-like algorithm.  

:- table parse/4.
parseLR(Sym,Follow,Str0,Str) :-
    word(Str0,Sym,Str),
    next_str(Follow,Str).
parseLR(Sym,Follow,Str0,Str) :-
    (Sym ==> RB),
    parseLRSF(RB,Follow,Str0,Str).

parseLRSF([],Follow,Str,Str) :- 
    next_str(Follow,Str).
parseLRSF([Sym|SF],Follow,Str0,Str) :- 
    firstK(SF,Follow,SymFollow),
    parseLR(Sym,SymFollow,Str0,Str1),
    parseLRSF(SF,Follow,Str1,Str).

For this parser, we compute a follow string for a sentential form, and only return from parsing that sentential form if the next input symbols match that follow string. For an LR(k) grammar, the parser will not fail back over a successful return (unless the entire string is rejected.) This allows the parser to work in linear time.

The above program was written as a Prolog program, but we can write the identical program as a DCG and have the DCG transformation put in the string variables, as follows:

:- table parseLR/4.
parseLR(Sym,Follow) --> [Sym], look(Follow).
parseLR(Sym,Follow) -->
    {(Sym ==> RHS)},
    parseLRSF(RHS,Follow).

parseLRSF([],Follow) --> look(Follow).
parseLRSF([Sym|SF],Follow) -->
    {firstK(SF,Follow,SymFollow)},
    parseLR(Sym,SymFollow),
    parseLRSF(SF,Follow).

look([]) --> [].
look([Word|Words]), [Word] --> [Word], look(Words).

This recognizer differs from an LR(k) recognizer in that it computes the look-ahead strings as needed. Also it processes each look-ahead string separately; i.e., Follow is a single string, not a set of strings. In an LR(k) recognizer, the lookahead tables are computed once and stored.

Next: Parsing of Context Sensitive Up: Grammars Previous: Computing First Sets of

David S. Warren
1999-07-31