(This section involves a more advanced topic in XSB programming, metainterpretation. It may be helpful to read the later section on metainterpreters if the going gets tough here.)
As discussed above, parsing context-free grammars with tabling results
in an Earley-type parser. This is the parser we get when we write
DCGs. We can also write an XSB program that takes a grammar as input
(defined in a database predicate as in the example of first/3
)
and a string (defined by the database word/3
predicate) and
succeeds if the grammar accepts the string. With such processing we
can compute and use the FIRST sets to make the grammar processing
more deterministic. This is similar to what is done in LL(k) and
LR(k) parsing, but there the emphasis is on compile-time analysis and
complete determinacy. Here the approach is more interpretive and
supports nondeterminacy. However, if the grammars are indeed of the
appropriate form (LL(k) or LR(k)), the corresponding interpreters
presented here will have linear complexity.
It is very easy to write a simple context-free grammar parser in XSB.
Again, we assume that the grammar is stored in facts of the form
NT ==> Body
where NT
is a nonterminal symbol and
Body
is a list of terminals and nonterminals.
:- table parse/3. % parse(Sym,S0,S) if symbol Sym generates the string from S0 to S. parse(Sym,S0,S) :- word(S0,Sym,S). parse(Sym,S0,S) :- (Sym ==> Body), parseSF(Body,S0,S). % parseSF(SF,S0,S) if sentential form SF generates the string from S0 to S. parseSF([],S,S). parseSF([Sym|Syms],S0,S) :- parse(Sym,S0,S1), parseSF(Syms,S1,S).
The predicate parse/3
recognizes strings generated by a single
grammar symbol; the first clause recognizes terminal symbols directly
and the second clause uses parseSF/3
to recognize strings
generated by the sentential form that makes up the body of a rule for
a nonterminal. parseSF/3
simply maps parse/3
across the
sequence of grammar symbols in its sentential form argument.
Were we not to add the table declaration, we would get a recursive descent recognizer. But with the table declaration, we get the Earley-type recognizer of XSB as described above. The tabling reflects right through the programmed recognizer.
Next we add look-ahead to this recognizer by computing FIRST
sets and making calls only when the next symbols are in the first set
of the sentential form to be processed. This will give us an
LL(k)-like recognizer. Hovever, the form of FIRST
we need here
is slightly different from the one above. Here we want to include the
context in which a First
set is computed. For example, we may
want to compute the FIRST_2
set of a symbol N
, but
N
only generates one symbol. The definition of first
above would return a list of length one for that symbol. Here we want
to take into account the context in which N
is used. For
example, we may know that in the current context N
is
immediately followed by another nonterminal M
, and we know the
FIRST
of M
. Then we can compute the FIRST_2
of
N
in the following context of M
by extending out the one
symbol in first
of N
with symbols in FIRST
of
M
.
The following definition of firstK/3
computes such first sets.
:- table firstK/3. % firstK(+SF,+Follow,-First), where K = length(Follow) firstK([],Follow,Follow). firstK([Sym|SF],Follow,First) :- firstK(SF,Follow,SymFollow), ((Sym ==> Body) -> firstK(Body,SymFollow,First) ; First = [Sym|FirstTail] append(FirstTail,[_],SymFollow), ).
The predicate firstK/3
takes a sentential form SF
, a
follow string Follow
, and returns in First
a first
string of SF
in the context of the follow string. The value of
k is taken to be the length of the list Follow
; this will be
the length of the look-ahead.
We can now extend our parser to have a top-down look-ahead component, similar to an LL(k) recognizer:
TEST THIS SUCKER! :- table parseLL/4. % without this, it is an LL(k)-like parser, % but with it, what is it??? % parseLL(Sym,Follow,S0,S) if symbol Sym generates the string from S0 to S. parseLL(Sym,Follow,S0,S) :- (Sym ==> Body) -> firstK(Body,Follow,First), next_str(First,S0), % do the look-ahead, continuing only if OK parseLLSF(Body,Follow,S0,S) ; word(S0,Sym,S). % parseLLSF(SF,Follow,S0,S) if sentential form SF generates the string from S0 to S. parseLLSF([],_Follow,S,S). parseLLSF([Sym|Syms],Follow,S0,S) :- firstK(Syms,Follow,SymFollow), parseLL(Sym,SymFollow,S0,S1), parseLLSF(Syms,Follow,S1,S). next_str([],_). next_str(['$'|_],S) :- \+ word(S,_,_). % $end of string next_str([Sym|Syms],S) :- word(S,Sym,S1),next_str(Syms,S1).
The predicate parseLL/4
recognizes a string generated by a
grammar symbol, using lookahead. The condition tests whether
Sym
is a nonterminal, and if it is and can be rewritten as
Body
, it checks to see whether the next symbols in the input
string belong to the first set of the body of that rule. If not, it
needn't process that rule because it cannot succeed. For an LL(k)
grammar, only one rule will ever apply; the others all will be
filtered out by this lookahead. So in this case a rule is never tried
unless it is the only rule that might lead to a parse. If the symbol
is a terminal, it simply checks to see whether the symbol matches the
input.
parseLLSF/4
maps parseLL/4
across a sequence of grammar
symbols in a sentential form. It uses firstK/3
to compute the
follow strings of a symbol, which are needed in parseLL
to
compute the first strings in the correct context.
In this LL(k)-like parser, we tested that the next symbols were in the
first set just before we called parseLLSF
. We could also check
the lookahead just before returning. This gives us an LR(k)-like
parser, as follows:
% parseLR with tabling to get an lr(k)-like algorithm. :- table parse/4. parseLR(Sym,Follow,Str0,Str) :- word(Str0,Sym,Str), next_str(Follow,Str). parseLR(Sym,Follow,Str0,Str) :- (Sym ==> RB), parseLRSF(RB,Follow,Str0,Str). parseLRSF([],Follow,Str,Str) :- next_str(Follow,Str). parseLRSF([Sym|SF],Follow,Str0,Str) :- firstK(SF,Follow,SymFollow), parseLR(Sym,SymFollow,Str0,Str1), parseLRSF(SF,Follow,Str1,Str).
For this parser, we compute a follow string for a sentential form, and only return from parsing that sentential form if the next input symbols match that follow string. For an LR(k) grammar, the parser will not fail back over a successful return (unless the entire string is rejected.) This allows the parser to work in linear time.
The above program was written as a Prolog program, but we can write the identical program as a DCG and have the DCG transformation put in the string variables, as follows:
:- table parseLR/4. parseLR(Sym,Follow) --> [Sym], look(Follow). parseLR(Sym,Follow) --> {(Sym ==> RHS)}, parseLRSF(RHS,Follow). parseLRSF([],Follow) --> look(Follow). parseLRSF([Sym|SF],Follow) --> {firstK(SF,Follow,SymFollow)}, parseLR(Sym,SymFollow), parseLRSF(SF,Follow). look([]) --> []. look([Word|Words]), [Word] --> [Word], look(Words).
This recognizer differs from an LR(k) recognizer in that it computes
the look-ahead strings as needed. Also it processes each look-ahead
string separately; i.e., Follow
is a single string, not a set
of strings. In an LR(k) recognizer, the lookahead tables are computed
once and stored.