CSE 304/504: Compiler Design

Spring 2016

Homework #2

Due: Feb. 29 Mar. 2

Homework Description

This assignment expects you to build a lexical analyzer and parser for Decaf. The description of Decaf's syntax is given in its manual. Your lexical analyzer and parser will be built using PLY. In this assignment, you are expected to write a "syntax checker" that uses your lexical analyzer and parser. The syntax checker takes a single command-line argument, which is a file name, and determines whether or not that file contains a syntactically valid Decaf program. The syntax checker should simply output "Yes" if the program is syntactically valid, and reasonable error message(s) if the program has syntax errors.

CSE 504 vs CSE 304:

CSE 504 students are expected to build a syntax checker for the entire Decaf language. Their syntax checkers are also expected to identify multiple errors in a source program, and not just the first error.
CSE 304 students are expected to build a syntax checker for the following subset of Decaf:
- Simple string constants. String constants may themselves not contain the delimiting " character. Thus, a string constant begins a " and ends with a " with no " in between.
- Simple floating point constants. Floating point constants will be only of the first kind (page 2 of the manual): a decimal point `.' with a sequence of one or more digits on either side of it. That is, floating point constants will not use the exponent notation.
- No arrays! This subset will not support the declaration of array-valued fields and variables (page 3 of the manual), array expressions (array_access in page 6), and array creation (new_array in page 7).
Moreover, their syntax checkers are expected to identify only the first syntax error in a program.

Project Path

Read through Decaf manual, and make sure you understand the overall structure of the language.
Identify the set of tokens that will be recognized by your lexical analyzer. You may need to revise this set when you work on the later steps as well. It may be useful to begin with a slice of Decaf, e.g. a (rather useless) subset of the language with no constructor or method definitions, and build a lexical analyzer for the tokens in this subset. Moreover, you will find that definition of certain tokens will be significantly simpler than others; you can specify stubs for the difficult ones at first, and then come back to finish the job.
In its manual, Decaf's syntax is given in Extended Backus-Naur Form (EBNF). Specifically, grammar rules in EBNF are of the form

L ::= r₁ | r₂ | ... | r_n

where L is a non-terminal symbol (left hand side of a production) and each r_i is a (possibly empty) right hand side of a production. In the standard notation of grammars described in class, the right hand side of a production is a sequence of terminal and non-terminal symbols. In EBNF, the right hand side may be written using a regular expression-like notation as well. For instance, the following EBNF rule represents a string of one or more a's separated by b's (e.g. ababa):

S ::= a (b a)*

The single rule above can be represented by the set of productions in standard notation below:

S → a T
T → ε
T → b a T

Your task in this assignment is to come up with a PLY specification for Decaf syntax specified in EBNF. PLY/yacc generates an LALR(1) parser, which means that it can handle only a subset (but a large subset) of context free grammar specifications. Hence you will need to make sure that the expanded grammar can be handled by PLY. If your specification is ambiguous, make sure that you also specify disambiguating rules unless you are sure that PLY's default behaviour is what you want.
As you construct the PLY/yacc specification, you may revise the set of tokens you defined in step 2.
Once your token set is nearly completely determined, complete the construction of the lexical analyzer using PLY/lex.
Interface your lexical analyzer properly with your parser. Write glue code to take input from a specified file name (argument to your syntax checker), and make your lexical analyzer generate tokens from the contents of that file.
Wrap up by adding code to write suitable messages in case of scanning or parsing errors. Make sure the error messages refer to appropriate line/character position in the input.

Hints:

NOTE: To avoid running into difficulties with conflicts in the grammar, you should add the grammar rules gradually -- as few at a time as possible. Then run through PLY/yacc to ensure that you do not run into any conflicts. If you do run into conflicts this way, then the problem is most likely to be in the rules that you just added.
You may start with only a part of the grammar, as long as the portion you take is self-contained, to gain experience with the construction process. For example, you may start with the top-level structure of the language, going through class definitions, field definitions and method definitions in that order.
You may also build a complete syntax checker for a different small grammar (e.g. the set of valid propositional boolean formulae) to figure out how the plumbing works, and then build the checker for Decaf.
Make sure you can read and understand the parser.out file produced by PLY/yacc. This file contains the productions, states and transitions used by the parser. This file will also tell you where the conflicts are, i.e., which states and which productions. From this information, you should be able to make out what the offending grammar rules are and how they should be modified. If, after several attempts, you are unable to resolve the problem, contact the Instructor or the TA.

Deliverables

Your syntax checker for Decaf should be organized into the following files:

Lexer: decaflexer.py, PLY/lex scanner specification file.
Parser: decafparser.py, PLY/yacc parser specification file.
Main Top-Level: decafch.py, containing the main python function to put together the parser and lexer, take input from given file, etc., and perform syntax checking.
Documentation: README.txt which documents the contents of all the other files.

Your PLY/yacc-generated parser may have shift-reduce conflicts. Although in most cases such conflicts indicate coding errors, your may have verified that the conflicts and the way PLY/yacc resolves them are ok for your syntax checker. If this is the case, make sure you explain this in the README file.

We will be looking for a submission with the four files specified above, with the specified names. Deviating from these specifications may entail loss of points.

Grading

This maximum score on this assignment will be 20, divided as follows. 7 points for token specifications for lexical analysis; 7 points for grammar specifications; 2 points for the interfaces between lexical analyzer and parser, and between these two modules and your main syntax checker program; 3 points for quality of error messages; and 1 point for documentation/program readability.

As usual, complicated programs may not get the full grade unless accompanied by comments documenting it.

Submission

Submit your homework using Git, as done for HW1, before midnight of Mon., Feb 29.

We will assume that HW2 will be submitted by the same teams that submitted HW1. If there is any change in the team structure, you must inform the instructor the Monday before the submission deadline (Mon. Feb 22) and get the submission repositories for the new teams set up.

You may submit the homework multiple times; the last submission will be graded.

Change Log

Original posting on Mon Feb 15 22:13:15 EST 2016.

C. R. Ramakrishnan

Last modified: Mon Feb 15 22:46:22 EST 2016