CSE 304/504: Compiler Design

Spring 2016

Homework #3, AST Construction

Due: Mar. 23


Contents


Read the following description carefully before starting on the assignment. Also, go through the brief overview of the project path that describes the steps involved. A list of what materials are available and what you are expected to submit are also given.

Homework Description

This assignment expects you to build a module to construct ASTs for Decaf programs. Recall the description of Decaf's syntax is given in its manual. Your AST construction will begin with the PLY-based lexical analyzer and parser you developed for HW2. Overall, your AST builder will take a single command-line argument, which is a file name; if that file contains a syntactically valid Decaf program, then it will construct an AST for that program. As an AST is being constructed, you will perform additional syntax checks (defined later in this document). Upon successful construction of the AST, it will then print the AST to standard output in a reasonable form (outlined later in this document). The AST itself should follow the structure described below.

  1. AST Structure
  2. Additional Error Checks
  3. Printing the AST
  4. Initialization
  5. CSE 504 vs. CSE 304

AST Structure:

The core of the AST is a set of symbol tables, and a set of tree-structured data that links these tables. For Decaf, the top-level table is a Class Table that holds information about each class in the input program.

Class Table: Contents

The class table is a collection of records (you may implement this in any appropriate way: dictionaries, tuples etc.). Each record in the table holds information about a class. The necessary information for a class are: Note that Decaf does not have nested class definitions, so the above information is complete for each class.

The information about individual constructors, methods and fields themselves are stored as record structures with the following information:

Constructor record:

Method record: Field record: You may find it convenient to store the set of all constructor records in a separate constructors table; method records in a separate methods table; and field records in a separate fields table.

Variable Table: Contents

Each constructor/method is associated with a table with information on all variables (formal parameters or local variables) declared in that constructor/method. Each variable holds the following information:

Type Record: Contents

Each type is either an array type are an elementary type. Elementary types, in turn, may be one of built-in types: int, float, boolean, or string; or a user-defined type, which is a class name.

An array type is parameterized by its "base type", which is another type.

For instance:

Note: we may modify the contents of type records in later assignments. The content specified about is sufficient for this homework.

Statement Record: Contents

There are eight kinds of statements, each with its own contents:
  1. If-stmt: has three pieces of information:
  2. While-stmt: has two pieces of information:
  3. For-stmt: has four pieces of information:
  4. Return-stmt: has one optional piece of information:
  5. Expr-stmt: has one piece of information:
  6. Block-stmt: has one piece of information:
  7. Break-stmt: representing break.
  8. Continue-stmt: representing continue.
  9. Skip-stmt: representing an empty statement (as in an empty else part of an "if" statement.
Each statement record additionally contains the line number range in the input corresponding to it. (This information may be used in later assignments for signalling additional errors/warnings).

Expression Record: Contents

There are eleven kinds of expressions, each with its own contents:
  1. Constant-expression, which, in turn, is one of the following six forms:
  2. Var-expression: has one piece of information: the id of the referenced variable. Note that if there are multiple variables of the same name defined in nested blocks, the scope rules (defined later) specify which if those variables will be the referenced one.
  3. Unary-expression: has two pieces of information: the operand (an expression) and the unary operator. Unary operators are either uminus or neg. Note that the unary operator "+" in the concrete syntax has no effect. So an expression of the form "+25" will be represented as though it was simply "25".
  4. Binary-expression: has three pieces of information: the two operands (both expressions themselves) and the binary operator. Binary operators are one of add, sub, mul, div, and, or, eq, neq, lt, leq, gt, and geq.
  5. Assign-expression: has two pieces of information: the left- and right- hand sides of the assignment (both expressions).
  6. Auto-expression: has three pieces of information to represent auto-increment and auto-decrement expressions (e.g. "x++"). The three pieces of information are the operand (e.g. "x") which is an expression, whether the operation is an auto-increment (e.g. x++) or auto-decrement (e.g x--), and whether the operation is post (e.g. x++) or pre (e.g. ++x).
  7. Field-access-expression: has two pieces of information to represent p.x: the base (e.g. p), which is an expression, and the field name (e.g. x), which is a string.
  8. Method-call-expression: has three pieces of information to represent p.f(x,y): the base (e.g. p), which is an expression, the method name (e.g. x), which is a string, and a sequence of expressions representing the arguments to the method call (e.g. x,y). Note that the argument sequence may be empty.
  9. New-object-expression: has two pieces of information for creating a new object as in new a(i, x): the base class name (e.g. a), and the sequence of expressions representing the arguments to the constructor (e.g. i, x). Note that the argument sequence may be empty.
  10. This-expression: to denote this.
  11. Super-expression: to denote super.
  12. Class-reference-expression: with the referred class name as its information. This expression is used to denote the value of literal class names.
  13. Array-access-expression:504 only has two pieces of information to represent a[i]: the base (e.g. a), which is an expression, and the index value (e.g. i), which is also an expression.
  14. New-array-expression:504 only has two pieces of information for creating a new array as in new int[25][n]: the base type (e.g. int) and a sequence of array dimensions (e.g. 25, n). Note that extra empty dimension specifies array types as base type. For instance, new int[25][][] should be represented by an new array expression of base type array(array(int)), with array dimension 25 (a singleton sequence).
Each expression record additionally contains the line number range in the input corresponding to it. (This information may be used in later assignments for signalling additional errors/warnings).

Scopes of names

Decaf is statically, lexically scoped. Programs have different kinds of named entities: classes, methods, fields and variables. The access rules for these names differ slightly, as explained below. To resolve the identity of variables given these scope rules, you can use the following procedure. Note that a variable is first encountered as a part of field_access (the option without a preceding "."). If this name is declared as a variable in the current block or any enclosing block (up to the formal parameters of the current method), identify it with its most recent declaration. Otherwise, this could be a reference to a (previously defined) class. If there is a class with the same name, then the expression is a class reference expression. If not, the subsequent processing differs between CSE 304 and 504, as follows:

Additional error checks:

While constructing an AST, you may be able to detect the following errors related to multiple or confounding declarations:
  1. Classes: Each class in a Decaf program must have a distinct name. Hence, it is an error to have two classes with the same name.
  2. Methods: Since Decaf is an object-oriented language that permits method overloading, it is legal to define multiple methods with the same name in a given class. (In the type checking assignment that will follow, we will refine this to permit multiple methods only of their type signatures are different, but we will be liberal now).
  3. Fields: Each field declared within a class must have a distinct name. Note that, given the class structure, it is possible for a class as well as its super-class to have fields with same names declared. Hence it is only an error if two fields declared in the same class have same names.
  4. Variables: Each variable declared within a block must have a distinct name. Following the definition of scopes, the names of formal parameters of a method must be distinct from the names of variables declared in the top-most block of the method's body.
Your AST builder is expected to detect and report such errors. As in HW2, your score will also depend how many such errors you can detect, and how informative your error messages are.

Printing the AST

Class table

Each entry of the class table must be printed out, with the following information in the specified order:
  1. Class name: < string with name of class >
  2. Super class name: < Name of the super class (empty if none >
  3. Fields: < The set of fields defined in the class, in format defined below >
  4. Constructors: <: The set of constructors defined in the class, in format defined below >
  5. Methods: < The set of methods defined in the class, in format defined below >
Entries of the class table will be separated by a line of "---"s.

Field record

Each field record must be printed in a single line, beginning with "FIELD: ", with a comma separated list of the following information, in the specified order:
  1. Field Id
  2. Field Name
  3. Containing Class
  4. Field visibility
  5. Field applicability
  6. Field Type (in format defined below)

Constructor record

Each constructor record must be printed in four parts:
  1. A single line beginning with "CONSTRUCTOR: " and with a comma-separated list of the following information in the specified order:
    1. Constructor Id
    2. Constructor visibility
  2. A single line, starting with "Constructor parameters:" with a comma-separated list of ids (corresponding to the constructor's formal parameters) in the variable table (below).
  3. A set of lines starting with "Variable Table:", with each of the following lines containing information about a variable in the constructor (format defined later).
  4. A set of lines, starting with "Constructor Body:" with each of the following lines containing information about the statements in the constructor's body (format defined later).

Method record

Each method record must be printed in four parts:
  1. A single line beginning with "METHOD: " and with a comma-separated list of the following information in the specified order:
    1. Method Id
    2. Method Name
    3. Containing class
    4. Method visibility
    5. Method applicability
    6. Return type in format defined below)
  2. A single line, starting with "Method parameters:" with a comma-separated list of ids (corresponding to the methods's formal parameters) in the variable table (below).
  3. A set of lines starting with "Variable Table:", with each of the following lines containing information about a variable in the method (format defined later).
  4. A set of lines, starting with "Method Body:" with each of the following lines containing information about the statements in the method's body.

Type record

Each type must be printed in one of the following ways:

Statements

Each statement must be printed starting on a new line, containing the statement's kind (e.g. "For"), followed by an open parenthesis ("("), then by a comma-separated print of its component information, and finally a close parenthesis (")"). A sequence of statements (as in a block statement) must be printed out as a comma-separated list of statements.

Expressions

Each expression must be printed starting with the expessions's kind (e.g. "Binary"), followed by an open parenthesis ("("), then by a comma-separated print of its component information, and finally a close parenthesis (")"). A sequence of expressions (as in the list of arguments to a method call) must be printed out as a comma-separated list of expressions.

EXAMPLE

Input Decaf Program:
class A {
   int x;
   A () {
     this.x = 0;
   }
   int f() {
      return this.x + 1;
   }
   public int g() {
      int i;
      i = this.f();
      i++;
      return i;
   }
}
class B extends A {
   int y;
   public A s;
   B () {
     this.y = 2;
     this.s = new A();
   }
   public int f(int k) {
      return super.f() + k;
   }
}
AST (printed out):
-----------------------------------------------------------------------------
Class Name: A
Superclass Name: 
Fields:
FIELD 1, x, A, private, instance, int
Constructors:
CONSTRUCTOR: 1, private
Constructor Parameters: 
Variable Table:
Constructor Body:
Block([
Expr( Assign(Field-access(This, x), Constant(Integer-constant(0))) )
])
Methods:
METHOD: 1, f, A, private, instance, int
Method Parameters: 
Variable Table:
Method Body:
Block([
Return( Binary(add, Field-access(This, x), Constant(Integer-constant(1))) )
])
METHOD: 2, g, A, private, instance, int
Method Parameters: 
Variable Table:
VARIABLE 1, i, local, int
Method Body:
Block([
Expr( Assign(Variable(1), Method-call(This, f, [])) )
,  Expr( Auto(Variable(1), inc, post) )
,  Return( Variable(1) )
])
-----------------------------------------------------------------------------
Class Name: B
Superclass Name: A
Fields:
FIELD 2, y, B, private, instance, int
FIELD 3, s, B, private, instance, user(A)
Constructors:
CONSTRUCTOR: 2, private
Constructor Parameters: 
Variable Table:
Constructor Body:
Block([
Expr( Assign(Field-access(This, y), Constant(Integer-constant(2))) )
,  Expr( Assign(Field-access(This, s), New-object(A, [])) )
])
Methods:
METHOD: 3, f, B, private, instance, int
Method Parameters: 1
Variable Table:
VARIABLE 1, k, formal, int
Method Body:
Block([
Return( Binary(add, Method-call(Super, f, []), Variable(1)) )
])
-----------------------------------------------------------------------------

Initialization

Decaf manual says that there are two standard objects in the language: Out and In. We can treat these two as predefined classes. For AST construction, you should start out with a table that is already populated with the information about these two classes:

CSE 504 vs CSE 304

Project Path

  1. Start with a working PLY parser for Decaf.
  2. Create a Python class to hold the information for each Decaf class. Write a stub of a print method, and refine this later. Ensure you have fields in the this Python class to denote class and super class names, collection of fields, constructors and methods. In parser, add actions to fill out the class and super class names (leave the rest empty initially). Create a table to store these class records, and initialize appropriately. In parser, add actions to fill this table. In the main driver (that calls the parser), iterate over this table and print its contents.
  3. Add actions to synthesize information on fields, and to place this information in the respective class records.
  4. Add actions for simple expressions (constants, field access with "this", assignments, binary operations), and simple statements (expression statements, return statements, blocks).
  5. Proceed by either expanding on the expressions (to more complex ones such as method calls, new object creation etc.) and statements (to more complex ones such as while loops). Hint: keep array expressions to the end.

Deliverables

Your AST checker for Decaf should be organized into the following files:
  1. Lexer:    decaflexer.py, PLY/lex scanner specification file. (unchanged from HW2)
  2. Parser:    decafparser.py, PLY/yacc parser specification file. (modified from HW2 with actions to build AST)
  3. AST:    ast.py, table and class definitions for Decaf's AST.
  4. Main Top-Level:    decafch.py, containing the main python function to put together the parser and lexer, take input from given file, etc., and perform syntax checking.
  5. Documentation:    README.txt which documents the contents of all the other files.
We will be looking for a submission with the five files specified above, with the specified names. Deviating from these specifications may entail loss of points.

Grading

This maximum score on this assignment will be 40, divided as follows: 4 points for correct ASTs for simple expressions (those that do not have other expressions as parts), 6 points for other expressions; 4 points for types; 6 points for statements; 6 points for methods and constructors; 6 points for correct resolution of scopes of variables; 3 points for classes; 3 points for error detection; and 2 points for documentation/program readability.

As usual, complicated programs may not get the full grade unless accompanied by comments documenting it.

Submission

Submit your homework using Git, as done for HW1, before midnight of Wed., March 23.

We will assume that HW3 will be submitted by the same teams that submitted HW3. If there is any change in the team structure, you must inform the instructor the Monday before the submission deadline (Mon. March 21) and get the submission repositories for the new teams set up.

You may submit the homework multiple times; the last submission will be graded.

Change Log


C. R. Ramakrishnan
Last modified: Wed Mar 16 23:32:22 EDT 2016