CSE 304/504: Compiler Design

Spring 2016

Homework #5, Intermediate Code Generation

Due: April 22


Contents


Read the following description carefully before starting on the assignment. Also, go through the brief overview of the project path that describes the steps involved. A list of what materials are available and what you are expected to submit are also given.

Homework Description

In this assignment, you will complete intermediate code generation for Decaf. Recall the description of Decaf's syntax is given in its manual, the AST you built in HW #3, and the type checker you built in HW #4. This assignment will be based on the AST obtained after type checking and name resolution. From the set of classes in the AST, you will generate intermediate code for a register machine.

As in the past, you will produce a compiler front end that will take a single command-line argument, which is a file name; if that file contains a syntactically valid Decaf program, then it will construct an AST, perform type checking and name resolution. If there are no errors until this stage, the compiler than generates code for the target register machine, and prints the code out (see section on printing).

  1. Abstract Register Machine
  2. Static Single Assignment Form(CSE 504 only)
  3. Storage Allocation
  4. Simplifications
  5. Printing Translated Code
  6. CSE 504 vs. CSE 304

Abstract Register Machine:

The abstract register machine has

Abstract Machine Instructions:

In the following, r represents an argument or temporary register; i represents an literal integer value, f represents a literal floating point value, l represents a symbolic label.
Move:

Integer Arithmetic:
The following instructions assume that the source registers contain integer values, and perform integer arithmetic operations:

Floating Point Arithmetic:
Analogous to the above instructions, there is a set of floating point arithmetic instructions (prefixed with "f" instead of "i") that assume that the source registers contain floating point values, and perform floating point arithmetic operations. The exception is imod which has no floating point analog.

Conversion:
Branches:
Heap Manipulation:
Procedure Call/Return:

Static Single Assignment Form

(CSE 504 Only)

The translated abstract program should follow the static single assignment form, with the following structure. That is, no register can be a target in more than one instruction in a function.

Blocks

The abstract program should be broken into a set of basic blocks. Each basic block should have a unique label.

φ-Nodes

When control flows from two or more basic blocks merge, resulting φ-node will be represented in the program by a sequence of φ-pseudo instructions of the following form:
phi tn, < sequence of predecessor blocks>, < sequence of registers in predecessor blocks>
where tn is the target register.

For instance, consider registers t3 being assigned in two different blocks, labelled L1 and L2. Let t3 be used in a block labeled L3 that is a successor of L1 and L2. One way to make turn this into SSA form is to rename t3 in its definition (and subsequent uses) in blocks L1 and L2. Let's say we rename t3 to t1 in L1 and t2 in L2. We will then introduce a φ-node in L3 to merge these two values:

    phi t3, [L1, L2], [t1, t2]
Based on SSA form, this instruction says that t3 takes its value from t1 if we came from block L1; and t2 if we came from L2.

The SSA form is in preparation for the next homework where data flow analysis, register allocation and optimization will exploit it.

Storage Allocation

Parameters and Locals:

Each formal parameter value is in a distinct argument register. Assume that parameters are associated with argument registers in order. For static methods, the value of the first parameter is in a0, second is in a1 and so on. For instance methods, the implicit object (accessed using "this") is in a0, the first parameter value in a1, the second in a2 and so on.

For the purposes of this homework, you may assume that static methods do not refer to "this".

Each local variable or an intermediate value computed when evaluating an expression is kept in a temporary register.

Objects and Arrays:

Objects and arrays are allocated on heap. The fields of an object are laid out in consecutive locations on the heap; elements of an array are on consecutive locations as well.

Consider an object a of class C. Object a contains a unique location for all non-static fields in class C as well as all its superclasses. The object layout is as follows. Let B be a base class (with no super classes), with two non-static fields w and x. Then w and x are given unique offsets, starting at 0, so that the fields can be accessed from the base address of B objects. Let the offsets of w and x be 0 and 1 respectively. Whether the fields x and y are public or not does not matter. Let C be derived from B and have an additional non-static field y. Fields defined in C are given offsets that follow all fields in B (regardless of whether they are public or not). Hence y should be at offset 2. Moving further, let D be derived from C with two new fields, one called z, and another coincidentally called x. D's fields will have offsets following all fields of C. Hence field z will be at offset 3 and field x of D will be at offset 4. For computing the offsets and determining the layout, it does not matter whether a field is private or public.

Static fields:

Static fields of all classes will be allocated on the heap, in a special area called the "static area". This area will be allocated before the abstract machine program is run, and its allocation is specified by an abstract machine directive ".static_data n", where n is the total number of static fields to allocate.

The abstract machine has a special-purpose register called sap, the static area pointer, that holds the reference to the base of static area. All static fields should be allocated in the static area, with each field having a unique offset. It does not matter if the field is public or private for the purposes of allocation.

For instance, let there be two classes B, a base class, and C, a class derived from B. Let B have two static fields x and y defined in it; and C have two static fields y and z defined in it. Then the four fields will be allocated in the static area with directive ".static_data 4"; x and y of B will be at offsets 0 and 1, respectively; and y and z of C will be at offsets 2 and 3, respectively.

Continuing with the above example, let all the static fields in the example be public, and consider an assignment in a source program of the form "C.x = C.y + C.z;". The following abstract machine code will be a valid translation of the assignment:

   move_immed_i t0, 2   # Offset of y in C
   hload t1, sap, t0     # t1 = C.y
   move_immed_i t2, 3   # Offset of z in C
   hload t3, sap, t0     # t3 = C.y
   iadd  t4, t1, t3
   move_immed_i t5, 0   # Offset of x in C (which is x in B)
   hstore sap, t5, t4
In the above, "#..." is used to represent comments in order to explain the code.

Constructors:

Each constructor will be compiled into an initializer function.

For instance, consider a constructor of the form A(int i). This constructor will be compiled to an initializer function that takes a reference to a newly created object as a0, the value of formal parameter i as a1, executes the code corresponding to the body of the constructor. The first instruction in this sequence will be labeled as C_%d, where %d is the unique id of the constructor. Recall that constructor ids are unique across the whole program. Initializer functions return nothing.

Code for new object creation will have two parts. The first part should create an object of the correct size on the heap (using the halloc instruction). The second should call the initializer method with appropriate arguments. For instance, let the instances of the above class A have 5 instance fields (note: this includes non-static fields of superclasses, if any). Let the id of the matching constructor be 7. Then the expression new A(2) may be translated to the following sequence of machine instructions:

   move_immed t0, 5
   halloc t0, 5         # allocate 5 cells for object
   move_immed_i t1, 2   # Actual argument
   save  a0             # save current argument registers
   save  a1
   save  t0             # save reference to new object
   move   a0, t0
   move   a1, t1
   call   C_7           # call initializer function
   restore t0           # restore original register values
   restore a1
   restore a0
Note: the above code is to illustrate the use of the initializer function; the saves are restores may be necessary only in certain contexts.

Simplifying Assumptions

Printing the Abstract Machine Instructions:

The output of your compiler will be a text file with the abstract machine program. The program shall be printed to a file as follows.

Print one instruction per line. Each instruction will be its opcode, followed by the sequence of arguments. Include at least one space between opcode and the arguments. Arguments shall be comma separated. The entire instruction--- opcode and all arguments--- should be placed in a single line; an instruction shall not spill over two lines.

Labels of instructions, if any shall be printed on a line of their own. Label of an instruction shall be followed by a ":". For instance, an "iadd" instruction with label "L12" will be printed as in:

L12:
   iadd t3, t1, t2
If an instruction has more than one label, each label shall be printed on a line of their own.

Abstract machine programs may have single-line comments, starting with "#" and ending at the end of line. Comments may appear at end of instructions or as a line of their own.

Program:

The translated program will be a (i) a set of functions corresponding to methods and initializer functions, and (ii) a single line static data directive. Each function will begin with the label of the function, followed by the instructions in the function body. Functions may be separated by one or more empty lines for readability.

CSE 504 vs CSE 304

Project Path

  1. Start with a working AST builder and type checker.
  2. Define an interface to create abstract machine instructions and construct the output program.
  3. Assume a single class with no fields (!!), no method invocations, constructors or field accesses; Write a code generator that works on this restricted subset of programs. You may want to work with even smaller subsets than these.
  4. Proceed by adding field accesses first, and finally method invocations and constructors.
  5. For CSE 504 students only: work on conversion to SSA form after all else is completed.

Deliverables

Your Decaf compiler should be organized into the following files:
  1. Lexer:    decaflexer.py, PLY/lex scanner specification file. (unchanged from HW2)
  2. Parser:    decafparser.py, PLY/yacc parser specification file. (unchanged from HW3)
  3. AST:    ast.py, table and class definitions for Decaf's AST (modified from HW3 to add type-related fields and definitions).
  4. Type Checker:    typecheck.py, definitions for evaluating the type constraints and for name resolution.

    Note: This may be folded into ast.py as long as readability is maintained.

  5. Code Generator:    codegen.py, definitions for generating code.

    Note: This may be folded into ast.py as long as readability is maintained.

  6. Abstract Machine:    absmc.py, definitions for the abstract machine, and for manipulating abstract programs (e.g. SSA construction, printing).
  7. Main Top-Level:    decafc.py, containing the main python function that ties the modules together, takes command line argument of input file name, and processes the Decaf program in that file.

    decafc takes a single command line argument, with the name of the Decaf program to be translated. For example, if the Decaf program is in fum.decaf, then the invocation will be as

           python decafc.py fum  
    or
           python decafc.py fum.decaf  
    The translated machine instructions will be written to a file with the same base name with ".ami" extension (fum.ami in the above example).

    Note the change of main top-level. It is no longer decafch; we are compiling not doing checking alone anymore.

  8. Documentation:    README.txt which documents the contents of all the other files.
We will be looking for a submission with the files specified above, with the specified names. Deviating from these specifications may entail loss of points.

Grading

This maximum score on this assignment will be 40 (45 for CSE 504 students), divided as follows: 2 points for code generation for simple expressions (those that do not have other expressions as parts), 4 points for other simple expressions (binary/unary/auto) 6 points for simple boolean expressions and short-circuit evaluation 8 points for method calls and returns; 12 points for field expressions, new object creation, and array expressions; 6 points for loops; 5 points for SSA form (for cse 504 only); and 2 points for documentation/program readability.

As usual, complicated programs may not get the full grade unless accompanied by comments documenting it.

Submission

Submit your homework using Git, as done for HW1, before midnight of the posted deadline.

We will assume that HW5 will be submitted by the same teams that submitted HW4. If there is any change in the team structure, you must inform the instructor the Monday before the submission deadline and get the submission repositories for the new teams set up.

You may submit the homework multiple times; the last submission will be graded.

Change Log


C. R. Ramakrishnan
Last modified: Sun Apr 17 21:34:31 EDT 2016