CSE-304 Programming Assignment #1

Scanner (Lexical Analysis) Module

Read the following description carefully before starting on the assignment. Also, go through the brief overview of the project path that describes the steps involved. A list of what materials are available and what you are expected to submit are also given. Finally, make sure you understand and follow the submission requirements.

Description:

In this assignment, you will build the lexical analysis module for the Decaf compiler. You will use JavaCC, a scanner-generator tool, for this purpose. The lexical structure of Decaf is defined first, followed by a description of the interface to the parser (tokens and attributes).

Lexical Structure

We first describe the lexical structure of the Decaf programming language. Decaf is a subset of Java, and inherits most of the lexical features of Java. Decaf is case-sensitive; for example, for and For are treated as distinct lexical entities.

Comments and White Space

Decaf supports two styles of comments:

Multi-line comments that begin with ``/*'' and end with ``*/''. These comments may not be nested.
Single-line comments that start with ``//'' and terminate at the end of line.

Comments may appear wherever a whitespace may appear; they are ignored by the lexical analyzer.

NOTE: Your lexical analyzer is expected to correctly process comments of type 2. You may also handle comments of type 1 for extra credit.

Whitespace (blanks, newlines and tabs) serve to separate tokens; otherwise they are ignored. Whitespace may not appear within any token except a string constant (see below).

Reserved words

The following are reserved words.

boolean    break      class     continue  else       extends
false      float      for       if        int        new
null       private    public    return    static     super
this       true       void      while

Constants

There are three types of constants supported by Decaf: integer, floating point and string constants.

Integer constants contain only digits 0 thru 9.

Floating point constants contain a decimal point with sequence of one or more digits on either side of it (e.g., 3.14159). (Note that we don't support floating point constants in scientific (mantissa-exponent) notation).

String constants begin and end with a double quote ("). and may contain any sequence of characters other than newlines. As in C, if the string itself contains a double quote, it is ``escaped'' with a backslash (\) as in the example string: "\"What?\" she exclaimed.". Similarly, if the string contains a backslash, that is escaped too (e.g., as in "\\").

Identifiers

Letters denote the upper and lower case elements of the English alphabet (a thru z and A thru Z).

An identifier is a sequence of letters, digits and underscore (_), starting with a letter, and is not one of the reserved words.

Operators

The operators of Decaf are:

[    ]    (    )    {    } 
+    -    *    /    =    &&   ||   ! 
>    <    ==   !=   >=   <=                
,    .    ;

Clearly, the above operators form a small subset of the operators allowed in Java. However, most of the issues in compilation can be illustrated using the above subset.

Parser/Driver Interface

Tokens

The tokens you should define for each of the above lexical entities are:

TOK_BOOLEAN      TOK_BREAK          TOK_CLASS        TOK_CONTINUE
TOK_ELSE         TOK_EXTENDS        TOK_FALSE        TOK_FLOAT
TOK_FOR          TOK_IF             TOK_INT          TOK_NEW
TOK_NULL         TOK_PRIVATE        TOK_PUBLIC       TOK_RETURN
TOK_STATIC       TOK_SUPER          TOK_THIS         TOK_TRUE
TOK_VOID         TOK_WHILE

TOK_INT_CONST   TOK_FLOAT_CONST   TOK_STRING_CONST 

TOK_ID 

TOK_OPEN_SQ_BRACKET                 TOK_CLOSE_SQ_BRACKET
TOK_OPEN_PAREN   TOK_CLOSE_PAREN    TOK_OPEN_BRACE     TOK_CLOSE_BRACE
TOK_PLUS         TOK_MINUS          TOK_MULTIPLY       TOK_DIVIDE 
TOK_EQUAL        TOK_AND            TOK_OR             TOK_NOT
TOK_GREATER      TOK_LESSER         TOK_EQUAL_EQUAL    TOK_NOT_EQUAL
TOK_GREATER_OR_EQUAL                TOK_LESSER_OR_EQUAL
TOK_COMMA        TOK_DOT            TOK_SEMICOLON

Comments and whitespace are stripped by the lexical analyzer, i.e., no tokens are emitted.

The token names are self-explanatory. If you have any questions about this, please come to my office hours.

Kinds and Attributes

Each token has a "kind" and an attribute ("image") accessible as described in the Javacc documentation The "kind" is a unique integer value for the token. The image is the corresponding lexeme.

Error Handling (Extra Credit)

Whenever the current input character cannot be used to complete a token, the lexical analyzer must signal an error with sufficient information about the offending character.

Note that strings cannot contain newline characters. A sequence of characters that starts with a double quote (") and contains a newline before the next double quote is also an error that must be signalled.

Project Path

You will be writing the specifications for the lexical analyzer in a file. Conventionally, lexical specifications are stored in files with a .jj suffix.

Let the specifications be in a file decaf.jj. Your lexical analyser should contain only regular expression productions (no BNF productions at this stage). Use the lexical analyzer generator "javacc decaf.jj" to create the analyzer from the specifications. The generated analyzer is a Java program. Name this program DecafParser by writing PARSER_BEGIN(DecafParser) and PARSER_END(DecafParser) in your decaf.jj. Compile "javac DecafParser.java" to produce an executable, called DecafParser.class. Run it with "java DecafParser".

In order to compile and run properly java and javacc you should (1) set the PATH variable to "/usr/local/jdk1.2.2/bin:/usr/local/javacc2.1/bin:$PATH", (2) set the CLASSPATH variable to your working directory.

Tips: It is a good idea to start small. The toughest part of the assignment is to come up with the regular expressions for comments, strings and floating point numbers. Recognizing keywords is quite easy, and can be left for the end.

First step would be to check if you can specify integers without problem. Run javacc to figure out what the syntax of regular expressions should be. Write specifications for recognizing identifiers, floating point numbers, strings, single-line comments, and then multi-line comments, adding one rule at a time. Keep running javacc each time you add a new rule to ensure that there are no problems with the specifications. Once all the non-trivial rules go through javacc without a problem, add the rules for keywords, operators, whitespace etc.

Once you can get all the rules through javacc , the next step would be to compile the generated analyzer. Once compilation goes through without problem, you can run your analyser. The analyzer should accept input either from the standard input or from a file. In this way it is easy to debug it.

Now you are ready to add the more complex actions that set the attribute values for tokens, and test the whole thing on the input files.

Available Material

Official documentation for javacc may be found here.
The java homepage with associated documentation can be found here.
A sample skeleton for decaf.jj is this.
Sample input files "t1*","t2*","test.decaf" this.
Makefile is this.

Deliverables

The lexical analyzer handin must contain:

The source code for the analyzer (decaf.jj).
The Makefile.

Submission Requirements

Please note the following submission requirements. These are not simply conventions, but are requirements. They make grading of the projects easier, and lets us spend more time wading through the code to award partial credit! They are not difficult to follow, so failure to follow this may result in severe penalties (project may not be graded!). If you have trouble with the set-up, please contact me.

The generated analyser should read all the tokens in a file and for each token it should print its kind and its image.

Running make and java on the provided file should produce a valid demo.

Radu Grosu