NOTE: Your lexical analyzer is expected to correctly process comments of type 2. You may also handle comments of type 1 for extra credit.Whitespace (blanks, newlines and tabs) serve to separate tokens; otherwise they are ignored. Whitespace may not appear within any token except a string constant (see below).
boolean break class continue else extends
false float for if int new
null private public return static super
this true void while
Integer constants contain only digits 0 thru 9.
Floating point constants contain a decimal point with sequence of one or more digits on either side of it (e.g., 3.14159). (Note that we don't support floating point constants in scientific (mantissa-exponent) notation).
String constants begin and end with a double quote ("). and may contain any sequence of characters other than newlines. As in C, if the string itself contains a double quote, it is ``escaped'' with a backslash (\) as in the example string: "\"What?\" she exclaimed.". Similarly, if the string contains a backslash, that is escaped too (e.g., as in "\\").
An identifier is a sequence of letters, digits and underscore (_), starting with a letter, and is not one of the reserved words.
[ ] ( ) { }Clearly, the above operators form a small subset of the operators allowed in Java. However, most of the issues in compilation can be illustrated using the above subset.
+ - * / = && || !
> < == != >= <=
, . ;
TOK_BOOLEAN TOK_BREAK TOK_CLASS TOK_CONTINUEComments and whitespace are stripped by the lexical analyzer, i.e., no tokens are emitted.
TOK_ELSE TOK_EXTENDS TOK_FALSE TOK_FLOAT
TOK_FOR TOK_IF TOK_INT TOK_NEW
TOK_NULL TOK_PRIVATE TOK_PUBLIC TOK_RETURN
TOK_STATIC TOK_SUPER TOK_THIS TOK_TRUE
TOK_VOID TOK_WHILE
TOK_INT_CONST TOK_FLOAT_CONST TOK_STRING_CONST
TOK_ID
TOK_OPEN_SQ_BRACKET TOK_CLOSE_SQ_BRACKET
TOK_OPEN_PAREN TOK_CLOSE_PAREN TOK_OPEN_BRACE TOK_CLOSE_BRACE
TOK_PLUS TOK_MINUS TOK_MULTIPLY TOK_DIVIDE
TOK_EQUAL TOK_AND TOK_OR TOK_NOT
TOK_GREATER TOK_LESSER TOK_EQUAL_EQUAL TOK_NOT_EQUAL
TOK_GREATER_OR_EQUAL TOK_LESSER_OR_EQUAL
TOK_COMMA TOK_DOT TOK_SEMICOLON
The token names are self-explanatory. If you have any questions about this, please come to my office hours.
Note that strings cannot contain newline characters. A sequence of characters that starts with a double quote (") and contains a newline before the next double quote is also an error that must be signalled.
Let the specifications be in a file decaf.jj. Your lexical analyser should contain only regular expression productions (no BNF productions at this stage). Use the lexical analyzer generator "javacc decaf.jj" to create the analyzer from the specifications. The generated analyzer is a Java program. Name this program DecafParser by writing PARSER_BEGIN(DecafParser) and PARSER_END(DecafParser) in your decaf.jj. Compile "javac DecafParser.java" to produce an executable, called DecafParser.class. Run it with "java DecafParser".
In order to compile and run properly java and javacc you should (1) set the PATH variable to "/usr/local/jdk1.2.2/bin:/usr/local/javacc2.1/bin:$PATH", (2) set the CLASSPATH variable to your working directory.
Tips: It is a good idea to start small. The toughest part of the assignment is to come up with the regular expressions for comments, strings and floating point numbers. Recognizing keywords is quite easy, and can be left for the end.
First step would be to check if you can specify integers without problem. Run javacc to figure out what the syntax of regular expressions should be. Write specifications for recognizing identifiers, floating point numbers, strings, single-line comments, and then multi-line comments, adding one rule at a time. Keep running javacc each time you add a new rule to ensure that there are no problems with the specifications. Once all the non-trivial rules go through javacc without a problem, add the rules for keywords, operators, whitespace etc.
Once you can get all the rules through javacc , the next step would be to compile the generated analyzer. Once compilation goes through without problem, you can run your analyser. The analyzer should accept input either from the standard input or from a file. In this way it is easy to debug it.
Now you are ready to add the more complex actions that set the attribute values for tokens, and test the whole thing on the input files.
- The generated analyser should read all the tokens in a file and for each token it should print its kind and its image.
- Running make and java on the provided file should produce a valid demo.