$35
Your assignment is to write a scanner for the μGo language with lex. This document gives the lexical definition of the language, while the syntactic definition and code generation will follow in subsequent assignments.
Your programming assignments are based around this division and later assignments will use the parts of the system you have built in the earlier assignments. That is, in the first assignment you will implement the scanner using lex, in the second assignment you will implement the syntactic definition in yacc, and in the last assignment you will generate assembly code for the Java Virtual Machine by augmenting your yacc parser.
This definition is subject to modification as the semester progresses. You should take care in implementation that the codes you write are well-structured and able to be revised easily.
1. μGo Language Features
We highlight the features of μGo by comparing it with C language. It is very important to note that
tokens that will be passed to the parser, and
tokens that will be discarded by the scanner (e.g., recognized but not passed to the parser).
2.1 Tokens that will be passed to the parser
The following tokens will be recognized by the scanner and will be eventually passed to the parser.
2.1.1 Delimiters
Each of these delimiters should be passed back to the parser as a token.
Delimiters
Symbols
Parentheses ( ) { } [ ]
Semicolon ;
Comma ,
Quotation " "
Newline \n
2.1.2 Arithmetic, Relational, and Logical Operators
Each of these operators should be passed back to the parser as a token.
Operators
Symbols
Arithmetic + - * / % ++ --
Relational < > <= >= == !=
Assignment = += -= *= /= %=
Logical && || !
2.1.3 Keywords
Each of these keywords should be passed back to the parser as a token.
The following keywords are reserved words of μC:
Types
keywords
Data type int32 float32 bool string
Conditional if else for
Variable declaration var
Build-in functions print println
Functional func return package
Switch switch case default
2.1.4 Identifiers
An identifier is a string of letters ( a ~ z , A ~ Z , _ ) and digits ( 0 ~ 9 ) and it begins with a letter or underscore. Identifiers are case-sensative; for example, ident , Ident , and IDENT are not the same identifier. Note that keywords are not identifiers.
2.1.5 Integer Literals and Floating-Point Literals
Integer literals: a sequence of one or more digits, such as 1 , 23 , and 666 .
Floating-point literals: numbers that contain floating decimal points, such as 0.2 and 3.141 .
2.1.6 String Literals
A string literal is a sequence of zero or more ASCII characters appearing between double-quote
( " ) delimiters. A double-quote appearing with a string must be written after a " , e.g., "abc" and "Hello world" .
2.2 Tokens that will be discarded
The following tokens will be recognized by the scanner, but should be discarded, rather than returning to the parser.
2.2.1 Whitespace
A sequence of blanks (spaces), tabs, and newlines.
2.2.2 Comments
Comments can be added in several ways:
C-style is texts surrounded by /* and */ delimiters, which may span more than one line; C++-style comments are a text following a // delimiter running up to the end of the line.
Whichever comment style is encountered first remains in effect until the appropriate comment close is encountered. For example,
// this is a comment // line */ /* with /* delimiters */ before the end and
/* this is a comment // line with some /* and C delimiters */ are both valid comments.
2.2.3 Other characters
The undefined characters or strings should be discarded by your scanner during parsing.
3. What should Your Scanner Do?
3.1 Assignment Requirements
We have prepared 11 μGo programs, which are used to test the functionalities of your scanner.
Each test program is 10pt and the total score is 110pt. You will get 110pt if your scanner successfully generates the answers for all eleven programs. Note that the TA will prepare hidden test cases to verify that your scanner is not hardcoded to the attached inputs and outputs. For the hardcoded case, you will get 0pt. judge program to get the testing score by typing judge in your terminal.
The output messages generated by your scanner must use the given names of token classes listed below.
Symbol
Token
Symbol
Token
Symbol
Token
+
ADD &&
LAND print
PRINT
-
SUB ||
LOR println
PRINTLN
*
MUL !
NOT if
IF
/
QUO (
LPAREN else
ELSE
%
REM )
RPAREN for
FOR
++
INC [
LBRACK int32
INT
--
DEC ]
RBRACK float32
FLOAT
>
GTR {
LBRACE string
STRING
<
LSS }
RBRACE bool
BOOL
>=
GEQ ;
SEMICOLON true
TRUE
<=
LEQ ,
COMMA false
FALSE
==
EQL "
QUOTA var
VAR
!=
NEQ \n
NEWLINE
=
ASSIGN :
COLON func
FUNC
+=
ADD_ASSIGN Int Number INT_LIT package
PACKAGE
-=
SUB_ASSIGN Float Number FLOAT_LIT return
RETURN
*=
MUL_ASSIGN String Literal STRING_LIT switch
SWITCH
/=
QUO_ASSIGN Identifier IDENT case
CASE
%=
REM_ASSIGN Comment COMMENT default
DEFAULT
3.2 Example of Your Scanner Output
The example input code and the corresponding output that we expect your scanner to generate are as follows.
3.3 How to debug
Compile source code and feed the input to your program, then compare with the ground truth.