SysY Compiler

1. Project Overview

1.1 What is this project?

This project is designed and developed as part of the "Compiler Technology" course experiment at Beihang University, which requires the design and implementation of a compiler.

1.2 What language needs to be translated?

The compiler is expected to translate code written in SysY language (a subset of C language) into intermediate code or target code.

1.3 What language will source code be translated into?

The source code needs to be translated into intermediate code (LLVM IR Code or P-code) or target code (MIPS).

2. Introduction to Reference Compilers

The course group provided two compilers (Pascal-Compiler and Pl0-Compiler) for reference. We need to read and analyze their source code, and then complete the overall architecture design of our own compiler based on this.

In this section, I will summarize the design philosophy of one of the compilers, including but not limited to its overall structure, interface design, and file organization. The specific content is as follows.

3. Overall Design of SysY Compiler

Project Repository: SysY Compileropen in new window

3.1 Directory Structure

The compiler is developed in Java language, and the current source code directory structure is as follows (the directory structure will be updated continuously during the development process).

src
├── Compiler.java
├── config.json
├── exception
│   ├── CompilationException.java
│   └── ExceptionCategory.java
├── lexer
│   ├── ILexer.java
│   ├── impl
│   │   ├── DefaultLexerImpl.java
│   │   └── StateTransitionLexerImpl.java
│   ├── token
│   │   ├── Assistant.java
│   │   ├── CharConst.java
│   │   ├── Identifier.java
│   │   ├── IntConst.java
│   │   ├── Keyword.java
│   │   ├── Operator.java
│   │   ├── StringConst.java
│   │   ├── Token.java
│   │   └── TokenCategory.java
│   └── utils
│       ├── State.java
│       └── StateConverter.java
├── parser
│   ├── IParser.java
│   ├── impl
│   │   └── DefaultParserImpl.java
│   └── node
│       ├── BlockNode.java
│       ├── ParseTreeLeafNode.java
│       └── ParseTreeNode.java
├── semantic
│   ├── ISemanticAnalyzer.java
│   ├── env
│   │   ├── ArrayItem.java
│   │   ├── Env.java
│   │   ├── EnvItem.java
│   │   ├── FunctionItem.java
│   │   └── ItemCategory.java
│   └── impl
│       └── DefaultSemanticAnalyzerImpl.java
├── symbol
│   ├── BlankCategory.java
│   ├── Category.java
│   └── NonTerminalCategory.java
└── tools
    ├── FirstFollowCalculator.java
    ├── Grammar.java
    ├── PredictiveParsingTableBuilder.java
    └── ToolMain.java

3.2 Root Directory

The Compiler class under the root directory is the entry point of the entire compiler, and the configuration file config.json is a configuration file required for evaluation, which is unrelated to the compiler program itself.

3.3 Directory `exception`

The two classes under this directory are the exception class and the enumeration class.

The exception class CompilationException means a compilation error, which can be thrown when the compiler encounters an error in the source program.

The enumeration class ExceptionCategory serves as a field of the above exception class, representing different types of compilation errors, including lexical errors, syntax errors, and semantic errors.

3.4 Directory `lexer`

This directory contains files related to the lexical analysis subroutine.

The interface ILexer is the interface for the lexical analyzer, which exposes the functionality of the lexical analyzer and can have different implementations.

The subdirectory impl contains different implementations of the lexical analyzer. Currently, there are two implementation classes with different implementation ideas. In addition, the implementation class DefaultLexerImpl is not guaranteed to work properly after the lexical analysis phase, and the class StateTransitionLexerImpl is currently used as the only implementation of the lexical analyzer.

The subdirectory token defines lexical units. The abstract class Token serves as the base class for all lexical units, and the enumeration class TokenCategory serves as a field of this class to mark the type of the lexical unit. The remaining classes are subclasses of Token and are used to better classify lexical units.

The files in the subdirectory utils are used for state transitions in the lexical analyzer. The enumeration class State defines different states, and the interface StateConverter is a functional interface used to implement the transition between different states in the lexical analyzer.

3.5 Directory `parser`

This directory contains files related to the syntax analysis subroutine.

The interface IParser is the interface for the syntax analyzer, which exposes the functionality of the syntax analyzer and can have different implementations.

The subdirectory impl contains different types of syntax analyzer implementations, and currently there is only one implementation class.

The subdirectory node defines syntax analysis tree nodes. The class ParseTreeNode is the class for syntax analysis tree nodes, which includes attributes and methods common to intermediate nodes and leaf nodes; the class ParseTreeLeafNode inherits from the ParseTreeNode class and includes attributes and methods unique to leaf nodes.

3.6 Directory `semantic`

This directory contains files related to the semantic analysis subroutines.

The interface ISemanticAnalyzer serves as the interface for the semantic analyzer, exposing the functionality of the semantic analyzer, which can have different implementations.

The subdirectory impl contains implementations of different types of semantic analyzers, with only one implementation class currently available.

The subdirectory env contains classes related to the symbol table. The class Env is a symbol table tree node class, and the class EnvItem is a symbol table item, which has two subclasses, namely the ArrayItem class and the FunctionItem class, representing arrays and functions, respectively. These classes include attributes and methods specific to arrays and functions. The enumeration class ItemCategory can serve as a field of the EnvItem class, marking the type of symbol table items.

3.7 Directory `symbol`

The files in this directory define the type information needed during the syntax analysis process.

The interface Category has three implementation classes, all of which are enumeration classes, namely the TokenCategory mentioned earlier and the BlankCategory and NonterminalCategory in this directory.

The enumeration class BlankCategory defines the empty production in the grammar.

The enumeration class NonterminalCategory defines the types of non-terminals in the grammar.

The TokenCategory mentioned above defines the terminal symbols in the grammar, and in particular, it also includes the end-of-input symbol, which is used by the syntax analyzer to recognize the end of the lexical unit stream.

3.8 Directory `tools`

The files in this directory are used to process the grammar and generate the predictive parsing table required for syntax analysis.

The enumeration class Grammar defines all the grammars.

The class FirstFollowCalculator calculates the FIRST set and FOLLOW set for each non-terminal based on the grammar and outputs the set content to the output stream.

The class PredictiveParsingTableBuilder generates the predictive parsing table based on the grammar and the FIRST set and FOLLOW set just calculated, and can output the predictive parsing table information to the output file.

The class ToolMain starts the entire predictive parsing table generation process, outputs the calculation results of the above two classes to the file, and returns the data structure of the predictive parsing table to the caller.

4. Lexical Analysis Design

The lexical analyzer includes two implementation classes. The implementation class DefaultLexerImpl is now deprecated due to structural confusion.

The implementation class StateTransitionLexerImpl implements lexical analysis using a state transition diagram, exposing the method nextToken to read the next lexical unit from the given input stream and return it to the caller.

Based on the given lexicon, we can draw the following state transition diagram:

StateInLexer

When constructing the lexical analyzer class, the state is initialized to Start.

Each call to the nextToken method will find the corresponding state transition handler based on the current state, and then perform the corresponding state transition logic according to the character read; this transition process will continue until the next lexical unit can be returned or an exception is thrown.

In addition to the normal state transition handlers, the lexical analyzer also includes a special error state handler, which enters the error state when a lexical error is found or when the end of the file is reached. If it enters the error state due to a lexical error, the handler will throw a lexical error exception; if it enters the error state due to reaching the end of the file, the handler will permanently set the lexical analyzer state to FileEnd to mark the end of file reading.

If the lexical analyzer state is FileEnd, that is, after the file reading is finished, and the nextToken method is called again, the method will only return null.

5. Syntax Analysis Design

The basic idea of the syntax analyzer design: driven by the predictive parsing table, non-recursive, top-down analysis.

5.1 Grammar Transformation

To meet the design requirements, we must first transform the original grammar to conform to the LL(1) grammar rules as much as possible. The transformation process is as follows, where \epsilon represents an empty string:

Compilation Unit

// <CompUnit> -> {<Decl>} {<FuncDef>} <MainFuncDef>
<CompUnit> -> <DeclList> <FuncDefList> <MainFuncDef>
<DeclList> -> <Decl> <DeclList> | \epsilon
<FuncDefList> -> <FuncDef> <FuncDefList> | \epsilon

Declarations

// <Decl> -> <ConstDecl> | <VarDecl>
<Decl> -> <ConstDecl> | <VarDecl>

Constant Declaration

// <ConstDecl> -> CONSTTK <BType> <ConstDef> { COMMA <ConstDef> } SEMICN
<ConstDecl> -> CONSTTK <BType> <ConstDef> <ConstDefList> SEMICN
<ConstDefList> -> COMMA <ConstDef> <ConstDefList> | \epsilon

Basic Type

// <BType> -> INTTK | CHARTK
<BType> -> INTTK | CHARTK

Constant Definition

// <ConstDef> -> IDENFR [ LBRACK <ConstExp> RBRACK ] ASSIGN <ConstInitVal>
<ConstDef> -> IDENFR <ArrSym> ASSIGN <ConstInitVal>
<ConstArrSym> -> LBRACK <ConstExp> RBRACK | \epsilon

Constant Initial Value

// <ConstInitVal> -> <ConstExp> | LBRACE [ <ConstExp> { COMMA <ConstExp> } ] RBRACE | STRCON
<ConstInitVal> -> <ConstExp> | LBRACE <ConstExpListOpt> RBRACE | STRCON
<ConstExpListOpt> -> <ConstExp> <ConstExpTail> | \epsilon
<ConstExpTail> -> COMMA <ConstExp> <ConstExpTail> | \epsilon

Variable Declaration

// <VarDecl> -> <BType> <VarDef> { COMMA <VarDef> } SEMICN
<VarDecl> -> <BType> <VarDef> <VarDefList> SEMICN
<VarDefList> -> COMMA <VarDef> <VarDefList> | \epsilon

Variable Definition

// <VarDef> -> IDENFR [ LBRACK <ConstExp> RBRACK ] | IDENFR [ LBRACK <ConstExp> RBRACK ] ASSIGN <InitVal>
<VarDef> -> IDENFR <ConstArrSym> | IDENFR <ArrSym> ASSIGN <InitVal>

Extracting Left Common Factors:

<VarDef> -> IDENFR <ArrSym> <_VarDef>
<_VarDef> -> ASSIGN <InitVal> | \epsilon

Variable Initial Value

// <InitVal> -> <Exp> | LBRACE [ <Exp> { COMMA <Exp> } ] RBRACE | STRCON
<InitVal> -> <Exp> | LBRACE <ExpListOpt> RBRACE | STRCON
<ExpListOpt> -> <Exp> <ExpTail> | \epsilon
<ExpTail> -> COMMA <Exp> <ExpTail> | \epsilon

Function Definition

// <FuncDef> -> <FuncType> IDENFR LPARENT [ <FuncFParams> ] RPARENT <Block>
<FuncDef> -> <FuncType> IDENFR LPARENT <FuncFParamsList> RPARENT <Block>
<FuncFParamsList> -> <FuncFParams> | \epsilon

Main Function Definition

// <MainFuncDef> -> INTTK MAINTK LPARENT RPARENT <Block>
<MainFuncDef> -> INTTK MAINTK LPARENT RPARENT <Block>

Function Type

// <FuncType> -> VOIDTK | INTTK | CHARTK
<FuncType> -> VOIDTK | INTTK | CHARTK

Function Parameter List

// <FuncFParams> -> <FuncFParam> { COMMA <FuncFParam> }
<FuncFParams> -> <FuncFParam> <FuncFParamList>
<FuncFParamList> -> COMMA <FuncFParam> <FuncFParamList> | \epsilon

Function Parameter

// <FuncFParam> -> <BType> IDENFR [ LBRACK RBRACK ]
<FuncFParam> -> <BType> IDENFR <BracketPair>
<BracketPair> -> LBRACK RBRACK | \epsilon

Statement Block

// <Block> -> LBRACE { <BlockItem> } RBRACE
<Block> -> LRACE <BlockItemList> RBRACE
<BlockItemList> -> <BlockItem> <BlockItemList> | \epsilon

Statement Block Item

// <BlockItem> -> <Decl> | <Stmt>
<BlockItem> -> <Decl> | <Stmt>

Statements

Original Grammar:

<Stmt> -> <Lval> ASSIGN <Exp> SEMICN
| [ <Exp> ] SEMICN
| <Block>
| IFTK LPARENT <Cond> RPARENT <Stmt> [ ELSETK <Stmt> ]
| FORTK LPARENT [ <ForStmt> ] SEMICN [ <Cond> ] SEMICN [ <ForStmt> ] RPARENT <Stmt>
| BREAKTK SEMICN
| CONTINUETK SEMICN
| RETURNTK [ <Exp> ] SEMICN
| <Lval> ASSIGN GETINTTK LPARENT RPARENT SEMICN
| <Lval> ASSIGN GETCHARTK LPARENT RPARENT SEMICN
| PRINTFTK LPARENT STRCON { COMMA <Exp> } RPARENT SEMICN

Rewritten Grammar:

<Stmt> -> <Lval> ASSIGN <Exp> SEMICN
       | <ExpOpt> SEMICN
       | <Block>
       | IFTK LPARENT <Cond> RPARENT <Stmt> <ElseOpt>
       | FORTK LPARENT <ForStmtOpt> SEMICN <CondOpt> SEMICN <ForStmtOpt> RPARENT <Stmt>
       | BREAKTK SEMICN
       | CONTINUETK SEMICN
       | RETURNTK <ExpOpt> SEMICN
       | <Lval> ASSIGN GETINTTK LPARENT RPARENT SEMICN
       | <Lval> ASSIGN GETCHARTK LPARENT RPARENT SEMICN
       | PRINTFTK LPARENT STRCON <ExpListTailOpt> RPARENT SEMICN

<ExpOpt> -> <Exp> | \epsilon
<ElseOpt> -> ELSETK <Stmt> | \epsilon
<ForStmtOpt> -> <ForStmt> | \epsilon
<CondOpt> -> <Cond> | \epsilon
<ExpListTailOpt> -> COMMA <Exp> <ExpListTailOpt> | \epsilon

Extracting Left Common Factors:

<Stmt> -> <LVal> ASSIGN <ReadOrOther>
       | <ExpOpt> SEMICN
       | <Block>
       | IFTK LPARENT <Cond> RPARENT <Stmt> <ElseOpt>
       | FORTK LPARENT <ForStmtOpt> SEMICN <CondOpt> SEMICN <ForStmtOpt> RPARENT <Stmt>
       | BREAKTK SEMICN
       | CONTINUETK SEMICN
       | RETURNTK <ExpOpt> SEMICN
       | PRINTFTK LPARENT STRCON <ExpListTailOpt> RPARENT SEMICN
<ReadOrOther> = <Exp> SEMICN | <GetFunc> LPARENT RPARENT SEMICN
<GetFunc> -> GETINTTK | GETCHARTK

For Loop Statement

// <ForStmt> -> <Lval> ASSIGN <Exp>
<ForStmt> -> <Lval> ASSIGN <Exp>

Experiment

// <Exp> -> <AddExp>
<Exp> -> <AddExp>

Conditional Experiment

// <Cond> -> <LOrExp>
<Cond> -> <LOrExp>

Left Value Statement

// <LVal> -> IDENFR [ LBRACK <Exp> RBRACK ]
<LVal> -> IDENFR <ArrSym>
<ArrSym> -> LBRACK <Exp> RBRACK | \epsilon

Primary Experiment

// <PrimaryExp> -> LPARENT <Exp> RPARENT | <LVal> | <Number> | <Character>
<PrimaryExp> -> LPARENT <Exp> RPARENT | <LVal> | <Number> | <Character>

Number

// <Number> -> INTCON
<Number> -> INTCON

Character

// <Character> -> CHRCON
<Character> -> CHRCON

Unary Experiment

// <UnaryExp> -> <PrimaryExp> | IDENFR LPARENT [ <FuncRParams> ] RPARENT | <UnaryOp> <UnaryExp>
<UnaryExp> -> <PrimaryExp> | IDENFR LPARENT <FuncRParamsList> RPARENT | <UnaryOp> <UnaryExp>
<FuncRParamsList> -> <FuncRParams> | \epsilon

Unary Operator

// <UnaryOp> -> PLUS | MINU | NOT
<UnaryOp> -> PLUS | MINU | NOT

Function Real Parameters

// <FuncRParams> -> <Exp> { COMMA <Exp> }
<FuncRParams> -> <Exp> <ExpListTail>
<ExpListTail> -> COMMA <Exp> <ExpListTail> | \epsilon

Multiply and Divide and Mod Experiment

// <MulExp> -> <UnaryExp> | <MulExp> ( MULT | DIV | MOD ) <UnaryExp>
<MulExp> -> <UnaryExp> | <MulExp> <MultDivModOp> <UnaryExp>
<MultDivModOp> -> MULT | DIV | MOD

Extracting Left Common Factors:

<MulExp> -> <UnaryExp> <_MulExp>
<_MulExp> -> <MultDivModOp> <MulExp> | \epsilon
<MultDivModOp> -> MULT | DIV | MOD

Add and Minus Experiment

// <AddExp> -> <MulExp> | <AddExp> ( PLUS | MINU ) <MulExp>
<AddExp> -> <MulExp> | <AddExp> <PlusMinuOp> <MulExp>
<PlusMinuOp> -> PLUS | MINU

Extracting Left Common Factors:

<AddExp> -> <MulExp> <_AddExp>
<_AddExp> -> <PlusMinuOp> <AddExp> | \epsilon
<PlusMinuOp> -> PLUS | MINU

Relationship Experiment

// <RelExp> -> <AddExp> | <RelExp> ( GRE | LSS | GEQ | LEQ ) <AddExp>
<RelExp> -> <AddExp> | <RelExp> <GreLssGeqLeqOp> <AddExp>
<GreLssGeqLeqOp> -> GRE | LSS | GEQ | LEQ

Extracting Left Common Factors:

<RelExp> -> <AddExp> <_RelExp>
<_RelExp> -> <GreLssGeqLeqOp> <RelExp> | \epsilon
<GreLssGeqLeqOp> -> GRE | LSS | GEQ | LEQ

Equation Experiment

// <EqExp> -> <RelExp> | <EqExp> ( EQL | NEQ ) <RelExp>
<EqExp> -> <RelExp> | <EqExp> <EqlNeqOp> <RelExp>
<EqlNeqOp> -> EQL | NEQ

Extracting Left Common Factors:

<EqExp> -> <RelExp> <_EqExp>
<_EqExp> -> <EqlNeqOp> <EqExp> | \epsilon
<EqlNeqOp> -> EQL | NEQ

Logic And Experiment

// <LAndExp> -> <EqExp> | <LAndExp> AND <EqExp>
<LAndExp> -> <EqExp> | <LAndExp> AND <EqExp>

Extracting Left Common Factors:

<LAndExp> -> <EqExp> <_LAndExp>
<_LAndExp> -> AND <LAndExp> | \epsilon
<LAndExp> -> <EqExp> | <LAndExp> AND <EqExp>

Logic Or Experiment

// <LOrExp> -> <LAndExp> | <LOrExp> OR <LAndExp>
<LOrExp> -> <LAndExp> | <LOrExp> OR <LAndExp>

Extracting Left Common Factors:

<LOrExp> -> <LAndExp> <_LOrExp>
<_LOrExp> -> OR <LOrExp> | \epsilon

Constant Experiment

// <ConstExp> -> <AddExp>
<ConstExp> -> <AddExp>

Although this grammar is not entirely compliant with LL(1) grammar rules, it is sufficient for use.

5.2 Calculation of FIRST and FOLLOW sets

We represent the transformed grammar in the form of program source code within the compiler so that when the grammar changes, these two sets can be updated accordingly.

Next, we use pseudocode to describe the calculation process of these two sets, with the following symbol definitions:

G : Grammar
N : Set of non-terminals
T : Set of terminals
P : Set of productions

According to the following algorithm, calculate the FIRST set for each non-terminal:

function ComputeFIRST(G):
    initialize FIRST(A) = {} for each non-terminal A in N

    repeat
        for each production A -> α in P:
            for each symbol X in α:
                if X in T:
                    add X to FIRST(A)
                    break
                else if x not in T:
                    add all elements of (FIRST(X) - {ε}) to FIRST(A)
                    
                    if ε in FIRST(X):
                        continue
                    else break
            if all Xi (1 <= i <= n) contain ε in their FIRST sets:
                add ε to FIRST(A)

    until no changes to any FIRST set

    return FIRST

Alternatively, we can also use the transitivity of the FIRST set to recursively calculate the FIRST set:

function ComputeFIRST(G):
    initialize FIRST(A) = {} for each non-terminal A in N
    initialize visited(A) = false for each non-terminal A in N

    for each non-terminal A in N:
        if not visited(A):
            FIRST(A) = ComputeFIRSTRecursive(A)

    return FIRST

function ComputeFIRSTRecursive(A):
    if visited(A):
        return FIRST(A)

    visited(A) = true

    for each production A -> α in P:
        for each symbol X in α:
            if X in T:
                add X to FIRST(A)
                break
            else if X not in T:
                add all elements of (ComputeFIRSTRecursive(X) - {ε}) to FIRST(A)

                if ε not in FIRST(X):
                    break

        if all symbols in α can derive ε:
            add ε to FIRST(A)

    return FIRST(A)

According to the following algorithm, calculate the FOLLOW set for each non-terminal:

function ComputeFOLLOW(G):
    initialize FOLLOW(A) = {} for each non-terminal A in N
    initialize FOLLOW(S) = { $ }

    repeat
        for each production A -> α in P:
            for each symbol B in α:
                if B in T:
                    add all elements of (FIRST(β) - {ε}) to FOLLOW(B)

                    if ε in FIRST(β) or B is the last symbol in α:
                        add all elements of FOLLOW(A) to FOLLOW(B)

    until no changes to any FOLLOW set

    return FOLLOW

We can output the calculation results to a local file to monitor the program's operation.

5.3 Generation of Predictive Parsing Table

After calculating the FIRST and FOLLOW sets, we can use these two sets to generate a predictive parsing table.

For each production A->α in grammar G , do the following:

For each terminal symbol a in FIRST(α), add A->α to M[A,a] .
If ε is in FIRST(α) , then for each terminal symbol b in FOLLOW(A) , add A->α to M[A,b] . If ε is in FIRST(α) , and $ is in FOLLOW(A) , add A->α to M[A,$] .

After completing the above operations, if there are no productions in M[A,a] , set M[A,a] to an error, which we usually represent with an empty entry in the table.

Similarly, we can print the predictive parsing table to a local .csv file to monitor the results.

5.4 Syntax Analysis Main Program

The syntax analysis main program uses a predictive analysis table-driven predictive analysis algorithm, with pseudocode as follows:

function PredictiveParser(input, parseTable, startSymbol):
    initialize stack = [ $, startSymbol ]
    initialize pointer = 0
    input = input + $

    while stack is not empty:
        top = stack.pop()

        currentInput = input[pointer]

        if top is a terminal or top == $:
            if top == currentInput:
                pointer = pointer + 1
            else:
                return "Error: Unexpected input symbol"

        else if top is a non-terminal:
            production = parseTable[top, currentInput]
            if production is empty:
                return "Error: No matching production"

            else:
                for symbol in reverse(production):
                    if symbol != ε:
                        stack.push(symbol)

        else:
            return "Error: Invalid symbol on stack"

    if pointer == len(input):
        return "Success: Input is parsed successfully"
    else:
        return "Error: Input is not fully parsed"

However, since our grammar is not entirely an LL(1) grammar, some entries in the predictive analysis table will have two productions, and we need to handle these cases individually.

The problematic entries are as follows:

Non- Terminal\Terminal	`IDENFR`	`ELSETK`	`INTTK`	`CHARTK`
`<UnaryExp>`	√
`<Stmt>`	√
`<ElseOpt>`		√
`<DeclList>`			√	√
`<FuncDefList>`			√

We will analyze each of these entries one by one.

<UnaryExp>,IDENFR

<UnaryExp> -> <PrimaryExp>
    | IDENFR LPARENT <FuncRParamsList> RPARENT

The relevant productions are as follows:

<PrimaryExp> -> LPARENT <Exp> RPARENT | <LVal> | <Number> | <Character>
<LVal> -> IDENFR <ArrSym>
<ArrSym> -> LBRACK <Exp> RBRACK | \epsilon

This problem is easy to solve; read one terminal symbol forward, if the next terminal symbol is LPARENT , it indicates the second production, otherwise, it is the first production.

<Stmt>,IDENFR

<Stmt> -> <ExpOpt> SEMICN
    | <LVal> ASSIGN <ReadOrOther>

The relevant productions are as follows:

<ExpOpt> -> <Exp> | \epsilon
<Exp> -> <AddExp>
<AddExp> -> <MulExp> <_AddExp>
<MulExp> -> <UnaryExp> <_MulExp>
<UnaryExp> -> <PrimaryExp> | IDENFR LPARENT <FuncRParamsList> RPARENT | <UnaryOp> <UnaryExp>
<PrimaryExp> -> LPARENT <Exp> RPARENT | <LVal> | <Number> | <Character>

This problem is more complex; we can do the following:

Read one terminal symbol forward, if it is LPARENT , it indicates the first production.
If it does not meet the conditions of the first step, first analyze the syntax component of <LVal> , then judge the next terminal symbol after <LVal> , if it is ASSIGN , it indicates the second production, otherwise, it is the first production.

<ElseOpt>,ELSETK

<ElseOpt> -> ELSETK <Stmt>
    | \epsilon

This is a classic dangling else problem; we can follow the convention to match each else with the nearest unmatched if .

<DeclList>,INTTK 和 <FuncDefList>,INTTK

<DeclList> -> <Decl> <DeclList>
    | \epsilon;
<FuncDefList> -> <FuncDef> <FuncDefList>
    | \epsilon

There are many relevant productions, so they are not listed here.

The solution is simple; for the case of <DeclList> , read two terminal symbols forward, if the second terminal symbol is LPARENT , it indicates the second production, otherwise, it is the first production; for the case of <FuncDefList> , read one terminal symbol forward, if it is MAINTK , it indicates the second production, otherwise, it is the first production.

<DeclList>,CHARTK

<DeclList> -> <Decl> <DeclList>
    | \epsilon

The handling of this situation is the same as the previous one; read two terminal symbols forward, if the second terminal symbol is LPARENT , it indicates the second production, otherwise, it is the first production.

5.5 Error Handling

According to the evaluation requirements, the source program will have missing SEMICN RPARENT RBRACK situations, and we just need to create a terminal symbol and add it to the syntax analysis tree when this kind of terminal symbol is missing. The newly created terminal symbol has the same line number as the previous terminal symbol, and after the processing is completed, the error message is printed to the error output stream.

5.6 Evaluation Instructions

The output of the syntax analyzer is the syntax analysis tree, and the compiler main program will post-order traverse this syntax analysis tree to output the required terminal and non-terminal symbols.

Considering that we have rewritten the grammar, the structure of the syntax analysis tree will be different from the structure of the syntax analysis tree of the original grammar, which will lead to different output results. We can handle it as follows:

For the newly added non-terminal symbols and the existing non-terminal symbols that are not required to be output, do not output their analysis information during the traversal process.
For the rewritten left-recursive grammar, place the operation of outputting the root node in the middle of the operations of outputting the left subtree and the right subtree.

Pseudocode description is as follows:

function ParseAndPrint(node):
    if node is <MulExp> or <AddExp> or <RelExp> or <EqExp> or <LAndExp> or <LOrExp>:
        ParseAndPrint(node.left)
        print(node)
        ParseAndPrint(node.right)
        return
    else:
        for child in node.children:
            ParseAndPrint(child)
        print(node)

To facilitate monitoring of the output results for debugging, the following algorithm can be used to output the syntax analysis tree to a file:

function PrintTree(tree, prefix, isLast):
    print(prefix + (isLast ? "└── " : "├── ") + tree)
    for child in tree.children:
        set newPrefix = prefix + (isLast ? "    " : "│   ")
        PrintTree(child, newPrifix, whether child is the last)

PrintTree(SyntaxTree, "", true)

6. Semantic Analysis Design

6.1 Main Design Philosophy

The primary task of semantic analysis is to generate a symbol table and check for semantic errors in the source code.

Although semantic analysis can be integrated into the syntax analysis process, to avoid extensive modifications to the existing syntax analysis code and to facilitate the writing of semantic analysis, semantic analysis in this task will be placed after syntax analysis. The semantic analyzer will use the syntax tree output by the syntax analyzer, traverse the syntax tree to generate a symbol table, and check for semantic errors within it.

6.2 Symbol Table Management

In this compiler design, the symbol table used will be represented as a symbol table tree, with the root node representing the global scope and the other nodes representing local scopes of different nesting depths.

During the semantic analysis process, we will also use a pointer to point to the symbol table tree node corresponding to the current scope. This pointer points to the top-level scope node, which can also be referred to as the top-level scope pointer. Initially, the top-level scope pointer points to the root node of the symbol table tree, representing the current scope as the global scope.

Whenever we enter a new scope, we will create a new symbol table tree node and update the top-level scope pointer to point to this newly created node. Whenever we exit a scope, we need to update the top-level scope pointer so that it points to the symbol table tree node corresponding to the outer scope of the original scope. To achieve this, each symbol table tree node, in addition to storing the symbol table items corresponding to its scope, also needs to store a pointer to its upper node and a list of pointers to its lower nodes. Naturally, the upper node pointer of the root node is null.

The final symbol table will be shown in the figure below:

SymbolTableTree

During the semantic analysis process, if we are dealing with variable declarations or function declarations, we will look up the current top-level scope symbol table tree node to check if there is already an entry with the same name. If there is, an error should be reported; otherwise, a new symbol table entry should be added. If we are dealing with a call to a symbol, we will start from the current top-level scope symbol table tree node and search all the way to the root node of the symbol table tree until we find the corresponding symbol table entry. If we search all the way to the root node and still do not find the corresponding symbol table entry, an error should be reported.

6.3 Evaluation Instructions

With a complete syntax tree already available, the semantic analysis work becomes quite straightforward. What really needs attention are the evaluation requirements.

For correct source programs, the symbol table information needs to be output in the order of scopes. We simply need to perform a pre-order depth-first search on the symbol table tree, outputting the entry information of each node during the traversal. For easy lookup, the entries in the nodes are stored using a hash table, but the evaluation requires that the symbol names be output in the order they were inserted. If writing a compiler in Java, we can utilize the LinkedHashMap class provided by the Java collection framework. It is a hash table structure that can record the order of element insertion, which perfectly meets our needs.

For erroneous source programs, since our semantic analysis is conducted after syntax analysis is complete, lexical errors, syntactic errors, and new semantic errors cannot be output during the analysis process. Therefore, we need to use a certain data structure to store the errors detected during the analysis. After the analysis is finished, all errors should be combined and sorted before being output together.

Appendix

A. Lexical Unit Definition

Name of Nonterminal Symbol	Category Code
Ident	IDENFR
IntConst	INTCON
StringConst	STRCON
CharConst	CHRCON
main	MAINTK
const	CONSTTK
int	INTTK
char	CHARTK
break	BREAKTK
continue	CONTINUETK
if	IFTK
else	ELSETK
!	NOT
&&	AND
\|\|	OR
for	FORTK
getint	GETINTTK
gerchar	GETCHARTK
printf	PRINTFTK
return	RETURNTK
+	PLUS
-	MINU
void	VOIDTK
*	MULT
/	DIV
%	MOD
<	LSS
<=	LEQ
>	GRE
>=	GEQ
==	EQL
!=	NEQ
=	ASSIGN

B. Error Category

Error Category	Category Code	Description
Illegal Symbol	a	Lexical error, appear & or \| symbol, which should be treated as && or \|\| .
Symbol Redefinition	b	The symbol name is redefined within the current scope.
Undefined Symbol	c	Using an undefined identifier.
Mismatched Function Params	d	The number of actual arguments passed to a function does not match the number of formal parameters defined.
Function Params Type Mismatch	e	The type of actual arguments passed to a function does not match the type of formal parameters defined.
Mismatched `return` Statement	f	A `return` statement with a return value is used in a function that does not return a value.
Missing `return` Statement	g	A function that returns a value is missing a `return` statement at the end.
Attempt to Modify Constant Value	h	Attempting to modify the value of a constant left value.
Missing Semicolon	i	Syntax Error
Missing Right Parenthesis	j	Syntax Error
Missing Right Bracket	k	Syntax Error
Format Characters and Expressions Mismatch	l	The number of format characters in the control string of a print statement does not match the number of expressions passed.
Illegal Loop Control Statement	m	Using `continue` or `break` statements within a non-loop block.

SysY Compiler

ON THIS PAGE