Module 3: Hands-On with hyggec

Contents

Module 3: Hands-On with hyggec#

This module outlines the implementation of the hyggec compiler, in its minimal version that implements the Hygge0 programming language specification. We explore how hyggec works, and how it translates a Hygge0 program into corresponding RISC-V code. We see how to extend both the Hygge0 language and the hyggec compiler with new features. We conclude this module with some Project Ideas.

Important

This module introduces quite a lot of information that you will use throughout the rest of the course, while working on your project.

Here is a recommendation:

  • first, focus on grasping the overall picture of how hyggec works, without trying to memorise every detail;

  • then, use this module as a reference: come back here when you will need to investigate some aspect of the hyggec compiler in more detail. While working on the project, you will also become more familiar with the hyggec internals.

Quick Start#

  1. Go to the hyggec Git repository: https://gitlab.gbar.dtu.dk/02247/f26/hyggec (NOTE: you will need to log in with your DTU account)

  2. Create your own private fork of the repository, by clicking on the “Fork” button (to the right of the hyggec title and logo)

    Important

    To avoid possible plagiarism issues, make sure your forked repository is private, and only accessible by you and the other members of your project group. You can configure the access permissions on the forked repository page, on the menu to the left, by clicking “Manage” → “Members”.

  3. Clone your forked repository on your computer (use the “Code” button to the right. You may choose between cloning via “SSH” or “HTTPS”, and the first is probably better; you may need to configure your SSH keys for authenticating)

  4. Follow the instructions in the file README.md in the hyggec repository.

Important

  • The hyggec Git repository linked above will be updated during the course with the new features presented in each Module. Each feature update will have a separate Git tag to distinguish it. The instructions in this Module refer to the hyggec code up to the tag hygge0.

  • If you want, you can find all the hyggec compiler updates (covering the whole course) in the “full” Git repository: https://gitlab.gbar.dtu.dk/02247/f26/hyggec-full. Both the hyggec and hyggec-full repositories will have the same Git history and tags: the teacher will progressively push the changes from hyggec-full into the hyggec repository. You may decide to work on your project by directly cloning (or pulling changes from) the hyggec-full repository: this may help minimising merge conflicts with your project work during the course. The drawback is that the hyggec-full repository contains code that will be explained later in the course, and may be initially unclear.

The Compiler Phases of hyggec#

In the beginning of the course, we discussed what is a compiler and what are the typical phases that allow a compiler to translate an input program into an output program (Fig. 2).

The hyggec compiler has the phases illustrated in Fig. 2. In this section we explore how each phase is implemented. A more detailed picture is presented in Fig. 3, where:

  • each compiler phase is annotated with the file that handles it in hyggec;

  • each intermediate compilation product is annotated with the data type that hyggec uses to represent it;

  • unlike Fig. 2, hyggec also includes an interpreter that is not part of the basic compilation pipeline. The interpreter provides a reference for the language semantics and can be integrated in the compilation process to implement optimisations (as we will see later in the course).

Diagram of the `hyggec` compiler phases and components

Fig. 3 Phases and main components of the hyggec compiler. (Diagram based on Fig. 2.)#

Overview of the hyggec Source Tree#

Here is a quick summary of the hyggec source code structure: the files and directories are roughly ordered depending on how soon (and how often) you will use or modify them.

Tip

When exploring the hyggec source code with Visual Studio Code (or similar IDEs), try to hover with the mouse pointer on types, function names, operators, and variables: in most cases, you will see a brief description of their purpose (and you can jump to their definition, if you want).

Table 4 Overview of the hyggec source tree.#

File or directory name

Description

hyggec and hyggec.bat

Scripts to run hyggec on Unix-like and Windows operating systems. Both scripts cause the automatic recompilation of hyggec if its source code was changed after the last run.

src/

Directory containing the hyggec source files.

tests/

Directory containing the hyggec test suite. We discuss the structure of its subdirectories in The Test Suite of hyggec.

src/AST.fs

Representation of the Abstract Syntax Tree (AST) of Hygge0, based on the Definition 1. This file is discussed below in The Abstract Syntax Tree.

src/ASTUtil.fs

Utility functions for manipulating an AST (e.g. apply substitutions).

src/Lexer.fs

Lexer that reads Hygge0 source code files and builds a corresponding sequence of tokens. This file is discussed below in The hyggec Lexer

src/Parser.fs and src/ParserUtils.fs

Parser (plus auxiliary functions) that processes a sequence of tokens and builds the corresponding Hygge0 AST. These files are discussed below in The hyggec Parser.

src/Interpreter.fs

Interpreter for Hygge programs, based on the Structural Operational Semantics of Hygge0. This file is discussed below in The Built-In Interpreter.

src/Type.fs

Definition of Hygge0 types (based on Definition 5). This file is discussed below in Types and Type Checking.

src/Typechecker.fs

Functions for performing type checking on a given Hygge0 program AST, according to Definition 8 and Definition 11 (thus including Subtyping). This file is discussed below in Types and Type Checking.

src/RISCVCodegen.fs

Functions for generating RISC-V assembly code from a well-typed Hygge0 program AST. This file is discussed below in Code Generation Strategy.

src/RISCV.fs

Functions for creating and manipulating RISC-V assembly code fragments. This file is discussed below in RISC-V Code Generation Utilities.

src/PrettyPrinter.fs

Functions to pretty-print various compiler data structures (e.g. ASTs or typing environments)

src/Log.fs

Utility functions for logging messages on the console.

src/Util.fs

Miscellaneous utility functions (e.g. for generating unique strings or numbers).

src/CmdLine.fs

Configuration of the hyggec command line options.

src/Program.fs

The main program, with the main function.

examples/

This directory contains a few examples of Hygge0 programs.

src/Test.fs

Test suite configuration and execution functions. The test suite uses the Expecto testing library.

hyggec.fsproj

.NET project file.

src/RARS.fs

Utility functions to launch RARS — RISC-V Assembler and Runtime Simulator and immediately execute the compiled RISC-V assembly code.

lib/

This directory contains a copy of RARS — RISC-V Assembler and Runtime Simulator, used in RARS.fs (see above).

rars and rars.bat

Scripts to launch RARS on Unix-based and Windows operating systems. These scripts use the copy of RARS contained in the lib/ directory.

README.md and LICENSE.md

Should be self-explanatory…

fsharplint.json

Configuration file for FSharpLint (you can ignore it).

bin/ and obj/

Auto-generated directories, overwritten when the hyggec source is recompiled. Do not edit the contents of these directories!

The Abstract Syntax Tree#

After the specification of a programming language is completed, one of the first steps towards developing a compiler is (typically) defining the data structures needed for the internal representation of a syntactically-correct program, after it has been read from an input file. This internal representation is called Abstract Syntax Tree (AST) (where the term “abstract” differentiates it from the concrete syntax tree handled by the The hyggec Lexer and Parser, that we discuss later). The AST definition follows the grammar of the language being compiled, with possible adjustments for simplifying the overall compiler implementation.

In the case of hyggec, the AST is defined in the file src/AST.fs and is based on the Hygge0 grammar in Definition 1. However, the AST definition includes some adjustments, that we now illustrate.

Defining the AST: First Attempt#

In a programming language like F#, we can very naturally specify an AST using discriminated unions. For instance, we can:

  1. define a type Expr for representing a generic Hygge0 expression \(e\), with a dedicated named case to represent each different kind of expression in Definition 1; and

  2. in each named case, use fields of type Expr for representing sub-expressions.

For example, starting from the bottom of Definition 1, we may define our AST type Expr as:

type Expr =
    | UnitVal                  // Unit value
    | BoolVal of value: bool   // Boolean value
    | IntVal of value: int     // Integer value
    // ...
    | Var of name: string      // Variable

    // Addition between left-hand-side and right-hand-side sub-expressions.
    | Add of lhs: Expr
           * rhs: Expr

    // Multiplication between left-hand-side and right-hand-side sub-expressions.
    | Mult of lhs: Expr
            * rhs: Expr

    // ...

Using the type definition above, the Hygge0 expression “\(42 + x\)” would be represented in F# as:

Add( IntVal(42), Var("x") )

This approach is valid, but it has three drawbacks.

  1. The definition of Expr above does not include information about the position (line and column number) of each sub-expression in the original input file being compiled. We will need this information to generate helpful error messages during type checking, to help Hygge0 programmers find and fix errors in their programs.

  2. During type checking, we take a Hygge0 expression \(e\) and compute its typing derivation by following the syntactic structure of \(e\). Therefore, the typing derivation of \(e\) has pretty much the same shape of the syntax tree of \(e\): see, for instance, the typing derivations in Example 14 and Example 15. The main difference is that, unlike a syntax tree, a typing derivation includes information about:

    • the type assigned to each sub-expression, and

    • the typing environment used to assign such a type.

    This information will be necessary for code generation, so it is very convenient to save it somewhere, when it is generated during type checking. And given the similarities between an AST and a typing derivation, it would be nice to somehow represent both of them with the same data type, thus avoiding code duplication.

  3. Several Hygge0 expressions have very similar semantics (Definition 4), typing rules (Definition 8), and code generation logic (as we will see later). To reduce the amount of duplicated code, it is convenient represent such expressions with a common case in Expr– adding a field to distinguish them when necessary. This consideration applies to binary arithmetic expressions (“\(e_1 + e_2\)” and “\(e_1 * e_2\)”), binary boolean operations (“\(\hygOr{e_1}{e_2}\)” and “\(\hygAnd{e_1}{e_2}\)”), and relational operations (“\(e_1 = e_2\)” and “\(e_1 < e_2\)”).

The hyggec AST Definition#

We address the drawbacks discussed above in the final definition of the Hygge0 AST, which is available in the file src/AST.fs. The key idea is that we “wrap” each Expr object with a record of type Node (representing a node of the AST) which includes:

  1. the position of the expression in the original source file, and

  2. information about the typing of the expression. This information is absent before type-checking, and becomes available after type-checking.

Moreover, in the discriminated union type Expr we define some cases that capture multiple Hygge0 expressions:

  • BinNumOp for binary numerical operations, distinguished by a field op of type NumericalOp;

  • BinLogicOp for binary logical operations, distinguished by a field op of type LogicOp;

  • BinRelOp for binary relational operation, distinguished by a field op of type RelationalOp.

As a result, we obtain a unified data type that includes the position of each expression, and can represent ASTs both before and after they are type-checked, while avoiding code duplication for similar expressions. This approach is adopted by many compilers (although the details may vary depending on the implementation programming language); the design here is inspired by the Scala 3 compiler.

The final definition of AST Nodes and Expressions in hyggec is the following.

type Node<'E,'T> = 
    {
      Expr: Expr<'E,'T> // Hygge expression contained in this AST node
      Pos: Position     // Position of the expression in the input source file
      Env: 'E           // Typing environment used to type-check the expression
      Type: 'T          // Type assigned to the expression
    }

and Expr<'E,'T> =
    | UnitVal                  // Unit value
    | BoolVal of value: bool   // Boolean value
    | IntVal of value: int     // Integer value
    // ...
    | Var of name: string      // Variable

    /// Binary numerical operation between lhs and rhs.
    | BinNumOp of op: NumericalOp     // 'op' can be either 'Add' or 'Mult'
                * lhs: Node<'E,'T>
                * rhs: Node<'E,'T>

    /// Binary logical operation between lhs and rhs.
    | BinLogicOp of op: LogicOp       // 'op' can be either 'And' or 'Or'
                  * lhs: Node<'E,'T>
                  * rhs: Node<'E,'T>

    /// Logical not.
    | Not of arg: Node<'E,'T>

    /// Binary relational operation between lhs and rhs.
    | BinRelOp of op: RelationalOp    // 'op' can be either 'Eq' or 'Less'
                * lhs: Node<'E,'T>
                * rhs: Node<'E,'T>

    // ...

The two type arguments of Node<'E,'T> and Expr<'E,'T> are used as follows:

  • 'E specifies what typing environment information is associated to each expression in the AST;

  • 'T specifies what type information is assigned to each expression in the AST.

The hyggec compiler handles two kinds of ASTs, both based on the definition of Node<'E,'T> above:

  • UntypedAST, which contains expressions of type UntypedExpr (both defined in src/AST.fs). UntypedAST and UntypedExpr are just type aliases, respectively, for Node<unit, unit> and Expr<unit, unit>: they represent AST nodes and expressions without any information (unit) about their typing environments or types (see Example 18 below). This kind of AST represents a syntactically-valid Hygge0 expression read from an input file; it is produced by The hyggec Lexer and Parser, and it may contain expressions that get stuck, like the ones discussed in Example 13.

  • TypedAST, which contains expressions of type TypedExpr (both defined in src/Typechecker.fs). TypedAST and TypedExpr are just type aliases, respectively, for Node<TypingEnv, Type> and Expr<TypingEnv, Type>: they represent AST nodes and expressions that have been type-checked, and have typing information available (similarly to a typing derivation). We discuss TypedASTs in Types and Type Checking.

Example 18 (Untyped AST of a Hygge0 Expression)

Consider the Hygge0 expression “\(42 + x\)”. Its representation in F# as an instance of type UntypedAST (i.e. Node<unit, unit>) is the following:

{
    Pos  = ... // Position in the input source file
    Env  = ()
    Type = ()
    Expr = BinNumOp(
                    Add,           // Type of binary numerical operation
                    {
                        Pos  = ... // Position in the input source file
                        Env  = ()
                        Type = ()
                        Expr = IntVal(42)
                    },
                    {
                        Pos  = ... // Position in the input source file
                        Env  = ()
                        Type = ()
                        Expr = Var("x")
                    } )
}

The file src/AST.fs also defines two types called PretypeNode and Pretype that represent, respectively, the syntax tree node of a pretype, and the pretype itself from Definition 1.

type PretypeNode =
    {
      Pos: Position    // Position of the pretype in the source file
      Pretype: Pretype // Pretype contained in this Abstract Syntax Tree node
    }

and Pretype =
    | TId of id: string  // A type identifier

Note

This representation of pretypes with two data types may seem a bit redundant — but it will allow us to easily extend the Pretype definition (and the types supported by hyggec) later in the course.

The hyggec Lexer and Parser#

After we establish the internal representation of the AST, a logical next step in constructing a compiler is to develop the functions that builds such ASTs by reading some input text (i.e. the source code we wish to compile). To this purpose, we need to develop two components.

  1. A lexer (a.k.a. tokenizer or scanner) that reads the input text and classifies groups of characters that are either “useful” for building a syntax tree (e.g. identifiers, operators, parentheses…) or “useless” (e.g. sequences of white spaces, newlines, comments…). These groups of characters are typically recognised by matching them against a set of regular expressions; the goal of the lexer is to discard useless groups of characters, and turn the useful groups into tokens (or lexemes): each token captures a group of character and assigns to it a lexical category. The lexer should also report a tokenization error if it sees a sequence of input characters that it cannot classify.

  2. A parser that reads a stream of tokens produced by a lexer, and tries to match them against a set of grammar rules, thus producing a corresponding syntax tree. The parser should produce a syntax error if it is unable to match the input tokens with any of the grammar rules.

In the following subsections we explore the implementation of The hyggec Lexer and The hyggec Parser.

The hyggec Lexer#

The file src/Lexer.fs implements a function tokenize that reads the contents of an input file (which should contain Hygge0 source code) and produces a sequence of tokens.

The beginning of src/Lexer.fs defines a discriminated union type called Token whose cases represent all the tokens of the Hygge0 programming language.

/// A token, possibly carrying a value.
type Token =
    /// Left parenthesis.
    | LPAREN
    /// Right parenthesis.
    | RPAREN
    /// Left curly bracket.
    | LCURLY
    /// Right curly bracket.
    | LCURLY

    // ...

    /// Integer literal.
    | LIT_INT of value: int
    /// Floating-point literal.
    | LIT_FLOAT of value: single
    /// Boolean literal.
    | LIT_BOOL of value: bool
    /// String literal.
    | LIT_STRING of value: string
    /// Unit literal.
    | LIT_UNIT
    /// Generic identifier (might result in a variable, pretype, etc.).
    | IDENT of value: string
    /// End of file.
    | EOF

Then, src/Lexer.fs defines two data types:

  • Position to represent a position in an input file – mainly, line and column numbers; and

  • TokenWithPos, a record that combines a Token with two Positions – one marking the beginning and one marking the end of the token in the input file.

The core of the tokenization logic is the function tokenizeRec (called by tokenize), which recursively traverses an input string, applies a series of rules to pattern-match its characters and produce corresponding tokens, and accumulates such tokens into a collection acc. The function returns Ok if it tokenizes the whole string, or Error (with a message) if it encounters input characters that cannot be matched by any rule. The function tokenizeRec looks as follows:

 1/// Recursively tokenize the given 'input' string, using 'pos' as the current
 2/// position, and accumulating the accepted tokens with positions in 'acc'.
 3/// Return either Ok (), or Error with an error message.
 4let rec tokenizeRec (input: string) (pos: Position)
 5                    (acc: ResizeArray<TokenWithPos>): Result<unit,string> =
 6    match input[pos.Offset..] with
 7    | "" -> // We reached the end of the input string: add EOF and return Ok
 8        acc.Add { Token = EOF; Begin = pos; End = pos }
 9        Ok ()
10
11    | Whitespace pos pos'
12    | Comment    pos pos'
13    | Newline    pos pos' ->
14        // Just keep tokenizing the rest of the input at position pos'
15        tokenizeRec input pos' acc
16
17    // IMPORTANT: longer symbols must be matched first, otherwise they may be
18    // tokenized as multiple shorter symbols. E.g., if the rule for token
19    // LIT_UNIT is moved after LPAREN and RPAREN, then LPAREN and RPAREN will be
20    // matched first and LIT_UNIT will never be produced.
21    | Symbol "()" LIT_UNIT pos (accepted, pos')
22    | Symbol "("  LPAREN   pos (accepted, pos')
23    | Symbol ")"  RPAREN   pos (accepted, pos')
24    | Symbol "{"  LCURLY   pos (accepted, pos')
25    | Symbol "}"  RCURLY   pos (accepted, pos')
26    | Symbol "+"  PLUS     pos (accepted, pos')
27    | Symbol "*"  TIMES    pos (accepted, pos')
28    | Symbol "="  EQ       pos (accepted, pos')
29    | Symbol "<"  LT       pos (accepted, pos')
30    | Symbol ";"  SEMI     pos (accepted, pos')
31    | Symbol ":"  COLON    pos (accepted, pos')
32    | Regex @"(\d+\.?\d*|\.\d+)([eE][+-]?\d+)?f" mkFloatLit       pos (accepted, pos')
33    | Regex @"\d+"                               mkIntegerLit     pos (accepted, pos')
34    | Regex "\"(\\\\\"|[^\"])*\""                mkStringLit      pos (accepted, pos')
35    | Regex @"[a-zA-Z_][a-zA-Z0-9_]*"            mkKeywordOrIdent pos (accepted, pos') ->
36        // Add the accepted token to acc and keep tokenizing the rest of the
37        // input at position pos'
38        acc.Add accepted
39        tokenizeRec input pos' acc
40
41    | other -> // There are more input characters but they do not match any rule
42        /// The first 8 (at most) characters of the remaining input
43        let len = min (other.Length-1) 7
44        Error $"{pos.Filename}:{pos.Line}:{pos.Column}: unrecognized input: {other[..len]}"

For example:

  1. Line 26 says that if the first characters in the input string are equal to the string "+", then the lexer produces a token PLUS (a case of the discriminated union Token defined above) at the current position.

  2. Line 33 says that if the first characters in the input string are matched by the .NET regular expression "\d+" (meaning “one or more digits”), then the lexer passes the matched characters to the function mkIntegerLit (also defined in src/Lexer.fs) – which in turn produces a token LIT_INT carrying the integer value matched by the regular expression.

  3. Line 35 is similar to the previous one, except that it matches input characters using a .NET regular expression that describes a generic identifier (starting with a letter or _, followed by zero or more letters, digits, or _). Then, it produces a token by passing the matched string to the function mkKeywordOrIdent, which distinguishes reserved keywords (which are converted into their respective tokens) from identifiers:

    /// Convert a string into a keyword token or an identifier.
    and internal mkKeywordOrIdent (s: string) =
        match s with
        | "and" -> AND
        | "or" -> OR
        | "not" -> NOT
        | "if" -> IF
        // ...
        | "true" -> LIT_BOOL true
        | "false" -> LIT_BOOL false
        | other -> IDENT other
    

In all three cases discussed above, the newly-produced token is assigned to the variable accepted, while the new tokenization position is assigned to pos'. Then, on lines 38-39, the function tokenizeRec adds the accepted token to acc and continues recursively from the new position pos'.

Instead, lines 11–15 say that if the first characters in the input string are a whitespace, a comment (beginning with //), or a newline, then no token is produced: such characters are skipped and tokenizeRec continues recursively from the updated position pos'.

Note

In the tokenizeRec code snippet above, Whitespace, Comment, Newline, Symbol, and Regex are F# active patterns, i.e., custom high-level pattern matching constructs that make the tokenization rules more readable and intuitive, by hiding some low-level details (character-by-character matching, token positions computation). These active patterns are defined in src/Lexer.fs (after tokenizeRec).

During your project you will need to use those predefined active patterns – but it is unlikely that you will need to modify them, or write new ones.

Example 19 (Tokenizing a Hygge0 Expression)

Create a file called test.hyg, with the following content:

42 + x // This is a comment!

You can tokenize this file by executing:

./hyggec tokenize test.hyg

After hyggec reads this file, the function tokenizeRec processes its contents character-by-character, grouping and classifying the characters according to its pattern-matching rules. Therefore, the lexer matches the characters in the file as follows:

  • 42, that produces a token LIT_INT carrying the integer value 42;

  • a white space, that is skipped;

  • +, that produces a token PLUS;

  • a white space, that is skipped;

  • x, that produces a token IDENT carrying the value "x";

  • //, which marks the beginning of a comment: all characters until the end of the line are skipped and no token is produced;

  • the end of the input characters, which produces a token EOF.

When the tokenization process is complete, the sequence of tokens produced by tokenizeRec is displayed by hyggec as follows:

LIT_INT 42 (1:1-1:2, offset 0-1)
PLUS (1:4-1:4, offset 3-3)
IDENT "x" (1:6-1:6, offset 5-5)
EOF (2:1-2:1, offset 29-29)

The numbers between parentheses describe the token position. For instance, LIT_INT 42 spans from line 1, column 1 to line 1, column 2 (included); its offset in number of characters from the beginning of the file spans from 0 to 1 (included).

Example 20 (Lexing Errors)

To see how the hyggec lexer produces erros, try adding e.g. the character $ somewhere in the file test.hyg from Example 19 above; then, try running again ./hyggec tokenize test.hyg.

The hyggec Parser#

The file src/Parser.fs defines a function parse that takes a collection of tokens with positions (produced by the function tokenize in The hyggec Lexer) and returns either Ok with an instance of UntypedAST (see The hyggec AST Definition), or Error with a message describing what went wrong.

The function parse wraps the tokens produced by the lexer into a TokenStream, which is a data structure (defined in ParserUtils.fs) that keeps track of the current parsing position and errors. The main methods of a TokenStream object stream are:

  • stream.Current(): return the TokenWithPos at the head of stream;

  • stream.Advance(): advance the stream to the next TokenWithPos.

Then, parse passes the TokenStream to the parsing function pProgram, which in turn recursively calls other parsing functions to process the token stream and progressively build and return an UntypedAST instance.

The function pProgram and the other parsing functions in src/Parser.fs are implemented using parser combinators, a popular technique for writing parsers in functional programming style. If you scroll through src/Parser.fs you will see various definitions that are intuitively similar to some grammar rules in Definition 1. For example, consider the following definition of pLet:

/// Parse a 'let' binding without type annotation.
let pLet =
    pToken LET ->>- pIdent ->>-
        (pToken EQ >>- pSimpleExpr) ->>-
        (pToken SEMI >>- pExpr)
            |>> fun (((tok, (_, name)), init), scope) ->
                mkNode (AST.Expr.Let (name, init, scope)) ...

Observe that:

  • the part before the symbol |>> follows the syntactic shape of the expressions “\(\hygLetU{x}{e}{e'}\)” (mentioning tokens from src/Lexer.fs);

  • the part after the symbol |>> is a function that creates an AST node containing a Let expression.

The next subsections explain how parsers like pLet above work, the meaning of symbols like ->>-, >>-, and |>> (which are summarised in Table 5 below) and how they are used in src/Parser.fs.

Important

When working on your project, it is unlikely that you will need to modify the functions and parser combinators defined in the file src/ParserUtils.fs (and summarised in Table 5). Still, a basic understanding of how they work will be necessary for extending the parsers in src/Parser.fs to support new expressions.

Table 5 Overview of the hyggec parser combinators.#

Parser combinator

Description

Explained in…

p1 ->>- p2

Parse p1 and then p2. If both succeed, return both their results as a tuple.

Basic Parser Combinators (->>-, ->>, >>-, |>>)

p1 ->> p2

Parse p1 and then p2. If both succeed, return only the result of p1.

Basic Parser Combinators (->>-, ->>, >>-, |>>)

p1 >>- p2

Parse p1 and then p2. If both succeed, return only the result of p2.

Basic Parser Combinators (->>-, ->>, >>-, |>>)

choice [p1; ...; pn]

Try each parser in the list p1,…,pn in order, one at a time; if one succeeds, return its result; if one fails, backtrack and try the next parser. If all fail, return the error produced by pn.

Parser Combinators for Choices and Left-Associative Chaining (choice, chainl1)

chainl1 p op

Parse one or more instances of p separated by op, with left-associativity.

Parser Combinators for Choices and Left-Associative Chaining (choice, chainl1)

p |>> f

Parse p, and if it succeeds, apply f to its result.

Basic Parser Combinators (->>-, ->>, >>-, |>>)

p >>= f

Parse p, and if it succeeds, apply f (which returns a parser) to its result. Then, continue parsing using the parser returned by f.

Avoiding Backtracking by Using >>= and preturn

preturn v

Return the value v without consuming any token.

Avoiding Backtracking by Using >>= and preturn

Parser Combinators in a Nutshell#

A parser is a function that takes a TokenStream as argument, “consumes” its tokens (by calling the method Advance() to move to the next token), and returns some result or an error. The result of a parsing function is usually (but not necessarily!) an instance of the data type UntypedAST presented in The hyggec AST Definition.

In this section we discuss how to write complex parsers by combining and reusing simple parsers. We first explore how to write a parser that accepts simple sequences of tokens using Basic Parser Combinators (->>-, ->>, >>-, |>>); then, we improve that parser using Parser Combinators for Choices and Left-Associative Chaining (choice, chainl1).

Basic Parser Combinators (->>-, ->>, >>-, |>>)#

Consider a TokenStream containing the sequence of tokens produced at the end of Example 19:

[LIT_INT 42; PLUS; IDENT "x"; EOF]

Suppose that our goal is to write a parser that consumes such tokens and returns the AST for the Hygge0 expression \(42 + x\). To achieve this goal, we can write some simple parsers and then combine them. For example, the tokens in the sequence above may be parsed individually using the following three simple parsers. Observe that each parser advances the token stream if the current token is the expected one – otherwise, the parser returns an error message describing what the parser expected to see.

Note

To create UntypedAST nodes, the code snippets below uses the utility function mkNode defined in src/ParserUtils.fs. It takes 4 arguments: the UntypedExpr carried by the new AST node, the main Position of the AST node, where it begins, and where it ends.

/// Parse an integer value, producing an AST node with an IntVal expression.
let pIntVal (stream: TokenStream): Result<UntypedAST, string> =
    let tok = stream.Current()
    match tok.Token with
        | LIT_INT v ->
            stream.Advance()
            Ok (mkNode (AST.Expr.IntVal v) tok.Begin tok.Begin tok.End)
        | _ -> Error "LIT_INT"

/// Parse a variable, producing an AST node with a Var expression.
let pVariable (stream: TokenStream): Result<UntypedAST, string> =
    let tok = stream.Current()
    match tok.Token with
        | IDENT name ->
            stream.Advance()
            Ok (mkNode (AST.Expr.Var name) tok.Begin tok.Begin tok.End)
        | _ -> Error "IDENT"

/// Parse a token if it has the given 'tokenType'. Return either Ok and the
/// token with position, or Error with a string describing the expected token.
let pToken (tokenType: Token) (stream: TokenStream): Result<TokenWithPos, string> =
    let tok = stream.Current()
    match tok.Token with
    | t when t = tokenType ->
        stream.Advance()
        Ok tok
    | _ ->
        Error (tokenType.ToString())

We can now combine the simple parsers above into a parser called pAddIntToVar that can consume the first 3 tokens of the stream [LIT_INT 42; PLUS; IDENT "x"; EOF] to create and return an UntypedAST node (representing the expression \(42 + x\)), as follows:

let pAddIntToVar (stream: TokenStream): Result<UntypedAST, string> =
    match (pIntVal stream) with
    | Ok lhs ->
        match (pToken PLUS stream) with
        | Ok tok ->
            match (pVariable stream) with
            | Ok rhs ->
                mkNode (AST.Expr.BinNumOp (AST.NumericalOp.Add, lhs, rhs))
                       tok.Begin lhs.Pos.Begin rhs.Pos.End
            | Error msg -> Error msg
        | Error msg -> Error msg
    | Error msg -> Error msg

The code of pAddIntToVar above contains some patterns that are very common when writing parsing functions:

  • execute parsers in sequence, returning an error as soon as one fails; and

  • use the value returned by a successful parser to compute other values (in this case, using mkNode).

These patterns are captured by the operators ->>- and |>> below, which are examples of parser combinators (and are defined in src/ParserUtils.fs):

/// Create a parser that parses both 'p1' and then 'p2', and returns both their
/// results as a tuple. If either 'p1' or 'p2' fails, return the corresponding
/// error.
let (->>-) (p1: TokenStream -> Result<'A, 'E>)
           (p2: TokenStream -> Result<'B, 'E>)
           (stream: TokenStream): Result<'A * 'B, 'E> = ...

/// Create a parser that parses 'p', then applies 'f' to its result. If 'p'
/// fails, return the corresponding error.
let (|>>) (p: TokenStream -> Result<'A, 'E>)
          (f: 'A -> 'B)
          (stream: TokenStream): Result<'B, 'E> = ...

Using these two combinators, we can rewrite the parsing function pAddIntToVar above as:

1let pAddIntToVar (stream: TokenStream): Result<UntypedAST, string> =
2    ((pIntVal ->>- (pToken PLUS)) ->>- pVariable
3        |>> fun ((lhs, tok), rhs) ->
4            mkNode (AST.Expr.BinNumOp (AST.NumericalOp.Add, lhs, rhs))
5                   tok.Begin lhs.Pos.Begin rhs.Pos.End) stream

Observe that:

  • on line 2, (pIntVal ->>- (pToken PLUS)) ->>- pVariable is a parser that attempts to parse an integer, then a token PLUS, then a variable. The output of this parser is a nested tuple of the form ((lhs, tok), rhs), where lhs and rhs are UntypedAST nodes (produced by pIntVal and pVariable, respectively) and tok is a TokenWithPos (produced by pToken PLUS);

  • on line 3, ... |>> ... constructs a parser that runs the parser above, takes its return value, and passes it to a function (lines 3-5) which creates and returns an UntypedAST node;

  • finally, on line 5, the parser constructed by ... |>> ... above is applied to stream.

We can further simplify this definition of pAddIntToVar by removing unnecessary parentheses (because ->>- is left-associative) and leveraging F#’s partial function application to omit the argument stream:

let pAddIntToVar: TokenStream -> Result<UntypedAST, string> =
    pIntVal ->>- (pToken PLUS) ->>- pVariable
        |>> fun ((lhs, tok), rhs) ->
            mkNode (AST.Expr.BinNumOp (AST.NumericalOp.Add, lhs, rhs))
                   tok.Begin lhs.Pos.Begin rhs.Pos.End

As a final cleanup, we can remove the type annotation (because F# can infer it):

let pAddIntToVar =
    pIntVal ->>- (pToken PLUS) ->>- pVariable
        |>> fun ((lhs, tok), rhs) ->
            mkNode (AST.Expr.BinNumOp (AST.NumericalOp.Add, lhs, rhs))
                   tok.Begin lhs.Pos.Begin rhs.Pos.End

Observe that this definition of pAddIntToVar is quite short and intuitively describes what it tries to parse (to the left of |>>) and what it returns when it succeeds (to the right of |>>). Moreover, the code of pAddIntToVar is not cluttered by error handling (which is performed automatically by the parser combinators ->>- and |>>).

To complete the parsing of the stream [LIT_INT 42; PLUS; IDENT "x"; EOF], we can create another parser called pAddIntToVarEOF that first applies pAddIntToVar above, and then parses the remaining token EOF (using pToken above):

let pAddIntToVarEOF =
    pAddIntToVar ->>- (pToken EOF)
        |>> fun (node, _tok) ->
            node

Observe that this implementation of pAddIntToVarEOF does not use the token _tok produced by the parser pToken EOF. This is another common pattern when writing parsing functions: it may be necessary to parse something (usually a token, like EOF in this case) and then discard the result. This pattern is captured by the parser combinators ->> and >>- below (defined in src/ParserUtils.fs):

/// Create a parser that parses both 'p1' and 'p2', and returns only the result
/// of 'p1'. If either 'p1' or 'p2' fails, return the corresponding error.
let (->>) (p1: TokenStream -> Result<'A, 'E>)
          (p2: TokenStream -> Result<'B, 'E>)
          (stream: TokenStream): Result<'A, 'E> = ...

/// Create a parser that parses both 'p1' and 'p2', and returns only the result
/// of 'p2'. If either 'p1' or 'p2' fails, return the corresponding error.
let (>>-) (p1: TokenStream -> Result<'A, 'E>)
          (p2: TokenStream -> Result<'B, 'E>)
          (stream: TokenStream): Result<'B, 'E> = ...

Using the parser combinator ->> we can simplify pAddIntToVarEOF above as follows:

let pAddIntToVarEOF =
    pAddIntToVar ->> (pToken EOF)
        |>> fun node ->
            node

Notice that now ... |>> fun node -> node is redundant, because it is just returning the AST node produced by pAddIntToVar ->> (pToken EOF). Therefore, we can simplify pAddIntToVarEOF further:

let pAddIntToVarEOF =
    pAddIntToVar ->> (pToken EOF)

Summing up, here is the code of a parser that can parse the stream of tokens (with positions) [LIT_INT 42; PLUS; IDENT "x"; EOF] and return a corresponding AST node containing the Hygge0 expression \(42 + x\):

let pIntVal (stream: TokenStream) = ... // See above

let pVariable (stream: TokenStream) = ... // See above

let pToken (tokenType: Token) (stream: TokenStream) = ... // See above

let pAddIntToVar =
    pIntVal ->>- (pToken PLUS) ->>- pVariable
        |>> fun ((lhs, tok), rhs) ->
            mkNode (AST.Expr.BinNumOp (AST.NumericalOp.Add, lhs, rhs))
                   tok.Begin lhs.Pos.Begin rhs.Pos.End

let pAddIntToVarEOF =
    pAddIntToVar ->> (pToken EOF)

By reading these parser definitions bottom-up, we can intuitively see that, if we call pAddIntToVarEOF stream, the following happens:

  • pAddIntToVarEOF tries to parse pAddIntToVar and then parse EOF – and if successful, it returns the result of the first parser;

  • pAddIntToVar, in turn, tries to parse an integer, then a token PLUS, then a variable – and if successful, creates and returns an AST node with an addition.

Parser Combinators for Choices and Left-Associative Chaining (choice, chainl1)#

The parser pAddIntToVarEOF written at the end of Basic Parser Combinators (->>-, ->>, >>-, |>>) is very restrictive: it can parse a stream of tokens that correspond to a Hygge0 expression like “\(42 + x\)” – but it cannot parse e.g. the expression “\(x + 42\)” although both expressions should be valid according to Definition 1. To relax the parser, we can introduce choices by using the following parser combinator (defined in src/ParserUtils.fs):

/// Create a parser that tries each parser in the list 'parsers' in order until
/// one succeeds. When a parser fails, backtrack the 'stream' before trying the
/// next parser. If all parsers fail, return the error produced by the last
/// parser.
let choice (parsers: list<TokenStream -> Result<'A, string>>)
           (stream: TokenStream): Result<'A, string> = ...

We can then use choice to write a parser pVarOrIntVal that parses either an integer or a variable; then, we can use pVarOrIntVal to write the parsers pAdd and pAddEOF below:

let pIntVal (stream: TokenStream) = ... // See above

let pVariable (stream: TokenStream) = ... // See above

let pToken (tokenType: Token) (stream: TokenStream) = ... // See above

let pVarOrIntVal = choice [
    pIntVal
    pVariable
]

let pAdd =
  pVarOrIntVal ->>- (pToken PLUS) ->>- pVarOrIntVal
        |>> fun ((lhs, tok), rhs) ->
            mkNode (AST.Expr.BinNumOp (AST.NumericalOp.Add, lhs, rhs))
                   tok.Begin lhs.Pos.Begin rhs.Pos.End

let pAddEOF =
    pAdd ->> (pToken EOF)

This new parser pAddEOF can parse sequences of tokens corresponding to any addition between variables and integers, followed by EOF. For example:

  • [LIT_INT 42; PLUS; IDENT "x";  EOF] – produces the AST for \(42 + x\)

  • [IDENT "x";  PLUS; LIT_INT 42; EOF] – produces the AST for \(x + 42\)

  • [LIT_INT 42; PLUS; LIT_INT 43; EOF] – produces the AST for \(42 + 43\)

  • [IDENT "x";  PLUS; IDENT "y";  EOF] – produces the AST for \(x + y\)

However, pAddEOF is still quite limited: it cannot parse a stream of tokens corresponding to a sequence of additions with 3 or more operands. For example, consider:

[IDENT "x"; PLUS; IDENT "y"; PLUS; IDENT "w"; PLUS; IDENT "z"; EOF]`

This sequence of tokens should be parsed taking into account that \(+\) is left-associative, hence the resulting AST should correspond to \(((x + y) + w) + z\). To achieve this, we need to revise pAdd: we need to keep parsing additions while accumulating the parsing outputs, as follows:

  1. First, try running pValOrIntVal to parse an integer or variable. If successful, the UntypedAST node returned by this parser will be the initial value for an accumulator acc;

  2. Try parsing a token PLUS followed by another pValOrIntVal:

    • If the parsing succeeds, create a new UntypedAST node with an addition between the accumulator acc and the result of the newly-parsed pValOrIntVal. Use this new UntypedAST as the new value for the accumulator acc and repeat from point 2;

    • Instead, if the parsing fails, return the current accumulator acc.

This is a very common pattern that is captured by the parser combinator chainl1 (defined in src/ParserUtils.fs):

/// Create a parser that parses one or more occurrences of 'p' separated by
/// 'op', with left associativity. If the first parsing of 'p' fails, return the
/// corresponding error. Otherwise, the first parsing of 'p' provides the
/// initial value for an accumulator. The result of parsing 'op' is a function
/// that takes the accumulator and the result of the next parsing of 'p', and
/// returns a new accumulator. Iterate by parsing 'op' and then 'p' and
/// accumulating their result; when either 'op' or 'p' fails, backtrack the
/// 'stream' after the last successful 'p' and return the current value of the
/// accumulator.
let inline chainl1 (p: TokenStream -> Result<'A, string>)
                   (op: TokenStream -> Result<('A -> 'A -> 'A), string>)
                   (stream: TokenStream): Result<'A, string> = ...

Using chainl1 we can improve the parser pAdd above as follows:

1let pAdd =
2    chainl1 pVarOrIntVal
3        (pToken PLUS
4            |>> fun tok ->
5                fun acc rhs ->
6                    mkNode (BinNumOp (AST.NumericalOp.Add, acc, rhs))
7                           tok.Begin acc.Pos.Begin rhs.Pos.End)

Now, if stream contains [IDENT "x"; PLUS; IDENT "y"; PLUS; IDENT "w"; PLUS; IDENT "z"; EOF], then pAdd stream works as follows:

  1. First, chainl1 tries to parse pVarOrIntVal, which succeeds by producing the AST node for the variable \(x\). This AST node becomes the value of an accumulator: let’s call this accumulator acc;

  2. chainl1 tries to parse pToken PLUS |>> fun tok -> ..., which succeeds and returns the function on lines 5–7. Let’s call this function f;

  3. chainl1 tries to parse pVarOrIntVal again, which succeeds by producing the AST node for the variable \(y\). Let’s call this AST node rhs;

  4. chainl1 calls f acc rhs, which returns the AST node for the addition expression \(x + y\). This AST node becomes the new accumulator acc;

  5. chainl1 tries to parse pToken PLUS |>> fun tok -> ... again, which succeeds and returns the function on lines 5–7. Let’s call this function f;

  6. chainl1 tries to parse pVarOrIntVal again, which succeeds by producing the AST node for the variable \(w\). Let’s call this AST node rhs;

  7. chainl1 calls f acc rhs, which returns the AST node for the addition expression \((x + y) + w\). This AST node becomes the new accumulator acc;

  8. chainl1 tries to parse pToken PLUS |>> fun tok -> ... again, which succeeds and returns the function on lines 5–7. Let’s call this function f;

  9. chainl1 tries to parse pVarOrIntVal again, which succeeds by producing the AST node for the variable \(z\). Let’s call this AST node rhs;

  10. chainl1 calls f acc rhs, which returns the AST node for the addition expression \(((x + y) + w) + z\). This AST node becomes the new accumulator acc;

  11. chainl1 tries to parse pToken PLUS |>> fun tok -> ... again, which fails (because it sees the token EOF instead of PLUS). Therefore, chainl1 returns the current value of the accumulator acc, i.e., the AST node for the expression \(((x + y) + w) + z\).

Overview of the hyggec Parser Implementation#

The file src/Parser.fs implements the Hygge0 grammar in Definition 1 using the parser combinators outlined above. As mentioned earlier, the function parse in src/Parser.fs passes a TokenStream to the parsing function pProgram – which looks as follows:

/// Parse a whole Hygge program (i.e., an expression followed by EOF); if the
/// parsing fails, report a summary of all errors.
let pProgram: TokenStream -> Result<AST.UntypedAST, Lexer2.Position * string> =
    pExpr ->> pToken EOF  |>  summarizeErrors

I.e., pProgram creates the parser pExpr ->> pToken EOF and uses summarizeErrors (defined in src/ParserUtils.fs) to collect and report its parsing errors. The parser pExpr, in turn, parses Hygge0 expressions by combining and recursively calling other parsers that correspond to the various Hygge0 grammar rules in Definition 1.

The next subsections illustrate some key features of src/Parser.fs:

Avoiding Grammar Ambiguities and Implementing Operator Precedence#

Importantly, the way the parsers in src/Parser.fs are composed and recursively call each other is designed to enforce a certain parsing order that avoids Grammar Ambiguities and provides standard precedence to arithmetic operators. In the case of additions and multiplications, this is achieved as follows:

  • When pExpr recursively calls other parsers, it always reaches the parser pAddExpr (which produces addition expressions) before pMultExpr (which produces multiplication expressions);

  • The parser pAddExpr, in turn, is implemented as follows:

    /// Parse an additive expression.
    let pAddExpr =
        chainl1 pMultExpr
            (choice [
                pToken PLUS
                    |>> fun tok ->
                        fun acc rhs ->
                            mkNode (AST.Expr.BinNumOp (AST.NumericalOp.Add, acc, rhs))
                                   tok.Begin acc.Pos.Begin rhs.Pos.End
            ])
    

    In other words, the parser pAddExpr uses the combinator chainl1 to parse one or more pMultExprs separated by +. This means that AST nodes with multiplications are created “earlier” than AST nodes with additions: e.g., [LIT_INT 1; TIMES; LIT_INT 2; PLUS; LIT_INT 3; TIMES; LIT_INT 4] is parsed as \((1 * 2) + (3 * 4)\).

  • The parser pMultExpr, in turn, has a similar shape and is implemented as follows:

    /// Parse a multiplicative expression.
    let pMultExpr =
        chainl1 pUnaryExpr
            (choice [
                pToken TIMES
                    |>> fun tok ->
                        fun acc rhs ->
                            mkNode (AST.Expr.BinNumOp (AST.NumericalOp.Mult, acc, rhs))
                                   tok.Begin acc.Pos.Begin rhs.Pos.End
            ])
    

    In other words, the parser pMultExpr uses the combinator chainl1 to parse one or more pUnaryExprs separated by * – where each pUnaryExpr can be e.g. an integer or a variable.

For example, as a consequence of the recursive calling order between pExpr, pAddExpr, pMultExpr, and pUnaryExpr, the sequence of tokens [LIT_INT 1; PLUS; LIT_INT 2; TIMES; LIT_INT 3] is always parsed as \(1 + (2 * 3)\) (which respects the usual operator precedence between \(+\) and \(*\)) and never as \((1 + 2) * 3\).

A similar technique is used to obtain correct precedence and associativity for boolean expressions and relational expressions.

Avoiding Backtracking by Using >>= and preturn#

The hyggec parser may backtrack: i.e., if a parser fails, src/Parser.fs may “rewind” the TokenStream and try another parser. This is a powerful feature that is mainly provided by the choice parser combinator; however, this feature must be used with care, because it can impact parsing performance.

For instance, consider the following situation. In src/Parser.fs, pExpr is the parser for Hygge0 expressions, while pPretype is the parser for pretypes. By Definition 1, an expression \(e\) may have an optional pretype ascription \(t\), written \(e: t\). To parse this kind of expressions, one may be tempted to write a parser pAscriptionExpr like the following:

let pAscriptionExpr = choice [
    pExpr ->>- (pToken COLON) ->>- pPretype
        |>> fun ((e, tok), pt) ->
            mkNode (AST.Expr.Ascription (tpe, node)) ...
    pExpr
]

Which means: first, try to parse pExpr, then :, then pPretype, and if successful create an AST node for an ascription. If this parsing attempt fails, then backtrack and try parsing just pExpr and return its result.

This version of pAscriptionExpr would work – but it can be very inefficient. The first parser in the list given to choice may fail because pExpr succeeds but there is no : afterwards; after the failure, choice backtracks the token stream and tries the second parser in the list – where pExpr is parsed again. This backtracking and re-parsing is quite wasteful – and if it happens too often and with long sequences of tokens, it can severely degrade parsing performance.

A better approach to implement the parser pAscriptionExpr is to only parse pExpr once, as follows:

  1. try to parse pExpr, which returns an AST node;

  2. then, try parsing : and then pPretype:

    • if the parsing succeeds, create and return an AST node for an Ascription;

    • otherwise, just return the AST node returned by pExpr at step 1.

This approach is implemented in the following version of pAscriptionExpr (slightly simplified w.r.t. src/ASTUtil.fs):

/// Parse an ascription: an expression with (optional) type annotation.
let pAscriptionExpr =
    pExpr >>= fun node ->
        // Check whether there is a colon and a pretype after the expression: if
        // so, create an AST node for an ascription. Otherwise, return the
        // already-parsed AST node as it is.
        choice [
            pToken COLON ->>- pPretype
                |>> fun (tok, tpe) ->
                    mkNode (AST.Expr.Ascription (tpe, node)) ...
            preturn node
        ]

The code snippet above uses the following parser combinators (defined in src/ParserUtils.fs):

/// Create a parser that parses 'p', then applies 'f' to its result; then, uses
/// the parser returned by 'f' to continue parsing the stream. If either 'p' or
/// the parser returned by 'f' fail, return the corresponding error.
let (>>=) (p: TokenStream -> Result<'A, 'E>)
          (f: 'A -> TokenStream -> Result<'B, 'E>)
          (stream: TokenStream): Result<'B, 'E> = ...

/// Create a parser that returns the given 'value' without consuming any token
/// from the stream.
let preturn (value: 'A) (stream: TokenStream): Result<'A, 'E> = ...

This approach is used by various parsers in src/Parser.fs: see e.g. pRelExpr (for relational expressions).

Handling Mutually-Recursive Parsers#

Several parsers in src/Parser.fs are mutually recursive: e.g., pExpr recursively composes various parsers which, in turn, compose pExpr with other parsers (see e.g. pCurlyExpr, which parses expressions between curly brackets). This cycle of mutually-recursive compositions cannot be directly expressed in F#, because variables (like pExpr) must be defined before their use. Therefore, src/Parser.fs breaks this cycle of dependencies as follows:

  1. pExpr is defined at the beginning of src/Parser.fs as an uninitialised forward parser reference (using pForwardRef() from src/ParserUtils.fs);

  2. other parsers use and compose pExpr as needed;

  3. the actual implementation of pExpr is provided by a parser called pExpr'. Before the parser runs, the forward reference pExpr defined above is initialised with pExpr'.

In src/Parser.fs you can find several parser references that are defined, used, and initialised this way.

The hyggec Parser in Action#

We now have all ingredients to understand how the hyggec parser turns a sequence of tokens into an AST. This is illustrated in Example 21, Example 22, Example 23, and Example 24 below.

Example 21 (Parsing a Hygge0 Expression)

Consider again the program test.hyg in Example 19, containing:

42 + x // This is a comment!

and consider the sequence of tokens produced from test.hyg at the end of Example 19:

[LIT_INT 42; PLUS; IDENT "x"; EOF]

If you follow the parser definitions in src/Parser.fs “bottom-up”, starting from pProgram, you can trace how the parsers try to consume the tokens and generate a corresponding AST:

  • pProgram tries to parse pExpr;
    • pExpr (which is a reference to pExpr') tries a bunch of parsers:
      • pType fails because it expects the token TYPE;
      • pLetT and pLet fail because they expect the token LET;
      • pSequenceExpr (which is a reference to pSequenceExpr') tries:
        • pCurlyExpr, which fails because it expects the token LCURLY;
        • pSimpleExpr (which is a reference to pSimpleExpr') tries:
          • pIfExpr, which fails because it expects the token IF;
          • pOrExpr, which tries (using chainl1):
            • pAndExpr, which tries (using chainl1):
              • pRelExpr, which tries:
                • pAddExpr, which tries (using chainl1):
                  • pMultExpr, which tries:
                    • pUnaryExpr (which is a reference to pUnaryExpr'), which tries a bunch of parsers that fail, until it tries:
                      • pAscriptionExpr, which tries:
                        • pPrimary, which tries:
                          • token LPAREN, which fails;
                          • pValue, which tries:
                            • pIntVal, which succeeds, consumes the token LIT_INT 42, and returns the AST node for the integer value 42. This AST node is returned back to pAddExpr in the call hierarchy;
                  • then, pAddExpr tries to parse pToken PLUS, which succeeds and consumes the token PLUS;
                  • then, pAddExpr tries to parse pMultExpr, which tries:
                    • pUnaryExpr (which is a reference to pUnaryExpr'), which tries a bunch of parsers that fail, until it tries:
                      • pAscriptionExpr, which tries:
                        • pPrimary, which tries:
                          • token LPAREN, which fails;
                          • pValue, which fails;
                          • pVariable, which succeeds, consumes the token IDENT "x", and returns the AST node for the variable x. This AST node is returned back to pAddExpr in the call hierarchy;
                  • therefore, pAddExpr succeeds and builds the AST node for the addition 42 + x. This AST node is returned back to pProgram in the call hierarchy;
  • then, pProgram tries to parse pToken EOF, which succeeds consuming the token EOF. Then, pProgram returns the AST node 42 + x.

The corresponding AST produced by hyggec can be seen by running:

./hyggec parse test.hyg

The output is the following (where the AST nodes show the position of their expressions in test.hyg, ranging from their initial to their final line:column coordinates).

BinNumOp Add (1:1-1:6)
┣╾lhs: IntVal 42 (1:1-1:2)
┗╾rhs: Var x (1:6-1:6)

Note

Example 21 shows that even parsing a simple expression like “\(42 + x\)” requires many parsing attempts and failures. One may wonder whether this may impact parsing performance, since the hyggec parser may backtrack after a failed attempt (as discussed in Avoiding Backtracking by Using >>= and preturn).

Indeed, every failed parsing attempt slows down parsing – but some attempts cannot be avoided, because the parser needs to “see what comes next” (“look ahead”) on the token stream to distinguish different expressions.

An important aspect to consider when designing a parser is: how far does a parser need to “look ahead” before succeeding, and how quickly can it fail?

  • The parsers in src/Parser.fs are designed so that attempts like those outlined in Example 21 fail immediately after trying a few tokens (usually just one token): this avoids long backtracks and reduces the impact on parsing performance.

  • Instead, the first example discussed in Avoiding Backtracking by Using >>= and preturn shows a bad parser design where a parser in a choice may fail late, thus causing the backtracking and re-parsing of many tokens.

Example 22 (Parsing a Hygge0 Expression with Operation Precedence and Associativity (Part 1))

Try creating a file called test2.hyg with the following content:

1 + 2 * 3 + 4

If you try to parse this file by running ./hyggec parse test2.hyg, you will see the following output:

BinNumOp Add (1:1-1:13)
┣╾lhs: BinNumOp Add (1:1-1:9)
┃      ┣╾lhs: IntVal 1 (1:1-1:1)
┃      ┗╾rhs: BinNumOp Mult (1:5-1:9)
┃             ┣╾lhs: IntVal 2 (1:5-1:5)
┃             ┗╾rhs: IntVal 3 (1:9-1:9)
┗╾rhs: IntVal 4 (1:13-1:13)

Observe that this AST corresponds to the expression \((1 + (2 * 3)) + 4\): this means that the hyggec parser has given the correct precedence to multiplication over addition, and the correct left-associativity to additions.

As an exercise, you can try following the parser definitions in src/Parser.fs “bottom-up”, starting from pProgram, to see how this is achieved.

Example 23 (Parsing a Hygge0 Expression with Operation Precedence and Associativity (Part 2))

Try creating a file called test3.hyg with the following content:

1 + 2 < 3 * 4 and 42 = 43 and 44 = 45

If you try to parse this file by running ./hyggec parse test3.hyg, you will see the following output:

BinLogicOp And (1:1-1:37)
┣╾lhs: BinLogicOp And (1:1-1:25)
┃      ┣╾lhs: BinNumRelOp Less (1:1-1:13)
┃      ┃      ┣╾lhs: BinNumOp Add (1:1-1:5)
┃      ┃      ┃      ┣╾lhs: IntVal 1 (1:1-1:1)
┃      ┃      ┃      ┗╾rhs: IntVal 2 (1:5-1:5)
┃      ┃      ┗╾rhs: BinNumOp Mult (1:9-1:13)
┃      ┃             ┣╾lhs: IntVal 3 (1:9-1:9)
┃      ┃             ┗╾rhs: IntVal 4 (1:13-1:13)
┃      ┗╾rhs: BinNumRelOp Eq (1:19-1:25)
┃             ┣╾lhs: IntVal 42 (1:19-1:20)
┃             ┗╾rhs: IntVal 43 (1:24-1:25)
┗╾rhs: BinNumRelOp Eq (1:31-1:37)
       ┣╾lhs: IntVal 44 (1:31-1:32)
       ┗╾rhs: IntVal 45 (1:36-1:37)

Observe that this AST corresponds to the expression \(\hygAnd{(\hygAnd{((1 + 2) < (3 * 4))}{(42 = 43)})}{44 = 45}\): this means that the hyggec parser has given the correct precedence to numerical operations over numerical relations, and to numerical relations over logical operations; moreover, it has correctly treated and as left-associative.

As an exercise, you can try following the parser definitions in src/Parser.fs “bottom-up”, starting from pProgram, to see how this is achieved.

Example 24 (Parsing Errors)

To see how the hyggec parser reports errors, you can try creating a file called parser-error.hyg with the following content:

let x;

If you try to parse this file by running ./hyggec parse parsing-error.hyg, you will see the following output:

hyggec: error: parsing-error.hyg:1:6: found SEMI but expected one of: COLON, EQ

If you follow the parser definitions in src/Parser.fs “bottom-up”, starting from pProgram, you can see that COLON and EQ were the tokens expected by pLetT and pLet, respectively, after they successfully parsed let x.

References and Further Readings#

Building lexers and parsers is typically a routine job (unless the programming language we wish to parse has a peculiar syntax…), but it is often very time-consuming. For this reason, there are various approaches that try to simplify and automate the work.

Parser combinators are quite popular in programming languages with functional programming features such as F#, Haskell, or Scala – and more recently, also Java, Kotlin, and C#. The parser combinators used by hyggec (outlined in Table 5) are inspired by the F# library FParsec.

There exists also a wide variety of tools called lexer generators and parser generators — i.e. programs that:

  1. read a configuration file containing a specification of the language we wish to parse (as a description of its tokens and/or grammar rules, similar to Definition 1); and

  2. generate the source code of a lexer and/or parser, based on such a configuration file. The generated code can then be included and used as part of a compiler.

Here are some examples of lexer and parser generators.

  • Lex and Yacc are standard tools for generating lexers/parser in C, available on all Unix-like operating systems. These tools were initially released in the 1970s!

  • Flex and GNU Bison are more modern, improved replacements for Lex and Yacc.

  • FsLex and FsYacc are a lexer and parser generator for F#.

  • ANTLR can generate lexers/parsers in various programming languages, and is especially popular in the Java world.

The tools above adopt a variety of parsing algorithms, which may require writing grammars in a certain style. For an overview of parsing algorithms and their limitations you can have a look at:

  • Bill Campbell, Iyer Swami, Bahar Akbal-Delibas. Introduction to Compiler Construction in a Java World. Chapman and Hall/CRC, 2012. Available on DTU Findit.

    • Chapter 3 (Parsing)

The Built-In Interpreter#

hyggec includes an interpreter that can execute Hygge0 expressions (either typed or untyped). This interpreter is not necessary for developing a compiler: in fact, once hyggec has a typed AST (produced by Typechecker.fs), it can proceed directly to Code Generation. However, an interpreter can be useful in multiple ways:

  1. it provides a direct implementation of the Formal Semantics of Hygge0, to be used as a reference;

  2. it helps testing whether the compiler respects the Formal Semantics of Hygge0: an expression \(e\) compiled and executed under RARS must produce the same computations and outputs observed when interpreting \(e\);

  3. finally, the interpreter can be used inside the compiler for code optimisation (as we will see later in the course).

The Hygge0 interpreter is implemented in the file src/Interpreter.fs, following the Formal Semantics of Hygge0. Most of the interpreter’s types and functions take two type arguments 'E and 'T: they have the same meaning discussed in The hyggec AST Definition, and being generic, they allow the interpreter to support both typed and untyped ASTs.

The type RuntimeEnv defined in src/Interpreter.fs represents the runtime environment \(R\) in Structural Operational Semantics of Hygge0:

  type RuntimeEnv<'E,'T> = {
      Reader: Option<unit -> string>  // Used to read a console input
      Printer: Option<string -> unit> // Used to perform console output
  }

The function reduce is the core of src/Interpreter.fs, and it corresponds to the reduction formalised in Definition 4. The function takes a runtime environment and an AST node, and attempts to perform one reduction step:

  • if it succeeds, it returns Some with a pair consisting of a (possibly updated) runtime environment and reduced AST node;

  • if it cannot perform a reduction (because the AST node contains a value, or a stuck expression), it returns None.

let rec reduce (env: RuntimeEnv<'E,'T>)
               (node: Node<'E,'T>): Option<RuntimeEnv<'E,'T> * Node<'E,'T>> =
    match node.Expr with
    // ...

Each pattern matching case in the function reduce corresponds to a possible Hygge0 expression, and the function attempts to reduce the expression according to Definition 4. For example, values and variables cannot reduce (because there is no rule to let them reduce):

    | UnitVal
    | BoolVal(_)
    | IntVal(_)
    | FloatVal(_)
    | StringVal(_) -> None

    | Var(_) -> None

For an expression “\(\hygAssert{e}\)”, the function reduce follows rules \(\ruleFmt{R-Assert-Eval-Arg}\) and \(\ruleFmt{R-Assert-Res}\) in Definition 4:

  • if \(e\) is the boolean value \(\text{true}\), it proceeds by reducing the whole expression to the value unit (according to rule \(\ruleFmt{R-Assert-Res}\));

  • otherwise, it tries to recursively reduce \(e\):

    • if \(e\) reduces into \(e'\), then reduce returns an AST node with the updated expression \(e'\), by rule \(\ruleFmt{R-Assert-Eval-Arg}\) (notice that reduce creates the new AST node by copying and updating its original argument node);

    • if \(e\) cannot reduce, then \(e\) is stuck, and thus, reduce returns None (because the whole expression is stuck).

    | Assertion(arg) ->
        match arg.Expr with
        | BoolVal(true) -> Some(env, {node with Expr = UnitVal})
        | _ ->
            match (reduce env arg) with
            | Some(env', arg') -> Some(env', {node with Expr = Assertion(arg')})
            | None -> None

For an expression “\(\hygLetU{x}{e}{e_2}\)”, the function reduce follows rules \(\ruleFmt{R-Let-Eval-Init}\) and \(\ruleFmt{R-Let-Subst}\) in Definition 4. It tries to recursively reduce \(e\), and:

  • if \(e\) can reduce into \(e'\), it returns an AST node with the updated expression “\(\hygLet{x}{t}{e'}{e_2}\)” (by rule \(\ruleFmt{R-Let-Eval-Init}\));

  • if \(e\) cannot reduce because it is already a value \(v\), it substitutes \(x\) with \(v\) in \(e'\) (by rule \(\ruleFmt{R-Let-Subst}\)). In this case, reduce invokes ASTUtil.subst, which implements substitution according to Definition 2;

  • otherwise, \(e\) is stuck, and thus, reduce returns None (because the whole expression is stuck).

    | Let(name, init, scope) ->
        match (reduce env init) with
        | Some(env', init') ->
            Some(env', {node with Expr = Let(name, init', scope)})
        | None when (isValue init) ->
            Some(env, {node with Expr = (ASTUtil.subst scope name init).Expr})
        | None -> None

Exercise 19

Consider the following groups of reduction rules from Definition 4:

  • \(\ruleFmt{R-Mul-L}\), \(\ruleFmt{R-Mul-R}\), \(\ruleFmt{R-Mul-Res}\)

  • \(\ruleFmt{R-Seq-Eval}\), \(\ruleFmt{R-Seq-Res}\)

  • \(\ruleFmt{R-Type-Res}\)

  • \(\ruleFmt{R-Ascr-Res}\)

  • \(\ruleFmt{R-Print-Eval}\), \(\ruleFmt{R-Print-Res}\),

  • \(\ruleFmt{R-Read-Int}\), \(\ruleFmt{R-Read-Float}\)

For each group of reduction rules listed above:

  1. find the case of the function reduce (in the file src/Interpreter.fs) that implements reductions for the corresponding Hygge0 expression (e.g. in the case of \(\ruleFmt{T-Type-Res}\), find the case of reduce that handles the expression “\(\hygType{x}{t}{e}\)”);

  2. identify how the premises and conditions of the reduction rules are checked in reduce; and

  3. identify how the expression returned by reduce corresponds to the reduced expression in the conclusion of the reduction rules.

Exercise 20

Consider the reduction rules you wrote when solving Exercise 14. For each of those reduction rules:

  1. find the case of the function reduce (in the file src/Interpreter.fs) that implements reduction for the corresponding Hygge0 expression;

  2. identify how each premise and condition of your reduction rule is checked in reduce; and

  3. identify how the expression returned by reduce corresponds to the reduced expression in the conclusion of your reduction rule.

Do you see any discrepancy? If so, do you think there is a mistake in reduce, or in your reduction rule?

Tip

If you have not yet solved Exercise 14, you can proceed in the opposite direction: see how reduce handles a certain expression, and try to write down the corresponding reduction rule(s).

Types and Type Checking#

The hyggec type checker is implemented in the file src/Typechecker.fs, and its goal is to inspect an UntypedAST (produced by The hyggec Parser), and either:

  • produce a TypedAST, where each AST node and expression has an associated type and typing environment (similarly to a typing derivation); or

  • report typing errors pointing at issues in the input source program.

Consequently, we represent the result of type checking as a type called TypingResult (defined in src/Typechecker.fs). The definition of TypingResult uses the standard Result type provided by F#, so a typing result is either Ok with a TypedAST, or Error with a list of type errors:

type TypingResult = Result<TypedAST, TypeErrors>

As mentioned in The hyggec AST Definition, the type TypedAST above is just an alias for type Node<TypingEnv, Type>, where:

  • the type Type (defined in the file src/Type.fs) is the internal representation of a Hygge0 type in the hyggec compiler, and follows Definition 5:

    type Type =
        | TBool                 // Boolean type.
        | TInt                  // Integer type.
        | TFloat                // Floating-point type (single-precision).
        | TString               // String type.
        | TUnit                 // Unit type.
        | TVar of name: string  // Type variable.
    
  • the type TypingEnv (defined in the file src/Typechecker.fs) is the internal representation of a Hygge0 typing environment in the hyggec compiler, and follows Definition 6:

    type TypingEnv = {
        Vars: Map<string, Type> // Variables in the current scope, with their type
        TypeVars: Map<string, Type> // Type vars in current scope, with their def.
    }
    

Type Checking or Type Inference?#

Before proceeding, it is worth noticing that here (as in most literature about compilers) we often use the term “type checking” in a broad sense, meaning “analysing a source program to see whether it is well-typed”. But more precisely, to implement hyggec we need to solve a type inference problem, according to Definition 12 below.

Definition 12 (Type Checking vs. Type Inference)

Take a set of rules defining a typing judgement \(\hygTypeCheckJ{\Gamma}{e}{T}\) (such as the rules in Definition 8).

  • A type checking problem has the following form:

    • given a typing environment \(\Gamma\), an expression \(e\), and a type \(T\), construct a typing derivation that proves \(\hygTypeCheckJ{\Gamma}{e}{T}\)

  • A type inference problem has the following form:

    • given a typing environment \(\Gamma\) and an expression \(e\), find a type \(T\) for which we can construct a typing derivation that proves \(\hygTypeCheckJ{\Gamma}{e}{T}\)

Luckily, the Hygge0 typing rules in Definition 8 directly suggest us how to implement a type inference algorithm, because the rules are syntax-driven: just by looking at the shape of an expression \(e\), and reading the rules “bottom-up” (from the conclusion to the premises) we can determine which rule(s) could possibly be used to type \(e\), and what additional checks they require. For instance, suppose we are trying to infer the type of a Hygge0 expression \(e\):

  • if \(e\) is an integer value \(42\), we can only type it by rule \(\ruleFmt{T-Val-Int}\), hence we immediately infer that \(e\) has type \(\tInt\);

  • if \(e\) is a variable \(x\), we can only type it by rule \(\ruleFmt{T-Var}\), hence we infer that \(e\) must have the type contained in \(\envField{\Gamma}{Vars}(x)\). If \(x\) is not in \(\envField{\Gamma}{Vars}\), there is no other typing rule we can try, so we report a typing error;

  • if \(e\) is a logical negation “\(\hygNot{e'}\)”, then we can only type it by trying the typing rule you wrote as part of Exercise 17. Therefore, we recursively infer the type of \(e'\), and check whether \(e'\) has type \(\tBool\). If this is the case, we infer that \(e\) has type \(\tBool\); otherwise, there is no other typing rule we can try, so we report a typing error;

  • if \(e\) is an addition “\(e_1 + e_2\)”, then we can only type it by trying rule \(\ruleFmt{T-Add}\). Therefore, we recursively infer the types of \(e_1\) and \(e_2\), and check whether:

    • both \(e_1\) and \(e_2\) have type \(\tInt\) — and in this case, we infer that \(e\) has type \(\tInt\); or

    • both \(e_1\) and \(e_2\) have type \(\tFloat\) — and in this case, we infer that \(e\) has type \(\tFloat\).

    If this doesn’t work, there is no other typing rule we can try, so we report a typing error;

  • if \(e\) is a conditional “\(\hygCond{e_1}{e_2}{e_3}\)”, we can only type it by trying rule \(\ruleFmt{T-Cond}\). Therefore, we recursively infer the types of \(e_1\), \(e_2\), and \(e_3\), and check whether:

    • \(e_1\) has type \(\tBool\), and

    • \(e_1\) and \(e_2\) have a same type \(T\) — and in this case, we infer that \(e\) has that type \(T\).

    If this doesn’t work, there is no other typing rule we can try, so we report a typing error.

Implementation of src/Typechecker.fs#

The core of src/Typechecker.fs is a function called typer, which implements the type inference algorithm discussed above. The function typer has the following definition: it takes a typing environment and an AST node, performs a pattern matching against the expression contained in the AST node, and returns a TypingResult.

let rec typer (env: TypingEnv) (node: UntypedAST): TypingResult =
    match node.Expr with
    // ...

The function typer is initially called with an empty environment, and with the whole AST of the Hygge0 program being compiled. While running, typer recursively traverses the AST, adding information to the environment, as required by the typing rules in Definition 8.

Each case in typer takes the given AST node (which is untyped), and (if type inference succeeds) produces a typed AST node containing the current environment, and the inferred type. Otherwise, typer returns a list of typing errors.

For example, typer assigns types to boolean or integer values, as follows:

    | BoolVal(v) ->
        Ok { Pos = node.Pos; Env = env; Type = TBool; Expr = BoolVal(v) }
    | IntVal(v) ->
        Ok { Pos = node.Pos; Env = env; Type = TInt; Expr = IntVal(v) }

When typing a variable, typer checks whether the variable is in the typing environment, and assigns the type it finds there to the variable — or reports an error:

    | Var(name) ->
        match (env.Vars.TryFind name) with
        | Some(tpe) ->
            Ok { Pos = node.Pos; Env = env; Type = tpe; Expr = Var(name) }
        | None ->
            Error([(node.Pos, $"undefined variable: %s{name}")])

When typing a logical “not” expression, typer recursively infers the type of the argument, and checks whether it is a subtype of \(\tBool\): if so, typer infers that the whole expression has type \(\tBool\); otherwise, it reports type errors.

    | Not(arg) ->
        match (typer env arg) with
        | Ok(targ) when (isSubtypeOf env targ.Type TBool) ->
            Ok { Pos = node.Pos; Env = env; Type = TBool; Expr = Not(targ) }
        | Ok(arg) ->
            Error([(node.Pos, $"expected %O{TBool} arg, found %O{arg.Type}")])
        | Error(es) -> Error(es)

Important

In the last example above (3rd line) we are using isSubtypeOf (defined in src/Typechecker.fs) to compare the types targ.Type and TBool. This check corresponds to an application of the subsumption rule in Definition 11: we want targ to have type \(\tBool\), so it is OK if targ has a subtype of \(\tBool\) (e.g. via some type alias).

Instead of using isSubtypeOf, in the code above we could have simply written targ.Type = TBool: this would work, but it would also make our typing system less flexible, leading to limitations similar to those discussed in Example 16.

Therefore, the rule of thumb is: whenever typer needs to compare two types, it should use isSubtypeOf (instead of plain equality =).

One may ask: instead of checking subtyping in multiple places, can we implement a general type inference case based on the subsumption rule \(\ruleFmt{T-Sub}\) in Definition 11?

The answer, unfortunately, is no — and the reason is that (unlike the typing rules in Definition 8) rule \(\ruleFmt{T-Sub}\) is not syntax-driven: it can be applied to any expression \(e\). Therefore, we need to explicitly check subtyping whenever we compare types.

The rest of the pattern matching cases inspected by the function typer work similarly. Two things to mention:

  • in the cases for “\(\hygLet{x}{t}{e_1}{e_2}\)” and “\(\hygType{x}{t}{e}\)”, typer need to check whether the pretype \(t\) corresponds to some valid type \(T\). To this end, typer uses the function resolvePretype, which corresponds to the type resolution judgement in Definition 7;

  • in various cases, typer uses auxiliary functions to avoid code duplication. For instance: “\(e_1 + e_2\)” and “\(e_1 * e_2\)” are typed in a similar way (see rule \(\ruleFmt{T-Add}\) in Definition 8, and Exercise 17), and thus, typer uses a function called binaryNumOpTyper to handle both.

Example 25 (Untyped vs. Typed ASTs)

To see the difference between the untyped AST and the typed AST of a Hygge0 program, you can try to parse the example in Example 3:

./hyggec parse examples/hygge0-spec-example.hyg

And then compare the untyped AST above with the typed AST produced by src/Typechecker.fs:

./hyggec typecheck examples/hygge0-spec-example.hyg

Exercise 21

Consider the following typing rules from Definition 8:

  • \(\ruleFmt{T-Seq}\)

  • \(\ruleFmt{T-Assert}\)

  • \(\ruleFmt{T-Ascr}\)

  • \(\ruleFmt{T-Print}\)

  • \(\ruleFmt{T-Let}\)

  • \(\ruleFmt{T-Type}\)

For each typing rule listed above:

  1. find the case of the function typer (in the file src/Typechecker.fs) that implements type inference for the corresponding Hygge0 expression (e.g. in the case of \(\ruleFmt{T-Seq}\), find the case of typer that handles the expression \(e_1; e_2\));

  2. identify how each premise of the typing rule is checked in typer; and

  3. identify how the type inferred by typer matches the type expected in the conclusion of the typing rule.

Exercise 22

Consider the typing rules you wrote when solving Exercise 17. For each one of those typing rules:

  1. find the case of the function typer (in the file src/Typechecker.fs) that implements type inference for the corresponding Hygge0 expression;

  2. identify how each premise of your typing rule is checked in typer; and

  3. identify how the type inferred by typer matches the type expected in the conclusion of your typing rule.

Do you see any discrepancy? If so, do you think there is a mistake in typer, or in your typing rule?

Tip

If you have not yet solved Exercise 17, you can proceed in the opposite direction: see how typer handles a certain expression, and try to write down a corresponding typing rule.

Code Generation#

The code generation of hyggec is implemented in the file src/RISCVCodegen.fs. We illustrate its contents when discussing the Code Generation Strategy — but first, we have a look at the RISC-V Code Generation Utilities.

RISC-V Code Generation Utilities#

The file src/RISCV.fs contains various data types and functions to represent RISC-V assembly code, and manipulate assembly code snippets. The hyggec compiler goal is to produce text output containing RISC-V assembly — and this internal representation of the output code makes the job simpler, and prevents possible mistakes.

The key components of src/RISCV.fs are the following (we will see them in use in the Code Generation Strategy).

  • The type RV (standing for “RISC-V”) represents a statement in a RISC-V assembly program: it is a discriminated union with one named case for each supported RISC-V instruction (e.g. the RISC-V instruction mv is represented by the named case RV.MV(...)). It also includes named cases for representing labels and comments in RISC-V assembly.

  • The types Reg and FPReg represent, respectively, a base integer register and a floating-point register. Their purpose is twofold:

    1. they help ensuring that the RISC-V instructions in RV above can only be used with registers that exist, and have the correct type (otherwise, the F# compiler reports a type error). For example:

      • we cannot use the instruction RV.MV(...) on a floating-point register — if we try, we get an F# type error;

      • to move a value from register t0 to t1, we can write RV.MV(Reg.t1, Reg.t0) — and if there is a typo (e.g. if we write “RV.MV(Reg.t1, Reg.y0)”) it will be spotted by the F# compiler;

    2. they provide generic numbered registers Reg.r(n) and FPReg.r(n), which range over all “temporary” and “saved” RISC-V registers. For example, to move a value from generic register number \(n+1\) to \(n\), one can write: RV.MV(Reg.r(n), Reg.r(n+1)). This greatly simplifies the handling of registers during code generation.

  • The type Asm represents an assembly program with its .data and .text segments. The main features are:

    • the methods addData and addText, which allow us to add memory allocations and instructions in the selected memory segment; and

    • the method ++, which allows us to combine two assembly programs (e.g. produced during code generation) into a unique, well-formed assembly program.

Code Generation Strategy#

The core of the file src/RISCVCodegen.fs is the function doCodegen, which takes a code generation environment and a typed AST node, and produces assembly code. Correspondingly, the declaration of doCodegen has the following types.

let rec doCodegen (env: CodegenEnv) (node: TypedAST): Asm = // ...

The function doCodegen uses a very simple code generation strategy. In a nutshell, when compiling an expression \(e\):

  • after \(e\) is computed, its result is written in a target register with number \(n\);

  • if the computation of \(e\) requires the results of other sub-expressions, then doCodegen recursively compiles each sub-expression by increasing (if necessary) their target register numbers to \(n+1\), \(n+2\), etc.

Important

This compilation strategy only works if the code produced by each doCodegen recursive call never modifies any register with a number lower than the current target.

With this approach, the code generation environment (of type CodegenEnv) used by doCodegen must contain two pieces of information:

  1. which target register number should be used to compile the current expression. More precisely, we need one target for integer expressions, and another target for floating-point expressions; and

  2. a “storage” mapping from known variable names, to the location where the value of each variable is stored (e.g. in a register, or in memory).

Consequently, the CodegenEnv type is a record that looks as follows:

type CodegenEnv = {
    Target: uint      // Target register for the result of integer expressions
    FPTarget: uint      // Target register for the result of float expressions
    VarStorage: Map<string, Storage>     // Storage info about known variables
}

The type Storage used by CodegenEnv tells us whether the value of a variable is stored in an integer register, or in a floating-point register, or in a memory location marked with a label in the generated assembly code.

type Storage =
    | Reg of reg: Reg     // The value is stored in an integer register
    | FPReg of fpreg: FPReg  // The value is stored in a float register
    | Label of label: string // Value is stored in memory, with a label

A Tour of doCodegen#

We now have all ingredients to examine how doCodegen works. Its implementation is a pattern matching on the expression contained in the AST node being compiled.

let rec doCodegen (env: CodegenEnv) (node: TypedAST): Asm =
    match node.Expr with
    // ...

Here are some examples of how doCodegen’s main pattern matching handles various kinds of expressions.

An integer value is immediately loaded into the target register (using the assembly instruction li).

    | IntVal(v) ->
        Asm(RV.LI(Reg.r(env.Target), v))

Example 26 (Compiling an Integer)

If we compile the Hygge0 expression 42, the match case for IntVal above is executed, and the output of the compiler is:

.data:

.text:
    li t0, 42            # <-- Produced by match case for IntVal in doCodegen
    li a7, 10  # RARS syscall: Exit
    ecall  # Successful exit with code 0

A string value is added to the .data segment of the generated assembly code, with its memory address marked by a unique label (generated with the function Util.genSymbol). Then, the memory address is loaded into the target register (using the assembly instructions la).

    | StringVal(v) ->
        let label = Util.genSymbol "string_val"
        Asm().AddData(label, Alloc.String(v))
             .AddText(RV.LA(Reg.r(env.Target), label))

Example 27 (Compiling a String)

If we compile the Hygge0 expression "Hello, World!", the match case for StringVal above is executed, and the output of the compiler is:

.data:
string_val:
    .string "Hello, World!"  # <-- Produced by case for StringVal in doCodegen

.text:
    la t0, string_val  # <-- Produced by match case for StringVal in doCodegen
    li a7, 10  # RARS syscall: Exit
    ecall  # Successful exit with code 0

When compiling an addition “\(e_1 + e_2\)” with target register \(n\), doCodegen proceeds as follows:

  1. recursively generates the assembly code for \(e_1\), targeting the register \(n\);

  2. recursively generates the assembly code for \(e_2\), targeting the register \(n+1\);

  3. generates a RISC-V addition operation that adds the contents of registers \(n\) and \(n+1\), and overwrites register \(n\) with the result.

The resulting code looks, intuitively, as follows.

    | BinNumOp(op, lhs, rhs)
        let lAsm = doCodegen env lhs    // Generated code for the lhs expression
        
        match node.Type with   // Generated code depends on the type of addition
        | t when (isSubtypeOf node.Env t TInt) ->
            let rtarget = env.Target + 1u  // Target register for rhs expression
            let rAsm = doCodegen {env with Target = rtarget} rhs  // Asm for rhs
            let opAsm =            // Generated code for the numerical operation
                match op with
                    | NumericalOp.Add ->           // Generate code for addition
                        Asm(RV.ADD(Reg.r(env.Target),
                                   Reg.r(env.Target), Reg.r(rtarget)))
                    | NumericalOp.Mult ->    // Generate code for multiplication
                        Asm(RV.MUL(Reg.r(env.Target),
                                   Reg.r(env.Target), Reg.r(rtarget)))
            lAsm ++ rAsm ++ opAsm  // Put asm code together: lhs, rhs, operation

        | t when (isSubtypeOf node.Env t TFloat) ->
            // ** Omitted code similar to above, with float regs and instructs

        | t ->
            failwith $"BUG: addition codegen invoked on invalid type %O{t}"

When compiling a variable \(x\), doCodegen produces code to access the value of the variable, depending on where it is stored: to this end, it inspects the VarStorage mapping in the code generation environment. (For clarity of exposition, the code snippet below omits some cases that are present in the implementation).

    | Var(name) ->
        match node.Type with   // Inspect the var type and where it is stored
        | t when (isSubtypeOf node.Env t TFloat) ->
            match (env.VarStorage.TryFind name) with
            | Some(Storage.FPReg(fpreg)) ->
                Asm(RV.FMV_S(FPReg.r(env.FPTarget), fpreg),
                    $"Load variable '%s{name}'")
            | _ -> failwith $"BUG: float var with bad storage: %s{name}"

        | _ ->      // Default case for variables holding integer-like values
            match (env.VarStorage.TryFind name) with
            | Some(Storage.Reg(reg)) ->
                Asm(RV.MV(Reg.r(env.Target), reg), $"Load variable '%s{name}'")
            | _ -> failwith $"BUG: variable without storage: %s{name}"

When compiling a “let” expression “\(\hygLetU{x}{e_1}{e_2}\)” or “\(\hygLet{x}{t}{e_1}{e_2}\)” with target register \(n\), doCodegen proceeds as follows:

  • recursively generates assembly code for \(e_1\), targeting the register \(n\);

  • adds the variable \(x\) to the env.VarStorage mapping, assigning it to register \(n\) (which contains the result of \(e_1\));

  • compiles \(e_2\) with the updated env.VarStorage (containing \(x\)), targeting register \(n+1\) (because \(e_2\) may use \(x\), and thus its code may use the register \(n\));

  • copies the result of \(e_2\) from register \(n+1\) to \(n\) (thus overwriting the value of \(x\), which is going out of scope).

    | Let(name, init, scope) 
    | LetT(name, _, init, scope) ->
        let initCode = doCodegen env init  // 'let...' initialisation asm code
        match init.Type with
        | t when (isSubtypeOf init.Env t TFloat) ->
            // ** Omitted code similar to the following, using float registers

        | _ ->     // Default case for integer-like initialisation expressions
            let scopeTarget = env.Target + 1u   // Target reg. for 'let' scope
            let scopeVarStorage =     // Var storage for compiling 'let' scope
                env.VarStorage.Add(name, Storage.Reg(Reg.r(env.Target)))
            let scopeEnv = { env with Target = scopeTarget
                                      VarStorage = scopeVarStorage }
            initCode
                ++ (doCodegen scopeEnv scope)
                    .AddText(RV.MV(Reg.r(env.Target), Reg.r(scopeTarget)),
                             "Move 'let' scope result to target register")

Example 28 (Compiling a “Let…” and an Addition)

Consider the following Hygge0 program:

let x = 42;
x + 3

If we save this program in a file example.hyg, we can see its typed AST by executing:

./hyggec typecheck example.hyg

The typed AST looks as follows: (we omit Env.TypeVars for brevity)

Let x (1:1-2:5)
┣╾Env.Vars: ∅
┣╾Type: int
┣╾init: IntVal 42 (1:9-1:10)
┃       ┣╾Env.Vars: ∅
┃       ┗╾Type: int
┗╾scope: BinNumOp Add (2:1-2:5)
         ┣╾Env.Vars: Map
         ┃           ┗╾x: int
         ┣╾Type: int
         ┣╾lhs: Var x (2:1-2:1)
         ┃      ┣╾Env.Vars: Map
         ┃      ┃           ┗╾x: int
         ┃      ┗╾Type: int
         ┗╾rhs: IntVal 3 (2:5-2:5)
                ┣╾Env.Vars: Map
                ┃           ┗╾x: int
                ┗╾Type: int

To compile the program above, we can execute:

./hyggec compile example.hyg

Then, hyggec produces the following RISC-V assembly code.

.data:

.text:
    li t0, 42                   # <-- Produced by case for IntVal in doCodegen
    mv t1, t0  # Load variable 'x' # <-- Produced by case for Var in doCodegen
    li t2, 3                    # <-- Produced by case for IntVal in doCodegen
    add t1, t1, t2                 # <-- Produced by case for Add in doCodegen
    mv t0, t1  # Move 'let' scope result to 'let' target register
    li a7, 10  # RARS syscall: Exit
    ecall  # Successful exit with code 0

Note

The function doCodegen includes more cases, which follow the explanations above. The only exceptions are those that require RARS system calls (e.g. Print, ReadInt: their code generation makes use of functions that save register values on the stack, and restore registers values from the stack. We will address these aspects later in the course.

Important

The Code Generation Strategy illustrated in this module is quite simple, and it has an important flaw: its naive register allocation policy tends to increase the target register number for each sub-expression being compiled — and in some cases, it may run out of available registers. When this happens, the hyggec compiler crashes, reporting a “BUG”.

Luckily, it takes some effort to write a Hygge0 program that triggers this limitation: therefore, it is unlikely that, by chance, you will write a program that stumbles into the issue. (See Exercise 23 below.)

We will consider better register allocation strategies later in the course.

Exercise 23 (Running Out of Registers)

Write a Hygge0 expression that, by the Code Generation Strategy, causes doCodegen to run out of registers, and makes hyggec crash.

Hint

There is a solution that only uses +, some integer values, and parentheses…

The Test Suite of hyggec#

We conclude this overview by mentioning the hyggec test suite, which is designed to encourage frequent testing of the compiler. To launch the test suite, simply invoke:

./hyggec test

Or, equivalently:

dotnet test

When running, the testing framework explores the tree of directories under tests/, and uses each file with extension .hyg as a test case. The outcome of the test depends on the position of the .hyg file in the tests/ directory tree, according to Table 6 below. (Note that the .hyg files under the fail/ directories are expect to cause some error, in specific ways.)

Table 6 Overview of the hyggec directory tree for tests.#

Directory under tests/

A .hyg file under this directory is a passed test if…

lexer/pass/

The tokenization succeeds.

lexer/fail/

The tokenization fails with a lexer error.

parser/pass/

Parsing succeeds.

parser/fail/

Tokenization succeeds, but parsing fails.

interpreter/pass/

Tokenization and parsing succeed, and the interpreter reduces the program into a value.

interpreter/fail/

Tokenization and parsing succeed, but the interpreter reaches a stuck expression (e.g. assert(false)).

typechecker/pass/

Tokenization, parsing, and type checking succeed.

typechecker/fail/

Tokenization and parsing succeed, but type checking fails.

codegen/pass/

Tokenization, parsing, and type checking succeed, and the generated RISC-V assembly program runs under RARS and terminates successfully (exit code 0).

codegen/fail/

Tokenization, parsing, and type checking succeed, and the generated RISC-V assembly program runs under RARS, but it terminates with an assertion violation (assert(false)).

Note

When looking for test cases, the hyggec test suite will also explore all subdirectories of the directories listed in Table 6. This means that, if you have many new tests for various compiler features, you can organise them in dedicated subdirectories, e.g.:

  • codegen/pass/feature1/test1.hyg

  • codegen/pass/feature1/test2.hyg

  • codegen/pass/feature2/test1.hyg

Example: Extending Hygge0 and hyggec with a Subtraction Operator#

The following sections show how to extend the Hygge0 language and the hyggec compiler with a subtraction operator, in 8 steps:

  1. Defining the Formal Specification of Subtraction

  2. Extending the AST

  3. Extending the Pretty Printer

  4. Extending the Lexer

  5. Extending the Parser

  6. Extending the Interpreter

  7. Extending the Type Checker

  8. Extending the Code Generation

To perform these steps, we use as a reference the most similar expression that is already supported by Hygge0 and hyggec — i.e. the addition \(e_1 + e_2\).

Formal Specification of Subtraction#

Before jumping to the implementation, we specify the formal syntax, semantics, and typing rules of the new subtraction operator.

First, we specify the syntax of subtraction expressions by extending the grammar rules of Hygge0 in Definition 1. We add a new rule:

\[\begin{split} \begin{array}{rrcll} \text{Expression} & e & ::= & \ldots \\ & & \mid & e_1 - e_2 & \text{(Subtraction)} \end{array} \end{split}\]

Then, we specify the semantics of subtraction expressions, in two steps.

  1. We specify how to substitute variable \(x\) with expression \(e'\) inside a subtraction \(e_1 - e_2\), by extending Definition 2 with a new case (similar to the existing case for addition):

    \[ \subst{(e_1 - e_2)}{x}{e'} \;\;=\;\; \subst{e_1}{x}{e'} - \subst{e_2}{x}{e'} \]
  2. We extend the reduction rules in Definition 4 with new rules for subtraction expressions (very similar to the existing rules for addition):

    \[\begin{split} \begin{array}{c} \begin{prooftree} \AxiomC{$\hygEvalConf{R}{e} \to \hygEvalConf{R'}{e'}$} \UnaryInfCLab{R-Sub-L}{$\hygEvalConf{R}{e - e_2} \;\to\; \hygEvalConf{R'}{e' - e_2}$} \end{prooftree} \quad \begin{prooftree} \AxiomC{$\hygEvalConf{R}{e} \to \hygEvalConf{R'}{e'}$} \UnaryInfCLab{R-Sub-R}{$\hygEvalConf{R}{v - e} \;\to\; \hygEvalConf{R'}{v - e'}$} \end{prooftree} \\ \begin{prooftree} \AxiomC{$v_1 - v_2 = v_3$} \UnaryInfCLab{R-Sub-Res}{$\hygEvalConf{R}{v_1 - v_2} \;\to\; \hygEvalConf{R}{v_3}$} \end{prooftree} \end{array} \end{split}\]

Finally, we extend the typing rules in Definition 8 with a new rule for subtraction expressions (very similar to the existing typing rule for addition):

\[ \begin{prooftree} \AxiomC{$T \in \{\text{int}, \text{float}\}$} \AxiomC{$\Gamma \vdash e_1 : T$} \AxiomC{$\Gamma \vdash e_2 : T$} \TrinaryInfCLab{T-Subtract}{$\Gamma \vdash e_1 - e_2 : T$} \end{prooftree} \]

Extending the AST#

We can now implement the Formal Specification of Subtraction. The first step is to extend the AST: in the type Expr<'E,'T> (in the file src/AST.fs) we need to ensure that the case for binary numerical expressions BinNumOp allows its op (which has type NumericalOp) can be Sub (not just Add and Mult). To achieve this, we add a new case Sub to the type NumericalOp:

type NumericalOp =
    // ...

    /// NEW CASE: Subtraction operation.
    | Sub

Tip

By extending the type NumericalOp as shown above, we will cause various warning in several source files, with messages like:

Incomplete pattern matches on this expression. For example, the value 'Sub' may indicate a case not covered by the pattern(s).

These warnings highlight the parts of the hyggec source code we need to adjust to support the new kind of expression we have just added.

To see all such warnings on the console, we can rebuild hyggec by running:

dotnet clean

followed by

dotnet build

Extending the Pretty Printer#

hyggec needs to know how to display the new subtraction expression when printing an AST on screen. The display is performed by src/PrettyPrinter.fs – and in general, every time we add a new expression to the language, we may need to add a new pattern matching case to the function formatASTRec (in the file src/PrettyPrinter.fs).

In this specific example, formatASTRec already has a case that pretty-prints a generic BinNumOp – and since it already works correctly when the operation is Sub, no change is needed:

    | BinNumOp(op, lhs, rhs) ->
        mkTree $"BinNumOp {op}" node [("lhs", formatASTRec lhs)
                                      ("rhs", formatASTRec rhs)]

Extending the Lexer#

We can now add a new token type and rule to the lexer, to recognise the new token for the subtraction symbol “-”. In the file src/Lexer.fs, we add a new case to the type Token:

type Token =
    // ...

    /// NEW CASE: Minus sign.
    | MINUS

Then we add a new pattern matching rule to tokenizeRec (e.g. nearby the case that matches "+"):

    // ...
    | Symbol "+"  PLUS     pos (accepted, pos')
    | Symbol "-"  MINUS    pos (accepted, pos') // <-- We add this line
    | Symbol "*"  TIMES    pos (accepted, pos')
    // ...

Testing the Lexer#

To see whether the lexer recognises the new token, we can add a new test. For example, we can add a file called tests/lexer/pass/006-minus.hyg containing just a minus sign:

-

And now, if we run

./hyggec tokenize tests/lexer/pass/006-minus.hyg

we should see (besides the warnings about incomplete pattern matches):

MINUS (1:1-1:1, offset 0-0)
EOF (2:1-2:1, offset 2-2)

Moreover, running ./hyggec test should not report any failed test.

Extending the Parser#

We can now add a new rule for parsing subtractions in src/Parser.fs. Since subtraction should have the same operator precedence of addition, we can look for the parser that handles additions and extend it to support subtractions, as well. The parser in question is pAddExpr: inside its case [ ... ] combinator there is a parser that supports additions, and we can copy&paste it and then tweak it to add a new case for subtractions.

/// Parse an additive expression.
let pAddExpr =
    chainl1 pMultExpr
        (choice [
            pToken PLUS
                |>> fun tok ->
                    fun acc rhs ->
                        mkNode (AST.Expr.BinNumOp (AST.NumericalOp.Add, acc, rhs))
                               tok.Begin acc.Pos.Begin rhs.Pos.End
            // NEW CASE: similar to the above, but it matches token MINUS and
            // creates a BinNumOp AST node with op = AST.NumericalOp.Sub
            pToken MINUS
                |>> fun tok ->
                    fun acc rhs ->
                        mkNode (AST.Expr.BinNumOp (AST.NumericalOp.Sub, acc, rhs))
                               tok.Begin acc.Pos.Begin rhs.Pos.End
        ])

Testing the Parser#

To see whether the parser recognises subtraction expressions, we can add a new test. For example, we can add a file called tests/parser/pass/012-sub.hyg containing:

42 - x

And now, if we run

./hyggec parse tests/parser/pass/012-sub.hyg

we should see (besides the warnings about incomplete pattern matches) a representation of the AST node like the following:

BinNumOp Sub (1:1-1:6)
┣╾lhs: IntVal 42 (1:1-1:2)
┗╾rhs: Var x (1:6-1:6)

Extending the Interpreter#

In the Formal Specification of Subtraction, we have defined new reduction rules that are quite similar to those for addition. Correspondingly, we can extend the hyggec interpreter by adapting existing cases for addition. In general, adding a new expression to the language may require two steps:

  1. we may need to extend the function subst in src/ASTUtil.fs by adding a new case. In this specific example, subst already contains a case that supports generic AST BinNumOps – and since it already works correctly when the operation is Sub, no change is needed:

     | BinNumOp(op, lhs, rhs) ->
         {node with Expr = BinNumOp(op, (subst lhs var sub), (subst rhs var sub))}
    
  2. we extend the function reduce in src/Interpreter.fs by adding a new case for subtractions. In this example, we extend the case for AST BinNumOps: we can copy and adapt the existing code for op = Add to support op = Sub:

        | BinNumOp(op, lhs, rhs) ->
            match op with
            // ...
    
            // NEW CASE copied & adapted from the existing case for NumericalOp.Add
            | NumericalOp.Sub ->
                match (lhs.Expr, rhs.Expr) with
                | (IntVal(v1), IntVal(v2)) ->
                    Some(env, {node with Expr = IntVal(v1 - v2)})
                | (FloatVal(v1), FloatVal(v2)) ->
                    Some(env, {node with Expr = FloatVal(v1 - v2)})
                | (_, _) ->
                    match (reduceLhsRhs env lhs rhs) with
                    | Some(env', lhs', rhs') ->
                        Some(env', {node with Expr = BinNumOp(op, lhs', rhs')})
                    | None -> None
    

Testing the Interpreter#

To test whether the interpreter handles subtraction expressions correctly, we can check whether the result of a subtraction is the value we expect. We can obtain this by writing a test case called e.g. tests/interpreter/pass/008-sub.hyg, with the following content:

assert(42 - 10 = 32);
assert(3.14f - 1.0f = 2.14f)  // Careful when comparing floats! This case is OK

and then running ./hyggec interpret tests/interpreter/pass/008-sub.hyg.

If any of the comparisons inside the assertions is false, the interpreter reaches a stuck expression \(\hygAssert{\hygFalse}\) and reports an error. After we add this test, running ./hyggec test should not report any failed test.

Extending the Type Checker#

In the Formal Specification of Subtraction, we have defined new typing rule \(\ruleFmt{T-Subtract}\) that is quite similar to rule \(\ruleFmt{T-Add}\) in Definition 8. Correspondingly, we can extend the hyggec function typer (in the file src/Typechecker.fs) by tweaking the case that supports the typing of additions. Concretely, we can look for the case of the function typer that types a generic AST BinNumOp, and add a case:

    | BinNumOp(op, lhs, rhs) ->
        match op with
        // ...

        // NEW CASE copied & adapted from the existing case for NumericalOp.Add
        | NumericalOp.Sub ->
            match (binaryNumOpTyper "subtraction" node.Pos env lhs rhs) with
            | Ok(tpe, tlhs, trhs) ->
                Ok { Pos = node.Pos; Env = env; Type = tpe; Expr = BinNumOp(op, tlhs, trhs) }
            | Error(es) -> Error(es)

Testing the Type Checker#

To test whether the type checking for subtraction expressions works as intended, we can check whether the subtraction of two integers has type \(\tInt\), and whether the subtraction of two floating-point values has type \(\tFloat\). We can obtain this by writing a test case called e.g. tests/typechecker/pass/012-sub.hyg, with the following content:

(2 - 1): int;
(3.14f - 1.0f): float

and then, running ./hyggec typecheck tests/typechecker/pass/012-sub.hyg.

If the type of the subtraction does not match the type ascription, hyggec reports a typing error. After we add this test, running ./hyggec test should not report any failed test.

Extending the Code Generation#

Code generation for subtraction is very similar to addition (and also multiplication): the only difference is that we need to emit the RISC-V assembly instruction sub (for integers) or fsub.s (for floating point values). Consequently, it is enough to edit the function doCodegen (in the file src/RISCVCodegen.fs), find the pattern matching case for BinNumOp(op, lhs, rhs) and apply the following three changes:

  1. we find the internal pattern matching that generates the assembly instruction for a numerical operation on integer values. We extend that pattern matching with a new case for op = Sub:

        | BinNumOp(op, lhs, rhs) ->
            // ...
            match node.Type with
            | t when (isSubtypeOf node.Env t TInt) ->
                // ...
                let opAsm =
                    match op with
                    // ...
    
                    // NEW CASE copied & adapted from the existing case for NumericalOp.Add
                    | NumericalOp.Sub ->
                        Asm(RV.SUB(Reg.r(env.Target),
                                   Reg.r(env.Target), Reg.r(rtarget)))
    
  2. then, we find the internal pattern matching that generates the assembly instruction for a numerical operation on float values. We extend that pattern matching with a new case for op = Sub:

        | BinNumOp(op, lhs, rhs) ->
            // ...
            match node.Type with
            // ...
            | t when (isSubtypeOf node.Env t TFloat) ->
                let opAsm =
                    match op with
                    // ...
    
                    // NEW CASE copied & adapted from the existing case for NumericalOp.Add
                    | NumericalOp.Sub ->
                        Asm(RV.FSUB_S(FPReg.r(env.FPTarget),
                                      FPReg.r(env.FPTarget), FPReg.r(rfptarget)))
    

Testing the Code Generation#

To test the code generation, it is good practice to use the same test cases of the interpreter, with assertions that stop the program execution if the result of a computation is not what we expect.

Consequently, we can create a test case called e.g. tests/codegen/pass/008-sub.hyg with the same content used for Testing the Interpreter. Then, we can execute ./hyggec rars tests/codegen/pass/008-sub.hyg to try the new test.

After we add this test, running ./hyggec test should not report any failed test.

Important

When we reach this point, we should have fixed all the warnings caused by the addition of case Sub in src/AST.fs. To double-check, we can rebuild hyggec by executing dotnet clean and then dotnet build: this should not produce any warning nor error.

Project Ideas#

For your group project, you should implement all the following (easy) project ideas (but notice that some of them give you a choice between different options). These project ideas will allow you to become familiar with the hyggec compiler internals.

You also have the option of choosing the following medium- or hard-difficulty project ideas, instead of (or in addition to) the above (but it is better to start with easy ideas for gaining experience):

Note

These project ideas are tailored for whole project groups. If you have not yet joined a group, you can address part of them (e.g. implement only one new arithmetic operator instead of 3), and later combine your work with your teammates.

Project Idea (Easy): Extend Hygge0 and hyggec with New Arithmetic Operations#

Add some new arithmetic operations to the Hygge0 language and to the hyggec compiler, by following the steps described in Example: Extending Hygge0 and hyggec with a Subtraction Operator. Choose at least 3 operations between:

  • Division “\(e_1 / e_2\)” (both integer and floating point)

  • Remainder “\(e_1 \mathbin{\%} e_2\)” (only between integers)

  • Square root “\(\text{sqrt}(e)\)” (only floating point)

    Hint

    To support \(\text{sqrt}(e)\), you will need to add a new case to the Expr type in src/Parser.fs; correspondingly, you will need to add a pattern matching case in src/PrettyPrinter.fs.

    Then, to perform lexing and parsing of the new \(\text{sqrt}(e)\) expression, you will need to:

    • in src/Lexer.fs, define a new case for the type Token for representing the keyword “\(\text{sqrt}\)”; you may call this token e.g. SQRT, and extend the function mkKeywordOrIdent to generate it;

    • then, in src/Parser.fs, you will need to add a new rule to parse an occurrence of SQRT, followed by LPAR (left parenthesis), followed by an expression, followed by RPAR (right parenthesis). You can use as a reference e.g. the existing parser pPrint that parses a \(\hygPrint{...}\) expression, and looks like the following:

      /// Parse a print(...) expression.
      let pPrint =
          pToken PRINT ->>- (pToken LPAREN >>- pSimpleExpr) ->>- pToken RPAREN
              |>> fun ((tok1, expr), tok2) ->
                  mkNode (AST.Expr.Print expr) tok1.Begin tok1.Begin tok2.End
      

      The new parser for “\(\text{sqrt}(e)\)” expressions should have the same precedence of pPrint, so it should also be placed as a case of the parser unaryExpr'.

  • Maximum “\(\text{max}(e_1, e_2)\)” and minimum “\(\text{min}(e_1, e_2)\)” (both integer and floating point)

    Hint

    • To support the new “\(\text{max}(e_1, e_2)\)” and “\(\text{min}(e_1, e_2)\)” operations, you can follow the hints given for \(\text{sqrt}\) above. In addition, you will need to define a token for recognising the comma "," between the two arguments: you may call this token e.g. COMMA.

    • There are several ways to implement both “\(\text{max}(e_1, e_2)\)” and “\(\text{min}(e_1, e_2)\)” in hyggec. Depending on your approach, you may achieve the result without extending the interpreter, nor the code generation…

    • To implement code generation for “\(\text{max}(e_1, e_2)\)” and “\(\text{min}(e_1, e_2)\)” on integers, the generated RISC-V code will need to perform a conditional jump with a branching instruction, depending e.g. on whether \(e_1\) is smaller than \(e_2\). The implementation for floats is simpler. For an inspiration, see how code is generated for BinRelOp(op, lhs, rhs) in the function doCodegen (in the file src/RISCVCodegen.fs).

Project Idea (Easy): Extend Hygge0 and hyggec with New Relational Operations#

Add some new relational operations to the Hygge0 language and to the hyggec compiler, by following the steps described in Example: Extending Hygge0 and hyggec with a Subtraction Operator — except that you should use the existing expression “\(e_1 < e_2\)” as a reference. Choose at least one operation between:

  • Less than or equal to “\(e_1 \leq e_2\)” (both integer and floating point)

  • Greater than “\(e_1 > e_2\)” (both integer and floating point)

  • Greater than or equal to “\(e_1 \geq e_2\)” (both integer and floating point)

Hint

There are several ways to implement these expressions in hyggec. Depending on your approach, you may implement all of them without extending the interpreter, nor the code generation…

Project Idea (Easy): Extend Hygge0 and hyggec with the Logical Operator “Exclusive Or”#

Add the “exclusive or” operator “\(e_1 \text{ xor } e_2\)” to the Hygge0 language and to the hyggec compiler, by following the steps described in Example: Extending Hygge0 and hyggec with a Subtraction Operator — except that you should use the existing expression “\(e_1 \text{ or } e_2\)” as a reference.

Hint

There are several ways to implement this operator in hyggec. Depending on your approach, you may achieve the result without extending the interpreter, nor the code generation…

Project Idea (Medium Difficulty): “And” and “Or” with Short-Circuit-Semantics#

The Hygge0 formal semantics (Definition 4) omits the reduction rules for the expressions “\(e_1 \text{ and } e_2\)” and “\(e_1 \text{ or } e_2\)” (but you may have written them down as part of Exercise 14, and compared them against their implementation in hyggec in Exercise 20).

hyggec implements the following “eager” semantics for the logical “and” and “or”: they reduce both arguments to values, and then reduce to \(\hygTrue\) or \(\hygFalse\) accordingly.

\[\begin{split} \begin{array}{c} \begin{prooftree} \AxiomC{$\hygEvalConf{R}{e} \to \hygEvalConf{R'}{e'}$} \UnaryInfCLab{R-And-L}{$\hygEvalConf{R}{e \text{ and } e_2} \;\to\; \hygEvalConf{R'}{e' \text{ and } e_2}$} \end{prooftree} \;\; \begin{prooftree} \AxiomC{$\hygEvalConf{R}{e} \to \hygEvalConf{R'}{e'}$} \UnaryInfCLab{R-And-R}{$\hygEvalConf{R}{v \text{ and } e} \;\to\; \hygEvalConf{R'}{v \text{ and } e'}$} \end{prooftree} \\\\ \begin{prooftree} \AxiomC{} \UnaryInfCLab{R-And-Res1}{$\hygEvalConf{R}{\hygTrue \text{ and } \hygTrue} \;\to\; \hygEvalConf{R}{\hygTrue}$} \end{prooftree} \\\\ \begin{prooftree} \AxiomC{$v$ is a boolean value} \UnaryInfCLab{R-And-Res3}{$\hygEvalConf{R}{\hygFalse \text{ and } v} \;\to\; \hygEvalConf{R}{\hygFalse}$} \end{prooftree} \\\\ \begin{prooftree} \AxiomC{$v$ is a boolean value} \UnaryInfCLab{R-And-Res2}{$\hygEvalConf{R}{v \text{ and } \hygFalse} \;\to\; \hygEvalConf{R}{\hygFalse}$} \end{prooftree} \\\\ \begin{prooftree} \AxiomC{$\hygEvalConf{R}{e} \to \hygEvalConf{R'}{e'}$} \UnaryInfCLab{R-Or-L}{$\hygEvalConf{R}{e \text{ or } e_2} \;\to\; \hygEvalConf{R'}{e' \text{ or } e_2}$} \end{prooftree} \;\; \begin{prooftree} \AxiomC{$\hygEvalConf{R}{e} \to \hygEvalConf{R'}{e'}$} \UnaryInfCLab{R-Or-R}{$\hygEvalConf{R}{v \text{ or } e} \;\to\; \hygEvalConf{R'}{v \text{ or } e'}$} \end{prooftree} \\\\ \begin{prooftree} \AxiomC{$v$ is a boolean value} \UnaryInfCLab{R-Or-Res1}{$\hygEvalConf{R}{\hygTrue \text{ or } v} \;\to\; \hygEvalConf{R}{\hygTrue}$} \end{prooftree} \\\\ \begin{prooftree} \AxiomC{$v$ is a boolean value} \UnaryInfCLab{R-Or-Res2}{$\hygEvalConf{R}{v \text{ or } \hygTrue} \;\to\; \hygEvalConf{R}{\hygTrue}$} \end{prooftree} \\\\ \begin{prooftree} \AxiomC{} \UnaryInfCLab{R-Or-Res3}{$\hygEvalConf{R}{\hygFalse \text{ or } \hygFalse} \;\to\; \hygEvalConf{R}{\hygFalse}$} \end{prooftree} \end{array} \end{split}\]

Example 29

The “eager” semantics of “and” and “or” in hyggec can be observed e.g. by running the following Hygge0 program (available under examples/and-or-evaluation.hyg):

({println("Left of 'and'"); false}) and ({println("Right of 'and'"); true});
({println("Left of 'or'"); true}) or ({println("Right of 'or'"); true})

The output will be:

Left of 'and'
Right of 'and'
Left of 'or'
Right of 'or'

However, many programming languages implement a short-circuit evaluation semantics for both “and” and “or”. The intuition is the following:

  • the expression “\(e_1 \text{ and } e_2\)” reduces to \(\hygFalse\) when \(e_1\) is \(\hygFalse\). The expression \(e_2\) is only considered (and reduced, if needed) when \(e_1\) is \(\hygTrue\);

  • the expression “\(e_1 \text{ or } e_2\)” reduces to \(\hygTrue\) when \(e_1\) is \(\hygTrue\). The expression \(e_2\) is only considered (and reduced, if needed) when \(e_1\) is \(\hygFalse\).

Example 30

With short-circuit semantics for “and” and “or”, the program in Example 29 would only output:

Left of 'and'
Left of 'or'

Write down the reduction semantics rules for “short-circuit and” and “short-circuit or” expressions, and implement them in hyggec. You can choose to either:

  • modify the built-in interpreter and the code generation in hyggec to give new short-circuit semantics to the existing “and” and “or” expressions; or

  • (recommended) add two new operators “&&” and “||” to hyggec. These new operators mean respectively, “short-circuit and” and short-circuit or”, and you should add them to hyggec without altering the existing “and” and “or” expressions (this is similar to the Kotlin programming language).

Hint

  • There are several ways to implement these short-circuit operators in hyggec. Depending on your approach, you may achieve the result without changing the interpreter, nor the code generation. You may even make them simpler by removing some of their existing code…

  • To properly test the short-circuit operator, you need to write some tests that have different observable behaviour depending on whether some part of the short-circuit expression is executed or not. This is not possible with the Hygge0 language — but you can revise the issue in Module 5: Mutability and Loops, when we will introduce Mutable Variables. For example, you may write a test where a short-circuit expression may change the value of a mutable variable (or not), and then write an assertion about that mutable variable…

Project Idea (Medium-Hard Difficulty): Better Output for Assertion Failures#

The goal of this project idea is to improve the usability of assertion failures in the assembly programs generated by hyggec. At the moment, the code generation for assertions (i.e., the case for Assert(...) in the function doCodegen in the file RISCVCodegen.fs) simply generates code that exits the RARS simulator (with an error code) if an assertion fails: as a consequence, it may be hard to understand which assertion failed in the input program.

To improve this, you can extend the hyggec code generation: you can make it generate assembly code that prints some diagnostic information when an assertion fails, before exiting RARS. For instance, you may print an error message with the position of the failed assertion in the input source program.

Note

  • The difficulty of this Project Idea may range from medium to hard, depending on how detailed is the diagnostic output. Please talk with the teacher to agree on an estimate.

  • To report better diagnostic output (e.g., printing the values of the variables involved in a failed assertion) you may revise and improve this Project Idea later in the course — e.g., when we have the tooling to recognise the free variables of an expression.