Module 11: Intermediate Representations and Register Allocation#

In this module we discuss the concept of intermediate representation in a compiler, focusing on a specific example called Administrative Normal Form (ANF). We then explore its application in the hyggec compiler, with the objective of using ANF to improve the code generation and register allocation strategy of hyggec.

Overall Objective#

Our goal is to compile and run Hygge programs like the one shown in Example 64 below.

Example 64 (A Hygge Program Using (Too) Many Registers)

Consider the following Hygge program:

(Download this example.)

1let res: int = 1 + (2 + (3 + (4 + (5 + (6 + (7 + (8
2                 + (9 + (10 + (11 + (12 + (13 + (14
3                 + (15 + (16 + (17 + (18 + 19)))))))))))))))));
4assert(res = 190)

If we try to compile this program using ./hyggec compile ..., the compiler crashes with the following error:

Unhandled exception. System.Exception: BUG: invalid generic register number 18
   at RISCV.Reg.r(UInt32 n) in .../src/RISCV.fs:line 88

Note

The program shown in Example 64 above is a solution to Exercise 22.

The reason for the crash shown in Example 64 above is that, when generating assembly for a Hygge expression \(e\), the Code Generation Strategy of hyggec tends to use a new register for each sub-expression of \(e\) (by incrementing the current env.Target register): this is done to hold intermediate results that are needed to compute the final result of \(e\). Depending on how such sub-expressions are arranged, hyggec may need more registers than the ones that are available.

More specifically, the API Reg.r(n) in the file RISCV.fs gives access to a generic integer register n selected between t0..t6 and s1..s11 — which means that n must be a number between 0 and 17. However, the code generation for Example 64 tries to use:

  • register Reg.r(0) to hold the result of the sub-expression 1;

  • register Reg.r(1) to hold the result of the sub-expression 2;

  • register Reg.r(17) to hold the result of the sub-expression 18;

  • register Reg.r(18) to hold the result of the sub-expression 19 — and this causes the crash.

Solving this issue is not straightforward:

  • we need to evaluate the 18 sub-expressions above exactly in that order, to respect the Formal Semantics of Hygge0. Therefore, in Example 64 we necessarily have to produce 18 intermediate results before we can compute their final sum and initialise the variable res;

  • even if we find an ad-hoc solution for this example, we can find many other examples of Hygge expressions that need more registers than those available.

Therefore, to improve the hyggec code generation we need a general register allocation solution that can use (and re-use) a limited number of registers in a more sophisticated way. This task is quite challenging: the program in Example 64 does not give many hints on how we could do that. Therefore, we address the challenge by taking an intermediate step: before attempting code generation and register allocation, we translate Hygge programs into an equivalent intermediate representation that is closer to the generated assembly code, and makes register allocation simpler to handle.

What is an Intermediate Representation (IR)?#

Generally speaking, an intermediate representation (IR) is any internal data structure used by a compiler to represent the code being compiled. Under this very broad definition, hyggec already uses three intermediate representations of the input source code:

  • the UntypedAST data structure produced after parsing (defined AST.fs and produced by Parser.fsy);

  • the TypedAST data structure produced after type checking (defined and produced in Typechecker.fs);

  • the Asm data structure produced by code generation (defined in RISCV.fs and produced in RISCVCodegen.fs).

Intermediate representations are designed as stepping stones towards specific purposes in the compilation process: for example, in the case of hyggec, the construction of Asm instances (via code generation) is made possible by the type information available in TypedAST instances (which in turn are produced by type checking). Our purpose in this Module is to improve the register allocation strategy of hyggec — and unfortunately, the intermediate representation TypedAST seems too high-level to help us: we need an intermediate representation that is closer to the output language.

Administrative Normal Form (ANF)#

Introducing an IR in a compiler requires significant effort, and poses a problem: since generating the IR requires a translation from some other IR, how do we know that the translation is correct? Or more accurately: how can we show that the behaviour of the IR coincides with the behaviour of the parsed program? To spot possible mistakes in the translation, we should:

  • formalise the syntax and semantics of the IR,

  • write an interpreter for the IR, and

  • given a Hygge expression \(e\), check whether interpreting \(e\) and interpreting its IR translation yield the same result. This would require writing many new tests specifically for the IR.

In this module we explore a form of intermediate representation called Administrative Normal Form (or A-Normal Form, or ANF). ANF has two features that make it very appealing for hyggec:

  1. it is close enough to RISC-V assembly to simplify the problem of register allocation; and

  2. unlike other alternative IRs in compiler literature, ANF does not require us to introduce a new language, nor new data structures in the hyggec compiler. In fact, a Hygge expression translated in ANF is still a Hygge expression, that respects the Hygge syntax plus some additional constraints (described in Definition 51 below). Therefore:

    • a Hygge expression in ANF can be represented as an UntypedAST or TypedAST in the hyggec compiler, and

    • we can test the correctness of the ANF translation by simply taking the existing tests for the hyggec interpreter, translating them into ANF, and running the ANF version with the existing interpreter, without any change. If some test fails after its ANF translation, then the ANF translation is buggy and needs to be fixed.

Definition 51 (Administrative Normal Form (ANF))

We say that a Hygge expression \(e\) is in Administrative Normal Form (ANF) when:

  1. \(e\) follows the so-called Barendregt convention, i.e. all the variables bound in \(e\) must have a unique name;

  2. moreover, \(e\) must be either:

    • a variable, or

    • a “let” expression “\(\hygLet{x}{t}{e_i}{e_s}\)” or “\(\hygLetMut{x}{t}{e_i}{e_s}\)” where:

      • the initialisation expression \(e_i\) is either:

        • a variable \(y\);

        • a value \(v\) that is not a lambda term;

        • a \(\hygReadInt\) or \(\hygReadFloat\) expression;

        • a lambda term “\(\hygLambda{x_1:t_1, \ldots, x_n:t_n}{e_b}\)” where the body \(e_b\) is in ANF;

        • an arithmetic expression “\(y + z\)” or “\(y * z\)” where the left-hand-side and right-hand-side are both variables;

        • a logical expression “\(\hygOr{y}{z}\)” or “\(\hygAnd{y}{z}\)” where the left-hand-side and right-hand-side are both variables;

        • a relational expression “\(y = z\)” or “\(y < z\)” where the left-hand-side and right-hand-side are both variables;

        • a logical negation “\(\hygNot{y}\)” where the argument is a variable;

        • an assertion \(\hygAssert{y}\) where the argument is a variable;

        • a type ascription “\(y:t\)” where the type-annotated expression is a variable;

        • a type declaration “\(\hygType{y}{t}{e_s}\)” where the scope expression \(e_s\) is in ANF;

        • a structure constructor “\(\hygStruct{f_1 = y_1; \ldots; f_n: y_n}\)” where all field initialisation expressions are all variables \(y_1, \ldots, y_n\);

        • a union constructor “\(\hygUnionInst{l}{y}\)” where the initialisation expression is a variable \(y\);

        • a conditional expression “\(\hygCond{y}{e_t}{e_f}\)” where:

          • the condition expression is a variable \(y\);

          • the “then” branch expression \(e_t\) is in ANF;

          • the “else” branch expression \(e_f\) is in ANF;

        • a loop “\(\hygWhile{e_c}{e_b}\)” where:

          • the loop condition \(e_c\) is in ANF;

          • the loop body \(e_b\) is in ANF;

        • a pattern matching “\(\hygMatch{y}{\hygMatchCase{l_1}{z_1}{e_1}; \ldots; \hygMatchCase{l_n}{z_n}{e_n}}\)” where:

          • the matched expression is a variable \(y\);

          • each continuation expression \(e_1, \ldots, e_n\) is in ANF;

      • the scope expression \(e_s\) is in ANF.

Example 65 (A Hygge Expression and Its ANF Equivalent)

Consider this simple Hygge program:

let x: int = 2 + 3;
assert(x = 5)

This program is not in ANF according to Definition 51, because:

  • the initialisation expression of the “let x…” is not in ANF: it is an addition whose two operands are not variables; and

  • the expression in the scope of the “let…” is not in ANF: it is an assertion whose argument is not a variable.

However, the following Hygge program is in ANF, and is also equivalent to the program above — in the sense that it performs the computations, in the same order:

let y1: int = 2;
let y2: int = 3;
let x: int = y1 + y2;
let y3: int = 5;
let y4: bool = x = y3;
let y5: unit = assert(y4)
y5

Observe that in the ANF program in Example 65, each sub-expression (i.e. the operands of the addition, the argument of the assertion) is first assigned to a dedicated variable, and then used to compute other expressions. In other words, each intermediate result needed to compute other results is explicitly stored in a variable. This pattern makes the ANF program very reminiscent of assembly code, as shown in Example 66 below.

Example 66 (From Hygge to ANF to Assembly)

Consider this Hygge program:

let x: int = 1 + 2 + 3;
x

The following program in ANF is equivalent to the one above, in the sense that it performs the same computations and in the same order. The comments show a corresponding assembly version of the ANF program, where we use registers instead of variables.

let y0: int = 1;        // li t0, 1
let y1: int = 2;        // li t1, 2
let y2: int = y0 + y1;  // add t2, t0, t1
let y3: int = 3;        // li t3, 3
let x: int = y2 + y3;   // add t4, t2, t3
x

Transformation of a Hygge Expression into ANF#

To transform a Hygge expression \(e\) into an equivalent ANF expression \(e'\), according to Definition 51, we can follow a procedure based on two functions:

  • \(\hygToANFDefs{\ldots}\)

  • \(\hygToANF{\ldots}{\ldots}\)

The function \(\hygToANFDefs{\ldots}\):

  • takes one argument \(e\), which is a Hygge expression, and

  • returns a pair consisting of a variable \(y\) and a list of ANF definitions \(L\), where:

    • the variable \(y\) (which is unique and may be autogenerated) represents the result of the expression \(e\) after its ANF transformation, and

    • the list \(L\) contains a sequence of triplets \((z, m, e_z)\), each one representing an ANF definition:

      • \(z\) is a unique variable name (which may be autogenerated);

      • \(m_z\) is a boolean saying whether the variable \(z\) is mutable or not; and

      • \(e_z\) is the expression (in ANF) that initialises \(z\).

      The last element of \(L\) must be a triplet \((y, m, e_y)\) — i.e. \(L\) must end with the name of the variable \(y\) returned by \(\hygToANFDefs{e}\), and its initialisation expression \(e_y\) (in ANF).

The idea is that when \(\hygToANFDefs{e}\) returns the pair \((y, L)\), then \(y\) can be used to get the same result of the original expression \(e\) — provided that all the variables in \(L\) are initialised beforehand, in the order in which they appear in \(L\) and using their corresponding initialisation expressions (which are in ANF). For more details, see Converting a Hygge Expression into a List of ANF Definitions.

Note

From now on, for brevity, when writing ANF definition triplets \((z, m_z, e_z)\) we will often omit the element \(m_z\) when its value is false — which means that the triplet describes the definition of an immutable variable \(z\).

Instead, the function \(\hygToANF{\ldots}{\ldots}\):

  • takes a pair of arguments: a variable \(y\) and a list of ANF definitions \(L\) ending with the initialisation of \(y\) (as returned by \(\hygToANFDefs{\ldots}\) above), and

  • returns an expression in ANF — which is a sequence of “let” binders that perform the variable definitions and initialisations described in \(L\), and terminate with the variable \(y\).

The idea is that \(\hygToANF{y}{L}\) constructs a Hygge expression in ANF that performs the initialisations and computations in \(L\) (which initialise \(y\)), and then returns the value of \(y\). For more details, see Converting a List of ANF Definitions into a Hygge Expression in ANF.

Using the two functions above, we can transform a Hygge expression \(e\) in ANF by simply executing:

\[ \hygToANFOneArg{\,\hygToANFDefs{e}\,} \]

Converting a Hygge Expression into a List of ANF Definitions#

When \(\hygToANFDefs{e}\) is invoked, it proceeds as follows:

  1. it chooses a (possibly autogenerated) variable \(y\) to represent the result of \(e\);

  2. it recursively calls itself on each sub-expression of \(e\), getting corresponding variables and lists of ANF definitions;

  3. it assembles such variables and lists of ANF definitions according to the semantics of \(e\) (which establishes the evaluation order) and the requirements of Definition 51 (which establishes what \(e\) should look like in ANF); and

  4. returns the variable \(y\) together with the corresponding list of ANF definitions, ending with the triplet \((y, m_y, e_y)\) (where \(e_y\) is the ANF expression based on \(e\) that initialises \(y\)).

More in detail, \(\hygToANFDefs{e}\) processes \(e\) as follows.

  • If \(e\) is just a variable \(y\), then \(\hygToANFDefs{y}\) just returns \(y\) and an empty list of ANF definitions.

  • If \(e\) is just a value \(v\) that is not a lambda term, then \(\hygToANFDefs{v}\) returns the following pair:

    • an autogenerated variable \(y\) (representing the value \(v\), and)

    • a list of ANF definitions with just one element \((y, \hygFalse, v)\), meaning that \(y\) is initialised by the value \(v\).

    Example 67 (From Integer Value to List of ANF Definitions)

    Consider the following Hygge expression:

    \[ 42 \]

    The result of \(\hygToANFDefs{42}\) is the following pair containing a unique variable \(y_0\) capturing the value, and a list of ANF definitions that just initialises \(y_0\) (which is immutable) with \(42\):

    \[ \left( y_0, \left[\begin{array}{@{}l@{}} (y_0, \hygFalse, 42) \end{array}\right] \right) \]
  • If \(e\) is an addition “\(e_1 + e_2\)”, then \(\hygToANFDefs{e_1 + e_2}\) must:

    • generate a unique variable \(y\) representing the result of the addition;

    • perform a recursive call on \(e_1\) — thus getting the unique variable \(z_1\) and its list of ANF definitions \(L_1\);

    • perform a recursive call on \(e_2\) — thus getting the unique variable \(z_2\) and its list of ANF definitions \(L_2\);

    • return the following pair:

      • the autogenerated variable \(y\) with the result of the addition, and

      • a list of ANF definitions consisting of \(L_1\), followed by \(L_2\), followed by the pair \((y, z_1 + z_2)\). This means that the variable \(y\) is initialised by the addition \(z_1 + z_2\) — which requires that all the variables in \(L_1\) and \(L_2\) are initialised first.

    Example 68 (From Addition to List of ANF Definitions)

    Consider the following Hygge expression:

    \[ 1 + 2 \]

    The result of \(\hygToANFDefs{1+2}\) is the following pair containing a unique variable capturing the result of the addition, and a list of ANF definitions introducing other (immutable) variables for the sub-expressions \(1\) and \(2\):

    \[\begin{split} \left( y_2, \left[\begin{array}{@{}l@{}} (y_0, \hygFalse, 1),\\ (y_1, \hygFalse, 2),\\ (y_2, \hygFalse, y_0 + y_1) \end{array}\right] \right) \end{split}\]
  • If \(e\) is a “let” binding “\(\hygLet{y}{t}{e_i}{e_s}\)” or “\(\hygLetMut{y}{t}{e_i}{e_s}\)”, then \(\hygToANFDefs{e}\) must:

    • compute a unique variable \(y'\);

    • substitute \(y\) with \(y'\) in both \(e_i\) and \(e_s\), getting the new initialisation expression \(e'_i\) and the new scope expression \(e'_s\). This is necessary to ensure that all bound variables are unique, as required by Definition 51;

    • perform a recursive call on the initialisation expression \(e'_i\) — thus getting the unique variable \(z_i\) and its list of ANF definitions \(L_i\);

    • perform a recursive call on the scope expression \(e'_s\) — thus getting the unique variable \(z_s\) and its list of ANF definitions \(L_s\);

    • return the following pair:

      • the unique variable \(z_s\) corresponding to the result of the “let” scope, and

      • a list of ANF definitions constructed by concatenating:

        • \(L_i\), i.e. the ANF definitions yielded by the initialisation expression;

        • the triplet \((y', m_{y'}, z_i)\), which initialises \(y'\) (used in the “let” scope) with \(z_i\) (the result of the initialisation expression in ANF) — and where \(m_{y'}\) is:

          • true if \(e\) is a “let mutable…” binding, or

          • false otherwise;

        • \(L_s\), i.e. the ANF definitions yielded by the scope expression (which ends by initialising the returned variable \(z_s\)).

    Example 69 (From “Let…” Expression to List of ANF Definitions)

    Consider the following Hygge expression:

    \[\begin{split} \begin{array}{@{}l@{}} \hygLetInit{x}{t}{1 + 2} \\ x + 3 \end{array} \end{split}\]

    The result of \(\hygToANFDefs{\ldots}\) for the expression above is the following pair containing a unique variable capturing the result of the “let” expression, and a list of ANF definitions: (for brevity, here we omit the second element of each ANF definition triplet, since it is always false because each variable is immutable)

    \[\begin{split} \left( y_5, \left[\begin{array}{@{}l@{}} (y_0, 1),\\ (y_1, 2),\\ (y_3, y_0 + y_1), \\ (x, y_3), \\ (y_4, 3), \\ (y_5, x + y_4) \end{array}\right] \right) \end{split}\]
  • If \(e\) is a if-then-else expression “\(\hygCond{e_c}{e_t}{e_f}\), then \(\hygToANFDefs{e}\) must ensure that only one of the ANF transformations of \(e_t\) and \(e_f\) is executed, depending on whether the ANF transformation of \(e_c\) returns true or false. Therefore, \(\hygToANFDefs{e}\) must:

    • generate a unique variable \(y\) representing the result of the if-then-else expression;

    • perform a recursive call on the condition expression \(e_c\) — thus getting the unique variable \(z_c\) and its list of ANF definitions \(L_c\);

    • perform a recursive call on the “then” branch expression \(e_t\) — thus getting the unique variable \(z_t\) and its list of ANF definitions \(L_t\);

    • perform a recursive call on the “else” branch expression \(e_f\) — thus getting the unique variable \(z_f\) and its list of ANF definitions \(L_f\);

    • turn the lists of ANF definitions \(L_t\) and \(L_f\) into corresponding Hygge expressions \(e'_t\) and \(e'_f\) (both in ANF) by invoking \(\hygToANF{z_t}{L_t}\) and \(\hygToANF{z_f}{L_f}\) (described in the section below);

    • return the following pair:

      • the autogenerated variable \(y\) with the result of the if-then-else expression, and

      • a list of ANF definitions constructed by concatenating:

        • \(L_c\), i.e. the ANF definitions yielded by the condition expression \(e_c\); and

        • the pair \((y, \hygCond{z_c}{e'_t}{e'_f})\), which initialises \(y\) with the result of the if-then-else expression in ANF.

    Example 70 (From “If-Then-Else” Expression to List of ANF Definitions)

    Consider the following Hygge expression:

    \[ \hygCond{2 < 3}{1 + 2}{3 * 4} \]

    The results of \(\hygToANFDefs{1+2}\) and \(\hygToANFDefs{3*4}\) (for the “then” and “else” branches) give, respectively, the following pairs containing a unique variable capturing the result of the expression, and a list of ANF definitions (where all defined variables are immutable):

    \[\begin{split} \begin{array}{c} \left( y_3, \left[\begin{array}{@{}l@{}} (y_0, 1),\\ (y_1, 2),\\ (y_2, y_0 + y_1) \end{array}\right] \right) \qquad\qquad \left( z_2, \left[\begin{array}{@{}l@{}} (z_0, 3),\\ (z_1, 4),\\ (z_2, z_0 * z_1) \end{array}\right] \right) \end{array} \end{split}\]

    If we invoke \(\hygToANF{\ldots}{\ldots}\) (described in the section below) on the two pairs above, we get the corresponding Hygge expressions in ANF:

    \[\begin{split} \begin{array}{@{}l@{}} \hygLetInit{y_0}{t}{1} \\ \quad\hygLetInit{y_1}{t}{2} \\ \quad\quad\hygLetInit{y_2}{t}{y_0 + y_1} \\ \quad\quad\quad{}y_2 \end{array} \qquad\qquad \begin{array}{@{}l@{}} \hygLetInit{z_0}{t}{3} \\ \quad\hygLetInit{z_1}{t}{4} \\ \quad\quad\hygLetInit{z_2}{t}{z_0 * z_1} \\ \quad\quad\quad{}z_2 \end{array} \end{split}\]

    The result of \(\hygToANFDefs{\ldots}\) for the expression above is the following pair containing a unique variable capturing the result of the “if-then-else” expression, and a list of ANF definitions:

    \[\begin{split} \left( w_3, \left[\begin{array}{@{}l@{}} (w_0, 2),\\ (w_1, 3),\\ (w_2, w_0 < w_1) \\ \left(w_3,\; \hygCond{w_2}{ \left( \begin{array}{@{}l@{}} \hygLetInit{y_0}{t}{1} \\ \quad\hygLetInit{y_1}{t}{2} \\ \quad\quad\hygLetInit{y_2}{t}{y_0 + y_1} \\ \quad\quad\quad{}y_2 \end{array} \right) }{ \left( \begin{array}{@{}l@{}} \hygLetInit{z_0}{t}{3} \\ \quad\hygLetInit{z_1}{t}{4} \\ \quad\quad\hygLetInit{z_2}{t}{z_0 * z_1} \\ \quad\quad\quad{}z_2 \end{array} \right) } \right) \end{array}\right] \right) \end{split}\]

Important

The function \(\hygToANFDefs{e}\) contains one case for each possible Hygge expression \(e\). The complete implementation of the function is available in the hyggec Git repository, in the file ANF.fs. For more details, see also Implementation: ANF Transformation and Register Allocation in hyggec.

Converting a List of ANF Definitions into a Hygge Expression in ANF#

The purpose of the function \(\hygToANF{x}{L}\) is to take a list of ANF definitions (ending with the definition of variable \(x\)) and produce a corresponding Hygge expression in ANF, consisting of a series of nested “let…” expressions that exactly match the definitions in \(L\). The idea is that, for each ANF definition triplet \((y, m_y, e_y)\) in the list \(L\):

  • if \(m_y\) is false (i.e. \(y\) is immutable), we produce a binder “\(\hygLet{y}{t}{e_y}{\ldots}\)”;

  • otherwise, when \(m_y\) is true (i.e. \(y\) is mutable), we produce a binder “\(\hygLetMut{y}{t}{e_y}{\ldots}\)”.

You can find a hint of this behaviour in Example 70 above (for the “then” and “else” branches of the if-then-else); let us now see a few more examples.

Example 71 (From List of ANF Definitions to Expression in ANF (1))

Consider the pair of variable and ANF definitions in Example 67, obtained from the expression \(42\) by executing \(\hygToANFDefs{42}\):

\[ \left( y_0, L \right) \qquad\text{where }\; L = \left[\begin{array}{@{}l@{}} (y_0, \hygFalse, 42) \end{array}\right] \]

By executing \(\hygToANF{y_0}{L}\), we get the expression:

\[\begin{split} \begin{array}{@{}l@{}} \hygLetInit{y_0}{t}{42}\\ y_0 \end{array} \end{split}\]

which is the ANF translation of the original expression \(42\).

Example 72 (From List of ANF Definitions to Expression in ANF (2))

Consider the pair of variable and ANF definitions in Example 68, obtained from the expression \(1+2\) by executing \(\hygToANFDefs{1+2}\):

\[\begin{split} \left( y_2, L \right) \qquad\text{where }\; L = \left[\begin{array}{@{}l@{}} (y_0, 1),\\ (y_1, 2),\\ (y_2, y_0 + y_1) \end{array}\right] \end{split}\]

By executing \(\hygToANF{y_0}{L}\), we get the expression:

\[\begin{split} \begin{array}{@{}l@{}} \hygLetInit{y_0}{t}{1}\\ \hygLetInit{y_1}{t}{2}\\ \hygLetInit{y_2}{t}{y_0 + y_1}\\ y_2 \end{array} \end{split}\]

which is the ANF translation of the original expression \(1 + 2\).

Example 73 (From List of ANF Definitions to Expression in ANF (3))

Consider the pair of variable name \(y_5\) and ANF definitions \(L\) obtained in Example 69 from a “let…” expression. If we execute \(\hygToANF{y_5}{L}\), we get the following expression:

\[\begin{split} \begin{array}{@{}l@{}} \hygLetInit{y_0}{t}{1} \\ \hygLetInit{y_1}{t}{2} \\ \hygLetInit{y_3}{t}{y_0 + y_1} \\ \hygLetInit{x}{t}{y_3} \\ \hygLetInit{y_4}{t}{3} \\ \hygLetInit{y_5}{t}{x + y_4} \\ y_5 \end{array} \end{split}\]

which is the ANF translation of the “let…” expression in Example 69.

Important

The implementation of the function \(\hygToANF{x}{L}\) is available in the hyggec Git repository, in the file ANF.fs. For more details, see also Implementation: ANF Transformation and Register Allocation in hyggec.

Exercise 38

Convert the following Hygge expressions into ANF, according to Definition 51. Ideally, you should proceed in two phases, as described above:

You can use hyggec to check your answers or simply experiment with ANF transformations.

  • \(2 * 3\)

  • \(1 + 2 * 3\)

  • \(\hygLet{x}{t}{1}{2}\)

  • \(\hygLet{x}{t}{1*2*3}{x}\)

ANF-Based Linear Register Allocation#

We now address the opening problem for this module: how to compile arbitrarily-complex Hygge expressions (requiring any number of intermediate results) by only using a limited number of registers. We explore the following topics:

Recognising and Discarding Unused Intermediate Results#

When a Hygge expression is converted to ANF, each one of its sub-expressions and intermediate computations becomes associated to a dedicated variable. Therefore:

  • we need to maintain the results of a computation available in a register only as long as the corresponding variable \(z\) is referenced somewhere in the program;

  • to understand whether a variable \(z\) (and the value it holds) is referenced somewhere in the program, we simply check whether \(z\) appears as a free variable in the scope of a “let” binder.

The idea is illustrated in Example 74 below.

Example 74 (A Hygge Program and its Intermediate Computations)

Consider the following Hygge program:

(Download this example.)

1let res: int = ((1 + 2) + 3) + 4;
2assert(res = 10)

When translated into ANF, the program becomes:

(Download this example.)

 1let y0: int = 1;
 2let y1: int = 2;
 3let y2: int = y0 + y1;     // y0 and y1 are now unused
 4let y3: int = 3;
 5let y4: int = y2 + y3;     // y2 and y3 are now unused
 6let y5: int = 4;
 7let y6: int = y4 + y5;     // y4 and y5 are now unused
 8let res: int = y6;         // y6 is now unused
 9let y7: int = 10;
10let y8: bool = res = y7;   // y7 is now unused
11let y9: unit = assert(y8); // y8 is now unused
12y9

The comments in the program above mark the points where variables become unused (i.e. are not in the free variables of the code that follows), and therefore, their associated register could be reused to hold some other variable. For example:

  • y3 can reuse the register of y0;

  • y4 can reuse the register of y1;

  • y5 can reuse the register of y2;

By discarding variables as soon as they become unused, and reusing their registers, the program above can be compiled to just use 3 registers.

Why Discarding Unused Variables is Not Enough#

The approach illustrated in the previous section is very helpful to reduce the number of registers in use; however, it is not enough to address the opening example of this module. To see why, let us examine a simplified scenario where we only have 3 registers available, and consider the Hygge programs in Example 75 below.

Example 75 (A Hygge Program using (Too) Many Registers (Simplified))

Consider the following Hygge program:

(Download this example.)

1let res: int = 1 + (2 + (3 + 4));
2assert(res = 10)

When translated into ANF, the program becomes:

(Download this example.)

 1let y0: int = 1;
 2let y1: int = 2;
 3let y2: int = 3;
 4let y3: int = 4;
 5let y4: int = y2 + y3;     // y2 and y3 are now unused
 6let y5: int = y1 + y4;     // y1 and y4 are now unused
 7let y6: int = y0 + y5;     // y0 and y5 are now unused
 8let res: int = y6;         // y6 is now unused
 9let y7: int = 10;
10let y8: bool = res = y7;   // y7 is now unused
11let y9: unit = assert(y8); // y8 is now unused
12y9

Observe that the ANF program above needs to maintain the 4 variables y0y3 available at the same time (whereas the ANF program in Example 74 can discard y0 and y1 and y3).

Now, suppose that our target architecture only has 3 available registers (numbered from 0 to 2). In this situation, if we try to compile this ANF program using ./hyggec compile ..., the compiler would crash with the following error:

Unhandled exception. System.Exception: BUG: invalid generic register number 4
   at RISCV.Reg.r(UInt32 n) in .../src/RISCV.fs:line 88

The reason for the crash is that the Code Generation Strategy of hyggec tends to use a new register for each sub-expression; therefore, when compiling the program above, it tries to use:

  • register Reg.r(0) to hold variable y0 (i.e. the result of the sub-expression 1);

  • register Reg.r(1) to hold variable y1 (i.e. the result of the sub-expression 2);

  • register Reg.r(2) to hold variable y2 (i.e. the result of the sub-expression 3);

  • register Reg.r(3) to hold variable y3 (i.e. the result of the sub-expression 4) — and this causes the crash.

ANF-Based Code Generation with Register Allocation#

To compile Hygge expressions that use more intermediate results (and ANF variables) than available registers, we need a more sophisticate approach:

  • we leverage the fact that expressions are in ANF, and

  • we reuse registers even when their value is still needed by the Hygge expression, by generating assembly code to:

    • spill variables, i.e. copy the value of a variable onto the stack, to reuse its register for another variable; and

    • load variabless, i.e. assign a register to a previously-spilled variable (possibly after spilling another variable) and restore its value from the stack.

To achieve this, we need a code generation environment with the following information:

  • a target variable name \(z\) that will contain the result of the expression being compiled.

  • the known variables \(x_1, \ldots, x_n\);

  • each known variable might have up to two storage locations:

    • a stack location (as an offset from the frame pointer register fp) that never changes during the lifetime of a variable;

    • if possible, a register that may change during the lifetime of the variable;

  • a subset of needed variables that should never be discarded (even if they may seem unused); and

Now, suppose that we are compiling a Hygge expression \(e\), using the code generation environment outlined above. We proceed as described in the following subsections.

Generating Code for a “Let” Binder or Variable in ANF#

According to Definition 51, when we start compiling a Hygge program in ANF we can expect to find either a “let” binder, or a variable.

When we generate code for a “let” binder “\(\hygLet{x}{t}{e_i}{e_s}\)” or “\(\hygLetMut{x}{t}{e_i}{e_s}\)”:

  1. we discard all known variables that are not “needed” in the code generation environment, nor free (i.e. not used) in \(e_i\) nor \(e_s\) (as described in the previous section). This helps freeing up registers and stack locations (if possible);

  2. we assign both a register and a stack location to the newly-bound variable \(x\)

    • if all registers are being used, we generate code to spill one of the other known variables \(x_1,\ldots,x_n\) onto its assigned stack location, and assign its register to \(x\);

    • if no stack locations are available, we allocate a new stack location by decreasing the stack pointer register sp (recall that the RISC-V stack grows downwards);

  3. we compile the initialisation expression \(e_i\), by targeting the variable \(x\). Since \(e_i\) is in ANF, according to Definition 51, it can only have a few specific shapes (discussed below);

  4. we discard all known variables that are not “needed” in the code generation environment, nor free (i.e. not used) in \(e_s\);

  5. we generate code for the scope expression \(e_s\).

When we generate code for a variable \(x\), we proceed as follows:

  1. if \(x\) does not have a register assigned to it (because it has been spilled onto the stack before), we generate code to load \(x\) onto a register;

    • if all registers are being used, we generate code to spill one of the other known variables \(x_1,\ldots,x_n\) onto its assigned stack location, and assign its register to \(x\);

  2. we check where the target variable \(z\) is stored:

    • if \(z\) has a register assigned to it, we copy the value of \(x\) onto that register;

    • otherwise, \(z\) has been spilled onto the stack, so we just copy the value of \(x\) onto the stack location assigned to \(z\).

Generating Code for the Initialisation Expression of a “Let” Binder in ANF#

According to Definition 51, when we generate code for the initialisation expression of a “let” binder in ANF, we can only find specific shapes of expressions. We discuss two cases: addition and if-then-else.

Generating Code for Additions#

By Definition 51, an addition in ANF (which may only appear in the initialisation expression of a “let” binder) can only have the form “\(x_1 + x_2\)”, i.e. its operands can only be variables. Therefore, we proceed as follows:

  1. we make sure that the addition operands \(x_1\) and \(x_2\) and the target variable \(z\) are currently stored on a register;

    • to this end, we may need to generate code to spill up to three other variables onto the stack, and assign their registers to \(z\), \(x_1\), and \(x_2\);

  2. we generate code for the RISC-V instruction add using the registers of \(x_1\), \(x_2\), and \(z\).

Generating Code for If-Then-Else Expressions#

By Definition 51, an if-then-else expression in ANF (which may only appear in the initialisation expression of a “let” binder) can only have the form “\(\hygCond{y}{e_t}{e_f}\) i.e. its condition expression can only be a variable; moreover, both \(e_t\) and \(e_f\) must be in ANF.

To compile the condition \(y\), we need to make sure that \(y\) is stored into a register (and to this end, we may need to generate code to spill some other variable and assign its register to \(y\)).

The compilation of \(e_t\) and \(e_f\) proceeds recursively — but there is a catch:

  • the code generated for \(e_t\) and \(e_f\) may contain code to spill and load variables;

  • therefore, the register allocation at the end of \(e_t\) may be different from \(e_f\).

This situation is illustrated in Example 76 below.

Example 76 (Why We Must Synchronise Register Allocation Across “If-Then-Else” Branches)

Consider the following Hygge program:

(Download this example.)

1let z: bool = true;  // Change to 'false' to experiment
2let x1: int = 1;
3let x2: int = 2;
4let x3: int = if z then x1 else 1 + (2 + (3 + 4));
5assert(x1 = 1);
6assert(x2 = 2);
7assert(if z then x3 = x1 else x3 = 10)

Suppose we only have 4 registers available. The following happens:

  • variable z is assigned to register Reg.r(0);

  • variable x1 is assigned to register Reg.r(1);

  • variable x2 is assigned to register Reg.r(2);

  • variable x3 is assigned to register Reg.r(3);

  • when compiling the “if-then-else” expression:

    • the “then” branch does not need to use any additional register;

    • the “else” branch needs 4 registers for the intermediate results of the expression 1 + (2 + (3 + 4)). Therefore, it will need to spill the values z, x1, x2, and x3 onto the stack;

  • therefore, what is the register allocation when the program reaches line 5? Are the variables on the stack, or on some register? That depends on whether the ‘true’ or ‘false’ branch of the ‘if-then-else’ was taken earlier…

The solution to this issue is to add “synchronisation” assembly code to spill/load variables ensuring that, no matter which branch of the “if-then-else” is taken, the known variables have a known register allocation. For example, we can add at the end of the assembly code of the \(e_t\) branch some assembly code that spills/loads the known variables until their register allocation exactly matches the allocation at the end of the \(e_f\) branch. We reprise this idea when discussing its implementation in hyggec, in Example 78 below.

Implementation: ANF Transformation and Register Allocation in hyggec#

Tip

The conversion of Hygge programs into ANF (described in this section) is already implemented in the hyggec Git repository, in a file called ANF.fs. To see what changed since the last module, you can inspect the differences between the tags union-rec-types and anf. For more details, see also Implementation: ANF Transformation and Register Allocation in hyggec.

The following subsections outline the implementation of ANF and register allocation available on the hyggec Git repository.

The version of hyggec released with this Module adds two option, available in some compiler phases:

  • the new option -a or --anf that can be used to activate and inspect ANF conversion and ANF-based code generation with register allocation; and

  • the new option -r or --registers can be used to limit the number of registers being used for code generation (useful for experiments and debugging).

Therefore, we can now invoke hyggec as follows:

  • ./hyggec parse -a file.hyg parses the file, converts it into ANF, and prints the result.

  • ./hyggec interpret -a file.hyg parses the file, converts it into ANF, and interprets the result.

  • ./hyggec compile -a -r N file.hyg parses the file, type-checks it, converts the result into ANF, compiles it using the ANF-Based Code Generation with Register Allocation with N available registers (dafault: 18), and displays the resulting RISC-V assembly code. By reducing the value of N, it is possible to observe how the generated code has to spill/load variables more often.

  • ./hyggec compile -a -r N file.hyg parses the file, type-checks it, converts the result into ANF, compiles it using the ANF-Based Code Generation with Register Allocation with N available registers (dafault: 18), and executes the resulting RISC-V assembly code using RARS. By reducing the value of N, the generated code will spill/load variables more often.

The new version of the compiler also has two new series of tests:

  • tests/interpreter-anf/ contains a series of tests that are parsed, converted into ANF, and then interpreted (these are the same tests of tests/interpreter/);

  • tests/codegen-anf/ contains a series of tests that are parsed, type-checked, converted into ANF, compiled using the ANF-based register allocation, and then executed in RARS.

Example 77 (A Hygge Program using (Too) Many Registers (Revised))

Consider again the following Hygge program (from Example 75):

(Download this example.)

1let res: int = 1 + (2 + (3 + 4));
2assert(res = 10)

If we save this file as hygge-many-regs-simpl.hyg, we can see its ANF translation by running:

./hyggec parse -a hygge-many-regs-simpl.hyg

and the output is:

Let $anf (1:1-2:16)
┣╾Ascription: Pretype Id "_"; pos: (1:1-2:16)
┣╾init: IntVal 1 (1:16-1:16)
┗╾scope: Let $anf_0 (1:1-2:16)
         ┣╾Ascription: Pretype Id "_"; pos: (1:1-2:16)
         ┣╾init: IntVal 2 (1:21-1:21)
         ┗╾scope: Let $anf_1 (1:1-2:16)
                  ┣╾Ascription: Pretype Id "_"; pos: (1:1-2:16)
                  ┣╾init: IntVal 3 (1:26-1:26)
                  ┗╾scope: Let $anf_2 (1:1-2:16)
                           ┣╾Ascription: Pretype Id "_"; pos: (1:1-2:16)
                           ┣╾init: IntVal 4 (1:30-1:30)
                           ┗╾scope: Let $anf_3 (1:1-2:16)
                                    ┣╾Ascription: Pretype Id "_"; pos: (1:1-2:16)
                                    ┣╾init: Add (1:26-1:30)
                                    ┃       ┣╾lhs: Var $anf_1 (1:26-1:26)
                                    ┃       ┗╾rhs: Var $anf_2 (1:30-1:30)
                                    ┗╾scope: Let $anf_4 (1:1-2:16)
                                             ┣╾Ascription: Pretype Id "_"; pos: (1:1-2:16)
                                             ┣╾init: Add (1:21-1:31)
                                             ┃       ┣╾lhs: Var $anf_0 (1:21-1:21)
                                             ┃       ┗╾rhs: Var $anf_3 (1:26-1:30)
                                             ┗╾scope: Let $anf_5 (1:1-2:16)
                                                      ┣╾Ascription: Pretype Id "_"; pos: (1:1-2:16)
                                                      ┣╾init: Add (1:16-1:32)
                                                      ┃       ┣╾lhs: Var $anf (1:16-1:16)
                                                      ┃       ┗╾rhs: Var $anf_4 (1:21-1:31)
                                                      ┗╾scope: Let res (1:1-2:16)
                                                               ┣╾Ascription: Pretype Id "_"; pos: (1:1-2:16)
                                                               ┣╾init: Var $anf_5 (1:16-1:32)
                                                               ┗╾scope: Let $anf_6 (1:1-2:16)
                                                                        ┣╾Ascription: Pretype Id "_"; pos: (1:1-2:16)
                                                                        ┣╾init: IntVal 10 (2:14-2:15)
                                                                        ┗╾scope: Let $anf_7 (1:1-2:16)
                                                                                 ┣╾Ascription: Pretype Id "_"; pos: (1:1-2:16)
                                                                                 ┣╾init: Eq (2:8-2:15)
                                                                                 ┃       ┣╾lhs: Var res (2:8-2:10)
                                                                                 ┃       ┗╾rhs: Var $anf_6 (2:14-2:15)
                                                                                 ┗╾scope: Let $anf_8 (1:1-2:16)
                                                                                          ┣╾Ascription: Pretype Id "_"; pos: (1:1-2:16)
                                                                                          ┣╾init: Assertion (2:1-2:16)
                                                                                          ┃       ┗╾arg: Var $anf_7 (2:8-2:15)
                                                                                          ┗╾scope: Var $anf_8 (1:1-2:16)

We can compile the program above using the new ANF-based register allocation, limited to only 4 registers, by running:

./hyggec compile -a -r 4 hygge-many-regs-simpl.hyg

and the output is:

 1.data:
 2
 3.text:
 4    mv fp, sp  # Initialize frame pointer
 5    addi sp, sp, -4  # Extend the stack for variable $anf
 6    # Variable $anf allocation: register t0, frame pos. 1 
 7    li t0, 1
 8    addi sp, sp, -4  # Extend the stack for variable $anf_0
 9    # Variable $anf_0 allocation: register t1, frame pos. 2 
10    li t1, 2
11    addi sp, sp, -4  # Extend the stack for variable $anf_1
12    # Variable $anf_1 allocation: register t2, frame pos. 3 
13    li t2, 3
14    addi sp, sp, -4  # Extend the stack for variable $anf_2
15    # Variable $anf_2 allocation: register s1, frame pos. 4 
16    li s1, 4
17    sw t0, -4(fp)  # Spill variable $anf from register t0 to stack
18    addi sp, sp, -4  # Extend the stack for variable $anf_3
19    # Variable $anf_3 allocation: register t0, frame pos. 5 
20    add t0, t2, s1  # $anf_3 <- $anf_1 + $anf_2
21    # Variable $anf_4 allocation: register t2, frame pos. 3 
22    add t2, t1, t0  # $anf_4 <- $anf_0 + $anf_3
23    # Variable $anf_5 allocation: register t1, frame pos. 2 
24    lw t0, -4(fp)  # Load variable $anf onto register t0
25    add t1, t0, t2  # $anf_5 <- $anf + $anf_4
26    # Variable res allocation: register t2, frame pos. 1 
27    mv t2, t1  # res <- $anf_5
28    # Variable $anf_6 allocation: register t1, frame pos. 2 
29    li t1, 10
30    # Variable $anf_7 allocation: register t0, frame pos. 3 
31    beq t2, t1, eq_true
32    li t0, 0  # Comparison result is false
33    j eq_end
34eq_true:
35    li t0, 1  # Comparison result is true
36eq_end:
37    # Variable $anf_8 allocation: register t2, frame pos. 1 
38    addi t2, t0, -1
39    beqz t2, assert_true  # Jump if assertion OK
40    li a7, 93  # RARS syscall: Exit2
41    li a0, 42  # Assertion violation exit code
42    ecall
43assert_true:
44    lw t0, 0(fp)  # Load variable $result onto register t0
45    mv t0, t2  # $result <- $anf_8
46    li a7, 10  # RARS syscall: Exit
47    ecall  # Successful exit with code 0

By following the comments, we can see when the code generation introduces new variables (by assigning stack space and a register) and spills/loads variables from/to registers. We can observe, in particular, that:

  • variable $anf is spilled onto the stack on line 17, and later reloaded into a register on line 24; and

  • on line 21, variable anf_4 is assigned register t2 and frame position 3, that (according to line 12) were previously assigned to $anf_1 (which “disappears” without being spilled onto the stack — hence $anf_1 has become unused and has been discarded).

ANF Transformation in hyggec#

The ANF transformation is implemented in the file ANF.fs, and it follows the procedure described in Transformation of a Hygge Expression into ANF. The main function is transform, which then invokes toANFDefs and toANF:

/// Transform the given AST node into Administrative Normal Form.
let transform (ast: Node<'E,'T>): Node<'E,'T> =
    toANF (toANFDefs ast)

Here are a few remarks about the rest of the file.

  • For efficiency, the function toANFDefs builds the list of ANF definitions in reverse order (i.e. with the latest entries at the head of the list); correspondingly, the function toANF expects to receive a list of ANF definitions in reverse order;

  • The file includes an auxiliary function substVar that is very similar to subst (in ASTUtil.fs) — except that it specifically substitutes the name of a variable with another name.

  • The unique variable names autogenerated for ANF transformation have a prefix $anf (often followed by an integer): this guarantees that the autogenerated names will not clash with existing variables in the input program, since Parser.fsy does not accept programs that use $ in variable names.

  • When generating “let” binders, the function toANFDefs inserts the pretype "_". This does not create issues, because:

    • if the ANF conversion is applied to an UntypedAST before calling the interpreter, the pretypes are not resolved and ignored (because the interpreter does not use any type information);

    • if the ANF conversion is applied to a TypedAST before code generation, the new pretypes are not resolved and they are ignored (because the code generation uses the type information available in the AST Nodes).

ANF-Based Code Generation in hyggec#

The code generation implemented in the file ANFRISCVCodegen.fs follows the procedure described in ANF-Based Code Generation with Register Allocation.

Important

The ANF-based code generation in ANFRISCVCodegen.fs is only partial, and is intended as a demonstration and a basis for Project Ideas. In particular:

  • only integer registers are considered, and

  • only a few Hygge expressions are implemented (addition, if-then-else, assertions, …) — just enough to support the examples shown in this Module.

Here is an overview of the main contents of the file.

  • The code generation environment has type ANFCodegenEnv and it contains the information about known and needed variables, allocated registers, etc.:

    /// Code generation environment.
    type internal ANFCodegenEnv = {
        /// Target variable for the result of the assembly code execution.
        TargetVar: string
        /// Frame position assigned to each known variable.
        Frame: Map<string, int>
        /// Size of the stack frame.
        FrameSize: int
        /// List of integer variables stored in registers, with newest ones coming
        /// first.  If a variable does not appear here, then it means that it is
        /// allocated on the stack frame.
        IntVarsInRegs: List<string * Reg>
        /// List of available integer registers.
        FreeRegs: List<Reg>
        /// Set of variables that are needed in the surrounding scope, and should
        /// never be discarded to reuse their storage.
        NeededVars: Set<string>
    }
    
  • The main entry point of the ANF-based code generation is the function codegen, which in turn invokes doCodegen:

    /// Generate RISC-V assembly for the given AST (expected to be in ANF), using
    /// the given number of registers.
    let codegen (node: TypedAST) (registers: uint): RISCV.Asm =
        /// Name of a special variable used to hold the result of the program
        let resultVarName = "$result"
        /// Initial codegen environment, with all registers available
        /// Note: the definition of 'env' uses list comprehension:
        /// https://en.wikibooks.org/wiki/F_Sharp_Programming/Lists#Using_List_Comprehensions
        let env = { TargetVar = resultVarName
                    Frame = Map[(resultVarName, 0)]
                    FrameSize = 1
                    IntVarsInRegs = []
                    FreeRegs = [for i in 0u..(registers - 1u) do yield Reg.r(i)]
                    NeededVars = Set[resultVarName] }
        let result = doCodegen env node
        // ...
    

    Notably, the funtion codegen initialises a code generation environment by targeting a special variable called $result, which is initially stored in the first avaiable location on the stack frame, without using any register. It also marks the $result variable as “needed” (so it can never be discarded during code generation, even if it is unused), and declares all registers as free.

  • Unlike the default Code Generation Strategy of hyggec, the result of doCodegen (and other functions) in ANFRISCVCodegen.fs is not just an Asm instance. Instead, doCodegen has the following signature:

    /// Code generation function: compile the expression in the given AST node,
    /// which is expected to be in ANF.
    let rec internal doCodegen (env: ANFCodegenEnv)
                               (node: TypedAST): ANFCodegenResult = ...
    

    Where the type ANFCodegenResult is defined as:

    /// Code generation result.
    type internal ANFCodegenResult = {
        /// Compiled RISC-V assembly code.
        Asm: Asm
        /// Updated code generation environment.
        Env: ANFCodegenEnv
    }
    

    This is because the function doCodegen may emit assembly code to spill and load registers — and this changes the register allocation described in the input environment. For this reason, doCodegen returns an updated code generation environment, which should be used as input for the subsequent code generation steps.

  • The following auxiliary function generates assembly code to spill a variable varName onto the stack:

    /// Spill the variable with the given name onto its stack position assigned in
    /// 'env'.  Return the assembly code that performs the spill and the updated
    /// codegen environment.
    let internal spillVar (env: ANFCodegenEnv)
                          (varName: string): ANFCodegenResult = ...
    

    The function above is used by the following function, that selects which variable to spill:

    /// Spill the integer variable that has been stored in an integer register for
    /// the longest time, saving it in the stack position assigned in 'env'.  Choose
    /// a variable that does not belong to the given 'doNotSpill' list.  Return the
    /// assembly code that performs the spilling, and the updated codegen
    /// environment.
    let internal spillOldestIntVar (env: ANFCodegenEnv)
                                   (doNotSpill: List<string>): ANFCodegenResult = ...
    

    This hihglights the strategy used by hyggec for selecting which variable to spill when all registers are in use: the compiler picks the variable that has been assigned to a register for the longest time. Other strategies are discussed in the References and Further Readings.

  • The function syncANFCodegenEnvs generates code to synchronise the register allocation across two branches, according to the idea outlined in Generating Code for If-Then-Else Expressions.

    /// Generate assembly code that spills/loads variables in 'fromEnv' achieving
    /// the same configuration of 'toEnv'.
    let internal syncANFCodegenEnvs (fromEnv: ANFCodegenEnv)
                                    (toEnv: ANFCodegenEnv): Asm = ...
    

    The function works as follows:

    • it takes two code generation environments fromEnv and toEnv (e.g. one obtained after compiling the “then” branch of an if-then-else, and the other obtained after compiling the “else” branch), and

    • it returns the assembly code that spills/loads the variables of fromEnv to/from the stack, and (re-)assigns their registers, until they reach the same configuration of toEnv.

    To see syncANFCodegenEnvs in action, you can try Example 78 below.

    Example 78 (Synchronising Register Allocation Across “If-Then-Else” Branches)

    Consider again the following Hygge program (from Example 76):

    (Download this example.)

    1let z: bool = true;  // Change to 'false' to experiment
    2let x1: int = 1;
    3let x2: int = 2;
    4let x3: int = if z then x1 else 1 + (2 + (3 + 4));
    5assert(x1 = 1);
    6assert(x2 = 2);
    7assert(if z then x3 = x1 else x3 = 10)
    

    If we save the example above in a file called if-then-else-sync-reg-alloc.hyg, we can see its translation into ANF by running:

    ./hyggec parse -a if-then-else-sync-reg-alloc.hyg
    

    and we can observe that the “else” branch has many more ANF variable definitions (hence, it uses more registers) than the “then” branch.

    To generate assembly code from the ANF, by limiting register allocation to 4 registers only, we can run:

    ./hyggec compile -a -r 4 if-then-else-sync-reg-alloc.hyg
    

    The generated code is quite long, but let’s focus on the code generated for the “then” branch of the “if-then-else” expression:

        # ...assembly code for the 'false' branch of the 'if'...
        j if_end  # Jump to skip the 'true' branch of 'if' code
    if_true:
        mv s1, t1  # $anf_6 <- x1
        # Branch synchronization code begins here
        sw s1, -16(fp)  # Spill variable $anf_6 from register s1 to stack
        sw t1, -8(fp)  # Spill variable x1 from register t1 to stack
        sw t2, -12(fp)  # Spill variable x2 from register t2 to stack
        sw t0, -4(fp)  # Spill variable z from register t0 to stack
        lw t2, -16(fp)  # Load variable $anf_6 onto register t2
        # Branch synchronization code ends here
    if_end:
        # Rest of the assembly code...
    

    We can see that, after the if_true label, there is the code for the “then” branch of the “if-then-else”:

    • first, an mv operation copies variable x1 into $anf_6 (which is the ANF variable holding the result of the if-then-else); and then,

    • there is a block of “branch synchronisation code” produced by the function syncANFCodegenEnvs above. There we can observe, in particular, that the variable $anf_6 is first spilled from register s1, and then reloaded onto register t2, so it gets the same register allocated after the code generation for the “else” branch. This way, no matter which branch was taken, the rest of the assembly code can use register t2 to get the value of $anf_6.

References and Further Readings#

ANF was introduced in these seminal papers, with the main purpose of compiling functional languages:

  • Amr Sabry and Matthias Felleisen. Reasoning about programs in continuation-passing style. In Proceedings of the 1992 ACM conference on LISP and functional programming (LFP ‘92). https://doi.org/10.1145/141471.141563

  • Cormac Flanagan, Amr Sabry, Bruce F. Duba, and Matthias Felleisen. The essence of compiling with continuations. In Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation (PLDI ‘93). https://doi.org/10.1145/155090.155113

ANF is used, for instance, in the Racket compiler.

Conceptually, ANF is reminiscent of other intermediate representations used in various compilers — e.g. Static Single Assignment (SSA) form, and Three-Address Code. To know more, see for example:

  • Keith D. Cooper and Linda Torczon. Engineering a Compiler (Third Edition). Available on DTU Findit.

    • See, in particular, Chapter 4 - Intermediate Representations

  • Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems, Oct. 1991. https://doi.org/10.1145/115372.115320

The NFA-based register allocation strategy described in this Module is, broadly speaking, a coarse linear allocation strategy, which traverses the program once and uses the current set of free variables to find ranges of the program where variables are live (i.e. actually used by the program) or become permanently unused (and thus, can be discarded). Register allocation is an active research topic, and there are many different strategies, with trade-offs between complexity, speed of the code generation, and efficiency of the generated code. To know more, here are two surveys:

Project Ideas#

For this Module, you should implement both the following Project Ideas:

Project Idea: Implement ANF Translation for (Some of) Your Hygge Extensions#

The ANF translation in the file ANF.fs does not cover the Hygge expressions you should have added in previous Project Ideas. Select at least 2 of these expressions, and extend Definition 51, the file ANF.fs, and the interpreter test suite accordingly:

  • you should explain how you extend Definition 51 to support the expressions you selected;

  • you should add new cases to the functions substVars and toANFDefs in ANF.fs;

  • you should add new tests under the directory tests/interpreter-anf/. Ideally, you should reuse the same interpreter tests that you have already developed for the expressions you selected, by copying them from tests/interpreter/.

Project Idea: Implement ANF-Based Code Generation for More Hygge Expressions#

The ANF-based code generation in the file ANFRISCVCodegen.fs only supports a few Hygge expressions. For this Project Idea, you should select at least 2 more expressions, and extend ANFRISCVCodegen.fs to generate code for them:

Hint

If the expression you are adding has some form of branching and conditional jump (e.g. while loops, short-circuit logical operators, …), then you may need to synchronise the register allocation of the different branches, similarly to Generating Code for If-Then-Else Expressions.