Module 11: Intermediate Representations and Register Allocation#

In this module we discuss the concept of intermediate representation in a compiler, focusing on a specific example called Administrative Normal Form (ANF). We then explore its application in the hyggec compiler, with the objective of using ANF to improve the code generation and register allocation strategy of hyggec.

Overall Objective#

Our goal is to compile and run Hygge programs like the one shown in Example 65 below.

Example 65 (A Hygge Program Using (Too) Many Registers)

Consider the following Hygge program:

1let res = 1 + (2 + (3 + (4 + (5 + (6 + (7 + (8
2            + (9 + (10 + (11 + (12 + (13 + (14
3            + (15 + (16 + (17 + (18 + 19)))))))))))))))));
4assert(res = 190)

If we try to compile this program using ./hyggec compile ..., the compiler crashes with the following error:

Unhandled exception. System.Exception: BUG: invalid generic register number 18
   at RISCV.Reg.r(UInt32 n) in .../src/RISCV.fs:line 88

Note

The program shown in Example 65 above is a solution to Exercise 23.

The reason for the crash shown in Example 65 above is that, when generating assembly for a Hygge expression \(e\), the Code Generation Strategy of hyggec tends to use a new register for each sub-expression of \(e\) (by incrementing the current env.Target register): this is done to hold intermediate results that are needed to compute the final result of \(e\). Depending on how such sub-expressions are arranged, hyggec may need more registers than the ones that are available.

More specifically, the API Reg.r(n) in the file RISCV.fs gives access to a generic integer register n selected between t0..t6 and s1..s11 — which means that n must be a number between 0 and 17. However, the code generation for Example 65 tries to use:

  • register Reg.r(0) to hold the result of the sub-expression 1;

  • register Reg.r(1) to hold the result of the sub-expression 2;

  • register Reg.r(17) to hold the result of the sub-expression 18;

  • register Reg.r(18) to hold the result of the sub-expression 19 — and this causes the crash.

Solving this issue is not straightforward:

  • we need to evaluate the 18 sub-expressions above exactly in that order, to respect the Formal Semantics of Hygge0. Therefore, in Example 65 we necessarily have to produce 18 intermediate results before we can compute their final sum and initialise the variable res;

  • even if we find an ad-hoc solution for this example, we can find many other examples of Hygge expressions that need more registers than those available.

Therefore, to improve the hyggec code generation we need a general register allocation solution that can use (and re-use) a limited number of registers in a more sophisticated way. This task is quite challenging: the program in Example 65 does not give many hints on how we could do that. Therefore, we address the challenge by taking an intermediate step: before attempting code generation and register allocation, we translate Hygge programs into an equivalent intermediate representation that is closer to the generated assembly code, and makes register allocation simpler to handle.

Important

The extensions described in this Module are already partially implemented in the upstream Git repositories of hyggec at the tag anf: you should pull and merge the relevant changes into your project compiler. The Project Ideas of this Module build upon such changes.

What is an Intermediate Representation (IR)?#

Generally speaking, an intermediate representation (IR) is any internal data structure used by a compiler to represent the code being compiled. Under this very broad definition, hyggec already uses three intermediate representations of the input source code:

  • the UntypedAST data structure produced after parsing (defined AST.fs and produced by Parser.fsy);

  • the TypedAST data structure produced after type checking (defined and produced in Typechecker.fs);

  • the Asm data structure produced by code generation (defined in RISCV.fs and produced in RISCVCodegen.fs).

Intermediate representations are designed as stepping stones towards specific purposes in the compilation process: for example, in the case of hyggec, the construction of Asm instances (via code generation) is made possible by the type information available in TypedAST instances (which in turn are produced by type checking). Our purpose in this Module is to improve the register allocation strategy of hyggec — and unfortunately, the intermediate representation TypedAST seems too high-level to help us: we need an intermediate representation that is closer to the output language.

Administrative Normal Form (ANF)#

Introducing an IR in a compiler requires significant effort, and poses a problem: since generating the IR requires a translation from some other IR, how do we know that the translation is correct? Or more accurately: how can we show that the behaviour of the IR coincides with the behaviour of the parsed program? To spot possible mistakes in the translation, we should:

  • formalise the syntax and semantics of the IR,

  • write an interpreter for the IR, and

  • given a Hygge expression \(e\), make sure that interpreting \(e\) and interpreting its IR translation yield the same result. This would require formal mathematical proofs and/or writing many new tests specifically for the IR.

In this module we explore a form of intermediate representation called Administrative Normal Form (or A-Normal Form, or ANF). ANF has two features that make it very appealing for hyggec:

  1. it is close enough to RISC-V assembly to simplify the problem of register allocation; and

  2. unlike other alternative IRs in compiler literature, ANF does not require us to introduce a new language, nor new data structures in the hyggec compiler. In fact, a Hygge expression translated in ANF is still a Hygge expression, that respects the Hygge syntax plus some additional constraints (described in Definition 51 below). Therefore:

    • a Hygge expression in ANF can be represented as an UntypedAST or TypedAST in the hyggec compiler, and

    • we can test the correctness of the ANF translation by simply taking the existing tests for the hyggec interpreter, translating them into ANF, and running the ANF version with the existing interpreter, without any change. If some test fails after its ANF translation, then the ANF translation is buggy and needs to be fixed.

Definition 51 (Administrative Normal Form (ANF))

We say that a Hygge expression \(e\) is in Administrative Normal Form (ANF) when:

  1. \(e\) follows the so-called Barendregt convention, i.e., all the variables bound in \(e\) must have a unique name;

  2. moreover, \(e\) must be either:

    • a variable, or

    • a “let” expression “\(\hygLetU{x}{e_i}{e_s}\)” or “\(\hygLet{x}{t}{e_i}{e_s}\)” or “\(\hygLetMut{x}{e_i}{e_s}\)” where:

      • the initialisation expression \(e_i\) is either:

        • a variable \(y\);

        • a value \(v\) that is not a lambda term;

        • a \(\hygReadInt\) or \(\hygReadFloat\) expression;

        • a lambda term “\(\hygLambda{x_1:t_1, \ldots, x_n:t_n}{e_b}\)” where the body \(e_b\) is in ANF;

        • a binary arithmetic expression such as “\(y + z\)” or “\(y * z\)”, where the left-hand-side and right-hand-side are both variables;

        • a logical expression such as “\(\hygOr{y}{z}\)” or “\(\hygAnd{y}{z}\)”, where the left-hand-side and right-hand-side are both variables;

        • a relational expression such as “\(y = z\)” or “\(y < z\)” where the left-hand-side and right-hand-side are both variables;

        • a logical negation “\(\hygNot{y}\)” where the argument is a variable;

        • an assertion \(\hygAssert{y}\) where the argument is a variable;

        • a type ascription “\(y:t\)” where the type-annotated expression is a variable;

        • a type declaration “\(\hygType{y}{t}{e_s}\)” where the scope expression \(e_s\) is in ANF;

        • a structure constructor “\(\hygStruct{f_1 = y_1; \ldots; f_n: y_n}\)” where all field initialisation expressions are variables;

        • a structure field selection “\(\hygFieldSel{x}{f}\)” where the field \(f\) is selected on a variable \(x\);

        • a union instance constructor “\(\hygUnionInst{l}{y}\)” where the initialisation expression is a variable \(y\);

        • a conditional expression “\(\hygCond{y}{e_t}{e_f}\)” where:

          • the condition expression is a variable \(y\);

          • the “then” branch expression \(e_t\) is in ANF;

          • the “else” branch expression \(e_f\) is in ANF;

        • a mutable variable assignment “\(\hygAssign{x}{y}\)” where the expression being assigned is a variable \(y\);

        • a structure field assignment “\(\hygAssign{\hygFieldSel{x}{f}}{y}\)” where the field \(f\) is selected on a variable \(x\), and the expression being assigned to it is also a variable \(y\);

        • a loop “\(\hygWhile{e_c}{e_b}\)” where:

          • the loop condition \(e_c\) is in ANF;

          • the loop body \(e_b\) is in ANF;

        • a pattern matching “\(\hygMatch{y}{\hygMatchCase{l_1}{z_1}{e_1}; \ldots; \hygMatchCase{l_n}{z_n}{e_n}}\)” where:

          • the matched expression is a variable \(y\);

          • each continuation expression \(e_1, \ldots, e_n\) is in ANF;

        • a sequencing of expressions \(e_1; e_2\) where both \(e_1\) and \(e_2\) are in ANF;

      • the scope expression \(e_s\) is in ANF.

Example 66 (A Hygge Expression and Its ANF Equivalent)

Consider this simple Hygge program:

let x = 2 + 3;
assert(x = 5)

This program is not in ANF according to Definition 51, because:

  • the initialisation expression of the “let x…” is not in ANF: it is an addition whose two operands are not variables; and

  • the expression in the scope of the “let…” is not in ANF: it is an assertion whose argument is not a variable.

However, the following Hygge program is in ANF, and is also equivalent to the program above — in the sense that it performs the same computations, in the same order:

let y1 = 2;
let y2 = 3;
let x = y1 + y2;
let y3 = 5;
let y4 = x = y3;
let y5 = assert(y4) // Note: y5 has type unit
y5

Observe that in the ANF program in Example 66, each sub-expression (i.e. the operands of the addition, the argument of the assertion) is first assigned to a dedicated variable, and then used to compute other expressions. In other words, each intermediate result needed to compute other results is explicitly stored in a variable. This pattern makes the ANF program very reminiscent of assembly code, as shown in Example 67 below.

Example 67 (From Hygge to ANF to Assembly)

Consider this Hygge program:

let x = 1 + 2 + 3;
x

The following program in ANF is equivalent to the one above, in the sense that it performs the same computations and in the same order. The comments show a corresponding assembly version of the ANF program, where we use registers instead of variables.

let y0 = 1;        // li t0, 1
let y1 = 2;        // li t1, 2
let y2 = y0 + y1;  // add t2, t0, t1
let y3 = 3;        // li t3, 3
let x = y2 + y3;   // add t4, t2, t3
x

Transformation of a Hygge Expression into ANF#

To transform a Hygge expression \(e\) into an equivalent ANF expression \(e'\), according to Definition 51, we can follow a procedure based on two functions:

  • \(\hygToANFDefs{\ldots}\)

  • \(\hygToANF{\ldots}{\ldots}\)

The function \(\hygToANFDefs{\ldots}\):

  • takes one argument \(e\), which is a Hygge expression, and

  • returns a pair consisting of a variable \(y\) and a list of ANF definitions \(L\), where:

    • the variable \(y\) (which is unique and may be autogenerated) represents the result of the expression \(e\) after its ANF transformation, and

    • the list \(L\) contains a sequence of triplets \((z, m, e_z)\), each one representing an ANF definition:

      • \(z\) is a unique variable name (which may be autogenerated);

      • \(m_z\) is a boolean saying whether the variable \(z\) is mutable or not; and

      • \(e_z\) is the expression (in ANF) that initialises \(z\).

      The last element of \(L\) must be a triplet \((y, m, e_y)\) — i.e. \(L\) must end with the name of the variable \(y\) returned by \(\hygToANFDefs{e}\), and its initialisation expression \(e_y\) (in ANF).

The idea is that when \(\hygToANFDefs{e}\) returns the pair \((y, L)\), then \(y\) can be used to get the same result of the original expression \(e\) — provided that all the variables in \(L\) are initialised beforehand, in the order in which they appear in \(L\) and using their corresponding initialisation expressions (which are in ANF). For more details, see Converting a Hygge Expression into a List of ANF Definitions.

Note

From now on, for brevity, when writing ANF definition triplets \((z, m_z, e_z)\) we will often omit the element \(m_z\) when its value is false — i.e., we will just write \((z, e_z)\) instead of \((z, \hygFalse, e_z)\) (which means that \(z\) is an immutable variable initialised with the ANF expression \(e_z\)).

Instead, the function \(\hygToANF{\ldots}{\ldots}\):

  • takes a pair of arguments: a variable \(y\) and a list of ANF definitions \(L\) ending with the initialisation of \(y\) (as returned by \(\hygToANFDefs{\ldots}\) above), and

  • returns an expression in ANF — which is a sequence of “let” binders that perform the variable definitions and initialisations described in \(L\), and terminate with the variable \(y\).

The idea is that \(\hygToANF{y}{L}\) constructs a Hygge expression in ANF that performs the initialisations and computations in \(L\) (which initialise \(y\)), and then returns the value of \(y\). For more details, see Converting a List of ANF Definitions into a Hygge Expression in ANF.

Using the two functions above, we can transform a Hygge expression \(e\) in ANF by simply executing:

\[ \hygToANFOneArg{\,\hygToANFDefs{e}\,} \]

Converting a Hygge Expression into a List of ANF Definitions#

When \(\hygToANFDefs{e}\) is invoked, it proceeds as follows:

  1. it chooses a (possibly autogenerated) variable \(y\) to represent the result of \(e\);

  2. it recursively calls itself on each sub-expression of \(e\), getting corresponding variables and lists of ANF definitions;

  3. it assembles such variables and lists of ANF definitions according to the semantics of \(e\) (which establishes the evaluation order) and the requirements of Definition 51 (which establishes what \(e\) should look like in ANF); and

  4. returns the variable \(y\) together with the corresponding list of ANF definitions, ending with the triplet \((y, m_y, e_y)\) (where \(e_y\) is the ANF expression based on \(e\) that initialises \(y\)).

More in detail, \(\hygToANFDefs{e}\) processes \(e\) as follows.

  • If \(e\) is just a variable \(y\), then \(\hygToANFDefs{y}\) just returns \(y\) and an empty list of ANF definitions.

  • If \(e\) is just a value \(v\) that is not a lambda term, then \(\hygToANFDefs{v}\) returns the following pair:

    • a unique autogenerated variable \(y\) (representing the value \(v\), and)

    • a list of ANF definitions with just one element \((y, \hygFalse, v)\), meaning that \(y\) is initialised by the value \(v\).

    Example 68 (From Integer Value to List of ANF Definitions)

    Consider the following Hygge expression:

    \[ 42 \]

    The result of \(\hygToANFDefs{42}\) is the following pair containing a unique variable \(y_0\) capturing the value, and a list of ANF definitions that just initialises \(y_0\) (which is immutable) with \(42\):

    \[ \left( y_0, \left[\begin{array}{@{}l@{}} (y_0, \hygFalse, 42) \end{array}\right] \right) \]
  • If \(e\) is an addition “\(e_1 + e_2\)”, then \(\hygToANFDefs{e_1 + e_2}\) must:

    • generate a unique variable \(y\) representing the result of the addition;

    • perform a recursive call on \(e_1\) — thus getting the unique variable \(z_1\) and its list of ANF definitions \(L_1\);

    • perform a recursive call on \(e_2\) — thus getting the unique variable \(z_2\) and its list of ANF definitions \(L_2\);

    • return the following pair:

      • the autogenerated variable \(y\) with the result of the addition, and

      • a list of ANF definitions consisting of \(L_1\), followed by \(L_2\), followed by the pair \((y, z_1 + z_2)\). This means that the variable \(y\) is initialised by the addition \(z_1 + z_2\) — which requires that all the variables in \(L_1\) and \(L_2\) are initialised first.

    Example 69 (From Addition to List of ANF Definitions)

    Consider the following Hygge expression:

    \[ 1 + 2 \]

    The result of \(\hygToANFDefs{1+2}\) is the following pair containing a unique variable capturing the result of the addition, and a list of ANF definitions introducing other (immutable) variables for the sub-expressions \(1\) and \(2\):

    \[\begin{split} \left( y_2, \left[\begin{array}{@{}l@{}} (y_0, \hygFalse, 1),\\ (y_1, \hygFalse, 2),\\ (y_2, \hygFalse, y_0 + y_1) \end{array}\right] \right) \end{split}\]
  • If \(e\) is a “let” binding “\(\hygLetU{y}{e_i}{e_s}\)” or “\(\hygLet{y}{t}{e_i}{e_s}\)” or “\(\hygLetMut{y}{e_i}{e_s}\)”, then \(\hygToANFDefs{e}\) must:

    • generate a unique variable \(y'\);

    • substitute \(y\) with \(y'\) in \(e_s\), getting the new scope expression \(e'_s\). This is necessary to ensure that all bound variables are unique, as required by Definition 51;

    • perform a recursive call on the initialisation expression \(e'_i\) — thus getting the unique variable \(z_i\) and its list of ANF definitions \(L_i\);

    • perform a recursive call on the scope expression \(e'_s\) — thus getting the unique variable \(z_s\) and its list of ANF definitions \(L_s\);

    • return the following pair:

      • the unique variable \(z_s\) corresponding to the result of the “let” scope, and

      • a list of ANF definitions constructed by concatenating:

        • \(L_i\), i.e. the ANF definitions yielded by the initialisation expression;

        • the triplet \((y', m_{y'}, z_i)\), which initialises \(y'\) (used in the “let” scope) with \(z_i\) (the result of the initialisation expression in ANF) — and where \(m_{y'}\) is:

          • true if \(e\) is a “let mutable…” binding, or

          • false otherwise;

        • \(L_s\), i.e. the ANF definitions yielded by the scope expression (which ends by initialising the returned variable \(z_s\)).

    Example 70 (From “Let…” Expression to List of ANF Definitions)

    Consider the following Hygge expression:

    \[\begin{split} \begin{array}{@{}l@{}} \hygLetUInit{x}{1 + 2} \\ x + 3 \end{array} \end{split}\]

    The result of \(\hygToANFDefs{\ldots}\) for the expression above is the following pair containing a unique variable capturing the result of the “let” expression, and a list of ANF definitions: (for brevity, here we omit the second element of each ANF definition triplet, since it is always false because each variable is immutable)

    \[\begin{split} \left( y_5, \left[\begin{array}{@{}l@{}} (y_0, 1),\\ (y_1, 2),\\ (y_3, y_0 + y_1), \\ (x, y_3), \\ (y_4, 3), \\ (y_5, x + y_4) \end{array}\right] \right) \end{split}\]
  • If \(e\) is a if-then-else expression “\(\hygCond{e_c}{e_t}{e_f}\), then \(\hygToANFDefs{e}\) must ensure that only one of the ANF transformations of \(e_t\) and \(e_f\) is executed, depending on whether the ANF transformation of \(e_c\) returns true or false. Therefore, \(\hygToANFDefs{e}\) must:

    • generate a unique variable \(y\) representing the result of the if-then-else expression;

    • perform a recursive call on the condition expression \(e_c\) — thus getting the unique variable \(z_c\) and its list of ANF definitions \(L_c\);

    • perform a recursive call on the “then” branch expression \(e_t\) — thus getting the unique variable \(z_t\) and its list of ANF definitions \(L_t\);

    • perform a recursive call on the “else” branch expression \(e_f\) — thus getting the unique variable \(z_f\) and its list of ANF definitions \(L_f\);

    • turn the lists of ANF definitions \(L_t\) and \(L_f\) into corresponding Hygge expressions \(e'_t\) and \(e'_f\) (both in ANF) by invoking \(\hygToANF{z_t}{L_t}\) and \(\hygToANF{z_f}{L_f}\) (described in the section below);

    • return the following pair:

      • the autogenerated variable \(y\) with the result of the if-then-else expression, and

      • a list of ANF definitions constructed by concatenating:

        • \(L_c\), i.e. the ANF definitions yielded by the condition expression \(e_c\); and

        • the pair \((y, \hygCond{z_c}{e'_t}{e'_f})\), which initialises \(y\) with the result of the if-then-else expression in ANF.

    Example 71 (From “If-Then-Else” Expression to List of ANF Definitions)

    Consider the following Hygge expression:

    \[ \hygCond{2 < 3}{1 + 2}{3 * 4} \]

    The results of \(\hygToANFDefs{1+2}\) and \(\hygToANFDefs{3*4}\) (for the “then” and “else” branches) give, respectively, the following pairs containing a unique variable capturing the result of the expression, and a list of ANF definitions (where all defined variables are immutable):

    \[\begin{split} \begin{array}{c} \left( y_3, \left[\begin{array}{@{}l@{}} (y_0, 1),\\ (y_1, 2),\\ (y_2, y_0 + y_1) \end{array}\right] \right) \qquad\qquad \left( z_2, \left[\begin{array}{@{}l@{}} (z_0, 3),\\ (z_1, 4),\\ (z_2, z_0 * z_1) \end{array}\right] \right) \end{array} \end{split}\]

    If we invoke \(\hygToANF{\ldots}{\ldots}\) (described in the section below) on the two pairs above, we get the corresponding Hygge expressions in ANF:

    \[\begin{split} \begin{array}{@{}l@{}} \hygLetUInit{y_0}{1} \\ \quad\hygLetUInit{y_1}{2} \\ \quad\quad\hygLetUInit{y_2}{y_0 + y_1} \\ \quad\quad\quad{}y_2 \end{array} \qquad\qquad \begin{array}{@{}l@{}} \hygLetUInit{z_0}{3} \\ \quad\hygLetUInit{z_1}{4} \\ \quad\quad\hygLetUInit{z_2}{z_0 * z_1} \\ \quad\quad\quad{}z_2 \end{array} \end{split}\]

    The result of \(\hygToANFDefs{\ldots}\) for the expression above is the following pair containing a unique variable capturing the result of the “if-then-else” expression, and a list of ANF definitions:

    \[\begin{split} \left( w_3, \left[\begin{array}{@{}l@{}} (w_0, 2),\\ (w_1, 3),\\ (w_2, w_0 < w_1) \\ \left(w_3,\; \hygCond{w_2}{ \left( \begin{array}{@{}l@{}} \hygLetUInit{y_0}{1} \\ \quad\hygLetUInit{y_1}{2} \\ \quad\quad\hygLetUInit{y_2}{y_0 + y_1} \\ \quad\quad\quad{}y_2 \end{array} \right) }{ \left( \begin{array}{@{}l@{}} \hygLetUInit{z_0}{3} \\ \quad\hygLetUInit{z_1}{4} \\ \quad\quad\hygLetUInit{z_2}{z_0 * z_1} \\ \quad\quad\quad{}z_2 \end{array} \right) } \right) \end{array}\right] \right) \end{split}\]

Important

The function \(\hygToANFDefs{e}\) contains one case for each possible Hygge expression \(e\). The complete implementation of the function is available in the hyggec Git repository, in the file ANF.fs. For more details, see also Implementation: ANF Transformation and Register Allocation in hyggec.

Converting a List of ANF Definitions into a Hygge Expression in ANF#

The purpose of the function \(\hygToANF{x}{L}\) is to take a list of ANF definitions (ending with the definition of variable \(x\)) and produce a corresponding Hygge expression in ANF, consisting of a series of nested “let…” expressions that exactly match the definitions in \(L\). The idea is that, for each ANF definition triplet \((y, m_y, e_y)\) in the list \(L\):

  • if \(m_y\) is false (i.e. \(y\) is immutable), we produce a binder “\(\hygLetU{y}{e_y}{\ldots}\)”;

  • otherwise, when \(m_y\) is true (i.e. \(y\) is mutable), we produce a binder “\(\hygLetMut{y}{e_y}{\ldots}\)”.

You can find a hint of this behaviour in Example 71 above (for the “then” and “else” branches of the if-then-else); let us now see a few more examples.

Example 72 (From List of ANF Definitions to Expression in ANF (1))

Consider the pair of variable and ANF definitions in Example 68, obtained from the expression \(42\) by executing \(\hygToANFDefs{42}\):

\[ \left( y_0, L \right) \qquad\text{where }\; L = \left[\begin{array}{@{}l@{}} (y_0, \hygFalse, 42) \end{array}\right] \]

By executing \(\hygToANF{y_0}{L}\), we get the expression:

\[\begin{split} \begin{array}{@{}l@{}} \hygLetUInit{y_0}{42}\\ y_0 \end{array} \end{split}\]

which is the ANF translation of the original expression \(42\).

Example 73 (From List of ANF Definitions to Expression in ANF (2))

Consider the pair of variable and ANF definitions in Example 69, obtained from the expression \(1+2\) by executing \(\hygToANFDefs{1+2}\):

\[\begin{split} \left( y_2, L \right) \qquad\text{where }\; L = \left[\begin{array}{@{}l@{}} (y_0, 1),\\ (y_1, 2),\\ (y_2, y_0 + y_1) \end{array}\right] \end{split}\]

By executing \(\hygToANF{y_0}{L}\), we get the expression:

\[\begin{split} \begin{array}{@{}l@{}} \hygLetUInit{y_0}{1}\\ \hygLetUInit{y_1}{2}\\ \hygLetUInit{y_2}{y_0 + y_1}\\ y_2 \end{array} \end{split}\]

which is the ANF translation of the original expression \(1 + 2\).

Example 74 (From List of ANF Definitions to Expression in ANF (3))

Consider the pair of variable name \(y_5\) and ANF definitions \(L\) obtained in Example 70 from a “let…” expression. If we execute \(\hygToANF{y_5}{L}\), we get the following expression:

\[\begin{split} \begin{array}{@{}l@{}} \hygLetUInit{y_0}{1} \\ \hygLetUInit{y_1}{2} \\ \hygLetUInit{y_3}{y_0 + y_1} \\ \hygLetUInit{x}{y_3} \\ \hygLetUInit{y_4}{3} \\ \hygLetUInit{y_5}{x + y_4} \\ y_5 \end{array} \end{split}\]

which is the ANF translation of the “let…” expression in Example 70.

Important

The implementation of the function \(\hygToANF{x}{L}\) is available in the hyggec Git repository, in the file ANF.fs. For more details, see also Implementation: ANF Transformation and Register Allocation in hyggec.

Exercise 39 (Experimenting with ANF translation)

Convert the following Hygge expressions into ANF, according to Definition 51. Ideally, you should proceed in two phases, as described above: first, generate the list of ANF definitions, and then generate the corresponding ANF expression.

  • \(2 * 3\)

  • \(1 + 2 * 3\)

  • \(\hygLetU{x}{1}{2}\)

  • \(\hygLetU{x}{1*2*3}{x}\)

Tip

You can use hyggec to check your answers or simply experiment with ANF transformations.

ANF-Based Linear Register Allocation#

We now address the opening problem for this module: how to compile arbitrarily-complex Hygge expressions (requiring any number of intermediate results) by only using a limited number of registers. To solve this problem, we develop a linear register allocation strategy that takes advantage of ANF. (The meaning of “linear” is explained in the References and Further Readings). To achieve this register allocation strategy, we explore the following topics:

Using ANF to Recognise and Discard Unused Intermediate Results#

When a Hygge expression is converted to ANF, each one of its sub-expressions and intermediate computations becomes associated to a dedicated variable. Therefore:

  • we need to maintain the results of a computation available in a register only as long as the corresponding variable \(z\) is referenced somewhere in the program;

  • to understand whether a variable \(z\) (and the value it holds) is referenced somewhere in the program, we simply check whether \(z\) appears as a free variable in the scope of a “let” binder.

The idea is illustrated in Example 75 below.

Example 75 (A Hygge Program and its Intermediate Computations)

Consider the following Hygge program:

1let res = ((1 + 2) + 3) + 4;
2assert(res = 10)

When translated into ANF, the program becomes:

 1let y0 = 1;
 2let y1 = 2;
 3let y2 = y0 + y1;     // y0 and y1 are now unused
 4let y3 = 3;
 5let y4 = y2 + y3;     // y2 and y3 are now unused
 6let y5 = 4;
 7let y6 = y4 + y5;     // y4 and y5 are now unused
 8let re = y6;          // y6 is now unused
 9let y7 = 10;
10let y8 = res = y7;    // y7 is now unused
11let y9 = assert(y8);  // y8 is now unused
12y9

The comments in the program above mark the points where variables become unused (i.e. are not in the free variables of the code that follows), and therefore, their associated register could be reused to hold some other variable. For example:

  • y3 can reuse the register previously used by y0;

  • y4 can reuse the register previously used by y1;

  • y5 can reuse the register previously used by y2;

By discarding variables as soon as they become unused, and reusing their registers, the program above can be compiled to just use 3 registers.

Why Discarding Unused Variables is Not Enough#

The approach illustrated in the previous section is very helpful to reduce the number of registers in use; however, it is not enough to address the opening example of this module. To see why, let us examine a simplified scenario where we only have 3 registers available, and consider the Hygge programs in Example 76 below.

Example 76 (A Hygge Program using (Too) Many Registers (Simplified))

Consider the following Hygge program:

1let res = 1 + (2 + (3 + 4));
2assert(res = 10)

When translated into ANF, the program becomes:

(Download this example.)

 1let y0 = 1;
 2let y1 = 2;
 3let y2 = 3;
 4let y3 = 4;
 5let y4 = y2 + y3;     // y2 and y3 are now unused
 6let y5 = y1 + y4;     // y1 and y4 are now unused
 7let y6 = y0 + y5;     // y0 and y5 are now unused
 8let res = y6;         // y6 is now unused
 9let y7 = 10;
10let y8 = res = y7;    // y7 is now unused
11let y9 = assert(y8);  // y8 is now unused
12y9

Observe that the ANF program above needs to maintain the 4 variables y0y3 available at the same time (whereas the ANF program in Example 75 can discard y0 and y1 after computing y2).

Now, suppose that our target architecture only has 3 available registers (numbered from 0 to 2). In this situation, if we try to compile this ANF program using ./hyggec compile ..., the compiler would crash with the following error:

Unhandled exception. System.Exception: BUG: invalid generic register number 3
   at RISCV.Reg.r(UInt32 n) in .../src/RISCV.fs:line 88

The reason for the crash is that the Code Generation Strategy of hyggec tends to use a new register for each sub-expression; therefore, when compiling the program above, it tries to use:

  • register Reg.r(0) to hold variable y0 (i.e. the result of the sub-expression 1);

  • register Reg.r(1) to hold variable y1 (i.e. the result of the sub-expression 2);

  • register Reg.r(2) to hold variable y2 (i.e. the result of the sub-expression 3);

  • register Reg.r(3) to hold variable y3 (i.e. the result of the sub-expression 4) — and this causes the crash.

ANF-Based Code Generation with Register Allocation#

To compile Hygge expressions that use more intermediate results (and ANF variables) than available registers, we need a more sophisticate approach:

  • we leverage the fact that expressions are in ANF, and

  • we reuse registers even when their value is still needed by the Hygge expression being compiled, by generating assembly code to:

    • spill variables onto the stack, i.e. copy the value of a variable onto the stack in order to reuse its register for another variable; and

    • load variables from the stack, i.e. assign a register to a previously-spilled variable and restore its value from the stack.

To achieve this, we need a code generation environment with the following information:

  • a target variable name \(z\) that will contain the result of the expression being compiled.

  • the set of known variables \(x_1, \ldots, x_n\);

  • the storage locations of all known variables. Each known variable might have up to two storage locations:

    • a stack location (as an offset from the frame pointer register fp) that never changes during the lifetime of a variable;

    • if possible, a register that may change during the lifetime of the variable;

  • the needed variables, i.e. a subset of the known variables that should never be discarded (even if they may seem unused).

Now, suppose that we are compiling a Hygge expression \(e\), using the code generation environment outlined above. T he following subsections describe how to proceed:

Generating Code for a “Let” Binder or Variable in ANF#

According to Definition 51, when we start compiling a Hygge program in ANF we can expect to find either a “let” binder, or a variable.

When we generate code for a “let” binder “\(\hygLetU{x}{e_i}{e_s}\)” or “\(\hygLet{x}{t}{e_i}{e_s}\)” or “\(\hygLetMut{x}{e_i}{e_s}\)”:

  1. we discard all known variables that are not “needed” in the code generation environment, nor free (i.e. not used) in \(e_i\) nor \(e_s\) (as described in the previous section). This helps freeing up registers and stack locations (if possible);

  2. we assign both a register and a stack location to the newly-declared variable \(x\):

    • if all registers are being used, we generate code to spill one of the other known variables \(x_1,\ldots,x_n\) onto its assigned stack location, and reassign its register to \(x\);

    • if no stack locations are available, we allocate a new stack location by decreasing the stack pointer register sp (recall that the RISC-V stack grows downwards);

  3. we compile the initialisation expression \(e_i\) (as described below), by targeting the variable \(x\). Since \(e_i\) is in ANF, according to Definition 51, it can only have a few specific shapes (discussed below);

  4. we discard all known variables that are not “needed” in the code generation environment, nor free (i.e. not used) in the scope expression \(e_s\);

  5. we generate code for the scope expression \(e_s\).

Instead, when we generate code for a variable \(x\), we proceed as follows:

  1. if \(x\) does not have a register assigned to it (because it has been spilled onto the stack before), we generate code to load \(x\) onto a register;

    • if all registers are being used, we generate code to spill one of the other known variables \(x_1,\ldots,x_n\) onto its assigned stack location, and assign its register to \(x\);

  2. we check where the target variable \(z\) is stored (recall that the target variable \(z\) is expected to contain the result of the expression being compiled):

    • if \(z\) has a register assigned to it, we copy the value of \(x\) onto that register;

    • otherwise, \(z\) has been spilled onto the stack, so we just copy the value of \(x\) onto the stack location assigned to \(z\).

Generating Code for the Initialisation Expression of a “Let” Binder in ANF#

According to Definition 51, when we generate code for the initialisation expression of a “let” binder in ANF, we can only find specific shapes of expressions. We discuss two cases: addition and if-then-else.

Generating Code for Additions in ANF#

By Definition 51, an addition in ANF (which may only appear in the initialisation expression of a “let” binder) can only have the form “\(x_1 + x_2\)”, i.e. its operands can only be variables. Therefore, we proceed as follows:

  1. we make sure that the addition operands \(x_1\) and \(x_2\) and the target variable \(z\) are currently stored on a register;

    • to this end, we may need to generate code to spill up to three other variables onto the stack, and reassign their registers to \(z\), \(x_1\), and \(x_2\);

  2. we generate code for the RISC-V instruction add using the registers of \(x_1\), \(x_2\), storing the result in the register of \(z\).

Generating Code for If-Then-Else Expressions in ANF#

By Definition 51, an if-then-else expression in ANF (which may only appear in the initialisation expression of a “let” binder) can only have the form “\(\hygCond{y}{e_t}{e_f}\) i.e. its condition expression can only be a variable; moreover, both \(e_t\) and \(e_f\) must be in ANF.

To compile the condition \(y\), we need to make sure that \(y\) is stored into a register (and to this end, we may need to generate code to spill some other variable and assign its register to \(y\)).

The compilation of \(e_t\) and \(e_f\) proceeds recursively — but there is a catch:

  • the code generated for \(e_t\) and \(e_f\) may contain code to spill and load variables;

  • therefore, the register allocation at the end of \(e_t\) may be different from \(e_f\)!

This situation is illustrated in Example 77 below.

Example 77 (Why We Must Synchronise Register Allocation Across “If-Then-Else” Branches)

Consider the following Hygge program:

1let z = true;  // Change to 'false' to experiment
2let x1 = 1;
3let x2 = 2;
4let x3 = if z then x1 else 1 + (2 + (3 + 4));
5assert(x1 = 1);
6assert(x2 = 2);
7assert(if z then x3 = x1 else x3 = 10)

Suppose we only have 4 registers available. The following happens:

  • variable z is assigned to register Reg.r(0);

  • variable x1 is assigned to register Reg.r(1);

  • variable x2 is assigned to register Reg.r(2);

  • variable x3 is assigned to register Reg.r(3);

  • when compiling the “if-then-else” expression on line 4:

    • the “then” branch does not need to use any additional register;

    • the “else” branch needs 4 registers for the intermediate results of the expression 1 + (2 + (3 + 4)). Therefore, it will need to spill the variables z, x1, x2, and x3 (which are needed on lines 5–7) onto the stack;

  • therefore, what is the register allocation when the program reaches line 5? Are the variables z, x1, x2, and x3 on the stack, or on some register? That depends on whether the ‘true’ or ‘false’ branch of the ‘if-then-else’ was taken earlier…

The solution to the issue illustrated in Example 77 above is to add some “branch synchronisation” assembly code at the end of the ‘then’ and ‘else’ branches, with the goal of ensuring that, no matter which branch of the “if-then-else” is taken, all the known variables have a known storage on a register and/or on the stack. For example, we can add at the end of the assembly code of the branch \(e_t\) some assembly code that spills/loads the known variables until their register allocation exactly matches the allocation at the end of the branch \(e_f\). We reprise this idea when discussing its implementation in hyggec, in Example 79 below.

Implementation: ANF Transformation and Register Allocation in hyggec#

Tip

The conversion of Hygge programs into ANF (described in this section) is already implemented in the hyggec Git repository, in a file called ANF.fs. To see what changed since the last module, you can inspect the differences between the tags union-rec-types and anf. For more details, see also Implementation: ANF Transformation and Register Allocation in hyggec.

The following subsections outline the implementation of ANF and register allocation available on the hyggec Git repository.

The version of hyggec released with this Module adds two option, available in some compiler phases:

  • the new option -a or --anf that can be used to activate and inspect ANF conversion and ANF-based code generation with register allocation; and

  • the new option -r or --registers can be used to limit the number of registers being used for code generation (useful for experiments and debugging).

Therefore, we can now invoke hyggec as follows:

  • ./hyggec parse -a file.hyg — parses the file, converts it into ANF, and prints the result.

  • ./hyggec interpret -a file.hyg — parses the file, converts it into ANF, and interprets the result.

  • ./hyggec compile -a -r N file.hyg — parses the file, type-checks it, converts the result into ANF, compiles it using the ANF-Based Code Generation with Register Allocation with N available registers (dafault: 18), and displays the resulting RISC-V assembly code. By reducing the value of N, it is possible to observe how the generated code has to spill/load variables more often.

  • ./hyggec rars -a -r N file.hyg — parses the file, type-checks it, converts the result into ANF, compiles it using the ANF-Based Code Generation with Register Allocation with N available registers (dafault: 18), and executes the resulting RISC-V assembly code using RARS. By reducing the value of N, the generated code will spill/load variables more often.

The new version of the compiler also has two new series of tests:

  • tests/interpreter-anf/ contains a series of tests that are parsed, converted into ANF, and then interpreted (these are the same tests of tests/interpreter/);

  • tests/codegen-anf/ contains a series of tests that are parsed, type-checked, converted into ANF, compiled using the ANF-based register allocation, and then executed in RARS.

Example 78 (A Hygge Program using (Too) Many Registers, Revisited)

Consider again the following Hygge program (from Example 76):

1let res = 1 + (2 + (3 + 4));
2assert(res = 10)

If we save this file as hygge-many-regs-simpl.hyg, we can see its ANF translation by running:

./hyggec parse -a hygge-many-regs-simpl.hyg

and the output is:

Let $anf (1:1-2:16)
┣╾init: IntVal 1 (1:11-1:11)
┗╾scope: Let $anf_0 (1:1-2:16)
         ┣╾init: IntVal 2 (1:16-1:16)
         ┗╾scope: Let $anf_1 (1:1-2:16)
                  ┣╾init: IntVal 3 (1:21-1:21)
                  ┗╾scope: Let $anf_2 (1:1-2:16)
                           ┣╾init: IntVal 4 (1:25-1:25)
                           ┗╾scope: Let $anf_3 (1:1-2:16)
                                    ┣╾init: Add (1:21-1:25)
                                    ┃       ┣╾lhs: Var $anf_1 (1:21-1:21)
                                    ┃       ┗╾rhs: Var $anf_2 (1:25-1:25)
                                    ┗╾scope: Let $anf_4 (1:1-2:16)
                                             ┣╾init: Add (1:16-1:26)
                                             ┃       ┣╾lhs: Var $anf_0 (1:16-1:16)
                                             ┃       ┗╾rhs: Var $anf_3 (1:21-1:25)
                                             ┗╾scope: Let $anf_5 (1:1-2:16)
                                                      ┣╾init: Add (1:11-1:27)
                                                      ┃       ┣╾lhs: Var $anf (1:11-1:11)
                                                      ┃       ┗╾rhs: Var $anf_4 (1:16-1:26)
                                                      ┗╾scope: Let res (1:1-2:16)
                                                               ┣╾init: Var $anf_5 (1:11-1:27)
                                                               ┗╾scope: Let $anf_6 (1:1-2:16)
                                                                        ┣╾init: IntVal 10 (2:14-2:15)
                                                                        ┗╾scope: Let $anf_7 (1:1-2:16)
                                                                                 ┣╾init: Eq (2:8-2:15)
                                                                                 ┃       ┣╾lhs: Var res (2:8-2:10)
                                                                                 ┃       ┗╾rhs: Var $anf_6 (2:14-2:15)
                                                                                 ┗╾scope: Let $anf_8 (1:1-2:16)
                                                                                          ┣╾init: Assertion (2:1-2:16)
                                                                                          ┃       ┗╾arg: Var $anf_7 (2:8-2:15)
                                                                                          ┗╾scope: Var $anf_8 (1:1-2:16)

We can compile the program above using the new ANF-based register allocation, limited to only 4 registers, by running:

./hyggec compile -a -r 4 hygge-many-regs-simpl.hyg

and the output is:

 1.data:
 2
 3.text:
 4    mv fp, sp  # Initialize frame pointer
 5    addi sp, sp, -4  # Extend the stack for variable $anf
 6    # Variable $anf allocation: register t0, frame pos. 1 
 7    li t0, 1
 8    addi sp, sp, -4  # Extend the stack for variable $anf_0
 9    # Variable $anf_0 allocation: register t1, frame pos. 2 
10    li t1, 2
11    addi sp, sp, -4  # Extend the stack for variable $anf_1
12    # Variable $anf_1 allocation: register t2, frame pos. 3 
13    li t2, 3
14    addi sp, sp, -4  # Extend the stack for variable $anf_2
15    # Variable $anf_2 allocation: register s1, frame pos. 4 
16    li s1, 4
17    sw t0, -4(fp)  # Spill variable $anf from register t0 to stack
18    addi sp, sp, -4  # Extend the stack for variable $anf_3
19    # Variable $anf_3 allocation: register t0, frame pos. 5 
20    add t0, t2, s1  # $anf_3 <- $anf_1 + $anf_2
21    # Variable $anf_4 allocation: register t2, frame pos. 3 
22    add t2, t1, t0  # $anf_4 <- $anf_0 + $anf_3
23    # Variable $anf_5 allocation: register t1, frame pos. 2 
24    lw t0, -4(fp)  # Load variable $anf onto register t0
25    add t1, t0, t2  # $anf_5 <- $anf + $anf_4
26    # Variable res allocation: register t2, frame pos. 1 
27    mv t2, t1  # res <- $anf_5
28    # Variable $anf_6 allocation: register t1, frame pos. 2 
29    li t1, 10
30    # Variable $anf_7 allocation: register t0, frame pos. 3 
31    beq t2, t1, eq_true
32    li t0, 0  # Comparison result is false
33    j eq_end
34eq_true:
35    li t0, 1  # Comparison result is true
36eq_end:
37    # Variable $anf_8 allocation: register t2, frame pos. 1 
38    addi t2, t0, -1
39    beqz t2, assert_true  # Jump if assertion OK
40    li a7, 93  # RARS syscall: Exit2
41    li a0, 42  # Assertion violation exit code
42    ecall
43assert_true:
44    lw t0, 0(fp)  # Load variable $result onto register t0
45    mv t0, t2  # $result <- $anf_8
46    li a7, 10  # RARS syscall: Exit
47    ecall  # Successful exit with code 0

By following the comments, we can see when the code generation introduces new variables (by assigning stack space and a register) and spills/loads variables from/to registers. We can observe, in particular, that:

  • variable $anf is spilled onto the stack on line 17, and later reloaded into a register on line 24; and

  • on line 21, variable anf_4 is assigned register t2 and frame position 3, that (according to line 12) were previously assigned to $anf_1 (which “disappears” without being spilled onto the stack — hence $anf_1 has become unused and has been discarded).

Exercise 40 (Experimenting with ANF and the number of available registers)

Experiment with some variations of Example 78:

  • make the “let…” expression being compiled more complicated, and observe how this influences the ANF transformation, and in the amount of spills/loads in the generated assembly;

  • try varying the number of available registers for ANF code generation (using the option -r of hyggec), and observe how this influences the amount of spills/loads in the generated assembly.

ANF Transformation in hyggec#

The ANF transformation is implemented in the file ANF.fs, and it follows the procedure described in Transformation of a Hygge Expression into ANF. The main function is transform, which then invokes toANFDefs and toANF:

/// Transform the given AST node into Administrative Normal Form.
let transform (ast: Node<'E,'T>): Node<'E,'T> =
    toANF (toANFDefs ast)

Here are a few remarks about the rest of the file.

  • For efficiency, the function toANFDefs builds the list of ANF definitions in reverse order (i.e. with the latest entries at the head of the list); correspondingly, the function toANF expects to receive a list of ANF definitions in reverse order;

  • The file includes an auxiliary function substVar that is very similar to subst (in ASTUtil.fs) — except that it specifically substitutes the name of a variable with another name.

  • The unique variable names autogenerated for ANF transformation have a prefix $anf (often followed by an integer): this guarantees that the autogenerated names will not clash with existing variables in the input program, since Parser.fsy does not accept programs that use $ in variable names.

ANF-Based Code Generation in hyggec#

The code generation implemented in the file ANFRISCVCodegen.fs follows the procedure described in ANF-Based Code Generation with Register Allocation.

Important

The ANF-based code generation in ANFRISCVCodegen.fs is only partial, and is intended as a demonstration and a basis for Project Ideas. In particular:

  • only integer registers are considered, and

  • only a few Hygge expressions are implemented (addition, if-then-else, assertions, …) — just enough to support the examples shown in this Module.

Here is an overview of the main contents of the file.

  • The code generation environment has type ANFCodegenEnv and it contains the information about known and needed variables, allocated registers, etc.:

    /// Code generation environment.
    type internal ANFCodegenEnv = {
        /// Target variable for the result of the assembly code execution.
        TargetVar: string
        /// Frame position assigned to each known variable.
        Frame: Map<string, int>
        /// Size of the stack frame.
        FrameSize: int
        /// List of integer variables stored in registers, with newest ones coming
        /// first.  If a variable does not appear here, then it means that it is
        /// allocated on the stack frame.
        IntVarsInRegs: List<string * Reg>
        /// List of available integer registers.
        FreeRegs: List<Reg>
        /// Set of variables that are needed in the surrounding scope, and should
        /// never be discarded to reuse their storage.
        NeededVars: Set<string>
    }
    
  • The main entry point of the ANF-based code generation is the function codegen, which in turn invokes doCodegen:

    /// Generate RISC-V assembly for the given AST (expected to be in ANF), using
    /// the given number of registers.
    let codegen (node: TypedAST) (registers: uint): RISCV.Asm =
        /// Name of a special variable used to hold the result of the program
        let resultVarName = "$result"
        /// Initial codegen environment, with all registers available
        /// Note: the definition of 'env' uses list comprehension:
        /// https://en.wikibooks.org/wiki/F_Sharp_Programming/Lists#Using_List_Comprehensions
        let env = { TargetVar = resultVarName
                    Frame = Map[(resultVarName, 0)]
                    FrameSize = 1
                    IntVarsInRegs = []
                    FreeRegs = [for i in 0u..(registers - 1u) do yield Reg.r(i)]
                    NeededVars = Set[resultVarName] }
        let result = doCodegen env node
        // ...
    

    Notably, the funtion codegen initialises a code generation environment by targeting a special variable called $result, which is initially stored in the first available location on the stack frame, without using any register. It also marks the $result variable as “needed” (so it can never be discarded during code generation, even if it is unused), and declares all registers as free.

  • Unlike the default Code Generation Strategy of hyggec, the result of doCodegen (and other functions) in ANFRISCVCodegen.fs is not just an Asm instance. Instead, doCodegen has the following signature:

    /// Code generation function: compile the expression in the given AST node,
    /// which is expected to be in ANF.
    let rec internal doCodegen (env: ANFCodegenEnv)
                               (node: TypedAST): ANFCodegenResult = ...
    

    Where the type ANFCodegenResult is defined as:

    /// Code generation result.
    type internal ANFCodegenResult = {
        /// Compiled RISC-V assembly code.
        Asm: Asm
        /// Updated code generation environment.
        Env: ANFCodegenEnv
    }
    

    This is because the function doCodegen may emit assembly code to spill and load registers — and this changes the register allocation described in the input environment. For this reason, doCodegen returns an updated code generation environment, which should be used as input for the subsequent code generation steps.

  • The following auxiliary function generates assembly code to spill a variable varName onto the stack:

    /// Spill the variable with the given name onto its stack position assigned in
    /// 'env'.  Return the assembly code that performs the spill and the updated
    /// codegen environment.
    let internal spillVar (env: ANFCodegenEnv)
                          (varName: string): ANFCodegenResult = ...
    

    The function spillVar above is used by the function spillOldestIntVar below, that selects which variable to spill:

    /// Spill the integer variable that has been stored in an integer register for
    /// the longest time, saving it in the stack position assigned in 'env'.  Choose
    /// a variable that does not belong to the given 'doNotSpill' list.  Return the
    /// assembly code that performs the spilling, and the updated codegen
    /// environment.
    let internal spillOldestIntVar (env: ANFCodegenEnv)
                                   (doNotSpill: List<string>): ANFCodegenResult = ...
    

    This hihglights the strategy used by hyggec for selecting which variable to spill when all registers are in use: the compiler picks the variable that has been assigned to a register for the longest time. (Other strategies are discussed in the References and Further Readings.)

  • The function syncANFCodegenEnvs below generates code to synchronise the register allocation across two ANFCodegenEnvironments: such environments may be produces by the two branches of an if-then-else, according to the idea outlined in Generating Code for If-Then-Else Expressions in ANF.

    /// Generate assembly code that spills/loads variables in 'fromEnv' achieving
    /// the same configuration of 'toEnv'.
    let internal syncANFCodegenEnvs (fromEnv: ANFCodegenEnv)
                                    (toEnv: ANFCodegenEnv): Asm = ...
    

    The function works as follows:

    • it takes two code generation environments fromEnv and toEnv (e.g. one obtained after compiling the “then” branch of an if-then-else, and the other obtained after compiling the “else” branch), and

    • it returns the assembly code that spills/loads the variables of fromEnv to/from the stack, and (re-)assigns their registers, until they reach the same configuration of toEnv.

    To see syncANFCodegenEnvs in action, you can try Example 79 below.

    Example 79 (Synchronising Register Allocation Across “If-Then-Else” Branches)

    Consider again the following Hygge program (from Example 77):

    1let z = true;  // Change to 'false' to experiment
    2let x1 = 1;
    3let x2 = 2;
    4let x3 = if z then x1 else 1 + (2 + (3 + 4));
    5assert(x1 = 1);
    6assert(x2 = 2);
    7assert(if z then x3 = x1 else x3 = 10)
    

    If we save the example above in a file called if-then-else-sync-reg-alloc.hyg, we can see its translation into ANF by running:

    ./hyggec parse -a if-then-else-sync-reg-alloc.hyg
    

    and we can observe that the “else” branch has many more ANF variable definitions (hence, it uses more registers) than the “then” branch.

    To generate assembly code from the ANF, by limiting register allocation to 4 registers only, we can run:

    ./hyggec compile -a -r 4 if-then-else-sync-reg-alloc.hyg
    

    The generated code is quite long, but let’s focus on the code generated for the “then” branch of the “if-then-else” expression:

        # ...assembly code for the 'false' branch of the 'if'...
        j if_end  # Jump to skip the 'true' branch of 'if' code
    if_true:
        mv s1, t1  # $anf_6 <- x1
        # Branch synchronization code begins here
        sw s1, -16(fp)  # Spill variable $anf_6 from register s1 to stack
        sw t1, -8(fp)  # Spill variable x1 from register t1 to stack
        sw t2, -12(fp)  # Spill variable x2 from register t2 to stack
        sw t0, -4(fp)  # Spill variable z from register t0 to stack
        lw t2, -16(fp)  # Load variable $anf_6 onto register t2
        # Branch synchronization code ends here
    if_end:
        # Rest of the assembly code...
    

    We can see that, after the if_true label, there is the code for the “then” branch of the “if-then-else”:

    • first, an mv operation copies variable x1 into $anf_6 (which is the target variable holding the result of the whole if-then-else expression); and then,

    • there is a block of “branch synchronisation code” produced by the function syncANFCodegenEnvs above. There we can observe, in particular, that the variable $anf_6 is first spilled from register s1, and then reloaded onto register t2, so it gets the same register allocated after the code generation for the “else” branch. This way, no matter which branch was taken, the rest of the assembly code can use register t2 to get the value of $anf_6.

References and Further Readings#

ANF was introduced in these seminal papers, with the main purpose of compiling functional languages:

  • Amr Sabry and Matthias Felleisen. Reasoning about programs in continuation-passing style. In Proceedings of the 1992 ACM conference on LISP and functional programming (LFP ‘92). https://doi.org/10.1145/141471.141563

  • Cormac Flanagan, Amr Sabry, Bruce F. Duba, and Matthias Felleisen. The essence of compiling with continuations. In Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation (PLDI ‘93). https://doi.org/10.1145/155090.155113

ANF is used, for instance, in the Racket compiler.

Conceptually, ANF is reminiscent of other intermediate representations used in various compilers — e.g. Static Single Assignment (SSA) form, and Three-Address Code. To know more, see for example:

  • Keith D. Cooper and Linda Torczon. Engineering a Compiler (Third Edition). Available on DTU Findit.

    • See, in particular, Chapter 4 - Intermediate Representations

  • Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems, Oct. 1991. https://doi.org/10.1145/115372.115320

Register allocation is an active research topic: there are many different strategies to decide if/when a certain variable in a program should be spilled/loaded from/into a register; each strategy brings trade-offs in terms of complexity, speed of code generation, and efficiency of the generated code. The ANF-based register allocation strategy described in this Module is, broadly speaking, a coarse linear register allocation strategy:

  • it takes spill/load decisions while traversing the program once, and

  • it uses the current set of free variables to find ranges of the program where variables are live (i.e. actually used by the program) or become permanently unused (and thus, can be discarded).

To know more about register allocation, here are two surveys:

Project Ideas#

Here are some ideas for your group project.

Project Idea (Easy-Medium Difficulty): Implement ANF Translation for (Some of) Your Hygge Extensions#

The ANF translation in the file ANF.fs does not cover the Hygge expressions you should have added in the previous Project Ideas. Select some of those expressions, and extend Definition 51, the file ANF.fs, and the interpreter test suite accordingly:

  • you should explain how you extend Definition 51 to support the expressions you selected;

  • you should add new cases to the functions substVars and toANFDefs in ANF.fs;

  • you should add new tests under the directory tests/interpreter-anf/. Ideally, you should reuse the same interpreter tests that you have already developed for the expressions you selected, by copying them from tests/interpreter/.

Note

  • The difficulty of this Project Idea may range from Easy to Medium depending on which expression(s) you choose to translate to ANF: please speak with the teacher to agree on an estimate.

Project Idea (Medium-Hard Difficulty): Implement ANF-Based Code Generation for More Hygge Expressions#

The ANF-based code generation in the file ANFRISCVCodegen.fs only supports a few Hygge expressions. For this Project Idea, you should select more expressions, and extend ANFRISCVCodegen.fs to generate code for them, using the ANF-based register allocation strategy described in this Module.

  • You can choose the same expressions you converted into ANF for the project idea above — but this is not mandatory;

  • in ANFRISCVCodegen.fs, you should add new cases to the function doCodegen;

  • you should add more corresponding test cases under the directory tests/codegen-anf/. All test cases should compile and run correctly by using just 3 registers.

Hint

If the expression you are adding has some form of branching and conditional jump (e.g. while loops, short-circuit logical operators, …), then you may need to synchronise the register allocation of the different branches, similarly to Generating Code for If-Then-Else Expressions in ANF.

Note

  • The difficulty of this Project Idea may range from Medium to Hard depending on which expression(s) you choose to compile using ANF-based register allocation: please speak with the teacher to agree on an estimate.

  • If you chose Project Idea (Hard): Recursive Functions and you want to add ANF transformation for “let rec x =…”, you should treat it similarly to “let x = …”.