Module 11: Intermediate Representations and Register Allocation
Contents
Module 11: Intermediate Representations and Register Allocation#
In this module we discuss the concept of intermediate representation in a
compiler, focusing on a specific example called Administrative Normal Form
(ANF). We then explore its application in the hyggec
compiler, with the
objective of using ANF to improve the code generation and register allocation
strategy of hyggec
.
Overall Objective#
Our goal is to compile and run Hygge programs like the one shown in Example 65 below.
(A Hygge Program Using (Too) Many Registers)
Consider the following Hygge program:
1let res = 1 + (2 + (3 + (4 + (5 + (6 + (7 + (8
2 + (9 + (10 + (11 + (12 + (13 + (14
3 + (15 + (16 + (17 + (18 + 19)))))))))))))))));
4assert(res = 190)
If we try to compile this program using ./hyggec compile ...
, the compiler
crashes with the following error:
Unhandled exception. System.Exception: BUG: invalid generic register number 18
at RISCV.Reg.r(UInt32 n) in .../src/RISCV.fs:line 88
Note
The program shown in Example 65 above is a solution to Exercise 23.
The reason for the crash shown in Example 65 above is that,
when generating assembly for a Hygge expression \(e\), the
Code Generation Strategy of hyggec
tends to use a new register for
each sub-expression of \(e\) (by incrementing the current env.Target
register):
this is done to hold intermediate results that are needed to compute the final
result of \(e\). Depending on how such sub-expressions are arranged, hyggec
may
need more registers than the ones that are available.
More specifically, the API Reg.r(n)
in the file RISCV.fs
gives access to a
generic integer register n
selected
between t0..t6
and s1..s11
— which means that n
must be a number between
0 and 17. However, the code generation for Example 65 tries
to use:
register
Reg.r(0)
to hold the result of the sub-expression1
;register
Reg.r(1)
to hold the result of the sub-expression2
;…
register
Reg.r(17)
to hold the result of the sub-expression18
;register
Reg.r(18)
to hold the result of the sub-expression19
— and this causes the crash.
Solving this issue is not straightforward:
we need to evaluate the 18 sub-expressions above exactly in that order, to respect the Formal Semantics of Hygge0. Therefore, in Example 65 we necessarily have to produce 18 intermediate results before we can compute their final sum and initialise the variable
res
;even if we find an ad-hoc solution for this example, we can find many other examples of Hygge expressions that need more registers than those available.
Therefore, to improve the hyggec
code generation we need a general register
allocation solution that can use (and re-use) a limited number of registers in
a more sophisticated way. This task is quite challenging: the program in
Example 65 does not give many hints on how we could do that.
Therefore, we address the challenge by taking an intermediate step: before
attempting code generation and register allocation, we translate Hygge programs
into an equivalent intermediate representation that is closer to the
generated assembly code, and makes register allocation simpler to handle.
Important
The extensions described in this Module are already partially implemented in the
upstream Git repositories of hyggec at the tag
anf
: you should pull and merge the relevant changes into your
project compiler. The Project Ideas of this
Module build upon such changes.
What is an Intermediate Representation (IR)?#
Generally speaking, an intermediate representation (IR) is any internal data
structure used by a compiler to represent the code being compiled. Under
this very broad definition, hyggec
already uses three intermediate
representations of the input source code:
the
UntypedAST
data structure produced after parsing (definedAST.fs
and produced byParser.fsy
);the
TypedAST
data structure produced after type checking (defined and produced inTypechecker.fs
);the
Asm
data structure produced by code generation (defined inRISCV.fs
and produced inRISCVCodegen.fs
).
Intermediate representations are designed as stepping stones towards specific
purposes in the compilation process: for example, in the case of hyggec
, the
construction of Asm
instances (via code generation) is made possible by the
type information available in TypedAST
instances (which in turn are produced
by type checking). Our purpose in this Module is to improve the register
allocation strategy of hyggec
— and unfortunately, the intermediate
representation TypedAST
seems too high-level to help us: we need an
intermediate representation that is closer to the output language.
Administrative Normal Form (ANF)#
Introducing an IR in a compiler requires significant effort, and poses a problem: since generating the IR requires a translation from some other IR, how do we know that the translation is correct? Or more accurately: how can we show that the behaviour of the IR coincides with the behaviour of the parsed program? To spot possible mistakes in the translation, we should:
formalise the syntax and semantics of the IR,
write an interpreter for the IR, and
given a Hygge expression \(e\), make sure that interpreting \(e\) and interpreting its IR translation yield the same result. This would require formal mathematical proofs and/or writing many new tests specifically for the IR.
In this module we explore a form of intermediate representation called
Administrative Normal Form (or A-Normal Form, or ANF). ANF has two
features that make it very appealing for hyggec
:
it is close enough to RISC-V assembly to simplify the problem of register allocation; and
unlike other alternative IRs in compiler literature, ANF does not require us to introduce a new language, nor new data structures in the
hyggec
compiler. In fact, a Hygge expression translated in ANF is still a Hygge expression, that respects the Hygge syntax plus some additional constraints (described in Definition 51 below). Therefore:a Hygge expression in ANF can be represented as an
UntypedAST
orTypedAST
in thehyggec
compiler, andwe can test the correctness of the ANF translation by simply taking the existing tests for the
hyggec
interpreter, translating them into ANF, and running the ANF version with the existing interpreter, without any change. If some test fails after its ANF translation, then the ANF translation is buggy and needs to be fixed.
(Administrative Normal Form (ANF))
We say that a Hygge expression \(e\) is in Administrative Normal Form (ANF) when:
\(e\) follows the so-called Barendregt convention, i.e., all the variables bound in \(e\) must have a unique name;
moreover, \(e\) must be either:
a variable, or
a “let” expression “\(\hygLetU{x}{e_i}{e_s}\)” or “\(\hygLet{x}{t}{e_i}{e_s}\)” or “\(\hygLetMut{x}{e_i}{e_s}\)” where:
the initialisation expression \(e_i\) is either:
a variable \(y\);
a value \(v\) that is not a lambda term;
a \(\hygReadInt\) or \(\hygReadFloat\) expression;
a lambda term “\(\hygLambda{x_1:t_1, \ldots, x_n:t_n}{e_b}\)” where the body \(e_b\) is in ANF;
a binary arithmetic expression such as “\(y + z\)” or “\(y * z\)”, where the left-hand-side and right-hand-side are both variables;
a logical expression such as “\(\hygOr{y}{z}\)” or “\(\hygAnd{y}{z}\)”, where the left-hand-side and right-hand-side are both variables;
a relational expression such as “\(y = z\)” or “\(y < z\)” where the left-hand-side and right-hand-side are both variables;
a logical negation “\(\hygNot{y}\)” where the argument is a variable;
an assertion \(\hygAssert{y}\) where the argument is a variable;
a type ascription “\(y:t\)” where the type-annotated expression is a variable;
a type declaration “\(\hygType{y}{t}{e_s}\)” where the scope expression \(e_s\) is in ANF;
a structure constructor “\(\hygStruct{f_1 = y_1; \ldots; f_n: y_n}\)” where all field initialisation expressions are variables;
a structure field selection “\(\hygFieldSel{x}{f}\)” where the field \(f\) is selected on a variable \(x\);
a union instance constructor “\(\hygUnionInst{l}{y}\)” where the initialisation expression is a variable \(y\);
a conditional expression “\(\hygCond{y}{e_t}{e_f}\)” where:
the condition expression is a variable \(y\);
the “then” branch expression \(e_t\) is in ANF;
the “else” branch expression \(e_f\) is in ANF;
a mutable variable assignment “\(\hygAssign{x}{y}\)” where the expression being assigned is a variable \(y\);
a structure field assignment “\(\hygAssign{\hygFieldSel{x}{f}}{y}\)” where the field \(f\) is selected on a variable \(x\), and the expression being assigned to it is also a variable \(y\);
a loop “\(\hygWhile{e_c}{e_b}\)” where:
the loop condition \(e_c\) is in ANF;
the loop body \(e_b\) is in ANF;
a pattern matching “\(\hygMatch{y}{\hygMatchCase{l_1}{z_1}{e_1}; \ldots; \hygMatchCase{l_n}{z_n}{e_n}}\)” where:
the matched expression is a variable \(y\);
each continuation expression \(e_1, \ldots, e_n\) is in ANF;
a sequencing of expressions \(e_1; e_2\) where both \(e_1\) and \(e_2\) are in ANF;
the scope expression \(e_s\) is in ANF.
(A Hygge Expression and Its ANF Equivalent)
Consider this simple Hygge program:
let x = 2 + 3;
assert(x = 5)
This program is not in ANF according to Definition 51, because:
the initialisation expression of the “let x…” is not in ANF: it is an addition whose two operands are not variables; and
the expression in the scope of the “let…” is not in ANF: it is an assertion whose argument is not a variable.
However, the following Hygge program is in ANF, and is also equivalent to the program above — in the sense that it performs the same computations, in the same order:
let y1 = 2;
let y2 = 3;
let x = y1 + y2;
let y3 = 5;
let y4 = x = y3;
let y5 = assert(y4) // Note: y5 has type unit
y5
Observe that in the ANF program in Example 66, each sub-expression (i.e. the operands of the addition, the argument of the assertion) is first assigned to a dedicated variable, and then used to compute other expressions. In other words, each intermediate result needed to compute other results is explicitly stored in a variable. This pattern makes the ANF program very reminiscent of assembly code, as shown in Example 67 below.
(From Hygge to ANF to Assembly)
Consider this Hygge program:
let x = 1 + 2 + 3;
x
The following program in ANF is equivalent to the one above, in the sense that it performs the same computations and in the same order. The comments show a corresponding assembly version of the ANF program, where we use registers instead of variables.
let y0 = 1; // li t0, 1
let y1 = 2; // li t1, 2
let y2 = y0 + y1; // add t2, t0, t1
let y3 = 3; // li t3, 3
let x = y2 + y3; // add t4, t2, t3
x
Transformation of a Hygge Expression into ANF#
To transform a Hygge expression \(e\) into an equivalent ANF expression \(e'\), according to Definition 51, we can follow a procedure based on two functions:
\(\hygToANFDefs{\ldots}\)
\(\hygToANF{\ldots}{\ldots}\)
The function \(\hygToANFDefs{\ldots}\):
takes one argument \(e\), which is a Hygge expression, and
returns a pair consisting of a variable \(y\) and a list of ANF definitions \(L\), where:
the variable \(y\) (which is unique and may be autogenerated) represents the result of the expression \(e\) after its ANF transformation, and
the list \(L\) contains a sequence of triplets \((z, m, e_z)\), each one representing an ANF definition:
\(z\) is a unique variable name (which may be autogenerated);
\(m_z\) is a boolean saying whether the variable \(z\) is mutable or not; and
\(e_z\) is the expression (in ANF) that initialises \(z\).
The last element of \(L\) must be a triplet \((y, m, e_y)\) — i.e. \(L\) must end with the name of the variable \(y\) returned by \(\hygToANFDefs{e}\), and its initialisation expression \(e_y\) (in ANF).
The idea is that when \(\hygToANFDefs{e}\) returns the pair \((y, L)\), then \(y\) can be used to get the same result of the original expression \(e\) — provided that all the variables in \(L\) are initialised beforehand, in the order in which they appear in \(L\) and using their corresponding initialisation expressions (which are in ANF). For more details, see Converting a Hygge Expression into a List of ANF Definitions.
Note
From now on, for brevity, when writing ANF definition triplets \((z, m_z, e_z)\) we will often omit the element \(m_z\) when its value is false — i.e., we will just write \((z, e_z)\) instead of \((z, \hygFalse, e_z)\) (which means that \(z\) is an immutable variable initialised with the ANF expression \(e_z\)).
Instead, the function \(\hygToANF{\ldots}{\ldots}\):
takes a pair of arguments: a variable \(y\) and a list of ANF definitions \(L\) ending with the initialisation of \(y\) (as returned by \(\hygToANFDefs{\ldots}\) above), and
returns an expression in ANF — which is a sequence of “let” binders that perform the variable definitions and initialisations described in \(L\), and terminate with the variable \(y\).
The idea is that \(\hygToANF{y}{L}\) constructs a Hygge expression in ANF that performs the initialisations and computations in \(L\) (which initialise \(y\)), and then returns the value of \(y\). For more details, see Converting a List of ANF Definitions into a Hygge Expression in ANF.
Using the two functions above, we can transform a Hygge expression \(e\) in ANF by simply executing:
Converting a Hygge Expression into a List of ANF Definitions#
When \(\hygToANFDefs{e}\) is invoked, it proceeds as follows:
it chooses a (possibly autogenerated) variable \(y\) to represent the result of \(e\);
it recursively calls itself on each sub-expression of \(e\), getting corresponding variables and lists of ANF definitions;
it assembles such variables and lists of ANF definitions according to the semantics of \(e\) (which establishes the evaluation order) and the requirements of Definition 51 (which establishes what \(e\) should look like in ANF); and
returns the variable \(y\) together with the corresponding list of ANF definitions, ending with the triplet \((y, m_y, e_y)\) (where \(e_y\) is the ANF expression based on \(e\) that initialises \(y\)).
More in detail, \(\hygToANFDefs{e}\) processes \(e\) as follows.
If \(e\) is just a variable \(y\), then \(\hygToANFDefs{y}\) just returns \(y\) and an empty list of ANF definitions.
If \(e\) is just a value \(v\) that is not a lambda term, then \(\hygToANFDefs{v}\) returns the following pair:
a unique autogenerated variable \(y\) (representing the value \(v\), and)
a list of ANF definitions with just one element \((y, \hygFalse, v)\), meaning that \(y\) is initialised by the value \(v\).
(From Integer Value to List of ANF Definitions)
Consider the following Hygge expression:
\[ 42 \]The result of \(\hygToANFDefs{42}\) is the following pair containing a unique variable \(y_0\) capturing the value, and a list of ANF definitions that just initialises \(y_0\) (which is immutable) with \(42\):
\[ \left( y_0, \left[\begin{array}{@{}l@{}} (y_0, \hygFalse, 42) \end{array}\right] \right) \]If \(e\) is an addition “\(e_1 + e_2\)”, then \(\hygToANFDefs{e_1 + e_2}\) must:
generate a unique variable \(y\) representing the result of the addition;
perform a recursive call on \(e_1\) — thus getting the unique variable \(z_1\) and its list of ANF definitions \(L_1\);
perform a recursive call on \(e_2\) — thus getting the unique variable \(z_2\) and its list of ANF definitions \(L_2\);
return the following pair:
the autogenerated variable \(y\) with the result of the addition, and
a list of ANF definitions consisting of \(L_1\), followed by \(L_2\), followed by the pair \((y, z_1 + z_2)\). This means that the variable \(y\) is initialised by the addition \(z_1 + z_2\) — which requires that all the variables in \(L_1\) and \(L_2\) are initialised first.
(From Addition to List of ANF Definitions)
Consider the following Hygge expression:
\[ 1 + 2 \]The result of \(\hygToANFDefs{1+2}\) is the following pair containing a unique variable capturing the result of the addition, and a list of ANF definitions introducing other (immutable) variables for the sub-expressions \(1\) and \(2\):
\[\begin{split} \left( y_2, \left[\begin{array}{@{}l@{}} (y_0, \hygFalse, 1),\\ (y_1, \hygFalse, 2),\\ (y_2, \hygFalse, y_0 + y_1) \end{array}\right] \right) \end{split}\]If \(e\) is a “let” binding “\(\hygLetU{y}{e_i}{e_s}\)” or “\(\hygLet{y}{t}{e_i}{e_s}\)” or “\(\hygLetMut{y}{e_i}{e_s}\)”, then \(\hygToANFDefs{e}\) must:
generate a unique variable \(y'\);
substitute \(y\) with \(y'\) in \(e_s\), getting the new scope expression \(e'_s\). This is necessary to ensure that all bound variables are unique, as required by Definition 51;
perform a recursive call on the initialisation expression \(e'_i\) — thus getting the unique variable \(z_i\) and its list of ANF definitions \(L_i\);
perform a recursive call on the scope expression \(e'_s\) — thus getting the unique variable \(z_s\) and its list of ANF definitions \(L_s\);
return the following pair:
the unique variable \(z_s\) corresponding to the result of the “let” scope, and
a list of ANF definitions constructed by concatenating:
\(L_i\), i.e. the ANF definitions yielded by the initialisation expression;
the triplet \((y', m_{y'}, z_i)\), which initialises \(y'\) (used in the “let” scope) with \(z_i\) (the result of the initialisation expression in ANF) — and where \(m_{y'}\) is:
true if \(e\) is a “let mutable…” binding, or
false otherwise;
\(L_s\), i.e. the ANF definitions yielded by the scope expression (which ends by initialising the returned variable \(z_s\)).
(From “Let…” Expression to List of ANF Definitions)
Consider the following Hygge expression:
\[\begin{split} \begin{array}{@{}l@{}} \hygLetUInit{x}{1 + 2} \\ x + 3 \end{array} \end{split}\]The result of \(\hygToANFDefs{\ldots}\) for the expression above is the following pair containing a unique variable capturing the result of the “let” expression, and a list of ANF definitions: (for brevity, here we omit the second element of each ANF definition triplet, since it is always false because each variable is immutable)
\[\begin{split} \left( y_5, \left[\begin{array}{@{}l@{}} (y_0, 1),\\ (y_1, 2),\\ (y_3, y_0 + y_1), \\ (x, y_3), \\ (y_4, 3), \\ (y_5, x + y_4) \end{array}\right] \right) \end{split}\]If \(e\) is a if-then-else expression “\(\hygCond{e_c}{e_t}{e_f}\), then \(\hygToANFDefs{e}\) must ensure that only one of the ANF transformations of \(e_t\) and \(e_f\) is executed, depending on whether the ANF transformation of \(e_c\) returns true or false. Therefore, \(\hygToANFDefs{e}\) must:
generate a unique variable \(y\) representing the result of the if-then-else expression;
perform a recursive call on the condition expression \(e_c\) — thus getting the unique variable \(z_c\) and its list of ANF definitions \(L_c\);
perform a recursive call on the “then” branch expression \(e_t\) — thus getting the unique variable \(z_t\) and its list of ANF definitions \(L_t\);
perform a recursive call on the “else” branch expression \(e_f\) — thus getting the unique variable \(z_f\) and its list of ANF definitions \(L_f\);
turn the lists of ANF definitions \(L_t\) and \(L_f\) into corresponding Hygge expressions \(e'_t\) and \(e'_f\) (both in ANF) by invoking \(\hygToANF{z_t}{L_t}\) and \(\hygToANF{z_f}{L_f}\) (described in the section below);
return the following pair:
the autogenerated variable \(y\) with the result of the if-then-else expression, and
a list of ANF definitions constructed by concatenating:
\(L_c\), i.e. the ANF definitions yielded by the condition expression \(e_c\); and
the pair \((y, \hygCond{z_c}{e'_t}{e'_f})\), which initialises \(y\) with the result of the if-then-else expression in ANF.
(From “If-Then-Else” Expression to List of ANF Definitions)
Consider the following Hygge expression:
\[ \hygCond{2 < 3}{1 + 2}{3 * 4} \]The results of \(\hygToANFDefs{1+2}\) and \(\hygToANFDefs{3*4}\) (for the “then” and “else” branches) give, respectively, the following pairs containing a unique variable capturing the result of the expression, and a list of ANF definitions (where all defined variables are immutable):
\[\begin{split} \begin{array}{c} \left( y_3, \left[\begin{array}{@{}l@{}} (y_0, 1),\\ (y_1, 2),\\ (y_2, y_0 + y_1) \end{array}\right] \right) \qquad\qquad \left( z_2, \left[\begin{array}{@{}l@{}} (z_0, 3),\\ (z_1, 4),\\ (z_2, z_0 * z_1) \end{array}\right] \right) \end{array} \end{split}\]If we invoke \(\hygToANF{\ldots}{\ldots}\) (described in the section below) on the two pairs above, we get the corresponding Hygge expressions in ANF:
\[\begin{split} \begin{array}{@{}l@{}} \hygLetUInit{y_0}{1} \\ \quad\hygLetUInit{y_1}{2} \\ \quad\quad\hygLetUInit{y_2}{y_0 + y_1} \\ \quad\quad\quad{}y_2 \end{array} \qquad\qquad \begin{array}{@{}l@{}} \hygLetUInit{z_0}{3} \\ \quad\hygLetUInit{z_1}{4} \\ \quad\quad\hygLetUInit{z_2}{z_0 * z_1} \\ \quad\quad\quad{}z_2 \end{array} \end{split}\]The result of \(\hygToANFDefs{\ldots}\) for the expression above is the following pair containing a unique variable capturing the result of the “if-then-else” expression, and a list of ANF definitions:
\[\begin{split} \left( w_3, \left[\begin{array}{@{}l@{}} (w_0, 2),\\ (w_1, 3),\\ (w_2, w_0 < w_1) \\ \left(w_3,\; \hygCond{w_2}{ \left( \begin{array}{@{}l@{}} \hygLetUInit{y_0}{1} \\ \quad\hygLetUInit{y_1}{2} \\ \quad\quad\hygLetUInit{y_2}{y_0 + y_1} \\ \quad\quad\quad{}y_2 \end{array} \right) }{ \left( \begin{array}{@{}l@{}} \hygLetUInit{z_0}{3} \\ \quad\hygLetUInit{z_1}{4} \\ \quad\quad\hygLetUInit{z_2}{z_0 * z_1} \\ \quad\quad\quad{}z_2 \end{array} \right) } \right) \end{array}\right] \right) \end{split}\]
Important
The function \(\hygToANFDefs{e}\) contains one case for each possible Hygge
expression \(e\). The complete implementation of the function is available in the
hyggec Git repository, in the file ANF.fs
.
For more details, see also Implementation: ANF Transformation and Register Allocation in hyggec.
Converting a List of ANF Definitions into a Hygge Expression in ANF#
The purpose of the function \(\hygToANF{x}{L}\) is to take a list of ANF definitions (ending with the definition of variable \(x\)) and produce a corresponding Hygge expression in ANF, consisting of a series of nested “let…” expressions that exactly match the definitions in \(L\). The idea is that, for each ANF definition triplet \((y, m_y, e_y)\) in the list \(L\):
if \(m_y\) is false (i.e. \(y\) is immutable), we produce a binder “\(\hygLetU{y}{e_y}{\ldots}\)”;
otherwise, when \(m_y\) is true (i.e. \(y\) is mutable), we produce a binder “\(\hygLetMut{y}{e_y}{\ldots}\)”.
You can find a hint of this behaviour in Example 71 above (for the “then” and “else” branches of the if-then-else); let us now see a few more examples.
(From List of ANF Definitions to Expression in ANF (1))
Consider the pair of variable and ANF definitions in Example 68, obtained from the expression \(42\) by executing \(\hygToANFDefs{42}\):
By executing \(\hygToANF{y_0}{L}\), we get the expression:
which is the ANF translation of the original expression \(42\).
(From List of ANF Definitions to Expression in ANF (2))
Consider the pair of variable and ANF definitions in Example 69, obtained from the expression \(1+2\) by executing \(\hygToANFDefs{1+2}\):
By executing \(\hygToANF{y_0}{L}\), we get the expression:
which is the ANF translation of the original expression \(1 + 2\).
(From List of ANF Definitions to Expression in ANF (3))
Consider the pair of variable name \(y_5\) and ANF definitions \(L\) obtained in Example 70 from a “let…” expression. If we execute \(\hygToANF{y_5}{L}\), we get the following expression:
which is the ANF translation of the “let…” expression in Example 70.
Important
The implementation of the function \(\hygToANF{x}{L}\) is available in the
hyggec Git repository, in the file ANF.fs
.
For more details, see also Implementation: ANF Transformation and Register Allocation in hyggec.
(Experimenting with ANF translation)
Convert the following Hygge expressions into ANF, according to Definition 51. Ideally, you should proceed in two phases, as described above: first, generate the list of ANF definitions, and then generate the corresponding ANF expression.
\(2 * 3\)
\(1 + 2 * 3\)
\(\hygLetU{x}{1}{2}\)
\(\hygLetU{x}{1*2*3}{x}\)
Tip
You can use hyggec
to check your answers or simply experiment with ANF transformations.
ANF-Based Linear Register Allocation#
We now address the opening problem for this module: how to compile arbitrarily-complex Hygge expressions (requiring any number of intermediate results) by only using a limited number of registers. To solve this problem, we develop a linear register allocation strategy that takes advantage of ANF. (The meaning of “linear” is explained in the References and Further Readings). To achieve this register allocation strategy, we explore the following topics:
Using ANF to Recognise and Discard Unused Intermediate Results#
When a Hygge expression is converted to ANF, each one of its sub-expressions and intermediate computations becomes associated to a dedicated variable. Therefore:
we need to maintain the results of a computation available in a register only as long as the corresponding variable \(z\) is referenced somewhere in the program;
to understand whether a variable \(z\) (and the value it holds) is referenced somewhere in the program, we simply check whether \(z\) appears as a free variable in the scope of a “let” binder.
The idea is illustrated in Example 75 below.
(A Hygge Program and its Intermediate Computations)
Consider the following Hygge program:
1let res = ((1 + 2) + 3) + 4;
2assert(res = 10)
When translated into ANF, the program becomes:
1let y0 = 1;
2let y1 = 2;
3let y2 = y0 + y1; // y0 and y1 are now unused
4let y3 = 3;
5let y4 = y2 + y3; // y2 and y3 are now unused
6let y5 = 4;
7let y6 = y4 + y5; // y4 and y5 are now unused
8let re = y6; // y6 is now unused
9let y7 = 10;
10let y8 = res = y7; // y7 is now unused
11let y9 = assert(y8); // y8 is now unused
12y9
The comments in the program above mark the points where variables become unused (i.e. are not in the free variables of the code that follows), and therefore, their associated register could be reused to hold some other variable. For example:
y3
can reuse the register previously used byy0
;y4
can reuse the register previously used byy1
;y5
can reuse the register previously used byy2
;…
By discarding variables as soon as they become unused, and reusing their registers, the program above can be compiled to just use 3 registers.
Why Discarding Unused Variables is Not Enough#
The approach illustrated in the previous section is very helpful to reduce the number of registers in use; however, it is not enough to address the opening example of this module. To see why, let us examine a simplified scenario where we only have 3 registers available, and consider the Hygge programs in Example 76 below.
(A Hygge Program using (Too) Many Registers (Simplified))
Consider the following Hygge program:
1let res = 1 + (2 + (3 + 4));
2assert(res = 10)
When translated into ANF, the program becomes:
1let y0 = 1;
2let y1 = 2;
3let y2 = 3;
4let y3 = 4;
5let y4 = y2 + y3; // y2 and y3 are now unused
6let y5 = y1 + y4; // y1 and y4 are now unused
7let y6 = y0 + y5; // y0 and y5 are now unused
8let res = y6; // y6 is now unused
9let y7 = 10;
10let y8 = res = y7; // y7 is now unused
11let y9 = assert(y8); // y8 is now unused
12y9
Observe that the ANF program above needs to maintain the 4 variables y0
…y3
available at the same time (whereas the ANF program in
Example 75 can discard y0
and y1
after computing
y2
).
Now, suppose that our target architecture only has 3 available registers
(numbered from 0 to 2). In this situation, if we try to compile this ANF
program using ./hyggec compile ...
, the compiler would crash with the
following error:
Unhandled exception. System.Exception: BUG: invalid generic register number 3
at RISCV.Reg.r(UInt32 n) in .../src/RISCV.fs:line 88
The reason for the crash is that the Code Generation Strategy of
hyggec
tends to use a new register for each sub-expression; therefore, when
compiling the program above, it tries to use:
register
Reg.r(0)
to hold variabley0
(i.e. the result of the sub-expression1
);register
Reg.r(1)
to hold variabley1
(i.e. the result of the sub-expression2
);register
Reg.r(2)
to hold variabley2
(i.e. the result of the sub-expression3
);register
Reg.r(3)
to hold variabley3
(i.e. the result of the sub-expression4
) — and this causes the crash.
ANF-Based Code Generation with Register Allocation#
To compile Hygge expressions that use more intermediate results (and ANF variables) than available registers, we need a more sophisticate approach:
we leverage the fact that expressions are in ANF, and
we reuse registers even when their value is still needed by the Hygge expression being compiled, by generating assembly code to:
spill variables onto the stack, i.e. copy the value of a variable onto the stack in order to reuse its register for another variable; and
load variables from the stack, i.e. assign a register to a previously-spilled variable and restore its value from the stack.
To achieve this, we need a code generation environment with the following information:
a target variable name \(z\) that will contain the result of the expression being compiled.
the set of known variables \(x_1, \ldots, x_n\);
the storage locations of all known variables. Each known variable might have up to two storage locations:
a stack location (as an offset from the frame pointer register
fp
) that never changes during the lifetime of a variable;if possible, a register that may change during the lifetime of the variable;
the needed variables, i.e. a subset of the known variables that should never be discarded (even if they may seem unused).
Now, suppose that we are compiling a Hygge expression \(e\), using the code generation environment outlined above. T he following subsections describe how to proceed:
Generating Code for a “Let” Binder or Variable in ANF#
According to Definition 51, when we start compiling a Hygge program in ANF we can expect to find either a “let” binder, or a variable.
When we generate code for a “let” binder “\(\hygLetU{x}{e_i}{e_s}\)” or “\(\hygLet{x}{t}{e_i}{e_s}\)” or “\(\hygLetMut{x}{e_i}{e_s}\)”:
we discard all known variables that are not “needed” in the code generation environment, nor free (i.e. not used) in \(e_i\) nor \(e_s\) (as described in the previous section). This helps freeing up registers and stack locations (if possible);
we assign both a register and a stack location to the newly-declared variable \(x\):
if all registers are being used, we generate code to spill one of the other known variables \(x_1,\ldots,x_n\) onto its assigned stack location, and reassign its register to \(x\);
if no stack locations are available, we allocate a new stack location by decreasing the stack pointer register
sp
(recall that the RISC-V stack grows downwards);
we compile the initialisation expression \(e_i\) (as described below), by targeting the variable \(x\). Since \(e_i\) is in ANF, according to Definition 51, it can only have a few specific shapes (discussed below);
we discard all known variables that are not “needed” in the code generation environment, nor free (i.e. not used) in the scope expression \(e_s\);
we generate code for the scope expression \(e_s\).
Instead, when we generate code for a variable \(x\), we proceed as follows:
if \(x\) does not have a register assigned to it (because it has been spilled onto the stack before), we generate code to load \(x\) onto a register;
if all registers are being used, we generate code to spill one of the other known variables \(x_1,\ldots,x_n\) onto its assigned stack location, and assign its register to \(x\);
we check where the target variable \(z\) is stored (recall that the target variable \(z\) is expected to contain the result of the expression being compiled):
if \(z\) has a register assigned to it, we copy the value of \(x\) onto that register;
otherwise, \(z\) has been spilled onto the stack, so we just copy the value of \(x\) onto the stack location assigned to \(z\).
Generating Code for the Initialisation Expression of a “Let” Binder in ANF#
According to Definition 51, when we generate code for the initialisation expression of a “let” binder in ANF, we can only find specific shapes of expressions. We discuss two cases: addition and if-then-else.
Generating Code for Additions in ANF#
By Definition 51, an addition in ANF (which may only appear in the initialisation expression of a “let” binder) can only have the form “\(x_1 + x_2\)”, i.e. its operands can only be variables. Therefore, we proceed as follows:
we make sure that the addition operands \(x_1\) and \(x_2\) and the target variable \(z\) are currently stored on a register;
to this end, we may need to generate code to spill up to three other variables onto the stack, and reassign their registers to \(z\), \(x_1\), and \(x_2\);
we generate code for the RISC-V instruction
add
using the registers of \(x_1\), \(x_2\), storing the result in the register of \(z\).
Generating Code for If-Then-Else Expressions in ANF#
By Definition 51, an if-then-else expression in ANF (which may only appear in the initialisation expression of a “let” binder) can only have the form “\(\hygCond{y}{e_t}{e_f}\) i.e. its condition expression can only be a variable; moreover, both \(e_t\) and \(e_f\) must be in ANF.
To compile the condition \(y\), we need to make sure that \(y\) is stored into a register (and to this end, we may need to generate code to spill some other variable and assign its register to \(y\)).
The compilation of \(e_t\) and \(e_f\) proceeds recursively — but there is a catch:
the code generated for \(e_t\) and \(e_f\) may contain code to spill and load variables;
therefore, the register allocation at the end of \(e_t\) may be different from \(e_f\)!
This situation is illustrated in Example 77 below.
(Why We Must Synchronise Register Allocation Across “If-Then-Else” Branches)
Consider the following Hygge program:
1let z = true; // Change to 'false' to experiment
2let x1 = 1;
3let x2 = 2;
4let x3 = if z then x1 else 1 + (2 + (3 + 4));
5assert(x1 = 1);
6assert(x2 = 2);
7assert(if z then x3 = x1 else x3 = 10)
Suppose we only have 4 registers available. The following happens:
variable
z
is assigned to registerReg.r(0)
;variable
x1
is assigned to registerReg.r(1)
;variable
x2
is assigned to registerReg.r(2)
;variable
x3
is assigned to registerReg.r(3)
;when compiling the “if-then-else” expression on line 4:
the “then” branch does not need to use any additional register;
the “else” branch needs 4 registers for the intermediate results of the expression
1 + (2 + (3 + 4))
. Therefore, it will need to spill the variablesz
,x1
,x2
, andx3
(which are needed on lines 5–7) onto the stack;
therefore, what is the register allocation when the program reaches line 5? Are the variables
z
,x1
,x2
, andx3
on the stack, or on some register? That depends on whether the ‘true’ or ‘false’ branch of the ‘if-then-else’ was taken earlier…
The solution to the issue illustrated in
Example 77 above is to add some “branch
synchronisation” assembly code at the end of the ‘then’ and ‘else’ branches,
with the goal of ensuring that, no matter which branch of the “if-then-else” is
taken, all the known variables have a known storage on a register and/or on the
stack. For example, we can add at the end of the assembly code of the branch
\(e_t\) some assembly code that spills/loads the known variables until their
register allocation exactly matches the allocation at the end of the branch
\(e_f\). We reprise this idea when discussing its implementation in hyggec
, in
Example 79 below.
Implementation: ANF Transformation and Register Allocation in hyggec#
Tip
The conversion of Hygge programs into ANF (described in this section) is already
implemented in the hyggec Git repository, in a
file called ANF.fs
. To see what changed since the last module, you can inspect
the differences between the tags union-rec-types
and anf
. For more details,
see also Implementation: ANF Transformation and Register Allocation in hyggec.
The following subsections outline the implementation of ANF and register allocation available on the hyggec Git repository.
The version of hyggec
released with this Module adds two option, available in
some compiler phases:
the new option
-a
or--anf
that can be used to activate and inspect ANF conversion and ANF-based code generation with register allocation; andthe new option
-r
or--registers
can be used to limit the number of registers being used for code generation (useful for experiments and debugging).
Therefore, we can now invoke hyggec
as follows:
./hyggec parse -a file.hyg
— parses the file, converts it into ANF, and prints the result../hyggec interpret -a file.hyg
— parses the file, converts it into ANF, and interprets the result../hyggec compile -a -r N file.hyg
— parses the file, type-checks it, converts the result into ANF, compiles it using the ANF-Based Code Generation with Register Allocation withN
available registers (dafault: 18), and displays the resulting RISC-V assembly code. By reducing the value ofN
, it is possible to observe how the generated code has to spill/load variables more often../hyggec rars -a -r N file.hyg
— parses the file, type-checks it, converts the result into ANF, compiles it using the ANF-Based Code Generation with Register Allocation withN
available registers (dafault: 18), and executes the resulting RISC-V assembly code using RARS. By reducing the value ofN
, the generated code will spill/load variables more often.
The new version of the compiler also has two new series of tests:
tests/interpreter-anf/
contains a series of tests that are parsed, converted into ANF, and then interpreted (these are the same tests oftests/interpreter/
);tests/codegen-anf/
contains a series of tests that are parsed, type-checked, converted into ANF, compiled using the ANF-based register allocation, and then executed in RARS.
(A Hygge Program using (Too) Many Registers, Revisited)
Consider again the following Hygge program (from Example 76):
1let res = 1 + (2 + (3 + 4));
2assert(res = 10)
If we save this file as hygge-many-regs-simpl.hyg
, we can see its ANF
translation by running:
./hyggec parse -a hygge-many-regs-simpl.hyg
and the output is:
Let $anf (1:1-2:16)
┣╾init: IntVal 1 (1:11-1:11)
┗╾scope: Let $anf_0 (1:1-2:16)
┣╾init: IntVal 2 (1:16-1:16)
┗╾scope: Let $anf_1 (1:1-2:16)
┣╾init: IntVal 3 (1:21-1:21)
┗╾scope: Let $anf_2 (1:1-2:16)
┣╾init: IntVal 4 (1:25-1:25)
┗╾scope: Let $anf_3 (1:1-2:16)
┣╾init: Add (1:21-1:25)
┃ ┣╾lhs: Var $anf_1 (1:21-1:21)
┃ ┗╾rhs: Var $anf_2 (1:25-1:25)
┗╾scope: Let $anf_4 (1:1-2:16)
┣╾init: Add (1:16-1:26)
┃ ┣╾lhs: Var $anf_0 (1:16-1:16)
┃ ┗╾rhs: Var $anf_3 (1:21-1:25)
┗╾scope: Let $anf_5 (1:1-2:16)
┣╾init: Add (1:11-1:27)
┃ ┣╾lhs: Var $anf (1:11-1:11)
┃ ┗╾rhs: Var $anf_4 (1:16-1:26)
┗╾scope: Let res (1:1-2:16)
┣╾init: Var $anf_5 (1:11-1:27)
┗╾scope: Let $anf_6 (1:1-2:16)
┣╾init: IntVal 10 (2:14-2:15)
┗╾scope: Let $anf_7 (1:1-2:16)
┣╾init: Eq (2:8-2:15)
┃ ┣╾lhs: Var res (2:8-2:10)
┃ ┗╾rhs: Var $anf_6 (2:14-2:15)
┗╾scope: Let $anf_8 (1:1-2:16)
┣╾init: Assertion (2:1-2:16)
┃ ┗╾arg: Var $anf_7 (2:8-2:15)
┗╾scope: Var $anf_8 (1:1-2:16)
We can compile the program above using the new ANF-based register allocation, limited to only 4 registers, by running:
./hyggec compile -a -r 4 hygge-many-regs-simpl.hyg
and the output is:
1.data:
2
3.text:
4 mv fp, sp # Initialize frame pointer
5 addi sp, sp, -4 # Extend the stack for variable $anf
6 # Variable $anf allocation: register t0, frame pos. 1
7 li t0, 1
8 addi sp, sp, -4 # Extend the stack for variable $anf_0
9 # Variable $anf_0 allocation: register t1, frame pos. 2
10 li t1, 2
11 addi sp, sp, -4 # Extend the stack for variable $anf_1
12 # Variable $anf_1 allocation: register t2, frame pos. 3
13 li t2, 3
14 addi sp, sp, -4 # Extend the stack for variable $anf_2
15 # Variable $anf_2 allocation: register s1, frame pos. 4
16 li s1, 4
17 sw t0, -4(fp) # Spill variable $anf from register t0 to stack
18 addi sp, sp, -4 # Extend the stack for variable $anf_3
19 # Variable $anf_3 allocation: register t0, frame pos. 5
20 add t0, t2, s1 # $anf_3 <- $anf_1 + $anf_2
21 # Variable $anf_4 allocation: register t2, frame pos. 3
22 add t2, t1, t0 # $anf_4 <- $anf_0 + $anf_3
23 # Variable $anf_5 allocation: register t1, frame pos. 2
24 lw t0, -4(fp) # Load variable $anf onto register t0
25 add t1, t0, t2 # $anf_5 <- $anf + $anf_4
26 # Variable res allocation: register t2, frame pos. 1
27 mv t2, t1 # res <- $anf_5
28 # Variable $anf_6 allocation: register t1, frame pos. 2
29 li t1, 10
30 # Variable $anf_7 allocation: register t0, frame pos. 3
31 beq t2, t1, eq_true
32 li t0, 0 # Comparison result is false
33 j eq_end
34eq_true:
35 li t0, 1 # Comparison result is true
36eq_end:
37 # Variable $anf_8 allocation: register t2, frame pos. 1
38 addi t2, t0, -1
39 beqz t2, assert_true # Jump if assertion OK
40 li a7, 93 # RARS syscall: Exit2
41 li a0, 42 # Assertion violation exit code
42 ecall
43assert_true:
44 lw t0, 0(fp) # Load variable $result onto register t0
45 mv t0, t2 # $result <- $anf_8
46 li a7, 10 # RARS syscall: Exit
47 ecall # Successful exit with code 0
By following the comments, we can see when the code generation introduces new variables (by assigning stack space and a register) and spills/loads variables from/to registers. We can observe, in particular, that:
variable
$anf
is spilled onto the stack on line 17, and later reloaded into a register on line 24; andon line 21, variable
anf_4
is assigned registert2
and frame position 3, that (according to line 12) were previously assigned to$anf_1
(which “disappears” without being spilled onto the stack — hence$anf_1
has become unused and has been discarded).
(Experimenting with ANF and the number of available registers)
Experiment with some variations of Example 78:
make the “let…” expression being compiled more complicated, and observe how this influences the ANF transformation, and in the amount of spills/loads in the generated assembly;
try varying the number of available registers for ANF code generation (using the option
-r
ofhyggec
), and observe how this influences the amount of spills/loads in the generated assembly.
ANF Transformation in hyggec#
The ANF transformation is implemented in the file ANF.fs
, and it follows the
procedure described in Transformation of a Hygge Expression into ANF. The main function
is transform
, which then invokes toANFDefs
and toANF
:
/// Transform the given AST node into Administrative Normal Form.
let transform (ast: Node<'E,'T>): Node<'E,'T> =
toANF (toANFDefs ast)
Here are a few remarks about the rest of the file.
For efficiency, the function
toANFDefs
builds the list of ANF definitions in reverse order (i.e. with the latest entries at the head of the list); correspondingly, the functiontoANF
expects to receive a list of ANF definitions in reverse order;The file includes an auxiliary function
substVar
that is very similar tosubst
(inASTUtil.fs
) — except that it specifically substitutes the name of a variable with another name.The unique variable names autogenerated for ANF transformation have a prefix
$anf
(often followed by an integer): this guarantees that the autogenerated names will not clash with existing variables in the input program, sinceParser.fsy
does not accept programs that use$
in variable names.
ANF-Based Code Generation in hyggec#
The code generation implemented in the file ANFRISCVCodegen.fs
follows the
procedure described in ANF-Based Code Generation with Register Allocation.
Important
The ANF-based code generation in ANFRISCVCodegen.fs
is only partial, and is
intended as a demonstration and a basis for
Project Ideas. In particular:
only integer registers are considered, and
only a few Hygge expressions are implemented (addition, if-then-else, assertions, …) — just enough to support the examples shown in this Module.
Here is an overview of the main contents of the file.
The code generation environment has type
ANFCodegenEnv
and it contains the information about known and needed variables, allocated registers, etc.:/// Code generation environment. type internal ANFCodegenEnv = { /// Target variable for the result of the assembly code execution. TargetVar: string /// Frame position assigned to each known variable. Frame: Map<string, int> /// Size of the stack frame. FrameSize: int /// List of integer variables stored in registers, with newest ones coming /// first. If a variable does not appear here, then it means that it is /// allocated on the stack frame. IntVarsInRegs: List<string * Reg> /// List of available integer registers. FreeRegs: List<Reg> /// Set of variables that are needed in the surrounding scope, and should /// never be discarded to reuse their storage. NeededVars: Set<string> }
The main entry point of the ANF-based code generation is the function
codegen
, which in turn invokesdoCodegen
:/// Generate RISC-V assembly for the given AST (expected to be in ANF), using /// the given number of registers. let codegen (node: TypedAST) (registers: uint): RISCV.Asm = /// Name of a special variable used to hold the result of the program let resultVarName = "$result" /// Initial codegen environment, with all registers available /// Note: the definition of 'env' uses list comprehension: /// https://en.wikibooks.org/wiki/F_Sharp_Programming/Lists#Using_List_Comprehensions let env = { TargetVar = resultVarName Frame = Map[(resultVarName, 0)] FrameSize = 1 IntVarsInRegs = [] FreeRegs = [for i in 0u..(registers - 1u) do yield Reg.r(i)] NeededVars = Set[resultVarName] } let result = doCodegen env node // ...
Notably, the funtion
codegen
initialises a code generation environment by targeting a special variable called$result
, which is initially stored in the first available location on the stack frame, without using any register. It also marks the$result
variable as “needed” (so it can never be discarded during code generation, even if it is unused), and declares all registers as free.Unlike the default Code Generation Strategy of
hyggec
, the result ofdoCodegen
(and other functions) inANFRISCVCodegen.fs
is not just anAsm
instance. Instead,doCodegen
has the following signature:/// Code generation function: compile the expression in the given AST node, /// which is expected to be in ANF. let rec internal doCodegen (env: ANFCodegenEnv) (node: TypedAST): ANFCodegenResult = ...
Where the type
ANFCodegenResult
is defined as:/// Code generation result. type internal ANFCodegenResult = { /// Compiled RISC-V assembly code. Asm: Asm /// Updated code generation environment. Env: ANFCodegenEnv }
This is because the function
doCodegen
may emit assembly code to spill and load registers — and this changes the register allocation described in the inputenv
ironment. For this reason,doCodegen
returns an updated code generation environment, which should be used as input for the subsequent code generation steps.The following auxiliary function generates assembly code to spill a variable
varName
onto the stack:/// Spill the variable with the given name onto its stack position assigned in /// 'env'. Return the assembly code that performs the spill and the updated /// codegen environment. let internal spillVar (env: ANFCodegenEnv) (varName: string): ANFCodegenResult = ...
The function
spillVar
above is used by the functionspillOldestIntVar
below, that selects which variable to spill:/// Spill the integer variable that has been stored in an integer register for /// the longest time, saving it in the stack position assigned in 'env'. Choose /// a variable that does not belong to the given 'doNotSpill' list. Return the /// assembly code that performs the spilling, and the updated codegen /// environment. let internal spillOldestIntVar (env: ANFCodegenEnv) (doNotSpill: List<string>): ANFCodegenResult = ...
This hihglights the strategy used by
hyggec
for selecting which variable to spill when all registers are in use: the compiler picks the variable that has been assigned to a register for the longest time. (Other strategies are discussed in the References and Further Readings.)The function
syncANFCodegenEnvs
below generates code to synchronise the register allocation across twoANFCodegenEnv
ironments: such environments may be produces by the two branches of an if-then-else, according to the idea outlined in Generating Code for If-Then-Else Expressions in ANF./// Generate assembly code that spills/loads variables in 'fromEnv' achieving /// the same configuration of 'toEnv'. let internal syncANFCodegenEnvs (fromEnv: ANFCodegenEnv) (toEnv: ANFCodegenEnv): Asm = ...
The function works as follows:
it takes two code generation environments
fromEnv
andtoEnv
(e.g. one obtained after compiling the “then” branch of an if-then-else, and the other obtained after compiling the “else” branch), andit returns the assembly code that spills/loads the variables of
fromEnv
to/from the stack, and (re-)assigns their registers, until they reach the same configuration oftoEnv
.
To see
syncANFCodegenEnvs
in action, you can try Example 79 below.(Synchronising Register Allocation Across “If-Then-Else” Branches)
Consider again the following Hygge program (from Example 77):
1let z = true; // Change to 'false' to experiment 2let x1 = 1; 3let x2 = 2; 4let x3 = if z then x1 else 1 + (2 + (3 + 4)); 5assert(x1 = 1); 6assert(x2 = 2); 7assert(if z then x3 = x1 else x3 = 10)
If we save the example above in a file called
if-then-else-sync-reg-alloc.hyg
, we can see its translation into ANF by running:./hyggec parse -a if-then-else-sync-reg-alloc.hyg
and we can observe that the “else” branch has many more ANF variable definitions (hence, it uses more registers) than the “then” branch.
To generate assembly code from the ANF, by limiting register allocation to 4 registers only, we can run:
./hyggec compile -a -r 4 if-then-else-sync-reg-alloc.hyg
The generated code is quite long, but let’s focus on the code generated for the “then” branch of the “if-then-else” expression:
# ...assembly code for the 'false' branch of the 'if'... j if_end # Jump to skip the 'true' branch of 'if' code if_true: mv s1, t1 # $anf_6 <- x1 # Branch synchronization code begins here sw s1, -16(fp) # Spill variable $anf_6 from register s1 to stack sw t1, -8(fp) # Spill variable x1 from register t1 to stack sw t2, -12(fp) # Spill variable x2 from register t2 to stack sw t0, -4(fp) # Spill variable z from register t0 to stack lw t2, -16(fp) # Load variable $anf_6 onto register t2 # Branch synchronization code ends here if_end: # Rest of the assembly code...
We can see that, after the
if_true
label, there is the code for the “then” branch of the “if-then-else”:first, an
mv
operation copies variablex1
into$anf_6
(which is the target variable holding the result of the whole if-then-else expression); and then,there is a block of “branch synchronisation code” produced by the function
syncANFCodegenEnvs
above. There we can observe, in particular, that the variable$anf_6
is first spilled from registers1
, and then reloaded onto registert2
, so it gets the same register allocated after the code generation for the “else” branch. This way, no matter which branch was taken, the rest of the assembly code can use registert2
to get the value of$anf_6
.
References and Further Readings#
ANF was introduced in these seminal papers, with the main purpose of compiling functional languages:
Amr Sabry and Matthias Felleisen. Reasoning about programs in continuation-passing style. In Proceedings of the 1992 ACM conference on LISP and functional programming (LFP ‘92). https://doi.org/10.1145/141471.141563
Cormac Flanagan, Amr Sabry, Bruce F. Duba, and Matthias Felleisen. The essence of compiling with continuations. In Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation (PLDI ‘93). https://doi.org/10.1145/155090.155113
ANF is used, for instance, in the Racket compiler.
Conceptually, ANF is reminiscent of other intermediate representations used in various compilers — e.g. Static Single Assignment (SSA) form, and Three-Address Code. To know more, see for example:
Keith D. Cooper and Linda Torczon. Engineering a Compiler (Third Edition). Available on DTU Findit.
See, in particular, Chapter 4 - Intermediate Representations
Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems, Oct. 1991. https://doi.org/10.1145/115372.115320
Register allocation is an active research topic: there are many different strategies to decide if/when a certain variable in a program should be spilled/loaded from/into a register; each strategy brings trade-offs in terms of complexity, speed of code generation, and efficiency of the generated code. The ANF-based register allocation strategy described in this Module is, broadly speaking, a coarse linear register allocation strategy:
it takes spill/load decisions while traversing the program once, and
it uses the current set of free variables to find ranges of the program where variables are live (i.e. actually used by the program) or become permanently unused (and thus, can be discarded).
To know more about register allocation, here are two surveys:
Fernando Magno Quintão Pereira. A Survey on Register Allocation, 2008. https://homepages.dcc.ufmg.br/~fernando/classes/dcc888/readingMat/RegisterAllocationSurvey.pdf
Roberto Castañeda Lozano and Christian Schulte. Survey on Combinatorial Register Allocation and Instruction Scheduling. ACM Computing Surveys, May 2020. Available on DTU Findit.
Project Ideas#
Here are some ideas for your group project.
Project Idea (Easy-Medium Difficulty): Implement ANF Translation for (Some of) Your Hygge Extensions
Project Idea (Easy-Medium Difficulty): Implement ANF Translation for (Some of) Your Hygge Extensions#
The ANF translation in the file ANF.fs
does not cover the Hygge expressions
you should have added in the previous Project Ideas. Select some of those
expressions, and extend Definition 51, the file ANF.fs
, and the
interpreter test suite accordingly:
you should explain how you extend Definition 51 to support the expressions you selected;
you should add new cases to the functions
substVars
andtoANFDefs
inANF.fs
;you should add new tests under the directory
tests/interpreter-anf/
. Ideally, you should reuse the same interpreter tests that you have already developed for the expressions you selected, by copying them fromtests/interpreter/
.
Note
The difficulty of this Project Idea may range from Easy to Medium depending on which expression(s) you choose to translate to ANF: please speak with the teacher to agree on an estimate.
Project Idea (Medium-Hard Difficulty): Implement ANF-Based Code Generation for More Hygge Expressions#
The ANF-based code generation in the file ANFRISCVCodegen.fs
only supports a
few Hygge expressions. For this Project Idea, you should select more
expressions, and extend ANFRISCVCodegen.fs
to generate code for them,
using the ANF-based register allocation strategy described in this Module.
You can choose the same expressions you converted into ANF for the project idea above — but this is not mandatory;
in
ANFRISCVCodegen.fs
, you should add new cases to the functiondoCodegen
;you should add more corresponding test cases under the directory
tests/codegen-anf/
. All test cases should compile and run correctly by using just 3 registers.
Hint
If the expression you are adding has some form of branching and conditional jump (e.g. while loops, short-circuit logical operators, …), then you may need to synchronise the register allocation of the different branches, similarly to Generating Code for If-Then-Else Expressions in ANF.
Note
The difficulty of this Project Idea may range from Medium to Hard depending on which expression(s) you choose to compile using ANF-based register allocation: please speak with the teacher to agree on an estimate.
If you chose Project Idea (Hard): Recursive Functions and you want to add ANF transformation for “let rec x =…”, you should treat it similarly to “let x = …”.