Module 7: Structured Data Types and the Heap
Contents
Module 7: Structured Data Types and the Heap#
In this module we study how to extend Hygge with structured data types; more
specifically, these lecture notes introduce a data type constructor called
struct
, inspired by the C programming language. A struct
instance is a
sequence of fields, each one having its own name and value.
The struct
data type constructor provides a blueprint for further extensions
of Hygge. Notably, Hygge struct
s are modelled in a way that is very similar to
Java objects: i.e. when a struct
instance is created, the program only gets a
pointer to the memory heap location where the actual data of the
struct
is written. As a consequence, struct
s are passed and returned by
reference when invoking functions, and their data survives outside the scope
where they are created.
Overall Objective#
Our goal is to interpret, compile and run Hygge programs like the one shown in Example 43 below.
(A Hygge Program with Structures)
1// Three type aliases for structure types
2type Shape = struct { name: string;
3 area: float };
4type Circle = struct { name: string;
5 area: float;
6 radius: float };
7type Square = struct { name: string;
8 area: float;
9 side: float };
10
11// Function that takes a structure as argument
12fun displayShape(shape: Shape): unit = {
13 print("Name: "); print(shape.name);
14 print("; area: "); println(shape.area)
15};
16
17// Structure constructors
18let c: Circle = struct { name = "Circle";
19 area = 10.0f * 10.0f * 3.14f;
20 radius = 10.0f };
21let s = struct { name = "Square"; // The type of s matches 'Square'
22 area = 2.0f * 2.0f;
23 side = 2.0f };
24let r: struct {name: string; area: float} = // The type of r matches 'Shape'
25 struct { name = "Rectangle"; // and has less fields than the
26 area = 3.0f * 4.0f; // actual struct instance
27 width = 3.0f;
28 height = 4.0f };
29
30// Function calls with structures as arguments. Note: the structures
31// passed as arguments may have more fields than required by the function.
32displayShape(c);
33displayShape(s);
34displayShape(r);
35
36// Assignment to structure fields
37c.area <- s.area <- r.area <- 0.0f;
38assert(c.area = s.area);
39assert(s.area = r.area);
40assert(r.area = 0.0f)
To introduce structured data types in the Hygge specification, and make it possible to write programs like the one in Example 43, we adopt a design that mixes ideas from Java, F# and TypeScript.
We adopt a so-called structural typing system (inspired by TypeScript), where
struct
types are compared by their fields, and their names are just a convenience for programmers. For instance: on lines 18, 21, and 24 of Example 43, the types of variablesc
,s
, andr
are not called “Shape
”; and yet, on lines 32–34,hyggec
allows those variables to be passed to functiondisplayShape
, which expects an argument of typeShape
. This is allowed because the types ofc
,s
, andr
include all the fields required byShape
(this will be formalised in Definition 32 later on).We dynamically allocate structure instances, by saving them in the memory heap; as a consequence, the creation of a structure just returns a pointer to a heap location, that can be used to access the structure fields. (The heap of typical RISC-V programs has been shown in The RISC-V Memory Layout.) These heap pointers appear in the program semantics (while the program is being executed), but they are not made directly accessible to Hygge programmers.
To access a specific field of a structure instance, the semantics and code generation must use the structure pointer together with an offset which depends on the shape of the structure instance. Therefore, we will need to associate each structure pointer to some information about the shape of the structure it points to. We will keep this information in the runtime environment (for the interpreter) or in the structure types (for the compiled code).
Note
These design decisions imply that we also need some way to deallocate unused
structure instances from the heap to support long-running programs that may
create and discard large quantities of structures. We will mention this topic
in the References and Further Readings, but we will not attempt the
implementation of an actual garbage collector for hyggec
, since this would
require a very substantial effort that is beyond the scope of this course.
Important
The extension described in this Module is already implemented in the
upstream Git repositories of hyggec at the tag
structures
: you should pull and merge the relevant changes into your project
compiler. The Project Ideas of this Module further
extend Hygge with support for other kinds structured data types.
Syntax#
Definition 27 below extends the syntax of Hygge with structures and pointers applications, and with a new pretype that specifies the syntax of a new structure type.
(Syntax of Structures and Pointers)
We define the syntax of Hygge0 with structures and pointers by extending Definition 1 as follows.
Definition 27 introduces the following syntactic elements:
a structure constructor “\(\hygStruct{f_1 {=} e_1; \ldots; f_n {=} e_n}\)” is used to instantiate a structure instance with \(n\) fields called \(f_1,\ldots,f_n\) (with \(i \ge 1\)); each field \(f_i\) is initialised by an expression \(e_i\). Fields names have their own new syntactic category: they are syntactically similar to variables, in the sense that a field name can be any non-reserved identifier. Notice that the structure constructor is an expression, but it is not a value; in other words, structure constructors are not passed around directly (e.g. when calling a function), but they need to be reduced into values first (as we will see shortly);
a field selection “\(\hygFieldSel{e}{f}\,\)” is used to access the field \(f\) of a target expression \(e\) (which, intuitively, is expected to be a structure instance);
a pointer “\(p\)” denotes a memory address. We will see shortly that pointers are produced by structure constructors. Notice that pointers are runtime values: this means that they are part of the formal grammar of Hygge, but they are not supposed to be written directly by Hygge programmers, and should only be created during the execution of a program;
a structure pretype “\(\tStruct{f_1: t_1; \ldots; f_n: t_n}\)” describes the shape of a structure instance with \(n\) fields called \(f_1,\ldots,f_n\) (with \(i \ge 1\)); each field \(f_i\) has pretype \(t_i\).
Operational Semantics#
Definition 28 formalises how substitution works for structure constructors, field selections, and pointers.
(Substitution for Structures Constructors and Pointers)
We extend Definition 2 (substitution) with the following new cases:
According to Definition 28, the substitution of variable \(x\) with expression \(e'\) works as follows:
when performing the substitution on a structure constructor “\(\hygStruct{f_1 {=} e_1; \ldots; f_n {=} e_n}\)”, we simply propagate the substitution through each sub-expression \(e_i\) (for \(i \in 1..n\));
when performing the substitution on a field selection “\(\hygFieldSel{e}{f}\,\)”, we propagate the substitution through the target expression \(e\);
finally, substitutions applied on a pointer \(p\) have no effect.
Definition 29 below formalises the semantics of structure constructors and field selections and assignments. To this purpose, we extend the runtime environment \(R\) with a simple model of a memory heap, as a mapping from memory addresses to values; we also use \(R\) to save information about each known memory pointer \(p\): this is needed to correctly access the contents of the memory block pointed by \(p\).
(Semantics of Structures, Pointers, and the Heap)
We extend the definition of the runtime environment \(R\) in the Structural Operational Semantics of Hygge0 by adding the following fields to the record \(R\):
\(\envField{R}{Heap}\) is a mapping from memory addresses to Hygge values, specifying which memory addresses are currently allocated in the heap, and what is the current value written in the corresponding memory location. Given a heap mapping \(h\), we write \(\hygHeapMax{h}\) to retrieve the highest allocated address in \(h\); if \(h\) is empty, then \(\hygHeapMax{h}\) is \(\hygPtr{0x00000000}\);
\(\envField{R}{PtrInfo}\) is a mapping from memory addresses to lists of structure fields, specifying which structure fields are saved at the given address. We represent lists of structure fields as \([f_0; f_1; \ldots; f_n]\).
Then, we define the semantics of Hygge0 structures and pointers by extending Definition 4 to use the extended runtime environment \(R\) above, and by adding the following rules:
The rules in Definition 29 work as follows.
By rule \(\ruleFmt{R-Struct-F}\), we reduce a structure constructor “\(\hygStruct{f_1 {=} e_1; \ldots; f_n {=} e_n}\)” by first reducing all its field initialisation expressions \(e_1; \ldots; e_n\), from left to right, until all of them become values.
By rule \(\ruleFmt{R-Struct-Res}\), a structure constructor “\(\hygStruct{f_1 {=} v_0; \ldots; f_n {=} v_n}\)” (where all fields are initialised by values) reduces by storing the structure data on the heap, and returning the memory address where the data is located. More in detail, the premises of the rule say that:
\(h\) is the current heap, taken from the runtime environment \(R\) (i.e. \(\envField{R}{Heap}\));
\(p\) is the memory address of the first location after the maximum address currently used in \(h\); (e.g. if \(h\) assigns values to addresses between \(\hygPtr{0x00000001}\) to \(\hygPtr{0x000000a8}\), then \(p\) will be \(\hygPtr{0x000000a9}\))
\(h'\) is an updated heap that is equal to \(h\), except that the memory locations in the range from \(p+0\) to \(p+n\) are mapped to the values \(v_0,\ldots,v_n\) that initialise the structure constructor fields. Notice that the memory addresses from \(p+0\) to \(p+n\) were not used in the original heap \(h\), but are now being used in \(h'\);
\(i'\) is an updated pointer information mapping that is equal to \(\envField{R}{PtrInfo}\), except that the pointer \(p\) is now mapped to the list of fields \([f_0; \ldots; f_n]\) used in the structure constructor;
\(R'\) is an updated runtime environment that is equal to \(R\), except that \(\envField{R'}{Heap}\) is \(h'\) and \(\envField{R'}{PtrInfo}\) is \(i'\), thus reflecting the fact that the structure instance is now allocated on the heap.
When all these premises hold, the reduction produces the updated runtime environment \(R'\), together with the pointer \(p\) where the structure instance has been allocated. As a consequence, \(R'\) is “aware” of \(p\), and it is possible to inspect and use the contents of the heap area pointed by \(p\) via \(\envField{R'}{PtrInfo}\) and \(\envField{R'}{Heap}\).
By rule \(\ruleFmt{R-FieldSel-Eval}\), we reduce a field selection expression “\(\hygFieldSel{e}{f}\,\)” by first reducing \(e\), until it becomes a value;
By rule \(\ruleFmt{R-FieldSel-Res}\), a field selection expression “\(\hygFieldSel{p}{f}\,\)” (where \(p\) is a pointer value) expects that \(p\) is a known pointer in the runtime environment \(R\), where a structure with a field called \(f\) is allocated. More in detail, the premises of the rule say that:
\(\envField{R}{PtrInfo}\) maps the pointer \(p\) to a list of structure fields names \([f_0; \ldots; f_n]\);
one of such fields names \(f_i\) (for some \(i \in 1..n\)) is equal to \(f\) (i.e. the field that is being accessed);
the heap \(\envField{R}{Heap}\) maps the memory address \(p+i\) to the value \(v\).
When all these premises hold, the reduction produces and unchanged runtime environment \(R\) and the value \(v\) retrieved from the heap.
Rule \(\ruleFmt{R-Assign-FieldSel-Target}\) says that if we attempt to perform an assignment onto a structure field selected with “\(\hygFieldSel{e}{f}\,\)”, then we first need to reduce \(e\), until it becomes a value.
Rule \(\ruleFmt{R-Assign-FieldSel-Arg}\) says that to perform an assignment “\(\hygAssign{\hygFieldSel{v}{f}}{e}\)” (where the target of the field selection is a value \(v\)), we first need to reduce the expression \(e\) (on the right-hand-side of the assignment) until it becomes a value.
Rule \(\ruleFmt{R-Assign-FieldSel-Res}\) says that an assignment “\(\hygAssign{\hygFieldSel{p}{f}}{v}\)” (where the field selection target is a pointer, and the right-hand-side of the assignment is a value) reduces by updating the structure data allocated on the heap. More in detail, the rule premises say that:
\(\envField{R}{PtrInfo}\) maps the pointer \(p\) to a list of structure field names \([f_0; \ldots; f_n]\);
one of such field names \(f_i\) (for some \(i \in 1..n\)) is equal to \(f\) (i.e. the field that is receiving the assignment);
\(R'\) is a runtime environment that is equal to \(R\), except that in \(\envField{R'}{Heap}\), the memory address \(p + i\) (where the value of field \(f_i\) is stored) maps to the newly-assigned value \(v\).
When all these premises hold, the reduction produces the updated runtime environment \(R'\) and the newly-assigned value \(v\).
(Reductions of Structure Construction and Assignment)
Consider the following expression, which creates a structure instance and updates one of its fields.
Let us now consider the reductions of this expression, in a runtime environment \(R\) where:
\(\envField{R}{Heap}\) is empty, i.e. there is no allocated memory address on the heap;
\(\envField{R}{PtrInfo}\) is empty, i.e. we have no information about any known pointer.
In the first reduction below, the expression that initialises the “let…” expression is reduced, hence we perform the construction of a “struct” instance, via rule \(\ruleFmt{R-Struct-Res}\) in Definition 29. As a consequence, the reduction produces an updated runtime environment \(R'\) that is equal to \(R\), except that:
\(\envField{R'}{Heap}\) is updated to make room for all the fields of the structure. In this case, we have two fields called \(\hygField{name}\) and \(\hygField{area}\), carrying respectively a string and an integer, and each one of them is stored in a consecutive heap address. Consequently:
the heap address \(\hygPtr{0x0001}\) will point to the string \(\hygStr{Circle}\), and
the heap address \(\hygPtr{0x0002}\) will point to the integer \(42\);
\(\envField{R'}{PtrInfo}\) records the fact that the memory address \(\hygPtr{0x0001}\) is the beginning of a structure with two fields, called \(\hygField{name}\) and \(\hygField{area}\).
The reduction yields the updated runtime environment \(R'\) and the memory address \(\hygPtr{0x0001}\) where the structure is saved.
In the second reduction below, a standard application of rule \(\ruleFmt{R-Let-Subst}\) (from Definition 4) replaces each occurrence of the variable \(s\) with its initialisation value, i.e. the memory address \(\hygPtr{0x0001}\) obtained in the previous reduction.
In the third reduction, the assignment “\(\hygAssign{\hygFieldSel{\hygPtr{0x0001}}{area}\;}{\;\hygFieldSel{\hygPtr{0x0001}}{area} + 2}\)” reduces the addition on its right-hand-side. This reduction, in turn, retrieves the value of the field access “\(\hygFieldSel{\hygPtr{0x0001}}{area}\)” from the runtime environment, via rule \(\ruleFmt{R-FieldSel-Res}\) in Definition 29. The premises of that rule inspect the runtime environment \(R'\) and find out that:
according to \(\envField{R'}{PtrInfo}\), the address \(\hygPtr{0x0001}\) (on which a field selection is being attempted) points to a structure with a list of two fields, called \(\hygField{name}\) and \(\hygField{area}\). We index the field names according to their position on the list, hence the field \(\hygField{area}\) has index 1;
therefore, to access “\(\hygFieldSel{\hygPtr{0x0001}}{area}\)” we need to read the heap at the address \(\hygPtr{0x0002}\) (i.e. the address \(\hygPtr{0x0001}\) plus the offset \(1\) for the field \(\hygField{area}\));
the location \(\envField{R'}{Heap}(\hygPtr{0x0002})\) contains the value \(42\), which becomes the result of the reduction of “\(\hygFieldSel{\hygPtr{0x0001}}{area}\)”.
The fourth reduction computes the addition \(42 + 2\).
The fifth reduction performs the assignment of value \(44\) to the field selection “\(\hygFieldSel{\hygPtr{0x0001}}{area}\)”, via rule \(\ruleFmt{R-Assign-FieldSel-Res}\) in Definition 29. The rule premises inspect the runtime environment \(R'\) similarly to rule \(\ruleFmt{R-FieldSel-Res}\) in the third reduction step above:
the premises determine that “\(\hygFieldSel{\hygPtr{0x0001}}{area}\)” is located at the heap address \(\hygPtr{0x0002}\), and
they produce an updated runtime environment \(R''\) that is equal to \(R'\), except that in \(\envField{R''}{Heap}\) the address \(\hygPtr{0x0002}\) points to the newly-assigned value \(44\).
Therefore, the expression given at the beginning of this example terminates its reductions by reaching the value \(44\). Notice, however, that the runtime environment \(R''\) is different from the initial runtime environment \(R\), because we have:
Write the reductions of the following expression:
Typing Rules#
In order to type-check programs that use structures, we need to introduce a new structure type (Definition 30), a new rule for pretype resolution (Definition 31), a new subtyping rule (Definition 32), and some new typing rules (Definition 33).
(Structure Type)
We extend the Hygge0 typing system with a structure type by adding the following case to Definition 5:
By Definition 30, a structure type describes a structure instance where each field \(f_i\) (for \(i \in i..n\), with \(n \ge 1\)) has a type \(T_i\). Note that the field names must be distinct from each other.
(Structure Types)
The following type describes a structure with a field \(f\) of type \(\tInt\), and a field \(g\) of type \(\tBool\).
The following type describes a structure with a field \(f\) of type \(\tInt\), and a field \(g\) having a structure type, which in turn has a field \(a\) of type \(\tFloat\) and a field \(b\) of type \(\tBool\).
We also need a way to resolve a syntactic structure pretype (from Definition 27) into a valid structure type (from Definition 30): this is formalised in Definition 31 below.
(Resolution of Structure Types)
We extend Definition 7 (type resolution judgement) with this new rule:
According to Definition 31, a function pretype is resolved by ensuring that field names are distinct from each other, and then by recursively resolving the type of each field.
We also extend Definition 10 (subtyping) with a new rule for structure types, according to Definition 32 below. This extension is not strictly necessary, but it adds great flexibility to the Hygge programming language, as we will see shortly. Notice that, without Definition 32, the subtyping for structure types would only be allowed by rule \(\ruleFmt{TSub-Refl}\) in Definition 10, which only relates types that are exactly equal to each other.
(Subtyping for Structure Types)
We define the subtyping of Hygge0 with structure types by extending Definition 10 with the following new rule:
According to rule \(\ruleFmt{TSub-Struct}\) in Definition 32, a structure type “\(\tStruct{f_1: T_1; \ldots; f_m: T_m}\)” is subtype of another structure type “\(\tStruct{f_1: T'_1; \ldots; f_n: T'_n}\)” when:
the subtype contains at least all the structure fields that appear in the supertype, and such fields appear first and in the same order (this is enforced by the shape of the structure types and by the rule premise “\(m \ge n\)”); and
if a field \(f_i\) appears in the structure supertype with type \(T'_i\), then \(T'_i\) is a supertype of the corresponding field type \(T_i\) in the structure subtype.
The rationale for rule \(\ruleFmt{TSub-Struct}\) is based on Definition 9 (Liskov Substitution Principle): an instance of a subtype should be safely usable whenever an instance of the supertype is required. See Example 46 below.
Consider the following structure type:
Consider any well-typed Hygge program that uses an instance of such a structure type: intuitively, that program will only access the field \(f\) (using it as an \(\tInt\)) and the field \(g\) (using it as a \(\tBool\)).
Therefore, that Hygge program will also work correctly if it operates on a structure instance of the following type:
The reason is that the program will still be able to access the structure fields \(f\) and \(g\), since they appear in the expected order and have the expected types; the program will simply ignore the additional field \(h\). For this reason, Definition 32 considers the second structure type as a subtype of the first.
Note
Example 46 above shows that, in some sense, structure subtyping in the Hygge programming language is similar to subclassing in object-oriented programming languages: a structure subtype “derives” a structure supertype because the subtype “implements” all the fields of the supertype, and might contain more fields; therefore, a structure subtype is akin to a “subclass” of its supertype.
This idea is also visible in Example 43, where the function
displayShape
is called with arguments that have all the fields expected by the
structure type Shape
, plus maybe other fields (that displayShape
does not
use). This is allowed because all the arguments passed to displayShape
have a
subtype of Shape
, according to Definition 32.
We now have all the ingredients to define the typing rules for structure constructors and field accesses and assignments: they are formalised in Definition 33 below.
(Typing Rules for Structures and Field Access and Assignment)
We define the typing rules of Hygge0 with structures and field assignments by extending Definition 11 with the following rules (which use the structure type introduced in Definition 30 above):
The typing rules in Definition 33 work as follows.
By rule \(\ruleFmt{T-Struct}\), a structure constructor “\(\hygStruct{f_1 {=} e_1; \ldots; f_n {=} e_n}\)” has a structure type “\(\tStruct{f_1: T_1; \ldots; f_n: T_n}\)”, where (according to the rule premises) each field \(f_i\) has the type of the corresponding initialisation expression \(e_i\). The first premise of the rule also requires that all field names \(f_i\) are distinct from each other.
By rule \(\ruleFmt{T-FieldSel}\), a field selection “\(\hygFieldSel{e}{f}\,\)” is well-typed if (according to the rule premises) \(e\) has a structure type, and there is a field \(f_i\) of that structure type which matches \(f\) (i.e. the field being accessed). If these premises hold, then the whole field access has type \(T_i\) (i.e. the type of the structure field \(f_i\)).
By rule \(\ruleFmt{T-Assign-FieldSel}\), an assignment to a field selection “\(\hygAssign{\hygFieldSel{e_1}{f}}{e_2}\)” is type-checked by ensuring that the field selection “\(\hygFieldSel{e_1}{f}\,\)” has a type \(T\) (first premise of the rule), and that such a type \(T\) is also the type of the assigned expression \(e_2\) (second premise of the rule). If these premises hold, then the whole assignment has type \(T\).
Consider the following expression, which initialises variable \(s\) with a structure instance, and updates the value of the field \(\hygFieldSel{s}{area}\):
We have the following typing derivation — where the typing environment \(\Gamma'\) is equal to \(\Gamma\), except that we have \(\envField{\Gamma'}{Vars}(s) = \tStruct{\hygField{area}: \tInt}\).
(Download this derivation as a PNG image
.)
Notice that the “let…” binder declares a variable \(s\) with type “\(\tStruct{\hygField{area}: \tInt}\)”, but initialises \(s\) with a structure constructor that has an additional field “\(\hygField{name}\)”: to satisfy the premises of rule \(\ruleFmt{T-Let2}\), we need to show that such a structure constructor also has type “\(\tStruct{\hygField{area}: \tInt}\)” (i.e. without the additional field “\(\hygField{name}\)”). To this end, the derivation uses the subsumption rule \(\ruleFmt{T-Sub}\) (from Definition 11) and the new subtyping rule \(\ruleFmt{TSub-Struct}\) (from Definition 32); without rule \(\ruleFmt{TSub-Struct}\), this program would not type-check.
Implementation#
We now have a look at how hyggec
is extended to implement structures and field
accesses and assignments, according to the specification illustrated in the
previous sections.
Tip
To see a summary of the changes described below, you can inspect the differences
in the hyggec Git repositories between the tags
functions
and structures
.
Lexer, Parser, Interpreter, and Type Checking#
These parts of the hyggec
compiler are extended along the lines of
Example: Extending Hygge0 and hyggec with a Subtraction Operator — except that, in order to implement
Definition 32, we also need to extend the function
isSubtypeOf
.
We extend
AST.fs
in two ways, according to Definition 27:we extend the data type
Expr<'E,'T>
with three new cases:StructCons
for the structure constructor “\(\hygStruct{f_1 = t_1; \ldots; f_n = e_n}\)”;FieldSelect
for “\(\hygFieldSel{e}{f}\,\)”; andPointer
for “\(p\)”;
we also extend the data type
Pretype
with a new case calledTStruct
, corresponding to the new structure pretype “\(\tStruct{f_1: t_1; \ldots; f_n:t_n}\)”:and Pretype = // ... /// A structure pretype, with pretypes for each field. | TStruct of fields: List<string * PretypeNode>
We extend
PrettyPrinter.fs
to support the new expressions and pretype.We extend
Lexer.fsl
to support two new tokens:STRUCT
for the new keyword “struct
”, andDOT
for the symbol “.
” used to access structure fields in the expression “\(\hygFieldSel{e}{f}\,\)”.
We extend
Parser.fsy
to recognise the desired sequences of tokens according to Definition 27, and generate AST nodes for the new expressions. We proceed by adding:a new rule under the
pretype
category to generateTStruct
pretype instances;two new rules under the
primary
category to generateStructCons
andFieldSelect
instances;various auxiliary syntactic categories and rules to recognise the syntactic elements needed by the rules above. In particular:
field
matches a structure field;structFieldInitSeq
is a non-empty sequence of field assignments, separated by semicolons (needed to parse structure constructors);structFieldTypeSeq
is a sequence of fields with type annotations, separated by semicolons (needed to parse structure pretypes).
Note
We do not extend
Parser.fsy
with new rules for parsingPointer
instances. The reason is that, as we mentioned when introducing the syntax of structures, we want to treat pointers as runtime values that are only produced by the semantics (while a program reduces) and cannot be written by Hygge programmers.We extend the function
subst
inASTUtil.fs
to support the new expressionsStructCons
,FieldSelect
andPointer
, according to Definition 28.We extend
Interpreter.fs
according to Definition 29:we extend the definition of the record
RuntimeEnv
to include the new fieldsHeap
andPtrInfo
;in the function
isValue
, we add a new case forPointer
; andin the function
reduce
:we add a new case for
Pointer
(which does not reduce);we add a new case for
StructCons
, with an auxiliary function calledheapAlloc
that helps allocating new values on top of the heap, as required by rule \(\ruleFmt{R-Struct-Res}\);we add a new case for
Fieldselect
; andwe add a new case to reduce an
Assign
expression when the left-hand-side of the assignment is a field selection, according rules \(\ruleFmt{R-Assign-FieldSel-Target}\) and \(\ruleFmt{R-Assign-FieldSel-Arg}\).
We extend
Type.fs
by adding a new case to the data typeType
, according to Definition 30: the new case is calledTStruct
. We also add a a corresponding new case to the functionfreeTypeVars
in the same file.Note
Correspondingly, we also extend the pretty-printing function
formatType
inPrettyPrinter.fs
, to display the stucture type we have just introduced.We extend
Typechecker.fs
:we extend the type resolution function
resolvePretype
with a new case for structure types, according to Definition 31;we extend the function
isSubtypeOf
with a new case for structure types, according to Definition 32;we extend the function
typer
according to Definition 33, to support the new cases for the expressionsStructCons
andFieldSelect
, and type-check assignments to structure fields.
As usual, we add new tests for all compiler phases.
Code Generation#
We extend RISCVCodegen.fs
in three ways:
Code Generation for Structure Constructors#
When compiling a structure constructor, we need to add a new case to the
function doCodegen
in RISCVCodegen.fs
matching the expression StructCons
.
We need to generate RISC-V assembly code that:
allocates enough space on the heap to contain all the structure fields, and
initialises each structure fields with the value produced by the corresponding field initialisation expression.
To allocate space on the heap, we use the RARS system call Sbrk
: it extends
the heap by allocating the amount of bytes specified in the register a0
, and
returns the address of the newly-allocated memory block in the register a0
itself. In order to compute the required amount of bytes, we need to know the
the size of the structure instance — which is quite straightforward: each
Hygge value of any type, once compiled, fits in a single 32-bit word (4 bytes),
and therefore, we obtain the structure size by multiplying the number of
structure fields by 4.
The resulting code looks as follows
let rec internal doCodegen (env: CodegenEnv) (node: TypedAST): Asm =
// ...
| Struct(fields) ->
/// Assembly code for initialising each field of the struct
let fieldsInitCode = // ...
/// Assembly code that allocates space on the heap for the new
/// structure, through an 'Sbrk' system call. The size of the structure
/// is computed by multiplying the number of fields by the word size (4)
let structAllocCode =
(beforeSysCall [Reg.a0] [])
.AddText([
(RV.LI(Reg.a0, fields.Length * 4),
"Amount of memory to allocate for a struct (in bytes)")
(RV.LI(Reg.a7, 9), "RARS syscall: Sbrk")
(RV.ECALL, "")
(RV.MV(Reg.r(env.Target), Reg.a0),
"Move syscall result (struct mem address) to target")
])
++ (afterSysCall [Reg.a0] [])
// Put everything together: allocate heap space, init all struct fields
structAllocCode ++ fieldsInitCode
Code Generation for Field Selection#
The assembly code for field selections needs to establish what is the memory location of the selected structure field. To this end:
we compile the target expression of the field selection, which is expected to leave a memory address in the target register; and then
use the target expression type (which should be
TStruct
) to find out the offset of the selected field from that memory address.
We then use the RISC-V instruction lw
(load word) to read a value from the
memory location of the structure field, and write it in the target register.
let rec internal doCodegen (env: CodegenEnv) (node: TypedAST): Asm =
// ...
| FieldSelect(target, field) ->
/// Generated code for the target object whose field is being selected
let selTargetCode = doCodegen env target
/// Assembly code to access the struct field in memory
let fieldAccessCode =
match (expandType node.Env target.Type) with
| TStruct(fields) ->
let (fieldNames, fieldTypes) = List.unzip fields
let offset = List.findIndex (fun f -> f = field) fieldNames
match fieldTypes.[offset] with
// ...
| _ ->
Asm(RV.LW(Reg.r(env.Target), Imm12(offset * 4),
Reg.r(env.Target)),
$"Retrieve value of struct field '%s{field}'")
// Put everything together: compile the target, access the field
selTargetCode ++ fieldAccessCode
Code Generation for Assignments to Structure Fields#
To support assignments to structure fields, we extend the code generation for
the Assign
expression. The new case generates RISC-V assembly code to:
compute the target expression of the field selection (which should produce the memory address of a structure instance on the heap);
compute the assigned value, and
compute the memory address of the structure field that is receiving the assignment (as in the Code Generation for Field Selection).
Then, the generated code overwrites that memory address with the value being assigned.
let rec internal doCodegen (env: CodegenEnv) (node: TypedAST): Asm =
// ...
| Assign(lhs, rhs) ->
match lhs.Expr with
// ...
| FieldSelect(target, field) ->
let selTargetCode = // ...Code for target expression of selection
let rhsCode = // ...Code for the right-hand-side of the assignment
match target.Type with
| TStruct(fields) ->
let (fieldNames, _) = List.unzip fields // Struct field names
/// Offset of the selected struct field beginning of struct
let offset = List.findIndex (fun f -> f = field) fieldNames
let assignCode = // Assembly code to perform field assignment
match rhs.Type with
// ...
| _ ->
Asm([(RV.SW(Reg.r(env.Target + 1u), Imm12(offset * 4),
Reg.r(env.Target)),
$"Assigning value to struct field '%s{field}'")
(RV.MV(Reg.r(env.Target), Reg.r(env.Target + 1u)),
"Copying assigned value to target register")])
// Put everything together
rhsCode ++ selTargetCode ++ assignCode
References and Further Readings#
The treatment of structure types in Hygge is inspired by languages with structural typing systems, such as F#, OCaml, Scala, TypeScript. To know more about the programming language concepts behind this, you can refer to:
Benjamin Pierce. Types and Programming Languages. MIT Press, 2002. Available on DTU Findit.
Chapter 19 (Case Study: Featherweight Java) — in particular, section 19.3 (Nominal and Structural Type Systems)
Although this version of Hygge allocates structures on the heap, it does not provide any facility to deallocate them. Therefore, the heap usage of a program will tend to grow until the program terminates — and it is possible to write programs that allocate too many structures and are killed by the operating system (or by RARS) with an out-of-memory error. This issue can be addressed in two ways.
We could add a C-style expression like
free(e)
which removes from the heap the contents ofe
(which is supposed to reduce into a structure pointer). This is not entirely straightforward to implement — but most importantly, manual memory deallocation would introduce the typical problems of C programs, such as missing deallocations causing memory leaks, or duplicate deallocations that crash a program.We could tweak the semantics of Hygge with the automatic removal of structures that are not pointed from anywhere within the program, nor the heap. Correspondingly, we could extend the
hyggec
code generation to include the assembly code of a garbage collector that is informed every time a new structure is allocated, and deallocates that structure when it becomes unreferenced. Writing (or finding) a garbage collector that can be compiled to run under RARS is not a trivial task. The Boehm-Demers-Weiser garbage collector is a popular option that can be integrated into C/C++ programs, and could also be used in the runtime system of a compiled programming language.
Project Ideas#
Here are some ideas for your group project.
Project Idea (Medium Difficulty): Mutable vs. Immutable Structure Fields
Project Idea (Hard): Extend Hygge with Copying of Structures
Important
If you chose Project Idea (Medium Difficulty): Function Subtyping, please remember to add some test cases to show how function subtyping interplays with structure subtyping.
Project Idea (Easy): Extend Hygge with Reference Cells#
The goal of this project idea is to extend Hygge and hyggec
with a construct
similar to reference cells in the F# programming
language. Hygge
programmers programmer should be able to write, interpret, type-check, compile
and run a program like the following:
let x: ref {int} = ref {40 + 2};
x.value <- x.value + 10;
assert(x.value = 52)
where ref {int}
is the type of a reference to a value of type int
(which is
the type of the initialisation expression 40 + 2
), while the reference
constructor ref {40 + 2}
represents a reference to a location on the heap,
where the result of the expression 40 + 2
is saved.
You should describe how you modify the Hygge language specification and the
hyggec
implementation to achieve this extension. As usual, you should also
provide tests that leverage this extension.
Hint
You can approach this project idea by using F# reference cells as a blueprint:
the type
ref {int}
is just syntactic sugar for a structure typestruct {value: int}
, andthe reference constructor
ref {42}
is just syntactic sugar for a structure constructorstruct {value = 42}
.
Project Idea (Easy): Extend Hygge with Tuples#
The goal of this project idea is to extend Hygge and hyggec
with tuples, i.e.
sequences of elements of a fixed length determined statically (not at
run-time), and with each element having its own predetermined type. Tuples can
have any length (minimum one element). Hygge programmers should be
able to write, interpret, type-check, compile and run a program like the
following:
let t: tuple {string, int, bool} = tuple {"Hello", 40 + 2, true};
t._1 <- "Hej";
t._2 <- t._2 + 10;
assert(t._2 = 52)
where:
tuple {string, int, bool}
is the type of a tuple containing 3 values, such that the first has typestring
, the second has typeint
, and the third has typebool
;tuple {"Hello", 40 + 2, true}
is the constructor of a tuple containing 3 values, initialised as"Hello"
,40 + 2
, andtrue
;the selectors
_1
,_2
, etc. are used to access and assign a value to the corresponding element of the tuple. The tuple length determines which selectors are available: for instance, a tuple of length 12 has selectors ranging from_1
to_12
. (This is inspired by tuples in the Scala 2 programming language.)
You should describe how you modify the Hygge language specification and the
hyggec
implementation to achieve this extension. As usual, you should also
provide tests that leverage this extension.
Hint
The hint given for the project idea on reference cells might be helpful…
Project Idea (Medium Difficulty): Mutable vs. Immutable Structure Fields#
The goal of this project idea is to allow Hygge and hyggec
to distinguish
between mutable and immutable structure fields. You can choose between two
possible alternative designs:
all structure fields are mutable by default (as specified in this module) — but a programmer can optionally specify that some fields are immutable, e.g. by writing a structure type like:
struct { f: int; immutable g: string}
. Or,all structure fields are immutable by default, but a programmer can optionally specify that some fields are mutable, e.g. by writing a structure type like:
struct { f: int; mutable g: string}
. This approach is similar to F#, but it will require the revision of several existinghyggec
tests.
You should describe how you modify the Hygge language specification and the
hyggec
implementation to achieve this extension. As usual, you should also
provide tests that leverage this extension.
Hint
You don’t need to substantially change the Hygge semantics with respect to Definition 28: since mutable/immutable structure fields are only specified in structure types, you can keep treating all fields as mutable in the interpreter.
You do need to revise the types and type checking:
you need to remember which fields of a structure type are mutable, and which fields are immutable;
if a program attempts to perform an assignment “\(\hygAssign{\hygFieldSel{e}{f}}{e'}\)” where \(f\) is an immutable field of the structure type of \(e\), then the assignment should not type-check, and a corresponding error should be reported.
Structure subtyping (Definition 32) should be extended to distinguish mutable and immutable fields. For maximum flexibility, you can extend subtyping as follows:
if a structure field \(f\) is immutable in the supertype, then \(f\) can be either mutable or immutable in the subtype;
if a structure field \(f\) is mutable in the supertype, then \(f\) must also be mutable in the subtype.
Project Idea (Medium Difficulty): Extend Hygge with Arrays#
The goal of this project idea is to extend Hygge and hyggec
with arrays, i.e.
sequences of values of a same type, with a fixed length determined at run-time
(unlike tuples). Arrays can have any length (minimum one element). Hygge
programmers should be able to write, interpret, type-check, compile and run a
program like the following (but feel free to change the syntax as you like):
let n = readInt();
let arr: array {int} = array(n, 40 + 2);
let mutable i: int = 0;
while (i < arrayLength(arr)) do {
arrayElem(arr, i) <- arrayElem(arr, i) + i;
i <- i + 1
};
i <- 0;
while (i < arrayLength(arr)) do {
assert(arrayElem(arr, i) = 42 + i)
i <- i + 1
}
where:
array {int}
is the type of an array containing elements of typeint
;array(n, 40 + 2)
is the constructor of an array containingn
values, each one initialised with the result of the expression40 + 2
;arrayLength(arr)
is the size (number of elements) of the arrayarr
;arrayElem(arr, i)
is used to access and assign a value to elementi
of arrayarr
(with elements numbered from 0). The program should get stuck if the indexi
is not smaller than the size of the array.
You should describe how you modify the Hygge language specification and the
hyggec
implementation to achieve this extension. As usual, you should also
provide tests that leverage this extension.
Hint
Intuitively, an array instance could imagined as a structure with two fields:
a field
length
with the size of the array; anda field
data
with a pointer to the memory location where the array elements are stored.
However, arrays are not just syntactic sugar for structures! The main
difference is that the array size is dynamic, so the data
field in the
intuition above needs a dedicated treatment. Still, you can certainly reuse
and adapt part of the specification and code for structures to design and
implement arrays…
Project Idea (Hard): Arrays with Slicing and For-Each#
This project idea is an harder alternative to Project Idea (Medium Difficulty): Extend Hygge with Arrays. In addition to implementing everything required in that project idea, you should also implement:
Array slicing, i.e. defining an array that shares its elements with (a subset of) another array. For instance, a program using array slicing may look like:
let arr1 = array(10, 3); // Arr1 contains 10 elements, all equal to 3 let arr2 = slice(arr1, 2, 6); // Array slice sharing elements 2..6 of arr1 arrayElem(arr2, 0) <- 42; // This also overwrites arrayElem(arr1, 2) assert(arrayElem(arr1, 2) = 42) assert(arrayElem(arr2, 0) = 42)
Here,
arr2
is constructed as an array slice out ofarr1
, and shares the elements on the heap from position 2 to 6 ofarr1
(and those elements have positions from 0 to 4 inarr2
). As a consequence, assigning the value42
into position0
ofarr2
overwrites the element at position2
ofarr1
(and thus, the assertions above should not fail).For-each loop, i.e. a dedicated
for
loop construct to iterate over all elements of an array — e.g. as follows:let arr = array(10, 3); for (x in arr) { println(x) }
Hint
If you implemented Project Idea (Medium Difficulty): a “For” Loop with a Scoped Variable, you may be able to implement the for-each loop quite easily…
For both features above, you should describe how you modify the Hygge language
specification and the hyggec
implementation to achieve them. Remember to
specify what happens if e.g. slice(arr, ...)
tries to refer to elements
outside the boundaries of arr
. As usual, you should also provide tests that
leverage these extension.
Project Idea (Hard): Extend Hygge with Copying of Structures#
In the current specification of Hygge, and in hyggec
, structures are always
handled by reference. For example, consider the following program:
let s1 = struct {f = 1 + 2};
let s2 = s1;
s1.f <- 42;
assert(s1.f = 42);
assert(s2.f = 42)
The “let…” binder that introduces s2
initialises it with the same heap
address of the structure s1
; as a consequence, any change to the fields of
s1
is reflected in s2
— and vice versa.
The goal of this project idea is to extend Hygge and hyggec
with an expression
copy(e)
, which takes an expression e
(expected to be a structure) and
returns a copy of e
— i.e. a new structure that has the same type, fields,
and values of e
, but does not share any data with e.
As a consequence, Hygge programmers should be able to write, interpret, type-check, compile and run a program like the following:
let s1 = struct {f = 1 + 2};
let s2 = copy(s1);
s1.f <- 42;
assert(s1.f = 42);
assert(s2.f = 3)
You should describe how you modify the Hygge language specification and the
hyggec
implementation to achieve this extension. As usual, you should also
provide tests that leverage this extension.
Note
The formal specification of the semantics of the copy(e)
expression can be
quite complicated, so you can omit it.
You can approach this project idea in two steps.
First, you can extend Hygge and
hyggec
with shallow copying, i.e. copy the contents of structure fields, without recursively copying the structures pointed by those fields (if any). For example, shallow copying would allow to correctly execute both the example above, and the following program:let s1 = struct {f = struct {g = 0.0f}}; let s2 = copy(s1); s1.f.g <- 1.0f; assert(s1.f.g = 1.0f); assert(s2.f.g = 1.0f)
This is because shallow copying only duplicates the “first level” of the structure, and therefore, it copies the structure pointer of
s1.f
intos2.f
. Consequently,s1.f
ands2.f
point to the same structure, and any change tos1.f.g
is visible ins2.f.g
— and vice versa.Then, you can implement deep copying, i.e., copy the contents of structure fields, and recursively copy any structure pointed by those fields. For example, deep copying would allow to correctly execute the following program:
let s1 = struct {f = struct {g = 0.0f}}; let s2 = copy(s1); s1.f.g <- 1.0f; assert(s1.f.g = 1.0f); assert(s2.f.g = 0.0f)
This is because deep copying duplicates the structure pointed by
s1.f
, and initialisess2.f
with a pointer to the structure copy. Consequently,s1.f
ands2.f
point to different structures that do not share any data, and any change tos1.f.g
is not reflected ins2.f.g
— nor vice versa.
Hint
To understand how many fields of a structure should be copied by copy(e)
(in a
shallow or deep way), you can inspect either the runtime environment (when
interpreting the program) or the structure type of e
(during code generation).