Semantics

Let's give meaning to it all.

Christian Gram Kalhauge

The goal of this chapter is to ask and answer the following questions:

1. What are Semantics?

Today we are going to talk about semantics. Program semantics is about assigning meaning to programs. When we can talk about what program mean, it is easier to explain what they do.

We are going to discuss some different approaches to write down the semantics of a program. They all essentially turn programs syntax into mathematical logic.

1.1. Preface: Natural Deduction

If you are unfamiliar with Natural Deduction and Gentzen-style proofs, please refer to the wikipediea page on the topic. The sort story is that we refer to logical rules like this:

๐‘๐‘Ÿ๐‘’๐‘š๐‘–๐‘ 1โ€ฆ๐‘๐‘Ÿ๐‘’๐‘š๐‘–๐‘ n๐‘๐‘œ๐‘›๐‘๐‘™๐‘ข๐‘ ๐‘–๐‘œ๐‘›(name)

Which means that ๐‘๐‘Ÿ๐‘’๐‘š๐‘–๐‘ 1โˆงโ€ฆโˆง๐‘๐‘Ÿ๐‘’๐‘š๐‘–๐‘ n implies ๐‘๐‘œ๐‘›๐‘๐‘™๐‘ข๐‘ ๐‘–๐‘œ๐‘›.

If we want multiple ways of reaching the conclusion, we can make more rules. For example, conjunction AโˆงB only requires one rule, both A and B has to be true but disjunction AโˆจB has two rules: either A has to be true or B has to be true.

ABAโˆงB(โˆง)AAโˆจB(โˆจL)BAโˆจB(โˆจR)

1.2. Axiomatic Semantics, Flowcharts, and Hoare Triplets

The first approach to assigning meaning to programs was Axiomatic Semantics. Here the meaning of the program is described by assigning precondition and postconditions to all statements in a program.

Figure: The flowchart from the original paper on semantics Assigning Meaning to Programs by Robert W. Floyd. The program computes the sum of an array.

In this world we can describe the semantics of a program by denoting three parts, the precondition P, the program C, and the postcondition Q, this is also called Hoare triplets.

{P}C{Q}

These mean that if the world satisfies P before executing C then the world will satisfy Q after.

The great thing about this approach is that we can compose the proofs of correctness of parts of the program into a proof of total correctness. Assume the program C1;C2, where C2 is executed after C1. Then we can describe the correctness of C1;C2, like so:

{P1}C1{Q1}{P2}C2{Q2}Q1โ‡’P2{P}C1;C2{Q}

If you are unfamiliar with the syntax above, it is natural deduction see the section on natural deduction. Essentially it means that given {P1}C1{Q1}, and {P2}C2{Q2}, and that the postcondition of C1 implies the precondition of C2 (Q1โ‡’P2), are all true, we can also prove that {P}C1;C2{Q}.

This approach is every effective at describing the meaning of specific programs. This means that it is very good for doing program verification, but is not used often for program analysis in general.

1.3. Denotational Semantics

Another approach for writing down the semantics of the program is Denotational Semantics. The idea behind denotational semantics is to map the semantics of the program we want to analyse to a program to something we know well. This can either be another programming language or math.

Consider a very simple expressional language called ๐”ผ๐•, which has addition, variables x and natural numbers n:

eโˆˆ๐”ผ๐•:=e1+e2|x|n

If want to give meaning to the ๐”ผ๐• program x + 5, then we can define a map from expressions to a function from a store to integers: โ„ฐ:๐”ผ๐•โ†’(๐•Šโ†’โ„•)

โ„ฐโŸฆ๐šŽ๐Ÿท+๐šŽ๐ŸธโŸงฯƒ=โ„ฐโŸฆ๐šŽ๐ŸทโŸงฯƒ+โ„ฐโŸฆ๐šŽ๐ŸธโŸงฯƒโ„ฐโŸฆ๐šกโŸงฯƒ=lookup(โŸฆ๐šกโŸง,ฯƒ)โ„ฐโŸฆ๐š—โŸงฯƒ=toNat(โŸฆ๐š—โŸง)

Here we explain that the semantic + means the same as math symbol +, that variables are the same as looking up the variable in a store, and that numbers should just be read as natural numbers.

Now we can see that x + 5 can be calculated in the store ฯƒ=[nโ†ฆ3], using normal math

โ„ฐโŸฆ๐šก+๐ŸปโŸงฯƒ=โ„ฐโŸฆ๐šกโŸงฯƒ+โ„ฐโŸฆ๐ŸปโŸงฯƒ=lookup(โŸฆ๐šกโŸง,ฯƒ)+โ„ฐโŸฆ๐ŸปโŸงฯƒ=3+โ„ฐโŸฆ๐ŸปโŸงฯƒ=3+toNat(โŸฆ๐ŸปโŸง)=3+5=8

If you think this just looks like functional programming, you would be right. It also turns that it is best at describing expressional languages

1.4. Operational Semantics

Finally, we can introduce Operational Semantics, which is the semantics we will focus on this semester. Operational semantics, describes it's semantics as changes to a state. This makes it ideal for describing imperative languages like the JVM bytecode. Furthermore, the Structural Operational Semantics are defined exactly as you would write a interpreter, which is handy because you are going to write one.

The Structural Operational Semantics or Small Step Semantics are written as judgments of the type (ฯˆโŠขฯƒโ†’ฯƒโ€พ) which means given the environment ฯˆ, the state of the program ฯƒ is turned into ฯƒโ€ฒ.

The thing that makes this small step semantics is that we we only care about a single operation.

The Natural Operational Semantics or Big Step Semantics, are describing running the program until it halts. (ฯˆโŠขฯƒโ†“v) where v is the final value of the program. Big step semantics often looks nicer than small step semantics, because it does not have to care about execution order.

Big Step semantics have the benefit of being easier to read, however, it has some big disadvantages, namely: we cannot reason about programs that run forever, and we cannot turn big step semantics into a working implementation. In contrast, small step semantics are easy to convert into an interpreter, and we can always recover the big step semantics from the operational semantics by simply applying the single step semantics until the program has terminated with a value:

ฯˆโŠขฯƒโ†’ฯƒโ€พฯˆโŠขฯƒโ€พโ†“vฯˆโŠขฯƒโ†“v(step)ฯƒterminated withvฯˆโŠขฯƒโ†“v(done)

1.5. Transition System and Traces

Using our new found definition of single step semantics, we can define the meaning of a program P as a Transition System: โŸจ๐’๐ญ๐š๐ญ๐žP,ฮดP,IPโŸฉ where ๐’๐ญ๐š๐ญ๐žP is the set program states, ฮดP is the transition relation (defined by the single step semantics) and IP are possible initial states.

A ๐“๐ซ๐š๐œ๐žP is the possible infinite sequence of states and operations of the program.

๐“๐ซ๐š๐œ๐žP=๐’๐ญ๐š๐ญ๐žPโ‹†

The meaning of a program is now the set of traces that it exhibit:

Sem:๐๐ซ๐จ๐ ๐ซ๐š๐ฆโ†’2๐“๐ซ๐š๐œ๐žSemโŸฆ๐™ฟโŸง={ฯ„โˆˆ๐’๐ญ๐š๐ญ๐žPn|nโˆˆ[1,โˆž],ฯ„0โˆˆIP,โˆ€iโˆˆ[1,nโˆ’1],ฮดP(ฯ„iโˆ’1,ฯ„i)}

This is also called the Maximal Trace Semantics. We can now define properties like, does a program halt, using relatively well defined math:

โ„’halt={P|Pโˆˆโ„’,โˆ€ฯ„โˆˆSemโŸฆ๐™ฟโŸง.|ฯ„|โ‰ โˆž}

But more about this next time.

2. What are the Semantics of the JVM?

In this section, we are going to introduce some of the semantics of a limited JVM. It is, however, incomplete and you would have to complete it on your own.

2.1. The Values

The JVM is dynamically typed, this means that every value caries around information about its type. There are three kinds of values, stack values ๐•ฯƒ, local values ๐•ฮป and heap values ๐•ฮท.

๐•ฮป:=(๐š’๐š—๐šn)|(๐š๐š•๐š˜๐šŠ๐šf)|(๐š›๐šŽ๐šr)๐•ฯƒ:=(๐š‹๐šข๐š๐šŽb)|(๐šŒ๐š‘๐šŠ๐š›c)|(๐šœ๐š‘๐š˜๐š›๐šs)|๐•ฮป๐•ฮท:=(๐šŠ๐š›๐š›๐šŠ๐šขnta)|(๐šŒ๐š•๐šŠ๐šœ๐šœcnfs)|๐•ฯƒ

The stack values are an extention of the local values. Ints are signed 32 bit integers, floats are 32 bit floating point values (IEEE 754 Standard (JLS ยง1.7)), and refs are also only 32 bits. Bytes are 8 bits unsinged, chars and shorts are singed and 8 and 16 bits respectively.

The heap also contains arrays, which has a length, a type and the content, and classes which has a name and the values of the fields, which is a mapping from names to stack values.

In this course we'll not cover long and double as they are a pain in the ... Furthermore, we'll try to avoid inner classes as well as bootstrap methods.

2.2. The Final States

Our program can end in either ok(v) or err(โ€˜๐š›๐šŽ๐šŠ๐šœ๐š˜๐š—โ€™).

2.3. The Context

As with all single step semantics rules, the JVM is run in a context. The context is the bytecode ๐š‹๐šŒ. For now we will define a simple operation ๐š‹๐šŒ[ฮน] which looks up the bytecode instruction at ฮน, ฮน is program counter, e.i., the name of the method and the offset in that method. We use the following short hands if ฮน=โŸจm,oโŸฉ then m=ฮนm and o=ฮนo. Furthermore, ฮน+n=โŸจฮนm,ฮนo+nโŸฉ and ฮนโ†n=โŸจฮนm,nโŸฉ.

2.4. The Stack

The JVM is a stack based virtual machine, this means that instead of having registers to put intermediate values in it uses a stack.

The stack is a list of values: ฯƒ=๐•ฯƒโ‹†. ฯต denote the empty stack and we add and remove elemenets from the end of the stack. A stack with the integers 1, 2, and 3, looks like this: ฯƒ=ฯต(๐š’๐š—๐š1)(๐š’๐š—๐š2)(๐š’๐š—๐š3). Most of our operations only operate on the stack and the program counter so we define our first simple SOS judgment like this:

๐š‹๐šŒโŠขโŸจฯƒ,ฮนโŸฉโ†’โŸจฯƒโ€พ,ฮนโ€พโŸฉ

Here are some examples. First we have the noop operation:

b=๐š‹๐šŒ[ฮน]b.๐š˜๐š™๐š›=โ€˜๐š—๐š˜๐š™โ€™๐š‹๐šŒโŠขโŸจฯƒ,ฮนโŸฉโ†’โŸจฯƒ,ฮน+1โŸฉ(nop)

Then we have the add operation:

b=๐š‹๐šŒ[ฮน]b.๐š˜๐š™๐š›=โ€˜๐š‹๐š’๐š—๐šŠ๐š›๐šขโ€™b.๐š˜๐š™๐šŽ๐š›๐šŠ๐š—๐š=โ€˜๐šŠ๐š๐šโ€™b.๐š๐šข๐š™๐šŽ=โ€˜๐š’๐š—๐šโ€™v3=v1+๐š’๐Ÿน๐Ÿธv2๐š‹๐šŒโŠขโŸจฯƒ(๐š’๐š—๐šv1)(๐š’๐š—๐šv2),ฮนโŸฉโ†’โŸจฯƒ(๐š’๐š—๐šv3),ฮน+1โŸฉ(binary)

2.5. The Locals

In the JVM saves local variables of type ๐•ฮป to a local array ฮป. This is were the inputs to the method goes and any data that should be saved on the method stack instead of in the heap. The local array is indexed normally ฮป[0].

We can therefore extend our JVM judgements to the form:

๐š‹๐šŒโŠขโŸจฮป,ฯƒ,ฮนโŸฉโ†’โŸจฮปโ€พ,ฯƒโ€พ,ฮนโ€พโŸฉ

We can run every operation that does not use the local variables, like so:

๐š‹๐šŒโŠขโŸจฯƒ,ฮนโŸฉโ†’โŸจฯƒโ€พ,ฮนโ€พโŸฉ๐š‹๐šŒโŠขโŸจฮป,ฯƒ,ฮนโŸฉโ†’โŸจฮป,ฯƒโ€พ,ฮนโ€พโŸฉ(liftฮป)

We can now also write rules that interact with the locals:

b=๐š‹๐šŒ[ฮน]b.๐š˜๐š™๐š›=โ€˜๐š•๐š˜๐šŠ๐šโ€™b.๐š๐šข๐š™๐šŽ=โ€˜๐š’๐š—๐šโ€™(๐š’๐š—๐šv)=ฮป[b.๐š’๐š—๐š๐šŽ๐šก]๐š‹๐šŒโŠขโŸจฮป,ฯƒ,ฮนโŸฉโ†’โŸจฮป,ฯƒ(๐š’๐š—๐šv),ฮน+1โŸฉ(load๐š’๐š—๐š)

2.6. The Method Stack

The JVM also has a method stack ฮผ, which is a stack of the tuples we have already introduced.

ฮผโˆผโ€ฆโŸจฮป2,ฯƒ2,ฮน2โŸฉโŸจฮป1,ฯƒ1,ฮน1โŸฉ

We can now define the last level of the semantic hirachy.

๐š‹๐šŒโŠขฮผโ†’ฮผโ€พ

Ofcause we can use the operations we defined before, by lifting them into the world of method stacks:

๐š‹๐šŒโŠขโŸจฮป,ฯƒ,ฮนโŸฉโ†’โŸจฮปโ€พ,ฯƒโ€พ,ฮนโ€พโŸฉ๐š‹๐šŒโŠขฮผโŸจฮป,ฯƒ,ฮนโŸฉโ†’ฮผโŸจฮปโ€พ,ฯƒโ€พ,ฮนโ€พโŸฉ(liftฮผ)

But now we can also do things like returning from the program:

b=๐š‹๐šŒ[ฮน]b.๐š˜๐š™๐š›=โ€˜๐š›๐šŽ๐š๐šž๐š›๐š—โ€™b.๐š๐šข๐š™๐šŽ=โ€˜๐š’๐š—๐šโ€™๐š‹๐šŒโŠขฯตโŸจฮป,ฯƒ(๐š’๐š—๐šv),ฮนโŸฉโ†’ok(๐š’๐š—๐šv)(returnฯต)

Or returning from a method:

b=๐š‹๐šŒ[ฮน]b.๐š˜๐š™๐š›=โ€˜๐š›๐šŽ๐š๐šž๐š›๐š—โ€™b.๐š๐šข๐š™๐šŽ=โ€˜๐š’๐š—๐šโ€™๐š‹๐šŒโŠขฮผโŸจฮป2,ฯƒ2,ฮน2โŸฉโŸจฮป,ฯƒ(๐š’๐š—๐šv),ฮนโŸฉโ†’ฮผโŸจฮป2,ฯƒ2(๐š’๐š—๐šv),ฮน2+1โŸฉ(returnฮผ)

2.7. The Heap

Finally we can add information about the heap ฮท. The heap is a mapping from references r or static variables to ๐•ฮท. This also present our last SOS judgement:

๐š‹๐šŒโŠขโŸจฮท,ฮผโŸฉโ†’โŸจฮทโ€พ,ฮผโ€พโŸฉ

And, our last lifting operation:

๐š‹๐šŒโŠขฮผโ†’ฮผโ€พ๐š‹๐šŒโŠขโŸจฮท,ฮผโŸฉโ†’โŸจฮท,ฮผโ€พโŸฉ(liftฮท)

The heap allows use to talk about arrays and classes:

b=๐š‹๐šŒ[ฮน]b.๐š˜๐š™๐š›=โ€˜๐šŠ๐š›๐š›๐šŠ๐šข๐š•๐šŽ๐š—๐š๐š๐š‘โ€™ฮท[r]=(๐šŠ๐š›๐š›๐šŠ๐šขnta)๐š‹๐šŒโŠขโŸจฮท,ฮผโŸจฮป,ฯƒ(๐š›๐šŽ๐šr),ฮนโŸฉโŸฉโ†’โŸจฮท,ฮผโŸจฮป,ฯƒ(๐š’๐š—๐šn),ฮน+1โŸฉโŸฉ(arraylength)

2.8. What about the rest?

So this is a very limited definition of semantics the JVM, we have not covered threads or exceptions, and we probably wont in this course. But even with these restrictions there are still many undefined rules left.

Write out dup as single step semantics

Write dup as single step semantic, given the definitions above. You can use the following resources:

  1. [ https://github.com/kalhauge/jvm2json/blob/main/CODEC.txt? ]: the decompiled codec (search for <ByteCodeInst>).
  2. [ https://en.wikipedia.org/wiki/List_of_Java_bytecode_instructions? ]: the full list of instructions.
  3. [ https://docs.oracle.com/javase/specs/jvms/se22/html/jvms-4.html? ]: the class file format. And, finally
  4. [ https://docs.oracle.com/javase/specs/jvms/se22/html/jvms-6.html#jvms-6.5? ]: the official specification of each instruction.

It is now up to you to help each other by using this new found language to communicate the semantics of the rules of the JVM to each other through single step semantics.

Help each other

Choose one or more operations from the decompiled code and create a rule at this overleaf project:

[ https://www.overleaf.com/2251384699ytnrvmqwpmyh#5063de? ]

You can come back to this activity as needed when doing the rest of the activities.

I hope you can use it as a Wiki for explaining to each other the semantics of the JVM.

Note that Overleaf is not run by DTU, your data is therefore owned by them. Your participation in this excessive is therefore purely optional. I will share the results on request.

3. Getting Started

The goal of today is to get familiar with the semantics of the JVM as well as starting writing our interpreter.

3.1. Get familiar with Java Bytecode

Now that we have some knowledge of java bytecode we should try to read it:

Inspect the Bytecode

First take a look at the decompiled code of the cases. First look at Simple.json. Now find the assertFalse method (it's listed in the methods). Now find the code.bytecode section, which should look like:

[ 
{ "field": { "class": "jpamb/cases/Simple", "name": "$assertionsDisabled", "type": "boolean" }, "offset": 0, "opr": "get", "static": true },
{ "condition": "ne", "offset": 3, "opr": "ifz", "target": 6 },
{ "class": "java/lang/AssertionError", "offset": 6, "opr": "new" },
{ "offset": 9, "opr": "dup", "words": 1 },
{ "access": "special", "method": { "args": [], "is_interface": false, "name": "<init>", "ref": { "kind": "class", "name": "java/lang/AssertionError" }, "returns": null }, "offset": 10, "opr": "invoke" },
{ "offset": 13, "opr": "throw" },
{ "offset": 14, "opr": "return", "type": null }
]

Especially, notice the opr, it explains what operations to run:

in#oprstackdescription
00get[]Get the assertionsDisabled boolean and put it on the stack
01ifz[bool]if it is not equal to zero (true) jump to the 6th instruction (e.i. return)
02new[]otherwise create a new AssertionError object.
03dup[ref]dublicate the reference
04invoke[ref, ref]call the init method on the AssertionErrror (consuming the top reference).
05throw[ref]throw the assertion error.
06return[]otherwise return.

3.2. Write an Interpreter

We should now be able to write an interpreter using the rules. For example, the rule

b=๐š‹๐šŒ[ฮน]b.๐š˜๐š™๐š›=โ€˜๐š™๐šž๐šœ๐š‘โ€™v=b.๐šŸ๐šŠ๐š•๐šž๐šŽ.๐šŸ๐šŠ๐š•๐šž๐šŽ๐š‹๐šŒโŠขโŸจฯƒ,ฮนโŸฉโ†’โŸจฯƒv,ฮน+1โŸฉ(push)

Converts to the following python code:

def step_push(self, b):
    self.stack.append(b["value"]["value"])
    self.pc += 1

Write an Interpreter

Write an interpreter for the JVM, that given a method id and an input, prints the query on the last line:

$ ./interpreter "jpamb.cases.Simple.assertInteger:(I)V" "(1)"
... alot  ...
... of intermediate  ...
... results ...
ok

Here are some good advice getting started:

  1. Start small, one method at a time. You don't have to cover the entire language. Also you don't need a method stack or a heap for most of the cases
  2. do them later.
  3. It's okay to hack some things, like getting the $assertionsDisabled static field. You can assume that is always be false.
  4. Print out the state at every step, this will help you debug.
  5. Look at (and extend) solutions/interpret.py for inspiration.

Test it!

You can use the bin/test.py tool from JPAMB to test that your tool performs correctly. It will also build a report (designated by the -o flag) that you can inspect errors. The same method filters for bin/evaluate.py works here as well:

$ python bin/test.py --filter-methods=justReturn\: -o - -- python solutions/interpret.py
8t> starting: python solutions/interpret.py 'jpamb.cases.Simple.justReturn:()I' '()'
8t> read decompiled classfile decompiled/jpamb/cases/Simple.json
8t> STEP 0:
8t>   PC: 0 {'offset': 0, 'opr': 'push', 'value': {'type': 'integer', 'value': 0}}
8t>   LOCALS: []
8t>   STACK: []
8t> STEP 1:
8t>   PC: 1 {'offset': 1, 'opr': 'return', 'type': 'int'}
8t>   LOCALS: []
8t>   STACK: [0]
8t> DONE ok
8t>   LOCALS: []
8t>   STACK: []
8t> done

A specially effective technique is to save the report to a git repository with the -o parameter, then you get an automatic semantic diff between versions of your tool.

Assuming you are on a linux machine and your project repository is next to jpamb, you can run the following command to get an instant golden test.

$ python ../jpamb/bin/test.py -o golden.log -- ./bin/interpreter
$ git add golden.log
some time parses
$ python ../jpamb/bin/test.py -o golden.log -- ./bin/interpreter
$ git diff golden.log

3.3. Make your first Dynamic analysis.

Write your first Dynamic Analysis

We are going to write our first dynamic analysis given our interpreter. Its called a random tester.

  • Given a method m
  • Run m with n random input i, make sure to limit the run-depth so that it does not run forever.
  • Report any behavior witnessed with 100%, and answer 50% on anything else (or don't answer).