In this section, we are going to reiterate on the abstract interpretation from last time and then we are going to introduce a new integer abstraction, namely the interval abstraction, then we are going to show how we can do Unbounded static analysis analysis with no limits. To do this we have to introduce fixpoints, the widening operator, and finally a more efficient algorithm.
A big problem, which we have glazed over so far, are dependent variables. Dependent variables are variables who's solution space are constrained by each other.
Consider the case in Dependent.safeDivByN:
@Case("(0) -> ok")
@Case("(1) -> ok")
public static int safeDivByN(int n) {
if (n != 0) {
return 1 / n;
}
return 0;
}If we run our sign analysis naively on the decompiled code we get the following. The offset column shows the offset, the instruction column, shows the instruction column, and the abstract state column shows abstract state right before the instruction and at what instructions the results are put.
| Offset | Instruction | Abstract State |
|---|---|---|
000 | load:I 0 | |
001 | ifz eq 6 | |
002 | push:I 1 | |
003 | load:I 0 | |
004 | binary:I div | |
005 | return:I | |
006 | push:I 0 | |
007 | return:I |
The problem with the above execution is that we find a divide by zero error, even though it does not exist. This happens because we do not capture the dependency between the integer on the stack and in the locals in the choice in 001.
To fix this we need to add a notion of named values and constraints. In the example above, we could add a named value with constraint indicating that values are the same. We can keep these constraints in addition to the state:
| Offset | Instruction | Abstract State |
|---|---|---|
000 | load:I 0 | |
001 | ifz eq 6 | |
002 | push:I 1 | |
003 | load:I 0 | |
004 | binary:I div | |
005 | return:I | |
006 | push:I 0 | |
007 | return:I |
Now we can see that we can constrain the state-space of the named values inside each of the branches allowing for more precise results.
A couple of things to keep in mind when using named values, is that
They should not change. Which means that if a variable, like a local is overridden, we need to create a new named value, so that values copied out of the locals do not change, when the original variable change.
Try to be smart about when to create named values and when to remove them again, as too many named variables might make your analysis slow.
Try to think about what instructions that create dependencies between values. Some examples are dup and load, but array and field accesses does the same.
It is possible to capture more interesting relationships between values. This is especially important when they depend on each other.
@Case("(0, 0) -> ok")
@Case("(1, 1) -> ok")
public static int safeDivAByB(int a, int b) {
if (a >= 0) {
if (b > a) {
return a / b
}
}
return 0;
}The example above contains no errors. But this would be hard for our current techniques to detect. To maintain these constraints efficiently we can use Polyhedrons (Cousot 1978) or most of the time faster Octagons (Miné 2006).
First, we are going to look at what an implementation might look like. Because we are going to step all states at the same time, we need an manystep function (Bounded Abstract Interpretation (§3)). Here we take a set of states (abstracted by A), and yield all the states (or final states) which could be produced by it (The State Abstraction (§3.1)).
def manystep(sts : StateSet[A]) -> Iteratable[A | str]:
...To run our bounded static analysis, we first have to compute the initial state, from the initial method (methodid), then we can simply run this a bounded number of times, and joining the results every time.
final = {}
sts = A.initialstate_from_method(methodid)
for i in range(MAX_STEPS):
for s in manystep(sts):
if isinstance(s, str):
final.add(s)
else:
sts |= s
log.info(f"The following final states {final} is possible in {MAX_STEPS}")The first thing we do is we ask the state set to break it down per instruction (see The Per-Instruction Abstraction (§3.2)), we do this so we can process the states at the same instruction at the same time.
for pc, state in sts.per_instruction():
match bc[pc]:
case jvm.Binary(type=jvm.Int(), operant=opr):
...At this point we would like to step the states according to the instructions, but we are at an impasse. We need to get the variables out of the set of states.
One option (as we used in The Per-Variable Abstraction (§3.3)) is to merge all the states and create a single state with merged variables. But this comes at a loss of precision (see Dependent Values (§1)), or we can ask the states to group themselves by the variables needed.
Here we use a group method which takes the set of states and group them by some query. By allowing the state to choose how group themselves, it can choose the most optimal grouping. In our case, we want to pop two integers:
for ([va1, va2], after) for state.group(pop=[jvm.Int(), jvm.Int()]):
...At this point we have two abstract integers which we then use to do our computation. The computation is done using the state.binary method, which returns all possible pairs of values and states or failures. Even though we already have the abstract values va1 va2 doing the operation on them might affect some other parts of the state, see Dependent Values (§1). If the result is not a failure, we update the state with the new value, using an update method.
for res in after.binary(opr, va1, va2):
match res with
case (va3, after):
yield after.frame.update(push=[va3], pc=pc+1)
case err:
yield errSometimes like with Ifz, operations sometimes can return multiples states. If we compare an abstract value with 0, it can both be true that some variables are smaller and some are larger. However, this might also have implications for the rest of the state. We need to return multiple states, one if something bad happens and one if it does not.
case jvm.Ifz(condition=con, target=target):
for ([va1], after) for state.group(pop=[jvm.Int()]):
for res in after.binary(con, va1, 0):
match res:
case (True, after):
yield after.frame.update(pc=target)
case (False, after):
yield after.frame.update(pc=pc+1)
case err:
yield errThe problem with bounded static analysis is that even if we run our static analysis for 1.000.000 iterations, we have no gurantee that our program will not exhibit bad behaviour in iteration 1.000.001. Unbounded Static Analysis seeks to fix this, by running the static analysis until a fixed point is reached.
Ideally, we want the smallest such fixed point, which we refere to as the Minimal Fixed Point.
Given a function , that is monotone over then the minimal fixed point is the least st. .
means , were is applied times.
Not all functions have a fixed point, but if it does, we can run the function until the result no longer changes, and then we we know we know we don't have to continue running the analysis.
For example given our Sign abstraction we are guranteed a fixed point in no more than 4 iterations, as the height of the latice is 4.
However, not all abstractions form a lattice of finite height. One example of such abstractions are the interval abstraction.
Here we represent a set integers as a interval between two numbers.
The interval abstraction is a tuple of two numbers , , representing the interval of integers between , both inclusive.
Now implement the range abstraction in your language of choice. Remember to test your abstraction:
from hypothesis import given
from hypothesis.strategies import integers, sets
@given(sets(integers()))
def test_interval_abstraction_valid(xs):
r = Interval.abstract(xs)
assert all(x in r for x in xs)
@given(sets(integers()), sets(integers()))
def test_interval_abstraction_distributes(xs, ys):
assert (Interval.abstract(xs) | Interval.abstract(ys)) == Interval.abstract(xs | ys)Define the arithmetic operations you need to run your static analysis.
You can use hypothesis to test them, remember you want the operations to be an over approximation:
@given(sets(integers()), sets(integers()))
def test_interval_abstraction_add(xs,ys):
r = Interval.abstract(xs) + Interval.abstraction(ys)
assert all(x + y in r for x in xs for y in ys)Notice that the latice of the interval abstraction does not have a finite height, we can always add 1 to either end of the interval.
Consider the following program:
void example() {
int i = 0;
while (i >= 0) {
i = i + 1;
}
}In this case our interval analysis will look something like this:
This might continue forever, unless we find a better way.
Because, we are not guranteed to reach a fixed point with the interval abstraction, we need to force it. This is known as the widening operator, the goal of the widening operator is to dramatically over approximate joins in a way that we are guranteed a fixed point in a finite number of steps.
One way of adding a widing operator to the interval abstraction is to us a constant set, often refered to as . This set contains all the constants in the program or file of interest. This is like the dictionaries we used in the dynamic analysis.
To define the widing operator we define two operators and which given a set of intergers pics the largest and smallest, respectively, st. and .
We can see that we can only apply this operator a finite number of times until we either get or .
Now when running our analysis we use the operator instead of the operator, and we can continue computing until we reach a fixed point.
Include a widening operator to your analysis, so that you can merge a finite number of times.
Currently our algorithm is a little slow, as we essentially in lockstep iterate over the entier program for every step.
This is of cause silly, as we would much rather only handle unprocessed events. Consider you have a stepping function from a program locations and an abstract state to a set of program locations and abstract states, which needs to joined at those points
Now the analysis algorithm can be written something like this:
def analyse(i : tuple[PC, AState]):
state : dict[PC, AState] = dict()
needs_work : list(PC) = []
# Add the initial state to the set of items that need work, and set the state.
needs_work.append(i[0])
state[i[0]] = i[1]
# Iterate until the needs_work set is empty:
while needs_work:
next = needs_work.pop()
for pc, astate in step(next, state[next]):
# merge the new state into the old using the widening operator.
old, state[pc] = state[pc], wide(state[pc], astate)
# if the state has changed, append it to the work list.
if old != state[pc]:
needs_work.append(i)This approach uses significantly less instructions than running all steps every time, and still produce the same results.
However there is still some work we can do to improve the algorithm. For-example, when you analyse the following program:
void fn() {
int i = 0;
while (i < 100) {
i += 1;
}
// Many operations on i.
}There is no reason to update the states after the while loop, before we have reached fixpoint of the while loop.
The correct order to do this computation is in reverse post-order. If you do this you are guranteed to process any loops first.
Until next time, write the best analysis you can and upload the results to Autolab, and ponder the following:
What is the primary problem when doing unbounded static analysis?
Why do we need a widening operator?
Cousot, Patrick; Halbwachs, Nicolas (1978). Automatic discovery of linear restraints among variables of a program.
doi:10.1145/512760.512770 link
Miné, Antoine (2006). The octagon abstract domain.
doi:10.1007/s10990-006-8609-1 link
Nielson, Flemming; Nielson, Hanne Riis (2020). Program Analysis (an Appetizer).
link