Unbounded Static Analysis

Christian Gram Kalhauge

Table of Contents
  1. Dependent Values§1
    1. Dependencies between different values.§1.1
    2. Implementing an Abstract Interpreter§2
      1. Grouping Per Instruction§2.1
        1. Grouping Per Variable§2.2
          1. Doing the Operation§2.3
            1. Constraining the Space§2.4
            2. Unbounded Static Analysis§3
              1. Fixed Points§3.1
                1. The Interval Abstraction§3.2
                  1. The Widening Operator§3.3
                    1. The Worklist Algorithm§3.4
                      1. The Reverse Post-Order§3.4.1

                    In this section, we are going to reiterate on the abstract interpretation from last time and then we are going to introduce a new integer abstraction, namely the interval abstraction, then we are going to show how we can do Unbounded static analysis analysis with no limits. To do this we have to introduce fixpoints, the widening operator, and finally a more efficient algorithm.

                    Dependent Values §1

                    A big problem, which we have glazed over so far, are dependent variables. Dependent variables are variables who's solution space are constrained by each other.

                    Safe Div By N

                    Consider the case in Dependent.safeDivByN:

                    @Case("(0) -> ok")
                    @Case("(1) -> ok")
                    public static int safeDivByN(int n) { 
                      if (n != 0) {
                        return 1 / n;
                      }
                      return 0;
                    }

                    If we run our sign analysis naively on the decompiled code we get the following. The offset column shows the offset, the instruction column, shows the instruction column, and the abstract state column shows abstract state right before the instruction and at what instructions the results are put.

                    OffsetInstructionAbstract State
                    000load:I 0{0{+,,0}},ϵ,0{1}
                    001ifz eq 6{0{+,,0}},{+,,0},1{2,6}
                    002push:I 1{0{+,,0}},ϵ,2{3}
                    003load:I 0{0{+,,0}},{+},3{4}
                    004binary:I div{0{+,,0}},{+}{+,,0},4{5,err(‘𝚍𝚒𝚟 𝚋𝚢 𝚣𝚎𝚛𝚘’)}
                    005return:I{0{+,,0}},{+,},5{ok}
                    006push:I 0{0{+,,0}},ϵ,6{7}
                    007return:I{0{+,,0}},{0},7{ok}

                    The problem with the above execution is that we find a divide by zero error, even though it does not exist. This happens because we do not capture the dependency between the integer on the stack and in the locals in the choice in 001.

                    To fix this we need to add a notion of named values and constraints. In the example above, we could add a named value n with constraint n={+,,0} indicating that values are the same. We can keep these constraints in addition to the state:

                    OffsetInstructionAbstract State
                    000load:I 0n={+,,0}{0n},ϵ,0{1}
                    001ifz eq 6n={+,,0}{0n},n,1{2,6}
                    002push:I 1n={+,}{0n},ϵ,2{3}
                    003load:I 0n={+,}{0n},{+},3{4}
                    004binary:I divn={+,}{0n},{+}n,4{5}
                    005return:In={+,}{0n},{+,},5{ok}
                    006push:I 0n={0}{0n},ϵ,6{7}
                    007return:In={0}{0n},{0},7{ok}

                    Now we can see that we can constrain the state-space of the named values inside each of the branches allowing for more precise results.

                    A couple of things to keep in mind when using named values, is that

                    Dependencies between different values. §1.1

                    It is possible to capture more interesting relationships between values. This is especially important when they depend on each other.

                    @Case("(0, 0) -> ok")
                    @Case("(1, 1) -> ok")
                    public static int safeDivAByB(int a, int b) { 
                      if (a >= 0) {
                        if (b > a) {
                          return a / b
                        }
                      }
                      return 0;
                    }

                    The example above contains no errors. But this would be hard for our current techniques to detect. To maintain these constraints efficiently we can use Polyhedrons (Cousot 1978) or most of the time faster Octagons (Miné 2006).

                    Implementing an Abstract Interpreter §2

                    First, we are going to look at what an implementation might look like. Because we are going to step all states at the same time, we need an manystep function (Bounded Abstract Interpretation (§3)). Here we take a set of states (abstracted by A), and yield all the states (or final states) which could be produced by it (The State Abstraction (§3.1)).

                    def manystep(sts : StateSet[A]) -> Iteratable[A | str]:
                      ...

                    To run our bounded static analysis, we first have to compute the initial state, from the initial method (methodid), then we can simply run this a bounded number of times, and joining the results every time.

                    final = {}
                    sts = A.initialstate_from_method(methodid)
                    for i in range(MAX_STEPS):
                      for s in manystep(sts):
                        if isinstance(s, str):
                          final.add(s)
                        else:
                          sts |= s
                    log.info(f"The following final states {final} is possible in {MAX_STEPS}")

                    Grouping Per Instruction §2.1

                    The first thing we do is we ask the state set to break it down per instruction (see The Per-Instruction Abstraction (§3.2)), we do this so we can process the states at the same instruction at the same time.

                    for pc, state in sts.per_instruction():
                      match bc[pc]:
                        case jvm.Binary(type=jvm.Int(), operant=opr):
                          ...

                    Grouping Per Variable §2.2

                    At this point we would like to step the states according to the instructions, but we are at an impasse. We need to get the variables out of the set of states.

                    One option (as we used in The Per-Variable Abstraction (§3.3)) is to merge all the states and create a single state with merged variables. But this comes at a loss of precision (see Dependent Values (§1)), or we can ask the states to group themselves by the variables needed.

                    Here we use a group method which takes the set of states and group them by some query. By allowing the state to choose how group themselves, it can choose the most optimal grouping. In our case, we want to pop two integers:

                    for ([va1, va2], after) for state.group(pop=[jvm.Int(), jvm.Int()]):
                      ...

                    Doing the Operation §2.3

                    At this point we have two abstract integers which we then use to do our computation. The computation is done using the state.binary method, which returns all possible pairs of values and states or failures. Even though we already have the abstract values va1 va2 doing the operation on them might affect some other parts of the state, see Dependent Values (§1). If the result is not a failure, we update the state with the new value, using an update method.

                    for res in after.binary(opr, va1, va2):
                      match res with 
                        case (va3, after):
                            yield after.frame.update(push=[va3], pc=pc+1)
                        case err:
                            yield err

                    Constraining the Space §2.4

                    Sometimes like with Ifz, operations sometimes can return multiples states. If we compare an abstract value with 0, it can both be true that some variables are smaller and some are larger. However, this might also have implications for the rest of the state. We need to return multiple states, one if something bad happens and one if it does not.

                    case jvm.Ifz(condition=con, target=target):
                      for ([va1], after) for state.group(pop=[jvm.Int()]):
                        for res in after.binary(con, va1, 0):
                          match res:
                            case (True, after):
                              yield after.frame.update(pc=target)
                            case (False, after):
                              yield after.frame.update(pc=pc+1)
                            case err:
                              yield err

                    Unbounded Static Analysis §3

                    The problem with bounded static analysis is that even if we run our static analysis for 1.000.000 iterations, we have no gurantee that our program will not exhibit bad behaviour in iteration 1.000.001. Unbounded Static Analysis seeks to fix this, by running the static analysis until a fixed point is reached.

                    Fixed Points §3.1

                    Ideally, we want the smallest such fixed point, which we refere to as the Minimal Fixed Point.

                    Minimal Fixed Point

                    Given a function f:AA, that is monotone over (A,A) (aA.aAf(a)) then the minimal fixed point is the least n st. fn()=fn+1().

                    fn() means f(f(f())), were f is applied n times.

                    Not all functions have a fixed point, but if it does, we can run the function until the result no longer changes, and then we we know we know we don't have to continue running the analysis.

                    For example given our Sign abstraction we are guranteed a fixed point in no more than 4 iterations, as the height of the latice is 4.

                    The Interval Abstraction §3.2

                    However, not all abstractions form a lattice of finite height. One example of such abstractions are the interval abstraction.

                    Here we represent a set integers as a interval between two numbers.

                    Interval Abstraction

                    The interval abstraction 𝐈𝐭𝐯𝐚𝐥 is a tuple of two numbers i, j, representing the interval of integers between [i,j], both inclusive.

                    α(S)=[minS,maxS]γ([i,j])={n|jni}[i,j]𝐈𝐭𝐯𝐚𝐥[k,h]kijh[i,j]𝐈𝐭𝐯𝐚𝐥[k,h][min{i,k},max{j,h}][i,j]𝐈𝐭𝐯𝐚𝐥[k,h][max{i,k},min{j,h}]𝐈𝐭𝐯𝐚𝐥=[,]𝐈𝐭𝐯𝐚𝐥=[,]

                    Implement the Interval abstraction.

                    Now implement the range abstraction in your language of choice. Remember to test your abstraction:

                    from hypothesis import given
                    from hypothesis.strategies import integers, sets
                    
                    @given(sets(integers()))
                    def test_interval_abstraction_valid(xs):
                      r = Interval.abstract(xs) 
                      assert all(x in r for x in xs)
                    
                    @given(sets(integers()), sets(integers()))
                    def test_interval_abstraction_distributes(xs, ys):
                      assert (Interval.abstract(xs) | Interval.abstract(ys)) == Interval.abstract(xs | ys)

                    Implement Operations on Interval Abstraction.

                    Define the arithmetic operations you need to run your static analysis.

                    You can use hypothesis to test them, remember you want the operations to be an over approximation:

                    @given(sets(integers()), sets(integers()))
                    def test_interval_abstraction_add(xs,ys):
                      r = Interval.abstract(xs) + Interval.abstraction(ys)
                      assert all(x + y in r for x in xs for y in ys)

                    Notice that the latice of the interval abstraction does not have a finite height, we can always add 1 to either end of the interval.

                    Runs forever

                    Consider the following program:

                    void example() {
                      int i = 0;
                      while (i >= 0) {
                        i = i + 1;
                      }
                    }

                    In this case our interval analysis will look something like this:

                    1{i[0,0]}2{i[0,0]}𝚜𝚎𝚎 𝚝𝚑𝚊𝚝 𝚒 >= 𝟶3{i[0,0]+[1,1]}𝚒 = 𝚒 + 𝟷4{i[0,0][1,1]}𝚕𝚘𝚘𝚙 𝚋𝚊𝚌𝚔5{i[0,1]}𝚜𝚎𝚎 𝚝𝚑𝚊𝚝 𝚒 >= 𝟶6{i[0,1]+[1,1]}𝚒 = 𝚒 + 𝟷7{i[0,1][1,2]}𝚕𝚘𝚘𝚙 𝚋𝚊𝚌𝚔8...𝚊𝚗𝚍 𝚜𝚘 𝚘𝚗

                    This might continue forever, unless we find a better way.

                    The Widening Operator §3.3

                    Because, we are not guranteed to reach a fixed point with the interval abstraction, we need to force it. This is known as the widening operator, the goal of the widening operator is to dramatically over approximate joins in a way that we are guranteed a fixed point in a finite number of steps.

                    One way of adding a widing operator to the interval abstraction is to us a constant set, often refered to as 𝐊. This set contains all the constants in the program or file of interest. This is like the dictionaries we used in the dynamic analysis.

                    To define the widing operator we define two operators minKJ and maxKJ which given a set of intergers J pics the largest and smallest, respectively, kK{,} st. kminJ and kmaxJ.

                    [i,j][k,h]=[minK{i,k},maxK{j,h}]

                    We can see that we can only apply this operator a finite number of times until we either get or .

                    Now when running our analysis we use the operator instead of the operator, and we can continue computing until we reach a fixed point.

                    Define a widing operator

                    Include a widening operator to your analysis, so that you can merge a finite number of times.

                    The Worklist Algorithm §3.4

                    Currently our algorithm is a little slow, as we essentially in lockstep iterate over the entier program for every step.

                    This is of cause silly, as we would much rather only handle unprocessed events. Consider you have a stepping function from a program locations and an abstract state to a set of program locations and abstract states, which needs to joined at those points 𝚜𝚝𝚎𝚙:𝐏𝐂×𝐒𝐭𝐚𝐭𝐞̂2𝐏𝐂×𝐒𝐭𝐚𝐭𝐞̂.

                    Now the analysis algorithm can be written something like this:

                    def analyse(i : tuple[PC, AState]):
                      state : dict[PC, AState] = dict()
                      needs_work : list(PC) = []
                    
                      # Add the initial state to the set of items that need work, and set the state.
                      needs_work.append(i[0])
                      state[i[0]] = i[1]
                    
                      # Iterate until the needs_work set is empty:
                      while needs_work:
                        next = needs_work.pop()
                        for pc, astate in step(next, state[next]):
                          # merge the new state into the old using the widening operator.
                          old, state[pc] = state[pc], wide(state[pc], astate)
                    
                          # if the state has changed, append it to the work list.
                          if old != state[pc]:
                            needs_work.append(i)

                    This approach uses significantly less instructions than running all steps every time, and still produce the same results.

                    The Reverse Post-Order §3.4.1

                    However there is still some work we can do to improve the algorithm. For-example, when you analyse the following program:

                    void fn() { 
                      int i = 0;
                      while (i < 100) { 
                        i += 1;
                      }
                      // Many operations on i.
                    }

                    There is no reason to update the states after the while loop, before we have reached fixpoint of the while loop.

                    The correct order to do this computation is in reverse post-order. If you do this you are guranteed to process any loops first.

                    Until next time!

                    Until next time, write the best analysis you can and upload the results to Autolab, and ponder the following:

                    1. What is the primary problem when doing unbounded static analysis?

                    2. Why do we need a widening operator?

                    Bibliography

                    1. Cousot, Patrick; Halbwachs, Nicolas (1978). Automatic discovery of linear restraints among variables of a program. doi:10.1145/512760.512770 link

                    2. Miné, Antoine (2006). The octagon abstract domain. doi:10.1007/s10990-006-8609-1 link

                    3. Nielson, Flemming; Nielson, Hanne Riis (2020). Program Analysis (an Appetizer). link