Bounded Static Analysis

Everything Everywhere All at Once

The goal of this chapter is to ask and answer the following questions:

1. What is Bounded Static Analysis?

In bounded static analysis we want to talk about all the behaviors of the program up to some depth. There are no easy way to do this, because we have to handle a possible infinite set of traces, which by itself is hard to do.

The idea behind bounded static analysis is straight forward. Instead of selecting a single initial state and then applying the semantics on that until a trace has been made. This is what we did with dynamic analysis. We start with the set initial states and then apply the semantics to all states at once. Essentially, if dynamic analysis is depth first search, Static Analysis is breath first. While we have the downside of having to work with infinite sets of traces, we can be smart about it, and use abstractions to represent the set using a finite number of traces.

1.1. In Theory

The Bounded Static Analysis on a program P can defined as the set of traces of depth n: 𝐁𝐒𝐀Pn. Using a many-stepping function 𝐬𝐭𝐞𝐩, that step set of traces up to length n to the set of traces up to length n+1.

𝐬𝐭𝐞𝐩P(T)=Tβˆͺ{Ο„β€²s|Ο„β€²βˆˆT,Ξ΄P(Ο„β€²|Ο„β€²|,s)}𝐁𝐒𝐀Pn=𝐬𝐭𝐞𝐩Pn(IP)

In the definition of 𝐬𝐭𝐞𝐩, Ξ΄ is the transition relation defined by the single step semantics, and Ο„β€²s means appending s to Ο„β€². Since we are always talking about the same program P we'll sometimes omit it.

Now, we check for assertion errors down to depth n, let's call this 𝐁𝐀𝐄Pn, by seeing if any trace ends in an assertion error:

𝐁𝐀𝐄Pn≑(Ο„s)βˆˆππ’π€Pn∧s=err(β€˜πšŠπšœπšœπšŽπš›πšπš’πš˜πš— πšŽπš›πš›πš˜πš›β€™)

1.2. May and Must Analyses

Sadly, it's infeasible to do computations over all traces at ones. Therefore, when designing an analysis, we either underestimate the set of traces this is called a must analysis, or overestimate the set of traces which is called a may analysis. A must analysis is called that, because we know what the program must do, and a may analysis is called that because we know what the program may do.

In the may analysis we overestimate every step, and in the must analysis we underestimate every step:

π¬π­πžπ©π—†π—Žπ—Œπ—(T)βŠ†π¬π­πžπ©(T)βŠ†π¬π­πžπ©π—†π–Ίπ—’(T)

Naturally, we'll see that

π¬π­πžπ©π—†π—Žπ—Œπ—n(IP)βŠ†π¬π­πžπ©n(IP)βŠ†π¬π­πžπ©π—†π–Ίπ—’n(IP)

The may analysis is great if we want to make sure that something newer happens. Essentially, if we want to warn about an assertion error may happen, we need a may analysis. If we want to grantee that an assertion error must happen on all runs, then we need a must analysis:

ππ€π„π—†π—Žπ—Œπ—nβŸΉππ€π„n¬𝐁𝐀𝐄nβŸΈΒ¬ππ€π„π—†π–Ίπ—’n

However, even with the concessions, where we are approximating the correct results we still need some extra magic, to make our analyses truly efficient.

We need abstractions!

2. Abstractions

Instead of working with sets of traces, we have to work with finite structures instead. We call these abstractions.

2.1. The Sign Analysis

Before we go down into the theory, let's start with a concrete example.

Assume that we are working with all the possible integer values that an variable has at some point in the program. It would be much easier to simply keep track of whether the set contains positive, negative or zero values.

We can represent this as a set that only contain three elements {+,βˆ’,0}, (0), larger than zero (+) and smaller than zero (βˆ’). This set of size 3!, which is much easier to work with, than the set of size 232!. In the following sections we are going to refer to this set as 𝐒𝐒𝐠𝐧.

𝐒𝐒𝐠𝐧={+,0,βˆ’}

We call 𝐒𝐒𝐠𝐧 an abstraction, as we can covert the concrete domain of integers into the abstract domain of 𝐒𝐒𝐠𝐧. We call this conversion the abstraction Ξ±:

Ξ±({1,2,3})={+}Ξ±({0})={0}Ξ±({βˆ’1,3})={+,βˆ’}Ξ±({βˆ’1,0,3})={0,+,βˆ’}

However before we continue this set, we first have to ensure that a computation in abstract domain, does not invalidate our over-approximation.

In the following sections we'll see that:

Make the Sign Abstraction

Implement the SignSet as an abstract set in the language of your choosing.

In Python, you can start with this:

from dataclasses import dataclass
from typing import TypeAlias, Literal

Sign : TypeAlias = Literal["+"] | Literal["-"] | Literal["0"]

@dataclass
class SignSet:
  signs : set[Sign]

Fasten your seatbelts it's going to be a bumby night!

2.2. Partially Ordered Sets (Posets)

A partially ordered set or poset is a tuple (L,βŠ‘) set of elements L and an ordering βŠ‘, that uphold:

reflexiveβˆ€a.aβŠ‘aantisymetricβˆ€a.aβŠ‘b∧bβŠ‘a⟹a=btransitiveβˆ€a.aβŠ‘b∧bβŠ‘c⟹aβŠ‘c

Common partially ordered sets are the integers (β„€,≀) (also in the other direction (β„€,β‰₯)), the booleans ({𝚝𝚝,𝚏𝚏},β‡’), and the set of 𝐒𝐒𝐠𝐧′s (2𝐒𝐒𝐠𝐧,βŠ†),

Check the rules

Think about if the rules apply to the posets above.

Make Sign a partially ordered set

Make your 𝐒𝐒𝐠𝐧 abstraction into a partially ordered set by implementing βŠ‘ (<=) in the language of your choice.

2.3. Lattices

A lattice is partially ordered sets (L,βŠ‘), with two extra operators βŠ” and βŠ“. βŠ” is the least upper bound aβŠ”b, meaning that βˆ€c.aβŠ‘c∧bβŠ‘c⟹aβŠ”bβŠ‘c.

The dual is true for βŠ“ that is the greatest lower bound βˆ€c.cβŠ‘a∧cβŠ‘b⟹cβŠ‘aβŠ“b.

Furthermore, this implies that there exist a least bound βŠ₯=β¨…L and a greatest bound ⊀=⨆L, from which we have the following identities: βŠ€βŠ“a=a=aβŠ”βŠ₯, and βŠ€βŠ”a=⊀ and aβŠ“βŠ₯=βŠ₯.

2.3.1. Hasse diagrams.

Hasse diagram phi Ø x->phi x {+} y->phi y {0} z->phi z {-} xy->x xy->y xy {+, 0} xz->x xz->z xz {+, -} yz {0, -} yz->y yz->z xyz {+,0,-} xyz->yz xyz->xz xyz->xy
Figure: A hasse diagram over the poset (2𝐒𝐒𝐠𝐧,βŠ†). By I, KSmrq, CC BY-SA 3.0 (with edits)

The reason why they are called latices is that they can be drawn using Hasse digrams which gives these nice structures, which looks like a wooden lattice.

Here we only draw the imitate next elements in the order. So while {+}βŠ†{+,0,βˆ’} we do not draw that edge because {+,0} and {+,βˆ’} are in the way.

Rate your favourite animals

Create a hasse diagram over your favorite animals, The order is (𝐀𝐧𝐒𝐦,β€˜πš’πšœ πš πš˜πš›πšœπšŽ πšπš‘πšŠπš—β€™). Where

𝐀𝐧𝐒𝐦={Cat,Dog,Ant,Worm,Spider,Horse,Rabbit}

While objectively (Catβ€˜πš’πšœ πš πš˜πš›πšœπšŽ πšπš‘πšŠπš—β€™Dog) some of the other animals like Ant and Worm are no worse than each other.

Is your order a lattice: Does every pair of animals have meet and a join?

2.3.2. Back to the Sign Abstraction

In our Sign abstraction, we use set inclusion as our order βŠ‘π’π’π π§=βŠ† and use the intersection and union of the underlying set as the meet (βŠ“) and join (βŠ”) of our lattice. βŠ₯=βˆ… and ⊀={+,βˆ’,0}.

Make the Sign a lattice

To make your 𝐒𝐒𝐠𝐧 abstraction into a lattice, you need to also implement the meet βŠ“ and join βŠ” operators.

In Python, you can implement __and__ and __or__ to get the meet operator a & b, and the join operator a | b.

2.4. Galois Connection

Figure: A Galois Connection is a connection between two ordered sets, with a concretion Ξ³ and an abstraction Ξ± function.

A Galois connection is a relationship between two ordered sets (C,βŠ‘C) and (A,βŠ‘A), where we can abstract information from the concrete domain into the abstract Ξ±:Cβ†’A while preserving the order if going back Ξ³:Aβ†’C. Ξ± explains how to abstract a value, and Ξ³ explains what the abstraction means.

Because the abstraction might be lossy it is not the case that if we abstract a value we get back the same concrete value Ξ³(Ξ±(c))β‰ c. Instead a Galois connection satisfies the following rule:

βˆ€c∈C,a∈A.cβŠ‘CΞ³(a)⇔α(c)βŠ‘Aa

It states, that if have a concrete value c, and abstract it Ξ±(c), then any value larger than this abstraction a will concertize to a value Ξ³(a) that is larger than the original value. At first this might not seem useful, but if we satisfy this rule we get the following laws for free:

cβŠ‘CΞ³(Ξ±(c))– if we abstract we only increase in sizeΞ±(Ξ³(a))βŠ‘Ca– if we concertize we only decrease in sizeΞ±(Ξ³(Ξ±(c)))=Ξ±(c)– we don’t keep loosing informationΞ³(Ξ±(Ξ³(a)))=Ξ³(a)– in both directions

The two first rules gives us confidence that whatever abstraction we choose only lose information never create it. And the second two rules gives us confidence that we don't keep losing that information.

2.4.1. Back to the Sign Analysis

We can see that there exist a Galoi connection between our concrete integer domain and our abstract sign domain (2β„€,βŠ†)↔γα(𝐒𝐒𝐠𝐧,βŠ‘π’π’π π§). Where the abstraction Ξ± is: Ξ±(N)={+|n∈N,n<0}βˆͺ{βˆ’|n∈N,n>0}βˆͺ{0|n∈N,n=0} And Ξ³ is: Ξ³(S)={n|nβˆˆβ„€,n<0,+∈S}βˆͺ{n|nβˆˆβ„€,n>0,βˆ’βˆˆS}βˆͺ{0|0∈S}

We can also see that we satisfy the rules, let's pick a concrete value {0,1}. If we convert that to an abstract value we get: Ξ±({0,1})={0,+}.

Since Ξ±({0,1})βŠ‘{0,+}βŠ‘{0,+,βˆ’}, and since Ξ³({0,+})=β„€0,+ and Ξ³({0,+,βˆ’})=β„€0,+,βˆ’, then we get that {0,1}βŠ†β„€0,+βŠ†β„€.

Build the Sign abstraction

In your favorite language build the sign abstraction.

You need an abstract function, which takes a set and turns it into your abstract domain.

@staticmethod
def abstract(items : set[int]): 
  ...
  return SignSet(...)

Then because we are using possible infinite sets, instead of generating that set, we can just implement the contains metod:

def __contains__(self, member : int): 
  ... 
  if (member == 0 and "0" in self.signs): 
      return true

You can use the Galoi rules to test your abstraction:

βˆ€xβˆˆβ„€:{x}βŠ†Ξ³(Ξ±({x}))

If you are using python you fuzz test this property using hypothesis.

from hypothesis import given
from hypothesis.strategies import integers, sets

@given(sets(integers()))
def test_valid_abstraction(xs):
  s = SignSet.abstract(xs) 
  assert all(x in s for x in xs)

2.5. Abstract Operations

Finally, we have reached the fun part. We can now define operations in the abstract domain, that mimic the operations in the concrete.

Because our concrete domain is the set of integers, we can define +set, as the operation over the product:

A+setB={a+b|a∈A,b∈B}{1,3}+set{0,10}={1,3,11,13}

Now we can also define the plus abstract domain:

{+}+𝐒𝐒𝐠𝐧{+}={+}{+}+𝐒𝐒𝐠𝐧{βˆ’}={βˆ’,+,0}{+}+𝐒𝐒𝐠𝐧{0}={+}{0}+𝐒𝐒𝐠𝐧{+}={+}{0}+𝐒𝐒𝐠𝐧{βˆ’}={βˆ’}{0}+𝐒𝐒𝐠𝐧{0}={0}{βˆ’}+𝐒𝐒𝐠𝐧{+}={βˆ’,+,0}{βˆ’}+𝐒𝐒𝐠𝐧{βˆ’}={βˆ’}{βˆ’}+𝐒𝐒𝐠𝐧{0}={βˆ’}…

And many more.

But, how do we know if we have implemented our abstract operation correctly? Since we want to over-approximate our operations, we know that computing in our abstract domain should newer under-approximate:

A+setBβŠ†Ξ³(Ξ±(A)+𝐒𝐒𝐠𝐧α(B))

This is of cause not easy to test for, instead we can use use our Galoi connection to see that, if only have to show the following:

Ξ±(A+setB)βŠ†Ξ±(A)+𝐒𝐒𝐠𝐧α(B)

Essentially, abstracting the result of an operation, should be smaller than abstracting and doing the operation in the abstract domain.

Implement Abstract Arithmetic

Implement arithmetic for your sign abstraction.

In Python, you can build an Arithmetic class that can handle all the abstract arithmetics. Also if you use hypothesis, you can test that we have done the conversion correctly by, testing the following:

@given(sets(integers()), sets(integers()))
def test_sign_adds(xs, ys):
  assert (
    SignSet.abstract({x + y for x in xs for y in ys}) 
      <= arithmetic.binary("add", SignSet.abstract(xs), SignSet.abstract(ys))
    )

For cases, where we compare like le, we might want to return set of booleans instead.

@given(sets(integers()), sets(integers()))
def test_sign_adds(xs, ys):
    assert (
      {x <= y for x in xs for y in ys} 
      <= arithmetic.compare("le", SignSet.abstract(xs), SignSet.abstract(ys))
      )

3. Bounded Abstract Interpretation

In practice, Galoi connections allows us to map our infinite domain of may or must analyses into a more manageable abstract domain, while giving us guarantees that we are still correctly over- or under-estimating.

Let's assume we are building a may analysis. Since we want to over approximate our 𝐬𝐭𝐞𝐩 function, we therefore introduce an abstract domain (A,βŠ‘A) that has an abstract stepping function 𝐬𝐭𝐞𝐩A. We assume the galoi connection (2π“π«πšπœπž,βŠ†)↔γα(A,βŠ‘A). As with the abstract operations, we want the following to be true:

TβŠ†π¬π­πžπ©(T)βŠ†Ξ³(𝐬𝐭𝐞𝐩A(Ξ±(T)))

Now we can apply our knowledge about Galoi connections, which we can prove by showing the following:

Ξ±(𝐬𝐭𝐞𝐩(T))βŠ‘A𝐬𝐭𝐞𝐩A(Ξ±(T))

We can use this new abstract stepping function to define an bounded static analysis of depth n: 𝐁𝐒𝐀An=𝐬𝐭𝐞𝐩An(Ξ±(IP)), and we can prove by induction over n that stepping in the abstract domain is in fact a may analysis 𝐁𝐒𝐀nβŠ†Ξ³(𝐁𝐒𝐀An).

Show that 𝐁𝐒𝐀An a may analysis

By induction over n, in the base case you have to show that 𝐬𝐭𝐞𝐩0(IP)βŠ†Ξ³(Ξ±(𝐬𝐭𝐞𝐩A0(IP))) and in the induction case you need to show that 𝐬𝐭𝐞𝐩(n+1)(IP)βŠ†Ξ³(𝐬𝐭𝐞𝐩(n+1)(Ξ±(IP))) but are given 𝐬𝐭𝐞𝐩n(IP)βŠ†Ξ³(𝐬𝐭𝐞𝐩n(Ξ±(IP))).

3.1. The State Abstraction

The first, and most useful abstraction, is the state abstraction. We can abstract every trace as it's final state.

(2π“π«πšπœπž,βŠ†)↔γα(2π’π­πšπ­πž,βŠ†)

We can define the abstraction from a set of traces as the set of end states and the concretion of a set of states as the set of traces that end in one of the states.

Ξ±(T)={Ο„|Ο„||Ο„βˆˆT}Ξ³(S)={Ο„|s∈S,Ο„βˆˆπ“π«πšπœπž,Ο„|Ο„|=s}

This is a useful abstraction for us, because we do not need to know where the trace came from to figure out if an assertion error exist in the program, only that a trace ends in an assertion error.

We can now define π¬π­πžπ©π’π­πšπ­πž as only stepping the final state of each trace:

π¬π­πžπ©π’π­πšπ­πž(S)=Sβˆͺ{sβ€²|s∈S,Ξ΄(s,sβ€²)}

Check that π¬π­πžπ©π’π­πšπ­πž is an abstract operation of 𝐬𝐭𝐞𝐩

Ξ±(𝐬𝐭𝐞𝐩(T))βŠ†π¬π­πžπ©π’π­πšπ­πž(Ξ±(T))

Now let ππ’π€π’π­πšπ­πžn=π¬π­πžπ©π’π­πšπ­πžn(Ξ±(IP)). It is now clear by induction on n that 𝐁𝐒𝐀nβŠ†Ξ³(ππ’π€π’π­πšπ­πžn). So if there exist an assertion error within n steps, then there exist a trace Ο„βˆˆππ’π€n st. Ο„|Ο„|=err(β€˜πšŠπšœπšœπšŽπš›πšπš’πš˜πš— πšŽπš›πš›πš˜πš›β€™) and |Ο„|≀n.

From our Galoi connection we can see that

{Ο„err(β€˜πšŠπšœπšœπšŽπš›πšπš’πš˜πš— πšŽπš›πš›πš˜πš›β€™)}βŠ†ππ’π€nβŠ†Ξ³(ππ’π€π’π­πšπ­πžn)⟹α({Ο„err(β€˜πšŠπšœπšœπšŽπš›πšπš’πš˜πš— πšŽπš›πš›πš˜πš›β€™)})βŠ†{err(β€˜πšŠπšœπšœπšŽπš›πšπš’πš˜πš— πšŽπš›πš›πš˜πš›β€™)}βŠ†ππ’π€π’π­πšπ­πžn

Which means that if err(β€˜πšŠπšœπšœπšŽπš›πšπš’πš˜πš— πšŽπš›πš›πš˜πš›β€™)βˆ‰ππ’π€π’π­πšπ­πžn, then {Ο„err(β€˜πšŠπšœπšœπšŽπš›πšπš’πš˜πš— πšŽπš›πš›πš˜πš›β€™)}βŠˆππ’π€n. Which is exactly the grantee we are looking for in a may analysis.

3.2. The Per-Instruction Abstraction

In our next abstraction we take advantage of that we treat states at with the same program counter alike.

Let's focus on a JVM without a method stack, which means that every state is βŸ¨Οƒ,Ξ»,ι⟩. We'll get back to talk about how to handle the method stack in later sections. We can abstract this by collecting the states per ΞΉ. Let 𝐏𝐜=ΞΉβ†’2π’π­πšπ­πž be a mapping from program counters to sets of states. 𝐏𝐜 is a lattice with partial order βŠ‘ππœ were one mapping is less than another if all states are smaller than the other and βŠ”ππœ is pointwise βˆͺ of states.

(2π’π­πšπ­πž,βŠ†)↔γα(𝐏𝐜,βŠ‘ππœ)

Convince yourself this is a Galoi connection

Now we can write a stepping function that can step all states with the same instruction at once 𝐬𝐭𝐞𝐩ι, and then merge the result into the other states.

𝐬𝐭𝐞𝐩𝐏𝐜(C)=CβŠ”ππœβ¨†ππœ{ι′↦{βŸ¨Οƒβ€²,Ξ»β€²,ΞΉβ€²βŸ©}|(ι↦S)∈C,Sβ€²=𝐬𝐭𝐞𝐩ι(S),βŸ¨Οƒβ€²,Ξ»β€²,ΞΉβ€²βŸ©βˆˆSβ€²}

Consider the following python, which does the same thing, given you have defined StateSet as an set of states:

def many_step(states : dict[Pc, StateSet]):
    next_states = states.copy()
    for pc, astate in states.items():
        b = bytecode[pc]
        # run the operation on the state set
        for (pc_, s_) in lookup(self, f"step_{b['opr']}")(b, pc, astate):
          # merge the new state with the current state or bottom
          next_states[pc_] = next_states.get(pc_, StateSet.bot) | s_
    return next_states

3.3. The Per-Variable Abstraction

We are now faced with our first hard choice.

We would love to use Sign abstraction we build in the previous section, one way of doing that is compressing the state, so that instead of having a set of states, we can have distribute the content of the locals and stack.

𝐏𝐯=ΞΉβ†’(2𝐕σ)⋆×(2𝐕λ)⋆βˆͺ{ok,err(β€˜.’)}

Since the stack and the locals are typed at each instruction, these are actually sets of integers, which we know how to abstract using the sign abstraction. Furthermore 𝐏𝐯 is also a partially ordered set, by point wise comparing the set of variables, and a lattice by point-wise joining and meeting the sets of variables. We can also see that (ππœβ†”Ξ³Ξ±ππ―) is a galoi connection.

The problem with this abstraction, and the reason this choice is hard, is that is the first that severely over-approximates our problem. More about this next time.

Define the Abstract State

Define the Abstract (per variable) State in your program analysis, remember to define the ordering as well as the meet and join operations. It might be useful to define three pseudo elements Bot, is an element that joins with all others, Err represent the error state and Ok represent the Ok state.

In Python, we could define an abstract state like so:

class AbstractState:
  stack: list[AbstactValues]
  locals: list[AbstactValues]

Remember to define the abstract, __le__, __and__, and __or__ methods.

Where the abstract values, if we only work with integers, could be the SignSet from before.

3.4. Getting Started

To get started with static analysis we need to make some changes to our interpreter, as our input values are going to be abstractions.

Extending the interpreter

Extend or rebuild your interpreter so that it is able to handle the per-instruction abstraction.

You can use the following snip-it for inspiration:

@dataclass
class AbstractInterpreter:
  bytecode : dict
  states : dict[ByteCodeOffset, AbstractState]
  final : set[str]
  arithmetic : Arithmetic

  def step(self):
    next_states = self.states.copy()
    for pc, astate in self.states.items():
        b = bytecode[pc]
        for (pc_, s_) in lookup(self, f"step_{b['opr']}")(b, pc, astate):
          # if the pc_ is -1 we are "done", and s_ is the end string
          if pc_ == -1:
            final.add(s_)
          else:
            # merge the new state with the current state or bottom
            next_states[pc_] = next_states.get(pc_, AbstractState.bot) | s_
    self.states = next_states
 
  # Stepping functions are now generators that can generate different 
  # states depending on the abstract state.
  def step_ifz(self, b, pc, astate):
    # depending on the defintion of the abstract_state 
      aval = astate.stack.pop()
      # Note that the abstract value might both compare and not compare to 0
      for b in self.arithmetic.compare(b["compare"], aval, 0):
        if b:
          yield (b["target"], astate.copy())
        else:
          yield (pc + 1, astate.copy())

Experiment with different Abstractions

You have been introduced to the Sign Abstraction, but next time we'll look at more different abstractions. If you have time, try to come up with your own.

References