Context Sensitive Analysis

A method for handling methods

Christian Gram Kalhauge

The goal of this chapter is to ask and answer the following questions:

Why are method calls complicated?
Why would you use context sensitive analysis?
What is the drawback of a context sensitive analysis?

In this chapters, we are going to take a look at how to make static analysis work on method calls.

1. Why Method Calls are complicated.

Method calls are one of the most common cases of dynamic control flow. In our previous chapters, every time we encountered a jump instruction it was to a known static location, with dynamic control flow the jump location is only known at run-time. While the jump site is known at statically when calling a static method, the return call is unknown -- it depends on the caller. So when running a static analysis we don't know where to return the results.

1.1. In-lining

The naive solution is to inline the method call. In theory, this means expanding the content of the callee-method, the method that is being called, inside the caller-method, and then continuing normal abstract interpretation.

This is a bad idea, for multiple reasons, it can be hard to correctly change the code, we can not reuse the analysis across calls, and worst of all we can not handle recursive calls. Recursive calls will keep coming back as a recurring problem.

1.2. Abstract Calls

An alternative solution is to think about how the computer actually calls a method. When the computer calls a method, it stores a return address on the stack. When we return from a method, we use this return address as the target.

Figure: The call stack, notice how the return address is stored secretly among the local variables. By R. S. Shaw (modified) - Own work, Public Domain link

So one solution to the call problem is to at each state to keep track of a set of return address alongside the current state. When we now reach a return call, we have a set of targets. We can use this set to update the state with the return value at each of those targets.

In theoretical terms, your state is now a map from program points to a set of return states $(2^{ι})$ as well as an abstraction of the powerset of states $\overline{𝐒 𝐭 𝐚 𝐭 𝐞}$ .

ι \mapsto 2^{ι} \times \overline{𝐒 𝐭 𝐚 𝐭 𝐞}

Notice that we do not have to abstract the set of return addresses, as it is finite (and normally quite small, e.g ., less that 100).

Include a return address in the state

Besides the locals and the stack, have every state also include a set of return addresses. If you are in main this set is empty.

At every call site, you should build a new state containing of the abstract parameters from the stack as well as a return set containing the address of the call-site, and then merge it into the first state of the target method.

At every return site, you should return the value and put it on the stack of all the possible return addresses.

Otherwise, the analysis stay the same.

Hover to see partial implementation

# The step function, now either returns a next state, or 
# a return value that can be blended with the caller state
def step(pc, state : AbstractState) -> Iterator[(PC, NextState | ReturnValue)] :
  # Partial implementation of new things:
  match bytecode[pc_]:
    case Call(method):
      # Create a call_state from the locals with return to this pc
      call_state = AbstractState.from_locals(state.take_args(len(method.args)), return={pc})
      yield (PC(method=method, offset=0), NextState(call_state))
    case Return():
      value = state.pop_stack()
      for rp in state.returns: 
          yield (rp, ReturnValue(value))
  ...

def analyse(pc: PC, inputs: list[AbstractValue]):
  # initialize the state and the worklist
  state : dict[PC, AbstractState] = dict()
  state[pc] = AbstractState.from_locals(locals, return={})
  worklist = [pc]


  # work your way through the worklist
  while worklist:
    pc_ = worklist.pop()
 
    # the step function now can either return a state or 
    # return a value.
    for (next_pc, next_action) in step(pc_, state[pc_]):
      match next_action:
        # We can still return a new next state as we are used to
        case NextState(next_state):
          next_state = next_state
        # But now we can also return a single value.
        case ReturnValue(value):
          # The next_pc points to the method call, so we need to pop the arguments from 
          # the stack and replace them with the return value.
          next_state = AbstractState.simulate_call(state[next_pc], value)
          # also update the pc to the next position.
          next_pc = next_pc + 1
     
        # continue abstract interpretation as usual
        new_state = state[next_pc] | next_state 
        if new_state != state[next_pc]: 
          state[next_pc] = new_state
          worklist.append(next_pc)

However this comes at a cost. Keeping only a single instance of a method, means that every use of that method leaks into every other. Consider the following example, run with the sign analysis.

2. Context Sensitive Analysis

[ 1 ]

Context Sensitive Analysis is about expanding the abstract state to be able to react to the context in which the object is used. In our example from the last section, it would be nice to differentiate between the two call sites sign(-10) and sign(0), as they produce very different results.

2.1. Full Call Stack Context

One way of doing this, is by making the state of a bytecode instruction be dependent on the call stack. We can write the total state as a map from bytecode offsets to a map from a stack of return address to an abstract state:

ι \mapsto (ι * \mapsto \overline{𝐒 𝐭 𝐚 𝐭 𝐞})

Joining the states are straight forward, we simply merge them point-wise by iterating over the domains (the keys) of each of the maps. Here $A [i]$ returns $⊥$ if $i \notin 𝐃𝐨𝐦 A$ .

A ⊔ B = {A [i] ⊔ B [i] | i \in (𝐃𝐨𝐦 A \cup 𝐃𝐨𝐦 B)}

This has the advantage of giving us precise static information. In our example, from above the two calls would be differentiated by their call stack, and we can remove the faulty warning.

Add call context to your analysis

Instead of having a mapping from bytecodes to state, make it into a mapping from bytecode to a mapping from call stack to states.

Context = list[BC]
State = dict[BC, dict[Context, AbstractState]]

To take advantage of this new state you will have to make some modifications to your code:

When calling a method, only update the abstract state at your current context with the current state appended.
When returning from method, only update the abstract state of the bytecode of the top return address of the call-stack in the context of the rest of the call-stack.

static int sign(int i) {  
  if (i < 0) return -1; 
  else if (i == 0) return 0; 
  else return 1; 
}
// Called from: main:10 input [-10,-10] returns [-1,-1]
// Called from: main:11 input [0,0] returns [0,0]

public static void main(String[] args) {
  int[] a = new int { sign(-10) };  
  // a is (ArrayOf size: [-10,-10] length: [1,1])
  return a[sign(0)];                
  // lookup is fine sign (from [main:11] -> [0,0])
}

2.2. (k-CFA) Limited Context Sensitivity

The problem with the approach above is two-fold. The first problem is space, as we have to keep a version of state for each reachable call-stack, this can quickly explode in size. The second problem is recursive methods. In recursive methods, the call-stack is often dependent on the input, and can make the analysis run forever.

As with everything else in static analysis, we have to abstract. Instead of keeping the entire call-stack, we just keep the last $k$ call-sites, this is called k-CFA [ 2, 3 ].

In object oriented programming languages are $k \leq 2$ often feasible, and polynomial [ 3 ], however in functional languages it is EXPTIME-complete meaning that there exist no polynomial-time algorithm that solves the problem.

In Java, it is often fine (and necessary) to choose $k = 1$ to get a good analysis.

References

1. George Balatsouras, George Kastrinis , and Yannis Smaragdakis. Introspective Analysis: Context-Sensitivity, Across the Board.
2. Olin G Shivers. Control-Flow Analysis of Higher-Order Languages.
3. David Van Horn, Yannis Smaragdakis , and Matthew Might. Resolving and Exploiting the k-CFA Paradox.