Dynamic Analysis

Let's run it and see?

Christian Gram Kalhauge

Table of Contents
  1. What is Dynamic Analysis?§1
    1. Trace Selection§1.1
      1. Trace Analysis§1.2
        1. Trace Abstraction and Prediction§1.3
          1. The Path Equivalence Class§1.3.1
            1. Coverage§1.3.2
              1. Trace Abstraction§1.3.3
            2. Running the Program§2
              1. Why Should We Run the Program?§2.1
                1. Why Shouldn't We Run the Program?§2.2
                2. Testing§3
                  1. Characterization Trace Testing§3.1
                    1. Assertions§3.2
                    2. Selecting Input§4
                      1. Random Inputs§4.1
                        1. Dictionaries§4.2
                          1. Coverage-Guided Fuzzing§4.3
                            1. Property Based Testing§4.4
                              1. The Small-Scope Hypothesis§4.5

                              In this topic, we are going to introduce the technique of dynamic analysis. We are going to focus on the simplest version. Just running the program. First we are going to cover the basis in What is Dynamic Analysis? (§1), then we discuss some limitations and advantages of Running the Program (§2), touch on the simplest form for dynamic analysis namely Testing (§3), and, finally, we go over different techniques for Selecting Input (§4).

                              What is Dynamic Analysis? §1

                              Dynamic analysis is finding out information about a program by running it. Actually, you have been doing dynamic analysis most of your career without knowing it. When you have been writing tests for your program, you have been doing dynamic analysis. When you have inserted log statements in your program, you have been doing dynamic analysis. And, when you have search through a log to recreate a bug in production, you have been doing dynamic analysis.

                              It might seem obvious that we can learn about a program by running it, but it is far from obvious what we learn and even how to run the program. In theory, dynamic analysis is about extracting information about the all behaviours of a program from a few executions (or traces).

                              Dynamic analysis consists of three phases, Trace Selection (§1.1), Trace Abstraction and Prediction (§1.3), and Trace Analysis (§1.2).

                              Fill Input State Space Prediction 1.2 Input Selection 1.1 Analysis 1.3 Repeat 2.1

                              The three steps of a dynamic analysis, 1.1 we select a trace from the program, 1.2 we try to predict a class of traces from which our trace came from, 1.3 we analyse the set of traces. If we did not find anything we might repeat the analysis 2.1

                              Trace Selection §1.1

                              First, recall the definition of a Transition System from Transition System and Traces (§2.4). A program can be described as a triplet 𝐒𝐭𝐚𝐭𝐞P,δP,Ip, were δp is the transition function and Ip is the set of initial states.

                              In the first phase, we select a trace from the transitions system. This consists of choosing an initial state σIp, and then executing the program to picking a trace τ:

                              τ𝐒𝐞𝐥𝐞𝐜𝐭(P,σ)τSem(P)τ0=σ

                              If the program is deterministic, there is only one trace per input state. If the program is non-deterministic, like for example with parallel programs each initial state might give rise to multiple traces.

                              Selecting the initial state or even executing the program is not trivial in practice (see Running the Program (§2)).

                              Trace Analysis §1.2

                              Now that we have a trace, we can make several deductions about behaviours of the program by analysing the trace. Common analyses are (slightly abusing notation):

                              • did it end in a failure? 𝐓𝐫𝐚𝐜𝐞𝚎𝚛𝚛(τ)τ1=err(‘...’)

                              • did any opened resource, not get closed? 𝐓𝐫𝐚𝐜𝐞𝚛𝚎𝚜(τ)fi.𝐎(τi,f)j.¬𝐂(τj,f)j<i

                              • did we execute this instruction? 𝐓𝐫𝐚𝐜𝐞𝚌𝚘𝚟𝚎𝚛(i)(τ)i{e.ι | eτ}

                              Any question we can answer correctly for one trace can be turned into a may analysis for the full program. If we find a trace that ends in a failure, the program might fail. If the we find a trace where opened resource is not closed, we know that program may not close all resources in the program. So, essentially for any property X we can find in one trace, we can build an may analysis which can detect it:

                              σIp,τ𝐒𝐞𝐥𝐞𝐜𝐭(P,σ).𝐓𝐫𝐚𝐜𝐞X(τ)X(P)

                              Actually, if we can increase the precision of the analysis by running the program multiple times. And, in the limit (assuming that we could run all traces) a dynamic analysis precisely captures any property.

                              σIp,τ𝐒𝐞𝐥𝐞𝐜𝐭(P,σ).𝐓𝐫𝐚𝐜𝐞X(τ)X(P)

                              No analysis are truly sound!

                              I was convinced that we could create a sound dynamic memory analysis for an company focusing on embedded devices.

                              The idea was simple, fuzz the program while using a memory sanitizer to check if all memory that was allocated was eventually deallocated.

                              The analysis per-say was sound, every trace we warned about did contain a memory error, but the company in question had a coding style where they didn't clean up the initial memory allocation, because it was cleaned up when the device were shut off. The warnings we produced were not a problem to this company, and therefore false positives.

                              The lesson: no analysis is fire and forget, they all requires some coding conventions to give good results.

                              Trace Abstraction and Prediction §1.3

                              The problem is of course, that the number of traces in the semantics of most programs are infinite (or close), and it is impossible to cover them all without some trickery. Therefore, should we try to expand the number of traces we cover every time by inferring information about other traces from one trace. We call this Trace Prediction.

                              Trace prediction works by mapping each trace τ from the trace space into a possible finite set of trace classes [τ]. If we can show that an analysis over the trace class is equivalent to a trace analysis over all traces in the trace class: 𝐓𝐫𝐚𝐜𝐞X([τ])τ[τ].𝐓𝐫𝐚𝐜𝐞X(τ), we can save a lot of executions. And if we are also able to avoid any trace class in our next trace selection we can potentially cover all traces of a program.

                              This step is especially useful when dealing with parallel programs. It can be very hard to run into deadlocks or data-races. With trace prediction, we can infer that there exist a datarace in the program even if we don't run into it (Kalhauge 2018).

                              The Path Equivalence Class §1.3.1

                              The most interesting trace class is the path equivalent class: [t]π. In this trace class, we say that all traces, that follow the same path through the program are equivalent. This class is interesting because many errors are path dependent. For example consider the following buggy program, which checks if the wrong part of the division operator:

                              int checkTheWrongThing(int a) { 
                                if (a != 0) { 
                                  return a / 0;
                                } else { 
                                  return 0;
                                }
                              }

                              This program contains near infinite (232) traces, the one starting from 0, 1, -1, 2, and so on, but only contains two path equivalence classes. The one ending in a error and the one that does not.

                              In our JVM language, the path equivalence class, is all traces that has has the same bytecode offset in the same order:

                              [τ]π={t𝐓𝐫𝐚𝐜𝐞 | i[1,).(ti).ι=(ti).ι}

                              Coverage §1.3.2

                              When writing a dynamic analysis, it is nice to know how much of the possible paths you have covered. We can measure coverage in many different ways. The most common are line coverage, branch coverage, and instruction coverage.

                              Instruction coverage of a set of traces is the set of instruction covered by

                              #ι(T)={τi.ι | i[0,],τT}

                              The reason instruction coverage is a nice coverage metric because we can see that the cover of the path equivalence class of a trace is the same as the cover of one element from the class:

                              #ι([τ]π)=#ι({τ})

                              So if we have uncovered instructions, either we have not found the trace class that covers it, or there is dead-code in our program.

                              Trace Abstraction §1.3.3

                              Trace abstraction is more of a practical concern. Storing the full state at every step is not only expensive it is also wasteful. In most cases we only need very little of the state to find the problem. In the case of an error we only have to emit the final error code if it happened, and in the case of code coverage we only have to emit the instruction offset.

                              In one end of the spectrum, some trace analyses can be done completely at runtime, and even allow the system to react to the analysis, this is known as Runtime Verification.

                              In the other end of the spectrum, some trace analysis might be so computational heavy or only make sens to run after a bug (or intrusion) has been found. Then it makes a lot of sense to trace just enough to make the analysis.

                              Running the Program §2

                              Running the program has some advantages and disadvantages.

                              Why Should We Run the Program? §2.1

                              Running the program is often a good initial analysis we can do on most programs, and it has a lot of advantages:

                              • Most programming languages comes with built-in interpreter. Running the program can sometimes be easier than using a syntactic analysis.

                              • The analysis can present you with a concrete trace that produce the error. This makes it very easy to debug the error.

                              • Since the goal of the dynamic analysis is to underestimate the set of valid traces, if it warns us about a problem, then the problem is real.

                              Why Shouldn't We Run the Program? §2.2

                              It might seem obvious, that we can learn things about the program by running it, however, running the program is not always as easy as we would like.

                              • Setting up the correct environment of a program can be hard to downward impossible without running it in production.

                              • Warning that something can happen by doing it, can be problematic if the program has side-effects like deleting the database or firing the missiles.

                              • Running the program, will never tell you if something can't happen, like the program running forever.

                              Testing §3

                              The simplest from for dynamic analysis is testing. In a test, we run the program with a given input and check if it succeeds. Often, in real-life scenarios, we care about other properties than the program not crashes, and those can in most languages be encoded using assertions.

                              Turn your interpreter into a dynamic analysis

                              Your interpreter is actually already a dynamic analysis, but only for cases with no inputs. To use your interpreter as a dynamic analysis do the following:

                              1. If the method takes arguments, exit early, otherwise,

                              2. Run the case with your interpreter,

                              3. Give the case returned 100%, and all others 0%.

                              Characterization Trace Testing §3.1

                              In many cases, it is too complicated (or the assessment of correctness is subject) to actually check that output of a test is correct. Instead, we can save output the to a file, or trace the events taken to produce the output. After we have checked that the trace or output is correct, we can store it in code repository. We call that the snapshot or (golden master). Every time we then re-run the test, then if the file did not change, we know the output or trace has remained the same.

                              This technique is prefect at catching regressions.

                              Snapshot Test Your Interpreter

                              The benchmark suite is already perfectly setup to record snapshot tests of your interpreter. To also track that you are not making any regressions on your interpreter. You can simply run:

                              $ uv run jpamb interpret -r out.expected <your_interpreter>

                              Assertions §3.2

                              One specifically useful tool for improving the usefulness of testing is to use assertions to trace that the behavior of the system is correct. The cool idea here is that using assertions we can catch programmer errors early and closer to the bug, instead of as part of a complicated crash.

                              Tiger Style

                              In the tiger beetle database they use the Tiger Style. In this style they explicitly require 2 assertions per function, which test both positive space (what you expect) and negative space (what you do not expect).

                              For example, a function calculating the area of a square, could check that the height and width are both non-zero before doing the calculation.

                              public static int square(int height, int width) {
                                assert height > 0; 
                                assert width > 0;
                                return height * width;
                              }

                              Negative Space Programming

                              When writing your interpreter. At every point you have an assumption, add an assertion.

                              For example to get started, it is perfectly acceptable to assume that all get operations are to assertionsDisabled, and should return false, however, that might not be the case in the more advanced examples.

                              case Get(field=field):
                                assert field.fieldid.name == "assertionsDisabled"
                                frame.stack.push(Value.int(0)) # Push False

                              Inserting these assertions by hand can be a lot of work, and a good dynamic analysis can insert some of them automatically. Since Java is a memory safe language, most out of bounds memory accesses are caught by the language at runtime. You can see this as automatically inserting assertions.

                              For other languages like C, there exist tools to extend capabilities of testing by adding memory checks at runtime. These are called sanitizers.

                              Selecting Input §4

                              One of the most successful dynamic analysis strategies is called Fuzzing. There exist many different kinds of fuzzing, but I'm going to use the definition of Zeller (2024). Fuzz testing is about automating software testing by generating test automatically.

                              Essentially, fuzz testing is all about trace selection.

                              Random Inputs §4.1

                              The easiest way to get started with fuzzing, is to simply choose an input at random at the start of the program and then see if it can make the program fail.

                              This works because most traces of a program can be approximated using relatively few inputs. The hypothesis is that the input space relatively correspond to the path equivalent space. And this is correct for examples like this:

                              void split(int i) { 
                                if (i > 0) { 
                                  ...
                                } else {
                                  ...
                                }
                              }

                              In this world we can clearly see there are two path equivalent classes, those with i>0 and i0. In this case choosing i randomly give equal chance to hit either branch.

                              Extend your dynamic analysis

                              Build a new analysis that use your interpreter. To generate traces you can choose inputs randomly from their domains and then run the interpreter.

                              Now if the program exhibit a query behavior emit 100% for that behavior, otherwise return 50% or nothing for the rest of the behaviors.

                              Figure out a good limit for how many random inputs to use.

                              Tag your solution with random and dynamic.

                              Dictionaries §4.2

                              The problem is of course, that not all branches are like that, some are checking for specific values.

                              void isEqualToZero(int i) { 
                                if (i == 0) { 
                                  ...
                                } else {
                                  ...
                                }
                              }

                              In this case, we can use a trick called dictionaries. First we use a syntactic analysis to scan the code for constants like e.g. 1 and "test". Then we add those constants to a dictionary, which we can sample from when we choose random values.

                              A hybrid analysis

                              Can you improve your analysis by first scanning the input method for interesting values?

                              Tag your solution with random, dictionary, syntatic, and dynamic.

                              Coverage-Guided Fuzzing §4.3

                              The problem with the above mentioned approaches are that they are not good at directing the execution down specific paths of the program. Consider the arraySpellsHello test case:

                              @Case("([C: 'h','e','l','l','o']) -> ok")
                              @Case("([C: 'x']) -> assertion error")
                              @Case("([C: ]) -> out of bounds")
                              @Tag({ ARRAY })
                              public static void arraySpellsHello(char[] array) {
                                assert array[0] == 'h'
                                    && array[1] == 'e'
                                    && array[2] == 'l'
                                    && array[3] == 'l'
                                    && array[4] == 'o';
                              }

                              Using the random testing or the dictionary approach alone, we would have a hard time finding hello by chance. A cool approach to get around this, and which is used in real application is Coverage-Guided Fuzzing (Fioraldi 2020).

                              This approach manipulates a list of byte-strings called test cases. The user of the fuzzer are expected to provide a function that can parse those byte-strings into input of the program. And now the fun begins.

                              We start with the empty byte-string as input, then we find its coverage (the bytecode offsets it executes). We add that byte-string to a list we call interesting.

                              Now, while we have interesting test-cases:

                              • We pick a test-case, and manipulate it, by either changing a byte, removing a byte or appending a byte.

                              • Then we run it, and see if it add coverage compared to the previous cases.

                              • If it does add it to the interesting cases, otherwise ignore it.

                              • Repeat.

                              A Coverage-Guided Analysis

                              Can you improve your analysis by using the coverage-guided technique?

                              Tag your solution with random, coverage, and dynamic.

                              Property Based Testing §4.4

                              When using random inputs to generate inputs to tests, we also have to change the way we tests. Instead of testing for equalities, we now have to test for properties. This approach is called property based testing.

                              In Python, there is a very nice library called Hypothesis, which helps you do just that. In the following example from the website, they test that their sorting algorithm my_sort is equivalent to the builtin sort on all that lists of integers and floats:

                              from hypothesis import given, strategies as st
                              
                              @given(st.lists(st.integers() | st.floats()))
                              def test_sort_correct(lst):
                                  # lst is a random list of numbers
                                  assert my_sort(lst) == sorted(lst)
                              
                              test_sort_correct()

                              The Small-Scope Hypothesis §4.5

                              As an alternative to the fuzzing technique, we have the small-scope hypothesis:

                              The Small-Scope Hypothesis

                              Most bugs in a program can be found by investigating only a small section of the inputs.

                              Model-checkers like Alloy (Jackson 2006) and testing frameworks like SmallCheck (Runciman 2008), use this idea by only investigating small test-cases. Instead of testing random numbers, you instead test only a few small inputs. The good thing about this approach is that

                              • you can be deterministic and always test the same inputs, because you can cover all small values to a depth d.

                              • The second advantage is that if a buggy input is found, it is the smallest input that can create that error.

                              This technique excel at small cases, like checking of empty lists or stuff like that:

                              void isEqualToZero(int i) { 
                                if (i == 0) { 
                                  ...
                                } else {
                                  ...
                                }
                              }

                              However, the technique does have clear disadvantages: what if the error hides behind a big input?

                              void isEqualToAMillion(int i) { 
                                if (i == 1000000) { 
                                  ...
                                } else {
                                  ...
                                }
                              }

                              In the following activity, you can seek inspiration from the SmallCheck paper Runciman (2008).

                              Make use of the small-check idea

                              Instead of trying random inputs, create a generator that can generate list of all inputs smaller than some depth d.

                              To generate integers you can use the following python generator:

                              def gen_int(depth): 
                                yield 0
                                for i in range(depth):
                                  yield (i + 1)
                                  yield -(i + 1)

                              Now use iterative deepening to slowly increase d[1,2,3,,D] until some max depth D.

                              Tag your solution with smallcheck and dynamic.

                              Until next time!

                              Until next time, write the best analysis you can and upload the results to Autolab, and ponder the following:

                              1. Name cases where the choice of a dynamic analysis makes the most sense?

                              2. What are the primary limitation and challenges of dynamic analysis?

                              3. When is random sampling of inputs better than testing small values and visa versa?

                              Bibliography

                              1. Andoni, Alexandr; Daniliuc, Dumitru; Khurshid, Sarfraz; Marinov, Darko (2003). Evaluating the Small Scope Hypothesis. link

                              2. Fioraldi, Andrea; Maier, Dominik; Eißfeldt, Heiko; Heuse, Marc (2020). AFL++: Combining incremental steps of fuzzing research. link

                              3. Jackson, Daniel (2006). Software Abstractions: Logic, Language, and Analysis.

                              4. Kalhauge, Christian Gram; Palsberg, Jens (2018). Sound deadlock prediction. doi:10.1145/3276516 link

                              5. Prikler, Liliana Marie; Wotawa, Franz (2025). Reevaluating the Small-Scope Testing Hypothesis of Answer Set Programs. doi:10.1007/978-3-031-80889-0_6 link

                              6. Runciman, Colin; Naylor, Matthew; Lindblad, Fredrik (2008). Smallcheck and lazy smallcheck. doi:10.1145/1543134.1411292 link

                              7. Zeller, Andreas; Gopinath, Rahul; Böhme, Marcel; Fraser, Gordon; Holler, Christian (2024). The Fuzzing Book. link