Syntactic Analysis

A poor mans analysis

Christian Gram Kalhauge

The goal of this chapter is to ask and answer the following questions:

1. What is Syntactic Analysis?

As mentioned in the introduction, syntactic analysis is about analysing the structure of a program, instead of looking at the meaning of it. For example, in syntactic analysis 2 + 1 is different from 1 + 2. This makes them ideal for capturing the intent of the developer, but not great for detecting complex logical bugs.

Many static analyses in the wild, often called linters, are simply syntactic analyses, matching patterns in the code. They do this for two reasons. The first is that it is easier than to develop and maintain than a correct static analysis. The second is that can give better feedback to the developer, as you can highlight the pattern in the source.

1.1. The Levels of Syntax

bits text tokens CST AST IR Intention Structure Bytecode
Figure: The levels of syntactic matches can range from matching the bits, to matching the final compiled bytecode.

When matching syntax it is important to match at the right level of abstraction. Initial levels contain more of the developers intention, but later levels gives us more structure, making the analysis more robust and easier to write.

We can start by analysing the text directly, this allows us to extract information about the white-space of the program. However, if we don't think that the white-space is important for the analysis we can look at the tokens instead. The tokens is a list of elements matched by the parser to group patterns like integer literals 1232 and 232, so that they appear the same to the parser.

The parser converts the list of tokens into a Concrete Syntax Tree. A CST is the collection of all the tokens of the program, but placed in a tree like structure. Forexample 10 + (5 + 3) is turned into a tree like (parenthesis (+ 10 (parenthesis (+ 5 3)))) instead. This deals with precedence and other headaches we don't want to think about. Ex ., is 10 + 3 * 5 equal to 10 + (3 * 5) or (10 + 3) * 5?

In contrast to CST does Abstract Syntax Tree not have to follow directly from the structure of the text. This allows the AST to remove seemingly unimportant information from the tree, like parenthesis: (( 10 )) and ( 10 ) are both stored as 10.

The compiler then coverts the AST into a some kind of Intermediate Representation, this can be the LLVM-IR, ClangIR, or RTL. At this level, a lot of work has been done by the compiler to disambiguate references (making all variables fully qualified) and optimize some cases x + 2 + 4 might now be mypackage.Class.x + 6, which makes some kinds of analysis easier. Finally the IR is complied to Bytecode, now most of the style from the original language is gone, and only pure functionality is left.

While every step down the latter, makes it easier to write analyses, we lose a little of the context and intention of the developer. For example, say we want to warn users about using || and && in code because the precedence is often not known by developers. A good level to attack this is on the CST level, as we can differentiate between a && b || c and (a && b) || c. The developer would get a warning for the first example but not for the second.

1.2. Goal of Syntactic Analysis

Since a syntactic analysis can never be sound (accepting only good programs), we often have to aim for complete (rejecting no good program). This is in line with a study by Chrisakis and Bird, that shows that most developers would rather have an unsound analysis than too many warnings. Here are two quotes from the findings:

Program analysis design should aim for a false-positive rate no higher than 15–20%. -- Christakis and Bird
When forced to choose between more bugs or fewer false positives, they typically choose the latter. -- Christakis and Bird

2. A little Language Theory

Before we go to the exercises, it is nice to have a little understanding about language theory.

2.1. A Language formally speaking

In Computer Science, we refer to a language as a set of a sequence of symbols from the alphabet. If a string is in language we refer to it as a word.

Language

A language L over the alphabet Σ, is a subset of all strings from the alphabet LΣ*.

The problem is figuring out if a given word is in the language or not. In this world we differentiate between Recognizers and Deciders. A decider, is a machine (or Automaton), which can determine if a string is in a language or not. A recognizer does the same thing, but is allowed to never give an answer. Essentially run forever.

In the world of program analysis we can actually describe our problems as sub-languages: For example, the language of all terminating java programs can be describes like so:

Lhalt={p|p terminates,pLJava}

This is also the reason that static analysis use the terminology they do. A sound analysis is analysis that if it decides a program is a sublanguage, it is in fact in that language:

2.2. Production Rules and Grammars

We actually define language using languages, these are called Grammars. A grammar is a collection of production rules, of the form αβ, where α,β are sequences of symbols (represented by a, b, and c) and nonterminals (represented by A, B, and C). They are called non-terminals as we will keep applying the production rules, rewriting the string that look like α into β until there non-terminals left.

We can now see if a string w is in a language by generating all possible strings in the language, and then checking if p is in that set. This algorithm, clearly can run forever, so more efficient solutions are needed.

Grammar

Given a grammar G=(N,Σ,,S), where N is a set of non-terminals (A, B, C), Σ is the alphabet, is a set of production rules, and S is the start symbol, and * means applying the rules in until no non-terminal is left, then the language of the grammar is:

LG={w|wΣ*,S*w}

This is best illustrated by an example.

Generate ((ab)b)

Imagine the grammar ({S,X},{a,b,(,)},,S), where is defined like this:

S1(SX)S2aX3b

Can generate the language S1(SX)3(Sb)1((SX)b)2((aX)b)3((ab)b)

Some languages actually work like this, think about LaTeX and the C macro system.

2.3. What Types of Languages Exist?

Since all of syntactic analysis is matching strings, which languages can we efficiently match. This is where the Chomsky hierarchy comes in.

gray Type 3 Regular Type 2 Context-free Type 1 Context-sensitive Type 0 Recursively enumerable
Figure: The Chomsky hierarchy, with production rules.

It separates into four categories of decreasing difficulty and expressive power. A type 0 language talks about the executable nature of a Turing program. One example is all programs that terminate.

A type 1 language is a language that can keep information about the context it is in. For example checking a program to see if a variable is in scope, is context sensitive because it depends on the declarations and how nested we are in the curly parenthesis:

int hello(int x) { 
  // Context { x , y }
  int y; 
  { 
    int z; 
    // Context { x , y, z} is z in scope? Yes
  }
  // Context { x , y} is z in scope? No
}

A type 2 language can also no longer talk about the context, but it can still have infinitely nested parenthesis. This language is the basis of the syntax of most programming languages (except C: typedef I'm looking at you!).

A type 3 language is the most restrictive language, it only allows for repeats of small languages and cannot match nested parenthesis. These are the ones you can match really fast using regular expressions.

3. In Practice

In practice we use three kinds of matching systems when writing syntactic analyses. Regular expressions, Parser Grammars, and bespoke matching.

3.1. Regular: Regular Expressions

To match regular languages we can use regular expressions, these are extremely common and very fast:

e:=ϵ– an empty string |a– symbol |e1e2– a concatenation |e1+e2– an alternative |e1*– a Kleene star
Figure: The grammar for defining regular expressions.

In practice, is this language extended quite heavily to be able to easily write better matches. You can explore different flavors of the language and how it matches on [ https://regex101.com/? ]. Here | is often used instead of +, and + is used to indicate one or more matches: ee*=e+.

3.2. Context-Free: Backus-Naur form

The definition of the regular expression in the previous section is written in Backus-Naur form. This is the primary way that we programming language people write down new syntax.

It is created by a set of producitons (like with out grammar), but we allow each production to have multiple matches, separated by |: A:=abc|B|x* and B := a. The BNF is often extended with syntax for many * and some +. Furthermore to allow for fast matching, some restrictions are made on how and what you can match.

In practice, we can take a look at the grammars written for the Tree-Sitter parser: We can see it contains the same elements as the grammars above but written with a javascript syntax.

{
  rules: {
    source_file: $ => repeat($._expression),
    _expression: $ => choice(
      $.identifier,
      $._parens,
      $.number
    ),
    _parens: $ => seq('(', $._expression, ')'),
    identifier: $ => /[a-z]+/,
    number: $ => /\d+/
  }
}

The cool thing about Tree-Sitter is that there are defined grammars for most language you would like to analyse.

3.3. Context-Sensitive and Recursive Enumerable: Bespoke

To match languages beyond Context-Insesitve languages, there exist some tools but it becomes harder and harder to match full languages efficiently. In this case you have to write bespoke analyses in your favorite language.

4. Getting Started

Today, we are going to work on writing a syntactic analysis.

4.1. Using Regular Expressions.

First, we need to get a hold of the information about the method, we need to analyse.

Match the method names.

You might already have recognized that the method names form a little language:

jpamb.cases.Simple.assertPositive:(I)V where each part represent important information. The class, the method, the arguments, and the return type.

Use [ https://regex101.com/? ] to try to match on the method. Here are some tricks:

  • remember to pick the falvor that matches your implementation language (in the menu).
  • use the cases from [ https://github.com/kalhauge/jpamb/blob/main/stats/cases.txt? ] as test cases (you do that by inserting it in the big box).
  • you can use the quick references to help you write the expression.
  • remember that '.' means everything, so use '\.' to match '.'.
  • use match groups to extract each part in your analysis.

Then, we'll need to actually inspect the code. First we need to figure out the correct file. We do that by replacing '.' with '/' or '\' depending on the platform. Then we want to extract the code of the correct method. We can also do this with regular expressions.

Extract the code

Insert the code of jpamb.cases.Simple into the Test String field, and try to extract the content of the methods. Then start with (python flavor):

r"public\W+static\W+(?P<retype>void)\W+(?P<mname>\w*)()\W*{(?P<code>[^}]*)}"

Think about if your solution still works:

  • on other files,
  • if you change the indentation or insert newlines,
  • change the methods from static to non-static, and
  • without the case statement.

4.2. Using A Parser (Tree-Sitter)

At this point it should be clear that using regular expressions for matching a Context Free language is a bad idea. Instead we should use a parser. Luckily, many parsers already exist. I recommend using Tree-Sitter.

Tree-Sitter has parsers for most languages and has bindings to many languages including Java, Python, Go and Rust. In this example we are going to use Python (but try your own).

Get syntaxer to work

If you have not already setup a Python environment, do it now.

$ python -m venv .venv
$ # on unix systems
$ source .venv/bin/activate
$ # on windows
PS> source .venv/bin/activate

Now install the requirements.txt and the requirements-treesitter.txt dependencies:

$ python -m pip install -r requirements.txt -r requirements-treesitter.txt

Now syntaxer should work, you should be able to run:

$ python solutions/syntaxer.py 'jpamb.cases.Simple.assertPositive:(I)V'

Okay, now lets get started extending the current solution. Tree-sitter has two things to it. It is able to parse a language given a grammar, (the Java grammar is available here Grammar), and it has a build in language for matching elements in the syntax tree.

Play around with Tree-sitters playground

Go to [ https://tree-sitter.github.io/tree-sitter/playground? ] and insert the code from jpamb.cases.Simple

Enable Query and try of the first query in the syntaxer file.

You can write your own queries using the Query Syntax.

Try to extend syntaxer

Extend syntaxer to also warn about diving by 0 errors, you can do this by slightly increasing the probability that the code contains an / if you see it in the method, and slightly decrease it if you don't.

Hover to show hint

Start with the query:

(binary_expression
  operator: "/"
) @expr

and debug with

public class Simple {
  public static int divideByZero() {
    return 1 / 0;
  }
}