Syntactic Analysis
A poor mans analysis
The goal of this chapter is to ask and answer the following questions:
- What was your most extreme syntactical match?
- When did you feel the limitations of the syntactic analyses?
- When is syntatic analysis a good tool?
1. What is Syntactic Analysis?
As mentioned in the introduction, syntactic analysis is about analysing the structure of a program, instead of looking at the meaning of it. For example, in syntactic analysis 2 + 1
is different from 1 + 2
. This makes them ideal for capturing the intent of the developer, but not great for detecting complex logical bugs.
Many static analyses
in the wild, often called linters, are simply syntactic analyses, matching patterns in the code. They do this for two reasons. The first is that it is easier than to develop and maintain than a correct static analysis. The second is that can give better feedback to the developer, as you can highlight the pattern in the source.
1.1. The Levels of Syntax
When matching syntax it is important to match at the right level of abstraction. Initial levels contain more of the developers intention, but later levels gives us more structure, making the analysis more robust and easier to write.
We can start by analysing the text directly, this allows us to extract information about the white-space of the program. However, if we don't think that the white-space is important for the analysis we can look at the tokens instead. The tokens is a list of elements matched by the parser to group patterns like integer literals 1232
and 232
, so that they appear the same to the parser.
The parser converts the list of tokens into a Concrete Syntax Tree. A CST is the collection of all the tokens of the program, but placed in a tree like structure. Forexample 10 + (5 + 3)
is turned into a tree like (parenthesis (+ 10 (parenthesis (+ 5 3))))
instead. This deals with precedence and other headaches we don't want to think about. Ex ., is 10 + 3 * 5
equal to 10 + (3 * 5)
or (10 + 3) * 5
?
In contrast to CST does Abstract Syntax Tree not have to follow directly from the structure of the text. This allows the AST to remove seemingly unimportant information from the tree, like parenthesis: (( 10 ))
and ( 10 )
are both stored as 10
.
The compiler then coverts the AST into a some kind of Intermediate Representation, this can be the LLVM-IR, ClangIR, or RTL. At this level, a lot of work has been done by the compiler to disambiguate references (making all variables fully qualified) and optimize some cases x + 2 + 4
might now be mypackage.Class.x + 6
, which makes some kinds of analysis easier. Finally the IR is complied to Bytecode, now most of the style from the original language is gone, and only pure functionality is left.
While every step down the latter, makes it easier to write analyses, we lose a little of the context and intention of the developer. For example, say we want to warn users about using ||
and &&
in code because the precedence is often not known by developers. A good level to attack this is on the CST level, as we can differentiate between a && b || c
and (a && b) || c
. The developer would get a warning for the first example but not for the second.
1.2. Goal of Syntactic Analysis
Since a syntactic analysis can never be sound (accepting only good programs), we often have to aim for complete (rejecting no good program). This is in line with a study by Chrisakis and Bird, that shows that most developers would rather have an unsound analysis than too many warnings. Here are two quotes from the findings:
Program analysis design should aim for a false-positive rate no higher than 15–20%. -- Christakis and Bird
When forced to choose between more bugs or fewer false positives, they typically choose the latter. -- Christakis and Bird
2. A little Language Theory
- Programming language theory, Wikipedia.
Before we go to the exercises, it is nice to have a little understanding about language theory.
2.1. A Language formally speaking
In Computer Science, we refer to a language as a set of a sequence of symbols from the alphabet. If a string is in language we refer to it as a word.
Language
A language over the alphabet , is a subset of all strings from the alphabet .
The problem is figuring out if a given word is in the language or not. In this world we differentiate between Recognizers and Deciders. A decider, is a machine (or Automaton), which can determine if a string is in a language or not. A recognizer does the same thing, but is allowed to never give an answer. Essentially run forever.
In the world of program analysis we can actually describe our problems as sub-languages: For example, the language of all terminating java programs can be describes like so:
This is also the reason that static analysis use the terminology they do. A sound analysis is analysis that if it decides a program is a sublanguage, it is in fact in that language:
- Sound:
- Complete:
2.2. Production Rules and Grammars
- Production (computer science), Wikipedia.
- Formal grammar, Wikipedia.
We actually define language using languages, these are called Grammars. A grammar is a collection of production rules, of the form , where are sequences of symbols (represented by , , and ) and nonterminals (represented by , , and ). They are called non-terminals as we will keep applying the production rules, rewriting the string that look like into until there non-terminals left.
We can now see if a string is in a language by generating all possible strings in the language, and then checking if is in that set. This algorithm, clearly can run forever, so more efficient solutions are needed.
Grammar
Given a grammar , where is a set of non-terminals (, , ), is the alphabet, is a set of production rules, and is the start symbol, and means applying the rules in until no non-terminal is left, then the language of the grammar is:
This is best illustrated by an example.
Generate
Imagine the grammar , where is defined like this:
Can generate the language
Some languages actually work like this, think about LaTeX and the C macro system.
2.3. What Types of Languages Exist?
- Chomsky hierarchy, Wikipedia.
Since all of syntactic analysis is matching strings, which languages can we efficiently match. This is where the Chomsky hierarchy comes in.
It separates into four categories of decreasing difficulty and expressive power. A type 0 language talks about the executable nature of a Turing program. One example is all programs that terminate.
A type 1 language is a language that can keep information about the context it is in. For example checking a program to see if a variable is in scope, is context sensitive because it depends on the declarations and how nested we are in the curly parenthesis:
int hello(int x) {
// Context { x , y }
int y;
{
int z;
// Context { x , y, z} is z in scope? Yes
}
// Context { x , y} is z in scope? No
}
A type 2 language can also no longer talk about the context, but it can still have infinitely nested parenthesis. This language is the basis of the syntax of most programming languages (except C: typedef
I'm looking at you!).
A type 3 language is the most restrictive language, it only allows for repeats of small languages and cannot match nested parenthesis. These are the ones you can match really fast using regular expressions.
3. In Practice
In practice we use three kinds of matching systems when writing syntactic analyses. Regular expressions, Parser Grammars, and bespoke matching.
3.1. Regular: Regular Expressions
- Regular expression, Wikipedia.
To match regular languages we can use regular expressions, these are extremely common and very fast:
In practice, is this language extended quite heavily to be able to easily write better matches. You can explore different flavors of the language and how it matches on [ https://regex101.com/? ]. Here | is often used instead of +, and + is used to indicate one or more matches: .
3.2. Context-Free: Backus-Naur form
- Backus%E2%80%93Naur form, Wikipedia.
- LR parser, Wikipedia.
The definition of the regular expression in the previous section is written in Backus-Naur form. This is the primary way that we programming language people write down new syntax.
It is created by a set of producitons (like with out grammar), but we allow each production to have multiple matches, separated by |
: and B := a. The BNF is often extended with syntax for many and some . Furthermore to allow for fast matching, some restrictions are made on how and what you can match.
In practice, we can take a look at the grammars written for the Tree-Sitter parser: We can see it contains the same elements as the grammars above but written with a javascript syntax.
{
rules: {
source_file: $ => repeat($._expression),
_expression: $ => choice(
$.identifier,
$._parens,
$.number
),
_parens: $ => seq('(', $._expression, ')'),
identifier: $ => /[a-z]+/,
number: $ => /\d+/
}
}
The cool thing about Tree-Sitter is that there are defined grammars for most language you would like to analyse.
3.3. Context-Sensitive and Recursive Enumerable: Bespoke
To match languages beyond Context-Insesitve languages, there exist some tools but it becomes harder and harder to match full languages efficiently. In this case you have to write bespoke analyses in your favorite language.
4. Getting Started
Today, we are going to work on writing a syntactic analysis.
4.1. Using Regular Expressions.
First, we need to get a hold of the information about the method, we need to analyse.
Match the method names.
You might already have recognized that the method names form a little language:
jpamb.cases.Simple
.
assertPositive
:
(I)
V
where each part represent important information. The class, the method, the arguments, and the return type.
Use [ https://regex101.com/? ] to try to match on the method. Here are some tricks:
- remember to pick the falvor that matches your implementation language (in the menu).
- use the cases from [ https://github.com/kalhauge/jpamb/blob/main/stats/cases.txt? ] as test cases (you do that by inserting it in the big box).
- you can use the quick references to help you write the expression.
- remember that
'.'
means everything, so use'\.'
to match'.'
. - use match groups to extract each part in your analysis.
Then, we'll need to actually inspect the code. First we need to figure out the correct file. We do that by replacing '.'
with '/'
or '\'
depending on the platform. Then we want to extract the code of the correct method. We can also do this with regular expressions.
Extract the code
Insert the code of jpamb.cases.Simple into the Test String
field, and try to extract the content of the methods. Then start with (python flavor):
r"public\W+static\W+(?P<retype>void)\W+(?P<mname>\w*)()\W*{(?P<code>[^}]*)}"
Think about if your solution still works:
- on other files,
- if you change the indentation or insert newlines,
- change the methods from static to non-static, and
- without the case statement.
4.2. Using A Parser (Tree-Sitter
)
- tree-sitter.github.io/
- harmonia.cs.berkeley.edu/papers/twagner-parsing.pdf
- www-users.cse.umn.edu/~evw/pubs/vanwyk07gpce/vanwyk07gpce.pdf
- www2.eecs.berkeley.edu/Pubs/TechRpts/1997/CSD-97-946.pdf
At this point it should be clear that using regular expressions for matching a Context Free language is a bad idea. Instead we should use a parser. Luckily, many parsers already exist. I recommend using Tree-Sitter
.
Tree-Sitter has parsers for most languages and has bindings to many languages including Java, Python, Go and Rust. In this example we are going to use Python (but try your own).
Get syntaxer
to work
If you have not already setup a Python environment, do it now.
$ python -m venv .venv
$ # on unix systems
$ source .venv/bin/activate
$ # on windows
PS> source .venv/bin/activate
Now install the requirements.txt
and the requirements-treesitter.txt
dependencies:
$ python -m pip install -r requirements.txt -r requirements-treesitter.txt
Now syntaxer
should work, you should be able to run:
$ python solutions/syntaxer.py 'jpamb.cases.Simple.assertPositive:(I)V'
Okay, now lets get started extending the current solution. Tree-sitter has two things to it. It is able to parse a language given a grammar, (the Java grammar is available here Grammar), and it has a build in language for matching elements in the syntax tree.
Play around with Tree-sitters playground
Go to [ https://tree-sitter.github.io/tree-sitter/playground? ] and insert the code from jpamb.cases.Simple
Enable Query
and try of the first query in the syntaxer
file.
You can write your own queries using the Query Syntax.
Try to extend syntaxer
Extend syntaxer
to also warn about diving by 0 errors, you can do this by slightly increasing the probability that the code contains an / if you see it in the method, and slightly decrease it if you don't.
Hover to show hint
Start with the query:
(binary_expression
operator: "/"
) @expr
and debug with
public class Simple {
public static int divideByZero() {
return 1 / 0;
}
}