Syntactic Analysis

A poor mans analysis

Christian Gram Kalhauge

Table of Contents

  1. What is Syntactic Analysis?
  2. What can we Pattern Match on?
  3. How do we Match Patterns in Practice?

Questions

  1. What are the limitations of a syntactic analysis?

  2. When is syntactic analysis a good tool?

  3. Trick Question: Is Java a regular language, context free or recursive enumerable?

What is Syntactic Analysis? (§1)

Pattern Matching on the structure of the program.

The Levels of Syntax (§1.1)

bits text tokens CST AST IR Intention Structure Bytecode

The levels of syntactic matches can range from matching the bits, to matching the final compiled bytecode.

Preventing AND and OR

fn (int a, b, c) { 
  return a && b || c;  # Might be a mistake
}
fn (int a, b, c) { 
  return (a && b) || c;  # Ah, never mind.
}

Goal of Syntactic Analysis (§1.2)

  • No bad warnings (false negatives).

  • Complete is better than Sound.

Program analysis design should aim for a false-positive rate no higher than 15–20%.

— Christakis and Bird

When forced to choose between more bugs or fewer false positives, [developers] typically choose the latter.

— Christakis and Bird

What can we Pattern Match on? (§2)

The set of programs matched by a pattern is called a Language.

A Language — Formally Speaking (§2.1)

Language

A language L over the alphabet Σ, is a subset of all words wΣ* from the alphabet LΣ*.

Lhalt={p|p terminates,pLJava}

  • Sound: SAhalt(p)pLhalt

  • Complete: pLhaltSAhalt(p)

Production Rules and Grammars (§2.2)

Grammar

A grammar G is a quadruple (N,Σ,,S), where N is a set of non-terminals (A, B, C), Σ is the alphabet, is a set of production rules, and S is the start symbol.

The language of the grammar is:

LG={w|wΣ*,S*w}

Where * means applying the rules in until no non-terminal is left.

Generate ((ab)b)

Consider the grammar ({S,X},{a,b,(,)},,S), where is defined like this:

S1(SX)S2aX3b

It can generate the following word in the language S1(SX)3(Sb)1((SX)b)2((aX)b)3((ab)b)

Some languages actually work like this, think about LaTeX and the C macro system.

What Types of Languages Exist? (§2.3)

Chomsky hierarchy

gray Type 3 Regular Type 2 Context-free Type 1 Context-sensitive Type 0 Recursively enumerable

The Chomsky hierarchy, with production rules.

How do we Match Patterns in Practice? (§3)

  1. Regular: Regular Expressions
  2. Context-Free: Grammars and Parsers
  3. Context Sensitive: Matching Patterns in Trees
  4. Recursive Enumerable: Bespoke

Regular: Regular Expressions (§3.1)

XKCD, source: https://xkcd.com/1171

e:=ϵ– an empty string |a– symbol |e1e2– a concatenation |e1+e2– an alternative |e1*– a Kleene star

The grammar for defining regular expressions.

[[ϵ]]={ϵ}[[a]]={a}[[e1e2]]={uw|u[[A]],w[[B]]}[[e1+e2]]=[[A]][[B]][[e1*]]={ϵ}{uw|u[[e1]],w[[e1*]]} where is the smallest solution to the equation.

The denotational semantics for regular expressions

Getting Started (§3.1.1)

https://regex101.com/

Context-Free: Grammars and Parsers (§3.2)

  • Write a Grammar in Backus Naur form.

  • Parse the unstructured text into an CST.

  • ...

  • Profit

Using Tree-Sitter (§3.2.1)

https://tree-sitter.github.io/tree-sitter/playground

Lambda Calculus

e:=λx.e– abstractione|e1e2– applicatione|x– variablee|(e)
{
  rules: {
    source_file: $ => $._exp,
    _exp: $ => choice($.abs, $.app, $.var, $._parens),
    abs: $ => prec.right(seq('λ', $.var, '.', $._exp)),
    app: $ => prec.left(seq($._exp, $._exp)),
    var: $ => /[a-zA-Z]+/,
    _parens: $ => choice("(", $._exp, ")"),
  }
}

Context Sensitive: Matching Patterns in Trees (§3.3)

public class Simple {
  public static int assertFalse() {
    assert False;
  }
  public static int divideByZero() {
    return 1 / 0;
  }
}

Folds: General Matching of Patterns in Trees (§3.3.1)

  • 𝚏𝚘𝚕𝚍:((x,a)+a)[a]a

  • 𝚌𝚊𝚝𝚊:(faa)Fa

Tree Traversals (§3.3.2)

Pre- (red), In- (green), and Post-order (blue) traversal of tree. source

def traverse(item: ts.Tree) -> Iterator[ts.Node]:
    cursor = item.walk()
    godeeper = True
    while True:
        node = cursor.node
        if godeeper:
            yield node # Pre or Post
            if not cursor.goto_first_child():
                godeeper = False
        elif cursor.goto_next_sibling():
            godeeper = True
        elif cursor.goto_parent():
            yield node # Pre or Post
            godeeper = False
        else:
            break

Recursive Enumerable: Bespoke (§3.4)

  • Linters (SpotBugs and EsLint)

  • No good system, yet...

  • LLMs to the rescue