Written by Luka Kerr on May 4, 2019
Regular Expressions and Finite Automata
Thompson’s construction
- Syntax driven
- Inductive
- Use $S$ to denote the starting state and $A$ to denote the final accepting state
Inductive Base:
For $\epsilon$, construct the NFA
For $a \in \Sigma$, construct the NFA
Inductive Step: suppose $N(r)$ and $N(s)$ are NFAs for regular expressions $r$ and $s$. Then:
$r | s$:
$rs$:
$r^*$:
$(r)$: $N((r)) \equiv N(r)$
Subset Construction
- Used to convert a NFA to DFA
- At most $2^n$ DFA states where $n$ is the total number of NFA states
Operations used:
Operation | Description |
---|---|
$\epsilon$-closure($s$) | Set of NFA states readable from NFA state $s$ on $\epsilon$-transitions |
$\epsilon$-closure($T$) | Set of NFA states readable from some state $s$ in $T$ on $\epsilon$-transitions |
move($T$, $a$) | Set of NFA states to which there is a transition on input $a$ from some state $s$ in $T$ |
where
- $s$ is a NFA state
- $T$ is a set of NFA states
Algorithm
let $s_0$ be the start state of the NFA;
DFAstates contains the only unmarked state $\epsilon$-closure($s_0$);
while there is an unmarked state $T$ in DFAstates do
$\quad$ mark $T$
$\quad$ for each input symbol $a$ do
$\qquad$ $U := \epsilon$-closure(move($T$, $a$));
$\qquad$ if $U$ is not in DFAstates then
$\qquad \quad$ add $U$ as an unmarked state in DFAstates;
$\qquad$ DFATrans[$T$, $a$] $:= U$;
$\quad$ done
done
DFA Minimisation
let $\prod$ be the partition with the two groups:
-
one is the set of all final states
-
the other is the set of all non-final states
let $\prod_{new} = \prod$;
for each group $G$ in $\prod_{new}$ do
$\quad$ partition $G$ into subgroups such that two states $s$ and $t$
$\qquad$ are in the same subgroup iff for all input symbols
$\qquad$ $a$, states $s$ and $t$ have transitions on $a$ to
$\qquad$ states in the same group of $\prod_{new}$;
$\quad$ replace $G$ in $\prod_{new}$ by the set of subgroups formed;
done
Chomsky’s Hierarchy
Grammar | Known As | Definition | Machine |
---|---|---|---|
Type 0 | Unrestricted grammar | $\alpha \ne \epsilon$ | Turing |
Type 1 | Context-sensitive grammar (CSG) | $|\alpha| \le |\beta|$ | Linear bounded |
Type 2 | Context-free grammar (CFG) | $A \to \alpha$ | Stack automation |
Type 3 | Regular grammar | $A \to w \mid B w$ | Finite state automation |
Context Free Grammars
BNF
BNF is a notation for writing CFG’s. Each production $(A, \alpha)$ is written as $A \to \alpha$, where the arrow $\to$ means “is defined to be” or “derives”.
A vertical bar $|$ reads “or else”.
EBNF
EBNF is an extended version of BNF, including regular expressions.
- $( \alpha )^*$ means that $\alpha$ can be repeated zero or more times
- $( \alpha )^+$ means that $\alpha$ can be repeated one or more times
- $( \alpha )?$ means that $\alpha$ is optional
Derivations
A grammar derives sentences by:
- Beginning with the start symbol
- Repeatedly replacing a nonterminal by the right-hand side of a production with that nonterminal on the left-hand side, until there are no more nonterminals to replace
Such a sequence of replacements is called a derivation of the sentence being analysed.
At each step in a derivation, 2 choices are made:
- Which nonterminal to replace?
- Which alternative to use for that nonterminal?
Let $G :=$
\[\begin{array}{l} S \to ABC \\ A \to aA \mid \epsilon \\ B \to b B \mid \epsilon \\ C \to cC \mid \epsilon \end{array}\]Leftmost Derivation
Always replace the leftmost nonterminal.
Example
Deriving $abbc$ in the grammar $G$ using a leftmost derivation gives:
\[S \Rightarrow ABC \Rightarrow aABC \Rightarrow aBC \Rightarrow abBC \Rightarrow abbC \Rightarrow abbcC \Rightarrow abbc\]Rightmost Derivation
Always replace the rightmost nonterminal.
Example
Deriving $abbc$ in the grammar $G$ using a rightmost derivation gives:
\[S \Rightarrow ABC \Rightarrow ABcC \Rightarrow ABc \Rightarrow AbBc \Rightarrow AbbBc \Rightarrow Abbc \Rightarrow aAbbc \Rightarrow abbc\]Parse Trees
In a parse tree:
- The start symbol is always at the root of the tree
- Nonterminals are always interior nodes
- Terminals are always leaves in the tree
Recursive Grammar
Left recursive productions: $A \to A \alpha$
Right recursive productions: $A \to \alpha A$
Eliminating Left Recursion
Grammar Rewriting
Left recursive grammar
$A \to A \alpha \mid \beta$
Non-left recursive grammar
$A \to \beta A^\prime \ A^\prime \to \alpha A^\prime \mid \epsilon$
Regular Operators
Left recursive grammar
$A \to \alpha \ A \to A \beta_1 \mid A \beta_2$
Non-left recursive grammar
$A \to \alpha(\beta_1 \mid \beta_2)^*$
Top Down Parsing
Let $G :=$
\[\begin{array}{l} E \to T Q \\ Q \to +T Q \mid -T Q \mid \epsilon \\ T \to F R \\ R \to * F R \mid / F R \mid \epsilon \\ F \to \text{Int} \mid (E) \end{array}\]First Sets
First($A$) is is the set of terminals which can appear as the first element of any chain of rules matching nonterminal $A$.
Using the grammar $G$
-
First($TQ$) = { Int, ( }
-
First($*R$) = { * }
Follow Sets
Follow($A$) is the set of terminals that appear immediately to the right of the non-terminal $A$. That is Follow($A$) = $\{ a : S \overset{*}{\implies} \dots A a \dots \}$
By convention, assume every input is terminated by a $ sign.
Follow sets:
- Are only constructed for non-terminals
- Do not contain $\epsilon$
Rules:
- If $A$ is the start symbol of the grammar, then Follow($A$) includes $
- If $A \to pBq$ is a production:
- Follow($B$) includes First($q$) - $\epsilon$
- If First($q$) contains $\epsilon$, then Follow($B$) includes Follow($A$)
- If $A \to pB$ is a production, then Follow($B$) includes Follow($A$)
Using the grammar $G$
-
Follow($E$) = {), $}
-
Follow($Q$) = {), $}
-
Follow($T$) = {+, -} + Follow($Q$) = {+, -, ), $}
-
Follow($R$) = Follow($T$) = {+, -, ), $}
-
Follow($F$) = {*, /} + Follow($R$) = {*, /, +, -, ), $}
Select Sets
- One select set for every production in the grammar
- The select set for a production of the form $A \to \alpha$ is:
- Select($A \to \alpha$) = First($\alpha$) - { $\epsilon$ }
- If $\epsilon \in$ First($\alpha$), then Select($A \to \alpha$) includes Follow($A$)
- Select($A \to \alpha$) = First($\alpha$) - { $\epsilon$ }
Using the grammar $G$
- Select($E \to TQ$) = First($TQ$) = { Int, ( }
- Select($Q \to +TQ$) = First($+TQ$) = { + }
- Select($R \to \epsilon$) = (First($\epsilon$) - { $\epsilon$ }) $\cup$ Follow($R$) = { +, -, ), $ }
$LL(1)$ Grammar
A grammar is $LL(1)$ if for every nonterminal of the form $A \to \alpha_1 \mid \dots \mid \alpha_n$, the select sets are pairwise disjoint, i.e. Select($A \to \alpha_i$) $\cap$ Select($A \to \alpha_j$) = $\emptyset$ for all $i$ and $j$ such that $i \ne j$.
- A grammar with left recursion is not $LL(1)$
- Examples of left recursion are:
- $A \to A \alpha$
- $A \to B \alpha \ B \to A \beta$
- Examples of left recursion are:
- A grammar with common prefixes is not $LL(1)$
- An example of a common prefix is:
- $A \to \textbf{If } \alpha \ \beta \textbf{ fi} \ A \to \textbf{If } “(“ \ \alpha \ “)” \ \beta \textbf{ fi}$
- An example of a common prefix is:
- $LL(1)$ grammar can be parsed deterministically using 1 token of lookahead
$LL(1)$ Parsing Table
for every production $A \to \alpha$ in the grammar do
$\quad$ forall $a$ in Select($A \to \alpha$), set $Table[A, a] = \alpha$
For example, using the grammar $G$, the $LL(1)$ parsing table would look like:
Int | $+$ | $-$ | $*$ | $/$ | $($ | $)$ | $ | |
---|---|---|---|---|---|---|---|---|
$E$ | $TQ$ | $TQ$ | ||||||
$Q$ | $+TQ$ | $-TQ$ | $\epsilon$ | $\epsilon$ | ||||
$T$ | $FR$ | $FR$ | ||||||
$R$ | $\epsilon$ | $\epsilon$ | $*FR$ | $/FR$ | $\epsilon$ | $\epsilon$ | ||
$F$ | Int | $(E)$ |
If every entry in the table contains no more than 1 production, the grammar is $LL(1)$.
$LL(1)$ Table Driven Parser
push $ onto the stack;
push the start symbol onto the stack;
while stack not empty do
$\quad$ let $X$ be the top stack symbol and $a$ be the lookahead symbol in the input;
$\quad$ if $X$ is a terminal then
$\qquad$ if $X = a$ then
$\quad \qquad$ pop $X$ and get the next token;
$\qquad$ else
$\quad \qquad$ error;
$\qquad$ fi
$\quad$ else
$\qquad$ if $Table[X, a]$ non-empty then
$\quad \qquad$ pop $X$;
$\quad \qquad$ push $Table[X, a]$ onto stack in the reverse order;
$\qquad$ else
$\quad \qquad$ error;
$\qquad$ fi
$\quad$ fi
done
For example, using the grammar $G$, and the $LL(1)$ parsing table above, a table driven parse of $i + i$ would look like:
Stack | Input | Production | Derivation |
---|---|---|---|
$$E$ | $i + i$$ | $E \to TQ$ | $E \implies_{lm} TQ$ |
$$QT$ | $i + i$$ | $T \to FR$ | $\implies_{lm} FRQ$ |
$$QRF$ | $i + i$$ | $F \to i$ | $\implies_{lm}iRQ$ |
$$QRi$ | $i + i$$ | pop and goto next token | |
$$QR$ | $+ i$$ | $R \to \epsilon$ | $\implies_{lm} iQ$ |
$$Q$ | $+i$$ | $Q \to + TQ$ | $\implies_{lm} i + TQ$ |
$$QT+$ | $+i$$ | pop and goto next token | |
$$QT$ | $i$$ | $T \to FR$ | $\implies_{lm} i + FRQ$ |
$$QRF$ | $i$$ | $F \to i$ | $\implies_{lm} i + iRQ$ |
$$QRi$ | $i$$ | pop and goto next token | |
$$QR$ | $ | $R \to \epsilon$ | $\implies{lm} i + iRQ$ |
$$Q$ | $ | $Q \to \epsilon$ | $\implies{lm} i + iRQ$ |
$ | $ | $\implies_{lm} i + iQ$ |
Attribute Grammars
An attribute grammar is a triple $A = (G, V, F)$ where
- $G$ is a CFG
- $V$ is a finite set of distinct attributes
- $F$ is a finite set of semantic rules about the attributes
Each attribute contains a name and a type, and can represent anything we choose.
Let $G :=$
\[\begin{array}{l} S \to E \\ E \to E \mathbin{/} E \\ E \to \textbf{num} \\ E \to \textbf{num} \cdot \textbf{num} \end{array}\]An example attribute grammar for $G$ is as follows:
Production | Semantic Rules |
---|---|
$S \to E$ | $E$.type if $E$.isFloat then float else int |
$E \to E_1 \ / \ E_2$ | $E$.isFloat = $E_1$.isFloat or $E_2$.isFloat |
$E_1$.type = $E$.type | |
$E_2$.type = $E$.type | |
$E \to \textbf{num}$ | $E$.isFloat = false |
$E \to \textbf{num} \cdot \textbf{num}$ | $E$.isFloat = true |
L-attributed and S-attributed
L-attributed
An attribute grammar is L-attributed if each inherited attribute of $X_i, 1 \le i \le n$ on the right hand side of $X_0 \to X_1 X_2 \cdots X_m$ depends only on:
- The attributes of the symbols $X_1, X_2, \cdots , X_{i - 1}$ to the left of $X_i$ in the production
- The inherited attributes of $X_0$
S-attributed
An attribute grammar is S-attributed if it uses synthesised attributes only.
Every S-attributed grammar is L-attributed.
Synthesised and Inherited Attributes
Let $X_0 \to X_1 X_2 \cdots X_n$ be a production, and $A(X)$ be the set of attributes associated with a grammar symbol $X$.
Synthesised
A synthesised attribute $syn$ of $X_0$ is computed by:
\[X_0.syn = f(A(X_1), A(X_2), \dots , A(X_n))\]The attribute $syn$ on a tree node depends on those of its children.
Inherited
An inherited attribute $inh$ of $X_i$, where $1 \le i \le n$, is computed by:
\[X_i.inh = g(A(X_0), A(X_1), \dots , A(X_n))\]The attribute $inh$ on a tree node depends on those on its parent and/or siblings. $inh$ can also depend on other attributed in $A(X_i)$.
Static Semantics
A semantic analyser enforces a language’s semantic constraints, specifically:
- Scope rules
- Type rules
Scope
The scope of a declaration is the portion of the program over which the declaration takes effect.
A declaration is in scope at a particular point in a program if and only if the declaration’s scope includes that point.
An example of scope:
int k; // k begin scope
void foo() {
int i; // i begin scope
int j; // j begin scope
i = 1;
j = 7;
print(i); // -> 1
print(j); // -> 7
{ // new block
int i; // new block scoped i begin scope
i = 2;
print(i); // -> 2
print(j); // -> 7
}
print(i); // -> 1
print(j); // -> 7
}
Scope Levels
- Declarations in the outermost block are in level 1
- Increment the scope level by 1 every time after moving from an enclosing to an enclosed block
Type Checking
The process of applying a language’s type rules to infer the type of each construct and comparing that type with the expected type.
Code Generation
Jasmin
Example
Java program:
class Test {
static int add(int a, int b) {
return a + b;
}
public static void main(String argv[]) {
Test.add(1, 2);
}
}
Jasmin output:
.source Test.java
.class Test
.super java/lang/Object
.method static add(II)I
.limit stack 2
.limit locals 2
.var 0 is a I from Label0 to Label1
.var 1 is b I from Label0 to Label1
Label0:
iload_0
iload_1
iadd
Label1:
ireturn
.end method
.method public static main([Ljava/lang/String;)V
.limit stack 2
.limit locals 1
.var 0 is argv [Ljava/lang/String; from Label0 to Label1
Label0:
iconst_1
iconst_2
invokestatic Test/add(II)I
pop
Label1:
return
.end method