# Study

## File detail

 Name: vyp.txt [Download] Location: study > min Size: 9.8 KB Last modification: 2009-06-26 20:53

## File content

Copyright (c)  2008  Kamil Dudka.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.2
with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
Texts.  A copy of the license is included in the section entitled "GNU

Compiler construction
=====================

Compiler
--------
- input: source program
- output: target program
- method: A compiler reads source program in source language and translates it
to target program in target language
(source and target programs are functionally equivalent)

- compilation phases:
1. Lexical analyzer (scanner)
--> sequence of lexical units (lexemes)
- lexical units are represented by tokens
- each token has two parts:
type - what kind of lexical unit
value (optional) - typically address to symbol table

2. Syntax analyzer (parser)
- the most complex part of compiler
- it dictates what to do in scanning, code, ...
(syntax directed translation)
- building derivation tree
- every derivation tree corresponds to derivation
- check if the code is syntactically correct
- derivation from start symbol to string of lexemes
- if we find the derivation then the program is correct
- left-most derivation (from root of the tree)
- right-most derivation (from the tree bottom)
- last phase which deals with non-terminals

a) derivation tree - can't be implemented but just simulated
b) syntax tree - can be implemented... (will be discussed later)
c) abstract (computational) syntax tree

3. Semantic analyzer
- set of actions
- type checking (may imply conversions)
- we must check variable declaration here (in some languages)
- if all used variables are declared (print errors)
- if all declared variables are used (print warnings)
- after semantic analyse compiler has to generate the code!!
- the code has to be correct after this phase

4. Intermediate code generator
- main goal of generating this code is optimization
- can be architecture independent

5. Optimizer
- for example we can convert int(60) to float(60) in the compile time
to save some time
- methods:
a) constant propagation
b) copy propagation
...

6. Target code generator
- for each instruction in intermediate language exists short sequence
of instruction in target language (assembler, machine code, ...)
- architecture dependent part of compiler
- there can be also architecture dependent optimization

Symbol table
------------
name <---> position (pointer)

Languages and compilers
-----------------------
- there are two points of view:
- theoretical: Does given string belongs to the language?
- practical: translation of the source programme to target...
- we need to describe errors (not just 'bad input' and period)

Regular expressions
-------------------
alphabet - finite non-empty set of symbols

- RE can be simplified using precedence: + < . < *

Finite automata
===============
- finite tool which can describe an infinite language
- indexes ^+ and ^* are taken from discrete mathematics:
R - a relation
R^+ - transitive closure
R^* - transitive and reflexive closure
- fundamental part of scanner in most cases
- keywords recognition is usually not part of scanner
(it is recognized 1st as identifier, and then as a keyword)
- scope of identifiers
--> leads to a stack structure
- usually two stacks: pointers, lookup (implementation detail)

Grammar
=======
- finite set of rules, which generate strings of the language

Context-Free Grammar (CFG) - rules are restricted (we know)
Context-Free Language (CFL) - exists CFG which generates the language

derivation step - change of string by a rule (any nonterminal rewritten)
sequence of derivation steps

- language generated by grammar G:
$G$ generates a terminal string $w$ by a sequence of derivation steps from
$S$ to $w$

- rule tree - graphically represents the rule
- derivation tree - corresponds to a derivation

- leftmost derivation - the leftmost nonterminal is rewritten
- rightmost derivation - the rightmost nonterminal is rewritten
--> these variants are equivalent (generate the same languages):
Without any loss of generality, we can consider only leftmost or only
rightmost derivations.

- ambiguous grammar - exist more than one derivation tree for one string
- inherently ambiguous CFL - there is no unambiguous grammar which generates
the language

Pushdown automata
=================
- 7-tuple (we know)
- do not forget epsilon-transition

- one symbol is popped from stack in one transition (left side of the rule)
- more symbol can be pushed to stack in one transition (right side of the
rule)

- configuration is defined by stack content, control state and the reminder of
input

- move ... relation on the set of configurations
- ^+, ^* closures can be defined

- there are three approaches to define language accepted by PDA:
1. by final state
2. by empty pushdown
3. by final state and empty pushdown

- all three approaches are equivalent in its strength

Deterministic PDA (DPDA)
------------------------
- only one possible move from any configuration
- PDAs are stronger than DPDAs (we know)

Extended PDA (EPDA)
-------------------
- the pushdown top is represented as string rather than a single symbol
- there are three approaches to define language accepted by EPDA
(same as PDA)
- PDAs and EPDAs are equivalent

Parsing models for CFGs
-----------------------
1. Top-Down Parsing
2. Bottom-Up Parsing <-- EPDA

- TODO: learn CFG->PDA algorithm from TIN !!
- CFG and PDA language classes are equivalent (we know)

- last VYP lesson is missing here...

Top-Down Parsing
================
- main question: which rule should I choose (ambiguity problem)
- basic idea: construct deterministic table (does not exist for all CFGs)
--> family of deterministic CFLs

Set Empty
---------
\epsilon    <-- if x derives empty string
empty set   <-- if not

- algorithm Empty(X):
input: grammar G
output: Empty(X) for each terminal and non-terminal

- for terminal it is always empty (trivial)
- for each non-terminal initialize the set
- until no empty set can by changed... (we know)

Set First
---------
First(x) is the set of all terminals which can begin a sentential form
derivable from x.

- algorithm First(X) - for symobls:
input: grammar G
output: First(X) for each terminal and non-terminal

- for each terminal set First(a) to a
- for each non-terminal set First(A) to empty set
- until no first set can be changed... (we know)

Algorithm First, Empty for any string
-------------------------------------
- similar as for symbols
- using First(X), Empty(X)
- Empty(\epsilon) = \epsilon (not empty set!!)

Set Follow
----------
Follow(A) is the set of all terminals that can come right after A in a
sentential form of G.

- algorithm Follow(A):
input: grammar G
output: Follow(A) for each *non-terminal*

TODO: try example

Set Predict
-----------
Predict(A-->x) is the set of all terminals that can begin a string obtained
by a derivation that starts by using A-->x.

- for rules(not symbols!!)
- algorithm Predict(A-->x):
if (Empty(X1X2...Xn)==\epsilon)
Predict(A-->x) = First(X1X2...Xn) | Follow(A);
else
Predict(A-->x) = First(X1X2...Xn);

LL Grammar
----------
- it is uneasy with \epsilon-rules
- for every input symbol there is no more than one rule (such that...)

Left Recursion Replacement
--------------------------
- viz. TIN

Recursive-Descent Parsing
-------------------------
- if we want to change the grammar, we must change the parser (program)
- "recursive" because a non-terminal calls itself (directly or indirectly)

Predictive Parsing
------------------
- if we want to change the grammar, we change only the table (the program
stays intact)

Simple Error recovery
---------------------
- if a non-terminal fails, check for expected terminal of rule of next
non-terminal
- skip all tokens up to this terminal
- then continue with the next non-terminal

Panic-Mode (Hartman) Error Recovery
-----------------------------------
TODO

Bottom-Up Parsing
=================
- LR parsing, precedence parsing

Operator-Precedence Parser
--------------------------
- input: precedence table, input string
- output: right parse of input string

- based on precedence table
- rows/columns ... symbols
- cells - one of these values: <, =, > and "blank"
blank - means always syntax error

handle - right side of the rule
$- symbol at pushdown bottom - precedence table construction is based on priority and associativity of operators LR Parser ========= - input: LR table based on grammar, input string - output: right parse of input string - a pushdown item is a pair <symbol, state>, rather than only symbols - the alg. is simple (page 15/42) - there are many way to construct LR table Simple LR (SLR) --------------- 1. add extra rule S' --> S (to by sure syntax analysis is successfully complete) if A --> x is rule and x=yz, we call A --> y'.'z as "item" y ... top of pushdown z ... part of input which can be reduced to z Closure(I) - input: item I - output: set of items defined by alg. - look at alg. (page 21/42) Set$\Theta_G$TODO (simple) Contents(x) - input: G,$\Theta_G$- output: Contents(x) for ALL x from$\Theta_G\$