# Study

## File detail

Name: | vyp.txt [Download] |

Location: | study > min |

Size: | 9.8 KB |

Last modification: | 2009-06-26 20:53 |

## File content

Copyright (c) 2008 Kamil Dudka. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License". Compiler construction ===================== Compiler -------- - input: source program - output: target program - method: A compiler reads source program in source language and translates it to target program in target language (source and target programs are functionally equivalent) - compilation phases: 1. Lexical analyzer (scanner) --> sequence of lexical units (lexemes) - lexical units are represented by tokens - each token has two parts: type - what kind of lexical unit value (optional) - typically address to symbol table 2. Syntax analyzer (parser) - the most complex part of compiler - it dictates what to do in scanning, code, ... (syntax directed translation) - building derivation tree - every derivation tree corresponds to derivation - check if the code is syntactically correct - derivation from start symbol to string of lexemes - if we find the derivation then the program is correct - left-most derivation (from root of the tree) - right-most derivation (from the tree bottom) - last phase which deals with non-terminals a) derivation tree - can't be implemented but just simulated b) syntax tree - can be implemented... (will be discussed later) c) abstract (computational) syntax tree 3. Semantic analyzer - set of actions - type checking (may imply conversions) - we must check variable declaration here (in some languages) - if all used variables are declared (print errors) - if all declared variables are used (print warnings) - after semantic analyse compiler has to generate the code!! - the code has to be correct after this phase 4. Intermediate code generator - three-address code generation - main goal of generating this code is optimization - can be architecture independent 5. Optimizer - for example we can convert int(60) to float(60) in the compile time to save some time - methods: a) constant propagation b) copy propagation c) dead code elimination ... 6. Target code generator - for each instruction in intermediate language exists short sequence of instruction in target language (assembler, machine code, ...) - architecture dependent part of compiler - there can be also architecture dependent optimization Symbol table ------------ name <---> position (pointer) Languages and compilers ----------------------- - there are two points of view: - theoretical: Does given string belongs to the language? - practical: translation of the source programme to target... - we need to describe errors (not just 'bad input' and period) Regular expressions ------------------- alphabet - finite non-empty set of symbols - RE can be simplified using precedence: + < . < * Finite automata =============== - finite tool which can describe an infinite language - indexes ^+ and ^* are taken from discrete mathematics: R - a relation R^+ - transitive closure R^* - transitive and reflexive closure - fundamental part of scanner in most cases - keywords recognition is usually not part of scanner (it is recognized 1st as identifier, and then as a keyword) - scope of identifiers --> leads to a stack structure - usually two stacks: pointers, lookup (implementation detail) Grammar ======= - finite set of rules, which generate strings of the language Context-Free Grammar (CFG) - rules are restricted (we know) Context-Free Language (CFL) - exists CFG which generates the language derivation step - change of string by a rule (any nonterminal rewritten) sequence of derivation steps - language generated by grammar G: $G$ generates a terminal string $w$ by a sequence of derivation steps from $S$ to $w$ - rule tree - graphically represents the rule - derivation tree - corresponds to a derivation - leftmost derivation - the leftmost nonterminal is rewritten - rightmost derivation - the rightmost nonterminal is rewritten --> these variants are equivalent (generate the same languages): Without any loss of generality, we can consider only leftmost or only rightmost derivations. - ambiguous grammar - exist more than one derivation tree for one string - inherently ambiguous CFL - there is no unambiguous grammar which generates the language Pushdown automata ================= - 7-tuple (we know) - do not forget epsilon-transition - one symbol is popped from stack in one transition (left side of the rule) - more symbol can be pushed to stack in one transition (right side of the rule) - configuration is defined by stack content, control state and the reminder of input - move ... relation on the set of configurations - ^+, ^* closures can be defined - there are three approaches to define language accepted by PDA: 1. by final state 2. by empty pushdown 3. by final state and empty pushdown - all three approaches are equivalent in its strength Deterministic PDA (DPDA) ------------------------ - only one possible move from any configuration - PDAs are stronger than DPDAs (we know) Extended PDA (EPDA) ------------------- - the pushdown top is represented as string rather than a single symbol - there are three approaches to define language accepted by EPDA (same as PDA) - PDAs and EPDAs are equivalent Parsing models for CFGs ----------------------- 1. Top-Down Parsing 2. Bottom-Up Parsing <-- EPDA - TODO: learn CFG->PDA algorithm from TIN !! - CFG and PDA language classes are equivalent (we know) - last VYP lesson is missing here... Top-Down Parsing ================ - main question: which rule should I choose (ambiguity problem) - basic idea: construct deterministic table (does not exist for all CFGs) --> family of deterministic CFLs Set Empty --------- \epsilon <-- if x derives empty string empty set <-- if not - algorithm Empty(X): input: grammar G output: Empty(X) for each terminal and non-terminal - for terminal it is always empty (trivial) - for each non-terminal initialize the set - until no empty set can by changed... (we know) Set First --------- First(x) is the set of all terminals which can begin a sentential form derivable from x. - algorithm First(X) - for symobls: input: grammar G output: First(X) for each terminal and non-terminal - for each terminal set First(a) to a - for each non-terminal set First(A) to empty set - until no first set can be changed... (we know) Algorithm First, Empty for any string ------------------------------------- - similar as for symbols - using First(X), Empty(X) - Empty(\epsilon) = \epsilon (not empty set!!) Set Follow ---------- Follow(A) is the set of all terminals that can come right after A in a sentential form of G. - algorithm Follow(A): input: grammar G output: Follow(A) for each *non-terminal* TODO: try example Set Predict ----------- Predict(A-->x) is the set of all terminals that can begin a string obtained by a derivation that starts by using A-->x. - for rules(not symbols!!) - algorithm Predict(A-->x): if (Empty(X1X2...Xn)==\epsilon) Predict(A-->x) = First(X1X2...Xn) | Follow(A); else Predict(A-->x) = First(X1X2...Xn); LL Grammar ---------- - it is uneasy with \epsilon-rules - for every input symbol there is no more than one rule (such that...) Left Recursion Replacement -------------------------- - viz. TIN Recursive-Descent Parsing ------------------------- - if we want to change the grammar, we must change the parser (program) - "recursive" because a non-terminal calls itself (directly or indirectly) Predictive Parsing ------------------ - if we want to change the grammar, we change only the table (the program stays intact) Simple Error recovery --------------------- - if a non-terminal fails, check for expected terminal of rule of next non-terminal - skip all tokens up to this terminal - then continue with the next non-terminal Panic-Mode (Hartman) Error Recovery ----------------------------------- TODO Bottom-Up Parsing ================= - LR parsing, precedence parsing Operator-Precedence Parser -------------------------- - input: precedence table, input string - output: right parse of input string - based on precedence table - rows/columns ... symbols - cells - one of these values: <, =, > and "blank" blank - means always syntax error handle - right side of the rule $ - symbol at pushdown bottom - precedence table construction is based on priority and associativity of operators LR Parser ========= - input: LR table based on grammar, input string - output: right parse of input string - a pushdown item is a pair <symbol, state>, rather than only symbols - the alg. is simple (page 15/42) - there are many way to construct LR table Simple LR (SLR) --------------- 1. add extra rule S' --> S (to by sure syntax analysis is successfully complete) if A --> x is rule and x=yz, we call A --> y'.'z as "item" y ... top of pushdown z ... part of input which can be reduced to z Closure(I) - input: item I - output: set of items defined by alg. - look at alg. (page 21/42) Set $\Theta_G$ TODO (simple) Contents(x) - input: G, $\Theta_G$ - output: Contents(x) for ALL x from $\Theta_G$