Česky
Kamil Dudka

Study

File detail

Name:Downloadvyp.txt [Download]
Location: study > min
Size:9.8 KB
Last modification:2009-06-26 20:53

File content

Copyright (c)  2008  Kamil Dudka.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.2
or any later version published by the Free Software Foundation;
with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
Texts.  A copy of the license is included in the section entitled "GNU
Free Documentation License".

Compiler construction
=====================

Compiler
--------
- input: source program
- output: target program
- method: A compiler reads source program in source language and translates it
  to target program in target language
  (source and target programs are functionally equivalent)

- compilation phases:
    1. Lexical analyzer (scanner)
       --> sequence of lexical units (lexemes)
       - lexical units are represented by tokens
       - each token has two parts:
            type - what kind of lexical unit
            value (optional) - typically address to symbol table

    2. Syntax analyzer (parser)
       - the most complex part of compiler
       - it dictates what to do in scanning, code, ...
         (syntax directed translation)
       - building derivation tree
           - every derivation tree corresponds to derivation
           - check if the code is syntactically correct
           - derivation from start symbol to string of lexemes
           - if we find the derivation then the program is correct
           - left-most derivation (from root of the tree)
           - right-most derivation (from the tree bottom)
       - last phase which deals with non-terminals

       a) derivation tree - can't be implemented but just simulated
       b) syntax tree - can be implemented... (will be discussed later)
       c) abstract (computational) syntax tree

    3. Semantic analyzer
       - set of actions
       - type checking (may imply conversions)
       - we must check variable declaration here (in some languages)
           - if all used variables are declared (print errors)
           - if all declared variables are used (print warnings)
       - after semantic analyse compiler has to generate the code!!
       - the code has to be correct after this phase

    4. Intermediate code generator
       - three-address code generation
       - main goal of generating this code is optimization
       - can be architecture independent

    5. Optimizer
       - for example we can convert int(60) to float(60) in the compile time
         to save some time
       - methods:
            a) constant propagation
            b) copy propagation
            c) dead code elimination
            ...

    6. Target code generator
       - for each instruction in intermediate language exists short sequence
         of instruction in target language (assembler, machine code, ...)
       - architecture dependent part of compiler
       - there can be also architecture dependent optimization

    Symbol table
    ------------
    name <---> position (pointer)

Languages and compilers
-----------------------
- there are two points of view:
    - theoretical: Does given string belongs to the language?
    - practical: translation of the source programme to target...
        - we need to describe errors (not just 'bad input' and period)

Regular expressions
-------------------
    alphabet - finite non-empty set of symbols

    - RE can be simplified using precedence: + < . < *

Finite automata
===============
- finite tool which can describe an infinite language
- indexes ^+ and ^* are taken from discrete mathematics:
    R - a relation
    R^+ - transitive closure
    R^* - transitive and reflexive closure
- fundamental part of scanner in most cases
- keywords recognition is usually not part of scanner
  (it is recognized 1st as identifier, and then as a keyword)
- scope of identifiers
    --> leads to a stack structure
    - usually two stacks: pointers, lookup (implementation detail)

Grammar
=======
- finite set of rules, which generate strings of the language

    Context-Free Grammar (CFG) - rules are restricted (we know)
    Context-Free Language (CFL) - exists CFG which generates the language

    derivation step - change of string by a rule (any nonterminal rewritten)
    sequence of derivation steps

- language generated by grammar G:
    $G$ generates a terminal string $w$ by a sequence of derivation steps from
    $S$ to $w$

- rule tree - graphically represents the rule
- derivation tree - corresponds to a derivation

- leftmost derivation - the leftmost nonterminal is rewritten
- rightmost derivation - the rightmost nonterminal is rewritten
--> these variants are equivalent (generate the same languages):
        Without any loss of generality, we can consider only leftmost or only
        rightmost derivations.

- ambiguous grammar - exist more than one derivation tree for one string
- inherently ambiguous CFL - there is no unambiguous grammar which generates
                             the language

Pushdown automata
=================
- 7-tuple (we know)
- do not forget epsilon-transition

- one symbol is popped from stack in one transition (left side of the rule)
- more symbol can be pushed to stack in one transition (right side of the
  rule)

- configuration is defined by stack content, control state and the reminder of
  input

- move ... relation on the set of configurations
- ^+, ^* closures can be defined

- there are three approaches to define language accepted by PDA:
    1. by final state
    2. by empty pushdown
    3. by final state and empty pushdown

    - all three approaches are equivalent in its strength

Deterministic PDA (DPDA)
------------------------
- only one possible move from any configuration
- PDAs are stronger than DPDAs (we know)

Extended PDA (EPDA)
-------------------
- the pushdown top is represented as string rather than a single symbol
- there are three approaches to define language accepted by EPDA
  (same as PDA)
- PDAs and EPDAs are equivalent

Parsing models for CFGs
-----------------------
    1. Top-Down Parsing
    2. Bottom-Up Parsing <-- EPDA

- TODO: learn CFG->PDA algorithm from TIN !!
- CFG and PDA language classes are equivalent (we know)



- last VYP lesson is missing here...


Top-Down Parsing
================
- main question: which rule should I choose (ambiguity problem)
- basic idea: construct deterministic table (does not exist for all CFGs)
    --> family of deterministic CFLs

Set Empty
---------
    \epsilon    <-- if x derives empty string
    empty set   <-- if not

- algorithm Empty(X):
    input: grammar G
    output: Empty(X) for each terminal and non-terminal

    - for terminal it is always empty (trivial)
    - for each non-terminal initialize the set
    - until no empty set can by changed... (we know)

Set First
---------
    First(x) is the set of all terminals which can begin a sentential form
    derivable from x.

- algorithm First(X) - for symobls:
    input: grammar G
    output: First(X) for each terminal and non-terminal

    - for each terminal set First(a) to a
    - for each non-terminal set First(A) to empty set
    - until no first set can be changed... (we know)

Algorithm First, Empty for any string
-------------------------------------
    - similar as for symbols
    - using First(X), Empty(X)
    - Empty(\epsilon) = \epsilon (not empty set!!)

Set Follow
----------
    Follow(A) is the set of all terminals that can come right after A in a
    sentential form of G.

- algorithm Follow(A):
    input: grammar G
    output: Follow(A) for each *non-terminal*

    TODO: try example

Set Predict
-----------
    Predict(A-->x) is the set of all terminals that can begin a string obtained
    by a derivation that starts by using A-->x.

- for rules(not symbols!!)
- algorithm Predict(A-->x):
    if (Empty(X1X2...Xn)==\epsilon)
        Predict(A-->x) = First(X1X2...Xn) | Follow(A);
    else
        Predict(A-->x) = First(X1X2...Xn);

LL Grammar
----------
- it is uneasy with \epsilon-rules
- for every input symbol there is no more than one rule (such that...)

Left Recursion Replacement
--------------------------
- viz. TIN


Recursive-Descent Parsing
-------------------------
- if we want to change the grammar, we must change the parser (program)
- "recursive" because a non-terminal calls itself (directly or indirectly)

Predictive Parsing
------------------
- if we want to change the grammar, we change only the table (the program
  stays intact)

Simple Error recovery
---------------------
- if a non-terminal fails, check for expected terminal of rule of next
  non-terminal
- skip all tokens up to this terminal
- then continue with the next non-terminal

Panic-Mode (Hartman) Error Recovery
-----------------------------------
    TODO

Bottom-Up Parsing
=================
- LR parsing, precedence parsing

Operator-Precedence Parser
--------------------------
- input: precedence table, input string
- output: right parse of input string

- based on precedence table
- rows/columns ... symbols
- cells - one of these values: <, =, > and "blank"
    blank - means always syntax error

    handle - right side of the rule
    $ - symbol at pushdown bottom

- precedence table construction is based on priority and associativity
  of operators

LR Parser
=========
- input: LR table based on grammar, input string
- output: right parse of input string

- a pushdown item is a pair <symbol, state>, rather than only symbols
- the alg. is simple (page 15/42)
- there are many way to construct LR table

Simple LR (SLR)
---------------
    1. add extra rule S' --> S
       (to by sure syntax analysis is successfully complete)

    if A --> x is rule and x=yz, we call A --> y'.'z as "item"
        y ... top of pushdown
        z ... part of input which can be reduced to z

    Closure(I)
    - input: item I
    - output: set of items defined by alg.
    - look at alg. (page 21/42)

    Set $\Theta_G$
        TODO (simple)

    Contents(x)
    - input: G, $\Theta_G$
    - output: Contents(x) for ALL x from $\Theta_G$