1.1 Overview and Phases of compilation | unit 1 introduction

Unit – 1

Introduction

1.1 Overview and Phases of compilation

The method of compilation includes the sequence of different stages. Each phase takes a source programme in one representation and produces output in another representation. From its previous stage, each process takes input.

The various stages of the compiler take place:

Lexical Analysis

Syntax Analysis

Semantic Analysis

Intermediate Code Generation

Code Optimization

Code Generation

Fig 1: phases of compiler

Lexical Analysis: The first phase of the compilation process is the lexical analyzer phase. As input, it takes source code. One character at a time, it reads the source programme and translates it into meaningful lexemes. These lexemes are interpreted in the form of tokens by the Lexical analyzer.

2. Syntax Analysis: Syntax analysis is the second step of the method of compilation. Tokens are required as input and a parse tree is generated as output. The parser verifies that the expression made by the tokens is syntactically right or not in the syntax analysis process.

3. Semantic Analysis: The third step of the method of compilation is semantic analysis. This tests whether the parse tree complies with language rules. The semantic analyzer keeps track of identifiers, forms, and expressions of identifiers. The output step of semantic analysis is the syntax of the annotated tree.

4. Intermediate Code Generation: The compiler generates the source code into an intermediate code during the intermediate code generation process. Between high-level language and machine language, intermediate code is created. You can produce the intermediate code in such a way that you can convert it easily into the code of the target computer.

5. Code Optimization: An optional step is code optimization. It is used to enhance the intermediate code so that the program's output can run quicker and take up less space. It eliminates redundant code lines and arranges the sequence of statements to speed up the execution of the programme.

6. Code Generation: The final stage of the compilation process is code generation. It takes as its input the optimized intermediate code and maps it to the language of the target computer. The code generator converts the intermediate code to the required computer's machine code.

Overview

● A compiler is a translator that translates the language at the highest level into the language of the computer.

● A developer writes high-level language and the processor can understand machine language.

● The compiler is used to display a programmer's errors.

● The main objective of the compiler is to modify the code written in one language without altering the program's meaning.

● When you run a programme that is written in the language of HLL programming, it runs in two parts.

Fig 2: Execution process of Compiler

Features of compilers:

● Maintaining the right interpretation of the code

● The speed of the target code

● Understand prototypes for legal and illegal systems

● Good documentation / management of errors

● Correctness

● Speed of compilation

● Code debugging help

Task of compilers:

Tasks which are performed by compilers:

● Divides the source programme into parts and imposes on them a grammatical structure.

● Allows you from the intermediate representation to construct the desired.

● target programme and also create the symbol table.

● Compiles and detects source code errors.

● Storage of all variables and codes is handled.

● Separate compilation support.

● Learn, interpret, and translate the entire programme to a semantically equivalent.

● Depending on the computer sort, conversion of the source code into object code.

Key takeaway

Each phase takes a source programme in one representation and produces output in another representation.

The method of compilation includes the sequence of different stages.

A compiler is a computer programme that lets you convert source code into low-level machine language written in a high-level language.

Without modifying the context of the code, it converts the code written in one programming language to some other language.

The compiler also makes the end code, optimized for execution time and memory space, efficient.

1.2 Lexical Analysis: Non-Deterministic and Deterministic Finite Automata (NFA & DFA)

DFA

DFA stands for Deterministic Finite Automata. Deterministic refers to computational uniqueness. In DFA, the character of the input goes to only one state. The DFA does not allow the null shift, which means that without an input character, the DFA does not change the state.

DFA has five tuples {Q, ∑, q0, F, δ}

Q: set of all states

∑: finite set of input symbol where δ: Q x ∑ →Q

q0: initial state

F: final state

δ: Transition function

NDFA

NDFA applies to Finite Non-Deterministic Automata. It is used for a single input to pass through any number of states. NDFA embraces the NULL step, indicating that without reading the symbols, it can change the state.

Like the DFA, NDFA also has five states. NDFA, however, has distinct transformation features.

NDFA's transition role may be described as:

δ: Q x ∑ →2Q

Key takeaway:

To identify patterns, finite state machines are used.

FA has two states: state approve or state deny.

In DFA, the character of the input goes to only one state.

It is used for a single input to pass through any number of states.

1.3 Regular grammar, Regular expressions and Regular languages

● Simple expressions called Regular Expressions can readily define the language adopted by finite automata. This is the most powerful way of expressing any language.

● Regular languages are referred to as the languages recognized by certain regular expressions.

● A regular expression may also be defined as a pattern sequence that defines a series.

● For matching character combinations in strings, regular expressions are used. This pattern is used by the string search algorithm to locate operations on a string.

A Regular Expression can be recursively defined as follows −

● ε is a Regular Expression indicates the language containing an empty

string. (L (ε) = {ε})

● φ is a Regular Expression denoting an empty language. (L (φ) = { })

● x is a Regular Expression where L = {x}

● If X is a Regular Expression denoting the language L(X) and Y is a Regular

● Expression denoting the language L(Y), then

● X + Y is a Regular Expression corresponding to the language L(X) ∪

L(Y) where L(X+Y) = L(X) ∪ L(Y).

● X. Y is a Regular Expression corresponding to the language L(X).

L(Y) where L (X.Y) = L(X). L(Y)

● R* is a Regular Expression corresponding to the language L(R*)

where L(R*) = (L(R)) *

● If we apply any of the rules several times from 1 to 5, they are Regular

Expressions.

Unix Operator Extensions

Regular expressions are used frequently in Unix:

● In the command line

● Within text editors

● In the context of pattern matching programs such as grep and egrep

Additional operators are recognized by unix. These operators are used for

convenience only.

● character classes: ‘[‘<list of chars> ‘]’

● start of a line: ‘^’

● end of a line: ‘$’

● wildcard matching any character except newline: ‘.’

● optional instance: R? = epsilon | R

● one or more instances: R+ == RR*

Regular language

A language is regular if, in terms of regular expression, it can be presented.

Regular expressions have the capability to express finite languages by defining a pattern for finite strings of symbols.

The grammar defined by regular expressions is known as regular grammar. The language defined by regular grammar is known as regular language.

Regular languages are closed under a wide variety of operations.

● Union and intersection

Pick DFAs recognizing the two languages and use the cross-product construction to build a DFA recognizing their union or intersection.

● Set complement

Pick a DFA recognizing the language, then swap the accept/non- accept markings on its states.

● String reversal

Pick an NFA recognizing the language. Create a new final state, with epsilon transitions to it from all the old final states. Then swap the final and start states and reverse all the transition arrows.

● Set difference

Rewrite set difference using a combination of intersection and set complement.

● Concatenation and Star

Pick an NFA recognizing the language and modify.

● Homomorphism

A homomorphism is a function from strings to strings. What makes it a homomorphism is that it’s output on a multi-character string is just the concatenation of its outputs on each individual character in the string. Or, equivalently, h(xy) = h(x)h(y) for any strings x and y. If S is a set of strings, then h(S) is {w: w = h(x) for some x in S}.

To show that regular languages are closed under homomorphism, choose an arbitrary regular language L and a homomorphism h. It can be represented using a regular expression R. But then h(R) is a regular expression representing h(L). So, h(L) must also be regular.

Notice that regular languages are not closed under the subset/superset relation. For

example, 0 * 1 * is regular, but its subset {O n 1 n: n > = 0} is not regular, but its subset {01, 0011, 000111} is regular again.

Key takeaway:

Regular languages are referred to as the languages recognized by certain regular expressions.

A regular expression may also be defined as a pattern sequence that defines a series.

1.4 Design of a Lexical Analyzer as a DFA

We identify three algorithms that have been used to implement pattern matchers designed from regular expressions and optimize them.

In a Lex compiler, the first algorithm is useful because it creates a DFA directly from a regular expression without constructing an NFA intermediate. The resulting DFA can also be constructed through an NFA with fewer states than the DFA.

The second algorithm minimizes the number of states of any DFA, by combining states that have the same future behaviour. The algorithm itself is very efficient, running at time 0(n log n), where n is the DFA's number of states.

The third algorithm produces transformation tables with more compact representations than the regular, two-dimensional table.

1.5 Lexical Analyzer generator

Lex is a lexical analyzer generating programme. It is used to produce the YACC parser. A software that transforms an input stream into a sequence of tokens is the lexical analyzer. It reads the input stream and, by implementing the lexical analyzer in the C programme, produces the source code as output.

Fig 3: lexical analyzer generator

If a token is considered invalid by the lexical analyzer, it produces an error. The lexical analyzer works with the syntax analyzer in close collaboration. It reads character streams from the source code, checks for legal tokens, and, when necessary, passes the information to the syntax analyzer.

Tokens

It is said that lexemes are a character sequence (alphanumeric) in a token. For each lexeme to be identified as a valid token, there are some predefined rules. These rules, by means of a pattern, are described by grammar rules. A pattern describes what a token can be, and by means of regular expressions, these patterns are described.

Keywords, constants, identifiers, sequences, numbers, operators and punctuation symbols can be used as tokens in the programming language.

Example: the variable declaration line in the C language

int value = 100;

contains the tokens:

int (keyword), value (identifiers) = (operator), 100 (constant) and; (symbol)

Key takeaway:

It is said that lexemes are a character sequence (alphanumeric) in a token.

For each lexeme to be identified as a valid token, there are some predefined rules.

Lexical analyzer is used to produce the YACC parser.

1.6 Syntax Analysis: Role of a Parser

The parser or syntactic analyzer obtains from the lexical analyzer a string of tokens and verifies that the grammar will produce the string for the source language. In the software, it reports any syntax errors. It also recovers from frequently occurring errors so that its input can continue to be processed.

Fig 4: position of parser

Based on the grammar, it tests the structure created by the tokens.

This builds the tree of parse.

It records mistakes.

It carries out error recovery.

Issues:

The parser is unable to detect errors like:

● Re-declaration of Variable

● Initialization of variable before use

● For an operation, data type mismatch.

The above problems are discussed in the Semantic Analysis stage.

Syntax Error Handling:

At several different levels, programmes can contain errors. For instance:

● Lexical, such as misspelling, keyword, or operator of an identifier.

● Syntactics, such as unbalanced parentheses in an arithmetic expression.

● Semantic, such as being applied to an incompatible operand by an operator.

● Logical, such as a call which is infinitely recursive.

Functions of error handler:

● It should clearly and reliably report the existence of errors.

● It should recover quickly enough from each mistake to be able to detect subsequent errors.

● The processing of right programmes does not substantially slow down.

Error recovery strategies:

The various methods a parse uses to recover from a syntactic error are:

Panic mode

Phrase level

Error productions

Global correction

The above topic described below:

Panic mode recovery

The parser discards input symbols one at a time upon finding an error before a synchronizing token is found. Typically, the synchronizing tokens are delimiters, such as end or semicolon. It has the value of simplicity and doesn't go through an endless loop. This approach is very useful when several errors in the same statement are uncommon.

Phase level recovery

The parser performs local correction on the remaining input when finding an error, which allows it to proceed. Example: Insert the missing semicolon, or erase the alien semicolon, etc.

Error production

The parser is constructed with error outputs using augmented grammar. If the parser uses error generation, suitable error diagnostics can be created to show the erroneous constructs that the input recognizes.

Global correction

Some algorithms can be used to find a parse tree for the y string because of the incorrect input string x and grammar G, such that the number of token insertions, deletions and changes is as limited as possible. In general, however, these methods are too expensive in terms of time and space.

Key takeaway:

The parser or syntactic analyzer obtains from the lexical analyzer a string of tokens and verifies that the grammar will produce the string for the source language.

It also recovers from frequently occurring errors so that its input can continue to be processed.

The parser discards input symbols one at a time upon finding an error before a synchronizing token is found.

The parser is constructed with error outputs using augmented grammar.

1.7 Context free grammars and Context free languages

Context-free grammar (CFG) is a collection of rules (or productions) for recursive rewriting used to produce string patterns.

It consists of:

● a set of terminal symbols,

● a set of non-terminal symbols,

● a set of productions,

● a start symbol.

Context-free grammar G can be defined as:

G = (V, T, P, S)

Where

G - grammar, used for generate string,

V - final set of non-terminal symbol, which are denoted by capital letters,

T - final set of terminal symbols, which are denoted be small letters,

P - set of production rules, it is used for replacing non terminal symbol of a string in both side (left or right)

S - start symbol, which is used for derive the string

A CFG for Arithmetic Expressions -

An example grammar that generates strings representing arithmetic expressions with the four operators +, -, *, /, and numbers as operands is:

1. <expression> --> number

2. <expression> --> (<expression>)

3.<expression> --> <expression> +<expression>

4. <expression> --> <expression> - <expression>

5. <expression> --> <expression> * <expression>

6. <expression> --> <expression> / <expression>

The only non-terminal symbol in this grammar is <expression>, which is also the

start symbol. The terminal symbols are {+, -, *, /, (,), number}

The first rule states that an <expression> can be rewritten as a number.

The second rule says that an <expression> enclosed in parentheses is also an <expression>

The remaining rules say that the sum, difference, product, or division of two <expression> is also an expression.

Context free language (CFL)

A context-free language (CFL) is a language developed by context-free grammar (CFG) in formal language theory. In programming languages, context-free languages have many applications, particularly most arithmetic expressions are created by context-free grammars.

Through comparing different grammars that characterize the language, intrinsic properties of the language can be separated from extrinsic properties of a particular grammar.

In the parsing stage, CFLs are used by the compiler as they describe the syntax of a programming language and are used by several editors.

A grammatical description of a language has four essential components -

There is a set of symbols that form the strings of the language being defined. They are called terminal symbols, represented by V t.

There is a finite set of variables, called non-terminals. These are represented by V n.

One of the variables represents the language being defined; it is called the start symbol. It is represented by S.

There are a finite set of rules called productions that represent the recursive definition of a language. Each production consists of:

A variable that is being defined by the production. This variable is also called the production head.

The production symbol ->

A string of zero terminals and variables, or more.

Key takeaway:

CFG is a collection of rules (or productions) for recursive rewriting used to produce string patterns.

CFL is a language developed by context-free grammar (CFG) in formal language theory.

1.8 Parse trees and derivations, Ambiguous grammar

Derivation is a sequence of the laws of development. Via these production laws, it is used to get the input string. We have to make two decisions during parsing. That are like the following:

● The non-terminal that is to be replaced must be determined by us.

● We have to determine the law of development by which the non-terminal would be substituted.

We have two choices to determine which non-terminal is to be replaced by the rule of output.

Left most derivation:

The input is scanned and replaced with the development rule from left to right in the left-most derivation. In most derivatives on the left, the input string is read from left to right.

Example:

Production rules:

S = S + S

S = S - S

S = a | b |c

Input

a - b + c

The left-most derivation is:

S = S + S

S = S - S + S

S = a - S + S

S = a - b + S

S = a - b + c

Right most derivation:

The input is scanned and replaced with the development rule from right to left in the rightmost derivation. So, we read the input string from right to left in most derivatives on the right.

Example:

S = S + S

S = S - S

S = a | b |c

Input:

a - b + c

The right-most derivation is:

S = S - S

S = S - S + S

S = S - S + c

S = S - b + c

S = a - b + c

Parse Tree

● The graphical representation of symbols is the Parse tree. Terminal or non-terminal may be the symbol.

● In parsing, using the start symbol, the string is derived. The starter symbol is the root of the parse tree.

● It is the symbol's graphical representation, which can be terminals or non-terminals.

● The parse tree follows operators' precedence. The deepest sub-tree went through first. So, the operator has less precedence over the operator in the sub-tree in the parent node.

Following these points, the parse tree:

● The terminals have to be all leaf nodes.

● Non-terminals have to be all interior nodes.

● In-order traversal supplies the initial string of inputs.

Example

Production rules:

T= T + T | T * T

T = a|b|c

Input

a * b + c

Step 1:

Step 2:

Step 3:

Step 4:

Step 5:

Ambiguous grammar

A grammar is said to be ambiguous if there exists more than one leftmost derivation or more than one rightmost derivative or more than one parse tree for the given input string. If the grammar is not ambiguous then it is called unambiguous.

Example:

S = aSb | SS

S = ∈

For the string aabb, the above grammar generates two parse trees:

Ambiguity Ambiguity 1

Fig 5: two parse trees

If the grammar has ambiguity, then it is not good for a compiler construction. No method can automatically detect and remove the ambiguity but you can remove ambiguity by re-writing the whole grammar without ambiguity.

Key takeaway:

Derivation is a sequence of the laws of development.

We have to make two decisions during parsing.

In most derivatives on the left, the input string is read from left to right.

The parse tree follows operators' precedence.

If the grammar has ambiguity, then it is not good for a compiler construction.

1.9 Top Down Parsing: Recursive descent parsing, LL (1) grammars, Non-recursive Predictive Parsing, Error reporting and Recovery

Recursive or predictive parsing is known as top-down parsing.

The parsing begins with the start symbol in the top-down parsing and transforms it into the input symbol.

The top-down parser is the parser that, by expanding the non-terminals, generates a parse for the given input string with the aid of grammar productions, i.e., it begins from the start symbol and ends on the terminals.

Parse Tree The input string representation of acdb is as follows:

Fig 6: parse tree

There are 2 types of additional Top-down parser: Recursive descent parser, and non-recursive descent parser.

I. Recursive descent parser:

It is also known as Brute force parser or the with backtracking parser. Basically, by using brute force and backtracking, it produces the parse tree.

A top-down parsing technique that builds the parse tree from the top and reads the input from left to right is recursive descent. For any terminal and non-terminal entity, it uses procedures.

The input is recursively parsed by this parsing technique to make a parse tree, which may or may not require back-tracking. But the related grammar (if not left-faced) does not prevent back-tracking. Predictive is considered a type of recursive-descent parsing that does not require any back-tracking.

As it uses context-free grammar that is recursive in nature, this parsing technique is called recursive.

II. Non-recursive descent parser:

It is also known with or without a backtracking parser or dynamic parser, LL (1) parser or predictive parser. It uses the parsing table instead of backtracking to create the parsing tree.

LL (1) Parsing

It is also known with or without a backtracking parser or dynamic parser, LL (1) parser or predictive parser. It uses the parsing table instead of backtracking to create the parsing tree.

Predictive parser is a recursive descent parser, which has the capability to predict which production is to be used to replace the input string. The predictive parser does not suffer from backtracking.

To accomplish its tasks, the predictive parser uses a look-ahead pointer, which points to the next input symbols. To make the parser back-tracking free, the predictive parser puts some constraints on the grammar and accepts only a class of grammar known as LL(k) grammar.

Predictive parsing uses a stack and a parsing table to parse the input and generate a parse tree. Both the stack and the input contain an end symbol $ to denote that the stack is empty and the input is consumed. To make some decision on the input and stack element combination, the parser refers to the parsing table.

Fig 7: predictive parser

In recursive descent parsing, for a single instance of input, the parser can have more than one output to choose from, while in predictive parser, each stage has at most one production to choose from. There may be instances where the input string does not fit the production, making the parsing procedure fail.

Key takeaway:

A top-down parsing technique that builds the parse tree from the top and reads the input from left to right is recursive descent.

The input is recursively parsed by this parsing technique to make a parse tree, which may or may not require back-tracking.

Predictive parser is a recursive descent parser, which has the capability to predict which production is to be used to replace the input string

1.10 Bottom Up Parsing: Handle pruning and shift reduce Parsing

Bottom up parsing is also known as parsing for shift-reduce.

Bottom-up parsing is used to build an input string parsing tree.

The parsing begins with the input symbol in the bottom up parsing and builds the parse tree up to the start symbol by tracing the rightmost string derivations in reverse.

Parsing from the bottom up is known as multiple parsing.

Shift reduce parsing

● Shift reduction parsing is a mechanism by which a string is reduced to the beginning symbol of a grammar.

● Using a stack to hold the grammar and an input tape to hold the string, Shift reduces parsing.

● The two acts are done by Shift Decrease Parsing: move and decrease. This is why it is considered that shift reduces parsing.

● The current symbol in the input string is moved into the stack for the shift operation.

● The symbols will be substituted by the non-terminals at each reduction. The mark is the right side of manufacturing and the left side of output is non-terminal.

Example:

Grammar:

S → S+S

S → S-S

S → (S)

S → a

Input string:

a1-(a2+a3)

Parsing table:

For bottom-up parsing, shift-reduce parsing uses two special steps. Such steps are referred to as shift-step and reduce-step.

Shift - step: The shift step refers to the progression to the next input symbol, which is called the shifted symbol, of the input pointer. Into the stack, this symbol is moved. The shifted symbol will be viewed as a single parse tree node.

Reduce - step: When a complete grammar rule (RHS) is found by the parser and substituted with (LHS), it is known as a reduction stage. This happens when a handle is located at the top of the stack. A POP function is performed on the stack that pops off the handle and substitutes it with a non-terminal LHS symbol to decrease.

Key takeaway:

Bottom up parsing is also known as parsing for shift-reduce.

Bottom-up parsing is used to build an input string parsing tree.

Shift reduction parsing is a mechanism by which a string is reduced to the beginning symbol of a grammar.

1.11 SLR parsers and construction or SLR parsing tables, LR (1) parsers and construction of LR (1) parsing tables, LALR parsers and construction of efficient LALR parsing tables

Except for the decreased data, the SLR parser is similar to the LR (0) parser. Reduced outputs are written only in the FOLLOW of the variable whose output is reduced.

Construction of SLR parsing table: -

Constructing C = {I0, I1, ....... In}, a list of LR (0) object sets for G '.

State I is made out of Ii. The parsing acts are calculated as follows for State I

● If [A ->? A. .a? ] is set to 'move j' in Ii and GOTO(Ii, a) = Ij, then set ACTION[i, a]. A must-be terminal here.

● If [A ->? .] is in Ii and then ACTION[i, a] is set to "reduce A ->??" In FOLLOW(A) for all a '; here A may not be S'.

● Is [S -> S.] in Ii, then 'accept' is set to action[i,$].

3. For all nonterminal A, the goto transitions for state I are constructed using the rule:

if GOTO (Ii, A) = Ij then GOTO [i, A] = j.

4. A mistake is made in all entries not specified by Rules 2 and 3.

If we have several entries in the parsing table, then it is said that a dispute is

Consider the grammar E -> T+E | T

T ->id

Augmented grammar - E’ -> E

E -> T+E | T

T -> id

LR (1) parser

Different steps involved in LR (1) Parsing:

● Write context-free grammar for the specified input string.

● Verify the grammar's uncertainty.

● Add the Augmentation output to the specified grammar.

● Creating a canonical set of objects from LR (0).

● Drawing a diagram of data flow (DFA).

● Build a table for LR (1) parsing.

Augment Grammar:

If we add one more output to the given grammar G, Augmented Grammar G will be generated. It allows the parser to define when the parsing should be stopped and to announce the acceptance of the input.

Example

Given grammar

S → AA

A → aA | b

The Augment grammar G is represented by

S→ S

S → AA

A → aA | b

Constructing canonical LR (1) parser

● Canonical lookahead is referred to by CLR. To create the CLR (1) parsing table, CLR parsing utilizes the canonical set of LR (1) objects. As compared to the SLR (1) parsing, the CLR (1) parsing table produces a greater number of states.

● We only position the reduction node in the lookahead symbols in the CLR (1).

● Different steps used in CLR (1) Parsing:

Write a context free grammar for the specified input string

Verify the grammar's uncertainty

Add Increase output in the grammar specified

Build a canonical set of things from LR (0)

Drawing a diagram of data flow (DFA)

Build a CLR (1) table for parsing

LR (1) item

Item LR (1) is a list of things from LR (0) and a look-ahead sign.

LR (1) item = LR (0) item + look ahead

The look ahead is used to decide where the final item will be put.

For the development of claims, the look ahead often adds the $ symbol.

Example

CLR (1) Grammar

S → AA

A → aA

A → b

Add Augment Production, insert the symbol '•' for each production in G at the first position and add the lookahead as well.

S → •S, $

S → •AA, $

A → •aA, a/b

A → •b, a/b

I0 state:

Add the output of Augment to the I0 State and compute the closure

I0 = Closure (S → •S)

In changed I0 State, add all productions beginning with S since "." is followed by the non-terminal. So, it will become the I0 State.

I0 = S → •S, $

S → •AA, $

In changed I0 State, add all productions beginning with A since "." is followed by the non-terminal. So, it will become the I0 State

I0= S → •S, $

S → •AA, $

A → •aA, a/b

A → •b, a/b

I1= Go to (I0, S) = closure (S → S•, $) = S → S•, $

I2= Go to (I0, A) = closure (S → A•A, $)

"In I2 State, add all productions beginning with A since "." is followed by the non-terminal. The I2 Condition, then, becomes

I2= S → A•A, $

A → •aA, $

A → •b, $

I3= Go to (I0, a) = Closure (A → a•A, a/b)

In the I3 State, add all productions beginning with A since "." is followed by the non-terminal. The I3 Condition, then, becomes

I3= A → a•A, a/b

A → •aA, a/b

A → •b, a/b

Go to (I3, a) = Closure (A → a•A, a/b) = (same as I3)

Go to (I3, b) = Closure (A → b•, a/b) = (same as I4)

I4= Go to (I0, b) = closure (A → b•, a/b) = A → b•, a/b

I5= Go to (I2, A) = Closure (S → AA•, $) =S → AA•, $

I6= Go to (I2, a) = Closure (A → a•A, $)

"In I6 State, add all productions beginning with A since "." is followed by the non-terminal. The I6 Condition, then, becomes

I6 = A → a•A, $

A → •aA, $

A → •b, $

Go to (I6, a) = Closure (A → a•A, $) = (same as I6)

Go to (I6, b) = Closure (A → b•, $) = (same as I7)

I7= Go to (I2, b) = Closure (A → b•, $) = A → b•, $

I8= Go to (I3, A) = Closure (A → aA•, a/b) = A → aA•, a/b

I9= Go to (I6, A) = Closure (A → aA•, $) = A → aA•, $

Diagram DFA:

Fig 7: DFA

CLR (1) Parsing Table:

Productions are numbered according to the following:

S → AA ... (1)

A → aA .... (2)

A → b ... (3)

The location of the move node in the CLR (1) parsing table is the same as the parsing table of the SLR (1).

Constructing LALR parser

The lookahead LR is alluded to by LALR. We use the canonical set of LR (1) objects to create the LALR (1) parsing table.

In the LALR (1) analysis, the LR (1) products that have the same outputs but look ahead differently are combined to form a single set of items

The parsing of LALR (1) is the same as the parsing of CLR (1), the only distinction in the parsing table.

LALR (1) Parsing:

LALR refers to the lookahead LR. To construct the LALR (1) parsing table, we use the canonical collection of LR (1) items.

In the LALR (1) parsing, the LR (1) items which have same productions but different look ahead is combined to form a single set of items

LALR (1) parsing is the same as the CLR (1) parsing, only difference in the parsing table.

Example

LALR (1) Grammar

S → AA

A → aA

A → b

Add Augment Production, insert the '•' symbol for each production in G at the first place and add the look ahead as well.

S → •S, $

S → •AA, $

A → •aA, a/b

A → •b, a/b

I0 State:

Add the output of Augment to the I0 State and measure Closure L

I0 = Closure (S → •S)

Add all productions beginning with S to I0 State, since the non-terminal is followed by '•'. The I0 Condition, then, becomes

I0 = S → •S, $

S → •AA, $

In changed I0 State, add all productions beginning with A since "•" is followed by the non-terminal. So, it will become the I0 State.

I0= S → •S, $

S → •AA, $

A → •aA, a/b

A → •b, a/b

I1= Go to (I0, S) = closure (S → S•, $) = S → S•, $

I2= Go to (I0, A) = closure (S → A•A, $)

In the I2 State, add all productions beginning with A since "•" is followed by the non-terminal.

Key takeaway:

The current symbol in the input string is moved into the stack for the shift operation.

The shifted symbol will be viewed as a single parse tree node.

LR (1) parser Verify the grammar's uncertainty.

The lookahead LR is alluded to by LALR. We use the canonical set of LR (1) objects to create the LALR (1) parsing table.

1.12 Parsing using Ambiguous grammars

You may use the LR parser to parse ambiguous grammar. In the parsing table of ambiguous grammars, the LR parser resolves the conflicts (shift/reduce or decrease/reduce) based on certain grammar rules (precedence and/or associativity of operators).

Example:

Let's take the unclear grammar below:

E -> E+E

E -> E*E

E -> id

The precedence and associativity of the grammar operators (+ and *) can be considered to be as follows:

● “+” and “*” both are left associative,

● Precedence of “*” is higher than the precedence of “+”.

If we use LALR (1) parser, the LR (1) item DFA will be:

Fig 8: LR (1) item with DFA

We can see from the LR (1) DFA item that there are shift/reduce conflicts in the I5 and I6 states.

Fig 9: parsing table

"In I5 and I6, moves on "+ and "*" are both moved and decreased. To settle this dispute, we will use the precedence and associativeness of the operators to decide which phase to preserve and which to discard from the table.

Key takeaway:

1. In the parsing table of ambiguous grammars, the LR parser resolves the conflicts based on certain grammar rules.

1.13 Error reporting and Recovery

Any program fault should be detected and reported by a parser. When an error occurs, it is assumed that the parser will be able to manage it and continue processing the rest of the input. The parser is typically expected to check for problems, however mistakes might occur at any point during the compilation process.

Error recovery

Panic Mode Recovery

○ In this method, successive characters from input are removed one at a time until a designated set of synchronizing tokens is found. Synchronizing tokens are deli-meters such as; or}

○ Advantage is that it's easy to implement and guarantees not to go to infinite loop.

○ Disadvantage is that a considerable amount of input is skipped without checking it for additional errors.

2. Statement Mode recovery

○ In this method, when a parser encounters an error, it performs necessary correction on remaining input so that the rest of the input statement allows the parser to parse ahead.

○ The correction can be deletion of extra semicolons, replacing comma by semicolon or inserting missing semicolon.

○ While performing correction, at most care should be taken for not going in an infinite loop.

○ Disadvantage is that it finds difficult to handle situations where actual error occurred before point of detection.

Key takeaway:

Any program fault should be detected and reported by a parser.

1.14 Parser generator

YACC is a program that generates the parser software automatically.

As we addressed YACC in the first unit of this course, you can go through the fundamentals again to make sure you understand everything.

YACC is an automatic tool which generates the programme for the parser.

YACC stands for Compiler Compiler Yet Another. The software is available under UNIX OS

LR parser construction requires a lot of work for the input string to be parsed.

Therefore, in order to achieve productivity in parsing and data, the procedure must provide automation.

Basically, YACC is an LALR parser generator that reports conflicts or uncertainties in the form of error messages (if at all present).

As we addressed YACC in the first unit of this tutorial, to make it more understandable, you can go through the concepts again.

● YACC

As shown in the picture, the standard YACC translator can be represented as

Fig 10: YACC automatic parser generator

Key takeaway:

YACC is an automatic tool which generates the programme for the parser.

YACC is an LALR parser generator that reports conflicts or uncertainties in the form of error messages (if at all present).

References:

Compilers – Principles, Techniques and Tools, A. V. Aho, M. S. Lam, R. Sethi, J. D. Ullman, 2nd Ed., Pearson. 2007

Modern Compiler Design, D. Galles, 1st Ed., Pearson Education,2004

Advanced Compiler Design & Implementation, S. S. Muchnick, Morgan Kaufmann, 1997

Sign Up

Index

Notes

Highlighted

Underlined

Browse by Topics

Notes

Highlighted

Underlined