Unit – 1 | unit 1 introduction to compiler

Unit – 1

Introduction to Compiler

1.2 Introduction to Compiler: Phases and passes

● A compiler is a translator that translates the language at the highest level into the language of the computer.

● A developer writes high-level language and the processor can understand machine language.

● The compiler is used to display a programmer's errors.

● The main objective of the compiler is to modify the code written in one language without altering the program's meaning.

● When you run a programme that is written in the language of HLL programming, it runs in two parts.

Fig 1: Execution process of Compiler

Phases and passes

Phases -

The method of compilation includes the sequence of different stages. Each phase takes a source programme in one representation and produces output in another representation. From its previous stage, each process takes input.

The various stages of the compiler take place:

Lexical Analysis

Syntax Analysis

Semantic Analysis

Intermediate Code Generation

Code Optimization

Code Generation

Fig 2: phases of compiler

Lexical Analysis: The first phase of the compilation process is the lexical analyzer phase. As input, it takes source code. One character at a time, it reads the source programme and translates it into meaningful lexemes. These lexemes are interpreted in the form of tokens by the Lexical analyzer.

Syntax Analysis: Syntax analysis is the second step of the method of compilation. Tokens are required as input and a parse tree is generated as output. The parser verifies that the expression made by the tokens is syntactically right or not in the syntax analysis process.

Semantic Analysis: The third step of the method of compilation is semantic analysis. This tests whether the parse tree complies with language rules. The semantic analyzer keeps track of identifiers, forms, and expressions of identifiers. The output step of semantic analysis is the syntax of the annotated tree.

Intermediate Code Generation: The compiler generates the source code into an intermediate code during the intermediate code generation process. Between high-level language and machine language, intermediate code is created. You can produce the intermediate code in such a way that you can convert it easily into the code of the target computer.

Code Optimization: An optional step is code optimization. It is used to enhance the intermediate code so that the program's output can run quicker and take up less space. It eliminates redundant code lines and arranges the sequence of statements to speed up the execution of the programme.

Code Generation: The final stage of the compilation process is code generation. It takes as its input the optimized intermediate code and maps it to the language of the target computer. The code generator converts the intermediate code to the required computer's machine code.

Passes -

Pass is a full source software traversal. The programmer has two passes through the source programme to traverse it.

Multi-pass compiler -

The Multi Pass compiler is used many times to process a program's source code.

The compiler will read the source programme in the first stage, scan it, extract the tokens, and save the result to an output file.

In the second pass, the compiler can read the first pass output file, construct the syntactic tree, and perform the syntactic analysis. A file containing the syntactic tree is the output of this step.

In the third pass, the compiler will read the output file that the second pass generates and check whether or not the tree follows the language rules. The output step of semantic analysis is the syntax of the annotated tree.

This passage continues until the performance of the target is produced.

One - pass compiler -

The one-pass compiler is only used once to access the programme. The one-pass compiler passes through the components of each compilation unit only once. This translates every component into its final machine code.

In the single-pass parser, it is scanned and the token is extracted when the line source is processed.

The syntax of each line is then evaluated and the structure of the tree is constructed. The code is created after the semantic portion.

For every line of code, the same procedure is repeated until the whole programme is compiled.

Key takeaway:

- Each phase takes a source programme in one representation and produces output in another representation.

- The method of compilation includes the sequence of different stages.

- Pass is a full source software traversal.

- The programmer has two passes through the source programme to traverse it.

1.2 Bootstrapping

Bootstrapping is commonly used in the production of compilations.

In order to generate a self-hosting compiler, bootstrapping is used. Self-hosting is a type of compiler that is capable of compiling its own source code.

To compile the compiler, the Bootstrap compiler is used, and then you can use this compiled compiler to compile all else and future versions of the compiler itself.

Three languages can characterize a compiler:

Source Language

Target Language

Implementation Language

A SCIT compiler for source S, Goal T, implemented in I, is shown in the T- diagram.

To generate a new language L for machine A, follow some steps:

Use the language 'A' to build an SCAA compiler for the subset, S for the desired language, L, and that compiler runs on machine A.

2. Build an LCSA compiler for language L that is written in a subset of L.

3. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a compiler for language L, which runs on machine A and produces code for machine A.

Bootstrapping is called the mechanism defined by the T-diagrams.

Key takeaway:

- Bootstrapping is commonly used in the production of compilations

- To generate a self-hosting compiler, bootstrapping is used.

1.3 Finite state machines

➢ To identify patterns, finite state machines are used.

➢ The Finite Automated System takes the symbol string as input and modifies its state accordingly. When the desired symbol is found in the input, then the transformation happens.

➢ The automated devices may either switch to the next state during the transition or remain in the same state.

➢ FA has two states: state approve or state deny. When the input string is processed successfully and the automatic has reached its final state, it will approve it.

The following refers to a finite automatic:

Q: a finite set of states

∑: a finite set of the input symbol

q0: initial state

F: final state

δ: Transition function

It is possible to describe transition functions as

δ: Q x ∑ →Q

The FA is described in two ways:

DFA

NDFA

DFA

DFA stands for Deterministic Finite Automata. Deterministic refers to computational uniqueness. In DFA, the character of the input goes to only one state. The DFA does not allow the null shift, which means that without an input character, the DFA does not change the state.

DFA has five tuples {Q, ∑, q0, F, δ}

Q: a set of all states

∑: a finite set of input symbol where δ: Q x ∑ →Q

q0: initial state

F: final state

δ: Transition function

NDFA

NDFA applies to Finite Non-Deterministic Automata. It is used for a single input to pass through any number of states. NDFA embraces the NULL step, indicating that without reading the symbols, it can change the state.

Like the DFA, NDFA also has five states. NDFA, however, has distinct transformation features.

NDFA's transition role may be described as:

δ: Q x ∑ →2Q

Key takeaway:

- To identify patterns, finite state machines are used.

- FA has two states: state approve or state deny.

- In DFA, the character of the input goes to only one state.

- It is used for a single input to pass through any number of states.

1.4 Regular expressions and their applications to lexical analysis

The lexical analyzer requires only a finite set of valid string/token/lexeme that belongs to the language in hand to be scanned and defined. It looks for the pattern that the language rules describe.

By defining a pattern for finite strings of symbols, regular expressions can express finite languages. Regular grammar is referred to as the grammar described by regular expressions. The standard language is regarded as the language described by the regular grammar.

For defining patterns, a regular expression is a significant notation. Each pattern matches a string set, so regular expressions function as string set names. Regular languages can be represented by programming language tokens. An instance of a recursive description is the specification of regular expressions. Regular languages are easy to understand and execute effectively.

There are a variety of algebraic rules that can be used to manipulate regular expressions into analogous forms that are obeyed by regular expressions.

Precedence and Associativity

● The associative *, concatenation (.), and | (pipe sign) are left.

● * Has the highest priority

● Concatenation (.) has the highest second precedence.

● | (pipe sign) has the lowest of all precedents.

Representing in standard expression authentic tokens of a language

If x is a normal expression, then:

➔ x* implies the occurrence of x at zero or more.

It is capable of generating {e, x, xx, xxx, xxxx, xxxx,... }

➔ x+ implies one or more instances of x.

In other words, { x, xx, xxx, xxx, xxxx... } or x.x* can be generated.

➔ x? Means one instance of x at most

It can, for example, generate either {x} or {e}.

All the lower-case English language alphabets are [a-z].

Both English language upper-case alphabets are [A-Z].

All the natural digits used in mathematics are [0-9].

Representing tokens of language using regular expressions

Decimal = (sign)?(digit)+

Identifier = (letter)(letter | digit)*

How to check the validity of a regular expression used in defining the patterns of keywords in a language is the only issue left with the lexical analyzer. The use of finite automata for verification is a well-accepted approach.

Key takeaway:

- By defining a pattern for finite strings of symbols, regular expressions can express finite languages.

- Regular grammar is referred to as the grammar described by regular expressions.

1.5 Optimization of DFA-Based Pattern Matchers implementation of lexical analyzers

We identify three algorithms that have been used to implement pattern matchers designed from regular expressions and optimize them.

In a Lex compiler, the first algorithm is useful because it creates a DFA directly from a regular expression without constructing an NFA intermediate. The resulting DFA can also be constructed through an NFA with fewer states than the DFA.

The second algorithm minimizes the number of states of any DFA, by combining states that have the same future behaviour. The algorithm itself is very efficient, running at time 0(n log n), where n is the DFA's number of states.

The third algorithm produces transformation tables with more compact representations than the regular, two-dimensional table.

1.6 Lexical-analyzer generator

Lex is a lexical analyzer generating programme. It is used to produce the YACC parser. Software that transforms an input stream into a sequence of tokens is the lexical analyzer. It reads the input stream and, by implementing the lexical analyzer in the C programme, produces the source code as output.

Fig 3: lexical analyzer generator

If a token is considered invalid by the lexical analyzer, it produces an error. The lexical analyzer works with the syntax analyzer in close collaboration. It reads character streams from the source code, checks for legal tokens, and, when necessary, passes the information to the syntax analyzer.

Tokens

It is said that lexemes are a character sequence (alphanumeric) in a token. For each lexeme to be identified as a valid token, there are some predefined rules. These rules, by means of a pattern, are described by grammar rules. A pattern describes what a token can be, and by means of regular expressions, these patterns are described.

Keywords, constants, identifiers, sequences, numbers, operators, and punctuation symbols can be used as tokens in the programming language.

Example: the variable declaration line in the C language

int value = 100 ;

contains the tokens:

int (keyword) , value (identifiers) = (operator) , 100 (constant) and ; (symbol)

Key takeaway:

- It is said that lexemes are a character sequence (alphanumeric) in a token.

- For each lexeme to be identified as a valid token, there are some predefined rules.

- Lexical analyzer is used to produce the YACC parser

1.7 LEX Compiler

Functions of lex:

● Firstly, in the language of Lex, the lexical analyzer produces a lex.1 programme. The Lex compiler then runs the lex.1 programme and generates the lex. yy.c C programme.

● Finally, the programmer C runs the programme lex.yy.c and generates the object programme a.out.

● A.out is a lexical analyzer that converts an input stream into a token sequence.

Fig 4: Lex compiler

Lex file format:

By percent delimiters, a Lex programme is divided into three parts. The structured sources for Lex are as follows:

{ definitions }

{ rules }

{ user subroutines }

Definitions include constant, variable, and standard meaning declarations.

The rules describe the form statement p1 {action1} p2 {action2}....pn {action} p2 form p1....pn {action}

Where pi describes the regular expression and action1 describes the behaviour that should be taken by the lexical analyzer when a lexeme matches the pattern pi.

Auxiliary procedures required by the acts are user subroutines. It is possible to load the subroutine with the lexical analyzer and compile it separately.

Key takeaway:

- In the language of Lex, the lexical analyzer produces a lex.1 programme.

- The Lex compiler then runs the lex.

1.8 Formal grammars and their application to syntax analysis

A set of rules is formal grammar. It is used in a language to distinguish right or wrong strings of tokens. The formal grammar is shown as G.

Formal grammar is used to generate all possible strings that are syntactically correct in the language using the alphabet.

Formal grammar is often used during the compilation process of syntactic analysis (parsing).

As folows, formal grammar G is written:

G = <V, N, P, S>

Where:

N - a finite set of non-terminal symbols.

V - a finite set of terminal symbols.

P - set of production rules

S - start symbol.

Example:

L = {a, b}, N = {S, R, B}

Production rules:

S = bR

R = aR

R = aB

B = b

We may generate some strings through this production, such as bab, baab, baaab, etc.

This output describes the banab - shape string.

Fig 5: Formal language

Key takeaway:

- Formal grammar is used to generate all possible strings that are syntactically correct in the language using the alphabet.

- A set of rules is formal grammar.

1.9 BNF notation

BNF stands for Backus-Naur Form. It is used to write a formal representation of grammar that is context-free. It is often used to define a programming language's syntax.

Basically, BNF notation is just a version of context-free grammar.

Productions at the BNF take the form of:

Left side → definition

If left side∈ (Vn∪ Vt)+ and (definition) concept ∈ (Vn∪ Vt)*. In BNF, one non-terminal is found on the left line.

The various productions with the same left side can be identified. The vertical bar symbol "|" is used to distinguish all productions.

For any grammar, the output is as follows:

S → aSa

S → bSb

S → c

In BNF, the grammar above can be expressed as follows:

S → aSa| bSb| c

Key takeaway:

- It is used to write a formal representation of grammar that is context-free.

- It is often used to define a programming language's syntax.

1.10 Ambiguity

If there is more than one leftmost derivation or more than one rightmost derivative or more than one parse tree for the given input string, grammar is said to be ambiguous. It's called unambiguous if the grammar is not vague.

Example:

S = aSb | SS

S = ∈

The above grammar produces two parse trees for the string Aabb:

If there is uncertainty in the grammar, then it is good for constructing a compiler.

No method can detect and remove ambiguity automatically, but by re-writing the entire grammar without ambiguity, you can remove ambiguity.

Key takeaway:

- If there is more than one left or rightmost derivation or more than one parse tree for the given input string, grammar is said to be ambiguous.

1.11 YACC

● YACC stands for Compiler Compiler Yet Another.

● For a given grammar, YACC provides a method to generate a parser.

● YACC is a software designed to compile the grammar of LALR(1).

● It is used to generate the source code of the language syntactic analyzer provided by LALR (1) grammar.

● The YACC input is a rule of grammar, and a C programme is the output.

Some points about YACC are as follows:

Input: A CFG- file.y

Output: A parser y.tab.c (yacc)

● The output file "file.output" includes tables of parsing.

● There are declarations in the file 'file.tab.h'.

● Yyparse was named by the parser ().

● To get tokens, Parser expects to use a function called yylex ().

The following is the essential operational sequence:

In YACC format, this file contains the desired grammar.

YACC software is seen.

It is the YACC-generated c source software.

C - compiler

Executable file to decode grammar defined in gram. YY.

Key takeaway:

- The YACC input is a rule of grammar, and a C programme is the output.

- For a given grammar, YACC provides a method to generate a parser.

1.12 The syntactic specification of programming languages: Context-free Grammar

A context-free grammar is a formal grammar that is used in a given formal language to generate all possible strings.

Four tuples can be described by Context-free grammar G as:

G= (V, T, P, S)

Where,

G - grammar

T - a finite set of terminal symbols.

V - a finite set of non-terminal symbols

P - set of production rules

S - start symbol.

In CFG, to derive the string, the start symbol is used. You may extract a string by replacing a non-terminal on the right side of the output repeatedly until all non-terminals are replaced with terminal symbols.

Example:

L= {wcwR | w € (a, b)*}

Production rule:

S → aSa

S → bSb

S → c

Now, verify that you can derive the abbcbba string from the given CFG.

S ⇒ aSa

S ⇒ abSba

S ⇒ abbSbba

S ⇒ abbcbba

By recursively applying the production S → aSa, S → bSb, and finally applying the production S → c, we get the abbcbba series.

Key takeaway:

- In CFG, to derive the string, the start symbol is used.

- A context-free grammar is a formal grammar that is used in a given formal language to generate all possible strings.

1.13 Derivation and parse trees

The derivation is a sequence of the laws of development. Via these production laws, it is used to get the input string. We have to make two decisions during parsing. Those are like the following:

● The non-terminal that is to be replaced must be determined by us.

● We have to determine the law of development by which the non-terminal would be substituted.

We have two choices to determine which non-terminal is to be replaced by the rule of output.

Leftmost derivation:

The input is scanned and replaced with the development rule from left to right in the left-most derivation. In most derivatives on the left, the input string is read from left to right.

Example:

Production rules:

S = S + S

S = S - S

S = a | b |c

Input

a - b + c

The left-most derivation is:

S = S + S

S = S - S + S

S = a - S + S

S = a - b + S

S = a - b + c

Right most derivation:

The input is scanned and replaced with the development rule from right to left in the rightmost derivation. So, we read the input string from right to left in most derivatives on the right.

Example:

S = S + S

S = S - S

S = a | b |c

Input:

a - b + c

The right-most derivation is:

S = S - S

S = S - S + S

S = S - S + c

S = S - b + c

S = a - b + c

Parse Tree

● The graphical representation of symbols is the Parse tree. Terminal or non-terminal may be the symbol.

● In parsing, using the start symbol, the string is derived. The starter symbol is the root of the parse tree.

● It is the symbol's graphical representation, which can be terminals or non-terminals.

● The parse tree follows the operators' precedence. The deepest sub-tree went through first. So, the operator has less precedence over the operator in the sub-tree in the parent node.

Following these points, the parse tree:

● The terminals have to be all leaf nodes.

● Non-terminals have to be all interior nodes.

● In-order traversal supplies the initial string of inputs.

Example

Production rules:

T= T + T | T * T

T = a|b|c

Input

a * b + c

Step 1:

Step 2:

Step 3:

Step 4:

Step 5:

Key takeaway:

- The derivation is a sequence of the laws of development.

- We have to make two decisions during parsing.

- In most derivatives on the left, the input string is read from left to right.

- The parse tree follows the operators' precedence.

1.14 Capabilities of CFG

The various capabilities of CFG exist:

In order to identify most of the programming languages, a context-free grammar is beneficial.

If the grammar is properly built, then it is possible to automatically create an effective parser.

Appropriate grammars for expressions can be developed using the characteristics of associative & precedence data.

A context-free grammar is capable of representing nested structures such as balanced brackets, begin-end matching, if-then-corresponding, else's, etc.

References:

K. Muneeswaran, Compiler Design, First Edition, Oxford University Press.

J.P. Bennet, “Introduction to Compiler Techniques”, Second Edition, Tata McGraw-Hill,2003.

Henk Alblas and Albert Nymeyer, “Practice and Principles of Compiler Building with C”, PHI, 2001.

Sign Up

Index

Notes

Highlighted

Underlined

Browse by Topics

Notes

Highlighted

Underlined