6.1 Conditions of parallelism Data and Resource Dependences | unit 6 program and network properties

ACA

Program and Network Properties

6.1 Conditions of parallelism Data and Resource Dependences

Dependencies in a pipelined processor

There are mainly three types of dependencies possible in a pipelined processor. These are :
1) Structural Dependency
2) Control Dependency
3) Data Dependency

These dependencies may introduce stalls in the pipeline.

Stall : A stall is a cycle in the pipeline without new input.

Structural dependency

This dependency arises due to the resource conflict in the pipeline. A resource conflict is a situation when more than one instruction tries to access the same resource in the same cycle. A resource can be a register, memory, or ALU.

Example:

Instruction / Cycle	1	2	3	4	5
I1	IF(Mem)	ID	EX	Mem
I2		IF(Mem)	ID	EX
I3			IF(Mem)	ID	EX
I4				IF(Mem)	ID

Fig 1 – Example

In the above scenario, in cycle 4, instructions I1 and I4 are trying to access same resource (Memory) which introduces a resource conflict.
To avoid this problem, we have to keep the instruction on wait until the required resource (memory in our case) becomes available. This wait will introduce stalls in the pipeline as shown below:

Cycle	1	2	3	4	5	6	7	8
I1	IF(Mem)	ID	EX	Mem	WB
I2		IF(Mem)	ID	EX	Mem	WB
I3			IF(Mem)	ID	EX	Mem	WB
I4				–	–	–	IF(Mem)

Fig 2 - Introduce stalls in the pipeline

Solution for structural dependency
To minimize structural dependency stalls in the pipeline, we use a hardware mechanism called Renaming.
Renaming : According to renaming, we divide the memory into two independent modules used to store the instruction and data separately called Code memory(CM) and Data memory(DM) respectively. CM will contain all the instructions and DM will contain all the operands that are required for the instructions.

Instruction/ Cycle	1	2	3	4	5	6	7
I1	IF(CM)	ID	EX	DM	WB
I2		IF(CM)	ID	EX	DM	WB
I3			IF(CM)	ID	EX	DM	WB
I4				IF(CM)	ID	EX	DM
I5					IF(CM)	ID	EX
I6						IF(CM)	ID
I7							IF(CM)

Fig 3- Example

Control Dependency (Branch Hazards)
This type of dependency occurs during the transfer of control instructions such as BRANCH, CALL, JMP, etc. On many instruction architectures, the processor will not know the target address of these instructions when it needs to insert the new instruction into the pipeline. Due to this, unwanted instructions are fed to the pipeline.

Consider the following sequence of instructions in the program:
100: I1
101: I2 (JMP 250)
102: I3
.
.
250: BI1

Expected output: I1 -> I2 -> BI1

NOTE: Generally, the target address of the JMP instruction is known after ID stage only.

Instruction/ Cycle	1	2	3	4	5	6
I1	IF	ID	EX	MEM	WB
I2		IF	ID (PC:250)	EX	Mem	WB
I3			IF	ID	EX	Mem
BI1				IF	ID	EX

Output Sequence: I1 -> I2 -> I3 -> BI1

So, the output sequence is not equal to the expected output, that means the pipeline is not implemented correctly.

To correct the above problem we need to stop the Instruction fetch until we get target address of branch instruction. This can be implemented by introducing delay slot until we get the target address.

Instruction/ Cycle	1	2	3	4	5	6
I1	IF	ID	EX	MEM	WB
I2		IF	ID (PC:250)	EX	Mem	WB
Delay	–	–	–	–	–	–
BI1				IF	ID	EX

Output Sequence: I1 -> I2 -> Delay (Stall) -> BI1

As the delay slot performs no operation, this output sequence is equal to the expected output sequence. But this slot introduces stall in the pipeline.

Solution for Control dependency Branch Prediction is the method through which stalls due to control dependency can be eliminated. In this at 1st stage prediction is done about which branch will be taken.For branch prediction Branch penalty is zero.

Branch penalty : The number of stalls introduced during the branch operations in the pipelined processor is known as branch penalty.

NOTE : As we see that the target address is available after the ID stage, so the number of stalls introduced in the pipeline is 1. Suppose, the branch target address would have been present after the ALU stage, there would have been 2 stalls. Generally, if the target address is present after the kth stage, then there will be (k – 1) stalls in the pipeline.

Total number of stalls introduced in the pipeline due to branch instructions = Branch frequency * Branch Penalty

Data Dependency (Data Hazard)
Let us consider an ADD instruction S, such that
S : ADD R1, R2, R3
Addresses read by S = I(S) = {R2, R3}
Addresses written by S = O(S) = {R1}

Now, we say that instruction S2 depends in instruction S1, when

This condition is called Bernstein condition.

Three cases exist:

Flow (data) dependence: O(S1) ∩ I (S2), S1 → S2 and S1 writes after something read by S2

Anti-dependence: I(S1) ∩ O(S2), S1 → S2 and S1 reads something before S2 overwrites it

Output dependence: O(S1) ∩ O(S2), S1 → S2 and both write the same memory location.

Example: Let there be two instructions I1 and I2 such that:
I1 : ADD R1, R2, R3
I2 : SUB R4, R1, R2

When the above instructions are executed in a pipelined processor, then data dependency condition will occur, which means that I2 tries to read the data before I1 writes it, therefore, I2 incorrectly gets the old value from I1.

Instruction / Cycle	1	2	3	4
I1	IF	ID	EX	DM
I2		IF	ID(Old value)	EX

To minimize data dependency stalls in the pipeline, operand forwarding is used.

Operand Forwarding : In operand forwarding, we use the interface registers present between the stages to hold intermediate output so that dependent instruction can access new value from the interface register directly.

Considering the same example:
I1 : ADD R1, R2, R3
I2 : SUB R4, R1, R2

Instruction / Cycle	1	2	3	4
I1	IF	ID	EX	DM
I2		IF	ID	EX

Data Hazards

Data hazards occur when instructions that exhibit data dependence, modify data in different stages of a pipeline. Hazard cause delays in the pipeline. There are mainly three types of data hazards:

1) RAW (Read after Write) [Flow/True data dependency]
2) WAR (Write after Read) [Anti-Data dependency]
3) WAW (Write after Write) [Output data dependency]

Let there be two instructions I and J, such that J follow I. Then,

RAW hazard occurs when instruction J tries to read data before instruction I writes it.
Eg:
I: R2 <- R1 + R3
J: R4 <- R2 + R3

WAR hazard occurs when instruction J tries to write data before instruction I reads it.
Eg:
I: R2 <- R1 + R3
J: R3 <- R4 + R5

WAW hazard occurs when instruction J tries to write output before instruction I writes it.
Eg:
I: R2 <- R1 + R3
J: R2 <- R4 + R5

WAR and WAW hazards occur during the out-of-order execution of the instructions.

Key Takeaway

There are mainly three types of dependencies possible in a pipelined processor. These are :
1) Structural Dependency
2) Control Dependency
3) Data Dependency

These dependencies may introduce stalls in the pipeline.

Stall : A stall is a cycle in the pipeline without new input.

6.2 Data dependency analysis - Bernstein’s condition

Bernstein’s Conditions are the conditions applied on two statements S1 and S2 that are to be executed in the processor. It states that three conditions that are explained below must be satisfied for two successive statements S1 and S2 to be executed concurrently and still produce the same result.

The intersection between read-write set, write-read set and write-write set of S1 and S2 must be null.

The above statement can be expressed in the form of expressions as following:

1. R(S1) ∩ W(S2) = { }

2. W(S1) ∩ R(S2) = { }

3. W(S1) ∩ W(S2) = { }

The read and write set for the statement c = a – b;

R(c=a-b) = {a, b}

W(c=a-b) = {c}

The read and write set for the statement w = c + 1;

R(w=c+1) = {c}

W(w=c+1) = {w}

The read and write set for the statement x = x + 2;

R(x=x+2) = {x}

W(x=x+2) = {x}

Example-1:
Let,

S1 : a = x + y

S2 : b = z + 1

Check whether two statements S1 and S2 satisfies Bernstein’s conditions.
Solution:

R(S1) = {x, y}

W(S1) = {a}

R(S2) = {z}

W(S2) = {b}

1. R(S1) ∩ W(S2) = {x, y} ∩ {b} = { }

2. W(S1) ∩ R(S2) = {a} ∩ {z} = { }

3. W(S1) ∩ W(S2) = {a} ∩ {b} = { }

Therefore S1 ans S2 can be executed concurrently i.e., Bernstein’s conditions are satisfied.

Example-2:
Let,

S1 : a = x + y

S2 : b = z + 1

S3 : c = a - b

Check whether two statements S2 and S3 satisfies Bernstein’s conditions.

Solution:

R(S2) = {z}

W(S2) = {b}

R(S3) = {a, b}

W(S3) = {c}

1. R(S2) ∩ W(S3) = {z} ∩ {c} = { }

2. W(S2) ∩ R(S3) = {b} ∩ {a, b} = {b}

3. W(S2) ∩ W(S3) = {b} ∩ {c} = { }

Therefore S2 ans S3 cannot be executed concurrently i.e., Bernstein’s conditions are not satisfied.

Key Takeaway

The intersection between read-write set, write-read set and write-write set of S1 and S2 must be null.

6.3 Hardware and Software Parallelism

What is Parallelism?

Performing more than one task at a time.

It speeds up the process.

More work in less time.

It reduces the cost of work.

Modern computer architecture implementation requires special hardware and software support for parallelism. Types of parallelism are-

1. Hardware parallelism

2. Software parallelism

1. Hardware Parallelism:

This refers to the type of parallelism defined by the machine architecture and hardware multiplicity.

Hardware parallelism is a function of cost and performance tradeoffs. It displays the resource utilization patterns of simultaneously executable operations. It can also indicate the peak performance of the processors.

One way to characterize the parallelism in a processor is by the number of instruction issues per machine cycle.

In a modern processor, two or more instructions can be issued per machine cycle. e .g i960CA was a 3-issue processor.

2. Software Parallelism:

It is defined by the control and data dependence of programs.

The degree of parallelism is revealed in the program profile or in the program flow graph.

Software parallelism is a function of algorithm, programming style, and compiler optimization.

The program flow graph displays the patterns of simultaneously executable operations.

Parallelism in a program varies during the execution period. It limits the continuous performance of the processor.

Key Takeaway

What is Parallelism?

Performing more than one task at a time.

It speeds up the process.

More work in less time.

It reduces the cost of work.

Modern computer architecture implementation requires special hardware and software support for parallelism. Types of parallelism are-

1. Hardware parallelism

2. Software parallelism

6.4 Grain Sizes and Latency

GRAIN SIZE:

Grain Size or granularity is the amount of computation and manipulation involved in a software process.

The simplest way TO FOUND GRAIN SIZE IS to count the number of instructions in a given program segment.

Grain size decides the basic program segment chosen for parallel processing.

Grain sizes are usually explained as fine, medium or coarse, depending on the processing levels involved.

LATENCY:

Latency is a time of the communication overhead acquire a mid-subsystems.

The memory latency is the time required by processor to access the memory.

The time required for two processes to synchronize with each other is called synchronization latency. Computational granularity and communication latency are closely related.

Key Takeaway

Grain Size or granularity is the amount of computation and manipulation involved in a software process.

The simplest way TO FOUND GRAIN SIZE IS to count the number of instructions in a given program segment.

Latency is a time of the communication overhead acquire a mid-subsystems.

The memory latency is the time required by processor to access the memory.

6.5 Grain Packing and Scheduling

It outlines the computers with multiple processing elements that can perform the same operation on multiple data points simultaneously.

And, Granularity is the concept of where systems are broken down into various small parts, we may say that either the system itself or the description/observation of the system. It is actually related when a larger entity is subdivided into various parts. For example, a plot is broken into yards for much finer granularity than just saying a plot.

SIMD (Single Instruction Multiple Data) can be classified as various types but the 2 main and most important types of SIMD are:

(i) Fine-Grained SIMD:
These are actually the detailed description which deals with the much smaller components which are in actual is composed of the much larger components.

(ii) Coarse-Grained SIMD:
These systems are consisting of fewer components which are obviously more than the original one but are much lesser than the Fine-Grained SIMD, but the size of components is much more (high/more) than the fine-grained subcomponents of a system.

Difference between Fine-Grained and Coarse-Grained SIMD Architecture:

S.No.	Fine-Grained SIMD	Coarse-Grained SIMD
1.	Fine Grain SIMD have less computation time then the coarse grain architecture.	Coarse Grain SIMD have more computation time then the Fine grain architecture.
2.	Here, programs are broken into large number of small tasks.	Here, programs are broken into small number of large task.
3	Fine Grain SIMD have much higher level of parallelism then Coarse grain SIMD.	Coarse grain SIMD have lower level of parallelism then Fine Grain SIMD.
4.	Here, Grain Size is over 1000 instructions.	Here, Grain Size in range of 2-500 instructions.
5.	Here, the size of subcomponents is much smaller than the Coarse grained.	Here, the size of subcomponents is more than the Fine-Grained.
6.	Here, two types of parallelism can be obtained – a) Instruction Level Parallelism b) Loop Level Parallelism	Here, these two types of parallelism can be obtained – a) Sub-program b) Program Level Parallelism
7.	In Fine Grain SIMD, Load Balancing is proper.	In Coarse Grain SIMD, Load Balancing is improper.
8.	Here Parallelism can be detected using compiler.	Here Parallelism can’t be detected using compiler.
9.	Fine Grain SIMD is a much costlier process than the Coarse Grain SIMD.	Coarse Grain SIMD is much cheaper than the Fine Grain SIMD.
10.	Fine Grain is the concept of future multi-threaded architectures to be used in the future also.	Coarse Grain is in one of the earlier concepts of single-threaded architectures.
11.	The Detailed description is further divided into many small subcomponents and makes the processes less complex from the original one and from the coarse-grained also.	The Detailed description is divided into large subcomponents and makes the processes less complex than the original one but more complex than Fine-Grained.
12.	Examples – Connection Machine (CM-2), J-Machine, etc.	Examples – CRAY Y, etc.

Key Takeaway

It outlines the computers with multiple processing elements that can perform the same operation on multiple data points simultaneously.

Text Books:

Advanced computer architecture – Kai Hwang (MGH). (for Unit 1, 3 & 6)

Computer Architecture & Parallel Processing – Kai Hwang & Briggs (MGH) (for Unit 2 & 4)

Computer Architecture - A Quantitative Approach, 5 th Edition, John L. Hennessy and David A. Patterson, Elsevier. (For Unit 5 & 1)

Reference Books:

Advanced computer Architecture – Dezso Sima, Terence Fountain & Peter Kacsuk (Pearson Education)

Parallel Programming Techniques & Applications using Networked Workstations & Parallel Computers - Barry Wilkinson & Michael Allen–Second Edition (Pearson Education).

Advanced Computer Architecture , Kai Hwang & Naresh Jotwani, 2nd edition , McGraw Hill Publications.

Sign Up

Index

Notes

Highlighted

Underlined

Browse by Topics

Notes

Highlighted

Underlined