Pipelining

Advanced Computer
Architecture
Chapter 4
Advanced Pipelining
Ioannis Papaefstathiou
CS 590.25
Easter 2003
(thanks to Hennesy & Patterson)
Chapter Overview
4.1 Instruction Level Parallelism: Concepts and Challenges
4.2 Overcoming Data Hazards with Dynamic Scheduling
4.3 Reducing Branch Penalties with Dynamic Hardware Prediction
4.4 Taking Advantage of More ILP with Multiple Issue
4.5 Compiler Support for Exploiting ILP
4.6 Hardware Support for Extracting more Parallelism
4.7 Studies of ILP
Chap. 4 - Pipelining I
Chapter Overview
Technique
Reduces
Section
Loop Unrolling
Control Stalls
4.1
Basic Pipeline Scheduling
RAW Stalls
4.1
Dynamic Scheduling with Scoreboarding
RAW stalls
4.2
Dynamic Scheduling with Register Renaming
WAR and WAW stalls
4.2
Dynamic Branch Prediction
Control Stalls
4.3
Issue Multiple Instructions per Cycle
Ideal CPI
4.4
Compiler Dependence Analysis
Ideal CPI & data stalls
4.5
Software pipelining and trace scheduling
Ideal CPI & data stalls
4.5
Speculation
All data & control stalls
4.6
Dynamic memory disambiguation
RAW stalls involving memory
4.2, 4.6
Instruction Level
Parallelism
4.1 Instruction Level Parallelism:
Concepts and Challenges
4.2 Overcoming Data Hazards
with Dynamic Scheduling
4.3 Reducing Branch Penalties
with Dynamic Hardware
Prediction
4.4 Taking Advantage of More ILP
with Multiple Issue
4.5 Compiler Support for
Exploiting ILP
4.6 Hardware Support for
Extracting more Parallelism
4.7 Studies of ILP
ILP is the principle that there are many

instructions in code that dont
depend on each other. That means
its possible to execute those
instructions in parallel.
This is easier said than done:
Issues include:
Building compilers to analyze the
code,
Building hardware to be even
smarter than that code.
This section looks at some of the
problems to be solved.
Instruction Level
Parallelism
Pipeline Scheduling and

Loop Unrolling
Terminology
Basic Block - That set of instructions between entry points and between
branches. A basic block has only one entry and one exit. Typically
this is about 6 instructions long.
Loop Level Parallelism - that parallelism that exists within a loop. Such
parallelism can cross loop iterations.
Loop Unrolling - Either the compiler or the hardware is able to exploit
the parallelism inherent in the loop.
Instruction Level
Parallelism

Loop Unrolling
Simple Loop and its Assembler Equivalent

This is a clean and
simple example!
for (i=1; i<=1000; i++)

x(i) = x(i) + s;
Loop:
LD
ADDD
SD
SUBI
BNEZ
NOP
F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,8
R1,Loop
;F0=vector element
;add scalar from F2
;store result
;decrement pointer 8bytes (DW)
;branch R1!=zero
;delayed branch slot
Instruction Level
Parallelism

Loop Unrolling
FP Loop Hazards
Loop:
LD
ADDD
SD
SUBI
BNEZ
NOP
F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,8
R1,Loop
Instruction
producing result
FP ALU op
FP ALU op
Load double
Load double
Integer op
Where are the stalls?
;F0=vector element
;add scalar in F2
;store result
;decrement pointer 8B (DW)
;branch R1!=zero
Instruction
using result
Another FP ALU op
Store double
FP ALU op
Store double
Integer op
Latency in
clock cycles
3
2
1
0
0
Instruction Level
Parallelism

Loop Unrolling
FP Loop Showing Stalls

1 Loop:
2
3
4
5
6
7
8
9
10
LD
stall
ADDD
stall
stall
SD
SUBI
stall
BNEZ
stall
F0,0(R1)
;F0=vector element
F4,F0,F2
;add scalar in F2
0(R1),F4
R1,R1,8
;store result
;decrement pointer 8Byte (DW)
R1,Loop
;branch R1!=zero
Instruction
producing result
FP ALU op
FP ALU op
Load double
Load double
Integer op
Instruction
using result
Another FP ALU op
Store double
FP ALU op
Store double
Integer op
10 clocks: Rewrite code

to minimize stalls? Chap.
4 - Pipelining I
Latency in
clock cycles
3
2
1
0
0
Instruction Level
Parallelism

Loop Unrolling
Scheduled FP Loop Minimizing Stalls

1 Loop:
2
3
4
5
6
LD
SUBI
ADDD
stall
BNEZ
SD
F0,0(R1)
R1,R1,8
F4,F0,F2
R1,Loop
8(R1),F4
Stall is because SD
cant proceed.
;delayed branch
;altered when move past SUBI
Swap BNEZ and SD by changing address of SD

Instruction
producing result
FP ALU op
FP ALU op
Load double
Instruction
using result
Another FP ALU op
Store double
FP ALU op
Now 6 clocks: Now unroll

loop 4 times to make faster.
Chap.
Latency in
clock cycles
3
2
1
4 - Pipelining I

Instruction Level
Loop Unrolling
Parallelism
Unroll Loop Four Times (straightforward way)
1 Loop:
2
3
4
5
6
7
8
9
10
11
12
13
14
LD
stall
ADDD
stall
stall
SD
LD
stall
ADDD
stall
stall
SD
LD
stall
F0,0(R1)
F4,F0,F2
0(R1),F4
F6,-8(R1)
F8,F6,F2
-8(R1),F8
F10,-16(R1)
15
16
17
18
19
20
21
22
23
24
25
26
27
28
ADDD
stall
stall
SD
LD
stall
ADDD
stall
stall
SD
SUBI
BNEZ
stall
NOP
F12,F10,F2
-16(R1),F12
F14,-24(R1)
F16,F14,F2
-24(R1),F16
R1,R1,#32
R1,LOOP
15 + 4 x (1+2) +1 = 28 clock cycles, or 7 per iteration

Assumes R1 is multiple of 4
Rewrite loop to minimize stalls.

10
Instruction Level
Parallelism

Loop Unrolling
Unrolled Loop That Minimizes Stalls
1 Loop:
2
3
4
5
6
7
8
9
10
11
12
13
14
LD
LD
LD
LD
ADDD
ADDD
ADDD
ADDD
SD
SD
SD
SUBI
BNEZ
SD
F0,0(R1)
F6,-8(R1)
F10,-16(R1)
F14,-24(R1)
F4,F0,F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
0(R1),F4
-8(R1),F8
-16(R1),F12
R1,R1,#32
R1,LOOP
8(R1),F16
What assumptions made when

moved code?
OK to move store past SUBI
even though changes register
OK to move loads before
stores: get right data?
When is it safe for compiler to
do such changes?
; 8-32 = -24
14 clock cycles, or 3.5 per iteration
No Stalls!!
11
Instruction Level
Parallelism

Loop Unrolling
Summary of Loop Unrolling Example
Determine that it was legal to move the SD after the SUBI and BNEZ,
and find the amount to adjust the SD offset.
Determine that unrolling the loop would be useful by finding that the
loop iterations were independent, except for the loop maintenance
code.
Use different registers to avoid unnecessary constraints that would be
forced by using the same registers for different computations.
Eliminate the extra tests and branches and adjust the loop maintenance
code.
Determine that the loads and stores in the unrolled loop can be
interchanged by observing that the loads and stores from different
iterations are independent.
This requires analyzing the memory
addresses and finding that they do not refer to the same address.
Schedule the code, preserving any dependences needed to yield the
same result as the original code.
12
Instruction Level
Parallelism
Dependencies
Compiler Perspectives on Code Movement

Compiler concerned about dependencies in program. Not concerned if a
HW hazard depends on a given pipeline.
Tries to schedule code to avoid hazards.
Looks for Data dependencies (RAW if a hazard for HW)
Instruction i produces a result used by instruction j, or
Instruction j is data dependent on instruction k, and instruction k is data
dependent on instruction i.
If dependent, cant execute in parallel
Easy to determine for registers (fixed names)
Hard for memory:
Does 100(R4) = 20(R6)?
From different loop iterations, does 20(R6) = 20(R6)?
13
Instruction Level
Parallelism
Data Dependencies
1 Loop:
2
3
4
5
LD
ADDD
SUBI
BNEZ
SD
F0,0(R1)
F4,F0,F2
R1,R1,8
R1,Loop
8(R1),F4
Where are the data

dependencies?
;delayed branch
;altered when move past SUBI
14
Instruction Level
Parallelism
Name Dependencies
Another kind of dependence called name dependence:

two instructions use same name (register or memory location) but dont
exchange data
Anti-dependence (WAR if a hazard for HW)
Instruction j writes a register or memory location that instruction i reads from
and instruction i is executed first
Output dependence (WAW if a hazard for HW)
Instruction i and instruction j write the same register or memory location;
ordering between instructions must be preserved.
15
Instruction Level
Parallelism
Name Dependencies

1 Loop:
2
3
4
5
6
7
8
9
10
11
12
13
14
15
LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
SUBI
BNEZ
NOP
F0,0(R1)
F4,F0,F2
0(R1),F4
F0,-8(R1)
F4,F0,F2
-8(R1),F4
F0,-16(R1)
F4,F0,F2
-16(R1),F4
F0,-24(R1)
F4,F0,F2
-24(R1),F4
R1,R1,#32
R1,LOOP
Where are the name

dependencies?
No data is passed in F0, but
cant reuse F0 in cycle 4.
How can we remove these

dependencies?Chap. 4 - Pipelining I
16
Instruction Level
Parallelism
Name Dependencies
Again Name Dependencies are Hard for Memory Accesses

Does 100(R4) = 20(R6)?
From different loop iterations, does 20(R6) = 20(R6)?
Our example required compiler to know that if R1 doesnt change then:
0(R1)?8(R1)?16(R1)?24(R1)
There were no dependencies between some loads and stores so they
could be moved around each other
18
Instruction Level
Parallelism
Control Dependencies
Final kind of dependence called control dependence

Example
ifp1{S1;};
ifp2{S2;};
S1 is control dependent on p1 and S2 is control dependent on p2 but not
on p1.
19
Instruction Level
Parallelism
Two (obvious) constraints on control dependences:

An instruction that is control dependent on a branch cannot be moved
before the branch so that its execution is no longer controlled by the
branch.
An instruction that is not control dependent on a branch cannot be
moved to after the branch so that its execution is controlled by the
branch.
Control dependencies relaxed to get parallelism; get same effect if

preserve order of exceptions (address in register checked by branch
before use) and data flow (value in register depends on branch)
20
Instruction Level
Parallelism

1 Loop:
2
3
4
5
6
7
8
9
10
11
12
13
14
15
....
LD
ADDD
SD
SUBI
BEQZ
LD
ADDD
SD
SUBI
BEQZ
LD
ADDD
SD
SUBI
BEQZ
F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,8
R1,exit
F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,8
R1,exit
F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,8
R1,exit
Where are the control

dependencies?
21
Instruction Level
Parallelism
Loop Level Parallelism
When Safe to Unroll Loop?

Example: Where are data dependencies?
(A,B,C distinct & non-overlapping)
for (i=1; i<=100; i=i+1) {
A[i+1] = A[i] + C[i];
/* S1 */
B[i+1] = B[i] + A[i+1]; /* S2 */
}
1. S2 uses the value, A[i+1], computed by S1 in the same iteration.
2. S1 uses a value computed by S1 in an earlier iteration, since
iteration i computes A[i+1] which is read in iteration i+1. The same
is true of S2 for B[i] and B[i+1].
This is a loop-carried dependence between iterations
Implies that iterations are dependent, and cant be executed in parallel
Note the case for our prior example; each iteration was distinct
22
Instruction Level
Parallelism
When Safe to Unroll Loop?

Example: Where are data dependencies?
(A,B,C,D distinct & non-overlapping)
for (i=1; i<=100; i=i+1) {

A[i+1] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
1.
2.
No dependence from S1 to S2. If there were, then there would be a cycle in

the dependencies and the loop would not be parallel. Since this other
dependence is absent, interchanging the two statements will not affect the
execution of S2.
On the first iteration of the loop, statement S1 depends on the value of B[1]
computed prior to initiating the loop.
23
Instruction Level
Parallelism
Now Safe to Unroll Loop? (p. 240)

OLD:
NEW:
for (i=1; i<=100; i=i+1) {

A[i+1] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i];} /* S2 */
A[1] = A[1] + B[1];

for (i=1; i<=99; i=i+1) {
B[i+1] = C[i] + D[i];
A[i+1] = + A[i+1] + B[i+1];
}
B[101] = C[100] + D[100];
No circular dependencies.
Loop caused dependence
on B.
Have eliminated loop

dependence.
24
Dynamic Scheduling
Prediction
with Multiple Issue
Exploiting ILP
4.7 Studies of ILP
Dynamic Scheduling is when the

hardware rearranges the order of
instruction execution to reduce
stalls.
Advantages:
Dependencies unknown at compile
time can be handled by the
hardware.
Code compiled for one type of
pipeline can be efficiently run on
another.
Disadvantages:
Hardware much more complex.
25
Dynamic Scheduling
The idea:
HW Schemes: Instruction Parallelism
Why in HW at run time?

Works when cant know real dependence at compile time
Compiler simpler
Code for one machine runs well on another
Key Idea: Allow instructions behind stall to proceed.
Key Idea: Instructions executing in parallel. There are multiple
execution units, so use them.
DIVD
F0,F2,F4
ADDD
F10,F0,F8
SUBD
F12,F8,F14
Enables out-of-order execution => out-of-order completion
26
Dynamic Scheduling
The idea:
HW Schemes: Instruction Parallelism
Out-of-order execution divides ID stage:

1. Issuedecode instructions, check for structural hazards
2. Read operandswait until no data hazards, then read operands
Scoreboards allow instruction to execute whenever 1 & 2 hold, not

waiting for prior instructions.
A scoreboard is a data structure that provides the information
necessary for all pieces of the processor to work together.
We will use In order issue, out of order execution, out of order
commit ( also called completion)
First used in CDC6600. Our example modified here for DLX.
CDC had 4 FP units, 5 memory reference units, 7 integer units.
DLX has 2 FP multiply, 1 FP adder, 1 FP divider, 1 integer.
27
Dynamic Scheduling
Using A Scoreboard
Scoreboard Implications
Out-of-order completion => WAR, WAW hazards?

Solutions for WAR
Queue both the operation and copies of its operands
Read registers only during Read Operands stage
For WAW, must detect hazard: stall until other completes
Need to have multiple instructions in execution phase => multiple
execution units or pipelined execution units
Scoreboard keeps track of dependencies, state or operations
Scoreboard replaces ID, EX, WB with 4 stages
28
Dynamic Scheduling
Using A Scoreboard
Four Stages of Scoreboard Control

1. Issue decode instructions & check for structural hazards (ID1)
If a functional unit for the instruction is free and no other active
instruction has the same destination register (WAW), the
scoreboard issues the instruction to the functional unit and
updates its internal data structure.
If a structural or WAW hazard exists, then the instruction issue
stalls, and no further instructions will issue until these hazards
are cleared.
29
Dynamic Scheduling
Using A Scoreboard

2.
Read operands wait until no data hazards, then read

operands (ID2)
A source operand is available if no earlier issued active
instruction is going to write it, or if the register containing
the operand is being written by a currently active
functional unit.
When the source operands are available, the scoreboard tells
the functional unit to proceed to read the operands from
the registers and begin execution. The scoreboard
resolves RAW hazards dynamically in this step, and
instructions may be sent into execution out of order.
30
Dynamic Scheduling
Using A Scoreboard

3. Execution operate on operands (EX)
The functional unit begins execution upon receiving
operands. When the result is ready, it notifies the scoreboard
that it has completed execution.
4. Write result finish execution (WB)
Once the scoreboard is aware that the functional unit has
completed execution, the scoreboard checks for WAR
hazards. If none, it writes results. If WAR, then it stalls the
instruction.
Example:
DIVD
F0,F2,F4
ADDD
F10,F0,F8
SUBD
F8,F8,F14
Scoreboard would stall SUBD until ADDD reads operands
31
Using A Scoreboard
Dynamic Scheduling
Three Parts of the Scoreboard

1. Instruction
statuswhich
of
steps
the
instruction
is
in
2. Functional unit statusIndicates the state of the functional unit (FU). 9

fields for each functional unit
BusyIndicates whether the unit is busy or not
OpOperation to perform in the unit (e.g., + or )
FiDestination register
Fj, FkSource-register numbers
Qj, QkFunctional units producing source registers Fj, Fk
Rj,
RkFlags
indicating
when
Fj,
Fk
are
ready
3. Register result statusIndicates which functional unit will write each
register, if one exists. Blank when no pending instructions will write that
register
32
Dynamic Scheduling
Using A Scoreboard
Detailed Scoreboard Pipeline Control

Instruction
status
Wait until
Bookkeeping
Issue
Not busy (FU)

and not result(D)
Busy(FU) yes; Op(FU) op;

Fi(FU) `D; Fj(FU) `S1;
Fk(FU) `S2; Qj Result(S1);
Qk Result(`S2); Rj not Qj;
Rk not Qk; Result(D) FU;
Read
operands
Rj and Rk
Rj No; Rk No
Execution
complete
Functional unit
done
f((Fj( f )Fi(FU)
or Rj( f )=No) &
Write result (Fk( f ) Fi(FU) or
f(if Qj(f)=FU then Rj(f) Yes);

f(if Qk(f)=FU then Rj(f) Yes);
Result(Fi(FU)) 0; Busy(FU) No
Rk( f )=No))
33
Dynamic Scheduling
Using A Scoreboard
Scoreboard Example
This is the sample code well be working with in the example:
LD
LD
MULT
SUBD
DIVD
ADDD
F6, 34(R2)
F2, 45(R3)
F0, F2, F4
F8, F6, F2
F10, F0, F6
F6, F8, F2
What are the hazards in this code?
Latencies (clock cycles):

LD
1
MULT
10
SUBD
2
DIVD
40
ADDD
2
34
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example
Instruction status
Instruction j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2 F4
SUBD F8
F6 F2
DIVD F10 F0 F6
ADDDF6
F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Read Execution
Write
Issue operands
complete
Result
Busy
No
No
No
No
No
Clock
F0
Op
dest
Fi
S1
Fj
S2
Fk
FU for j FU for k Fj?

Qj
Qk
Rj
Fk?
Rk
F2
F4
F6
F8
F10 F12
F30
...
FU
35
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 1

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Time Name
Integer
Mult1
Mult2
Add
Divide
Read Execution
Write
Issue operands
complete
Result
1
Busy
Yes
No
No
No
No
Clock
F0
FU
Issue LD #1
Shows in which cycle
the operation occurred.
Op
Load
dest
Fi
F6
S1
Fj
S2
Fk
R2

Qj
Qk
Rj
F2
F4
F6 F8 F10 F12
...
Fk?
Rk
Yes
F30
Integer
36
Using A Scoreboard
Dynamic Scheduling

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Time Name
Integer
Mult1
Mult2
Add
Divide
Read Execution
Write
Issue operands
complete
Result
1
2
Clock
F0
FU
Busy Op
Yes
Load
No
No
No
No
F2
S2
Fk
R2
LD #2 cant issue since

integer unit is busy.
MULT cant issue because
we require in-order issue.
dest
Fi
F6
S1
Fj

Qj
Qk
Rj
F4
F6 F8 F10 F12
...
Fk?
Rk
Yes
F30
Integer
37
Using A Scoreboard
Dynamic Scheduling

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Time Name
Integer
Mult1
Mult2
Add
Divide
Read Execution
Write
Issue operands
complete
Result
1
2
3
Clock
F0
FU
Busy Op
Yes
Load
No
No
No
No
F2
dest
Fi
F6
S1
Fj
S2
Fk
R2

Qj
Qk
Rj
F4
F6 F8 F10 F12
...
Fk?
Rk
Yes
F30
Integer
38
Using A Scoreboard
Dynamic Scheduling

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Time Name
Integer
Mult1
Mult2
Add
Divide
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
Clock
F0
FU
Busy Op
Yes
Load
No
No
No
No
F2
dest
Fi
F6
S1
Fj
S2
Fk
R2

Qj
Qk
Rj
F4
F6 F8 F10 F12
...
Fk?
Rk
Yes
F30
Integer
39
Using A Scoreboard
Dynamic Scheduling

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Time Name
Integer
Mult1
Mult2
Add
Divide
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
Clock
F0
FU
Busy Op
Yes
Load
No
No
No
No
F2
S2
Fk
R3
Issue LD #2 since integer

unit is now free.
dest
Fi
F2
S1
Fj

Qj
Qk
Rj
F4
F6 F8 F10 F12
...
Fk?
Rk
Yes
F30
Integer
40
Using A Scoreboard
Dynamic Scheduling

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Time Name
Integer
Mult1
Mult2
Add
Divide
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
6
Clock
F0
FU
Busy Op
Yes
Load
Yes
Mult
No
No
No
F2
dest
Fi
F2
F0
S1
Fj
F4
F6 F8 F10 F12
F2
S2
Fk
R3
F4
Issue MULT.

Qj
Qk
Rj
Integer
No
Fk?
Rk
Yes
Yes
...
F30
Mult1 Integer
41
Using A Scoreboard
Dynamic Scheduling

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Time Name
Integer
Mult1
Mult2
Add
Divide
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
6
7
Busy
Yes
Yes
No
Yes
No
Clock
F0
FU
Op
Load
Mult
dest
Fi
F2
F0
S1
Fj
Sub
F2
Mult1 Integer
MULT cant read its

operands (F2) because LD
#2 hasnt finished.

Qj
Qk
Rj
F2
S2
Fk
R3
F4
No
Fk?
Rk
Yes
Yes
F8
F6
F2
Integer Yes
No
F4
F6 F8 F10 F12
Integer
...
F30
Add
42
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 8a

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Time Name
Integer
Mult1
Mult2
Add
Divide
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
6
7
8
Busy
Yes
Yes
No
Yes
Yes
Clock
F0
FU
Op
Load
Mult
dest
Fi
F2
F0
S1
Fj
Sub
Div
F2
Mult1 Integer
DIVD issues.
MULT and SUBD both
waiting for F2.

Qj
Qk
Rj
F2
S2
Fk
R3
F4
No
Fk?
Rk
Yes
Yes
F8
F10
F6
F0
F2
F6
Integer Yes
Mult1
No
No
Yes
F4
F6 F8 F10 F12
Integer
...
F30
Add Divide
43
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 8b

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Time Name
Integer
Mult1
Mult2
Add
Divide
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
7
8
Busy
No
Yes
No
Yes
Yes
Clock
F0
FU
Mult1
LD #2 writes F2.
Op
dest
Fi
S1
Fj
S2
Fk

Qj
Qk
Rj
Fk?
Rk
Mult
F0
F2
F4
Yes
Yes
Sub
Div
F8
F10
F6
F0
F2
F6
Yes
No
Yes
Yes
F2
F4
F6 F8 F10 F12
...
F30
Mult1
Add Divide
44
Using A Scoreboard
Dynamic Scheduling

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Time Name
Integer
10 Mult1
Mult2
2 Add
Divide
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
8
Busy
No
Yes
No
Yes
Yes
Clock
F0
FU
Mult1
Now MULT and SUBD can

both read F2.
How can both instructions
do this at the same time??
Op
dest
Fi
S1
Fj
S2
Fk

Qj
Qk
Rj
Fk?
Rk
Mult
F0
F2
F4
Yes
Yes
Sub
Div
F8
F10
F6
F0
F2
F6
Yes
No
Yes
Yes
F2
F4
F6 F8 F10 F12
...
F30
Mult1
Add Divide
45
Using A Scoreboard
Dynamic Scheduling

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Time Name
Integer
8 Mult1
Mult2
0 Add
Divide
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
8
Busy
No
Yes
No
Yes
Yes
Clock
F0
11
FU
Mult1
ADDD cant start because

add unit is busy.
Op
dest
Fi
S1
Fj
S2
Fk

Qj
Qk
Rj
Fk?
Rk
Mult
F0
F2
F4
Yes
Yes
Sub
Div
F8
F10
F6
F0
F2
F6
Yes
No
Yes
Yes
F2
F4
F6 F8 F10 F12
...
F30
Mult1
Add Divide
46
Using A Scoreboard
Dynamic Scheduling

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Time Name
Integer
7 Mult1
Mult2
Add
Divide
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
Clock
F0
12
FU
Busy Op
No
Yes
Mult
No
No
Yes
Div
F2
Mult1
SUBD finishes.
DIVD waiting for F0.
dest
Fi
S1
Fj
S2
Fk

Qj
Qk
Rj
Fk?
Rk
F0
F2
F4
Yes
Yes
F10
F0
F6
No
Yes
F4
F6 F8 F10 F12
...
F30
Mult1
Divide
47
Using A Scoreboard
Dynamic Scheduling

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Time Name
Integer
6 Mult1
Mult2
Add
Divide
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2 F4
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6
Clock
F0
13
FU
F2
Mult1
F4
ADDD issues.

Qj
Qk
Rj
Fk?
Rk
Yes
Yes
Yes
No
Yes
Yes
...
F30
Mult1
F6 F8 F10 F12
Add
Divide
48
Using A Scoreboard
Dynamic Scheduling

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Time Name
Integer
5 Mult1
Mult2
2 Add
Divide
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
14
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2 F4
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6
Clock
F0
14
FU
F2
Mult1
F4

Qj
Qk
Rj
Fk?
Rk
Yes
Yes
Yes
No
Yes
Yes
...
F30
Mult1
F6 F8 F10 F12
Add
Divide
49
Using A Scoreboard
Dynamic Scheduling

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Time Name
Integer
4 Mult1
Mult2
1 Add
Divide
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
14
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2 F4
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6
Clock
F0
15
FU
F2
Mult1
F4

Qj
Qk
Rj
Fk?
Rk
Yes
Yes
Yes
No
Yes
Yes
...
F30
Mult1
F6 F8 F10 F12
Add
Divide
50
Using A Scoreboard
Dynamic Scheduling

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Time Name
Integer
3 Mult1
Mult2
0 Add
Divide
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
14
16
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2 F4
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6
Clock
F0
16
FU
F2
Mult1
F4

Qj
Qk
Rj
Fk?
Rk
Yes
Yes
Yes
No
Yes
Yes
...
F30
Mult1
F6 F8 F10 F12
Add
Divide
51
Using A Scoreboard
Dynamic Scheduling

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Time Name
Integer
2 Mult1
Mult2
Add
Divide
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
14
16
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2 F4
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6
Clock
F0
17
FU
F2
Mult1
F4
ADDD cant write because

of DIVD. RAW!

Qj
Qk
Rj
Fk?
Rk
Yes
Yes
Yes
No
Yes
Yes
...
F30
Mult1
F6 F8 F10 F12
Add
Divide
52
Using A Scoreboard
Dynamic Scheduling

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Time Name
Integer
1 Mult1
Mult2
Add
Divide
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
14
16
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2 F4
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6
Clock
F0
18
FU
F2
Mult1
F4
Nothing Happens!!

Qj
Qk
Rj
Fk?
Rk
Yes
Yes
Yes
No
Yes
Yes
...
F30
Mult1
F6 F8 F10 F12
Add
Divide
53
Using A Scoreboard
Dynamic Scheduling

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Time Name
Integer
0 Mult1
Mult2
Add
Divide
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
7
9
11
12
8
13
14
16
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2 F4
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6
Clock
F0
19
FU
F2
Mult1
F4
MULT completes execution.

Qj
Qk
Rj
Fk?
Rk
Yes
Yes
Yes
No
Yes
Yes
...
F30
Mult1
F6 F8 F10 F12
Add
Divide
54
Using A Scoreboard
Dynamic Scheduling

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Time Name
Integer
Mult1
Mult2
Add
Divide
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
13
14
16
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
No
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6
Clock
F0
20
FU
F2
F4
MULT writes.

Qj
Qk
Rj
Fk?
Rk
Yes
Yes
Yes
Yes
...
F30
F6 F8 F10 F12
Add
Divide
55
Using A Scoreboard
Dynamic Scheduling

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Time Name
Integer
Mult1
Mult2
Add
Divide
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
21
13
14
16
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
No
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6
Clock
F0
21
FU
F2
F4
DIVD loads operands

Qj
Qk
Rj
Fk?
Rk
Yes
Yes
Yes
Yes
...
F30
F6 F8 F10 F12
Add
Divide
56
Dynamic Scheduling
Using A Scoreboard

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Time Name
Integer
Mult1
Mult2
Add
40 Divide
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
21
13
14
16
22
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
No
No
No
Yes
Div
F10
F0 F6
Clock
F0
22
FU
F2
F4
Now ADDD can write since

WAR removed.

Qj
Qk
Rj
Fk?
Rk
Yes
Yes
...
F30
F6 F8 F10 F12
Divide
57
Dynamic Scheduling
Using A Scoreboard

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Time Name
Integer
Mult1
Mult2
Add
0 Divide
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
21
61
13
14
16
22
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
No
No
No
Yes
Div
F10
F0 F6
Clock
F0
61
FU
F2
F4
DIVD completes execution

Qj
Qk
Rj
Fk?
Rk
Yes
Yes
...
F30
F6 F8 F10 F12
Divide
58
Using A Scoreboard
Dynamic Scheduling

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Time Name
Integer
Mult1
Mult2
Add
0 Divide
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
21
61
62
13
14
16
22
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
No
No
No
No
Clock
F0
62
F2
F4
DONE!!

Qj
Qk
Rj
F6 F8 F10 F12
...
Fk?
Rk
F30
FU
59
Dynamic Scheduling
Using A Scoreboard
Another Dynamic Algorithm:

Tomasulo Algorithm
For IBM 360/91 about 3 years after CDC 6600 (1966)

Goal: High Performance without special compilers
Differences between IBM 360 & CDC 6600 ISA
IBM has only 2 register specifiers / instruction vs. 3 in CDC 6600
IBM has 4 FP registers vs. 8 in CDC 6600
Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II,
PowerPC 604,
60
Dynamic Scheduling
Using A Scoreboard
Tomasulo Algorithm vs. Scoreboard
Control & buffers distributed with Function Units (FU) vs.

centralized in scoreboard;
FU buffers called reservation stations; have pending operands
Registers in instructions replaced by values or pointers to

reservation stations(RS); called register renaming ;
avoids WAR, WAW hazards
More reservation stations than registers, so can do optimizations
compilers cant
Results to FU from RS, not through registers, over Common

Data Bus that broadcasts results to all FUs
Load and Stores treated as FUs with RSs as well
Integer instructions can go past branches, allowing
FP ops beyond basic block in FP queue
61
Dynamic Scheduling
Using A Scoreboard
Tomasulo Organization
Load
Buffer
FP Op Queue FP
Registers
Store
Buffer
Common
Data
Bus
FP Add
Res.
Station
FP Mul
Res.
Station
62
Dynamic Scheduling
Using A Scoreboard
Reservation Station Components

OpOperation to perform in the unit (e.g., + or )
Vj, VkValue of Source operands
Store buffers have V field, result to be stored
Qj, QkReservation stations producing source registers (value to be
written)
Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready
Store buffers only have Qi for RS producing result
BusyIndicates reservation station or FU is busy
Register result statusIndicates which functional unit will write each
register, if one exists. Blank when no pending instructions that will
write that register.
63
Dynamic Scheduling
Using A Scoreboard
Three Stages of Tomasulo Algorithm

1. Issueget instruction from FP Op Queue
If reservation station free (no structural hazard),
control issues instruction & sends operands (renames registers).
2. Executionoperate on operands (EX)
When both operands ready then execute;
if not ready, watch Common Data Bus for result
3. Write resultfinish execution (WB)
Write on Common Data Bus to all awaiting units;
mark reservation station available
Normal data bus: data + destination (go to bus)
Common data bus: data + source (come from bus)
64 bits of data + 4 bits of Functional Unit source address
Write if matches expected Functional Unit (produces result)
Does the broadcast
64
Using A Scoreboard
Dynamic Scheduling
Tomasulo Example Cycle 0

Instruction status
Instruction
j
k Issue
LD
F6
34+
R2
LD
F2
45+
R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Reservation Stations
Time Name Busy Op
0 Add1 No
0 Add2 No
0 Add3 No
0 Mult1 No
0 Mult2 No
Execution
complete
S1
Vj
S2
Vk
RS for j
Qj
RS for k
Qk
Clock
F2
F4
F6
F8
F0
Write
Result
Load1
Load2
Load3
Busy
No
No
No
Address
F10 F12 ...
F30
FU
65
Dynamic Scheduling
Using A Scoreboard
Review: Tomasulo
Prevents Register as bottleneck

Avoids WAR, WAW hazards of Scoreboard
Allows loop unrolling in HW
Not limited to basic blocks (provided branch prediction)
Lasting Contributions
Dynamic scheduling
Register renaming
Load/store disambiguation
360/91 descendants are PowerPC 604, 620; MIPS R10000; HP-PA

8000; Intel Pentium Pro
66
Dynamic Hardware
Prediction
Prediction
with Multiple Issue
Exploiting ILP
Dynamic Branch Prediction is the ability

of the hardware to make an educated
guess about which way a branch will
go - will the branch be taken or not.
The hardware can look for clues based
on the instructions, or it can use past
history - we will discuss both of
these directions.

4.7 Studies of ILP
67
Dynamic Hardware
Prediction
Basic Branch Prediction:

Branch Prediction Buffers

Performance = (accuracy, cost of misprediction)
Branch History Lower bits of PC address index table of 1-bit values
Says whether or not branch taken last time
Problem: in a loop, 1-bit BHT will cause two mis-predictions:

End of loop case, when it exits instead of looping as before
First time through loop on next time through code, when it predicts exit instead
of looping
Address
31
0
1
Bits 13 - 2
1023
P
r
e
d
i
c
t
i
o
n
68
Dynamic Hardware
Prediction


Solution: 2-bit scheme where change prediction only if get
misprediction twice: (Figure 4.13, p. 264)
NT
Predict Taken
Predict Taken
T
Predict Not
Taken
T
NT
T
NT
Predict Not
Taken
NT
69
Dynamic Hardware
Prediction

BHT Accuracy
Mispredict because either:
Wrong guess for that branch
Got branch history of wrong branch when index the table
4096 entry table programs vary from 1% misprediction (nasa7,
tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%
4096 about as good as infinite table, but 4096 is a lot of HW
70
Dynamic Hardware
Prediction

Correlating Branches
Idea: taken/not taken of
recently executed branches is
related to behavior of next
branch (as well as the history
of that branch behavior)
Branch address
2-bits per branch predictors
Then behavior of recent

branches selects between, say,
four predictions of next branch,
updating just that prediction
Prediction
Prediction
2-bit global branch history
71
Dynamic Hardware
Prediction

Frequency of Mispredictions
Accuracy of Different Schemes
18%
(Figure 4.21,
p. 272)
4096 Entries 2-bits per entry

Unlimited Entries 2-bits per entry
1024 Entries - 2 bits of history,
2 bits per entry
18%
14%
12%
11%
10%
8%
6%
6%
6%
6%
5%
5%
4%
4%
1%
0%
1%
4,096 entries: 2-bits per entry
Unlimited entries: 2-bits/entry
1,024 entries (2,2)
li
eqntott
espresso
gcc
fpppp
spice
m atrix300
0%
doducd
0%
tom catv
2%
nasa7
Fre que ncy o f Mispre dictio ns
16%
72
Dynamic Hardware
Prediction

Branch Target Buffers
Branch Target Buffer
Branch Target Buffer (BTB): Use address of branch as index to get prediction AND branch address (if taken)
Note: must check for branch match now, since cant use wrong branch address (Figure 4.22, p. 273)
Return instruction addresses predicted with stack
Predicted PC
Branch Prediction:
Taken or not Taken
73
Dynamic Hardware
Prediction
Example
Instructions
in Buffer
Yes
Yes
No

Branch Target Buffers
Prediction
Taken
Taken
Actual
Branch
Taken
Not taken
Taken
Penalty
Cycles
0
2
2
Example on page 274.

Determine the total branch penalty for a BTB using the above
penalties. Assume also the following:
Prediction accuracy of 80%
Hit rate in the buffer of 90%
60% taken branch frequency.
Branch Penalty = Percent buffer hit rate X Percent incorrect predictions X 2
( 1 - percent buffer hit rate) X Taken branches X 2
Branch Penalty = ( 90% X 10% X 2) + (10% X 60% X 2)

Branch Penalty = 0.18 + 0.12 = 0.30 clock cycles
74
Multiple Issue
Prediction
with Multiple Issue
Exploiting ILP
4.7 Studies of ILP
Multiple Issue is the ability of the

processor to start more than one
instruction in a given cycle.
Flavor I:
Superscalar processors issue varying
number of instructions per clock - can
be either statically scheduled (by the
compiler) or dynamically scheduled
(by the hardware).
Superscalar has a varying number of
instructions/cycle (1 to 8), scheduled
by compiler or by HW (Tomasulo).
IBM PowerPC, Sun UltraSparc, DEC
Alpha, HP 8000
75
Multiple Issue
Issuing Multiple Instructions/Cycle
Flavor II:
VLIW - Very Long Instruction Word - issues a fixed number of
instructions formatted either as one very large instruction or as a
fixed packet of smaller instructions.
fixed number of instructions (4-16) scheduled by the compiler; put
operators into wide templates
Joint HP/Intel agreement in 1999/2000
Intel Architecture-64 (IA-64) 64-bit address
Style: Explicitly Parallel Instruction Computer (EPIC)
76
Multiple Issue
Flavor II - continued:
3 Instructions in 128 bit groups; field determines if instructions

dependent or independent
Smaller code size than old VLIW, larger than x86/RISC
Groups can be linked to show independence > 3 instr
64 integer registers + 64 floating point registers
Not separate files per functional unit as in old VLIW
Hardware checks dependencies
(interlocks => binary compatibility over time)
Predicated execution (select 1 out of 64 1-bit flags)
=> 40% fewer mis-predictions?
IA-64 : name of instruction set architecture; EPIC is type
Merced is name of first implementation (1999/2000?)
77
Multiple Issue
A SuperScalar Version of DLX

Fetch 64-bits/clock cycle; Int on left, FP on right
Can only issue 2nd instruction if 1st instruction issues
More ports for FP registers to do FP load & FP op in a pair
In our DLX example,

we can handle 2
instructions/cycle:
Floating Point
Anything Else
Type
Pipe Stages
Int. instruction
IF
ID
EX MEM WB
FP instruction
IF
ID
EX MEM WB
Int. instruction
IF
ID
EX MEM WB
FP instruction
IF
ID
EX MEM WB
Int. instruction
IF
ID
EX MEM WB
FP instruction
IF
ID
EX MEM WB
1 cycle load delay causes delay to 3 instructions in Superscalar
instruction in right half cant use it, nor instructions in next slot
78
Multiple Issue
Unrolled Loop Minimizes Stalls for Scalar

1 Loop:
2
3
4
5
6
7
8
9
10
11
12
13
14
LD
LD
LD
LD
ADDD
ADDD
ADDD
ADDD
SD
SD
SD
SUBI
BNEZ
SD
F0,0(R1)
F6,-8(R1)
F10,-16(R1)
F14,-24(R1)
F4,F0,F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
0(R1),F4
-8(R1),F8
-16(R1),F12
R1,R1,#32
R1,LOOP
8(R1),F16
Latencies:
LD to ADDD: 1 Cycle
ADDD to SD: 2 Cycles
; 8-32 = -24
14 clock cycles, or 3.5 per iteration

79
Multiple Issue
Loop Unrolling in Superscalar

Integer instruction
FP instruction
Loop: LD F0,0(R1)
1
LD F6,-8(R1)
2
LD F10,-16(R1)
ADDD F4,F0,F2
LD F14,-24(R1)
ADDD F8,F6,F2
LD F18,-32(R1)
ADDD F12,F10,F2
SD 0(R1),F4
ADDD F16,F14,F2
SD -8(R1),F8
ADDD F20,F18,F2
SD -16(R1),F12
8
SD -24(R1),F16
9
SUBI R1,R1,#40
10
BNEZ R1,LOOP
11
SD
8(R1),F20
12
Unrolled 5 times to avoid delays (+1 due to SS)
12 clocks, or 2.4 clocks per iteration
Clock cycle
3
4
5
6
7
80
Multiple Issue
Multiple Instruction Issue &

Dynamic Scheduling
Dynamic Scheduling in Superscalar

Code compiler for scalar version will run poorly on Superscalar
May want code to vary depending on how Superscalar
Simple approach: separate Tomasulo Control for separate reservation
stations for Integer FU/Reg and for FP FU/Reg
81
Multiple Issue

Dynamic Scheduling
Dynamic Scheduling in Superscalar

How to do instruction issue with two instructions and keep in-order
instruction issue for Tomasulo?
Issue 2X Clock Rate, so that issue remains in order
Only FP loads might cause dependency between integer and FP
issue:
Replace load reservation station with a load queue;
operands must be read in the order they are fetched
Load checks addresses in Store Queue to avoid RAW violation
Store checks addresses in Load Queue to avoid WAR,WAW
82
Multiple Issue

Dynamic Scheduling
Performance of Dynamic Superscalar

Iteration Instructions
Issues Executes Writes result
no.
clock-cycle number
1
LD F0,0(R1)
1
2
4
1
ADDD F4,F0,F2
1
5
8
1
SD 0(R1),F4
2
9
1
SUBI R1,R1,#8
3
4
5
1
BNEZ R1,LOOP
4
5
2
LD F0,0(R1)
5
6
8
2
ADDD F4,F0,F2
5
9
12
2
SD 0(R1),F4
6
13
2
SUBI R1,R1,#8
7
8
9
2
BNEZ R1,LOOP
8
9
4 clocks per iteration
Branches, Decrements still take 1 clock cycle
83
VLIW
Multiple Issue
Loop Unrolling in VLIW

Memory
reference 1
LD F0,0(R1)
LD F10,-16(R1)
LD F18,-32(R1)
LD F26,-48(R1)
Memory
FP
reference 2
operation 1
LD F6,-8(R1)
LD F14,-24(R1)
LD F22,-40(R1) ADDD F4,F0,F2
ADDD F12,F10,F2
ADDD F20,F18,F2
SD 0(R1),F4
SD -8(R1),F8 ADDD F28,F26,F2
SD -16(R1),F12 SD -24(R1),F16
SD -32(R1),F20 SD -40(R1),F24
SD -0(R1),F28
FP
op. 2
Int. op/
branch
Clock
ADDD F8,F6,F2
ADDD F16,F14,F2
ADDD F24,F22,F2
SUBI R1,R1,#48
BNEZ R1,LOOP
Unrolled 7 times to avoid delays

7 results in 9 clocks, or 1.3 clocks per iteration
Need more registers to effectively use VLIW
84
1
2
3
4
5
6
7
8
9
Multiple Issue
Limitations With Multiple Issue
Limits to Multi-Issue Machines

Inherent limitations of ILP
1 branch in 5 instructions => how to keep a 5-way VLIW busy?
Latencies of units => many operations must be scheduled
Need about Pipeline Depth x No. Functional Units of independent
operations to keep machines busy.
Difficulties in building HW
Duplicate Functional Units to get parallel execution
Increase ports to Register File (VLIW example needs 6 read and 3
write for Int. Reg. & 6 read and 4 write for Reg.)
Increase ports to memory
Decoding SS and impact on clock rate, pipeline depth
85
Multiple Issue
Limits to Multi-Issue Machines

Limitations specific to either SS or VLIW implementation
Decode issue in SS
VLIW code size: unroll loops + wasted fields in VLIW
VLIW lock step => 1 hazard & all instructions stall
VLIW & binary compatibility
86
Multiple Issue
Multiple Issue Challenges
While Integer/FP split is simple for the HW, get CPI of 0.5 only for
programs with:
Exactly 50% FP operations
No hazards
If more instructions issue at same time, greater difficulty of decode and

issue
Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2
instructions can issue
VLIW: tradeoff instruction space for simple decoding

The long instruction word has room for many operations
By definition, all the operations the compiler puts in the long instruction word are
independent => execute in parallel
E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide
Need compiling technique that schedules across several branches
87
Compiler Support For ILP

Prediction
with Multiple Issue
Exploiting ILP
4.7 Studies of ILP
How can compilers be smart?

1. Produce good scheduling of code.
2. Determine which loops might contain
parallelism.
3. Eliminate name dependencies.
Compilers must be REALLY smart to
figure out aliases -- pointers in C are
a real problem.
Techniques lead to:
Symbolic Loop Unrolling
Critical Path Scheduling
88
Software Pipelining
Observation: if iterations from loops are independent, then can get ILP
by taking instructions from different iterations
Software pipelining: reorganizes loops so that each iteration is made
from instructions chosen from different iterations of the original loop
(Tomasulo in SW)
Iteration
0
Iteration
Iteration
1
2
Iteration
3
Iteration
4
Softwarepipelined
iteration
89
SW Pipelining Example
Before: Unrolled 3 times
1 LD
F0,0(R1)
2 ADDD F4,F0,F2
3 SD
0(R1),F4
4 LD
F6,-8(R1)
5 ADDD F8,F6,F2
6 SD
-8(R1),F8
7 LD
F10,-16(R1)
8 ADDD F12,F10,F2
9 SD
-16(R1),F12
10 SUBI R1,R1,#24
11 BNEZ R1,LOOP
SD
IF
ADDD
LD
After: Software Pipelined
1
2
3
4
5
Read F4
ID
IF
EX
ID
IF
LD
ADDD
LD
SD
ADDD
LD
SUBI
BNEZ
SD
ADDD
SD
Read F0
Mem
EX
ID
F0,0(R1)
F4,F0,F2
F0,-8(R1)
0(R1),F4;
Stores M[i]
F4,F0,F2;
Adds to M[i-1]
F0,-16(R1); loads M[i-2]
R1,R1,#8
R1,LOOP
0(R1),F4
F4,F0,F2
-8(R1),F4
WB Write F4
Mem WB
EX
Mem WB
Chap. 4 - Pipelining Write

I F0
90
SW Pipelining Example
Less code space
Overhead paid only once
vs. each iteration in loop unrolling
Software Pipelining
Loop Unrolling
100 iterations = 25 loops with 4 unrolled iterations each
91
Critical Path Scheduling
Trace Scheduling
Parallelism across IF branches vs. LOOP branches

Two steps:
Trace Selection
Find likely sequence of basic blocks (trace)
of (statically predicted or profile predicted)
long sequence of straight-line code
Trace Compaction
Squeeze trace into few VLIW instructions
Need bookkeeping code in case prediction is wrong
Compiler undoes bad guess
(discards values in registers)
Subtle compiler bugs mean wrong answer
vs. poorer performance; no hardware interlocks
92
Hardware Support For

Parallelism
Software support of ILP is best when

code is predictable at compile time.
But what if theres no predictability?

Prediction
Here
well
talk
about
hardware
techniques. These include:

with Multiple Issue
Conditional
Instructions
Hardware Speculation

Exploiting ILP
or
Predicated
4.7 Studies of ILP
93

Parallelism
Nullified Instructions
Tell the Hardware To Ignore An Instruction

Avoid branch prediction by turning branches into
conditionally executed instructions:
IF (x) then A = B op C else NOP
If false, then neither store result nor cause exception
Expanded ISA of Alpha, MIPs, PowerPC, SPARC,
have conditional move. PA-RISC can annul any
following instruction.
IA-64: 64 1-bit condition fields selected so
conditional execution of any instruction
Drawbacks to conditional instructions:
Still takes a clock, even if annulled
Stalls if condition evaluated late
Complex conditions reduce effectiveness; condition
becomes known late in pipeline.
This can be a major win because there is no time lost by
taking a branch!!
A=
B op C
94

Parallelism
Nullified Instructions
Tell the Hardware To Ignore An Instruction

Suppose we have the code:
if ( VarA == 0 )
VarS = VarT;
Previous Method:
LD
R1, VarA
BNEZ
R1, Label
LD
R2, VarT
SD
VarS, R2
Label:
Compare
and Nullify
Next Instr.
If Not Zero
Compare
and Move
IF Zero
Nullified Method:
LD
R1, VarA
LD
R2, VarT
CMPNNZ
R1, #0
SD
VarS, R2
Label:
Nullified Method:
LD
R1, VarA
LD
R2, VarT
CMOVZ
VarS,R2, R1
95

Parallelism
Compiler Speculation
Increasing Parallelism
The theory here is to move an instruction across a branch so as to
increase the size of a basic block and thus to increase parallelism.
Primary difficulty is in avoiding exceptions. For example
if ( a ^= 0 ) c = b/a; may have divide by zero error in some cases.
Methods for increasing speculation include:
1. Use a set of status bits (poison bits) associated with the registers.
Are a signal that the instruction results are invalid until some later
time.
2. Result of instruction isnt written until its certain the instruction is
no longer speculative.
96

Parallelism
Increasing
Parallelism
Example on Page 305.
Code for
if ( A == 0 )
A = B;
else
A = A + 4;
Assume A is at 0(R3) and
B is at 0(R4)
Note here that only ONE
side needs to take a
branch!!
Original Code:
LW
R1, 0(R3)
BNEZ R1, L1
LW
R1, 0(R2)
J
L2
L1: ADDI R1, R1, #4
L2: SW
0(R3), R1
Load A
Test A
If Clause
Skip Else
Else Clause
Store A
Speculated Code:
LW
R1, 0(R3)
LW
R14, 0(R2)
BEQZ R1, L3
ADDI R14, R1, #4
L3: SW
0(R3), R14
Load A
Spec Load B
Other if Branch
Else Clause
Non-Spec Store
97

Parallelism
Poison Bits
In the example on the last
page, if the LW* produces
an exception, a poison bit
is set on that register. The
if a later instruction tries to
use the register, an
exception is THEN raised.
Speculated Code:
LW
R1, 0(R3)
LW*
R14, 0(R2)
BEQZ R1, L3
ADDI R14, R1, #4
L3: SW
0(R3), R14
Load A
Spec Load B
Other if Branch
Else Clause
Non-Spec Store
98

Parallelism
HW support for More ILP

Need HW buffer for results of
uncommitted instructions: reorder buffer
Reorder buffer can be operand source
FP
Once operand commits, result is
Op
found in register
Queue
3 fields: instr. type, destination, value
Use reorder buffer number instead
of reservation station
Discard instructions on mis-predicted Res Stations
branches or on exceptions
FP Adder
Reorder
Buffer
FP Regs
Res Stations
FP Adder
Figure 4.34, page 311
99

Parallelism
HW support for More ILP

How is this used in practice?
Rather than predicting the direction of a branch, execute the
instructions on both side!!
We early on know the target of a branch, long before we know it if will
be taken or not.
So begin fetching/executing at that new Target PC.
But also continue fetching/executing as if the branch NOT taken.
100
Studies of ILP

Prediction
with Multiple Issue
Exploiting ILP

Conflicting studies of amount of

improvement available
Benchmarks (vectorized FP
Fortran vs. integer C programs)
Hardware sophistication
Compiler sophistication
How much ILP is available using
existing mechanisms with increasing
HW budgets?
Do we need to invent new HW/SW
mechanisms to keep on processor
performance curve?
4.7 Studies of ILP
101
Studies of ILP
Limits to ILP
Initial HW Model here; MIPS compilers.
Assumptions for ideal/perfect machine to start:
1. Register renaminginfinite virtual registers and all WAW & WAR
hazards are avoided
2. Branch predictionperfect; no mispredictions
3. Jump predictionall jumps perfectly predicted => machine with
perfect speculation & an unbounded buffer of instructions available
4. Memory-address alias analysisaddresses are known & a store can
be moved before a load provided addresses not equal
1 cycle latency for all instructions; unlimited number of instructions
issued per clock cycle
102
Studies of ILP
Upper Limit to ILP: Ideal

Machine
This is the amount of parallelism when

there are no branch mis-predictions and
were limited only by data dependencies.
(Figure 4.38, page 319)
Instruction Issues per cycle
IPC
160
FP: 75 - 150
140
120
150.1
118.7
Integer: 18 - 60
100
75.2
80
60
54.8
62.6
40
17.9
20
0
gcc
Instructions that could

theoretically be issued
per cycle.
espresso
li
fpppp
doducd
tomcatv
Programs
103
Studies of ILP
Impact of Realistic Branch

Prediction
What parallelism do we get when we dont allow perfect branch

prediction, as in the last picture, but assume some realistic model?
Possibilities include:
1. Perfect - all branches are perfectly predicted (the last slide)
2. Selective History Predictor - a complicated but do-able mechanism for
selection.
3. Standard 2-bit history predictor with 512 2-bit entries.
4. Static prediction based on past history of the program.
5. None - Parallelism is limited to basic block.
104
Studies of ILP
Bonus!!
Selective History Predictor

8096 x 2 bits
1
0
11
Choose Non-correlator
10
01 Choose Correlator
00
Branch Addr
2
Global
History
Taken/Not Taken
00
01
10
11
2048 x 4 x 2 bits
8K x 2 bit
Selector
11 Taken
10
01 Not Taken
00
105
Impact of Realistic
Branch Prediction
Studies of ILP
Figure 4.42, Page 325
Limiting the type of

branch prediction.
60
61
48
50
IPC
Instruction issues per cycle
60
58
FP: 15 - 45
46 45
46 45 45
41
40
35
Integer: 6 - 12
29
30
20
12
10
19
16
9
6
13 14
10
7
15
0
gcc
espresso
li
fpppp
doducd
tomcatv
Program
Perfect
Perfect
Selective predictor
Standard 2-bit
Static
Chap. 4BHT
- Pipelining
I
(512)
Profile
Selective Hist
None
106
No prediction
More Realistic HW:

Register Impact
Studies of ILP
Effect of limiting the
number of renaming
registers.
60

59
FP: 11 - 45
54
49
Instruction issues per cycle
IPC
50
45
40
44
35
Integer: 5 - 15
30
29
28
20
20
15 15
11 10 10
10
16
13
12 12 12 11
10
9
5
11
6
15
0
gcc
espresso
li
fpppp
doducd
tomcatv
Program
Infinite
Infinite
256
128
64
32
Chap.
4
Pipelining
I
256 128 64
32
None
None
107
Studies of ILP
More Realistic HW:

Alias Impact
What happens when there

may be conflicts with
memory aliasing?

49
50
FP: 4 - 45
(Fortran,
no heap)
49
Instruction issues per cy cle
IPC
45
45
45
40
35
Integer: 4 - 9
30
25
20
16
15
15
12
10
10
5
16
7
4
9
5
0
gcc
espresso
li
fpppp
Program
Global/Stack perf;
Perfect
Inspec.
Perfect
Global/stack Perfect
Inspection
heapChap.
conflicts
4 - Pipelining
I
Assem.
doducd
None
None
tomcatv
108
Summary
4.1 Instruction Level Parallelism: Concepts and Challenges
4.2 Overcoming Data Hazards with Dynamic Scheduling
4.3 Reducing Branch Penalties with Dynamic Hardware Prediction
4.4 Taking Advantage of More ILP with Multiple Issue
4.5 Compiler Support for Exploiting ILP
4.6 Hardware Support for Extracting more Parallelism
4.7 Studies of ILP
109

Pipelining

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pipelining

Uploaded by

Copyright:

Available Formats

Advanced Computer

Basic Pipeline Scheduling

Dynamic Scheduling with Scoreboarding

Dynamic Scheduling with Register Renaming

WAR and WAW stalls

Dynamic Branch Prediction

Issue Multiple Instructions per Cycle

Compiler Dependence Analysis

Ideal CPI & data stalls

Software pipelining and trace scheduling

Ideal CPI & data stalls

All data & control stalls

Dynamic memory disambiguation

RAW stalls involving memory

ILP is the principle that there are many

Pipeline Scheduling and

Pipeline Scheduling and

Simple Loop and its Assembler Equivalent

for (i=1; i<=1000; i++)

Pipeline Scheduling and

Where are the stalls?

Pipeline Scheduling and

FP Loop Showing Stalls

10 clocks: Rewrite code

Pipeline Scheduling and

Scheduled FP Loop Minimizing Stalls

Swap BNEZ and SD by changing address of SD

Now 6 clocks: Now unroll

Pipeline Scheduling and

15 + 4 x (1+2) +1 = 28 clock cycles, or 7 per iteration

Rewrite loop to minimize stalls.

Pipeline Scheduling and

Unrolled Loop That Minimizes Stalls

What assumptions made when

14 clock cycles, or 3.5 per iteration

Pipeline Scheduling and

Summary of Loop Unrolling Example

Compiler Perspectives on Code Movement

Compiler Perspectives on Code Movement

Where are the data

Compiler Perspectives on Code Movement

Another kind of dependence called name dependence:

Compiler Perspectives on Code Movement

Where are the name

How can we remove these

Compiler Perspectives on Code Movement

Again Name Dependencies are Hard for Memory Accesses

Compiler Perspectives on Code Movement

Final kind of dependence called control dependence

Compiler Perspectives on Code Movement

Two (obvious) constraints on control dependences:

Control dependencies relaxed to get parallelism; get same effect if

Compiler Perspectives on Code Movement

Where are the control

Loop Level Parallelism

When Safe to Unroll Loop?

Loop Level Parallelism

When Safe to Unroll Loop?

for (i=1; i<=100; i=i+1) {

No dependence from S1 to S2. If there were, then there would be a cycle in

Loop Level Parallelism

Now Safe to Unroll Loop? (p. 240)

for (i=1; i<=100; i=i+1) {

A[1] = A[1] + B[1];