Professional Documents
Culture Documents
Architecture
Chapter 4
Advanced Pipelining
Ioannis Papaefstathiou
CS 590.25
Easter 2003
(thanks to Hennesy & Patterson)
Chapter Overview
4.1 Instruction Level Parallelism: Concepts and Challenges
4.2 Overcoming Data Hazards with Dynamic Scheduling
4.3 Reducing Branch Penalties with Dynamic Hardware Prediction
4.4 Taking Advantage of More ILP with Multiple Issue
4.5 Compiler Support for Exploiting ILP
4.6 Hardware Support for Extracting more Parallelism
4.7 Studies of ILP
Chap. 4 - Pipelining I
Chapter Overview
Technique
Reduces
Section
Loop Unrolling
Control Stalls
4.1
RAW Stalls
4.1
RAW stalls
4.2
4.2
Control Stalls
4.3
Ideal CPI
4.4
4.5
4.5
Speculation
4.6
4.2, 4.6
Chap. 4 - Pipelining I
Instruction Level
Parallelism
4.1 Instruction Level Parallelism:
Concepts and Challenges
4.2 Overcoming Data Hazards
with Dynamic Scheduling
4.3 Reducing Branch Penalties
with Dynamic Hardware
Prediction
4.4 Taking Advantage of More ILP
with Multiple Issue
4.5 Compiler Support for
Exploiting ILP
4.6 Hardware Support for
Extracting more Parallelism
4.7 Studies of ILP
Chap. 4 - Pipelining I
Instruction Level
Parallelism
Terminology
Basic Block - That set of instructions between entry points and between
branches. A basic block has only one entry and one exit. Typically
this is about 6 instructions long.
Loop Level Parallelism - that parallelism that exists within a loop. Such
parallelism can cross loop iterations.
Loop Unrolling - Either the compiler or the hardware is able to exploit
the parallelism inherent in the loop.
Chap. 4 - Pipelining I
Instruction Level
Parallelism
LD
ADDD
SD
SUBI
BNEZ
NOP
F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,8
R1,Loop
;F0=vector element
;add scalar from F2
;store result
;decrement pointer 8bytes (DW)
;branch R1!=zero
;delayed branch slot
Chap. 4 - Pipelining I
Instruction Level
Parallelism
FP Loop Hazards
Loop:
LD
ADDD
SD
SUBI
BNEZ
NOP
F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,8
R1,Loop
Instruction
producing result
FP ALU op
FP ALU op
Load double
Load double
Integer op
;F0=vector element
;add scalar in F2
;store result
;decrement pointer 8B (DW)
;branch R1!=zero
;delayed branch slot
Instruction
using result
Another FP ALU op
Store double
FP ALU op
Store double
Integer op
Chap. 4 - Pipelining I
Latency in
clock cycles
3
2
1
0
0
Instruction Level
Parallelism
LD
stall
ADDD
stall
stall
SD
SUBI
stall
BNEZ
stall
F0,0(R1)
;F0=vector element
F4,F0,F2
;add scalar in F2
0(R1),F4
R1,R1,8
;store result
;decrement pointer 8Byte (DW)
R1,Loop
;branch R1!=zero
;delayed branch slot
Instruction
producing result
FP ALU op
FP ALU op
Load double
Load double
Integer op
Instruction
using result
Another FP ALU op
Store double
FP ALU op
Store double
Integer op
4 - Pipelining I
Latency in
clock cycles
3
2
1
0
0
Instruction Level
Parallelism
LD
SUBI
ADDD
stall
BNEZ
SD
F0,0(R1)
R1,R1,8
F4,F0,F2
R1,Loop
8(R1),F4
Stall is because SD
cant proceed.
;delayed branch
;altered when move past SUBI
Instruction
using result
Another FP ALU op
Store double
FP ALU op
Latency in
clock cycles
3
2
1
4 - Pipelining I
LD
stall
ADDD
stall
stall
SD
LD
stall
ADDD
stall
stall
SD
LD
stall
F0,0(R1)
F4,F0,F2
0(R1),F4
F6,-8(R1)
F8,F6,F2
-8(R1),F8
F10,-16(R1)
15
16
17
18
19
20
21
22
23
24
25
26
27
28
ADDD
stall
stall
SD
LD
stall
ADDD
stall
stall
SD
SUBI
BNEZ
stall
NOP
F12,F10,F2
-16(R1),F12
F14,-24(R1)
F16,F14,F2
-24(R1),F16
R1,R1,#32
R1,LOOP
10
Instruction Level
Parallelism
1 Loop:
2
3
4
5
6
7
8
9
10
11
12
13
14
LD
LD
LD
LD
ADDD
ADDD
ADDD
ADDD
SD
SD
SD
SUBI
BNEZ
SD
F0,0(R1)
F6,-8(R1)
F10,-16(R1)
F14,-24(R1)
F4,F0,F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
0(R1),F4
-8(R1),F8
-16(R1),F12
R1,R1,#32
R1,LOOP
8(R1),F16
; 8-32 = -24
Chap. 4 - Pipelining I
No Stalls!!
11
Instruction Level
Parallelism
Determine that it was legal to move the SD after the SUBI and BNEZ,
and find the amount to adjust the SD offset.
Determine that unrolling the loop would be useful by finding that the
loop iterations were independent, except for the loop maintenance
code.
Use different registers to avoid unnecessary constraints that would be
forced by using the same registers for different computations.
Eliminate the extra tests and branches and adjust the loop maintenance
code.
Determine that the loads and stores in the unrolled loop can be
interchanged by observing that the loads and stores from different
iterations are independent.
This requires analyzing the memory
addresses and finding that they do not refer to the same address.
Schedule the code, preserving any dependences needed to yield the
same result as the original code.
Chap. 4 - Pipelining I
12
Instruction Level
Parallelism
Dependencies
Chap. 4 - Pipelining I
13
Instruction Level
Parallelism
Data Dependencies
1 Loop:
2
3
4
5
LD
ADDD
SUBI
BNEZ
SD
F0,0(R1)
F4,F0,F2
R1,R1,8
R1,Loop
8(R1),F4
;delayed branch
;altered when move past SUBI
Chap. 4 - Pipelining I
14
Instruction Level
Parallelism
Name Dependencies
Chap. 4 - Pipelining I
15
Instruction Level
Parallelism
Name Dependencies
LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
SUBI
BNEZ
NOP
F0,0(R1)
F4,F0,F2
0(R1),F4
F0,-8(R1)
F4,F0,F2
-8(R1),F4
F0,-16(R1)
F4,F0,F2
-16(R1),F4
F0,-24(R1)
F4,F0,F2
-24(R1),F4
R1,R1,#32
R1,LOOP
16
Instruction Level
Parallelism
Name Dependencies
Chap. 4 - Pipelining I
18
Instruction Level
Parallelism
Control Dependencies
Chap. 4 - Pipelining I
19
Instruction Level
Parallelism
Control Dependencies
Chap. 4 - Pipelining I
20
Instruction Level
Parallelism
Control Dependencies
LD
ADDD
SD
SUBI
BEQZ
LD
ADDD
SD
SUBI
BEQZ
LD
ADDD
SD
SUBI
BEQZ
F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,8
R1,exit
F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,8
R1,exit
F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,8
R1,exit
Chap. 4 - Pipelining I
21
Instruction Level
Parallelism
Chap. 4 - Pipelining I
22
Instruction Level
Parallelism
2.
Chap. 4 - Pipelining I
23
Instruction Level
Parallelism
NEW:
Chap. 4 - Pipelining I
No circular dependencies.
Loop caused dependence
on B.
24
Dynamic Scheduling
4.1 Instruction Level Parallelism:
Concepts and Challenges
4.2 Overcoming Data Hazards
with Dynamic Scheduling
4.3 Reducing Branch Penalties
with Dynamic Hardware
Prediction
4.4 Taking Advantage of More ILP
with Multiple Issue
4.5 Compiler Support for
Exploiting ILP
4.6 Hardware Support for
Extracting more Parallelism
4.7 Studies of ILP
Chap. 4 - Pipelining I
25
Dynamic Scheduling
The idea:
Chap. 4 - Pipelining I
26
Dynamic Scheduling
The idea:
Chap. 4 - Pipelining I
27
Dynamic Scheduling
Using A Scoreboard
Scoreboard Implications
Chap. 4 - Pipelining I
28
Dynamic Scheduling
Using A Scoreboard
Chap. 4 - Pipelining I
29
Dynamic Scheduling
Using A Scoreboard
Chap. 4 - Pipelining I
30
Dynamic Scheduling
Using A Scoreboard
Chap. 4 - Pipelining I
31
Using A Scoreboard
Dynamic Scheduling
statuswhich
of
steps
the
instruction
is
in
Chap. 4 - Pipelining I
32
Dynamic Scheduling
Using A Scoreboard
Wait until
Bookkeeping
Issue
Read
operands
Rj and Rk
Rj No; Rk No
Execution
complete
Functional unit
done
f((Fj( f )Fi(FU)
or Rj( f )=No) &
Write result (Fk( f ) Fi(FU) or
Rk( f )=No))
Chap. 4 - Pipelining I
33
Dynamic Scheduling
Using A Scoreboard
Scoreboard Example
This is the sample code well be working with in the example:
LD
LD
MULT
SUBD
DIVD
ADDD
F6, 34(R2)
F2, 45(R3)
F0, F2, F4
F8, F6, F2
F10, F0, F6
F6, F8, F2
Chap. 4 - Pipelining I
34
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example
Instruction status
Instruction j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2 F4
SUBD F8
F6 F2
DIVD F10 F0 F6
ADDDF6
F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status
Read Execution
Write
Issue operands
complete
Result
Busy
No
No
No
No
No
Clock
F0
Op
dest
Fi
S1
Fj
S2
Fk
Fk?
Rk
F2
F4
F6
F8
F10 F12
F30
...
FU
Chap. 4 - Pipelining I
35
Using A Scoreboard
Dynamic Scheduling
Read Execution
Write
Issue operands
complete
Result
1
Busy
Yes
No
No
No
No
Clock
F0
FU
Issue LD #1
Shows in which cycle
the operation occurred.
Op
Load
dest
Fi
F6
S1
Fj
S2
Fk
R2
F2
F4
F6 F8 F10 F12
...
Fk?
Rk
Yes
F30
Integer
Chap. 4 - Pipelining I
36
Using A Scoreboard
Dynamic Scheduling
Read Execution
Write
Issue operands
complete
Result
1
2
Clock
F0
FU
Busy Op
Yes
Load
No
No
No
No
F2
S2
Fk
R2
dest
Fi
F6
S1
Fj
F4
F6 F8 F10 F12
...
Fk?
Rk
Yes
F30
Integer
Chap. 4 - Pipelining I
37
Using A Scoreboard
Dynamic Scheduling
Read Execution
Write
Issue operands
complete
Result
1
2
3
Clock
F0
FU
Busy Op
Yes
Load
No
No
No
No
F2
dest
Fi
F6
S1
Fj
S2
Fk
R2
F4
F6 F8 F10 F12
...
Fk?
Rk
Yes
F30
Integer
Chap. 4 - Pipelining I
38
Using A Scoreboard
Dynamic Scheduling
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
Clock
F0
FU
Busy Op
Yes
Load
No
No
No
No
F2
dest
Fi
F6
S1
Fj
S2
Fk
R2
F4
F6 F8 F10 F12
...
Fk?
Rk
Yes
F30
Integer
Chap. 4 - Pipelining I
39
Using A Scoreboard
Dynamic Scheduling
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
Clock
F0
FU
Busy Op
Yes
Load
No
No
No
No
F2
S2
Fk
R3
dest
Fi
F2
S1
Fj
F4
F6 F8 F10 F12
...
Fk?
Rk
Yes
F30
Integer
Chap. 4 - Pipelining I
40
Using A Scoreboard
Dynamic Scheduling
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
6
Clock
F0
FU
Busy Op
Yes
Load
Yes
Mult
No
No
No
F2
dest
Fi
F2
F0
S1
Fj
F4
F6 F8 F10 F12
F2
S2
Fk
R3
F4
Issue MULT.
No
Fk?
Rk
Yes
Yes
...
F30
Mult1 Integer
Chap. 4 - Pipelining I
41
Using A Scoreboard
Dynamic Scheduling
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
6
7
Busy
Yes
Yes
No
Yes
No
Clock
F0
FU
Op
Load
Mult
dest
Fi
F2
F0
S1
Fj
Sub
F2
Mult1 Integer
F2
S2
Fk
R3
F4
No
Fk?
Rk
Yes
Yes
F8
F6
F2
Integer Yes
No
F4
F6 F8 F10 F12
Integer
...
F30
Add
Chap. 4 - Pipelining I
42
Using A Scoreboard
Dynamic Scheduling
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
6
7
8
Busy
Yes
Yes
No
Yes
Yes
Clock
F0
FU
Op
Load
Mult
dest
Fi
F2
F0
S1
Fj
Sub
Div
F2
Mult1 Integer
DIVD issues.
MULT and SUBD both
waiting for F2.
F2
S2
Fk
R3
F4
No
Fk?
Rk
Yes
Yes
F8
F10
F6
F0
F2
F6
Integer Yes
Mult1
No
No
Yes
F4
F6 F8 F10 F12
Integer
...
F30
Add Divide
Chap. 4 - Pipelining I
43
Using A Scoreboard
Dynamic Scheduling
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
7
8
Busy
No
Yes
No
Yes
Yes
Clock
F0
FU
Mult1
LD #2 writes F2.
Op
dest
Fi
S1
Fj
S2
Fk
Fk?
Rk
Mult
F0
F2
F4
Yes
Yes
Sub
Div
F8
F10
F6
F0
F2
F6
Yes
No
Yes
Yes
F2
F4
F6 F8 F10 F12
...
F30
Mult1
Add Divide
Chap. 4 - Pipelining I
44
Using A Scoreboard
Dynamic Scheduling
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
8
Busy
No
Yes
No
Yes
Yes
Clock
F0
FU
Mult1
Op
dest
Fi
S1
Fj
S2
Fk
Fk?
Rk
Mult
F0
F2
F4
Yes
Yes
Sub
Div
F8
F10
F6
F0
F2
F6
Yes
No
Yes
Yes
F2
F4
F6 F8 F10 F12
...
F30
Mult1
Add Divide
Chap. 4 - Pipelining I
45
Using A Scoreboard
Dynamic Scheduling
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
8
Busy
No
Yes
No
Yes
Yes
Clock
F0
11
FU
Mult1
Op
dest
Fi
S1
Fj
S2
Fk
Fk?
Rk
Mult
F0
F2
F4
Yes
Yes
Sub
Div
F8
F10
F6
F0
F2
F6
Yes
No
Yes
Yes
F2
F4
F6 F8 F10 F12
...
F30
Mult1
Add Divide
Chap. 4 - Pipelining I
46
Using A Scoreboard
Dynamic Scheduling
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
Clock
F0
12
FU
Busy Op
No
Yes
Mult
No
No
Yes
Div
F2
Mult1
SUBD finishes.
DIVD waiting for F0.
dest
Fi
S1
Fj
S2
Fk
Fk?
Rk
F0
F2
F4
Yes
Yes
F10
F0
F6
No
Yes
F4
F6 F8 F10 F12
...
F30
Mult1
Divide
Chap. 4 - Pipelining I
47
Using A Scoreboard
Dynamic Scheduling
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2 F4
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6
Clock
F0
13
FU
F2
Mult1
F4
ADDD issues.
Fk?
Rk
Yes
Yes
Yes
No
Yes
Yes
...
F30
Mult1
F6 F8 F10 F12
Add
Divide
Chap. 4 - Pipelining I
48
Using A Scoreboard
Dynamic Scheduling
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
14
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2 F4
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6
Clock
F0
14
FU
F2
Mult1
F4
Fk?
Rk
Yes
Yes
Yes
No
Yes
Yes
...
F30
Mult1
F6 F8 F10 F12
Add
Divide
Chap. 4 - Pipelining I
49
Using A Scoreboard
Dynamic Scheduling
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
14
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2 F4
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6
Clock
F0
15
FU
F2
Mult1
F4
Fk?
Rk
Yes
Yes
Yes
No
Yes
Yes
...
F30
Mult1
F6 F8 F10 F12
Add
Divide
Chap. 4 - Pipelining I
50
Using A Scoreboard
Dynamic Scheduling
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
14
16
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2 F4
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6
Clock
F0
16
FU
F2
Mult1
F4
Fk?
Rk
Yes
Yes
Yes
No
Yes
Yes
...
F30
Mult1
F6 F8 F10 F12
Add
Divide
Chap. 4 - Pipelining I
51
Using A Scoreboard
Dynamic Scheduling
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
14
16
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2 F4
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6
Clock
F0
17
FU
F2
Mult1
F4
Fk?
Rk
Yes
Yes
Yes
No
Yes
Yes
...
F30
Mult1
F6 F8 F10 F12
Add
Divide
Chap. 4 - Pipelining I
52
Using A Scoreboard
Dynamic Scheduling
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
14
16
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2 F4
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6
Clock
F0
18
FU
F2
Mult1
F4
Nothing Happens!!
Fk?
Rk
Yes
Yes
Yes
No
Yes
Yes
...
F30
Mult1
F6 F8 F10 F12
Add
Divide
Chap. 4 - Pipelining I
53
Using A Scoreboard
Dynamic Scheduling
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
7
9
11
12
8
13
14
16
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2 F4
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6
Clock
F0
19
FU
F2
Mult1
F4
Fk?
Rk
Yes
Yes
Yes
No
Yes
Yes
...
F30
Mult1
F6 F8 F10 F12
Add
Divide
Chap. 4 - Pipelining I
54
Using A Scoreboard
Dynamic Scheduling
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
13
14
16
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
No
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6
Clock
F0
20
FU
F2
F4
MULT writes.
Fk?
Rk
Yes
Yes
Yes
Yes
...
F30
F6 F8 F10 F12
Add
Divide
Chap. 4 - Pipelining I
55
Using A Scoreboard
Dynamic Scheduling
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
21
13
14
16
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
No
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6
Clock
F0
21
FU
F2
F4
Fk?
Rk
Yes
Yes
Yes
Yes
...
F30
F6 F8 F10 F12
Add
Divide
Chap. 4 - Pipelining I
56
Dynamic Scheduling
Using A Scoreboard
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
21
13
14
16
22
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
No
No
No
Yes
Div
F10
F0 F6
Clock
F0
22
FU
F2
F4
Fk?
Rk
Yes
Yes
...
F30
F6 F8 F10 F12
Divide
Chap. 4 - Pipelining I
57
Dynamic Scheduling
Using A Scoreboard
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
21
61
13
14
16
22
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
No
No
No
Yes
Div
F10
F0 F6
Clock
F0
61
FU
F2
F4
Fk?
Rk
Yes
Yes
...
F30
F6 F8 F10 F12
Divide
Chap. 4 - Pipelining I
58
Using A Scoreboard
Dynamic Scheduling
Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
21
61
62
13
14
16
22
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
No
No
No
No
Clock
F0
62
F2
F4
DONE!!
F6 F8 F10 F12
...
Fk?
Rk
F30
FU
Chap. 4 - Pipelining I
59
Dynamic Scheduling
Using A Scoreboard
Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II,
PowerPC 604,
Chap. 4 - Pipelining I
60
Dynamic Scheduling
Using A Scoreboard
Chap. 4 - Pipelining I
61
Dynamic Scheduling
Using A Scoreboard
Tomasulo Organization
Load
Buffer
FP Op Queue FP
Registers
Store
Buffer
Common
Data
Bus
FP Add
Res.
Station
FP Mul
Res.
Station
Chap. 4 - Pipelining I
62
Dynamic Scheduling
Using A Scoreboard
Chap. 4 - Pipelining I
63
Dynamic Scheduling
Using A Scoreboard
Chap. 4 - Pipelining I
64
Using A Scoreboard
Dynamic Scheduling
Execution
complete
S1
Vj
S2
Vk
RS for j
Qj
RS for k
Qk
Clock
F2
F4
F6
F8
F0
Write
Result
Load1
Load2
Load3
Busy
No
No
No
Address
F30
FU
Chap. 4 - Pipelining I
65
Dynamic Scheduling
Using A Scoreboard
Review: Tomasulo
Chap. 4 - Pipelining I
66
Dynamic Hardware
Prediction
4.1 Instruction Level Parallelism:
Concepts and Challenges
4.2 Overcoming Data Hazards
with Dynamic Scheduling
4.3 Reducing Branch Penalties
with Dynamic Hardware
Prediction
4.4 Taking Advantage of More ILP
with Multiple Issue
4.5 Compiler Support for
Exploiting ILP
Chap. 4 - Pipelining I
67
Dynamic Hardware
Prediction
Address
31
0
1
Bits 13 - 2
1023
Chap. 4 - Pipelining I
P
r
e
d
i
c
t
i
o
n
68
Dynamic Hardware
Prediction
NT
Predict Taken
Predict Taken
T
Predict Not
Taken
T
NT
T
Chap. 4 - Pipelining I
NT
Predict Not
Taken
NT
69
Dynamic Hardware
Prediction
BHT Accuracy
Mispredict because either:
Wrong guess for that branch
Got branch history of wrong branch when index the table
4096 entry table programs vary from 1% misprediction (nasa7,
tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%
4096 about as good as infinite table, but 4096 is a lot of HW
Chap. 4 - Pipelining I
70
Dynamic Hardware
Prediction
Correlating Branches
Idea: taken/not taken of
recently executed branches is
related to behavior of next
branch (as well as the history
of that branch behavior)
Branch address
2-bits per branch predictors
Prediction
Prediction
Chap. 4 - Pipelining I
71
Dynamic Hardware
Prediction
Frequency of Mispredictions
18%
(Figure 4.21,
p. 272)
18%
14%
12%
11%
10%
8%
6%
6%
6%
6%
5%
5%
4%
4%
1%
0%
1%
Chap. 4 - Pipelining I
li
eqntott
espresso
gcc
fpppp
spice
m atrix300
0%
doducd
0%
tom catv
2%
nasa7
16%
72
Dynamic Hardware
Prediction
Branch Target Buffer (BTB): Use address of branch as index to get prediction AND branch address (if taken)
Note: must check for branch match now, since cant use wrong branch address (Figure 4.22, p. 273)
Predicted PC
Chap. 4 - Pipelining I
Branch Prediction:
Taken or not Taken
73
Dynamic Hardware
Prediction
Example
Instructions
in Buffer
Yes
Yes
No
Actual
Branch
Taken
Not taken
Taken
Penalty
Cycles
0
2
2
Chap. 4 - Pipelining I
74
Multiple Issue
4.1 Instruction Level Parallelism:
Concepts and Challenges
4.2 Overcoming Data Hazards
with Dynamic Scheduling
4.3 Reducing Branch Penalties
with Dynamic Hardware
Prediction
4.4 Taking Advantage of More ILP
with Multiple Issue
4.5 Compiler Support for
Exploiting ILP
4.6 Hardware Support for
Extracting more Parallelism
4.7 Studies of ILP
Flavor I:
Superscalar processors issue varying
number of instructions per clock - can
be either statically scheduled (by the
compiler) or dynamically scheduled
(by the hardware).
Superscalar has a varying number of
instructions/cycle (1 to 8), scheduled
by compiler or by HW (Tomasulo).
IBM PowerPC, Sun UltraSparc, DEC
Alpha, HP 8000
Chap. 4 - Pipelining I
75
Multiple Issue
Issuing Multiple Instructions/Cycle
Flavor II:
VLIW - Very Long Instruction Word - issues a fixed number of
instructions formatted either as one very large instruction or as a
fixed packet of smaller instructions.
fixed number of instructions (4-16) scheduled by the compiler; put
operators into wide templates
Joint HP/Intel agreement in 1999/2000
Intel Architecture-64 (IA-64) 64-bit address
Style: Explicitly Parallel Instruction Computer (EPIC)
Chap. 4 - Pipelining I
76
Multiple Issue
Issuing Multiple Instructions/Cycle
Flavor II - continued:
Chap. 4 - Pipelining I
77
Multiple Issue
Type
Pipe Stages
Int. instruction
IF
ID
EX MEM WB
FP instruction
IF
ID
EX MEM WB
Int. instruction
IF
ID
EX MEM WB
FP instruction
IF
ID
EX MEM WB
Int. instruction
IF
ID
EX MEM WB
FP instruction
IF
ID
EX MEM WB
1 cycle load delay causes delay to 3 instructions in Superscalar
instruction in right half cant use it, nor instructions in next slot
Chap. 4 - Pipelining I
78
Multiple Issue
LD
LD
LD
LD
ADDD
ADDD
ADDD
ADDD
SD
SD
SD
SUBI
BNEZ
SD
F0,0(R1)
F6,-8(R1)
F10,-16(R1)
F14,-24(R1)
F4,F0,F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
0(R1),F4
-8(R1),F8
-16(R1),F12
R1,R1,#32
R1,LOOP
8(R1),F16
Latencies:
LD to ADDD: 1 Cycle
ADDD to SD: 2 Cycles
; 8-32 = -24
79
Multiple Issue
Chap. 4 - Pipelining I
Clock cycle
3
4
5
6
7
80
Multiple Issue
Chap. 4 - Pipelining I
81
Multiple Issue
Chap. 4 - Pipelining I
82
Multiple Issue
Chap. 4 - Pipelining I
83
VLIW
Multiple Issue
Memory
FP
reference 2
operation 1
LD F6,-8(R1)
LD F14,-24(R1)
LD F22,-40(R1) ADDD F4,F0,F2
ADDD F12,F10,F2
ADDD F20,F18,F2
SD 0(R1),F4
SD -8(R1),F8 ADDD F28,F26,F2
SD -16(R1),F12 SD -24(R1),F16
SD -32(R1),F20 SD -40(R1),F24
SD -0(R1),F28
FP
op. 2
Int. op/
branch
Clock
ADDD F8,F6,F2
ADDD F16,F14,F2
ADDD F24,F22,F2
SUBI R1,R1,#48
BNEZ R1,LOOP
Chap. 4 - Pipelining I
84
1
2
3
4
5
6
7
8
9
Multiple Issue
Chap. 4 - Pipelining I
85
Multiple Issue
Chap. 4 - Pipelining I
86
Multiple Issue
While Integer/FP split is simple for the HW, get CPI of 0.5 only for
programs with:
Exactly 50% FP operations
No hazards
Chap. 4 - Pipelining I
87
Chap. 4 - Pipelining I
88
Software Pipelining
Observation: if iterations from loops are independent, then can get ILP
by taking instructions from different iterations
Software pipelining: reorganizes loops so that each iteration is made
from instructions chosen from different iterations of the original loop
(Tomasulo in SW)
Iteration
0
Iteration
Iteration
1
2
Iteration
3
Iteration
4
Softwarepipelined
iteration
Chap. 4 - Pipelining I
89
SW Pipelining Example
Before: Unrolled 3 times
1 LD
F0,0(R1)
2 ADDD F4,F0,F2
3 SD
0(R1),F4
4 LD
F6,-8(R1)
5 ADDD F8,F6,F2
6 SD
-8(R1),F8
7 LD
F10,-16(R1)
8 ADDD F12,F10,F2
9 SD
-16(R1),F12
10 SUBI R1,R1,#24
11 BNEZ R1,LOOP
SD
IF
ADDD
LD
1
2
3
4
5
Read F4
ID
IF
EX
ID
IF
LD
ADDD
LD
SD
ADDD
LD
SUBI
BNEZ
SD
ADDD
SD
Read F0
Mem
EX
ID
F0,0(R1)
F4,F0,F2
F0,-8(R1)
0(R1),F4;
Stores M[i]
F4,F0,F2;
Adds to M[i-1]
F0,-16(R1); loads M[i-2]
R1,R1,#8
R1,LOOP
0(R1),F4
F4,F0,F2
-8(R1),F4
WB Write F4
Mem WB
EX
Mem WB
90
SW Pipelining Example
Symbolic Loop Unrolling
Less code space
Overhead paid only once
vs. each iteration in loop unrolling
Software Pipelining
Loop Unrolling
Chap. 4 - Pipelining I
91
Trace Scheduling
Chap. 4 - Pipelining I
92
Here
well
talk
about
hardware
techniques. These include:
Conditional
Instructions
Hardware Speculation
or
Predicated
Chap. 4 - Pipelining I
93
Nullified Instructions
Chap. 4 - Pipelining I
A=
B op C
94
Nullified Instructions
Compare
and Nullify
Next Instr.
If Not Zero
Compare
and Move
IF Zero
Nullified Method:
LD
R1, VarA
LD
R2, VarT
CMPNNZ
R1, #0
SD
VarS, R2
Label:
Nullified Method:
LD
R1, VarA
LD
R2, VarT
CMOVZ
VarS,R2, R1
Chap. 4 - Pipelining I
95
Compiler Speculation
Increasing Parallelism
The theory here is to move an instruction across a branch so as to
increase the size of a basic block and thus to increase parallelism.
Primary difficulty is in avoiding exceptions. For example
if ( a ^= 0 ) c = b/a; may have divide by zero error in some cases.
Methods for increasing speculation include:
1. Use a set of status bits (poison bits) associated with the registers.
Are a signal that the instruction results are invalid until some later
time.
2. Result of instruction isnt written until its certain the instruction is
no longer speculative.
Chap. 4 - Pipelining I
96
Increasing
Parallelism
Example on Page 305.
Code for
if ( A == 0 )
A = B;
else
A = A + 4;
Assume A is at 0(R3) and
B is at 0(R4)
Note here that only ONE
side needs to take a
branch!!
Compiler Speculation
Original Code:
LW
R1, 0(R3)
BNEZ R1, L1
LW
R1, 0(R2)
J
L2
L1: ADDI R1, R1, #4
L2: SW
0(R3), R1
Load A
Test A
If Clause
Skip Else
Else Clause
Store A
Speculated Code:
LW
R1, 0(R3)
LW
R14, 0(R2)
BEQZ R1, L3
ADDI R14, R1, #4
L3: SW
0(R3), R14
Load A
Spec Load B
Other if Branch
Else Clause
Non-Spec Store
Chap. 4 - Pipelining I
97
Compiler Speculation
Poison Bits
In the example on the last
page, if the LW* produces
an exception, a poison bit
is set on that register. The
if a later instruction tries to
use the register, an
exception is THEN raised.
Speculated Code:
LW
R1, 0(R3)
LW*
R14, 0(R2)
BEQZ R1, L3
ADDI R14, R1, #4
L3: SW
0(R3), R14
Chap. 4 - Pipelining I
Load A
Spec Load B
Other if Branch
Else Clause
Non-Spec Store
98
Hardware Speculation
Reorder
Buffer
FP Regs
Res Stations
FP Adder
Chap. 4 - Pipelining I
99
Hardware Speculation
Chap. 4 - Pipelining I
100
Studies of ILP
4.1 Instruction Level Parallelism:
Concepts and Challenges
Chap. 4 - Pipelining I
101
Studies of ILP
Limits to ILP
Initial HW Model here; MIPS compilers.
Assumptions for ideal/perfect machine to start:
1. Register renaminginfinite virtual registers and all WAW & WAR
hazards are avoided
2. Branch predictionperfect; no mispredictions
3. Jump predictionall jumps perfectly predicted => machine with
perfect speculation & an unbounded buffer of instructions available
4. Memory-address alias analysisaddresses are known & a store can
be moved before a load provided addresses not equal
1 cycle latency for all instructions; unlimited number of instructions
issued per clock cycle
Chap. 4 - Pipelining I
102
Studies of ILP
IPC
160
FP: 75 - 150
140
120
150.1
118.7
Integer: 18 - 60
100
75.2
80
60
54.8
62.6
40
17.9
20
0
gcc
espresso
li
fpppp
doducd
tomcatv
Programs
Chap. 4 - Pipelining I
103
Studies of ILP
Chap. 4 - Pipelining I
104
Studies of ILP
Bonus!!
1
0
11
Choose Non-correlator
10
01 Choose Correlator
00
Branch Addr
2
Global
History
Taken/Not Taken
00
01
10
11
2048 x 4 x 2 bits
8K x 2 bit
Selector
11 Taken
10
01 Not Taken
00
Chap. 4 - Pipelining I
105
Impact of Realistic
Branch Prediction
Studies of ILP
60
61
48
50
IPC
60
58
FP: 15 - 45
46 45
46 45 45
41
40
35
Integer: 6 - 12
29
30
20
12
10
19
16
9
6
13 14
10
7
15
0
gcc
espresso
li
fpppp
doducd
tomcatv
Program
Perfect
Perfect
Selective predictor
Standard 2-bit
Static
Chap. 4BHT
- Pipelining
I
(512)
Profile
Selective Hist
None
106
No prediction
Studies of ILP
Effect of limiting the
number of renaming
registers.
60
FP: 11 - 45
54
49
IPC
50
45
40
44
35
Integer: 5 - 15
30
29
28
20
20
15 15
11 10 10
10
16
13
12 12 12 11
10
9
5
11
6
15
0
gcc
espresso
li
fpppp
doducd
tomcatv
Program
Infinite
Infinite
256
128
64
32
Chap.
4
Pipelining
I
256 128 64
32
None
None
107
Studies of ILP
50
FP: 4 - 45
(Fortran,
no heap)
49
IPC
45
45
45
40
35
Integer: 4 - 9
30
25
20
16
15
15
12
10
10
5
16
7
4
9
5
0
gcc
espresso
li
fpppp
Program
Global/Stack perf;
Perfect
Inspec.
Perfect
Global/stack Perfect
Inspection
heapChap.
conflicts
4 - Pipelining
I
Assem.
doducd
None
None
tomcatv
108
Summary
4.1 Instruction Level Parallelism: Concepts and Challenges
4.2 Overcoming Data Hazards with Dynamic Scheduling
4.3 Reducing Branch Penalties with Dynamic Hardware Prediction
4.4 Taking Advantage of More ILP with Multiple Issue
4.5 Compiler Support for Exploiting ILP
4.6 Hardware Support for Extracting more Parallelism
4.7 Studies of ILP
Chap. 4 - Pipelining I
109