M116C 1 EE116C-Midterm2-w15 Solution

CS151B/EE
M116C
Computer Systems Architecture
Winter 2015 Midterm Exam II

Instructor:
Prof. Lei He
Solution
Problem 1. (10 points) Explain terms or answer short problems. For example,
Program counter (PC) is the register containing the address of the instruction in the
program being executed. (Hint: if you do not know how to explain the term
precisely, you may use examples to explain).
(1) Explain the concept of delayed load AND give an example piece of code
The loaded data is available one clock cycle after the load instruction. For
example in the code:
lw
add
$1,
$3,
0($2)
$1,
$4
The loaded data $1 from the first instruction is not available right after the
instruction. For the add instruction, we need to add nop before it or stall one
clock cycle to wait for $1 to be available.
(2) Explain the concept of loop unrolling and why we perform loop unrolling
Unroll the loop n times and rename the registers to make a big loop. Loop
unrolling leads to more instructional level parallelism and therefore improve
performance.
(3) Name three techniques (in either software or hardware) to resolve branch
hazard or reduce performance loss of branch hazard

Stall until branch is known, guess branch direction, reduce branch delay,
delayed branch (always execute instruction after branch).
(4) State the conditions to share hardware between different stages of multi-
cycle implementation
A piece of hardware can be shared if it is used by a same instruction in
different clock cyles. For instance, PC increment and R-type execution both
make use of the same ALU, but they do so on different cycles.
(5) A single-cycle implementation may be divided into five stages for
pipelining. Compare the average CPI between single-cycle and ideal pipelining
implementations and explain why pipelining may improve performance.
The average CPI of both single-cycle and ideal pipelining implementations is 1.
But the critical path length of the single-cycle implementation is much larger
(usually N times larger, where N is the number of pipeline stages, N=5 in the
problem) than the ideal pipeline. Therefore, pipelining can improve
performance.

Problem 2. (10 points) In this exercise, we examine how data dependences affect execution in the
basic 5-stage pipeline described in textbook. Problems in this exercise refer to the following sequence
of instructions:
lw $5, -16($5)
sw $5, -16($5)
add $5, $5, $5
Also, assume the following cycle times for each of the options related to forwarding:
Without Forwarding
With Full Forwarding
With ALU-ALU Forwarding only
220ps
240ps
230ps
1) Indicate dependences and their type.
I1: lw $5,-16($5)
I2: sw $5,-16($5)
I3: add $5,$5,$5
RAW on $5 from I1 to I2 and I3

WAR on $5 from I1 and I2 to I3
WAW on $5 from I1 to I3
2) Assume there is no forwarding in this pipelined processor. Indicate hazards and add NOP
instructions to eliminate them.

In the basic five-stage pipeline WAR and WAW dependences do not cause any hazards. Without
forwarding, any RAW dependence between an instruction and the next two instructions (if register
read happens in the second half of the clock cycle and the register write happens in the first half).
The code that eliminates these hazards by inserting nop instructions is:
lw $5,-16($5)
nop
nop
sw $5,-16($5)
add $5,$5,$5
Delay I2 to avoid RAW hazard on $5 from I1

Note: no RAW hazard from on $5 from I1 now
3) Assume there is full forwarding. Indicate hazards and add NOP instructions to eliminate them.
With full forwarding, an ALU instruction can forward a value to EX stage of the next instruction
without a hazard. However, a load cannot forward to the EX stage of the next instruction (by can to
the instruction after that). The code that eliminates these hazards by inserting nop instructions is:
lw $5,-16($5)
nop
Delay I2 to avoid RAW hazard on $5 from I1
sw $5,-16($5)
Value for $5 is forwarded from I2 now
add $5,$5,$5
Note: no RAW hazard from on $5 from I1 now
4) What is the total execution time of this instruction sequence WITHOUT forwarding and WITH full
forwarding? What is the speedup achieved by adding full forwarding to a pipeline that had no
forwarding?
The total execution time is the clock cycle time times the number of cycles. Without any stalls, a
three-instruction sequence executes in 7 cycles (5 to complete the first instruction, then one per

instruction). The execution without forwarding must add a stall for every nop we had in 4.13.2, and
execution forwarding must add a stall cycle for every nop we had in 4.13.3. Overall, we get:
Without Forwarding
With Full Forwarding
Speed-up due to forwarding
(7 + 2) 220ps = 1980ps
(7 + 1) 240ps = 1920ps
1.03
5) Add NOP instructions to this code to eliminate hazards if there is ALU-ALU forwarding only
(no forwarding from the MEM to the EX stage).

lw $5,-16($5)
nop
nop
sw $5,-16($5)
add $5,$5,$5
Cant use ALU-ALU forwarding ($5 loaded in MEM)
6) What is the total execution time of this instruction sequence with only ALU-ALU forwarding?
What is the speedup over a no-forwarding pipeline?

No forwarding
1980ps
With ALU-ALU
forwarding only
(7 + 2) 230ps = 2070ps
Speedup with ALUALU

forwarding
0.96 (This is really a slowdown)
Problem 3. (10 points): Assume that we have a five-stage machine same as the one in textbook. For
the following code,

sub $2, $5, $4
add $4, $2, $5 lw
$2, 100($4)
add $5, $2, $4
(a)
(b)
(c)
(d)
$2 depends on (a)
$4 depends on (b)
$2 depends on (c), $4 depends on (b)
(1)
Name all data dependencies

see the red bold words.
(2)
Which data hazards can be resolved by renaming? Write down the code after renaming and
with minimal data hazards
Write after write can be resolved by renaming. After renaming, the code will look like
the following, $2 in (c) and (d) is renamed $6:
(a) sub
(b) add
(c) lw
(d) add
(3)
IM
$2,
$4,
$6,
$5,
$5,
$4
$2,
$5
100($4)
$6,
$4
After renaming, which data hazard can be resolved via forwarding? Illustrate all the
forwarding using 5-stage pipelining figures similar to those in the textbook.
Reg
Re
$2
IM
Reg
Re
$4
IM
Reg
Re
$6
IM
Reg
Problem 4. (15 points) Data forwarding

Considering data forwarding for the pipeline below, state how to generate control signals
for MUX A. I.e., use plain English AND logic function such as EX/MEM.RegisterRd !=
0 to explain when the control signal for MUX A should be 00, 01 and 10 respectively.
I. control signal = 00
No data forwarding.
Neither condition below holds.
II. control signal = 01
Forward result from MEM/WB register:

If ((MEM/WB.RegWrite) &&
(MEM/WB.RegisterRd != 0) &&
(MEM/WB.RegisterRd == ID/EX.RegisterRs))
III. control signal = 10
Forward result from EX/MEM register.

If ((EX/MEM.RegWrite) &&
(EX/MEM.RegisterRd != 0)
(EX/MEM.RegisterRd == ID/EX.RegisterRs))
Problem 5. (15 points) Media applications that play audio or video files are part of a
class of workloads called streaming workloads; i.e., they bring in large amounts of
data but do not reuse much of it. Consider a video streaming workload that accesses a
512 KB working set sequentially with the following address stream:
a. Assume a 64 KB direct-mapped cache with a 32-byte line. What is the miss rate for
the address stream above? How is this miss rate sensitive to the size of the cache or
the working set? How would you categorize the misses this workload is
experiencing, explain what causes the misses?
6.25% miss rate. The miss rate doesnt change with cache size or working set. These are cold
misses, which means that the data are brought back from the memory for the first time.
b. Re-compute the miss rate when the cache line (block) size is 16 bytes, 64 bytes, and
128 bytes. What kind of locality is this workload exploiting?
12.5%, 3.125% and 1.5625% miss rates for 16-byte, 64-byte and 128-byte blocks. Spatial
locality.
(1/8, 1/32, 1/64)
c. Prefetching is a technique that leverages predictable address patterns to

speculatively bring in additional cache lines when a particular cache line is
accessed. One example of prefetching is a stream buffer that prefetches sequentially
adjacent cache lines into a separate buffer when a particular cache line is brought
in. if the data is found in the prefetch buffer, it is considered as a hit and moved into
the cache and the next cache line is prefetched. Assume a two-entry stream buffer
and assume that the cache latency is such that a cache line can be loaded before the
computation on the previous cache line is completed. What is the miss rate for the
address stream above?
With next-line prefetching, miss rate will be near 0%.
Cache block size (B) can affect both miss rate and miss latency. Assuming a 1-CPI
machine with an average of 1.35 references (both instruction and data) per
instruction, help find the optimal block size given the following miss rates for
various block sizes.
Size
16
32
64
Miss Rate
4%
3%
3%
1.5%
d. What is the optimal block size for a miss latency of 20 B cycles?
8-byte.
e. What is the optimal block size for a miss latency of 24 + B cycles?

16-byte.
f.
For constant miss latency, what is the optimal block size?

128-byte.
(Size*C)
128
1%

M116C 1 EE116C-Midterm2-w15 Solution

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

M116C 1 EE116C-Midterm2-w15 Solution

Uploaded by

Copyright:

Available Formats

CS151B/EE

Winter 2015 Midterm Exam II

hazard or reduce performance loss of branch hazard

With Full Forwarding

With ALU-ALU Forwarding only

1) Indicate dependences and their type.

RAW on $5 from I1 to I2 and I3

instructions to eliminate them.

Delay I2 to avoid RAW hazard on $5 from I1

With Full Forwarding

Speed-up due to forwarding

(no forwarding from the MEM to the EX stage).

Cant use ALU-ALU forwarding ($5 loaded in MEM)

What is the speedup over a no-forwarding pipeline?

Speedup with ALUALU

Name all data dependencies

Problem 4. (15 points) Data forwarding

Forward result from MEM/WB register:

Forward result from EX/MEM register.

c. Prefetching is a technique that leverages predictable address patterns to

e. What is the optimal block size for a miss latency of 24 + B cycles?

For constant miss latency, what is the optimal block size?

You might also like