Professional Documents
Culture Documents
>
<transcript>
<text start="18.57" dur="8.9">Hello, and welcome to todays lecture on
control hazards. Before I do that let us have</text>
<text start="27.47" dur="7.71">a quick recap of various types of dependences
and hazards which we have discussed the last</text>
<text start="35.18" dur="6.87">couple of lectures. We have seen that the
dependences can be broadly divided into 2</text>
<text start="42.05" dur="9.509">categories data dependences and control depend
ences.
And the data dependences again can be divided</text>
<text start="51.559" dur="9.07">into 2 broad categories. First one is a true
data dependences and which leads to read data</text>
<text start="60.629" dur="6.111">type of hazards. And we have discussed variou
s
techniques by which you can overcome these</text>
<text start="66.74" dur="6.409">true data dependences by using hardware and
software means we have seen we can use hot</text>
<text start="73.149" dur="5.711">providing. You can use instruction scheduling
,
static instruction scheduling, dynamic instruction</text>
<text start="78.86" dur="6.04">scheduling by hardware. And by that you can
overcome read after write type of hazards</text>
<text start="84.9" dur="7.38">arising out of true data dependence.
Similarly, we have discussed about name dependences</text>
<text start="92.28" dur="5.82">having 2 different verities; one is known
as output dependences. And second one is known</text>
<text start="98.1" dur="7.53">as anti dependences and output dependences
lead to read after write type of hazards.</text>
<text start="105.63" dur="6.70999999999999">And anti dependences lead to write
after read
type of hazards and these 2 types of hazards</text>
<text start="112.34" dur="6.7">can be overcome by using resistor in a main.
And we have seen how resistor in a main can</text>
<text start="119.04" dur="7.75">be done by the compiler or by the hardware
as it has been done in thamosilous algorithm.</text>
<text start="126.79" dur="11.21">So, this is how the data dependences tackled
and hazards are arising out of data dependences</text>
<text start="138" dur="6.269">can be overcome by different techniques so
far we have concentrated on data dependences</text>
<text start="144.269" dur="5.56">and overcoming the hazards arising out of
data dependences. Now, we shall focus on control</text>
<text start="149.829" dur="9.41">dependences we have seen control dependences
lead to control hazards and in simple terms.</text>
<text start="159.239" dur="6.53">We can discuss about control dependence, we
can tell about control dependences in this</text>
<text start="165.769" dur="5.56">way control hazards also occurred due to inst
ruction
changing the program counter. We have seen</text>
<text start="171.329" dur="7.671">the program counter keeps track of the instr
uction
to the executed next that is program counter</text>
<text start="179" dur="13.51">holds the adds up the next instruction. And
this particularly when there are branches</text>
<text start="192.51" dur="5.27">this program counter has to may not be known
immediately. And I has been count that control</text>
<text start="197.78" dur="4.76">hazards cause a better performance loss than
do data hazards.</text>
<text start="202.54" dur="10.4">So, data hazards sometimes leads to some losse
s
stor will
be modified only then the permanent change</text>
<text start="578.92" dur="8.51999999999988">in state is done. So, you can see
the before
that happened the work the condition and the</text>
<text start="587.44" dur="7.49000000000012">new end this is known. So, what yo
u have to
do all these 3 instructions are to be nullified</text>
<text start="594.93" dur="7.80999999999988">by comforting them into no operati
onal instructions.
And obviously, there will be no change, but</text>
<text start="602.74" dur="6.59">we shall be losing the pre cycles. And the
instructions fetch will take place if the</text>
<text start="609.33" dur="5.93">branch is taken place where the branch is
taking place. And I mean this when will be</text>
<text start="615.26" dur="7.69">known that it will execute this or it will
jump to this instruction I mean there is a</text>
<text start="622.95" dur="6.72000000000012">address thirty 6 where it will jum
p. So, you
can say the, this is how it will happen. So,</text>
<text start="629.67" dur="8.55999999999988">we shall be losing 3 cycles. Now,
the question
will arises where that it is possible to reduce</text>
<text start="638.23" dur="7.84">the number of stalks that means whenever the
branch instruction encountered. We are finding</text>
<text start="646.07" dur="6.78">that if we do not use any complicated techniqu
e.
If we simply introduce talks then we shall</text>
<text start="652.85" dur="7.09">losing 3 cycles for each encounter of a branch
,
so let us see.</text>
<text start="659.94" dur="6.99000000000012">What will be our loss impact of br
anch stalls?
So, let us assume your idea CPI is equal to</text>
<text start="666.93" dur="8.50999999999988">1. And let us assume that 30 perce
nt of the
instruction by branches; remaining 70 percent</text>
<text start="675.44" dur="7">instructions are value operations. So, since
there are 3 is a install of 3 cycles. So,</text>
<text start="682.44" dur="8.75">new CPI is equal to 1 plus 0.3 into 31.9.
Of course, we have not considered this situation</text>
<text start="691.19" dur="8.55">that you know all branches may not be taken
here. We have assumed that as soon as a branch</text>
<text start="699.74" dur="6.099">installed instruction is encountered 3 stalls
will be introduced. But that is not really</text>
<text start="705.839" dur="6.531">necessary even for the simple 5 9 that I hav
e
discussed. Because you see the branch can</text>
<text start="712.37" dur="4.86">be whether the branch will be taken or not
taken is known as the execution stage if it</text>
<text start="717.23" dur="9.87">is not taken. Then obviously, it is not necess
ary
to I mean wait for the next cycle, because</text>
<text start="727.1" dur="5.239">branch cycle will not be taken.
So if branch is not taken then the loss will</text>
<text start="732.339" dur="5.62000000000012">be of 2 cycles. For example, if 5
0 percent
of this branches that taken then the new CPI</text>
<text start="737.959" dur="9.48099999999988">will be 1 plus 1 is the 31 that i
s real situation
<text start="926.51" dur="7.75">done with the help of the ALU which is availab
le
in the processor. Now, if you want to move</text>
<text start="934.26" dur="9.579">it to the to the V S stage then we will requi
re
an additional adder which actual perform the</text>
<text start="943.839" dur="5.141">effective address calculation earlier in the
instruction decode stage. So, you fine that</text>
<text start="948.98" dur="6.72">if you can add this; you can shift this hardwa
re
means this multiplexer along with an additional</text>
<text start="955.7" dur="8.50900000000012">adder and this 0 detector to the in
struction
and decode stage.</text>
<text start="964.209" dur="8.081">Then we find that what will what is the outc
ome
of this that means invoke the condition and</text>
<text start="972.29" dur="8.02">the branch address are known in the second
state itself we do not have to go to the fourth</text>
<text start="980.31" dur="8.82">stage. So, the loss or the penalty is reduced
only to 1 cycle, because in the instruction</text>
<text start="989.13" dur="7.26">decode stage both will be known. And according
ly
depending on the outcome either the next instruction</text>
<text start="996.39" dur="10.09">can be fetched from c p plus form in the cycl
e
in , I mean after 1 cycle. And also or from</text>
<text start="1006.48" dur="3.68">the branch address which is not that means
program counter will be loaded by c p plus</text>
<text start="1010.16" dur="7.739">4 or by the effective address which is calcu
lated
in the third cycle itself instead of waiting</text>
<text start="1017.899" dur="7.601">for the fifth cycle. So, you find that the,
these 2 solutions can be easily accomplish</text>
<text start="1025.5" dur="5.76">with the help of additional hardware. And
so this is how this can reduce the branch</text>
<text start="1031.26" dur="5.35999999999988">penalty to 1 cycle. So, after thi
s we shall
assume that the simple pipeline that we are</text>
<text start="1036.62" dur="6.19900000000012">discussing the branch penalty is
1 cycle.
That means we shall assume that this change</text>
<text start="1042.819" dur="9.05">has been made in the hardware and our branch
penalty now, 1 cycle.</text>
<text start="1051.869" dur="10.5499999999998">Now, here some statistics about
the control
instruction based on the SPEC benchmark on</text>
<text start="1062.419" dur="8.41100000000023">this DLX processor, it is taken
from that
computer architecture that is second addition</text>
<text start="1070.83" dur="9.069">book that and quality of approach computer
architecture and quality approach from that</text>
<text start="1079.899" dur="5.9">particular book. And branches it has been
found that this statistics is like this branches</text>
<text start="1085.799" dur="6.901">occur frequency of 14 to 16 percentage in
integers programs and 3 to 3 percent to 12</text>
<text start="1092.7" dur="5.99">percent in floating point programs So, this
is the branch frequency the rate at which</text>
<text start="1098.69" dur="5.309">branch instructions countered in a program.
And this is more in integer programs than</text>
nes targets
has also be known whether it is p c plus 4</text>
<text start="1269.669" dur="8.39000000000023">or p c plus some that immediate
value which
is the part of instruction. So, obviously</text>
<text start="1278.059" dur="8.05">this approach is not accepted if you are int
erested
in improving the performance. So, what do</text>
<text start="1286.109" dur="8.5">you mean by the second approach? Second appro
ach
is to treat every branch as taken that means</text>
<text start="1294.609" dur="10.731">the complier assumes that branch will be a
lways
taken. So, if the branch is always taken what</text>
<text start="1305.34" dur="7.07899999999977">will be done the address, the nex
t instructions
will be fetched.</text>
<text start="1312.419" dur="7.71000000000023">I mean I am sorry in this case b
ranch is not
taken so treat every branch is not taken that</text>
<text start="1320.129" dur="8.92">means it always assumes that branch is not
taken. So, when the branch is not taken obviously;</text>
<text start="1329.049" dur="5.24">the next instruction to be executed is p c
plus 4. So, it proceeds in that direction</text>
<text start="1334.289" dur="12.401">so executes successor instructions in sequ
ence
as if there is no branch. And however this</text>
<text start="1346.69" dur="7.13000000000023">also I mean whenever this assumpt
ion is made
it this is simply an assumption. It does not</text>
<text start="1353.82" dur="5.979">mean that branch will I mean all the branche
s
will not be taken some branches will be taken.</text>
<text start="1359.799" dur="6.96">So, what will be done in such cases? So, whe
n
branch is taken, we need to turn the fetch</text>
<text start="1366.759" dur="7.591">instruction into a no op and restart the fe
tch
at the target address. So, this is the thing</text>
<text start="1374.35" dur="8.959">we have to do whenever a prediction that is
done by the compiler or the assumption is</text>
<text start="1383.309" dur="7.56">made by the compiler terms of 2 rounds. And
it has been found that 47 percent of branches</text>
<text start="1390.869" dur="7.9">are not taken on an average.
So, in 47 percent of the cases you know there</text>
<text start="1398.769" dur="8.421">will be no need to any modification. So, pe
rform
they will be no performance loss for the 47</text>
<text start="1407.19" dur="4.589">percent of the cases, but for the remaining
63 percent of the cases there will be some</text>
<text start="1411.779" dur="8.431">performance loss. Because we have up to we
have to I mean we have to fetch an instruction</text>
<text start="1420.21" dur="4.459">I mean we have to convert the already fetche
d
instruction into a no op. And we have to restart</text>
<text start="1424.669" dur="5.12">the fetch at the target address.
So, this is the situation when assumption</text>
<text start="1429.789" dur="5.44">has been made that branch is not taken now
third approach is an alternative scheme is</text>
<text start="1435.229" dur="6.91000000000023">to treat every branch as taken.
So, in such
a case what is being done it is assumed that</text>
<text start="1442.139" dur="8.91">all the branches are taken that assumption
is made. But unfortunately even for the simple</text>
<text start="1451.049" dur="6.98">pipeline we have seen that branch address
is known only after only in the execution</text>
<text start="1458.029" dur="6.941">stage when the when already the whether bra
nch
will be taken or not taken is also known.</text>
<text start="1464.97" dur="7.139">So, as a consequence for the simple pipeline
that we have discussed there is no gain no</text>
<text start="1472.109" dur="3.80999999999977">advantage.
So, this approach has been no advantage for</text>
<text start="1475.919" dur="6.801">the 5 stages pipelining discussing however
there is a some performance gain whenever</text>
<text start="1482.72" dur="8.879">2 whenever the branch address is not taken.
So, as now there will be another approach</text>
<text start="1491.599" dur="13.54">which is known as delayed branch shall see
how the instruction following the branch can</text>
<text start="1505.139" dur="8.58099999999977">be converted into a useful instr
uction normally.
We have seen if the prediction is wrong then</text>
<text start="1513.72" dur="6.259">we lose one cycle that the instruction which
was executed that has to be conveyed to known</text>
<text start="1519.979" dur="7.971">up. So, that we can overcome with this part
icular
thing we can execute an instruction and it</text>
<text start="1527.95" dur="7.26">is not necessary to converted into a known
off. So, that is known as delayed branch so</text>
<text start="1535.21" dur="6.91900000000023">we shall discussed these zones te
chniques
one after the other. Of course, the first</text>
<text start="1542.129" dur="10.66">technique nothing to discuss I have already
mentioned that you have to simply processes</text>
<text start="1552.789" dur="7.041">introduce a stall after detecting a branch
instruction in the instruction decode stage.</text>
<text start="1559.83" dur="6.36999999999977">So first approach has nothing we
do not need
further discussion to consider first step</text>
<text start="1566.2" dur="0.24">approach.</text>
<text start="1566.44" dur="7.459">Let us now focus on approach 2 predict not
taken. So, who is predicts things who is decides</text>
<text start="1573.899" dur="6.16">here; obviously, the prediction has been don
e
by the compiler. So, compiler is assumed to</text>
<text start="1580.059" dur="8.61999999999977">be in that the branch is not tak
en. So, in
such a case you can execute successor instructions</text>
<text start="1588.679" dur="9.05">keep on fetching plus 4 c p plus 8 is the
one and one executing however in this p c</text>
<text start="1597.729" dur="6.44">plus 4 already calculated. So, use it to get
the next instruction chances are the branch</text>
<text start="1604.169" dur="7.1">is not taken. So, whenever branch is not take
n
as we have seen we have to we have to modify</text>
<text start="1611.269" dur="6.86">the instruction. And that is why it is been
done and that if branch is not taken the following</text>
<text start="1618.129" dur="4.60999999999977">instruction you have to squash i
nstructions
in the pipeline if branch is actually taken.</text>
</transcript>