You are on page 1of 19

<?xml version="1.0" encoding="utf-8"?

>
<transcript>
<text start="18.57" dur="8.9">Hello, and welcome to todays lecture on
control hazards. Before I do that let us have</text>
<text start="27.47" dur="7.71">a quick recap of various types of dependences
and hazards which we have discussed the last</text>
<text start="35.18" dur="6.87">couple of lectures. We have seen that the
dependences can be broadly divided into 2</text>
<text start="42.05" dur="9.509">categories data dependences and control depend
ences.
And the data dependences again can be divided</text>
<text start="51.559" dur="9.07">into 2 broad categories. First one is a true
data dependences and which leads to read data</text>
<text start="60.629" dur="6.111">type of hazards. And we have discussed variou
s
techniques by which you can overcome these</text>
<text start="66.74" dur="6.409">true data dependences by using hardware and
software means we have seen we can use hot</text>
<text start="73.149" dur="5.711">providing. You can use instruction scheduling
,
static instruction scheduling, dynamic instruction</text>
<text start="78.86" dur="6.04">scheduling by hardware. And by that you can
overcome read after write type of hazards</text>
<text start="84.9" dur="7.38">arising out of true data dependence.
Similarly, we have discussed about name dependences</text>
<text start="92.28" dur="5.82">having 2 different verities; one is known
as output dependences. And second one is known</text>
<text start="98.1" dur="7.53">as anti dependences and output dependences
lead to read after write type of hazards.</text>
<text start="105.63" dur="6.70999999999999">And anti dependences lead to write
after read
type of hazards and these 2 types of hazards</text>
<text start="112.34" dur="6.7">can be overcome by using resistor in a main.
And we have seen how resistor in a main can</text>
<text start="119.04" dur="7.75">be done by the compiler or by the hardware
as it has been done in thamosilous algorithm.</text>
<text start="126.79" dur="11.21">So, this is how the data dependences tackled
and hazards are arising out of data dependences</text>
<text start="138" dur="6.269">can be overcome by different techniques so
far we have concentrated on data dependences</text>
<text start="144.269" dur="5.56">and overcoming the hazards arising out of
data dependences. Now, we shall focus on control</text>
<text start="149.829" dur="9.41">dependences we have seen control dependences
lead to control hazards and in simple terms.</text>
<text start="159.239" dur="6.53">We can discuss about control dependence, we
can tell about control dependences in this</text>
<text start="165.769" dur="5.56">way control hazards also occurred due to inst
ruction
changing the program counter. We have seen</text>
<text start="171.329" dur="7.671">the program counter keeps track of the instr
uction
to the executed next that is program counter</text>
<text start="179" dur="13.51">holds the adds up the next instruction. And
this particularly when there are branches</text>
<text start="192.51" dur="5.27">this program counter has to may not be known
immediately. And I has been count that control</text>
<text start="197.78" dur="4.76">hazards cause a better performance loss than
do data hazards.</text>
<text start="202.54" dur="10.4">So, data hazards sometimes leads to some losse
s

we have to introduce loss, but it has been</text>


<text start="212.94" dur="7.63">count that control hazards are more and leads
more performance loss. So, we have to focus</text>
<text start="220.57" dur="6.74">attention to control hazards. And we have
to see how they are impact can be minimized</text>
<text start="227.31" dur="9.56">they are the loss can be reduced we know that
a branch can be can have 2 outcomes. Number</text>
<text start="236.87" dur="7.45">one is known as taken, another one is not
taken that means whenever you have got a branch</text>
<text start="244.32" dur="6.509">instruction there are 2 possibilities in case
of taken you have to generate a new address</text>
<text start="250.829" dur="4.87">that means in case of taken branches.</text>
<text start="255.699" dur="9.03">Effective address you check the address is
equal to program counter plus that immediate</text>
<text start="264.729" dur="5.43099999999994">data that is available as the par
t of the
instruction added with the program counter</text>
<text start="270.16" dur="6.9">and effective address is generated. And this
is the address where it will be generate the</text>
<text start="277.06" dur="4.75">instruction execution fruit starts this progra
m
counter has to be loaded by this effective</text>
<text start="281.81" dur="7.81">address basically has to be loaded by this
part. Another possibility is that not taken</text>
<text start="289.62" dur="6.69">that means this the branch may not take the
condition may not be satisfied in such case</text>
<text start="296.31" dur="5">the program counter is essentially the address
of the next instruction.</text>
<text start="301.31" dur="7.94">As we know it is equal to P C plus 4 because
instructions are 4 bytes. So, the next address</text>
<text start="309.25" dur="9.76">is present whole of the program counter plus
4. So, this is the next address of the extraction,</text>
<text start="319.01" dur="9.51">but so these 2 I mean when we shall know the
new address the branch is taken. And we can</text>
<text start="328.52" dur="7.5">know the address when branch is not taken.
So unless these 2 are known we cannot face</text>
<text start="336.02" dur="6.64999999999994">the next instruction from the till
the value
of P C is known. That is unless the new value</text>
<text start="342.67" dur="4.97000000000006">is depending on whether it is whet
her the
branch is taken or not taken, we cannot really</text>
<text start="347.64" dur="3.88">proceed cannot fetch the next instruction
and start execution.</text>
<text start="351.52" dur="6.8">Now, let us see, what are the solutions? The
first solution or the simplest solution is</text>
<text start="358.32" dur="6.62">to install the pipeline upon detecting a branc
h.
That means as soon as detected a branch what</text>
<text start="364.94" dur="11.27">you can do it will install the pipeline and
wait till the branch address is known. So,</text>
<text start="376.21" dur="6.81">that is the simplest solution and the steps
are given here the ID stage detects the branch.</text>
<text start="383.02" dur="4.47">That is after the instruction is decoded it
will be known whether it is a branch instruction</text>
<text start="387.49" dur="6.46">or not and do not know if branch is taken
until the execution stage. So, as we shall</text>
<text start="393.95" dur="7.66">see in our pipeline we have to go to the execu
tion
stage, because there the condition for branch</text>

<text start="401.61" dur="4.529">tested and it will be known whether branch


will be taken or not.</text>
<text start="406.139" dur="6.591">And then the new PC is not changed until the
end of the memory stage is. That means until</text>
<text start="412.73" dur="8.54000000000006">we go to the memory stage we do no
t know what
will be the new address if branch is taken.</text>
<text start="421.27" dur="6.04">It means after determining the branch is taken
and new PC value is known in the, me memory</text>
<text start="427.31" dur="5.579">state. And if the branch is taken we need
to repeat some stages and fetch new instructions;</text>
<text start="432.889" dur="5.39100000000006">that means that that fetching the
new address.
It was taken by place from consecutive addresses</text>
<text start="438.28" dur="6.43999999999994">like PC plus 4, PC plus 8 those th
ings are
first out those instructions are first out.</text>
<text start="444.72" dur="5.84">And you have to fetch it instructions from
the new address and that is how it will continue.</text>
<text start="450.56" dur="8.06">As it is clear from this tag emergency the
condition in this simple pipe line that we</text>
<text start="458.62" dur="8.08">have discussed x 59 it checks whether a partic
ular
resistor content is 0 or not another mention</text>
<text start="466.7" dur="6.36">it is done at the at the execution state.
So, the condition is known only at the execution</text>
<text start="473.06" dur="5.43">state and where the branch will taken place
you can see the address is known in the memory</text>
<text start="478.49" dur="4.01">access stage. So, in the memory access stage
the address will be known and that content</text>
<text start="482.5" dur="5.919">will be loaded in to the program counter.
So, we have to wait till the memory accessed</text>
<text start="488.419" dur="7.091">stage to know both the things condition whet
her
satisfied or not. And the branch address,</text>
<text start="495.51" dur="5.55">whether if the branch is taken so what will
be the delay?</text>
<text start="501.06" dur="5.2">Obviously, the delay it leads to delay of
3 cycles so you have to call the pipeline</text>
<text start="506.26" dur="6.24">by 3 cycles in the normal situation. So, you
can see here whenever the branch is taken.</text>
<text start="512.5" dur="8.47900000000006">And you can see the after the execu
tion stage
the branch condition is decided an after the</text>
<text start="520.979" dur="6.37099999999994">memory stage the new target has b
een known.
So, these are the instructions fetched this</text>
<text start="527.35" dur="12.04">one; this one and this one that means after
this b e q r 1 comma r 3 r 6. You know these</text>
<text start="539.39" dur="4.86">this particular the instructions which are
following these remaining these 3 instruction</text>
<text start="544.25" dur="9.36">and r 2 comma r 3 comma r 5 or r 6 comma r
1 comma r 7 and add r 8 comma r 1 comma r</text>
<text start="553.61" dur="5.39">9. These instructions; obviously, are to first
out, what do you really means by that you</text>
<text start="559" dur="7.87">can say fortunately none of these 3 instructions
have done any permanent damage or permanently</text>
<text start="566.87" dur="4.98">change in the status of processor.
And that will be take place only in the light</text>
<text start="571.85" dur="7.07000000000012">back stage the content of the resi

stor will
be modified only then the permanent change</text>
<text start="578.92" dur="8.51999999999988">in state is done. So, you can see
the before
that happened the work the condition and the</text>
<text start="587.44" dur="7.49000000000012">new end this is known. So, what yo
u have to
do all these 3 instructions are to be nullified</text>
<text start="594.93" dur="7.80999999999988">by comforting them into no operati
onal instructions.
And obviously, there will be no change, but</text>
<text start="602.74" dur="6.59">we shall be losing the pre cycles. And the
instructions fetch will take place if the</text>
<text start="609.33" dur="5.93">branch is taken place where the branch is
taking place. And I mean this when will be</text>
<text start="615.26" dur="7.69">known that it will execute this or it will
jump to this instruction I mean there is a</text>
<text start="622.95" dur="6.72000000000012">address thirty 6 where it will jum
p. So, you
can say the, this is how it will happen. So,</text>
<text start="629.67" dur="8.55999999999988">we shall be losing 3 cycles. Now,
the question
will arises where that it is possible to reduce</text>
<text start="638.23" dur="7.84">the number of stalks that means whenever the
branch instruction encountered. We are finding</text>
<text start="646.07" dur="6.78">that if we do not use any complicated techniqu
e.
If we simply introduce talks then we shall</text>
<text start="652.85" dur="7.09">losing 3 cycles for each encounter of a branch
,
so let us see.</text>
<text start="659.94" dur="6.99000000000012">What will be our loss impact of br
anch stalls?
So, let us assume your idea CPI is equal to</text>
<text start="666.93" dur="8.50999999999988">1. And let us assume that 30 perce
nt of the
instruction by branches; remaining 70 percent</text>
<text start="675.44" dur="7">instructions are value operations. So, since
there are 3 is a install of 3 cycles. So,</text>
<text start="682.44" dur="8.75">new CPI is equal to 1 plus 0.3 into 31.9.
Of course, we have not considered this situation</text>
<text start="691.19" dur="8.55">that you know all branches may not be taken
here. We have assumed that as soon as a branch</text>
<text start="699.74" dur="6.099">installed instruction is encountered 3 stalls
will be introduced. But that is not really</text>
<text start="705.839" dur="6.531">necessary even for the simple 5 9 that I hav
e
discussed. Because you see the branch can</text>
<text start="712.37" dur="4.86">be whether the branch will be taken or not
taken is known as the execution stage if it</text>
<text start="717.23" dur="9.87">is not taken. Then obviously, it is not necess
ary
to I mean wait for the next cycle, because</text>
<text start="727.1" dur="5.239">branch cycle will not be taken.
So if branch is not taken then the loss will</text>
<text start="732.339" dur="5.62000000000012">be of 2 cycles. For example, if 5
0 percent
of this branches that taken then the new CPI</text>
<text start="737.959" dur="9.48099999999988">will be 1 plus 1 is the 31 that i
s real situation

then 15 percent of the cases branch is taken.</text>


<text start="747.44" dur="6.18">So, in such a case the loss will be of 3 cycle
s
will be because address will be known only</text>
<text start="753.62" dur="8.17">at the end of the memory stage and so 0.15
into 3. And then whenever branch is not taken</text>
<text start="761.79" dur="6.89999999999988">for the 15 percent of the cases th
e loss will
be of 2 cycle. So, if you add up find that</text>
<text start="768.69" dur="12.29">there in new CPI will be 1 percent not 1.9.
Now, this penalty would be what is for current</text>
<text start="780.98" dur="4.93">definition? If this will be the case for this
simple pipeline that we have considered. But</text>
<text start="785.91" dur="7.119">in the modern processors now a days for
example, even the instruction exchange or</text>
<text start="793.029" dur="3.741">instruction decode stage is divided into sev
eral
stages.</text>
<text start="796.77" dur="6.27">So, in such a case loss will be more the numbe
r
of clock cycles or number of stalls that will</text>
<text start="803.04" dur="9.65999999999988">be happening will be more but in o
ur case
it will be restricted to 3. So, how do we</text>
<text start="812.7" dur="5.44000000000012">reduce the impact of branch stalls?
Question
is there, any way by which you can reduce</text>
<text start="818.14" dur="8.65">the impact of branch stalls there are 2 part
solution, first of all you have to determine</text>
<text start="826.79" dur="7.63">branch taken by or not taken sooner. So, if
you can is find out that branch will be taken</text>
<text start="834.42" dur="6.62">or not earlier in the pipeline stage. If you
can make some arrangement for that then there</text>
<text start="841.04" dur="6.44">is a possibility of some gain. Similarly,
if we know the branch address earlier even</text>
<text start="847.48" dur="7.34">then we can have some gain in performance
or we can reduce the impact of this branch</text>
<text start="854.82" dur="7.9">stalls. So, these 2 things had to be done
determine branch taken or not sooner and compute</text>
<text start="862.72" dur="7.14">taken branch address earlier. So, these are
the things to be done and there can be a solution</text>
<text start="869.86" dur="3.83">for this. So, let us see what is the hardware
solution?</text>
<text start="873.69" dur="9.68">What can be done you can see that that detect
at 0 degree detector which checks whether</text>
<text start="883.37" dur="8.6">particular resistor is 0 or not is in the
execution list, it can be moved t the instruction</text>
<text start="891.97" dur="4.65">decode stage you can see it has been moved
to the instruction decode stage. And not only</text>
<text start="896.62" dur="5.74">that this multiplexer which was in the memory
access stage can be allowed to be moved to</text>
<text start="902.36" dur="7.169">the instruction decoder stage. So, if we do
that however if we wanted to do that you will</text>
<text start="909.529" dur="7.06">require an additional hardware that is an
adder. You will require an adder earlier this</text>
<text start="916.589" dur="3.97100000000012">addition to generate an defective
address
you know you have to generate an defective</text>
<text start="920.56" dur="5.95">address p c plus immediate value. So, these
values has to be calculated earlier this was</text>

<text start="926.51" dur="7.75">done with the help of the ALU which is availab
le
in the processor. Now, if you want to move</text>
<text start="934.26" dur="9.579">it to the to the V S stage then we will requi
re
an additional adder which actual perform the</text>
<text start="943.839" dur="5.141">effective address calculation earlier in the
instruction decode stage. So, you fine that</text>
<text start="948.98" dur="6.72">if you can add this; you can shift this hardwa
re
means this multiplexer along with an additional</text>
<text start="955.7" dur="8.50900000000012">adder and this 0 detector to the in
struction
and decode stage.</text>
<text start="964.209" dur="8.081">Then we find that what will what is the outc
ome
of this that means invoke the condition and</text>
<text start="972.29" dur="8.02">the branch address are known in the second
state itself we do not have to go to the fourth</text>
<text start="980.31" dur="8.82">stage. So, the loss or the penalty is reduced
only to 1 cycle, because in the instruction</text>
<text start="989.13" dur="7.26">decode stage both will be known. And according
ly
depending on the outcome either the next instruction</text>
<text start="996.39" dur="10.09">can be fetched from c p plus form in the cycl
e
in , I mean after 1 cycle. And also or from</text>
<text start="1006.48" dur="3.68">the branch address which is not that means
program counter will be loaded by c p plus</text>
<text start="1010.16" dur="7.739">4 or by the effective address which is calcu
lated
in the third cycle itself instead of waiting</text>
<text start="1017.899" dur="7.601">for the fifth cycle. So, you find that the,
these 2 solutions can be easily accomplish</text>
<text start="1025.5" dur="5.76">with the help of additional hardware. And
so this is how this can reduce the branch</text>
<text start="1031.26" dur="5.35999999999988">penalty to 1 cycle. So, after thi
s we shall
assume that the simple pipeline that we are</text>
<text start="1036.62" dur="6.19900000000012">discussing the branch penalty is
1 cycle.
That means we shall assume that this change</text>
<text start="1042.819" dur="9.05">has been made in the hardware and our branch
penalty now, 1 cycle.</text>
<text start="1051.869" dur="10.5499999999998">Now, here some statistics about
the control
instruction based on the SPEC benchmark on</text>
<text start="1062.419" dur="8.41100000000023">this DLX processor, it is taken
from that
computer architecture that is second addition</text>
<text start="1070.83" dur="9.069">book that and quality of approach computer
architecture and quality approach from that</text>
<text start="1079.899" dur="5.9">particular book. And branches it has been
found that this statistics is like this branches</text>
<text start="1085.799" dur="6.901">occur frequency of 14 to 16 percentage in
integers programs and 3 to 3 percent to 12</text>
<text start="1092.7" dur="5.99">percent in floating point programs So, this
is the branch frequency the rate at which</text>
<text start="1098.69" dur="5.309">branch instructions countered in a program.
And this is more in integer programs than</text>

<text start="1103.999" dur="7.4">in floating point terms and another statistic


s
about whenever a branch is encountered. It</text>
<text start="1111.399" dur="4.35999999999977">has been found that about sevent
y 5 percent
of the branches are forward branches.</text>
<text start="1115.759" dur="6.87">So, the branch can take place in the forward
direction in which it is increased or it can</text>
<text start="1122.629" dur="5.33099999999977">take place in the backward direc
tion when
the address is decreased. So, 75 percent of</text>
<text start="1127.96" dur="6.64000000000023">the branches are forward branches
and more
over 60 percent of the forward branches are</text>
<text start="1134.6" dur="8.82899999999977">taken. And 80 percent of the branc
hes are
taken actually you may be asking why 80 percent</text>
<text start="1143.429" dur="4.72">of the backward branches are taken. Why it
more, the reason for that is you know this</text>
<text start="1148.149" dur="4.89">is because of loop in case of loop it goes
back to a fetching instruction. So, because</text>
<text start="1153.039" dur="10.551">of that looping you know the percentage of
backward branches taken is more so this is</text>
<text start="1163.59" dur="1.76">the statistics.</text>
<text start="1165.35" dur="10.3089999999998">Now, let us consider techniques b
y which we
can deal with control hazards what are the</text>
<text start="1175.659" dur="6.11">different techniques that we can adapt first
techniques here I mean we have already discussed</text>
<text start="1181.769" dur="8.38">some hardware solution by which we can reduc
e
the number of stalls Now, the number of approach</text>
<text start="1190.149" dur="8.08999999999977">available is 4 first step approa
ch is simplest
method is to redo the fetch of the instruction</text>
<text start="1198.239" dur="6.65000000000023">following the branch So, this is
essentially
by introducing stalls so what is been done</text>
<text start="1204.889" dur="8.35999999999977">in this case until the branch di
rection is
known until the address is calculated So,</text>
<text start="1213.249" dur="6.92">you continue to ask the pipeline and so ever
y
branch causes a performance loss. So, whenever</text>
<text start="1220.169" dur="7.95000000000023">a branch office occurs you simpl
y introduce
a stall in simple pipeline the fortunately</text>
<text start="1228.119" dur="6.27">number stalls has been reduced from 3 to 1.
So, whenever a branch instruction is encountered</text>
<text start="1234.389" dur="9.28999999999977">a stall is introduced so every b
ranch leads
to 1 instruction loss. So, this solution is</text>
<text start="1243.679" dur="7.09">very simple, simple to implement, because
you do not have to check anything after the</text>
<text start="1250.769" dur="5.171">instruction is decoded. If it is a branch
you introduce a stall then you proceed to</text>
<text start="1255.94" dur="5.92900000000023">the next cycle where both the con
dition is
known whether the condition is satisfied or</text>
<text start="1261.869" dur="7.79999999999977">not is known. And also the turbi

nes targets
has also be known whether it is p c plus 4</text>
<text start="1269.669" dur="8.39000000000023">or p c plus some that immediate
value which
is the part of instruction. So, obviously</text>
<text start="1278.059" dur="8.05">this approach is not accepted if you are int
erested
in improving the performance. So, what do</text>
<text start="1286.109" dur="8.5">you mean by the second approach? Second appro
ach
is to treat every branch as taken that means</text>
<text start="1294.609" dur="10.731">the complier assumes that branch will be a
lways
taken. So, if the branch is always taken what</text>
<text start="1305.34" dur="7.07899999999977">will be done the address, the nex
t instructions
will be fetched.</text>
<text start="1312.419" dur="7.71000000000023">I mean I am sorry in this case b
ranch is not
taken so treat every branch is not taken that</text>
<text start="1320.129" dur="8.92">means it always assumes that branch is not
taken. So, when the branch is not taken obviously;</text>
<text start="1329.049" dur="5.24">the next instruction to be executed is p c
plus 4. So, it proceeds in that direction</text>
<text start="1334.289" dur="12.401">so executes successor instructions in sequ
ence
as if there is no branch. And however this</text>
<text start="1346.69" dur="7.13000000000023">also I mean whenever this assumpt
ion is made
it this is simply an assumption. It does not</text>
<text start="1353.82" dur="5.979">mean that branch will I mean all the branche
s
will not be taken some branches will be taken.</text>
<text start="1359.799" dur="6.96">So, what will be done in such cases? So, whe
n
branch is taken, we need to turn the fetch</text>
<text start="1366.759" dur="7.591">instruction into a no op and restart the fe
tch
at the target address. So, this is the thing</text>
<text start="1374.35" dur="8.959">we have to do whenever a prediction that is
done by the compiler or the assumption is</text>
<text start="1383.309" dur="7.56">made by the compiler terms of 2 rounds. And
it has been found that 47 percent of branches</text>
<text start="1390.869" dur="7.9">are not taken on an average.
So, in 47 percent of the cases you know there</text>
<text start="1398.769" dur="8.421">will be no need to any modification. So, pe
rform
they will be no performance loss for the 47</text>
<text start="1407.19" dur="4.589">percent of the cases, but for the remaining
63 percent of the cases there will be some</text>
<text start="1411.779" dur="8.431">performance loss. Because we have up to we
have to I mean we have to fetch an instruction</text>
<text start="1420.21" dur="4.459">I mean we have to convert the already fetche
d
instruction into a no op. And we have to restart</text>
<text start="1424.669" dur="5.12">the fetch at the target address.
So, this is the situation when assumption</text>
<text start="1429.789" dur="5.44">has been made that branch is not taken now
third approach is an alternative scheme is</text>
<text start="1435.229" dur="6.91000000000023">to treat every branch as taken.

So, in such
a case what is being done it is assumed that</text>
<text start="1442.139" dur="8.91">all the branches are taken that assumption
is made. But unfortunately even for the simple</text>
<text start="1451.049" dur="6.98">pipeline we have seen that branch address
is known only after only in the execution</text>
<text start="1458.029" dur="6.941">stage when the when already the whether bra
nch
will be taken or not taken is also known.</text>
<text start="1464.97" dur="7.139">So, as a consequence for the simple pipeline
that we have discussed there is no gain no</text>
<text start="1472.109" dur="3.80999999999977">advantage.
So, this approach has been no advantage for</text>
<text start="1475.919" dur="6.801">the 5 stages pipelining discussing however
there is a some performance gain whenever</text>
<text start="1482.72" dur="8.879">2 whenever the branch address is not taken.
So, as now there will be another approach</text>
<text start="1491.599" dur="13.54">which is known as delayed branch shall see
how the instruction following the branch can</text>
<text start="1505.139" dur="8.58099999999977">be converted into a useful instr
uction normally.
We have seen if the prediction is wrong then</text>
<text start="1513.72" dur="6.259">we lose one cycle that the instruction which
was executed that has to be conveyed to known</text>
<text start="1519.979" dur="7.971">up. So, that we can overcome with this part
icular
thing we can execute an instruction and it</text>
<text start="1527.95" dur="7.26">is not necessary to converted into a known
off. So, that is known as delayed branch so</text>
<text start="1535.21" dur="6.91900000000023">we shall discussed these zones te
chniques
one after the other. Of course, the first</text>
<text start="1542.129" dur="10.66">technique nothing to discuss I have already
mentioned that you have to simply processes</text>
<text start="1552.789" dur="7.041">introduce a stall after detecting a branch
instruction in the instruction decode stage.</text>
<text start="1559.83" dur="6.36999999999977">So first approach has nothing we
do not need
further discussion to consider first step</text>
<text start="1566.2" dur="0.24">approach.</text>
<text start="1566.44" dur="7.459">Let us now focus on approach 2 predict not
taken. So, who is predicts things who is decides</text>
<text start="1573.899" dur="6.16">here; obviously, the prediction has been don
e
by the compiler. So, compiler is assumed to</text>
<text start="1580.059" dur="8.61999999999977">be in that the branch is not tak
en. So, in
such a case you can execute successor instructions</text>
<text start="1588.679" dur="9.05">keep on fetching plus 4 c p plus 8 is the
one and one executing however in this p c</text>
<text start="1597.729" dur="6.44">plus 4 already calculated. So, use it to get
the next instruction chances are the branch</text>
<text start="1604.169" dur="7.1">is not taken. So, whenever branch is not take
n
as we have seen we have to we have to modify</text>
<text start="1611.269" dur="6.86">the instruction. And that is why it is been
done and that if branch is not taken the following</text>
<text start="1618.129" dur="4.60999999999977">instruction you have to squash i
nstructions
in the pipeline if branch is actually taken.</text>

<text start="1622.739" dur="6.721">So, this if the ith instruction is a branch


instruction and prediction was not taken and</text>
<text start="1629.46" dur="5.959">unfortunately it has turns over to be. So,
when a branch is taken in such a case this</text>
<text start="1635.419" dur="9.041">i plus 1 is instruction we have to introduc
e
stall here so one cycle is lost as you can</text>
<text start="1644.46" dur="5.41900000000023">see. And of course, the next inst
ruction is
taken since it is a second instruction it</text>
<text start="1649.879" dur="6.11999999999977">is from the branch target addres
s the instructor
is fetched and then execution continues. So,</text>
<text start="1655.999" dur="12">this is the approach to where predict not
taken is done. And of course, this particular</text>
<text start="1667.999" dur="6.071">thing can be easily done as I have already
explained. Because CPU state is not updated</text>
<text start="1674.07" dur="5.939">till the locate in the pipeline. We have see
n
that CPU state is updated only in the later</text>
<text start="1680.009" dur="5.92">part of the cycle that is in the right the
stage that is your modifying the register</text>
<text start="1685.929" dur="6.31">that means permanent change you are making.
So, before that if the decision is known if</text>
<text start="1692.239" dur="8.44">your prediction is wrong there is no problem
you can knob of course, there is a loss, but</text>
<text start="1700.679" dur="3.37">that has to be accepted.</text>
<text start="1704.049" dur="11.18">And let us take a considering the, what hap
pens.
So, 53 percent branches taken on an average,</text>
<text start="1715.229" dur="9.39000000000023">but branch target address is not
available.
So, here it is predict branch taken the second</text>
<text start="1724.619" dur="5.65">approach 53 percent branches are taken in
on an average. But branch target address is</text>
<text start="1730.269" dur="7.25">not available after instruction fetch in MIP
s so MIP s is still incurs 1 cycle branch</text>
<text start="1737.519" dur="5.16">penalty even with predict taken. So, as I
have already mentioned for this simple pipeline</text>
<text start="1742.679" dur="4.761">there is no benefit for this prediction thi
s
assumption.</text>
<text start="1747.44" dur="7.67900000000023">However there are machines where
branch target
is known before branch outcome is computed.</text>
<text start="1755.119" dur="6.84999999999977">So, there are processors where t
his thing
this particular situation exist branch target</text>
<text start="1761.969" dur="4.82">is known before the branch outcome is comput
ed.
In such a case significant verification can</text>
<text start="1766.789" dur="7.69">accrue, because there may be processors wher
e
in the execution state both the target is</text>
<text start="1774.479" dur="6.471">known and the value is computed. So, in suc
h
cases you know there can be something, but</text>
<text start="1780.95" dur="4.809">not for the pipeline that we have discussed.
</text>
<text start="1785.759" dur="11.62">Now, we shall focus on the forth approach

that is delayed branch we have seen we have</text>


<text start="1797.379" dur="8.13">a branch instruction.. Following that there
are several sequential instructions and this</text>
<text start="1805.509" dur="6.55">is the target address where the, if the bran
ch
is taken it jumps to this address. So, in</text>
<text start="1812.059" dur="8.10999999999977">between the successive instructi
on there is
a I mean and branch target address branch</text>
<text start="1820.169" dur="5.76">target if the instruction branch is taken
you have got several instructions. So, these</text>
<text start="1825.929" dur="7.19000000000023">are known as sequential success
of the branch
instructor.</text>
<text start="1833.119" dur="8.34999999999977">And these instructions are consi
dered to be
in the branch delay slot. So, here you have</text>
<text start="1841.469" dur="7.11100000000023">got branch delay of length n so
there is a
there are a instruction in this branch delay</text>
<text start="1848.58" dur="5.549">slot. However, for this simple pipeline, tha
t
have that we have already discussed there</text>
<text start="1854.129" dur="7.08099999999977">is a only one slot delay require
d in 5 stage
pipeline. So, you have already seen that the</text>
<text start="1861.21" dur="9.049">in the delay slot the branch delay slot has
got only one instruction. And so in general</text>
<text start="1870.259" dur="9.16">there they can gain instructions but in our
simple case only one slot delay is required.</text>
<text start="1879.419" dur="11.771">Now, we are interested in filling up that
particular detail in the slot. And so this</text>
<text start="1891.19" dur="5.68900000000023">is the branch instruction; this i
s delay slot
instructions and this is the post branch instructions.</text>
<text start="1896.879" dur="7.32099999999977">So, here this is the target now
this d I plus
1; this is the delay slot and this instruction</text>
<text start="1904.2" dur="10.25">we have up to we have to fill up with the
some instruction so later is useful. So, instructions</text>
<text start="1914.45" dur="6.16900000000023">in the branch delay slot get exec
uted whether
or not branch is taken. So, the point that</text>
<text start="1920.619" dur="9.86999999999977">you have to understand is that w
hether the
branch is taken or not taken the instruction</text>
<text start="1930.489" dur="9.17">following the branch instruction is also fol
lows
the executed but if your prediction is wrong.</text>
<text start="1939.659" dur="5.11">I mean if the branch is taken then you have
to nullify it but that instruction will always</text>
<text start="1944.769" dur="9.061">get executed. So, based on this observation
,
we can think about some solution which will</text>
<text start="1953.83" dur="3.38">help in proving the performance of the proces
sor.</text>
<text start="1957.21" dur="7.649">So, the simple idea is put an instruction
that would be executed anyway right after</text>
<text start="1964.859" dur="6.13">a branch. So, this is the delay slot this
is the branch instruction; this is the delay</text>

<text start="1970.989" dur="5.831">slot and this is the branch target or the


successor of the instruction. Now, question</text>
<text start="1976.82" dur="6.989">is what instruction do we input in the delay
slot? So, we have to put an instruction in</text>
<text start="1983.809" dur="6.381">the delay slot with some objective what is
objective one that can safely be executed.</text>
<text start="1990.19" dur="8.44900000000023">No matter what the branch does th
at means
whether the branch is taken or not taken that</text>
<text start="1998.639" dur="7.08999999999977">instruction can be executed? And
it will not
lead to any I mean you do not have to be convert</text>
<text start="2005.729" dur="6.86">into a knob that is the basic objective. And
the compiler decides this compiler has to</text>
<text start="2012.589" dur="9.51">decide which instruction to put in this dela
y
slot and there are several approaches.</text>
<text start="2022.099" dur="8.53">One possibility is an instruction from befor
e
an instruction can be taken from before these</text>
<text start="2030.629" dur="9.951">branches. So, here is a delay slot DADD the
n
see there are several solutions.</text>
<text start="2040.58" dur="17.5390000000002">First solution is an instruction
form before
by this from before we mean you have your</text>
<text start="2058.119" dur="11.4810000000002">branch instruction. And so let u
s consider
DADD R 1, R 2, R 3. This is an instruction</text>
<text start="2069.6" dur="12.799">before this branch instruction if R 2 equal
to 0 then it will jump to this; this is targeted</text>
<text start="2082.399" dur="6.5">address. And this is the targeted delay slot
what we are doing this instruction you can</text>
<text start="2088.899" dur="6.971">see this; this is the normal instruction ex
ecution
floor. So, this instruction is executed after</text>
<text start="2095.87" dur="5.88999999999953">that this branch instruction enco
untered.
So, this instruction will be executed whether</text>
<text start="2101.76" dur="7.32000000000047">this branch instruction is taken
or not. Now,
if we move this to this slot that means we</text>
<text start="2109.08" dur="9.67">are converting it in to in this way if r 2
is equal to 0. Then and we are filling up</text>
<text start="2118.75" dur="10.23">this slot with DADD R 1 R 2 R 3 and then it
is it will go to this.</text>
<text start="2128.98" dur="6.36">So, this delay slot is filled up with an inst
ruction
from before it was here now you have moved</text>
<text start="2135.34" dur="6.61000000000047">it to here. So, we find that as w
e know this
instruction will be executed irrespective</text>
<text start="2141.95" dur="3.9">of whether of branch instruction taken or
not taken. And this instruction was supposed</text>
<text start="2145.85" dur="6.01">to be executed before this branch instruction
.
So, the here also this instruction is executed</text>
<text start="2151.86" dur="8.21">so the branch is whether a branch is taken
or not taken there is no loss. So, we are</text>
<text start="2160.07" dur="7.29">able to put an instruction which is useful

you do not have to convert into an knob if</text>


<text start="2167.36" dur="11.37">prediction is wrong. So; obviously so this
can be moved to this delay slot and this will</text>
<text start="2178.73" dur="3.24">this is the possibility solution.</text>
<text start="2181.97" dur="10.1799999999995">So, you get to execute the DADD e
xecution
for free. So, as if we are getting this instruction</text>
<text start="2192.15" dur="6.84000000000047">executed free of cost. Free of co
st means
some instructions are supposed to be executed</text>
<text start="2198.99" dur="6.26">here we are taking an instruction for the
branch. So, as we go to the next instruction</text>
<text start="2205.25" dur="5.87">by that time will know where the branch is
taken or not taken and also the target address.</text>
<text start="2211.12" dur="10.05">So this instruction can be either it will
can be the p c plus 4 or it can be that p</text>
<text start="2221.17" dur="5.31">c plus that immediate value which is availabl
e
in the as the part of the instruction. In</text>
<text start="2226.48" dur="9.47">case there is a I mean the branch is taken
so we find that which is the best solution.</text>
<text start="2235.95" dur="5.54">This is the best solution that we can have
and this is the preferred approach we can</text>
<text start="2241.49" dur="8.67">follow in our, for filling up instructions
in the branch delay slot. Now, what is the</text>
<text start="2250.16" dur="6.04">second possibility? The second possibility
is that an instruction from target means where</text>
<text start="2256.2" dur="7.71999999999953">it is jumping so it is if R 1 is e
qual to
0 then it is jumping to these d sub R 4 this</text>
<text start="2263.92" dur="10.73">instruction. Now, what can be done? This ins
truction
can be replicated here it can be replicated</text>
<text start="2274.65" dur="6.61">here in this delay slot. And then we can chan
ge
this branch target to get address that means</text>
<text start="2281.26" dur="5.14">you can you can loading. So, that it will
be pointing to the in extension, because I</text>
<text start="2286.4" dur="11.0900000000005">is getting executed here. This par
ticular
thing can be done whenever in most of the</text>
<text start="2297.49" dur="6.57">situations your prediction is that branch
will be taken then this is very advantageous.</text>
<text start="2304.06" dur="9.19999999999953">So, for filling up the delay perf
ormance by
an instruction from the target and by doing</text>
<text start="2313.26" dur="5.6">this you are able to improve the performance.
But improvement of performance takes place</text>
<text start="2318.86" dur="2.77">when branch is taken.</text>
<text start="2321.63" dur="7.61000000000047">So, yet another possibility is an
instruction
from inside the taken path. So, here you are</text>
<text start="2329.24" dur="14.25">taking it from inside the taken path means
you can see here is the delay slot. And this</text>
<text start="2343.49" dur="7.07">particular instruction or can be moved into
that delay slot only if its execution does</text>
<text start="2350.56" dur="6.4">not disrupt the program execution. So, from
the taken path you, you have taken, you have</text>
<text start="2356.96" dur="9.32">to filled up the delay slot by using his inst
ruction.

So, this is an another approach you can follow.</text>


<text start="2366.28" dur="14.8400000000005">Now, let us see an example here w
e have got
3 parts so this is a delay loop load R 1 comma</text>
<text start="2381.12" dur="11.2199999999995">0 R 2 DSUBU R 1 comma R on comma
R 3 BEQ R1
comma l. Now, you can see and l is the target</text>
<text start="2392.34" dur="6.71">address is this. And this is the essentially
the delay slot I mean where you have to put</text>
<text start="2399.05" dur="5.06">your instructions that means following your
weeks. We have to fill up the there is a delay</text>
<text start="2404.11" dur="7.98">slot where you have to put these instruction
in this instruction this particular code sequence.</text>
<text start="2412.09" dur="10.0700000000005">First thing that you can do is to
take this
instruction DSUBU R 1 comma R 1 comma R 3</text>
<text start="2422.16" dur="9.15">after this unfortunately this cannot be done.
Because BEQZ is dependent on this, because</text>
<text start="2431.31" dur="8.88">of this dependency we cannot move this instru
ction
to after this branch instruction. So, this</text>
<text start="2440.19" dur="3.66">first approach which is the first approach
cannot be followed here.</text>
<text start="2443.85" dur="6.28">Now, what are the other alternatives, if we
know that branch was all over taken within</text>
<text start="2450.13" dur="8.63">a high probability. Then DADDU could be moved
into the block B1that means if the, you see</text>
<text start="2458.76" dur="5.83">this is this is target address. So, we can
take this instruction from the target and</text>
<text start="2464.59" dur="8.24">put it immediately after this instruction.
I mean immediately after this BEQZ R 1 l that</text>
<text start="2472.83" dur="7.56">means the branch instruction and if we know
that the branch was taken with high probability.</text>
<text start="2480.39" dur="7.96">So, if you if you have high probability of
branch taken then this particular approach</text>
<text start="2488.35" dur="5.53">is followed since it does not have any depend
encies
on block 2. Since there is no dependency it</text>
<text start="2493.88" dur="4.78000000000047">can have I mean this instruction
I mean there
is no dependence of the instruction on this.</text>
<text start="2498.66" dur="7.79">So, it can be moved without any without any
problem, but this solution is will be good</text>
<text start="2506.45" dur="9.4">whenever branch is taken with high probability
.
Now, what are the third possibility? Third</text>
<text start="2515.85" dur="5.98">possibility is that knowing the branch does
not was not taken then or could be moved into</text>
<text start="2521.83" dur="5.75">block B. So, this instruction that is the
whole through that that approach that I told</text>
<text start="2527.58" dur="8.88">is the whole through. So, this instruction
can be moved I mean can we move to the block</text>
<text start="2536.46" dur="5.92">immediately after a following this and or
could remove this block B 1 since I does not</text>
<text start="2542.38" dur="9.71">affect anything in B 3. So, we can see we
have got 3 possibilities whenever we can fill</text>
<text start="2552.09" dur="7.83">out by depending on different situations we
can do that and this particular example, illustrate</text>
<text start="2559.92" dur="4.12">various alternatives possible.</text>
<text start="2564.04" dur="7.82">So, we can summarize the scheduling of branch

delay slot possibility first one is delay</text>


<text start="2571.86" dur="5.98">slot schedule with an independent instruction
from before the branch. So, this is the preferred</text>
<text start="2577.84" dur="5.84000000000047">schedule I have already mentioned
second is
delay slot is scheduled from the target of</text>
<text start="2583.68" dur="6.86">the branch you have to copy and an instructio
n.
And this is useful only branch is taken and</text>
<text start="2590.54" dur="4.38">this is preferred when the branch is taken
with high probability such as a loop branch.</text>
<text start="2594.92" dur="6.32000000000047">So, in case of loop branch, we ha
ve seen branch
is taken and probability is larger so in such</text>
<text start="2601.24" dur="7.11999999999953">a situation this is this is this
is double.
And third possibility is have already told</text>
<text start="2608.36" dur="6.08">delay slot from the non taken fall through
so non taken non through this is useful if</text>
<text start="2614.44" dur="4.89">branch is not taken. And if the branch goes
in the another un expected direction it should</text>
<text start="2619.33" dur="8.46">produce correct result. So, these are the
3 possibilities by which we can fill up the</text>
<text start="2627.79" dur="7.38">branch delay slot. And we have seen this is
how the performance can be improved.</text>
<text start="2635.17" dur="6.23">And this particular diagram summarizes the
3 possibilities; first one is from before</text>
<text start="2641.4" dur="6.59000000000047">that means this instruction which
is before
this instruction can be filled up here. And</text>
<text start="2647.99" dur="6.26">this is a second approach from the target;
this is the target address; this is the, from</text>
<text start="2654.25" dur="8.39">target. And we will copy this in the slot
and change the direction I mean this pointer</text>
<text start="2662.64" dur="8.88">value where this branch target address is
there. That address has to be modified from</text>
<text start="2671.52" dur="8.55">the fall through this is filled up with this
instruction SUB R 4 R 5 R 6. And as we have</text>
<text start="2680.07" dur="5.59000000000047">seen this particular approach is
suitable
when the there is probability of branches</text>
<text start="2685.66" dur="8.05">not taken. So, the basic objective is of this
compiler, the job of the compiler is make</text>
<text start="2693.71" dur="4.56">the successor instructions valid and useful
allayed branch slot. And this is the philosophy</text>
<text start="2698.27" dur="3.98">that has been used to fill up the.</text>
<text start="2702.25" dur="10.21">Now, let us have some statistics compiler
effectiveness for single branch delay slot.</text>
<text start="2712.46" dur="6.91">So, it has been found that it fills about
60 percent of the branch delay slot about</text>
<text start="2719.37" dur="5.88">80 percent of the instruction executed in
the branch delay slot useful in computation.</text>
<text start="2725.25" dur="8.6">So, that means out of 60 percent of instructio
ns
which are I mean branch delay slots are filled</text>
<text start="2733.85" dur="9.34">up 80 percent is the, is useful in computatio
n.
So, that means that in 50 percent of the cases</text>
<text start="2743.19" dur="6.09">this slots are usually filled usually fully
filled. And so in another words we can tell</text>

<text start="2749.28" dur="9.03000000000047">that there is a 50 percent improv


ement you
know improvement compared to whenever we assume</text>
<text start="2758.31" dur="6.54">you consider the first step follow the first
step approach where you introduce a slot.</text>
<text start="2764.85" dur="8.64">Now, we have not consider a very important
aspect of delayed branch downside is what</text>
<text start="2773.49" dur="5.46">is multiple instructions issued per clock
cycle. That means whenever it is supers tailor</text>
<text start="2778.95" dur="7.17">process we have considered the simple situati
on
it is a pipeline processor so branch delay</text>
<text start="2786.12" dur="4.83">slot. We are issuing one instruction at a
time and that delay slot has to be filled</text>
<text start="2790.95" dur="7.52">up with one instruction that is it is a super
scalar
processor. Then it is necessary to issue 2</text>
<text start="2798.47" dur="8.90999999999953">or 3 or 4 depending on the degree
of the superscalar
processor. So, many instructions we should</text>
<text start="2807.38" dur="6.26000000000047">and so many instruction has to be
filled up
in the delay slot. So, the task of the compiler</text>
<text start="2813.64" dur="7.6">are becomes I mean it is difficult in such
situation for a superscalar composition.</text>
<text start="2821.24" dur="10.11">Now, here the performance for different alte
rnatives
are given here. So, pipeline speedup is equal</text>
<text start="2831.35" dur="11.17">to pipeline depth by 1 plus branch frequency
.
And branch penalty you see we have that performance</text>
<text start="2842.52" dur="7.19">is dependent on 2 things number 1 is branch
frequency. And second is branch penalty and</text>
<text start="2849.71" dur="10.91">it has been found that this branch frequency
varies from instruction from program to program.</text>
<text start="2860.62" dur="9.94">And since the depending on the branch frequen
cy
your improvement that will take place is depend</text>
<text start="2870.56" dur="6">not only upon branch frequency. But also on
the branch penalty that branch penalty that</text>
<text start="2876.56" dur="6.73">will take place is dependence not dependence
the approach that we are following that means</text>
<text start="2883.29" dur="5.11">technique that we are adopting to improve
the performance means. So, that branch penalty</text>
<text start="2888.4" dur="5.38">will be dependent on that so first part that
branch frequency is dependent on the program.</text>
<text start="2893.78" dur="4.5">And second part branch penalty is dependent
on the approach that you are following.</text>
<text start="2898.28" dur="5.06">So, based on these assumption that fourteen
percent of the instructions are branches.</text>
<text start="2903.34" dur="5.86000000000047">So, that means branch frequency i
s 14 percent
and 13 percent 30 percent of the branches</text>
<text start="2909.2" dur="6.86999999999953">are not taken. And assuming that 5
0 percent
of the delay slots can be filled with useful</text>
<text start="2916.07" dur="5.94">instructions it useful instructions based
on these assumptions. This is the result for</text>
<text start="2922.01" dur="8.39">different situation first is slow stall pipel
ine

where we assume that the branch penalty is</text>


<text start="2930.4" dur="9.83">3 cycles. So, whenever you have got a branch
penalty of 3 cycles you get a CPI of 1.41.</text>
<text start="2940.23" dur="7.19">And speed up with respect to un pipelined
is 3.5 ideally it should be 5, what you get</text>
<text start="2947.42" dur="9.75">3.5 So, that is the speed up with respect
to un pipelined whenever your, you have to</text>
<text start="2957.17" dur="7.71">it incurs number of stalls is 3 and speed
up with respect to stall that means whenever</text>
<text start="2964.88" dur="4.86000000000047">of course, in this case it will b
e 1. So,
whenever we are adopting the technique of</text>
<text start="2969.74" dur="4.8">introducing stall with respect to stall perfor
mance
speed up is 1.</text>
<text start="2974.54" dur="6.19">So, with these let us compare the other techn
iques
first one is fast stall pipe line. First stall</text>
<text start="2980.73" dur="6.79">pipe line means we have used additional hardw
are
to reduce the number of stalls. And as we</text>
<text start="2987.52" dur="7.73999999999953">have discussed the branch penalty
can be reduced
to 1 from 3 as we do that this CPI series.</text>
<text start="2995.26" dur="8.75">CPI improves from 1.42 to 1.14. So, there
is significant improvement in CPI and we find</text>
<text start="3004.01" dur="9.89">that speed up is 4.45 4.4 it is there is a
significant increase from 3.5 to 4 point.</text>
<text start="3013.9" dur="8.27">And speed up with respect to stall approach
define approach is 1.26 then if we consider</text>
<text start="3022.17" dur="9.28000000000047">the third case the prediction is
branch taken.
And we have send when for our simple pipeline</text>
<text start="3031.45" dur="6.01">there will be always loss of one cycle. So,
in this case if it is effectively same as</text>
<text start="3037.46" dur="8.01000000000047">the second approach. So, there is
no performance
gain as a, which quite obvious. So, the first</text>
<text start="3045.47" dur="6.61999999999953">stall pipeline and predict taken
is given
the same performance that means if it remains</text>
<text start="3052.09" dur="4.1">1.14.
So, in this case also there is loss of 1 cycle.</text>
<text start="3056.19" dur="7.62">And in this case also there is always a loss
of 1 cycle so CPI remains 1.14 but this speed</text>
<text start="3063.81" dur="6.61">up to un pipelined also remains same. And
speed up with respect to stall is 1.26 that</text>
<text start="3070.42" dur="7.48">also remains the same and predict not taken
in this case the branch penalty is 0.7. We</text>
<text start="3077.9" dur="5.49">have seen there can be 50 percent of the cases
it can be filed up with full instructions.</text>
<text start="3083.39" dur="10.1199999999995">So, branch penalty is 1.7 and CPI
is 1.10.
So, we find that CPI is improving compared</text>
<text start="3093.51" dur="7.15000000000047">to the previous case and there is
consequent
improvement in speed up with respect to un-pipelined.</text>
<text start="3100.66" dur="5.94">And also there is consequence speed up with
respect to stalls whenever we take up the</text>
<text start="3106.6" dur="8.85">approach of stall is 1.29. And last but least

is that delayed branch approach in which case</text>


<text start="3115.45" dur="10.8">the branch penalty is 0.5 and CPI is 1.07.
So, it is very close to 1 1.07. And speedup</text>
<text start="3126.25" dur="7.63">is also very good 4.7 with respect to the
ideal un-pipelined and this speedup with respect</text>
<text start="3133.88" dur="11.4">to stall is 1.34. So, here we call we call
it static branch prediction the prediction</text>
<text start="3145.28" dur="6.77">is done with the help of a compiler. And a
compiler can be lot of instruction to further</text>
<text start="3152.05" dur="6.46">improve speedup that we have already discusse
d.</text>
<text start="3158.51" dur="9.37">And later on we shall consider another approa
ch
particularly important, because of the importance</text>
<text start="3167.88" dur="9.23">of stall reduction the crucial in modern proc
essors
which issue and execute multiple instructions</text>
<text start="3177.11" dur="5.69">per every cycle. So, need to have a steady
stream of instructions to keep hardware busy</text>
<text start="3182.8" dur="5.89">stalls due to control hazards dominate. So,
this importance of stall reduction is very</text>
<text start="3188.69" dur="6.93">important. And so far we are looked at static
schemes for reducing branch penalties and</text>
<text start="3195.62" dur="5.01">same scheme applies to every branch instructi
on.
That means what do you mean by steady static?</text>
<text start="3200.63" dur="5.42">Static means if there are hundred branches,
for all the hundred branches you are adopting</text>
<text start="3206.05" dur="4.53000000000047">the same policy because it is don
e by the
compiler steady static.</text>
<text start="3210.58" dur="5.71">However there is potential for increased bene
fits
from dynamics schemes, what do you mean by</text>
<text start="3216.29" dur="7.73999999999953">dynamic scheme here you know dyna
mically at
when the while instruction execution is progress</text>
<text start="3224.03" dur="8.33">for a particular branch prediction can be
not taken for another. Branch prediction can</text>
<text start="3232.36" dur="5.91">be taken so it will dynamically keep on chang
ing
as you as the instruction execution takes</text>
<text start="3238.27" dur="6.06">place. And that is done at execution time
with the help of a hardware. So, it can choose</text>
<text start="3244.33" dur="5.85">appropriate scheme separately from the each
instruction.</text>
<text start="3250.18" dur="7.42999999999953">So, the branches to top of loop h
ave different
behavior taken or not taken. And can learn</text>
<text start="3257.61" dur="4.79">appropriate scheme based on observer behavior
and dynamic branch prediction scheme can be</text>
<text start="3262.4" dur="8.13">used for both direction taken or not taken
and target prediction. So, in my next lecture</text>
<text start="3270.53" dur="5.92000000000047">I shall discuss in detail this dy
namic technique
that means dynamic prediction schemes can</text>
<text start="3276.45" dur="6.82999999999953">be used. And we shall see how the
performance
is be improved by adopting to dynamic technique.</text>
<text start="3283.28" dur="0.99">Thank you.</text>

</transcript>

You might also like