Professional Documents
Culture Documents
1
Pipelining — Instruction Set Parallelism (ISP)
2
A pipelined implementation
WB
MEM
ALU
RD
IF
0 1 2 3 4 5 6 7 8 9 10 11 12 13
time (clock cycles)
3
What is required to pipeline the datapath?
4
Note that in a pipelineds implementation, every instruction passes
through each pipeliner stage. This is quite different from the multi-
cycle implementation, where a cycle is omitted if it is not required.
For example, this means that for every instruction requiring a reqister
write, this action happens four clock periods after the instruction is
fetched from instruction memory.
Furthermore, if an instruction requires no action in a particular
pipeline stage, any information required required by a later stage
must be “passed through.”
The next figure shows a first attempt at the datapath with pipeline
registers added.
5
1
M
U
X
0
Add
4 Add
Shift
left 2
Inst[25−21] Read
Read Register 1
PC Inst[20−16]
6
0
M
U
Inst[15−11] X
1
IF ID EX MEM WB
It is useful to note the changes that have been made to the datapath,
The most obvious change is, of course, the addition of the pipeline
registers.
The addition of these registers introduce some questions.
How large should the pipeline registers be?
Will they be the same size in each stage?
The next change is to the location of the MUX that updates the PC.
This must be associated with the IF stage. In this stage, the PC
should also be incremented.
7
Pipeline control
For our processor example, since the datapath elements are the same
as for the single cycle processor, then the control signals required
must be similar, and can be implemented in a similar way.
All the signals can be generated early (in the ID stage) and passed
along the pipeline until they are required.
8
W
B
PCSrc
M W
1 E
M B
U M
RegDst M MemRead
X RegWrite
0 E ALUSrc E MemWrite W
X ALUop M Branch B MemtoReg
Inst [31−26]
Add
4 Add
Shift
left 2
Inst[25−21] Read
Read Register 1
PC Inst[20−16]
9
Inst[5−0]
ALU
control
0
M
U
Inst[15−11] X
1
IF ID EX MEM WB
Executing an instruction
• load
• store
• beq
10
1
M
U
X
0
IF/ID ID/EX EX/MEM MEM/WB
Add
4 Add
Shift
left 2
Inst[25−21] Read
Read Register 1
PC address Inst[20−16] Read Read
Register 2 data 1
Instruction
[31−0] Zero
Registers ALU
Write
Register Read 0 Address
11
Inst[5−0]
ALU
control
0
M
U
Inst[15−11] X
1
IF ID EX MEM WB
1
M
U
X
0
IF/ID ID/EX EX/MEM MEM/WB
Add
4 Add
Shift
left 2
Inst[25−21] Read
Read Register 1
PC address Inst[20−16] Read Read
Register 2 data 1
Instruction
[31−0] Zero
Registers ALU
Write
Register Read 0 Address
data 2
12
Instruction M Read 0
Write data M
Memory data U
X U
1 Data X
Memory 1
Write
Inst[15−0] Sign data
16 extend 32
Inst[5−0]
ALU
control
0
M
U
Inst[15−11] X
1
IF ID EX MEM WB
1
M
U
X
0
IF/ID ID/EX EX/MEM MEM/WB
Add
4 Add
Shift
left 2
Inst[25−21] Read
Read Register 1
PC address Inst[20−16] Read Read
Register 2 data 1
Instruction
[31−0] Zero
Registers ALU
Write
Register Read 0 Address
data 2
13
Instruction M Read 0
Write data M
Memory data U
X U
1 Data X
Memory 1
Write
Inst[15−0] Sign data
16 extend 32
Inst[5−0]
ALU
control
0
M
U
Inst[15−11] X
1
IF ID EX MEM WB
1
M
U
X
0
IF/ID ID/EX EX/MEM MEM/WB
Add
4 Add
Shift
left 2
Inst[25−21] Read
Read Register 1
PC address Inst[20−16] Read Read
Register 2 data 1
Instruction
[31−0] Zero
Registers ALU
Write
Register Read 0 Address
data 2
14
Instruction M Read 0
Write data M
Memory data U
X U
1 Data X
Memory 1
Write
Inst[15−0] Sign data
16 extend 32
Inst[5−0]
ALU
control
0
M
U
Inst[15−11] X
1
IF ID EX MEM WB
1
M
U
X
0
IF/ID ID/EX EX/MEM MEM/WB
Add
4 Add
Shift
left 2
Inst[25−21] Read
Read Register 1
PC address Inst[20−16] Read Read
Register 2 data 1
Instruction
[31−0] Zero
Registers ALU
Write
Register Read 0 Address
data 2
15
Instruction M Read 0
Write data M
Memory data U
X U
1 Data X
Memory 1
Write
Inst[15−0] Sign data
16 extend 32
Inst[5−0]
ALU
control
0
M
U
Inst[15−11] X
1
IF ID EX MEM WB
Representing a pipeline pictorially
IF ID ALU MEM WB
IF ID ALU MEM WB
IF ID ALU MEM WB
16
LW IM REG ALU DM REG
Structural hazards
Control hazards
18
What happens to the instructions in the pipeline following a success-
ful branch?
There are several possibilities.
One is to stall the instructions following a branch until the branch
result is determined. (Some texts refer to a stall as a “bubble.”)
This can be done by the hardware (stopping, or stalling the pipeline
for several cycles when a branch instruction is detected.)
lw IF ID ALU MEM WB
19
Another possibility is to execute the instructions in the pipeline. It
is left to the compiler to ensure that those instructions are either
nops or useful instructions which should be executed regardless of
the branch test result.
This is, in fact, what was done in the MIPS. It had one “branch delay
slot” which the compiler could with a useful instruction about 50%
of the time.
We saw earlier that branches are quite common, and inserting many
stalls or nops is inefficient.
20
Branch prediction
Branches are problematic in that they are frequent, and cause ineffi-
ciencies by requiring pipeline flushes. In deep pipelines, this can be
computationally expensive.
21
Data hazards
Note that $r2 is written in the first instruction, and read in the
second.
In our pipelined implementation, however, $r2 is not written until
four cycles after the second instruction begins, and therefore three
bubbles or nops would have to be inserted before the correct value
would be read.
data hazard
IF ID ALU MEM WB
22
add $2, $1, $3 IM REG ALU DM REG
An astute observer could note that the result of the ALU operation
is stored in the pipeline register at the end of the ALU stage, two
cycles before it is written into the register file.
If instructions could take the value from the pipeline register, it could
reduce or eliminate many of the data hazards.
This idea is called forwarding.
The following figure shows how forwarding would help in the pipeline
example shown earlier.
24
add $2, $1, $3 IM REG ALU DM REG
forwarding
Note that from the previous examples there are now two potential
additional sources of operands for the ALU during the EX cycle —
the EX/MEM pipeline register, and the the MEM/WB pipeline.
26
ID/EX EX/MEM MEM/WB
M
U
Read R1 Read X
Data 1
Read R2 zero Read
Read 0
ForwardA ALU address
Registers Data M
result
Write U
Write R Read Data
X
Data 2 M Memory 1
Write data
U
Write
X
Data
ForwardB
rt
0
M
U
rd
X
1
27
Forwarding control
28
The register control signals ForwardA and ForwardB have values
defined as:
MUX control Source Explanation
00 ID/EX Operand comes from the register file
(no forwarding)
01 MEM/WB Operand forwarded from a memory
operation or an earlier ALU opera-
tion
10 EX/MEM Operand forwarded from the previ-
ous ALU operation
The conditions for a hazard with a value in the EX/MEM stage are:
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd 6= 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
then ForwardA = 10
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd 6= 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
then ForwardB = 10
29
For hazards with the MEM/WB stage, an additional constraint is
required in order to make sure the most recent value is used:
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd 6= 0)
and (EX/MEM.RegisterRd 6= ID/EX.RegisterRs)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
then ForwardA = 01
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd 6= 0)
and (EX/MEM.RegisterRd 6= ID/EX.RegisterRt)
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
then ForwardB = 01
The datapath with the forwarding control is shown in the next figure.
30
ID/EX EX/MEM MEM/WB
M
U
Read R1 Read X
Data 1
Read R2 zero Read
Read 0
ForwardA ALU address
Registers Data M
result
Write U
Write R Read Data
X
Data 2 M Memory 1
Write data
U
Write
X
Data
rs
ForwardB
rt
0
M
EX/MEM.RegisterRd
U
rd
X
1
Forwarding MEM/WB.RegisterRd
unit
For a datapath with forwarding, the hazards which are fixed by for-
warding are not considered hazards any more.
31
Forwarding for other instructions
32
There is a situation which cannot be handled by forwarding, however.
Consider a load followed by an R-type operation:
Here, the data from the load is not ready when the r-type instruction
requires it — we have a hazard.
What can be done here?
33
The condition under which the “hazard detection circuit” is required
to insert a pipeline stall is when an operation requiring the ALU
follows a load instruction, and one of the operands comes from the
register to be written.
if (ID/EX.MemRead
and (ID/EX.RegisterRt = IF/ID.RegisterRs)
or (ID/EX.RegisterRt = IF/ID.RegisterRt))
then STALL
34
Forwarding with branches
For the beq instruction, if the comparison is done in the ALU, the
forwarding already implemented is sufficient.
35
In order to correctly implement this instruction in a processor with
forwarding, both forwarding and hazard detection must be employed.
The forwarding must be similar to that for the ALU instructions,
and the hazard detection similar to that for the load/ALU type in-
structions.
Presently, most processors do not use a “branch delay slot” for branch
instructions, but use branch prediction.
Typically, there is a small amount of memory contained in the pro-
cessor which records information about the last few branch decisions
for each branch.
In fact, individual branches are not identified directly in this memory;
the low order address bits of the branch instruction are used as an
identifier for the branch.
This means that sometimes several branches will be indistinguishable
in the branch prediction unit. (The frequency of this occurrence
depends on the size of the memory used for branch prediction.)
36
Exceptions and interrupts
Hazard
detection
unit
M ID/EX
u M
40000040 u
x
WB x
0 EX/MEM
M M
Control u M u WB
x x MEM/WB
0
0
EX Cause M WB
IF/ID
Except
PC
4 Shift
left 2
M
u
x
Registers = Data
Instruction ALU
PC memory M
memory
u
M x
u
x
Sign
extend
M
u
x
Forwarding
unit
38
Interrupts can be handled in a way similar to that for exceptions.
Here, though, the instruction presently being completed may be al-
lowed to finish, and the pipeline flushed.
(Another possibility is to simply allow all instructions presently in
the pipeline to complete, but this will increase the interrupt latency.)
The value of the PC + 4 is stored in the EPC, and this will be the
return address from the interrupt, as discussed earlier.
Note that the effect of an interrupt on every instruction will have to
be carefully considered — what happens if an interrupt occurs near
a branch instruction?
39
Superscalar and superpipelined processors
40
Dynamic pipeline scheduling
(This clearly shows that the designers anticipated that there would
be many instructions issued — on average 1/3 of the instructions —
that would be aborted.)
41
Instruction fetch In-order issue
and decode unit
In-order commit
Commit
unit
42
A generic view of the Pentium P-X and the Power PC
pipeline
Instruction Data
PC cache
cache
Decode/dispatch unit
Store Load
Floating
Branch Integer Integer Complex Load/
point
integer store
Commit
unit
Reorder
buffer
43
Speculative execution
44
Effects of Instruction Set Parallelism on programs
We have seen that data and control hazards can sometimes dramat-
ically reduce the potential speed gains that ISP, and pipelining in
particular, offer.
Programmers (and/or compilers) can do several things to mitigate
against this. In particular, compiler technology has been developed
to provide code that can run more effectively on such datapaths.
We will look at some simple code modifications that are commonly
used in compilers to develop more efficient code for processors with
ISP.
45
Let us write this code in simple MIPS assembler, assuming that
we have a multiply instruction that is similar in form to the add
instruction, and that A, X, and Y are 32 bit integer values. Further,
assume N is already stored in register $s1, A is stored in register
$s2, the start of array X is stored in register $s3, and the start of
array Y is stored in register $s4. Variable i will use register $s0.
This is a fairly direct implementation of the loop, and is not the most
efficient code.
For example, the variable i need not be implemented in this code,
we could use the array index for one of the vectors instead, and use
the final array address (+4) as the termination condition.
Also, this code has numerous data dependencies, some of which may
be reduced by reordering the code.
46
Using this idea, register $s1 would now be set by the compiler to
have the value of the start of array X (or Y), plus 4 × (N + 1).
Reordering, and rescheduling the previous code for the MIPS:
47
Loop unrolling
48
The following is a rescheduled assembly code for the unrolled loop.
Note that the number of nop instructions is reduced, as well as re-
ducing the number of array pointer additions.
Two additional registers ($t2 and $t3) were required.
49
Loop merging
In the previous code, there is one more optimization that can improve
the performance (for both pipelined and non-pipelined implementa-
tions). It is equivalent to the following:
51
Recursion and ISP
main ()
{
printf ("the factorial of 10 is %d\n", fact(10))
}
main:
subiu $sp,$sp,32 # Allocate stack space for return
# address and local variables (32
# bytes minimum, by convention).
# (Stack "grows" downward.)
sw $ra, 20($sp) # Save return address
sw $fp, 16($sp) # Save old frame pointer
addiu $fp $sp,28 # Set up frame pointer
53
# restore saved registers
lw $ra, 20($sp) # restore return address
lw $fp, 16($sp) # Save old frame pointer
addiu $sp $sp,32 # Pop stack frame
jr $ra # return to caller (shell)
.rdata
$LC:
.ascii "The factorial of 10 is "
Now the factorial function itself, first setting up the function call
stack, then evaluating the function, and finally restoring saved regis-
ter values and returning:
# factorial function
.text # Text section
fact:
subiu $sp,$sp,32 # Allocate stack frame (32 bytes)
sw $ra, 20($sp) # Save return address
sw $fp, 16($sp) # Save old frame pointer
addiu $fp $sp,28 # Set up frame pointer
54
# here we do the required calculation
# first check for terminal condition
# do recursion
$L2:
subiu $a0, $a0, 1 # subtract 1 from n
jal fact # jump to factorial function
# returning fact(n-1) in $v0
lw $v1, 0($fp) # Load n (saved earlier) into $v1
mul $v0, $v0, $v1 # compute (fact(n-1) * n)
# and return result in $v0
For this simple example, the data dependency in the recursion relates
to register $v1.
55
Branch predication revisited
56
Taken
Weakly taken
Not taken
Predict taken Predict taken
Taken
Taken Not taken
Taken
Predict not taken Predict not taken
Not taken
Weakly not taken
Not taken
Again, looking at what happens in a loop that is repeated, at the end
of the loop there will be a misprediction, and the state machine will
move to the “weakly taken” state. The next time the loop is entered,
the prediction will still be correct, and the state machine will again
move to the “strongly taken” state.
57
The Memory Architecture
58
Memory
Memory is often the largest single component in a system, and con-
sequently requires some special care in its design. Of course, it is
possible to use the simple register structures we have seen earlier,
but for large blocks of memory these are usually wasteful of chip
area.
For designs which require large amounts of memory, it is typical to
use “standard” memory chips — these are optimized to provide large
memory capacity at high speed and low cost.
There are two basic types of memory – static and dynamic. The
static memory is typically more expensive and with a much lower
capacity, but very high access speed. (This type of memory is often
used for high performance cache memories.) The single transistor
dynamic memory is usually the cheapest RAM, with very high ca-
pacity, but relatively slow. It also must be refreshed periodically (by
reading or writing) to preserve its data. (This type of memory is
typically used for the main memory in computer systems.)
The following diagrams show the basic structures of some commonly
used memory cells in random access memories.
59
Static memory
✻ ✻
s s
X-enable
✲
VDD
M2 M4
s M5 s s M6 s
s s
◗ ✑
◗ ✑
◗
✑
M1 ✑ ◗ M3
s s s
M7 s M8
60
4-transistor dynamic memory — the pull-up transistors are shared
among a column of cells. Refresh is accomplished here by switching
in the pull-up transistors M9 and M10.
refresh
s
M9 s M10
VDD
✻ ✻
s s
X-enable
✲
s M5 M6 s
s s
◗ ✑
◗ ✑
s✑ ◗ s
◗
✑
M1 M3
s s s s s
M7 s M8
61
3-transistor dynamic memory — here, the inverter on the left of the
original static cell is also added to the refresh circuitry.
✻ ✻
s s
X-enable
✲
R
s ✲
s M5 M6 s
◗
◗
◗ s
◗
M3
s s s
Refresh R
VDD P
W
s
VDD
s s
s s
s s s
M7 s M8
✻
❄
data in Y-enable data out
62
For refresh, initially R=1, P=1, W=0 and the contents of memory
are stored on the capacitor. R is then set to 0, and W to 1, and the
value is stored back in memory, after being restored in the refresh
circuitry.
63
1-transistor dynamic memory
✻ ✻
s s
X-enable ✲
s M5 ◗
◗
◗
◗
✻
s
Refresh and
control circuitry
❄
data in/out
This memory cell is not only dynamic, but a read destroys the con-
tents of the memory (discharges the capacitor), and the value must
be rewritten. The memory state is determined by the charge on the
capacitor, and this charge is detected by a sense amplifier in the
control circuitry. The amount of charge required to store a value
reliably is important in this type of cell.
64
For the 1-transistor cell, there are several problems; the gate ca-
pacitance is too small to store enough charge, and the readout is
destructive. (They are often made with a capacitor constructed over
the top of the transistor, to save area.) Also, the output signal is
usually quite small. (1M bit dynamic RAM’s may store a bit using
only ≃ 50,000 electrons!) This means that the sense amplifier must
be carefully designed to be sensitive to small charge differences, as
well as to respond quickly to changes.
65
The evolution of memory cells
VDD
M2 M4
M5 s s M6 M5 M6
s s s s
◗ ✑ ◗ ✑
s✑ ◗ s
◗ ✑ ◗ ✑
✑◗
✑◗ ✑
◗
M1 M3 M1 M3
s s s s s s s s
M5 ◗
M6 M5 ◗
◗ s
◗ ◗◗
◗
M3
s s s
66
The following slides show some of the ways in which single transistor
memory cells can be reduced in area to provide high storage densities.
(Taken from Advanced Cell Structures for Dynamic RAMS,
IEEE Circuits and Devices, V. 5 No. 1, pp.27–36)
The first figure shows a transistor beside a simple two plate capacitor,
with both the capacitor and the transistor fabricated on the plane of
the surface of the silicon substrate:
67
The next figure shows a “stacked” transistor structure in which the
capacitor is constructed over the top of the transistor, in order to
occupy a smaller area on the chip:
68
Another way in which the area of the capacitor is reduced is by
constructing the capacitor in a “trench” in the silicon substrate. This
requires etching deep, steep-walled structures in the surface of the
silicon:
69
70
Another useful memory cell, for particular applications, is a dual-port
(or n-port) memory cell. This can be accomplished in the previous
memory cells by adding a second set of x-enable and y-enable lines,
as follows:
✻ ✻ ✻ ✻
X0-enable
✛ s s ✲
s s
VDD
❢ ❢
s s s s s
s s s s s
s s s
s s
X1-enable
✛ ✲
71
The memory hierarchy
72
A modern high speed disk has a track-to-track latency of about 1
ms., and the disk rotates at a speed of 7200 RPM. The disk therefore
makes one revolution in 1/120th of a second, or 8.4 ms. The average
rotational latency is therefore about 4.2 ms. Faster disks (using
smaller diameter disk plates) can rotate even faster.
A typical memory system, connected to a medium-to-large size com-
puter (a desktop or server configuration) might consist of the follow-
ing:
DISK
✫✪
❅ ✒
❅
❅
❅
MAIN ❅ ✠
CPU ❅ DISK
MEMORY ❅
❅ CNTRL ❅
■
❅
❅ ❅
❅
❅
CACHE ❅ ✬✩
❘
❅
DISK
✫✪
73
Cache memory
74
When a cache is used, there must be some way in which the memory
controller determines whether the value currently being addressed in
memory is available from the cache. There are several ways that this
can be accomplished. One possibility is to store both the address and
the value from main memory in the cache, with the address stored in
a type of memory called associative memory or, more descriptively,
content addressable memory.
An associative memory, or content addressable memory, has the
property that when a value is presented to the memory, the address
of the value is returned if the value is stored in the memory, otherwise
an indication that the value is not in the associative memory is re-
turned. All of the comparisons are done simultaneously, so the search
is performed very quickly. This type of memory is very expensive,
because each memory location must have both a comparator and a
storage element. A cache memory can be implemented with a block
of associative memory, together with a block of “ordinary” memory.
The associative memory holds the address of the data stored in the
cache, and the ordinary memory contains the data at that address.
address (input)
...
comparator
... stored address
75
Such a fully associative cache memory might be configured as shown:
ASSOCIATIVE ORDINARY
MEMORY MEMORY
r r r r
r r r r
r r r r
address data
(input) (output)
If the address is not found in the associative memory, then the value
is obtained from main memory.
Associative memory is very expensive, because a comparator is re-
quired for every word in the memory, to perform all the comparisons
in parallel.
76
A cheaper way to implement a cache memory, without using expen-
sive associative memory, is to use direct mapping. Here, part of
the memory address (the low order digits of the address) is used to
address a word in the cache. This part of the address is called the
index. The remaining high-order bits in the address, called the tag,
are stored in the cache memory along with the data.
For example, if a processor has an 18 bit address for memory, and
a cache of 1 K words of 2 bytes (16 bits) length, and the processor
can address single bytes or 2 byte words, we might have the memory
address field and cache organized as follows:
MEMORY ADDRESS
17 1110 1 0
TAG INDEX
✻
BYTE
1023
✻ ✻ ✻✻
Parity Bits Valid Bit
77
This was, in fact, the way the cache was organized in the PDP-11/60.
In the 11/60, however, there are 4 other bits used to ensure that the
data in the cache is valid. 3 of these are parity bits; one for each byte
and one for the tag. The parity bits are used to check that a single
bit error has not occurred to the data while in the cache. A fourth
bit, called the valid bit is used to indicate whether or not a given
location in cache is valid.
78
In the PDP-11/60, the data path from memory to cache was the
same size (16 bits) as from cache to the CPU. (In the PDP-11/70,
a faster machine, the data path from the CPU to cache was 16 bits,
while from memory to cache was 32 bits which means that the cache
had effectively prefetched the next instruction, approximately half
of the time). The number of consecutive words taken from main
memory into the cache on each memory fetch is called the line size
of the cache. A large line size allows the prefetching of a number
of instructions or data words. All items in a line of the cache are
replaced in the cache simultaneously, however, resulting in a larger
block of data being replaced for each cache miss.
INDEX TAG WORD 0 WORD 1
0
1
2
3
4
r r
r r
r r
1023
Memory address
17 1211 2 10
79
For a similar 2K word (or 8K byte) cache, the MIPS processor would
typically have a cache configuration as follows:
r r
r r
r r
1023
Memory address
31 13 12 2 1 0
...
80
A characteristic of the direct mapped cache is that a particular
memory address can be mapped into only one cache location.
Many memory addresses are mapped to the same cache location (in
fact, all addresses with the same index field are mapped to the same
cache location.) Whenever a “cache miss” occurs, the cache line will
be replaced by a new line of information from main memory at an
address with the same index but with a different tag.
Note that if the program “jumps around” in memory, this cache
organization will likely not be effective because the index range is
limited. Also, if both instructions and data are stored in cache, it
may well happen that both map into the same area of cache, and
may cause each other to be replaced very often. This could happen,
for example, if the code for a matrix operation and the matrix data
itself happened to have the same index values.
81
A more interesting configuration for a cache is the set associative
cache, which uses a set associative mapping. In this cache organiza-
tion, a given memory location can be mapped to more than one cache
location. Here, each index corresponds to two or more data words,
each with a corresponding tag. A set associative cache with n tag
and data fields is called an “n–way set associative cache”. Usually
n = 2k , for k = 1, 2, 3 are chosen for a set associative cache (k = 0
corresponds to direct mapping). Such n–way set associative caches
allow interesting tradeoff possibilities; cache performance can be im-
proved by increasing the number of “ways”, or by increasing the line
size, for a given total amount of memory. An example of a 2–way set
associative cache is shown following, which shows a cache containing
a total of 2K lines, or 1 K sets, each set being 2–way associative.
(The sets correspond to the rows in the figure.)
1023
82
In a 2-way set associative cache, if one data line is empty for a read
operation corresponding to a particular index, then it is filled. If both
data lines are filled, then one must be overwritten by the new data.
Similarly, in an n-way set associative cache, if all n data and tag fields
in a set are filled, then one value in the set must be overwritten, or
replaced, in the cache by the new tag and data values. Note that an
entire line must be replaced each time.
83
The line replacement algorithm
• First in, first out (FIFO) — here the first value stored in the
cache, at each index position, is the value to be replaced. For
a 2-way set associative cache, this replacement strategy can be
implemented by setting a pointer to the previously loaded word
each time a new word is stored in the cache; this pointer need
only be a single bit. (For set sizes > 2, this algorithm can be
implemented with a counter value stored for each “line”, or index
in the cache, and the cache can be filled in a “round robin”
fashion).
84
• Least recently used (LRU) — here the value which was actu-
ally used least recently is replaced. In general, it is more likely
that the most recently used value will be the one required in the
near future. For a 2-way set associative cache, this is readily
implemented by setting a special bit called the “USED” bit for
the other word when a value is accessed while the corresponding
bit for the word which was accessed is reset. The value to be
replaced is then the value with the USED bit set. This replace-
ment strategy can be implemented by adding a single USED bit
to each cache location. The LRU strategy operates by setting a
bit in the other word when a value is stored and resetting the
corresponding bit for the new word. For an n-way set associative
cache, this strategy can be implemented by storing a modulo n
counter with each data word. (It is an interesting exercise to
determine exactly what must be done in this case. The required
circuitry may become somewhat complex, for large n.)
85
Cache memories normally allow one of two things to happen when
data is written into a memory location for which there is a value
stored in cache:
• Write through cache — both the cache and main memory are
updated at the same time. This may slow down the execution
of instructions which write data to memory, because of the rel-
atively longer write time to main memory. Buffering memory
writes can help speed up memory writes if they are relatively
infrequent, however.
87
Real cache performance
The following figures show the behavior (actually the miss ratio,
which is equal to 1 – the hit ratio) for direct mapped and set as-
sociative cache memories with various combinations of total cache
memory capacity, line size and degree of associativity.
The graphs are from simulations of cache performance using cache
traces collected from the SPEC92 benchmarks, for the paper “Cache
Performance of the SPEC92 Benchmark Suite,” by J. D. Gee, M. D.
Hill, D. N. Pnevmatikatos and A. J. Smith, in IEEE Micro, Vol. 13,
Number 4, pp. 17-27 (August 1993).
The processor used to collect the traces was a SUN SPARC processor,
which has an instruction set architecture similar to the MIPS.
The data is from benchmark programs, and although they are “real”
programs, the data sets are limited, and the size of the code for the
benchmark programs may not reflect the larger size of many newer
or production programs.
The figures show the performance of a mixed cache. The paper shows
the effect of separate instruction and data caches as well.
88
Miss ratio vs. Line size
Direct mapped Cache size
1K
2K
0.1 4K
8K
Miss Ratio
16 K
32 K
0.01 64 K
128 K
256 K
512 K
1024 K
0.001
16 32 64 128 256
Line size (bytes)
This figure shows that increasing the line size usually decreases the
miss ratio, unless the line size is a significant fraction of the cache
size (i.e., the cache should contain more than a few lines.)
Note that increasing the line size is not always effective in increas-
ing the throughput of the processor, because of the additional time
required to transfer large lines of data from main memory.
89
Miss ratio vs. cache size
Line size (bytes)
16
32
64
0.1 128
256
Miss Ratio
0.01
0.001
1 10 100 1000
Cache size (Kbytes)
This figure shows that the miss ratio drops consistently with cache
size. (The plot is for a direct mapped cache, using the same data as
the previous figure, replotted to show the effect of increasing the size
of the cache.)
90
Miss ratio vs. Way size
Cache size
0.1
1K
Miss Ratio
2K
4K
8K
16 K
32 K
0.01 64 K
128 K
256 K
512 K
1024 K
0.001
1 2 4 8 full
Way size (bytes)
For large caches the associativity, or “way size,” becomes less impor-
tant than for smaller caches.
Still, the miss ratio for a larger way size is always better.
91
Miss ratio vs. cache size
associativity
direct
2-way
4-way
0.1 8-way
full
Miss Ratio
0.01
0.001
1 10 100 1000
Cache size (Kbytes)
This is the previous data, replotted to show the effect of cache size
for different associativities.
Note that direct mapping is always significantly worse than even
2-way set associative mapping.
This is important even for a second level cache.
92
What happens when there is a cache miss?
93
Example:
Assume a cache “miss rate” of 5%, (a “hit rate” of 95%) with cache
memory of 1ns cycle time, and main memory of 35ns cycle time. We
can calculate the average cycle time as
94
Examples — the µVAX 3500 and the MIPS R2000
Both the µVAX 3500 and the MIPS R2000 processors have interest-
ing cache structures, and were marketed at the same time.
(Interestingly, neither of the parent companies which produced these
processors are now independent companies. Digital Equipment Cor-
poration was acquired by Compaq, which in turn was acquired by
Hewlett Packard. MIPS was acquired by Silicon Graphics Corpora-
tion).
The µVAX 3500 has two levels of cache memory — a 1 Kbyte 2-way
set associative cache is built into the processor chip itself, and there
is an external 64 Kbyte direct mapped cache. The overall cache hit
rate is typically 95 to 99%. If there is an on-chip (first level) cache
hit, the external memory bus is not used by the processor. The first
level cache responds to a read in one machine cycle (90ns), while the
second level cache responds within two cycles. Both caches can be
configured as caches for instructions only, for data only, or for both
instructions and data. In a single processor system, a mixed cache is
typical; in systems with several processors and shared memory, one
way of ensuring data consistency is to cache only instructions (which
are not modified); then all data must come from main memory, and
consequently whenever a processor reads a data word, it gets the
current value.
95
The behavior of a two-level cache is quite interesting; the second
level cache does not “see” the high memory locality typical of a
single level cache; the first level cache tends to strip away much of
this locality. The second level cache therefore has a lower hit rate
than would be expected from an equivalent single level cache, but
the overall performance of the two-level system is higher than using
only a single level cache. In fact, if we know the hit rates for the two
caches, we can calculate the overall hit rate as H = H1 +(1−H1)H2,
where H is the overall hit rate, and H1 and H2 are the hit rates for
the first and second level caches, respectively. DEC claims1 that the
hit rate for the second level cache is about 85%, and the first level
cache has a hit rate of over 80%, so we would expect the overall hit
rate to be about 80% + (20% × 80%) = 97%.
1
See C. J. DeVane, “Design of the MicroVAX 3500/3600 Second Level Cache” in the Digital Technical
Journal, No. 7, pp. 87 – 94 for a discussion of the performance of this cache.
96
The MIPS R2000 has no on-chip cache, but it has provision for the
addition for up to 64 Kbytes of instruction cache and 64 Kbytes
of data cache. Both caches are direct mapped. Separation of the
instruction and data caches is becoming more common in processor
systems, especially for direct mapped caches. In general, instructions
tend to be clustered in memory, and data also tend to be clustered,
so having separate caches reduces cache conflicts. This is particularly
important for direct mapped caches. Also, instruction caches do not
need any provision for writing information back into memory.
97
Simulating cache memory performance
Since much of the effectiveness of the system depends on the cache
miss rate, it is important to be able to measure, or at least accurately
estimate, the performance of a cache system early in the system
design cycle.
Clearly, the type of jobs (the “job mix”) will be important to the
cache simulation, since the cache performance can be highly data
and code dependent. The best simulation results come from actual
job mixes.
Since many common programs can generate a large number of mem-
ory references, (document preparation systems like LATEX, for exam-
ple), the data sets for cache traces for “typical” jobs can be very
large. In fact, large cache traces are required for effective simulation
of even moderate sized caches.
98
For example, given a cache size of 8K lines with an anticipated miss
rate of, say, 10%, we would require about 80K lines to be fetched
from memory before it could reasonably be expected that each line in
the cache was replaced. To determine reasonable estimates of actual
cache miss rates, each cache line should be replaced a number of times
(the “accuracy” of the determination depends on the number of such
replacements.) This net effect is to require a memory trace of some
factor larger, say another factor of 10, or about 800K lines. That is,
the trace length would be at least 100 times the size of the cache.
Lower expected cache miss rates and larger cache sizes exacerbate
this problem. (e.g., for a cache miss rate of 1%, a trace of 100 times
the cache size would be required to, on average, replace each line
in the cache once. A further, larger, factor would be required to
determine the miss rate to the required accuracy.)
99
The following two results (see High Performance Computer Archi-
tecture by H.S. Stone, Addison Wesley, Chapter 2, Section 2.2.2, pp.
57–70) derived by Puzak, in his Ph.D. thesis (T.R. Puzak, Cache
Memory Design, University of Massachusetts, 1985) can be used to
reduce the size of the traces and still result in realistic simulations.
The first trace reduction, or trace stripping, technique assumes that
a series of caches of related sizes starting with a cache of size N, all
with the same line size, are to be simulated with some cache trace.
The cache trace is reduced by retaining only those memory references
which result in a cache miss for a direct mapped cache.
Note that, for a miss rate of 10%, 90% of the memory trace would
be discarded. Lower miss rates result in higher reductions.
The reduced trace will produce the same number of cache misses as
the original trace for:
100
The second trace reduction technique is not exact; it relies on the
observation that generally each of the N sets behaves statistically
like any other set; consequently observing the behavior of a small
subset of the cache sets is sufficient to characterize the behavior of
the cache. (The accuracy of the simulation depends somewhat on
the number of sets chosen, because some sets may actually have
behaviors quite different from the “average.”) Puzak suggests that
choosing about 10% of the sets in the initial simulation is sufficient.
Combining the two trace reduction techniques typically reduces the
number of memory references required for the simulation of successive
caches by a factor of 100 or more. This gives a concomitant speedup
of the simulation, with little loss in accuracy.
101
Other methods for fast memory access
There are other ways of decreasing the effective access time of main
memory, in addition to the use of cache.
Some processors have circuitry which prefetches the next instruc-
tion from memory while the current instruction is being executed.
Most of these processors simply prefetch the next instructions from
memory; others check for branch instructions and either attempt to
predict to which location the branch will occur, or fetch both pos-
sible instructions. (The µVAX 3500 has a 12 byte prefetch queue,
which it attempts to keep full by prefetching the next instructions in
memory.)
In some processors, instructions can remain in the “queue” after they
have been executed. This allows the execution of small loops without
additional instructions being fetched from memory.
Another common speed enhancement is to implement the backwards
jump in a loop instruction while the conditional expression is being
evaluated; usually the jump is successful, because the loop condition
fails only when the loop execution is finished.
102
Interleaved memory
103
In order to model the expected gain in speed by having an interleaved
memory, we make the simplifying assumption that all instructions are
of the type
Ri ← Rj op Mp[EA]
where Mp[EA] is the content of memory at location EA, the effective
address of the instruction (i.e., we ignore register-to-register opera-
tions). This is a common instruction format for supercomputers, but
is quite different from the RISC model. We can make a similar model
for RISC machines; here we need only model the fetching of instruc-
tions, and the LOAD and STORE instructions. The model does not
apply directly to certain types of supercomputers, but again can be
readily modified.
104
Here we can have two cases; case (a), where the execution time is
less than the full time for an operand fetch, and case (b) where the
execution time is greater than the time for an operand fetch. The
following figures (a) and (b) show cases (a) and (b) respectively,
105
With an interleaved memory, the time to complete an instruction can
be improved. The following figure shows an example of interleaving
the fetching of instructions and operands.
I-fetch I-fetch
s s s
ta ts ta
td Operand fetch
ta ts
tea
Note that this example assumes that there is no conflict — the in-
struction and its operand are in separate memory banks. For this
example, the instruction execution time is
ti = 2ta + td + te
If ta ≈ ts and te is small, then ti(interleaved) ≈ 12 ti(non − interleaved).
106
The previous examples assumed no conflicts between operand and
data fetches. We can make a (pessimistic) assumption that each of
the N memory modules is equally likely to be accessed. Now there
are two potential delays,
1. the operand fetch, with delay length ts − td, and this has prob-
ability 1/N
PK = (1 − λ)K−1 λ
is the probability of a sequence of K − 1 sequential instructions
followed by a branch.
107
The expected number of instructions to be executed in serial order is
N
X
IF = K(1 − λ)K−1λ
K=1
1
= [1 − (1 − λ)N ]
λ
where N is the number of interleaved memory banks. IF is, effec-
tively, the number of memory banks being used.
Example:
If N = 4, and λ = 0.1 then
IF = 1/0.1(1 − (1 − 0.1)4)
= 10(1 − 0.94)
= 3.4
For operands, a simple (but rather pessimistic) thing is to assume
that the data is randomly distributed in the memory banks. In this
case, the probability Q(K) of a string of length K is:
N N −1 N −2 K (N − 1)!K
Q(k) = · · ··· =
N N N N (N − K)!N K
and the average number of operand fetches is
N
X (N − 1)!K
OF = K×
K=1 (N − K)!N K
1
which can be shown to be O(N 2 ).
108
A Brief Introduction to Operating
Systems
109
Typically, more modern texts do not “define” the term operating
system, they merely specify some of the aspects of operating systems.
Usually two aspects receive most attention:
• CPU usage
110
In addition to resource management (allocation of resources) the
operating system must ensure that different processes do not have
conflicts over the use of particular resources. (Even simple resource
conflicts can result in such things as corrupted file systems or process
deadlocks.)
This is a particularly important consideration when two or more
processes must cooperate in the use of one or more resources.
Processes
✁
active
❅
1 ✁✁ ✫ ❅✪
■ ❅ 2
❅ ❅
✁
3
✬ ✩ ✬ ✩
❅ ❅
✁☛✁ ❅ ❅
❘
blocked ✲ ready
4
✫ ✪ ✫ ✪
112
Following is a simplified process state diagram for the UNIX operat-
ing system:
✘✘❳❳
❅❅
✄ user ❈
✄ ❈
❈ running ✄
❈ ✄
❅
❅❳❳✘✘
✻
113
The system moves from user mode to kernel mode as a result of an
interrupt, exception, or system call.
As an example of a system call. consider the following C code:
int main()
{
.
.
.
printf{"Hello world"};
.
.
return(0);
}
114
Passing parameters in system calls
2. Pass the values on the stack (as MIPS does when there are more
than four arguments.)
115
Styles of operating systems
Layer N−1
Hardware
116
Micro-kernel Here, as much as possible is moved into the “user”
space, keeping the kernel as small as possible.
This makes it easier to extend the kernel. Also, since it is smaller,
it is easier to port to other architectures.
117
One of the most fundamental resources to be allocated among pro-
cesses (in a single CPU system) is the main memory.
A number of allocation strategies are possible:
118
Early memory management — “static overlay” — done under user
program control:
The graph shows the functional dependence of “code segments”.
1 16k
16k
3 8k
2 14k
5 12k 6
32k 20k 16k
4
48k
7 12k
64k
8 12k 9
20k
80k
Clearly, “segments” at the same level in the tree need not be memory
resident at the same time. e.g., in the above example, it would be
appropriate to have segments (1,3,9) and (5,7) in memory simulta-
neously, but not, say, (2,3).
119
(2) Contiguous Allocation
In the late 1960’s, operating systems began to control, or “manage”
more resources, including memory. The first attempts used very
simple memory management strategies.
One very early system was Fixed-Partition Allocation:
✻
40k Kernel
❄
✻
35k Job 1
❄
✻ ✻
❄
❄
waste
This system did not offer a very efficient use of memory; the systems
manager had to determine an appropriate memory partition, which
was then fixed. This limited the number of processes, and the mix
of processes which could be run at any given time.
Also, in this type of system, dynamic data structures pose difficulties.
120
An obvious improvement over fixed-partition allocation was Movable-
Partition Allocation
Here, dynamic data structures are still a problem — jobs are placed
in areas where they fit at the time of loading.
A “new” problem here is memory fragmentation — it is usually
much easier to find a block of memory for a small job than for a large
job. Eventually, memory may contain many small jobs, separated by
“holes” too small for any of the queued processes.
This effect may seriously reduce the chances of running a large job.
121
One solution to this problem is to allow dynamic reallocation of
processes running in memory. The following figure shows the result
of dynamic reallocation of Job 5 after Job 1 terminates:
35k Free
15k Free
40k Job 4 40k Job 4
In this system, the whole program must be moved, which may have a
penalty in execution time. This is a tradeoff — how frequently mem-
ory should be “compacted” against the performance lost to memory
fragmentation.
Again, dynamic memory allocation is still difficult, but less so than
for the other systems.
122
Modern processors generally manage memory using a scheme called
virtual memory — here all processes appear to have access to all
of the memory available to the system. A combination of special
hardware and the operating system maintains some parts of each
process in main memory, but the process is actually stored on disk
memory.
(Main memory acts somewhat like a cache for processes — only the
active portion of the process is stored there. The remainder is loaded
as needed, by the operating system.)
We will look in some detail at how processes are “mapped” from
virtual memory into physical memory.
The idea of virtual memory can be applied to the whole processor, so
we can think of it as a virtual system, where every process has access
to all system resources, and where separate (non-communicating)
processes cannot interfere with each other.
In fact, we are already used to thinking of computers in this way.
We are familiar with the sharing of physical resources like printers
(through the use of a print queue) as well as sharing access to the
processor itself in a multitasking environment.
123
Virtual Memory Management
124
The process of translating, or mapping, a virtual address into a phys-
ical address is called virtual address translation. The following
diagram shows the relationship between a named variable and its
physical location in the system.
✬ ✩
✬ ✩
✠
125
This mapping can be accomplished in ways similar to those discussed
for mapping main memory into the cache memory. In the case of vir-
tual address mapping, however, the relative speed of main memory to
disk memory (a factor of approximately 100,000 to 1,000,000) means
that the cost of a “miss” in main memory is very high compared
to a cache miss, so more elaborate replacement algorithms may be
worthwhile.
There are two “flavours” of virtual memory mapping; paged memory
mapping and segmented memory mapping. We will look at both in
some detail.
Virtually all processors today use paged memory mapping, In most
systems, pages are placed in memory when addressed by the program
— this is called demand paging.
In many processors, a direct mapping scheme is supported by the
system hardware, in which a page map is maintained in physical
memory. This means that each physical memory reference requires
both an access to the page table and and an operand fetch (two
memory references per instruction). In effect, all memory references
are indirect.
126
The following diagram shows a typical virtual-to-physical address
mapping:
Virtual address
✲ Page
Map
❄ ❄
Physical page offset
number
Base address of Page
(physical memory)
Note that whole page blocks in virtual memory are mapped to whole
page blocks in physical memory.
This means that the page offset is part of both the virtual and phys-
ical address.
127
Requiring two memory fetches for each instruction is a large per-
formance penalty, so most virtual addressing systems have a small
associative memory (called a translation lookaside buffer, or TLB)
which contains the last few virtual addresses and their correspond-
ing physical addresses. Then for most cases the virtual to physical
mapping does not require an additional memory access. The follow-
ing diagram shows a typical virtual-to-physical address mapping in
a system containing a TLB:
Virtual address
✲
Page hit
TLB
in TLB
Page
Map
❄❄ ❄
Physical page offset
number
Base address of Page
(physical memory)
128
For many current architectures, including the INTEL PENTIUM,
and MIPS, addresses are 32 bits, so the virtual address space is 232
bytes, or 4 G bytes (4,000 Mbytes). A physical memory of about 256
Mbytes–2 Gbytes is typical for these machines, so the virtual address
translation must map the 32 bits of the virtual memory address into
a corresponding area of physical memory.
A recent trend (Pentium P4, UltraSPARC, PowerPC 9xx, MIPS
R16000, AMD Opteron) is to have a 64 bit address space, so the
maximum virtual address space is 264 bytes (17,179,869,184 Gbytes).
129
While the memory controller is fetching the required information
from disk, the processor can be executing another program, so the
actual time required to find the information on the disk (the disk
seek time) is not wasted by the processor. In this sense, the disk seek
time usually imposes little (time) overhead on the computation, but
the time required to actually place the information in memory may
impact the time the user must wait for a result. If many disk seeks
are required in a short time, however, the processor may have to wait
for information from the disk.
Normally, blocks of information are taken from the disk and placed
in the memory of the processor. The two most common ways of de-
termining the sizes of the blocks to be moved into and out of memory
are called segmentation and paging, and the term segmented mem-
ory management or paged memory management refer to memory
management systems in which the blocks in memory are segments
or pages.
130
Mapping in the memory hierarchy
Per process
Virtual address Physical address
Virtual to
Physical Address
Translation
Note that not all the virtual address blocks are in the physical mem-
ory at the same time. Furthermore, adjacent blocks in virtual mem-
ory are not necessarily adjacent in physical memory.
If a block is moved out of physical memory and later replaced, it may
not be at the same physical address.
The translation process must be fast, most of the time.
131
Segmented memory management
In a segmented memory management system the blocks to be re-
placed in main memory are potentially of unequal length and corre-
spond to program and data “segments.” A program segment might
be, for example, a subroutine or procedure. A data segment might
be a data structure or an array. In both cases, segments correspond
to logical blocks of code or data. Segments, then, are “atomic,” in
the sense that either the whole segment should be in main mem-
ory, or none of the segment should be there. The segments may be
placed anywhere in main memory, but the instructions or data in one
segment should be contiguous, as shown:
SEGMENT 1
SEGMENT 5
SEGMENT 7
SEGMENT 2
SEGMENT 4
SEGMENT 9
132
When segments are replaced, a single segment can only be replaced
by a segment of the same size, or by a smaller segment. After a time
this results in a “memory fragmentation”, with many small segments
residing in memory, having small gaps between them. Because the
probability that two adjacent segments can be replaced simultane-
ously is quite low, large segments may not get a chance to be placed
in memory very often. In systems with segmented memory manage-
ment, segments are often “pushed together” occasionally to limit the
amount of fragmentation and allow large segments to be loaded.
Segmented memory management appears to be efficient because an
entire block of code is available to the processor. Also, it is easy for
two processes to share the same code in a segmented memory system;
if the same procedure is used by two processes concurrently, there
need only be a single copy of the code segment in memory. (Each
process would maintain its own, distinct data segment for the code
to access, however.)
Segmented memory management is not as popular as paged mem-
ory management, however. In fact, most processors which presently
claim to support segmented memory management actually support
a hybrid of paged and segmented memory management, where the
segments consist of multiples of fixed size blocks.
133
Paged memory management:
Paged memory management is really a special case of segmented
memory management. In the case of paged memory management,
• all of the segments are exactly the same size (typically 256 bytes
to 16 M bytes)
134
The following is an example of a paged memory management config-
uration using a fully associative page translation table:
Consider a computer system which has 16 M bytes (224 bytes) of main
memory, and a virtual memory space of 232 bytes. The following
diagram sketches the page translation table required to manage all
of main memory if the page size is 4K (212) bytes. Note that the
associative memory is 20 bits wide ( 32 bits – 12 bits, the virtual
address size – the page size). Also to manage 16 M bytes of memory
with a page size of 4 K bytes, a total of (16M )/(4K) = 212 = 4096
associative memory locations are required.
VIRTUAL ADDRESS
31 1211 0
31 12
PHYSICAL 0
PAGE 1
ADDRESS 2
3
4r
r
r qq q
q
q q
q q
4095
ASSOCIATIVE MEMORY
135
Some other attributes are usually included in a page translation ta-
ble, as well, by adding extra fields to the table. For example, pages
or segments may be characterized as read only, read-write, etc. As
well, it is common to include information about access privileges, to
help ensure that one program does not inadvertently corrupt data for
another program. It is also usual to have a bit (the “dirty” bit) which
indicates whether or not a page has been written to, so that the page
will be written back onto the disk if a memory write has occurred
into that page. (This is done only when the page is “swapped”,
because disk access times are too long to permit a “write-through”
policy like cache memory.) Also, since associative memory is very ex-
pensive, it is not usual to map all of main memory using associative
memory; it is more usual to have a small amount of associative mem-
ory which contains the physical addresses of recently accessed pages,
and maintain a “virtual address translation table” in main memory
for the remaining pages in physical memory. A virtual to physical
address translation can normally be done within one memory cycle
if the virtual address is contained in the associative memory; if the
address must be recovered from the “virtual address translation ta-
ble” in main memory, at least one more memory cycle must be used
to retrieve the physical address from main memory.
136
There is a kind of trade-off between the page size for a system and
the size of the page translation table (PTT). If a processor has a
small page size, then the PTT must be quite large to map all of
the virtual memory space. For example, if a processor has a 32 bit
virtual memory address, and a page size of 512 bytes (29 bytes), then
there are 223 possible page table entries. If the page size is increased
to 4 Kbytes (212 bytes), then the PTT requires “only” 220, or 1 M
page table entries. These large page tables will normally not be very
full, since the number of entries is limited to the amount of physical
memory available.
One way these large, sparse PTT’s are managed is by mapping the
PTT itself into virtual memory. (Of course, the pages which map
the virtual PTT must not be mapped out of the physical memory!)
There are also other pages that should not be mapped out of physical
memory. For example, pages mapping to I/O buffers. Even the I/O
devices themselves are normally mapped to some part of the physical
address space.
137
Note that both paged and segmented memory management pro-
vide the users of a computer system with all the advantages of a
large virtual address space. The principal advantage of the paged
memory management system over the segmented memory manage-
ment system is that the memory controller required to implement a
paged memory management system is considerably simpler. Also,
the paged memory management does not suffer from fragmentation
in the same way as segmented memory management. Another kind
of fragmentation does occur, however. A whole page is swapped in or
out of memory, even if it is not full of data or instructions. Here the
fragmentation is within a page, and it does not persist in the main
memory when new pages are swapped in.
One problem found in virtual memory systems, particularly paged
memory systems, is that when there are a large number of processes
executing “simultaneously” as in a multiuser system, the main mem-
ory may contain only a few pages for each process, and all processes
may have only enough code and data in main memory to execute for
a very short time before a page fault occurs. This situation, often
called “thrashing,” severely degrades the throughput of the proces-
sor because it actually must spend time waiting for information to
be read from or written to the disk.
138
Examples — the µVAX 3500 and the MIPS R2000
These machines are interesting because the µVAX 3500 was a typical
complex instruction set (CISC) machine, while the the MIPS R2000
was a classical reduced instruction set (RISC) machine.
µVAX 3500
Both the µVAX 3500 and the MIPS R2000 use paged virtual memory,
and both also have fast translation look-aside buffers which handle
many of the virtual to physical address translations. The µVAX
3500, like other members of the VAX family, has a page size of 512
bytes. (This is the same as the number of sets in the on-chip cache, so
address translation can proceed in parallel with the cache access —
another example of parallelism in this processor.) The µVAX 3500
has a 28 entry fully associative translation look-aside buffer (TLB)
which uses an LRU algorithm for replacement. Address translation
for TLB misses is supported in the hardware (microcode); the page
table stored in main memory is accessed to find the physical ad-
dresses corresponding to the current virtual address, and the TLB is
updated.
139
MIPS R2000
The MIPS R2000 has a 4 Kbyte page size, and 64 entries in its fully
associative TLB, which can perform two translations in each machine
cycle — one for the instruction to be fetched and one for the data
to be fetched or stored (for the LOAD and STORE instructions).
Unlike the µVAX 3500 (and most other processors, including other
RISC processors), the MIPS R2000 does not handle TLB misses
using hardware. Rather, an exception (the TLB miss exception) is
generated, and the address translation is handled in software. In fact,
even the replacement of the entry in the TLB is handled in software.
Usually, the replacement algorithm chosen is random replacement,
because the processor generates a random number between 8 and 63
for this purpose. (The lowest 8 TLB locations are normally reserved
for the kernel; e.g., to refer to such things as the current PTT).
This is another example of the MIPS designers making a tradeoff —
providing a larger TLB, thus reducing the frequency of TLB misses
at the expense of handling those misses in software, much as if they
were page faults.
140
Virtual memory replacement algorithms
Since page misses interrupt a process in virtual memory systems, it
is worthwhile to expend additional effort to reduce their frequency.
Page misses are handled in the system software, so the cost of this
added complexity is small.
Fixed replacement algorithms
Here, the number of pages for a process is fixed, constant. Some of
these algorithms are the same as those discussed for cache replace-
ment. The common replacement algorithms are:
141
Generally, other considerations come into play for page replacement;
for example, it requires more time to replace a “dirty” page (i.e., one
which has been written into) than a “clean” page, because of the
time required to write the page back onto the disk. This may make
it more efficient to preferentially swap clean pages.
Most large disks today have internal buffers to speed up reading and
writing, and can accept several read and write requests, reordering
them for more efficient access.
The following diagram shows the performance of these algorithms
on a small sample program, with a small number of pages allocated.
Note that, in this example, the number of page faults for LRU <
CLOCK < FIFO.
page faults
(x 1000)
10 FIFO
CLOCK
8 LRU
OPT
6
4 ❭
❭
❭
2 ❭
❭❳❳
❳❳❤
❤❤❤❤
❤❤❤❤
6 8 10 12 14
Pages allocated
142
The replacement algorithms LRU and OPT have a useful property
known as the stack property. This property can be expressed as:
143
page faults
(x 1000)
10 FIFO
CLOCK
8 LRU
OPT
6
❚
❚
4 ❚
❚
❚
❚
2 ❚❳❳
❳❳❳
❳❳❳
4 8 16 32
1024 512 256
Number of pages (fixed 8K memory)
Note that, when the page size is sufficiently small, the performance
degrades. In this (small) example, the small number of pages loaded
in memory degrade the performance severely for the largest page size
(2K bytes, corresponding to only 4 pages in memory.) Performance
improves with increased number of pages (of smaller size) in memory,
until the page size becomes small enough that a page doesn’t hold
an entire logical block of code.
144
Variable replacement algorithms
In fixed replacement schemes, two “anomalies” can occur — a pro-
gram running in a small local region may access only a fraction of
the main memory assigned to it, or the program may require much
more memory than is assigned to it, in the short term. Both cases
are undesirable; the second may cause severe delays in the execution
of the program.
In variable replacement algorithms, the amount of memory available
to a process varies depending on the locality of the program.
The following diagram shows the memory requirements for two sep-
arate runs of the same program, using a different data set each time,
as a function of time (in clock cycles) as the program progresses.
memory
✘
✘✘✘ ❇
❇
❇
❇
❇
❇
❏
❏
❏
❏✘✘✘✘ ❆
❆
❆
❆ ✥
❆✥✥✥✥ ❈
✡ ❈
✡ ❈
✡ ❈❵❵
❵❵❵
✡ ❇
✡ ❇
✡ ❇❳❳
✡ ❳❳
time
145
Working set replacement
A replacement scheme which accounts for this variation in memory
requirements dynamically may perform much better than a fixed
memory allocation scheme. One such algorithm is the working set
replacement algorithm. This algorithm uses a moving window in
time. Pages which are not referred to in this time are removed from
the working set.
For a window size T (measured in memory references), the working
set at time t is the set of pages which were referenced in the interval
(t − T + 1, t). A page may be replaced when it no longer belongs to
the working set (this is not necessarily when a page fault occurs.)
146
Example:
Given a program with 7 virtual pages {a,b,. . . ,g} and the reference
sequence
a b a c g a f c g a f d b g
with a window of 4 references. The following figure shows the sliding
window; the working set is the set of pages contained in this window.
a b a c g a f c g a f d b g
4
5
6
7
8
The following table shows the working set after each time period:
1 a 8 acgf
2 ab 9 acgf
3 ab 10 acgf
4 abc 11 acgf
5 abcg 12 agfd
6 acg 13 afdb
7 acgf 14 fdbg
147
A variant of the basic working set replacement, which replaces pages
only when there is a page fault, could do the following on a page
fault:
1. If all pages belong to the working set (i.e., have been accessed in
the window W time units prior to the page fault) then increase
the working set by 1 page.
2. If one or more pages do not belong to the working set (i.e., have
not been referenced in the window W time units prior to the
page fault) then decrease the working set by discarding the last
recently used page. If there is more than one page not in the
working set, discard the 2 pages which have been least recently
used.
The following diagram shows the behavior of the working set replace-
ment algorithm relative to LRU.
148
Page
faults
❳❳❳
❳ LRU
❡
❡
WS
❡
❡
❡
❇
❇
❇
❇
❇
❇
❇
❇
❡
❡
❡❳❳
❳❳❳
❳❳ ✭✭✭
memory
149
• Increase the number of pages allocated to the process by 1 when-
ever the PFF is greater than some threshold Th .
• If Tl < PFF < Th, then replace a page in memory by some other
reasonable policy; e.g., LRU.
150
Some “real” memory systems — X86-64
The 12 bit offset specifies the byte in a 4KB page. The 9 bit (512
entry) page table points to the specific page, while the three higher
level (9 bit, 512 entry) tables are used to point eventually to the page
table.
The page table itself maps 512 4KB pages, or 2MB of memory.
Adding one more level increases this by another factor of 512, for
1GB of memory, and so on.
Clearly, most programs do not use anywhere near all the available
virtual memory, so the page tables higher level page maps are very
sparse.
Both Windows 7/8 and Linux use a page size of 4KB, although Linux
also supports a 2MB page size for some applications.
151
152
The 32 bit ARM processor
The 32 bit ARM processors support 4KB and 1MB page sizes, as
well as 16KB and 16MB page sizes. The following shows how a 4KB
page is mapped with a 2-level mapping:
outer inner
page page offset
31 22 21 12 11 0
4KB
page
The 10 bit (1K entry) outer page table points to an inner page
table of the same size. The inner page table contains the map-
ping for the virtual page in physical memory.
user
programs ⑥
❩
❩
❩
✻ ❩
❩
❩
⑦
❩
traps libraries
✻
User traps, interrupts
...............................................................................................................................................................................................................................................................................................................................................................
Kernel
❄ ❄
system call interface
✻ ✻
...❄
.................................................................................................. ...
❄ ... ...
inter-process
... ...
file ....
communication ....
process ...
....................................................................................................
.
subsystem
control ...................................................................................................
....
...
...
...
✻ ✻ ... scheduler ...
subsystem .... ....
..
....................................................................................................
❄
buffer cache .................................................................................................... ...
...
... memory ...
✻ ✻.. ....
... management
.......................................................................................................
..
❄ ❄
char block
❄ ❄
hardware control
Kernel
...............................................................................................................................................................................................................................................................................................................................................................
Hardware
”the computer”
154
The allocation of processes to the processor
ready — process is ready to run but has not yet been selected by
the “scheduler”
155
✘✘❳❳
❅❅
✄ user ❈
✄ ❈
❈ running ✄◗❦
◗
❈ ✄ ◗
❅ ◗
❅❳❳✘✘ ◗
✻ ◗
◗
◗
◗
return to user
sys call return ◗
◗
or interrupt ◗
◗
◗
interrupt return ◗
◗
✘✘❳❳ ✘✘❳❳ ✑ ❵❵❵ ◗✘✘❳❳
❄ ✑
❅❅ ❅❅✑ ❇❇ ❅❅
✄ ❈ exit ✄ kernel ❈ ✓ ✄ ❈
✄ zombie ❈✛ ✄ ❈
✾✘ ✘✘ ✓ preempted
✄ ❈
❈ ✄ ❈ running ✄✘ ✲ ❈ ✄
❈ ✄ ❈ ✄ ❈ ✄
❅
❅❳❳✘✘ ❅
❅❳❳✘✘ ❅ ■
preempt ❅ ❅..❳❳✘✘
sleep ❅ ...
❅ schedule .. .
process ...
❅ ....
✘✘❳❳ ✠
❅
✘✘❳❳ .....
❅ ...
❅❅ ❅❅..
asleep, in
✄ ❈ ✄ ready, in
❈
✄ ❈ ✲✄ ❈
❈ memory ✄ wakeup ❈ memory ❦
◗
◗✄
❈ ✄ ❈ ✄ ◗
◗
enough mem
❅
❅❳❳✘✘ ❅
❅❳❳✘✘ ◗
✻ ◗ ✘✘❳❳
◗
◗ ❅
❅
◗ fork
swap out swap out swap in ✄ ❈
✄
created ❈✛
❈ ✄
❈ ✄
✑ ❅
✘✘❄
❳❳ ✘✘❳❳ ✑ ❅❳❳✘✘
❄ ✑
❅ ❅ ✑
❅
asleep, wakeup ❅
ready, ✑
✑ not enough mem
✄ ❈ ✄ ❈
✄ ❈ ✲✄ ❈✑ ✑
✰✑
❈ swapped ✄ ❈ swapped ✄ (swapped)
❈ ✄ ❈ ✄
❅
❅❳❳✘✘ ❅
❅❳❳✘✘
156
In the operating system, each process is represented by its own pro-
cess control block (sometimes called a task control block, or job
control block). This process control block is a data structure (or set
of structures) which contains information about the process. This
information includes everything required to continue the process if it
is blocked for any reason (or if it is interrupted). Typical information
would include:
• the values of the program counter, stack pointer and other inter-
nal registers
157
In many systems, this space for these process control blocks is allo-
cated (in system space memory) when the system is generated, and
places a firm limit on the number of processes which can be allocated
at one time. (The simplicity of this allocation makes it attractive,
even though it may waste part of system memory by having blocks
allocated which are rarely used.)
Following is a diagram of the process management data structures in
a typical UNIX system:
per process
region table region table
u area
✻
✘✘
✿
✘✘✘✘ ✄
✼
✓ Text ✘
✓ ✄
✓ ✄
✓ ✒ Stack ✄
✓ ❍❍ ✄
✓ ❍❍ ✄
✓ ❍❍ ✄
✓ ❥
❍ ✄
✓ ✄
❄ ✄
✓
✄
✄
✄
✄
✄
✄
✄
✄
process table ✄
✄
✄✎ ❄
main memory
158
Process scheduling:
159
Criteria for scheduling algorithms (performance)
• CPU utilization
• response time
160
Commonalities in the memory hierarchy
161
Replacement strategies
There are a small number of commonly used replacement strategies
for a block:
• random replacement
Writing
There are two basic strategies for writing data from one level of the
hierarchy to the other:
Write through — both levels are consistent, or coherent.
Write back — only the highest level has the correct value, and it
is written back to the next level on replacement. This implies that
there is a way of indicating that the block has been written into (e.g.,
with a “used” bit.)
162
Differences between levels of memory hierarchies
Finding a block
Blocks are found by fast parallel searches at the highest level, where
speed is important (e.g., full associativity, set associative mapping,
direct mapping).
At lower levels, where access time is less important, table lookup can
be used (even multiple lookups may be tolerated at the lowest levels.)
Block sizes
Typically, the block size increases as the hierarchy is descended. In
many levels, the access time is large compared to the transfer time, so
using larger block sizes amortizes the access time over many accesses.
Capacity and cost
Invariably, the memory capacity increases, and the cost per bit de-
creases, as the hierarchy is descended.
163
Input and Output (I/O)
164
ATMEL AVR architecture
165
AVR architecture
166
The program memory is flash programmable, and is fixed until over-
written with a programmer. It is guaranteed to survive at least 10,000
rewrites.
In many processors, self programming is possible. A bootloader can
be stored in protected memory, and a new program downloaded by
a simple serial interface.
In some older small devices, there is no data memory — programs use
registers only. In those processors, there is a small stack in hardware
(3 entries). In the processors with data memory, the stack is located
in the data memory.
The size of on-chip data memory (SRAM — static memory) varies
from 0 to 16 Kbytes.
Most processors also have EEPROM memory, from 0 to 4 Kbytes.
The C compiler can only be used for devices with SRAM data mem-
ory.
Only a few of the older tiny ATMEL devices do not have SRAM
memory.
167
ATMEL AVR datapath
8−bit
data bus
Program Stack
counter pointer
Program SRAM or
flash hardware stack
Instruction General
register purpose
registers
X
Instruction Y
decoder
Z
Control
lines ALU
Status
register
Note that the datapath is 8 bits, and the ALU accepts two indepen-
dent operands from the register file.
Note also the status register, (SREG) which holds information about
the state of the processor; e.g., if the result of a comparison was 0.
168
Typical ATMEL AVR device
8−bit data bus
Processor core On−chip peripherals
Control Interrupt
lines ALU unit
Status Data
register EEprom
+
Port X drivers
−
Analog comparator
Px7 Px0
169
The AVR memory address space
Note that the general purpose registers and I/O registers are mapped
into the data memory.
170
The AVR instruction set
Like the MIPS, the AVR is a load/store architecture, so most arith-
metic and logic is done in the register file.
The basic register instruction is of the type Rd ← Rd op Rs
For example,
Many immediate instructions, like this one, can only use registers 16
to 31, so be careful with those instructions.
There is no add immediate instruction.
There is an add word immediate instruction which operates on the
pointer registers as 16 bit entities in 2 cycles. The maximum constant
which can be added is 63.
171
There are many data movement operations relating to loads and
stores. Memory is typically addressed through the register pairs X
(r26, r27), Y (r28, r29), and Z (r30, r31).
A typical instruction accessing data memory is load indirect LD.
It uses one of the index registers, and places a byte from the memory
addressed by the index register in the designated register.
There are also push and pop instructions which push a byte onto,
or pop a byte off, the stack. (The stack pointer is in the I/O space,
registers 0X3D, OX3E).
172
There are a number of branch instructions, depending on values in the
status register. For example, branch on carry set (BRCS) branches
to a target address by adding a displacement (-64 to +63) to the
program counter (actually, PC +1) if the carry flag is set.
The relative call (rcall) instruction is similar, but places the return
address (PC + 1) on the stack.
The return instruction (ret) returns from a function call by replacing
the PC with the value on the stack.
There are also instructions which skip over the next instruction on
some condition. For example, the instruction skip on register bit
set (SRBS) skips the next instruction (increments PC by 2 or 3) if a
particular bit in the designated register is set.
173
There are many instructions for bit manipulation; bit rotation in a
byte, bit shifting, and setting and clearing individual bits.
There are also instructions to set and clear individual bits in the
status register, and to enable and disable global interrupts.
The instructions SEI (set global interrupt flag) and CLI (clear global
interrupt flag) enable and disable interrupts under program control.
When an interrupt occurs, the global interrupt flag is cleared, and
reset when the return from interrupt (reti) is executed.
Individual devices (e.g., timers) can also be set as interrupting de-
vices, and also have their interrupt capability turned off. We will
look at this capability later.
There are also instructions to input values from and output values
to specific I/O pins, and sets of I/O pins called ports.
We will look in more detail at these instructions later.
174
The status register (SREG)
One of the differences between the MIPS processor and the AVR
is that the AVR uses a status register — the MIPS uses the set
instructions for conditional branches.
The SREG has the following format:
7 6 5 4 3 2 1 0
I T H S V N Z C
175
An example of program-controlled I/O for the AVR
Programming the input and output ports (ports are basically regis-
ters connected to sets of pins on the chip) is interesting in the AVR,
because each pin in a port can be set to be an input or an output
pin, independent of other pins in the port.
Ports have three registers associated with them.
The data direction register (DDR) determines which pins are inputs
(by writing a 0 to the DDR at the bit position corresponding to
that pin) and which are output pins (similarly, by writing a 1 in the
DDR).
The PORT is a register which contains the value written to an output
pin, or the value presented to an input pin.
Ports can be written to or read from.
The PIN register can only be read, and the value read is the value
presently at the pins in the register. Input is read from a pin.
The short program following shows the use of these registers to con-
trol, read, and write values to two pins of PORTB. (Ports are desig-
nated by letters in the AVR processors.)
We assume that a push button is connected to pin 4 of port B.
Pressing the button connects this pin to ground (0 volts) and would
cause an input of 0 at the pin.
Normally, a pull-up resistor of about 10K ohms is used to keep the
pin high (1) when the switch is open.
The speaker is connected to pin 5 of port B.
176
A simple program-controlled I/O example
The following program causes the speaker to buzz when the button
is pressed. It is an infinite loop, as are many examples of program
controlled I/O.
The program reads pin 4 of port B until it finds it set to zero (the
button is pressed). Then it jumps to code that sets bit 5 of port
B (the speaker input) to 0 for a fixed time, and then resets it to 1.
(Note that pins are read, ports are written.)
#include <m168def.inc>
.org 0
reset:
ldi R16, 0b00100000 ; load register 16 to set PORTB
; registers as input or output
out DDRB, r16 ; set PORTB 5 to output,
; others to input
ser R16 ; load register 16 to all 1’s
out PORTB, r16 ; set pullups (1’s) on inputs
177
LOOP: ; infinite wait loop
sbic PINB, 4 ; skip next line if button pressed
rjmp LOOP ; repeat test
SPIN2:
subi R16, 1
brne SPIN2
178
Following is a (roughly) equivalent C program:
#include <avr/io.h>
#include <util/delay.h>
int main(void)
{
DDRB = 0B00100000;
PORTB = 0B11111111;
while (1) {
while(!(PINB&0B00010000)) {
PORTB = 0B00100000;
_delay_loop_1(128);
PORTB = 0;
_delay_loop_1(128);
}
}
return(1);
}
Two words about mechanical switches — they bounce! That is, they
make and break contact several times in the few microseconds before
full contact is made or broken. This means that a single switch
operation may be seen as several switch actions.
The way this is normally handled is to read the value at a switch (in
a loop) several times over a short period, and report a stable value.
179
Interrupts in the AVR processor
The AVR uses vectored interrupts, with fixed addresses in program
memory for the interrupt handling routines.
Interrupt vectors point to low memory; the following are the locations
of the memory vectors for some of the 26 possible interrupt events in
the ATmega168:
Address Source Event
0X0000 RESET power on or reset
0X0002 INT0 External interrupt request 0
0X0004 INT1 External interrupt request 1
0X0006 PCINT0 pin change interrupt request 0
0X0008 PCINT1 pin change interrupt request 1
0X000A PCINT2 pin change interrupt request 2
0X000C WDT Watchdog Timer
0X000E TIMER2 COMPA Timer/counter 2 compare match A
0X0010 TIMER1 COMPB Timer/counter 2 compare match B
· ·
· ·
Interrupts are prioritized as listed; RESET has the highest priority.
Normally, the instruction at the memory location of the vector is
a jmp to the interrupt handler. (In processors with 2K or fewer
program memory words, rjmp is sufficient.)
180
Reducing power consumption with interrupts
181
Enabling external interrupts in the AVR
182
The following code enables pin change interrupt 1 (PCINT1) and
clears any pending interrupts by writing 1 to bit 7 of the respective
registers:
There is also a register associated with the particular pins for the
PCINT interrupts. They are the Pin Change Mask Registers (PCMSK1
and PCMSK0).
We want to enable the input connected to the switch, at PINB[4],
which is pcint12, and therefore set bit 4 of PCMSK1, leaving the
other bits unchanged.
Normally, this would be possible with the code
sbi PCMSK1, 4
Unfortunately, this register is one of the extended I/O registers, and
must be written to as a memory location.
The following code sets up its address in register pair Y, reads the
current value in PCMSK1, sets bit 4 to 1, and rewrites the value in
memory.
.org 0
vects:
jmp RESET ; vector for reset
jmp EXT_INT0 ; vector for int0
jmp EXT_INT1 ; vector for int1
jmp PCINT0 ; vector for pcint0
jmp PCINT1 ; vector for pcint1
jmp PCINT2 ; vector for pcint2
The next thing necessary is to set the stack pointer to a high memory
address, since interrupts push values on the stack:
After this, interrupts can be enabled after the I/O ports are set up,
as in the program-controlled I/O example.
184
#include <m168def.inc>
.org 0
VECTS:
jmp RESET ; vector for reset
jmp EXT_INT0 ; vector for int0
jmp EXT_INT1 ; vector for int1
jmp PCINT_0 ; vector for pcint0
jmp BUTTON ; vector for pcint1
jmp PCINT_2 ; vector for pcint2
jmp WDT ; vector for watchdog timer
EXT_INT0:
EXT_INT1:
PCINT_0:
PCINT_2:
WDT:
reti
RESET:
; set up pin change interrupt 1
ldi r28, PCMSK1 ; load address of PCMSK1 in Y low
clr r29 ; load high byte with 0
ld r16, Y ; read value in PCMSK1
sbr r16,0b00010000 ; allow pin change interrupt on portB pin 4
st Y, r16 ; store new PCMSK1
185
out PORTB, r16 ; set pullups (1’s) on inputs
BUTTON:
reti
rjmp LOOP
SNOOZE:
sleep
LOOP:
sbic PINB, 4 ; skip next line if button pressed
rjmp SNOOZE ; go back to sleep if button not pressed
SPIN2:
subi R16, 1
brne SPIN2
186
Input-Output Architecture
In our discussion of the memory hierarchy, it was implicitly assumed
that memory in the computer system would be “fast enough” to
match the speed of the processor (at least for the highest elements
in the memory hierarchy) and that no special consideration need be
given about how long it would take for a word to be transferred from
memory to the processor — an address would be generated by the
processor, and after some fixed time interval, the memory system
would provide the required information. (In the case of a cache miss,
the time interval would be longer, but generally still fixed. For a
page fault, the processor would be interrupted; and the page fault
handling software invoked.)
Although input-output devices are “mapped” to appear like memory
devices in many computer systems, I/O devices have characteristics
quite different from memory devices, and often pose special problems
for computer systems. This is principally for two reasons:
• Unlike memory operations, I/O operations and the CPU are not
generally synchronized with each other.
187
I/O devices also have other characteristics; for example, the amount
of data required for a particular operation. For example, a keyboard
inputs a single character at a time, while a color display may use
several Mbytes of data at a time.
The following lists several I/O devices and some of their typical prop-
erties:
Device Data size (KB) Data rate (KB/s) Interaction
keyboard 0.001 0.01 human/machine
mouse 0.001 0.1 human/machine
voice input 1 1 human/machine
laser printer 1 – 1000+ 1000 machine/human
graphics display 1000 100,000+ machine/human
magnetic disk 4 – 4000 100,000+ system
CD/DVD 4 1000 system
LAN 1 100,000+ system/system
188
The following figure shows the general I/O structure associated with
many medium-scale processors. Note that the I/O controllers and
main memory are connected to the main system bus. The cache
memory (usually found on-chip with the CPU) has a direct connec-
tion to the processor, as well as to the system bus.
Interrupts
. and control
CPU ....
✟❍
✟ ❍
✁✁
❆❆
❍
❍✟✟
Cache
✟❍
✟ ❍
.
..... ❍
❍✟✟ ❍
❍✟✟ ........... .
.....
✘
✘
❳
❳ ✘
✘
❳
❳
✘
✘
❳
❳ System bus ✘
✘
❳
❳
✘
✘
❳
❳.. ✘
✘
❳
❳..
.... ✟❍
✟ ❍ ✟❍
✟ ❍ ✟❍
✟ ❍
....
❍
❍✟✟ ❍
❍✟✟ ❍
❍✟✟
✚✙
I/O devices
Note that the I/O devices shown here are not connected directly
to the system bus, they interface with another device called an I/O
controller.
189
In simpler systems, the CPU may also serve as the I/O controller,
but in systems where throughput and performance are important,
I/O operations are generally handled outside the processor.
In higher performance processors (desktop and workstation systems)
there may be several separate I/O buses. The PC today has separate
buses for memory (the FSB, or front-side bus), for graphics (the AGP
bus or PCIe/16 bus), and for I/O devices (the PCI or PCIe bus).
It has one or more high-speed serial ports (USB or Firewire), and
100 Mbit/s or 1 Gbit/s network ports as well. (The PCIe bus is also
serial.)
It may also support several “legacy” I/O systems, including serial
(RS-232) and parallel (“printer”) ports.
190
Synchronization — the “two wire handshake”
Because the I/O devices are not synchronized with the CPU, some
information must be exchanged between the CPU and the device to
ensure that the data is received reliably. This interaction between
the CPU and an I/O device is usually referred to as “handshaking.”
Since communication can be in both directions, it is usual to consider
that there are two types of behavior – talking and listening.
Either the CPU or the I/O device can act as the talker or the listener.
For a complete “handshake,” four events are important:
1. The device providing the data (the talker) must indicate that
valid data is now available.
2. The device accepting the data (the listener) must indicate that
it has accepted the data. This signal informs the talker that it
need not maintain this data word on the data bus any longer.
3. The talker indicates that the data on the bus is no longer valid,
and removes the data from the bus. The talker may then set up
new data on the data bus.
4. The listener indicates that it is not now accepting any data on the
data bus. the listener may use data previously accepted during
this time, while it is waiting for more data to become valid on
the bus.
191
Note that each of the talker and listener supply two signals. The
talker supplies a signal (say, data valid, or DAV ) at step (1). It
supplies another signal (say, data not valid, or DAV ) at step (3).
Both these signals can be coded as a single binary value (DAV )
which takes the value 1 at step (1) and 0 at step (3). The listener
supplies a signal (say, data accepted, or DAC) at step (2). It supplies
a signal (say, data not now accepted, or DAC) at step (4). It, too,
can be coded as a single binary variable, DAC. Because only two
binary variables are required, the handshaking information can be
communicated over two wires, and the form of handshaking described
above is called a two wire handshake.
The following figure shows a timing diagram for the signals DAV
and DAC which illustrates the timing of these four events:
(1) (3)- - - 1
DAV 0
(2) (4)
- - 1
DAC 0
192
As stated earlier, either the CPU or the I/O device can act as the
talker or the listener. In fact, the CPU may act as a talker at one
time and a listener at another. For example, when communicating
with a terminal screen (an output device) the CPU acts as a talker,
but when communicating with a terminal keyboard (an input device)
the CPU acts as a listener.
This is about the simplest synchronization which can guarantee re-
liable communication between two devices. It may be inadequate
where there are more than two devices.
Other forms of handshaking are used in more complex situations; for
example, where there may be more than one controller on the bus,
or where the communication is among several devices.
For example, there is also a similar, but more complex, 3-wire hand-
shake which is useful for communicating among more than two de-
vices.
193
I/O control strategies
Several I/O strategies are used between the computer system and I/O
devices, depending on the relative speeds of the computer system and
the I/O devices.
194
Program-controlled I/O
195
A typical configuration might look somewhat as shown in the follow-
ing figure.
The outputs labeled “handshake in” would be connected to bits in
the “status” port. The input labeled “handshake in” would typically
be generated by the appropriate decode logic when the I/O port
corresponding to the device was addressed.
✲
HANDSHAKE IN
✁✁
TO PORT 1 ❆❆ DEVICE 1
HANDSHAKE OUT ✛
TO PORT 2 ✁✁ DEVICE 2
❆❆
✛
TO PORT 3 ✁✁ DEVICE 3
❆❆
✛
q
q
q
✲
TO PORT N ✁✁ DEVICE N
❆❆
✛
196
Program-controlled I/O has a number of advantages:
197
Program controlled I/O is often used for simple operations which
must be performed sequentially. For example, the following may be
used to control the temperature in a room:
DO forever
INPUT temperature
IF (temperature < setpoint) THEN
turn heat ON
ELSE
turn heat OFF
END IF
Note here that the order of events is fixed in time, and that the
program loops forever. (It is really waiting for a change in the tem-
perature, but it is a “busy wait.”)
198
An example of program-controlled I/O for the AVR
Programming the input and output ports (ports are basically regis-
ters connected to sets of pins on the chip) is interesting in the AVR,
because each pin in a port can be set to be an input or an output
pin, independent of other pins in the port.
Ports have three registers associated with them.
The data direction register (DDR) determines which pins are inputs
(by writing a 0 to the DDR at the bit position corresponding to
that pin) and which are output pins (similarly, by writing a 1 in the
DDR).
The PORT is a register which contains the value written to an output
pin, or the value presented to an input pin.
Ports can be written to or read from.
The PIN register can only be read, and the value read is the value
presently at the pins in the register. Input is read from a pin.
The short program following shows the use of these registers to con-
trol, read, and write values to two pins of PORTB. (Ports are desig-
nated by letters in the AVR processors.)
In the following example, the button is connected to pin 4 of port B.
Pressing the button connects this pin to ground (0 volts) and would
cause an input of 0 at the pin.
Normally, a pull-up resistor is used to keep the pin high (1) when
the switch is open. These are provided in the processor.
The speaker is connected to pin 5 of port B.
199
A simple program-controlled I/O example
The following program causes the speaker to buzz when the button
is pressed. It is an infinite loop, as are many examples of program
controlled I/O.
The program reads pin 4 of port B until it finds it set to zero (the
button is pressed). Then it jumps to code that sets bit 5 of port
B (the speaker input) to 0 for a fixed time, and then resets it to 1.
(Note that pins are read, ports are written.)
#include <m168def.inc>
.org 0
reset:
ldi R16, 0b00100000 ; load register 16 to set PORTB
; registers as input or output
out DDRB, r16 ; set PORTB 5 to output,
; others to input
ser R16 ; load register 16 to all 1’s
out PORTB, r16 ; set pullups (1’s) on inputs
200
LOOP: ; infinite wait loop
sbic PINB, 4 ; skip next line if button pressed
rjmp LOOP ; repeat test
SPIN2:
subi R16, 1
brne SPIN2
201
Following is a (roughly) equivalent C program:
#include <avr/io.h>
#include <util/delay.h>
int main(void)
{
DDRB = 0B00100000;
PORTB = 0B11111111;
while (1) {
while(!(PINB&0B00010000)) {
PORTB = 0B00100000;
_delay_loop_1(128);
PORTB = 0;
_delay_loop_1(128);
}
}
return(1);
}
Two words about mechanical switches — they bounce! That is, they
make and break contact several times in the few microseconds before
full contact is made or broken. This means that a single switch
operation may be seen as several switch actions.
The way this is normally handled is to read the value at a switch (in
a loop) several times over a short period, and report a stable value.
202
Interrupt-controlled I/O
203
In the previous figure, the “handshake out” outputs would be con-
nected to a priority encoder to implement this type of I/O. the other
connections remain the same. (Some systems use a “daisy chain”
priority system to determine which of the interrupting devices is ser-
viced first. “Daisy chain” priority resolution is discussed later.)
✲
HANDSHAKE IN
✁✁
TO PORT 1 ❆❆ DEVICE 1
TO PRIORITY INTERRUPT CONTROLLER ✛
TO PORT 2 ✁✁ DEVICE 2
❆❆
✛
TO PORT 3 ✁✁ DEVICE 3
❆❆
✛
q
q
q
✲
TO PORT N ✁✁ DEVICE N
❆❆
✛
204
Returning control from an interrupt
205
Vectored interrupts
206
Interrupts in the AVR processor
The AVR uses vectored interrupts, with fixed addresses in program
memory for the interrupt handling routines.
Interrupt vectors point to low memory; the following are the locations
of the memory vectors for some of the 23 possible interrupt events in
the ATmega168:
Address Source Event
0X000 RESET power on or reset
0X002 INT0 External interrupt request 0
0X004 INT1 External interrupt request 1
0X006 PCINT0 pin change interrupt request 0
0X008 PCINT1 pin change interrupt request 1
0X00A PCINT2 pin change interrupt request 2
0X00C WDT Watchdog timer interrupt
· ·
· ·
Interrupts are prioritized as listed; RESET has the highest priority.
Normally, the instruction at the memory location of the vector is a
jmp to the interrupt handler.
(In processors with 2K or fewer program memory words, rjmp is
sufficient.)
207
Reducing power consumption with interrupts
208
Enabling external interrupts in the AVR
209
The following code enables pin change interrupt 1 (PCINT1) and
clears any pending interrupts by writing 1 to bit 7 of the respective
registers:
There is also a register associated with the particular pins for the
PCINT interrupts. They are the Pin Change Mask Registers (PCMSK1
and PCMSK0).
We want to enable the input connected to the switch, at PINB[4],
which is pcint12, and therefore set bit 4 of PCMSK1, leaving the
other bits unchanged.
Normally, this would be possible with the code
sbi PCMSK1, 4
Unfortunately, this register is one of the extended I/O registers, and
must be written to as a memory location.
The following code sets up its address in register pair Y, reads the
current value in PCMSK1, sets bit 4 to 1, and rewrites the value in
memory.
.org 0
vects:
jmp RESET ; vector for reset
jmp EXT_INT0 ; vector for int0
jmp PCINT0 ; vector for pcint0
jmp PCINT1 ; vector for pcint1
jmp TIM2_COMP ; vector for timer 2 comp
The next thing necessary is to set the stack pointer to a high memory
address, since interrupts push values on the stack:
After this, interrupts can be enabled after the I/O ports are set up,
as in the program-controlled I/O example.
211
#include <m168def.inc>
.org 0
VECTS:
jmp RESET ; vector for reset
jmp EXT_INT0 ; vector for int0
jmp EXT_INT1 ; vector for int1
jmp PCINT_0 ; vector for pcint0
jmp BUTTON ; vector for pcint1
jmp PCINT_2 ; vector for pcint2
EXT_INT0:
EXT_INT1:
PCINT_0:
PCINT_2:
reti
RESET:
; set up pin change interrupt 1
ldi r28, PCMSK1 ; load address of PCMSK1 in Y low
clr r29 ; load high byte with 0
ld r16, Y ; read value in PCMSK1
sbr r16,0b00010000 ; allow pin change interrupt on portB pin 4
st Y, r16 ; store new PCMSK1
212
sei ; enable interrupts
BUTTON:
reti
rjmp LOOP
SNOOZE:
sleep
LOOP:
sbic PINB, 4 ; skip next line if button pressed
rjmp SNOOZE ; go back to sleep if button not pressed
SPIN2:
subi R16, 1
brne SPIN2
213
Direct memory access
214
There are two possibilities for the timing of the data transfer from
the DMA controller to memory:
215
One problem that systems employing several DMA devices have to
address is the contention for the single system bus. There must be
some method of selecting which device controls the bus (acts as “bus
master”) at any given time. There are many ways of addressing the
“bus arbitration” problem; three techniques which are often imple-
mented in processor systems are the following (these are also often
used to determine the priorities of other events which may occur si-
multaneously, like interrupts). They rely on the use of at least two
signals (bus request and bus grant), used in a manner similar
to the two-wire handshake.
Three commonly used arbitration schemes are:
• Prioritized arbitration
• Distributed arbitration
216
Daisy chain arbitration Here, the requesting device or devices
assert the signal bus request. The bus arbiter returns the
bus grant signal, which passes through each of the devices
which can have access to the bus, as shown below. Here, the pri-
ority of a device depends solely on its position in the daisy chain.
If two or more devices request the bus at the same time, the high-
est priority device is granted the bus first, then the bus grant
signal is passed further down the chain. Generally a third sig-
nal (bus release) is used to indicate to the bus arbiter that the
first device has finished its use of the bus. Holding bus request
asserted indicates that another device wants to use the bus.
Priority 1 Priority 2 Priority n
217
Priority encoded arbitration Here, each device has a request line
connected to a centralized arbiter that determines which device
will be granted access to the bus. The order may be fixed by the
order of connection (priority encoded), or it may be determined
by some algorithm preloaded into the arbiter. The following
diagram shows this type of system. Note that each device has a
separate line to the bus arbiter. (The bus grant signals have
been omitted for clarity.)
Request
Device 1
❍❍
Bus ❍❍
arbiter Request Device 2
❏❏ ...
❏
❏
❏
❏
Request
Device n
218
Distributed arbitration by self-selection Here, the devices them-
selves determine which of them has the highest priority. Each
device has a bus request line or lines on which it places a code
identifying itself. Each device examines the codes for all the re-
questing devices, and determines whether or not it is the highest
priority requesting device.
219
The I/O address space
220
Some problems can arise with memory mapped I/O in systems which
use cache memory or virtual memory. If a processor uses a virtual
memory mapping, and the I/O ports are allowed to be in a virtual
address space, the mapping to the physical device may not be con-
sistent if there is a context switch or even if a page is replaced.
If physical addressing is used, mapping across page boundaries may
be problematic.
In many operating systems, I/O devices are directly addressable only
by the operating system, and are assigned to physical memory loca-
tions which are not mapped by the virtual memory system.
221
In the “real world” ...
Although we have been discussing fairly complex processors like the
MIPS, the largest market for microprocessors is still for small, simple
processors much like the early microprocessors. In fact, there is still
a large market for 4-bit and 8-bit processors.
These devices are used as controllers for other products. A large part
of their function is often some kind of I/O, from simple switch inputs
to complex signal processing.
One function of such processors is as I/O processors for more so-
phisticated computers. The following diagram shows the sales of
controllers of various types:
SALES
12 (billions, US) 11.7
DSP
10 9.9
16/32-
8.2 bit
8
6.6
6
4.9 5.2 8-bit
4
2
. 4-bit
91 92 93 94 95 96
The projected microcontroller sales for 2001 is 9.8 billion; for 2002,
222
9.6 billion; for 2003, 12.0 billion; for 2004, 13.0 billion; for 2005, 14
billion. (SIA projection.)
For DSP devices, it is 4.9 billion in 2002, 6.5 billion in 2003, 8.4
billion in 2004, and 9.4 billion in 2005.
223
Magnetic disks
A magnetic disk drive consists of a set of very flat disks, called plat-
ters, coated on both sides with a material which can be magnetized
or demagnetized.
The magnetic state can be read or written by small magnetic heads
located on mechanical arms which can move in and out over the
surfaces of the disks, very close to but not actually touching, the
surfaces.
224
Each platter containing a number of tracks, and each track containing
a set of sectors.
Platters
Tracks
Sectors
Total storage is
(no. of platters) × (no. of tracks/platter) × (no. of sectors/track)
Typically, disks are formatted and bad sectors are noted in a table
stored in the controller.
225
Disks spin at speeds of 4200 RPM to 15,000 RPM. Typical speeds for
PC desktops are 7200 RPM and 10,000 RPM. Laptop disks usually
spin at 4200 or 5400 RPM.
“Disk speed” is usually characterized by several parameters:
average seek time, which is the average required for the read/write
head to be positioned over the correct track, typically about 8ms.
226
Latency can be reduced in several ways in modern disk systems.
• The controller can optimize the seek path (overall seek time) for
a set of reads, and thereby increase throughput.
In fact, systems are often built with large, redundant disk arrays for
several reasons. Typically, security against disk failure and increased
read speed are the main reasons for such systems.
Large disks are now so inexpensive that the Department now uses
large disk arrays as backup storage devices, replacing the slower and
more cumbersome tape drives. Presently, the department maintains
servers with several terabytes of redundant disk.
227
Disk arrays — RAID
228
There are several defined levels of RAID, as follows:
Some systems support two RAID levels. The most common example
of this is RAID 0+1. This is a striped, mirrored disk array. It
provides redundancy and parallelism for both reads and writes.
229
Failure tolerance
230
Networking — the Ethernet
Originally, the physical medium for the Ethernet was a single coaxial
cable with a maximum length of about 500 M. and a maximum of
100 connections.
It was basically a single, high speed (at the time) serial bus network.
It had a particularly simple distributed control mechanism, as well
as ways to extend the network (repeaters, bridges, routers, etc.)
We will describe the original form of the Ethernet, and its modern
switched counterpart.
Coax cable
Terminator
Transcever
tap
231
The network used a variable length packet, transmitted serially at
the rate of 10 Mbits/second, with the following format:
232
One of the more interesting features of the Ethernet protocol is the
way in which a station gets access to the bus.
Each station listens to the bus, and does not attempt to transmit
while another station is transmitting, or in the interpacket delay
period. In this situation, the station is said to be deferring.
A station may transmit if it is not deferring. While a station trans-
mits, it also listens to the bus. If it detects an inconsistency between
the transmitted and received data (a collision, caused by another
station transmitting) then the station aborts transmission, and sends
4-6 bytes of junk (a jam) to ensure every other station transmitting
also detects the collision.
Each transmitting station then waits a random time interval before
attempting to retransmit. On consecutive collisions, the size of the
random interval is doubled, to a maximum of 10 collisions. The base
interval is 512 bit times (51.2 µs.)
This arbitration mechanism is fair, an not rely on any central arbiter,
and is simple to implement.
While it may seem inefficient, usually there are relatively few colli-
sions, even in a fairly highly loaded network. The average number of
collisions is actually quite small.
233
Current ethernet systems
The modern variant of the Ethernet is quite different (but the same
protocols apply).
In the present system, individual stations are connected to a switch,
using an 8-conductor wire (Cat-5 wire, but only 4 wires are actually
used) which allows bidirectional traffic from the station to the switch
at 100 Mbits/s in either direction.
A more recent variant uses a similar arrangement, with Cat-6 wire,
and can achieve 1 Gbit/second.
Switch
Cat−5 wire
Host
Host station
station Host
station
234
The maximum length of a single link is 100 m., and switches are
often linked by fibre optical cable.
The following pictures show the network in the Department:
The first shows the cable plant (where all the Cat-5 wires are con-
nected to the switches).
The second shows the switches connecting the Department to the
campus network. It is an optical fiber network operating at 10 Gbit/s.
The optical fiber cables are orange.
The third shows the actual switches used for the internal network.
Note the orange optical fibre connection to each switch.
235
236
In the following, note the orange optical fibre cable.
237
238
In the previous picture, there were 8 sets of high-speed switches,
each with 24 1 Gbit/s. ports, and 1 fibre optical port at 10 Gbit/s.
Each switch is interconnected to the others by a backplane connector
which can transfer data at 2 Gbits/s.
The 10 Gbit/s. ports are connected to the departmental servers
which collectively provide several Tera-bytes of redundant (raid) disk
storage for departmental use.
239
Multiprocessor systems
In order to perform computation faster, there are two basic strategies:
The first of these approaches was successful for several decades, but
the low cost per unit of commercial microprocessors is so attractive
that the microprocessor based systems have the potential to provide
very high performance computing at relatively low cost.
240
Multiprocessor systems
241
An alternate system, (a shared memory multiprocessor system) where
processors share a large common memory, could look as follows:
Interconnect
Processors Memory
Network
242
A single bus shared memory multiprocessor system:
Global
Processors Cache bus
local
busses
tag
Shared
tag
Memory
tag
tag
Note that here each processor has its own cache. Virtually all current
high performance microprocessors have a reasonable amount of high
speed cache implemented on chip.
In a shared memory system, this is particularly important to reduce
contention for memory access.
243
The cache, while important for reducing memory contention, must
behave somewhat differently than the cache in a single processor
system.
Recall that a cache had four components:
244
Multiprocessor Cache Coherency
245
2. Cache shared, writable data and use hardware to maintain cache
coherence.
246
One example of an invalidating policy is the write-once policy —
a cache writes data back to memory one time (the first write) and
when the line is flushed. On the first write, other caches in the
system holding this data mark their entries invalid.
A cache line can have 4 states:
247
The Intel Pentium class processors use a similar cache protocol called
the MESI protocol. Most other single-chip multiprocessors use this,
or a very similar protocol, as well.
modified — the cache line has been modified, and is available only
in this cache (dirty)
M E S I
Cache line valid? Yes Yes Yes No
Memory is Stale Valid Valid ?
Multiple cache copies? No No Maybe Maybe
248
Read hit
For a read hit, the processor takes data directly from the local cache
line, as long as the line is valid. (If it is not valid, it is a cache miss,
anyway.)
Read miss
Here, there are several possibilities:
• If no other cache has the line, the data is taken from memory
and marked exclusive.
249
Write hit
The processor marks the line in cache as modified. If the line was
already in state modified or exclusive, then that cache has the only
copy of the data, and nothing else need be done. If the line was in
state shared, then the other caches should mark their copies invalid.
(A bus transaction is required).
Write miss
The processor first reads the line from memory, then writes the word
to the cache, marks the line as modified, and performs a bus transac-
tion so that if any other cache has the line in the shared or exclusive
state it can be marked invalid.
If, on the initial read, another cache has the line in the modified
state, that cache marks its own copy invalid, suspends the initiating
read, and immediately writes its value to memory. The suspended
read resumes, getting the correct value from memory. The word can
then be written to this cache line, and marked as modified.
250
False sharing
251
Example — a simple multiprocessor calculation
The first step is to split the set of numbers into subsets of the same
size. Since there is a single, common memory for this machine, there
is no need to partition the data; we just give different starting ad-
dresses in the array to each processor. Pn is the number of the
processor, between 0 and 15.
All processors start the program by running a loop that sums their
subset of numbers:
tmp = 0;
for (i = 1000000 * Pn; i < 1000000 * (Pn+1);
i = i + 1) {
tmp = tmp + A[i]; /* sum the assigned areas*/
}
sum[Pn] = tmp
This loop uses load instructions to bring the correct subset of num-
bers to the caches of each processor from the common main memory.
Each processor must have its own version of the loop counter variable
i, so it must be a “private” variable. Similarly for the partial sum,
tmp. The array sum[Pn] is a global array of partial sums, one from
each processor.
252
The next step is to add these many partial sums, using “divide and
conquer.” Half of the processors add pairs of partial sums, then a
quarter add pairs of the new partial sums, and so on until we have
the single, final sum.
In this example, the two processors must synchronize before the
“consumer” processor tries to read the result written to memory
by the “producer” processor; otherwise, the consumer may read the
old value of the data. Following is the code (half is private also):
253
Does the parallelization actually speed up this computation?
254
Synchronization Using Cache Coherency
255
The processor then races against all other processors that were simi-
larly spin waiting to see who can lock the variable first. All processors
use a test-and-set instruction that reads the old value and stores a
1 (“locked”) into the lock variable. The single winner will see the 0
(“unlocked”), and the losers will see a 1 that was placed there by the
winner. (The losers will continue to write the variable with the locked
value of 1, but that doesn’t change its value.) The winning processor
then executes the code that updates the shared data. When the win-
ner exits, it stores a 0 (“unlocked”) into the lock variable, thereby
starting the race all over again.
The term usually used to describe the code segment between the lock
and the unlock is a “critical section.”
256
❄ ❄
Load lock
variable S
❄
✟❍❍❍
✟✟ ❍❍
No ✟✟
✛ ✟
❍ Unlocked? ❍
✟
❍❍ ✟✟
S=0
❍❍ ✟✟
❍✟
Yes
❄
Try to lock variable
using test-and-set
(set S = 1)
❄
✟❍
✟ ❍❍
✟ ❍❍
No ✟✟
✛ ✟
❍ Succeed? ❍
✟
❍
❍❍(S = 0)✟✟
✟
❍❍✟✟
Yes
❄
Access shared Critical
resource section
Unlock
(Set S = 0)
Compete for
lock
257
Let us see how this spin lock scheme works with bus-based cache
coherency.
One advantage of this scheme is that it allows processors to spin wait
on a local copy of the lock in their caches. This dramatically reduces
the amount of bus traffic; The following table shows the bus and
cache operations for multiple processors trying to lock a variable.
Once the processor with the lock stores a 0 into the lock, all other
caches see that store and invalidate their copy of the lock variable.
Then they try to get the new value of 0 for the lock. (With write
update cache coherency, the caches would update their copy rather
than first invalidate and then load from memory.) This new value
starts the race to see who can set the lock first. The winner gets the
bus and stores a 1 into the lock; the other caches replace their copy
of the lock variable containing 0 with a 1.
They read that the variable is already locked and must return to
testing and spinning.
Because of the communication traffic generated when the lock is
released, this scheme has difficulty scaling up to many processors.
258
Step P0 P1 P2 Bus Activity
1 Has lock Spin (test if Spin None
lock=0)
2 Set lock=0 Spin Spin Write-
and 0 sent invalidate of
over bus lock variable
from P0
3 Cache miss Cache miss Bus decides
to service P2
cache miss
4 (Waits while Lock = 0 Cache miss
bus busy) for P2 satis-
fied
5 Lock = 0 Swap: reads Cache miss
lock and sets for P2 satis-
to 1 fied
6 Swap: reads Value from Write-
lock and sets swap =0 invalidate of
to 1 and 1 sent lock variable
over bus from P2
7 Value from Owns the Write-
swap = 1 lock, so invalidate of
and 1 sent can update lock variable
over bus shared data from P1
8 Spins None
259
Multiprocessing without shared memory — networked
processors
Interconnect
Processors Processors
Network
260
The following diagrams show some useful network topologies. Typi-
cally, a topology is chosen which maps onto features of the program
or data structures.
1D mesh Ring
2D torus
2D mesh
Tree
3D grid
261
In the following, the layout area is (eventually) dominated by the interconnections:
Hypercube
Butterfly
262
Let us assume a simple network; for example, a single high-speed
Ethernet connection to a switched hub.
(This is a common approach for achieving parallelism in Linux sys-
tems. Parallel systems like this are often called “Beowulf clusters.”)
Switch
263
Parallel Program (Message Passing)
264
The next step is to get the sum of each subset. This step is simply
a loop that every execution unit follows; read a word from local
memory and add it to a local variable:
receive(A1);
sum = 0;
for (i = 0; i<1000000; i = i + 1)
sum = sum + A1[i]; /* sum the local arrays */
Again, the final step is adding these 16 partial sums. Now, each
partial sum is located in a different execution unit. Hence, we must
use the interconnection network to send partial sums to accumulate
the final sum.
Rather than sending all the partial sums to a single processor, which
would result in sequentially adding the partial sums, we again apply
“divide and conquer.” First, half of the execution units send their
partial sums to the other half of the execution units, where two partial
sums are added together. Then one quarter of the execution units
(half of the half) send this new partial sum to the other quarter of the
execution units (the remaining half of the half) for the next round of
sums.
265
This halving, sending, and receiving continues until there is a single
sum of all numbers.
limit = 16;
half = 16; /* 16 processors */
repeat
half = (half+1)/2; /* send vs. receive dividing line*/
if (Pn >= half && Pn < limit) send(Pn - half, sum, 1);
receive(tmp);
if (Pn < (limit/2-1)) sum = sum + tmp;
limit = half; /* upper limit of senders */
until (half == 1); /* exit with final sum */
This code divides all processors into senders or receivers and each
receiving processor gets only one message, so we can presume that a
receiving processor will stall until it receives a message. Thus, send
and receive can be used as primitives for synchronization as well as
for communication, as the processors are aware of the transmission
of data.
266
How much does parallel processing help?
In the previous course, we met Amdahl’s Law, which stated that, for
a given program and data set, the total amount of speedup of the
program is limited by the fraction of the program that is serial in
nature.
If P is the fraction of a program that can be parallelized, and the
serial (non-parallelizable) fraction of the code is 1 − P then the total
time taken by the parallel system is (1 − P ) + P/N . The speedup
S(N ) with N processors is therefore
1
S(N ) =
(1 − P ) + P/N
As N becomes large, this approaches 1/(1 − P ).
So, for a fixed problem size, the serial component of a program limits
the speedup.
Of course, if the program has no serial component, then this is not
a problem. Such programs are often called “trivially parallelizable”,
but many interesting problems are not of this type.
267
Gustafson’s law
For problems fitting this model, the speedup is really the best one
can hope from applying N processors to a problem.
268
So, we have two models for analyzing the potential speedup for par-
allel computation.
They differ in the way they determine speedup.
Let us think of a simple example to show the difference between the
two:
Consider booting a computer system. It may be possible to reduce
the time required somewhat by running several processes simultane-
ously, but the serial nature will pose a lower limit on the amount of
time required. (Amdhal’s Law).
Gustafson’s Law would say that, in the same time that is required
to boot the processor, more facilities could be made available; for
example, initiating more advanced window managers, or bringing up
peripheral devices.
269