Professional Documents
Culture Documents
Miss rate
may execute one 1.0% epic
mpeg2
application forever 0.5%
0.0%
Tuning the cache
1 2 4
configuration (size, associativity
associativity, line size) 100%
Normalized
can save a lot of energy 75%
Energy
50% epic
Associativity example 25% mpeg2
40% difference in memory 0%
(b)
access energy 1 2 4
associativity
Mass production
Unique chips getting more expensive as technology
scales down (ITRS)
Huge benefits to mass producing a single chip
Harder to produce chips distinguished by cache
when we have 50-100 processors per chip
Adapt to program phases
Recent research shows programs have different
cache requirements over time
Much research assumes a configurable cache
four-way
associative base
cache
Ways can be Way 1 Way 2
two-way
concatenated to
form two-way
Can be further
concatenated to
Way 1
direct-mapped
direct mapped
Concatenation is
logical – only 1
array accessed
Chuanjun Zhang, UC Riverside 7
Way-Concatenate Cache Architecture
Trivial area a31 tag address a13 a12 a11 a10 index a5 a4 line offset a0
overhead
No performance reg0 reg1 ways
Configuration circuit
overhead 0 0 DM
reg0
• NAND transistors enlarged 0 1 2
decoders
index
6x64
6x64
c2 c3 data
6x64
array
6x64
6x64
6x64
critical path
B
64B cache line
64B non
consecutive
code
48B are wasted
16B
80%
60%
40%
20%
0%
padpcm
adpcm
mpeg
ucbqsort
auto2
pjepg
g721
parser
jpeg
bcnt
v42
mcf
binary
blit
art
brev
vpr
epic
crc
bilv
g3fax
pegwit
fir
Energy savings when way concatenation, way shut
down, and cache line size concatenation are
implemented. (C. Zhang TECS ACM To Appear)
Simulation-based methods
Drawback: slowness.
Seconds of real-timework may take tens of hours to simulate
Simulation tools set up
Increase the time
Self exploring method
Cache parameter explorer
Incorporated on a prototype platform
Pareto parameters: a set of parameters show
performance and energy trade off
An explorer is used to
detect the Pareto set of
Mem
cache parameters
Proc D$
The explorer stands esso
aside to collect r
information used to
calculate the energy Explorer
I$
pegwit
A
68 Tradeoff between
Not a Energy and
Pareto
64 Point D Performance
Best
Performance
60 C
B
56
0.04 0.08 0.12 0.16
Energy(mJ)
Time
Do not need cache flush 64
Best Perf
Then searching for point B 60
B
C
Last we search for points in region C56
0.04 0.08 0.12 0.16
Energy(mJ)
8k 4k 2k 8k 4k 2k
One Way
6%
0.4
3%
0.2
0% 0.0
16B 32B 64B 1W 2W 4W 16B 32B 64B 1W 2W 4W
3
Energy(J)
1
.
0
1KB
2KB
4KB
8KB
16KB
32KB
64KB
128KB
256KB
512KB
1MB
Benchmark:
parser Cache Size
Chuanjun Zhang, UC Riverside 21
Searching for Point A
Search Cache Search Line Search Way prediction
W1 W2Size
W3 W4 Size Associativity
72
68 A
Time
Lowest
64 Energy
60
Point A :The least energy
56
cache configuration 0.04 0.08 0.12 0.16
Energy(mJ)
72
A
68
Best
64 Performance
60
B
Point B :The best performance cache configuration
High associativity doesn’t mean high performance 56
0.04 0.08 0.12 0.16
Large line size may not be good for data cache Energy(mJ)
mux mux
control
multiplier
com_out
adder memory
configure register register
lowest energy
comparator
com_out
Chuanjun Zhang, UC Riverside 25
Implementing the Heuristic in Hardware
Total size of the explorer
About 4,200 gates, or 0.041 mm2 in 0.18 micron CMOS
technology.
Area overhead
Compared to the reported size of the MIPS 4Kp with cache, this
represents just over a 3% area overhead.
Power consumption:
2.69 mW at 200 MHz. The power overhead compared with the
MIPS 4Kp would be less than 0.5%.
Furthermore, the exploring hardware is used only during the
exploring stage, and can be shut down after the best
configuration is determined.
Chuanjun Zhang, UC Riverside 26
How well the heuristic is ?
Time complexity:
Search all space: O(m x n x l x p)
Heuristic : O(m + n + l + p)
m:number of associativities, n :number of cache size
l : number of cache line size , p :way prediction on/off
Efficiency
On average 5 searching instead of 27 total searchings can find point A
2 out of 19 benchmarks miss the lowest power cache configuration.
Use a different searching heuristic: line size, associativity, way prediction and
cache size.
11 out 19 benchmarks miss the best configuration
Time(cycles)
Time(cycles)
11237000 65000000
11236000
60000000
11235000
11234000 55000000
11233000 0 0.05 0.1 0.15 0.2
Energy (mJ)
0 0.002 0.004 0.006 0.008 0.01
Energy(mJ)
padpcm crc
146000 3104000
144000 3102000
Time(cycles)