A Highly Configurable Cache Architecture For Embedded

A highly Configurable Cache
Architecture for Embedded

Systems
Chuanjun Zhang*, Frank Vahid** , and Walid Najjar

*Dept. of Electrical Engineering
Dept. of Computer Science and Engineering
University of California, Riverside
**Also with the Center for Embedded Computer Systems at UC Irvine
This work was supported by the National Science Foundation and

the Semiconductor Research Corporation
Chuanjun Zhang, UC Riverside

1
Outline
 Why a Configurable Cache? What Parameters ?
 Configurable Associativity by Way Concatenation
 Configurable Size by Way Shutdown
 Configurable Line Size
 How to Configure Cache
 Cache Parameter Explorer
 A Heuristic Algorithm Searches Pareto Set of Cache
Parameters :
 Tradeoff Between Energy Dissipation and Performance
 The explorer is Synthesized Using Synopsys
 Conclusions and Future Work
Chuanjun Zhang, UC Riverside 2

Why Choose Cache: Impacts
Performance and Power
 Performance impacts are
well known
 Power
 ARM920T: Caches consume
50% of total processor system
power (Segars 01)
 M*CORE: Unified cache
consumes 50% of total
processor system power
(Lee/Moyer/Arends 99)
 We’ll show that a
configurable cache can
reduce that power nearly in
half on average

Why a Configurable Cache?
2.0%
 An embedded system 1.5%
Miss rate
may execute one 1.0% epic
mpeg2
application forever 0.5%
0.0%
 Tuning the cache
1 2 4
configuration (size, associativity
associativity, line size) 100%
Normalized
can save a lot of energy 75%
Energy
50% epic
 Associativity example 25% mpeg2
 40% difference in memory 0%
(b)
access energy 1 2 4
associativity
epic & mpeg2 from MediaBench

Benefits of Configurable Cache
 Mass production
 Unique chips getting more expensive as technology
scales down (ITRS)
 Huge benefits to mass producing a single chip
 Harder to produce chips distinguished by cache
when we have 50-100 processors per chip
 Adapt to program phases
 Recent research shows programs have different
cache requirements over time
 Much research assumes a configurable cache

Caches Vary Greatly in Embedded Processors
Instruct. Cache Data Cache Instruct. Cache Data Cache

Processor Size As. Line Size As. Line Processor Size As. Line Size As. Line
AMD-K6-IIIE 32K 2 32 32K 2 32 Motorola MPC8540 32K 4 32/64 32K 4 32/64
Alchemy AU1000 16K 4 32 16K 4 32 Motorola MPC7455 32K 8 32 32K 8 32
ARM 7 8K/U 4 16 8K/U 4 16 NEC VR5500 32K 2 32 32K 2 32
ColdFire 0-32K DM 16 0-32K N/A N/A NEC VR4131 16K 2 16/32 16K 2 16/32
Hitachi SH7750S (SH4) 8K DM 32 16K DM 32 NEC VR4181 4K DM 16 4K DM 16
Hitachi SH7727 16K/U 4 16 16K/U 4 16 NEC VR4181A 8K DM 32 8K DM 32
IBM PPC 750CX 32K 8 32 32K 8 32 NEC VR4121 16 DM 16 8K DM 16
IBM PPC 7603 16K 4 32 16K 4 32 PMC Sierra RM9000X2 16K 4 N/A 16K 4 N/A
IBM750FX 32K 8 32 32K 8 32 PMC Sierra RM7000A 16K 4 32 16K 4 32
IBM403GCX 16K 2 16 8K 2 16 SandCraft sr71000 32K 4 32 32K 4 32
IBM Power PC 405CR 16K 2 32 8K 2 32 Sun Ultra SPARC Iie 16K 2 N/A 16K DM N/A
Intel 960JA 2K 2 N/A 1K 2 N/A SuperH 32K 4 32 32K 4 32
Intel 960JD 4K 2 N/A 2K 2 N/A TI TMS320C6414 16K DM N/A 16K 2 N/A
Intel 960IT 16K 2 N/A 4K 2 N/A TriMedia TM32A 32K 8 64 16K 8 64
Motorola MPC8240 16K 4 32 16K 4 32 Xilinx Virtex IIPro 16K 2 32 8K 2 32

Configurable Associativity by Way
Concatenation
Way 1 Way 2 Way 3 Way 4 C. Zhang(ISCA 03)
Four-way set-
four-way

associative base
cache
 Ways can be Way 1 Way 2
two-way
concatenated to
form two-way
 Can be further
concatenated to
Way 1
direct-mapped
direct mapped
 Concatenation is
logical – only 1
array accessed
Way-Concatenate Cache Architecture
Trivial area a31 tag address a13 a12 a11 a10 index a5 a4 line offset a0
overhead
No performance reg0 reg1 ways
Configuration circuit
overhead 0 0 DM
reg0
• NAND transistors enlarged 0 1 2
to match inverter speed reg1

1 0 2
1 1 4
• Configuration circuit c0 c1 c2 c3
operates concurrent toc0 c1
decoders
index
6x64
6x64
c2 c3 data
6x64
array
6x64
6x64
6x64
tag c0 c1 c3 line offset

c2
address
tag part mux driver data output
critical path

Previous Method – Way Shutdown
 Albonesi proposed a cache where ways could be shut down

 To save dynamic power
 Motorola M*CORE has same way-shutdown feature
 Unified cache – even allows setting each way as I, D, both, or off
Way 1 Way 2 Way 3 Way 4
 Reduces dynamic power by accessing fewer ways

 But, decreases total size, so may increase miss rate

Way Shutdown Can be Good for Static
Power
 Static power (leakage) increasingly important in nanoscale

technologies
 We combine way shutdown with way concatenate
 Use sleep transistor method of Powell (ISLPED 2000)
Bitline Vdd Bitline

When off,
prevents
leakage.
But 4%
performanc
Gated-Vdd e overhead
Control
Gnd

Cache Line Size
C. Zhang(ISVLSI 03)
A
64B cache line
64B
consecutive
code
B
64B cache line
64B non
consecutive
code
48B are wasted
16B

Configurable Cache Line Size With Line
Concatenation
 The physical line 4 physical One Way
size is 16 byte lines are 16 bytes
 A programmable filled when
counter is used to line size Counter
designate the line is 64 bytes bus
size
 An interleaved off
chip memory
organization
Off Chip Memory
Computing Total Memory-Related Energy
 Considers CPU stall energy and off-chip memory energy
 Excludes CPU active energy
 Thus, represents all memory-related energy
energy_mem = energy_dynamic + energy_static

energy_dynamic = cache_hits * energy_hit + cache_misses * energy_miss
energy_miss = energy_offchip_access + energy_uP_stall + energy_cache_block_fill
energy_static = cycles * energy_static_per_cycle
energy_miss = k_miss_energy * energy_hit
energy_static_per_cycle = k_static * energy_total_per_cycle
(We varied the k’s to account for different system implementations)
Underlined – measured quantities

SimpleScalar (cache_hits, cache_misses, cycles)
Our layout or data sheets (others)

Energy Savings
120% 127% 620% 126
cnv8K4W32B cnv8K1W32B cfg8Kw c32B
cfg8Kw cw s32B cfg8Kw cw slc
100%
Normalized Energy
80%
60%
40%
20%
0%
padpcm
adpcm
mpeg
ucbqsort
auto2
pjepg
g721
parser
jpeg
bcnt
v42
mcf
binary
blit
art
brev
vpr
epic
crc
bilv
g3fax
pegwit
fir
 Energy savings when way concatenation, way shut
down, and cache line size concatenation are
implemented. (C. Zhang TECS ACM To Appear)

Cache Parameters that Consume the
Lowest Energy Varies Across Applications
Best Configuration Best Configuration
Ben. I$ D$ Ben. I$ D$
padpcm 8K 1W32B 8K1W32B pjepg 4K1W32B 4K2W64B
crc 2K1W32B 4K1W64B ucbqsort 4K1W16B 4K1W64B
auto 8K 2W16B 4K1W32B v42 8K1W16B 8K2W16B
bcnt 2K 1W32B 2K1W 64B adpcm 2K1W16B 4K1W16B
bilv 4K1W32B 2K1W32B epic 2K1W64B 8K1W16B
binary 2K1W32B 2K1W32B g721 8K4W16B 2K1W16B
blit 2K1W 16B 8K2W32B pegwit 4K1W16B 4K1W16B
brev 4K 1W32B 2K1W32B mpeg2 4K1W32B 8K2W16B
g3fax 4K1W32B 4K1W16B art 2K1W32B 2K1W16B
fir 4K 1W32B 2K1W32B parser 8K4W16B 8K2W64B
jpeg 8K4W32B 4K2W32B mcf 8K4W16B 8K1W16B
vpr 8K 4W32B 2K1W16B
How to Configure Cache
 Simulation-based methods
 Drawback: slowness.
 Seconds of real-timework may take tens of hours to simulate
 Simulation tools set up
 Increase the time
 Self exploring method
 Cache parameter explorer
 Incorporated on a prototype platform
 Pareto parameters: a set of parameters show
performance and energy trade off

Cache self-exploring hardware
 An explorer is used to
detect the Pareto set of
Mem
cache parameters
Proc D$
 The explorer stands esso
aside to collect r
information used to
calculate the energy Explorer
I$

Pareto parameter sets
Lowest
72 Energy
Time(million cycles)
pegwit
A
68 Tradeoff between
Not a Energy and
Pareto
64 Point D Performance
Best
Performance
60 C
B
56
0.04 0.08 0.12 0.16
Energy(mJ)

Heuristic algorithm
 Search all possible Cache configurations

 Time consuming. Considering other configurable
parameters: voltage levels, bus width, etc. the
search space will increase very quickly to millions.
 A heuristic is proposed Lowest
Energy
 First to search point A 72
A

Sequence of searching parameter matters,
68
Tradeoff
Time

Do not need cache flush 64
Best Perf
 Then searching for point B 60
B
C
 Last we search for points in region C56
0.04 0.08 0.12 0.16
Energy(mJ)

Impact of Cache Parameters on Miss
Rate and Energy
12% 1.0
8k 4k 2k 8k 4k 2k
One Way
Ave. Icache energy

0.8
9%
Line Size 32B
Ave. Icache miss rate
One Way Line Size 32B 0.6
6%
0.4
3%
0.2
0% 0.0
16B 32B 64B 1W 2W 4W 16B 32B 64B 1W 2W 4W
Average Instruction Cache Miss Rate and Normalized Energy of the

Benchmarks.

Energy Dissipation on On-Chip Cache
and Off Chip Memory
5
Cache Memory T otal
4
3
Energy(J)
1
 .
0
1KB
2KB
4KB
8KB
16KB
32KB
64KB
128KB
256KB
512KB
1MB
Benchmark:
parser Cache Size
Searching for Point A
Search Cache Search Line Search Way prediction
W1 W2Size
W3 W4 Size Associativity
72
68 A
Time
Lowest
64 Energy
60
 Point A :The least energy
56
cache configuration 0.04 0.08 0.12 0.16
Energy(mJ)

Searching for Point B
Fix Cache Size Search Line Search No Way
W1 W2 W3 W4 Size Associativity prediction
72
A
68
Best
64 Performance
60
B
 Point B :The best performance cache configuration
 High associativity doesn’t mean high performance 56
0.04 0.08 0.12 0.16
 Large line size may not be good for data cache Energy(mJ)

Searching for Point C
Point A B C
 Cache parameters in region C:
represent the trade off Line size 64 64 64
between energy and
performance Cache size 2K 8K 4K 8K
 Choose cache parameters
between points A and B. Associativity 1W 4W 1W 1W 2W
 Cache size at points A and B are 72
8K and 4K respectively, then the A
68
cache size of points in region C
will be tested at 8K and 4K. 64
C
B
 Combinations of point A and Tradeoff between
60
B’s parameters are tested. Energy and

Performance
56
0.04 0.08 0.12 0.16

FSM and Data Path of the Cache
Explorer
hit energies hit num

input miss energies miss num
static energies exe time
FSM
mux mux
control
multiplier
com_out
adder memory
configure register register
lowest energy
comparator
com_out
Implementing the Heuristic in Hardware
 Total size of the explorer
 About 4,200 gates, or 0.041 mm2 in 0.18 micron CMOS
technology.
 Area overhead
 Compared to the reported size of the MIPS 4Kp with cache, this
represents just over a 3% area overhead.
 Power consumption:
 2.69 mW at 200 MHz. The power overhead compared with the
MIPS 4Kp would be less than 0.5%.
 Furthermore, the exploring hardware is used only during the
exploring stage, and can be shut down after the best
configuration is determined.
How well the heuristic is ?
 Time complexity:
 Search all space: O(m x n x l x p)
 Heuristic : O(m + n + l + p)
 m:number of associativities, n :number of cache size
 l : number of cache line size , p :way prediction on/off
 Efficiency
 On average 5 searching instead of 27 total searchings can find point A
 2 out of 19 benchmarks miss the lowest power cache configuration.
 Use a different searching heuristic: line size, associativity, way prediction and
cache size.
 11 out 19 benchmarks miss the best configuration

Results of Some Other Benchmarks
bliv p e g w it
11239000 75000000
11238000 70000000
Time(cycles)
Time(cycles)
11237000 65000000
11236000
60000000
11235000
11234000 55000000
11233000 0 0.05 0.1 0.15 0.2
Energy (mJ)
0 0.002 0.004 0.006 0.008 0.01
Energy(mJ)
padpcm crc
146000 3104000
144000 3102000
Time(cycles)
142000 Time(cycles) 3100000

140000 3098000
138000 3096000
136000 3094000
134000 3092000
132000 3090000
0.174 0.175 0.176 0.177 0.178 0.179 0.18 0 0.001 0.002 0.003 0.004
Energy(mJ)
Energy (nJ)
Conclusion and Future Work
 A configurable cache architecture is proposed.
 Associativity, size,line size.
 A cache parameter explorer is implemented to find the cache
parameters.
 A heuristic algorithm is proposed to search the Pareto cache
parameter sets.
 The complexity of the heuristic is O(m+n+l) instead of O(m*n*l)
 Only 95% of the Pareto points can be found by Heuristic
 Overhead
 little area and power overhead, and no performance overhead.
 Future Work
 Dynamically detect the cache parameters .

A Highly Configurable Cache Architecture For Embedded

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Highly Configurable Cache Architecture For Embedded

Uploaded by

Copyright:

Available Formats

A highly Configurable Cache

Architecture for Embedded

Chuanjun Zhang*, Frank Vahid** , and Walid Najjar

This work was supported by the National Science Foundation and

Chuanjun Zhang, UC Riverside

Chuanjun Zhang, UC Riverside 2

Chuanjun Zhang, UC Riverside 3

epic & mpeg2 from MediaBench

Chuanjun Zhang, UC Riverside 4

Chuanjun Zhang, UC Riverside 5

Instruct. Cache Data Cache Instruct. Cache Data Cache

Chuanjun Zhang, UC Riverside 6

to match inverter speed reg1

tag c0 c1 c3 line offset

tag part mux driver data output

Chuanjun Zhang, UC Riverside 8

 Albonesi proposed a cache where ways could be shut down

Way 1 Way 2 Way 3 Way 4

 Reduces dynamic power by accessing fewer ways

Chuanjun Zhang, UC Riverside 9

 Static power (leakage) increasingly important in nanoscale

Bitline Vdd Bitline

Chuanjun Zhang, UC Riverside 10

Chuanjun Zhang, UC Riverside 11

energy_mem = energy_dynamic + energy_static

Chuanjun Zhang, UC Riverside 13

Chuanjun Zhang, UC Riverside 14

Chuanjun Zhang, UC Riverside 16

Chuanjun Zhang, UC Riverside 17

Chuanjun Zhang, UC Riverside 18

 Search all possible Cache configurations

Chuanjun Zhang, UC Riverside 19

Ave. Icache energy

One Way Line Size 32B 0.6

Average Instruction Cache Miss Rate and Normalized Energy of the

Chuanjun Zhang, UC Riverside 20

Chuanjun Zhang, UC Riverside 22

Chuanjun Zhang, UC Riverside 23

B’s parameters are tested. Energy and

Chuanjun Zhang, UC Riverside 24

hit energies hit num

Chuanjun Zhang, UC Riverside 27

142000 Time(cycles) 3100000

Chuanjun Zhang, UC Riverside 29

You might also like