You are on page 1of 29

A highly Configurable Cache

Architecture for Embedded


Systems

Chuanjun Zhang*, Frank Vahid** , and Walid Najjar


*Dept. of Electrical Engineering
Dept. of Computer Science and Engineering
University of California, Riverside
**Also with the Center for Embedded Computer Systems at UC Irvine

This work was supported by the National Science Foundation and


the Semiconductor Research Corporation

Chuanjun Zhang, UC Riverside


1
Outline
 Why a Configurable Cache? What Parameters ?
 Configurable Associativity by Way Concatenation
 Configurable Size by Way Shutdown
 Configurable Line Size
 How to Configure Cache
 Cache Parameter Explorer
 A Heuristic Algorithm Searches Pareto Set of Cache
Parameters :
 Tradeoff Between Energy Dissipation and Performance
 The explorer is Synthesized Using Synopsys
 Conclusions and Future Work

Chuanjun Zhang, UC Riverside 2


Why Choose Cache: Impacts
Performance and Power
 Performance impacts are
well known
 Power
 ARM920T: Caches consume
50% of total processor system
power (Segars 01)
 M*CORE: Unified cache
consumes 50% of total
processor system power
(Lee/Moyer/Arends 99)
 We’ll show that a
configurable cache can
reduce that power nearly in
half on average

Chuanjun Zhang, UC Riverside 3


Why a Configurable Cache?
2.0%
 An embedded system 1.5%

Miss rate
may execute one 1.0% epic
mpeg2
application forever 0.5%
0.0%
 Tuning the cache
1 2 4
configuration (size, associativity
associativity, line size) 100%

Normalized
can save a lot of energy 75%

Energy
50% epic
 Associativity example 25% mpeg2
 40% difference in memory 0%
(b)

access energy 1 2 4

associativity

epic & mpeg2 from MediaBench

Chuanjun Zhang, UC Riverside 4


Benefits of Configurable Cache

 Mass production
 Unique chips getting more expensive as technology
scales down (ITRS)
 Huge benefits to mass producing a single chip
 Harder to produce chips distinguished by cache
when we have 50-100 processors per chip
 Adapt to program phases
 Recent research shows programs have different
cache requirements over time
 Much research assumes a configurable cache

Chuanjun Zhang, UC Riverside 5


Caches Vary Greatly in Embedded Processors

Instruct. Cache Data Cache Instruct. Cache Data Cache


Processor Size As. Line Size As. Line Processor Size As. Line Size As. Line
AMD-K6-IIIE 32K 2 32 32K 2 32 Motorola MPC8540 32K 4 32/64 32K 4 32/64
Alchemy AU1000 16K 4 32 16K 4 32 Motorola MPC7455 32K 8 32 32K 8 32
ARM 7 8K/U 4 16 8K/U 4 16 NEC VR5500 32K 2 32 32K 2 32
ColdFire 0-32K DM 16 0-32K N/A N/A NEC VR4131 16K 2 16/32 16K 2 16/32
Hitachi SH7750S (SH4) 8K DM 32 16K DM 32 NEC VR4181 4K DM 16 4K DM 16
Hitachi SH7727 16K/U 4 16 16K/U 4 16 NEC VR4181A 8K DM 32 8K DM 32
IBM PPC 750CX 32K 8 32 32K 8 32 NEC VR4121 16 DM 16 8K DM 16
IBM PPC 7603 16K 4 32 16K 4 32 PMC Sierra RM9000X2 16K 4 N/A 16K 4 N/A
IBM750FX 32K 8 32 32K 8 32 PMC Sierra RM7000A 16K 4 32 16K 4 32
IBM403GCX 16K 2 16 8K 2 16 SandCraft sr71000 32K 4 32 32K 4 32
IBM Power PC 405CR 16K 2 32 8K 2 32 Sun Ultra SPARC Iie 16K 2 N/A 16K DM N/A
Intel 960JA 2K 2 N/A 1K 2 N/A SuperH 32K 4 32 32K 4 32
Intel 960JD 4K 2 N/A 2K 2 N/A TI TMS320C6414 16K DM N/A 16K 2 N/A
Intel 960IT 16K 2 N/A 4K 2 N/A TriMedia TM32A 32K 8 64 16K 8 64
Motorola MPC8240 16K 4 32 16K 4 32 Xilinx Virtex IIPro 16K 2 32 8K 2 32

Chuanjun Zhang, UC Riverside 6


Configurable Associativity by Way
Concatenation
Way 1 Way 2 Way 3 Way 4 C. Zhang(ISCA 03)
Four-way set-

four-way

associative base
cache
 Ways can be Way 1 Way 2

two-way
concatenated to
form two-way
 Can be further
concatenated to
Way 1
direct-mapped

direct mapped
 Concatenation is
logical – only 1
array accessed
Chuanjun Zhang, UC Riverside 7
Way-Concatenate Cache Architecture
Trivial area a31 tag address a13 a12 a11 a10 index a5 a4 line offset a0
overhead
No performance reg0 reg1 ways
Configuration circuit
overhead 0 0 DM
reg0
• NAND transistors enlarged 0 1 2

to match inverter speed reg1


1 0 2
1 1 4
• Configuration circuit c0 c1 c2 c3
operates concurrent toc0 c1

decoders
index
6x64

6x64
c2 c3 data

6x64
array

6x64
6x64

6x64

tag c0 c1 c3 line offset


c2
address

tag part mux driver data output

critical path

Chuanjun Zhang, UC Riverside 8


Previous Method – Way Shutdown

 Albonesi proposed a cache where ways could be shut down


 To save dynamic power
 Motorola M*CORE has same way-shutdown feature
 Unified cache – even allows setting each way as I, D, both, or off

Way 1 Way 2 Way 3 Way 4

 Reduces dynamic power by accessing fewer ways


 But, decreases total size, so may increase miss rate

Chuanjun Zhang, UC Riverside 9


Way Shutdown Can be Good for Static
Power

 Static power (leakage) increasingly important in nanoscale


technologies
 We combine way shutdown with way concatenate
 Use sleep transistor method of Powell (ISLPED 2000)

Bitline Vdd Bitline


When off,
prevents
leakage.
But 4%
performanc
Gated-Vdd e overhead
Control
Gnd

Chuanjun Zhang, UC Riverside 10


Cache Line Size
C. Zhang(ISVLSI 03)
A
64B cache line
64B
consecutive
code

B
64B cache line
64B non
consecutive
code
48B are wasted

16B

Chuanjun Zhang, UC Riverside 11


Configurable Cache Line Size With Line
Concatenation
 The physical line 4 physical One Way
size is 16 byte lines are 16 bytes
 A programmable filled when
counter is used to line size Counter
designate the line is 64 bytes bus
size
 An interleaved off
chip memory
organization
Off Chip Memory
Chuanjun Zhang, UC Riverside 12
Computing Total Memory-Related Energy
 Considers CPU stall energy and off-chip memory energy
 Excludes CPU active energy
 Thus, represents all memory-related energy

energy_mem = energy_dynamic + energy_static


energy_dynamic = cache_hits * energy_hit + cache_misses * energy_miss
energy_miss = energy_offchip_access + energy_uP_stall + energy_cache_block_fill
energy_static = cycles * energy_static_per_cycle
energy_miss = k_miss_energy * energy_hit
energy_static_per_cycle = k_static * energy_total_per_cycle
(We varied the k’s to account for different system implementations)
Underlined – measured quantities

SimpleScalar (cache_hits, cache_misses, cycles)
Our layout or data sheets (others)

Chuanjun Zhang, UC Riverside 13


Energy Savings
120% 127% 620% 126
cnv8K4W32B cnv8K1W32B cfg8Kw c32B
cfg8Kw cw s32B cfg8Kw cw slc
100%
Normalized Energy

80%

60%

40%

20%

0%
padpcm

adpcm

mpeg
ucbqsort
auto2

pjepg

g721

parser
jpeg
bcnt

v42

mcf
binary

blit

art
brev

vpr
epic
crc

bilv

g3fax

pegwit
fir
 Energy savings when way concatenation, way shut
down, and cache line size concatenation are
implemented. (C. Zhang TECS ACM To Appear)

Chuanjun Zhang, UC Riverside 14


Cache Parameters that Consume the
Lowest Energy Varies Across Applications
Best Configuration Best Configuration
Ben. I$ D$ Ben. I$ D$
padpcm 8K 1W32B 8K1W32B pjepg 4K1W32B 4K2W64B
crc 2K1W32B 4K1W64B ucbqsort 4K1W16B 4K1W64B
auto 8K 2W16B 4K1W32B v42 8K1W16B 8K2W16B
bcnt 2K 1W32B 2K1W 64B adpcm 2K1W16B 4K1W16B
bilv 4K1W32B 2K1W32B epic 2K1W64B 8K1W16B
binary 2K1W32B 2K1W32B g721 8K4W16B 2K1W16B
blit 2K1W 16B 8K2W32B pegwit 4K1W16B 4K1W16B
brev 4K 1W32B 2K1W32B mpeg2 4K1W32B 8K2W16B
g3fax 4K1W32B 4K1W16B art 2K1W32B 2K1W16B
fir 4K 1W32B 2K1W32B parser 8K4W16B 8K2W64B
jpeg 8K4W32B 4K2W32B mcf 8K4W16B 8K1W16B
vpr 8K 4W32B 2K1W16B
Chuanjun Zhang, UC Riverside 15
How to Configure Cache

 Simulation-based methods
 Drawback: slowness.
 Seconds of real-timework may take tens of hours to simulate
 Simulation tools set up
 Increase the time
 Self exploring method
 Cache parameter explorer
 Incorporated on a prototype platform
 Pareto parameters: a set of parameters show
performance and energy trade off

Chuanjun Zhang, UC Riverside 16


Cache self-exploring hardware

 An explorer is used to
detect the Pareto set of
Mem
cache parameters
Proc D$
 The explorer stands esso
aside to collect r
information used to
calculate the energy Explorer

I$

Chuanjun Zhang, UC Riverside 17


Pareto parameter sets
Lowest
72 Energy
Time(million cycles)

pegwit
A
68 Tradeoff between
Not a Energy and
Pareto
64 Point D Performance
Best
Performance

60 C
B
56
0.04 0.08 0.12 0.16
Energy(mJ)

Chuanjun Zhang, UC Riverside 18


Heuristic algorithm

 Search all possible Cache configurations


 Time consuming. Considering other configurable
parameters: voltage levels, bus width, etc. the
search space will increase very quickly to millions.
 A heuristic is proposed Lowest
Energy
 First to search point A 72
A

Sequence of searching parameter matters,
68
Tradeoff

Time

Do not need cache flush 64
Best Perf
 Then searching for point B 60
B
C
 Last we search for points in region C56
0.04 0.08 0.12 0.16
Energy(mJ)

Chuanjun Zhang, UC Riverside 19


Impact of Cache Parameters on Miss
Rate and Energy
12% 1.0

8k 4k 2k 8k 4k 2k
One Way

Ave. Icache energy


0.8
9%  
Line Size 32B
Ave. Icache miss rate

One Way Line Size 32B 0.6  

6%    

0.4

3%
0.2

0% 0.0
16B 32B 64B 1W 2W 4W 16B 32B 64B 1W 2W 4W

Average Instruction Cache Miss Rate and Normalized Energy of the


Benchmarks.

Chuanjun Zhang, UC Riverside 20


Energy Dissipation on On-Chip Cache
and Off Chip Memory
5
Cache Memory T otal
4

3
Energy(J)

1
 .
0
1KB
2KB
4KB
8KB
16KB
32KB
64KB
128KB
256KB
512KB
1MB
Benchmark:
parser Cache Size
Chuanjun Zhang, UC Riverside 21
Searching for Point A
Search Cache Search Line Search Way prediction
W1 W2Size
W3 W4 Size Associativity

72

68 A

Time
Lowest
64 Energy

60
 Point A :The least energy
56
cache configuration 0.04 0.08 0.12 0.16
Energy(mJ)

Chuanjun Zhang, UC Riverside 22


Searching for Point B
Fix Cache Size Search Line Search No Way
W1 W2 W3 W4 Size Associativity prediction

72
A
68
Best
64 Performance

60
B
 Point B :The best performance cache configuration
 High associativity doesn’t mean high performance 56
0.04 0.08 0.12 0.16
 Large line size may not be good for data cache Energy(mJ)

Chuanjun Zhang, UC Riverside 23


Searching for Point C
Point A B C
 Cache parameters in region C:
represent the trade off Line size 64 64 64
between energy and
performance Cache size 2K 8K 4K 8K
 Choose cache parameters
between points A and B. Associativity 1W 4W 1W 1W 2W
 Cache size at points A and B are 72  
8K and 4K respectively, then the A
68
cache size of points in region C
will be tested at 8K and 4K. 64
C
B
 Combinations of point A and Tradeoff between
60

B’s parameters are tested. Energy and


Performance
56
0.04 0.08 0.12 0.16

Chuanjun Zhang, UC Riverside 24


FSM and Data Path of the Cache
Explorer

hit energies hit num


input miss energies miss num
static energies exe time
FSM

mux mux
control
multiplier
com_out
adder memory
configure register register
lowest energy
comparator
com_out
Chuanjun Zhang, UC Riverside 25
Implementing the Heuristic in Hardware
 Total size of the explorer
 About 4,200 gates, or 0.041 mm2 in 0.18 micron CMOS
technology.
 Area overhead
 Compared to the reported size of the MIPS 4Kp with cache, this
represents just over a 3% area overhead.
 Power consumption:
 2.69 mW at 200 MHz. The power overhead compared with the
MIPS 4Kp would be less than 0.5%.
 Furthermore, the exploring hardware is used only during the
exploring stage, and can be shut down after the best
configuration is determined.
Chuanjun Zhang, UC Riverside 26
How well the heuristic is ?
 Time complexity:
 Search all space: O(m x n x l x p)
 Heuristic : O(m + n + l + p)
 m:number of associativities, n :number of cache size
 l : number of cache line size , p :way prediction on/off
 Efficiency
 On average 5 searching instead of 27 total searchings can find point A
 2 out of 19 benchmarks miss the lowest power cache configuration.
 Use a different searching heuristic: line size, associativity, way prediction and
cache size.
 11 out 19 benchmarks miss the best configuration

Chuanjun Zhang, UC Riverside 27


Results of Some Other Benchmarks
bliv p e g w it
11239000 75000000
11238000 70000000

Time(cycles)
Time(cycles)

11237000 65000000
11236000
60000000
11235000
11234000 55000000
11233000 0 0.05 0.1 0.15 0.2
Energy (mJ)
0 0.002 0.004 0.006 0.008 0.01
Energy(mJ)

padpcm crc
146000 3104000
144000 3102000
Time(cycles)

142000 Time(cycles) 3100000


140000 3098000
138000 3096000
136000 3094000
134000 3092000
132000 3090000
0.174 0.175 0.176 0.177 0.178 0.179 0.18 0 0.001 0.002 0.003 0.004
Energy(mJ)
Energy (nJ)
Chuanjun Zhang, UC Riverside 28
Conclusion and Future Work
 A configurable cache architecture is proposed.
 Associativity, size,line size.
 A cache parameter explorer is implemented to find the cache
parameters.
 A heuristic algorithm is proposed to search the Pareto cache
parameter sets.
 The complexity of the heuristic is O(m+n+l) instead of O(m*n*l)
 Only 95% of the Pareto points can be found by Heuristic
 Overhead
 little area and power overhead, and no performance overhead.
 Future Work
 Dynamically detect the cache parameters .

Chuanjun Zhang, UC Riverside 29

You might also like