Professional Documents
Culture Documents
1, JANUARY 2016
105
I. INTRODUCTION
Manuscript received May 05, 2015; revised June 25, 2015; accepted July 22,
2015. Date of publication August 25, 2015; date of current version December
30, 2015. This paper was approved by Guest Editor Jinuk Luke Shin.
B. Munger, D. Akeson, H. R. Fair, J. Farrell, G. Krishnan, J. White, and
K. Wilcox are with AMD, Boxborough, MA, USA.
S. Arekapudi, T. Burd, and H. McIntyre are with AMD, Sunnyvale, CA, USA.
D. Johnson, S. Naffziger, and R. Schreiber are with AMD, Fort Collins, CO,
USA.
E. McLellan is with Cavium Networks, Marlborough, MA, USA.
S. Sundaram is with AMD, Austin, TX, USA.
Color versions of one or more of the gures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identier 10.1109/JSSC.2015.2464688
0018-9200 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
106
Fig. 3. Universal Curve comparing Carrizo and Kaveri thin oxide devices.
Each process has 3 Vts: the two smallest channel lengths for each Vt are shown.
The 29% density improvement in the Excavator module enables a larger area allocation for graphics IP, multi-media ofoad and the integration of a system controller into a single
BGA package. The increased graphics and multi-media area
allocation enables Carrizo to implement twice as many video
compression engines as Kaveri, a new video decoder to facilitate high throughput H264 decode, a new high efciency video
107
Fig. 5. Power improvements for Carrizo over Kaveri for common battery-life
cases.
108
Fig. 6. Frequency-power curves with and without a separate graphics voltage domain.
one-hot and are asserted high if the byte is read. The combination of the GWL0, GWL1 and Byte-Sel ensure that the bitslice
within a byte gets a 64 entry zero-hot wordline.
The same array macro is used to implement the data cache and
way predict arrays. This macro uses a 0.241 m 8 T bitcell with
8 cells per local bitline and two-stage dynamic sensing. Steamroller uses a 2-stack pulldown in the second dynamic sensing
stage while the Excavator micro banking scheme enables the
use of a single high stack, resulting in 10% area reduction and
8% faster read time in the array macro.
V. POWER DENSITY
The density improvement in Excavator more than offsets the
lower leakage technology resulting in higher power density and
increased operating temperatures at the same power as Steamroller. A variety of techniques are used to mitigate power density
in order to enable Carrizo to operate across the desired power
range (up to 35 W).
A useful tool to managing die temperature is pre-silicon
thermal analysis. The inputs to this analysis are power estimates for each IP in the die oorplan and a 2D matrix that
109
110
111
112
Fig. 13. Near miss statistics and extrapolation to estimate timing margin.
113
to be able to use a combination wordline underdrive and wordline boost for read and write assist. The wordline can be driven
to a voltage lower than VDD during the rst phase of the access
by turning on a PFET pulldown (Fig. 14), allowing the bitlines
to partially discharge while the wordline is underdriven which
reduces susceptibility to read disturb. The PFET pulldown can
then be turned off during the second phase of the access, allowing the wordline to reach VDD. Each set of 16 wordlines
share a power header which can be turned off after a wordline
returns to VDD. The virtual supply can then be boosted above
VDD via an nFET used as a capacitor (Fig. 14). An nFET keeper
ensures the wordline never leaks further than a Vt below VDD.
The circuit can be congured to allow any combination of rst
phase underdrive, return to VDD and second phase boost.
The L2 tag macro wordline is only asserted for half a cycle.
It uses a combination of wordline underdrive and negative bitline for its assist techniques because the phase-bound wordline
does not leave enough time for the bitlines to discharge prior to
the start of the write assist. The negative-bitline circuitry uses a
single capacitor per logical bit column which is coupled to the
bitline through an nFET passgate (Fig. 15). The circuit couples
the bitline down after a self-timed delay which is designed to
delay the coupling event until after the bitline has discharged to
ground.
114
REFERENCES
[1] K. Wilcox et al., A 28 nm 86 APU optimized for power and area
efciency, in IEEE ISSCC Tech. Papers, 2015, pp. 8485.
[2] D. Bouvier et al., Applying AMD's Kaveri APU for heterogeneous
computing, presented at the Hot Chips 2014 Symp., Cupertino, CA,
USA, 2014.
115
116
ology, technology, and ESD for future AMD processors. From 1996 to 2005
he worked at Sun Microsystems on SRAMs and custom circuits, including
memory compilers, for multiple UltraSPARC processors. Prior to joining Sun,
Hugh was at Inmos Limited, then ST Microelectronics, working on high-speed
SRAMs, Flash EPROMs, and a media processor. Mr. McIntyre holds eleven
U.S. patents and has four other IEEE publications.