Professional Documents
Culture Documents
ARCHITECTURE OVERVIEW
HOT CHIPS TUTORIAL - AUGUST 2013
PHIL ROGERS
HSA FOUNDATION PRESIDENT
AMD CORPORATE FELLOW
HSA FOUNDATION
www.hsafoundation.com
Promoters
Supporters
Contributors
Academic
Associates
Easier to program
Easier to optimize
Higher performance
Lower power
Single-Core Era
Moores
Law
Voltage
Constrained by:
Power
Complexity
Enabled by:
Moores Law
SMP
architecture
Constrained by:
Power
Parallel SW
Scalability
Scaling
?
we are
here
Time
Abundant data
parallelism
Power efficient
GPUs
we are
here
Time (# of processors)
Temporarily
Constrained by:
Programming
models
Comm.overhead
Throughput
Performance
Single-thread
Performance
Enabled by:
Modern Application
Performance
Enabled by:
Heterogeneous
Systems Era
Multi-Core Era
we are
here
** All features subject to change, pending completion and ratification of specifications in the HSA Working Groups
Copyright 2012 HSA Foundation. All Rights Reserved.
HSA Runtime
Hardware companies
Operating Systems
Explicitly parallel
Load.Acquire
Store.Release
Barriers
CPU or GPU
Recursion
Tree traversal
Wavefront reforming
10
HSA SOFTWARE
TITLE
HSA Software Stack
Driver Stack
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Domain Libraries
OpenCL, DX Runtimes,
User Mode Drivers
HSA Runtime
Task Queuing
Libraries
HSA JIT
Graphics Kernel Mode Driver
HSA Kernel
Mode Driver
12
13
Finally a single source code base for the CPU and GPU!
Bolt version 1.0 for OpenCL and C++ AMP is available now at
https://github.com/HSA-Libraries/Bolt
14
HSA will feature an open source linux execution and compilation stack
Component Name
IHV or Common
Rationale
Common
Common
Enable research
LLVM Contributions
Common
HSAIL Assembler
Common
HSA Runtime
Common
HSA Finalizer
IHV
IHV
15
ACCELERATING JAVA
GOING BEYOND NATIVE LANGUAGES
17
JAVA HETEROGENEOUS
ENABLEMENT ROADMAP
Application
Application
APARAPI
Application
APARAPI
JVM
Application
APARAPI
JVM
JVM
HSA Runtime
LLVM Optimizer
HSAIL
OpenCL
CPU ISA
CPU
GPU ISA
GPU
HSAIL
HSA Finalizer
CPU ISA
HSA CPU
GPU ISA
HSA CPU
HSAIL
HSA Finalizer
CPU ISA
HSA CPU
GPU ISA
HSA CPU
HSA Finalizer
CPU ISA
HSA CPU
GPU ISA
HSA CPU
18
Application.java
Java Compiler
Development
Runtime
Application.class
Application
Lambda/Stream API
Sumatra Enabled JVM
http://openjdk.java.net/projects/sumatra/
https://wikis.oracle.com/display/HotSpotInternals/Sumatra
http://mail.openjdk.java.net/pipermail/sumatra-dev/
HSA Finalizer
CPU ISA
CPU
GPU ISA
GPU
19
EXAMPLE WORKLOADS
22
More HD Calculations
70% scaling in H and V
Total Pixels = 4.07 Million
Search squares = 3.8 Million
23
Stage N
Feature m
Yes
Face still
possible?
Feature p
No
Feature r
Feature q
Stage N+1
REJECT
FRAME
24
STAGE 1
STAGE 2
STAGE 21
STAGE 22
FACE
CONFIRMED
NO FACE
Final HD Calculations
Search squares = 3.8 million
Average features per square = 124
Calculations per feature = 100
Calculations per frame = 47 GCalcs
Calculation Rate
30 frames/sec = 1.4TCalcs/second
60 frames/sec = 2.8TCalcs/second
and this only gets front-facing faces
25
Live
Dead
Early out algorithms, like HAAR, exhibit divergence between work items
26
20-25
15-20
10-15
5-10
0-5
27
PROCESSING TIME/STAGE
Trinity A10-4600M (6CU@497Mhz, 4 cores@2700Mhz)
100
90
80
Time (ms)
70
60
50
40
30
GPU
20
10
CPU
0
1
9-22
AMD A10 4600M APU with Radeon HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G,
6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL 1.1 (873.1)
Copyright 2012 HSA Foundation. All Rights Reserved.
28
PERFORMANCE CPU-VS-GPU
Trinity A10-4600M (6CU@497Mhz, 4 cores@2700Mhz)
12
10
Images/Sec
4
CPU
HSA
GPU
0
0
3
4
5
6
Number of Cascade Stages on GPU
22
AMD A10 4600M APU with Radeon HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G,
6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL 1.1 (873.1)
Copyright 2012 HSA Foundation. All Rights Reserved.
29
-2.5x
INCREASED
PERFORMANCE
DECREASED ENERGY
PER FRAME
30
SUFFIX ARRAYS
Bio-informatics
32
Radix Sort::GPU
+5.8x
Lexical Rank::CPU
Compute SA::CPU
-5x
Radix Sort::GPU
Merge Sort::GPU
INCREASED
PERFORMANCE
DECREASED
ENERGY
M. Deo, Parallel Suffix Array Construction and Least Common Prefix for the GPU, Submitted to Principles and Practice of Parallel Programming, (PPoPP13) February 2013.
AMD A10 4600M APU with Radeon HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM
33
A way to animate and interact with objects, widely used in games and movie production
Used to drive game play and for visual effects (eye candy)
Games ranging from Angry Birds and Cut the Rope to Tomb Raider and Crysis 3
3D authoring tools such as Autodesk Maya, Unity 3D, Houdini, Cinema 4D, Lightwave
Robotics simulation
35
Mid-Phase
Collision
Detection
Narrow-Phase
Collision
Detection
Compute
contact
points
Setup
constraints
Solve
constraints
B
1
D
4
A
B0
B1
C0
C1
D1
D1
36
Benefits of HSA
GPU enqueue
37
EASE OF PROGRAMMING
CODE COMPLEXITY VS. PERFORMANCE
350
35.00
300
30.00
Init.
250
25.00
Launch
20.00
LOC
Compile
Compile
Copy
150
Launch
Copy
Launch
15.00
Launch
Algorithm
Launch
100
10.00
Launch
Algorithm
Algorithm
Algorithm
Launch
50
5.00
Algorithm
Algorithm
Algorithm
Copy-back
Serial CPU
Copy-back
Algorithm
TBB
Intrinsics+TBB
Launch
Copy
OpenCL-C
Compile
Copy-back
OpenCL -C++
Init
Copy-back
C++ AMP
HSA Bolt
Performance
AMD A10-5800K APU with Radeon HD Graphics CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM.
Software Windows 7 Professional SP1 (64-bit OS); AMD OpenCL 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta
39
Performance
200
Lower power, more capable devices in your hand, on the wall, in the cloud
40
THANK YOU