Professional Documents
Culture Documents
ARCHITECTURE (HSA)
PHIL ROGERS,
HSA FOUNDATION PRESIDENT
AMD CORPORATE FELLOW
HSA FOUNDATION
www.hsafoundation.com
Promoters
Supporters
Contributors
Academic
Associates
Easier to program
Easier to optimize
Higher performance
Lower power
GOALS OF HSA
Single-Core Era
Moores
Law
Voltage
Constrained by:
Power
Complexity
Enabled by:
Moores Law
SMP
architecture
Constrained by:
Power
Parallel SW
Scalability
Scaling
?
we are
here
Time
Abundant data
parallelism
Power efficient
GPUs
we are
here
Time (# of processors)
Temporarily
Constrained by:
Programming
models
Comm.overhead
Throughput
Performance
Single-thread
Performance
Enabled by:
Modern Application
Performance
Enabled by:
Heterogeneous
Systems Era
Multi-Core Era
we are
here
Time (Data-parallel exploitation)
** All features subject to change, pending completion and ratification of specifications in the HSA Working Groups
HSA Runtime
Hardware companies
Operating Systems
Explicitly parallel
10
Load.Acquire
Store.Release
Barriers
11
HSA SOFTWARE
TITLE
HSA Software Stack
Driver Stack
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Domain Libraries
OpenCL, DX Runtimes,
User Mode Drivers
HSA Runtime
Task Queuing
Libraries
HSA JIT
Graphics Kernel Mode Driver
HSA Kernel
Mode Driver
13
14
Finally a single source code base for the CPU and GPU!
Bolt version 1.0 for OpenCL and C++ AMP is available now at
https://github.com/HSA-Libraries/Bolt
15
HSA will feature an open source linux execution and compilation stack
Component Name
IHV Specific
Rationale
No
No
Enable research
LLVM Contributions
No
HSAIL Assembler
No
HSA Runtime
No
HSA Finalizer
Yes
Yes
16
ACCELERATING JAVA
GOING BEYOND NATIVE LANGUAGES
18
JAVA HETEROGENEOUS
ENABLEMENT ROADMAP
Application
Application
APARAPI
Application
APARAPI
JVM
Application
APARAPI
JVM
JVM
HSA Runtime
LLVM Optimizer
HSAIL
OpenCL
CPU ISA
CPU
GPU ISA
GPU
HSAIL
HSA Finalizer
CPU ISA
HSA CPU
GPU ISA
HSA CPU
HSAIL
HSA Finalizer
CPU ISA
HSA CPU
GPU ISA
HSA CPU
HSA Finalizer
CPU ISA
HSA CPU
GPU ISA
HSA CPU
19
Application.java
Java Compiler
Development
Runtime
Application.class
Application
Lambda/Stream API
Sumatra Enabled JVM
http://openjdk.java.net/projects/sumatra/
https://wikis.oracle.com/display/HotSpotInternals/Sumatra
http://mail.openjdk.java.net/pipermail/sumatra-dev/
HSA Finalizer
CPU ISA
CPU
GPU ISA
GPU
20
EXAMPLE WORKLOADS
23
More HD Calculations
70% scaling in H and V
Total Pixels = 4.07 Million
Search squares = 3.8 Million
24
Stage N
Feature m
Yes
Face still
possible?
Feature p
No
Feature r
Feature q
Stage N+1
REJECT
FRAME
25
STAGE 1
STAGE 2
STAGE 21
STAGE 22
FACE
CONFIRMED
NO FACE
Final HD Calculations
Search squares = 3.8 million
Average features per square = 124
Calculations per feature = 100
Calculations per frame = 47 GCalcs
Calculation Rate
30 frames/sec = 1.4TCalcs/second
60 frames/sec = 2.8TCalcs/second
and this only gets front-facing faces
26
Live
Dead
Early out algorithms, like HAAR, exhibit divergence between work items
27
10
5
0
20-25
15-20
10-15
5-10
0-5
28
PROCESSING TIME/STAGE
Trinity A10-4600M (6CU@497Mhz, 4 cores@2700Mhz)
100
90
80
Time (ms)
70
60
50
40
30
GPU
20
10
CPU
0
9-22
AMD A10 4600M APU with Radeon HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G,
6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL 1.1 (873.1)
29
PERFORMANCE CPU-VS-GPU
Trinity A10-4600M (6CU@497Mhz, 4 cores@2700Mhz)
12
10
Images/Sec
CPU
HSA
GPU
0
0
22
AMD A10 4600M APU with Radeon HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G,
6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL 1.1 (873.1)
30
-2.5x
INCREASED
PERFORMANCE
DECREASED ENERGY
PER FRAME
31
SUFFIX ARRAYS
Bio-informatics
33
Radix Sort::GPU
+5.8x
Lexical Rank::CPU
Compute SA::CPU
-5x
Radix Sort::GPU
Merge Sort::GPU
INCREASED
PERFORMANCE
DECREASED
ENERGY
M. Deo, Parallel Suffix Array Construction and Least Common Prefix for the GPU, Submitted to Principles and Practice of Parallel Programming, (PPoPP13) February 2013.
AMD A10 4600M APU w ith Radeon HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM
34
A way to animate and interact with objects, widely used in games and movie production
Used to drive game play and for visual effects (eye candy)
Games ranging from Angry Birds and Cut the Rope to Tomb Raider and Crysis 3
3D authoring tools such as Autodesk Maya, Unity 3D, Houdini, Cinema 4D, Lightwave
Robotics simulation
36
Mid-Phase
Collision
Detection
Narrow-Phase
Collision
Detection
Compute
contact
points
Setup
constraints
Solve
constraints
B
1
D
4
A
B0
B1
C0
C1
D1
D1
37
Benefits of HSA
Unified Addressing,
Pageable memory,
Coherency
38
Benefits of HSA
Unified Addressing,
Pageable memory,
Coherency
GPU Enqueue
39
EASE OF PROGRAMMING
CODE COMPLEXITY VS. PERFORMANCE
350
35.00
300
30.00
Init.
250
25.00
Launch
20.00
LOC
Compile
Compile
Copy
150
Launch
Copy
15.00
Launch
Launch
Algorithm
Launch
100
10.00
Launch
Algorithm
Algorithm
Algorithm
Launch
5.00
50
Algorithm
Algorithm
Algorithm
Copy-back
0
Serial CP U
Copy-back
Algorithm
TBB
Intrinsics+TBB
Launch
Copy
OpenCL -C
Compile
Copy-back
Copy-back
OpenCL
Init
-C++
C++ AMP
0
HSA Bolt
Performance
AMD A10-5800K APU w ith Radeon HD Graphics CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM.
Softw are Window s 7 Professional SP1 (64-bit OS); AMD OpenCL 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta
41
Performance
200
Lower power, more capable devices in your hand, on the wall, in the cloud
42
THANK YOU