HSA Intro HotChips - Final PDF

HETEROGENEOUS SYSTEM ARCHITECTURE (HSA)
PHIL ROGERS, HSA FOUNDATION PRESIDENT AMD CORPORATE FELLOW
HSA FOUNDATION

Founded in June 2012 Developing a new platform for heterogeneous systems www.hsafoundation.com Specifications under development in working groups Our first specification, HSA Programmers Reference Manual is already published and available on our web site Additional specifications for System Architecture, Runtime Software and Tools are in process
Copyright 2012 HSA Foundation. All Rights Reserved.
HSA FOUNDATION MEMBERSHIP AUGUST 2013

Founders
Promoters Supporters
Contributors
Academic Associates
SOCS HAVE PROLIFERATED MAKE THEM BETTER
SOCs have arrived and are a tremendous advance over previous platforms SOCs combine CPU cores, GPU cores and other accelerators, with high bandwidth access to memory How do we make them even better?
Easier to program Easier to optimize Higher performance Lower power
HSA unites accelerators architecturally Early focus on the GPU compute accelerator, but HSA goes well beyond the GPU
5
GOALS OF HSA
Make the unprecedented processing capability of the SOC as accessible to programmers as the CPU is today Dramatically expand the SOC software ecosystem in client and server systems Accelerate immersive applications whether hosted locally or in the cloud
INFLECTIONS IN PROCESSOR DESIGN
Single-Core Era
Enabled by:
Moores Law Voltage
Scaling
Multi-Core Era
Enabled by:
Moores Law SMP architecture
Heterogeneous Systems Era

Enabled by:
Abundant data parallelism Power efficient GPUs
Constrained by: Power Complexity
Constrained by:
Power Parallel SW Scalability
Temporarily Constrained by:

Programming models Comm.overhead
Assembly C/C++ Java

Single-thread Performance
Throughput Performance
pthreads OpenMP / TBB

Modern Application Performance
Shader CUDA OpenCL C++ and Java
?
we are here
we are here
we are here
Time (Data-parallel exploitation)
Time
Time (# of processors)
HIGH LEVEL FEATURES OF HSA
Features currently being defined in the HSA Working Groups**
Unified addressing across all processors Operation into pageable system memory Full memory coherency User mode dispatch Architected queuing language High level language support for GPU compute processors Preemption and context switching
** All features subject to change, pending completion and ratification of specifications in the HSA Working Groups
HSA AN OPEN PLATFORM
Open Architecture, membership open to all
HSA Programmers Reference Manual HSA System Architecture HSA Runtime
Delivered via royalty free standards
Royalty Free IP, Specifications and APIs
ISA agnostic for both CPU and GPU Membership from all areas of computing
Hardware companies Operating Systems Tools and Middleware
HSA INTERMEDIATE LAYER HSAIL
HSAIL is a virtual ISA for parallel programs
Finalized to ISA by a JIT compiler or Finalizer ISA independent by design for CPU & GPU
Explicitly parallel
Designed for data parallel programming
Support for exceptions, virtual functions, and other high level language features Lower level than OpenCL SPIR
Fits naturally in the OpenCL compilation stack
Suitable to support additional high level languages and programming models:
Java, C++, OpenMP, etc
10
HSA MEMORY MODEL
Defines visibility ordering between all threads in the HSA System Designed to be compatible with C++11, Java and .NET Memory Models Relaxed consistency memory model for parallel compute performance Visibility controlled by:
Load.Acquire Store.Release Barriers
11
HSA SOFTWARE
TITLE
Driver Stack
Apps Apps
HSA Software Stack

Apps
Apps
Apps Apps
Apps
Apps
Apps
Apps
Apps
Apps
Domain Libraries
HSA Domain Libraries, OpenCL 2.x Runtime
OpenCL, DX Runtimes, User Mode Drivers HSA JIT Graphics Kernel Mode Driver
HSA Runtime
Task Queuing Libraries
HSA Kernel Mode Driver
Hardware - APUs, CPUs, GPUs

User mode component Kernel mode component Components contributed by third parties
13
OPENCL AND HSA
HSA is an optimized platform architecture for OpenCL
Not an alternative to OpenCL
OpenCL on HSA will benefit from
Avoidance of wasteful copies Low latency dispatch Improved memory model Pointers shared between CPU and GPU
OpenCL 2.0 shows considerable alignment with HSA
Many HSA member companies are also active with Khronos in the OpenCL working group
14
BOLT PARALLEL PRIMITIVES LIBRARY FOR HSA
Easily leverage the inherent power efficiency of GPU computing
Common routines such as scan, sort, reduce, transform More advanced routines like heterogeneous pipelines Bolt library works with OpenCL
Enjoy the unique advantages of the HSA platform
Move the computation not the data
Finally a single source code base for the CPU and GPU!
Developers can focus on core algorithms
Bolt version 1.0 for OpenCL and C++ AMP is available now at https://github.com/HSA-Libraries/Bolt
15
HSA OPEN SOURCE SOFTWARE
HSA will feature an open source linux execution and compilation stack
Allows a single shared implementation for many components Enables university research and collaboration in all areas Because its the right thing to do
Component Name
HSA Bolt Library HSAIL Code Generator
LLVM Contributions
IHV Specific
No No
No
Rationale
Enable understanding and debug Enable research
Industry and academic collaboration
HSAIL Assembler HSA Runtime HSA Finalizer HSA Kernel Driver
No No Yes Yes
Enable understanding and debug Standardize on a single runtime Enable research and debug For inclusion in linux distros
16
ACCELERATING JAVA
GOING BEYOND NATIVE LANGUAGES
JAVA ENABLEMENT BY APARAPI

Aparapi = Runtime capable of converting Java bytecode to OpenCL
Developer creates Java source Source compiled to class files (bytecode) using standard compiler
For execution on any OpenCL 1.1+ capable device OR execute via a thread pool if OpenCL is not available
18
JAVA HETEROGENEOUS ENABLEMENT ROADMAP
Application
APARAPI
Application
APARAPI
Application
APARAPI
Application
JVM
JVM
JVM IR
HSA Runtime LLVM Optimizer
Sumatra Enabled JVM
HSAIL OpenCL
CPU ISA CPU
HSAIL HSA Finalizer

CPU ISA HSA CPU
HSAIL HSA Finalizer

CPU ISA HSA CPU
HSA Finalizer
CPU ISA HSA CPU
GPU ISA
GPU
GPU ISA
HSA CPU
GPU ISA
HSA CPU
GPU ISA
HSA CPU
19
SUMATRA PROJECT OVERVIEW

AMD/Oracle sponsored Open Source (OpenJDK) project Targeted at Java 9 (2015 release) Allows developers to efficiently represent data parallel algorithms in Java Sumatra repurposes Java 8s multi-core Stream/Lambda APIs to enable both CPU or GPU computing At runtime, Sumatra enabled Java Virtual Machine (JVM) will dispatch selected constructs to available HSA enabled devices Developers of Java libraries are already refactoring their library code to use these same constructs
Application.java
Java Compiler
Development Runtime
Application.class
Application
Lambda/Stream API
Sumatra Enabled JVM
HSA Finalizer
CPU ISA
So developers using existing libraries should see GPU acceleration without any code changes
GPU ISA
GPU
http://openjdk.java.net/projects/sumatra/ https://wikis.oracle.com/display/HotSpotInternals/Sumatra http://mail.openjdk.java.net/pipermail/sumatra-dev/
CPU
20
EXAMPLE WORKLOADS
HAAR FACE DETECTION

CORNERSTONE TECHNOLOGY FOR COMPUTERVISION
LOOKING FOR FACES IN ALL THE RIGHT PLACES

Quick HD Calculations
Search square = 21 x 21
Pixels = 1920 x 1080 = 2,073,600
Search squares = 1900 x 1060 = ~2 Million
23
LOOKING FOR DIFFERENT SIZE FACES BY SCALING THE VIDEO FRAME
More HD Calculations 70% scaling in H and V Total Pixels = 4.07 Million Search squares = 3.8 Million
24
HAAR CASCADE STAGES

Feature k
Feature l Feature m Yes Feature p Stage N
Face still possible?
No
Feature r Feature q Stage N+1
REJECT FRAME
25
22 CASCADE STAGES, EARLY OUT BETWEEN EACH
STAGE 1
STAGE 2
STAGE 21
STAGE 22
FACE CONFIRMED
NO FACE
Final HD Calculations Search squares = 3.8 million Average features per square = 124 Calculations per feature = 100 Calculations per frame = 47 GCalcs Calculation Rate 30 frames/sec = 1.4TCalcs/second 60 frames/sec = 2.8TCalcs/second and this only gets front-facing faces
26
UNBALANCING DUE TO EARLY EXITS
Live Dead
When running on the GPU, we run each search rectangle on a separate work item Early out algorithms, like HAAR, exhibit divergence between work items
Some work items exit early

Their neighbors continue
SIMD packing suffers as a result
27
CASCADE DEPTH ANALYSIS

Cascade Depth
25 20 15
10
5 0
20-25 15-20 10-15
5-10 0-5
28
PROCESSING TIME/STAGE
Trinity A10-4600M (6CU@497Mhz, 4 cores@2700Mhz)
100 90
80
70
Time (ms) 60 50
40
30
GPU
20
10 CPU
0
9-22
AMD A10 4600M APU with Radeon HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL 1.1 (873.1)
29
PERFORMANCE CPU-VS-GPU
Trinity A10-4600M (6CU@497Mhz, 4 cores@2700Mhz)
12
10
8 Images/Sec CPU
HSA GPU 2
0
0 1 2 3 4 5 6 7 8 22 Number of Cascade Stages on GPU
AMD A10 4600M APU with Radeon HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL 1.1 (873.1)
30
HAAR SOLUTION RUN DIFFERENT CASCADES ON GPU AND CPU

By seamlessly sharing data between CPU and GPU, allows the right processor to handle its appropriate workload +2.5x
-2.5x
INCREASED PERFORMANCE
DECREASED ENERGY PER FRAME
31
ACCELERATING SUFFIX ARRAY CONSTRUCTION

CLOUD SERVER WORKLOAD
SUFFIX ARRAYS
Suffix Arrays are a fundamental data structure
Designed for efficient searching of a large text
Quickly locate every occurrence of a substring S in a text T
Suffix Arrays are used to accelerate in-memory cloud workloads
Full text index search Lossless data compression Bio-informatics
33
ACCELERATED SUFFIX ARRAY CONSTRUCTION ON HSA

By efficiently sharing data between CPU and GPU, HSA lets us move compute to data without penalty of intermediate copies.
Skew Algorithm for Compute SA
By offloading data parallel computations to GPU, HSA increases performance and reduces energy for Suffix Array Construction versus Single Threaded CPU.
+5.8x
Radix Sort::GPU
Lexical Rank::CPU
Compute SA::CPU -5x
Radix Sort::GPU INCREASED PERFORMANCE

DECREASED ENERGY
Merge Sort::GPU
M. Deo, Parallel Suffix Array Construction and Least Common Prefix for the GPU, Submitted to Principles and Practice of Parallel Programming, (PPoPP13) February 2013. AMD A10 4600M APU w ith Radeon HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units , 685MHz; 4GB RAM
34
GAMEPLAY RIGID BODY PHYSICS
RIGID BODY PHYSICS SIMULATION
Rigid-Body Physics Simulation is:
A way to animate and interact with objects, widely used in games and movie production Used to drive game play and for visual effects (eye candy)
Physics Simulation is used in many of todays software:
Middleware Physics engines such as Bullet, Havok, PhysX Games ranging from Angry Birds and Cut the Rope to Tomb Raider and Crysis 3 3D authoring tools such as Autodesk Maya, Unity 3D, Houdini, Cinema 4D, Lightwave Industrial applications such as Siemens NX8 Mechatronics Concept Design Medical applications such as surgery trainers Robotics simulation
But GPU-accelerated rigid-body physics is not used in game play only in effects
36
RIGID BODY PHYSICS ALGORITHM

Find potential interacting object pairs using bounding shape approximations. Perform full overlap testing between potentially interacting pairs Compute exact contact information for a various shape types Compute constraint forces for natural motion and stable stacking
Broad-Phase Collision Detection Mid-Phase Collision Detection
Narrow-Phase Collision Detection
Compute contact points
Setup constraints
Solve constraints
2
B 1 A
A 1 B0 1 B1 2 C0 2 C1 3
3 D 4
D1 3
D1 4
A 4
37
RIGID BODY PHYSICS CHALLENGES & SOLUTIONS

Implementation Challenges
Benefits of HSA
Game engine and Physics engine need to interact synchronously during simulation
Fast CPU round-trips
User mode dispatch
Many operations require fast CPU round-trips and CPU modification of simulation state mid-pipeline
Traditional GPU solutions cannot guarantee frame-time response
Immediate access to geometry and modification of simulation state midpipeline
Unified Addressing, Pageable memory, Coherency
The set of pairs can be huge and changes from frame to frame
Thousands to Millions for any given frame
Supports as large a pair list as CPU
Entire memory space
GPU can resize pair list
Dynamic memory allocation
38
RIGID BODY PHYSICS CHALLENGES & SOLUTIONS

Implementation Challenges
Benefits of HSA
Simulation is a pipeline of many different algorithms, some of which are more suitable for CPU while others are more suitable for GPU
Varying object sizes require more complex and difficult to parallelize broad-phase algorithms
Avoidance of data copies and the overhead of maintaining two copies of simulation state
Unified Addressing, Pageable memory, Coherency
sweep-and-prune uses incremental sorting and traversal of lists
More efficient serial aspects of broadphase can run on the CPU
Move the compute not the data
Narrow-phase algorithms cause thread divergence
Improved handling of thread divergence
GPU Enqueue
39
EASE OF PROGRAMMING
CODE COMPLEXITY VS. PERFORMANCE
LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT PROGRAMMING MODELS
350 300
(Exemplary ISV Hessian Kernel)
35.00 30.00
Init.
250 200
Launch
25.00
Performance
Compile Compile
Copy
20.00
Copy
LOC
150
Algorithm
Launch
15.00
Launch
Launch
100
Launch
Launch
10.00
Algorithm
Algorithm
Algorithm
Launch
50
Algorithm Algorithm Algorithm Copy-back
5.00
Copy-back
Copy-back
0 Serial CP U
Copy-back Algorithm
TBB
Launch
Intrinsics+TBB
Copy
OpenCL -C
Compile
OpenCL
Init
0 HSA Bolt
-C++
C++ AMP
Performance
AMD A10-5800K APU w ith Radeon HD Graphics CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM. Softw are Window s 7 Professional SP1 (64- bit OS); AMD OpenCL 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta
41
THE HSA FUTURE
Architected heterogeneous processing on the SOC Programming of accelerators becomes much easier Accelerated software that runs across multiple hardware vendors Scalability from phones to super computers on a common architecture GPU acceleration of parallel processing is the initial target, with DSPs and other accelerators coming to the HSA system architecture model Heterogeneous software ecosystem evolves at a much faster pace Lower power, more capable devices in your hand, on the wall, in the cloud
42
THANK YOU

HSA Intro HotChips - Final PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HSA Intro HotChips - Final PDF

Uploaded by

Copyright:

Available Formats

HETEROGENEOUS SYSTEM ARCHITECTURE (HSA)

PHIL ROGERS, HSA FOUNDATION PRESIDENT AMD CORPORATE FELLOW

Copyright 2012 HSA Foundation. All Rights Reserved.

HSA FOUNDATION MEMBERSHIP AUGUST 2013

Copyright 2012 HSA Foundation. All Rights Reserved.

SOCS HAVE PROLIFERATED MAKE THEM BETTER

Easier to program Easier to optimize Higher performance Lower power

Copyright 2012 HSA Foundation. All Rights Reserved.

Copyright 2012 HSA Foundation. All Rights Reserved.

INFLECTIONS IN PROCESSOR DESIGN

Heterogeneous Systems Era

Constrained by: Power Complexity

Temporarily Constrained by:

Assembly C/C++ Java

pthreads OpenMP / TBB

Shader CUDA OpenCL C++ and Java

Copyright 2012 HSA Foundation. All Rights Reserved.

HIGH LEVEL FEATURES OF HSA

Features currently being defined in the HSA Working Groups**

Copyright 2012 HSA Foundation. All Rights Reserved.

HSA AN OPEN PLATFORM

Open Architecture, membership open to all

HSA Programmers Reference Manual HSA System Architecture HSA Runtime

Delivered via royalty free standards

Royalty Free IP, Specifications and APIs

Hardware companies Operating Systems Tools and Middleware

Copyright 2012 HSA Foundation. All Rights Reserved.

HSA INTERMEDIATE LAYER HSAIL

HSAIL is a virtual ISA for parallel programs

Designed for data parallel programming

Fits naturally in the OpenCL compilation stack

Suitable to support additional high level languages and programming models:

Java, C++, OpenMP, etc

Copyright 2012 HSA Foundation. All Rights Reserved.

HSA MEMORY MODEL

Load.Acquire Store.Release Barriers

Copyright 2012 HSA Foundation. All Rights Reserved.

HSA Software Stack

HSA Domain Libraries, OpenCL 2.x Runtime

Task Queuing Libraries

HSA Kernel Mode Driver

Hardware - APUs, CPUs, GPUs

Copyright 2012 HSA Foundation. All Rights Reserved.

OPENCL AND HSA

HSA is an optimized platform architecture for OpenCL

Not an alternative to OpenCL

OpenCL on HSA will benefit from

OpenCL 2.0 shows considerable alignment with HSA

Copyright 2012 HSA Foundation. All Rights Reserved.

BOLT PARALLEL PRIMITIVES LIBRARY FOR HSA

Easily leverage the inherent power efficiency of GPU computing

Enjoy the unique advantages of the HSA platform

Move the computation not the data

Developers can focus on core algorithms

Copyright 2012 HSA Foundation. All Rights Reserved.

HSA OPEN SOURCE SOFTWARE

HSAIL Assembler HSA Runtime HSA Finalizer HSA Kernel Driver

Copyright 2012 HSA Foundation. All Rights Reserved.

JAVA ENABLEMENT BY APARAPI

Copyright 2012 HSA Foundation. All Rights Reserved.

JAVA HETEROGENEOUS ENABLEMENT ROADMAP

Sumatra Enabled JVM