You are on page 1of 42

HETEROGENEOUS SYSTEM ARCHITECTURE (HSA)

PHIL ROGERS, HSA FOUNDATION PRESIDENT AMD CORPORATE FELLOW

HSA FOUNDATION

Founded in June 2012 Developing a new platform for heterogeneous systems www.hsafoundation.com Specifications under development in working groups Our first specification, HSA Programmers Reference Manual is already published and available on our web site Additional specifications for System Architecture, Runtime Software and Tools are in process

Copyright 2012 HSA Foundation. All Rights Reserved.

HSA FOUNDATION MEMBERSHIP AUGUST 2013


Founders

Promoters Supporters

Contributors

Academic Associates

Copyright 2012 HSA Foundation. All Rights Reserved.

SOCS HAVE PROLIFERATED MAKE THEM BETTER

SOCs have arrived and are a tremendous advance over previous platforms SOCs combine CPU cores, GPU cores and other accelerators, with high bandwidth access to memory How do we make them even better?

Easier to program Easier to optimize Higher performance Lower power

HSA unites accelerators architecturally Early focus on the GPU compute accelerator, but HSA goes well beyond the GPU
5

Copyright 2012 HSA Foundation. All Rights Reserved.

GOALS OF HSA

Make the unprecedented processing capability of the SOC as accessible to programmers as the CPU is today Dramatically expand the SOC software ecosystem in client and server systems Accelerate immersive applications whether hosted locally or in the cloud

Copyright 2012 HSA Foundation. All Rights Reserved.

INFLECTIONS IN PROCESSOR DESIGN

Single-Core Era
Enabled by:
Moores Law Voltage
Scaling

Multi-Core Era
Enabled by:
Moores Law SMP architecture

Heterogeneous Systems Era


Enabled by:
Abundant data parallelism Power efficient GPUs

Constrained by: Power Complexity

Constrained by:
Power Parallel SW Scalability

Temporarily Constrained by:


Programming models Comm.overhead

Assembly C/C++ Java


Single-thread Performance
Throughput Performance

pthreads OpenMP / TBB


Modern Application Performance

Shader CUDA OpenCL C++ and Java

?
we are here

we are here

we are here
Time (Data-parallel exploitation)

Time

Time (# of processors)

Copyright 2012 HSA Foundation. All Rights Reserved.

HIGH LEVEL FEATURES OF HSA

Features currently being defined in the HSA Working Groups**

Unified addressing across all processors Operation into pageable system memory Full memory coherency User mode dispatch Architected queuing language High level language support for GPU compute processors Preemption and context switching

** All features subject to change, pending completion and ratification of specifications in the HSA Working Groups

Copyright 2012 HSA Foundation. All Rights Reserved.

HSA AN OPEN PLATFORM

Open Architecture, membership open to all

HSA Programmers Reference Manual HSA System Architecture HSA Runtime

Delivered via royalty free standards

Royalty Free IP, Specifications and APIs

ISA agnostic for both CPU and GPU Membership from all areas of computing

Hardware companies Operating Systems Tools and Middleware

Copyright 2012 HSA Foundation. All Rights Reserved.

HSA INTERMEDIATE LAYER HSAIL

HSAIL is a virtual ISA for parallel programs

Finalized to ISA by a JIT compiler or Finalizer ISA independent by design for CPU & GPU

Explicitly parallel

Designed for data parallel programming

Support for exceptions, virtual functions, and other high level language features Lower level than OpenCL SPIR

Fits naturally in the OpenCL compilation stack

Suitable to support additional high level languages and programming models:

Java, C++, OpenMP, etc

Copyright 2012 HSA Foundation. All Rights Reserved.

10

HSA MEMORY MODEL

Defines visibility ordering between all threads in the HSA System Designed to be compatible with C++11, Java and .NET Memory Models Relaxed consistency memory model for parallel compute performance Visibility controlled by:

Load.Acquire Store.Release Barriers

Copyright 2012 HSA Foundation. All Rights Reserved.

11

HSA SOFTWARE

TITLE
Driver Stack
Apps Apps

HSA Software Stack


Apps
Apps

Apps Apps

Apps

Apps

Apps

Apps

Apps

Apps

Domain Libraries

HSA Domain Libraries, OpenCL 2.x Runtime

OpenCL, DX Runtimes, User Mode Drivers HSA JIT Graphics Kernel Mode Driver

HSA Runtime

Task Queuing Libraries

HSA Kernel Mode Driver

Hardware - APUs, CPUs, GPUs


User mode component Kernel mode component Components contributed by third parties

Copyright 2012 HSA Foundation. All Rights Reserved.

13

OPENCL AND HSA

HSA is an optimized platform architecture for OpenCL

Not an alternative to OpenCL

OpenCL on HSA will benefit from

Avoidance of wasteful copies Low latency dispatch Improved memory model Pointers shared between CPU and GPU

OpenCL 2.0 shows considerable alignment with HSA

Many HSA member companies are also active with Khronos in the OpenCL working group

Copyright 2012 HSA Foundation. All Rights Reserved.

14

BOLT PARALLEL PRIMITIVES LIBRARY FOR HSA

Easily leverage the inherent power efficiency of GPU computing

Common routines such as scan, sort, reduce, transform More advanced routines like heterogeneous pipelines Bolt library works with OpenCL

Enjoy the unique advantages of the HSA platform

Move the computation not the data

Finally a single source code base for the CPU and GPU!

Developers can focus on core algorithms

Bolt version 1.0 for OpenCL and C++ AMP is available now at https://github.com/HSA-Libraries/Bolt

Copyright 2012 HSA Foundation. All Rights Reserved.

15

HSA OPEN SOURCE SOFTWARE

HSA will feature an open source linux execution and compilation stack

Allows a single shared implementation for many components Enables university research and collaboration in all areas Because its the right thing to do

Component Name
HSA Bolt Library HSAIL Code Generator
LLVM Contributions

IHV Specific
No No
No

Rationale
Enable understanding and debug Enable research
Industry and academic collaboration

HSAIL Assembler HSA Runtime HSA Finalizer HSA Kernel Driver

No No Yes Yes

Enable understanding and debug Standardize on a single runtime Enable research and debug For inclusion in linux distros

Copyright 2012 HSA Foundation. All Rights Reserved.

16

ACCELERATING JAVA
GOING BEYOND NATIVE LANGUAGES

JAVA ENABLEMENT BY APARAPI


Aparapi = Runtime capable of converting Java bytecode to OpenCL
Developer creates Java source Source compiled to class files (bytecode) using standard compiler

For execution on any OpenCL 1.1+ capable device OR execute via a thread pool if OpenCL is not available

Copyright 2012 HSA Foundation. All Rights Reserved.

18

JAVA HETEROGENEOUS ENABLEMENT ROADMAP

Application
APARAPI

Application
APARAPI

Application
APARAPI

Application

JVM

JVM

JVM IR
HSA Runtime LLVM Optimizer

Sumatra Enabled JVM

HSAIL OpenCL
CPU ISA CPU

HSAIL HSA Finalizer


CPU ISA HSA CPU

HSAIL HSA Finalizer


CPU ISA HSA CPU

HSA Finalizer
CPU ISA HSA CPU

GPU ISA
GPU

GPU ISA
HSA CPU

GPU ISA
HSA CPU

GPU ISA
HSA CPU

Copyright 2012 HSA Foundation. All Rights Reserved.

19

SUMATRA PROJECT OVERVIEW


AMD/Oracle sponsored Open Source (OpenJDK) project Targeted at Java 9 (2015 release) Allows developers to efficiently represent data parallel algorithms in Java Sumatra repurposes Java 8s multi-core Stream/Lambda APIs to enable both CPU or GPU computing At runtime, Sumatra enabled Java Virtual Machine (JVM) will dispatch selected constructs to available HSA enabled devices Developers of Java libraries are already refactoring their library code to use these same constructs

Application.java
Java Compiler
Development Runtime

Application.class

Application

Lambda/Stream API
Sumatra Enabled JVM

HSA Finalizer
CPU ISA

So developers using existing libraries should see GPU acceleration without any code changes

GPU ISA
GPU

http://openjdk.java.net/projects/sumatra/ https://wikis.oracle.com/display/HotSpotInternals/Sumatra http://mail.openjdk.java.net/pipermail/sumatra-dev/

CPU

Copyright 2012 HSA Foundation. All Rights Reserved.

20

EXAMPLE WORKLOADS

HAAR FACE DETECTION


CORNERSTONE TECHNOLOGY FOR COMPUTERVISION

LOOKING FOR FACES IN ALL THE RIGHT PLACES


Quick HD Calculations
Search square = 21 x 21
Pixels = 1920 x 1080 = 2,073,600

Search squares = 1900 x 1060 = ~2 Million

Copyright 2012 HSA Foundation. All Rights Reserved.

23

LOOKING FOR DIFFERENT SIZE FACES BY SCALING THE VIDEO FRAME

More HD Calculations 70% scaling in H and V Total Pixels = 4.07 Million Search squares = 3.8 Million

Copyright 2012 HSA Foundation. All Rights Reserved.

24

HAAR CASCADE STAGES


Feature k
Feature l Feature m Yes Feature p Stage N

Face still possible?

No
Feature r Feature q Stage N+1

REJECT FRAME

Copyright 2012 HSA Foundation. All Rights Reserved.

25

22 CASCADE STAGES, EARLY OUT BETWEEN EACH

STAGE 1

STAGE 2

STAGE 21

STAGE 22

FACE CONFIRMED

NO FACE
Final HD Calculations Search squares = 3.8 million Average features per square = 124 Calculations per feature = 100 Calculations per frame = 47 GCalcs Calculation Rate 30 frames/sec = 1.4TCalcs/second 60 frames/sec = 2.8TCalcs/second and this only gets front-facing faces

Copyright 2012 HSA Foundation. All Rights Reserved.

26

UNBALANCING DUE TO EARLY EXITS

Live Dead

When running on the GPU, we run each search rectangle on a separate work item Early out algorithms, like HAAR, exhibit divergence between work items

Some work items exit early


Their neighbors continue

SIMD packing suffers as a result

Copyright 2012 HSA Foundation. All Rights Reserved.

27

CASCADE DEPTH ANALYSIS


Cascade Depth
25 20 15

10
5 0
20-25 15-20 10-15

5-10 0-5

Copyright 2012 HSA Foundation. All Rights Reserved.

28

PROCESSING TIME/STAGE
Trinity A10-4600M (6CU@497Mhz, 4 cores@2700Mhz)
100 90
80

70
Time (ms) 60 50

40
30

GPU
20

10 CPU
0

9-22

AMD A10 4600M APU with Radeon HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL 1.1 (873.1)

Copyright 2012 HSA Foundation. All Rights Reserved.

29

PERFORMANCE CPU-VS-GPU
Trinity A10-4600M (6CU@497Mhz, 4 cores@2700Mhz)
12

10

8 Images/Sec CPU
HSA GPU 2

0
0 1 2 3 4 5 6 7 8 22 Number of Cascade Stages on GPU

AMD A10 4600M APU with Radeon HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL 1.1 (873.1)

Copyright 2012 HSA Foundation. All Rights Reserved.

30

HAAR SOLUTION RUN DIFFERENT CASCADES ON GPU AND CPU


By seamlessly sharing data between CPU and GPU, allows the right processor to handle its appropriate workload +2.5x

-2.5x
INCREASED PERFORMANCE

DECREASED ENERGY PER FRAME

Copyright 2012 HSA Foundation. All Rights Reserved.

31

ACCELERATING SUFFIX ARRAY CONSTRUCTION


CLOUD SERVER WORKLOAD

SUFFIX ARRAYS

Suffix Arrays are a fundamental data structure

Designed for efficient searching of a large text

Quickly locate every occurrence of a substring S in a text T

Suffix Arrays are used to accelerate in-memory cloud workloads

Full text index search Lossless data compression Bio-informatics

Copyright 2012 HSA Foundation. All Rights Reserved.

33

ACCELERATED SUFFIX ARRAY CONSTRUCTION ON HSA


By efficiently sharing data between CPU and GPU, HSA lets us move compute to data without penalty of intermediate copies.
Skew Algorithm for Compute SA

By offloading data parallel computations to GPU, HSA increases performance and reduces energy for Suffix Array Construction versus Single Threaded CPU.
+5.8x

Radix Sort::GPU

Lexical Rank::CPU

Compute SA::CPU -5x

Radix Sort::GPU INCREASED PERFORMANCE


DECREASED ENERGY

Merge Sort::GPU

M. Deo, Parallel Suffix Array Construction and Least Common Prefix for the GPU, Submitted to Principles and Practice of Parallel Programming, (PPoPP13) February 2013. AMD A10 4600M APU w ith Radeon HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units , 685MHz; 4GB RAM

Copyright 2012 HSA Foundation. All Rights Reserved.

34

GAMEPLAY RIGID BODY PHYSICS

RIGID BODY PHYSICS SIMULATION

Rigid-Body Physics Simulation is:

A way to animate and interact with objects, widely used in games and movie production Used to drive game play and for visual effects (eye candy)

Physics Simulation is used in many of todays software:

Middleware Physics engines such as Bullet, Havok, PhysX Games ranging from Angry Birds and Cut the Rope to Tomb Raider and Crysis 3 3D authoring tools such as Autodesk Maya, Unity 3D, Houdini, Cinema 4D, Lightwave Industrial applications such as Siemens NX8 Mechatronics Concept Design Medical applications such as surgery trainers Robotics simulation

But GPU-accelerated rigid-body physics is not used in game play only in effects

Copyright 2012 HSA Foundation. All Rights Reserved.

36

RIGID BODY PHYSICS ALGORITHM


Find potential interacting object pairs using bounding shape approximations. Perform full overlap testing between potentially interacting pairs Compute exact contact information for a various shape types Compute constraint forces for natural motion and stable stacking
Broad-Phase Collision Detection Mid-Phase Collision Detection
Narrow-Phase Collision Detection

Compute contact points

Setup constraints

Solve constraints

2
B 1 A
A 1 B0 1 B1 2 C0 2 C1 3

3 D 4

D1 3

D1 4

A 4

Copyright 2012 HSA Foundation. All Rights Reserved.

37

RIGID BODY PHYSICS CHALLENGES & SOLUTIONS


Implementation Challenges

Benefits of HSA

Game engine and Physics engine need to interact synchronously during simulation

Fast CPU round-trips

User mode dispatch

Many operations require fast CPU round-trips and CPU modification of simulation state mid-pipeline
Traditional GPU solutions cannot guarantee frame-time response

Immediate access to geometry and modification of simulation state midpipeline

Unified Addressing, Pageable memory, Coherency

The set of pairs can be huge and changes from frame to frame

Thousands to Millions for any given frame

Supports as large a pair list as CPU

Entire memory space

GPU can resize pair list

Dynamic memory allocation

Copyright 2012 HSA Foundation. All Rights Reserved.

38

RIGID BODY PHYSICS CHALLENGES & SOLUTIONS


Implementation Challenges

Benefits of HSA

Simulation is a pipeline of many different algorithms, some of which are more suitable for CPU while others are more suitable for GPU
Varying object sizes require more complex and difficult to parallelize broad-phase algorithms

Avoidance of data copies and the overhead of maintaining two copies of simulation state

Unified Addressing, Pageable memory, Coherency

sweep-and-prune uses incremental sorting and traversal of lists

More efficient serial aspects of broadphase can run on the CPU

Move the compute not the data

Narrow-phase algorithms cause thread divergence

Improved handling of thread divergence

GPU Enqueue

Copyright 2012 HSA Foundation. All Rights Reserved.

39

EASE OF PROGRAMMING
CODE COMPLEXITY VS. PERFORMANCE

LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT PROGRAMMING MODELS

350 300

(Exemplary ISV Hessian Kernel)

35.00 30.00

Init.

250 200

Launch

25.00

Performance

Compile Compile
Copy

20.00
Copy

LOC
150
Algorithm

Launch

15.00
Launch

Launch

100
Launch

Launch

10.00
Algorithm
Algorithm

Algorithm

Launch

50
Algorithm Algorithm Algorithm Copy-back

5.00
Copy-back
Copy-back

0 Serial CP U
Copy-back Algorithm

TBB
Launch

Intrinsics+TBB
Copy

OpenCL -C
Compile

OpenCL
Init

0 HSA Bolt

-C++

C++ AMP
Performance

AMD A10-5800K APU w ith Radeon HD Graphics CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM. Softw are Window s 7 Professional SP1 (64- bit OS); AMD OpenCL 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta

Copyright 2012 HSA Foundation. All Rights Reserved.

41

THE HSA FUTURE

Architected heterogeneous processing on the SOC Programming of accelerators becomes much easier Accelerated software that runs across multiple hardware vendors Scalability from phones to super computers on a common architecture GPU acceleration of parallel processing is the initial target, with DSPs and other accelerators coming to the HSA system architecture model Heterogeneous software ecosystem evolves at a much faster pace Lower power, more capable devices in your hand, on the wall, in the cloud

Copyright 2012 HSA Foundation. All Rights Reserved.

42

THANK YOU

You might also like