You are on page 1of 42

HETEROGENEOUS SYSTEM

ARCHITECTURE (HSA)
PHIL ROGERS,
HSA FOUNDATION PRESIDENT
AMD CORPORATE FELLOW

HSA FOUNDATION

Founded in June 2012

Developing a new platform for


heterogeneous systems

www.hsafoundation.com

Specifications under development


in working groups

Our first specification, HSA


Programmers Reference Manual
is already published and available
on our web site

Additional specifications for


System Architecture, Runtime
Software and Tools are in process

Copyright 2012 HSA Foundation. All Rights Reserved.

HSA FOUNDATION MEMBERSHIP


AUGUST 2013
Founders

Promoters
Supporters

Contributors

Academic
Associates

Copyright 2012 HSA Foundation. All Rights Reserved.

SOCS HAVE PROLIFERATED


MAKE THEM BETTER

SOCs have arrived and are a tremendous


advance over previous platforms

SOCs combine CPU cores, GPU cores and


other accelerators, with high bandwidth access
to memory

How do we make them even better?

Easier to program

Easier to optimize

Higher performance

Lower power

HSA unites accelerators architecturally

Early focus on the GPU compute accelerator,


but HSA goes well beyond the GPU

Copyright 2012 HSA Foundation. All Rights Reserved.

GOALS OF HSA

Make the unprecedented processing


capability of the SOC as accessible
to programmers as the CPU is today

Dramatically expand the SOC


software ecosystem in client and
server systems

Accelerate immersive applications


whether hosted locally or in the cloud

Copyright 2012 HSA Foundation. All Rights Reserved.

INFLECTIONS IN PROCESSOR DESIGN

Single-Core Era
Moores
Law
Voltage

Constrained by:
Power
Complexity

Enabled by:
Moores Law
SMP
architecture

Constrained by:
Power
Parallel SW
Scalability

Scaling

?
we are
here

Time

Copyright 2012 HSA Foundation. All Rights Reserved.

Abundant data
parallelism
Power efficient
GPUs

we are
here

Time (# of processors)

Temporarily
Constrained by:
Programming
models
Comm.overhead

Shader CUDA OpenCL


C++ and Java

pthreads OpenMP / TBB

Throughput
Performance

Single-thread
Performance

Assembly C/C++ Java

Enabled by:

Modern Application
Performance

Enabled by:

Heterogeneous
Systems Era

Multi-Core Era

we are
here
Time (Data-parallel exploitation)

HIGH LEVEL FEATURES OF HSA

Features currently being defined in the HSA Working Groups**

Unified addressing across all processors

Operation into pageable system memory

Full memory coherency

User mode dispatch

Architected queuing language

High level language support for GPU compute processors

Preemption and context switching

** All features subject to change, pending completion and ratification of specifications in the HSA Working Groups

Copyright 2012 HSA Foundation. All Rights Reserved.

HSA AN OPEN PLATFORM

Open Architecture, membership open to all

HSA Programmers Reference Manual

HSA System Architecture

HSA Runtime

Delivered via royalty free standards

Royalty Free IP, Specifications and APIs

ISA agnostic for both CPU and GPU

Membership from all areas of computing

Hardware companies

Operating Systems

Tools and Middleware

Copyright 2012 HSA Foundation. All Rights Reserved.

HSA INTERMEDIATE LAYER HSAIL

HSAIL is a virtual ISA for parallel programs

Finalized to ISA by a JIT compiler or Finalizer

ISA independent by design for CPU & GPU

Explicitly parallel

Designed for data parallel programming

Support for exceptions, virtual functions,


and other high level language features

Lower level than OpenCL SPIR

Fits naturally in the OpenCL compilation stack

Suitable to support additional high level languages and programming models:

Java, C++, OpenMP, etc

Copyright 2012 HSA Foundation. All Rights Reserved.

10

HSA MEMORY MODEL

Defines visibility ordering between all threads


in the HSA System

Designed to be compatible with C++11, Java


and .NET Memory Models

Relaxed consistency memory model for


parallel compute performance

Visibility controlled by:

Load.Acquire

Store.Release

Barriers

Copyright 2012 HSA Foundation. All Rights Reserved.

11

HSA SOFTWARE

TITLE
HSA Software Stack

Driver Stack
Apps
Apps

Apps
Apps
Apps

Apps

Apps

Apps

Apps

Apps

Apps

Apps

HSA Domain Libraries,


OpenCL 2.x Runtime

Domain Libraries

OpenCL, DX Runtimes,
User Mode Drivers

HSA Runtime

Task Queuing
Libraries

HSA JIT
Graphics Kernel Mode Driver

HSA Kernel
Mode Driver

Hardware - APUs, CPUs, GPUs


User mode component

Kernel mode component

Copyright 2012 HSA Foundation. All Rights Reserved.

Components contributed by third parties

13

OPENCL AND HSA

HSA is an optimized platform architecture


for OpenCL

Not an alternative to OpenCL

OpenCL on HSA will benefit from

Avoidance of wasteful copies

Low latency dispatch

Improved memory model

Pointers shared between CPU and GPU

OpenCL 2.0 shows considerable alignment


with HSA

Many HSA member companies are also active


with Khronos in the OpenCL working group

Copyright 2012 HSA Foundation. All Rights Reserved.

14

BOLT PARALLEL PRIMITIVES


LIBRARY FOR HSA

Easily leverage the inherent power efficiency of GPU computing

Common routines such as scan, sort, reduce, transform

More advanced routines like heterogeneous pipelines

Bolt library works with OpenCL

Enjoy the unique advantages of the HSA platform

Finally a single source code base for the CPU and GPU!

Move the computation not the data

Developers can focus on core algorithms

Bolt version 1.0 for OpenCL and C++ AMP is available now at
https://github.com/HSA-Libraries/Bolt

Copyright 2012 HSA Foundation. All Rights Reserved.

15

HSA OPEN SOURCE SOFTWARE

HSA will feature an open source linux execution and compilation stack

Allows a single shared implementation for many components

Enables university research and collaboration in all areas

Because its the right thing to do

Component Name

IHV Specific

Rationale

HSA Bolt Library

No

Enable understanding and debug

HSAIL Code Generator

No

Enable research

LLVM Contributions

No

Industry and academic collaboration

HSAIL Assembler

No

Enable understanding and debug

HSA Runtime

No

Standardize on a single runtime

HSA Finalizer

Yes

Enable research and debug

HSA Kernel Driver

Yes

For inclusion in linux distros

Copyright 2012 HSA Foundation. All Rights Reserved.

16

ACCELERATING JAVA
GOING BEYOND NATIVE LANGUAGES

JAVA ENABLEMENT BY APARAPI


Aparapi = Runtime capable of converting Java bytecode to OpenCL
Developer creates
Java source

Source compiled to class files


(bytecode) using standard compiler

For execution on any


OpenCL 1.1+ capable device
OR execute via a thread pool if
OpenCL is not available

Copyright 2012 HSA Foundation. All Rights Reserved.

18

JAVA HETEROGENEOUS
ENABLEMENT ROADMAP

Application

Application

APARAPI

Application

APARAPI

JVM

Application

APARAPI

JVM

JVM

Sumatra Enabled JVM


IR

HSA Runtime
LLVM Optimizer

HSAIL
OpenCL
CPU ISA
CPU

GPU ISA
GPU

HSAIL

HSA Finalizer
CPU ISA
HSA CPU

Copyright 2012 HSA Foundation. All Rights Reserved.

GPU ISA
HSA CPU

HSAIL

HSA Finalizer
CPU ISA
HSA CPU

GPU ISA
HSA CPU

HSA Finalizer
CPU ISA
HSA CPU

GPU ISA
HSA CPU

19

SUMATRA PROJECT OVERVIEW

AMD/Oracle sponsored Open Source (OpenJDK) project

Targeted at Java 9 (2015 release)

Allows developers to efficiently represent data parallel


algorithms in Java

Sumatra repurposes Java 8s multi-core Stream/Lambda


APIs to enable both CPU or GPU computing

Application.java
Java Compiler
Development
Runtime

Application.class

Application

At runtime, Sumatra enabled Java Virtual Machine (JVM)


will dispatch selected constructs to available HSA
enabled devices

Lambda/Stream API
Sumatra Enabled JVM

Developers of Java libraries are already refactoring their


library code to use these same constructs

So developers using existing libraries should see GPU


acceleration without any code changes

http://openjdk.java.net/projects/sumatra/

https://wikis.oracle.com/display/HotSpotInternals/Sumatra

http://mail.openjdk.java.net/pipermail/sumatra-dev/

Copyright 2012 HSA Foundation. All Rights Reserved.

HSA Finalizer
CPU ISA

CPU

GPU ISA
GPU

20

EXAMPLE WORKLOADS

HAAR FACE DETECTION


CORNERSTONE TECHNOLOGY
FOR COMPUTERVISION

LOOKING FOR FACES IN ALL


THE RIGHT PLACES
Quick HD Calculations
Search square = 21 x 21
Pixels = 1920 x 1080 = 2,073,600

Search squares = 1900 x 1060 = ~2 Million

Copyright 2012 HSA Foundation. All Rights Reserved.

23

LOOKING FOR DIFFERENT SIZE FACES


BY SCALING THE VIDEO FRAME

More HD Calculations
70% scaling in H and V
Total Pixels = 4.07 Million
Search squares = 3.8 Million

Copyright 2012 HSA Foundation. All Rights Reserved.

24

HAAR CASCADE STAGES


Feature k
Feature l

Stage N

Feature m
Yes

Face still
possible?

Feature p

No
Feature r
Feature q

Copyright 2012 HSA Foundation. All Rights Reserved.

Stage N+1

REJECT
FRAME

25

22 CASCADE STAGES, EARLY OUT


BETWEEN EACH

STAGE 1

STAGE 2

STAGE 21

STAGE 22

FACE
CONFIRMED

NO FACE
Final HD Calculations
Search squares = 3.8 million
Average features per square = 124
Calculations per feature = 100
Calculations per frame = 47 GCalcs

Copyright 2012 HSA Foundation. All Rights Reserved.

Calculation Rate
30 frames/sec = 1.4TCalcs/second
60 frames/sec = 2.8TCalcs/second
and this only gets front-facing faces

26

UNBALANCING DUE TO EARLY EXITS

Live
Dead

When running on the GPU, we run each search rectangle on a separate


work item

Early out algorithms, like HAAR, exhibit divergence between work items

Some work items exit early

Their neighbors continue

SIMD packing suffers as a result

Copyright 2012 HSA Foundation. All Rights Reserved.

27

CASCADE DEPTH ANALYSIS


Cascade Depth
25
20
15

10
5
0

20-25
15-20
10-15

5-10
0-5

Copyright 2012 HSA Foundation. All Rights Reserved.

28

PROCESSING TIME/STAGE
Trinity A10-4600M (6CU@497Mhz, 4 cores@2700Mhz)
100
90
80

Time (ms)

70
60
50

40
30

GPU
20

10
CPU
0

9-22

AMD A10 4600M APU with Radeon HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G,
6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL 1.1 (873.1)

Copyright 2012 HSA Foundation. All Rights Reserved.

29

PERFORMANCE CPU-VS-GPU
Trinity A10-4600M (6CU@497Mhz, 4 cores@2700Mhz)
12

10

Images/Sec

CPU
HSA
GPU

0
0

22

Number of Cascade Stages on GPU

AMD A10 4600M APU with Radeon HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G,
6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL 1.1 (873.1)

Copyright 2012 HSA Foundation. All Rights Reserved.

30

HAAR SOLUTION RUN DIFFERENT


CASCADES ON GPU AND CPU
By seamlessly sharing data between CPU and GPU,
allows the right processor to handle its appropriate workload
+2.5x

-2.5x
INCREASED
PERFORMANCE

Copyright 2012 HSA Foundation. All Rights Reserved.

DECREASED ENERGY
PER FRAME

31

ACCELERATING SUFFIX ARRAY


CONSTRUCTION
CLOUD SERVER WORKLOAD

SUFFIX ARRAYS

Suffix Arrays are a fundamental data structure

Designed for efficient searching of a large text

Quickly locate every occurrence of a substring S in a text T

Suffix Arrays are used to accelerate in-memory cloud workloads

Full text index search

Lossless data compression

Bio-informatics

Copyright 2012 HSA Foundation. All Rights Reserved.

33

ACCELERATED SUFFIX ARRAY


CONSTRUCTION ON HSA
By efficiently sharing data between CPU and
GPU, HSA lets us move compute to data
without penalty of intermediate copies.

By offloading data parallel computations to


GPU, HSA increases performance and
reduces energy for Suffix Array Construction
versus Single Threaded CPU.

Skew Algorithm for Compute SA

Radix Sort::GPU

+5.8x

Lexical Rank::CPU

Compute SA::CPU
-5x

Radix Sort::GPU

Merge Sort::GPU

INCREASED
PERFORMANCE

DECREASED
ENERGY

M. Deo, Parallel Suffix Array Construction and Least Common Prefix for the GPU, Submitted to Principles and Practice of Parallel Programming, (PPoPP13) February 2013.
AMD A10 4600M APU w ith Radeon HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM

Copyright 2012 HSA Foundation. All Rights Reserved.

34

GAMEPLAY RIGID BODY PHYSICS

RIGID BODY PHYSICS SIMULATION

Rigid-Body Physics Simulation is:

A way to animate and interact with objects, widely used in games and movie production

Used to drive game play and for visual effects (eye candy)

Physics Simulation is used in many of todays software:

Middleware Physics engines such as Bullet, Havok, PhysX

Games ranging from Angry Birds and Cut the Rope to Tomb Raider and Crysis 3

3D authoring tools such as Autodesk Maya, Unity 3D, Houdini, Cinema 4D, Lightwave

Industrial applications such as Siemens NX8 Mechatronics Concept Design

Medical applications such as surgery trainers

Robotics simulation

But GPU-accelerated rigid-body physics is not used in game play


only in effects

Copyright 2012 HSA Foundation. All Rights Reserved.

36

RIGID BODY PHYSICS ALGORITHM

Find potential interacting object pairs using bounding shape approximations.

Perform full overlap testing between potentially interacting pairs

Compute exact contact information for a various shape types

Compute constraint forces for natural motion and stable stacking


Broad-Phase
Collision
Detection

Mid-Phase
Collision
Detection

Narrow-Phase
Collision
Detection

Compute
contact
points

Setup
constraints

Solve
constraints

B
1

D
4
A

Copyright 2012 HSA Foundation. All Rights Reserved.

B0

B1

C0

C1

D1

D1

37

RIGID BODY PHYSICS


CHALLENGES & SOLUTIONS
Implementation Challenges

Game engine and Physics engine


need to interact synchronously during
simulation

Many operations require fast CPU


round-trips and CPU modification of
simulation state mid-pipeline

Traditional GPU solutions cannot


guarantee frame-time response

Benefits of HSA

The set of pairs can be huge and


changes from frame to frame

Thousands to Millions for any given


frame

Fast CPU round-trips

Immediate access to geometry and


modification of simulation state midpipeline

Unified Addressing,

Pageable memory,

Coherency

Supports as large a pair list as CPU

Entire memory space

GPU can resize pair list

Copyright 2012 HSA Foundation. All Rights Reserved.

User mode dispatch

Dynamic memory allocation

38

RIGID BODY PHYSICS


CHALLENGES & SOLUTIONS
Implementation Challenges

Simulation is a pipeline of many


different algorithms, some of which are
more suitable for CPU while others are
more suitable for GPU
Varying object sizes require more
complex and difficult to parallelize
broad-phase algorithms

sweep-and-prune uses incremental


sorting and traversal of lists

Benefits of HSA

Avoidance of data copies and the


overhead of maintaining two copies of
simulation state

Unified Addressing,

Pageable memory,

Coherency

More efficient serial aspects of broadphase can run on the CPU

Narrow-phase algorithms cause thread


divergence

Improved handling of thread


divergence

Copyright 2012 HSA Foundation. All Rights Reserved.

Move the compute not the data

GPU Enqueue

39

EASE OF PROGRAMMING
CODE COMPLEXITY VS. PERFORMANCE

LINES-OF-CODE AND PERFORMANCE FOR


DIFFERENT PROGRAMMING MODELS

350

35.00

(Exemplary ISV Hessian Kernel)

300

30.00
Init.

250

25.00

Launch

20.00

LOC

Compile
Compile
Copy

150

Launch

Copy

15.00

Launch

Launch

Algorithm
Launch

100

10.00

Launch
Algorithm

Algorithm

Algorithm

Launch

5.00

50
Algorithm

Algorithm

Algorithm
Copy-back

0
Serial CP U
Copy-back

Algorithm

TBB

Intrinsics+TBB
Launch

Copy

OpenCL -C
Compile

Copy-back

Copy-back

OpenCL
Init

-C++

C++ AMP

0
HSA Bolt

Performance

AMD A10-5800K APU w ith Radeon HD Graphics CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM.
Softw are Window s 7 Professional SP1 (64-bit OS); AMD OpenCL 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta

Copyright 2012 HSA Foundation. All Rights Reserved.

41

Performance

200

THE HSA FUTURE

Architected heterogeneous processing on the SOC

Programming of accelerators becomes much easier

Accelerated software that runs across multiple hardware vendors

Scalability from phones to super computers on a common architecture

GPU acceleration of parallel processing is the initial target, with DSPs


and other accelerators coming to the HSA system architecture model

Heterogeneous software ecosystem evolves at a much faster pace

Lower power, more capable devices in your hand, on the wall, in the cloud

Copyright 2012 HSA Foundation. All Rights Reserved.

42

THANK YOU

You might also like