Phil Rogers Keynote-FINAL

8/6/2019 Phil Rogers Keynote-FINAL

1/34

THE PROGRAMMERS G

TO THE APU GALAXY

Phil Rogers, Corporate Fellow

AMD


2/34

2 | The Programmers Guide to the APU Galaxy | June 2011

THE OPPORTUNITY WE ARE SEIZING

Make the unprecedentedprocessing capability ofthe APU as accessible to

programmers as the

CPU is today.


3/34


OUTLINE

The APU today and its programmingenvironment

The future of the heterogeneous platform

AMD Fusion System Architecture

Roadmap

Software evolution

A visual view of the new commandand data flow


4/34


APU: ACCELERATED PROCESSING UNIT

The APU has arrived and it is a great advanceover previous platforms

Combines scalar processing on CPU withparallel processing on the GPU and highbandwidth access to memory

How do we make it even better going forward?

Easier to program

Easier to optimize

Easier to load balance

Higher performance

Lower power


5/345 | The Programmers Guide to the APU Galaxy | June 2011

LOW POWER E-SERIES AMD FUSION APU: ZACATE

2 x86 Bobcat CPU cores

Array of Radeon Cores

Discrete-class DirectX 11 performance

80 Stream Processors

3rd Generation Unified Video Decoder

PCIeGen2

Single-channel DDR3 @ 1066

18W TDP

E-Series APU

Performance:

Up to 8.5GB/s System Memory Bandwidth

Up to 90 Gflop of Single Precision Compute



TABLET Z-SERIES AMD FUSION APU: DESNA

2 x86 Bobcat CPU cores

Array of Radeon Cores

Discrete-class DirectX 11 performance

80 Stream Processors


PCIeGen2

Single-channel DDR3 @ 1066

6W TDP w/ Local Hardware Thermal Control

Z-Series APU

Performance:

Up to 8.5GB/s System Memory Bandwidth

Suitable for sealed, passively cooled designs



MAINSTREAM A-SERIES AMD FUSIONAPU: LLANO

Up to four x86 CPU cores AMD Turbo CORE frequency acceleration

Array of Radeon Cores Discrete-class DirectX 11 performance


Blu-ray 3D stereoscopic display

PCIeGen2

Dual-channel DDR3

45W TDP

A-Series APU

Performance:

Up to 29GB/s System Memory Bandwidth

Up to 500 Gflops of Single Precision Compute



COMMITTED TO OPEN STANDARDS

AMD drives open and de-facto standards

Compete on the best implementation

Open standards are the basis for largeecosystems

Open standards always win over time

SW developers want their applications

to run on multiple platforms frommultiple hardware vendors

DirectX


9/34


A NEW ERA OF PROCESSOR PERFORMANCE

?

Single-threa

d

Performanc

e

Time

we arehere

Enabled by:

Moores Law Voltage

Scaling

Constrained by:

PowerComplexity

Single-Core Era

ModernApplication

Performance

Time (D

weh

HeteSys

Enabled by: Abundant datparallelism

Power efficienGPUs

Throughput

Performanc

e

Time (# of processors)

we arehere

Enabled by:

Moores Law SMP

architecture

Constrained by:PowerParallel SWScalability

Multi-Core Era

Assembly C/C++ Java pthreads OpenMP/ TBB Shader CU


10/34


EVOLUTION OF HETEROGENEOUS COMPUTING

Architecture

Maturity&ProgrammerA

ccessibility

Poor

Excellent

2012009 - 20112002 - 2008

Graphics & ProprietaryDriver-based APIs

Proprietary Drivers Era

Adventurous programmers

Exploit early programmableshader cores in the GPU

Make your program look likegraphics to the GPU

CUDA, Brook+, etc

OpenCL, DirectCompute

Driver-based APIs

Standards Drivers Era

Expert programmers C and C++ subsets Compute centric APIs , data

types Multiple address spaces with

explicit data movement Specialized work queue based

structures Kernel mode dispatch

AMD Fusion SGPU Pe

Arc

Mainstream

Full C++ GPU as a c Unified coh Task parall Nested Dat User mode Pre-emptio

switching

See Herb S

tomorrow foplans for th


11/34


FSA FEATURE ROADMAP

Q

ArchitecturalIntegration

Unified Address Spacefor CPU and GPU

Fully coherent memorybetween CPU & GPU

GPU uses pageablesystem memory via

CPU pointers

OptimizedPlatforms

Bi-Directional PowerMgmt between CPU

and GPU

GPU Compute C++support

User mode scheduling

PhysicalIntegration

Integrate CPU & GPUin silicon

Unified MemoryController

CommonManufacturing

Technology


12/34


FUSION SYSTEM ARCHITECTURE AN OPEN PLATFORM

Open Architecture, published specifications

FSAIL virtual ISA

FSA memory model

FSA dispatch

ISA agnostic for both CPU and GPU

Inviting partners to join us, in all areas

Hardware companies

Operating Systems

Tools and Middleware

Applications

FSA review committee planned


13/34


FSA INTERMEDIATE LAYER - FSAIL

FSAIL is a virtual ISA for parallel programs

Finalized to ISA by a JIT compiler orFinalizer

Explicitly parallel

Designed for data parallel programming

Support for exceptions, virtual functions,and other high level language features

Syscall methods GPU code can call directly to system

services, IO, printf, etc

Debugging support


14/34


FSA MEMORY MODEL

Designed to be compatible with C++0x,Java and .NET Memory Models

Relaxed consistency memory model forparallel compute performance

Loads and stores can be re-ordered bythe finalizer

Visibility controlled by:

Load.Acquire, Store.Release Fences Barriers


15/34


Hardware - APUs, CPUs, GPUs

AMD user mode component AMD kernel mode component All others contributed by th

Driver Stack

Domain Libraries

OpenCL 1.x, DX Runtimes,User Mode Drivers

Graphics Kernel Mode Driver

AppsApps

AppsApps

AppsApps

FSA Software Stack

Task QueuingLibraries

FSA Domain Libraries

F

FSA JIT

AppsApps

AppsApps

AppsApps

OPENCL AND FSA


16/34


OPENCL AND FSA

FSA is an optimized platform architecturefor OpenCL

Not an alternative to OpenCL

OpenCL on FSA will benefit from Avoidance of wasteful copies

Low latency dispatch

Improved memory model

Pointers shared between CPU and GPU

FSA also exposes a lower level programminginterface, for those that want the ultimate incontrol and performance

Optimized libraries may choose the lowerlevel interface

TASK QUEUING RUNTIMES


17/34


TASK QUEUING RUNTIMES

Popular pattern for task and data parallelprogramming on SMP systems today

Characterized by:

A work queue per core

Runtime library that divides largeloops into tasks and distributes toqueues

A work stealing runtime that keeps the

system balancedFSA is designed to extend this pattern torun on heterogeneous systems

TASK QUEUING RUNTIME ON CPUS


18/34


TASK QUEUING RUNTIME ON CPUS

CPU Threads GPU Threads Memory

Work Stealing Runtime

CPUWorker

Q

CPUWorker

Q

CPUWorker

Q

CPUWorker

Q

X86 CPU X86 CPU X86 CPU X86 CPU

TASK QUEUING RUNTIME ON THE FSA PLATFORM


19/34





CPUWorker

Q

CPUWorker

Q

CPUWorker

Q

CPUWorker

Q

GPMana

X86 CPU X86 CPU X86 CPU X86 CPU Radeon



20/34



SI

MD

SIMD

SI

MD

SIMD


CPUWorker

Q

CPUWorker

Q

CPUWorker

Q

CPUWorker

Q

GPMana

Fetch and

X86 CPU X86 CPU X86 CPU X86 CPU


FSA SOFTWARE EXAMPLE - REDUCTION


21/34


FSA SOFTWARE EXAMPLE - REDUCTION

float foo(float);

float myArray[];

Task task([myArray]( IndexRange index) [[dfloat sum = 0.;

for (size_t I = index.begin(); I != index.end(); i++) {

sum += foo(myArray[i]);

}

return sum;

});

float result = task.enqueueWithReduce( Partition(1920),

[] (int x, int y) [[device]] { return x

HETEROGENEOUS COMPUTE DISPATCH


22/34


HETEROGENEOUS COMPUTE DISPATCH

How compute dispatch operates

today in thedriver model

How compute dispatchimproves tomorrowunder FSA

TODAYS COMMAND AND DISPATCH FLOW


23/34


TODAY S COMMAND AND DISPATCH FLOW

Command Flow Data Flow

SoftQueue

KernelModeDriver

ApplicationA

Command Buffer

UserModeDriver

Direct3D

DMA Buffer

HardwareQueue

A



24/34



SoftQueue

KernelModeDriver

ApplicationA

Command Buffer

UserModeDriver

Direct3D

DMA Buffer



SoftQueue

KernelModeDriver

ApplicationC

Command Buffer

UserModeDriver

Direct3D

DMA Buffer

HardwareQueue

A


SoftQueue

KernelModeDriver

ApplicationB

Command Buffer

UserModeDriver

Direct3D

DMA Buffer



25/34



HardwareQueue

A

C

BA B


SoftQueue

KernelModeDriver

ApplicationA

Command Buffer

UserModeDriver

Direct3D

DMA Buffer


SoftQueue

KernelModeDriver

ApplicationC

Command Buffer

UserModeDriver

Direct3D

DMA Buffer


SoftQueue

KernelModeDriver

ApplicationB

Command Buffer

UserModeDriver

Direct3D

DMA Buffer



26/34



SoftQueue

KernelModeDriver

ApplicationA

Command Buffer

UserModeDriver

Direct3D

DMA Buffer


SoftQueue

KernelModeDriver

ApplicationC

Command Buffer

UserModeDriver

Direct3D

DMA Buffer


SoftQueue

KernelModeDriver

ApplicationB

Command Buffer

UserModeDriver

Direct3D

DMA Buffer

O S CO S C O

HardwareQueue

A

C

BA B

FUTURE COMMAND AND DISPATCH FLOW


27/34


ApplicationA

ApplicationB

ApplicationC

Optional DispatchBuffer

GPUHARDWARE

Hardware Queue

A

A A

Hardware Queue

B

B B

Hardware Queue

C

C C

C

C

No APIs

No Soft Q No User M

No Kerne

No Overh

Applicatiohardware

User mod

Hardware

Low disp

FUTURE COMMAND AND DISPATCH CPU GPU


28/34


Application / Runtime

CPU2CPU1 GPU



29/34



CPU2CPU1 GPU



30/34



CPU2CPU1 GPU


31/34

WHERE ARE WE TAKING YOU?


32/34


Switch the compute, dont movethe data!

Platform Design Goal

Every processor now has serial and

parallel cores

All cores capable, with performancedifferences

Simple andefficient programmodel

Easy support of massive dat

Support for task based progrmodels

Solutions forall platforms

Open to all

THE FUTURE OF HETEROGENEOUS COMPUTING


33/34


The architectural path for the future is clear

Programming patterns established onSymmetric Multi-Processor (SMP)systems migrate to the heterogeneous

world

An open architecture, with publishedspecifications and an open sourceexecution software stack

Heterogeneous cores working togetherseamlessly in coherent memory

Low latency dispatch

No software fault lines


34/34

Documents

Phil Rogers Keynote-FINAL