8/6/2019 Phil Rogers Keynote-FINAL
1/34
THE PROGRAMMERS G
TO THE APU GALAXY
Phil Rogers, Corporate Fellow
AMD
8/6/2019 Phil Rogers Keynote-FINAL
2/34
2 | The Programmers Guide to the APU Galaxy | June 2011
THE OPPORTUNITY WE ARE SEIZING
Make the unprecedentedprocessing capability ofthe APU as accessible to
programmers as the
CPU is today.
8/6/2019 Phil Rogers Keynote-FINAL
3/34
3 | The Programmers Guide to the APU Galaxy | June 2011
OUTLINE
The APU today and its programmingenvironment
The future of the heterogeneous platform
AMD Fusion System Architecture
Roadmap
Software evolution
A visual view of the new commandand data flow
8/6/2019 Phil Rogers Keynote-FINAL
4/34
4 | The Programmers Guide to the APU Galaxy | June 2011
APU: ACCELERATED PROCESSING UNIT
The APU has arrived and it is a great advanceover previous platforms
Combines scalar processing on CPU withparallel processing on the GPU and highbandwidth access to memory
How do we make it even better going forward?
Easier to program
Easier to optimize
Easier to load balance
Higher performance
Lower power
8/6/2019 Phil Rogers Keynote-FINAL
5/345 | The Programmers Guide to the APU Galaxy | June 2011
LOW POWER E-SERIES AMD FUSION APU: ZACATE
2 x86 Bobcat CPU cores
Array of Radeon Cores
Discrete-class DirectX 11 performance
80 Stream Processors
3rd Generation Unified Video Decoder
PCIeGen2
Single-channel DDR3 @ 1066
18W TDP
E-Series APU
Performance:
Up to 8.5GB/s System Memory Bandwidth
Up to 90 Gflop of Single Precision Compute
8/6/2019 Phil Rogers Keynote-FINAL
6/346 | The Programmers Guide to the APU Galaxy | June 2011
TABLET Z-SERIES AMD FUSION APU: DESNA
2 x86 Bobcat CPU cores
Array of Radeon Cores
Discrete-class DirectX 11 performance
80 Stream Processors
3rd Generation Unified Video Decoder
PCIeGen2
Single-channel DDR3 @ 1066
6W TDP w/ Local Hardware Thermal Control
Z-Series APU
Performance:
Up to 8.5GB/s System Memory Bandwidth
Suitable for sealed, passively cooled designs
8/6/2019 Phil Rogers Keynote-FINAL
7/347 | The Programmers Guide to the APU Galaxy | June 2011
MAINSTREAM A-SERIES AMD FUSIONAPU: LLANO
Up to four x86 CPU cores AMD Turbo CORE frequency acceleration
Array of Radeon Cores Discrete-class DirectX 11 performance
3rd Generation Unified Video Decoder
Blu-ray 3D stereoscopic display
PCIeGen2
Dual-channel DDR3
45W TDP
A-Series APU
Performance:
Up to 29GB/s System Memory Bandwidth
Up to 500 Gflops of Single Precision Compute
8/6/2019 Phil Rogers Keynote-FINAL
8/348 | The Programmers Guide to the APU Galaxy | June 2011
COMMITTED TO OPEN STANDARDS
AMD drives open and de-facto standards
Compete on the best implementation
Open standards are the basis for largeecosystems
Open standards always win over time
SW developers want their applications
to run on multiple platforms frommultiple hardware vendors
DirectX
8/6/2019 Phil Rogers Keynote-FINAL
9/34
9 | The Programmers Guide to the APU Galaxy | June 2011
A NEW ERA OF PROCESSOR PERFORMANCE
?
Single-threa
d
Performanc
e
Time
we arehere
Enabled by:
Moores Law Voltage
Scaling
Constrained by:
PowerComplexity
Single-Core Era
ModernApplication
Performance
Time (D
weh
HeteSys
Enabled by: Abundant datparallelism
Power efficienGPUs
Throughput
Performanc
e
Time (# of processors)
we arehere
Enabled by:
Moores Law SMP
architecture
Constrained by:PowerParallel SWScalability
Multi-Core Era
Assembly C/C++ Java pthreads OpenMP/ TBB Shader CU
8/6/2019 Phil Rogers Keynote-FINAL
10/34
10 | The Programmers Guide to the APU Galaxy | June 2011
EVOLUTION OF HETEROGENEOUS COMPUTING
Architecture
Maturity&ProgrammerA
ccessibility
Poor
Excellent
2012009 - 20112002 - 2008
Graphics & ProprietaryDriver-based APIs
Proprietary Drivers Era
Adventurous programmers
Exploit early programmableshader cores in the GPU
Make your program look likegraphics to the GPU
CUDA, Brook+, etc
OpenCL, DirectCompute
Driver-based APIs
Standards Drivers Era
Expert programmers C and C++ subsets Compute centric APIs , data
types Multiple address spaces with
explicit data movement Specialized work queue based
structures Kernel mode dispatch
AMD Fusion SGPU Pe
Arc
Mainstream
Full C++ GPU as a c Unified coh Task parall Nested Dat User mode Pre-emptio
switching
See Herb S
tomorrow foplans for th
8/6/2019 Phil Rogers Keynote-FINAL
11/34
11 | The Programmers Guide to the APU Galaxy | June 2011
FSA FEATURE ROADMAP
Q
ArchitecturalIntegration
Unified Address Spacefor CPU and GPU
Fully coherent memorybetween CPU & GPU
GPU uses pageablesystem memory via
CPU pointers
OptimizedPlatforms
Bi-Directional PowerMgmt between CPU
and GPU
GPU Compute C++support
User mode scheduling
PhysicalIntegration
Integrate CPU & GPUin silicon
Unified MemoryController
CommonManufacturing
Technology
8/6/2019 Phil Rogers Keynote-FINAL
12/34
12 | The Programmers Guide to the APU Galaxy | June 2011
FUSION SYSTEM ARCHITECTURE AN OPEN PLATFORM
Open Architecture, published specifications
FSAIL virtual ISA
FSA memory model
FSA dispatch
ISA agnostic for both CPU and GPU
Inviting partners to join us, in all areas
Hardware companies
Operating Systems
Tools and Middleware
Applications
FSA review committee planned
8/6/2019 Phil Rogers Keynote-FINAL
13/34
13 | The Programmers Guide to the APU Galaxy | June 2011
FSA INTERMEDIATE LAYER - FSAIL
FSAIL is a virtual ISA for parallel programs
Finalized to ISA by a JIT compiler orFinalizer
Explicitly parallel
Designed for data parallel programming
Support for exceptions, virtual functions,and other high level language features
Syscall methods GPU code can call directly to system
services, IO, printf, etc
Debugging support
8/6/2019 Phil Rogers Keynote-FINAL
14/34
14 | The Programmers Guide to the APU Galaxy | June 2011
FSA MEMORY MODEL
Designed to be compatible with C++0x,Java and .NET Memory Models
Relaxed consistency memory model forparallel compute performance
Loads and stores can be re-ordered bythe finalizer
Visibility controlled by:
Load.Acquire, Store.Release Fences Barriers
8/6/2019 Phil Rogers Keynote-FINAL
15/34
15 | The Programmers Guide to the APU Galaxy | June 2011
Hardware - APUs, CPUs, GPUs
AMD user mode component AMD kernel mode component All others contributed by th
Driver Stack
Domain Libraries
OpenCL 1.x, DX Runtimes,User Mode Drivers
Graphics Kernel Mode Driver
AppsApps
AppsApps
AppsApps
FSA Software Stack
Task QueuingLibraries
FSA Domain Libraries
F
FSA JIT
AppsApps
AppsApps
AppsApps
OPENCL AND FSA
8/6/2019 Phil Rogers Keynote-FINAL
16/34
16 | The Programmers Guide to the APU Galaxy | June 2011
OPENCL AND FSA
FSA is an optimized platform architecturefor OpenCL
Not an alternative to OpenCL
OpenCL on FSA will benefit from Avoidance of wasteful copies
Low latency dispatch
Improved memory model
Pointers shared between CPU and GPU
FSA also exposes a lower level programminginterface, for those that want the ultimate incontrol and performance
Optimized libraries may choose the lowerlevel interface
TASK QUEUING RUNTIMES
8/6/2019 Phil Rogers Keynote-FINAL
17/34
17 | The Programmers Guide to the APU Galaxy | June 2011
TASK QUEUING RUNTIMES
Popular pattern for task and data parallelprogramming on SMP systems today
Characterized by:
A work queue per core
Runtime library that divides largeloops into tasks and distributes toqueues
A work stealing runtime that keeps the
system balancedFSA is designed to extend this pattern torun on heterogeneous systems
TASK QUEUING RUNTIME ON CPUS
8/6/2019 Phil Rogers Keynote-FINAL
18/34
18 | The Programmers Guide to the APU Galaxy | June 2011
TASK QUEUING RUNTIME ON CPUS
CPU Threads GPU Threads Memory
Work Stealing Runtime
CPUWorker
Q
CPUWorker
Q
CPUWorker
Q
CPUWorker
Q
X86 CPU X86 CPU X86 CPU X86 CPU
TASK QUEUING RUNTIME ON THE FSA PLATFORM
8/6/2019 Phil Rogers Keynote-FINAL
19/34
19 | The Programmers Guide to the APU Galaxy | June 2011
TASK QUEUING RUNTIME ON THE FSA PLATFORM
CPU Threads GPU Threads Memory
Work Stealing Runtime
CPUWorker
Q
CPUWorker
Q
CPUWorker
Q
CPUWorker
Q
GPMana
X86 CPU X86 CPU X86 CPU X86 CPU Radeon
TASK QUEUING RUNTIME ON THE FSA PLATFORM
8/6/2019 Phil Rogers Keynote-FINAL
20/34
20 | The Programmers Guide to the APU Galaxy | June 2011
TASK QUEUING RUNTIME ON THE FSA PLATFORM
SI
MD
SIMD
SI
MD
SIMD
Work Stealing Runtime
CPUWorker
Q
CPUWorker
Q
CPUWorker
Q
CPUWorker
Q
GPMana
Fetch and
X86 CPU X86 CPU X86 CPU X86 CPU
CPU Threads GPU Threads Memory
FSA SOFTWARE EXAMPLE - REDUCTION
8/6/2019 Phil Rogers Keynote-FINAL
21/34
21 | The Programmers Guide to the APU Galaxy | June 2011
FSA SOFTWARE EXAMPLE - REDUCTION
float foo(float);
float myArray[];
Task task([myArray]( IndexRange index) [[dfloat sum = 0.;
for (size_t I = index.begin(); I != index.end(); i++) {
sum += foo(myArray[i]);
}
return sum;
});
float result = task.enqueueWithReduce( Partition(1920),
[] (int x, int y) [[device]] { return x
HETEROGENEOUS COMPUTE DISPATCH
8/6/2019 Phil Rogers Keynote-FINAL
22/34
22 | The Programmers Guide to the APU Galaxy | June 2011
HETEROGENEOUS COMPUTE DISPATCH
How compute dispatch operates
today in thedriver model
How compute dispatchimproves tomorrowunder FSA
TODAYS COMMAND AND DISPATCH FLOW
8/6/2019 Phil Rogers Keynote-FINAL
23/34
23 | The Programmers Guide to the APU Galaxy | June 2011
TODAY S COMMAND AND DISPATCH FLOW
Command Flow Data Flow
SoftQueue
KernelModeDriver
ApplicationA
Command Buffer
UserModeDriver
Direct3D
DMA Buffer
HardwareQueue
A
TODAYS COMMAND AND DISPATCH FLOW
8/6/2019 Phil Rogers Keynote-FINAL
24/34
24 | The Programmers Guide to the APU Galaxy | June 2011
Command Flow Data Flow
SoftQueue
KernelModeDriver
ApplicationA
Command Buffer
UserModeDriver
Direct3D
DMA Buffer
TODAY S COMMAND AND DISPATCH FLOW
Command Flow Data Flow
SoftQueue
KernelModeDriver
ApplicationC
Command Buffer
UserModeDriver
Direct3D
DMA Buffer
HardwareQueue
A
Command Flow Data Flow
SoftQueue
KernelModeDriver
ApplicationB
Command Buffer
UserModeDriver
Direct3D
DMA Buffer
TODAYS COMMAND AND DISPATCH FLOW
8/6/2019 Phil Rogers Keynote-FINAL
25/34
25 | The Programmers Guide to the APU Galaxy | June 2011
TODAY S COMMAND AND DISPATCH FLOW
HardwareQueue
A
C
BA B
Command Flow Data Flow
SoftQueue
KernelModeDriver
ApplicationA
Command Buffer
UserModeDriver
Direct3D
DMA Buffer
Command Flow Data Flow
SoftQueue
KernelModeDriver
ApplicationC
Command Buffer
UserModeDriver
Direct3D
DMA Buffer
Command Flow Data Flow
SoftQueue
KernelModeDriver
ApplicationB
Command Buffer
UserModeDriver
Direct3D
DMA Buffer
TODAYS COMMAND AND DISPATCH FLOW
8/6/2019 Phil Rogers Keynote-FINAL
26/34
26 | The Programmers Guide to the APU Galaxy | June 2011
Command Flow Data Flow
SoftQueue
KernelModeDriver
ApplicationA
Command Buffer
UserModeDriver
Direct3D
DMA Buffer
Command Flow Data Flow
SoftQueue
KernelModeDriver
ApplicationC
Command Buffer
UserModeDriver
Direct3D
DMA Buffer
Command Flow Data Flow
SoftQueue
KernelModeDriver
ApplicationB
Command Buffer
UserModeDriver
Direct3D
DMA Buffer
O S CO S C O
HardwareQueue
A
C
BA B
FUTURE COMMAND AND DISPATCH FLOW
8/6/2019 Phil Rogers Keynote-FINAL
27/34
27 | The Programmers Guide to the APU Galaxy | June 2011
ApplicationA
ApplicationB
ApplicationC
Optional DispatchBuffer
GPUHARDWARE
Hardware Queue
A
A A
Hardware Queue
B
B B
Hardware Queue
C
C C
C
C
No APIs
No Soft Q No User M
No Kerne
No Overh
Applicatiohardware
User mod
Hardware
Low disp
FUTURE COMMAND AND DISPATCH CPU GPU
8/6/2019 Phil Rogers Keynote-FINAL
28/34
28 | The Programmers Guide to the APU Galaxy | June 2011
Application / Runtime
CPU2CPU1 GPU
FUTURE COMMAND AND DISPATCH CPU GPU
8/6/2019 Phil Rogers Keynote-FINAL
29/34
29 | The Programmers Guide to the APU Galaxy | June 2011
Application / Runtime
CPU2CPU1 GPU
FUTURE COMMAND AND DISPATCH CPU GPU
8/6/2019 Phil Rogers Keynote-FINAL
30/34
30 | The Programmers Guide to the APU Galaxy | June 2011
Application / Runtime
CPU2CPU1 GPU
8/6/2019 Phil Rogers Keynote-FINAL
31/34
WHERE ARE WE TAKING YOU?
8/6/2019 Phil Rogers Keynote-FINAL
32/34
32 | The Programmers Guide to the APU Galaxy | June 2011
Switch the compute, dont movethe data!
Platform Design Goal
Every processor now has serial and
parallel cores
All cores capable, with performancedifferences
Simple andefficient programmodel
Easy support of massive dat
Support for task based progrmodels
Solutions forall platforms
Open to all
THE FUTURE OF HETEROGENEOUS COMPUTING
8/6/2019 Phil Rogers Keynote-FINAL
33/34
33 | The Programmers Guide to the APU Galaxy | June 2011
The architectural path for the future is clear
Programming patterns established onSymmetric Multi-Processor (SMP)systems migrate to the heterogeneous
world
An open architecture, with publishedspecifications and an open sourceexecution software stack
Heterogeneous cores working togetherseamlessly in coherent memory
Low latency dispatch
No software fault lines
8/6/2019 Phil Rogers Keynote-FINAL
34/34