Phil Rogers Keynote-FINAL

Embed Size (px)

Citation preview

  • 8/6/2019 Phil Rogers Keynote-FINAL

    1/34

    THE PROGRAMMERS G

    TO THE APU GALAXY

    Phil Rogers, Corporate Fellow

    AMD

  • 8/6/2019 Phil Rogers Keynote-FINAL

    2/34

    2 | The Programmers Guide to the APU Galaxy | June 2011

    THE OPPORTUNITY WE ARE SEIZING

    Make the unprecedentedprocessing capability ofthe APU as accessible to

    programmers as the

    CPU is today.

  • 8/6/2019 Phil Rogers Keynote-FINAL

    3/34

    3 | The Programmers Guide to the APU Galaxy | June 2011

    OUTLINE

    The APU today and its programmingenvironment

    The future of the heterogeneous platform

    AMD Fusion System Architecture

    Roadmap

    Software evolution

    A visual view of the new commandand data flow

  • 8/6/2019 Phil Rogers Keynote-FINAL

    4/34

    4 | The Programmers Guide to the APU Galaxy | June 2011

    APU: ACCELERATED PROCESSING UNIT

    The APU has arrived and it is a great advanceover previous platforms

    Combines scalar processing on CPU withparallel processing on the GPU and highbandwidth access to memory

    How do we make it even better going forward?

    Easier to program

    Easier to optimize

    Easier to load balance

    Higher performance

    Lower power

  • 8/6/2019 Phil Rogers Keynote-FINAL

    5/345 | The Programmers Guide to the APU Galaxy | June 2011

    LOW POWER E-SERIES AMD FUSION APU: ZACATE

    2 x86 Bobcat CPU cores

    Array of Radeon Cores

    Discrete-class DirectX 11 performance

    80 Stream Processors

    3rd Generation Unified Video Decoder

    PCIeGen2

    Single-channel DDR3 @ 1066

    18W TDP

    E-Series APU

    Performance:

    Up to 8.5GB/s System Memory Bandwidth

    Up to 90 Gflop of Single Precision Compute

  • 8/6/2019 Phil Rogers Keynote-FINAL

    6/346 | The Programmers Guide to the APU Galaxy | June 2011

    TABLET Z-SERIES AMD FUSION APU: DESNA

    2 x86 Bobcat CPU cores

    Array of Radeon Cores

    Discrete-class DirectX 11 performance

    80 Stream Processors

    3rd Generation Unified Video Decoder

    PCIeGen2

    Single-channel DDR3 @ 1066

    6W TDP w/ Local Hardware Thermal Control

    Z-Series APU

    Performance:

    Up to 8.5GB/s System Memory Bandwidth

    Suitable for sealed, passively cooled designs

  • 8/6/2019 Phil Rogers Keynote-FINAL

    7/347 | The Programmers Guide to the APU Galaxy | June 2011

    MAINSTREAM A-SERIES AMD FUSIONAPU: LLANO

    Up to four x86 CPU cores AMD Turbo CORE frequency acceleration

    Array of Radeon Cores Discrete-class DirectX 11 performance

    3rd Generation Unified Video Decoder

    Blu-ray 3D stereoscopic display

    PCIeGen2

    Dual-channel DDR3

    45W TDP

    A-Series APU

    Performance:

    Up to 29GB/s System Memory Bandwidth

    Up to 500 Gflops of Single Precision Compute

  • 8/6/2019 Phil Rogers Keynote-FINAL

    8/348 | The Programmers Guide to the APU Galaxy | June 2011

    COMMITTED TO OPEN STANDARDS

    AMD drives open and de-facto standards

    Compete on the best implementation

    Open standards are the basis for largeecosystems

    Open standards always win over time

    SW developers want their applications

    to run on multiple platforms frommultiple hardware vendors

    DirectX

  • 8/6/2019 Phil Rogers Keynote-FINAL

    9/34

    9 | The Programmers Guide to the APU Galaxy | June 2011

    A NEW ERA OF PROCESSOR PERFORMANCE

    ?

    Single-threa

    d

    Performanc

    e

    Time

    we arehere

    Enabled by:

    Moores Law Voltage

    Scaling

    Constrained by:

    PowerComplexity

    Single-Core Era

    ModernApplication

    Performance

    Time (D

    weh

    HeteSys

    Enabled by: Abundant datparallelism

    Power efficienGPUs

    Throughput

    Performanc

    e

    Time (# of processors)

    we arehere

    Enabled by:

    Moores Law SMP

    architecture

    Constrained by:PowerParallel SWScalability

    Multi-Core Era

    Assembly C/C++ Java pthreads OpenMP/ TBB Shader CU

  • 8/6/2019 Phil Rogers Keynote-FINAL

    10/34

    10 | The Programmers Guide to the APU Galaxy | June 2011

    EVOLUTION OF HETEROGENEOUS COMPUTING

    Architecture

    Maturity&ProgrammerA

    ccessibility

    Poor

    Excellent

    2012009 - 20112002 - 2008

    Graphics & ProprietaryDriver-based APIs

    Proprietary Drivers Era

    Adventurous programmers

    Exploit early programmableshader cores in the GPU

    Make your program look likegraphics to the GPU

    CUDA, Brook+, etc

    OpenCL, DirectCompute

    Driver-based APIs

    Standards Drivers Era

    Expert programmers C and C++ subsets Compute centric APIs , data

    types Multiple address spaces with

    explicit data movement Specialized work queue based

    structures Kernel mode dispatch

    AMD Fusion SGPU Pe

    Arc

    Mainstream

    Full C++ GPU as a c Unified coh Task parall Nested Dat User mode Pre-emptio

    switching

    See Herb S

    tomorrow foplans for th

  • 8/6/2019 Phil Rogers Keynote-FINAL

    11/34

    11 | The Programmers Guide to the APU Galaxy | June 2011

    FSA FEATURE ROADMAP

    Q

    ArchitecturalIntegration

    Unified Address Spacefor CPU and GPU

    Fully coherent memorybetween CPU & GPU

    GPU uses pageablesystem memory via

    CPU pointers

    OptimizedPlatforms

    Bi-Directional PowerMgmt between CPU

    and GPU

    GPU Compute C++support

    User mode scheduling

    PhysicalIntegration

    Integrate CPU & GPUin silicon

    Unified MemoryController

    CommonManufacturing

    Technology

  • 8/6/2019 Phil Rogers Keynote-FINAL

    12/34

    12 | The Programmers Guide to the APU Galaxy | June 2011

    FUSION SYSTEM ARCHITECTURE AN OPEN PLATFORM

    Open Architecture, published specifications

    FSAIL virtual ISA

    FSA memory model

    FSA dispatch

    ISA agnostic for both CPU and GPU

    Inviting partners to join us, in all areas

    Hardware companies

    Operating Systems

    Tools and Middleware

    Applications

    FSA review committee planned

  • 8/6/2019 Phil Rogers Keynote-FINAL

    13/34

    13 | The Programmers Guide to the APU Galaxy | June 2011

    FSA INTERMEDIATE LAYER - FSAIL

    FSAIL is a virtual ISA for parallel programs

    Finalized to ISA by a JIT compiler orFinalizer

    Explicitly parallel

    Designed for data parallel programming

    Support for exceptions, virtual functions,and other high level language features

    Syscall methods GPU code can call directly to system

    services, IO, printf, etc

    Debugging support

  • 8/6/2019 Phil Rogers Keynote-FINAL

    14/34

    14 | The Programmers Guide to the APU Galaxy | June 2011

    FSA MEMORY MODEL

    Designed to be compatible with C++0x,Java and .NET Memory Models

    Relaxed consistency memory model forparallel compute performance

    Loads and stores can be re-ordered bythe finalizer

    Visibility controlled by:

    Load.Acquire, Store.Release Fences Barriers

  • 8/6/2019 Phil Rogers Keynote-FINAL

    15/34

    15 | The Programmers Guide to the APU Galaxy | June 2011

    Hardware - APUs, CPUs, GPUs

    AMD user mode component AMD kernel mode component All others contributed by th

    Driver Stack

    Domain Libraries

    OpenCL 1.x, DX Runtimes,User Mode Drivers

    Graphics Kernel Mode Driver

    AppsApps

    AppsApps

    AppsApps

    FSA Software Stack

    Task QueuingLibraries

    FSA Domain Libraries

    F

    FSA JIT

    AppsApps

    AppsApps

    AppsApps

    OPENCL AND FSA

  • 8/6/2019 Phil Rogers Keynote-FINAL

    16/34

    16 | The Programmers Guide to the APU Galaxy | June 2011

    OPENCL AND FSA

    FSA is an optimized platform architecturefor OpenCL

    Not an alternative to OpenCL

    OpenCL on FSA will benefit from Avoidance of wasteful copies

    Low latency dispatch

    Improved memory model

    Pointers shared between CPU and GPU

    FSA also exposes a lower level programminginterface, for those that want the ultimate incontrol and performance

    Optimized libraries may choose the lowerlevel interface

    TASK QUEUING RUNTIMES

  • 8/6/2019 Phil Rogers Keynote-FINAL

    17/34

    17 | The Programmers Guide to the APU Galaxy | June 2011

    TASK QUEUING RUNTIMES

    Popular pattern for task and data parallelprogramming on SMP systems today

    Characterized by:

    A work queue per core

    Runtime library that divides largeloops into tasks and distributes toqueues

    A work stealing runtime that keeps the

    system balancedFSA is designed to extend this pattern torun on heterogeneous systems

    TASK QUEUING RUNTIME ON CPUS

  • 8/6/2019 Phil Rogers Keynote-FINAL

    18/34

    18 | The Programmers Guide to the APU Galaxy | June 2011

    TASK QUEUING RUNTIME ON CPUS

    CPU Threads GPU Threads Memory

    Work Stealing Runtime

    CPUWorker

    Q

    CPUWorker

    Q

    CPUWorker

    Q

    CPUWorker

    Q

    X86 CPU X86 CPU X86 CPU X86 CPU

    TASK QUEUING RUNTIME ON THE FSA PLATFORM

  • 8/6/2019 Phil Rogers Keynote-FINAL

    19/34

    19 | The Programmers Guide to the APU Galaxy | June 2011

    TASK QUEUING RUNTIME ON THE FSA PLATFORM

    CPU Threads GPU Threads Memory

    Work Stealing Runtime

    CPUWorker

    Q

    CPUWorker

    Q

    CPUWorker

    Q

    CPUWorker

    Q

    GPMana

    X86 CPU X86 CPU X86 CPU X86 CPU Radeon

    TASK QUEUING RUNTIME ON THE FSA PLATFORM

  • 8/6/2019 Phil Rogers Keynote-FINAL

    20/34

    20 | The Programmers Guide to the APU Galaxy | June 2011

    TASK QUEUING RUNTIME ON THE FSA PLATFORM

    SI

    MD

    SIMD

    SI

    MD

    SIMD

    Work Stealing Runtime

    CPUWorker

    Q

    CPUWorker

    Q

    CPUWorker

    Q

    CPUWorker

    Q

    GPMana

    Fetch and

    X86 CPU X86 CPU X86 CPU X86 CPU

    CPU Threads GPU Threads Memory

    FSA SOFTWARE EXAMPLE - REDUCTION

  • 8/6/2019 Phil Rogers Keynote-FINAL

    21/34

    21 | The Programmers Guide to the APU Galaxy | June 2011

    FSA SOFTWARE EXAMPLE - REDUCTION

    float foo(float);

    float myArray[];

    Task task([myArray]( IndexRange index) [[dfloat sum = 0.;

    for (size_t I = index.begin(); I != index.end(); i++) {

    sum += foo(myArray[i]);

    }

    return sum;

    });

    float result = task.enqueueWithReduce( Partition(1920),

    [] (int x, int y) [[device]] { return x

    HETEROGENEOUS COMPUTE DISPATCH

  • 8/6/2019 Phil Rogers Keynote-FINAL

    22/34

    22 | The Programmers Guide to the APU Galaxy | June 2011

    HETEROGENEOUS COMPUTE DISPATCH

    How compute dispatch operates

    today in thedriver model

    How compute dispatchimproves tomorrowunder FSA

    TODAYS COMMAND AND DISPATCH FLOW

  • 8/6/2019 Phil Rogers Keynote-FINAL

    23/34

    23 | The Programmers Guide to the APU Galaxy | June 2011

    TODAY S COMMAND AND DISPATCH FLOW

    Command Flow Data Flow

    SoftQueue

    KernelModeDriver

    ApplicationA

    Command Buffer

    UserModeDriver

    Direct3D

    DMA Buffer

    HardwareQueue

    A

    TODAYS COMMAND AND DISPATCH FLOW

  • 8/6/2019 Phil Rogers Keynote-FINAL

    24/34

    24 | The Programmers Guide to the APU Galaxy | June 2011

    Command Flow Data Flow

    SoftQueue

    KernelModeDriver

    ApplicationA

    Command Buffer

    UserModeDriver

    Direct3D

    DMA Buffer

    TODAY S COMMAND AND DISPATCH FLOW

    Command Flow Data Flow

    SoftQueue

    KernelModeDriver

    ApplicationC

    Command Buffer

    UserModeDriver

    Direct3D

    DMA Buffer

    HardwareQueue

    A

    Command Flow Data Flow

    SoftQueue

    KernelModeDriver

    ApplicationB

    Command Buffer

    UserModeDriver

    Direct3D

    DMA Buffer

    TODAYS COMMAND AND DISPATCH FLOW

  • 8/6/2019 Phil Rogers Keynote-FINAL

    25/34

    25 | The Programmers Guide to the APU Galaxy | June 2011

    TODAY S COMMAND AND DISPATCH FLOW

    HardwareQueue

    A

    C

    BA B

    Command Flow Data Flow

    SoftQueue

    KernelModeDriver

    ApplicationA

    Command Buffer

    UserModeDriver

    Direct3D

    DMA Buffer

    Command Flow Data Flow

    SoftQueue

    KernelModeDriver

    ApplicationC

    Command Buffer

    UserModeDriver

    Direct3D

    DMA Buffer

    Command Flow Data Flow

    SoftQueue

    KernelModeDriver

    ApplicationB

    Command Buffer

    UserModeDriver

    Direct3D

    DMA Buffer

    TODAYS COMMAND AND DISPATCH FLOW

  • 8/6/2019 Phil Rogers Keynote-FINAL

    26/34

    26 | The Programmers Guide to the APU Galaxy | June 2011

    Command Flow Data Flow

    SoftQueue

    KernelModeDriver

    ApplicationA

    Command Buffer

    UserModeDriver

    Direct3D

    DMA Buffer

    Command Flow Data Flow

    SoftQueue

    KernelModeDriver

    ApplicationC

    Command Buffer

    UserModeDriver

    Direct3D

    DMA Buffer

    Command Flow Data Flow

    SoftQueue

    KernelModeDriver

    ApplicationB

    Command Buffer

    UserModeDriver

    Direct3D

    DMA Buffer

    O S CO S C O

    HardwareQueue

    A

    C

    BA B

    FUTURE COMMAND AND DISPATCH FLOW

  • 8/6/2019 Phil Rogers Keynote-FINAL

    27/34

    27 | The Programmers Guide to the APU Galaxy | June 2011

    ApplicationA

    ApplicationB

    ApplicationC

    Optional DispatchBuffer

    GPUHARDWARE

    Hardware Queue

    A

    A A

    Hardware Queue

    B

    B B

    Hardware Queue

    C

    C C

    C

    C

    No APIs

    No Soft Q No User M

    No Kerne

    No Overh

    Applicatiohardware

    User mod

    Hardware

    Low disp

    FUTURE COMMAND AND DISPATCH CPU GPU

  • 8/6/2019 Phil Rogers Keynote-FINAL

    28/34

    28 | The Programmers Guide to the APU Galaxy | June 2011

    Application / Runtime

    CPU2CPU1 GPU

    FUTURE COMMAND AND DISPATCH CPU GPU

  • 8/6/2019 Phil Rogers Keynote-FINAL

    29/34

    29 | The Programmers Guide to the APU Galaxy | June 2011

    Application / Runtime

    CPU2CPU1 GPU

    FUTURE COMMAND AND DISPATCH CPU GPU

  • 8/6/2019 Phil Rogers Keynote-FINAL

    30/34

    30 | The Programmers Guide to the APU Galaxy | June 2011

    Application / Runtime

    CPU2CPU1 GPU

  • 8/6/2019 Phil Rogers Keynote-FINAL

    31/34

    WHERE ARE WE TAKING YOU?

  • 8/6/2019 Phil Rogers Keynote-FINAL

    32/34

    32 | The Programmers Guide to the APU Galaxy | June 2011

    Switch the compute, dont movethe data!

    Platform Design Goal

    Every processor now has serial and

    parallel cores

    All cores capable, with performancedifferences

    Simple andefficient programmodel

    Easy support of massive dat

    Support for task based progrmodels

    Solutions forall platforms

    Open to all

    THE FUTURE OF HETEROGENEOUS COMPUTING

  • 8/6/2019 Phil Rogers Keynote-FINAL

    33/34

    33 | The Programmers Guide to the APU Galaxy | June 2011

    The architectural path for the future is clear

    Programming patterns established onSymmetric Multi-Processor (SMP)systems migrate to the heterogeneous

    world

    An open architecture, with publishedspecifications and an open sourceexecution software stack

    Heterogeneous cores working togetherseamlessly in coherent memory

    Low latency dispatch

    No software fault lines

  • 8/6/2019 Phil Rogers Keynote-FINAL

    34/34