Luebke Cuda Fundamentals

Embed Size (px)

Citation preview

  • 8/7/2019 Luebke Cuda Fundamentals

    1/85

    Beyond Programmable Shading: In Action

    CUDA FundamentalsCUDA Fundamentals

    David Luebke, NVIDIA ResearchDavid Luebke, NVIDIA Research

  • 8/7/2019 Luebke Cuda Fundamentals

    2/85

    Beyond Programmable Shading: In Action

    Computers no longer get faster, just wider

    Many people have not yet gotten this memo

    You must re-think your algorithms to be parallel ! Not just a good idea the only way to gain performance

    Otherwise: if its not fast enough now, it never will be

    Data-parallel computing is most scalable solution

    Otherwise: refactor code for 2 cores

    You will always have more data than cores build the computation around the data

    8 cores4 cores

    TheThe NewNew MooreMoores Laws Law

    16 cores

  • 8/7/2019 Luebke Cuda Fundamentals

    3/85

    Beyond Programmable Shading: In Action

    Enter the GPUEnter the GPU

    Massive economies of scale

    Massively parallel

  • 8/7/2019 Luebke Cuda Fundamentals

    4/85

    Beyond Programmable Shading: In Action

    Enter CUDAEnter CUDA

    CUDA is a scalable parallel programming model and a

    software environment for parallel computing Minimal extensions to familiar C/C++ environment

    Heterogeneous serial-parallel programming model

    NVIDIAs GPU architecture accelerates CUDA Expose the computational horsepower of NVIDIA GPUs

    Enable general-purpose GPU computing

    (CUDA also maps well to multicore CPUs)

  • 8/7/2019 Luebke Cuda Fundamentals

    5/85

    Beyond Programmable Shading: In Action

    The DemocratizationThe Democratization

    of Parallel Computingof Parallel Computing GPU Computing with CUDA brings parallel

    computing to the masses

    Over 85,000,000 CUDA-capable GPUs sold A developer kit costs ~$100 (for 500 GFLOPS!)

    Data-parallel supercomputers are everywhere CUDA makes this power accessible

    Were already seeing innovations in data-parallel

    computing

    Massively parallel computing has become a

    commodity technology

  • 8/7/2019 Luebke Cuda Fundamentals

    6/85

    Beyond Programmable Shading: In Action

    CUDA Programming ModelCUDA Programming Model

  • 8/7/2019 Luebke Cuda Fundamentals

    7/85

    Beyond Programmable Shading: In Action

    Some Design GoalsSome Design Goals

    Scale to 100s of cores, 1000s of parallel threads

    Let programmers focus on parallel algorithms

    not mechanics of a parallel programming language

    Enable heterogeneous systems (i.e., CPU+GPU) CPU good at serial computation, GPU at parallel

  • 8/7/2019 Luebke Cuda Fundamentals

    8/85

    Beyond Programmable Shading: In Action

    Key Parallel Abstractions in CUDAKey Parallel Abstractions in CUDA

    0. Zillions of lightweight threads Simple decomposition model

    1. Hierarchy of concurrent threads Simple execution model

    2. Lightweight synchronization primitives Simple synchronization model

    3. Shared memory model for cooperating threads Simple communication model

  • 8/7/2019 Luebke Cuda Fundamentals

    9/85

    Beyond Programmable Shading: In Action

    Hierarchy of concurrent threadsHierarchy of concurrent threads

    Parallel kernels composed of many threads all threads execute the same sequential program

    Threads are grouped into thread blocks threads in the same block can cooperate

    Threads/blockshave unique IDs

    Thread t

    t0 t1 tBBlock b

    Kernel foo()

    . . .

  • 8/7/2019 Luebke Cuda Fundamentals

    10/85

    Beyond Programmable Shading: In Action

    Heterogeneous ProgrammingHeterogeneous Programming

    CUDA = serial program with parallel kernels, all in C Serial C code executes in a CPU thread

    Parallel kernel C code executes in thread blocks

    across multiple processing elements

    Serial Code

    . . .

    . . .

    Parallel Kernel

    foo>(args);

    Serial Code

    Parallel Kernel

    bar>(args);

  • 8/7/2019 Luebke Cuda Fundamentals

    11/85

    Beyond Programmable Shading: In Action

    Example: Vector Addition KernelExample: Vector Addition Kernel

    // Compute vector sum C = A+B

    // Each thread performs one pair-wise addition

    __global__ void vecAdd(float* A, float* B, float* C)

    {

    int i = threadIdx.x +blockDim.x *blockIdx.x;

    C[i] = A[i] + B[i];

    }

    int main()

    {

    // Run N/256 blocks of 256 threads each

    vecAdd>(d_A, d_B, d_C);

    }

    Device Code

  • 8/7/2019 Luebke Cuda Fundamentals

    12/85

    Beyond Programmable Shading: In Action

    Example: Vector Addition KernelExample: Vector Addition Kernel

    // Compute vector sum C = A+B

    // Each thread performs one pair-wise addition

    __global__ void vecAdd(float* A, float* B, float* C)

    {

    int i = threadIdx.x +blockDim.x *blockIdx.x;

    C[i] = A[i] + B[i];

    }

    int main()

    {

    // Run N/256 blocks of 256 threads each

    vecAdd>(d_A, d_B, d_C);

    }

    Host Code

  • 8/7/2019 Luebke Cuda Fundamentals

    13/85

    Beyond Programmable Shading: In Action

    Synchronization & CoordinationSynchronization & Coordination

    Threads within block may synchronize withbarriers

    Step 1 __syncthreads();

    Step 2

    Blocks coordinate via atomic memory operations e.g., increment shared queue pointer with atomicInc()

    Implicit barrier between dependent kernelsvec_minus(a, b, c);

    vec_dot(c, c);

  • 8/7/2019 Luebke Cuda Fundamentals

    14/85

    Beyond Programmable Shading: In Action

    Threads & Thread BlocksThreads & Thread Blocks

    What is a thread?

    An independent thread of execution

    Has its own PC, variables (registers), processor state, etc.

    What is a thread block?

    A virtualized multiprocessor Fit processors to data & computation, each kernel launch

    A (data) parallel task all blocks in kernel have the same entry point

    but may execute any code they want

  • 8/7/2019 Luebke Cuda Fundamentals

    15/85

    Beyond Programmable Shading: In Action

    Blocks Must Be IndependentBlocks Must Be Independent

    Any possible interleaving of blocks should be valid

    presumed to run to completion without pre-emption

    can run in any order can run concurrently OR sequentially

    Blocks may coordinate but not synchronize

    shared queue pointer: OK

    shared lock: BAD can easily deadlock

    Independence requirement gives scalability

  • 8/7/2019 Luebke Cuda Fundamentals

    16/85

    Beyond Programmable Shading: In Action

    Blocks Run on MultiprocessorsBlocks Run on Multiprocessors

    Kernel launched by host

    . ..

    SP

    SharedMemory

    MT IU

    SP

    SharedMemory

    MT IU

    SP

    SharedMemory

    MT IU

    SP

    SharedMemory

    MT IU

    SP

    SharedMemory

    MT IU

    SP

    SharedMemory

    MT IU

    SP

    SharedMemory

    MT IU

    SP

    SharedMemory

    MT IU

    . . .

    Device processor array

    Device Memory

  • 8/7/2019 Luebke Cuda Fundamentals

    17/85

    Beyond Programmable Shading: In Action

    Levels of ParallelismLevels of Parallelism

    Thread parallelism each thread is an independent thread of execution

    Data parallelism across threads in a block

    across blocks in a kernel

    Task parallelism different blocks are independent

    independent kernels

  • 8/7/2019 Luebke Cuda Fundamentals

    18/85

    Beyond Programmable Shading: In Action

    Memory modelMemory model

    Thread

    Per-thread

    Local Memory

    Block

    Per-block

    SharedMemory

  • 8/7/2019 Luebke Cuda Fundamentals

    19/85

    Beyond Programmable Shading: In Action

    Memory modelMemory model

    Kernel 0

    . . .Per-device

    GlobalMemory

    Kernel 1

    SequentialKernels

    . . .

  • 8/7/2019 Luebke Cuda Fundamentals

    20/85

    Beyond Programmable Shading: In Action

    Memory modelMemory model

    Device 0memory

    Device 1memory

    Host

    memory

    cudaMemcpy()

  • 8/7/2019 Luebke Cuda Fundamentals

    21/85

    Beyond Programmable Shading: In Action

    CUDA SDKCUDA SDK

    NVIDIA C Compiler

    NVIDIA Assemblyfor Computing

    CPU Host Code

    Integrated CPUand GPU C Source Code

    Libraries:FFT, BLAS,Example Source Code

    CUDA

    Driver

    Debugger

    Profiler Standard C Compiler

    GPU CPU

  • 8/7/2019 Luebke Cuda Fundamentals

    22/85

    Beyond Programmable Shading: In Action

    Graphics InteroperabilityGraphics Interoperability

    Go Beyond Programmable Shading

    Access textures from within CUDA Free filtering, wrapping, 2/3D addressing, spatial caching...

    Share resources (vertex, pixel, texture, surface) Map resources into CUDA global memory, manipulate freely

    Geometry: physics (collision, response, deformation, ...),

    lighting, geometry creation, non-traditional rasterization,acceleration structures, AI, ...

    Imaging: image processing, (de)compression, procedural texturegeneration, postprocess effects, computer vision, ...

  • 8/7/2019 Luebke Cuda Fundamentals

    23/85

    Beyond Programmable Shading: In Action

    SummarySummary

    CUDA is a powerful parallel programming model

    Heterogeneous - mixed serial-parallel programming Scalable - hierarchical thread execution model

    Accessible - minimal but expressive changes to C

    Interoperable - simple graphics interop mechanisms

    CUDA is an attractive platform Broad - OpenGL, DirectX, WinXP, Vista, Linux, MacOS

    Widespread - over 85M CUDA GPUs, 60K CUDA developers

    CUDA provides t remendous scope for innovat ivegraphics research beyond programmable shading

  • 8/7/2019 Luebke Cuda Fundamentals

    24/85

    Beyond Programmable Shading: In Action

    Questions?Questions?

    David LuebkeDavid [email protected]@nvidia.com

  • 8/7/2019 Luebke Cuda Fundamentals

    25/85

    Beyond Programmable Shading: In Action

    CUDACUDA

    Design ChoicesDesign Choices

  • 8/7/2019 Luebke Cuda Fundamentals

    26/85

    Beyond Programmable Shading: In Action

    Goal: ScalabilityGoal: Scalability

    Scalable execution Program must be insensitive to the number of cores

    Write one program for any number of SM cores

    Program runs on any size GPU without recompiling

    Hierarchical execution model

    Decompose problem into sequential steps (kernels) Decompose kernel into computing parallel blocks

    Decompose block into computing parallel threads

    Hardware distributes independentblocks to SMsas available

  • 8/7/2019 Luebke Cuda Fundamentals

    27/85

    Beyond Programmable Shading: In Action

    Blocks Run on MultiprocessorsBlocks Run on Multiprocessors

    Kernel launched by host

    . ..

    SP

    SharedMemory

    MT IU

    SP

    SharedMemory

    MT IU

    SP

    SharedMemory

    MT IU

    SP

    SharedMemory

    MT IU

    SP

    SharedMemory

    MT IU

    SP

    SharedMemory

    MT IU

    SP

    SharedMemory

    MT IU

    SP

    SharedMemory

    MT IU

    . . .

    Device processor array

    Device Memory

  • 8/7/2019 Luebke Cuda Fundamentals

    28/85

    Beyond Programmable Shading: In Action

    Goal: easy to programGoal: easy to program

    Strategies: Familiar programming language mechanics

    C/C++ with small extensions

    Simple parallel abstractions

    Simple barrier synchronization Shared memory semantics

    Hardware-managed hierarchy of threads

  • 8/7/2019 Luebke Cuda Fundamentals

    29/85

    Beyond Programmable Shading: In Action

    Hardware MultithreadingHardware Multithreading

    Hardware allocates resources to blocks

    blocks need: thread slots, registers, shared memory

    blocks dont run until resources are available

    Hardware schedules threads

    threads have their own registers any thread not waiting for something can run

    context switching is (basically) free every cycle

    Hardware relies on threads to hide latency i.e., parallelism is necessary for performance

  • 8/7/2019 Luebke Cuda Fundamentals

    30/85

    Beyond Programmable Shading: In Action

    GraphicsGraphics InteropInterop

    DetailsDetails

  • 8/7/2019 Luebke Cuda Fundamentals

    31/85

    Beyond Programmable Shading: In Action

    OpenGLOpenGL InteropInterop StepsSteps

    Register a buffer object with CUDA cudaGLRegisterBufferObject(GLuint buffObj);

    OpenGL can use a registered buffer only as a source Unregister the buffer prior to rendering to it by OpenGL

    Map the buffer object to CUDA memory cudaGLMapBufferObject(void **devPtr, GLuint buffObj);

    Returns an address in global memory Buffer must registered prior to mapping

    Launch a CUDA kernel to process the buffer

    Unmap the buffer object prior to use by OpenGL cudaGLUnmapBufferObject(GLuint buffObj);

    Unregister the buffer object cudaGLUnregisterBufferObject(GLuint buffObj);

    Optional: needed if the buffer is a render target

    Use the buffer object in OpenGL code

    InteropInterop ScenarioScenario:

  • 8/7/2019 Luebke Cuda Fundamentals

    32/85

    Beyond Programmable Shading: In Action

    InteropInterop Scenario:Scenario:Dynamic CUDADynamic CUDA--generated texturegenerated texture

    Register the texture PBO with CUDA

    For each frame:

    Map the buffer

    Generate the texture in a CUDA kernel

    Unmap the buffer

    Update the texture

    Render the textured object

    unsigned char *p_d=0;

    cudaGLMapBufferObject((void**)&p_d, pbo);

    prepTexture(p_d, time);

    cudaGLUnmapBufferObject(pbo);

    glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo);

    glBindTexture(GL_TEXTURE_2D, texID);

    glTexSubImage2D(GL_TEXTURE_2D, 0, 0,0, 256,256,

    GL_BGRA, GL_UNSIGNED_BYTE, 0);

    I tI t S iS i

  • 8/7/2019 Luebke Cuda Fundamentals

    33/85

    Beyond Programmable Shading: In Action

    InteropInterop Scenario:Scenario:Frame PostFrame Post--processing by CUDAprocessing by CUDA

    For each frame:

    Render to PBO with OpenGL Register the PBO with CUDA

    Map the buffer

    Process the buffer with a CUDA kernel

    Unmap the buffer Unregister the PBO from CUDA

    unsigned char *p_d=0;

    cudaGLRegisterBufferObject(pbo);

    cudaGLMapBufferObject((void**)&p_d, pbo);postProcess(p_d);

    cudaGLUnmapBufferObject(pbo);

    cudaGLUnregisterBufferObject(pbo);

    ...

  • 8/7/2019 Luebke Cuda Fundamentals

    34/85

    Beyond Programmable Shading: In Action

    Texture Fetch in CUDATexture Fetch in CUDA

    Advantages of using texture: Texture fetches are cached

    optimized for 2D locality (swizzled)

    Textures are addressable in 2D using integer or normalized coordinates means fewer addressing calculations in code

    Get filtering for free bilinear interpolation, anisotropic filtering etc.

    Free wrap modes (boundary conditions) clamp to edge

    repeat

    Disadvantages: Textures are read-only May not be a big performance advantage if global reads are already

    coalesced

  • 8/7/2019 Luebke Cuda Fundamentals

    35/85

    Beyond Programmable Shading: In Action

    Texture Fetch in CUDATexture Fetch in CUDA

    Textures represent 1D, 2D or 3D arrays of memory

    CUDA texture type defines type of components anddimension:

    texture

    texfetch function performs texture fetch:

    texfetch(texture t,

    float x, float y=0, float z=0)

  • 8/7/2019 Luebke Cuda Fundamentals

    36/85

    Beyond Programmable Shading: In Action

    Channel DescriptorsChannel Descriptors

    Channel descriptor is used in conjunction withtexture and describes how reads from globalmemory are interpreted

    Predefined channel descriptors exist for built-intypes, e.g.:

    cudaChannelFormatDesc

    cudaCreateChannelDesc(void);

    cudaChannelFormatDesc

    cudaCreateChannelDesc(void);

    cudaChannelFormatDesc

    cudaCreateChannelDesc(void);

  • 8/7/2019 Luebke Cuda Fundamentals

    37/85

    Beyond Programmable Shading: In Action

    ArraysArrays

    Arrays are the host representation of n-dimensional textures

    To allocate an arraycudaError_t

    cudaMalloc( cudaArray_t **arrayPtr,struct cudaChannelFormatDesc &desc,

    size_t width,

    size_t height = 1);

    To free an array:cudaError_t

    cudaFree( struct cudaArray *array);

  • 8/7/2019 Luebke Cuda Fundamentals

    38/85

    Beyond Programmable Shading: In Action

    Loading Memory from ArraysLoading Memory from Arrays

    Load memory to arrays using a version ofcudaMemcpy:

    cudaError_t cudaMemcpy(

    void *dst,

    const struct cudaArray *src,

    size_t count,

    enum cudaMemcpyKind kind

    )

  • 8/7/2019 Luebke Cuda Fundamentals

    39/85

    Beyond Programmable Shading: In Action

    Binding TexturesBinding Textures

    Textures are bound to arrays using:

    templatecudaError_t cudaBindTexture(

    const struct texture &tex,

    const struct cudaArray *array

    )

    To unbind:

    template

    cudaError_t cudaUnbindTexture(const struct texture &tex

    )

  • 8/7/2019 Luebke Cuda Fundamentals

    40/85

    Beyond Programmable Shading: In Action

    ExampleExamplecudaArray* cu_array;

    texture tex;

    // Allocate array

    cudaMalloc( &cu_array, cudaCreateChannelDesc(),

    width, height );

    // Copy image data to array

    cudaMemcpy( cu_array, image, width*height, cudaMemcpyHostToDevice);

    // Bind the array to the texture

    cudaBindTexture( tex, cu_array);

    // Run kernel

    dim3 blockDim(16, 16, 1);

    dim3 gridDim(width / blockDim.x, height / blockDim.y, 1);

    kernel>(d_odata, width, height);

    cudaUnbindTexture(tex);

  • 8/7/2019 Luebke Cuda Fundamentals

    41/85

    Beyond Programmable Shading: In Action

    Example (cont)Example (cont)__global__ void

    kernel(float* odata, int height, int width)

    {

    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;float c = texfetch(tex, x, y);

    odata[y*width+x] = c;

    }

  • 8/7/2019 Luebke Cuda Fundamentals

    42/85

    Beyond Programmable Shading: In Action

    CUDACUDA

    ProgrammingProgramming

  • 8/7/2019 Luebke Cuda Fundamentals

    43/85

    Beyond Programmable Shading: In Action

    GPU Computing with CUDAGPU Computing with CUDA

    CUDA = Compute Unified Driver Architecture

    Co-designed hardware & software for direct GPU

    computing

    Hardware: fully general data-parallelarchitecture

    Software: program the GPU in C

    General thread launch

    Global load-store

    Parallel data cache

    Scalar architecture

    Integers, bit operations

    Double precision (shortly)

    Scalable data-parallelexecution/memory model

    C with minimal yetpowerful extensions

    C DA SD

  • 8/7/2019 Luebke Cuda Fundamentals

    44/85

    Beyond Programmable Shading: In Action

    CUDA SDKCUDA SDK

    NVIDIA C Compiler

    NVIDIA Assemblyfor Computing

    CPU Host Code

    Integrated CPUand GPU C Source Code

    Libraries:FFT, BLAS,Example Source Code

    CUDA

    Driver

    Debugger

    ProfilerStandard C Compiler

    GPU CPU

    CUDA F il bl k lCUDA F il bl k l

  • 8/7/2019 Luebke Cuda Fundamentals

    45/85

    Beyond Programmable Shading: In Action

    CUDA: Features available to kernelsCUDA: Features available to kernels

    Standard mathematical functionssinf, powf, atanf, ceil, etc.

    Built-in vector typesfloat4, int4, uint4, etc. for dimensions 1..4

    Texture accesses in kernelstexture my_texture; // declare texture reference

    float4 texel = texfetch(my_texture, u, v);

    G8 CUDA C i h E iG8 CUDA C ith E t i

  • 8/7/2019 Luebke Cuda Fundamentals

    46/85

    Beyond Programmable Shading: In Action

    G8x CUDA = C with ExtensionsG8x CUDA = C with Extensions

    Philosophy: provide minimal set of extensions necessary to expose power

    Function qualifiers:__global__ void MyKernel() { }

    __device__ float MyDeviceFunc() { }

    Variable qualifiers:__constant__ float MyConstantArray[32];

    __shared__ float MySharedArray[32];

    Execution configuration:dim3 dimGrid(100, 50); // 5000 thread blocks

    dim3 dimBlock(4, 8, 8); // 256 threads per block

    MyKernel > (...); // Launch kernel

    Built-in variables and functions valid in device code:dim3 gridDim; // Grid dimensiondim3 blockDim; // Block dimension

    dim3 blockIdx; // Block index

    dim3 threadIdx; // Thread index

    void__syncthreads(); // Thread synchronization

    CUDA R ti tCUDA R ti t

  • 8/7/2019 Luebke Cuda Fundamentals

    47/85

    Beyond Programmable Shading: In Action

    CUDA: Runtime supportCUDA: Runtime support

    Explicit memory allocation returns pointers to GPU memory

    cudaMalloc(), cudaFree()

    Explicit memory copy for host

    device, device

    devicecudaMemcpy(), cudaMemcpy2D(), ...

    Texture management

    cudaBindTexture(), cudaBindTextureToArray(), ...

    OpenGL & DirectX interoperability

    cudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(),

    E l V t Additi K lE l V t Additi K l

  • 8/7/2019 Luebke Cuda Fundamentals

    48/85

    Beyond Programmable Shading: In Action

    Example: Vector Addition KernelExample: Vector Addition Kernel

    // Compute vector sum C = A+B

    // Each thread performs one pair-wise addition

    __global__ void vecAdd(float* A, float* B, float* C)

    {

    int i = threadIdx.x +blockDim.x *blockIdx.x;

    C[i] = A[i] + B[i];

    }

    E l I ki g th K lExample: Invoking the Kernel

  • 8/7/2019 Luebke Cuda Fundamentals

    49/85

    Beyond Programmable Shading: In Action

    Example: Invoking the KernelExample: Invoking the Kernel

    __global__ void vecAdd(float* A, float* B, float* C);

    void main()

    {

    // Execute on N/256 blocks of 256 threads each

    vecAdd>(d_A, d_B, d_C);

    }

    Example: Host code for memoryExample: Host code for memory

  • 8/7/2019 Luebke Cuda Fundamentals

    50/85

    Beyond Programmable Shading: In Action

    Example: Host code for memoryExample: Host code for memory

    // allocate host (CPU) memory

    float* h_A = (float*) malloc(N * sizeof(float));

    float* h_B = (float*) malloc(N * sizeof(float));

    initalize h_A and h_B

    // allocate device (GPU) memory

    float* d_A, d_B, d_C;

    cudaMalloc( (void**) &d_A, N * sizeof(float));

    cudaMalloc( (void**) &d_B, N * sizeof(float));

    cudaMalloc( (void**) &d_C, N * sizeof(float));

    // copy host memory to device

    cudaMemcpy( d_A, h_A, N * sizeof(float),cudaMemcpyHostToDevice));

    cudaMemcpy( d_B, h_B, N * sizeof(float),cudaMemcpyHostToDevice));

    // execute the kernel on N/256 blocks of 256 threads each

    vecAdd(d_A, d_B, d_C);

    A quick reviewA quick review

  • 8/7/2019 Luebke Cuda Fundamentals

    51/85

    Beyond Programmable Shading: In Action

    A quick reviewA quick review

    device = GPU = set of multiprocessors

    Multiprocessor = set of processors & shared memory

    Kernel = GPU program Grid = array of thread blocks that execute a kernel

    Thread block = group of SIMD threads that execute a

    kernel and can communicate via shared memoryMemory Location Cached Access Who

    Local Off-chip No Read/write One thread

    Shared On-chip N/A - resident Read/write All threads in a block

    Global Off-chip No Read/write All threads + host

    Constant Off-chip Yes Read All threads + host

    Texture Off-chip Yes Read All threads + host

  • 8/7/2019 Luebke Cuda Fundamentals

    52/85

    Beyond Programmable Shading: In Action

    Myths ofMyths of

    GPU ComputingGPU Computing

    Myths of GPU ComputingMyths of GPU Computing

  • 8/7/2019 Luebke Cuda Fundamentals

    53/85

    Beyond Programmable Shading: In Action

    Myths of GPU ComputingMyths of GPU Computing

    GPUs layer normal programs on top of graphics NO: CUDA compiles directly to the hardware

    GPUs architectures are: Very wide (1000s) SIMD machines

    on which branching is impossible or prohibitive

    with 4-wide vector registers.

    GPUs are power-inefficient

    GPUs dont do real floating point

    Myths of GPU ComputingMyths of GPU Computing

  • 8/7/2019 Luebke Cuda Fundamentals

    54/85

    Beyond Programmable Shading: In Action

    Myths of GPU ComputingMyths of GPU Computing

    GPUs layer normal programs on top of graphics

    GPUs architectures are: Very wide (1000s) SIMD machines

    on which branching is impossible or prohibitive

    with 4-wide vector registers.

    GPUs are power-inefficient

    GPUs dont do real floating point

    Myths of GPU ComputingMyths of GPU Computing

  • 8/7/2019 Luebke Cuda Fundamentals

    55/85

    Beyond Programmable Shading: In Action

    Myths of GPU ComputingMyths of GPU Computing

    GPUs layer normal programs on top of graphics

    GPUs architectures are: Very wide (1000s) SIMD machines NO: warps are 32-wide

    on which branching is impossible or prohibitive

    with 4-wide vector registers.

    GPUs are power-inefficient

    GPUs dont do real floating point

    Myths of GPU ComputingMyths of GPU Computing

  • 8/7/2019 Luebke Cuda Fundamentals

    56/85

    Beyond Programmable Shading: In Action

    Myths of GPU ComputingMyths of GPU Computing

    GPUs layer normal programs on top of graphics

    GPUs architectures are: Very wide (1000s) SIMD machines

    on which branching is impossible or prohibitive NOPE

    with 4-wide vector registers.

    GPUs are power-inefficient

    GPUs dont do real floating point

    Myths of GPU ComputingMyths of GPU Computing

  • 8/7/2019 Luebke Cuda Fundamentals

    57/85

    Beyond Programmable Shading: In Action

    Myths of GPU ComputingMyths of GPU Computing

    GPUs layer normal programs on top of graphics

    GPUs architectures are: Very wide (1000s) SIMD machines

    on which branching is impossible or prohibitive

    with 4-wide vector registers.

    GPUs are power-inefficient

    GPUs dont do real floating point

    Myths of GPU ComputingMyths of GPU Computing

  • 8/7/2019 Luebke Cuda Fundamentals

    58/85

    Beyond Programmable Shading: In Action

    Myths of GPU ComputingMyths of GPU Computing

    GPUs layer normal programs on top of graphics

    GPUs architectures are: Very wide (1000s) SIMD machines

    on which branching is impossible or prohibitive

    with 4-wide vector registers. NO: scalar thread processors

    GPUs are power-inefficient

    GPUs dont do real floating point

    Myths of GPU ComputingMyths of GPU Computing

  • 8/7/2019 Luebke Cuda Fundamentals

    59/85

    Beyond Programmable Shading: In Action

    Myths of GPU ComputingMyths of GPU Computing

    GPUs layer normal programs on top of graphics

    GPUs architectures are: Very wide (1000s) SIMD machines

    on which branching is impossible or prohibitive

    with 4-wide vector registers.

    GPUs are power-inefficient

    GPUs dont do real floating point

    Myths of GPU ComputingMyths of GPU Computing

  • 8/7/2019 Luebke Cuda Fundamentals

    60/85

    Beyond Programmable Shading: In Action

    Myths of GPU ComputingMyths of GPU Computing

    GPUs layer normal programs on top of graphics

    GPUs architectures are: Very wide (1000s) SIMD machines

    on which branching is impossible or prohibitive

    with 4-wide vector registers.

    GPUs are power-inefficient:No 4-10x perf/W advantage, up to 89x reported for some studies

    GPUs dont do real floating point

    Myths of GPU ComputingMyths of GPU Computing

  • 8/7/2019 Luebke Cuda Fundamentals

    61/85

    Beyond Programmable Shading: In Action

    Myths of GPU ComputingMyths of GPU Computing

    GPUs layer normal programs on top of graphics

    GPUs architectures are: Very wide (1000s) SIMD machines

    on which branching is impossible or prohibitive

    with 4-wide vector registers.

    GPUs are power-inefficient:

    GPUs dont do real floating point

    Double Precision Floating Point

  • 8/7/2019 Luebke Cuda Fundamentals

    62/85

    Beyond Programmable Shading: In Action

    g

    GPU Floating Point FeaturesGPU Floating Point Features

  • 8/7/2019 Luebke Cuda Fundamentals

    63/85

    Beyond Programmable Shading: In Action

    ggG80 SSE IBM Altivec Cell SPE

    Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754

    Rounding modes forFADD and FMUL

    Round to nearest andround to zero

    All 4 IEEE, round tonearest, zero, inf, -inf

    Round to nearestonly

    Round tozero/truncate only

    Denormal handling Flush to zero

    Supported,

    1000s of cycles

    Supported,

    1000s of cycles Flush to zero

    NaN support Yes Yes Yes No

    Overflow and Infinitysupport

    Yes, only clamps tomax norm

    Yes Yes No, infinity

    Flags No Yes Yes Some

    Square root Software only Hardware Software only Software only

    Division Software only Hardware Software only Software only

    Reciprocal estimate

    accuracy 24 bit 12 bit 12 bit 12 bit

    Reciprocal sqrtestimate accuracy

    23 bit 12 bit 12 bit 12 bit

    log2(x) and 2^xestimates accuracy

    23 bit No 12 bit No

    DoDo GPUsGPUs Do Real IEEE FP?Do Real IEEE FP?

  • 8/7/2019 Luebke Cuda Fundamentals

    64/85

    Beyond Programmable Shading: In Action

    G8x GPU FP is IEEE 754 Comparable to other processors / accelerators

    More precise / usable in some ways Less precise in other ways

    GPU FP getting better every generation Double precision support in GTX 20

    Goal: best of class by 2009

  • 8/7/2019 Luebke Cuda Fundamentals

    65/85

    Beyond Programmable Shading: In Action

    ApplicationsApplications

    &&

    Sweet SpotsSweet Spots

  • 8/7/2019 Luebke Cuda Fundamentals

    66/85

    Beyond Programmable Shading: In Action

    GPU Computing:Motivation110-240X

    45X100X

    35X

    17X

    13457x

    GPUsGPUs Are Fast & Getting FasterAre Fast & Getting Faster

  • 8/7/2019 Luebke Cuda Fundamentals

    67/85

    Beyond Programmable Shading: In Action

    gg

    0

    250

    500

    750

    1000

    Sep-02 Jan-04 May-05 Oct-06 Feb-08

    PeakG

    FLOP/s

    NVIDIA GPU Intel CPU

    GPU Computing Sweet SpotsGPU Computing Sweet Spots

  • 8/7/2019 Luebke Cuda Fundamentals

    68/85

    Beyond Programmable Shading: In Action

    Applications:

    High arithmetic intensity:

    Dense linear algebra, PDEs, n-body, finite difference,

    High bandwidth:Sequencing (virus scanning, genomics), sorting, database

    Visual computing:Graphics, image processing, tomography, machine vision

    GPU Computing Example MarketsGPU Computing Example Markets

  • 8/7/2019 Luebke Cuda Fundamentals

    69/85

    Beyond Programmable Shading: In Action

    Computational

    Modeling

    Computational

    Chemistry

    Computational

    Medicine

    Computational

    Science

    Computational

    Biology

    Computational

    Finance

    Computational

    Geoscience

    Image

    Processing

    GPU Computing Sweet SpotsGPU Computing Sweet Spots

  • 8/7/2019 Luebke Cuda Fundamentals

    70/85

    Beyond Programmable Shading: In Action

    From cluster to workstation

    The personal supercomputing phase change From lab to clinic

    From machine room to engineer, grad student desks From batch processing to interactive

    From interactive to real-time

    GPU-enabled clusters

    A 100x or better speedup changes the science

    Solve at different scales Direct brute-force methods may outperform cleverness

    New bottlenecks may emerge

    Approaches once inconceivable may become practical

    ApplicationsApplications -- CondensedCondensed

  • 8/7/2019 Luebke Cuda Fundamentals

    71/85

    Beyond Programmable Shading: In Action

    3D image analysis

    Adaptive radiation therapy

    Acoustics

    Astronomy

    Audio

    Automobile vision Bioinfomatics

    Biological simulation

    Broadcast

    Cellular automata

    Computational Fluid Dynamics

    Computer Vision

    Cryptography

    CT reconstruction

    Data Mining

    Digital cinema/projections

    Electromagnetic simulation

    Equity training

    Film

    Financial - lots of areas

    Languages

    GIS

    Holographics cinema

    Imaging (lots) Mathematics research

    Military (lots)

    Mine planning

    Molecular dynamics

    MRI reconstruction

    Multispectral imaging

    nbody

    Network processing

    Neural network

    Oceanographic research

    Optical inspection

    Particle physics

    Protein folding

    Quantum chemistry

    Ray tracing

    Radar

    Reservoir simulation

    Robotic vision/AI

    Robotic surgery

    Satellite data analysis

    Seismic imaging

    Surgery simulation

    Surveillance

    Ultrasound

    Video conferencing

    Telescope

    Video

    Visualization

    Wireless

    X-ray

    Tesla HPC PlatformTesla HPC Platform

  • 8/7/2019 Luebke Cuda Fundamentals

    72/85

    Beyond Programmable Shading: In Action

    HPC-oriented product line C870: board (1 GPU)

    D870: deskside unit (2 GPUs) S870: 1u server unit (4 GPUs)

    NVIDIA Decoder Ring

  • 8/7/2019 Luebke Cuda Fundamentals

    73/85

    Beyond Programmable Shading: In Action

    TeslaTMHigh Performance Computing

    QuadroDesign & Creation

    GeForceEntertainment

    Architecture: TESLA

    Chips: G80, G84, G92,

    New ApplicationsNew Applications

  • 8/7/2019 Luebke Cuda Fundamentals

    74/85

    Beyond Programmable Shading: In Action

    Real-time options implied volatility engine

    Swaption volatility cube calculator

    Manifold 8 GIS

    Ultrasound imaging

    HOOMD Molecular Dynamics

    Also

    Image rotation/classification

    Graphics processing toolbox

    Microarray data analysis

    Data parallel primitives

    Astrophysics simulations

    SDK: Mandelbrot, computer visi

    Seismic migration

    GPU Computing: Biomedical ImagingGPU Computing: Biomedical Imaging

  • 8/7/2019 Luebke Cuda Fundamentals

    75/85

    Beyond Programmable Shading: In Action

    Representative PublicationsRepresentative Publications

    Computed Tomography:

    K. Mueller, F. Xu, and N. Neophytou. Why do commodity graphicshardware boards (GPUs) work so well for acceleration of

    computed tomography? SPIE Electronic Imaging 2007,Computational Imaging V Keynote, 2007.

    G. C. Sharp, N. Kandasamy, H. Singh, and M. Folkert, GPU-basedstreaming architectures for fast cone-beam CT imagereconstruction and Demons deformable registration,Physics in Medicine and Biology, 2007.

    H. Scherl, B. Keck, M. Kowarschik, and J. Hornegger. Fast GPU-Based CT Reconstruction using the Common Unified DeviceArchitecture (CUDA), 2007 Nuclear Science Symposium and

    Medical Imaging Conference, Honolulu HW.

    X. Xue, A. Cheryauka, and D. Tubbs. Acceleration of fluoro-CTreconstruction for a mobile C-Arm on GPU and FPGAhardware: A simulation study. SPIE Medical Imaging 2006.

    K. Chidlow and T. Moller. Rapid emission tomographyreconstruction. Int'l Workshop on Volume Graphics, 2003.

    K. Mueller and R. Yagel. Rapid 3-D cone-beam reconstruction withthe simultaneous algebraic reconstruction technique (SART)

    using 2-D texture mapping hardware. IEEE Transactions onMedical Imaging, 19(12):1227-1237, 2000.

    Magnetic Resonance Imaging:

    S. Stone, J. Haldar, S. Tsao, W. Hwu, Z. Liang, B. Sutton.Accelerating Advanced MRI Reconstructions on GPUs, ACMFrontiers in Computing, Ischia, Italy, 2008.

    W. Jeong, P. T. Fletcher, R. Tao, and R.T. Whitaker, InteractiveVisualization of Volumetric White Matter Connectivity inDT-MRI Using a Parallel-Hardware Hamilton-Jacobi Solver,Proc. IEEE Visualization 2007, 2007.

    T. Schiwietz, T. Chang, P. Speier, and R. Westermann. MR imagereconstruction using the GPU. SPIE Medical Imaging 2006.

    T. Sorensen, T. Schaeter, K. Noe, and M. Hansen. Accelerating thenon-equispaced fast Fourier transform on commoditygraphics hardware. IEEE Transactions on Medical Imaging.

    K. Mueller and R. Yagel. Rapid 3-D cone-beam reconstruction withthe simultaneous algebraic reconstruction technique (SART)using 2-D texture mapping hardware. IEEE Transactions onMedical Imaging, 19(12):1227-1237, 2000.

    Registration:

    C. Vetter and C. Guetter and C. Xu and R. Westermann, Non-rigidmulti-modal registration on the GPU, Medical Imaging 2007:Image Processing, SPIE, vol. 6512, Mar 2007.

    The Future ofThe Future of GPUsGPUs

  • 8/7/2019 Luebke Cuda Fundamentals

    76/85

    Beyond Programmable Shading: In Action

    GPU Computing drives new applications Reducing Time to Discovery

    100x Speedup changes science and research methods

    New applications drive the future of GPUs andGPU Computing Drives new GPU capabilities

    Drives hunger for more performance

    Some exciting new domains:

    Vision, acoustic, and embedded applications Large-scale simulation & physics

  • 8/7/2019 Luebke Cuda Fundamentals

    77/85

    Beyond Programmable Shading: In Action

    Parallel ComputingParallel Computing

    ParallelParallel ComputingComputingss Golden AgeGolden Age

  • 8/7/2019 Luebke Cuda Fundamentals

    78/85

    Beyond Programmable Shading: In Action

    1980s, early `90s: a golden age for parallel computing Particularly data-parallel computing

    Architectures Connection Machine, MasPar, Cray

    True supercomputers: incredibly exotic, powerful, expensive

    Algorithms, languages, & programming models Solved a wide variety of problems

    Various parallel algorithmic models developed P-RAM, V-RAM, circuit, hypercube, etc.

    ParallelParallel ComputingComputingss Dark AgeDark Age

  • 8/7/2019 Luebke Cuda Fundamentals

    79/85

    Beyond Programmable Shading: In Action

    Butimpact of data-parallel computing limited

    Thinking Machines sold 7 CM-1s (100s of systems total)

    MasPar sold ~200 systems

    Commercial and research activity subsided Massively-parallel machines replaced by clusters

    of ever-more powerful commodity microprocessors

    Beowulf, Legion, grid computing,

    Massively parallel computing lost momentum tothe inexorable advance of commodity technology

  • 8/7/2019 Luebke Cuda Fundamentals

    80/85

    Beyond Programmable Shading: In Action

    GPU DesignGPU Design

    CPU/GPU ParallelismCPU/GPU Parallelism

  • 8/7/2019 Luebke Cuda Fundamentals

    81/85

    Beyond Programmable Shading: In Action

    Moores Law gives you more and more transistors What do you want to do with them?

    CPU strategy: make the workload (one compute thread) run as fast aspossible

    Tactics:

    Cache (area limiting)

    Instruction/Data prefetch

    Speculative execution

    limited by perimeter communication bandwidth

    then add task parallelismmulti-core

    GPU strategy: make the workload (as many threads as possible) run as fastas possible

    Tactics:

    Parallelism (1000s of threads)

    Pipelining limited by area compute capability

    Hardware Implementation:Hardware Implementation:A Set of SIMD MultiprocessorsA Set of SIMD Multiprocessors

  • 8/7/2019 Luebke Cuda Fundamentals

    82/85

    Beyond Programmable Shading: In Action

    pp

    The device is a set ofmultiprocessors

    Each multiprocessor is aset of 32-bit processorswith a Single InstructionMultiple Data architecture

    At each clock cycle, amultiprocessor executesthe same instruction on agroup of threads called a

    warp The number of threads in a

    warp is the warp size

    Device

    Multiprocessor N

    Multiprocessor 2

    Multiprocessor 1

    InstructionUnit

    Processor 1

    Processor 2 Processor M

    Streaming Multiprocessor (SM)Streaming Multiprocessor (SM)

  • 8/7/2019 Luebke Cuda Fundamentals

    83/85

    Beyond Programmable Shading: In Action

    Processing elements 8 scalar thread processors (SP)

    32 GFLOPS peak at 1.35 GHz

    8192 32-bit registers (32KB) MB total register file space!

    usual ops: float, int, branch,

    Hardware multithreading up to 8 blocks resident at once

    up to 768 active threads in total

    16KB on-chip memory low latency storage

    shared amongst threads of a block

    supports thread communication

    SP

    SharedMemory

    MT IU

    SM t0 t1 tB

    Hardware Implementation:Hardware Implementation:

  • 8/7/2019 Luebke Cuda Fundamentals

    84/85

    Beyond Programmable Shading: In Action

    Memory ArchitectureMemory Architecture

    The device has local

    device memory Can be read and written by

    the host and by themultiprocessors

    Each multiprocessor has: A set of 32-bit registers per

    processor

    on-chip shared memory

    A read-only constant cache A read-only texture cache

    Device

    Multiprocessor N

    Multiprocessor 2

    Multiprocessor 1

    Device memory

    Shared Memory

    Instruction

    UnitProcessor 1

    Registers

    Processor 2

    Registers

    Processor M

    Registers

    ConstantCache

    Texture

    Cache

    Hardware Implementation:Hardware Implementation:

  • 8/7/2019 Luebke Cuda Fundamentals

    85/85

    Beyond Programmable Shading: In Action

    Memory ModelMemory Model

    Each thread can:

    Read/write per-blockon-chip shared memory

    Read per-grid cachedconstant memory

    Read/write non-cacheddevice memory:

    Per-grid global memory

    Per-thread local memory Read cached texture

    memory

    Grid

    ConstantMemory

    TextureMemory

    GlobalMemory

    Block (0, 0)

    Shared Memory

    LocalMemory

    Thread (0, 0)

    Registers

    LocalMemory

    Thread (1, 0)

    Registers

    Block (1, 0)

    Shared Memory

    LocalMemory

    Thread (0, 0)

    Registers

    LocalMemory

    Thread (1, 0)

    Registers