01_TeslaCUDAIntro

Embed Size (px)

Citation preview

  • 7/29/2019 01_TeslaCUDAIntro

    1/33

    Tesla GPU ComputingA Revolut ion in High Performance Compu t ing

    Monash University GPU Computing Workshop

    May 28, 2009

  • 7/29/2019 01_TeslaCUDAIntro

    2/33

    4 cores

    What is GPU Computing?

    Computing with CPU + GPUHeterogeneous Computin g

  • 7/29/2019 01_TeslaCUDAIntro

    3/33

    GPUs: Turning Point in Supercomputing

    Te

    SuCalcUA$5 Mil l ion

    Source: University of Antwerp, Belgium

    67.4 secs

    59.9 secs

    55 60 65 70

    256 AMD dual-core Opterons

    4 TeslaC1060 GPUs

    Digital TomographyReconstruction Time

    Desktop beats Cluster

  • 7/29/2019 01_TeslaCUDAIntro

    4/33

    146X

    Medical ImagingU of Utah

    36X

    Molecular DynamicsU of Illinois, Urbana

    18X

    Video TranscodingElemental Tech

    50X

    Matlab ComputingAccelerEyes

    149X

    Financial simulationOxford

    47X

    Linear AlgebraUniversidad Jaime

    20X

    3D UltrasoundTechniscan

    130X

    Quantum ChemistryU of Illinois, Urbana

    G

    Not 2x or 3x : Speedups are 20x to 150x

  • 7/29/2019 01_TeslaCUDAIntro

    5/33

    L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1

    L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1

    NVIDIA Tesla 10-Series GPU

    Massively parallel, many core a

  • 7/29/2019 01_TeslaCUDAIntro

    6/33

  • 7/29/2019 01_TeslaCUDAIntro

    7/33

    TeslaTMHigh-Performance Computing

    QuaDesign &

    GeForce

    Entertainment

    Parallel Computing on All GPUs100+ Mil l ion CUDA GPUs Deployed

  • 7/29/2019 01_TeslaCUDAIntro

    8/33

    Tesla GPU Computing Products

    Tesla S1070 1U SystemTesla C1060

    Computing Board

    TeslSupe

    (4 Te

    GPUs 4 Tesla GPUs 1 Tesla GPU 4

    Single Precision Perf 4.14 Teraflops 933 Gigaflops 3

    Double Precision Perf 346 Gigaflops 78 Gigaflops 3

    Memory 4 GB / GPU 4 GB

  • 7/29/2019 01_TeslaCUDAIntro

    9/33

    Tesla S1070: Green Supercomputing

    20X BetterPerformance / Watt

    Hess

    Chevron

    Petrobras

    NCSA

    CEA

    Tokyo Tech

    JFCOM

    SAIC

    Federal

    Motorola

    Kodak

    Univer

    Univer

    Univer

    Max Pl

    Rice U

    Univer

    GusGu

    Eotvas

    Univer

    Chines

    Nation

  • 7/29/2019 01_TeslaCUDAIntro

    10/33

    A $5 Million Datacenter

    CPU 1U Server CPU 1

    Tesla 1

    2 Quad-core XeonCPUs: 8 cores

    0.17 Teraflop (single)0.08 Teraflop (double)

    $ 3,000

    700 W

    1819 CPU servers

    310 Teraflops (single)

    155 Teraflops (double)

    Total area 16K sq feet

    Total 1273 KW

    8 CPU Cores +4 GPUs = 968 cores

    4.14 Teraflops (single)0.346 Teraflop (double)

    $ 11,000

    1500 W

    455 CPU servers

    455 Tesla systems1961 Teraflops (single)

    196 Teraflops (double)

    Total area 9K sq feet

    Total 682 KW

  • 7/29/2019 01_TeslaCUDAIntro

    11/33

    Introducing the Tesla Personal Supercomp

    Supercomputing PerformanMassively parallel CUDA Archite

    960 cores. 4 TeraFlops250x the performance of a desk

    PersonalOne researcher, one supercomp

    Plugs into standard power strip

    AccessibleProgram in C for Windows, Linu

    Available now worldwide under

  • 7/29/2019 01_TeslaCUDAIntro

    12/33

    More Information

    Tesla main page

    http://www.nvidia.com/tesla

    Product Specs, MarketingLiterature

    CUDA Zone

    http://www.nvidia.com/cuda

    Applications, Papers, Videos

    YouTube Videoshttp://www.youtube.co

    Hear CUDA develabout their experi

    http://www.nvidia.com/teslahttp://www.nvidia.com/cudahttp://www.youtube.com/nvidiateslahttp://www.youtube.com/nvidiateslahttp://www.nvidia.com/cudahttp://www.nvidia.com/tesla
  • 7/29/2019 01_TeslaCUDAIntro

    13/33

    CUDA Parallel Programming Architectureand Programming Model

  • 7/29/2019 01_TeslaCUDAIntro

    14/33

    CUDA Parallel Computing Architecture

    ATIs Compute Solution

    Parallel computing architectureand programming model

    Includes a C compiler plussupport for OpenCL and

    DX11 Compute

    Architected to natively supportall computational interfaces

    (standard languages and APIs)

  • 7/29/2019 01_TeslaCUDAIntro

    15/33

    CUDA Facts

    750+ Research Papers

    50+ universities teaching CUDA

    100 Million CUDA-Enabled GPUs

    25K Active Developers

    www.NVIDIA.com

  • 7/29/2019 01_TeslaCUDAIntro

    16/33

    Simple C Description For Parallelism

    void saxpy_serial(int n, float a, float *x, float *y){ for (int i = 0; i < n; ++i)y[i] = a*x[i] + y[i];}// Invoke serial SAXPY kernelsaxpy_serial(n, 2.0, x, y);__global__ void saxpy_parallel(int n, float a, float *x, float { int i = blockIdx.x*blockDim.x + threadIdx.x;if (i < n) y[i] = a*x[i] + y[i];}// Invoke parallel SAXPY kernel with 256 threads/blockint nblocks = (n + 255) / 256;saxpy_parallel(n, 2.0, x, y);

    Standa

    Para

  • 7/29/2019 01_TeslaCUDAIntro

    17/33

    Compiling C for CUDA Applications

    CUDACC CPU Compiler

    C for CUDAKernels

    CUDA objectfiles

    Rest of CApplication

    CPU objectfiles

    CPU-GPU

    Executable

    NVCC

    C for CUDAApplication

    Linker

    Combined CPU-G

  • 7/29/2019 01_TeslaCUDAIntro

    18/33

    Shared back-end compiler

    and optimization technology

    NVIDIA C for CUDA and OpenCL

    OpenCL

    C for CUDA

    PTX

    GPU

    Entry poinwho prefe

    Entry point for developerswho want low-level API

  • 7/29/2019 01_TeslaCUDAIntro

    19/33

    Different Programming Styles

    C for CUDA

    C with parallel keywordsC runtime that abstracts driver API

    Memory managed by C runtime (familiar malloc, free)

    Generates PTX

    Low-level driver API optionally available

    OpenCL

    Hardware API - similar to OpenGL and CUDA driver A

    Memory managed by programmer

    Generates PTX

  • 7/29/2019 01_TeslaCUDAIntro

    20/33

    Application Domains

    A l ti Ti t Di

  • 7/29/2019 01_TeslaCUDAIntro

    21/33

    Accelerating Time to Discovery

    4.6 Days

    27 Minutes

    2.7 Days

    30 Minutes

    8 Hours

    13 Minutes

    3

    CPU Only With GPU

  • 7/29/2019 01_TeslaCUDAIntro

    22/33

    Computational Finance

    31.1 secs

    0

    5

    1015

    20

    25

    30

    35

    Intel Xeon(2.6 GHz)

    1 T

    Time

    (secs)Derivativ

    Sc

    164

    2116

    0

    1000

    2000

    3000

    4000

    5000

    6000

    Mersenne Twiste+ Box-Mueller (M

    Million

    Samples

    per sec

    100x faster Randfor Monte

    Intel Xeon Qua(3.0 GHz)Tesla C1060

    Source:

    Financial Computing Software vendors

    SciComp : Derivatives pricing modeling

    Hanweck: Options pricing & risk analysisAqumin: 3D visualization of market data

    Exegy: High-volume Tickers & Risk Analysis

    QuantCatalyst: Pricing & Hedging Engine

    Oneye: Algorithmic Trading

    Arbitragis Trading: Trinomial Options Pricing

    Ongoing work

    LIBOR Monte Carlo market model

    Callable Swaps and Continuous Time Finance

    Source

  • 7/29/2019 01_TeslaCUDAIntro

    23/33

    0.890

    50100

    150

    200

    250

    300

    Intel QX670quad-core w

    SSE

    Billion

    Evaluations/sec Ion Place

    271xFaster

    0

    100

    200

    300

    400

    500

    600

    N=24,300 NTimestepscalculated/sec

    Numbe

    Lennard-Jon LAM

    Source: Stone, P

    Molecular Dynamics

    Available MD software

    NAMD / VMD (alpha release)HOOMD

    ACE-MD

    MD-GPU

    Ongoing work

    LAMMPSCHARMM

    GROMACS

    AMBER

    Source: Anders

  • 7/29/2019 01_TeslaCUDAIntro

    24/33

    4.4 secs

    1.1 mins

    4.7 min

    0.2 secs

    1.2 secs

    4

    0.1

    1

    10

    100

    1000

    Caffeine Cholesterol Taxo

    Time(Log-scale)

    GAMESS on Intel Penvs CUDA code on Tes

    Quantum Chemistry

    2.8 mins

    8 mins

    4

    21.6 secs 32.0 sec

    0

    100

    200

    300

    400

    500600

    700

    Taxol/ LSDA/3-21G

    Taxol/PW91/ 6-

    31G

    VLS

    Time

    (secs)

    Coulomb PotenGaussian 03 on Inte

    vs CUDA co de on 1

    Source: Ufimtsev,

    Source: Yas

    Available MD software

    NAMD / VMD (alpha release)HOOMD

    ACE-MD

    MD-GPU

    Ongoing work

    LAMMPSCHARMM

    Q-Chem

    Gaussian

  • 7/29/2019 01_TeslaCUDAIntro

    25/33

    Computational Fluid Dynamics (CFD)

    0.9 0.6

    24

    0

    10

    20

    30

    40

    50

    60

    128x32x128

    256x3x256

    Gflops Incompress

    AMD Opteron 2.1 Tesla C8702 Tesla C870s4 Tesla C870s

    4.8 7.60

    100

    200

    300

    400

    500600

    700

    Intel Xeon(3.4 GHz)

    InteItanium(1.4 G

    MillionLattice

    Updates

    per

    Sec

    (MLUPs)

    Lattice Boltz

    for 128x12

    Source: Thiba

    Source: To

    Ongoing work

    Navier-StokesLattice Boltzman

    3D Euler Solver

    Weather and ocean modeling

  • 7/29/2019 01_TeslaCUDAIntro

    26/33

    Electromagnetics / Electrodynamics

    9.9

    Mcells/s0

    100

    200

    300

    400

    500

    600

    Intel Xeon (2.6 GHz)

    Speed

    Mcells/s

    Cell Phone MoSimulation si

    FDTD AcceleratSource: Ac

    FDTD Solvers

    Acceleware

    EM Photonics

    CUDA Tutorial

    Ongoing work

    Maxwell equation solver

    Ring Oscillator (FDTD)

    Particle beam dynamics simulator

  • 7/29/2019 01_TeslaCUDAIntro

    27/33

    Weather, Atmospheric, & Ocean Modeling

    1,315Mflops/s

    0

    10000

    2000030000

    40000

    50000

    60000

    70000

    Intel Xeon(3.0 GHz)

    Mflops/s

    WSM5 Mi

    5 day

    0

    50

    100

    150

    200

    250

    300

    350

    Intel Xeon (2

    Time

    (mins)

    T3000km

    Source: Mic

    CUDA-accelerated WRF available

    Other kernels in WRF being ported

    Ongoing work

    Tsunami modeling

    Ocean modeling

    Several CFD codes

    Source: Mat

  • 7/29/2019 01_TeslaCUDAIntro

    28/33

    Libraries

    FFT P f CPU GPU (8 S i )

  • 7/29/2019 01_TeslaCUDAIntro

    29/33

    FFT Performance: CPU vs GPU (8-Series)

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    GFLOPS

    Transform Size (Power of 2)

    1D Fast Fourier TransformOn CUDA

    CUFFT 2.x

    CUFFT 1.1

    INTEL MKL 10.0

    FFTW 3.x

    NVIDIA Tesla C870 GPU (8-series GPU)

    Quad-Core Intel Xeon CPU 5400 Series 3.0GHz,

    In-place, complex, single precision

    Intel

    calc

    sam

    Rea

    ~10

    Source for Intel data : http://www.intel.com/cd/software/products/asmo-na/eng/266852.htm

    BLAS CPU GPU

  • 7/29/2019 01_TeslaCUDAIntro

    30/33

    BLAS: CPU vs GPU

    cuBLAS 2.2: Tesla C1060MKL 10.0.3.020: Quad-core 3.0GHz Core2 Extreme (4 thread

    0

    50

    100

    150

    200

    250

    300

    350

    400

    GFLOPS

    Matrix Size M (MxM square)

    Single Precision BLAS: SGEMM

    cuBLAS 2.2

    MKL 10.0.3

    0

    10

    20

    30

    40

    50

    60

    70

    80

    GFLOPS

    Matrix Size M (Mx

    Double Precision BLA

    GPU + CPU DGEMM P f

  • 7/29/2019 01_TeslaCUDAIntro

    31/33

    GPU + CPU DGEMM Performance

    0

    20

    40

    60

    80

    100

    120

    128

    320

    512

    704

    896

    1088

    1280

    1472

    1664

    1856

    2048

    2240

    2432

    2624

    2816

    3008

    3200

    3392

    3584

    3776

    3968

    4160

    4352

    4544

    4736

    4928

    5120

    5312

    5504

    5696

    GFLOPs

    Size

    Xeon Quad-core 2.8 GHz, MKL 10

    Tesla C1060 GPU (1.296 GHz)

    GPU + CPU

    GPU + CP

    GPU on

    CPU onl

    Results: Sparse Matrix Vector Multiplication (SpMV) on C

  • 7/29/2019 01_TeslaCUDAIntro

    32/33

    Results: Sparse Matrix-Vector Multiplication (SpMV) on C

    CPU Results from Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et

  • 7/29/2019 01_TeslaCUDAIntro

    33/33

    Questions?

    http://www.nvidia.com/CUDA

    [email protected]