Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision...

Name: Kaiyong ZhaoSupervisor: Dr. X. -W Chu

2003 2004 2005 2006 2007M

CPUG80 Ultra

NV30 Hapertown

W oodcrestPrescott EENorthwood

2003 2004 2005 2006 2007M

CPUG80 Ultra

NV30 Hapertown

W oodcrestPrescott EENorthwood

•Computing Capability

•Memory Bandwidth

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Input Assembler

Streaming Multiprocessor (SM)

Streaming Processor (SP)

CUDA: CPU + GPU CParallel Computing modal

Single instruction Multiple Thread (SIMT)All threads run the same function(1000s threads on the fly)Each core deal with different data

Hidden the IO by multiple-threads(more than 1000s threads)Speed up Computing ／ IO Translation Coalesce the IO one time When half warp thread access neighboring data1 cycle@GPU vs. ~1000 cycles@CPU

C = vectorA * Matrix B % prime

There is no cache for global memory on G80/G200

Constant memory & texture memory have little cache

IO latency400-600 clock cycles

This is the bottle neckKey to Optimization!

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Global memory access by threads in a half-warp can be coalesced

When the words accessed by all threads lie in the same segment of size equal to:

32 bytes if all threads access 8-bit words64 bytes if all threads access 16-bit words 128 bytes if all threads access 32-bit or 64-bit words

Any pattern of addresses requested by the half-warp

Including patterns where multiple threads access the same address

Address 0

Thread 0

Address 4

Address …

Address 116

Address 120

Address 124

Address 128

Address …

Address 172

Address 176

Address 180

Address 184

Address 188

Address 252

Thread 1

Thread 2

Thread 3

Thread …

Thread 14

Thread 15

Segment 0 (128B) Segment 1 (128B)

Reduced to 32B Segment size is 32 bytes for 8-bit data, 64 bytes for 16-bit data, 128 bytes for 32-, 64- and 128-bit data.

CPU: Intel® Core™ i7 CPU 860 @ 2.80 GHz (single thread)GPU: XFX GTX280, 1.24 GHz

Summary

Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision...

Documents

Euchner Precision Multiple Limit Switches · 2 Quality, reliability, precision Quality, reliability and precision are the hallmarks of our corporate philosophy. They represent concepts

Programming of multiple GPUs with CUDA and Qt library

The MPACK library: A multiple precision version of BLAS ...suchix.kek.jp/mpcomp/20131120-sc13/20131120BoF-nakata.pdf · MPACK: multiple precision version of BLAS and LAPACK. ... GMP,

Precision Medicine: Lecture 15 Multiple Utilities

MD-CUDA · GPGPU CUDA N-body problem ... –Application programming interface (API) –CUDA runtime –CUFFT –CUBLAS. 20 CUDA Layers. 21 GPU Architecture In CUDA Memory Addressing

Precision Multiple Limit Switchesrad-online.com/pdfs/Euchner/PrMultLimSw_10-03_e_076644.pdf · 4 Precision Multiple Limit Switches Application EUCHNER precision multiple limit switches

Programming with CUDA · Programming with CUDA ... CUDA C programming guide – CUDA Programming 4 …

CUDA Without Cuda (CUDA Libraries) - Nvidiadeveloper.download.nvidia.com/CUDA/training/ntrotoCUDALibraries.pdf · CUDA Without Cuda (CUDA Libraries) GPU Computing Webinar 7/16/2011

High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

A survey of multiple precision computation using ﬂoating ...bt.pa.msu.edu/TM/BocaRaton2006/talks/lauter.pdf · Multiple precision using ﬂoating-point - Lauter - TMW 2006 1. Project

CUDA-GDB (NVIDIA CUDA Debugger)

CUDA Math API - Rice University · 2019. 6. 14. · CUDA Math API vRelease Version | 2 Half Comparison Functions Half2 Comparison Functions Half Precision Conversion And Data Movement

GPU Computing with CUDA Lecture 2 - CUDA · PDF fileGPU Computing with CUDA Lecture 2 - CUDA Memories ... August, 2011 UTFSM, Valparaíso, Chile 1. ... Memory hierarchy ‣CUDA works

The MPACK : Multiple precision version of BLAS and LAPACK

CMPE 665:Multiple Processor Systems CUDA-AWARE MPImeseec.ce.rit.edu/756-projects/fall2014/2-3.pdf · 2014-12-08 · CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU

GPUDIRECT, CUDA AWARE MPI, & CUDA IPC€¦ · Steve Abbott, February 12, 2019 GPUDIRECT, CUDA AWARE MPI,& CUDA IPC

CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed

Precision and Performance: Floating Point and IEEE 754 ...cseweb.ucsd.edu/.../cuda-5.5-doc/pdf/Floating_Point_on_NVIDIA_GP… · Precision and Performance: Floating Point and IEEE

Multiple Output, High Precision, Dual-Tracking Reference ... · Multiple Output, High Precision, Dual-Tracking Reference ... AD588TQ grades are specified for the full military/aerospace