14
CUDA From Wikipedia, the free encyclopedia Jump to: navigation , search CUDA Developer(s) Nvidia Corporation Stable release 4.2 / April 23, 2012; 51 days ago Operating system Windows XP and later , Mac OS X , Linux Platform Supported GPUs Type GPGPU License Freeware Website www.nvidia.com/ object/cuda home new.html Compute Unified Device Architecture (CUDA) is a parallel computing architecture developed by Nvidia for graphics processing. [1] CUDA is the computing engine in Nvidia graphics processing units (GPUs) that is accessible to software developers through variants of industry standard programming languages. Programmers use 'C for CUDA' (C with Nvidia extensions and certain restrictions), compiled through a PathScale Open64 C compiler, [2] to code algorithms for execution on the GPU. CUDA architecture shares a range of computational interfaces with two competitors: the Khronos Group 's OpenCL [3] and Microsoft's DirectCompute . [4] Third party wrappers are also available for Python , Perl , Fortran , Java , Ruby , Lua , Haskell , MATLAB , IDL , and native support in Mathematica . CUDA programming in the web browser is freely available for individual non-commercial purposes in NCLab . CUDA gives developers access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs. Using CUDA, the latest Nvidia GPUs become accessible for computation like CPUs . Unlike CPUs however, GPUs have a parallel throughput architecture that emphasizes executing many concurrent threads

CUDA Wikipedia

Embed Size (px)

Citation preview

Page 1: CUDA Wikipedia

CUDAFrom Wikipedia, the free encyclopediaJump to: navigation, search

CUDADeveloper(s) Nvidia CorporationStable release 4.2 / April 23, 2012; 51 days ago

Operating systemWindows XP and later,Mac OS X, Linux

Platform Supported GPUsType GPGPULicense Freeware

Websitewww.nvidia.com/object/cuda home new.html

Compute Unified Device Architecture (CUDA) is a parallel computing architecture developed by Nvidia for graphics processing.[1] CUDA is the computing engine in Nvidia graphics processing units (GPUs) that is accessible to software developers through variants of industry standard programming languages. Programmers use 'C for CUDA' (C with Nvidia extensions and certain restrictions), compiled through a PathScale Open64 C compiler,[2] to code algorithms for execution on the GPU. CUDA architecture shares a range of computational interfaces with two competitors: the Khronos Group's OpenCL [3] and Microsoft's DirectCompute.[4] Third party wrappers are also available for Python, Perl, Fortran, Java, Ruby, Lua, Haskell, MATLAB, IDL, and native support in Mathematica. CUDA programming in the web browser is freely available for individual non-commercial purposes in NCLab.

CUDA gives developers access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs. Using CUDA, the latest Nvidia GPUs become accessible for computation like CPUs. Unlike CPUs however, GPUs have a parallel throughput architecture that emphasizes executing many concurrent threads slowly, rather than executing a single thread very quickly. This approach of solving general purpose problems on GPUs is known as GPGPU.

In the computer game industry, in addition to graphics rendering, GPUs are used in game physics calculations (physical effects like debris, smoke, fire, fluids); examples include PhysX and Bullet. CUDA has also been used to accelerate non-graphical applications in computational biology, cryptography and other fields by an order of magnitude or more.[5][6][7][8] An example of this is the BOINC distributed computing client.[9]

CUDA provides both a low level API and a higher level API. The initial CUDA SDK was made public on 15 February 2007, for Microsoft Windows and Linux. Mac OS X support was later added in version 2.0,[10] which supersedes the beta released February 14, 2008.[11] CUDA works with all Nvidia GPUs from the G8x series onwards, including GeForce, Quadro and the Tesla line. CUDA is compatible with most standard operating systems. Nvidia states that programs

Page 2: CUDA Wikipedia

developed for the G8x series will also work without modification on all future Nvidia video cards, due to binary compatibility.

Example of CUDA processing flow1. Copy data from main mem to GPU mem2. CPU instructs the process to GPU3. GPU execute parallel in each core4. Copy the result from GPU mem to main mem

Contents

1 Background 2 Advantages

3 Limitations

4 Supported GPUs

5 Version features and specifications

6 Example

7 Language bindings

8 Current CUDA architectures

9 Current and future usages of CUDA architecture

10 See also

11 References

12 External links

Page 3: CUDA Wikipedia

Background

See also: GPU

The GPU, as a specialized processor, addresses the demands of real-time high-resolution 3D graphics compute-intensive tasks. As of 2012 GPUs have evolved into highly parallel multi core systems allowing very efficient manipulation of large blocks of data. This design is more effective than general-purpose CPUs for algorithms where processing of large blocks of data is done in parallel, such as:

push-relabel maximum flow algorithm fast sort algorithms of large lists

two-dimensional fast wavelet transform

For instance, the parallel nature of molecular dynamics simulations is suitable for CUDA implementation.[citation needed]

Advantages

CUDA has several advantages over traditional general-purpose computation on GPUs (GPGPU) using graphics APIs:

Scattered reads – code can read from arbitrary addresses in memory Shared memory – CUDA exposes a fast shared memory region (up to 48KB per Multi-

Processor) that can be shared amongst threads. This can be used as a user-managed cache, enabling higher bandwidth than is possible using texture lookups.[12]

Faster downloads and readbacks to and from the GPU

Full support for integer and bitwise operations, including integer texture lookups

Limitations

Texture rendering is not supported (CUDA 3.2 and up addresses this by introducing "surface writes" to cuda Arrays, the underlying opaque data structure).

Copying between host and device memory may incur a performance hit due to system bus bandwidth and latency (this can be partly alleviated with asynchronous memory transfers, handled by the GPU's DMA engine)

Threads should be running in groups of at least 32 for best performance, with total number of threads numbering in the thousands. Branches in the program code do not impact performance significantly, provided that each of 32 threads takes the same execution path; the SIMD execution model becomes a significant limitation for any inherently divergent task (e.g. traversing a space partitioning data structure during ray tracing).

Page 4: CUDA Wikipedia

Unlike OpenCL, CUDA-enabled GPUs are only available from Nvidia[13]

Valid C/C++ may sometimes be flagged and prevent compilation due to optimization techniques the compiler is required to employ to use limited resources.

CUDA (with compute capability 1.x) uses a recursion-free, function-pointer-free subset of the C language, plus some simple extensions. However, a single process must run spread across multiple disjoint memory spaces, unlike other C language runtime environments.

CUDA (with compute capability 2.x) allows a subset of C++ class functionality, for example member functions may not be virtual (this restriction will be removed in some future release). [See CUDA C Programming Guide 3.1 - Appendix D.6]

Double precision (CUDA compute capability 1.3 and above)[14] deviate from the IEEE 754 standard: round-to-nearest-even is the only supported rounding mode for reciprocal, division, and square root. In single precision, denormals and signalling NaNs are not supported; only two IEEE rounding modes are supported (chop and round-to-nearest even), and those are specified on a per-instruction basis rather than in a control word; and the precision of division/square root is slightly lower than single precision.

Supported GPUs

Compute capability table (version of CUDA supported) by GPU and card. Also available directly from Nvidia

Computecapability(version)

GPUs Cards

1.0G80, G92, G92b,

G94, G94bGeForce 8800GTX/Ultra, 9400GT, 9600GT, 9800GT, Tesla C/D/S870, FX4/5600,

360M, GT 420

1.1G86, G84, G98, G96,

G96b, G94, G94b, G92, G92b

GeForce 8400GS/GT, 8600GT/GTS, 8800GT/GTS, 9600 GSO, 9800GTX/GX2, GTS 250, GT 120/30/40, FX 4/570, 3/580, 17/18/3700, 4700x2, 1xxM, 32/370M,

3/5/770M, 16/17/27/28/36/37/3800M, NVS420/50

1.2GT218, GT216,

GT215GeForce 210, GT 220/40, FX380 LP, 1800M, 370/380M, NVS 2/3100M

1.3 GT200, GT200bGeForce GTX 260, GTX 275, GTX 280, GTX 285, GTX 295, Tesla C/M1060,

S1070, Quadro CX, FX 3/4/5800

2.0 GF100, GF110GeForce (GF100) GTX 465, GTX 470, GTX 480, Tesla C2050, C2070,

S/M2050/70, Quadro Plex 7000, GeForce (GF110) GTX570, GTX580, GTX590

2.1GF104, GF114, GF116, GF108,

GF106

GeForce GT 430, GT 440, GTS 450, GTX 460, GTX 550 Ti, GTX 560, GTX 560 Ti, 500M, Quadro 600, 2000, 4000, 5000, 6000

3.0GK104, GK106,

GK107GeForce GTX 690, GTX 680, GTX 670, GeForce GTX 660M, GeForce GT 650M,

GeForce GT 640M3.5 GK110

A table of devices officially supporting CUDA (Note that many applications require at least 256 MB of dedicated VRAM, and some recommend at least 96 cuda cores).[13]

Page 5: CUDA Wikipedia

see full list here: http://developer.nvidia.com/cuda-gpus

Nvidia GeForceGeForce GTX 690GeForce GTX 680GeForce GTX 670GeForce GTX 590GeForce GTX 580GeForce GTX 570GeForce GTX 560 TiGeForce GTX 560GeForce GTX 550 TiGeForce GT 520GeForce GTX 480GeForce GTX 470GeForce GTX 465GeForce GTX 460GeForce GTX 460 SEGeForce GTS 450GeForce GT 440GeForce GT 430GeForce GT 420GeForce GTX 295GeForce GTX 285GeForce GTX 280GeForce GTX 275GeForce GTX 260GeForce GTS 250GeForce GTS 240GeForce GT 240GeForce GT 220GeForce 210/G210GeForce GT 140GeForce 9800 GX2GeForce 9800 GTX+GeForce 9800 GTXGeForce 9800 GTGeForce 9600 GSOGeForce 9600 GTGeForce 9500 GTGeForce 9400 GT

Nvidia GeForce MobileGeForce GTX 660MGeForce GT 650MGeForce GT 640MGeForce GTX 580MGeForce GTX 570MGeForce GTX 560MGeForce GT 555MGeForce GT 550MGeForce GT 540MGeForce GT 525MGeForce GT 520MGeForce GTX 480MGeForce GTX 470MGeForce GTX 460MGeForce GT 445MGeForce GT 435MGeForce GT 425MGeForce GT 420MGeForce GT 415MGeForce GTX 285MGeForce GTX 280MGeForce GTX 260MGeForce GTS 360MGeForce GTS 350MGeForce GTS 260MGeForce GTS 250MGeForce GT 335MGeForce GT 330MGeForce GT 325MGeForce GT 320MGeForce 310MGeForce GT 240MGeForce GT 230MGeForce GT 220MGeForce G210MGeForce GTS 160MGeForce GTS 150MGeForce GT 130M

Nvidia QuadroQuadro 6000Quadro 5000Quadro 4000Quadro 2000Quadro 600Quadro FX 5800Quadro FX 5600Quadro FX 4800Quadro FX 4700 X2Quadro FX 4600Quadro FX 3800Quadro FX 3700Quadro FX 1800Quadro FX 1700Quadro FX 580Quadro FX 570Quadro FX 380Quadro FX 370Quadro NVS 450Quadro NVS 420Quadro NVS 295Quadro NVS 290Quadro Plex 1000 Model IVQuadro Plex 1000 Model S4Nvidia Quadro MobileQuadro 5010MQuadro 5000MQuadro 4000MQuadro 3000MQuadro 2000MQuadro 1000MQuadro FX 3800MQuadro FX 3700MQuadro FX 3600MQuadro FX 2800MQuadro FX 2700MQuadro FX 1800MQuadro FX 1700M

Page 6: CUDA Wikipedia

GeForce 9400 mGPUGeForce 9300 mGPUGeForce 9100 mGPUGeForce 8800 UltraGeForce 8800 GTXGeForce 8800 GTSGeForce 8800 GTGeForce 8800 GSGeForce 8600 GTSGeForce 8600 GTGeForce 8600 mGTGeForce 8500 GTGeForce 8400 GSGeForce 8300 mGPUGeForce 8200 mGPUGeForce 8100 mGPU

GeForce GT 120MGeForce G110MGeForce G105MGeForce G103MGeForce G102MGeForce G100GeForce 9800M GTXGeForce 9800M GTSGeForce 9800M GTGeForce 9800M GSGeForce 9700M GTSGeForce 9700M GTGeForce 9650M GTGeForce 9650M GSGeForce 9600M GTGeForce 9600M GSGeForce 9500M GSGeForce 9500M GGeForce 9400M GGeForce 9300M GSGeForce 9300M GGeForce 9200M GSGeForce 9100M GGeForce 8800M GTXGeForce 8800M GTSGeForce 8700M GTGeForce 8600M GTGeForce 8600M GSGeForce 8400M GTGeForce 8400M GSGeForce 8400M GGeForce 8200M G

Quadro FX 1600MQuadro FX 880MQuadro FX 770MQuadro FX 570MQuadro FX 380MQuadro FX 370MQuadro FX 360MQuadro NVS 320MQuadro NVS 160MQuadro NVS 150MQuadro NVS 140MQuadro NVS 135MQuadro NVS 130M

Nvidia TeslaTesla K20Tesla K10Tesla C2050/2070Tesla M2050/M2070Tesla S2050Tesla S1070Tesla M1060Tesla C1060Tesla C870Tesla D870Tesla S870

Version features and specifications

Feature support (unlisted features aresupported for all compute capabilities)

Compute capability (version)1.0 1.1 1.2 1.3 2.x 3.0 3.5

Integer atomic functions operating on32-bit words in global memory

No YesatomicExch() operating on 32-bitfloating point values in global memoryInteger atomic functions operating on No Yes

Page 7: CUDA Wikipedia

32-bit words in shared memoryatomicExch() operating on 32-bitfloating point values in shared memoryInteger atomic functions operating on64-bit words in global memoryWarp vote functionsDouble-precision floating-point operations No YesAtomic functions operating on 64-bitinteger values in shared memory

No Yes

Floating-point atomic addition operating on32-bit words in global and shared memory_ballot()_threadfence_system()_syncthreads_count(),_syncthreads_and(),_syncthreads_or()Surface functions3D grid of thread block

Technical specificationsCompute capability (version)

1.0 1.1 1.2 1.3 2.x 3.0 3.5Maximum dimensionality of grid of thread blocks 2 3Maximum x-, y-, or z-dimension of a grid of thread blocks 65535 231-1Maximum dimensionality of thread block 3Maximum x- or y-dimension of a block 512 1024Maximum z-dimension of a block 64Maximum number of threads per block 512 1024Warp size 32Maximum number of resident blocks per multiprocessor 8 16Maximum number of resident warps per multiprocessor 24 32 48 64Maximum number of resident threads per multiprocessor 768 1024 1536 2048Number of 32-bit registers per multiprocessor 8 K 16 K 32 K 64 KMaximum amount of shared memory per multiprocessor 16 KB 48 KBNumber of shared memory banks 16 32Amount of local memory per thread 16 KB 512 KBConstant memory size 64 KBCache working set per multiprocessor for constant memory 8 KB

Cache working set per multiprocessor for texture memoryDevice dependent, between 6 KB and

8 KBMaximum width for 1D texturereference bound to a CUDA array

8192 65536

Maximum width for 1D texturereference bound to linear memory

227

Maximum width and number of layersfor a 1D layered texture reference

8192 x 512 16384 x 2048

Maximum width and height for 2Dtexture reference bound to a CUDA array

65536 x 32768 65536 x 65535

Maximum width and height for 2Dtexture reference bound to a linear memory

65000 x 65000 65000 x 65000

Maximum width and height for 2Dtexture reference bound to a CUDA array supporting texture gather

N/A 16384 x 16384

Page 8: CUDA Wikipedia

Maximum width, height, and numberof layers for a 2D layered texture reference

8192 x 8192 x 512

16384 x 16384 x 2048

Maximum width, height and depthfor a 3D texture reference bound to linearmemory or a CUDA array

2048 x 2048 x 20484096 x 4096 x

4096

Maximum width (and height) for a cubemap texture reference N/A 16384Maximum width (and height) and number of layers for a cubemap layered texture reference

N/A 16384 x 2046

Maximum number of textures thatcan be bound to a kernel

128 256

Maximum width for a 1D surfacereference bound to a CUDA array

Notsupported

65536

Maximum width and height for a 2Dsurface reference bound to a CUDA array

65536 x 32768

Maximum number of surfaces thatcan be bound to a kernel

8 16

Maximum number of instructions perkernel

2 million 512 million

Architecture specificationsCompute capability (version)1.0 1.1 1.2 1.3 2.0 2.1 3.0 3.5

Number of cores for integer and floating-point arithmetic functions operations 8[15] 32 48 192Number of special function units for single-precision floating-point transcendental functions

2 4 8 32

Number of texture filtering units for every texture address unit or render output unit (ROP)

2 4 8 32

Number of warp schedulers 1 2 2 4Number of instructions issued at once by scheduler 1 1 2[16] 2

For more information please visit this site: http://www.geeks3d.com/20100606/gpu-computing-nvidia-cuda-compute-capability-comparative-table/ and also read Nvidia CUDA programming guide.[17]

Example

This example code in C++ loads a texture from an image into an array on the GPU:

texture<float, 2, cudaReadModeElementType> tex; void foo(){ cudaArray* cu_array; // Allocate array cudaChannelFormatDesc description = cudaCreateChannelDesc<float>(); cudaMallocArray(&cu_array, &description, width, height); // Copy image data to array cudaMemcpyToArray(cu_array, image, width*height*sizeof(float), cudaMemcpyHostToDevice); // Set texture parameters (default)

Page 9: CUDA Wikipedia

tex.addressMode[0] = cudaAddressModeClamp; tex.addressMode[1] = cudaAddressModeClamp; tex.filterMode = cudaFilterModePoint; tex.normalized = false; // do not normalize coordinates // Bind the array to the texture cudaBindTextureToArray(tex, cu_array); // Run kernel dim3 blockDim(16, 16, 1); dim3 gridDim((width + blockDim.x - 1)/ blockDim.x, (height + blockDim.y - 1) / blockDim.y, 1); kernel<<< gridDim, blockDim, 0 >>>(d_data, height, width); // Unbind the array from the texture cudaUnbindTexture(tex);} //end foo() __global__ void kernel(float* odata, int height, int width){ unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; unsigned int y = blockIdx.y*blockDim.y + threadIdx.y; if (x < width && y < height) { float c = tex2D(tex, x, y); odata[y*width+x] = c; }}

Below is an example given in Python that computes the product of two arrays on the GPU. The unofficial Python language bindings can be obtained from PyCUDA.[18]

import pycuda.compiler as compimport pycuda.driver as drvimport numpyimport pycuda.autoinit mod = comp.SourceModule("""__global__ void multiply_them(float *dest, float *a, float *b){ const int i = threadIdx.x; dest[i] = a[i] * b[i];}""") multiply_them = mod.get_function("multiply_them") a = numpy.random.randn(400).astype(numpy.float32)b = numpy.random.randn(400).astype(numpy.float32) dest = numpy.zeros_like(a)multiply_them( drv.Out(dest), drv.In(a), drv.In(b), block=(400,1,1)) print dest-a*b

Page 10: CUDA Wikipedia

Additional Python bindings to simplify matrix multiplication operations can be found in the program pycublas.[19]

import numpyfrom pycublas import CUBLASMatrixA = CUBLASMatrix( numpy.mat([[1,2,3],[4,5,6]],numpy.float32) )B = CUBLASMatrix( numpy.mat([[2,3],[4,5],[6,7]],numpy.float32) )C = A*Bprint C.np_mat()

Language bindings

Fortran - FORTRAN CUDA, PGI CUDA Fortran Compiler Haskell - Data.Array.Accelerate

IDL - GPULib

Java - jCUDA, JCuda, JCublas, JCufft

Lua - KappaCUDA

Mathematica - CUDALink

MATLAB - Parallel Computing Toolbox, Distributed Computing Server,[20] and 3rd party packages like Jacket.

.NET - CUDA.NET; CUDAfy.NET .NET kernel and host code, CURAND, CUBLAS, CUFFT

Perl - KappaCUDA, CUDA::Minimal

Python - PyCUDA, KappaCUDA

Ruby - KappaCUDA

Current CUDA architectures

The current generation CUDA architecture (codename: Fermi) which is standard on Nvidia's released (GeForce 400 Series [GF100] (GPU) 2010-03-27)[21] GPU is designed from the ground up to natively support more programming languages such as C++. It has significantly increased the peak double-precision floating-point performance compared to Nvidia's prior-generation Tesla GPU. It also introduced several new features[22] including:

up to 1024 CUDA cores and 6.0 billion transistors on the GTX 590 Nvidia Parallel DataCache technology

Nvidia GigaThread engine

ECC memory support

Page 11: CUDA Wikipedia

Native support for Visual Studio

Current and future usages of CUDA architecture

Accelerated rendering of 3D graphics Accelerated interconversion of video file formats

Accelerated encryption, decryption and compression

Distributed Calculations, such as predicting the native conformation of proteins

Medical analysis simulations, for example virtual reality based on CT and MRI scan images.

Physical simulations, in particular in fluid dynamics.

Real Time Cloth Simulation OptiTex.com - Real Time Cloth Simulation

The Search for Extra-Terrestrial Intelligence (SETI@Home) program[23][24]