GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

GPU병렬연산

박 필 성박 필 성수원대학교

[email protected]

용어용어

GPGPU?GPGPU? GPU(Graphics Processing Unit)를이용한일반목적계산(General Purpose Computing)일반목적계산(General Purpose Computing)

즉 graphic hardware를 non-graphic 연산에사용

nVIDIA’s CUDA? Compute Unified Device Architecture data-parallel programming을다루는 software

architecture

22

33

44

55

66

Why GPGPU?Why GPGPU?

77

88

CPU vs GPUCPU vs. GPUCPU “ lti ” CPU “multi-core” 빠른 cache Branching adaptability Branching adaptability 고성능(high performance)

GPU “many-core” (수백개) 다중 ALU 빠른 onboard memory (main memory의거의 10배속도) parallel task에고효율(high throughput) parallel task에고효율(high throughput)

CPU는 task parallelism에뛰어남p GPU는 data parallelism에뛰어남

99

CPU vs GPU HardwareCPU vs. GPU - Hardware

data processing에더많은 hardware 사용

1010

GPU ArchitectureGPU Architecture

1111

Processing ElementProcessing Element

Processing element = thread processor = ALU

1212

Memory ArchitectureMemory Architecture(Device) Grid

Registers Local memory

Block (0, 0)

Shared Memory

Block (1, 0)

Shared Memoryy

Shared memory Constant memory

Shared Memory

Registers Registers

Shared Memory

Registers Registers

Constant memory Global memory

Local

Thread (0, 0)

Local

Thread (1, 0)

Local

Thread (0, 0)

Local

Thread (1, 0)

Texture memoryGlobalMemory

LocalMemory

LocalMemory

LocalMemory

LocalMemory

Host

ConstantMemory

T t

e o y

1313

TextureMemory

Data parallel ProgrammingData-parallel Programming

Think of the GPU as a massively-threaded co-processorsp

Write “kernel” functions that execute on the device processing multiple datathe device -- processing multiple data elements in parallel

Keep it busy! massive threading Keep it busy! massive threading Keep your data close! local memory

1414

RequirementsRequirements

Hardware- CUDA-capable NVIDIA graphics card- PCI-Express slot

Software & Tools- CUDA device driverCUDA device driver- CUDA toolkit : nvcc(compiler), …- CUDA SDKCUDA SDK

1515

Host vs DeviceHost vs. Device Host : main computer (CPU + main memory) Host : main computer (CPU + main memory)

Device : graphics card (GPU + graphics memory)CUDA d 는 C/C 로작성되며다음의 CUDA source code는 C/C++로작성되며다음의둘로구성됨 : ( 이름 ~.cu )

h t d CPU에서실행- host code : CPU에서실행- device code (“kernel”) : GPU에서실행

Compile - nvcc VectorAdd.cu

1616

1717

How to computeHow to computeCPU가사용할변수들을 i 에잡고 CPU가사용할변수들을 main memory에잡고

GPU가계산에사용할변수들을 graphic card 에할당하고memory에할당하고

Host computer의 main memory로부터 graphic d의 로 d t 를복사한후card의 memory로 data를복사한후

GPU는수천-수천만개의 thread를생성하고hi d의 를사용하여연산을수행graphic card의 memory를사용하여연산을수행

연산결과를 host computer의 main memory로복사하고 CPU가이를이용하여추가작업하거나복사하고 CPU가이를이용하여추가작업하거나출력하고작업끝

1818

Initially:Initially:

array

Host’s Memory GPU Card’s Memory

1919

Allocate Memory in the GPU card

array_darray


2020

Copy content from the host’s memory to the GPU card memory

array_darray


2121

Execute code on the GPUExecute code on the GPU

GPU MPs

array_darray


2222

Copy results back to the host memory

array_darray


2323

// VectorAdd.cu#include <stdio.h>

__global__ void VectorAdd( int*a, int*b, int*c) // device code (kernel){int tid = blockIdx.x * blockDim.x + threadIdx.x;c[tid] = a[tid] + b[tid];[ ] [ ] [ ]

}

int main(){{const int size = 512*65535;const int BufferSize = size*sizeof(int);int *InputA, *InputB, *Result;

InputA = (int*)malloc(BufferSize); // Assign host memoryInputB = (int*)malloc(BufferSize);Result = (int*)malloc(BufferSize);

int i = 0;int* dev_A; int* dev_B; int* dev_R;

for( int i = 0; i < size; i++) { // Input dataInputA[i] = i; InputB[i] = i; Result[i] = 0;

}

cudaMalloc((void**)&dev_A, size*sizeof(int)); // Assign device memory

2424

cudaMalloc((void**)&dev_B, size*sizeof(int));cudaMalloc((void**)&dev_R, size*sizeof(int));

// Transfer data from host memory to device memoryy ycudaMemcpy(dev_A, InputA, size*sizeof(int), cudaMemcpyHostToDevice);cudaMemcpy(dev_B, InputB, size*sizeof(int), cudaMemcpyHostToDevice);

// Create 65535x512 threads and perform computation on GPUp pVectorAdd<<<65535,512>>>(dev_A, dev_B, dev_R);

// Transfer data from device memory to host memorycudaMemcpy(Result, dev_R, size*sizeof(int), cudaMemcpyDeviceToHost);py( _ ( ) py )

// Print results.for( i = 0; i < 5; i++) {

printf(" Result[%d] : %d\n",i,Result[i]);( )}printf(" ......\n");for( i = size-5; i < size; i++) {

printf(" Result[%d] : %d\n",i,Result[i]);}

// Free device memorycudaFree(dev_A); cudaFree(dev_B); cudaFree(dev_R);

// Free host memoryfree(InputA); free(InputB); free(Result);

2525

return 0;}

S E l 1024 1024행렬의곱셈Some Example : 1024 x 1024 행렬의곱셈[pspark@para kias]$ ./MatrixMul-c

Matrix C (Results)0.389147 0.418741 : 257.658 0.574162 0.669713 : 254.338 0 674025 0 867991 261 3010.674025 0.867991 : 261.301 0.468286 0.619271 : 256.432

Total elapsed time on the CPU chip 10.3449p p

[pspark@para kias]$ ./MatrixMul-cudagrid : 32 32 : block : 32 32grid : 32 32 : block : 32 32

Matrix C (Results)0.389147 0.418741 : 2.93874e-39 0.574162 0.669713 : 3.30608e-39 0.674025 0.867991 : 3.67342e-39 0.468286 0.619271 : 4.04076e-39

2626

Total elapsed time on the GPU card 0.0469801

참고참고

미루웨어미루웨어http://www.miruware.com/

NVidia Developer CUDA Zonehttp://developer.nvidia.com/category/zone/cuda-zonehttp://ko.wikipedia.org/wiki/CUDA

OpenCLhtt // kh / l/http://www.khronos.org/opencl/http://ko.wikipedia.org/wiki/OpenCL

2727

Intel LarrabeeIntel Larrabeehttp://ko.wikipedia.org/wiki/%EB%9D%BC%EB%9D%BC%EB%B9%84_(%http://ko.wikipedia.org/wiki/%EB%9D%BC%EB%9D%BC%EB%B9%84_(%EB%A7%88%EC%9D%B4%ED%81%AC%EB%A1%9C%EC%95%84%EDEB%A7%88%EC%9D%B4%ED%81%AC%EB%A1%9C%EC%95%84%ED%82%A4%ED%85%8D%EC%B2%98)%82%A4%ED%85%8D%EC%B2%98)

AMD, nVIDIAAMD, nVIDIA의의큰큰난적난적… Intel… Intel의의 “Larrabee”“Larrabee”http://uzys2011.tistory.com/337http://uzys2011.tistory.com/337

Larrabee GPU, Larrabee GPU, 결국결국개발개발중단중단http://www.kbench.com/hardware/?no=84965http://www.kbench.com/hardware/?no=84965

양자컴퓨터양자컴퓨터http://mirror.enha.kr/wiki/%EC%96%91%EC%9E%90%EC%BB%B4%ED%http://mirror.enha.kr/wiki/%EC%96%91%EC%9E%90%EC%BB%B4%ED%93%A8%ED%84%B093%A8%ED%84%B0

DD--Wave SystemsWave Systemshtt // d /htt // d /http://www.dwavesys.com/http://www.dwavesys.com/

Google, DGoogle, D--Wave 2 Wave 2 확보확보htt // d t k / / i ? ti id 20130704161219htt // d t k / / i ? ti id 20130704161219

2828

http://www.zdnet.co.kr/news/news_view.asp?artice_id=20130704161219http://www.zdnet.co.kr/news/news_view.asp?artice_id=20130704161219

Brook+SC07 BOF Session

November 13, 2007

2

What is Brook+?

Brook is an extension to the C-language for stream programming originally developed by Stanford University

Brook+ is an implementation by AMD of the Brook GPU spec on AMD's compute abstraction layer with some enhancements

3

Examplekernel void sum(float a<>, float b<>, out float c<>)

{

c = a + b;

}

int main(int argc, char** argv)

{

int i, j;

float a<10, 10>;

float b<10, 10>;

float c<10, 10>;

float input_a[10][10];

float input_b[10][10];

float input_c[10][10];

for(i=0; i<10; i++) {

for(j=0; j<10; j++) {

input_a[i][j] = (float) i;

input_b[i][j] = (float) j;

}

}

streamRead(a, input_a);

streamRead(b, input_b);

sum(a, b, c);

streamWrite(c, input_c);

...

}

Kernels – Program functionsthat operate on stream elements

Kernels – Program functionsthat operate on stream elements

Streams – collection of data elements of the same type which can be operated on in parallel.

Streams – collection of data elements of the same type which can be operated on in parallel.

Brook+ access functionsBrook+ access functions

4

Brook+ Compiler

Converts Brook+ files into C++ code. Kernels, written in C, are compiled to AMD’s IL code for the GPU or C code for the CPU.

5

Brook+ Runtime

IL code is executed on the GPU. The backend is written in CAL.

6

Brook+ Features

Brook+ is an extension to the Brook for GPUs source code.

Features of Brook for GPUs relevant to modern graphics hardware are maintained.

Kernels are compiled to AMD’s IL.

Runtime uses CAL for the GPU backend.

Original CPU backend also included.

7

Folding@Home Stats

Folding@Home client using Brook+

Currently 39 TFLOPS on 664 GPU clients

Avg. 60 GFLOPS per GPU client

Compared to:

Avg. 25 GFLOPS per PS3 client

Avg. 1 GFLOPS per CPU client

8

Brook+ Release

Brook+ package:

– Compiler and runtime binaries

– Source code and build environments

– Sample applications

Source code released under the BSD License.

Project will also reside on SourceForge.net.

9

Brook+ Moving Forward

Double precision - FireStream 9170

Mem-export (scatter)

Graphics API interoperability

Multi-GPU support

Other operating systems (Linux, Vista, 64-bit)

10

Trademark Attribution

AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners.

©2007 Advanced Micro Devices, Inc. All rights reserved.

DISCLAIMER

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATIONCONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

Documents

GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General