38
GPU 병렬연산 수원대학교 [email protected]

GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

GPU병렬연산

박 필 성박 필 성수원대학교

[email protected]

Page 2: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

용어용어

GPGPU?GPGPU? GPU(Graphics Processing Unit)를이용한일반목적계산(General Purpose Computing)일반목적계산(General Purpose Computing)

즉 graphic hardware를 non-graphic 연산에사용

nVIDIA’s CUDA? Compute Unified Device Architecture data-parallel programming을다루는 software

architecture

22

Page 3: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

33

Page 4: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

44

Page 5: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

55

Page 6: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

66

Page 7: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

Why GPGPU?Why GPGPU?

77

Page 8: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

88

Page 9: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

CPU vs GPUCPU vs. GPUCPU “ lti ” CPU “multi-core” 빠른 cache Branching adaptability Branching adaptability 고성능(high performance)

GPU “many-core” (수백개) 다중 ALU 빠른 onboard memory (main memory의거의 10배속도) parallel task에고효율(high throughput) parallel task에고효율(high throughput)

CPU는 task parallelism에뛰어남p GPU는 data parallelism에뛰어남

99

Page 10: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

CPU vs GPU HardwareCPU vs. GPU - Hardware

data processing에더많은 hardware 사용

1010

Page 11: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

GPU ArchitectureGPU Architecture

1111

Page 12: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

Processing ElementProcessing Element

Processing element = thread processor = ALU

1212

Page 13: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

Memory ArchitectureMemory Architecture(Device) Grid

Registers Local memory

Block (0, 0)

Shared Memory

Block (1, 0)

Shared Memoryy

Shared memory Constant memory

Shared Memory

Registers Registers

Shared Memory

Registers Registers

Constant memory Global memory

Local

Thread (0, 0)

Local

Thread (1, 0)

Local

Thread (0, 0)

Local

Thread (1, 0)

Texture memoryGlobalMemory

LocalMemory

LocalMemory

LocalMemory

LocalMemory

Host

ConstantMemory

T t

e o y

1313

TextureMemory

Page 14: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

Data parallel ProgrammingData-parallel Programming

Think of the GPU as a massively-threaded co-processorsp

Write “kernel” functions that execute on the device processing multiple datathe device -- processing multiple data elements in parallel

Keep it busy! massive threading Keep it busy! massive threading Keep your data close! local memory

1414

Page 15: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

RequirementsRequirements

Hardware- CUDA-capable NVIDIA graphics card- PCI-Express slot

Software & Tools- CUDA device driverCUDA device driver- CUDA toolkit : nvcc(compiler), …- CUDA SDKCUDA SDK

1515

Page 16: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

Host vs DeviceHost vs. Device Host : main computer (CPU + main memory) Host : main computer (CPU + main memory)

Device : graphics card (GPU + graphics memory)CUDA d 는 C/C 로작성되며다음의 CUDA source code는 C/C++로작성되며다음의둘로구성됨 : ( 이름 ~.cu )

h t d CPU에서실행- host code : CPU에서실행- device code (“kernel”) : GPU에서실행

Compile - nvcc VectorAdd.cu

1616

Page 17: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

1717

Page 18: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

How to computeHow to computeCPU가사용할변수들을 i 에잡고 CPU가사용할변수들을 main memory에잡고

GPU가계산에사용할변수들을 graphic card 에할당하고memory에할당하고

Host computer의 main memory로부터 graphic d의 로 d t 를복사한후card의 memory로 data를복사한후

GPU는수천-수천만개의 thread를생성하고hi d의 를사용하여연산을수행graphic card의 memory를사용하여연산을수행

연산결과를 host computer의 main memory로복사하고 CPU가이를이용하여추가작업하거나복사하고 CPU가이를이용하여추가작업하거나출력하고작업끝

1818

Page 19: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

Initially:Initially:

array

Host’s Memory GPU Card’s Memory

1919

Page 20: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

Allocate Memory in the GPU card

array_darray

Host’s Memory GPU Card’s Memory

2020

Page 21: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

Copy content from the host’s memory to the GPU card memory

array_darray

Host’s Memory GPU Card’s Memory

2121

Page 22: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

Execute code on the GPUExecute code on the GPU

GPU MPs

array_darray

Host’s Memory GPU Card’s Memory

2222

Page 23: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

Copy results back to the host memory

array_darray

Host’s Memory GPU Card’s Memory

2323

Page 24: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

// VectorAdd.cu#include <stdio.h>

__global__ void VectorAdd( int*a, int*b, int*c) // device code (kernel){int tid = blockIdx.x * blockDim.x + threadIdx.x;c[tid] = a[tid] + b[tid];[ ] [ ] [ ]

}

int main(){{const int size = 512*65535;const int BufferSize = size*sizeof(int);int *InputA, *InputB, *Result;

InputA = (int*)malloc(BufferSize); // Assign host memoryInputB = (int*)malloc(BufferSize);Result = (int*)malloc(BufferSize);

int i = 0;int* dev_A; int* dev_B; int* dev_R;

for( int i = 0; i < size; i++) { // Input dataInputA[i] = i; InputB[i] = i; Result[i] = 0;

}

cudaMalloc((void**)&dev_A, size*sizeof(int)); // Assign device memory

2424

cudaMalloc((void**)&dev_B, size*sizeof(int));cudaMalloc((void**)&dev_R, size*sizeof(int));

Page 25: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

// Transfer data from host memory to device memoryy ycudaMemcpy(dev_A, InputA, size*sizeof(int), cudaMemcpyHostToDevice);cudaMemcpy(dev_B, InputB, size*sizeof(int), cudaMemcpyHostToDevice);

// Create 65535x512 threads and perform computation on GPUp pVectorAdd<<<65535,512>>>(dev_A, dev_B, dev_R);

// Transfer data from device memory to host memorycudaMemcpy(Result, dev_R, size*sizeof(int), cudaMemcpyDeviceToHost);py( _ ( ) py )

// Print results.for( i = 0; i < 5; i++) {

printf(" Result[%d] : %d\n",i,Result[i]);( )}printf(" ......\n");for( i = size-5; i < size; i++) {

printf(" Result[%d] : %d\n",i,Result[i]);}

// Free device memorycudaFree(dev_A); cudaFree(dev_B); cudaFree(dev_R);

// Free host memoryfree(InputA); free(InputB); free(Result);

2525

return 0;}

Page 26: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

S E l 1024 1024행렬의곱셈Some Example : 1024 x 1024 행렬의곱셈[pspark@para kias]$ ./MatrixMul-c

Matrix C (Results)0.389147 0.418741 : 257.658 0.574162 0.669713 : 254.338 0 674025 0 867991 261 3010.674025 0.867991 : 261.301 0.468286 0.619271 : 256.432

Total elapsed time on the CPU chip 10.3449p p

[pspark@para kias]$ ./MatrixMul-cudagrid : 32 32 : block : 32 32grid : 32 32 : block : 32 32

Matrix C (Results)0.389147 0.418741 : 2.93874e-39 0.574162 0.669713 : 3.30608e-39 0.674025 0.867991 : 3.67342e-39 0.468286 0.619271 : 4.04076e-39

2626

Total elapsed time on the GPU card 0.0469801

Page 27: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

참고참고

미루웨어미루웨어http://www.miruware.com/

NVidia Developer CUDA Zonehttp://developer.nvidia.com/category/zone/cuda-zonehttp://ko.wikipedia.org/wiki/CUDA

OpenCLhtt // kh / l/http://www.khronos.org/opencl/http://ko.wikipedia.org/wiki/OpenCL

2727

Page 28: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

Intel LarrabeeIntel Larrabeehttp://ko.wikipedia.org/wiki/%EB%9D%BC%EB%9D%BC%EB%B9%84_(%http://ko.wikipedia.org/wiki/%EB%9D%BC%EB%9D%BC%EB%B9%84_(%EB%A7%88%EC%9D%B4%ED%81%AC%EB%A1%9C%EC%95%84%EDEB%A7%88%EC%9D%B4%ED%81%AC%EB%A1%9C%EC%95%84%ED%82%A4%ED%85%8D%EC%B2%98)%82%A4%ED%85%8D%EC%B2%98)

AMD, nVIDIAAMD, nVIDIA의의큰큰난적난적… Intel… Intel의의 “Larrabee”“Larrabee”http://uzys2011.tistory.com/337http://uzys2011.tistory.com/337

Larrabee GPU, Larrabee GPU, 결국결국개발개발중단중단http://www.kbench.com/hardware/?no=84965http://www.kbench.com/hardware/?no=84965

양자컴퓨터양자컴퓨터http://mirror.enha.kr/wiki/%EC%96%91%EC%9E%90%EC%BB%B4%ED%http://mirror.enha.kr/wiki/%EC%96%91%EC%9E%90%EC%BB%B4%ED%93%A8%ED%84%B093%A8%ED%84%B0

DD--Wave SystemsWave Systemshtt // d /htt // d /http://www.dwavesys.com/http://www.dwavesys.com/

Google, DGoogle, D--Wave 2 Wave 2 확보확보htt // d t k / / i ? ti id 20130704161219htt // d t k / / i ? ti id 20130704161219

2828

http://www.zdnet.co.kr/news/news_view.asp?artice_id=20130704161219http://www.zdnet.co.kr/news/news_view.asp?artice_id=20130704161219

Page 29: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

Brook+SC07 BOF Session

November 13, 2007

Page 30: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

2

What is Brook+?

Brook is an extension to the C-language for stream programming originally developed by Stanford University

Brook+ is an implementation by AMD of the Brook GPU spec on AMD's compute abstraction layer with some enhancements

Page 31: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

3

Examplekernel void sum(float a<>, float b<>, out float c<>)

{

c = a + b;

}

int main(int argc, char** argv)

{

int i, j;

float a<10, 10>;

float b<10, 10>;

float c<10, 10>;

float input_a[10][10];

float input_b[10][10];

float input_c[10][10];

for(i=0; i<10; i++) {

for(j=0; j<10; j++) {

input_a[i][j] = (float) i;

input_b[i][j] = (float) j;

}

}

streamRead(a, input_a);

streamRead(b, input_b);

sum(a, b, c);

streamWrite(c, input_c);

...

}

Kernels – Program functionsthat operate on stream elements

Kernels – Program functionsthat operate on stream elements

Streams – collection of data elements of the same type which can be operated on in parallel.

Streams – collection of data elements of the same type which can be operated on in parallel.

Brook+ access functionsBrook+ access functions

Page 32: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

4

Brook+ Compiler

Converts Brook+ files into C++ code. Kernels, written in C, are compiled to AMD’s IL code for the GPU or C code for the CPU.

Page 33: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

5

Brook+ Runtime

IL code is executed on the GPU. The backend is written in CAL.

Page 34: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

6

Brook+ Features

Brook+ is an extension to the Brook for GPUs source code.

Features of Brook for GPUs relevant to modern graphics hardware are maintained.

Kernels are compiled to AMD’s IL.

Runtime uses CAL for the GPU backend.

Original CPU backend also included.

Page 35: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

7

Folding@Home Stats

Folding@Home client using Brook+

Currently 39 TFLOPS on 664 GPU clients

Avg. 60 GFLOPS per GPU client

Compared to:

Avg. 25 GFLOPS per PS3 client

Avg. 1 GFLOPS per CPU client

Page 36: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

8

Brook+ Release

Brook+ package:

– Compiler and runtime binaries

– Source code and build environments

– Sample applications

Source code released under the BSD License.

Project will also reside on SourceForge.net.

Page 37: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

9

Brook+ Moving Forward

Double precision - FireStream 9170

Mem-export (scatter)

Graphics API interoperability

Multi-GPU support

Other operating systems (Linux, Vista, 64-bit)

Page 38: GPU 병렬연산 - SUWONvmlab.suwon.ac.kr/mwlee/data2/file/(12)GPU_computing_etc.pdf · 2014-05-19 · 용어 GPGPU? GPU(Graphics Processing Unit)를이용한 일반일반목적목적계산계산(General

10

Trademark Attribution

AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners.

©2007 Advanced Micro Devices, Inc. All rights reserved.

DISCLAIMER

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATIONCONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.