28
synergy.cs.vt .edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1 , Pavan Balaji 2 , Qian Zhu 3 , Rajeev Thakur 2 , Susan Coghlan 2 , Heshan Lin 1 , Gaojin Wen 4 , Jue Hong 4 , and Wu-chun Feng 1 1. Virginia Tech 2. Argonne National Laboratory 3. Accenture Technology Labs 4. Chinese Academy of Sciences

VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

  • Upload
    liana

  • View
    32

  • Download
    0

Embed Size (px)

DESCRIPTION

VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units. Shucai Xiao 1 , Pavan Balaji 2 , Qian Zhu 3 , Rajeev Thakur 2 , Susan Coghlan 2 , Heshan Lin 1 , Gaojin Wen 4 , Jue Hong 4 , and Wu-chun Feng 1. 1. Virginia Tech 2. Argonne National Laboratory - PowerPoint PPT Presentation

Citation preview

Page 1: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

Shucai Xiao1, Pavan Balaji2, Qian Zhu3, Rajeev Thakur2, Susan Coghlan2, Heshan Lin1, Gaojin Wen4, Jue Hong4, and Wu-chun Feng1

1. Virginia Tech2. Argonne National Laboratory3. Accenture Technology Labs4. Chinese Academy of Sciences

Page 2: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

Motivation• GPUs are widely used as accelerators for scientific

computation– Many applications are parallelized on GPUs– Speedup is reported compared to execution on CPUs

2

Page 3: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

GPU-Based Supercomputers

3

Page 4: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

Challenges of GPU Computing

• Provisioning Limitations– Not all computing nodes are configured with GPUs

• Budget and power consumption consideration• Multiple stages of investment

• Programmability – Current GPU programming models: CUDA and OpenCL– CUDA and OpenCL only support the utilization of local GPUs

4

Page 5: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

Our Contributions• Virtual OpenCL (VOCL) framework for transparent

virtualization of GPUs– Remote GPUs look like “virtual” local GPUs

• A program can use non-local GPUs• A program can use more GPUs than that can be installed locally

• Efficient resource management– Optimization of data transfer across different machines

5

OpenCL OpenCL

VOCL

OpenCL

MPI MPI

Page 6: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

Outline• Motivation and Contributions

• Related Work

• VOCL Framework

• VOCL Optimization

• Experimental Results

• Conclusion & Future work

6

Page 7: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

Existing Frameworks for GPU Virtualization• rCUDA

– Good performance• Relative performance overhead is about 2% compared to the

execution on a local GPU (GeForce 9800) – Lack of support for CUDA C extensions

• __kernel<<<….>>>()– Partial support for asynchronous data transfer

• MOSIX-VCL– Transparent virtualization– Large overhead even for local GPUs

• Average overhead: local GPU 25.95%; remote GPU 317.42% – No support for asynchronous data transfer

7

Page 8: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

Outline• Motivation and Contributions

• Related Work

• VOCL Framework

• VOCL Optimization

• Experimental Results

• Conclusion & Future work

8

Page 9: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

Virtual OpenCL (VOCL) Framework Components

• VOCL library and proxy process

9

Proxy

Native OpenCL Library

Application

VOCL Library

MPI

Local node Remote node

Proxy

OpenCL API

GPUGPU

Page 10: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

VOCL Library• Located on each local node• Implements OpenCL functionality

– Application Programming Interface (API) compatibility • API functions in VOCL have the same interface as that in OpenCL• VOCL is transparent to application programs

– Application Binary Interface (ABI) compatibility • No recompilation is needed; • Needs relinking for static libraries • Uses an environment variable to preload the library for dynamic

libraries• Deals with both local and remote GPUs in a system

– Local GPUs: Calls native OpenCL functions– Remote GPUs: Uses MPI API functions to send function calls to

remote nodes10

Page 11: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

VOCL Abstraction: GPUs on Multiple Nodes• OpenCL object handle value

– Same node, each OpenCL object has a unique handle value– Different nodes, different OpenCL objects could share the same

handle value

• VOCL abstraction– VOCL object

GPU

Native OpenCL Library

GPU

Native OpenCL Library

OCLH1 OCLH3

VOCL Library

OCLH2

OCLH1

OCLH2OCLH3

OCLH1 != OCLH2Application

OCLH2 == OCLH3struct voclObj { voclHandle vocl; oclHandle ocl; MPI_Comm com; int nodeIndex;}

Each OpenCL object is translated to a VOCL object with a different handle value

11

VOCLH1

VOCLH2VOCLH3

VOCL object

Page 12: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

VOCL Proxy• Daemon process: Initialized by the administrator• Located on each remote node

– Receives data communication requests (a separate thread)– Receives input data from and send output data to the application

process– Calls native OpenCL functions for GPU computation

12

Native OpenCL Library

Remote node 2

GPU

VOCL Library

Local node

App

GPU

Native OpenCL Library

Remote node 1

Proxy

GPU GPU

MPIMPIProxy

Page 13: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

Outline• Motivation and Contributions

• Related Work

• VOCL Framework

• VOCL Optimization

• Experimental Results

• Conclusion & Future work

13

Page 14: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

Overhead in VOCL

• Local GPUs– Translation between OpenCL and VOCL handles– Overhead is negligible

• Remote GPUs– Translation between VOCL and OpenCL handles– Data communication between different machines

14

GPU

VOCL

OpenCL

VOCL

OpenCL

GPU

OpencL

Local node Remote node

host

Page 15: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

Data Transfer:Between Host Memory and Device Memory• Pipelining approach

– Single block, each stage is transferred after another– Multiple blocks, transfer of first stage of one block can be

overlapped by the second stage of another block– Pre-allocate buffer pool for data storage in the proxy

15

GPU and memory

CPU and memory

CPU and memory

2

1

Buffer pool

Local node Remote node

3

4

12 3

4

3

Page 16: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

Environment for Program Execution

16

• Node Configuration– Local Node

• 2 Magny-Cours AMD CPUs• 64 GB Memory

– Remote Node• Host: 2 Magny-Cours AMD CPUs (64GB memory)• 2 Tesla M2070 GPUs (6GB global memory each)• CUDA 3.2 (OpenCL 1.1 specification)

– Network Connection – QDR InfiniBand

CPU1CPU0

GPU1GPU0 InfiniBand

PCIe PCIe

CPU3CPU2

InfiniBandPCIe

Local node Remote node

ProxyApp

Page 17: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

Micro-benchmark Results• Continuously transfer a window of data blocks one after another• Call the clFinish() function to wait for completion

17

for (i = 0; i < N; i++) { clEnqueueWriteBuffer()}clFinish()

GPU memory write bandwidth

512K 1024K 2048K 4096K 8192K 16384K 32768K0.0

0.5

1.0

1.5

2.0

2.5

3.0

0%10%20%30%40%50%60%70%80%90%100%

OpenCL, localVOCL, remote, pipeliningVOCL, remote, no pipeliningPercentage of the local GPU bandwidth, pipeliningPercentage of the local GPU bandwidth, nopipelining

Data block size (byte)

Band

wid

th (G

B/s)

Perc

enta

ge o

f th

e lo

cal G

PU b

andw

idth

Bandwidth increases from 50% to 80% of that of the local GPU

CPU3CPU2

InfiniBand

PCIeApp CPU1CPU0

InfiniBand

PCIe

Proxy

GPU1GPU0

Page 18: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

Kernel Argument Setting

• Overhead of kernel execution for aligning one pair of sequences (6K letters) with Smith-Waterman

18

Function Name RuntimeLocal GPU

RuntimeRemote GPU Overhead Number of

CallsclSetKernelArg 4.33 420.45 416.02 86,028

clEnqueueNDRangeKernel 1210.85 1316.92 106.07 12,288

Total time 1215.18 1737.37 522.19(Unit: ms)

Local node Remote node

clSetKernelArg()

clSetKernelArg()

clSetKernelArg()

clEnqueueND-RangeKernel()

42.97%

int a; cl_mem b;b = clCreateBuffer(…,…);clSetKernelArg(hFoo, 0, sizeof(int), &a);clSetKernelArg(hFoo, 1, sizeof(cl_mem), &b)clEnqueueNDRangeKernel(…,hFoo,…);

__kernel foo(int a, __global int *b) {}

Page 19: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

Kernel Argument Setting Caching

• Overhead of functions related to kernel execution for aligning the same pair of sequence

19

Function Name RuntimeLocal GPU

RuntimeRemote GPU Overhead Number of

CallsclSetKernelArg 4.33 4.03 -0.30 86,028

clEnqueueNDRangeKernel 1210.85 1344.01 133.71 12,288

Total time 1215.18 1348.04 132.71(Unit: ms)

Local node Remote node

clSetKernelArg()

clSetKernelArg()

clSetKernelArg()

clEnqueueND-RangeKernel()

10.92%St

ore

argu

men

ts lo

cally

Page 20: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

Outline• Motivation and Contributions

• Related Work

• VOCL Framework

• VOCL Optimization

• Experimental Results

• Conclusion & Future work

20

Page 21: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

Evaluation via Application Kernels

• Three application kernels– Matrix multiplication– Matrix transpose– Smith-Waterman

• Program execution time

• Relative overhead

• Relationship to time percentage of kernel execution

21

CPU0

InfiniBand

PCIeApp CPU1

InfiniBand

PCIe

Proxy

GPU

Page 22: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

Matrix Multiplication

1K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K0%

10%20%30%40%50%60%70%80%90%

100%

Matrix size

Perc

enta

ge o

f ker

nel e

xe-

cutio

n tim

e

22

1K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K10

100

1000

10000

0.0%0.5%1.0%1.5%2.0%2.5%3.0%3.5%4.0%4.5%

OpenCL VOCL, local

VOCL, remote Percentage of slowdown

Matrix size

Prog

ram

exe

cutio

n tim

e (m

s)

Perc

enta

ge o

f slo

wdo

wn

Time percentage of kernel execution

Kernel execution time and performance overhead

Multiple problem instances are issued consecutivelyfor (i = 0; i < N; i++) { clEnqueueWriteBuffer(); clEnqueueNDRangeKernel(); clEnqueueReadBuffer();}clFinish();

CPU0

InfiniBand

PCIeApp CPU1

InfiniBand

PCIe

Proxy

GPU

Page 23: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

Matrix Transpose

23

Time percentage of kernel execution

Kernel execution time and performance overhead

1K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K0%1%2%3%4%5%6%7%8%

Tim

e pe

rcen

tage

of k

erne

l ex

ecuti

on1K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K

0

50

100

150

200

250

300

350

0%

10%

20%

30%

40%

50%

60%OpenCLVOCL, localVOCL, remotePercentage of slowdown

Matrix size

Prog

ram

exe

cutio

n tim

e (m

s)

Perc

enta

ge o

f slo

wdo

wn

Multiple problem instances are issued consecutivelyfor (i = 0; i < N; i++) { clEnqueueWriteBuffer(); clEnqueueNDRangeKernel(); clEnqueueReadBuffer();}clFinish();

Page 24: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

Smith-Waterman

24

Time percentage of kernel execution

Kernel execution time and performance overhead

1K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K0%

10%20%30%40%50%60%70%80%90%

Tim

e pe

rcen

tage

of k

erne

l ex

ecuti

on

1K 2K 3K 4K 5K 6K0.00.10.20.30.40.50.60.70.80.91.0

0%

20%

40%

60%

80%

100%

120%

140%

160%

180%

OpenCLVOCL, localVOCL, remotePercentage of slowdown

Sequence size

Prog

ram

exe

cutio

n tim

e (s

)

Perc

enta

ge o

f slo

wdo

wn

Two Observations1. SW needs a lot of kernel

launches and large number of small messages are transferred

2. MPI in the proxy is initialized to support multiple threads, which supports the transfer of small messages poorly

for (i = 0; i < N; i++) { clEnqueueWriteBuffer(); for (j = 0; j < M; j++) { clEnqueueNDRangeKernel(); } clEnqueueReadBuffer();}clFinish();

Page 25: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

Outline• Motivation and Contributions

• Related Work

• VOCL Framework

• VOCL Optimization

• Experimental Results

• Conclusion & Future work

25

Page 26: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

Conclusions

• Virtual OpenCL Framework– Based on the OpenCL programming model– Internally use MPI for data communication

• VOCL Framework Optimization– Kernel arguments caching– GPU memory write and read pipelining

• Application Kernel Verification– SGEMM, n-body, Matrix transpose, and Smith-Waterman– Reasonable virtualization cost

26

Page 27: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

Future Work

• Extensions to the VOCL Framework– Live task migration (already done)– Super-GPU– Performance model for GPU utilization– Resource management strategies– Energy-efficient computing

27

Page 28: VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

synergy.cs.vt.edu

For More Information

• Shucai Xiao– Email -- [email protected]

• Synergy– Website -- http://synergy.cs.vt.edu/

ThanksQuestion?

28