Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Is remote GPU virtualization useful?
Federico SillaTechnical University of Valencia
Spain
HPC Advisory Council Spain Conference 2015 2/57
OutlinestWhat is “remote GPU virtualization”?
HPC Advisory Council Spain Conference 2015 3/57
We deal with GPUs, obviously!
HPC Advisory Council Spain Conference 2015 4/57
Basics of GPU computing
Basic behavior of CUDA
GPU
HPC Advisory Council Spain Conference 2015 5/57
Remote GPU virtualization
A software technology that enables a more flexible use of GPUs in computing facilities
No GPU
HPC Advisory Council Spain Conference 2015 6/57
Basics of remote GPU virtualization
HPC Advisory Council Spain Conference 2015 7/57
Basics of remote GPU virtualization
HPC Advisory Council Spain Conference 2015 8/57
Physicalconfiguration
CPU
Main
Mem
ory
Network
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
PCI-e
CPU
Main
Mem
ory
Network
Interconnection Network
Logical connections
Logicalconfiguration
Remote GPU virtualization envision
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
PCI-e C
PU
Main
Mem
ory
Network
Interconnection Network
PCI-e C
PU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e C
PU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
CPU
Main
Mem
ory
Network
GPU GPUmem
GPU GPUmem
PCI-e C
PU
Main
Mem
ory
Network
GPU GPUmem
GPU GPUmem
PCI-e C
PU
Main
Mem
ory
Network
GPU GPUmem
GPU GPUmem
PCI-e
Remote GPU virtualization allows a new vision of a GPU deployment, moving from the usual cluster configuration:
to the following one:
HPC Advisory Council Spain Conference 2015 9/57
Outline
Why is “remote GPU virtualization” needed?
nd
HPC Advisory Council Spain Conference 2015 10/57
Outline
Which is the problem with GPU-enabled clusters?
HPC Advisory Council Spain Conference 2015 11/57
A GPU-enabled cluster is a set of independent self-contained nodes that leverage the shared-nothing approach: Nothing is directly shared among nodes (MPI required for aggregating
computing resources within the cluster, included GPUs) GPUs can only be used within the node they are attached to
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
Interconnection Network
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
Characteristics of GPU-based clusters
HPC Advisory Council Spain Conference 2015 12/57
• Applications can only use the GPUs located within their node:• Non-accelerated applications keep GPUs idle in the nodes where they
use all the cores
First concern with accelerated clusters
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
Interconnection Network
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
A CPU-only application spreading over these four nodes would make their GPUs unavailable for accelerated applications
HPC Advisory Council Spain Conference 2015 13/57
For some workloads, GPUs may be idle for significant periods of time:• Initial acquisition costs not amortized• Space: GPUs reduce CPU density• Energy: idle GPUs keep consuming power
Money leakage in current clusters?
Time (s)
Idle
Pow
er (W
atts
)
1 GPU node
4 GPUs node
• 1 GPU node: Two E5-2620V2 sockets and 32GB DDR3 RAM. One Tesla K20 GPU • 4 GPUs node: Two E5-2620V2 sockets and 128GB DDR3 RAM. Four Tesla K20 GPUs
25%
HPC Advisory Council Spain Conference 2015 14/57
• Applications can only use the GPUs located within their node:• non-MPI multi-GPU applications running on a subset of nodes
cannot make use of the tremendous GPU resources available at other cluster nodes (even if they are idle)
Second concern with accelerated clusters
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
Interconnection Network
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e CPUGPU GPU
mem
Main
Mem
ory
Network
GPU GPUmem
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
All these GPUs cannot be used by the multi-GPU
application being executedNon-MPI multi-GPU
application
HPC Advisory Council Spain Conference 2015 15/57
• Do applications completely squeeze the GPUs available in the cluster?• When a GPU is assigned to an application, computational resources
inside the GPU may not be fully used• Application presenting low level of parallelism• CPU code being executed (GPU assigned ≠ GPU working)• GPU-core stall due to lack of data• etc …
One more concern with accelerated clusters
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
Interconnection Network
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e CPUGPU GPU
mem
Main
Mem
ory
Network
GPU GPUmem
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
HPC Advisory Council Spain Conference 2015 16/57
In summary …
• There are scenarios where GPUs are available but cannot be used
• Accelerated applications do not make use of GPUs 100% of the time
In conclusion …
• GPU cycles are lost, thus reducing cluster performance
Why performance in GPU clusters is lost?
HPC Advisory Council Spain Conference 2015 17/57
We need something more in the cluster
The current model for using GPUs is too rigid
What is missing is ...... some flexibility for using
the GPUs in the cluster
HPC Advisory Council Spain Conference 2015 18/57
What is missing is ...... some flexibility for using
the GPUs in the cluster
We need something more in the cluster
A way of seamlessly sharing GPUs across nodes in the cluster
(remote GPU virtualization)
The current model for using GPUs is too rigid
HPC Advisory Council Spain Conference 2015 19/57
CPU
Main
Mem
ory
Network
PCI-e
PCI-e
CPU
Main
Mem
ory
Network
CPU
Main
Mem
ory
Network
PCI-e
PCI-e C
PU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
Interconnection Network
PCI-e C
PU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e C
PU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
CPU
Main
Mem
oryNetwork
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
PCI-e
CPU
Main
Mem
ory
Network
Interconnection Network
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
Remote GPU virtualization envision
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
Without GPU virtualization
With GPU virtualization
Real localGPUs
VirtualizedremoteGPUs
GPU virtualization allows all nodes to
access all GPUs
HPC Advisory Council Spain Conference 2015 20/57
Several efforts have been made to implement remote GPU virtualization during the last years:
rCUDA (CUDA 7.0) GVirtuS (CUDA 3.2) DS-CUDA (CUDA 4.1) vCUDA (CUDA 1.1) GViM (CUDA 1.1) GridCUDA (CUDA 2.3) V-GPU (CUDA 4.0)
Remote GPU virtualization frameworks
Publicly available
NOT publicly available
HPC Advisory Council Spain Conference 2015 21/57
Remote GPU virtualization frameworks
H2D pageableD2H pageable
H2D pinned D2H pinned
FDR InfiniBand + K20 !!
HPC Advisory Council Spain Conference 2015 22/57
Outline
Cons of “remote GPU virtualization”?
rd
HPC Advisory Council Spain Conference 2015 23/57
The main GPU virtualization drawback is the reduced bandwidth to the remote GPU
Problem with remote GPU virtualization
No GPU
HPC Advisory Council Spain Conference 2015 24/57
Using InfiniBand networks
HPC Advisory Council Spain Conference 2015 25/57
H2D pageable D2H pageable
H2D pinned D2H pinned
rCUDA EDR Orig rCUDA EDR OptrCUDA FDR Orig rCUDA FDR Opt
Initial transfers within rCUDA
HPC Advisory Council Spain Conference 2015 26/57
H2D pageable D2H pageable
H2D pinned
Almost 100% of available BW
D2H pinned
Almost 100% of available BW
rCUDA EDR Orig rCUDA EDR OptrCUDA FDR Orig rCUDA FDR Opt
Optimized transfers within rCUDA
HPC Advisory Council Spain Conference 2015 27/57
rCUDA optimizations on applications• Several applications executed with
CUDA and rCUDA• K20 GPU and FDR InfiniBand• K40 GPU and EDR InfiniBand
Loweris better
HPC Advisory Council Spain Conference 2015 28/57
Outline
Pros of “remote GPU virtualization”?
th
HPC Advisory Council Spain Conference 2015 29/57
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
Without GPU virtualization
GPU virtualization is useful for multi-GPU applications
CPU
Main
Mem
ory
Network
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
PCI-e
CPU
Main
Mem
ory
Network
Interconnection Network
Logical connectionsPC
I-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
Interconnection Network
PCI-e C
PU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e C
PU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e C
PU
GPU GPUmem
Main
Mem
oryNetwork
GPU GPUmem
PCI-e C
PU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e C
PU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
With GPU virtualization
Many GPUs in the cluster can be provided
to the application
Only the GPUs in the node can be provided
to the application
1: more GPUs for a single application
HPC Advisory Council Spain Conference 2015 30/57
1: more GPUs for a single application
64GPUs!
HPC Advisory Council Spain Conference 2015 31/57
MonteCarlo Multi-GPU (from NVIDIA samples)
Loweris better
Higher is better
1: more GPUs for a single application
FDR InfiniBand +
NVIDIA Tesla K20
HPC Advisory Council Spain Conference 2015 32/57
GPU GPUmem
GPU GPUmem
PCI-e C
PU
Main
Mem
ory
Network
Interconnection Network
PCI-e C
PU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e C
PU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
CPU
Main
Mem
ory
Network
GPU GPUmem
GPU GPUmem
PCI-e C
PU
Main
Mem
ory
Network
GPU GPUmem
GPU GPUmem
PCI-e C
PU
Main
Mem
ory
Network
GPU GPUmem
GPU GPUmem
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
CPU
Main
Mem
oryNetwork
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
PCI-e
CPU
Main
Mem
ory
Network
Interconnection Network
Logical connections
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
Physicalconfiguration
Logicalconfiguration
2: busy CPU cores do not block GPUs
HPC Advisory Council Spain Conference 2015 33/57
3: easier cluster upgrade
No GPU
• A cluster without GPUs may be easily upgraded to use GPUs with rCUDA
HPC Advisory Council Spain Conference 2015 34/57
3: easier cluster upgrade
• A cluster without GPUs may be easily upgraded to use GPUs with rCUDA
GPU-enabled
HPC Advisory Council Spain Conference 2015 35/57
Box A has 4 GPUs but only one is busy Box B has 8 GPUs but only two are busy
1. Move jobs from Box B to Box A and switch off Box B
2. Migration should be transparent to applications (decided by the global scheduler)
TRUE GREEN GPU COMPUTING
4: GPU task migration
Box A
Box B
HPC Advisory Council Spain Conference 2015 36/57
5: virtual machines can easily access GPUs
• The GPU is assigned by using PCI passthrough exclusively to a single virtual machine
• Concurrent usage of the GPU is not possible
HPC Advisory Council Spain Conference 2015 37/57
5: virtual machines can easily access GPUs
High performance network available
Low performance network available
HPC Advisory Council Spain Conference 2015 38/57
Application performance with KVM
LAMMPS CUDA-MEME
CUDASW++ GPU-BLAST
FDR InfiniBand + K20 !!
HPC Advisory Council Spain Conference 2015 39/57
Application performance with Xen
LAMMPS CUDA-MEME
CUDASW++ GPU-BLAST
FDR InfiniBand + K20 !!
HPC Advisory Council Spain Conference 2015 40/57
Overhead of rCUDA within VMs
KVM
Xen
FDR InfiniBand + K20 !!
1.6% 2.5%
0.5% 0.07%
6.1%
2.9%
0.7%2.3%
HPC Advisory Council Spain Conference 2015 41/57
Outline
What happens at the cluster level?
th
HPC Advisory Council Spain Conference 2015 42/57
• GPUs can be shared among jobs running in remote clients• Job scheduler required for coordination
• Slurm was selected
App 2App 1
App 3App 4
rCUDA at cluster level … Slurm
App 6App 5
App 7App 8App 9
HPC Advisory Council Spain Conference 2015 43/57
Applications used for tests: GPU-Blast (21 seconds; 1 GPU; 1599 MB) LAMMPS (15 seconds; 4 GPUs; 876 MB) MCUDA-MEME (165 seconds; 4 GPUs; 151 MB) GROMACS (167 seconds) NAMD (11 minutes) BarraCUDA (10 minutes; 1 GPU; 3319 MB) GPU-LIBSVM (5 minutes; 1GPU; 145 MB) MUMmerGPU (5 minutes; 1GPU; 2804 MB)
Non
-GPU
Short execution time
Long execution time
Ongoing work: studying rCUDA+Slurm
Set 1Set 2
Three workloads: Set 1 Set 2 Set 1 + Set 2
Three workload sizes: Small (100 jobs) Medium (200 jobs) Large (400 jobs)
HPC Advisory Council Spain Conference 2015 44/57
Dual socket E5-2620v2 Intel Xeon + 32GB RAM + K20 GPU FDR InfiniBand based cluster
Test bench for studying rCUDA+Slurm
node with the Slurmscheduler
node with the Slurmscheduler node with
the Slurmscheduler
8 GPU nodes 16 GPU nodes
4 GPU nodes
HPC Advisory Council Spain Conference 2015 45/57
8 nodes
16 nodes
Loweris better
4 nodes
Results for Set 1
Results for execution time
HPC Advisory Council Spain Conference 2015 46/57
GPU utilization8 nodes
16 nodes
Loweris better
4 nodes
Results for Set 1
Higher is better
HPC Advisory Council Spain Conference 2015 47/57
Results for energy consumption8 nodes
16 nodes
Loweris better
4 nodes
Results for Set 1
HPC Advisory Council Spain Conference 2015 48/57
Reducing the amount of GPUs
35% Less
39% Less
41% LessLoweris better
HPC Advisory Council Spain Conference 2015 49/57
Reducing the amount of GPUs
35% Less
39% Less
41% LessLoweris better
Cost (in Euros):a) 16 nodes + 16 K20 … 16*2500 + 16*2000 = 72000 Eurosb) 16 nodes + 4 K20 … 16*2500 + 4*2000 = 48000 Euros
Reducing 33% acquisition expenses, performance is still 40% better
HPC Advisory Council Spain Conference 2015 50/57
Energy when removing GPUs
39% Less
42% Less
44% Less
Loweris better
HPC Advisory Council Spain Conference 2015 51/57
Energy when removing GPUs
39% Less
42% Less
44% Less
Loweris better
Costs:
… reducing 33% acquisition expenses, performance is still 40% better …
Additionally, the electricity bill is significantly reduced
HPC Advisory Council Spain Conference 2015 52/57
GPU utilization when removing GPUs
Higheris better
HPC Advisory Council Spain Conference 2015 53/57
Outline
… in summary …
th
HPC Advisory Council Spain Conference 2015 54/57
• Cons:1.Reduced bandwidth to remote GPU (really a concern??)
Pros and cons of rCUDA
• Pros:1.Many GPUs for a single application2.Concurrent GPU access to virtual machines3. Increased cluster throughput4.Similar performance with smaller investment5.Easier (cheaper) cluster upgrade6.Migration of GPU jobs7.Reduced energy consumption8. Increased GPU utilization
HPC Advisory Council Spain Conference 2015 55/57
… is remote GPU virtualization useful?
Back to the initial question
HPC Advisory Council Spain Conference 2015 56/57
Get a free copy of rCUDA athttp://www.rcuda.net
@rcuda_
More than 650 requests world wide
Sergio Iserte Fernando Campos
rCUDA is a development by Technical University of Valencia
Carlos Reaño Javier Prades
HPC Advisory Council Spain Conference 2015 57/57
Thanks!Questions?
rCUDA is a development by Technical University of Valencia