Distributing Computation to Large GPU Clusters

What is this about?

DiCE: Software library for writing applications

scaling to many GPUs and CPUs in a cluster

What is this about?

DiCE: Software library for writing applications

scaling to many GPUs and CPUs in a cluster

Used since 2003 in our rendering products...

NVIDIA indeX NVIDIA Iray

courtesy of Vyacheslav Serov

courtesy of Rüdiger Raab

courtesy of Thomas Zancker

Why are we presenting this here?

DiCE is a base technology in indeX

— Clustering / networking /distribution based on DiCE

DiCE API exposed by indeX

— Distribute pre-computation of data for indeX

— Do your own computation…

Design Goals

„Provide a software library to be used by rendering

experts to write scalable software for GPU clusters.“

— Not required: low level paralellization / networking knowledge

— High level of abstraction / easy to use...

— Not specific to special domain (e.g. rendering)

— High performance, meant for interactive applications

Other solutions...

Unique Combination of Features

Simple programming model

Ease of deployment / commodity hardware

Unified multi-core and cluster parallelization

GPU support

Dynamic clustering

Focus on interactive applications

Multi-user support e.g. for web services

Available on Windows, Linux, Mac OS X

Overview

Networking / Clustering

Datastore

Job System

C++ API

Application

Overview

Datastore

Job System

C++ API

Application

Overview

Datastore

Job System

C++ API

Application

Overview

Datastore

Job System

C++ API

Application

DiCE and indeX

Datastore

Job System

C++ API

Application

Job System

Datastore

Job System

C++ API

Application

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

Parallelization Model

Programmer: split work in n fragments!

— As independent as possible

— Potentially thousands per „frame“!

No apriori knowledge about resources in the cluster!

Goal: Distribute work over all GPUs / CPUs in cluster

Fragmented Job

~ similar to CUDA kernel

Implement C++ class:

void execute_fragment(int i, int n) {…}

To be called once for every fragment

Ask DiCE to execute job in n fragments

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

Parallelization Model - Cluster

Not a shared memory model!

Parallelization Model - Cluster

Not a shared memory model!

Idea: Split execution and integration of results

void execute_remote(int i, int n, OUT){…} Remote host

void receive_result(int i, int n, IN) {…} Origin host

execute_remote()+receive_result() = execute_fragment()

Parallelization Model – Single Host

My_job

• Scene

• Camera

• Framebuf[ ]

1 Host

2 GPUs

0 GPU 1

1 GPU 1

2 GPU 2

3 GPU 2

4 GPU1

5 GPU 2

My_job

• Scene

• Camera

• Framebuf[ ]

1 Host

2 GPUs

te fra

My_job

• Scene

• Camera

• Framebuf[ ]

te fra

0 GPU 1

1 GPU 1

2 GPU 2

3 GPU 2

4 GPU1

5 GPU 2

te fra

My_job

• Scene

• Camera

• Framebuf[ ]

te fra

0 GPU 1

1 GPU 1

2 GPU 2

3 GPU 2

4 GPU1

5 GPU 2

te fra

My_job

• Scene

• Camera

• Framebuf[ ]

te fra

0 GPU 1

1 GPU 1

2 GPU 2

3 GPU 2

4 GPU1

5 GPU 2

te fra

My_job

• Scene

• Camera

• Framebuf[ ]

te fra

0 GPU 1

1 GPU 1

2 GPU 2

3 GPU 2

4 GPU1

5 GPU 2

0 GPU 1

Host 1

1 GPU 1

Host 2

2 GPU 2

Host 2

3 GPU 2

Host 1

4 GPU1

Host 3

5 GPU 2

Host 3

Parallelization Model – 3 Hosts

Host 2 Host 3

Host 1 My_job

• Scene

• Camera

• Framebuf[ ]

3 Host

2 GPUs, each

Host 3 Host 2

Host 1 My_job

• Scene

• Camera

• Framebuf[ ]

My_job

• Scene

• Camera

My_job

• Scene

• Camera

0 GPU 1

Host 1

1 GPU 1

Host 2

2 GPU 2

Host 2

3 GPU 2

Host 1

4 GPU1

Host 3

5 GPU 2

Host 3

Host 3 Host 2

Host 1 My_job

• Scene

• Camera

• Framebuf[ ]

My_job

• Scene

• Camera

My_job

• Scene

• Camera

te fra

0 GPU 1

Host 1

1 GPU 1

Host 2

2 GPU 2

Host 2

3 GPU 2

Host 1

4 GPU1

Host 3

5 GPU 2

Host 3

Host 3 Host 2

Host 1 My_job

• Scene

• Camera

• Framebuf[ ]

My_job

• Scene

• Camera

My_job

• Scene

• Camera

te fra

0 GPU 1

Host 1

1 GPU 1

Host 2

2 GPU 2

Host 2

3 GPU 2

Host 1

4 GPU1

Host 3

5 GPU 2

Host 3

Host 3 Host 2

Host 1 My_job

• Scene

• Camera

• Framebuf[ ]

My_job

• Scene

• Camera

My_job

• Scene

• Camera

te fra

0 GPU 1

Host 1

1 GPU 1

Host 2

2 GPU 2

Host 2

3 GPU 2

Host 1

4 GPU1

Host 3

5 GPU 2

Host 3

Parallelization Model - Hierarchical

Viewer Host

Compositor Host

Render Host

Compositor Job

GPU Fragment

Rendering Job

GPU Job

Datastore

Job System

C++ API

Application

Datastore

In memory NoSQL datastore for arbitrary C++ objects

Store object on some host / retrieve on any host

Data transport (mostly) transparent to application

Datastore Objects

class My_adder Your class

float m_a;

int m_b;

float sum() { return m_a + m_b; }

Datastore Objects

class My_adder

float m_a; Arbitrary member variables

int m_b;

Datastore Objects

class My_adder

float m_a;

int m_b;

float sum() { return m_a + m_b; } Arbitrary member functions

Datastore Objects

class My_adder : public Element< UUID > Derive from base class

float m_a;

int m_b;

Datastore Objects

class My_adder : public Element< UUID >

float m_a;

int m_b;

void serialize(Serializer* serializer) Implement serialization

serializer->write(m_a);

serializer->write(m_b);

Datastore Objects

float m_a;

int m_b;

void serialize(Serializer* serializer);

void deserialize(Deserializer* deserializer) Implement deserialization

deserializer->read(m_a);

deserializer->read(m_b);

Datastore Objects

float m_a;

int m_b;

void serialize(ISerializer* serializer);

void deserialize(IDeserializer* deserializer);

register_serializable_class< My_adder >(); Register class

Datastore: Cache

Per host cache for objects

— Accessing object will make sure it is in the cache!

— If necessary fetch from other hosts

If cache is full: throw away objects owned by others (LRU)

— Store more data in cluster than a single host could

Configurable redundant storage for handling host failure

Datastore Transactions

Important for multi-user operation

— Atomicity: Transaction commit, abort

— Isolation: Starting transaction “freezes” view on datastore

— Atomicity: Transaction commit, abort

— Consistency: Cluster wide locks available

— Isolation: Starting transaction “freezes” view on datastore

— Durability: Redundancy

Transaction Isolation

Isolation based on multi-version capability

Copy-on-write

Datastore

Job System

C++ API

Application

Handles cluster building and data transfers

— Self-organizing, dynamic addition and removal of hosts

— Tested with up to 1000 hosts

— Several networking protocols for different environments…

Network Layer: UDP with Multicast

Unicast: Send to each host

Multicast: Like radio, send once, received by many

Self Organization:

— Multicast address identifies cluster

— Multicast “beacon” packets to announce to other hosts

— “Election” process to elect one synchronizer

Multicast / unicast used for bulk data transfers

— Especially effective for many hosts

Network Layer: TCP

For networks with

— low bandwidth multicast or

— No multicast (e.g. Amazon Web Services)

Discovering hosts

— UDP multicast layer or

— At least one know host

TCP used for all data transport

Still fully dynamic

Host 1

Memory

Network Layer: Infiniband

Native Infiniband with RDMA

0x1234

Host 2

Memory

0x4532

CPU CPU

Host 1

Memory

0x1234

Host 2

Memory

0x4532

CPU CPU

Host 1

Memory

RDMA used for speeding up bulk data transfer

Fastest transmissions > 30 Gbit/s end-to-end

0x1234

Host 2

Memory

0x4532

CPU CPU

Other Features

More multi-user capabilities (scopes, ...)

„Futures“

Global logging system

HTTP Server

RTMP Video streaming

Cloud Bridge

Summary

DiCE is a library for writing scalable applications

DiCE used since 10 years in our rendering products

Currently directly usable to if you use indeX

Thank you …

Stefan Radig Sr. Manager, NVIDIA Iray and DiCE

Distributing Computation to Large GPU Clusters | GTC...

Documents

GTC 2015 - How Schlumbergeron-demand.gputechconf.com/gtc/2015/presentation/S5329-Michael-He… · GTC 2015 - S5329 How Schlumberger Leverages NVIDIA GPUs using the Open Inventor®

Global Engineering Collaboration: CAD in the Cloud | GTC 2013on-demand.gputechconf.com/gtc/2013/presentations/S3472-Global... · Audience members will learn about the details for

Artificial Intelligence for drug discovery - GTC On …on-demand.gputechconf.com/gtc-eu/2017/presentation/23364-dean... · Artificial Intelligence for drug discovery GTC Europe 2017

Implementing & Tuning Irregular Programs on GPUs | GTC 2013on-demand.gputechconf.com/gtc/2013/presentations/S3016... · 2013. 3. 22. · Control flow is mainly determined by input

Academic Research Programs & Sponsored Research - GTC 2012on-demand.gputechconf.com/gtc/2012/presentations/S... · Academic Research Programs & Sponsored Research - GTC 2012 Author:

Computer Simulation of Lignocellulosic Biomass - GTC 2012on-demand.gputechconf.com/gtc/2012/presentations/S0659-GTC201… · Presentation_namefor the U.S. Department of Energy Cellulosic

Building Windows 8 Store Apps | GTC 2013on-demand.gputechconf.com/gtc/2013/presentations/S3565... · 2013-03-19 · Title: Building Windows 8 Store Apps | GTC 2013 Author: Vladimir

Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

Mobile Augmented Reality in Manufacturing | GTC 2013on-demand.gputechconf.com/gtc/2013/presentations/S3569-Mobile... · Mobile Augmented Reality in Manufacturing ... • Internal

DELIVERING IMMERSIVE EXPERIENCES - GTC On …on-demand.gputechconf.com/gtc/...delivering-immersive-experiences... · Title: S7203 Delivering Immersive Experiences through streaming

Distributed Optimization of CNNs and RNNs - GTC …on-demand.gputechconf.com/gtc/2015/presentation/S5632-William-Cha… · Distributed Optimization of CNNs and RNNs GTC 2015 William

Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

Accelerating NEMO with OpenACC - GTC On-Demand …on-demand.gputechconf.com/gtc/2013/presentations/S... · NEMO Modeling framework for oceanographic research, operational oceanography,

Immersive Visualization at Caterpillar | 2013on-demand.gputechconf.com/gtc/2013/...Visualization-at-Caterpillar.pdf · Caterpillar: Non-Confidential Agenda Biography Introduction

エヌビディアコーポレーションソリューションアーキテクチャエ …on-demand.gputechconf.com/gtc/2014/jp/sessions/1000.pdf · ディープラーニングとは？

Introduction to OpenCV for Tegra | GTC 2013on-demand.gputechconf.com/.../S3411-OpenCV-For-Tegra.pdf · 2013-04-19 · OpenCV for Tegra is a highly optimized port of the OpenCV library

Intro to CUDA-Aware MPI and NVIDIA GPUDirect | GTC 2013on-demand.gputechconf.com/gtc/...Intro-CUDA-Aware-MPI-NVIDIA-G… · regular MPI CUDA-aware MPI Jacobi Results (1000 steps)

Real-time 3D Tracking With GPUs | GTC 2013on-demand.gputechconf.com/gtc/2013/presentations/S... · JPEG, H.264, and MPEG4 Part 2 • NVIDIA GPUs support H.264 and MPEG 4 Part 2 decoding

GPU Performance Analysis and Optimization - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S... · Determining Performance Limiter for a Kernel •Kernel performance is limited

Digital Product Design at Harley-Davidson | GTC 2013on-demand.gputechconf.com/gtc/2013/presentations/...Harley-David… · This presentation will go thru some of the conceptual design