Distributing Computation to Large GPU Clusters | GTC...

Preview:

Citation preview

Distributing Computation to Large GPU Clusters

What is this about?

DiCE: Software library for writing applications

scaling to many GPUs and CPUs in a cluster

What is this about?

DiCE: Software library for writing applications

scaling to many GPUs and CPUs in a cluster

Used since 2003 in our rendering products...

NVIDIA indeX NVIDIA Iray

courtesy of Vyacheslav Serov

courtesy of Rüdiger Raab

courtesy of Thomas Zancker

Why are we presenting this here?

DiCE is a base technology in indeX

— Clustering / networking /distribution based on DiCE

DiCE API exposed by indeX

— Distribute pre-computation of data for indeX

— Do your own computation…

Design Goals

„Provide a software library to be used by rendering

experts to write scalable software for GPU clusters.“

— Not required: low level paralellization / networking knowledge

— High level of abstraction / easy to use...

— Not specific to special domain (e.g. rendering)

— High performance, meant for interactive applications

Other solutions...

Unique Combination of Features

Simple programming model

Ease of deployment / commodity hardware

Unified multi-core and cluster parallelization

GPU support

Dynamic clustering

Focus on interactive applications

Multi-user support e.g. for web services

Available on Windows, Linux, Mac OS X

Overview

Networking / Clustering

Datastore

Job System

C++ API

Application

Overview

Networking / Clustering

Datastore

Job System

C++ API

Application

Overview

Networking / Clustering

Datastore

Job System

C++ API

Application

Overview

Networking / Clustering

Datastore

Job System

C++ API

Application

DiCE and indeX

Networking / Clustering

Datastore

Job System

C++ API

Application

indeX

Job System

Networking / Clustering

Datastore

Job System

C++ API

Application

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

Parallelization Model

Programmer: split work in n fragments!

— As independent as possible

— Potentially thousands per „frame“!

No apriori knowledge about resources in the cluster!

Goal: Distribute work over all GPUs / CPUs in cluster

Parallelization Model

Fragmented Job

~ similar to CUDA kernel

Implement C++ class:

void execute_fragment(int i, int n) {…}

To be called once for every fragment

Ask DiCE to execute job in n fragments

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

Parallelization Model - Cluster

Not a shared memory model!

Parallelization Model - Cluster

Not a shared memory model!

Idea: Split execution and integration of results

void execute_remote(int i, int n, OUT){…} Remote host

void receive_result(int i, int n, IN) {…} Origin host

execute_remote()+receive_result() = execute_fragment()

Parallelization Model – Single Host

My_job

• Scene

• Camera

• Framebuf[ ]

1 Host

2 GPUs

0 GPU 1

1 GPU 1

2 GPU 2

3 GPU 2

4 GPU1

5 GPU 2

Parallelization Model – Single Host

My_job

• Scene

• Camera

• Framebuf[ ]

1 Host

2 GPUs

Parallelization Model – Single Host

Exe

cu

te fra

gm

en

t 1

Exe

cu

te fra

gm

en

t 2

Exe

cu

te fra

gm

en

t 4

Exe

cu

te fra

gm

en

t 5

My_job

• Scene

• Camera

• Framebuf[ ]

Exe

cu

te fra

gm

en

t 0

Exe

cu

te fra

gm

en

t 3

0 GPU 1

1 GPU 1

2 GPU 2

3 GPU 2

4 GPU1

5 GPU 2

Parallelization Model – Single Host

Exe

cu

te fra

gm

en

t 1

Exe

cu

te fra

gm

en

t 2

Exe

cu

te fra

gm

en

t 4

Exe

cu

te fra

gm

en

t 5

My_job

• Scene

• Camera

• Framebuf[ ]

Exe

cu

te fra

gm

en

t 0

Exe

cu

te fra

gm

en

t 3

0 GPU 1

1 GPU 1

2 GPU 2

3 GPU 2

4 GPU1

5 GPU 2

Parallelization Model – Single Host

Exe

cu

te fra

gm

en

t 1

Exe

cu

te fra

gm

en

t 2

Exe

cu

te fra

gm

en

t 4

Exe

cu

te fra

gm

en

t 5

My_job

• Scene

• Camera

• Framebuf[ ]

Exe

cu

te fra

gm

en

t 0

Exe

cu

te fra

gm

en

t 3

0 GPU 1

1 GPU 1

2 GPU 2

3 GPU 2

4 GPU1

5 GPU 2

Parallelization Model – Single Host

Exe

cu

te fra

gm

en

t 1

Exe

cu

te fra

gm

en

t 2

Exe

cu

te fra

gm

en

t 4

Exe

cu

te fra

gm

en

t 5

My_job

• Scene

• Camera

• Framebuf[ ]

Exe

cu

te fra

gm

en

t 0

Exe

cu

te fra

gm

en

t 3

0 GPU 1

1 GPU 1

2 GPU 2

3 GPU 2

4 GPU1

5 GPU 2

Parallelization Model

0 GPU 1

Host 1

1 GPU 1

Host 2

2 GPU 2

Host 2

3 GPU 2

Host 1

4 GPU1

Host 3

5 GPU 2

Host 3

Parallelization Model – 3 Hosts

Host 2 Host 3

Host 1 My_job

• Scene

• Camera

• Framebuf[ ]

3 Host

2 GPUs, each

Host 3 Host 2

Parallelization Model – 3 Hosts

Host 1 My_job

• Scene

• Camera

• Framebuf[ ]

My_job

• Scene

• Camera

My_job

• Scene

• Camera

0 GPU 1

Host 1

1 GPU 1

Host 2

2 GPU 2

Host 2

3 GPU 2

Host 1

4 GPU1

Host 3

5 GPU 2

Host 3

Host 3 Host 2

Parallelization Model – 3 Hosts

Host 1 My_job

• Scene

• Camera

• Framebuf[ ]

My_job

• Scene

• Camera

My_job

• Scene

• Camera

Exe

cu

te re

mo

te 1

Exe

cu

te re

mo

te 2

Exe

cu

te re

mo

te 4

Exe

cu

te re

mo

te 5

Exe

cu

te fra

gm

en

t 0

Exe

cu

te fra

gm

en

t 3

0 GPU 1

Host 1

1 GPU 1

Host 2

2 GPU 2

Host 2

3 GPU 2

Host 1

4 GPU1

Host 3

5 GPU 2

Host 3

Host 3 Host 2

Parallelization Model – 3 Hosts

Host 1 My_job

• Scene

• Camera

• Framebuf[ ]

My_job

• Scene

• Camera

My_job

• Scene

• Camera

Exe

cu

te re

mo

te 1

Exe

cu

te re

mo

te 2

Exe

cu

te re

mo

te 4

Exe

cu

te re

mo

te 5

Exe

cu

te fra

gm

en

t 0

Exe

cu

te fra

gm

en

t 3

0 GPU 1

Host 1

1 GPU 1

Host 2

2 GPU 2

Host 2

3 GPU 2

Host 1

4 GPU1

Host 3

5 GPU 2

Host 3

Host 3 Host 2

Parallelization Model – 3 Hosts

Host 1 My_job

• Scene

• Camera

• Framebuf[ ]

My_job

• Scene

• Camera

My_job

• Scene

• Camera

Exe

cu

te re

mo

te 1

Exe

cu

te re

mo

te 2

Exe

cu

te re

mo

te 4

Exe

cu

te re

mo

te 5

Exe

cu

te fra

gm

en

t 0

Exe

cu

te fra

gm

en

t 3

Re

ce

ive

resu

lt 1

Re

ce

ive

resu

lt 2

Re

ce

ive

resu

lt 4

Re

ce

vie

resu

lt 5

0 GPU 1

Host 1

1 GPU 1

Host 2

2 GPU 2

Host 2

3 GPU 2

Host 1

4 GPU1

Host 3

5 GPU 2

Host 3

Parallelization Model – 3 Hosts

Parallelization Model - Hierarchical

Viewer Host

Compositor Host

Render Host

GPUs

Compositor Job

GPU Fragment

Rendering Job

GPU Job

Datastore

Networking / Clustering

Datastore

Job System

C++ API

Application

Datastore

In memory NoSQL datastore for arbitrary C++ objects

Store object on some host / retrieve on any host

Data transport (mostly) transparent to application

Datastore Objects

class My_adder Your class

{

float m_a;

int m_b;

float sum() { return m_a + m_b; }

};

Datastore Objects

class My_adder

{

float m_a; Arbitrary member variables

int m_b;

float sum() { return m_a + m_b; }

};

Datastore Objects

class My_adder

{

float m_a;

int m_b;

float sum() { return m_a + m_b; } Arbitrary member functions

};

Datastore Objects

class My_adder : public Element< UUID > Derive from base class

{

float m_a;

int m_b;

float sum() { return m_a + m_b; }

};

Datastore Objects

class My_adder : public Element< UUID >

{

float m_a;

int m_b;

void serialize(Serializer* serializer) Implement serialization

{

serializer->write(m_a);

serializer->write(m_b);

}

};

Datastore Objects

class My_adder : public Element< UUID >

{

float m_a;

int m_b;

void serialize(Serializer* serializer);

void deserialize(Deserializer* deserializer) Implement deserialization

{

deserializer->read(m_a);

deserializer->read(m_b);

}

};

Datastore Objects

class My_adder : public Element< UUID >

{

float m_a;

int m_b;

void serialize(ISerializer* serializer);

void deserialize(IDeserializer* deserializer);

};

register_serializable_class< My_adder >(); Register class

Datastore: Cache

Per host cache for objects

— Accessing object will make sure it is in the cache!

— If necessary fetch from other hosts

If cache is full: throw away objects owned by others (LRU)

— Store more data in cluster than a single host could

Configurable redundant storage for handling host failure

Datastore Transactions

Important for multi-user operation

Datastore Transactions

Important for multi-user operation

ACID

— Atomicity: Transaction commit, abort

— Isolation: Starting transaction “freezes” view on datastore

Datastore Transactions

Important for multi-user operation

ACID

— Atomicity: Transaction commit, abort

— Consistency: Cluster wide locks available

— Isolation: Starting transaction “freezes” view on datastore

— Durability: Redundancy

Transaction Isolation

A X

T7

T8

1

Transaction Isolation

Isolation based on multi-version capability

A5 X9

T7

T8

1

Transaction Isolation

Isolation based on multi-version capability

Copy-on-write

A5 X9

T7

T8

1 X10

Transaction Isolation

Isolation based on multi-version capability

Copy-on-write

A5 X9

T7

T8

1 X10

Transaction Isolation

Isolation based on multi-version capability

Copy-on-write

A5

T8

1 X10

Networking / Clustering

Networking / Clustering

Datastore

Job System

C++ API

Application

Networking / Clustering

Handles cluster building and data transfers

— Self-organizing, dynamic addition and removal of hosts

— Tested with up to 1000 hosts

— Several networking protocols for different environments…

Network Layer: UDP with Multicast

Unicast: Send to each host

Network Layer: UDP with Multicast

Unicast: Send to each host

Multicast: Like radio, send once, received by many

Network Layer: UDP with Multicast

Unicast: Send to each host

Multicast: Like radio, send once, received by many

Network Layer: UDP with Multicast

Self Organization:

— Multicast address identifies cluster

— Multicast “beacon” packets to announce to other hosts

— “Election” process to elect one synchronizer

Multicast / unicast used for bulk data transfers

— Especially effective for many hosts

Network Layer: TCP

For networks with

— low bandwidth multicast or

— No multicast (e.g. Amazon Web Services)

Discovering hosts

— UDP multicast layer or

— At least one know host

TCP used for all data transport

Still fully dynamic

Host 1

Memory

Network Layer: Infiniband

Native Infiniband with RDMA

0x1234

Host 2

Memory

0x4532

CPU CPU

Host 1

Memory

Network Layer: Infiniband

Native Infiniband with RDMA

0x1234

Host 2

Memory

0x4532

CPU CPU

Host 1

Memory

Network Layer: Infiniband

Native Infiniband with RDMA

RDMA used for speeding up bulk data transfer

Fastest transmissions > 30 Gbit/s end-to-end

0x1234

Host 2

Memory

0x4532

CPU CPU

Other Features

More multi-user capabilities (scopes, ...)

„Futures“

Global logging system

HTTP Server

RTMP Video streaming

Cloud Bridge

...

Summary

DiCE is a library for writing scalable applications

DiCE used since 10 years in our rendering products

Currently directly usable to if you use indeX

Thank you …

Stefan Radig Sr. Manager, NVIDIA Iray and DiCE

Recommended