Resource Disaggregated Platform (RD Platform) Including ......Feb 11, 2016 · The RD Platform expands the use of cloud data centers not only office applications but also high performance

Revolutionizing the Datacenter

Join the Conversation #OpenPOWERSummit

Resource Disaggregated Platform (RD Platform) Including Power8 and GPUs

for Diverse Cloud Computing ServiceTakashi Yoshikawa*

Senior Principal Researcher, NEC Corporation(* and Guest Professor, Osaka University)

Join the Conversation #OpenPOWERSummit

Your logohere

Resource Disaggregated Platform In-Service

64 servers and 70 devices in resource pool over 6-racks connected via 236 PCIe (ExpEther).

4/1/2016 2

Cybermedia CenterOsaka UniversityGPU(Nvidia K20), Storage (SAS-HDD, PCIe

Flash), VDI Accelerator (GRID K2, PCoIP)

In service since Jun, 2014

Your logohere

Multi-Rack-Scale Software-Defined HPC

Cost-effective implementation of HPC cloud.

• Scale-up performance in need by allocating PCI Express devices at resource pool.

• Share expensive devices among multiple racks.

4/1/2016 3

Resource Manager

system configurationalong with user requirement

GPUGPUGPUGPU

FPGAFPGA

FlashFlash

CPU

SD serverSo

ftw

are

Defi

ne

Server Pool：64 units

Device Pool: 70 units

GPU

GPU

SAS

40T SAS JBOD

GPU

GPU

SAS

GPU

GPU

SAS

GPU

GPU

SAS

GPU

GPU

K2 GRID

PCoIP

GPU

GPU

PCIe Flash

NIC

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server Server Server Server

40T SAS JBOD

40T SAS JBOD

40T SAS JBOD

40T SAS JBOD

40T SAS JBOD

40T SAS JBOD

40T SAS JBOD

40T SAS JBOD

TOR L2 Switch TOR L2 Switch TOR L2 Switch TOR L2 Switch TOR L2 Switch TOR L2 Switch

Server Pool：64 servers

236 PCIe over 10/40G Ethernet

Rack #1 Rack #2 Rack #3 Rack #4 Rack #5 Rack #6

Device Pool: 70 devices

interconnect

device

Server

Your logohere

ExpEther: Distributed PCIe Switch Architecture

PCIe compliant single hop switch w/ km, k-port.

• Combination of up/down bridge and Ethernet transport

• Ethernet is transparent even for multi-hop switches .

4/1/2016 4

PCIe Switch Upstream PCI Bridge

DownstreamPCI Bridge

DownstreamPCI Bridge

I/O

Device

PCI Express

PCI Express

I/O

Device

I/O

Device

I/O

Device

PCI Express

PCI Express

ExpEtherHost Chip(upstream)

ExpEtherIO Chip(downstream)

ExpEtherIO Chip(downstream)

Internal busCan be

Implemented in

the same chipEthernet region

(transparent)

EthernetSwitch

Ethernet

Compute

(Root)

Compute

(Root)

ExpEther: PCI Express® over Ethernet

Your logohere

ExpEther: Reliable Ethernet

Reliable transport on standard Ethernet by delay-based congestion control and retry.

xN bandwidth, redundancy by multipath

4/1/2016 5

Monitoring RTT to control packet rate

2ExpEther packet5 4 3 2 1

ACK packet

RetryEthernet-1

ExpEther

Bridge

ExpEther

Bridge

PCIe

Device

PCIe

Device

Ethernet-2

Restoration at Path Error

xN performance using N paths

Congestion Control and retry Multi-path

ExpEther

Bridge

ExpEther

Bridge

ExpEther: PCI Express® over Ethernet

Your logohere

ExpEther : Resource Allocation by Group ID

PCIe tree is automatically constructed among ExpEther chips having the same group ID.

• Group ID can be prefixed or configured by management software.

4/1/2016 6

Host

B D G I

PCIeSwitch

Host

A J

PCIeSwitch

Host

C E H

PCIeSwitch

Host

F

PCIeSwitch

Group#1 Group#2 Group#3 Group#4

Logical View

Host Host HostHost

1 2 3 4

A B C D E F G H I J1 1 1 12 23 3 34

*Note that this figure omits the EE chips

Software Defined Configuration Management

Server

Your logohere

OpenStack Based Resource Management

Modify Ironic (Bare metal control) to device level.

4/1/2016 7

IaaSController

End User

Nova(Scheduler)

Horizon(GUI)

Heat(Orchestration)

Ceilometer(Telemetry)

ExpEtherManager

Fabricator

ExpEther Driver

Ironic(BM Mng.)

abstraction

Command

GUI on modified HorizonModify Ironic to control each device.

Host Host Host1 2 4

A B C D E F G H I J2 1 1 11 23 3 34

Host

Ethernet Fabric

3

Your logohere

Implementation Lineup

4/1/2016 8

Join the conversation at #OpenPOWERSummit

40G ExpEther (New !)

1G/10G ExpEther

2x 1000BASE-T

DVIx1,HDMI x1

USB3.0 x1

USB2.0 x3

Headphone x1

Microphone x1 x1 PCI Express

Dual 1000BASE-T

x8 PCIe Gen2

Dual 10G SFP+

x16 PCIe x 1 slot

Dual 1000BASE-T

x16 PCIe2 x 2 slots(full height/full length)

Dual 10G SFP+ per slot

ExpEther HBA ExpEther Client ExpEther IO Expansion Unit

IO Interface : x8 PCI Express 3.0Network I/F : QSFP+ x 2Form Factor : PCI Low Profile

IO Interface : x8 PCI Express 3.0Slots : x16 Slot x 4Network I/F : QSFP+ x 4Support IO : GPGPU (K80, K40)

3U

400mm

19” Rack Size

1,000W PSU

ExpEther HBA

IO Expansion Unit

Your logohere

Power8 Server and 2x40G ExpEther

4/1/2016 9

Power8

ExpEther HBA

ExpEtherexpansion unit

40G Ethernet switch

k80

Your logohere

Logical View

ExpEther’s logical view is standard PCIe switch

• A pair of PCI bridge

NVIDIA k80 as a local device at local PCIe bus

4/1/2016 10

-+-[0002:00]---00.0-[01]--+-[0001:00]---00.0-[01-17]----00.0-[02-17]--+-01.0-[03]--+-00.0 Broadcom Corporation BCM57840 NetXtreme II 10 Gigabit Ethernet| | +-00.1 Broadcom Corporation BCM57840 NetXtreme II 10 Gigabit Ethernet| | +-00.2 Broadcom Corporation BCM57840 NetXtreme II 10 Gigabit Ethernet| | \-00.3 Broadcom Corporation BCM57840 NetXtreme II 10 Gigabit Ethernet| +-08.0-[04]----00.0 Marvell Technology Group Ltd. Device 9235| +-09.0-[05]----00.0 Texas Instruments TUSB73x0 SuperSpeed USB 3.0 xHCI Host Controller| +-0a.0-[06-07]----00.0-[07]----00.0 ASPEED Technology, Inc. ASPEED Graphics Family| +-10.0-[08-12]----00.0-[09-12]--+-00.0-[0a]----00.0 Intel Corporation Device 0953| | +-01.0-[0b-0e]----00.0-[0c-0e]--+-08.0-[0d]----00.0 NVIDIA Corporation Device 102d| | | \-10.0-[0e]----00.0 NVIDIA Corporation Device 102d| | +-02.0-[0f-10]----00.0-[10]--| | \-03.0-[11-12]----00.0-[12]--| \-11.0-[13-17]--\-[0000:00]---00.0-[01]--

Your logohere

CUDA Benchmark : N-Body for k80

Comparable performance to local GPU

4/1/2016 11

Loca PCIe Slot

Disaggregated

Number of Particle

Perf

orm

ance

(G

flo

ps)

Your logohere

Multi GPUs in Resource Disaggregated Platform

Easy programming, high-performance(vs. GPUCluster)

• All GPUs exist in local. Direct memory access(DMA).

Cost-effective (vs. GPU datecated Machine)

• Doesn’t need GPU-special machine, Share expensive GPU

• Latency degrades performance

4/1/2016 12

GPU

GPU

CPU

GPU

GPU

EE

Ethernet GPU

CPU

EE

GPU

EE

GPU

EE GPU

CPU

Infiniband/Ethernet

GPU

CPU

GPU

CPU

Resource Disaggregated PF GPU MachineGPU Cluster system

Your logohere

Two Application Model using GPUs

4/1/2016 13

Particle motion simulation by Runge-Kutta method• N-step GPU computing w/

only 2-time data transfer

• no device communication

Advection term calculation by Cubic Lagrange Interpolation• device communication in each

iteration step• influenced by latency

Host to Device Data Transfer

Host to Device Data Transfer

Device to HostData Transfer

Device to HostData Transfer

GPUComputing

GPUComputing

Device to DeviceData Transfer

n stepsn stepsLoop Loop

Seq

ue

nce

Seq

ue

nce

Shimpei Nomura, Tetusya Nakahama, Junichi Higuchi, Jun Suzuki, Takashi Yoshikawa , and Hideharu Amano, “”The multi-GPU System with ExpEther”,” Proc. of the PDPTA, July 2012

Your logohere

0

0.5

1

1.5

2

2.5

3

3.5

Up

date

/Seco

nd

(10

9)

Problem Size

1

2

3

4

5

6

7

8

8 GPUs achieved speedup of 7.13 times

• Linearly scale up proportional to the number of GPUs the overhead of data transfer fm/to host is dominant @ small size

Particle motion simulation in 8-GPU system

4/1/2016 14

GPGPU BOX (Proto Type)

Overhead of

Data transfer

X 7.13 performance# of GPUs

1x1024x1024,

100 step

1x1024x1024,

1000 step

10x1024x1024,

1000 step

10x1024x1024,

2000 step

10x1024x1024,

4000 step

Shimpei Nomura, Tetusya Nakahama, Junichi Higuchi, Jun Suzuki, Takashi Yoshikawa , and Hideharu Amano, “”The multi-GPU System with ExpEther”,” Proc. of the PDPTA, July 2012

Your logohere

Latency Masking of Data Communication

2016/03/07 15

GPU 0

GPU 1

GPU 2

GPU 3

Kernel

recv

send

Kernel

recv

send

Kernel

recv

send

Kernel

recv

send

Kernel

recv

send

Kernel

recv

send

Kernel

recv

send

Kernel

recv

send

Kernel

recv

send

Kernel

recv

send

Kernel

recv

send

Kernel

recv

send

Kernel

Kernel

Kernel

Kernel

sync sync sync

process

own frontier

process

recv frontier

Reduce frequency of communication

Overlap communication and GPU processing

GPU direct communication w/o host node

Host

K20EE

K20EE

K20EE

K20EE

EE

Eth

ern

et

Sw

itch

Mellanox SX1012 x 2

Takuji Mitsuishi, Jun Suzuki, Yuki Hayashi, Masaki Kan and Hideharu Amano, “Breadth First Search on Cost-efficient Multi-GPU Systems”, Proc. of the International Symposium on Highly-Efficient Accelerators and Reconfigureable Technologies (HEART), May 2015

Your logohere

Level synchronized Breadth First Graph Search

4-GPU machine

0

200

400

600

800

1000

15 17 19 21 23

MT

EP

S

graph scale (|V|=2^scale

GPUs=1 GPUs=2

Resource Disaggregated

0

200

400

600

800

1000

15 17 19 21 23

MT

EP

S

graph scale (|V|=2^scale

GPUs=1 GPUs=2

2016/03/07 16

|E|=16 x |V|

Comparable Scalability to GPU machine for big graph.

• 80 % of communication time is successfully masked

TEPS(Traversed Edged Per Second)

Takuji Mitsuishi, Jun Suzuki, Yuki Hayashi, Masaki Kan and Hideharu Amano, “Breadth First Search on Cost-efficient Multi-GPU Systems”, Proc. of the International Symposium on Highly-Efficient Accelerators and Reconfigureable Technologies (HEART), May 2015

root

Level 1Level 2Level 3

: current frontier

: next frontier

Your logohere

High-Speed storage using PCI SSD

Same IOPS for Random Read @ 4kB

• Local: bw=1814.3MB/s, iops=464436

• ExpEther:bw=1810.3MB/s, iops=463428

• Catalog: iops = 450,000

4/1/2016 17

Intel SSD P3700 400GB

Your logohere

Resource Disaggregated Storage (RD-Store)

kvs over RD NVMe

• Individually increase control CPUs / storage devices.

• Scale up immediately whenadd servers independent of the amount of data.

4/1/2016 18

8threads/server: with NVMe 1.1 card302kOPS, GET 89 μsec, PUT 263μsec

1-4 ControlCPUs

1-5 devices

YCSB(Yahoo! Cloud Service Benchmark), Read Heavy Workload (100B): Random GET 95%, PUT 5%

Interconnect(ExpEther)

pool

Your logohere

Conclusion: Resource Disaggregated Platform

The RD Platform expands the use of cloud data centers not only office applications but also high performance computing.

The RD platform performs computation by allocating devices from a resource pool at the device level to scale up individual performance and functionality.

Since the fabric is ExpEther (distributed PCIe switch over Ethernet), open standard hardware and software can be utilized to build customer’s computer systems

A combination of the latest Gen3x8 – 2x 40GE ExpEther and power8 server show potential for intensive computing power.

The performance of both 8-GPU machine and KVS with shared PCIe-storage show the easy-to-scale-up feature in datacenter.

4/1/2016 19

Your logohere

Acknowledgement

I would like to thank all the members working on developing this technology Hideharu Amano at Keio University Susumu Date and Shinji Shimojo at Osaka University Richard Solomon, Hiroki Iwasaki, Hironori Ando, Takashi Yagi, Hiroshi

Ono, Mitsuhiro Miyata, and other Synopsys People Takaaki Terao, Tsuyoshi Yoshioka, Koetsu Narisawa, Nobuhiro Numata,

Yosuke Yano, Takayuki Saito, Osamu Nara, Takumi Kosawa, and other ID People

Yoich Hidaka, Junich Higuch, Jun Suzuki, Yuki Hayashi, Masahiko Takahashi, Masaki Kan, Akira Tusji, Shinya Miyakawa, Nobuharu Kami, Teruyuki Baba, Nobuyuki Enomoto, Hideyuki Shimonishi, Atsushi Iwata, Shinji Abe, Shinji Ueno, Hiroyoshi Yamane, Ryuta Niino, Kiyoshi Baba, Hiroki Yokoyama, Tatsuya Takada, Shiro Momot, Taka Koishi, Yasushi Kikkawa, Yoshimitsu Okayama, Hiroshi Baba, Tsukasa Kimura, Masayuki Terao, and other NEC people

4/1/2016 20

Documents

Resource Disaggregated Platform (RD Platform) Including ......Feb 11, 2016 · The RD Platform expands the use of cloud data centers not only office applications but also high performance