Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Revolutionizing the Datacenter
Join the Conversation #OpenPOWERSummit
Resource Disaggregated Platform (RD Platform) Including Power8 and GPUs
for Diverse Cloud Computing ServiceTakashi Yoshikawa*
Senior Principal Researcher, NEC Corporation(* and Guest Professor, Osaka University)
Join the Conversation #OpenPOWERSummit
Your logohere
Resource Disaggregated Platform In-Service
64 servers and 70 devices in resource pool over 6-racks connected via 236 PCIe (ExpEther).
4/1/2016 2
Cybermedia CenterOsaka UniversityGPU(Nvidia K20), Storage (SAS-HDD, PCIe
Flash), VDI Accelerator (GRID K2, PCoIP)
In service since Jun, 2014
Your logohere
Multi-Rack-Scale Software-Defined HPC
Cost-effective implementation of HPC cloud.
• Scale-up performance in need by allocating PCI Express devices at resource pool.
• Share expensive devices among multiple racks.
4/1/2016 3
Resource Manager
system configurationalong with user requirement
GPUGPUGPUGPU
FPGAFPGA
FlashFlash
CPU
SD serverSo
ftw
are
Defi
ne
Server Pool:64 units
Device Pool: 70 units
GPU
GPU
SAS
40T SAS JBOD
GPU
GPU
SAS
GPU
GPU
SAS
GPU
GPU
SAS
GPU
GPU
K2 GRID
PCoIP
GPU
GPU
PCIe Flash
NIC
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server Server Server Server
40T SAS JBOD
40T SAS JBOD
40T SAS JBOD
40T SAS JBOD
40T SAS JBOD
40T SAS JBOD
40T SAS JBOD
40T SAS JBOD
TOR L2 Switch TOR L2 Switch TOR L2 Switch TOR L2 Switch TOR L2 Switch TOR L2 Switch
Server Pool:64 servers
236 PCIe over 10/40G Ethernet
Rack #1 Rack #2 Rack #3 Rack #4 Rack #5 Rack #6
Device Pool: 70 devices
interconnect
device
Server
Your logohere
ExpEther: Distributed PCIe Switch Architecture
PCIe compliant single hop switch w/ km, k-port.
• Combination of up/down bridge and Ethernet transport
• Ethernet is transparent even for multi-hop switches .
4/1/2016 4
PCIe Switch Upstream PCI Bridge
DownstreamPCI Bridge
DownstreamPCI Bridge
I/O
Device
PCI Express
PCI Express
I/O
Device
I/O
Device
I/O
Device
PCI Express
PCI Express
ExpEtherHost Chip(upstream)
ExpEtherIO Chip(downstream)
ExpEtherIO Chip(downstream)
Internal busCan be
Implemented in
the same chipEthernet region
(transparent)
EthernetSwitch
Ethernet
Compute
(Root)
Compute
(Root)
ExpEther: PCI Express® over Ethernet
Your logohere
ExpEther: Reliable Ethernet
Reliable transport on standard Ethernet by delay-based congestion control and retry.
xN bandwidth, redundancy by multipath
4/1/2016 5
Monitoring RTT to control packet rate
2ExpEther packet5 4 3 2 1
ACK packet
RetryEthernet-1
ExpEther
Bridge
ExpEther
Bridge
PCIe
Device
PCIe
Device
Ethernet-2
Restoration at Path Error
xN performance using N paths
Congestion Control and retry Multi-path
ExpEther
Bridge
ExpEther
Bridge
ExpEther: PCI Express® over Ethernet
Your logohere
ExpEther : Resource Allocation by Group ID
PCIe tree is automatically constructed among ExpEther chips having the same group ID.
• Group ID can be prefixed or configured by management software.
4/1/2016 6
Host
B D G I
PCIeSwitch
Host
A J
PCIeSwitch
Host
C E H
PCIeSwitch
Host
F
PCIeSwitch
Group#1 Group#2 Group#3 Group#4
Logical View
Host Host HostHost
1 2 3 4
A B C D E F G H I J1 1 1 12 23 3 34
*Note that this figure omits the EE chips
Software Defined Configuration Management
Server
Your logohere
OpenStack Based Resource Management
Modify Ironic (Bare metal control) to device level.
4/1/2016 7
IaaSController
End User
Nova(Scheduler)
Horizon(GUI)
Heat(Orchestration)
Ceilometer(Telemetry)
ExpEtherManager
Fabricator
ExpEther Driver
Ironic(BM Mng.)
abstraction
Command
GUI on modified HorizonModify Ironic to control each device.
Host Host Host1 2 4
A B C D E F G H I J2 1 1 11 23 3 34
Host
Ethernet Fabric
3
Your logohere
Implementation Lineup
4/1/2016 8
Join the conversation at #OpenPOWERSummit
40G ExpEther (New !)
1G/10G ExpEther
2x 1000BASE-T
DVIx1,HDMI x1
USB3.0 x1
USB2.0 x3
Headphone x1
Microphone x1 x1 PCI Express
Dual 1000BASE-T
x8 PCIe Gen2
Dual 10G SFP+
x16 PCIe x 1 slot
Dual 1000BASE-T
x16 PCIe2 x 2 slots(full height/full length)
Dual 10G SFP+ per slot
ExpEther HBA ExpEther Client ExpEther IO Expansion Unit
IO Interface : x8 PCI Express 3.0Network I/F : QSFP+ x 2Form Factor : PCI Low Profile
IO Interface : x8 PCI Express 3.0Slots : x16 Slot x 4Network I/F : QSFP+ x 4Support IO : GPGPU (K80, K40)
3U
400mm
19” Rack Size
1,000W PSU
ExpEther HBA
IO Expansion Unit
Your logohere
Power8 Server and 2x40G ExpEther
4/1/2016 9
Power8
ExpEther HBA
ExpEtherexpansion unit
40G Ethernet switch
k80
Your logohere
Logical View
ExpEther’s logical view is standard PCIe switch
• A pair of PCI bridge
NVIDIA k80 as a local device at local PCIe bus
4/1/2016 10
-+-[0002:00]---00.0-[01]--+-[0001:00]---00.0-[01-17]----00.0-[02-17]--+-01.0-[03]--+-00.0 Broadcom Corporation BCM57840 NetXtreme II 10 Gigabit Ethernet| | +-00.1 Broadcom Corporation BCM57840 NetXtreme II 10 Gigabit Ethernet| | +-00.2 Broadcom Corporation BCM57840 NetXtreme II 10 Gigabit Ethernet| | \-00.3 Broadcom Corporation BCM57840 NetXtreme II 10 Gigabit Ethernet| +-08.0-[04]----00.0 Marvell Technology Group Ltd. Device 9235| +-09.0-[05]----00.0 Texas Instruments TUSB73x0 SuperSpeed USB 3.0 xHCI Host Controller| +-0a.0-[06-07]----00.0-[07]----00.0 ASPEED Technology, Inc. ASPEED Graphics Family| +-10.0-[08-12]----00.0-[09-12]--+-00.0-[0a]----00.0 Intel Corporation Device 0953| | +-01.0-[0b-0e]----00.0-[0c-0e]--+-08.0-[0d]----00.0 NVIDIA Corporation Device 102d| | | \-10.0-[0e]----00.0 NVIDIA Corporation Device 102d| | +-02.0-[0f-10]----00.0-[10]--| | \-03.0-[11-12]----00.0-[12]--| \-11.0-[13-17]--\-[0000:00]---00.0-[01]--
Your logohere
CUDA Benchmark : N-Body for k80
Comparable performance to local GPU
4/1/2016 11
Loca PCIe Slot
Disaggregated
Number of Particle
Perf
orm
ance
(G
flo
ps)
Your logohere
Multi GPUs in Resource Disaggregated Platform
Easy programming, high-performance(vs. GPUCluster)
• All GPUs exist in local. Direct memory access(DMA).
Cost-effective (vs. GPU datecated Machine)
• Doesn’t need GPU-special machine, Share expensive GPU
• Latency degrades performance
4/1/2016 12
GPU
GPU
CPU
GPU
GPU
EE
Ethernet GPU
CPU
EE
GPU
EE
GPU
EE GPU
CPU
Infiniband/Ethernet
GPU
CPU
GPU
CPU
Resource Disaggregated PF GPU MachineGPU Cluster system
Your logohere
Two Application Model using GPUs
4/1/2016 13
Particle motion simulation by Runge-Kutta method• N-step GPU computing w/
only 2-time data transfer
• no device communication
Advection term calculation by Cubic Lagrange Interpolation• device communication in each
iteration step• influenced by latency
Host to Device Data Transfer
Host to Device Data Transfer
Device to HostData Transfer
Device to HostData Transfer
GPUComputing
GPUComputing
Device to DeviceData Transfer
n stepsn stepsLoop Loop
Seq
ue
nce
Seq
ue
nce
Shimpei Nomura, Tetusya Nakahama, Junichi Higuchi, Jun Suzuki, Takashi Yoshikawa , and Hideharu Amano, “”The multi-GPU System with ExpEther”,” Proc. of the PDPTA, July 2012
Your logohere
0
0.5
1
1.5
2
2.5
3
3.5
Up
date
/Seco
nd
(10
9)
Problem Size
1
2
3
4
5
6
7
8
8 GPUs achieved speedup of 7.13 times
• Linearly scale up proportional to the number of GPUs the overhead of data transfer fm/to host is dominant @ small size
Particle motion simulation in 8-GPU system
4/1/2016 14
GPGPU BOX (Proto Type)
Overhead of
Data transfer
X 7.13 performance# of GPUs
1x1024x1024,
100 step
1x1024x1024,
1000 step
10x1024x1024,
1000 step
10x1024x1024,
2000 step
10x1024x1024,
4000 step
Shimpei Nomura, Tetusya Nakahama, Junichi Higuchi, Jun Suzuki, Takashi Yoshikawa , and Hideharu Amano, “”The multi-GPU System with ExpEther”,” Proc. of the PDPTA, July 2012
Your logohere
Latency Masking of Data Communication
2016/03/07 15
GPU 0
GPU 1
GPU 2
GPU 3
Kernel
recv
send
Kernel
recv
send
Kernel
recv
send
Kernel
recv
send
Kernel
recv
send
Kernel
recv
send
Kernel
recv
send
Kernel
recv
send
Kernel
recv
send
Kernel
recv
send
Kernel
recv
send
Kernel
recv
send
Kernel
Kernel
Kernel
Kernel
sync sync sync
process
own frontier
process
recv frontier
Reduce frequency of communication
Overlap communication and GPU processing
GPU direct communication w/o host node
Host
K20EE
K20EE
K20EE
K20EE
EE
Eth
ern
et
Sw
itch
Mellanox SX1012 x 2
Takuji Mitsuishi, Jun Suzuki, Yuki Hayashi, Masaki Kan and Hideharu Amano, “Breadth First Search on Cost-efficient Multi-GPU Systems”, Proc. of the International Symposium on Highly-Efficient Accelerators and Reconfigureable Technologies (HEART), May 2015
Your logohere
Level synchronized Breadth First Graph Search
4-GPU machine
0
200
400
600
800
1000
15 17 19 21 23
MT
EP
S
graph scale (|V|=2^scale
GPUs=1 GPUs=2
Resource Disaggregated
0
200
400
600
800
1000
15 17 19 21 23
MT
EP
S
graph scale (|V|=2^scale
GPUs=1 GPUs=2
2016/03/07 16
|E|=16 x |V|
Comparable Scalability to GPU machine for big graph.
• 80 % of communication time is successfully masked
TEPS(Traversed Edged Per Second)
Takuji Mitsuishi, Jun Suzuki, Yuki Hayashi, Masaki Kan and Hideharu Amano, “Breadth First Search on Cost-efficient Multi-GPU Systems”, Proc. of the International Symposium on Highly-Efficient Accelerators and Reconfigureable Technologies (HEART), May 2015
root
Level 1Level 2Level 3
: current frontier
: next frontier
Your logohere
High-Speed storage using PCI SSD
Same IOPS for Random Read @ 4kB
• Local: bw=1814.3MB/s, iops=464436
• ExpEther:bw=1810.3MB/s, iops=463428
• Catalog: iops = 450,000
4/1/2016 17
Intel SSD P3700 400GB
Your logohere
Resource Disaggregated Storage (RD-Store)
kvs over RD NVMe
• Individually increase control CPUs / storage devices.
• Scale up immediately whenadd servers independent of the amount of data.
4/1/2016 18
8threads/server: with NVMe 1.1 card302kOPS, GET 89 μsec, PUT 263μsec
1-4 ControlCPUs
1-5 devices
YCSB(Yahoo! Cloud Service Benchmark), Read Heavy Workload (100B): Random GET 95%, PUT 5%
Interconnect(ExpEther)
pool
Your logohere
Conclusion: Resource Disaggregated Platform
The RD Platform expands the use of cloud data centers not only office applications but also high performance computing.
The RD platform performs computation by allocating devices from a resource pool at the device level to scale up individual performance and functionality.
Since the fabric is ExpEther (distributed PCIe switch over Ethernet), open standard hardware and software can be utilized to build customer’s computer systems
A combination of the latest Gen3x8 – 2x 40GE ExpEther and power8 server show potential for intensive computing power.
The performance of both 8-GPU machine and KVS with shared PCIe-storage show the easy-to-scale-up feature in datacenter.
4/1/2016 19
Your logohere
Acknowledgement
I would like to thank all the members working on developing this technology Hideharu Amano at Keio University Susumu Date and Shinji Shimojo at Osaka University Richard Solomon, Hiroki Iwasaki, Hironori Ando, Takashi Yagi, Hiroshi
Ono, Mitsuhiro Miyata, and other Synopsys People Takaaki Terao, Tsuyoshi Yoshioka, Koetsu Narisawa, Nobuhiro Numata,
Yosuke Yano, Takayuki Saito, Osamu Nara, Takumi Kosawa, and other ID People
Yoich Hidaka, Junich Higuch, Jun Suzuki, Yuki Hayashi, Masahiko Takahashi, Masaki Kan, Akira Tusji, Shinya Miyakawa, Nobuharu Kami, Teruyuki Baba, Nobuyuki Enomoto, Hideyuki Shimonishi, Atsushi Iwata, Shinji Abe, Shinji Ueno, Hiroyoshi Yamane, Ryuta Niino, Kiyoshi Baba, Hiroki Yokoyama, Tatsuya Takada, Shiro Momot, Taka Koishi, Yasushi Kikkawa, Yoshimitsu Okayama, Hiroshi Baba, Tsukasa Kimura, Masayuki Terao, and other NEC people
4/1/2016 20