Upload
hoanghanh
View
227
Download
0
Embed Size (px)
Citation preview
F a c e b o o k F l e x i b l e G P U E x p a n d e r B i g B a s i n R e f r e s h
Whitney Zhao/HW Eng/Facebook Inc.Xiaodong Wang/SW Eng/Facebook Inc.
3
Introduct ion
Agenda
Architecture
Performance
Quest ions
4
Introduct ion
Agenda
5
Impact
LANGUAGE TRANSLATION LANGUAGE TRANSLATION
Facebook’s commitment to developing AI & advancing ML
FACE RECOGNITION
FACE RECOGNITIONSEARCH
SEARCH
ADS
ADSNEWS FEED
NEWS FEED
SIGMA
LUMOS
6
Goal • Open, full contribution to OCP• Disaggregation/Modularity• Serviceability
2016: Big Sur2017: Leopard + Big Basin2018: Tioga Pass + Big Basin V2
7
Big Basin V2 Overview
• 3 OU chassis• Open Rack v2 compatible• 8x Nvidia Tesla V100 GPUs; NVLink capable• 300W TDP for each Tesla V100 GPU• Facebook 2S Server Tioga Pass as Head node
8
A deeper look into Big Basin
Baseboard on sliding tray
Midplane boardBaseboardIO board
9
Serviceability
• Quick repairs at data center
• Telemetries accessible from head node
• Provisioning Big Basin with its head node is not much different from provisioning existing servers; these servers come with additional GPUs.
Thumb screws
10
Agenda
Architecture
11
Architecture (Headnode to Big Basin)
Leopard + Big Basin(Tesla P100) Tioga Pass + Big Basin V2(Tesla V100)
• MiniSAS HD cable(2 for each x16)o Standard PCIe x16o Present Pino USB2.0o IPMB/I2C
12
Architecture (PCIe)
Leopard + Big Basin Tioga Pass + Big Basin V2
Tioga Pass50G
24 4
51 1 2
3 3
5
13
Architecture (NVLINK)
Big Basin W/Nvidia Tesla P100 Big Basin V2 W/ Nvidia Tesla V100
14
Architecture (IPMB/I2C/PMBUS)
15
Agenda Performance
16
Performance
• Hardware Spec Improvement
• Application performance• Computer vision
• Single-GPU• Multi-GPU scalability• TensorCore
• Neural machine translation
17
Performance
Metrics NVIDIA V100 NVIIDA P100 Improvement
Performance
FP‐32 15 TFLOPS 10.6 TFLOPS 1.42x
FP‐16 30 TFLOPS 21.2 TFLOPS
TensorCore 125 TFLOPS NA Up to 5x
Mem Bandwidth 900 GB/s 720 GB/s 1.25x
NVLink 300 GB/s 160 GB/s 1.88x
Power 300 W 300 W
• Comparisons of GPU Hardware
18
Performance
• Comparisons of GPU Hardware
• Head-node upgrade: Tioga Pass• New CPU architecture: Broadwell to Skylake• Double PCIe bandwidth• Upgraded 100G NIC
• CUDA 9 + cudnn 7: faster libraries, etc.
19
Impact - Computer Vision
20
Performance metrics in Computer Vision
• Computer Vision: resnet-50⎻1-GPU training speed: use P100 + CUDA 8 as baseline
P100 V100 + CUDA 8 V100 + CUDA 9 V100 + CUDA 9 +TensorCore
1.66x
21
Computer Vision Performance
1-V100 2-V100 4-V100 8-V100
FP-32
• Computer Vision⎻Multi-GPU speedup vs. 1 P100
Spee
dup
vs. 1
P1
00
1.66x
22
Computer Vision Performance
1-V100 2-V100 4-V100 8-V100
FP-32
FP-16 TensorCore
• Computer Vision⎻High-bandwidth FP-16 TensorCore (WIP)
Spee
dup
vs. 1
P1
00
23
Machine Translation
Better Translation Quality
Phrase-based statistical approach Neural network approach
24
Machine Translation Performance
• Neural Machine translation
V100 + CUDA 9
2.2XV100 + CUDA 8
1.45XP100 + CUDA 8
Training Throughput as
Baseline
25
Questions?
27
OCP Marketplace
• http://www.opencompute.org/products/specsanddesign?keyword=Big+basin