Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
1
HPC’s most Intelligent Storage Solution
2
What is ClusterStor…
Integrated SoftwareComplete Solution
Modular Building BlocksPerformance + Capacity
Integrated Storage Servers and
Storage Capacity featuring NXD
Flash I/O Accelerator &
De-Clustered RAID
Engineered HPC Storage SystemComplete Solution
High density 12G disk enclosures
Integrated HA Object Storage Servers
Dedicated Lustre software development,
integration, test & support team
CLI, GUI and API Integration Administration
Manufacturing integration & test validation
Maximum performance from each disk drives
Makes Lustre easier to manage at scale
Support and Professional services
Copyright Cray, Inc. 2017
3
ClusterStor High-End Storage Leadership (8 of Top 20)Rank Name Computer File system Size (PB) Perf (GB/s)
1Sunway
TaihuLight
Sunway MPP, Sunway
260C 1.45GHz, Sunway
Sunway GFS
(Lustre)20 ~100
2 Tianhe-2TH-IVB-FEP Cluster,
Xeon E5-2692 2.2GHz,
TH Express-2, Xeon Phi
Lustre /
H2FS12.4 750
3Piz Daint
CSCS
Cray XC30, Xeon E5-
2670 8C 2.600GHz, Aries
interconnect , NVIDIA
K20x
Lustre 6 138
4 GyoukouJapan Agency for Marine-Earth Science and Tech.
5Titan
ORNL
Cray XK7 , Opteron 6274
16C 2.2GHz, Cray
Gemini, NVIDIA K20xLustre 32 700
6Sequoia
LLNL
BlueGene/Q, Power BQC
16C 1.60 GHz, Custom
InterconnectLustre 55 850
7Trinity
LANL
Cray XC40, Xeon E5-
2698v3 16C 2.3GHz,
Aries interconnect Lustre 78* 1,700
8Cori
NERSC
Cray XC40, Intel Xeon Phi
7250 68C 1.4GHz, Aries
interconnect
Lustre 30* 1000
9Oakforest-
PACS
PRIMERGY CX1640 M1,
Intel Xeon Phi 7250 68C
1.4GHz, Intel Omni-Path
Lustre 26 400
10 K computerFujitsu, SPARC64 VIIIfx
2.0GHz, , Tofu interctLustre 40 965
Nov2017 list
Rank Name Computer File system Size (PB) Perf (GB/s)
11 MiraBlueGene/Q, Power BQC
16C 1.60GHz, CustomGPFS 28.8 240
12 Stampede2 Dell PowerEdge C6320P Lustre 21* >300
13 TSUBAME3.0
SGI ICE XA, IP139-SXM2, Xeon E5-2680v4 14C 2.4GHz, Intel Omni-Path, NVIDIA Tesla P100 SXM2
14 MarconiCINECA Cluster, Intel
Xeon Phi 7250 68C
1.4GHz, Intel Omni-Path
GPFS 17 10
15 MetOfficeCray XC40, Xeon E5-
2695v4 18C 2.1GHz, AriesLustre 26* 1,000
16Mare
NostrumLenovo SD530, Xeon
Platinum 8160GPFS 14 ~ 100
17Pleiades
NASA
SGI ICE X, Intel Xeon
Infiniband FDRLustre 29 ??
18 ThetaCray XC40, Intel Xeon Phi
7230 64C 1.3GHz, Aries
interconnect
Lustre 10,1 224
19Hazel Hen
HLRS
Cray XC40, Xeon E5-
2680v3 12C 2.5GHz, AriesLustre 11 350
20Shaheen II
KAUST
Cray XC40, Xeon E5-
2698v3 16C 2.3GHz, Aries Lustre 18 >500
* usable capacity
Copyright Cray, Inc. 2017
4
HPC Users who already have switched to
5
ClusterStor News…
• Cray and Seagate form Strategic Partnership
• ClusterStor HPC storage product line moved to Cray
• Transaction closed on Monday, September 25th
• Sell, support, and enhance the ClusterStor product line
• Collaborate with Seagate to integrate future Seagate technologies
• Manufacturing moving to Cray’s facility in Chippewa Falls, WI
• Approximately 130 Seagate employees and contractors join Cray
• Mostly in R&D, Service and Support, and Reseller channel support
• Provides a commercial, fully supported, and up to date Lustre build
• Provides a well established solution proven at all scales
• Ensures continuity of support for ClusterStor customers
6
ClusterStor L300 Platform
• Lustre 2.7
• 12 Gbit SAS enclosures
• Broadwell based ESMs*
• EDR IB
• OPA
• 100 GbE
• 2x 40 GbE
• 6/8/10 TB HDDs
• 3.2 TB SSDs
• NXD** I/O Manager
• Advanced MMU*** (option)
Few applications with large,
sequential I/O workloads
ClusterStor L300
Many applications with
mixed I/O workloads
ClusterStor L300N
Price Difference:
15 - 20%
• Lustre 2.7
• 6 Gbit SAS enclosures
• Haswell based ESMs
• EDR IB
• OPA
• 100 GbE
• 2x 40 GbE
• 6/8/10 TB HDDs
* Embedded Server Module
** Nondisruptive aXeleration of Data
*** Metadata Management Unit
Copyright Cray, Inc. 2017
7
• 5U84 Enclosure – completely HA
• Two trays of 42 HDD’s redundantly SAS connected
• Dual-ported 3.5” NL SAS & SSD HDD Support
• 300+ MB/s SAS available bandwidth per HDD
• Pair of HA Embedded Server Modules
• L300 = up to 10 GB/sec IOR over IB - Sustained
• IB EDR, Intel OPA or 2x40 or 100 GbE Network Link
• Support for Lustre®
• 2x SSD OSS Journal/WIB disks for increased perf
• Data Protection/Integrity (GridRAID, 41 drives)
• GridRAID - 2 OSS’s per SSU, 1 OST’s per OSS
• 4/6/8/10 TB drives supported
Scalable Storage Unit (SSU)
Embedded
Server Modules
Copyright Cray, Inc. 2017
8
GridRAID - Faster Time to Recovery
To Lustre Client
Network (IB or 40GbE)
OST #2
Grid Raid
41 DrivesObject
Storage
Server #2
SAS Connectivity Lustre Connectivity
OST #1
Grid Raid
41 DrivesObject
Storage
Server #1
Key Product Advantage
› Up to 4x faster HDD Recovery time
› GridRAID = Parity Declustered RAID
o Parity de-clustering is designed to
balance cost against data reliability and
performance during failure recovery.
o It improves on standard parity
organizations by reducing the additional
load on surviving disks during the
reconstruction of a failed disk’s
contents.
o Yields higher user throughput during
recovery, and/or shorter recovery time.
Copyright Cray, Inc. 2017
9
Dataflow – Platform 300
EDR IB100GbE
Haswell L300/Broadwell
L300N
12Gbps SASCtrl
12GbpsExpander
~10GB/s
~12.5GB/s Required(8+2+2 Grid-RAID)
~12-13GB/s
x16 PCIe
x16 PCIe
x8 PCIe x4
x8
x4
x4
x4
~13GB/s
E
E
E
45GB/s (4.5x)Raw = 68GB/s
~4.3GB/s
300MB/s per drive
x14
x14
x14
x2
(x12 to other drawer)
DDR
Copyright Cray, Inc. 2017
10
• ClusterStor has Pioneered 10x Performance Improvement in HPC Storage in the last 5 Years
Efficiency @ Scale: HDD Performance
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
2011ORNL
2013NCSA
2014Kaust
2015DKRZ
2016CEA
2017CS 300N
Efficiency GAP
Realized HDD Performance in HPC Storage Systems
ClusterStor has delivered the fastest and most efficient HPC storage systems in the world
File System
Performance
per HDD
vs.
Single HDD Max
Raw Performance =
Efficiency Gap
83%
Efficiency
delivered
with L300N
---
244 MB/s per
drive
Copyright Cray, Inc. 2017
11
ClusterStor: No Single Points Of Failure
Copyright Cray, Inc. 2017
12
High Availability Meta Data Server
Performance for Thousands of
Compute Clients
High Availability MGMT Servers
Health and Performance
High Availability
Capacity Expansion
Lower Price per PB +
Performance
High Availability Data Network
Provides Compute Clients an
Alternative Data Path to the
Storage Servers & HDD/SSD
High Availability System
Management Network
Eliminates Single Points
of Failure
Factory Integration and Test
Faster Time to Acceptance and
Production
File System & Linux OS
Integration
100’s of SW Improvements +
Test Validation
De-clustered Parity RAID
Faster Time to Recovery &
Higher Data Resiliency
Modular Performance +
Capacity Building Blocks
Balanced I/O up to TB/sec
Disk Health Monitoring
& Management
Faster Time to Repair
High Availability Storage Servers
Failover / Failback Monitoring
and Management
Custom SAS Performance
Accelerator
Provides additional HDD
performance of up to 12GB/sec
per enclosure module
Copyright Cray, Inc. 2017
Complexity – Managed behind the scenes
13
ClusterStor: Monitoring & Reporting
SNMP
REST
API
Guided
FRU
Repair
Nagios
Copyright Cray, Inc. 2017
14
ClusterStor: ClusterStor Manager
Copyright Cray, Inc. 2017
15
ClusterStor: Linear Scalability
Unique converged architecture that adds with each modular building block
compute/memory/network/SSD/HDD resources for linear performance at scale.
Capacity (raw)
in PB
Performance (IOR) in GB/sec on EDR
10 20 30 40 60 70 80 90 100
820
1640
2460
4100
3280
4920
5740
7380
6560
ClusterStor L300 with 10TB HDDs
Copyright Cray, Inc. 2017
16
L300N
17
ClusterStor L300N: Problem Statement
1. Many compute applications
on the same storage platform› Mixed I/O patterns - Random, unaligned,
strided, small, large, sequential I/O
2. End of Dennard scaling› Leads to exploding core counts
› Making the mixed I/O problem worse
3. Fast HPC storage needs to
be more than “Scratch”› Home/project directories
› Storage for analytics workloads
Ways to solve
this mixed
I/O workload
problem
I/O workloads are becoming increasingly unpredictable for the HPC storage system
Over provision with
additional storage
servers & disk drives
Buy expensive
All Flash Arrays
Deploy the new
ClusterStor L300N
solution
$$$
$$$
$
Copyright Cray, Inc. 2017
18
Challenge: Lost
Productivity Due To Poor
I/O Predictability
Solution: Smart I/O
Management Provides
Consistent Performance
Result: I/O Optimized For
Each Application I/O
Profile
Compute Cluster ApplicationsSensor Data Capture & Processing
Data mining - Simulations
Modelling - Software development
Visualization of complex data -
Rapid mathematical calculations
Mixed I/O Patterns=Unpredictable Application
Performance
Goal: Support Widest
Application I/O Variety Within
Budget
Random
Un - Aligned
Small Block
Sequential
ClusterStor NXD =Transparent Redirection of I/O to
Appropriate SSD or HDD Medium
ClusterStor NXD =Predictable and Fastest Compute
Application Time-to-Solution
ClusterStor L300N: Any Workload, Any Time
Copyright Cray, Inc. 2017
19
SCSI Mid-Layer
sg
NXD DM Target Filter
Driver
Device Mapper / Mapping Target Interface
File Systems (ldiskfs or NSD)
Virtual File System (Lustre or Spectrum Scale)
System I/O Call Interface
NXD
Caching Library
sx sd st sb sz••••••
NXD Low-Level Driver
SSD partition HDD partition
≤ 32 KiB > 32 KiB
File system layer
I/O block layer
Device mapping
Hardware layer
NXD I/O Manager
– Filter driver implemented as
device mapper target driver
– Core library compiled as a
Linux kernel module with
well defined APIs
– Work at the block layer be
transparent to file system
and applications
– Core caching function is an
OS agnostic portable library
with well defined interfaces
– Filter Driver intercept’s IO
and routes through Cache
Management Library for
Caching functions
ClusterStor L300N NXD Software Architecture
Copyright Cray, Inc. 2017
20
Basic Nytro IO Manager Data Path
Wri
te
ac
kn
ow
led
ged
SSD’s In Memory Spinning Media
Lustre or Spectrum Scale
Wri
te
ac
kn
ow
led
ged
OST/NSD Block Layer
Nytro Filter Driver Nytro Caching Lib
1. Incoming IO are profiled & filtered
2. Large IO to spindle & small to SSD
3. Write acknowledge once on media
4. Small IO’s coalesced & flushed to spindle
Copyright Cray, Inc. 2017
21
NXD Operation & Block I/O Distribution Example
WRITE I/O HistoryBlock Sizes # Events
< 4K 0
4K 23978
4K+1 — 8K 499318
8K+1 — 16K 19954
16K+1 — 32K 1962
32K+1 — 64K 127
64K+1 — 128K 39
128K+1 — 256K 6
256K+1 — 512K 2
512K+1 — 1M 0
1M+1 — 2M 0
2M+1 — 4M 0
4M+1 — 8M 0
8M+1 — 16M 0
16M+1 — 32M 0
READ I/O HistoryBlock Sizes # Events
< 4K 0
4K 56796
4K+1 — 8K 456385
8K+1 — 16K 55
16K+1 — 32K 31
32K+1 — 64K 4
64K+1 — 128K 0
128K+1 — 256K 0
256K+1 — 512K 0
512K+1 — 1M 0
1M+1 — 2M 0
2M+1 — 4M 0
4M+1 — 8M 0
8M+1 — 16M 0
16M+1 — 32M 0
Copyright Cray, Inc. 2017
22
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
90,000
1 2 4 8 16 32 64 128
IOP
S
Transfer Size (KB)
Write performance
L300N NXD – Write/Rewrite, Aligned I/O
16 nodes, 64 processes total, 1x SSU, (2x Koho ME vs. 82x 8TB MakaraBP)
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
90,000
1 2 4 8 16 32 64 128IO
PS
Transfer Size (KB)
Re-Write performance
NytroXD
GridRAID
with NDX
w/o NDX
Copyright Cray, Inc. 2017
23
L300N NXD – Read performance16 nodes, 64 processes total, 1x SSU, (2x Koho ME vs. 82x 8TB MakaraBP)
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000
1 2 4 8 16 32 64 128
IOP
S
Transfer Size (KB)
Aligned I/O
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000
5 10 20 40 80 160
IOP
S
Transfer Size (KB)
Unligned I/O
NytroXD
GridRAID
with NDX
w/o NDX
Copyright Cray, Inc. 2017
24
L300N – IOPS Performance
IOR-2.10.3
*Stonewalling (180 sec)
Preconditioned for random workload
Optimal client/server settings
Write Initial Write Update Read
Peak* L300N 45,000 IOPS 60,000 IOPS 200,000 IOPS
L300 2,800 IOPS 3,500 IOPS 21,000 IOPS
Write Initial Read
Sustained L300N 25,000 IOPS 110,000 IOPS
L300 < 2,000 IOPS < 15,000 IOPS
Performance per SSU (4KiB IOPS)
Write Sparse- Random Sparse Write
Write Update - Random Rewrite (overwrite)
Read - Random Read
Copyright Cray, Inc. 2017
25
ClusterStor and Lustre 2.7 DNE Hardware• DNE is a Hardware OPTION in ClusterStor v3.0 using Standard MMU
• No changes to base rack MDS/MGT Server configuration MMU Configuration
• Base MMU is configured with MDT0 has master and MDT1 has additional MDT via DNE
• DNE Servers are configured in active / active pairs
• ClusterStor 2U24 with 2 MDS embedded server modules using the same standard MMU
• Scale Metadata Capacity / Performance with DNE Server pairs via MMU
• Up to 8 additional DNE MDS server pair chassis supported per file system (up to 8 additional MMUs)
• 2 to 18 MDS/MDT supported per file system Max
Root
dir a
File
MDT0
dir b
dir b2
File
MDT1
dir c
dir c2
File
MDT2
dir d
dir d2
File
MDT3
dir e
dir e2
File
MDT4
dir f
dir f2
File
MDT5
Base MMU
Copyright Cray, Inc. 2017
26
MMU-Advanced Performance
• Single MDS/MDT
• 2x SSU+1 (32 OSTs)
• RAID10 (5+5)
• mdtest-1.9.3
• Reporting File/s
MDS Create Stat Open Remove
MMU - Standard 55K 220K 95K 70K
MMU - Advanced 105K 350K 220K 80K
Copyright Cray, Inc. 2017
27
What makes NXD a better choice ?
• Fully transparent I/O acceleration solution
• No changes for applications, tools, scripts, operators and users
• Standard Lustre Clients
• No custom libraries, no FUSE, no recompile etc.
• NXD cache size scales with filesystem size• Best balance of capacity and value
• Full data integrity
• GridRAID
• Network checksums
• T10-PI
• Easily augmented with future All Flash Arrays
• OST pool deployment
ClusterStor L300N
Copyright Cray, Inc. 2017
28
Summary: ClusterStor Value PropositionPerformance Efficiency
Highest performance throughput per storage device.
Less hardware needed to
achieve performance requirement
Engineered Solution Pre-integrated, tested,
tuned, and shipped ready to
deploy.
Days instead of weeks
to implementation
ReliabilityNo single point of failure
architecture
Less rebuilds and less
downtime
ScalabilitySustained linear performance
when adding capacity
Predictable application
performance
Enterprise-grade Management
and SupportComprehensive file system management,
RAS/Phone home, holistic hardware
monitoring with health alerts, API
Less downtime and Faster-Time-To-
Problem Resolution
Five
Unique
Solution
Values
1
2
34
5
Copyright Cray, Inc. 2017
29
Thank You