Upload
lea-chace
View
225
Download
3
Tags:
Embed Size (px)
Citation preview
1
HPC StorageCurrent Status and Futures
Torben Kling Petersen, PhDPrincipal Architect, HPC
2
•Where are we today ??•File systems• Interconnects•Disk technologies•Solid state devices•Solutions•Final thoughts …
Agenda ??
Pan Galactic Gargle Blaster
"Like having your brains smashed out by a slice of lemon wrapped
around a large gold brick.”
3
Current Top10 …..
n.b. NCSA Bluewaters 24 PB 1100 GB/s (Lustre 2.1.3)
Rank Name Computer Site Total Cores Rmax Rpeak Power
(KW)File
system Size Perf
1 Tianhe-2TH-IVB-FEP Cluster, Xeon E5-2692 12C 2.2GHz, TH Express-2, Intel Xeon Phi
National Super Computer Center in Guangzhou
3120000 33862700 54902400 17808 Lustre/H2FS 12.4 PB ~750 GB/s
2 TitanCray XK7 , Opteron 6274 16C 2.2GHz, Cray Gemini interconnect, NVIDIA K20x
DOE/SC/Oak Ridge National Laboratory
560640 17590000 27112550 8209 Lustre 10.5 PB 240 GB/s
3 SequoiaBlueGene/Q, Power BQC 16C 1.60 GHz, Custom Interconnect
DOE/NNSA/LLNL 1572864 17173224 20132659 7890 Lustre 55 PB 850 GB/s
4 K computerFujitsu, SPARC64 VIIIfx 2.0GHz, Tofu interconnect
RIKEN AICS 705024 10510000 11280384 12659 Lustre 40 PB 965 GB/s
5 Mira BlueGene/Q, Power BQC 16C 1.60GHz, Custom
DOE/SC/Argonne National Laboratory
786432 8586612 10066330 3945 GPFS 7.6 PB 88 GB/s
6 Piz DaintCray XC30, Xeon E5-2670 8C 2.600GHz, Aries interconnect , NVIDIA K20x
Swiss National Supercomputing Centre (CSCS)
115984 6271000 7788853 2325 Lustre 2.5 PB 138 GB/s
7 StampedePowerEdge C8220, Xeon E5-2680 8C 2.7GHz, IB FDR, Intel Xeon Phi
TACC/Univ. of Texas 462462 5168110 8520112 4510 Lustre 14 PB 150 GB/s
8 JUQUEENBlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect
Forschungs zentrum Juelich (FZJ)
458752 5008857 5872025 2301 GPFS 5.6 PB 33 GB/s
9 VulcanBlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect
DOE/NNSA/LLNL 393216 4293306 5033165 1972 Lustre 55 PB 850 GB/s
10 SuperMUCiDataPlex DX360M4, Xeon E5-2680 8C 2.70GHz, Infiniband FDR
Leibniz Rechenzentrum 147456 2897000 3185050 3423 GPFS 10 PB 200 GB/s
4
Lustre Roadmap
5
• GPFS– Running out of steam ??– Let me qualify !! (and he then rambles on
…..)• Frauenhofer
– Excellent metadata perf– Many modern features– No real HA
• Ceph– New, interesting and with a LOT of good
features– Immature and with limited track record
• Panasas – Still RAID 5 and running out of steam ….
Other parallel file systems
6
• A traditional file system includes a hierarchy of files and directories
• Accessed via a file system driver in the OS• Object storage is “flat”, objects are located by
direct reference
• Accessed via custom APIso Swift, S3, librados, etc.
• The difference boils down to 2 questions:o How do you find files? o Where do you store metadata?
• Object store + Metadata + driver is a filesystem
Object based storage
7
• It’s more flexible. Interfaces can be presented in other ways, without the FS overhead. A generalized storage architecture vs. a file system
• It’s more scalable. POSIX was never intended for clusters, concurrent access, multi-level caching, ILM, usage hints, striping control, etc.
• It’s simpler. With the file system-”isms” removed, an elegant (= scalable, flexible, reliable) foundation can be laid
Object Storage Backend: Why?
8
Most clustered FS and OS are built on local FS’s – ….and inherit their problems• Native FS
o XFS, ext4, ZFS, btrfs• OS on FS
o Ceph on btrfso Swift on XFS
• FS on OS on FSo CephFS on Ceph on btrfso Lustre on OSS on ext4o Lustre on OSS on ZFS
Elephants all the way down...
9
• ObjectStor based solutions offers a lot of flexibility:– Next-generation design, for exascale-level size,
performance, and robustness– Implemented from scratch
• "If we could design the perfect exascale storage system..."
– Not limited to POSIX– Non-blocking availability of data– Multi-core aware– Non-blocking execution model with thread-per-core– Support for non-uniform hardware– Flash, non-volatile memory, NUMA– Using abstractions, guided interfaces can be
implemented• e.g., for burst buffer management (pre-staging and de-
staging).
The way forward ..
10
• S-ATA 6 Gbit• FC-AL 8 Gbit• SAS 6 Gbit• SAS 12 Gbit• PCI-E direct attach
• Ethernet• Infiniband• Next gen interconnect…
Interconnects (Disk and Fabrics)
11
• Doubles bandwidth compared to SAS 6 Gbit• Triples the IOPS !!• Same connectors and cables • 4.8 GB/s in each direction with 9.6 GB/s following
– 2 streams moving to 4 streams• 24 Gb/s SAS is on the drawing board ….
12 Gbit SAS
12
PCI-E direct attach storage
PCIe Arch Raw Bit Rate Data Encoding
Interconnect bandwidth
BW Lane Direction
Total BW for x16 link
PCIe 1.x 2.5GT/s 8b/10b 2Gb/s ~250MB/s ~8GB/s
PCIe 2.0 5.0GT/s 8b/10b 4Gb/s ~500MB/s ~16GB/s
PCIe 3.0 8.0GT/s 128b/130b 8Gb/s ~1 GB/s ~32GB/s
PCIe 4.0 ?? ?? ?? ?? ??
• M.2 Solid State Storage Modules• Lowest latency/highest bandwidth• Limitations in # of PCI-E channels available
• Ivy Bridge has up to 40 lanes per chip
13
• Ethernet has now been around for 40 years !!!• Currently around 41% of Top500 systems …
28% 1 GbE13% 10 GbE
• 40 GbE shipping in volume• 100 GbE being demonstrated
– Volume shipments expected in 2015• 400 GbE and 1 TbE is on the drawing board
– 400 GbE planned for 2017
Ethernet – Still going strong ??
14
Infiniband …
15
• Intel acquired Qlogic Infiniband team …• Intel acquired Cray’s interconnect technologies …• Intel has published and shows silicon photonics …
• And this means WHAT ????
Next gen interconnects …
16
Disk drive technologies
17
Areal density futures
1818
Disk write technology ….
19
• Sealed Helium Drives (Hitachi )– Higher density – 6 platters/12 heads– Less power (~ 1.6W idle)– Less heat (~ 4°C lower temp)
• SMR drives (Seagate)– Denser packaging on current technology– Aimed at read intensive application areas
• Hybrid drives (SSHD)– Enterprise edition– Transparent SSD/HHD combination (aka Fusion
drives)• eMLC + SAS
Hard drive futures …
20
SMR drive deep-dive
21
• HAMR drives (Seagate)– Using a laser to heat the magnetic substrate
(Iron/Platinum alloy)– Projected capacity – 30-60 TB/ 3.5 inch drive …– 2016 timeframe ….
• BPM (bit patterned media recording) – Stores one bit per cell, as opposed to regular hard-
drive technology, where each bit is stored across a few hundred magnetic grains
– Projected capacity – 100+ TB / 3.5 inch drive …
Hard drive futures …
22
• RAID 5 – No longer viable• RAID 6 – Still OK
– But re-build times are becoming prohibitive• RAID 10
– OK for SSDs and small arrays• RAID Z/Z2 etc
– A choice but limited functionality on Linux• Parity Declustered RAID
– Gaining foothold everywhere– But …. PD-RAID ≠ PD-RAID ≠ PD-RAID ….
• No RAID ???– Using multiple distributed copies works but …
What about RAID ?
23
• Supposed to “Take over the World” [cf. Pinky and the Brain]
• But for high performance storage there are issues ….
– Price and density not following predicted evolution– Reliability (even on SLC) not as expected
• Latency issues– SLC access ~25µs, MLC ~50µs …– Larger chips increase contention
• Once a flash die is accessed, other dies on the same bus must wait
• Up to 8 flash dies shares a bus
• Address translation, garbage collection and especially wear leveling add significant latency
Flash (NAND)
24
• MLC3-4 bits per cell @ 10K duty cycles
• SLC– 1 bit per cell @ 100K duty cycles
• eMLC– 2 bits per cell @ 30K duty cycles
• Disk drive formats (S-ATA / SAS bandwidth limitations)
• PCI-E accelerators• PCI-E direct attach
Flash (NAND)
25
• Flash is essentially NV-RAM but ….
• Phase Change Memory (PCM)– Significantly faster and more dense that NAND– Based on chalcogenide glass
• Thermal vs electronic process– More resistant to external factors
• Currently the expected solution for burst buffers etc …
– but there’s always Hybrid Memory Cubes ……
NV-RAM
26
• Size does matter …..2014 – 2016 >20 proposals for 40+ PB file systemsRunning at 1 – 4 TB/s !!!!
• Heterogeneity is the new buzzword– Burst buffers, data capacitors, cache off loaders …
• Mixed workloads are now taken seriously ….• Data integrity is paramount
– T10-DIF/X is a decent start but …• Storage system resiliency is equally important
– PD-RAID need to evolve and become system wide• Multi tier storage as standard configs …• Geographical distributed solutions commonplace
Solutions …
27
• Object based storage– Not an IF but a WHEN ….– Flavor(s) still TBD – DAOS, Exascale10, XEOS, ….
• Data management core to any solution– Self aware data, real time analytics, resource
management ranging from job scheduler to disk block ….
– HPC storage = Big Data• Live data –
– Cache ↔︎ Near line ↔︎ Tier2 ↔︎ Tape ? ↔︎ Cloud ↔︎ Ice• Compute with storage ➜ Storage with compute
Final thoughts ???
Storage is no longer a 2nd class citizen
28
Thank [email protected]