28
1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

Embed Size (px)

Citation preview

Page 1: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

1

HPC StorageCurrent Status and Futures

Torben Kling Petersen, PhDPrincipal Architect, HPC

Page 2: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

2

•Where are we today ??•File systems• Interconnects•Disk technologies•Solid state devices•Solutions•Final thoughts …

Agenda ??

Pan Galactic Gargle Blaster

"Like having your brains smashed out by a slice of lemon wrapped

around a large gold brick.”

Page 3: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

3

Current Top10 …..

n.b. NCSA Bluewaters 24 PB 1100 GB/s (Lustre 2.1.3)

Rank Name Computer Site Total Cores Rmax Rpeak Power

(KW)File

system Size Perf

1 Tianhe-2TH-IVB-FEP Cluster, Xeon E5-2692 12C 2.2GHz, TH Express-2, Intel Xeon Phi

National Super Computer Center in Guangzhou

3120000 33862700 54902400 17808 Lustre/H2FS 12.4 PB ~750 GB/s

2 TitanCray XK7 , Opteron 6274 16C 2.2GHz, Cray Gemini interconnect, NVIDIA K20x

DOE/SC/Oak Ridge National Laboratory

560640 17590000 27112550 8209 Lustre 10.5 PB 240 GB/s

3 SequoiaBlueGene/Q, Power BQC 16C 1.60 GHz, Custom Interconnect

DOE/NNSA/LLNL 1572864 17173224 20132659 7890 Lustre 55 PB 850 GB/s

4 K computerFujitsu, SPARC64 VIIIfx 2.0GHz, Tofu interconnect

RIKEN AICS 705024 10510000 11280384 12659 Lustre 40 PB 965 GB/s

5 Mira BlueGene/Q, Power BQC 16C 1.60GHz, Custom

DOE/SC/Argonne National Laboratory

786432 8586612 10066330 3945 GPFS 7.6 PB 88 GB/s

6 Piz DaintCray XC30, Xeon E5-2670 8C 2.600GHz, Aries interconnect , NVIDIA K20x

Swiss National Supercomputing Centre (CSCS)

115984 6271000 7788853 2325 Lustre 2.5 PB 138 GB/s

7 StampedePowerEdge C8220, Xeon E5-2680 8C 2.7GHz, IB FDR, Intel Xeon Phi

TACC/Univ. of Texas 462462 5168110 8520112 4510 Lustre 14 PB 150 GB/s

8 JUQUEENBlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect

Forschungs zentrum Juelich (FZJ)

458752 5008857 5872025 2301 GPFS 5.6 PB 33 GB/s

9 VulcanBlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect

DOE/NNSA/LLNL 393216 4293306 5033165 1972 Lustre 55 PB 850 GB/s

10 SuperMUCiDataPlex DX360M4, Xeon E5-2680 8C 2.70GHz, Infiniband FDR

Leibniz Rechenzentrum 147456 2897000 3185050 3423 GPFS 10 PB 200 GB/s

Page 4: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

4

Lustre Roadmap

Page 5: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

5

• GPFS– Running out of steam ??– Let me qualify !! (and he then rambles on

…..)• Frauenhofer

– Excellent metadata perf– Many modern features– No real HA

• Ceph– New, interesting and with a LOT of good

features– Immature and with limited track record

• Panasas – Still RAID 5 and running out of steam ….

Other parallel file systems

Page 6: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

6

• A traditional file system includes a hierarchy of files and directories

• Accessed via a file system driver in the OS• Object storage is “flat”, objects are located by

direct reference

• Accessed via custom APIso Swift, S3, librados, etc.

• The difference boils down to 2 questions:o How do you find files? o Where do you store metadata?

• Object store + Metadata + driver is a filesystem

Object based storage

Page 7: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

7

• It’s more flexible. Interfaces can be presented in other ways, without the FS overhead. A generalized storage architecture vs. a file system

• It’s more scalable. POSIX was never intended for clusters, concurrent access, multi-level caching, ILM, usage hints, striping control, etc.

• It’s simpler. With the file system-”isms” removed, an elegant (= scalable, flexible, reliable) foundation can be laid

Object Storage Backend: Why?

Page 8: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

8

Most clustered FS and OS are built on local FS’s – ….and inherit their problems• Native FS

o XFS, ext4, ZFS, btrfs• OS on FS

o Ceph on btrfso Swift on XFS

• FS on OS on FSo CephFS on Ceph on btrfso Lustre on OSS on ext4o Lustre on OSS on ZFS

Elephants all the way down...

Page 9: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

9

• ObjectStor based solutions offers a lot of flexibility:– Next-generation design, for exascale-level size,

performance, and robustness– Implemented from scratch

• "If we could design the perfect exascale storage system..."

– Not limited to POSIX– Non-blocking availability of data– Multi-core aware– Non-blocking execution model with thread-per-core– Support for non-uniform hardware– Flash, non-volatile memory, NUMA– Using abstractions, guided interfaces can be

implemented• e.g., for burst buffer management (pre-staging and de-

staging).

The way forward ..

Page 10: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

10

• S-ATA 6 Gbit• FC-AL 8 Gbit• SAS 6 Gbit• SAS 12 Gbit• PCI-E direct attach

• Ethernet• Infiniband• Next gen interconnect…

Interconnects (Disk and Fabrics)

Page 11: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

11

• Doubles bandwidth compared to SAS 6 Gbit• Triples the IOPS !!• Same connectors and cables • 4.8 GB/s in each direction with 9.6 GB/s following

– 2 streams moving to 4 streams• 24 Gb/s SAS is on the drawing board ….

12 Gbit SAS

Page 12: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

12

PCI-E direct attach storage

PCIe Arch Raw Bit Rate Data Encoding

Interconnect bandwidth

BW Lane Direction

Total BW for x16 link

PCIe 1.x 2.5GT/s 8b/10b 2Gb/s ~250MB/s ~8GB/s

PCIe 2.0 5.0GT/s 8b/10b 4Gb/s ~500MB/s ~16GB/s

PCIe 3.0 8.0GT/s 128b/130b 8Gb/s ~1 GB/s ~32GB/s

PCIe 4.0 ?? ?? ?? ?? ??

• M.2 Solid State Storage Modules• Lowest latency/highest bandwidth• Limitations in # of PCI-E channels available

• Ivy Bridge has up to 40 lanes per chip

Page 13: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

13

• Ethernet has now been around for 40 years !!!• Currently around 41% of Top500 systems …

28% 1 GbE13% 10 GbE

• 40 GbE shipping in volume• 100 GbE being demonstrated

– Volume shipments expected in 2015• 400 GbE and 1 TbE is on the drawing board

– 400 GbE planned for 2017

Ethernet – Still going strong ??

Page 14: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

14

Infiniband …

Page 15: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

15

• Intel acquired Qlogic Infiniband team …• Intel acquired Cray’s interconnect technologies …• Intel has published and shows silicon photonics …

• And this means WHAT ????

Next gen interconnects …

Page 16: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

16

Disk drive technologies

Page 17: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

17

Areal density futures

Page 18: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

1818

Disk write technology ….

Page 19: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

19

• Sealed Helium Drives (Hitachi )– Higher density – 6 platters/12 heads– Less power (~ 1.6W idle)– Less heat (~ 4°C lower temp)

• SMR drives (Seagate)– Denser packaging on current technology– Aimed at read intensive application areas

• Hybrid drives (SSHD)– Enterprise edition– Transparent SSD/HHD combination (aka Fusion

drives)• eMLC + SAS

Hard drive futures …

Page 20: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

20

SMR drive deep-dive

Page 21: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

21

• HAMR drives (Seagate)– Using a laser to heat the magnetic substrate

(Iron/Platinum alloy)– Projected capacity – 30-60 TB/ 3.5 inch drive …– 2016 timeframe ….

• BPM (bit patterned media recording) – Stores one bit per cell, as opposed to regular hard-

drive technology, where each bit is stored across a few hundred magnetic grains

– Projected capacity – 100+ TB / 3.5 inch drive …

Hard drive futures …

Page 22: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

22

• RAID 5 – No longer viable• RAID 6 – Still OK

– But re-build times are becoming prohibitive• RAID 10

– OK for SSDs and small arrays• RAID Z/Z2 etc

– A choice but limited functionality on Linux• Parity Declustered RAID

– Gaining foothold everywhere– But …. PD-RAID ≠ PD-RAID ≠ PD-RAID ….

• No RAID ???– Using multiple distributed copies works but …

What about RAID ?

Page 23: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

23

• Supposed to “Take over the World” [cf. Pinky and the Brain]

• But for high performance storage there are issues ….

– Price and density not following predicted evolution– Reliability (even on SLC) not as expected

• Latency issues– SLC access ~25µs, MLC ~50µs …– Larger chips increase contention

• Once a flash die is accessed, other dies on the same bus must wait

• Up to 8 flash dies shares a bus

• Address translation, garbage collection and especially wear leveling add significant latency

Flash (NAND)

Page 24: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

24

• MLC3-4 bits per cell @ 10K duty cycles

• SLC– 1 bit per cell @ 100K duty cycles

• eMLC– 2 bits per cell @ 30K duty cycles

• Disk drive formats (S-ATA / SAS bandwidth limitations)

• PCI-E accelerators• PCI-E direct attach

Flash (NAND)

Page 25: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

25

• Flash is essentially NV-RAM but ….

• Phase Change Memory (PCM)– Significantly faster and more dense that NAND– Based on chalcogenide glass

• Thermal vs electronic process– More resistant to external factors

• Currently the expected solution for burst buffers etc …

– but there’s always Hybrid Memory Cubes ……

NV-RAM

Page 26: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

26

• Size does matter …..2014 – 2016 >20 proposals for 40+ PB file systemsRunning at 1 – 4 TB/s !!!!

• Heterogeneity is the new buzzword– Burst buffers, data capacitors, cache off loaders …

• Mixed workloads are now taken seriously ….• Data integrity is paramount

– T10-DIF/X is a decent start but …• Storage system resiliency is equally important

– PD-RAID need to evolve and become system wide• Multi tier storage as standard configs …• Geographical distributed solutions commonplace

Solutions …

Page 27: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

27

• Object based storage– Not an IF but a WHEN ….– Flavor(s) still TBD – DAOS, Exascale10, XEOS, ….

• Data management core to any solution– Self aware data, real time analytics, resource

management ranging from job scheduler to disk block ….

– HPC storage = Big Data• Live data –

– Cache ↔︎ Near line ↔︎ Tier2 ↔︎ Tape ? ↔︎ Cloud ↔︎ Ice• Compute with storage ➜ Storage with compute

Final thoughts ???

Storage is no longer a 2nd class citizen

Page 28: 1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

28

Thank [email protected]