Content
• Exascale Storage Challenges (Intel)• From HDDs to NVM (Intel) • New Storage Architectures Beyond Lustre (Intel)• DDN ExaScaler: Software and Hardware Combined• Enhancing Lustre with NVM• Options for using NVM in Lustre today• Tools for Lustre at Scale
DDN & Intel
• Close Collaboration with Intel HPDD– Collaboration on Lustre features such as directory-level quota and novel security
features– DDN ExaScaler is a private branch within Intel IEEL– DDN Automated Build & Test Environment
• Member of Intel Storage Builder Program• Working with Intel to incorporate novel storage devices• Collaborate with Intel on new storage architectures towards Exascale
Legal InformationNo license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.
Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.
Cost reduction scenarios described are intended as examples of how a given Intel- based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation.• Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or
learn more at http://www.intel.com/content/www/us/en/software/intel-solutions-for-lustre-software.html.• Intel may make changes` to specifications and product descriptions at any time, without notice.• Any code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers,
licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user.
• Intel, Intel Inside, the Intel logo and Xeon Phi are trademarks of Intel Corporation in the United States and other countries. • Material in this presentation is intended as product positioning and not approved end user messaging. *Other names and brands may be claimed as the property of others.© 2016 Intel Corporation
5
Intel® Scalable System FrameworkA Holistic Design Solution for All HPC Needs
Small Clusters Through SupercomputersCompute and Data-Centric ComputingStandards-Based ProgrammabilityOn-Premise and Cloud-Based
Intel® Xeon® ProcessorsIntel® Xeon Phi™ Processors
Intel® Xeon Phi™ CoprocessorsIntel® Server Boards and
Platforms
Intel® Solutions for Lustre*Intel® Optane™ Technology
3D XPoint™ TechnologyIntel® SSDs
Intel® Omni-Path Architecture
Intel® True Scale FabricIntel® Ethernet
Intel® Silicon Photonics
HPC System Software StackIntel® Software ToolsIntel® Cluster Ready
ProgramIntel Supported SDVis
Compute Memory/Storage
Fabric Software
Intel Silicon Photonics
Disruptive change3D XPoint™ + Intel® Omni-Path Architecture
Byte-granular storage
Sub-µS storage latency
µS network latency
Conventional storage software
High overhead– 10s µS lost to communications S/W
– 100s µS lost to F/S & block I/O stack
Block/page granular & alignment sensitive– False sharing limits scalability
Exascale storage Low overhead
– OS bypass comms + storage
Byte granular & alignment insensitive– Ultra low latency enables fine granularity
Persistent Memory– All metadata & fine-grained application data
Block storage– NAND: High performance bulk data– Disk: High capacity cold data
PersistentMemory
Block
IntegratedFabric
Global Namespaces
Containers
Shared System Namespace– “Where’s my stuff”
Private Namespaces– “My stuff”– Entire simulation datasets
Multiple Schemas
POSIX– Shared (system) & Private (legacy datasets)– No discontinuity for application developers
Scientific: HDF5*, ADIOS*, SciDB*, …
Big Data: HDFS*, Spark*, Graph Analytics, …
`
ComputeNodes
SharedStorageNodes
System Namespace +Warm/staging datasets
Compute Cluster
Hot datasets
System POSIX Namespace /projects
/posix /Particle /Climate
HDF5 dataset
datadatadatadatadataset
group
datadatadatadatadataset
group
datadatadatadatadataset
group
group
ADIOS datasetPOSIX dataset
datadatadatadatafile
dir
datadatadatadatafile
dir
datadatadatadatafile
dir
dir
Lustre* evolving in response to:
• A Growing Customer Base
• Changing use cases
• Emerging HW capabilities
DAOS exploring new territory:
• What may lay beyond Posix
• Use new HW capabilities as storage
• Raising level of abstraction
The future is bothEvolutionary & Revolutionary
IML
Director class switch
Q2 ’16 Q3 ’16 Q4 ’16 Q1 ’17 Q2 ’17
10
Intel® Solutions for Lustre* Roadmap
Enterprise HPC, commercial technical computing, and analytics
FoundationCommunity Lustre software coupled with Intel support
Lustre 2.9.x
Foundation Edition 2.9
UID/GID MappingShared key cryptoLock aheadSubdirectory mounts
Lustre 2.8.x
Foundation Edition 2.8
Metadata directory striping (DNE 2)Client metadata RPC scalingLFSCK performance tuningSELinux on client + Kerberos updatesKey: DevelopmentShipping Planning
CloudHPC, commercial technical computing and analytics delivered on cloud services
Lustre 2.8.x
Cloud Edition 1.3
RHEL 7 Server SupportLustre release update
Lustre 2.9.x
Cloud Edition 1.4
Updated Lustre and base OS buildsNew instance support
Lustre 2.7.x
Cloud Edition 1.2.1
Client Support: EL6 & EL7Support for new regions
* Some names and brands may be claimed as the property of othersProduct placement not representative of final launch date within the specified quarter
Lustre 2.7.x
Enterprise Edition 3.1
Expanded OpenZFS support in IMLIML graphical UI enhancementsBulk IO Performance (16MB IO)OpenZFS metadata performance Subdirectory mountsPatchless server support
Lustre 2.7.x
Enterprise Edition 3.0
Client metadata RPC scalingOpenZFS performanceOpenZFS SnapshotsSELinux on client + KerberosIntel Omni-Path (OPA) supportRHEL 7 server/client, SLES 12 SP1 clientOpenZFS 0.6.5.3
Lustre 2.10.x
Foundation Edition 2.10
OpenZFS SnapshotsMulti-rail LNETProject quotas
DDN | ExaScaler Software Stack
11
Intel DSS
DDN Block Storage with SFX Cache
ldiskfsOpenZFS btrfs
DDN Lustre Edition
Intel IMLDDN ExaScaler
DDN IMENFS/CIFS/S3DDN Clients
DDN DirectMon DDN ExaScaler Monitor
Intel Hadoop
ExaScaler Data Management FrameworkFast Data Copy
Object (WOS)S3 CloudTape
DDN DDN & Intel Intel HPDD
Other HW
Other
Management and Monitoring
Global Filesystem
Local Filesystem
Storage Hardware
Data Management
Roadmap
Q2 2016 Q3 2016 Q4 2016 Q1 2017
• SFA14K• ES14K Appliance• SFA14K & ES14K Rack
Configurations• Infiniband EDR Support
ExaScaler 2.3Lustre 2.5.x, IEEL 2.4
Hardware: SFA12K, SFA14K, SFA7700X, ES7K, ES14K,
• RHEL 7 and SLES 12 sp1 • IME 1.0 Integration• L2RC and File Heat• Enhanced SFX Integration
and fadvice() support• Client Metadata
Performance• SE Linux (Client) &
Kerberos Support• Open TSDB Scalable
Monitoring• Scalable Posix Copy Tool• NFS/CIFS Gateway Tools• OpenZFS Support (not
optimized!)• OpenZFS Snapshots
ExaScaler 3.0Lustre 2.7.x, IEEL 3.0
Hardware: SFA12K, SFA14K, SFA7700X, ES7K, ES14K,
• Project Quota• Enhanced QoS Policies• S3 Protocol Export• Intel Omni-Path (OPA)• 14KXE• Online Upgrades
ExaScaler 3.1Lustre 2.7.x, IEEL 3.0
Hardware: SFA12K, SFA14K, SFA7700X, ES7K, ES14K,
• Enhanced Management & Monitoring
• Platform Enhancements
ExaScaler 3.2
Lustre 2.7.x, IEEL 3.2Hardware: SFA12K, SFA14K,
SFA14KX, SFA8700, SFA12KX, SFA7700X, ES7K, ES14K,
• Data on MDT• MDT Striped Directories• IME Metadata Caching• New Appliance
Configurations• Progressive File Layouts• SE Linux Server• UID/GID Mapping• Shared Key Encryption• MLS Design Phase 1• Multi-tenancy and Docker
Integration
ExaScaler 4.0
Lustre xx, IEEL 4.0Hardware: SFA12K, SFA14K,
SFA14KX, SFA8700, SFA12KX, SFA7700X, ES7K, ES14K,
Q3 2017
Shipping Planning InvestigationScheduled
DDN | ExaScaler SSUsHardware & Software Combined
XSES7K-MUp to 5 GB/secUp to 1148 TBSAS 7.2K rpmInternal OSSInternal MDSInternal MDT
MSFA12KX
Up to 38 GB/secUp to 3360 TBSAS 7.2K rpmExternal OSSExternal MDSInternal MDT
SES7K
Up to 12 GB/secUp to 1148 TBSAS 7.2K rpmInternal OSS
External MDSExternal MDT
LES14K
Up to 74 GB/secUp to 4032 TBSAS 7.2K rpmInternel OSSInternal MDSInternal MDT
XLES14K
Up to 96 GB/secUp to 1500 TBSAS 10K rpmInternal OSS
External MDSExternal MDT
XXLES14K
Up to 320 GB/secSAS SSD
Up to 1700 TBInternal OSS
External MDSExternal MDT
DDN | ES14K: Hardware & Software Combined Designed for Flash and NVM
Industry Leading Performance in 4U Up to 40 GB/sec throughput Up to 6 million IOPS to cache Up to 3.5 million IOPS to storage 1PB+ capacity (with 16TB SSD) 100 millisecond latency
Configuration Options 72 SAS SSD or 48 NVMe SSDs or HDDs only HDDs with SSD caching SSDs with HDD tier
Connectivity FDR/EDR OmniPath 40/100GbE
ES14K Architecture
QPI
IB/OPA 40/100 GbE
6 x SAS3
SFA14KE (Haswell) SFA14KEX (Broadwell)
OSS 0
SFAOS
OSS 1
SFAOS CPU1
6 x SAS3
QPI
6 x SAS3
OSS 0
SFAOS SFAOS
6 x SAS3
OSS 1
IB/OPA 40/100 GbE
CPU0 CPU1CPU0
OSS OSS OSS OSS OSS OSS OSS OSS OSS
IME
OSS OSS OSS
SFX
Block-Level Read Cache
Instant Commit, DSS, fadvice()
Millions of Read IOPS
L2RC
OSS-level Read cache
Heuristics with FileHeat
Millions of Read IOPS
IME
I/O level Write and Read Cache
Transparent + hints
10s of Millions of Read/Write IOPS
All SSD Lustre
Lustre HSM for Data Tiering to
HDD namespace
Generic Lustre I/O
Millions of Read and Write IOPS
SSD Options1 2 3 4
Rack Performance: Lustre
0
100
200
300
400
Write Read
IOR File-per-Process (GB/s)
0
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
Read
4k Random Read IOPS
350GB/s Read and Write (IOR)
4 Million IOPs
up to 4PB Flash Capacity
SFX & ReACT – Accelerating ReadsIntegrated with Lustre DSS
HDD Tier
DRAM Cache
OSS
SFX TierLarge Reads
Small Reads
Small Rereads
Cache Warm
DSSSF
X AP
I
Lustre L2RC and File Heat
OSS-based Read Caching– Uses SSDs (or SFA SSD pools) on the OSS
as read cache– Automatic prefetch management based
on file heat– File-heat is a relative (tunable) attribute
that reflects file access frequency– Indexes are kept in memory (worst case is
1 TB SSD for 10 GB memory)– Efficient space management for the SSD
cache space (4KB-1 MB extends)– Full support for ladvice in Lustre
OSS OSS OSS
Heap of Object Heat
Values
PREFETCH
Higher heat == Higher access
frequency
Lower heat == Lower access frequency
DDN | IMEApplication I/O Workflow
NVM TIER LUSTRECOMPUTE
Lightweight IME client intercepts application I/O. Places fragments into buffers + parity
IME client sends fragments to IME servers
IME servers write buffers to NVM and manage internal metadata
IME servers write aligned sequential I/O to SFA backend
Parallel File system operates at maximum efficiency
IME server IME server IME server IME server
IME Write Dataflow
IME server
Application
IME Client
Parallel Filesystem
1. Application issues fragmented, misaligned IO
2. IME clients send fragments to IME servers
3. Fragments are sent to IME servers and are
accessible via DHT to all clients
4. Fragments to be flushed from IME are assembled into
PFS stripes
5. PFS receives complete aligned PFS
stripe
ApplicationApplicationApplication
IME ClientIME ClientIME Client
COMPUTE NODES
IME Erasure Coding
• Data protection against IME server or SSD Failure is optional– (the lost data is "just cache”)
• Erasure Coding calculated at the Client– Great scaling with extremely high client
count– Servers don't get clogged up
• Erasure coding does reduce useable Client bandwidth and useable IME capacity:– 3+1: 56Gb 42Gb – 5+1: 56Gb 47Gb – 7+1: 56Gb 49Gb – 8+1: 56Gb 50Gb
COMPUTE NODES
IME
PFS
LUSTRE
PFS PFS
Data buffer
Parity buffer
Data buffer
Data buffer
Rack Performance: IME
500GB/s Read and Write
50 Million IOPs
768TB Flash Capacity0
100
200
300
400
500
600
Write Read
IOR File-per-Process (GB/s)
110
1001,000
10,000100,000
1,000,00010,000,000
100,000,000
Write Read
4k Random IOPS
Monitoring for Lustre at Scale
• Monitoring with OpenTSDB• Collect tens of millions of metrics per
second– Lustre file system– Storage backend
• Make full use of Lustre jobstats• Assist storage administrators to
understand what is happening in their system– Top-N query tool