Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Joe Moore, NetApp Array Products Group
The Evolution of Storage to Support Data Analytics at Scale
1
Solutions For Multiple Workloads
Deliver solutions built on open standards with best-in-class partnerships
2
FAS/V Family with Data ONTAP
®
E-Series with
Hadoop Lustre StorNext StorageGRID and many others
Agile Data Infrastructure Infrastructure for a New Era of
Enterprise Applications
What is “Big Data”?
Complexity
Volume Speed
“Big Data” refers to datasets whose volume, speed and complexity is beyond the ability of typical tools to capture, store, manage and analyze.
3
Coined in 2000 by Francis Diebold, Professor of Economics at the University of Pennsylvania.
Big Data Solution Portfolio
Insight from extremely large datasets
Performance for data intensive workloads
Secure boundless data storage
Big Data
4
Analytics of Tomorrow
¡ Traditional & Big Analytics side-by-side for years to come. ¡ Hadoop moves to shared, virtualized infrastructure, for
better efficiency and ease of management, either: – Logically distributed, shared nothing on physically shared
everything, or – Same as above, except Hadoop becomes logically shared
everything, as HDFS is replaced by a parallel file system (e.g., Lustre Cluster, StorNext or GPFS).
¡ Enterprise class resiliency (no SPoF) and reliability with HPC-like performance (no need for triplicas).
¡ Use of a single copy of data for the map phase (higher storage utilization).
¡ Natural intersection with Cloud (Analytics as a Service).
5
Application-Aware Storage (Hadoop example)
6
• Thin 6GB slices of LUN ac3ve • Sparse working set: 144GB of 4TB
Hadoop (Terasort) Workload Example
Intermediate results wriCen, read back within 20 minutes. Cache only un3l first read.
64MB read issued as a jumbled IO burst Chunk-‐aligned Prefetch
Map Reduce
Dominant workload trend: Big-‐block IO with “pseudo-‐random” jumps • Lustre, StorNext, Teradata, Hadoop • Prefetch only up to FS block size
LiFle short-‐term block reuse • LRU replacement ineffec3ve • All cache hits come from prefetch • Evict-‐aZer-‐read?
FS may split IO into a random burst • Defeats tradi3onal stream prefetch logic • Seen as IO jiCer within stream as wide as app block size.
Sub-‐LUN working sets have disMnct IO characterisMcs • Sta3c LUN-‐grain caching policy sub-‐op3mal
Key ObservaMons
Data Evolution: Scale, Structure and Storage ¡ Unstructured data increasingly the predominant format:
– Scale a challenge to traditional database technologies. – Innovative key-value stores (Hbase, BigTable) sacrifice some
structure (e.g., relational indexing) to achieve scalability. – Large data sets for many analytics domain are not amenable
to fixed tabular structures: ¡ Adjacency lists for graphs/networks, ¡ Feature vectors for machine learning are typically a reduction from
unstructured input.
¡ Cluster file systems, with large blocks and write-once semantics, accommodate this evolution of data scale and structure: – Block size of 64MB. not optimal for all datasets:
¡ Google Colossus uses 1MB. Blocks.
– Cluster-level erasure codes replace replica blocks for data protection. ¡ System level RAID is also a viable alternative to replicas.
7
8