The Evolution of Storage to Support Data Analytics …better efficiency and ease of management, either: – Logically distributed, shared nothing on physically shared everything, or

Joe Moore, NetApp Array Products Group

The Evolution of Storage to Support Data Analytics at Scale

1

Solutions For Multiple Workloads

Deliver solutions built on open standards with best-in-class partnerships

2

FAS/V Family with Data ONTAP

®

E-Series with

Hadoop Lustre StorNext StorageGRID and many others

Agile Data Infrastructure Infrastructure for a New Era of

Enterprise Applications

What is “Big Data”?

Complexity

Volume Speed

“Big Data” refers to datasets whose volume, speed and complexity is beyond the ability of typical tools to capture, store, manage and analyze.

3

Coined in 2000 by Francis Diebold, Professor of Economics at the University of Pennsylvania.

Big Data Solution Portfolio

Insight from extremely large datasets

Performance for data intensive workloads

Secure boundless data storage

Big Data

4

Analytics of Tomorrow

¡  Traditional & Big Analytics side-by-side for years to come. ¡  Hadoop moves to shared, virtualized infrastructure, for

better efficiency and ease of management, either: –  Logically distributed, shared nothing on physically shared

everything, or –  Same as above, except Hadoop becomes logically shared

everything, as HDFS is replaced by a parallel file system (e.g., Lustre Cluster, StorNext or GPFS).

¡  Enterprise class resiliency (no SPoF) and reliability with HPC-like performance (no need for triplicas).

¡  Use of a single copy of data for the map phase (higher storage utilization).

¡  Natural intersection with Cloud (Analytics as a Service).

5

Application-Aware Storage (Hadoop example)

6

•  Thin 6GB slices of LUN ac3ve •  Sparse working set: 144GB of 4TB

Hadoop (Terasort) Workload Example

Intermediate results wriCen, read back within 20 minutes. Cache only un3l first read.

64MB read issued as a jumbled IO burst Chunk-‐aligned Prefetch

Map Reduce

Dominant workload trend: Big-‐block IO with “pseudo-‐random” jumps • Lustre, StorNext, Teradata, Hadoop • Prefetch only up to FS block size

LiFle short-‐term block reuse • LRU replacement ineffec3ve • All cache hits come from prefetch • Evict-‐aZer-‐read?

FS may split IO into a random burst • Defeats tradi3onal stream prefetch logic • Seen as IO jiCer within stream as wide as app block size.

Sub-‐LUN working sets have disMnct IO characterisMcs • Sta3c LUN-‐grain caching policy sub-‐op3mal

Key ObservaMons

Data Evolution: Scale, Structure and Storage ¡  Unstructured data increasingly the predominant format:

–  Scale a challenge to traditional database technologies. –  Innovative key-value stores (Hbase, BigTable) sacrifice some

structure (e.g., relational indexing) to achieve scalability. –  Large data sets for many analytics domain are not amenable

to fixed tabular structures: ¡  Adjacency lists for graphs/networks, ¡  Feature vectors for machine learning are typically a reduction from

unstructured input.

¡  Cluster file systems, with large blocks and write-once semantics, accommodate this evolution of data scale and structure: –  Block size of 64MB. not optimal for all datasets:

¡  Google Colossus uses 1MB. Blocks.

–  Cluster-level erasure codes replace replica blocks for data protection. ¡  System level RAID is also a viable alternative to replicas.

7

8

Documents

The Evolution of Storage to Support Data Analytics …better efficiency and ease of management, either: – Logically distributed, shared nothing on physically shared everything, or