Download pdf - Nisha talagala keynote_inflow_2016

1

The New Storage Applications:Lots of Data, New Hardware and

Machine Intelligence

Nisha TalagalaParallel MachinesINFLOW 2016

2

Storage Evolution & Application Evolution Combined

Disk & Tape

Flash

DRAM

Persistent Memory

GeographicallyDistributed

Clustered

Local

Key-Value

File, Object

Block Data ManagementClassic Enterprise

TransactionsBusiness Intelligence

Search etc.

Advanced Analytics(Machine Learning, Cognitive

Functions)

3

In this talk

• What are the new data apps? – with a heavy focus on Advanced Analytics, particularly Machine Learning and Deep Learning

• What are their salient characteristics when it comes to storage and memory?

• How is storage optimized for these apps today?

• Opportunities for the storage stack?

4

Teaching AssistantsElderly CompanionsService Robots

Personal Social Robots

Smart Cities

Robot Drones

Smart Homes

Intelligent vehicles

Personal Assistants (bots)Smart Enterprise

Edited version of slide from BalintFleischer’s talk: Flash Memory Summit 2016, Santa Clara, CA

X

Growing Sources of Data

5

Classic Enterprise Transactions, Business Intelligence

Advanced Analytics

“Killer” use cases OTLPERPEmail

eCommerceMessaging

Social NetworksContent Delivery

Discovery of solutions, capabilitiesRisk Assessment

Improving customer experienceComprehending sensory data

Key functions RDBMSBI

Fraud detection

DatabasesSocial Graphs

SQL and ML AnalyticsStreaming

Natural Language UnderstandingObject Recognition

Probabilistic ReasoningContent Analytics

Data Types StructuredTransactional

StructuredUnstructuredTransactional

StreamingMixed

Graphs, Matrices

Storage Types Enterprise ScaleStandards drivenSAN/NAS, etc

Cloud ScaleOpen sourceFile/Object

???

Edited version of slide from BalintFleischer’s talk: Flash Memory Summit 2016 Santa Clara, CA

The Application Evolution

6

Libraries LibrariesMachine Learning, Deep Learning, SQL, Graph, CEP etc.

Data LakeData RepositoriesSQL

NoSQL

Data LakeData Streams

A Sample Analytics Stack

Processing Engine

Data from Repositories or Live Streams

Optimizers/Schedulers

Language Bindings, APIs

Frequently in memory

Python, Scala, Java etc

7

Data LakeData RepositoriesSQL

NoSQL

Data LakeData Streams

Machine Learning Software Ecosystem – a Partial View

Data from Repositories or Live Streams

Flink / ApexSpark Streaming

Storm / Samza / NiFi

CaffeTheano

Tensor Flow

Hadoop / SparkFlink

Tensor Flow

Mahout, Samsara, Mllib, FlinkML, Caffe, TensorFlow

Stream Processing Engine

BatchProcessing Engine

Domain focused back end engines

Algorithms and Libraries

Beam (Data Flow), StreamSQL, Keras

Layered API Providers

8

In this talk

• What are the new apps? – with a heavy focus on Advanced Analytics, particularly Machine Learning and Deep Learning


• How is storage optimized for these apps today?• Opportunities?

9

How ML/DL Workloads think about Data – Part 1• Data Sizes• Incoming datasets can range from MB to TB • Models are typically small. Largest models tend to be in deep neural networks

and range from 10s MB to single digit GB• Common Data Types• Time series and Streams• Multi-dimensional Arrays, Matrices and Vectors• DataFrames

• Common distributed patterns• Data Parallel, periodic synchronization• Model Parallel

• Network sensitivity varies between algorithms. Straggler performance issues can be significant• 2x performance difference between IB and 40Gbit Ethernet for some algorithms

like KMeans and SVM

10

The Growth of Streaming Data

• Continuous data flows and continuous processing• Enabled & driven by sensor data, real time information feeds• Enables native time component “event time”• Allows complex computations that can combine new and old data in

deterministic ways• Several variants with varied functionality• True Streams, Micro-Batch (an incremental batch emulation)

• Possible with existing models like SQL, supported natively by models like Google DataFlow / Apache Beam

• The performance of in-memory streaming enables a convergence between stream analytics (aggregation) and Complex Event Processing (CEP)

11

Convergence of RDBMS and Analytics• In-Memory DBs are moving to continuous queries• Ex: StreamSQL interfaces, Pipeline DB (based on PostgreSQL)

• Stream and batch analytic engines support SQL interfaces • Ex: SQL support on Spark, Flink

• SQL parsers with pluggable back ends – Apache Calcite

• Good for basic analytics but need extensions to support machine learning and deep learning• Joins, sorts, etc. good for feature engineering, data cleansing• Many core machine & deep learning operations require linear algebra ops

If the idea of a standard database is "durable data, ephemeral queries" the idea of a streaming database is "durable queries, ephemeral data”

http://www.databasesoup.com/2015/07/pipelinedb-streaming-postgres.html

12

The Growing Role of the Edge• Closest to data ingest, lowest latency.• Benefits to real time processing

• Highly varied connectivity to data centers

• Varied hardware architectures and resource constraints

• Differs from geographically distributed data center architecture • Asymmetry of hardware• Unpredictable connectivity• Unpredictable device uptime ioT Reference Model

13

How ML/DL Workloads think about Data – Part 2• The older data gets – the more its “role” changes• Older data for batch- historical analytics and model reboots• Used for model training (sort of), not for inference

• Guarantees can be “flexible” on older data• Availability can be reduced (most algorithms can deal with some data loss)• A few data corruptions don’t really hurt J• Data is evaluated in aggregate and algorithms are tolerant of outliers• Holes are a fact of real life data – algorithms deal with it

• Quality of service exists but is different • Random access is very rare • Heavily patterned access (most operations are some form of array/matrix)• Shuffle phase in some analytic engines

14

Correctness, Determinism, Accuracy and Speed• More complex evaluation metrics than

traditional transactional workloads• Correctness is hard to measure

• Even two implementations of the “same algorithm” can generate different results

• Determinism/Repeatability is not always present for streaming data• Ex: Micro-batch processing can produce

different results depending on arrival time Vs event time

• Accuracy to time tradeoff is non-linear• Exploratory models can generate massive

parallelism for the same data set used repeatedly (hyper-parameter search)

00.20.40.60.81

1.2

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14

Error

Time

SVM V1

0

0.2

0.4

0.6

0.8

1

1.2

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14Error

Time

SVM V2

15

The Role of Persistence

• For ML functions, most computations today are in-memory• Data flows from data lake to analytic engine and results flow back• Persistent checkpoints can generate large write traffic for very long running

computations (streams, large neural network training, etc.)• Persistent message storage to enforce exactly once semantics and

determinism, latency sensitive write traffic

• For in-memory databases, persistence is part of the core engine• Log based persistence is common

• Loading & cleaning of data is still a very large fraction of the pipeline time• Most of this involves manipulating stored data

16

In this talk

• What are the new apps? – with a heavy focus on Advanced Analytics, particularly Machine Learning and Deep Learning


• How is storage/memory optimized for these apps today?• Opportunities?

17

Abstractions and the Stack• ML/DL applications use common

abstractions that combine linear algebra, tables, streams etc

• These are stored as independent entities inside Key-Value pairs, Objects or Files

• File system used as common namespace

• Information is lost at each level down, along with opportunities to optimize layout, tiering, caching etc

Data copies (or transfers denoted by red lines) occur frequently, sometimes more

than once!

Block

File

Key-Value and Object

Matrices, Tables, Streams, etc

18

Optimizing Storage: Some Examples• Time series optimized databases• Examples BTrDB (FAST 2016) and Gorrilla DB (Facebook/VLDB 2015)• Streamlined data types, specialized indexing, tiering optimized for access

patterns

• API pushdown techniques• Iguazio.io• Streams and Spark RDDs as native access APIs

• Lineage• Alluxio (Formerly Tachyon)• Link data history & compute history, cache intermediate stages in machine

learning pipelines

• Memory expansion• Many studies on DRAM/Persistent Memory/Flash tiering for analytics

19

Opportunities: Places to Start • Persistent Memory and Flash offer several opportunities to

improve ML/DL capacity and efficiency

• Fast/Frequent Checkpointing for long running jobs• Note: will put pressure on write endurance

• Low latency logging for exactly-once semantics

• Memory expansion: DRAM/Persistent Memory/Flash hierarchies • exploit the highly predictable access patterns of ML algorithms

• Accelerate data load/save stages of ML/DL pipelines

20

Opportunities – More Fundamental Shifts• Role of storage types in analytics optimizers and schedulers –

superficially similar to DB query optimization

• Exploit the more relaxed set of requirements on persistence• Even correctness can be relaxed • Example in compute land for flexibility in synchronization (HogWild!

approach to SGD, plus Asynchronous SGD etc.)

• Leverage Persistent Memory to unify low latency streaming data requirements and high throughput batch data requirements

• New(er) data types and repeatable access patterns

• Converged systems with analytics and storage management for cross stack efficiency

21

Takeaways

• The use of ML/DL in enterprise is at its infancy and expanding furiously

• These apps put ever larger pressure on data management, latency, and throughput requirements

• These apps also introduce another layer of abstraction and another layer of workload intelligence• Further away from block and file

• Opportunities exist to significantly improve storage and memory for these use cases by understanding and exploiting their priorities and non-priorities for data

22

Thank You