1
The New Storage Applications:Lots of Data, New Hardware and
Machine Intelligence
Nisha TalagalaParallel MachinesINFLOW 2016
2
Storage Evolution & Application Evolution Combined
Disk & Tape
Flash
DRAM
Persistent Memory
GeographicallyDistributed
Clustered
Local
Key-Value
File, Object
Block Data ManagementClassic Enterprise
TransactionsBusiness Intelligence
Search etc.
Advanced Analytics(Machine Learning, Cognitive
Functions)
3
In this talk
• What are the new data apps? – with a heavy focus on Advanced Analytics, particularly Machine Learning and Deep Learning
• What are their salient characteristics when it comes to storage and memory?
• How is storage optimized for these apps today?
• Opportunities for the storage stack?
4
Teaching AssistantsElderly CompanionsService Robots
Personal Social Robots
Smart Cities
Robot Drones
Smart Homes
Intelligent vehicles
Personal Assistants (bots)Smart Enterprise
Edited version of slide from BalintFleischer’s talk: Flash Memory Summit 2016, Santa Clara, CA
X
Growing Sources of Data
5
Classic Enterprise Transactions, Business Intelligence
Advanced Analytics
“Killer” use cases OTLPERPEmail
eCommerceMessaging
Social NetworksContent Delivery
Discovery of solutions, capabilitiesRisk Assessment
Improving customer experienceComprehending sensory data
Key functions RDBMSBI
Fraud detection
DatabasesSocial Graphs
SQL and ML AnalyticsStreaming
Natural Language UnderstandingObject Recognition
Probabilistic ReasoningContent Analytics
Data Types StructuredTransactional
StructuredUnstructuredTransactional
StreamingMixed
Graphs, Matrices
Storage Types Enterprise ScaleStandards drivenSAN/NAS, etc
Cloud ScaleOpen sourceFile/Object
???
Edited version of slide from BalintFleischer’s talk: Flash Memory Summit 2016 Santa Clara, CA
The Application Evolution
6
Libraries LibrariesMachine Learning, Deep Learning, SQL, Graph, CEP etc.
Data LakeData RepositoriesSQL
NoSQL
Data LakeData Streams
A Sample Analytics Stack
Processing Engine
Data from Repositories or Live Streams
Optimizers/Schedulers
Language Bindings, APIs
Frequently in memory
Python, Scala, Java etc
7
Data LakeData RepositoriesSQL
NoSQL
Data LakeData Streams
Machine Learning Software Ecosystem – a Partial View
Data from Repositories or Live Streams
Flink / ApexSpark Streaming
Storm / Samza / NiFi
CaffeTheano
Tensor Flow
Hadoop / SparkFlink
Tensor Flow
Mahout, Samsara, Mllib, FlinkML, Caffe, TensorFlow
Stream Processing Engine
BatchProcessing Engine
Domain focused back end engines
Algorithms and Libraries
Beam (Data Flow), StreamSQL, Keras
Layered API Providers
8
In this talk
• What are the new apps? – with a heavy focus on Advanced Analytics, particularly Machine Learning and Deep Learning
• What are their salient characteristics when it comes to storage and memory?
• How is storage optimized for these apps today?• Opportunities?
9
How ML/DL Workloads think about Data – Part 1• Data Sizes• Incoming datasets can range from MB to TB • Models are typically small. Largest models tend to be in deep neural networks
and range from 10s MB to single digit GB• Common Data Types• Time series and Streams• Multi-dimensional Arrays, Matrices and Vectors• DataFrames
• Common distributed patterns• Data Parallel, periodic synchronization• Model Parallel
• Network sensitivity varies between algorithms. Straggler performance issues can be significant• 2x performance difference between IB and 40Gbit Ethernet for some algorithms
like KMeans and SVM
10
The Growth of Streaming Data
• Continuous data flows and continuous processing• Enabled & driven by sensor data, real time information feeds• Enables native time component “event time”• Allows complex computations that can combine new and old data in
deterministic ways• Several variants with varied functionality• True Streams, Micro-Batch (an incremental batch emulation)
• Possible with existing models like SQL, supported natively by models like Google DataFlow / Apache Beam
• The performance of in-memory streaming enables a convergence between stream analytics (aggregation) and Complex Event Processing (CEP)
11
Convergence of RDBMS and Analytics• In-Memory DBs are moving to continuous queries• Ex: StreamSQL interfaces, Pipeline DB (based on PostgreSQL)
• Stream and batch analytic engines support SQL interfaces • Ex: SQL support on Spark, Flink
• SQL parsers with pluggable back ends – Apache Calcite
• Good for basic analytics but need extensions to support machine learning and deep learning• Joins, sorts, etc. good for feature engineering, data cleansing• Many core machine & deep learning operations require linear algebra ops
If the idea of a standard database is "durable data, ephemeral queries" the idea of a streaming database is "durable queries, ephemeral data”
http://www.databasesoup.com/2015/07/pipelinedb-streaming-postgres.html
12
The Growing Role of the Edge• Closest to data ingest, lowest latency.• Benefits to real time processing
• Highly varied connectivity to data centers
• Varied hardware architectures and resource constraints
• Differs from geographically distributed data center architecture • Asymmetry of hardware• Unpredictable connectivity• Unpredictable device uptime ioT Reference Model
13
How ML/DL Workloads think about Data – Part 2• The older data gets – the more its “role” changes• Older data for batch- historical analytics and model reboots• Used for model training (sort of), not for inference
• Guarantees can be “flexible” on older data• Availability can be reduced (most algorithms can deal with some data loss)• A few data corruptions don’t really hurt J• Data is evaluated in aggregate and algorithms are tolerant of outliers• Holes are a fact of real life data – algorithms deal with it
• Quality of service exists but is different • Random access is very rare • Heavily patterned access (most operations are some form of array/matrix)• Shuffle phase in some analytic engines
14
Correctness, Determinism, Accuracy and Speed• More complex evaluation metrics than
traditional transactional workloads• Correctness is hard to measure
• Even two implementations of the “same algorithm” can generate different results
• Determinism/Repeatability is not always present for streaming data• Ex: Micro-batch processing can produce
different results depending on arrival time Vs event time
• Accuracy to time tradeoff is non-linear• Exploratory models can generate massive
parallelism for the same data set used repeatedly (hyper-parameter search)
00.20.40.60.81
1.2
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Error
Time
SVM V1
0
0.2
0.4
0.6
0.8
1
1.2
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14Error
Time
SVM V2
15
The Role of Persistence
• For ML functions, most computations today are in-memory• Data flows from data lake to analytic engine and results flow back• Persistent checkpoints can generate large write traffic for very long running
computations (streams, large neural network training, etc.)• Persistent message storage to enforce exactly once semantics and
determinism, latency sensitive write traffic
• For in-memory databases, persistence is part of the core engine• Log based persistence is common
• Loading & cleaning of data is still a very large fraction of the pipeline time• Most of this involves manipulating stored data
16
In this talk
• What are the new apps? – with a heavy focus on Advanced Analytics, particularly Machine Learning and Deep Learning
• What are their salient characteristics when it comes to storage and memory?
• How is storage/memory optimized for these apps today?• Opportunities?
17
Abstractions and the Stack• ML/DL applications use common
abstractions that combine linear algebra, tables, streams etc
• These are stored as independent entities inside Key-Value pairs, Objects or Files
• File system used as common namespace
• Information is lost at each level down, along with opportunities to optimize layout, tiering, caching etc
Data copies (or transfers denoted by red lines) occur frequently, sometimes more
than once!
Block
File
Key-Value and Object
Matrices, Tables, Streams, etc
18
Optimizing Storage: Some Examples• Time series optimized databases• Examples BTrDB (FAST 2016) and Gorrilla DB (Facebook/VLDB 2015)• Streamlined data types, specialized indexing, tiering optimized for access
patterns
• API pushdown techniques• Iguazio.io• Streams and Spark RDDs as native access APIs
• Lineage• Alluxio (Formerly Tachyon)• Link data history & compute history, cache intermediate stages in machine
learning pipelines
• Memory expansion• Many studies on DRAM/Persistent Memory/Flash tiering for analytics
19
Opportunities: Places to Start • Persistent Memory and Flash offer several opportunities to
improve ML/DL capacity and efficiency
• Fast/Frequent Checkpointing for long running jobs• Note: will put pressure on write endurance
• Low latency logging for exactly-once semantics
• Memory expansion: DRAM/Persistent Memory/Flash hierarchies • exploit the highly predictable access patterns of ML algorithms
• Accelerate data load/save stages of ML/DL pipelines
20
Opportunities – More Fundamental Shifts• Role of storage types in analytics optimizers and schedulers –
superficially similar to DB query optimization
• Exploit the more relaxed set of requirements on persistence• Even correctness can be relaxed • Example in compute land for flexibility in synchronization (HogWild!
approach to SGD, plus Asynchronous SGD etc.)
• Leverage Persistent Memory to unify low latency streaming data requirements and high throughput batch data requirements
• New(er) data types and repeatable access patterns
• Converged systems with analytics and storage management for cross stack efficiency
21
Takeaways
• The use of ML/DL in enterprise is at its infancy and expanding furiously
• These apps put ever larger pressure on data management, latency, and throughput requirements
• These apps also introduce another layer of abstraction and another layer of workload intelligence• Further away from block and file
• Opportunities exist to significantly improve storage and memory for these use cases by understanding and exploiting their priorities and non-priorities for data
22
Thank You