DATA516/CSED516 Scalable Data Systems and Algorithms · –Need complex functions for Extract Transform Load •Input data may be semi-structured or unstructured –Need to run machine

DATA516/CSED516Scalable Data Systems and Algorithms

Lecture 5In-memory Analytics & Machine Learning

(Spark, MLlib, and TensorFlow)

1DATA516/CSED516 - Fall 2017

References• M. Zaharia et al. Resilient distributed datasets: A

fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012.

• Spark SQL: Relational Data Processing in Spark Michael Armbrust, et. al. ACM SIGMOD Conference 2015, May. 2015.

• MLlib: Machine Learning in Apache Spark. XiangruiMeng, et. Al. Journal of Machine Learning Research, 17 (34), Apr. 2016.

• M.Abadi et al. TensorFlow:Large-scale machine learning on heterogeneous systems. In OSDI, 2016.

• Project website: https://www.tensorflow.org/2DATA516/CSED516 - Fall 2017

Motivation

• Goal: Better use distributed memory in a cluster

• Observation:– Modern data analytics involves iterations– Users also want to do interactive data mining– In both cases, want to keep intermediate data in

memory and reuse it– MapReduce does not support this scenario well

• Requires writing data to disk between jobs

DATA516/CSED516 - Fall 2017 3

Approach

• New abstraction: Resilient Distributed Datasets

• RDD properties– Parallel data structure– Can be persisted in memory– Fault-tolerant– Users can manipulate RDDs with rich set of operators


RDD Details• An RDD is a partitioned collection of records

– RDD’s are typed: RDD[Int] is an RDD of integers– Records are Java/Python objects

• An RDD is read only– This means no updates to individual records– This is to contrast with in-memory key-value stores

• To create an RDD– Execute a deterministic operation on another RDD– Or on data in stable storage– Example operations: map, filter, and join


RDD Materialization• Users control persistence and partitioning

• Persistence– Should we materialize this RDD in memory?

• Partitioning– Users can specify key for partitioning an RDD


RDDs and Materialized Views

• So RDD is a lot like a view in a parallel engine

• A view that can be materialized in memory

• A materialized view that can be physically tuned– Tuning: How to partition for maximum performance

• But RDDs hold arbitrary Java/Python objects– Unlike materialized views in an RDBMS


Spark Programming Interface

• RDDs implemented in new Spark system

• Spark exposes RDDs though a language-integrated API similar to DryadLINQ but in Scala

• Later Spark was extended with SQL


Why Scala?From Matei Zaharia (Spark lead author): “When we started Spark, we wanted it to have a concise API for users, which Scala did well. At the same time, we wanted it to be fast (to work on large datasets), so many scripting languages didn't fit the bill. Scala can be quite fast because it's statically typed and it compiles in a known way to the JVM. Finally, running on the JVM also let us call into other Java-based big data systems, such as Cassandra, HDFS and HBase.

Since we started, we've also added APIs in Java (which became much nicer with Java 8) and Python”

https://www.quora.com/Why-is-Apache-Spark-implemented-in-Scala


Querying/Processing RDDs• Programmer first defines RDDs through

transformations on data in stable storage– Map– Filter– …

• Then, can use RDDs in actions– Action returns a value to app or exports to storage– Count (counts elements in dataset)– Collect (returns elements themselves)– Save (output to stable storage)


Example (from paper)Search logs stored in HDFS

// First line defines RDD backed by an HDFS filelines = spark.textFile(“hdfs://…”)// Now we create a new RDD from the first oneerrors = lines.filter(_.startsWith(“Error”))// Persist the RDD in memory for reuse latererrors.persist() errors.collect()errors.filter(_.contains(“MySQL”)).count()


More on Programming Interface• Large set of pre-defined transformations:

– Map, filter, flatMap, sample, groupByKey, reduceByKey, union, join, cogroup, crossProduct, …

• Small set of pre-defined actions:– Count, collect, reduce, lookup, and save

• Programming interface includes iterations


More Complex Example

DATA516/CSED516 - Fall 2017 13[From Zaharia12]

Spark Runtime


[From https://spark.apache.org/docs/latest/cluster-overview.html]

1. Input data in HDFS or other Hadoop input source2. User writes driver program, which includes a SparkContext

1. SparkContext connects to cluster manager (e.g., YARN)2. Spark acquires executors (= procs) through cluster manager3. Ships code to workers, which cache data & execute tasks

3. Each app is independent set of procs (no sharing across apps)

Query Execution Details

• Lazy evaluation– RDDs are not evaluated until an action is called

• In memory caching– Spark workers are long-lived processes– RDDs can be materialized in memory in workers– Base data is not cached in memory


Key Challenge• How to provide fault-tolerance efficiently?


Fault-Tolerance Through Lineage

Represent RDD with 5 pieces of information• A set of partitions• A set of dependencies on parent partitions

– Distinguishes between narrow (one-to-one)– And wide dependencies (one-to-many)

• Function to compute dataset based on parent• Metadata about partitioning scheme and

data placementRDD = Distributed relation + lineage


More Details on Execution

DATA516/CSED516 - Fall 2017 18[From Zaharia12]

Scheduler builds a DAG ofstages based on lineage graph of desired RDD.

Pipelined executionwithin stages

Synchronization barrierwith materializationbefore shuffles

If a task fails, re-run itCan checkpoint RDDs to disk

Spark Ecosystem Growth


Image from: http://spark.apache.org/

Spark SQL Motivation• Issues with MapReduce:

– Writing low-level programs is onerous– Requires manual optimization to achieve high perf

• Issues with declarative queries + RA– Need complex functions for Extract Transform Load

• Input data may be semi-structured or unstructured– Need to run machine learning or graph algorithms

• Solution: Combine relational queries and complex procedural algorithms


Spark SQL vs Functional Prog. API• Spark’s original functional programming API

– General– But limited opportunities for automatic optimization

• Spark SQL simultaneously – Makes Spark accessible to more users– Improves opportunities for automatic optimizations


Spark SQL Detailed Goals• Support relational processing both within Spark

programs (on native RDDs) and on external data sources using a programmer-friendly API– “DataFrame API”, a lot like access method API– So users can easily add new data source access methods

• Provide high perf. using established DBMS techniques• Easily support new data sources, including semi-

structured data and external databases amenable to query federation

• Enable extension with advanced analytics algorithms such as graph processing and machine learning


Spark SQL Overview


DataFrame API• DataFrame: collections of structured records

– All records have the same schema– Collection can be distributed– So really the same thing as a relation

• DataFrame can be created from– Tables in catalog based on external data sources – RDDs of Java/Python objects

• DataFrame can be manipulated using Spark’s procedural API or using relational APIs

• Lazy: Each DataFrame object is logical planDATA516/CSED516 - Fall 2017 24

Data Model• Nested data model based on Hive

– Supports all major SQL data types– And complex types: structs, arrays, maps, ..– Data types can be nested

• First-class support for complex data types in the query language and the API


DataFrame Examples


DataFrame

DataFrame Expression in DataFrame DSL

DataFrame Operations

• Relational operations using a DSL– Select, where, join, groupby

• All operations take expression objects in DSL• Can register DataFrames as temp tables


DataFrame Benefits• Make it easy to compute multiple aggregates

in one pass using a SQL statement• Automatically store data in a columnar format

that is significantly more compact than Java/Python objects– Good for data caching

• Unlike existing data frame APIs in R and Python, DataFrame operations in Spark SQL go through a relational optimizer, Catalyst


Querying Native Datasets• Construct DataFrames directly against RDDs

of objects native to the programming language.• Spark SQL can automatically infer the schema

of these objects using reflection


User Defined Functions• Defined inline – Same programming language

– Easy to write and register– Easy to debug entire pipelines

• UDFs can operate on scalars or tables


Catalyst Optimizer• Spark SQL introduces a novel extensible

optimizer called Catalyst

• Catalyst makes it easy to add data sources, optimization rules, and data types for domains such as machine learning.


Catalyst Overview• Tree + Rules

– Rule is function from tree to another tree– Rules often use pattern matching to search and

replace subtrees– One rule can specify multiple patterns

• Optimization approach– Groups rules into batches– Executes each batch until fixed point


Optimization Steps


• Analyze a logical plan to resolve references• Logical plan optimization

– Rule based• Physical planning

– May generate multiple plans and compare cost• Code generation à to Java bytecode• Each phase uses different types of tree nodes

Extension Points• Rules

– Developers can add rules to optimizer• Data Sources

– CreateRelation function: Schema + Size– TableScan

• Scan all data and return RDD of row objects– PrunedFilteredScan

• Project columns and/or rows


CREATE TEMPORARY TABLE messages

USING com.databricks.spark.avro

OPTIONS (path "messages.avro”)

Extension Points

• Ex: register two-dimensional points (x,y) as a UDT:class PointUDT extends UserDefinedType[Point] {

def dataType = StructType(Seq(

StructField("x", DoubleType),

StructField("y", DoubleType)

))

def serialize(p: Point) = Row(p.x, p.y)

def deserialize(r: Row) =

Point(r.getDouble(0), r.getDouble(1)

}


Latest Advances


Image from: http://spark.apache.org/

MLlib• Library built on top of Spark• Provides many ML algorithms• Exposes an easy-to-use, high-level API

– Java, Scala, and Python• Executes in a distributed fashion

– Data-parallelism or model-parallelism


MLlib Functionality• Common learning algorithms

– Classification, regression, clustering, collaborative filtering

• Low-level primitives and basic utilities for– Convex optimization, distributed linear algebra,

statistical analysis, and feature extraction

• Tool for constructing & tuning pipelines


MLlib PipelinesGoal: Facilitate combining algorithms• DataFrame: Basic ML dataset• Transformer: Algorithm that transforms a DataFrame

into another DataFrame– Feature transformer or learned model

• Estimator: Algorithm that can be fit on a DataFrameto produce a Transformer– DataFrame is training data– Transformer is the model learned from the data– Key method: “fit()”

• Pipeline: Chains multiple Transformers and Estimators together


https://spark.apache.org/docs/latest/ml-pipeline.html

Example MLlib Pipeline• Text processing (training phase)

– Split each document into words– Convert each doc into a numerical feature vector

• See last lecture• Ex: Set a # buckets. Hash each word. Count frequencies.• “World Hello World World” (8 buckets) à [0, 0, 0, 1, 0, 0, 0, 3]

– Learn prediction model using feature vectors + labels



Transformer EstimatorDataFrame

Example MLlib Pipeline• Text processing (classification phase)





TensorFlow


Motivation• ML is driving advances in many fields• Why is ML so successful now?

– More sophisticated ML models– Larger datasets available for training models– Easy access to large amounts of compute resources

• TensorFlow designed by Google with goals– Experiment with new models– Train them on large datasets – Move them into production


Before TensorFlow… DistBelief


Figure 2 fromLarge Scale Distributed Deep Networks. NIPS 2012.Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng

Workers

DistributedsystemfortrainingneuralnetworksthatGooglehasusedsince2011

Parameter Server Limitations• Job comprises:

– Stateless worker processes• Perform bulk of computation when training a model

– Stateful parameter server• Maintains current version of model parameters

• Deep network example:– Layers with parameters (weights + bias)– Loss function at the end to assess error

• Workers compute parameter deltas and send to server• Server combines updates with its state


Limitations that TensorFlowSeeks to Address

• Make it easier to define new layers– In DistBelief, layers are C++ classes

• Facilitate refinements to training algorithm– Avoid having to modify system implementation

• Enable definition of new training algorithms– Not just stochastic gradient descent


TensorFlow Overview• TensorFlow uses a unified dataflow graph to

– Represent computation in an algorithm– And the state on which the algorithm operates

• Vertices represent either– Computations that update mutable state: convolution, matrix multiply– Computations that own mutable state: variables

• Edges carry tensors (multi-dimensional arrays) between nodes• TensorFlow transparently inserts the appropriate communication

between distributed subcomputations– Can run in cluster, workstation, or on accelerators (GPUs)

• Programmed through high-level scripting interface– TensorFlow optimizes and then executes entire program at once


Low-leveloperations!

TensorFlow Execution Model• Graph with vertices and edges• Vertex represents a unit of local computation

– Vertices are referred to as operations– An operation takes m ≥ 0 tensors as input and produces n ≥ 0

tensors as output– Each operation has a named “type” (such as Const, MatMul, or

Assign) and may have zero or more compile-time attributes that determine its behavior

• Edge represents vertex input or output– Edges are tensors– n-dimensional arrays of primitive types– Tensors are dense at lowest level


Mutable State as Vertices• Stateful operations: variables

– An operation can contain mutable state that is read and/or written each time it executes

• A Variable operation owns a mutable buffer– Can store shared parameters of a trained model– A Variable has no inputs, and produces a reference handle, which

acts as a typed capability for reading and writing the buffer– A Read operation takes a reference handle r as input, and outputs

the value of the variable (State[r]) as a dense tensor– Other operations can take capability as input and can modify the

underlying buffer• Stateful operations: queues


Executing a Graph


• Singledataflowgraphrepresentsallcomputationandstateinamachinelearningalgorithm

• Canexecutepartsofgraphinparallelordistributed• APIenablesselectionoffeedandfetchtensorsforexec

– Runtimeprunesrestofthegraph• Supportsmultipleconcurrentexecsofsamegraph

– Theseexecutionssharestatethroughstateful operations

Distributed Execution• Each operation is in a task and resides on a particular device

(CPU or GPU)• Device executes a kernel for each oper. assigned to it

– Multiple kernels per operation, with specialized implementations for a particular device or data type

• Runtime places operations on devices, subject to implicit or explicit constraints in the graph

• Partitions the operations into per-device sub-graphs– A per-device subgraph for device d contains all of the operations that

were assigned to d, with additional Send and Recv operations that replace edges across device boundaries

– Send/Recs have specialized implementations for several device-type pairs


More Complex Graphs• To implement RNNs and other advanced

algorithms, add constructs for– Conditional (if statement)– Iterative (while loop)

• These primitives serve to build higher-order constructs, such as map(), fold(), and scan()

• Implemented using switch and merge operators– Switch takes data and control inputs. It produces a real

value on one output and a dead value on other– Merge propagates only non-dead input unless both are dead

• Iteration impl. requires Enter, Exit, NextIter ops


TensorFlow Extensibility• Differentiation and optimization

– Everything is variable or mathematical operation– So easier to tune how learning is done– No hard-coded learning algorithm

• Training very large models• Fault-tolerance

– User-level checkpointing through save/restore• Synch. and asynch. replicate coord.


TensorFlow Implementation


Documents

DATA516/CSED516 Scalable Data Systems and Algorithms · –Need complex functions for Extract Transform Load •Input data may be semi-structured or unstructured –Need to run machine