80
Building Scalable Data Pipelines Evan Chan

Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Embed Size (px)

Citation preview

Page 1: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Building Scalable Data Pipelines

Evan Chan

Page 2: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Who am IDistinguished Engineer, Tuplejump

@evanfchan

http://velvia.github.io

User and contributor to Spark since 0.9

Co-creator and maintainer of Spark Job Server

Page 3: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

TupleJump - Big Data Dev Partners 3

Page 4: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Instant Gratification

I want insights now

I want to act on news right away

I want stuff personalized for me (?)

Page 5: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Fast Data, notBig Data

Page 6: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

How Fast do you Need to Act?

Financial trading - milliseconds

Dashboards - seconds to minutes

BI / Reports - hours to days?

Page 7: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

What’s Your App?

Concurrent video viewers

Anomaly detection

Clickstream analysis

Live geospatial maps

Real-time trend detection & learning

Page 8: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Common Components

Message Queue

EventsStream

Processing Layer

State / Database

Happy Users

Page 9: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Example: Real-time trend detection

Events: time, OS, location, asset/product ID

Analyze 1-5 second batches of new “hot” data in stream processor

Combine with recent and historical top K feature vectors in database

Update database recent feature vectors

Serve to users

Page 10: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Example 2: Smart Cities

Page 11: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Smart City Streaming Data

City buses - regular telemetry (position, velocity, timestamp)

Street sweepers - regular telemetry

Transactions from rail, subway, buses, smart cards

311 info

911 info - new emergencies

Page 12: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Citizens want to know…

Where and for how long can I park my car?

Are transportation options affected by 311 and 911 events?

How long will it take the next bus to get here?

Where is the closest bus to where I am?

Page 13: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Cities want to know…

How can I maximize parking revenue?

More granular updates to parking spots that don't need sweeping

How does traffic affect waiting times in public transit, and revenue?

Patterns in subway train times - is a breakdown coming?

Population movement - where should new transit routes be placed?

Page 14: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Message Queue

Stream Processing

Layer

Event storage

Ad-Hoc

311

911

Buses

MetroShort term telemetry

Models

Dashboard

Page 15: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

The HARD Principle

Highly Available, Resilient, Distributed

Flexibility - do as many transformations as possible with as few components as possible

Real-time: “NoETL”

Community: best of breed OSS projects with huge adoption and commercial support

Page 16: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Message Queue

Page 17: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Message Queue

EventsStream

Processing Layer

State / Database

Happy Users

Page 18: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Why a message queue?

Centralized publish-subscribe of events

Need more processing? Add another consumer

Buffer traffic spikes

Replay events in cases of failure

Page 19: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Message Queues help distribute data

A-F

G-M

N-S

T-Z

Input 1

Input 2

Input3

Input4

Processing

Processing

Processing

Processing

Page 20: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Intro to Apache Kafka

Kafka is a distributed publish subscribe system

It uses a commit log to track changes

Kafka was originally created at LinkedIn

Open sourced in 2011

Graduated to a top-level Apache project in 2012

Page 21: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

On being HARDMany Big Data projects are open source implementations of closed source products

Unlike Hadoop, HBase or Cassandra, Kafka actually isn't a clone of an existing closed source product

The same codebase being used for years at LinkedIn answers the questions:

Does it scale?

Is it robust?

Page 22: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Ad Hoc ETL

Page 23: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Decoupled ETL

Page 24: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Avro Schemas And Schema Registry

Keys and values in Kafka can be Strings or byte arrays

Avro is a serialization format used extensively with Kafka and Big Data

Kafka uses a Schema Registry to keep track of Avro schemas Verifies that the correct schemas are being used

Page 25: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Consumer Groups

Page 26: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Commit Logs

Page 27: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Kafka Resources

Official docs - https://kafka.apache.org/documentation.html

Design section is really good read

http://www.confluent.io/product

Includes schema registry

Page 28: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Stream Processing

Page 29: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Message Queue

EventsStream

Processing Layer

State / Database

Happy Users

Page 30: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Types of Stream Processors

Event by Event: Apache Storm, Apache Flink, Intel GearPump, Akka

Micro-batch: Apache Spark

Hybrid? Google Dataflow

Page 31: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Apache Storm and Flink

Transform one message at a time

Very low latency

State and more complex analytics difficult

Page 32: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Akka and Gearpump

Actor to actor messaging. Local state.

Used for extreme low latency (ad networks, etc)

Dynamically reconfigurable topology

Configurable fault tolerance and failure recovery

Cluster or local mode - you don’t always need distribution!

Page 33: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Spark Streaming

Data processed as stream of micro batches

Higher latency (seconds), higher throughput, more complex analysis / ML possible

Same programming model as batch

Page 34: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Why Spark?

file = spark.textFile("hdfs://...") file.flatMap(line => line.split(" "))    .map(word => (word, 1))    .reduceByKey(_ + _)

1 package org.myorg; 2 3 import java.io.IOException; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import org.apache.hadoop.io.*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 }

Page 35: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Spark Production Deployments

Page 36: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Explosion of Specialized Systems

Page 37: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Spark and Berkeley AMP Lab

Page 38: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Benefits of Unified LibrariesOptimizations can be shared between libraries Core Project Tungsten MLlib

Shared statistics libraries Spark Streaming GC and memory management

Page 39: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Mix and match modules

Easily go from DataFrames (SQL) to MLLib / statistics, for example:

scala> import org.apache.spark.mllib.stat.Statistics

scala> val numMentions = df.select("NumMentions").map(row => row.getInt(0).toDouble)numMentions: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[100] at map at DataFrame.scala:848

scala> val numArticles = df.select("NumArticles").map(row => row.getInt(0).toDouble)numArticles: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[104] at map at DataFrame.scala:848

scala> val correlation = Statistics.corr(numMentions, numArticles, "pearson")

Page 40: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Spark Worker FailureRebuild RDD Partitions on Worker from Lineage

Page 41: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Spark SQL & DataFrames

Page 42: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

DataFrames & Catalyst Optimizer

Page 43: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Catalyst OptimizationsColumn and partition pruning (Column filters) Predicate pushdowns (Row filters)

Page 44: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Spark SQL Data Sources APIEnables custom data sources to participate in SparkSQL = DataFrames + Catalyst Production Impls spark-csv (Databricks) spark-avro (Databricks) spark-cassandra-connector (DataStax) elasticsearch-hadoop (Elastic.co)

Page 45: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Spark Streaming

Page 46: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Streaming SourcesBasic: Files, Akka actors, queues of RDDs, Socket

Advanced

Kafka

Kinesis

Flume

Twitter firehose

Page 47: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

DStreams = micro-batches

Page 48: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Streaming Fault ToleranceIncoming data is replicated to 1 other node Write Ahead Log for sources that support ACKs Checkpointing for recovery if Driver fails

Page 49: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Direct Kafka Streaming: KafkaRDD

No single Receiver Parallelizable No Write Ahead Log Kafka *is* the Write Ahead Log! KafkaRDD stores Kafka offsets KafkaRDD partitions recover from offsets

Page 50: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Spark MLlib & GraphX

Page 51: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Spark MLlib Common AlgosClassifiers DecisionTree, RandomForest

Clustering K-Means, Streaming K-Means

Collaborative Filtering Alternating Least Squares (ALS)

Page 52: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Spark Text Processing AlgosTF/IDF

LDA

Word2Vec

*Pro-Tip: Use Stanford CoreNLP!

Page 53: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Spark ML PipelinesModeled after scikit-learn

Page 54: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Spark GraphX

PageRank Top Influencers

Connected Components Measure of clusters

Triangle Counting Measure of cluster density

Page 55: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Handling State

Page 56: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Message Queue

EventsStream

Processing Layer

State / Database

Happy Users

Page 57: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

What Kind of State?

Non-persistent / in-memory: concurrent viewers

Short term: latest trends

Longer term: raw event & aggregate storage

ML Models, predictions, scored data

Page 58: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Spark RDDs

Immutable, cache in memory and/or on disk

Spark Streaming: UpdateStateByKey

IndexedRDD - can update bits of data

Snapshotting for recovery

Page 59: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

•Massively Scalable• High Performance• Always On• Masterless

Page 60: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Scale

Apache Cassandra• Scales Linearly to as many nodes as you need

• Scales whenever you need

Page 61: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Performance

Apache Cassandra• It’s Fast • Built to sustain massive data insertion rates in irregular pattern spikes

Page 62: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

FaultTolerance

&Availability

Apache Cassandra• Automatic Replication • Multi Datacenter • Decentralized - no single point of failure • Survive regional outages • New nodes automatically add themselves to the cluster

• DataStax drivers automatically discover new nodes

Page 63: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Architecture

Apache Cassandra• Distributed, Masterless Ring Architecture

• Network Topology Aware

• Flexible, Schemaless - your data

structure can evolve seamlessly over time

Page 64: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

To download:

https://cassandra.apache.org/download/

https://github.com/pcmanus/ccm

^ Highly recommended for local testing/cluster setup

Page 65: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Cassandra Data Modeling

Primary key = (partition keys, clustering keys)

Fast queries = fetch single partition

Range scans by clustering key

Must model for query patterns

Clustering 1 Clustering 2 Clustering 3Partition 1Partition 2Partition 3

Page 66: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

City Bus Data Modeling Example

Primary key = (Bus UUID, timestamp)

Easy queries: location and speed of single bus for a range of time

Can also query most recent location + speed of all buses (slower)

1020 s 1010 s 1000 sBus A speed, GPSBus BBus C

Page 67: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Using Cassandra for Short Term StorageIdea is store and read small values

Idempotent writes + huge write capacity = ideal for streaming ingestion

For example, store last few (latest + last N) snapshots of buses, taxi locations, recent traffic info

Page 68: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

But Mommy! What about longer term data?

Page 69: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

I need to read lots of data, fast!!

- Ad hoc analytics of events - More specialized / geospatial - Building ML models from

large quantities of data - Storing scored/classified data

from models - OLAP / Data Warehousing

Page 70: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Can Cassandra Handle Batch?

Cassandra tables are much better at lots of small reads than big data scans

You CAN store data efficiently in C*

Files seem easier for long term storage and analysis

But are files compatible with streaming?

Page 71: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Lambda Architecture

Page 72: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Lambda is Hard and Expensive

Very high TCO - Many moving parts - KV store, real time, batch

Lots of monitoring, operations, headache

Running similar code in two places

Lower performance - lots of shuffling data, network hops, translating domain objects

Reconcile queries against two different places

Page 73: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

NoLambda

A unified system

Real-time processing and reprocessing

No ETLs

Fault tolerance

Everything is a stream

Page 74: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Can Cassandra do batch and ad-hoc?Yes, it can be competitive with Hadoop actually….

If you know how to be creative with storing your data!

Tuplejump/SnackFS - HDFS for Cassandra

github.com/tuplejump/FiloDB - analytics database

Store your data using Protobuf / Avro / etc.

Page 75: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Introduction to FiloDB

Efficient columnar storage - 5-10x better

Scan speeds competitive with Parquet - 100x faster than regular Cassandra tables

Very fine grained filtering for sub-second concurrent queries

Easy BI and ad-hoc analysis via Spark SQL/Dataframes (JDBC etc.)

Uses Cassandra for robust, proven storage

Page 76: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Combining FiloDB + Cassandra

Regular Cassandra tables for highly concurrent, aggregate / key-value lookups (dashboards)

FiloDB + C* + Spark for efficient long term event storage

Ad hoc / SQL / BI

Data source for MLLib / building models

Data storage for classified / predicted / scored data

Page 77: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Message Queue

EventsSpark

Streaming

Short term storage, K-V

Adhoc, SQL, ML

Cassandra

FiloDB: Events, ad-hoc, batch

Spark

Dashboards, maps

Page 78: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Message Queue

EventsSpark

Streaming Models

Cassandra

FiloDB: Long term event storage

Spark Learned Data

Page 79: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

FiloDB + CassandraRobust, peer to peer, proven storage platform

Use for short term snapshots, dashboards

Use for efficient long term event storage & ad hoc querying

Use as a source to build detailed models

Page 80: Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Thank you!@evanfchan

http://tuplejump.com