Building Scalable Data Pipelines

Evan Chan

Who am IDistinguished Engineer, Tuplejump

@evanfchan

http://velvia.github.io

User and contributor to Spark since 0.9

Co-creator and maintainer of Spark Job Server

TupleJump - Big Data Dev Partners 3

Instant Gratification

I want insights now

I want to act on news right away

I want stuff personalized for me (?)

Fast Data, notBig Data

How Fast do you Need to Act?

Financial trading - milliseconds

Dashboards - seconds to minutes

BI / Reports - hours to days?

What’s Your App?

Concurrent video viewers

Anomaly detection

Clickstream analysis

Live geospatial maps

Real-time trend detection & learning

Common Components

Message Queue

EventsStream

Processing Layer

State / Database

Happy Users

Example: Real-time trend detection

Events: time, OS, location, asset/product ID

Analyze 1-5 second batches of new “hot” data in stream processor

Combine with recent and historical top K feature vectors in database

Update database recent feature vectors

Serve to users

Example 2: Smart Cities

Smart City Streaming Data

City buses - regular telemetry (position, velocity, timestamp)

Street sweepers - regular telemetry

Transactions from rail, subway, buses, smart cards

311 info

911 info - new emergencies

Citizens want to know…

Where and for how long can I park my car?

Are transportation options affected by 311 and 911 events?

How long will it take the next bus to get here?

Where is the closest bus to where I am?

Cities want to know…

How can I maximize parking revenue?

More granular updates to parking spots that don't need sweeping

How does traffic affect waiting times in public transit, and revenue?

Patterns in subway train times - is a breakdown coming?

Population movement - where should new transit routes be placed?

Message Queue

Stream Processing

Event storage

Ad-Hoc

MetroShort term telemetry

Models

Dashboard

The HARD Principle

Highly Available, Resilient, Distributed

Flexibility - do as many transformations as possible with as few components as possible

Real-time: “NoETL”

Community: best of breed OSS projects with huge adoption and commercial support

Message Queue

EventsStream

Processing Layer

State / Database

Happy Users

Why a message queue?

Centralized publish-subscribe of events

Need more processing? Add another consumer

Buffer traffic spikes

Replay events in cases of failure

Message Queues help distribute data

Input 1

Input 2

Input3

Input4

Processing

Intro to Apache Kafka

Kafka is a distributed publish subscribe system

It uses a commit log to track changes

Kafka was originally created at LinkedIn

Open sourced in 2011

Graduated to a top-level Apache project in 2012

On being HARDMany Big Data projects are open source implementations of closed source products

Unlike Hadoop, HBase or Cassandra, Kafka actually isn't a clone of an existing closed source product

The same codebase being used for years at LinkedIn answers the questions:

Does it scale?

Is it robust?

Ad Hoc ETL

Decoupled ETL

Avro Schemas And Schema Registry

Keys and values in Kafka can be Strings or byte arrays

Avro is a serialization format used extensively with Kafka and Big Data

Kafka uses a Schema Registry to keep track of Avro schemas Verifies that the correct schemas are being used

Consumer Groups

Commit Logs

Kafka Resources

Official docs - https://kafka.apache.org/documentation.html

Design section is really good read

http://www.confluent.io/product

Includes schema registry

Stream Processing

Message Queue

EventsStream

Processing Layer

State / Database

Happy Users

Types of Stream Processors

Event by Event: Apache Storm, Apache Flink, Intel GearPump, Akka

Micro-batch: Apache Spark

Hybrid? Google Dataflow

Apache Storm and Flink

Transform one message at a time

Very low latency

State and more complex analytics difficult

Akka and Gearpump

Actor to actor messaging. Local state.

Used for extreme low latency (ad networks, etc)

Dynamically reconfigurable topology

Configurable fault tolerance and failure recovery

Cluster or local mode - you don’t always need distribution!

Spark Streaming

Data processed as stream of micro batches

Higher latency (seconds), higher throughput, more complex analysis / ML possible

Same programming model as batch

Why Spark?

file = spark.textFile("hdfs://...") file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)

1 package org.myorg; 2 3 import java.io.IOException; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import org.apache.hadoop.io.*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 }

Spark Production Deployments

Explosion of Specialized Systems

Spark and Berkeley AMP Lab

Benefits of Unified LibrariesOptimizations can be shared between libraries Core Project Tungsten MLlib

Shared statistics libraries Spark Streaming GC and memory management

Mix and match modules

Easily go from DataFrames (SQL) to MLLib / statistics, for example:

scala> import org.apache.spark.mllib.stat.Statistics

scala> val numMentions = df.select("NumMentions").map(row => row.getInt(0).toDouble)numMentions: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[100] at map at DataFrame.scala:848

scala> val numArticles = df.select("NumArticles").map(row => row.getInt(0).toDouble)numArticles: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[104] at map at DataFrame.scala:848

scala> val correlation = Statistics.corr(numMentions, numArticles, "pearson")

Spark Worker FailureRebuild RDD Partitions on Worker from Lineage

Spark SQL & DataFrames

DataFrames & Catalyst Optimizer

Catalyst OptimizationsColumn and partition pruning (Column filters) Predicate pushdowns (Row filters)

Spark SQL Data Sources APIEnables custom data sources to participate in SparkSQL = DataFrames + Catalyst Production Impls spark-csv (Databricks) spark-avro (Databricks) spark-cassandra-connector (DataStax) elasticsearch-hadoop (Elastic.co)

Spark Streaming

Streaming SourcesBasic: Files, Akka actors, queues of RDDs, Socket

Advanced

Kinesis

Twitter firehose

DStreams = micro-batches

Streaming Fault ToleranceIncoming data is replicated to 1 other node Write Ahead Log for sources that support ACKs Checkpointing for recovery if Driver fails

Direct Kafka Streaming: KafkaRDD

No single Receiver Parallelizable No Write Ahead Log Kafka *is* the Write Ahead Log! KafkaRDD stores Kafka offsets KafkaRDD partitions recover from offsets

Spark MLlib & GraphX

Spark MLlib Common AlgosClassifiers DecisionTree, RandomForest

Clustering K-Means, Streaming K-Means

Collaborative Filtering Alternating Least Squares (ALS)

Spark Text Processing AlgosTF/IDF

Word2Vec

*Pro-Tip: Use Stanford CoreNLP!

Spark ML PipelinesModeled after scikit-learn

Spark GraphX

PageRank Top Influencers

Connected Components Measure of clusters

Triangle Counting Measure of cluster density

Handling State

Message Queue

EventsStream

Processing Layer

State / Database

Happy Users

What Kind of State?

Non-persistent / in-memory: concurrent viewers

Short term: latest trends

Longer term: raw event & aggregate storage

ML Models, predictions, scored data

Spark RDDs

Immutable, cache in memory and/or on disk

Spark Streaming: UpdateStateByKey

IndexedRDD - can update bits of data

Snapshotting for recovery

•Massively Scalable• High Performance• Always On• Masterless

Apache Cassandra• Scales Linearly to as many nodes as you need

• Scales whenever you need

Performance

Apache Cassandra• It’s Fast • Built to sustain massive data insertion rates in irregular pattern spikes

FaultTolerance

&Availability

Apache Cassandra• Automatic Replication • Multi Datacenter • Decentralized - no single point of failure • Survive regional outages • New nodes automatically add themselves to the cluster

• DataStax drivers automatically discover new nodes

Architecture

Apache Cassandra• Distributed, Masterless Ring Architecture

• Network Topology Aware

• Flexible, Schemaless - your data

structure can evolve seamlessly over time

To download:

https://cassandra.apache.org/download/

https://github.com/pcmanus/ccm

^ Highly recommended for local testing/cluster setup

Cassandra Data Modeling

Primary key = (partition keys, clustering keys)

Fast queries = fetch single partition

Range scans by clustering key

Must model for query patterns

Clustering 1 Clustering 2 Clustering 3Partition 1Partition 2Partition 3

City Bus Data Modeling Example

Primary key = (Bus UUID, timestamp)

Easy queries: location and speed of single bus for a range of time

Can also query most recent location + speed of all buses (slower)

1020 s 1010 s 1000 sBus A speed, GPSBus BBus C

Using Cassandra for Short Term StorageIdea is store and read small values

Idempotent writes + huge write capacity = ideal for streaming ingestion

For example, store last few (latest + last N) snapshots of buses, taxi locations, recent traffic info

But Mommy! What about longer term data?

I need to read lots of data, fast!!

- Ad hoc analytics of events - More specialized / geospatial - Building ML models from

large quantities of data - Storing scored/classified data

from models - OLAP / Data Warehousing

Can Cassandra Handle Batch?

Cassandra tables are much better at lots of small reads than big data scans

You CAN store data efficiently in C*

Files seem easier for long term storage and analysis

But are files compatible with streaming?

Lambda Architecture

Lambda is Hard and Expensive

Very high TCO - Many moving parts - KV store, real time, batch

Lots of monitoring, operations, headache

Running similar code in two places

Lower performance - lots of shuffling data, network hops, translating domain objects

Reconcile queries against two different places

NoLambda

A unified system

Real-time processing and reprocessing

No ETLs

Fault tolerance

Everything is a stream

Can Cassandra do batch and ad-hoc?Yes, it can be competitive with Hadoop actually….

If you know how to be creative with storing your data!

Tuplejump/SnackFS - HDFS for Cassandra

github.com/tuplejump/FiloDB - analytics database

Store your data using Protobuf / Avro / etc.

Introduction to FiloDB

Efficient columnar storage - 5-10x better

Scan speeds competitive with Parquet - 100x faster than regular Cassandra tables

Very fine grained filtering for sub-second concurrent queries

Easy BI and ad-hoc analysis via Spark SQL/Dataframes (JDBC etc.)

Uses Cassandra for robust, proven storage

Combining FiloDB + Cassandra

Regular Cassandra tables for highly concurrent, aggregate / key-value lookups (dashboards)

FiloDB + C* + Spark for efficient long term event storage

Ad hoc / SQL / BI

Data source for MLLib / building models

Data storage for classified / predicted / scored data

Message Queue

EventsSpark

Streaming

Short term storage, K-V

Adhoc, SQL, ML

Cassandra

FiloDB: Events, ad-hoc, batch

Dashboards, maps

Message Queue

EventsSpark

Streaming Models

Cassandra

FiloDB: Long term event storage

Spark Learned Data

FiloDB + CassandraRobust, peer to peer, proven storage platform

Use for short term snapshots, dashboards

Use for efficient long term event storage & ad hoc querying

Use as a source to build detailed models

Thank you!@evanfchan

http://tuplejump.com

Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Engineering

Health Datapalooza 2013: Datalab - David Forrest

ActualMeds - Health Datapalooza Genius Bar

Enterprise & Scientific Data Interoperability Using Linked Data at the Health Datapalooza 2014

HEALTH DATAPALOOZA 2016 - Academyhealth Datapalooza 2016 Brochure2.pdf · HEALTH DATAPALOOZA 2016 MAY 8 ... that will explore key questions and concepts in health data in relation

Datapalooza ibm 051916_final

Health Datapalooza IV: Child and Adolescent Health Data App

r...Best Practices in Health Care Consumer Engagement and Protection; Health Datapalooza IV—invited to showcase mobile app and Datapalooza VII—translating claims into actionable

Health Datapalooza 2015: 10 Takeaways

Health Datapalooza 2013: Bootcamp - cards

RowdMap for Medicaid Health Datapalooza 2015

RowdMap Health Datapalooza Innovation Showcase

Health Datapalooza 2013: Connect & Coach Data Design Diabetes Winners

White House Education Datapalooza Presentation

Team Floriduh Health Datapalooza Code-A-Thon Presentation

HEALTH DATAPALOOZA 2016 - AcademyHealth Datapalooza 2016 Broch… · The 2016 Health Datapalooza is proud to be a Patients Included conference. Look for participants wearing “Ask

XGBoost: A Scalable Tree Boosting Systemdmlc.cs.washington.edu/data/pdf/XGBoostArxiv.pdf · 2016-04-05 · also incorporated into real-world production pipelines for ad click through

Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

RowdMap at DATALAB at Health Datapalooza 2015

Health DataPalooza Preconference May 8, 2016Health DataPalooza Preconference May 8, 2016 Alternative Payment Models and Health IT Kelly Cronin, MS, MPH, Director, Office of Care Transformation,

Scalable Pipelines