30
Databricks’ Data Pipelines: Journey and Lessons Learned Yu Peng, Burak Yavuz 07/06/2016

A Journey into Databricks' Pipelines: Journey and Lessons Learned

Embed Size (px)

Citation preview

Page 1: A Journey into Databricks' Pipelines: Journey and Lessons Learned

Databricks’ Data Pipelines:Journey and Lessons Learned

Yu Peng, Burak Yavuz07/06/2016

Page 2: A Journey into Databricks' Pipelines: Journey and Lessons Learned

Who Are WeYu Peng

Data Engineer at Databricks

Building Databricks’ next-generation data pipeline on top of Apache Spark

BS in Xiamen UniversityPh.D in The University of Hong Kong

Burak Yavuz

Software Engineer at Databricks

Contributor to Spark since Spark 1.1Maintainer of Spark Packages

BS in Mechanical Engineering at Bogazici UniversityMS in Management Science & Engineering at Stanford University

Page 3: A Journey into Databricks' Pipelines: Journey and Lessons Learned

Building a data pipeline is hard

• At least once or exactly once semantics• Fault tolerance• Resource management• Scalability• Maintainability

Page 4: A Journey into Databricks' Pipelines: Journey and Lessons Learned

Apache® Spark™ + Databricks = Our Solution

• All ETL jobs are built on top of Apache Spark • Unified solution, everything in the same place

• All ETL jobs are run on Databricks platform • Platform for Data Engineers and Scientists

• Test out Spark and Databricks new features

Apache, Apache Spark and Spark are trademarks of the Apache Software Foundation

Page 5: A Journey into Databricks' Pipelines: Journey and Lessons Learned

Classic Lambda Data Pipelineservice 0

service ...

log collector

….

Centralized Messaging

System

Delta ETL

Batch ETL

StorageSystem

service 1

service ...

log collector

….

service x

service ...

log collector

….

…...

Page 6: A Journey into Databricks' Pipelines: Journey and Lessons Learned

CustomerDep 0

CustomerDep 1

Amazon Kinesis

CustomerDep 2

Databricks Data Pipeline OverviewDatabricksDep

….

Page 7: A Journey into Databricks' Pipelines: Journey and Lessons Learned

CustomerDep 0

CustomerDep 1

Amazon Kinesis

service 1

service 2

service x

log-daemon

….CustomerDep 2

Cluster 0service 0

service x

log-daemon

….service 1

service y

log-daemon

….

Cluster 1

….

Databricks Data Pipeline Overview

Cluster 2 DatabricksDep

….

7

Page 8: A Journey into Databricks' Pipelines: Journey and Lessons Learned

CustomerDep 0

CustomerDep 1

Amazon Kinesis

service 1

service 2

service x

log-daemon

….CustomerDep 2

Cluster 0service 0

service x

log-daemon

….service 1

service y

log-daemon

….

Cluster 1

….

Databricks Data Pipeline Overview

Cluster 2 DatabricksDep

….

8

Page 9: A Journey into Databricks' Pipelines: Journey and Lessons Learned

Databricks Deployment

CustomerDep 0

CustomerDep 1

Amazon Kinesis

Databricks Filesystem

Databricks Jobs

service 1

service 2

service x

log-daemon

….CustomerDep 2

Cluster 0service 0

service x

log-daemon

….service 1

service y

log-daemon

….

Cluster 1

….

Databricks Data Pipeline Overview

Cluster 2 DatabricksDep

….

9

Page 10: A Journey into Databricks' Pipelines: Journey and Lessons Learned

Databricks Deployment

CustomerDep 0

CustomerDep 1

Amazon Kinesis

Databricks Filesystem

Databricks Jobs

service 1

service 2

service x

log-daemon

….CustomerDep 2

Cluster 0service 0

service x

log-daemon

….service 1

service y

log-daemon

….

Cluster 1

….

Databricks Data Pipeline Overview

Cluster 2

Real-time analysis

DatabricksDep

….

10

Page 11: A Journey into Databricks' Pipelines: Journey and Lessons Learned

Databricks Deployment

CustomerDep 0

CustomerDep 1

Amazon Kinesis

DBFS

Databricks Jobs

service 1

service 2

service x

log-daemon

….CustomerDep 2

Cluster 0service 0

service x

log-daemon

….service 1

service y

log-daemon

….

Cluster 1

….

Sync daemonRaw record batch (json)

Databricks Data Pipeline Overview

Cluster 2 DatabricksDep

….

11

Page 12: A Journey into Databricks' Pipelines: Journey and Lessons Learned

Databricks Deployment

CustomerDep 0

CustomerDep 1

Amazon Kinesis

DBFS

Databricks Jobs

service 1

service 2

service x

log-daemon

….CustomerDep 2

Cluster 0service 0

service x

log-daemon

….service 1

service y

log-daemon

….

Cluster 1

….

Sync daemon

ETL jobs

Raw record batch (json)

Tables (parquet)

Databricks Data Pipeline Overview

Cluster 2 DatabricksDep

….

12

Page 13: A Journey into Databricks' Pipelines: Journey and Lessons Learned

Databricks Deployment

CustomerDep 0

CustomerDep 1

Amazon Kinesis

DBFS

Databricks Jobs

service 1

service 2

service x

log-daemon

….CustomerDep 2

Cluster 0service 0

service x

log-daemon

….service 1

service y

log-daemon

….

Cluster 1

….

Sync daemon

ETL jobs

Data analysis

Raw record batch (json)

Tables (parquet)

Databricks Data Pipeline Overview

Cluster 2

Real-time analysis

DatabricksDep

….

13

Page 14: A Journey into Databricks' Pipelines: Journey and Lessons Learned

Log collection (Log-daemon)

• Fault tolerance and at least once semantics • Streaming• Batch

• Spark History Server• Multi-tenant and config driven

• Spark container

14

Page 15: A Journey into Databricks' Pipelines: Journey and Lessons Learned

Log Daemon

logStream1

Service 1

active.log

2015-11-30-20.log

2015-11-30-19.log

log rotation

…..

Service 2

active.log

2015-11-30-20.log

2015-11-30-19.log

log rotation

Kinesistopic-1

Service x

active.log

2015-11-30-20.log

2015-11-30-19.log

log rotation

state files

Log DaemonArchitecture

producer

reader

Message Producer

logStream2

producer

reader

logStreamX

producer

reader

…………... …………... …………...

15

topic-2

Page 16: A Journey into Databricks' Pipelines: Journey and Lessons Learned

Sync Daemon

• Read from Kinesis and Write to DBFS• Buffer and write in batches (128 MB or 5 Mins)• Partitioned by date

• A long running Apache Spark job • Easy to scale up and down

16

Page 17: A Journey into Databricks' Pipelines: Journey and Lessons Learned

Databricks Deployment

ETL Jobs

DatabricksFilesystem

No dedupAppend

DedupOverwrite

17

New filesCurrent day

All filesPrevious day

Databricks Jobs

Delta job(every 10 mins)

Batch job(daily)

Raw records

DatabricksFilesystem

ETL Tables(Parquet)

Page 18: A Journey into Databricks' Pipelines: Journey and Lessons Learned

ETL Jobs

• Use the same code for Delta and Batch jobs

• Run as scheduled Databricks jobs

• Use spot instances and fallback to on-demand

• Deliver to Databricks as parquet tables

Page 19: A Journey into Databricks' Pipelines: Journey and Lessons Learned

Lessons Learned- Partition Pruning can save a lot of time and money

Reduced query time from 2800 seconds to just 15 seconds.Don’t partition too many levels as it leads to worse metadata discovery

performance and cost.

19

Page 20: A Journey into Databricks' Pipelines: Journey and Lessons Learned

Lessons Learned- High S3 costs: Lots of LIST Requests

Metadata discovery on S3 is expensive. Spark SQL tries to refresh it’s metadata cache even after write operations.

20

Page 21: A Journey into Databricks' Pipelines: Journey and Lessons Learned

Running It All in Databricks - Jobs

Page 22: A Journey into Databricks' Pipelines: Journey and Lessons Learned

Running It All in Databricks - Spark

Page 23: A Journey into Databricks' Pipelines: Journey and Lessons Learned

Data Analysis & ToolsWe get the data in. What’s next?

● Monitoring● Debugging● Usage Analysis● Product Design (A/B testing)

23

Page 24: A Journey into Databricks' Pipelines: Journey and Lessons Learned

DebuggingAccess to logs in a matter of seconds thanks to Apache Spark.

24

Page 25: A Journey into Databricks' Pipelines: Journey and Lessons Learned

MonitoringMonitor logs by log level. Bug introduced on 2016-05-26 01:00:00 UTC. Fix deployed in 2 hours.

25

Page 26: A Journey into Databricks' Pipelines: Journey and Lessons Learned

Usage Analysis + Product DesignSparkR + ggplot2 = Match made in heaven

26

Page 27: A Journey into Databricks' Pipelines: Journey and Lessons Learned

SummaryDatabricks + Apache Spark create a unified platform for:

- ETL- Data Warehousing- Data Analysis- Real time analytics

Issues with DevOps out of the question:- No need to manage a huge cluster- Jobs are isolated, they don’t cannibalize each other’s resources- Can launch any Spark version

Page 28: A Journey into Databricks' Pipelines: Journey and Lessons Learned

Ongoing & Future WorkStructured Streaming

- Reduce Complexity of pipeline:Sync Daemon + Delta + Batch Jobs => Single Streaming Job

- Reduce LatencyAvailability of data in seconds instead of minutes

- Event Time Dashboards

28

Page 29: A Journey into Databricks' Pipelines: Journey and Lessons Learned

Try Apache Spark with Databricks

29

http://databricks.com/try

Page 30: A Journey into Databricks' Pipelines: Journey and Lessons Learned

Thank you.Have questions about ETL with Spark?Join us at the Databricks Booth 3.45-6.00pm!