14
Building Data Pipelines with Cask Hydrator Gokul Gunasekaran Software Engineer, Cask Data June 15, 2016 Cask, CDAP, Cask Hydrator and Cask Tracker are trademarks or registered trademarks of Cask Data. Apache Spark, Spark, the Spark logo, Apache Hadoop, Hadoop and the Hadoop logo are trademarks or registered trademarks of the Apache Software Foundation. All other trademarks and registered trademarks are the property of their respective owners.

Building Data pipelines with Cask Hydrator, by Gokul Gunasekaran from Cask

Embed Size (px)

Citation preview

Page 1: Building Data pipelines with Cask Hydrator, by Gokul Gunasekaran from Cask

Building Data Pipelines with Cask Hydrator

Gokul GunasekaranSoftware Engineer, Cask Data

June 15, 2016

Cask, CDAP, Cask Hydrator and Cask Tracker are trademarks or registered trademarks of Cask Data. Apache Spark, Spark, the Spark logo, Apache Hadoop, Hadoop and the Hadoop logo are trademarks or registered trademarks of the Apache Software Foundation. All other trademarks and registered trademarks are the property of their respective owners.

Page 2: Building Data pipelines with Cask Hydrator, by Gokul Gunasekaran from Cask

cask.co

INGESTany data from any source

in real-time and batch

BUILDdrag-and-drop ETL/ELT

pipelines that run on Hadoop

EGRESSany data to any destination

in real-time and batch

Data Pipelineprovides the ability to automate complex workflows that involves fetching data,

performing non-trivial transformations, deriving and serving insights from the data

2

Page 3: Building Data pipelines with Cask Hydrator, by Gokul Gunasekaran from Cask

cask.co

Web Analytics and Reporting Use Case

✦Hadoop ETL pipeline(s) stitched together using hard-to-maintain, brittle scripts

✦Not many developers with expertise in Hadoop components (HDFS, MapReduce, Spark, YARN, HBase, Kafka)

✦Hard to debug and validate, resulting in frequent failures in production environment

Fetch web access logs from S3 every hour, load it into Hadoop cluster for backup and perform analytics and enable realtime reporting of no. of successful/failure responses and client browser info

Challenge —

3

Page 4: Building Data pipelines with Cask Hydrator, by Gokul Gunasekaran from Cask

cask.co

Demo

Load Log Files from S3 into HDFS and perform aggregations/analysis

• Start with web access logs stored in Amazon S3

• Store the raw logs into HDFS Avro Files

• Parse the access log lines into individual fields

• Find out distribution of status codes

• Find out the most commonly used client browser

4

Page 5: Building Data pipelines with Cask Hydrator, by Gokul Gunasekaran from Cask

cask.co

S3 Input

69.181.160.120 - - [08/Feb/2015:04:36:40 +0000] "GET /ajax/planStatusHistory HTTP/1.1" 200 508 "http://builds.cask.co/log" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit Chrome/38.0.2125.122 Safari/537.36"

Fields: IP Address, Timestamp, Http Method, URI, Http Status, Response Size, URI, Client Browser

5

Page 6: Building Data pipelines with Cask Hydrator, by Gokul Gunasekaran from Cask

cask.co

Hydrator Studio

✦Drag-and-drop GUI for visual Data Pipeline creation

✦Rich library of pre-built sources, transforms, sinks for data ingestion and ETL use cases

✦Separation of pipeline creation from execution framework - MapReduce, Spark, Spark Streaming etc.

✦Hadoop-native and Hadoop Distro agnostic

6

Page 7: Building Data pipelines with Cask Hydrator, by Gokul Gunasekaran from Cask

cask.co

Hydrator Data Pipeline

✦Captures Metadata, Audit, Lineage info and visualized using Cask Tracker

✦Post-run notification, centralized metrics and log collection for ease of operability

✦Simple Java API to build your own source, transforms, sinks with class loader isolation

✦SparkML based plugins, Python transforms for data scientists

7

Page 8: Building Data pipelines with Cask Hydrator, by Gokul Gunasekaran from Cask

cask.co

✦ElasticSearch, Cassandra, Kafka, SFTP, JMS and many more sources and sinks

✦De-duplicate, Group By Aggregation, Row Denormalizer and other transforms

Out of the box Integrations

8

Page 9: Building Data pipelines with Cask Hydrator, by Gokul Gunasekaran from Cask

cask.co

✦ Implement your own batch (or realtime) source, transform, sink plugins using simple Java API

Custom Plugins

9

Page 10: Building Data pipelines with Cask Hydrator, by Gokul Gunasekaran from Cask

cask.co

Data Lake FraudDetection

RecommendationEngine

Sensor DataAnalytics

Customer360Hydrator Tracker

CASK DATA APP PLATFORM

Hadoop ecosystem, 50 different projects

Top 6 Hadoop distributions

10

Page 11: Building Data pipelines with Cask Hydrator, by Gokul Gunasekaran from Cask

cask.co

Pipeline Implementation

Logical Pipeline

Physical Workflow

MR/Spark Executions

Planner

CDAP

✦Planner converts logical pipeline to a physical execution plan

✦Optimizes and bundles functions into one or more MR/Spark jobs

✦CDAP is the runtime environment where all the components of the data pipeline are executed

✦CDAP provides centralized log and metrics collection, transaction, lineage and audit information

11

Page 12: Building Data pipelines with Cask Hydrator, by Gokul Gunasekaran from Cask

cask.co

✦Join across multiple data sources (CDAP-5588)

✦Pipeline preview

✦Macro substitutions

✦Pre-Actions in pipelines similar to post run notifications

✦Spark streaming support for Realtime pipelines

Upcoming capabilities

12

Page 13: Building Data pipelines with Cask Hydrator, by Gokul Gunasekaran from Cask

Thank [email protected]

@CaskData

github.com/caskdata/cdapgithub.com/caskdata/hydrator-plugins

Questions?13

Page 14: Building Data pipelines with Cask Hydrator, by Gokul Gunasekaran from Cask

cask.co

Self-Service Data Ingestionand ETL for Data Lakes

Built for Productionon CDAP

Rich Drag-and-DropUser Interface

Open Source &Highly Extensible

14