Transcript
Page 1: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Hadoop First ETL On Apache Falcon

Srikanth Sundarrajan Naresh Agarwal

Page 2: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

About Authors !  Srikanth Sundarrajan

!  Principal Architect, InMobi Technology Services

!  Naresh Agarwal !  Director – Engineering, InMobi Technology Services

Page 3: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Agenda !  ETL & Challenges with Big Data

!  Apache Falcon – Background

!  Pipeline Designer – Overview

!  Pipeline Designer – Internals

Page 4: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Agenda !  ETL & Challenges with Big Data

!  Apache Falcon – Background

!  Pipeline Designer – Overview

!  Pipeline Designer – Internals

Page 5: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

ETL (Extract Transform Load)

Intelligence

Information

Data

Value

Page 6: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

ETL Use cases

Data Warehouse

Data Migration

Data Consolidation

Master Data Management

Data Synchronization

Data Archiving

Page 7: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

ETL Authoring

Hand coded

In-house tools

Off-shelf tools

Page 8: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

ETL & Big Data – Challenges

Challenges

Volume

Variety Velocity

Page 9: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Big Data ETL !  Mostly Hand coded (High Cost – Implementation +

Maintenance) !  Map Reduce

!  Hive (i.e. SQL) !  Pig !  Crunch / Cascading

!  Spark

!  Off-shelf tools (Scale/Performance) !  Mostly Retrofitted

Page 10: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Agenda !  ETL & Challenges with Big Data

!  Apache Falcon – Background

!  Pipeline Designer – Overview

!  Pipeline Designer – Internals

Page 11: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Apache Falcon !  Off the shelf, Falcon provides standard data

management functions through declarative constructs !  Data movement recipes

!  Cross data center replication

!  Cross cluster data synchronization

!  Data retention recipes !  Eviction

!  Archival

Page 12: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Apache Falcon !  However ETL related functions are still largely left

to the developer to implement. Falcon today manages only !  Orchestration !  Late data handling / Change data capture

!  Retries !  Monitoring

Page 13: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Agenda !  ETL & Challenges with Big Data

!  Apache Falcon – Background

!  Pipeline Designer – Overview

!  Pipeline Designer – Internals

Page 14: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Pipeline Designer – Basics

Page 15: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Pipeline Designer – Basics !  Feed

!  Is a data entity that Falcon manages and is physically present in a cluster.

!  Data present in this feed conforms to a schema and partitions of the same are registered with Hcatalog

!  Data Management functions such as eviction, archival etc are declaratively specified through Falcon Feed definitions

Page 16: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Pipeline Designer – Basics

Page 17: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Pipeline Designer – Basics !  Process

!  Workflow that defines various actions that needs to be performed along with control flow

!  Executes at a specified frequency on one or more clusters

!  Pipelines !  Logical grouping of Falcon processes owned and

operated together

Page 18: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Pipeline Designer – Basics

Page 19: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Pipeline Designer – Basics !  Actions

!  Actions in designer are the building blocks for the process workflows.

!  Actions have access to output variables earlier in the flow and can emit output variables

!  Actions can transition to other actions !  Default / Success Transition

!  Failure Transition

!  Conditional Transition

!  Transformation action is a special action that further is a collection of transforms

Page 20: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Pipeline Designer – Basics

Page 21: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Pipeline Designer – Basics !  Transforms

!  Is a data manipulation function that accepts one or more inputs with well defined schema and produces ore or more outputs

!  Multiple transform elements can be stitched together to compose a single transformation action which can further be used to build a flow

!  Composite Transformations !  Transforms that are built through a combination of

multiple primitive transforms

!  Possible to add more transforms and extend the system

Page 22: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Pipeline Designer – Basics !  Deployment & Monitoring

!  Once a process and the pipeline is composed, the same is deployed in Falcon as a standard process

Page 23: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Agenda !  ETL & Challenges with Big Data

!  Apache Falcon – Background

!  Pipeline Designer – Overview

!  Pipeline Designer – Internals

Page 24: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Pipeline Designer Service

Pipeline Designer

Pipeline Designer Service

REST API

Versioned Storage

Flow / Action /

Transforms Compiler + Optimizer

Falcon Server

Hcatalog Service

Des

igner

UI

Falc

on D

ashboa

rd

Process

Feed

Schema

Page 25: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Pipeline Designer – Internals !  Transformation actions are compiled into PIG

scripts

!  Actions and Flows are compiled into Falcon Process definitions

Page 26: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Page 27: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Page 28: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Page 29: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Page 30: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
sriksun
Text
Page 31: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Page 32: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Page 33: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Page 34: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Page 35: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Page 36: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Page 37: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Page 38: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Page 39: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Q & A

Page 40: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Thanks

mailto:[email protected]

mailto:[email protected]


Recommended