40
Hadoop First ETL On Apache Falcon Srikanth Sundarrajan Naresh Agarwal

Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Embed Size (px)

DESCRIPTION

Pipeline designer allows users to author their processes and provision them on Falcon. This should make building applications on Falcon over Hadoop fairly trivial. Falcon has the ability to operate with HCatalog tables natively. This means that there is a one to one correspondence between a Falcon feed and an HCatalog table. Between the feed definition in Falcon and the underlying table definition in HCatalog, there is adequate metadata about the data stored underneath. This data (sets of them) can then be operated over by a collection of transformations to extract more refined dataset/feed. This logic (currently represented via Oozie workflow / pig scripts / map-reduce jobs) is typically represented through the Falcon process. In this talk we walk through the details of the pipeline designer and the current state of this feature.

Citation preview

Page 1: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Hadoop First ETL On Apache Falcon

Srikanth Sundarrajan Naresh Agarwal

Page 2: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

About Authors !  Srikanth Sundarrajan

!  Principal Architect, InMobi Technology Services

!  Naresh Agarwal !  Director – Engineering, InMobi Technology Services

Page 3: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Agenda !  ETL & Challenges with Big Data

!  Apache Falcon – Background

!  Pipeline Designer – Overview

!  Pipeline Designer – Internals

Page 4: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Agenda !  ETL & Challenges with Big Data

!  Apache Falcon – Background

!  Pipeline Designer – Overview

!  Pipeline Designer – Internals

Page 5: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

ETL (Extract Transform Load)

Intelligence

Information

Data

Value

Page 6: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

ETL Use cases

Data Warehouse

Data Migration

Data Consolidation

Master Data Management

Data Synchronization

Data Archiving

Page 7: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

ETL Authoring

Hand coded

In-house tools

Off-shelf tools

Page 8: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

ETL & Big Data – Challenges

Challenges

Volume

Variety Velocity

Page 9: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Big Data ETL !  Mostly Hand coded (High Cost – Implementation +

Maintenance) !  Map Reduce

!  Hive (i.e. SQL) !  Pig !  Crunch / Cascading

!  Spark

!  Off-shelf tools (Scale/Performance) !  Mostly Retrofitted

Page 10: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Agenda !  ETL & Challenges with Big Data

!  Apache Falcon – Background

!  Pipeline Designer – Overview

!  Pipeline Designer – Internals

Page 11: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Apache Falcon !  Off the shelf, Falcon provides standard data

management functions through declarative constructs !  Data movement recipes

!  Cross data center replication

!  Cross cluster data synchronization

!  Data retention recipes !  Eviction

!  Archival

Page 12: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Apache Falcon !  However ETL related functions are still largely left

to the developer to implement. Falcon today manages only !  Orchestration !  Late data handling / Change data capture

!  Retries !  Monitoring

Page 13: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Agenda !  ETL & Challenges with Big Data

!  Apache Falcon – Background

!  Pipeline Designer – Overview

!  Pipeline Designer – Internals

Page 14: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Pipeline Designer – Basics

Page 15: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Pipeline Designer – Basics !  Feed

!  Is a data entity that Falcon manages and is physically present in a cluster.

!  Data present in this feed conforms to a schema and partitions of the same are registered with Hcatalog

!  Data Management functions such as eviction, archival etc are declaratively specified through Falcon Feed definitions

Page 16: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Pipeline Designer – Basics

Page 17: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Pipeline Designer – Basics !  Process

!  Workflow that defines various actions that needs to be performed along with control flow

!  Executes at a specified frequency on one or more clusters

!  Pipelines !  Logical grouping of Falcon processes owned and

operated together

Page 18: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Pipeline Designer – Basics

Page 19: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Pipeline Designer – Basics !  Actions

!  Actions in designer are the building blocks for the process workflows.

!  Actions have access to output variables earlier in the flow and can emit output variables

!  Actions can transition to other actions !  Default / Success Transition

!  Failure Transition

!  Conditional Transition

!  Transformation action is a special action that further is a collection of transforms

Page 20: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Pipeline Designer – Basics

Page 21: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Pipeline Designer – Basics !  Transforms

!  Is a data manipulation function that accepts one or more inputs with well defined schema and produces ore or more outputs

!  Multiple transform elements can be stitched together to compose a single transformation action which can further be used to build a flow

!  Composite Transformations !  Transforms that are built through a combination of

multiple primitive transforms

!  Possible to add more transforms and extend the system

Page 22: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Pipeline Designer – Basics !  Deployment & Monitoring

!  Once a process and the pipeline is composed, the same is deployed in Falcon as a standard process

Page 23: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Agenda !  ETL & Challenges with Big Data

!  Apache Falcon – Background

!  Pipeline Designer – Overview

!  Pipeline Designer – Internals

Page 24: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Pipeline Designer Service

Pipeline Designer

Pipeline Designer Service

REST API

Versioned Storage

Flow / Action /

Transforms Compiler + Optimizer

Falcon Server

Hcatalog Service

Des

igner

UI

Falc

on D

ashboa

rd

Process

Feed

Schema

Page 25: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Pipeline Designer – Internals !  Transformation actions are compiled into PIG

scripts

!  Actions and Flows are compiled into Falcon Process definitions

Page 26: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Page 27: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Page 28: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Page 29: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Page 30: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
sriksun
Text
Page 31: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Page 32: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Page 33: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Page 34: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Page 35: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Page 36: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Page 37: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Page 38: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer
Page 39: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Q & A

Page 40: Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Thanks

mailto:[email protected]

mailto:[email protected]