Upload
srikanth-sundarrajan
View
577
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Pipeline designer allows users to author their processes and provision them on Falcon. This should make building applications on Falcon over Hadoop fairly trivial. Falcon has the ability to operate with HCatalog tables natively. This means that there is a one to one correspondence between a Falcon feed and an HCatalog table. Between the feed definition in Falcon and the underlying table definition in HCatalog, there is adequate metadata about the data stored underneath. This data (sets of them) can then be operated over by a collection of transformations to extract more refined dataset/feed. This logic (currently represented via Oozie workflow / pig scripts / map-reduce jobs) is typically represented through the Falcon process. In this talk we walk through the details of the pipeline designer and the current state of this feature.
Citation preview
Hadoop First ETL On Apache Falcon
Srikanth Sundarrajan Naresh Agarwal
About Authors ! Srikanth Sundarrajan
! Principal Architect, InMobi Technology Services
! Naresh Agarwal ! Director – Engineering, InMobi Technology Services
Agenda ! ETL & Challenges with Big Data
! Apache Falcon – Background
! Pipeline Designer – Overview
! Pipeline Designer – Internals
Agenda ! ETL & Challenges with Big Data
! Apache Falcon – Background
! Pipeline Designer – Overview
! Pipeline Designer – Internals
ETL (Extract Transform Load)
Intelligence
Information
Data
Value
ETL Use cases
Data Warehouse
Data Migration
Data Consolidation
Master Data Management
Data Synchronization
Data Archiving
ETL Authoring
Hand coded
In-house tools
Off-shelf tools
ETL & Big Data – Challenges
Challenges
Volume
Variety Velocity
Big Data ETL ! Mostly Hand coded (High Cost – Implementation +
Maintenance) ! Map Reduce
! Hive (i.e. SQL) ! Pig ! Crunch / Cascading
! Spark
! Off-shelf tools (Scale/Performance) ! Mostly Retrofitted
Agenda ! ETL & Challenges with Big Data
! Apache Falcon – Background
! Pipeline Designer – Overview
! Pipeline Designer – Internals
Apache Falcon ! Off the shelf, Falcon provides standard data
management functions through declarative constructs ! Data movement recipes
! Cross data center replication
! Cross cluster data synchronization
! Data retention recipes ! Eviction
! Archival
Apache Falcon ! However ETL related functions are still largely left
to the developer to implement. Falcon today manages only ! Orchestration ! Late data handling / Change data capture
! Retries ! Monitoring
Agenda ! ETL & Challenges with Big Data
! Apache Falcon – Background
! Pipeline Designer – Overview
! Pipeline Designer – Internals
Pipeline Designer – Basics
Pipeline Designer – Basics ! Feed
! Is a data entity that Falcon manages and is physically present in a cluster.
! Data present in this feed conforms to a schema and partitions of the same are registered with Hcatalog
! Data Management functions such as eviction, archival etc are declaratively specified through Falcon Feed definitions
Pipeline Designer – Basics
Pipeline Designer – Basics ! Process
! Workflow that defines various actions that needs to be performed along with control flow
! Executes at a specified frequency on one or more clusters
! Pipelines ! Logical grouping of Falcon processes owned and
operated together
Pipeline Designer – Basics
Pipeline Designer – Basics ! Actions
! Actions in designer are the building blocks for the process workflows.
! Actions have access to output variables earlier in the flow and can emit output variables
! Actions can transition to other actions ! Default / Success Transition
! Failure Transition
! Conditional Transition
! Transformation action is a special action that further is a collection of transforms
Pipeline Designer – Basics
Pipeline Designer – Basics ! Transforms
! Is a data manipulation function that accepts one or more inputs with well defined schema and produces ore or more outputs
! Multiple transform elements can be stitched together to compose a single transformation action which can further be used to build a flow
! Composite Transformations ! Transforms that are built through a combination of
multiple primitive transforms
! Possible to add more transforms and extend the system
Pipeline Designer – Basics ! Deployment & Monitoring
! Once a process and the pipeline is composed, the same is deployed in Falcon as a standard process
Agenda ! ETL & Challenges with Big Data
! Apache Falcon – Background
! Pipeline Designer – Overview
! Pipeline Designer – Internals
Pipeline Designer Service
Pipeline Designer
Pipeline Designer Service
REST API
Versioned Storage
Flow / Action /
Transforms Compiler + Optimizer
Falcon Server
Hcatalog Service
Des
igner
UI
Falc
on D
ashboa
rd
Process
Feed
Schema
Pipeline Designer – Internals ! Transformation actions are compiled into PIG
scripts
! Actions and Flows are compiled into Falcon Process definitions
Q & A