Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Scalable ETL with Talend and Hadoop Talend, Global Leader in Open Source Integra7on Solu7ons

Cédric Carbone – Talend CTO Twitter : @carbone [email protected]

Why speaking about ETL with Hadoop

Hadoop is complex BI consultants don’t have the skill to manipulate Hadoop ➜  Biggest issue in Hadoop project is to find skilled Hadoop Engineers

➜  ETL tool like Talend can help to democra7ze Hadoop

Trying to get from this…

to this…

Talend generates code that is executed within map reduce. This open approach removes the limitation of a proprietary “engine” to provide a truly unique and powerful set of tools for big data.

Why Talend…

Big Data Produc7on

Big Data Management

Big Data Consump7on

Storage Processing Filtering

Mining

Analy7cs

Search

Enrichment

RDBMS Analy7cal DB NoSQL DB ERP/CRM SaaS Social Media Web Analy7cs Log Files RFID Call Data Records Sensors Machine-‐Generated

Big Data Integra7on

Big Data Quality

BIG Data Management

Turn Big Data into actionable information

Parsing Checking

1.  Pipelining: as part of the load process

2.  Load the cluster than implement and execute a data quality map reduce job

Two methods for inserting data quality into a big data job

Extract – Transform -‐ Load E-‐T-‐L

DQ

Extract – Improve/Cleanse -‐ Load

E-‐ -‐L

CRM

ERP

Finance

Social Networking

Mobile Devices

Big Data

Pipelining: data quality with big data

•  Use tradi7onal data quality tools •  Once and done

DQ

DQ

DQ

DQ

DQ

Big data alternative: Load and improve within the cluster

•  Load first, improve later •  Complex matching cannot be done outside

CRM

ERP

Finance

Social Networking

Mobile Devices

Big Data

DQ

DQ

One key DQ rules: Match

➜  Find duplicates within Hadoop ➜  Today’s matching algorithms are processor-‐intensive ➜  Tomorrow’s matching algorithms could be more precise, more intensive

What is Hadoop?

What’s hadoop

➜  The Apache™ Hadoop® project develops open-‐source so^ware for reliable, scalable, distributed compu7ng.

➜  Java framework for storage and running data transforma7on on large cluster of commodity hardware

➜  Licensed under the Apache v2 license

➜  Created from Google's MapReduce, BigTable and Google File System (GFS) papers

➜  Sqoop ➜  Pig ➜  Hive ➜  Oozie ➜  HCatalog ➜ Mahout

➜  HBase

Hadoop ecosystem

Talend for Big Data : Hadoop Story

4.0: [April 2010] Put or get data into Haddop through HDFS connectors

4.1: [Oct 2010] Hadoop Query (Hive)

Bulk loald & fast export to Hadoop (Sqoop)

4.2: [May 2011] Transforma7on

(Pig)

5.0: [Nov 2011] Hbase NoSQL. Extend our tPig*

HCatalog

5.1: [May 2012] Metadata (Hcatalog)

Deployement & Scheduling (Oozie) Embeded into HDP

5.2: [Oct 2012] Visual ELT mapping (Hive)

DataLineage & Impact Analysis

5.3 -‐ [June 2013] Visual Pig mapping

Machine Learning (Mahout) Na7ve MapReduce Code Gen

Democratizing Integration with Data Integration tools for Big Data

WordCount

➜  WordCount •  Comes with Hadoop •  “First” demo that everyone tries!

➜  How-‐to in Talend Big Data •  Simple read, count, load results •  No coding, just drag-‐n-‐drop •  Runs remotely

Map Reduce

Data Node 1

Data Node 2

Map Reduce

Map Reduce

Map Reduce

Map Reduce

In Java

WordCount: Howto with a Graphical ETL

Thank You!

Let us show you…

Talend Open Studio

Generate Pure Map Reduce

Pig Latin Script generation

FOREACH tPigMap_1_out1_RESULT GENERATE $4 AS Revenu , $6 AS Label

HiveQL generation

HDFS Management and Sqoop

Apache Mahout

➜  Big Data can also be a blob of data to an organiza7on

➜  Apache Mahout provides algorithms to understanding data – data mining

➜  “You don’t know what you don’t know.” and mahout will tell you.

Metadata Management

➜  Centralize Metadata repository for Hadoop Cluster, HDFS, Hive… •  Versioning •  Impact Analysis and Data Lineage

➜  HCatalog accros •  HDFS •  Hive •  Pig

Thank You!

Choose your Hadoop distro

➜  Widely adopted ➜  Management tooling is not OSS

➜  Fully OpenSource ➜  Strong Developer ecosystem

➜  More proprietary ➜  GTM partner with AWS

➜  A lot of more are comming

Choose your Hadoop distro

Provide tooling : ➜  For installa7on ➜  For server monitoring But ➜  No GUI for parsing, transforming, easily loading. No data management

Parse and Standardize

➜  Big Data is not always structured ➜  Correct big data so that data conforms to the same rules

Profiling & Monitor DQ

Implications for Integration

DATA Challenge: Informa7on explosion increases complexity of integra7on and requires governance to maintain data quality Requirement: Informa7on processing must scale


APPLICATION

Challenge: Brisle, point-‐to-‐point connec7ons cannot adapt to evolving business requirements, new channels, and quickly changing topologies Requirement: Applica7on architecture must scale


PROCESS

Challenge: Compe77ve market forces drive frequent process changes and increased process complexity

Requirement: Business processes must scale


Challenge: Interdependencies across data, applica7ons and processes require more resources and budget Requirement: Resources and skillsets must scale

Integration at Any Scale

True scalability for •  Any integra7on challenge •  Any data volume •  Any project size Enables integration convergence

Technology that Scales

CODE GENERATOR No black-‐box engine means faster maintenance and deployment with improved quality The “engine” for Big Data is “Hadoop”, making it uniquely run at infinite scale.

Java SQL Map Reduce

Camel ……

STANDARDS-BASED Easy to learn, flexible to adopt, reduces vendor lock-‐in

ELT (SQL CodeGen) • Terradata • Netezza • Vertica

ETL •  Java Code • Partionning • Paralelisation

• Visual Wizard • Hive/Pig/MR CodeGen • HDFS, Sqoop, Oozie…

Technology Continuum…

Google Big Query

NoSQL • MongoDB • Neo4J • Cassandra • Hbase • Amazon Redshift

At a glance

▶  Founded in 2005

▶  Offers highly scalable integra7on solu7ons addressing Data Integra7on, Data Quality, MDM, ESB and BPM

▶  Provides: §  Subscrip7ons including

24/7 support and indemnifica7on;

§  Worldwide training and services

▶  Recognized as the open source leader in each of its market categories

Talend today ➜  400 employees in 7 countries with dual HQ in Los Altos, CA

and Paris, France ➜  Over 4,000 paying customers across different industry

ver7cals and company sizes ➜  Backed by Silver Lake Sumeru, Balderton Capital and

Idinvest Partners

Talend Overview

Brand Awareness 20 million Downloads

MoneDzaDon

4,000 Customers

AdopDon

1,000,000 Users

Market Momentum +50 New

Customers / Month

High growth through a proven model

Talend’s Unique Integration Solution

Consolidated metadata & project

informa7on

Repository

2

Web-‐based deployment & scheduling

Deployment

3 Same container for batch processing, message rou7ng &

services

Execu7on

4

Single web-‐based monitoring console

Monitoring

5

Comprehensive Eclipse-‐based user interface

1

Studio

Talend Unified Platform

Data Quality

Data Integration

MDM

ESB

BPM

➜  Reduce costs ➜  Eliminate risk ➜  Reuse skills ➜  Economies of scale ➜  Incremental adop7on

Best-of-Breed Solutions

+

Talend Unified Platform

=

Unique Integration Solution

Recognized as the open source leader in each of its market category by all industry analysts

Solutions that Scale

48

TALEND UNIFIED PLATFORM

Data Integra7on

Studio Execu7on Monitoring Deployment Repository

Big Data

Data Quality

ESB

MDM

BPM

UNIFIED PLATFORM A shared founda7on and toolset increases resource reuse

CONVERGED INTEGRATION Use for any data, applica7on and process project

Data Integra7on

The 6 Dimensions of BIG Data

Primary challenges ¾ Volume ¾ Velocity ¾ Variety

And also ¾ Complexity ¾ Valida7on ¾ Lineage

2015

What Is BIG Data?

"Big data" is informa7on of extreme size, diversity, complexity and need for rapid processing.

Ted Friedman - Information Infrastructure and Big Data Projects Key Initiative Overview - July 2011

2020

275 exabytes of data flowing over the Internet each day 275,000,000,000,000,000,000

200 billion intelligent devices 200,000,000,000

1,000,000 transac7ons per day at Walmart

3,500 tweets per second (June 2011)

What is Big Data?

volume, variety, velocity

How to define Big data is….

Key Takeaway #1

Hans Rosling – uses big data to analyze world health trends

CRM

ERP

Finance

ETL Data Quality

Normalized Data

Tradi7onal Data Warehouse

Business Analyst

Business User

Warehouse Administrator

Traditional Data Flows

•  Scheduled–daily or weekly, some7mes more frequently.

•  Volumes rarely exceed terabytes

Execu7ves

CRM

ERP

Finance

Social Networking

Big Data

The new world of big data

CRM

ERP

Finance


Social Networking

Mobile Devices

Big Data

CRM

ERP

Finance

Social Networking

Mobile Devices

Transac7ons

Network Devices

Sensors Big Data


Forces us to think differently

Key Takeaway #2

Data driven business

data

decisions supports

Your business

drives Information provides value to the business If you can't rely on your informa7on then the result can be missed opportuni7es, or higher costs.

Mashew West and Julian Fowler (1999). Developing High Quality Data Models. The European Process Industries STEP Technical Liaison Execu7ve (EPISTLE).

information

enables governance

BIG data driven business

BIG data

BIG information

BIG business

supports

drives

enables

Mashew West and Julian Fowler (1999). Developing High Quality Data Models. The European Process Industries STEP Technical Liaison Execu7ve (EPISTLE).

governance

BIG decisions

Information provides value to the business If you can't rely on your informa7on then the result can be missed opportuni7es, or higher costs.

How is big data integration being used?

Use Cases •  Recommenda7on Engine •  Sen7ment Analysis •  Risk Modeling •  Fraud Detec7on •  Behavior Analysis •  Marke7ng Campaign Analysis •  Customer Churn Analysis •  Social Graph Analysis •  Customer Experience Analy7cs •  Network Monitoring

BUT: to what level is DQ required for your use case?

Define your use case Key Takeaway #3