Upload
ow2-consortium
View
1.819
Download
8
Tags:
Embed Size (px)
DESCRIPTION
ETL is the process of extracting data from one location, transforming it, and loading it into a different location, often for the purposes of collection and analysis. As Hadoop becomes a common technology for sophisticated analysis and transformation of petabytes of structured and unstructured data, the task of moving data in and out efficiently becomes more important and writing transformation jobs becomes more complicated. Talend provides a way to build and automate complex ETL jobs for migration, synchronization, or warehousing tasks. Using Talend's Hadoop capabilities allows users to easily move data between Hadoop and a number of external data locations using over 450 connectors. Also, Talend can simplify the creation of MapReduce transformations by offering a graphical interface to Hive, Pig, and HDFS. In this talk, Cédric Carbone will discuss how to use Talend to move large amounts of data in and out of Hadoop and easily perform transformation tasks in a scalable way.
Citation preview
Scalable ETL with Talend and Hadoop Talend, Global Leader in Open Source Integra7on Solu7ons
Cédric Carbone – Talend CTO Twitter : @carbone [email protected]
Why speaking about ETL with Hadoop
Hadoop is complex BI consultants don’t have the skill to manipulate Hadoop ➜ Biggest issue in Hadoop project is to find skilled Hadoop Engineers
➜ ETL tool like Talend can help to democra7ze Hadoop
Trying to get from this…
to this…
Talend generates code that is executed within map reduce. This open approach removes the limitation of a proprietary “engine” to provide a truly unique and powerful set of tools for big data.
Why Talend…
Big Data Produc7on
Big Data Management
Big Data Consump7on
Storage Processing Filtering
Mining
Analy7cs
Search
Enrichment
RDBMS Analy7cal DB NoSQL DB ERP/CRM SaaS Social Media Web Analy7cs Log Files RFID Call Data Records Sensors Machine-‐Generated
Big Data Integra7on
Big Data Quality
BIG Data Management
Turn Big Data into actionable information
Parsing Checking
1. Pipelining: as part of the load process
2. Load the cluster than implement and execute a data quality map reduce job
Two methods for inserting data quality into a big data job
Extract – Transform -‐ Load E-‐T-‐L
DQ
Extract – Improve/Cleanse -‐ Load
E-‐ -‐L
CRM
ERP
Finance
Social Networking
Mobile Devices
Big Data
Pipelining: data quality with big data
• Use tradi7onal data quality tools • Once and done
DQ
DQ
DQ
DQ
DQ
Big data alternative: Load and improve within the cluster
• Load first, improve later • Complex matching cannot be done outside
CRM
ERP
Finance
Social Networking
Mobile Devices
Big Data
DQ
DQ
One key DQ rules: Match
➜ Find duplicates within Hadoop ➜ Today’s matching algorithms are processor-‐intensive ➜ Tomorrow’s matching algorithms could be more precise, more intensive
What is Hadoop?
What’s hadoop
➜ The Apache™ Hadoop® project develops open-‐source so^ware for reliable, scalable, distributed compu7ng.
➜ Java framework for storage and running data transforma7on on large cluster of commodity hardware
➜ Licensed under the Apache v2 license
➜ Created from Google's MapReduce, BigTable and Google File System (GFS) papers
➜ Sqoop ➜ Pig ➜ Hive ➜ Oozie ➜ HCatalog ➜ Mahout
➜ HBase
Hadoop ecosystem
Talend for Big Data : Hadoop Story
4.0: [April 2010] Put or get data into Haddop through HDFS connectors
4.1: [Oct 2010] Hadoop Query (Hive)
Bulk loald & fast export to Hadoop (Sqoop)
4.2: [May 2011] Transforma7on
(Pig)
5.0: [Nov 2011] Hbase NoSQL. Extend our tPig*
HCatalog
5.1: [May 2012] Metadata (Hcatalog)
Deployement & Scheduling (Oozie) Embeded into HDP
5.2: [Oct 2012] Visual ELT mapping (Hive)
DataLineage & Impact Analysis
5.3 -‐ [June 2013] Visual Pig mapping
Machine Learning (Mahout) Na7ve MapReduce Code Gen
Democratizing Integration with Data Integration tools for Big Data
WordCount
➜ WordCount • Comes with Hadoop • “First” demo that everyone tries!
➜ How-‐to in Talend Big Data • Simple read, count, load results • No coding, just drag-‐n-‐drop • Runs remotely
Map Reduce
Data Node 1
Data Node 2
Map Reduce
Map Reduce
Map Reduce
Map Reduce
In Java
WordCount: Howto with a Graphical ETL
Thank You!
Let us show you…
Talend Open Studio
Generate Pure Map Reduce
Pig Latin Script generation
FOREACH tPigMap_1_out1_RESULT GENERATE $4 AS Revenu , $6 AS Label
HiveQL generation
HDFS Management and Sqoop
Apache Mahout
➜ Big Data can also be a blob of data to an organiza7on
➜ Apache Mahout provides algorithms to understanding data – data mining
➜ “You don’t know what you don’t know.” and mahout will tell you.
Metadata Management
➜ Centralize Metadata repository for Hadoop Cluster, HDFS, Hive… • Versioning • Impact Analysis and Data Lineage
➜ HCatalog accros • HDFS • Hive • Pig
Thank You!
Choose your Hadoop distro
➜ Widely adopted ➜ Management tooling is not OSS
➜ Fully OpenSource ➜ Strong Developer ecosystem
➜ More proprietary ➜ GTM partner with AWS
➜ A lot of more are comming
Choose your Hadoop distro
Provide tooling : ➜ For installa7on ➜ For server monitoring But ➜ No GUI for parsing, transforming, easily loading. No data management
Parse and Standardize
➜ Big Data is not always structured ➜ Correct big data so that data conforms to the same rules
Profiling & Monitor DQ
Implications for Integration
DATA Challenge: Informa7on explosion increases complexity of integra7on and requires governance to maintain data quality Requirement: Informa7on processing must scale
Implications for Integration
APPLICATION
Challenge: Brisle, point-‐to-‐point connec7ons cannot adapt to evolving business requirements, new channels, and quickly changing topologies Requirement: Applica7on architecture must scale
Implications for Integration
PROCESS
Challenge: Compe77ve market forces drive frequent process changes and increased process complexity
Requirement: Business processes must scale
Implications for Integration
Challenge: Interdependencies across data, applica7ons and processes require more resources and budget Requirement: Resources and skillsets must scale
Integration at Any Scale
True scalability for • Any integra7on challenge • Any data volume • Any project size Enables integration convergence
Technology that Scales
CODE GENERATOR No black-‐box engine means faster maintenance and deployment with improved quality The “engine” for Big Data is “Hadoop”, making it uniquely run at infinite scale.
Java SQL Map Reduce
Camel ……
STANDARDS-BASED Easy to learn, flexible to adopt, reduces vendor lock-‐in
ELT (SQL CodeGen) • Terradata • Netezza • Vertica
ETL • Java Code • Partionning • Paralelisation
• Visual Wizard • Hive/Pig/MR CodeGen • HDFS, Sqoop, Oozie…
Technology Continuum…
Google Big Query
NoSQL • MongoDB • Neo4J • Cassandra • Hbase • Amazon Redshift
At a glance
▶ Founded in 2005
▶ Offers highly scalable integra7on solu7ons addressing Data Integra7on, Data Quality, MDM, ESB and BPM
▶ Provides: § Subscrip7ons including
24/7 support and indemnifica7on;
§ Worldwide training and services
▶ Recognized as the open source leader in each of its market categories
Talend today ➜ 400 employees in 7 countries with dual HQ in Los Altos, CA
and Paris, France ➜ Over 4,000 paying customers across different industry
ver7cals and company sizes ➜ Backed by Silver Lake Sumeru, Balderton Capital and
Idinvest Partners
Talend Overview
Brand Awareness 20 million Downloads
MoneDzaDon
4,000 Customers
AdopDon
1,000,000 Users
Market Momentum +50 New
Customers / Month
High growth through a proven model
Talend’s Unique Integration Solution
Consolidated metadata & project
informa7on
Repository
2
Web-‐based deployment & scheduling
Deployment
3 Same container for batch processing, message rou7ng &
services
Execu7on
4
Single web-‐based monitoring console
Monitoring
5
Comprehensive Eclipse-‐based user interface
1
Studio
Talend Unified Platform
Data Quality
Data Integration
MDM
ESB
BPM
➜ Reduce costs ➜ Eliminate risk ➜ Reuse skills ➜ Economies of scale ➜ Incremental adop7on
Best-of-Breed Solutions
+
Talend Unified Platform
=
Unique Integration Solution
Recognized as the open source leader in each of its market category by all industry analysts
Solutions that Scale
48
TALEND UNIFIED PLATFORM
Data Integra7on
Studio Execu7on Monitoring Deployment Repository
Big Data
Data Quality
ESB
MDM
BPM
UNIFIED PLATFORM A shared founda7on and toolset increases resource reuse
CONVERGED INTEGRATION Use for any data, applica7on and process project
Data Integra7on
The 6 Dimensions of BIG Data
Primary challenges ¾ Volume ¾ Velocity ¾ Variety
And also ¾ Complexity ¾ Valida7on ¾ Lineage
2015
What Is BIG Data?
"Big data" is informa7on of extreme size, diversity, complexity and need for rapid processing.
Ted Friedman - Information Infrastructure and Big Data Projects Key Initiative Overview - July 2011
2020
275 exabytes of data flowing over the Internet each day 275,000,000,000,000,000,000
200 billion intelligent devices 200,000,000,000
1,000,000 transac7ons per day at Walmart
3,500 tweets per second (June 2011)
What is Big Data?
volume, variety, velocity
How to define Big data is….
Key Takeaway #1
Hans Rosling – uses big data to analyze world health trends
CRM
ERP
Finance
ETL Data Quality
Normalized Data
Tradi7onal Data Warehouse
Business Analyst
Business User
Warehouse Administrator
Traditional Data Flows
• Scheduled–daily or weekly, some7mes more frequently.
• Volumes rarely exceed terabytes
Execu7ves
CRM
ERP
Finance
Social Networking
Big Data
The new world of big data
CRM
ERP
Finance
The new world of big data
Social Networking
Mobile Devices
Big Data
CRM
ERP
Finance
Social Networking
Mobile Devices
Transac7ons
Network Devices
Sensors Big Data
The new world of big data
Forces us to think differently
Key Takeaway #2
Data driven business
data
decisions supports
Your business
drives Information provides value to the business If you can't rely on your informa7on then the result can be missed opportuni7es, or higher costs.
Mashew West and Julian Fowler (1999). Developing High Quality Data Models. The European Process Industries STEP Technical Liaison Execu7ve (EPISTLE).
information
enables governance
BIG data driven business
BIG data
BIG information
BIG business
supports
drives
enables
Mashew West and Julian Fowler (1999). Developing High Quality Data Models. The European Process Industries STEP Technical Liaison Execu7ve (EPISTLE).
governance
BIG decisions
Information provides value to the business If you can't rely on your informa7on then the result can be missed opportuni7es, or higher costs.
How is big data integration being used?
Use Cases • Recommenda7on Engine • Sen7ment Analysis • Risk Modeling • Fraud Detec7on • Behavior Analysis • Marke7ng Campaign Analysis • Customer Churn Analysis • Social Graph Analysis • Customer Experience Analy7cs • Network Monitoring
BUT: to what level is DQ required for your use case?
Define your use case Key Takeaway #3