Upload
skaluska
View
2.967
Download
1
Embed Size (px)
DESCRIPTION
Presentation at Hadoop user summit in Bangalore on Feb 16, 2011 by Sanjay Kaluskar
Citation preview
11
Data Integration on HadoopSanjay KaluskarSenior Architect, Informatica
Feb 2011
22
Introduction
• Challenges• Results of analysis or mining are only as good as the
completeness & quality of underlying data• Need for the right level of abstraction & tools
• Data integration & data quality tools have tackled these challenges for many years!
More than 4,200 enterprises worldwide rely on Informatica
33
Files ApplicationsDatabases
Data sources
Hadoop
HDFS HBase
PIG Hive
Java
Sqoop
• Developer productivity• Vendor neutrality/flexibility
ODBC JMS
Access methods & languages
C/C++ OCIBAPI
SQL
WordExcel
Web services
XQueryNotepad
Javavi PL/SQL JDBC
CLI
Transact-SQL
Developer tools
44
Lookup example
‘Bangalore’, …, 234, …‘Chennai’, …, 82, …‘Mumbai’, …, 872, …
‘Delhi’, …, 11, …‘Chennai’, …, 43, …
‘xxx’, …, 2, …
Dept id Name
82 Stationery
2 Clothing
11 Jewellery
HDFS file
Database table
Your choices• Move table to HDFS using Sqoop and join
• Could use PIG/Hive to leverage the join operator• Implement Java code to lookup the database table
• Need to use access method based on the vendor
55
Or… leverage Informatica’s Lookup
a = load 'RelLookupInput.dat' as (deptid: double);b = foreach a generate flatten(com.mycompany.pig.RelLookup(deptid));store b into 'RelLookupOutput.out';
66
Or… you could start with a mapping
STORE
Load Filter
77
Goals of the prototype
• Enable Hadoop developers to leverage Data Transformation and Data Quality logic• Ability to invoke mapplets from Hadoop
• Lower the barrier to Hadoop entry by using Informatica Developer as the toolset• Ability to run a mapping on Hadoop
88
Mapplet Invocation
• Generation of the UDF of the right type• Output-only mapplet Load UDF• Input-only mapplet Store UDF• Input/output Eval UDF
• Packaging into a jar• Compiled UDF• Other meta-data: connections, reference tables
• Invokes Informatica engine (DTM) at runtime
99
Mapplet Invocation (contd.)
• Challenges• UDF execution is per-tuple; mapplets are optimized for batch execution• Connection info/reference data need to be plugged in• Runtime dependencies: 280 jars, 558 native dependencies
• Benefits• PIG user can leverage Informatica functionality
• Connectivity to many (50+) data sources• Specialized transformations
• Re-use of already developed logic
1010
Mapping Deployment: Idea
• Leverage PIG• Map to equivalent operators where possible• Let the PIG compiler optimize & translate to Hadoop jobs
• Wraps some transformations as UDFs• Transformations with no equivalents, e.g., standardizer,
address validator• Transformations with richer functionality, e.g., case-
insensitive sorter
1111
Leveraging PIG Operators
1212
Leveraging Informatica Transformations
NativePIG
NativePIG
Native PIG Informatica Transformation (Translated to PIG UDFs)
SourceUDFs
LookupUDF
Target UDF
Case converter
UDF
1313
Mapping Deployment
• Design• Leverages PIG operators where possible• Wraps other transformations as UDFs• Relies on optimization by the PIG compiler
• Challenges• Finding equivalent operators and expressions• Limitations of the UDF model – no notion of a user defined
operator
• Benefits• Re-use of already developed logic• Easy way for Informatica users to start using Hadoop
simultaneously; can also use the designer
1414
HDFS
Data Node
HDFSName Node
HDFSJob Tracker
Hadoop Cluster
Weblogs
Enterprise Applications
Databases
Semi-structuredUn-structured
BI
DW/DM
Informatica & HadoopBig Picture
MetadataRepository
Graphical IDE for
Hadoop Development
Enterprise
Connectivity for
Hadoop programs Transformation
Engine for custom
data processing
1515Developer tools
Files ApplicationsDatabases
Data sources
JMS
Access methods & languages
C/C++ OCIBAPI
SQL
WordExcel
Web services
XQueryNotepad
Javavi PL/SQL
Hadoop
HDFS HBase
PIG Hive
Java
Sqoop
• Developer productivity• Connectivity• Rich transforms• Designer tool
• Vendor neutrality/flexibility• Without losing
performance
1616
Informatica Extras…
• Specialized transformations• Matching• Address validation• Standardization
• Connectivity
• Other tools• Data federation• Analyst tool• Administration• Metadata manager• Business glossary
17
1818
Hadoop Connector for Enterprise data access
• Opens up all the connectivity available from Informatica for Hadoop processing• Sqoop-based connectors• Hadoop sources & targets in mappings
• Benefits• Load data from Enterprise data sources into Hadoop• Extract summarized data from Hadoop to load into DW and
other targets• Data federation
1919
HDFS
Data Node
Mapplets
PIGScript UDF
Complex Transformations: Addr Cleansing Dedup/Matching Hierarchical data parsing
Enterprise Data Access
Informatica eDTM
Informatica Developer tool for Hadoop
Informatica Hadoop Developer
Informatica Developer
Informatica developer builds hadoop mappings and deploys to Hadoop cluster
Mapping PIGscript
Mapplets etc PIG UDF
Hadoop Designer
MetadataRepository
eDTM
2020
HDFS
Data Node
Mapplets
PIGScript UDF
Complex Transformations: Dedupe/Matching Hierarchical data parsing
Enterprise Data Access
Informatica eDTM
Invoke Informatica Transformations from your Hadoop
MapReduce/PIG scripts
Informatica Developer Tool
Hadoop developer invokes Informatica UDFs from PIG scripts
Mapplets PIG UDF
Reuse Informatica
Components inHadoop
MetadataRepository
Hadoop Developer