23
Data Ingestion, Extraction, and Preparation for Hadoop Sanjay Kaluskar, Sr. Architect, Informatica David Teniente, Data Architect, Rackspace

Data Ingestion, Extraction & Parsing on Hadoop

Embed Size (px)

Citation preview

Page 1: Data Ingestion, Extraction & Parsing on Hadoop

1

Data Ingestion, Extraction, and Preparation for Hadoop

Sanjay Kaluskar, Sr. Architect, Informatica

David Teniente, Data Architect, Rackspace

Page 2: Data Ingestion, Extraction & Parsing on Hadoop

2

Safe Harbor Statement• The information being provided today is for informational purposes only. The

development, release and timing of any Informatica product or functionality described today remain at the sole discretion of Informatica and should not be relied upon in making a purchasing decision. Statements made today are based on currently available information, which is subject to change.  Such statements should not be relied upon as a representation, warranty or commitment to deliver specific products or functionality in the future. 

• Some of the comments we will make today are forward-looking statements including statements concerning our product portfolio, our growth and operational strategies, our opportunities, customer adoption of and demand for our products and services, the use and expected benefits of our products and services by customers, the expected benefit from our partnerships and our expectations regarding future industry trends and macroeconomic development.

• All forward-looking statements are based upon current expectations and beliefs. However, actual results could differ materially. There are many reasons why actual results may differ from our current expectations. These forward-looking statements should not be relied upon as representing our views as of any subsequent date and Informatica undertakes no obligation to update forward-looking statements to reflect events or circumstances after the date that they are made.

• Please refer to our recent SEC filings including the Form 10-Q for the quarter ended September 30th, 2011 for a detailed discussion of the risk factors that may affect our results. Copies of these documents may be obtained from the SEC or by contacting our Investor Relations department.

Page 3: Data Ingestion, Extraction & Parsing on Hadoop

3

Sales & Marketing Data mart

Customer ServicePortal

The Hadoop Data Processing PipelineInformatica PowerCenter + PowerExchange

3. Transform & Cleanse Data on Hadoop

1. Ingest Data into Hadoop

2. Parse & Prepare Data on Hadoop

4. Extract Data from Hadoop

1H / 2012

Product & Service Offerings

Customer Profile

Social MediaAccount Transactions

Customer Service Logs & Surveys

Marketing Campaigns

PowerCenter + PowerExchange

Available Today

Page 4: Data Ingestion, Extraction & Parsing on Hadoop

4

Options

Ingest/Extract Data

Parse & Prepare Data

Transform & Cleanse Data

Structured (e.g. OLTP, OLAP)

Informatica PowerCenter + PowerExchange, Sqoop

N/A Hive, PIG, MR, Future: Informatica Roadmap

Unstructured, semi-structured (e.g. web logs, JSON)

Informatica PowerCenter + PowerExchange, copy files, Flume, Scribe, Kafka

Informatica HParser, PIG/Hive UDFs, MR

Hive, PIG, MR, Future: Informatica Roadmap

Page 5: Data Ingestion, Extraction & Parsing on Hadoop

5

Unleash the Power of HadoopWith High Performance Universal Data Access

WebSphere MQJMSMSMQSAP NetWeaver XI

JD Edwards Lotus NotesOracle E-BusinessPeopleSoft

OracleDB2 UDBDB2/400SQL ServerSybase

ADABASDatacomDB2IDMSIMS

Word, ExcelPDFStarOfficeWordPerfectEmail (POP, IMPA)HTTP

InformixTeradataNetezzaODBCJDBC

VSAMC-ISAMBinary Flat FilesTape Formats…

Web ServicesTIBCO webMethods

SAP NetWeaver SAP NetWeaver BI SASSiebel

Messaging, and Web Services

Relational and Flat Files

Mainframe and Midrange

Unstructured Data and Files Flat files

ASCII reportsHTMLRPGANSILDAP

EDI–X12EDI-FactRosettaNet HL7HIPAA

ebXMLHL7 v3.0ACORD (AL3, XML)

XMLLegalXMLIFXcXML

ASTFIXCargo IMPMVR

Salesforce CRMForce.comRightNowNetSuite

ADP HewittSAP By DesignOracle OnDemand

Packaged Applications

Industry Standards

XML Standards

SaaS/BPO

Social Media

FacebookTwitter

LinkedInEMC/GreenplumVertica

AsterData

MPP Appliances

Page 6: Data Ingestion, Extraction & Parsing on Hadoop

6

Ingest Data

HDFS

HIVE

Batch

Real-time

CDC

Web server

ERP, CRM

Databases,Data Warehouse

Message Queues,

Email, Social Media

Mainframe

PowerExchange PowerCenter

Access Data Pre-Process Ingest Data

e.g. Filter, Join, Cleanse

Reuse PowerCenter mappings

Page 7: Data Ingestion, Extraction & Parsing on Hadoop

7

Extract Data

Batch

Web server

ERP, CRM

Databases,Data Warehouse

Mainframe

PowerExchange

Deliver Data

HDFS

Extract Data

PowerCenter

Post-Process

e.g. Transform to target schema

Reuse PowerCenter mappings

Page 8: Data Ingestion, Extraction & Parsing on Hadoop

8

2. Create Hadoop Connection

3. Configure Workflow

4. Create & Load Into Hive Table

1. Create Ingest or Extract Mapping

Page 9: Data Ingestion, Extraction & Parsing on Hadoop

9

Sales & Marketing Data mart

Customer ServicePortal

The Hadoop Data Processing PipelineInformatica HParser

3. Transform & Cleanse Data on Hadoop

1. Ingest Data into Hadoop

2. Parse & Prepare Data on Hadoop

4. Extract Data from Hadoop

1H / 2012

Product & Service Offerings

Customer Profile

Social MediaAccount Transactions

Customer Service Logs & Surveys

Marketing Campaigns

HParser

Available Today

Page 10: Data Ingestion, Extraction & Parsing on Hadoop

10

Options

Ingest/Extract Data

Parse & Prepare Data

Transform & Cleanse Data

Structured (e.g. OLTP, OLAP)

Informatica PowerCenter + PowerExchange, Sqoop

N/A Hive, PIG, MR, Future: Informatica Roadmap

Unstructured, semi-structured (e.g. web logs, JSON)

Informatica PowerCenter + PowerExchange, copy files, Flume, Scribe, Kafka

Informatica HParser, PIG/Hive UDFs, MR

Hive, PIG, MR, Future: Informatica Roadmap

Page 11: Data Ingestion, Extraction & Parsing on Hadoop

11

Informatica HParserProductivity: Data Transformation Studio

Page 12: Data Ingestion, Extraction & Parsing on Hadoop

12

SWIFT MTSWIFT MXNACHAFIXTelekursFpMLBAI – V2.0\LockboxCREST DEXIFXTWISTUNIFI (ISO 20022)SEPAFIXMLMISMO

B2B Standards

UN\EDIFACTEDI-X12EDI ARREDI UCS+WINSEDI VICSRosettaNetOAGI

Financial

Healthcare

HL7HL7 V3HIPAANCPDPCDISC

Insurance

DTCC-NSCCACORD-AL3ACORD XML

IATA-PADISPLMXMLNEIM

Other

Easy example based visual enhancements and edits

Definition is done using Business (industry) terminology and definitions

Enhanced Validations

Out of the box transformations for all messages in all versions

Updates and new versions delivered from Informatica

Informatica HParserProductivity: Data Transformation Studio

Page 13: Data Ingestion, Extraction & Parsing on Hadoop

13

Informatica HParserHow does it work?

Hadoop cluster

HDFS

SS

Svc Repository

SS

hadoop … dt-hadoop.jar … My_Parser /input/*/input*.txt

1. Develop an HParser transformation2. Deploy the transformation3. Run HParser on Hadoop to produce

tabular data4. Analyze the data with HIVE / PIG /

MapReduce / Other

Page 14: Data Ingestion, Extraction & Parsing on Hadoop

14

Sales & Marketing Data mart

Customer ServicePortal

The Hadoop Data Processing PipelineInformatica Roadmap

3. Transform & Cleanse Data on Hadoop

1. Ingest Data into Hadoop

2. Parse & Prepare Data on Hadoop

4. Extract Data from Hadoop

1H / 2012

Product & Service Offerings

Customer Profile

Social MediaAccount Transactions

Customer Service Logs & Surveys

Marketing Campaigns

Available Today

Page 15: Data Ingestion, Extraction & Parsing on Hadoop

15

Options

Ingest/Extract Data

Parse & Prepare Data

Transform & Cleanse Data

Structured (e.g. OLTP, OLAP)

Informatica PowerCenter + PowerExchange, Sqoop

N/A Hive, PIG, MR, Future: Informatica Roadmap

Unstructured, semi-structured (e.g. web logs, JSON)

Informatica PowerCenter + PowerExchange, copy files, Flume, Scribe, Kafka

Informatica HParser, PIG/Hive UDFs, MR

Hive, PIG, MR, Future: Informatica Roadmap

Page 16: Data Ingestion, Extraction & Parsing on Hadoop

16

Informatica Hadoop Roadmap – 1H 2012

• Process data on Hadoop• IDE, administration, monitoring, workflow• Data processing flow designed through IDE: Source/Target,

Filter, Join, Lookup, etc.• Execution on Hadoop cluster (pushdown via Hive)

• Flexibility to plug-in custom code• Hive and PIG UDFs• MR scripts

• Productivity with optimal performance• Exploit Hive performance characteristics• Optimize end-to-end data flow for performance

Page 17: Data Ingestion, Extraction & Parsing on Hadoop

17

Mapping for Hive execution

17

INSERT INTO STG0SELECT * FROM StockAnalysis0;

INSERT INTO STG1SELECT * FROM StockAnalysis1;

INSERT INTO STG2SELECT * FROM StockAnalysis2;

Source

Pre-view generated Hive code

Validate & configure for Hive translation

Logical representation of processing steps

Page 18: Data Ingestion, Extraction & Parsing on Hadoop

18

Takeaways

• Universal connectivity• Completeness and enrichment of raw data for holistic analysis• Prevent Hadoop from becoming another silo accessible to a few

experts

• Maximum productivity• Collaborative development environment

• Right level of abstraction for data processing logic• Re-use of algorithms and data flow logic

• Meta-data driven processing• Document data lineage for auditing and impact analysis• Deploy on any platform for optimal performance and utilization

Page 19: Data Ingestion, Extraction & Parsing on Hadoop

19

Customer Sentiment - Reaching beyond NPS (Net Promoter Score) and surveys

Gaining insight in to our customer’s sentiment will improve Rackspace’s ability to provide Fanatical Support™

Objectives:• What are “they” saying• Gauge the level of sentiment• Fanatical Support™ for the win

• Increase NPS• Increase MRR• Decrease Churn• Provide the right products• Keep our promises

Page 20: Data Ingestion, Extraction & Parsing on Hadoop

20

Customer Sentiment Use CasesPulling it all together

Case 1Match social media posts with Customer. Determine

a probable match.

Case 2Determine the sentiment of a

post, searching key words and scoring

the post.Case 3

Determine correlations between posts, ticket volume and NPS leading to negative

or positive sentiments.

Case 4Determine correlations in

sentiments with products/configurations which lead to negative or

positive sentiments.Case 5The ability to trend all

inputs over time…

Page 21: Data Ingestion, Extraction & Parsing on Hadoop

21

Rackspace Fanatical Support™Big Data Environment

21

BI Stack

BI Analytics

Search, Analytics, Algorithmic

Greenplum DB

Hadoop HDFS

Message bus / port listening

Data Sources(DBs, Flat files, Data

Streams)

OracleMySqlMS SQLPostgresDB2

ExcelCSVFlat FileXML

EDIBinarySys LogsMessagingAPIs

Indirect Analytics over Hadoop

Direct Analytics over Hadoop

Page 22: Data Ingestion, Extraction & Parsing on Hadoop

22

Twitter Feed for RackspaceUsing Informatica

Input Data Output Data

Page 23: Data Ingestion, Extraction & Parsing on Hadoop

23