Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Senior Product Specialist

Big Data and Hadoop

The Challenge Data fragmentation becomes the barrier to business success

10 2 MAINFRAME

CLIENT-SERVER WEB

SOCIAL INTERNET OF THINGS

CLOUD

Few Employees

Many Employees

Customers/ Consumers

Business Ecosystems

Communities & Society

Devices & Machines

10 4

10 6

10 7

10 9 10 11

Front Office Productivity Back Office

Automation E-Commerce

Line-of-Business Self-Service

Social Engagement

Real-Time Optimization

1960s-1970s 1980s

1990s

2011 2014

2007

OS/360

TECHNOLOGY

USERS

VALUE TECHNOLOGIES

SOURCES

BUSINESS

Data Mart Data

Mart Data Mart

Data Mart

Data Mart

Data Mart

Data Mart

Data Mart

Data Mart

Batch ETL

Big Data Challenges Volume, Variety, Velocity, Veracity

Where is the data I

need?

Can I trust this data?

Transactions, OLTP, OLAP

Enterprise Data Warehouse

Social Media, Web Logs

Machine Device, Scientific

Documents and Emails

Source Data Analytic Systems

80% of the work in big data projects is data integration and quality

“I spend more than half my time integrating, cleansing, and

transforming data without doing any actual analysis.”

“80% of the work in any data project is in cleaning the data”

“70% of my value is an ability to pull the data, 20% of my

value is using data-science…”

Why Informatica for Big Data & Hadoop

PowerCenter Big Data Edition

Big Transaction Data Big Interaction Data

Online Transaction Processing (OLTP) Oracle DB2 Ingres Informix Sysbase SQL Server …

Cloud Salesforce.com Concur Google App Engine Amazon …

Other Interaction Data Clickstream image/Text Scientific Genomoic/pharma Medical

Medical/Device Sensors/meters RFID tags CDR/mobile …

Social Media & Web Data Facebook Twitter Linkedin Youtube …

Big Data Processing

Online Analytical Processing (OLAP) & DW Appliances Teradata Redbrick EssBase Sybase IQ Netezza Exadata

HANA Greenplum DataAllegro Asterdata Vertica Paraccel …

Web applications Blogs Discussion forums Communities Partner portals …

Universal Data Access High-Speed Data

Ingestion and Extraction

ETL on Hadoop

Profiling on Hadoop Complex Data

Parsing on Hadoop

Entity Extraction and Data Classification on

Hadoop

No-Code Productivity

Business-IT Collaboration

Unified Administration

the VibeTM virtual data machine

9.6

Get Data Into and Out of Hadoop PowerExchange for Hadoop Replication to Hadoop Streaming to Hadoop Data Archiving to Hadoop

Data Warehouse

MDM

Applications

Data Ingestion and Extraction Moving terabytes of data per hour

Replicate

Streaming

Batch Load

Extract

Archive Extract Low Cost Store



Documents, Email

Industry Standards


PowerExchange Connectors

Enterprise Applications, Software as a Service (SaaS)

JDE EnterpriseOne JDE World Lotus Notes Oracle E-Business Suite ✔

PeopleSoft Enterprise Salesforce (salesforce.com) ✔ SAP NetWeaver ✔ SAP NetWeaver BI ✔

SAS Siebel Netsuite Microsoft Dynamics

Databases and Data

Warehouses

Adabas for UNIX, Windows C-ISAM DB2 for LUW ✔ Essbase

EMC/Greenplum Informix Dynamic Server Netezza Performance Server ODBC

Oracle ✔ SQL Server ✔ Sybase Teradata

Messaging Systems

JMS ✔ MSMQ ✔

TIBCO ✔ webMethods Broker ✔

WebSphere MQ ✔

Technology Standards

Email (POP, IMAP) HTTP(S) ✔

LDAP ✔ Web Services ✔

XML

Mainframe Adabas for z/OS ✔ Datacom ✔ DB2 for z/OS, z/Linux✔

IDMS ✔ IMS DB ✔ Oracle for z/Linux ✔

Teradata WebSphere MQ for z/Linux ✔ VSAM ✔

Big Data Asterdata, Greenplum

Vertica ParAccel

Microsoft PDW Kognitio

Social Facebook, Twitter, LinkedIn DataSift, Kapow MongoDB

Hadoop HDFS HIVE HBASE

�- Accessible in Real-time and/or via Change Data Capture (CDC)

NoSQL Support for HBase

11

Read from HBase as standard source

Write to HBase as standard target

Complete Mapping with HBase Src/Tgt can execute on hadoop

Sample HBase column families (Stored in JSON/complex formats)

NoSQL Support for MongoDB

Access, integrate, transform & ingest MongoDB data into other analytic systems (e.g. Hadoop, data warehouse)

Access, integrate, transform, & ingest data into MongoDB

Sampling MongoDB data & flattening it to relational format

IDR for Replicating to Hadoop

Supported Distributions •  Apache

•  0.20.203.x

•  0.20.204.x

•  0.20.205.x

•  0.23.x

•  1.0.x

•  1.1.x •  2.x.x

•  Cloudera •  CDH3

•  CDH4

EXTRACT APPLY

Source System Intermediate Files Cycle_1.work directory

HDFS

Table 1 File Table 2 File …Table N File Schema.ini File

Real-Time Data Collection and Streaming

14

Ultr

a M

essa

ging

Bus

Pub

lish

/ Sub

scrib

e

Leverage High Performance Messaging Infrastructure Publish with Ultra Messaging for global distribution without additional staging or landing.

HDFS, HBase,

Targets

Web Servers, Operations Monitors, rsyslog, SLF4J, etc.

Handhelds, Smart Meters, etc. Discrete Data Messages

Sources

Zookeeper

Management and Monitoring

Internet of Things, Sensor Data

Real Time Analysis, Complex Event Processing

No SQL Databases: Cassandara, Riak, MongoDB

Node

Node

Node

Node

Node

Node

Informatica Vibe Data Stream for Machine Data

15

•  High performance/efficient streaming data collection over LAN/WAN

•  GUI interface provides ease of configuration, deployment & use

•  Continuous ingestion of real-time generated data (sensors; logs; etc.). Machine generated & other data sources

•  Enable real-time interactions & response

•  Real-time delivery directly to multiple targets (batch/stream processing)

•  Highly available; efficient; scalable

•  Available ecosystem of light weight agents (sources & targets)

Predictive Maintenance with Event Processing and Analytics

United Technologies Aerospace Systems (UTAS) provides engines and aircraft components to leading commercial and defense manufacturers, including the new Airbus A380 and Boeing B787.

The challenge: •  5,000+ aircraft in service plus new design wins exponentially

increases the amount of sensor data being generated •  “Power by the Hour” leasing model means the maintenance cost and

service outages falls to UTAS •  No proactive capability to predict when a safety issue might occur

•  Once-per-day sensor readings moving to real-time, over-the-air

Archive to Hadoop Compression Extends Hadoop Cluster Capacity

Without INFA Optimized Archive Compression

With INFA Optimized Archive 95% Compression

10 TB 10 TB 10 TB

10 TB replicated 3X = 30TB 10 TB compressed 95% = 500GB Replicated 3X = 1.5 TB 20X less I/O bandwidth required 20 min vs. 1 min response time 8 hours vs. 24 mins backup window

500 GB 500 GB 500 GB

Parse and Prepare Data On Hadoop hParser and XMap

4. The DT engine can immediately use this service to process data.

The DT Engine is fully embeddable and can be invoked using any of the supported APIs.

Java, C++, C, .NET, web services

For simple integration, a command line interface is available to invoke

services.

Internal custom applications can embed transformation services

using the various APIs.

PowerCenter leverages DT via the Unstructured Data Transformation

(UDT). This is a GUI transformation widget in Powercenter which wraps around the DT API and

engine.

DT can also be embedded in other middleware technologies.

For some (WBIMB, WebMethods, BizTalk) INFA provides similar GUI widgets (agents) for the

respective design environments. For others the API layer can be used directly.

DT can be invoked in two general ways:

1.  Filenames can be passed to it, and DT will directly open the file(s) for processing.

On the output side, DT can also directly

write to the filesystem.

2.  The calling application can buffer the data and send buffers to DT for processing.

On the output side, DT can also write back to memory buffers which are returned to the calling application.

Though not shown below, the engine fully supports multiple input

and output files or buffers as needed by the transformation.

Engine invocation is a shared library. The DT engine runs fully within the process of the calling application.

It is not an external engine. This removes any overhead

from passing data between processes, across the network, etc. The engine is also dynamically invoked and does not

need to be ‘started up’ or maintained externally.

The DT engine is also thread-safe and re-entrant.

This allows the calling application to invoke DT in multiple threads to increase throughput.

A good example is DT’s support of PowerCenter partitioning

to scale up processing.

As shown below, the actual transformation logic is completely independent of any calling application.

This means you can develop a transformation once, and

leverage it in multiple environments simultaneously resulting in reduced development and maintenance times and lower

impact of change.

1. Developer uses Studio to develop a transformation

2. Developer deploys transformation to local service repository (directory). All files needed for the transformation

are moved.

3. To deploy to the server, this service folder is moved to the server via FTP,

copy, script, etc. NOTE: If the server file system is mountable from

the developer machine directly, then step 2 would deploy directly to the server.

Parse and Prepare Data on Hadoop

S Svc Repository

S

Flat Files & Documents

Interaction data Industry Standards XML

The broadest coverage for Big Data

��

��

social

Device/sensor scientific

Productivity

•  Visual parsing environment

•  Predefined translations

Any DI/BI architecture

PIG EDW MDM

Example use cases Call Detail record

•  Why Hadoop? •  CDR – Large data sets every 7 seconds every mobile phone in

the region create a record •  Desire to analyze behavior, location to personalize and

optimize pricing and ,marketing

hadoop … dt-hadoop.jar … My_Parser /input/*/input*.txt

1.  Define parser in HParser visual studio

2.  Deploy the parser on Hadoop Distributed File

System (HDFS) 3.  Run HParser to extract

data and produce tabular format in Hadoop

Parse and Prepare Data on Hadoop How does it work?

Profiling and Discovering Data Informatica Profiling & Data Discovery on Hadoop

CUSTOMER_ID example COUNTRY CODE example

3. Drilldown Analysis (into Hadoop Data)

2. Value & Pattern

Analysis of Hadoop Data

1. Profiling Stats: Min/Max Values, NULLs, Inferred Data Types, etc.

Drill down into actual data values to inspect results across entire data set, including potential duplicates

Value and Pattern Frequency to isolated

inconsistent/dirty data or unexpected patterns

Hadoop Data Profiling results – exposed to anyone in enterprise

via browser

Stats to identify outliers and

anomalies in data

Hadoop Data Profiling Results

Hadoop Data Domain Discovery Finding functional meaning of Data in Hadoop

Leverage INFA rules/mapplets to identify functional meaning of

Hadoop data

Sensitive data (e.g. SSN, Credit Card number, etc.)

PHI: Protected Health Information PII: Personally Identifiable Information

Scalable to look for/discover ANY Domain type

View/share report of data domains/sensitive data contained in Hadoop. Ability to drill down to see suspect

data values.

Transforming and Cleansing Data PowerCenter on Hadoop Data Quality on Hadoop

No-code visual development environment

Preview results at any point in the

data flow

PowerCenter developers are now Hadoop developers

Reuse and Import PC Metadata for Hadoop

Import existing PC artifacts into

Hadoop development environment

Validate import logic before the

actual import process to ensure

compatibility

Natural Language Processing Entity Extraction & Data Classification

Train NLP to find and classify entities in unstructured data

Address Validation & Data Cleansing�

Configure Mapping for Hadoop Execution

No need to redesign mapping logic to execute on either

Traditional or Hadoop infrastructure.

Configure where the integration logic should run – Hadoop or Native

SELECT T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS CUSTKEY, customer.C_NAME, customer.C_NATIONKEY, nation.N_NAME, nation.N_REGIONKEY FROM

( SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx FROM lineitem GROUP BY L_ORDERKEY ) T1 JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY) JOIN customer ON (orders.O_CUSTKEY = customer.C_CUSTKEY) JOIN nation ON (customer.C_NATIONKEY = nation.N_NATIONKEY) WHERE nation.N_NAME = 'UNITED STATES' ) T2

INSERT OVERWRITE TABLE TARGET1 SELECT * INSERT OVERWRITE TABLE TARGET2 SELECT CUSTKEY, count(ORDERKEY2) GROUP BY CUSTKEY;

Data Integration & Quality on Hadoop

Hive-QL

1.  Entire Informatica mapping translated to Hive Query Language

2.  Optimized HQL converted to MapReduce & submitted to Hadoop cluster (job tracker).

3.  Advanced mapping transformations executed on Hadoop through User Defined Functions using Vibe

MapReduce

UDF

Example Mapping Execution

Source External Flat File

Source External Relational Data

Engine Repository

Source HDFS File

Cluster of Linux Machines Mapping logic

translated to HQL and submitted

to Hadoop Cluster

Relational Data streamed to Hadoop for processing

Target HDFS FIle

Local flat file staged

temporarily on HDFS

Read HDFS file data

Final processed data

loaded into HDFS file

Temp Staged Lookup File

Orchestrating and Monitoring Hadoop Informatica Workflow & Monitoring for Hadoop Metadata Manager for Hadoop Dynamic Data Masking for Hadoop

Mixed Workflow Orchestration One workflow running tasks on hadoop and local environments

Cmd_Choose LoadPath

MT_Load2Hadoop + Parse

Cmd_Load2 Hadoop MT_Parse

Cmd_ProfileData MT_Cleanse

MT_Data Analysis

Notification

Name Type Default Value Description

$User.LoadOptionPath Integer 2 Load path for workflow, depending on output of cmd task

$User.DataSourceConnection String HiveSourceConnection Source connection object

$User.ProfileResult Integer 100 Output from “profiling” commnad task.

Add

Edit

Remove

List of variables:

Informatica Corporation Confidential Do Not Distribute.

Full traceability from workflow to MapReduce jobs

View generated Hive scripts

Unified Administration Single Place to Manage & Monitor

Data Lineage and Business Glossary

Hadoop Architecture Overview

•  PowerCenter on Hadoop •  Data Quality on Hadoop •  DT on Hadoop •  Entity Extraction on Hadoop •  Profiling on Hadoop

Execution on Hadoop PWX

for HDFS

PWX for

HDFS

PWX for

Hive

MYSQL

Mercury Services

Hive Client

HDFS

Infa-Lib

DataNode1

HParser

Map Reduce

RDBMS Clients

PW

X fo

r M

ercu

ry


Documents, Email



PowerCenter SE Enterprise Grid

PowerCenter Services

PW

X fo

r P

C

HDFS

Infa-Lib

HParser

Map Reduce

RDBMS Clients

HDFS

Infa-Lib

HParser

Map Reduce

RDBMS Clients

HDFS

Infa-Lib

HParser

Map Reduce

RDBMS Clients

Hive

DataNode2 DataNode3 NameNode Job Tracker

INFA Clients

40