40
Senior Product Specialist Big Data and Hadoop

Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Embed Size (px)

DESCRIPTION

講者:Informatica 資深產品顧問 | 尹寒柏 議題簡介:Big Data 時代,比的不是數據數量,而是了解數據的深度。現在,因為 Big Data 技術的成熟,讓非資訊背景的 CXO 們,可以讓過去像是專有名詞的 CI (Customer Intelligence) 變成動詞,從 BI 進入 CI,更連結消費者經濟的脈動,洞悉顧客的意圖。不過,有個 Big Data 時代要 注意的思維,那就是競爭到最後,不單只是看數據量的增長,還要比誰能更了解數據的深度。而 Informatica 正是這個最佳解決的答案。我們透過 Informatica 解決在企業及時提供可信賴數據的巨大壓力;同時隨著日益增高的數據量和複雜程度,Informatica 也有能力提供更快速彙集數據技術,從而讓數據變的有意義並可供企業用來促進效率提升、完善品質、保證確定性和發揮優勢的功能。Inforamtica 提供了更為快速有效地實現此目標的方案,是精誠集團在 Big Data 時代的最佳工具。

Citation preview

Page 1: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Senior Product Specialist

Big Data and Hadoop

Page 2: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

The Challenge Data fragmentation becomes the barrier to business success

10 2 MAINFRAME

CLIENT-SERVER WEB

SOCIAL INTERNET OF THINGS

CLOUD

Few Employees

Many Employees

Customers/ Consumers

Business Ecosystems

Communities & Society

Devices & Machines

10 4

10 6

10 7

10 9 10 11

Front Office Productivity Back Office

Automation E-Commerce

Line-of-Business Self-Service

Social Engagement

Real-Time Optimization

1960s-1970s 1980s

1990s

2011 2014

2007

OS/360

TECHNOLOGY

USERS

VALUE TECHNOLOGIES

SOURCES

BUSINESS

Page 3: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Data Mart Data

Mart Data Mart

Data Mart

Data Mart

Data Mart

Data Mart

Data Mart

Data Mart

Batch ETL

Big Data Challenges Volume, Variety, Velocity, Veracity

Where is the data I

need?

Can I trust this data?

Transactions, OLTP, OLAP

Enterprise Data Warehouse

Social Media, Web Logs

Machine Device, Scientific

Documents and Emails

Source Data Analytic Systems

Page 4: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Page 5: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

80% of the work in big data projects is data integration and quality

“I spend more than half my time integrating, cleansing, and

transforming data without doing any actual analysis.”

“80% of the work in any data project is in cleaning the data”

“70% of my value is an ability to pull the data, 20% of my

value is using data-science…”

Page 6: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Why Informatica for Big Data & Hadoop

Page 7: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

PowerCenter Big Data Edition

Big Transaction Data Big Interaction Data

Online Transaction Processing (OLTP) Oracle DB2 Ingres Informix Sysbase SQL Server …

Cloud Salesforce.com Concur Google App Engine Amazon …

Other Interaction Data Clickstream image/Text Scientific Genomoic/pharma Medical

Medical/Device Sensors/meters RFID tags CDR/mobile …

Social Media & Web Data Facebook Twitter Linkedin Youtube …

Big Data Processing

Online Analytical Processing (OLAP) & DW Appliances Teradata Redbrick EssBase Sybase IQ Netezza Exadata

HANA Greenplum DataAllegro Asterdata Vertica Paraccel …

Web applications Blogs Discussion forums Communities Partner portals …

Universal Data Access High-Speed Data

Ingestion and Extraction

ETL on Hadoop

Profiling on Hadoop Complex Data

Parsing on Hadoop

Entity Extraction and Data Classification on

Hadoop

No-Code Productivity

Business-IT Collaboration

Unified Administration

the VibeTM virtual data machine

9.6

Page 8: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Get Data Into and Out of Hadoop PowerExchange for Hadoop Replication to Hadoop Streaming to Hadoop Data Archiving to Hadoop

Page 9: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Data Warehouse

MDM

Applications

Data Ingestion and Extraction Moving terabytes of data per hour

Replicate

Streaming

Batch Load

Extract

Archive Extract Low Cost Store

Transactions, OLTP, OLAP

Social Media, Web Logs

Documents, Email

Industry Standards

Machine Device, Scientific

Page 10: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

PowerExchange Connectors

Enterprise Applications, Software as a Service (SaaS)

JDE EnterpriseOne JDE World Lotus Notes Oracle E-Business Suite ✔

PeopleSoft Enterprise Salesforce (salesforce.com) ✔ SAP NetWeaver ✔ SAP NetWeaver BI ✔

SAS Siebel Netsuite Microsoft Dynamics

Databases and Data

Warehouses

Adabas for UNIX, Windows C-ISAM DB2 for LUW ✔ Essbase

EMC/Greenplum Informix Dynamic Server Netezza Performance Server ODBC

Oracle ✔ SQL Server ✔ Sybase Teradata

Messaging Systems

JMS ✔ MSMQ ✔

TIBCO ✔ webMethods Broker ✔

WebSphere MQ ✔

Technology Standards

Email (POP, IMAP) HTTP(S) ✔

LDAP ✔ Web Services ✔

XML

Mainframe Adabas for z/OS ✔ Datacom ✔ DB2 for z/OS, z/Linux✔

IDMS ✔ IMS DB ✔ Oracle for z/Linux ✔

Teradata WebSphere MQ for z/Linux ✔ VSAM ✔

Big Data Asterdata, Greenplum

Vertica ParAccel

Microsoft PDW Kognitio

Social Facebook, Twitter, LinkedIn DataSift, Kapow MongoDB

Hadoop HDFS HIVE HBASE

�- Accessible in Real-time and/or via Change Data Capture (CDC)

Page 11: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

NoSQL Support for HBase

11

Read from HBase as standard source

Write to HBase as standard target

Complete Mapping with HBase Src/Tgt can execute on hadoop

Sample HBase column families (Stored in JSON/complex formats)

Page 12: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

NoSQL Support for MongoDB

Access, integrate, transform & ingest MongoDB data into other analytic systems (e.g. Hadoop, data warehouse)

Access, integrate, transform, & ingest data into MongoDB

Sampling MongoDB data & flattening it to relational format

Page 13: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

IDR for Replicating to Hadoop

Supported Distributions •  Apache

•  0.20.203.x

•  0.20.204.x

•  0.20.205.x

•  0.23.x

•  1.0.x

•  1.1.x •  2.x.x

•  Cloudera •  CDH3

•  CDH4

EXTRACT APPLY

Source System Intermediate Files Cycle_1.work directory

HDFS

Table 1 File Table 2 File …Table N File Schema.ini File

Page 14: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Real-Time Data Collection and Streaming

14

Ultr

a M

essa

ging

Bus

Pub

lish

/ Sub

scrib

e

Leverage High Performance Messaging Infrastructure Publish with Ultra Messaging for global distribution without additional staging or landing.

HDFS, HBase,

Targets

Web Servers, Operations Monitors, rsyslog, SLF4J, etc.

Handhelds, Smart Meters, etc. Discrete Data Messages

Sources

Zookeeper

Management and Monitoring

Internet of Things, Sensor Data

Real Time Analysis, Complex Event Processing

No SQL Databases: Cassandara, Riak, MongoDB

Node

Node

Node

Node

Node

Node

Page 15: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Informatica Vibe Data Stream for Machine Data

15

•  High performance/efficient streaming data collection over LAN/WAN

•  GUI interface provides ease of configuration, deployment & use

•  Continuous ingestion of real-time generated data (sensors; logs; etc.). Machine generated & other data sources

•  Enable real-time interactions & response

•  Real-time delivery directly to multiple targets (batch/stream processing)

•  Highly available; efficient; scalable

•  Available ecosystem of light weight agents (sources & targets)

Page 16: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Predictive Maintenance with Event Processing and Analytics

United Technologies Aerospace Systems (UTAS) provides engines and aircraft components to leading commercial and defense manufacturers, including the new Airbus A380 and Boeing B787.

The challenge: •  5,000+ aircraft in service plus new design wins exponentially

increases the amount of sensor data being generated •  “Power by the Hour” leasing model means the maintenance cost and

service outages falls to UTAS •  No proactive capability to predict when a safety issue might occur

•  Once-per-day sensor readings moving to real-time, over-the-air

Page 17: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Archive to Hadoop Compression Extends Hadoop Cluster Capacity

Without INFA Optimized Archive Compression

With INFA Optimized Archive 95% Compression

10 TB 10 TB 10 TB

10 TB replicated 3X = 30TB 10 TB compressed 95% = 500GB Replicated 3X = 1.5 TB 20X less I/O bandwidth required 20 min vs. 1 min response time 8 hours vs. 24 mins backup window

500 GB 500 GB 500 GB

Page 18: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Parse and Prepare Data On Hadoop hParser and XMap

Page 19: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

4. The DT engine can immediately use this service to process data.

The DT Engine is fully embeddable and can be invoked using any of the supported APIs.

Java, C++, C, .NET, web services

For simple integration, a command line interface is available to invoke

services.

Internal custom applications can embed transformation services

using the various APIs.

PowerCenter leverages DT via the Unstructured Data Transformation

(UDT). This is a GUI transformation widget in Powercenter which wraps around the DT API and

engine.

DT can also be embedded in other middleware technologies.

For some (WBIMB, WebMethods, BizTalk) INFA provides similar GUI widgets (agents) for the

respective design environments. For others the API layer can be used directly.

DT can be invoked in two general ways:

1.  Filenames can be passed to it, and DT will directly open the file(s) for processing.

On the output side, DT can also directly

write to the filesystem.

2.  The calling application can buffer the data and send buffers to DT for processing.

On the output side, DT can also write back to memory buffers which are returned to the calling application.

Though not shown below, the engine fully supports multiple input

and output files or buffers as needed by the transformation.

Engine invocation is a shared library. The DT engine runs fully within the process of the calling application.

It is not an external engine. This removes any overhead

from passing data between processes, across the network, etc. The engine is also dynamically invoked and does not

need to be ‘started up’ or maintained externally.

The DT engine is also thread-safe and re-entrant.

This allows the calling application to invoke DT in multiple threads to increase throughput.

A good example is DT’s support of PowerCenter partitioning

to scale up processing.

As shown below, the actual transformation logic is completely independent of any calling application.

This means you can develop a transformation once, and

leverage it in multiple environments simultaneously resulting in reduced development and maintenance times and lower

impact of change.

1. Developer uses Studio to develop a transformation

2. Developer deploys transformation to local service repository (directory). All files needed for the transformation

are moved.

3. To deploy to the server, this service folder is moved to the server via FTP,

copy, script, etc. NOTE: If the server file system is mountable from

the developer machine directly, then step 2 would deploy directly to the server.

Parse and Prepare Data on Hadoop

S Svc Repository

S

Flat Files & Documents

Interaction data Industry Standards XML

The broadest coverage for Big Data

��� ����� ���

��� �������� �

social

Device/sensor scientific

Productivity

•  Visual parsing environment

•  Predefined translations

Any DI/BI architecture

PIG EDW MDM

Page 20: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Example use cases Call Detail record

•  Why Hadoop? •  CDR – Large data sets every 7 seconds every mobile phone in

the region create a record •  Desire to analyze behavior, location to personalize and

optimize pricing and ,marketing

Page 21: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

hadoop … dt-hadoop.jar … My_Parser /input/*/input*.txt

1.  Define parser in HParser visual studio

2.  Deploy the parser on Hadoop Distributed File

System (HDFS) 3.  Run HParser to extract

data and produce tabular format in Hadoop

Parse and Prepare Data on Hadoop How does it work?

Page 22: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Page 23: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Profiling and Discovering Data Informatica Profiling & Data Discovery on Hadoop

Page 24: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

CUSTOMER_ID example COUNTRY CODE example

3. Drilldown Analysis (into Hadoop Data)

2. Value & Pattern

Analysis of Hadoop Data

1. Profiling Stats: Min/Max Values, NULLs, Inferred Data Types, etc.

Drill down into actual data values to inspect results across entire data set, including potential duplicates

Value and Pattern Frequency to isolated

inconsistent/dirty data or unexpected patterns

Hadoop Data Profiling results – exposed to anyone in enterprise

via browser

Stats to identify outliers and

anomalies in data

Hadoop Data Profiling Results

Page 25: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Hadoop Data Domain Discovery Finding functional meaning of Data in Hadoop

Leverage INFA rules/mapplets to identify functional meaning of

Hadoop data

Sensitive data (e.g. SSN, Credit Card number, etc.)

PHI: Protected Health Information PII: Personally Identifiable Information

Scalable to look for/discover ANY Domain type

View/share report of data domains/sensitive data contained in Hadoop. Ability to drill down to see suspect

data values.

Page 26: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Transforming and Cleansing Data PowerCenter on Hadoop Data Quality on Hadoop

Page 27: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

No-code visual development environment

Preview results at any point in the

data flow

PowerCenter developers are now Hadoop developers

Page 28: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Reuse and Import PC Metadata for Hadoop

Import existing PC artifacts into

Hadoop development environment

Validate import logic before the

actual import process to ensure

compatibility

Page 29: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Natural Language Processing Entity Extraction & Data Classification

Train NLP to find and classify entities in unstructured data

Page 30: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Address Validation & Data Cleansing�

Page 31: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Configure Mapping for Hadoop Execution

No need to redesign mapping logic to execute on either

Traditional or Hadoop infrastructure.

Configure where the integration logic should run – Hadoop or Native

Page 32: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

SELECT T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS CUSTKEY, customer.C_NAME, customer.C_NATIONKEY, nation.N_NAME, nation.N_REGIONKEY FROM

( SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx FROM lineitem GROUP BY L_ORDERKEY ) T1 JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY) JOIN customer ON (orders.O_CUSTKEY = customer.C_CUSTKEY) JOIN nation ON (customer.C_NATIONKEY = nation.N_NATIONKEY) WHERE nation.N_NAME = 'UNITED STATES' ) T2

INSERT OVERWRITE TABLE TARGET1 SELECT * INSERT OVERWRITE TABLE TARGET2 SELECT CUSTKEY, count(ORDERKEY2) GROUP BY CUSTKEY;

Data Integration & Quality on Hadoop

Hive-QL

1.  Entire Informatica mapping translated to Hive Query Language

2.  Optimized HQL converted to MapReduce & submitted to Hadoop cluster (job tracker).

3.  Advanced mapping transformations executed on Hadoop through User Defined Functions using Vibe

MapReduce

UDF

Page 33: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Example Mapping Execution

Source External Flat File

Source External Relational Data

Engine Repository

Source HDFS File

Cluster of Linux Machines Mapping logic

translated to HQL and submitted

to Hadoop Cluster

Relational Data streamed to Hadoop for processing

Target HDFS FIle

Local flat file staged

temporarily on HDFS

Read HDFS file data

Final processed data

loaded into HDFS file

Temp Staged Lookup File

Page 34: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Orchestrating and Monitoring Hadoop Informatica Workflow & Monitoring for Hadoop Metadata Manager for Hadoop Dynamic Data Masking for Hadoop

Page 35: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Mixed Workflow Orchestration One workflow running tasks on hadoop and local environments

Cmd_Choose LoadPath

MT_Load2Hadoop + Parse

Cmd_Load2 Hadoop MT_Parse

Cmd_ProfileData MT_Cleanse

MT_Data Analysis

Notification

Name Type Default Value Description

$User.LoadOptionPath Integer 2 Load path for workflow, depending on output of cmd task

$User.DataSourceConnection String HiveSourceConnection Source connection object

$User.ProfileResult Integer 100 Output from “profiling” commnad task.

Add

Edit

Remove

List of variables:

Informatica Corporation Confidential Do Not Distribute.

Page 36: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Full traceability from workflow to MapReduce jobs

View generated Hive scripts

Unified Administration Single Place to Manage & Monitor

Page 37: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Data Lineage and Business Glossary

Page 38: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Page 39: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

Hadoop Architecture Overview

•  PowerCenter on Hadoop •  Data Quality on Hadoop •  DT on Hadoop •  Entity Extraction on Hadoop •  Profiling on Hadoop

Execution on Hadoop PWX

for HDFS

PWX for

HDFS

PWX for

Hive

MYSQL

Mercury Services

Hive Client

HDFS

Infa-Lib

DataNode1

HParser

Map Reduce

RDBMS Clients

PW

X fo

r M

ercu

ry

Transactions, OLTP, OLAP

Documents, Email

Social Media, Web Logs

Machine Device, Scientific

PowerCenter SE Enterprise Grid

PowerCenter Services

PW

X fo

r P

C

HDFS

Infa-Lib

HParser

Map Reduce

RDBMS Clients

HDFS

Infa-Lib

HParser

Map Reduce

RDBMS Clients

HDFS

Infa-Lib

HParser

Map Reduce

RDBMS Clients

Hive

DataNode2 DataNode3 NameNode Job Tracker

INFA Clients

Page 40: Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

40