Unleash Power of Big Data with Informatica for Power Big Data...Unleash Power of Big Data with Informatica for . ... Cloud Concur ... Oracle E-BusinessSAS PeopleSoft

  • Published on
    08-Mar-2018

  • View
    221

  • Download
    8

Embed Size (px)

Transcript

  • Wei Zheng

    Senior Director, Product Management

    Informatica

    Unleash Power of Big Data with

    Informatica for

    http://hadoop.apache.org/

  • Agenda

    Big Data Overview

    What is Hadoop?

    Informatica for Hadoop

    Getting Data In and Out

    Parsing and Preparing Data

    Profiling and Discovering Data

    Transforming and Cleansing Data

    Orchestrating and Monitoring Hadoop

    Roadmap

  • Big Data Overview

  • Whats happening?

    Explosive Growth of Data Volume, Variety, Velocity

    Volume

    Source: IDC

    Latency Years Sub-Second

    Data Volume

    Across Time Scales

    Bu

    sin

    ess V

    alu

    e

    Velocity

    Variety

    http://www.workday.com/index.phphttp://www.google.com/imgres?imgurl=http://thecloudtutorial.com/elephant.gif&imgrefurl=http://thecloudtutorial.com/hadoop-tutorial.html&usg=__gMB0dzbV8scian9XqoQcLzF-LkE=&h=473&w=2000&sz=147&hl=en&start=1&sig2=zb1VhqPYhtx0cLN1-dgAgg&zoom=1&itbs=1&tbnid=yH80Nlay6IFDeM:&tbnh=35&tbnw=150&prev=/images?q=hadoop&hl=en&gbv=2&tbs=isch:1&ei=Z9ZFTdDhFI2isAP1vbSKCghttp://www.google.com/imgres?imgurl=http://3.bp.blogspot.com/_aN1WwtpRi5c/TJeCAiUwz9I/AAAAAAAAAFE/CXmGPn0KOQA/s1600/NetezzaFullLogo.jpg&imgrefurl=http://selmark.blogspot.com/2010/09/ibm-to-acquire-netezza.html&usg=__GYYOaZU9yUgJ0cI5s1mHpA6FXGc=&h=356&w=1506&sz=60&hl=en&start=1&sig2=6u8-CxZ1v-tm_EgFabHOBQ&zoom=1&itbs=1&tbnid=Iw5k7cbxZGDkcM:&tbnh=35&tbnw=150&prev=/images?q=netezza&hl=en&gbv=2&tbs=isch:1&ei=pNZFTc3PM4S8sAOIsISHCghttp://www.greenplum.com/http://www.hyperion.de/http://www.peoplesoft.com/corp/en/public_index.asphttp://aws.amazon.com/

  • Big Data

    Confluence of Big Transaction, Big Interaction & Big Data Processing

    Online

    Transaction

    Processing

    (OLTP)

    Online Analytical

    Processing

    (OLAP) &

    DW Appliances

    Social

    Media Data

    Device

    Sensor Data

    Scientific, genomic

    Machine/Device

    BIG TRANSACTION DATA BIG INTERACTION DATA

    BIG DATA PROCESSING

    Call detail

    records, image,

    click stream data

    BIG DATA INTEGRATION

    Cloud

    Salesforce.com

    Concur

    Google App Engine

    Amazon

  • What is Hadoop?

  • What is Hadoop?

    Distribution Example: Cloudera (CDH 3.0)

    Hadoop

    Distributed File

    System (HDFS)

    File Sharing & Data

    Protection Across

    Physical Servers

    MapReduce

    Distributed Computing

    Across Physical Servers

    Hadoop is a big data platform for data

    storage and processing that is

    Scalable

    Fault tolerant

    Open source

    CORE HADOOP COMPONENTS

    Coordination

    Data

    Integration

    Fast

    Read/Write

    Access

    Languages / Compilers

    Workflow Scheduling Metadata

    APACHE

    ZOOKEEPER

    APACHE FLUME,

    APACHE SQOOP APACHE HBASE

    APACHE PIG, APACHE HIVE

    APACHE OOZIE APACHE OOZIE APACHE HIVE

    File System

    Mount UI Framework SDK

    FUSE-DFS HUE HUE SDK 1. System Shall Manage and Heal Itself

    2. Performance Shall Scale Linearly

    3. Compute Shall Move to Data

    4. Simple Core, Modular and Extensible

    Hadoop Design Axioms

  • Hadoop Distributions

  • What can Hadoop Help You With?

    Improve

    Decisions

    Modernize

    Business

    Improve

    Efficiency

    & Reduce

    Costs

    Mergers

    Acquisitions

    &

    Divestitures

    Acquire &

    Retain

    Customers

    Outsource

    Non-core

    Functions

    Governance

    Risk

    Compliance

    Increase

    Partner

    Network

    Efficiency

    Increase

    Business

    Agility

    Increase Value of Big Data

    Relevant Actionable Timely Holistic Trustworthy Accessible Authoritative Secure

    Lower Cost of Big Data

    Business Costs Labor Costs Software Costs Hardware Costs Storage Costs

    On-Premise Transactions Desktops Mobile Cloud Interactions

    Predictive Analytics

    (Recommendations,

    Outcomes, MRO)

    Customer Analytics

    (Customer Sentiment,

    & Satisfaction)

    Pattern Recognition

    (Fraud Detection

    Risk & Portfolio

    Analysis

    Optimization

    (Pricing, Supply

    Chain)

  • Informatica for Hadoop

  • Unleash the Power of Hadoop With Informatica

    9.5.1

    Available Now

    Sales & Marketing

    Data mart

    Customer Service

    Portal

    Product & Service Offerings Customer Profile Social Media Account Transactions Customer Service Logs & Surveys Marketing Campaigns

    3. Parse & Prepare Data in Hadoop

    (MapReduce)

    1. Ingest Data into Hadoop

    4. Transform & Cleanse/Standardize Data

    in Hadoop (MapReduce)

    Monitor

    & M

    an

    ag

    e (

    Ha

    do

    op

    or

    no

    n H

    adoop jobs/p

    rocesses)

    Orc

    hestr

    ate

    Work

    flow

    s (

    Hadoop

    or

    non

    Hadoop

    jobs/p

    rocesses)

    6. Extract Data from Hadoop

    2. Discover Hadoop data for anomalies,

    relationships and domain types

    5. Invoke Custom Business Analytics on

    Hadoop

    Pro

    file

    Data

  • Repeatability

    Predictable, repeatable deployments and methodology

    Isolation from rapid Hadoop changes

    Frequent new versions and projects

    Avoiding placing bets on the wrong technology

    Reuse of existing assets

    Apply existing integration logic to load data to/from Hadoop

    Reuse existing data quality rules to validate Hadoop data

    Reuse of existing skills

    Enable ETL developers to leverage the power of Hadoop

    Governance

    Enforce and validate data security, data quality and

    regulatory policies

    Why Informatica? What are the Benefits?

  • Get Data Into and Out of Hadoop

    PowerExchange for Hadoop

    hStream with MapR

    Data Archiving for Hadoop

    Replication for Hadoop

  • Data Ingestion and Extraction

    Moving tens of terabytes per hour of transaction, interaction

    and streaming data

    Data

    Warehouse

    MDM

    Applications

    Transactions,

    OLTP, OLAP

    Social Media,

    Web Logs

    Documents,

    Email

    Industry

    Standards

    Machine Device,

    Scientific

    Replicate

    Stream

    Batch Load

    Extract

    Archive Extract

    Low

    Cost

    Store

  • Unleash the Power of Hadoop With High Performance Universal Data Access

    WebSphere MQ JMS MSMQ SAP NetWeaver XI

    JD Edwards Lotus Notes Oracle E-Business PeopleSoft

    Oracle DB2 UDB DB2/400 SQL Server Sybase

    ADABAS Datacom DB2 IDMS IMS

    Word, Excel PDF StarOffice WordPerfect Email (POP, IMPA) HTTP

    Informix Teradata Netezza ODBC JDBC

    VSAM C-ISAM Binary Flat Files Tape Formats

    Web Services TIBCO webMethods

    SAP NetWeaver SAP NetWeaver BI SAS Siebel

    Messaging, and Web Services

    Relational and Flat Files

    Mainframe and Midrange

    Unstructured Data and Files

    Flat files ASCII reports HTML RPG ANSI LDAP

    EDIX12

    EDI-Fact

    RosettaNet

    HL7

    HIPAA

    ebXML

    HL7 v3.0

    ACORD (AL3, XML)

    XML

    LegalXML

    IFX

    cXML

    AST

    FIX

    Cargo IMP

    MVR

    Salesforce CRM

    Force.com

    RightNow

    NetSuite

    ADP Hewitt SAP By Design Oracle OnDemand

    Packaged Applications

    Industry Standards

    XML Standards

    SaaS/BPO

    Social Media

    Facebook Twitter

    LinkedIn EMC/Greenplum Vertica

    AsterData

    MPP Appliances

    http://www.salesforce.com/

  • PowerExchange for Hadoop

    HDFS and Hive Adapters

    Support pushdown of source and target connections to ensure maximum performance and scale

    Native HDFS and Hive Source/Target Support Integrated development

    environment with metadata and preview support

    Perform any pre processing needed before ingestion

  • hStream with MapR Continuous Ingestion

    Transactions,

    OLTP, OLAP

    Social Media,

    Web Logs

    Documents,

    Email

    Industry

    Standards

    Machine Device,

    Scientific

    Informatica Ultra Messaging

    Streaming Data Continuously

    Ne

    two

    rk F

    ile S

    yste

    m (

    NF

    S)

  • Informatica Data Archive

    Archiving to Hadoop

    Production

    Data

    Optimized File Archive

    Stored on Hadoop File System

    Archive data to optimized file format

    for storage reduction

    Compressed (up to 90%)

    Immutable

    Accessible (SQL, ODBC, JDBC)

  • Informatica Data Archive

    Archiving from Hadoop

    File Archive

  • Parse and Prepare Data On

    Hadoop

    hParser

  • Informatica Hparser

    Tackling Diversity of Big Data

    / scientific

    Flat Files &

    Documents Interaction data Industry Standards XML

    The broadest coverage for Big Data

    ^/>Delimited

  • Parse and Prepare Data on Hadoop

    How does it work? hadoop dt-hadoop.jar My_Parser /input/*/input*.txt

    1. Define parser in HParser visual

    studio

    2. Deploy the parser on Hadoop

    Distributed File System (HDFS)

    3. Run HParser to extract data and

    produce tabular format in

    Hadoop

  • SWIFT MT

    SWIFT MX

    NACHA

    FIX

    Telekurs

    FpML

    BAI V2.0\Lockbox

    CREST DEX

    IFX

    TWIST

    UNIFI (ISO 20022)

    SEPA

    FIXML

    MISMO

    B2B Standards

    UN\EDIFACT

    EDI-X12

    EDI ARR

    EDI UCS+WINS

    EDI VICS

    RosettaNet

    OAGI

    Financial

    Healthcare

    HL7

    HL7 V3

    HIPAA

    NCPDP

    CDISC

    Insurance

    DTCC-NSCC

    ACORD-AL3

    ACORD XML

    IATA-PADIS

    PLMXML

    NEIM

    Other

    Easy example based visual enhancements and edits

    Easy example based visual enhancements and edits

    Enhanced Validations

    Informatica HParser

    Productivity: Data Transformation Studio

    Out of the box transformations for all messages in all versions

    Updates and new versions delivered from Informatica

  • Why Hadoop?

    Extremely large data sets

    Often information is split

    across multi files

    Not sure what are we

    looking for

    An hParser Example

    Proprietary web logs

  • Profiling and Discovering Data

    Informatica Profiling for Hadoop

  • Discovery of Hadoop Issues/Anomalies

    Repository

    Informatica

    Map-R

    educe

    Hadoop

    Create/Run profile to discover Hadoop data attributes

    Profile auto-converted to Hadoop queries/code (Hive, MapReduce, etc.)

    Executed natively on Hadoop

    Import metadata via native connectivity to Hadoop (Hive, HDFS, Hbase, etc.)

    Review and share results via

    browser or Eclipse clients Single table/data object

    Cross table/data object

    Data Domain Discovery

    HIVE

    HDFS

    HBase

    1

    3

    2

    beta

  • Hadoop Data Profiling Results

    CUSTOMER_ID example

    COUNTRY CODE example

    3. Drilldown Analysis (into Hadoop Data)

    2. Value &

    Pattern

    Analysis of

    Hadoop Data

    1. Profiling Stats: Min/Max Values, NULLs,

    Inferred Data Types, etc.

    ZIP CODE example

    Drill down into actual

    data values to inspect

    results across entire data

    set, including potential

    duplicates

    Value and Pattern

    Frequency to isolated

    inconsistent/dirty data or

    unexpected patterns

    Hadoop Data Profiling

    results exposed to

    anyone in enterprise via

    browser

    Stats to identify

    outliers and

    anomalies in data

    beta

  • Hadoop Data Domain Discovery

    Finding functional meaning of Hadoop Data

    1. Leverage INFA rules/mapplets to identify

    functional meaning of Hadoop data

    2. Sensitive data (e.g. SSN, Credit Card number,

    etc.)

    3. Liability and Compliance risk?

    PHI: Protected Health Information

    PII: Personally Identifiable Information

    Scalable to look for/discover ANY Domain type

    2. View/share report of data

    domains/sensitive data

    contained in Hadoop. Ability

    to drill down to see suspect

    data values. beta

  • Transforming and Cleansing Data

    PowerCenter for Hadoop

    Informatica Data Quality for Hadoop

  • Data Integration and Data Quality

    Hadoop MapReduce Processing

    Data Node SELECT T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS

    CUSTKEY,

    customer.C_NAME, customer.C_NATIONKEY, nation.N_NAME,

    nation.N_REGIONKEY

    FROM

    (

    SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx

    FROM lineitem

    GROUP BY L_ORDERKEY

    ) T1

    JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY)

    JOIN customer ON (orders.O_CUSTKEY = customer.C_CUSTKEY)

    JOIN nation ON (customer.C_NATIONKEY = nation.N_NATIONKEY)

    WHERE nation.N_NAME = 'UNITED STATES'

    ) T2

    INSERT OVERWRITE TABLE TARGET1 SELECT *

    INSERT OVERWRITE TABLE TARGET2 SELECT CUSTKEY, count(ORDERKEY2) GROUP BY

    CUSTKEY;

    Hive HQL

    Informatica Developer 1. Informatica mapping translated to optimized

    Hive HQL

    2. HQL invokes custom UDF within Informatica

    DTM for certain specialized data transformations

    3. Optimized HQL translated to MapReduce

    4. MapReduce and UDF executed on Hadoop

    Data Node

    Data Node

    Data Nodes

    UDF MapReduce

    Informatica

    Data Transformation Library

    beta

  • Import existing PC artifacts into Hadoop development environment

    Validate import logic before the actual import process to ensure compatibility

    beta

    Reuse and Import PC Metadata for Hadoop

  • Design integration and quality logic for Hadoop in a graphical and metadata driven environment

    Configure where the integration logic should run Hadoop or Native

    beta

    Design Mappings as Usual

  • View complete generated and pushed down Hive or MR code from Hadoop mappings

    beta

    View Generated HiveQL

  • Orchestrating and Monitoring

    Hadoop

    Informatica Workflow & Administration for Hadoop

  • Mixed Workflow Orchestration

    One workflow running tasks on hadoop and local environments

    beta

  • Monitoring Hive Query Plan Details

    beta

    Same hive query available in developer tool.

  • Monitoring Hive Query Drilldown to M/R

    beta

    Traceability to

    individual M/R

    Jobs. Link to Job

    Tracker URLs

    View Hive

    Query Details

    Summary of job tracker

    status

  • Hadoop GA

    (9.5.1 Release)

    Native HDFS and

    Hive connectivity

    Integrated parsing

    on Hadoop

    Data Integration &

    Data Quality push

    down execution on

    Hadoop

    Data Discovery on

    Hadoop

    Mixed workload

    orchestration and

    administration

    Product Roadmap

    Cap

    ab

    ilit

    y

    Hadoop Beta

    (9.5 Release)

    Native HDFS and Hive

    connectivity

    Integrated parsing on

    Hadoop

    Data Integration & Data

    Quality push down

    execution on Hadoop

    Data Discovery on

    Hadoop

    Mixed workload

    orchestration and

    administration

    PowerExchange

    for Hadoop

    (HDFS and PC)

    Hparser

    (including JSON

    Parsing)

    Support for parallel

    processing of large file

    parsing

    Support for parsing of

    archived files

    Managed file transfer

    Metadata Manager &

    Lineage Integration

    Translation to PIG

    support

    Profiling API on Hadoop

    (call from Java or M/R)

    Persistence of profiling

    stats on Hadoop

    Additional DI & DQ

    transformations running

    on Hadoop

    Available Now 1H, 2012 2H, 2012 1H, 2013

  • Hadoop Planned Release

    Beta/Early Access: August Oct, 2012

    GA: 9.5.1 Release, December 2012

    PowerCenter Big Data Edition Q3 2012 (Tentative)

    PowerCenter Standard Edition

    Enterprise Grid Option for PowerCenter

    PowerExchange for Hadoop

    PowerExchange for Social Media

    PowerExchange for Data Warehouse Appliance

    hParser

    PowerCenter on Hadoop (Available Dec 2012)

    39

    When Is It Available?

Recommended

View more >