Upload
etu-solution
View
5.664
Download
6
Tags:
Embed Size (px)
DESCRIPTION
講者:Informatica 資深產品顧問 | 尹寒柏 議題簡介:Big Data 時代,比的不是數據數量,而是了解數據的深度。現在,因為 Big Data 技術的成熟,讓非資訊背景的 CXO 們,可以讓過去像是專有名詞的 CI (Customer Intelligence) 變成動詞,從 BI 進入 CI,更連結消費者經濟的脈動,洞悉顧客的意圖。不過,有個 Big Data 時代要 注意的思維,那就是競爭到最後,不單只是看數據量的增長,還要比誰能更了解數據的深度。而 Informatica 正是這個最佳解決的答案。我們透過 Informatica 解決在企業及時提供可信賴數據的巨大壓力;同時隨著日益增高的數據量和複雜程度,Informatica 也有能力提供更快速彙集數據技術,從而讓數據變的有意義並可供企業用來促進效率提升、完善品質、保證確定性和發揮優勢的功能。Inforamtica 提供了更為快速有效地實現此目標的方案,是精誠集團在 Big Data 時代的最佳工具。
Citation preview
Senior Product Specialist
Big Data and Hadoop
The Challenge Data fragmentation becomes the barrier to business success
10 2 MAINFRAME
CLIENT-SERVER WEB
SOCIAL INTERNET OF THINGS
CLOUD
Few Employees
Many Employees
Customers/ Consumers
Business Ecosystems
Communities & Society
Devices & Machines
10 4
10 6
10 7
10 9 10 11
Front Office Productivity Back Office
Automation E-Commerce
Line-of-Business Self-Service
Social Engagement
Real-Time Optimization
1960s-1970s 1980s
1990s
2011 2014
2007
OS/360
TECHNOLOGY
USERS
VALUE TECHNOLOGIES
SOURCES
BUSINESS
Data Mart Data
Mart Data Mart
Data Mart
Data Mart
Data Mart
Data Mart
Data Mart
Data Mart
Batch ETL
Big Data Challenges Volume, Variety, Velocity, Veracity
Where is the data I
need?
Can I trust this data?
Transactions, OLTP, OLAP
Enterprise Data Warehouse
Social Media, Web Logs
Machine Device, Scientific
Documents and Emails
Source Data Analytic Systems
80% of the work in big data projects is data integration and quality
“I spend more than half my time integrating, cleansing, and
transforming data without doing any actual analysis.”
“80% of the work in any data project is in cleaning the data”
“70% of my value is an ability to pull the data, 20% of my
value is using data-science…”
Why Informatica for Big Data & Hadoop
PowerCenter Big Data Edition
Big Transaction Data Big Interaction Data
Online Transaction Processing (OLTP) Oracle DB2 Ingres Informix Sysbase SQL Server …
Cloud Salesforce.com Concur Google App Engine Amazon …
Other Interaction Data Clickstream image/Text Scientific Genomoic/pharma Medical
Medical/Device Sensors/meters RFID tags CDR/mobile …
Social Media & Web Data Facebook Twitter Linkedin Youtube …
Big Data Processing
Online Analytical Processing (OLAP) & DW Appliances Teradata Redbrick EssBase Sybase IQ Netezza Exadata
HANA Greenplum DataAllegro Asterdata Vertica Paraccel …
Web applications Blogs Discussion forums Communities Partner portals …
Universal Data Access High-Speed Data
Ingestion and Extraction
ETL on Hadoop
Profiling on Hadoop Complex Data
Parsing on Hadoop
Entity Extraction and Data Classification on
Hadoop
No-Code Productivity
Business-IT Collaboration
Unified Administration
the VibeTM virtual data machine
9.6
Get Data Into and Out of Hadoop PowerExchange for Hadoop Replication to Hadoop Streaming to Hadoop Data Archiving to Hadoop
Data Warehouse
MDM
Applications
Data Ingestion and Extraction Moving terabytes of data per hour
Replicate
Streaming
Batch Load
Extract
Archive Extract Low Cost Store
Transactions, OLTP, OLAP
Social Media, Web Logs
Documents, Email
Industry Standards
Machine Device, Scientific
PowerExchange Connectors
Enterprise Applications, Software as a Service (SaaS)
JDE EnterpriseOne JDE World Lotus Notes Oracle E-Business Suite ✔
PeopleSoft Enterprise Salesforce (salesforce.com) ✔ SAP NetWeaver ✔ SAP NetWeaver BI ✔
SAS Siebel Netsuite Microsoft Dynamics
Databases and Data
Warehouses
Adabas for UNIX, Windows C-ISAM DB2 for LUW ✔ Essbase
EMC/Greenplum Informix Dynamic Server Netezza Performance Server ODBC
Oracle ✔ SQL Server ✔ Sybase Teradata
Messaging Systems
JMS ✔ MSMQ ✔
TIBCO ✔ webMethods Broker ✔
WebSphere MQ ✔
Technology Standards
Email (POP, IMAP) HTTP(S) ✔
LDAP ✔ Web Services ✔
XML
Mainframe Adabas for z/OS ✔ Datacom ✔ DB2 for z/OS, z/Linux✔
IDMS ✔ IMS DB ✔ Oracle for z/Linux ✔
Teradata WebSphere MQ for z/Linux ✔ VSAM ✔
Big Data Asterdata, Greenplum
Vertica ParAccel
Microsoft PDW Kognitio
Social Facebook, Twitter, LinkedIn DataSift, Kapow MongoDB
Hadoop HDFS HIVE HBASE
�- Accessible in Real-time and/or via Change Data Capture (CDC)
NoSQL Support for HBase
11
Read from HBase as standard source
Write to HBase as standard target
Complete Mapping with HBase Src/Tgt can execute on hadoop
Sample HBase column families (Stored in JSON/complex formats)
NoSQL Support for MongoDB
Access, integrate, transform & ingest MongoDB data into other analytic systems (e.g. Hadoop, data warehouse)
Access, integrate, transform, & ingest data into MongoDB
Sampling MongoDB data & flattening it to relational format
IDR for Replicating to Hadoop
Supported Distributions • Apache
• 0.20.203.x
• 0.20.204.x
• 0.20.205.x
• 0.23.x
• 1.0.x
• 1.1.x • 2.x.x
• Cloudera • CDH3
• CDH4
EXTRACT APPLY
Source System Intermediate Files Cycle_1.work directory
HDFS
Table 1 File Table 2 File …Table N File Schema.ini File
Real-Time Data Collection and Streaming
14
Ultr
a M
essa
ging
Bus
Pub
lish
/ Sub
scrib
e
Leverage High Performance Messaging Infrastructure Publish with Ultra Messaging for global distribution without additional staging or landing.
HDFS, HBase,
Targets
Web Servers, Operations Monitors, rsyslog, SLF4J, etc.
Handhelds, Smart Meters, etc. Discrete Data Messages
Sources
Zookeeper
Management and Monitoring
Internet of Things, Sensor Data
Real Time Analysis, Complex Event Processing
No SQL Databases: Cassandara, Riak, MongoDB
Node
Node
Node
Node
Node
Node
Informatica Vibe Data Stream for Machine Data
15
• High performance/efficient streaming data collection over LAN/WAN
• GUI interface provides ease of configuration, deployment & use
• Continuous ingestion of real-time generated data (sensors; logs; etc.). Machine generated & other data sources
• Enable real-time interactions & response
• Real-time delivery directly to multiple targets (batch/stream processing)
• Highly available; efficient; scalable
• Available ecosystem of light weight agents (sources & targets)
Predictive Maintenance with Event Processing and Analytics
United Technologies Aerospace Systems (UTAS) provides engines and aircraft components to leading commercial and defense manufacturers, including the new Airbus A380 and Boeing B787.
The challenge: • 5,000+ aircraft in service plus new design wins exponentially
increases the amount of sensor data being generated • “Power by the Hour” leasing model means the maintenance cost and
service outages falls to UTAS • No proactive capability to predict when a safety issue might occur
• Once-per-day sensor readings moving to real-time, over-the-air
Archive to Hadoop Compression Extends Hadoop Cluster Capacity
Without INFA Optimized Archive Compression
With INFA Optimized Archive 95% Compression
10 TB 10 TB 10 TB
10 TB replicated 3X = 30TB 10 TB compressed 95% = 500GB Replicated 3X = 1.5 TB 20X less I/O bandwidth required 20 min vs. 1 min response time 8 hours vs. 24 mins backup window
500 GB 500 GB 500 GB
Parse and Prepare Data On Hadoop hParser and XMap
4. The DT engine can immediately use this service to process data.
The DT Engine is fully embeddable and can be invoked using any of the supported APIs.
Java, C++, C, .NET, web services
For simple integration, a command line interface is available to invoke
services.
Internal custom applications can embed transformation services
using the various APIs.
PowerCenter leverages DT via the Unstructured Data Transformation
(UDT). This is a GUI transformation widget in Powercenter which wraps around the DT API and
engine.
DT can also be embedded in other middleware technologies.
For some (WBIMB, WebMethods, BizTalk) INFA provides similar GUI widgets (agents) for the
respective design environments. For others the API layer can be used directly.
DT can be invoked in two general ways:
1. Filenames can be passed to it, and DT will directly open the file(s) for processing.
On the output side, DT can also directly
write to the filesystem.
2. The calling application can buffer the data and send buffers to DT for processing.
On the output side, DT can also write back to memory buffers which are returned to the calling application.
Though not shown below, the engine fully supports multiple input
and output files or buffers as needed by the transformation.
Engine invocation is a shared library. The DT engine runs fully within the process of the calling application.
It is not an external engine. This removes any overhead
from passing data between processes, across the network, etc. The engine is also dynamically invoked and does not
need to be ‘started up’ or maintained externally.
The DT engine is also thread-safe and re-entrant.
This allows the calling application to invoke DT in multiple threads to increase throughput.
A good example is DT’s support of PowerCenter partitioning
to scale up processing.
As shown below, the actual transformation logic is completely independent of any calling application.
This means you can develop a transformation once, and
leverage it in multiple environments simultaneously resulting in reduced development and maintenance times and lower
impact of change.
1. Developer uses Studio to develop a transformation
2. Developer deploys transformation to local service repository (directory). All files needed for the transformation
are moved.
3. To deploy to the server, this service folder is moved to the server via FTP,
copy, script, etc. NOTE: If the server file system is mountable from
the developer machine directly, then step 2 would deploy directly to the server.
Parse and Prepare Data on Hadoop
S Svc Repository
S
Flat Files & Documents
Interaction data Industry Standards XML
The broadest coverage for Big Data
��� ����� ���
��� �������� �
social
Device/sensor scientific
Productivity
• Visual parsing environment
• Predefined translations
Any DI/BI architecture
PIG EDW MDM
Example use cases Call Detail record
• Why Hadoop? • CDR – Large data sets every 7 seconds every mobile phone in
the region create a record • Desire to analyze behavior, location to personalize and
optimize pricing and ,marketing
hadoop … dt-hadoop.jar … My_Parser /input/*/input*.txt
1. Define parser in HParser visual studio
2. Deploy the parser on Hadoop Distributed File
System (HDFS) 3. Run HParser to extract
data and produce tabular format in Hadoop
Parse and Prepare Data on Hadoop How does it work?
Profiling and Discovering Data Informatica Profiling & Data Discovery on Hadoop
CUSTOMER_ID example COUNTRY CODE example
3. Drilldown Analysis (into Hadoop Data)
2. Value & Pattern
Analysis of Hadoop Data
1. Profiling Stats: Min/Max Values, NULLs, Inferred Data Types, etc.
Drill down into actual data values to inspect results across entire data set, including potential duplicates
Value and Pattern Frequency to isolated
inconsistent/dirty data or unexpected patterns
Hadoop Data Profiling results – exposed to anyone in enterprise
via browser
Stats to identify outliers and
anomalies in data
Hadoop Data Profiling Results
Hadoop Data Domain Discovery Finding functional meaning of Data in Hadoop
Leverage INFA rules/mapplets to identify functional meaning of
Hadoop data
Sensitive data (e.g. SSN, Credit Card number, etc.)
PHI: Protected Health Information PII: Personally Identifiable Information
Scalable to look for/discover ANY Domain type
View/share report of data domains/sensitive data contained in Hadoop. Ability to drill down to see suspect
data values.
Transforming and Cleansing Data PowerCenter on Hadoop Data Quality on Hadoop
No-code visual development environment
Preview results at any point in the
data flow
PowerCenter developers are now Hadoop developers
Reuse and Import PC Metadata for Hadoop
Import existing PC artifacts into
Hadoop development environment
Validate import logic before the
actual import process to ensure
compatibility
Natural Language Processing Entity Extraction & Data Classification
Train NLP to find and classify entities in unstructured data
Address Validation & Data Cleansing�
Configure Mapping for Hadoop Execution
No need to redesign mapping logic to execute on either
Traditional or Hadoop infrastructure.
Configure where the integration logic should run – Hadoop or Native
SELECT T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS CUSTKEY, customer.C_NAME, customer.C_NATIONKEY, nation.N_NAME, nation.N_REGIONKEY FROM
( SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx FROM lineitem GROUP BY L_ORDERKEY ) T1 JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY) JOIN customer ON (orders.O_CUSTKEY = customer.C_CUSTKEY) JOIN nation ON (customer.C_NATIONKEY = nation.N_NATIONKEY) WHERE nation.N_NAME = 'UNITED STATES' ) T2
INSERT OVERWRITE TABLE TARGET1 SELECT * INSERT OVERWRITE TABLE TARGET2 SELECT CUSTKEY, count(ORDERKEY2) GROUP BY CUSTKEY;
Data Integration & Quality on Hadoop
Hive-QL
1. Entire Informatica mapping translated to Hive Query Language
2. Optimized HQL converted to MapReduce & submitted to Hadoop cluster (job tracker).
3. Advanced mapping transformations executed on Hadoop through User Defined Functions using Vibe
MapReduce
UDF
Example Mapping Execution
Source External Flat File
Source External Relational Data
Engine Repository
Source HDFS File
Cluster of Linux Machines Mapping logic
translated to HQL and submitted
to Hadoop Cluster
Relational Data streamed to Hadoop for processing
Target HDFS FIle
Local flat file staged
temporarily on HDFS
Read HDFS file data
Final processed data
loaded into HDFS file
Temp Staged Lookup File
Orchestrating and Monitoring Hadoop Informatica Workflow & Monitoring for Hadoop Metadata Manager for Hadoop Dynamic Data Masking for Hadoop
Mixed Workflow Orchestration One workflow running tasks on hadoop and local environments
Cmd_Choose LoadPath
MT_Load2Hadoop + Parse
Cmd_Load2 Hadoop MT_Parse
Cmd_ProfileData MT_Cleanse
MT_Data Analysis
Notification
Name Type Default Value Description
$User.LoadOptionPath Integer 2 Load path for workflow, depending on output of cmd task
$User.DataSourceConnection String HiveSourceConnection Source connection object
$User.ProfileResult Integer 100 Output from “profiling” commnad task.
Add
Edit
Remove
List of variables:
Informatica Corporation Confidential Do Not Distribute.
Full traceability from workflow to MapReduce jobs
View generated Hive scripts
Unified Administration Single Place to Manage & Monitor
Data Lineage and Business Glossary
Hadoop Architecture Overview
• PowerCenter on Hadoop • Data Quality on Hadoop • DT on Hadoop • Entity Extraction on Hadoop • Profiling on Hadoop
Execution on Hadoop PWX
for HDFS
PWX for
HDFS
PWX for
Hive
MYSQL
Mercury Services
Hive Client
HDFS
Infa-Lib
DataNode1
HParser
Map Reduce
RDBMS Clients
PW
X fo
r M
ercu
ry
Transactions, OLTP, OLAP
Documents, Email
Social Media, Web Logs
Machine Device, Scientific
PowerCenter SE Enterprise Grid
PowerCenter Services
PW
X fo
r P
C
HDFS
Infa-Lib
HParser
Map Reduce
RDBMS Clients
HDFS
Infa-Lib
HParser
Map Reduce
RDBMS Clients
HDFS
Infa-Lib
HParser
Map Reduce
RDBMS Clients
Hive
DataNode2 DataNode3 NameNode Job Tracker
INFA Clients
40