Upload
lamhanh
View
238
Download
1
Embed Size (px)
Citation preview
1 ©2016 Talend
Talend – Spark Meetup
Edward Ost
2 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016(Revenue Growth)
Data Integration
Master Data Management
Data Quality
Big Data
Application Integration
Hadoop 2.0
Spark & Cloud
Key Facts
• Founded in 2006
• 550+ employees worldwide
• 7 countries
• 1300+ customers
• 2M+ open source downloads
Talend: A History of Innovation and Growth
Data Preparation
3
Top Big Data Challenges
Talend Directly
Addresses these
Challenges
Source:
Gartner 12 September 2013 - G00255160
Survey Analysis: Big Data Adoption in
2013 Shows Substance Behind the Hype
4
Talend Real-time Big Data
The first data integration platform on Spark
Internet of Things
Delivers an end-to-end integration platform for
IoT
Continuous Delivery
Provides Continuous Delivery data integration
with unmatched productivity
New Insight
Easily access master data from Big Data, Mobile, and Cloud Apps using
MDM REST APIs
Smarter, More Secure Data
New data masking and semantic discovery
capabilities
Unleashing the Power of Spark with Real-time Big Data Integration
Talend 6.0
5
Talend Remains Ahead of the Curve for Big Data
Talend 6 (Sept 2015)
Talend 6.1 (Dec 2015)
Talend 5.6.x (Dec 2014)
No
SQL
Had
oo
p
Dis
tro
s
Had
oo
p
Clo
ud
5.4 5.1
2.3 2.2 2.2
5.7
4.0.X 4.0.X 5.1
1.3 1.1* 1.6
2.0 2.0 3.4
2.6 2.6 3.2
2.0
5.5
5.1
1.5
2.2
3.0
2.4
Talend 6.2 (Jun 2016) * Tech Preview
4.x 3.x 3.x
3.3 3.2
4.x BigInsights
6
The More Data, The Better Talend Performs
2X
Number of Records Processed (in Millions)
5 9.5 19 37 75
3.5X 3.8X
5.4X
7.8X
Faster
Faster Faster
Faster
Faster 7.8X Faster
Benchmark
MCG Global Services
7
Easily Convert MapReduce to Spark
MapReduce Performance
(runs on disk)
One Click
Spark Performance
(runs in-memory & on disk)
5X Faster
8
Technical Concerns
• Decouple source systems
• Increase agility
• Reduce process latency
• Avoid re-engineering
• At scale
Information Supply Chain Drivers
Business Drivers
• Evolving business network
• Data Broker ecosystem
• Transform Data into Information
• Onboarding data sources rapidly
• Accelerate insight
9
Step 1: Establish the Business Keys, Hubs
Step 2: Establish the relationships between the Business Keys, Links
Step 3: Establish description around the Business Keys, Satellites
Step 4: Add Standalone components like Calendars and code/descriptions for decoding in Data Marts
Step 5: Tune for query optimization, add performance tables such as Bridge tables and Point-In-Time structures
DataVault
10
Simple Data Vault Design Flow - Relational
Account_ID (Pkey)Company_NMAddress_LN1Address_LN2CityStateZipCodeStatus_CODEIs_AUTHORIZEDIs_LOCKEDCreated_DTModified_DT
ACCOUNTSUser_ID (Pkey)Account_ID (Fkey)First_NMLast_NMMobile_PHGenderStatus_CODEIs_ACTIVECreated_DTModified_DT
USERS
Identify Business Keys
Identify Attributes
Establish Linkages
Control Lineage
Control History De-Normalize
11
Simple Data Vault Design Flow – Big Data
Account_ID (Pkey)Company_NMAddress_LN1Address_LN2CityStateZipCodeStatus_CODEIs_AUTHORIZEDIs_LOCKEDCreated_DTModified_DT
ACCOUNTSUser_ID (Pkey)Account_ID (Fkey)First_NMLast_NMMobile_PHGenderStatus_CODEIs_ACTIVECreated_DTModified_DT
USERS
Identify Business Keys
Identify Attributes
Establish Linkages
Control Lineage
Control History
De-Normalize
Create PIT & BRIDGE records
12
• Focus on business keys and simplicity in source extracts
• Autonomous extracts enable parallel processing
• Capture and preserve auditable data in raw data vault
• Defer more complex business rules to the business vault
• Consider point-in-time tables for operational data vault
Spark and Data Vault Design Notes
13
Basic Ingest
14
Data Vault – Relational Model
• Extract data write to DV ready CSV files
• Push to S3/RDS
• Use ELT to De-Normalize into Columnar DataMart
15
Data Vault – Big Data Analytics
• Sqoop data directly into S3/DV (Redshift)
• Use ELT to De-Normalize into Columnar DataMart
16
Data Vault with Spark – Big Data Real Time
• Sqoop data directly into S3/DV (Hive)
• Transform to Data Vault with Spark Batch
• Operational Data Vault with Spark Streaming
• ELT to De-Normalize into Columnar DataMart
17
• OLTP
• Systems of Engagement
• Data Warehouse
• Analytics
• BI
From Data to Information
• Supply Chain
• Collaboration
• Self-Service
• On-Demand
18
Lambda Architecture
Extract
Load
Transform
Transform Ingest
Update
Reporting
Data
Mining
MDD/OLAP
Dashboarding
Data Discovery
API
Analytics
Applications
IOT
NoSQL
Web Logs
Systems of
records
ERP
DBMS Learn
Act
Streaming layer
Batch layer
App. Events
19
• Discover the Talend Big Data Jumpstart Sandbox • Starting the Talend Big Data Sandbox
• Big Data Sandbox Forum
• Get It Right, in Real Time with SPARK
• Using AWS EMR, Redshift, and Spark to Power Your Analytics
• TalendForge Big Data Forum
• Data Vault Basics
• Data Vault Series – Agile Modeling not an Option Anymore
Talend Big Data Resources
20
Questions
Edward Ost
Channels Technical Director
301-666-1039