Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an

1 ©2016 Talend

Talend – Spark Meetup

Edward Ost

2 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016(Revenue Growth)

Data Integration

Master Data Management

Data Quality

Big Data

Application Integration

Hadoop 2.0

Spark & Cloud

Key Facts

• Founded in 2006

• 550+ employees worldwide

• 7 countries

• 1300+ customers

• 2M+ open source downloads

Talend: A History of Innovation and Growth

Data Preparation

3

Top Big Data Challenges

Talend Directly

Addresses these

Challenges

Source:

Gartner 12 September 2013 - G00255160

Survey Analysis: Big Data Adoption in

2013 Shows Substance Behind the Hype

4

Talend Real-time Big Data

The first data integration platform on Spark

Internet of Things

Delivers an end-to-end integration platform for

IoT

Continuous Delivery

Provides Continuous Delivery data integration

with unmatched productivity

New Insight

Easily access master data from Big Data, Mobile, and Cloud Apps using

MDM REST APIs

Smarter, More Secure Data

New data masking and semantic discovery

capabilities

Unleashing the Power of Spark with Real-time Big Data Integration

Talend 6.0

5

Talend Remains Ahead of the Curve for Big Data

Talend 6 (Sept 2015)

Talend 6.1 (Dec 2015)

Talend 5.6.x (Dec 2014)

No

SQL

Had

oo

p

Dis

tro

s

Had

oo

p

Clo

ud

5.4 5.1

2.3 2.2 2.2

5.7

4.0.X 4.0.X 5.1

1.3 1.1* 1.6

2.0 2.0 3.4

2.6 2.6 3.2

2.0

5.5

5.1

1.5

2.2

3.0

2.4

Talend 6.2 (Jun 2016) * Tech Preview

4.x 3.x 3.x

3.3 3.2

4.x BigInsights

6

The More Data, The Better Talend Performs

2X

Number of Records Processed (in Millions)

5 9.5 19 37 75

3.5X 3.8X

5.4X

7.8X

Faster

Faster Faster

Faster

Faster 7.8X Faster

Benchmark

MCG Global Services

https://info.talend.com/hadoopintegrationinformatica.html?utm_source=google&utm_medium=cpc&utm_campaign=NA Search - Branded - Eval&utm_term=talend&utm_content=talend - broad&utm_creative=116202697600&lang=en&src=GoogleAdwordsOD_US&kid=null&gclid=Cj0KEQjwhN-6BRCJsePgxru9iIwBEiQAI8rq81CJi5KMNznPVZi646JU0GIaSRQdW7PtKArDI0t5cUwaAp5c8P8HAQ

http://www.mcknightcg.com/

7

Easily Convert MapReduce to Spark

MapReduce Performance

(runs on disk)

One Click

Spark Performance

(runs in-memory & on disk)

5X Faster

8

Technical Concerns

• Decouple source systems

• Increase agility

• Reduce process latency

• Avoid re-engineering

• At scale

Information Supply Chain Drivers

Business Drivers

• Evolving business network

• Data Broker ecosystem

• Transform Data into Information

• Onboarding data sources rapidly

• Accelerate insight

9

Step 1: Establish the Business Keys, Hubs

Step 2: Establish the relationships between the Business Keys, Links

Step 3: Establish description around the Business Keys, Satellites

Step 4: Add Standalone components like Calendars and code/descriptions for decoding in Data Marts

Step 5: Tune for query optimization, add performance tables such as Bridge tables and Point-In-Time structures

DataVault

10

Simple Data Vault Design Flow - Relational

Account_ID (Pkey)Company_NMAddress_LN1Address_LN2CityStateZipCodeStatus_CODEIs_AUTHORIZEDIs_LOCKEDCreated_DTModified_DT

ACCOUNTSUser_ID (Pkey)Account_ID (Fkey)First_NMLast_NMMobile_PHGenderStatus_CODEIs_ACTIVECreated_DTModified_DT

USERS

Identify Business Keys

Identify Attributes

Establish Linkages

Control Lineage

Control History De-Normalize

11

Simple Data Vault Design Flow – Big Data

Account_ID (Pkey)Company_NMAddress_LN1Address_LN2CityStateZipCodeStatus_CODEIs_AUTHORIZEDIs_LOCKEDCreated_DTModified_DT

ACCOUNTSUser_ID (Pkey)Account_ID (Fkey)First_NMLast_NMMobile_PHGenderStatus_CODEIs_ACTIVECreated_DTModified_DT

USERS

Identify Business Keys

Identify Attributes

Establish Linkages

Control Lineage

Control History

De-Normalize

Create PIT & BRIDGE records

12

• Focus on business keys and simplicity in source extracts

• Autonomous extracts enable parallel processing

• Capture and preserve auditable data in raw data vault

• Defer more complex business rules to the business vault

• Consider point-in-time tables for operational data vault

Spark and Data Vault Design Notes

13

Basic Ingest

14

Data Vault – Relational Model

• Extract data write to DV ready CSV files

• Push to S3/RDS

• Use ELT to De-Normalize into Columnar DataMart

15

Data Vault – Big Data Analytics

• Sqoop data directly into S3/DV (Redshift)

• Use ELT to De-Normalize into Columnar DataMart

16

Data Vault with Spark – Big Data Real Time

• Sqoop data directly into S3/DV (Hive)

• Transform to Data Vault with Spark Batch

• Operational Data Vault with Spark Streaming

• ELT to De-Normalize into Columnar DataMart

17

• OLTP

• Systems of Engagement

• Data Warehouse

• Analytics

• BI

From Data to Information

• Supply Chain

• Collaboration

• Self-Service

• On-Demand

18

Lambda Architecture

Extract

Load

Transform

Transform Ingest

Update

Reporting

Data

Mining

MDD/OLAP

Dashboarding

Data Discovery

API

Analytics

Applications

IOT

NoSQL

Web Logs

Systems of

records

ERP

DBMS Learn

Act

Streaming layer

Batch layer

App. Events

19

• Discover the Talend Big Data Jumpstart Sandbox • Starting the Talend Big Data Sandbox

• Big Data Sandbox Forum

• Get It Right, in Real Time with SPARK

• Using AWS EMR, Redshift, and Spark to Power Your Analytics

• TalendForge Big Data Forum

• Data Vault Basics

• Data Vault Series – Agile Modeling not an Option Anymore

Talend Big Data Resources

http://www.talend.com/resources/webinars/discover-the-talend-big-data-jumpstart-sandbox



https://www.youtube.com/watch?v=NIh2ZIJTRu4

https://talendforge.org/forum/viewforum.php?id=41

http://www.talend.com/resources/webinars/get-it-right-in-real-time-with-spark





http://www.talend.com/resources/webinars/using-aws-emr-redshift-and-spark-to-power-your-analytics






http://danlinstedt.com/solutions-2/data-vault-basics/

http://danlinstedt.com/solutions-2/data-vault-basics/

http://www.vertabelo.com/blog/technical-articles/data-vault-series-agile-modeling-not-an-option-anymore




20

Questions

Edward Ost

Channels Technical Director

[email protected]

301-666-1039

mailto:[email protected]

Documents

Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an