23
1 © Copyright 2016 EMC Corporation. All rights reserved. BUSINESS DATA LAKE Transform your BI & Analytics Charles Sevior CTO EMC Emerging Technologies Division @CharlesSevior

EMC presentation at The Chief Data Officer Forum, Sydney

Embed Size (px)

Citation preview

1 © Copyright 2016 EMC Corporation. All rights reserved.

BUSINESS DATA LAKE Transform your BI & Analytics

Charles Sevior

CTO

EMC Emerging

Technologies Division

@CharlesSevior

2 © Copyright 2016 EMC Corporation. All rights reserved.

3 © Copyright 2016 EMC Corporation. All rights reserved.

Media data

Card transactional data

Shopping basket data

Geo-location data

Clickstream data

Industry sector data

Long tail data Rich

The Data Gap Seeking deep & rich information across a broad population set…

The Data Gap

De

ep

, rich

co

nsu

me

r in

sig

hts

(deta

iled b

ut lim

ite

d c

ove

rag

e)

Census

(ubiquitous but dull) Poor

Richness

of data

18m 0 Adults in Australia

The Problem

Data Sharing is a Bespoke,

Bilateral Activity

Companies need more data than they have. But there is no process for companies to

access the data they need – so they invent it anew every time

Current Market Data

Sets are NOT complete

Data coverage

Data

Ric

hn

ess

THE DATA GAP

insi

ghts

census

There is a Poor Time to

Value Ratio in Data Sharing data sharing

team

Legal

documentation

Technology,

security

Analytics

Time

Valu

e

• Legal contracts

• Risk mitigation

• Security protocols

• Management’s Time

• Legal contracts

• Risk mitigation

• Security protocols

• Management’s Time

• Legal contracts

• Risk mitigation

• Security protocols

• Management’s Time

End User

(Website)

Data Product

Developer

Application

Developers

End User

(Agency)

End User

(Retailer)

End User

(Property) End users are companies that use

data products and services

Data Product

Developer Data Product

Developer

Off the shelf

BI/GIS

Industry Experts

Partners are companies that create,

implement and provide Data

Products to End Users.

Data Contributor

(Bank)

Data Bank (resides behind bank-grade firewall)

Data Republic Senate Platform (only uses tokenised data)

Data Bank is the repository for PII

data, where it is depersonalised

Data Contributors put data into Data

Republic

Data Republic provides the

exchange and analytics platform

Data Contributor

(Airline)

Data Contributor

(Media)

Data Contributor

(Retailer)

End to End

(covers

everything)

Partner’s own

software

App Store

Tokenisation

Data Republic’s information eco-system Increase value by making information accessible & legally / contractually shareable

• Request master tokens (in case that data requires joining information of disparate datasets based on PII

• Specific output segments/results get associated contributor token keys

Solving: Security and Privacy Separation of PII & Attribute data between Databank and Senate platform

helps solve privacy and security concerns

• Stores PII in bank grade secure systems

• Generated Master Tokens (Token associated to same individual entity are stored securely to be consumed by DR)

DATA BANK Compliance Sensitive

Information

DATA REPUBLIC Competitively

Sensitive Information

• Once deposited in Data Bank, no PII is ever used in any part of this architecture • In case of breach of any or both systems, the complete individual information or the transactional information cannot be joined (reattached)

Data Contributors (DC)

Master Tokens

Request Master Token for Datasets

Output: Specific Segments data with or without token information

Tokens associated to each PII record are returned to DC to attach to attribute data, before sending to DR

Tokens with attribute and transaction data

PII

Input

Processing for segments

Outputs

Airline

(DC)

Bank

(DU)

Legend DC = Data Contributor DU = Data User

Single Bilateral shares… Scaling Data Users into Contributors… ... a comprehensive multi-lateral data eco-system

Airline Bank

(DC)

Retailer

Insurance

Government

Shopping

Centre (DU)

Airline Bank

Shopping

Centre (DC)

Insurance

(DC)

Gov’t (DC)

Bank

FMCG

Startup

Software Charity

Insurance

(DU)

Insight Required: Bank

When are my customers

overseas?

Insight Required: Shopping Centre

What type of retailers should I put in my centre?

Insight Required: Retailer

How many, & what type of

customers walk past my shop?

Insight Required: Insurance

Where are medical services located

& what type of unhealthy products

are being purchased?

Retailer

(DU)

Network Effect of Data Exchanges Data Users become Contributors. With more layers of contribution, a truly multi-lateral

analysis can yield even more detailed & personalised propensity outcomes

9 © Copyright 2016 EMC Corporation. All rights reserved.

STEPS TO EXTRACT DATA VALUE

Build New Applications, Products, & Business Models

Leverage New Analytics To Predict The Future

Gather as Much Data As Possible

10 © Copyright 2016 EMC Corporation. All rights reserved.

CURRENT STATE ANALYTICS

Existing Enterprise Data Warehouse

$$$$

(Highly Summarized / Processed Data)

ERP

HR

SFDC

Traditional Data Sources

Load

New Data Sources/Formats

Machine

ETL

Backup Storage

Trash

BI / Analytical

Tools

This data doesn’t look

right – where’s the

detail?

I really need data I know we have, but

it’s not accessible

I can’t afford to

keep buying more EDW’s

at this growth!

Business Users

11 © Copyright 2016 EMC Corporation. All rights reserved.

THE BUSINESS DATA LAKE APPROACH

Analytic Sandbox

Analytic Environment – New Insights

Structured BI Reporting Environment

Data Preparation and Enrichment

Via Hadoop

ALL data fed into Business Data Lake

EDW ETL

Business Data Lake

Offload EDW to Hadoop

13 © Copyright 2016 EMC Corporation. All rights reserved.

HADOOP ENABLES THE DATA LAKE

An ecosystem for storing and processing any data type

Large community of users and developers

Easily extended with new interfaces and tools

Not limited to single data type – can access any data

Store, process, and analyze any size data sets

14 © Copyright 2016 EMC Corporation. All rights reserved.

• Combine different data sources

• Minimize data movement

• Leverage the Apache ecosystem

• Evolve seamlessly

• Serve the Enterprise

DATA LAKE IMPLEMENTATION STRATEGY NEEDS TO…

Production Data

Web Logs

Public

Sales

Billing

CRM

SCM

Social Media

Location

Click Streams

Sensor Data

DATA LAKE

15 © Copyright 2016 EMC Corporation. All rights reserved.

STANDARD HADOOP FOR THE DATA LAKE?

Direct-attached storage

Stand-alone Servers

Single purpose

All commodity environment

Standard Hadoop

Efficiency & Agility

Rapid deployment

Purpose Built Silos

Operational Complexity

Enterprise Challenges

Reintroduces challenges that Enterprise IT solved years ago

16 © Copyright 2016 EMC Corporation. All rights reserved.

WOULD YOU RATHER

INTEGRATE OR

INNOVATE?

17 © Copyright 2016 EMC Corporation. All rights reserved.

ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources

EDWs SANs Search Servers LTO Libraries NAS

Slow Results

Silos Inconsistent Security

Access

THE TYPICAL SITUATION TODAY

18 © Copyright 2016 EMC Corporation. All rights reserved.

Faster Time to Insights

Enterprise Security

Multi-protocol Access

Shared Storage

Reporting

Mobile Analytics

Files

Archive

Web

A BETTER BUSINESS DATA LAKE

19 © Copyright 2016 EMC Corporation. All rights reserved.

EMC PARTNERSHIPS AND SUPPORTED SOLUTIONS

IBM BigInsights

20 © Copyright 2016 EMC Corporation. All rights reserved.

Benefits

• Both structured and unstructured data

• Expanded analytics with MapReduce, NoSQL, etc.

BOTTOM-LINE IMPACT

Solution Cost/Terabyte Hadoop Advantage

Hadoop on Isilon <$1,000

Teradata Warehouse Appliance $16,500 16x savings

Oracle Exadata $14,000 14x savings

IBM Netezza $10,000 10x savings

21 © Copyright 2016 EMC Corporation. All rights reserved.

EMC Data Lake - Hadoop Benchmark Results

Summary of Outcome • 30TB of TPC-DS dataset • HDP 2.2 with Hive on Tez &

ORC files • 45 queries executed • Isilon completes the queries

in only 70% of the elapsed time of the baseline

• Total execution time is 219 minutes on Isilon vs 374 minutes on the DAS baseline

-200%

0%

200%

400%

600%

800%

1000%

1200%

3 12 17 20 25 27 29 32 39 42 45 50 54 56 66 71 79 84 87 89 91 96 98

Query #

% P

erf

orm

ance I

ncre

ase

22 © Copyright 2016 EMC Corporation. All rights reserved.

Data Sources

Ingest Spring XD Sqoop Flume Kafka NFS SMB FTP HTTP Swift

Raw Files

ETL Processing

Hive/HAWQ/Impala

Processed Files

SQ

OO

P H

DFS a

nd/o

r NFS

Hadoop

DATA SOURCES

EDW Business Intelligence

MODERN DATA WAREHOUSE WITH EMC ISILON

Analytical Tools

Data Sources

Location Data

Unstructured

Social Media

Voice Data

Machine Logs

SAP

ERP

CRM

Click Streams Emails

Ticker Data

23 © Copyright 2016 EMC Corporation. All rights reserved.

DATA LAKE 2.0 – AVAILABLE NOW CONNECTING EDGE TO CORE TO CLOUD

EDGE CORE CLOUD

EXPAND DATA LAKE TO THE EDGE … AND TO THE CLOUD ESG: Remote office / Branch Office technology Trends, May 2015 IDC: “Preparing for the Internet of Things: Assessing the Impact on your data center and edge IT sites”, June 2015

23 © Copyright 2016 EMC Corporation. All rights reserved.

44% of enterprises have 10 - 50 TB per Branch Office

4 - 10% of Remote servers to support data collection for IoT

Massively Scalable Capacity

Reduced Cost