Upload
corinium-coriniumglobal
View
692
Download
0
Embed Size (px)
Citation preview
1 © Copyright 2016 EMC Corporation. All rights reserved.
BUSINESS DATA LAKE Transform your BI & Analytics
Charles Sevior
CTO
EMC Emerging
Technologies Division
@CharlesSevior
Media data
Card transactional data
Shopping basket data
Geo-location data
Clickstream data
Industry sector data
Long tail data Rich
The Data Gap Seeking deep & rich information across a broad population set…
The Data Gap
De
ep
, rich
co
nsu
me
r in
sig
hts
(deta
iled b
ut lim
ite
d c
ove
rag
e)
Census
(ubiquitous but dull) Poor
Richness
of data
18m 0 Adults in Australia
The Problem
Data Sharing is a Bespoke,
Bilateral Activity
Companies need more data than they have. But there is no process for companies to
access the data they need – so they invent it anew every time
Current Market Data
Sets are NOT complete
Data coverage
Data
Ric
hn
ess
THE DATA GAP
insi
ghts
census
There is a Poor Time to
Value Ratio in Data Sharing data sharing
team
Legal
documentation
Technology,
security
Analytics
Time
Valu
e
• Legal contracts
• Risk mitigation
• Security protocols
• Management’s Time
• Legal contracts
• Risk mitigation
• Security protocols
• Management’s Time
• Legal contracts
• Risk mitigation
• Security protocols
• Management’s Time
End User
(Website)
Data Product
Developer
Application
Developers
End User
(Agency)
End User
(Retailer)
End User
(Property) End users are companies that use
data products and services
Data Product
Developer Data Product
Developer
Off the shelf
BI/GIS
Industry Experts
Partners are companies that create,
implement and provide Data
Products to End Users.
Data Contributor
(Bank)
Data Bank (resides behind bank-grade firewall)
Data Republic Senate Platform (only uses tokenised data)
Data Bank is the repository for PII
data, where it is depersonalised
Data Contributors put data into Data
Republic
Data Republic provides the
exchange and analytics platform
Data Contributor
(Airline)
Data Contributor
(Media)
Data Contributor
(Retailer)
End to End
(covers
everything)
Partner’s own
software
App Store
Tokenisation
Data Republic’s information eco-system Increase value by making information accessible & legally / contractually shareable
• Request master tokens (in case that data requires joining information of disparate datasets based on PII
• Specific output segments/results get associated contributor token keys
Solving: Security and Privacy Separation of PII & Attribute data between Databank and Senate platform
helps solve privacy and security concerns
• Stores PII in bank grade secure systems
• Generated Master Tokens (Token associated to same individual entity are stored securely to be consumed by DR)
DATA BANK Compliance Sensitive
Information
DATA REPUBLIC Competitively
Sensitive Information
• Once deposited in Data Bank, no PII is ever used in any part of this architecture • In case of breach of any or both systems, the complete individual information or the transactional information cannot be joined (reattached)
Data Contributors (DC)
Master Tokens
Request Master Token for Datasets
Output: Specific Segments data with or without token information
Tokens associated to each PII record are returned to DC to attach to attribute data, before sending to DR
Tokens with attribute and transaction data
PII
Input
Processing for segments
Outputs
Airline
(DC)
Bank
(DU)
Legend DC = Data Contributor DU = Data User
Single Bilateral shares… Scaling Data Users into Contributors… ... a comprehensive multi-lateral data eco-system
Airline Bank
(DC)
Retailer
Insurance
Government
Shopping
Centre (DU)
Airline Bank
Shopping
Centre (DC)
Insurance
(DC)
Gov’t (DC)
Bank
FMCG
Startup
Software Charity
Insurance
(DU)
Insight Required: Bank
When are my customers
overseas?
Insight Required: Shopping Centre
What type of retailers should I put in my centre?
Insight Required: Retailer
How many, & what type of
customers walk past my shop?
Insight Required: Insurance
Where are medical services located
& what type of unhealthy products
are being purchased?
Retailer
(DU)
Network Effect of Data Exchanges Data Users become Contributors. With more layers of contribution, a truly multi-lateral
analysis can yield even more detailed & personalised propensity outcomes
9 © Copyright 2016 EMC Corporation. All rights reserved.
STEPS TO EXTRACT DATA VALUE
Build New Applications, Products, & Business Models
Leverage New Analytics To Predict The Future
Gather as Much Data As Possible
10 © Copyright 2016 EMC Corporation. All rights reserved.
CURRENT STATE ANALYTICS
Existing Enterprise Data Warehouse
$$$$
(Highly Summarized / Processed Data)
ERP
HR
SFDC
Traditional Data Sources
Load
New Data Sources/Formats
Machine
ETL
Backup Storage
Trash
BI / Analytical
Tools
This data doesn’t look
right – where’s the
detail?
I really need data I know we have, but
it’s not accessible
I can’t afford to
keep buying more EDW’s
at this growth!
Business Users
11 © Copyright 2016 EMC Corporation. All rights reserved.
THE BUSINESS DATA LAKE APPROACH
Analytic Sandbox
Analytic Environment – New Insights
Structured BI Reporting Environment
Data Preparation and Enrichment
Via Hadoop
ALL data fed into Business Data Lake
EDW ETL
Business Data Lake
Offload EDW to Hadoop
13 © Copyright 2016 EMC Corporation. All rights reserved.
HADOOP ENABLES THE DATA LAKE
An ecosystem for storing and processing any data type
Large community of users and developers
Easily extended with new interfaces and tools
Not limited to single data type – can access any data
Store, process, and analyze any size data sets
14 © Copyright 2016 EMC Corporation. All rights reserved.
• Combine different data sources
• Minimize data movement
• Leverage the Apache ecosystem
• Evolve seamlessly
• Serve the Enterprise
DATA LAKE IMPLEMENTATION STRATEGY NEEDS TO…
Production Data
Web Logs
Public
Sales
Billing
CRM
SCM
Social Media
Location
Click Streams
Sensor Data
DATA LAKE
15 © Copyright 2016 EMC Corporation. All rights reserved.
STANDARD HADOOP FOR THE DATA LAKE?
Direct-attached storage
Stand-alone Servers
Single purpose
All commodity environment
Standard Hadoop
Efficiency & Agility
Rapid deployment
Purpose Built Silos
Operational Complexity
Enterprise Challenges
Reintroduces challenges that Enterprise IT solved years ago
17 © Copyright 2016 EMC Corporation. All rights reserved.
ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources
EDWs SANs Search Servers LTO Libraries NAS
Slow Results
Silos Inconsistent Security
Access
THE TYPICAL SITUATION TODAY
18 © Copyright 2016 EMC Corporation. All rights reserved.
Faster Time to Insights
Enterprise Security
Multi-protocol Access
Shared Storage
Reporting
Mobile Analytics
Files
Archive
Web
A BETTER BUSINESS DATA LAKE
19 © Copyright 2016 EMC Corporation. All rights reserved.
EMC PARTNERSHIPS AND SUPPORTED SOLUTIONS
IBM BigInsights
20 © Copyright 2016 EMC Corporation. All rights reserved.
Benefits
• Both structured and unstructured data
• Expanded analytics with MapReduce, NoSQL, etc.
BOTTOM-LINE IMPACT
Solution Cost/Terabyte Hadoop Advantage
Hadoop on Isilon <$1,000
Teradata Warehouse Appliance $16,500 16x savings
Oracle Exadata $14,000 14x savings
IBM Netezza $10,000 10x savings
21 © Copyright 2016 EMC Corporation. All rights reserved.
EMC Data Lake - Hadoop Benchmark Results
Summary of Outcome • 30TB of TPC-DS dataset • HDP 2.2 with Hive on Tez &
ORC files • 45 queries executed • Isilon completes the queries
in only 70% of the elapsed time of the baseline
• Total execution time is 219 minutes on Isilon vs 374 minutes on the DAS baseline
-200%
0%
200%
400%
600%
800%
1000%
1200%
3 12 17 20 25 27 29 32 39 42 45 50 54 56 66 71 79 84 87 89 91 96 98
Query #
% P
erf
orm
ance I
ncre
ase
22 © Copyright 2016 EMC Corporation. All rights reserved.
Data Sources
Ingest Spring XD Sqoop Flume Kafka NFS SMB FTP HTTP Swift
Raw Files
ETL Processing
Hive/HAWQ/Impala
Processed Files
SQ
OO
P H
DFS a
nd/o
r NFS
Hadoop
DATA SOURCES
EDW Business Intelligence
MODERN DATA WAREHOUSE WITH EMC ISILON
Analytical Tools
Data Sources
Location Data
Unstructured
Social Media
Voice Data
Machine Logs
SAP
ERP
CRM
Click Streams Emails
Ticker Data
23 © Copyright 2016 EMC Corporation. All rights reserved.
DATA LAKE 2.0 – AVAILABLE NOW CONNECTING EDGE TO CORE TO CLOUD
EDGE CORE CLOUD
EXPAND DATA LAKE TO THE EDGE … AND TO THE CLOUD ESG: Remote office / Branch Office technology Trends, May 2015 IDC: “Preparing for the Internet of Things: Assessing the Impact on your data center and edge IT sites”, June 2015
23 © Copyright 2016 EMC Corporation. All rights reserved.
44% of enterprises have 10 - 50 TB per Branch Office
4 - 10% of Remote servers to support data collection for IoT
Massively Scalable Capacity
Reduced Cost