View
228
Download
1
Category
Preview:
Citation preview
1 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Redefine Big Data: EMC Data Lake in Action Andrea Prosperi – Systems Engineer
2 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Agenda
• Data Analytics Today
• Big data
• Hadoop & HDFS
• Different types of analytics
• Data lakes
• EMC Solutions for Data Lakes
3 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
The world before big data Data warehousing. Research and the definition of dimensions and facts started in the 1960’s. Things really got going in the 1980s.
4 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
So what changed? Big data rocked up to the party.
5 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Traditional solutions struggled
• Too much data
• No Real Time analysis
• No Data Exploration
• More expensive hardware to go faster and deeper
• Overnight batch not good enough
• Not just structured data in a star schema
6 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Thankfully we had Google Cue Doug Cutting’s son and his elephant, Hadoop…
• Computation Tier uses a framework called MapReduce
• Storage is provided via a distributed filesystem called HDFS
• Hadoop runs on commodity hardware
7 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
All analytics aren’t equal Descriptive, Predictive and Prescriptive. There is also Diagnostic.
Degree of Complexity
Com
petitive A
dvanta
ge
What exactly is the problem?
How many, how often, where?
What happened?
What will happen next if?
What if these trends continue?
What could happen?
What actions are needed?
How can we achieve the best outcome including the effects of variability?
How can we achieve the best outcome?
Descriptive
Predictive
Prescriptive
Source: Based on "Competing on Analytics," Davenport and Harris
8 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Descriptive Analytics
Predictive Analytics
Prescriptive Analytics
9 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Data lakes Today, think of it in terms of co-existence with Enterprise DWH. Both environments are valid.
Analyze & Report
Client/Portal Device
Data Security, Backup
Semi-structured & Unstructured Data
Structured Data
Data Transformation
Client/Portal Devices
Analyze & Report
Enterprise DWH
ETL/ELT
CRM ERP
OLTP DB
Hadoop Based Data
Lake
10 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
What is a Data Lake?
If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. *James Dixon, coiner of “Data Lake” term
11 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Pragmatic approach to Data Lake
Identify Domain
Be Pragmatic/Start Small
Build Lake infrastructure
Fill Lake
Build Fishing Poles, exploration, extract value, then expand
12 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Data Lake Interaction 3 Main Levels of interaction: • Real Time: for fast
analysis and correlation
• Interactive: for transactional processing
• Batch: for large dataset analysis
13 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
EMC Solutions for Data Lake Infrastructure
BATCH
INTERACTIVE
REAL-TIME
ISILON
VNX
Commodity
ECS
DSSD VIPR Controller
VIPR Services
EMC Big Data Storage
Lake I
nfr
astr
uctu
re
14 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Build Lake Infrastructure
• Be Fast – Reuse your current infrastructure to build
an HDFS repository
• Reduce risk – Reduce CAPEX investment required to perform
analytics
– Maintain data protection, compliance at array level
• Reduce cost and complexity of dedicated clusters
– Reduce need for new vendor nodes and storage capacity
Use General Purpose Arrays/Commodity Disks As Data Lake Store
Commodity
3rd Party
VNX
ViPR Data Services
15 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Build Lake Infrastructure
• ViPR Object & ViPR HDFS access on the same data – S3, Swift, Atmos API via the
Object head
– File protocols in development
• Use your preferred Hadoop distribution
Object, File And HDFS Operations On The Same Data
VIRTUAL ARRAY
Commodity
Object HDFS Object & HDFS
16 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Build Lake Infrastructure Use Specialized Arrays As Data Lake Store
ECS Appliance • Hyper-scale:
– ECS supports unlimited applications and users on a single, scale- out architecture
– start at 360 TB and scale to multiple petabytes or even exabytes
• Pre-Engineered and Pre-Built • Commodity Hardware • Structured and Unstructured Content
3rd p
latfo
rm
applic
atio
ns
17 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Build Lake Infrastructure
• Accelerate the benefits of Hadoop for the enterprise – Proven Hadoop solution, faster implementation
– Greater interoperability with enterprise applications and Hadoop analytics through multi-protocol parallel access from any client
• Enterprise data protection – Fast snapshots, backup, and recovery
– Simple, reliable data replication for disaster recovery
• Ultimate flexibility – Scale compute and storage resources separately
– Supports physical and virtualized server environments
Use Specialized Arrays As Data Lake Store
18 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
BATCH BATCH
INTERACTIVE INTERACTIVE HAWQ Greenplum DB
Unlimited Pivotal HD
REAL-TIME REAL-TIME GemFire XD
EMC/Pivotal Solutions for Data Lake Software
Lake S
oft
ware
19 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Pivotal HD Architecture - Apache
HDFS
HBas
e
Pig, Hive,
Mahout
Map
Reduce
Sqoop Flume
Resource
Management
& Workflow
Yarn
Zookeeper
Apache
20 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
HAWQ - Full ANSI SQL Engine on Hadoop
HDFS
HBas
e
Pig, Hive,
Mahout
Map
Reduce
Sqoop Flume
Resource
Management
& Workflow
Yarn
Zookeeper
Apache Pivotal
Comman
d Center Configure,
Deploy,
Monitor,
Manage
Data Loader
Spring
Unified Storage
Service
Xtension
Framework
Catalog
Services
Query
Optimizer
Dynamic Pipelining
ANSI SQL + Analytics
HAWQ – Advanced
Database Services
Hadoop Virtualization
Extension
MADlib Algorithms
21 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
GemFire - Real-Time Data Service
HDFS
HBas
e
Pig, Hive,
Mahout
Map
Reduce
Sqoop Flume
Resource
Management
& Workflow
Yarn
Zookeeper
Apache Pivotal
Comman
d Center Configure,
Deploy,
Monitor,
Manage
Data Loader
Spring
Unified Storage
Service
Xtension
Framework
Catalog
Services
Query
Optimizer
Dynamic Pipelining
ANSI SQL + Analytics
HAWQ – Advanced
Database Services
Hadoop Virtualization
Extension
Distrubuted
In-memory
Store
Query
Transactions
Ingestion
Processing
Hadoop Driver – Parallel with Compaction
ANSI SQL + In-Memory
GemFire XD – Real-Time
Database Services
MADlib Algorithms
22 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
A Reference Architecture Standardized, on-demand services are layered around shared data repositories & processing capabilities to form the data lake.
Data Sources • Existing structured data. • Unstructured or semi-
structured data sources • Machine generated data such
as logs and sensor data. • External data sources.
Ingest and data capture • Scheduled, Batch data ingest
to capture bulk data sources. • Micro-batch ingest capturing
small quantities of data.
• Low-latency and real-time ingest of data.
• Real-time routing of data to complex event processing and persistent storage.
Shared storage and re-use • Isilon and ViPR provide shared
access to new and existing data sources through HDFS.
• Minimize data copies. • Smart De-dupe for Hadoop. • Kerberos Authentication.
Data Analytics • In-memory performance
(GemFire) • MPP Processing (Pivotal HD) • High performance SQL access
to HDFS data (HAWQ).
Applications and integration • CloudFoundry on vSphere. • Build interactive, data-driven
applications using modern frameworks and approaches.
23 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Data Science Data Engineering
+
What about services?
Recommended