Upload
shaina
View
55
Download
0
Embed Size (px)
DESCRIPTION
Big Data’s Virtualization Journey. Andrew Yu Sr. Director, Big Data R&D VMware. Big Data: Not Just for the Web Giants – Now the Intelligent Enterprise. Real-time analysis allows instant understanding of market dynamics. - PowerPoint PPT Presentation
Citation preview
© 2009 VMware Inc. All rights reserved
Big Data’s Virtualization Journey
Andrew Yu
Sr. Director, Big Data R&D
VMware
2
Big Data: Not Just for the Web Giants – Now the Intelligent Enterprise
3
Real-time analysis allows instant understanding of
market dynamics.
Retailers can have intimate understanding of their
customers needs and use direct targeted marketing.
Market Segment Analysis Personalized Customer Targeting`
4
The Emerging Pattern of Big Data Systems: Retail Example
Real-TimeStreams
Exa-scale Data Store
Parallel DataProcessing
Real-TimeProcessing
MachineLearning
Data Science
Cloud Infrastructure
Analytics
5
A single GE Jet Engine produces
10 Terabytes of data in one hour – 90 Petabytes per year.
Enabling early detection of faults, common mode failures, product engineering feedback.
Post Mortem Proactively Maintained Connected Product
6
Storage: Plan for Peta-scale Data Storage and Processing
2000 2003 2006 2009 2012 20150.01
0.1
1
10
100
1000
Online AppsAnalytics
PB ofData
Analytics Rapidly Outgrows Traditional Data Size by 100x
7
Cloud Infrastructure Supports Mixed Big Data Workloads
MachineLearning HadoopReal-Time
Analytics
Cloud Infrastructure
MachineLearning
Hadoop
Real-TimeAnalytics
Management
Network/Security
Storage/Availability
Compute
8
Cloud Infrastructure Supports Multiple Tenants
Cloud Infrastructure
Management
Network/Security
Storage/Availability
Compute
Web UserAnalytics
FinancialAnalysis
Historical CustomerBehavior
9
Software-defined Datacenter: Compute
Agility / Rapid deployment
Lower Capex
Isolation for resource control and security
1
2
3
Operational efficiency4
Management
The Core Values of Virtualization Apply to Big Data
Network/Security
Storage/Availability
Compute
10
Strong Isolation between Workloads is Key
Hungry Workload 1
Reckless Workload 2
NosyWorkload 3
Cloud Infrastructure
11
Consolidation of workloads: Higher Utilization
Hadoop 1
Hadoop 2
HBase
• Without virtualization• independent Hadoop clusters each have access to fraction of total physical resources
• Consolidate and virtualize,- Consolidated cluster has access to entire pool of physical resources - For common use cases, reduce latency on priority jobs on consolidated cluster- Multiple HDFS striped across all physical hosts
12
Hadoopbatch analysis
Big Data Mix of Workloads
File System/Data Store
Host Host Host Host Host Host
HBasereal-time queries
NoSQL Cassandra, Mongo, etcBig SQL
Impala,Pivotal HawQ
Computelayer
Virtualization
Host
OtherSpark,Shark,Solr,
Platfora,Etc,…
13
Management
Software-defined Datacenter: Storage
Requirements of Next Generation Storage
Network/Security
Storage/Availability
Compute
10x lower cost of storage
Handle explosive data growth
Support a variety ofapplication types
1
2
3
Solve the privacy andsecurity issues4
14
Software-defined Storage Enables Fundamental Economics
0.5 1 2 4 8 16 32 64 128 $-
$0.50
$1.00
$1.50
$2.00
$2.50
$3.00
$3.50
$4.00
$4.50
$5.00
$5.50
Cost per GB
Petabytes Deployed
TraditionalSAN/NAS
DistributedObject
StorageHDFSMAPRCEPH
Scale-out NASIsilon, NTAP
15
Big-Data using Local Disks
Host
Host
Host
Host
Host
Host
Host
Top of Rack Switch
Servers withLocal Disks
16-24 core server12-24 SATA 2-4TB Disks10 GbE adapteriSCSI/NFS for SharedStorage for vMotion etc,…
High Performance 10GBE Switch per Rack
16
Big Data Storage
Scale-out Network Storage
Elastic ComputeScale-out Network Storage
• Hadoop Protocol• Snapshots• Posix Apps• Full NFS Access• Replication• Erasure Coding
17
Customer Success: Hadoop as a Service at FedEx
Scale-out Isilon Cluster- Shared Data- NAS + Hadoop
Elastic vSphere Cluster- Mixed Workloads- vSphere- Existing Rack Mount
Servers
18
HadoopVirtualNode 2
NN
NN
NN
NN
NN
NN data node
Isilon
Storage Configuration for Data/Compute Separation With Isilon
Virtualization Host
VMDKOS Image – VMDK
Shared storageSAN/NAS
OS Image – VMDK VMDK
VMDK
HadoopVirtualNode 1
Ext4
Job-tracker
Ext4
Temp
OS Image – VMDK
Ext4
Task-tracker
Ext4HadoopVirtualNode 3
Ext4
Task-tracker
Ext4
19
Agile Big Data at FedEx
• Trusted Isolation• Well known auditable
platform
Security
• Deploy in minutes• Optimize for shift in
workload characteristics
Agility
• Create true multi-tenancy
• Mixed workloads
Elasticity
20
Breakthrough Use Cases
Web Log Analysis Initial exploration was around detection of mobile devices accessing the
website. Analysis of 570 billion web server log entries took approximately 9 minutes to
complete on a small cluster.
ZIP code Analysis Analysis of data to determine which ZIP codes are the highest source or
destination for shipments.
Shipment Analysis Analysis of shipment information to determine patterns
that may delay a package.
21
Cloud Infrastructure is Ready for Big Data – Are you?
Cloud Infrastructure
22
Q&A