Upload
yahoo-developer-network
View
2.788
Download
1
Tags:
Embed Size (px)
DESCRIPTION
The cloud reduces the barrier to entry for many small and medium size enterprises into analytics. Hadoop and related frameworks like Hive, Oozie, Sqoop are becoming tools of choice for deriving insights from data. However, these frameworks were designed for in-house datacenters which have different tradeoffs from a cloud environment and making them run well in the cloud presents some challenges. In this talk, we describe how we've extended Hadoop and Hive to exploit these new tradeoffs and offer them as part of the Qubole Data Service (QDS). We will also present use-cases that show how QDS is making it extremely easy for an end user to use these technologies in the cloud. Speaker: Ashish Thusoo, CEO, Qubole
Citation preview
Qubole Inc., Proprietary
Hadoop User GroupAshish ThusooJan 16, 2013
Qubole Inc., Proprietary
About Me
Big Data Veteran
Ran the data infrastructure team at Facebookbefore starting Qubole
Co-created Hive in 2007 @ Facebook
••
•
Qubole Inc., Proprietary
What is Qubole?
A comprehensive cloud data platform basedon Hadoop and Hive for data in the cloud
Turnkey Infrastructure
Cloud Optimized Stack
Open Data Formats
Useful for exploring data and creating batchprocessing applications/data pipelines
•
---
•
Qubole Inc., Proprietary
Why Qubole?
End Users(User Ops, Product Managers
etc.)
Heterogenous Data(Structured & Unstructured)
The Intermediaries(Data Scientists and
Engineers)
BOTTLENECK
Qubole Inc., Proprietary
Qubole Service
Cloud Data Service
Cloud Data PlatformElastic . Robust . Fast
DataMarts
Explore Schedule SDK
EC2 / S3
Big Data Technology Stack
ODBC
Connectors
API
Logs
Events
DBs
Metrics
Cloud Sources
Qubole Inc., Proprietary
Cloud vs Bare Metal
Dynamic vs Fixed Provisioning
Separation between Compute and Storage
Purchasing and Budgeting
•••
Qubole Inc., Proprietary
Dynamic Provisioning
Advantage: Transient Clusters
Burden: How big of a cluster do I need?
Solution: Auto-scaled Hadoop
•••
Qubole Inc., Proprietary
Challenges:Auto-scaledHadoop
http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/
Adapting to Burstiness
Current load is not enough, also need to predict futureload
Adapting State-fully
Removing HDFS nodes is risky withoutdecommissioning
•-
•-
Qubole Inc., Proprietary
Implementation:Auto-scaledHadoop
http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/
TaskTrackers report launch times ofJobTracker
JT computes amount of time required tofinish existing workloads
If the time is above a certain threshold thenmore nodes are added
At hourly boundaries the nodes are removedin case of insufficient work
•
•
•
•
Qubole Inc., Proprietary
Implementation:Auto-scaledHadoop
http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/
Restrictions on Deleting Nodes:
Nodes Containing Task Outputs of Current Jobs
Fast Decommissioning Done for Data Nodes
Minimum Cluster Size Threshold
Fast Decommissioning - possible becauseHDFS is a cache for us
•---
•
Qubole Inc., Proprietary
Compute & Storage on theCloud (EC2/S3)
On the cloud Compute and Storage areSeparate!!
Advantage: Don’t Pay for CPU for Storing Data
Burden: Separation Can Cause Slowness &Variability
Solutions:
Caching File System
Masking S3 Latency
•
••
•
--
Qubole Inc., Proprietary
Caching File Systemhttp://www.qubole.com/blog/index.php/columnar-cloud-cache/
Qubole Inc., Proprietary
Caching File Systemhttp://www.qubole.com/blog/index.php/columnar-cloud-cache/
Benefits:
Masks the performance variance associated with S3 whilereading data
Columnar caching on the fly enables data to be persisted inopen formats while still giving the benefits of performance
•-
-
Qubole Inc., Proprietary
Masking S3 Latencyhttp://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/
File Operations in S3 are much slower thanHDFS
Problem: This leads to bad performance whendata is distributed in a lot of files
Solution:
Fast Split Generation Algorithm
Pipelined File Opens
•
•
•-
-
Qubole Inc., Proprietary
Faster Split Generationhttp://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/
Directory operations with merging instead ofper file metadata (upto 8x speedup)
•
Qubole Inc., Proprietary
Pipelined File Openshttp://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/
Open S3 files before they are read (30%improvements in simple queries)
•
Qubole Inc., Proprietary
Purchasing Instances
Buying Instances on Spot Prices vs On-Demand Prices
Benefits: Cheaper on average by 50-60%
Problems: Spot instances are not guaranteedand can be taken away anytime
Bad for MapReduce
Disastrous for HDFS
•
••
-
-
Qubole Inc., Proprietary
Spotted Hadoop Clustershttp://www.qubole.com/blog/index.php/hadoop-auto-scale-ec2-spot-instances/
Simplified Spot Bidding Strategy
Configuring Bidding Timeouts
Configuring % of instances through spot
Configuring bid pricses
Spot Instance Aware HDFS Block Placement
Ensures One Replica of the Blocks Reside On On-DemandNodes
•-
-
-
•-
Qubole Inc., Proprietary
Conclusion
Cloud is Different from Bare Metal
Check out more optimizations that we havemade to run Hadoop and Hive optimally in thecloud at our blog
••
http://www.qubole.com/blog/
Qubole Inc., Proprietary
Thank you.
Free Sign up for Qubole at https://api.qubole.com/users/sign_upCareers at http://www.qubole.com/careers