Upload
yahoo-developer-network
View
105
Download
0
Embed Size (px)
DESCRIPTION
As Apache Hadoop adoption continues to advance, customers are depending more and more on Hadoop for critical tasks, and deploying Hadoop for use cases with more real-time requirements. In this session, we will discuss the desired performance characteristics of such a deployment and the corresponding challenges. Leave with an understanding of how performance-sensitive deployments can be accelerated using In-Memory technologies that merge the Big Data capabilities of Hadoop with the unmatched performance of In-Memory data management.
Citation preview
In-Memory Accelerator for Hadoop™
www.gridgain.com #gridgain
Slide www.gridgain.com
Hadoop: Pros and Cons
2
What is Hadoop?> Hadoop is a batch system> HDFS - Hadoop Distributed File System> Data must be ETL-ed into HDFS> Parallel processing over data in HDFS> Hive, Pig, HBase, Mahout...> Most popular data warehouse
Pros:> Scales very well> Fault tolerant and resilient> Very active and rich eco-system> Process TBs/PBs in parallel fashion
Cons:> Batch oriented - real time not possible> Complex deployment> Significant execution overhead> HDFS is IO and network bound
Slide www.gridgain.com
In-Memory Accelerator For Hadoop: Overview
3
Up To 100x Faster:1. In-Memory File System
100% compatible with HDFSBoost HDFS performance by removing IO overheadDual-mode: standalone or cachingBlend into Hadoop ecosystem
2. In-Memory MapReduceEliminate Hadoop MapReduce overheadAllow for embedded executionRecord-based
Slide www.gridgain.com
GridGain: In-Memory Computing Platform
4
Slide www.gridgain.com
In-Memory Accelerator For Hadoop: Details
> PnP IntegrationMinimal or zero code change
> Any Hadoop distroHadoop v1 and v2
> In-Memory File System100% compatible with HDFSDual-mode: no ETL needed, read/write-throughBlock-level caching & smart evictionAutomatic pre-fetchingBackground fragmentizerOn-heap and off-heap memory utilization
> In-Memory MapReduceIn-process co-located computations - access GGFS in-processEliminate unnecessary IPCEliminate long task startup timeEliminate mandatory sorting and re-shuffling on reduction
5
Slide www.gridgain.com
GridGain Visor: Unified DevOps
6
HDFS Profiler
File Manager
Slide www.gridgain.com
Benchmarks: GGFS vs HDFS
7
10 nodes cluster of Dell R610> Each has dual 8-core CPU> Ubuntu 12.4, Java 7> 10 GBE network> Stock Apache Hadoop 2.x
Slide www.gridgain.com
Comparison: Hadoop Accelerator vs. Spark
8
> No ETL requiredAutomatic HDFS read-through and write-throughData is loaded on demand
> Per-block file cachingOnly hot data blocks are in memory
> Strong management capabilitiesGridGain Visor - Unified DevOps
> Requires data ETL-ed into SparkChanges to data do not get propagated to HDFSExplicit ETL step consumes time
> Needs to have full file loadedIf does not fit - gets offloaded to disk
> No management capabilities
Slide www.gridgain.com
Customer Use Case: Task & Challenge
9
Task:> Real time search with MapReduce> Dataset size is 5TB > Writes 80%, reads 20%> Perceptual real time SLA (few seconds)
Challenge:> Hadoop MapReduce too slow (> 30 sec)> Data scanning slow due to constant IO> Overall job takes > 1 minute
Slide www.gridgain.com
Customer Use Case: Solution
10
> Utilize existing serversStart GridGain data node on every server
> Only put highly utilized files in GGFSUser controlled caching
> In-Memory MapReduce over GGFSEmbedded processing
> Results under 3 seconds
GridGain Systemswww.gridgain.com
1065 East Hillsdale Blvd, Suite 230Foster City, CA 94404
@gridgain