View
452
Download
1
Tags:
Embed Size (px)
Citation preview
Jithin Raveendran
S7 IT
Roll No : 31
Guided by :
Prof. Remesh Babu
Presented by :
1
Buzz-word big-data : large-scale distributed data processing applications that operate on exceptionally large amounts of data.
2.5 Zettabytes of data/day — so much that 90% of the data in the world today has been created in the last two years alone.
BigData???
2
3
Open-source software framework from Apache - Distributed processing of large data sets across clusters of commodity servers.
Designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.
Inspired by
Google MapReduce
GFS (Google File System)
Case study with Hadoop MapReduceHadoop:
HDFSMap/Reduce
4
Apache Hadoop has two pillars
• HDFS• Self healing • High band width clustered
storage • MapReduce
• Retrieval System• Maper function tells the cluster
which data points we want to retrieve
• Reducer function then take all the data and aggregate
5
HDFS - Architecture
6
Name Node: Center piece of an HDFS file system
Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file.
Responds the successful requests by returning a list of relevant DataNodeservers where the data lives.
Data Node:
Stores data in the Hadoop File System.
A functional file system has more than one Data Node, with data replicated across them.
HDFS - Architecture
7
Secondary Name node:
Act as a check point of name node
Takes the snapshot of the Name node and use it whenever the back up is needed
HDFS Features:
Rack awareness
Reliable storage
High throughput
HDFS - Architecture
8
MapReduce Architecture
• Job Client:• Submits Jobs
• Job Tracker:• Co-ordinate
Jobs
• Task Tracker:• Execute Job
tasks
9
1. Clients submits jobs to the Job Tracker
2. Job Tracker talks to the name node
3. Job Tracker creates execution plan
4. Job Tracker submit works to Task tracker
5. Task Trackers report progress via heart beats
6. Job Tracker manages the phases
7. Job Tracker update the status
MapReduce Architecture
10
MapReduce is used for providing a standardized framework.
Limitation
Inefficiency in incremental processing.
Current System :
11
Dache - a data aware cache system for big-data applications using the MapReduce framework.
Dache aim-extending the MapReduceframework and provisioning a cache layer for efficiently identifying and accessing cache items in a MapReducejob.
Proposed System
12
Google Big table - handle incremental processing
Google Percolator - incremental processing platform
Ramcloud - distributed computing platform-Data on RAM
Related Work
13
Cache description scheme:
Data-aware caching requires each data object to be indexed by its content.
Provide a customizable indexing that enables the applications to describe their operations and the content of their generated partial results. This is a nontrivial task.
Cache request and reply protocol:
The size of the aggregated intermediate data can be very large. When such data is requested by other worker nodes, determining how to transport this data becomes complex
Technical challenges need to be
addressed
14
Map phase cache description scheme
Cache refers to the intermediate data produced by worker nodes/processes during the execution of a MapReduce task.
A piece of cached data stored in a Distributed File System (DFS).
Content of a cache item is described by the original data and the operations applied.
2-tuple: {Origin, Operation}
Origin : Name of a file in the DFS.
Operation : Linear list of available operations performed on the Origin file
Cache Description
15
Reduce phase cache description scheme
The input for the reduce phase is also a list of key-value pairs, where the value could be a list of values.
Original input and the applied operations are required.
Original input obtained by storing the intermediate results of the map phase in the DFS.
Cache Description
16
Protocol Relationship between job types and cache
organization
• When processing each file split, the cache manager reports the previous file splitting scheme used in its cache item.
17
Protocol Relationship between job types and cache
organization
To find words starting with ‘ab’, We use the results from the cache for word starting with ‘a’ ; and also add it to the cache
Find the best match among overlapped results [choose ‘ab’ instead of ‘a’]
18
Cache item submission
Mapper and reducer nodes/processes record cache items into their local storage space
Cache item should be put on the same machine as the worker process that generates it.
Worker node/process contacts the cache manager each time before it begins processing an input data file.
Worker process receives the tentative description and fetches the cache item.
Protocol
19
Cache manager - Determine how much time a cache item can be kept in the DFS.
Two types of policies for determining the lifetime of a cache item
Lifetime management of cache item
1. Fixed storage quota• Least Recent Used (LRU) is employed
2. Optimal utility• Estimates the saved computation
time, ts, by caching a cache item for a given amount of time, ta. • ts,ta - used to derive the monetary
gain and cost. 20
Map Cache: Cache requests must be sent out before the file splitting phase.
Job tracker issues cache requests to the cache manager.
Cache manager replies a list of cache descriptions.
Cache request and reply
Reduce Cache : • First , compare the requested cache item with the cached items in
the cache manager’s database.• Cache manager identify the overlaps of the original input files of
the requested cache and stored cache.• Linear scan is used here.
21
Implementation
Extend Hadoop to implement Dache by changing the components that are open to application developers.
The cache manager is implemented as an independent server.
Performance Evaluation
22
Hadoop is run in pseudo-distributed mode on a server that has
8-core CPU
core running at 3 GHz,
16GB memory,
a SATA disk
Two applications to benchmark the speedup of Dache over Hadoop
word-count and tera-sort.
Experiment settings
23
Results
24
Results
25
Results
26
Requires minimum change to the original MapReduce programming model.
Application code only requires slight changes in order to utilize Dache.
Implement Dache in Hadoop by extending relevant components.
Testbed experiments show that it can eliminate all the duplicate tasks in incremental MapReduce jobs.
Minimum execution time and CPU utilization.
Conclusion
27
This scheme utilizes much amount of cache.
Better cache management system will be needed.
Future Work
28
29