Suggested Algorithm to improve Hadoop's performance

Research on Scheduling Scheme for Hadoop clusters

By: Jiong Xiea,b, FanJun Mengc, HaiLong Wangc, HongFang Panb, JinHong Chengb, Xiao Qina

04/13/23 1CSC 8710

Outlines

• What is Hadoop?• Hadoop Characterstics• Hadoop Objectives• Big Data Challenges• Hadoop Architecture• What is the predictive schedule and prefetching

mechanism ?• Hadoop Issues• Hadoop Scheduler• PSP Scheduler• Conclustion

04/13/23 2CSC 8710

Goal

• Designing prefetching mechanism to solve the data moving problem in mapReducing and to improve the performance.

04/13/23 CSC 8710 3

What is Hadoop?

• Hadoop is an open source software framework that is used to deal with the large amount of data and to process them on clusters of commodity hardware.

04/13/23 4CSC 8710

Characteristics

• It is a framework of tools- Not a particular program as some people think

• Open source tools.• Distributed under apache license .• Linux based tools.• It works on a distributed models

- Not one big powerful computer, but numerous low cost computers.

04/13/23 5CSC 8710

objectives

• Hadoop supports running of application on Big Data.

• Therefore, Hadoop addresses Big Data challenges.

HadoopRunning application on Big Datasupports

04/13/23 6CSC 8710

Big Data Challenges

04/13/23 7CSC 8710

Why Do We need Hadoop?

• Powerful computer can process data until some point when the quantity of data becomes larger than the ability of the computer.

• Now, we need Hadoop tool to deal with this issue.

• Hadoop uses different strategy to deal with data.

04/13/23 8CSC 8710

Hadoop Functionality

• Hadoop breaks up the data into smaller pieces and distribute them equally on different nodes to be processed at the same time.

• Similarly, Hadoop divides the computation into the nodes equally.

• Results are combined all together then sent again to the application

04/13/23 9CSC 8710

Hadoop Functionality

Node Node

Big Data

Node

Combined Result

Dividing the data equally

computation

Returning the result

Input data

Combining the result

04/13/23 10CSC 8710

Architecture

• Hadoop consists of two main components:– MapReduce: divides the workload into smaller pieces– File System (HDFS): accounts for component failure, and it

keeps directory for all the tasks– There are other projects provide additional functionality:

• Pig

• Hive

• HBase

• Flume

• Mahout

• Oozie

• ScoopMapReduce File System

HDFS

Hadoop

04/13/23 11CSC 8710

Architecture

• Slave computers consist of 2 components:

- Task Tracker: to process the given task, and it represents the mapReduce component.

- Data Node: to manage the piece of task that has been give to the task tracker, and it represents HDFS.

04/13/23 12CSC 8710

Architecture

• The master computer consists of 4 components:

- Job Tracker: It works under mapReduce component so it breaks up the task into smaller pieces and divides them equally on the Task Trackers.

- Task Tracker: to process the given task.

- Name Node: It is responsible to keep an index of all the tasks.

- Data Node: to manage the piece of task that has been give to the task tracker.

04/13/23 13CSC 8710

Architecture

04/13/23 14CSC 8710

Fault Tolerance for Data

• Hadoop keeps three copies of each file, and each copy is given to a different node.

• If any one of the Task Tracker fails The Job Tracker will detect that failure and will ask another Task Tracker to take care of that job.

• Tables in The Name node will be backed up as well in different computer, and this is the reason why the enterprise version of Hadoop keeps two masters. One is the working master and the other one is back up master.

04/13/23 15CSC 8710

Scalability cost

• The scalability cost is always linear. If you want to increase the speed, increase the number of computers.

04/13/23 16CSC 8710

predictive schedule and prefetching

• implementing a predictive schedule and prefetching (PSP) mechanism on Hadoop tools to improve the performance.

• Predictive scheduler: - A flexible task scheduler, predicts the most appropriate task

trackers to the next data.

• Prefetching module:– The responsible part of forcing the preload workers threads to

start loading data to main memory of the node before the

current task finish. It depends on estimated time.

04/13/23 17CSC 8710

PSP

• Factors that make PSP possible:- Underutilization of CPU.

- Importance of MapReduce performance

- The storage availability in HDFS

- Interaction between the nodes

04/13/23 18CSC 8710

Hadoop’s Issue

• In the current MapReduce model, all the tasks are managed by the master node, so the computation nodes ask the master node to assign the new task to be processed.

• The master node will tell the computing nodes what the next task is, and where it is located.

• That will waste some of the CPU’s time while the computation node communicates with the master node.

04/13/23 19CSC 8710

Hadoop’s Issue

• The original Hadoop assigns tasks randomly from local or remote disk to the computation node whenever the data is required.

• CPU of the computing nodes won’t process until all the input data resources are loaded into the main memory.

• This affects Hadoop’s performance negatively.

04/13/23 20CSC 8710

Prefetching

• It will force the preload workers threads to start loading data from the local desk to the main memory of the node before the current task finish.

• The waiting time will be reduced, so the task will be processed on time.

• Improving the performance of MapReduce system.

04/13/23 21CSC 8710

Hadoop Scheduler

• The original Hadoop scheduler, The job tracker includes the task scheduler module assign tasks to different tasks trackers.

• Task Trackers periodically send heartbeat to the job tracker.

• The job tracker checks the heartbeat and send tasks to the available one.

• The scheduler assigns tasks randomly to the nodes via the same heartbeat message protocol.

• It assigns tasks randomly and mispredict stragglers in many cases.

04/13/23 22CSC 8710

Predictive Scheduler

• Making a predictive scheduler by designing a prediction algorithm integrated with the original Hadoop.

• The predictive scheduler predicts stragglers and find the appropriate data blocks.

• The prediction decisions are made by a prediction module during the prefetching stage.

04/13/23 23CSC 8710

Hadoop Function

04/13/23 24CSC 8710

Lunching Process

• Three basic steps to lunch the tasks:- Copying the job from the shared file system to the job

tracker’s file system, and copying all the required files.

- Creating a local directory of the task and un-jar the content of the jar into the directory.

- Copying the task to the task tracker to be processed.

04/13/23 25CSC 8710

Lunching Process

• In PSP, all the last steps are monitored by the prediction module, and it predicts three events:- The finish time of the current processed task.

- Tasks that are going to be assigned to the task trackers

- Lunch time of the pending tasks.

04/13/23 26CSC 8710

prefetching

• These three issued must be addressed:- When to prefetch:

- What to prefetch

- How much to prefetch

04/13/23 27CSC 8710

Conclusion

• Proposing a predictive scheduling and prefetching mechanism (PSP) aim to enhance Hadoop performance.

• prediction module predicts data blocks to be accessed by computing nodes in a cluster.

• the prefetching module preloads these future set of data in the cache of the nodes.

• It has been applied on 10 nodes, so it reduces the execution time up to 28% and 19% for the average.

• It increases the overall throughput and the I/O utilization.

04/13/23 28CSC 8710

Resources

• http://ac.els-cdn.com/S1877050913005668/1-s2.0-S1877050913005668-main.pdf?_tid=00e2b8e8-8d59-11e3-be92-00000aacb362&acdnat=1391490095_5f34abbe9f98d3b8a0978b2464478da1

• http://blog.vitria.com/bid/87945/Big-Data-Analytics-Challenges-Facing-All-Communications-Service-Providers

• http://blog.raremile.com/hadoop-demystified/

• http://namitkabra.wordpress.com/category/etl/page/2/

• http://odbms.org/download/Pro%20Hadoop%20Ch.%201.pdf

• http://hadoop.apache.org/docs/r0.18.0/hdfs_design.pdf

• http://wiki.apache.org/hadoop/Defining%20Hadoop

• https://engineering.purdue.edu/~ychu/ee673/Projects.F11/detectstraggeler_finalrpt.pdf

04/13/23 29CSC 8710

04/13/23 30CSC 8710

04/13/23 31CSC 8710

Data & Analytics

Suggested Algorithm to improve Hadoop's performance