red red red red red red red red red red red red red red...

Preview:

Citation preview

red red red red red red red red red red red red red red red red red red red red

CYS14011 - Rithu P Ravi

CYS14012 - Saumya K

— red 1/1

Big Data Hadoop... HDFS Map Reduce

Why and What HADOOP?...

Apache Hadoop is an open-source software framework

A tool to process big data

Rithu P Ravi,SaumyaK — HADOOP 2/30

Big Data Hadoop... HDFS Map Reduce

Outline

1 Big Data

2 Hadoop...

3 HDFS

4 Map Reduce

Rithu P Ravi,SaumyaK — HADOOP 3/30

Big Data Hadoop... HDFS Map Reduce

Big Data

Data beyond storage and processing power

3 ‘V’s

Volume

Velocity

Variety

Rithu P Ravi,SaumyaK — HADOOP 4/30

Big Data Hadoop... HDFS Map Reduce

Big Data

Exponential growth of data

Challenges to Google, Yahoo, Microsoft, Amazon

Need to go through TBs and PBs of data ?

Existing tools became inadequate to process such largedata sets.

Rithu P Ravi,SaumyaK — HADOOP 5/30

Big Data Hadoop... HDFS Map Reduce

Big ElephantNumerous small chicken..?

Rithu P Ravi,SaumyaK — HADOOP 6/30

Big Data Hadoop... HDFS Map Reduce

How to handle such BIG ?

Issues

How to handle a system up and downs ?

How to combine the data from all the systems ?

Rithu P Ravi,SaumyaK — HADOOP 7/30

Big Data Hadoop... HDFS Map Reduce

Problem1 : System’s Ups Downs

Commodity hardware for data storage and analysis

Chances of failure are very high

Replication of data across some machines

GFS (Google File System)

GFS

Divides data into chunks and stores in the file System

Can store data in ranges of PBs also

Rithu P Ravi,SaumyaK — HADOOP 8/30

Big Data Hadoop... HDFS Map Reduce

Problem 2 : How to combine the data ?

Analyze data across different machines .

Merge-, Data has to travel across network.

Doing this is notoriously challenging

Again GoogleMap—Reduce

Rithu P Ravi,SaumyaK — HADOOP 9/30

Big Data Hadoop... HDFS Map Reduce

Map Reduce

Provides a programming model

Abstracts disk reads and writes

Converts to (keys,values) pair

Two Phases

MapReduce

Rithu P Ravi,SaumyaK — HADOOP 10/30

Big Data Hadoop... HDFS Map Reduce

Outline

1 Big Data

2 Hadoop...

3 HDFS

4 Map Reduce

Rithu P Ravi,SaumyaK — HADOOP 11/30

Big Data Hadoop... HDFS Map Reduce

HADOOP

A reliable shared storage system

Analysis system

Rithu P Ravi,SaumyaK — HADOOP 12/30

Big Data Hadoop... HDFS Map Reduce

History

Google was the first to launch GFS and MapReduce

Published a paper – 2004

A brand new technology

Was well proven in Google by 2004 itself

Rithu P Ravi,SaumyaK — HADOOP 13/30

Big Data Hadoop... HDFS Map Reduce

History

Doug Cutting

Open source version of MapReduce system called Hadoop

Yahoo and others rallied around to support this effort.

Now Hadoop is core part in : Facebook, Yahoo, LinkedIn,Twitter

Rithu P Ravi,SaumyaK — HADOOP 14/30

Big Data Hadoop... HDFS Map Reduce

Core Concepts

HDFS

Map Reduce

Rithu P Ravi,SaumyaK — HADOOP 15/30

Big Data Hadoop... HDFS Map Reduce

Outline

1 Big Data

2 Hadoop...

3 HDFS

4 Map Reduce

Rithu P Ravi,SaumyaK — HADOOP 16/30

Big Data Hadoop... HDFS Map Reduce

HDFS...Hadoop Distributed File System

Streaming very large files on commodity cluster

1 Very Large Files : MBs to PBs2 Streaming

Write once read many approachNo modifiationTime to read the whole data is more important

3 Commodity Cluster

No High end ServersYes, high chance of failure (But HDFS is tolerantenough)Replication is done

Rithu P Ravi,SaumyaK — HADOOP 17/30

Big Data Hadoop... HDFS Map Reduce

HDFSHadoop Distributed File System...

Services

Masters

Name Node

Secondary Name Node

Job Tracker

Slaves

Data Node

Task Tracker

Rithu P Ravi,SaumyaK — HADOOP 18/30

Big Data Hadoop... HDFS Map Reduce

HDFSHadoop Distributed File System...

Name Node

Master Node

Maintains Name System

Meta Data

Secondary Name Node

Periodically updating fsimage file

Data Node

Slaves

Actual Storage

Rithu P Ravi,SaumyaK — HADOOP 19/30

Big Data Hadoop... HDFS Map Reduce

HDFS Architecture

Rithu P Ravi,SaumyaK — HADOOP 20/30

Big Data Hadoop... HDFS Map Reduce

Outline

1 Big Data

2 Hadoop...

3 HDFS

4 Map Reduce

Rithu P Ravi,SaumyaK — HADOOP 21/30

Big Data Hadoop... HDFS Map Reduce

Map Reduce

Large scale data processing in parallel.

It provides

Automatic parallelization and distributionFault-tolerance

Two Phases in Map Reduce

MapReduce

Rithu P Ravi,SaumyaK — HADOOP 22/30

Big Data Hadoop... HDFS Map Reduce

Map Reduce

Job Tracker

Master

Manages the jobes in the cluster

Task Tracker

Slaves

Responsible for Map Reduce

Rithu P Ravi,SaumyaK — HADOOP 23/30

Big Data Hadoop... HDFS Map Reduce

Map Reduce

Rithu P Ravi,SaumyaK — HADOOP 24/30

Big Data Hadoop... HDFS Map Reduce

Map Reduce

Map Phase

map(inKey,invalue)-list(outKey, intermediateValue)

Processes input key/value pair

Produces set of intermediate pairs

Reduce Phase

reduce(outKey,list(intermediateValue))- list(outValue)

Combines all intermediate values for a particular key

Produces a set of merged output values (usually just one)

Rithu P Ravi,SaumyaK — HADOOP 25/30

Big Data Hadoop... HDFS Map Reduce

Map Reduce

Rithu P Ravi,SaumyaK — HADOOP 26/30

Big Data Hadoop... HDFS Map Reduce

Map Reduce

Rithu P Ravi,SaumyaK — HADOOP 27/30

Big Data Hadoop... HDFS Map Reduce

Map Reduce

Rithu P Ravi,SaumyaK — HADOOP 28/30

Big Data Hadoop... HDFS Map Reduce

ReferencesIf you want to improve this style

Hadoop Tutorial-Durga Softhttps://www.youtube.com/watch?v=DLutRT6K2rM/

Hadoop Official Sitehttp://hadoop.apache.org/index.html/

Processing Big Data using Hadoop FrameworkPrashant D. Londhe, Satish S. Kumbhar, Ramakant S.Sul, Amit J. Khadse

Rithu P Ravi,SaumyaK — HADOOP 29/30

Big Data Hadoop... HDFS Map Reduce

Happy Hadooping.... :)

Rithu P Ravi,SaumyaK — HADOOP 30/30

Recommended