Author
others
View
14
Download
0
Embed Size (px)
red red red red red red red red red red red red red red red red red red red red
CYS14011 - Rithu P Ravi
CYS14012 - Saumya K
— red 1/1
Big Data Hadoop... HDFS Map Reduce
Why and What HADOOP?...
Apache Hadoop is an open-source software framework
A tool to process big data
Rithu P Ravi,SaumyaK — HADOOP 2/30
Big Data Hadoop... HDFS Map Reduce
Outline
1 Big Data
2 Hadoop...
3 HDFS
4 Map Reduce
Rithu P Ravi,SaumyaK — HADOOP 3/30
Big Data Hadoop... HDFS Map Reduce
Big Data
Data beyond storage and processing power
3 ‘V’s
Volume
Velocity
Variety
Rithu P Ravi,SaumyaK — HADOOP 4/30
Big Data Hadoop... HDFS Map Reduce
Big Data
Exponential growth of data
Challenges to Google, Yahoo, Microsoft, Amazon
Need to go through TBs and PBs of data ?
Existing tools became inadequate to process such largedata sets.
Rithu P Ravi,SaumyaK — HADOOP 5/30
Big Data Hadoop... HDFS Map Reduce
Big ElephantNumerous small chicken..?
Rithu P Ravi,SaumyaK — HADOOP 6/30
Big Data Hadoop... HDFS Map Reduce
How to handle such BIG ?
Issues
How to handle a system up and downs ?
How to combine the data from all the systems ?
Rithu P Ravi,SaumyaK — HADOOP 7/30
Big Data Hadoop... HDFS Map Reduce
Problem1 : System’s Ups Downs
Commodity hardware for data storage and analysis
Chances of failure are very high
Replication of data across some machines
GFS (Google File System)
GFS
Divides data into chunks and stores in the file System
Can store data in ranges of PBs also
Rithu P Ravi,SaumyaK — HADOOP 8/30
Big Data Hadoop... HDFS Map Reduce
Problem 2 : How to combine the data ?
Analyze data across different machines .
Merge-, Data has to travel across network.
Doing this is notoriously challenging
Again GoogleMap—Reduce
Rithu P Ravi,SaumyaK — HADOOP 9/30
Big Data Hadoop... HDFS Map Reduce
Map Reduce
Provides a programming model
Abstracts disk reads and writes
Converts to (keys,values) pair
Two Phases
MapReduce
Rithu P Ravi,SaumyaK — HADOOP 10/30
Big Data Hadoop... HDFS Map Reduce
Outline
1 Big Data
2 Hadoop...
3 HDFS
4 Map Reduce
Rithu P Ravi,SaumyaK — HADOOP 11/30
Big Data Hadoop... HDFS Map Reduce
HADOOP
A reliable shared storage system
Analysis system
Rithu P Ravi,SaumyaK — HADOOP 12/30
Big Data Hadoop... HDFS Map Reduce
History
Google was the first to launch GFS and MapReduce
Published a paper – 2004
A brand new technology
Was well proven in Google by 2004 itself
Rithu P Ravi,SaumyaK — HADOOP 13/30
Big Data Hadoop... HDFS Map Reduce
History
Doug Cutting
Open source version of MapReduce system called Hadoop
Yahoo and others rallied around to support this effort.
Now Hadoop is core part in : Facebook, Yahoo, LinkedIn,Twitter
Rithu P Ravi,SaumyaK — HADOOP 14/30
Big Data Hadoop... HDFS Map Reduce
Core Concepts
HDFS
Map Reduce
Rithu P Ravi,SaumyaK — HADOOP 15/30
Big Data Hadoop... HDFS Map Reduce
Outline
1 Big Data
2 Hadoop...
3 HDFS
4 Map Reduce
Rithu P Ravi,SaumyaK — HADOOP 16/30
Big Data Hadoop... HDFS Map Reduce
HDFS...Hadoop Distributed File System
Streaming very large files on commodity cluster
1 Very Large Files : MBs to PBs2 Streaming
Write once read many approachNo modifiationTime to read the whole data is more important
3 Commodity Cluster
No High end ServersYes, high chance of failure (But HDFS is tolerantenough)Replication is done
Rithu P Ravi,SaumyaK — HADOOP 17/30
Big Data Hadoop... HDFS Map Reduce
HDFSHadoop Distributed File System...
Services
Masters
Name Node
Secondary Name Node
Job Tracker
Slaves
Data Node
Task Tracker
Rithu P Ravi,SaumyaK — HADOOP 18/30
Big Data Hadoop... HDFS Map Reduce
HDFSHadoop Distributed File System...
Name Node
Master Node
Maintains Name System
Meta Data
Secondary Name Node
Periodically updating fsimage file
Data Node
Slaves
Actual Storage
Rithu P Ravi,SaumyaK — HADOOP 19/30
Big Data Hadoop... HDFS Map Reduce
HDFS Architecture
Rithu P Ravi,SaumyaK — HADOOP 20/30
Big Data Hadoop... HDFS Map Reduce
Outline
1 Big Data
2 Hadoop...
3 HDFS
4 Map Reduce
Rithu P Ravi,SaumyaK — HADOOP 21/30
Big Data Hadoop... HDFS Map Reduce
Map Reduce
Large scale data processing in parallel.
It provides
Automatic parallelization and distributionFault-tolerance
Two Phases in Map Reduce
MapReduce
Rithu P Ravi,SaumyaK — HADOOP 22/30
Big Data Hadoop... HDFS Map Reduce
Map Reduce
Job Tracker
Master
Manages the jobes in the cluster
Task Tracker
Slaves
Responsible for Map Reduce
Rithu P Ravi,SaumyaK — HADOOP 23/30
Big Data Hadoop... HDFS Map Reduce
Map Reduce
Rithu P Ravi,SaumyaK — HADOOP 24/30
Big Data Hadoop... HDFS Map Reduce
Map Reduce
Map Phase
map(inKey,invalue)-list(outKey, intermediateValue)
Processes input key/value pair
Produces set of intermediate pairs
Reduce Phase
reduce(outKey,list(intermediateValue))- list(outValue)
Combines all intermediate values for a particular key
Produces a set of merged output values (usually just one)
Rithu P Ravi,SaumyaK — HADOOP 25/30
Big Data Hadoop... HDFS Map Reduce
Map Reduce
Rithu P Ravi,SaumyaK — HADOOP 26/30
Big Data Hadoop... HDFS Map Reduce
Map Reduce
Rithu P Ravi,SaumyaK — HADOOP 27/30
Big Data Hadoop... HDFS Map Reduce
Map Reduce
Rithu P Ravi,SaumyaK — HADOOP 28/30
Big Data Hadoop... HDFS Map Reduce
ReferencesIf you want to improve this style
Hadoop Tutorial-Durga Softhttps://www.youtube.com/watch?v=DLutRT6K2rM/
Hadoop Official Sitehttp://hadoop.apache.org/index.html/
Processing Big Data using Hadoop FrameworkPrashant D. Londhe, Satish S. Kumbhar, Ramakant S.Sul, Amit J. Khadse
Rithu P Ravi,SaumyaK — HADOOP 29/30
Big Data Hadoop... HDFS Map Reduce
Happy Hadooping.... :)
Rithu P Ravi,SaumyaK — HADOOP 30/30
Big DataHadoop...HDFSMap Reduce