Upload
utkarsh-srivastava
View
70
Download
5
Embed Size (px)
Citation preview
INTRODUCTION TOBIG DATA
NOT EVERYTHING THAT CAN BE COUNTED COUNTS, AND NOT EVERYTHING THAT COUNTS CAN BE COUNTED.
What is Big Data?
• Big data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information.
WHY BIG DATA?
SERIAL VS PARALLEL PROCESSING
SERIAL VS SEQUENTIAL PROCESSING
WHY BIGDATA?
WHY BIG DATA?• Walmart has exhaustive customer data of close to
145 million Americans of which 60% of the data is of U.S adults. Walmart tracks and targets every consumer individually
• Walmart observed a significant 10% to 15% increase in online sales for $1 billion in incremental revenue.
Differentiating Factors:
• Accessible• Robust• Scalable• Simple
HADOOP ECOSYSTEM
HDFS ARCHITECTURE
HDFS ARCHITECTURE CONTD.
MapReduce Terminology
Job – A “full program” - an execution of a Mapper and Reducer across a data set
Task – An execution of a Mapper or a Reducer on a slice of data
Task Attempt – A particular instance of an attempt to execute a task on a machine
MapReduce Contd.JobTracker
MapReduce job submitted by
client computer
Master node
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
Map Phase
IMAGINE THIS AS THE INPUT FILE:
• Hello my name is abhishek Hello my name is utsav
• Hello my passion is cricket
OPERATION ON OUTPUT OF MAP PHASEHello 1 my 1 name 1 is 1 abhishek 1 Hello 1 my 1 name 1 is 1 utsav 1 Hello 1 my 1 passion 1 is 1 cricket 1
Hello(1,1,1)
my(1,1,1)
name(1,1)
is(1,1,1)
abhishek(1)
utsav(1)
passion(1)
cricket(1)
Key(tuple of values)
REDUCER PHASE
Hello(1,1,1)
my(1,1,1)
name(1,1,1)
is(1,1,1)
aka(1)
utsav(1)
passion(1)
cricket(1)
Key(tuple of values)
aka(1)cricket(1)
Hello(3)
is(3)
my(3)
name(3)
passion(1)
utsav(1)
Key(single value)
TYPES OF SPLITS(PARALLEL PROCESSING IN ACTION):
• Two types of splitting of input files are possible
• HDFS split: Splitting of files into blocks of fixed size e.g. splitting a file into blocks of 64 MB to promote parallel processing.
• N line split: Splitting of files into lines of fixed number of lines to promote parallel processing
• Lets see an example in the next slide
N LINE SPLITTING CONTD.• Assume the value of n as 3
• Map reduce is a framework based on processing of data parallelly. This algorithm consists of three phases namely map , shuffle and sort ,reduce. Here we will observe the effect of n line splitter on the number of
• map tasks i.e. the number of mappers created. This will
• create a better understanding of how a file splits.So both of these splits of the file will be sent to two different mappers while in the case of HDFS split the amount of data being sent to mappers depends on the size of the respective splits
ANY QUERIES?ASK AWAY!