techloop

Preview:

Citation preview

INTRODUCTION TOBIG DATA

NOT EVERYTHING THAT CAN BE COUNTED COUNTS, AND NOT EVERYTHING THAT COUNTS CAN BE COUNTED.

What is Big Data?

• Big data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information.

WHY BIG DATA?

SERIAL VS PARALLEL PROCESSING

SERIAL VS SEQUENTIAL PROCESSING

WHY BIGDATA?

WHY BIG DATA?• Walmart has exhaustive customer data of close to

145 million Americans of which 60% of the data is of U.S adults. Walmart tracks and targets every consumer individually

• Walmart observed a significant 10% to 15% increase in online sales for $1 billion in incremental revenue.

Differentiating Factors:

• Accessible• Robust• Scalable• Simple

HADOOP ECOSYSTEM

Author
Hadoop was created by Doug Cutting and Mike Cafarella in 2006. Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant.

HDFS ARCHITECTURE

HDFS ARCHITECTURE CONTD.

Author
Talk more about distributed systems
Author
http://www.slideshare.net/narangv43/seminar-presentation-hadoop
Author
http://www.slideshare.net/EdurekaIN/hadoop-week1-release22

MapReduce Terminology

Job – A “full program” - an execution of a Mapper and Reducer across a data set

Task – An execution of a Mapper or a Reducer on a slice of data

Task Attempt – A particular instance of an attempt to execute a task on a machine

MapReduce Contd.JobTracker

MapReduce job submitted by

client computer

Master node

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

Map Phase

IMAGINE THIS AS THE INPUT FILE:

• Hello my name is abhishek Hello my name is utsav

• Hello my passion is cricket

OPERATION ON OUTPUT OF MAP PHASEHello 1 my 1 name 1 is 1 abhishek 1 Hello 1 my 1 name 1 is 1 utsav 1 Hello 1 my 1 passion 1 is 1 cricket 1

Hello(1,1,1)

my(1,1,1)

name(1,1)

is(1,1,1)

abhishek(1)

utsav(1)

passion(1)

cricket(1)

Key(tuple of values)

REDUCER PHASE

Hello(1,1,1)

my(1,1,1)

name(1,1,1)

is(1,1,1)

aka(1)

utsav(1)

passion(1)

cricket(1)

Key(tuple of values)

aka(1)cricket(1)

Hello(3)

is(3)

my(3)

name(3)

passion(1)

utsav(1)

Key(single value)

TYPES OF SPLITS(PARALLEL PROCESSING IN ACTION):

• Two types of splitting of input files are possible

• HDFS split: Splitting of files into blocks of fixed size e.g. splitting a file into blocks of 64 MB to promote parallel processing.

• N line split: Splitting of files into lines of fixed number of lines to promote parallel processing

• Lets see an example in the next slide

N LINE SPLITTING CONTD.• Assume the value of n as 3

• Map reduce is a framework based on processing of data parallelly. This algorithm consists of three phases namely map , shuffle and sort ,reduce. Here we will observe the effect of n line splitter on the number of

• map tasks i.e. the number of mappers created. This will

• create a better understanding of how a file splits.So both of these splits of the file will be sent to two different mappers while in the case of HDFS split the amount of data being sent to mappers depends on the size of the respective splits

ANY QUERIES?ASK AWAY!