19
INTRODUCTION TO BIG DATA NOT EVERYTHING THAT CAN BE COUNTED COUNTS, AND NOT EVERYTHING THAT COUNTS CAN BE COUNTED.

techloop

Embed Size (px)

Citation preview

Page 1: techloop

INTRODUCTION TOBIG DATA

NOT EVERYTHING THAT CAN BE COUNTED COUNTS, AND NOT EVERYTHING THAT COUNTS CAN BE COUNTED.

Page 2: techloop

What is Big Data?

• Big data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information.

Page 3: techloop
Page 4: techloop
Page 5: techloop

WHY BIG DATA?

SERIAL VS PARALLEL PROCESSING

SERIAL VS SEQUENTIAL PROCESSING

Page 6: techloop

WHY BIGDATA?

Page 7: techloop

WHY BIG DATA?• Walmart has exhaustive customer data of close to

145 million Americans of which 60% of the data is of U.S adults. Walmart tracks and targets every consumer individually

• Walmart observed a significant 10% to 15% increase in online sales for $1 billion in incremental revenue.

Page 8: techloop

Differentiating Factors:

• Accessible• Robust• Scalable• Simple

Page 9: techloop

HADOOP ECOSYSTEM

Author
Hadoop was created by Doug Cutting and Mike Cafarella in 2006. Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant.
Page 10: techloop

HDFS ARCHITECTURE

Page 11: techloop

HDFS ARCHITECTURE CONTD.

Author
Talk more about distributed systems
Author
http://www.slideshare.net/narangv43/seminar-presentation-hadoop
Author
http://www.slideshare.net/EdurekaIN/hadoop-week1-release22
Page 12: techloop

MapReduce Terminology

Job – A “full program” - an execution of a Mapper and Reducer across a data set

Task – An execution of a Mapper or a Reducer on a slice of data

Task Attempt – A particular instance of an attempt to execute a task on a machine

Page 13: techloop

MapReduce Contd.JobTracker

MapReduce job submitted by

client computer

Master node

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

Page 14: techloop

Map Phase

IMAGINE THIS AS THE INPUT FILE:

• Hello my name is abhishek Hello my name is utsav

• Hello my passion is cricket

Page 15: techloop

OPERATION ON OUTPUT OF MAP PHASEHello 1 my 1 name 1 is 1 abhishek 1 Hello 1 my 1 name 1 is 1 utsav 1 Hello 1 my 1 passion 1 is 1 cricket 1

Hello(1,1,1)

my(1,1,1)

name(1,1)

is(1,1,1)

abhishek(1)

utsav(1)

passion(1)

cricket(1)

Key(tuple of values)

Page 16: techloop

REDUCER PHASE

Hello(1,1,1)

my(1,1,1)

name(1,1,1)

is(1,1,1)

aka(1)

utsav(1)

passion(1)

cricket(1)

Key(tuple of values)

aka(1)cricket(1)

Hello(3)

is(3)

my(3)

name(3)

passion(1)

utsav(1)

Key(single value)

Page 17: techloop

TYPES OF SPLITS(PARALLEL PROCESSING IN ACTION):

• Two types of splitting of input files are possible

• HDFS split: Splitting of files into blocks of fixed size e.g. splitting a file into blocks of 64 MB to promote parallel processing.

• N line split: Splitting of files into lines of fixed number of lines to promote parallel processing

• Lets see an example in the next slide

Page 18: techloop

N LINE SPLITTING CONTD.• Assume the value of n as 3

• Map reduce is a framework based on processing of data parallelly. This algorithm consists of three phases namely map , shuffle and sort ,reduce. Here we will observe the effect of n line splitter on the number of

• map tasks i.e. the number of mappers created. This will

• create a better understanding of how a file splits.So both of these splits of the file will be sent to two different mappers while in the case of HDFS split the amount of data being sent to mappers depends on the size of the respective splits

Page 19: techloop

ANY QUERIES?ASK AWAY!