techloop

INTRODUCTION TOBIG DATA

NOT EVERYTHING THAT CAN BE COUNTED COUNTS, AND NOT EVERYTHING THAT COUNTS CAN BE COUNTED.

What is Big Data?

• Big data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information.

WHY BIG DATA?

SERIAL VS PARALLEL PROCESSING

SERIAL VS SEQUENTIAL PROCESSING

WHY BIGDATA?

WHY BIG DATA?• Walmart has exhaustive customer data of close to

145 million Americans of which 60% of the data is of U.S adults. Walmart tracks and targets every consumer individually

• Walmart observed a significant 10% to 15% increase in online sales for $1 billion in incremental revenue.

Differentiating Factors:

• Accessible• Robust• Scalable• Simple

HADOOP ECOSYSTEM

HDFS ARCHITECTURE

HDFS ARCHITECTURE CONTD.

MapReduce Terminology

Job – A “full program” - an execution of a Mapper and Reducer across a data set

Task – An execution of a Mapper or a Reducer on a slice of data

Task Attempt – A particular instance of an attempt to execute a task on a machine

MapReduce Contd.JobTracker

MapReduce job submitted by

client computer

Master node

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

Map Phase

IMAGINE THIS AS THE INPUT FILE:

• Hello my name is abhishek Hello my name is utsav

• Hello my passion is cricket

OPERATION ON OUTPUT OF MAP PHASEHello 1 my 1 name 1 is 1 abhishek 1 Hello 1 my 1 name 1 is 1 utsav 1 Hello 1 my 1 passion 1 is 1 cricket 1

Hello(1,1,1)

my(1,1,1)

name(1,1)

is(1,1,1)

abhishek(1)

utsav(1)

passion(1)

cricket(1)

Key(tuple of values)

REDUCER PHASE

Hello(1,1,1)

my(1,1,1)

name(1,1,1)

is(1,1,1)

aka(1)

utsav(1)

passion(1)

cricket(1)

Key(tuple of values)

aka(1)cricket(1)

Hello(3)

name(3)

passion(1)

utsav(1)

Key(single value)

TYPES OF SPLITS(PARALLEL PROCESSING IN ACTION):

• Two types of splitting of input files are possible

• HDFS split: Splitting of files into blocks of fixed size e.g. splitting a file into blocks of 64 MB to promote parallel processing.

• N line split: Splitting of files into lines of fixed number of lines to promote parallel processing

• Lets see an example in the next slide

N LINE SPLITTING CONTD.• Assume the value of n as 3

• Map reduce is a framework based on processing of data parallelly. This algorithm consists of three phases namely map , shuffle and sort ,reduce. Here we will observe the effect of n line splitter on the number of

• map tasks i.e. the number of mappers created. This will

• create a better understanding of how a file splits.So both of these splits of the file will be sent to two different mappers while in the case of HDFS split the amount of data being sent to mappers depends on the size of the respective splits

ANY QUERIES?ASK AWAY!

techloop

Documents

Steve Jobs' Commencement Speech at Stanford

Chapter 24

Heidegger Kritik

Iron Mills Essay

Compressing And Decompressing Folders

Jan Van Eyck and the Man In A Red Turban

Minne hävisivät soneran rahat

Acetone Peroxide

Chapter-01

Bodybuilding - The Rock Hard Challenge (Month 1 Training)

Cakes Recipes

The Best American Humorous Short Stories

How Computer Keyboards Work

Simple Functions in Haskell

Who Killed God

The Last Carnival I Ever Saw

Algorithms

Basic Buffer Overflows Explained

Bhagavad Gita

Keyboard Shortcuts for the Opera Browser for Mac OS X