TheEdge10 : Big Data is Here - Hadoop to the Rescue

Big Data is Here – Hadoop to the Rescue!

Shay Sofer,AlphaCSP

2

Today we will:

» Understand what is BigData» Get to know Hadoop» Experience some MapReduce magic» Persist very large files» Learn some nifty tricks

On Today's Menu...

3

Data is Everywhere

4

» IDC : “Total data in the universe : 1.2 Zettabytes” (May, 2010)

» 1ZB = 1 Trillion Gigabytes (or: 1,000,000,000,000,000,000,000 bytes = 1021 )

» 60% Growth from 2009» By 2020 – we will reach 35 ZB

Facts and Numbers

Data is Everywhere

5

Facts and Numbers

Data is Everywhere

Source: www.idc.com

6

» 234M Web sites» 7M New sites in 2009» New York Stock Exchange – 1 TB of data per day» Web 2.0

147M Blogs (and counting…) Twitter – ~12 TB of data per day

Facts and Numbers

Data is Everywhere

7

» 500M users» 40M photos per day » More than 30 billion pieces of content (web links,

news stories, blog posts, notes, photo albums etc.) shared each month

Facts and Numbers - Facebook

Data is Everywhere

8

» Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools

» Where and how do we store this information?» How do we perform analyses on such large datasets?

Why are you here?

Data is Everywhere

9

Scale-up Vs. Scale-out

Data is Everywhere

10

» Scale-up : Adding resources to a single node in a system, typically involving the addition of CPUs or memory to a single computer

» Scale-out : Adding more nodes to a system. E.g. Adding a new computer with commodity hardware to a distributed software application

Scale-up Vs. Scale-out

Data is Everywhere

11

Introducing…Hadoop!

12

» A framework for writing and running distributed applications that process large amount of data.

» Runs on large clusters of commodity hardware» A cluster with hundreds of machine is standard» Inspired by Google’s architecture : MapReduce and

GFS

What is Hadoop?

Hadoop

13

» Robust - Handles failures of individual nodes» Scales linearly» Open source » A top-level Apache project

Why Hadoop?

Hadoop

14

Hadoop

15

» Facebook holds the largest known Hadoop storage cluster in the world 2000 machines 12 TB per machine (some has 24 TB) 32 GB of RAM per machine

» Total of more than 21 Petabytes » (1 Petabyte = 1024 Terabytes)

Facebook (Again…)

Hadoop

16

History

Hadoop

2004 2006 2008 20082002 2010

Apache Nutch – Open Source web search engine founded by Doug Cutting

Cutting joins Yahoo!, forms Hadoop

Sorting 1 TB in 62 seconds

Google’s GFS & MapReduce papers published

Hadoop hits web scale, being used by Yahoo! for web indexing

Creating the longest Pi yet

17

Hadoop

Common

MapReduce

Pig Chukwa

HDFS

Hive

Zoo Keeper

HBase

18

IDE Plugin

Hadoop

19

Hadoop and MapReduce

20

» A programming model for processing and generating large data sets

» Introduced by Google » Parallel processing of the map/reduce operations

Definition

MapReduce

21

Sam believed “An apple a day keeps a doctor away”

MapReduce – The Story of Sam

Mother

Sam

An Apple

Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs

22

Sam thought of “drinking” the apple


He used a to cut

the and a to

make juice.


23

(map ‘( ))

( )

Sam applied his invention to all the fruits he could find in the fruit basket


(reduce ‘( ))

A list of values mapped into another list of values, which gets reduced into a

single value


24

Sam got his first job for his talent in making juice


Now, it’s not just one basket

but a whole container of

fruits

Also, they produce a list of

juice types separately

Fruits

But, Sam had just ONE

and ONE

Large data and list of values for output


25

Sam Implemented a parallel version of his innovation Fruits

(<a, > , <o, > , <p ,> , …)

Each map input: list of <key, value> pairs

Each map output: list of <key, value> pairs

(<a’ , > , <o’, v > , <p’ , > , …)Grouped by key (shuffle)

Each reduce input: <key, value-list>

e.g. <a’, ( …)>

Reduced into a list of values


Map

Reduce


26

» Mapper - Takes a series of key/value pairs, processes each and generates output key/value pairs

(k1, v1) list(k2, v2)» Reducer - Iterates through the values that are

associated with a specific key and generate output (k2, list (v2)) list(k3, v3)» The Mapper takes the input data, filters and

transforms into something The Reducer can aggregate over

First Map, Then Reduce

MapReduce

27

MapReduce

Map

Map

Map

Map

Map

Input

Shuffle

Reduce

Reduce

Output

Output

28

» Hadoop comes with a number of predefined classes BooleanWritable ByteWritable LongWritable Text, etc…

» Supports pluggable serialization frameworks» Apache Avro

Hadoop Data Types

MapReduce

29

» TextInputFormat / TextOutputFormat» KeyValueTextInputFormat

» SequenceFile - A Hadoop specific compressed binary file format. Optimized for passing data between 2 MapReduce jobs

Input / Output Formats

MapReduce

30

public static class MapClass extends MapReduceBase

private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, …)

{String line = value.toString();StringTokenizer itr = new StringTokenizer(line);while(itr.hasMoreTokens()){

word.set(itr.nextToken());output.collect(word, new IntWritable(1));}

} }

Word Count – The Mapper

implements Mapper<LongWritable,Text,Text,IntWritable<

<K1,Hello World Bye World>

< Hello, 1> < World, 1>

< Bye, 1> < World, 1>

31

public static class ReduceClass extends MapReduceBase

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable> output,

…){int sum = 0;while(values.hasNext()){

sum += values.next().get(); }output.collect(key, new IntWritable(sum));

{{

Word Count– The Reducer

implements Reducer<Text,IntWritable,Text,IntWritable{<

< Hello, 1> < World, 2> < Bye, 1>

< Hello, 1> < World, 1> < Bye, 1> < World, 1>

32

public static void main(String[] args){JobConf job = new JobConf(WordCount.class);

job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(MapClass.class);job.setReducerClass(ReduceClass.class);

FileInputFormat.addInputFormat(job ,new Path(args[0]));FileOutputFormat.addOutputFormat(job ,new Path(args[1]));

//job.setInputFormat(KeyValueTextInputFormat.class);

JobClient.runJob(job);{

Word Count – The Driver

33

Music discovery website» Scrobbling / Streaming VIA radio» 40M unique visitors per month» Over 40M scrobbles per day» Each scrobble creates a log line

Hadoop @ Last.FM

MapReduce

34

35

» Goal : Create a “Unique listeners per track” chart

Sample listening data

MapReduce

Skip Radio Scrobbles TrackId UserId

0 10 5 100 55551

3 3 0 900 55551

0 5 0 101 55552

0 0 5 102 55553

36

public void map(LongWritable position, Text rawLine, OutputCollector<IntWritable,IntWritable>

output, Reporter reporter) throws IOException {

int scrobbles, radioListens; // assume they are initialized - IntWritable trackId,userId; // for verbosity // if track somehow is marked with zero plays - ignore if (scrobbles <= 0 && radioListens <= 0) { return;

} // output user id against track id output.collect(trackId, userId); }

Unique Listens - Mapper

37

public void reduce(IntWritable trackId, Iterator<IntWritable> values,

OutputCollector<IntWritable, IntWritable> output, Reporter reporter) throws IOException {

Set<Integer> usersSet = new HashSet<Integer>(); // add all userIds to the set, duplicates removed while (values.hasNext()) { IntWritable userId = values.next(); usersSet.add(userId.get()); }

// output: trackId -> number of unique listeners per track output.collect(trackId, new IntWritable(usersSet.size()));}

Unique Listens - Reducer

38

» Complex tasks will sometimes be needed to be broken down to subtasks

» Output of the previous job goes as input to the next job

» job-a | job-b | job-c» Simply launch the driver of the 2nd job after the 1st

Chaining

MapReduce

39

» Hadoop supports other languages via API called Streaming

» Use UNIX commands as mappers and reducers» Or use any script that processes line-oriented data

stream from STDIN and outputs to STDOUT» Python, Perl etc.

Hadoop Streaming

MapReduce

40

$ hadoop jar hadoop-streaming.jar -input input/myFile.txt -output output.txt -mapper myMapper.py -reducer myReducer.py

Hadoop Streaming

MapReduce

41

HDFSHadoop Distributed File System

42

» A large dataset can and will outgrow the storage capacity of a single physical machine

» Partition it across separate machines – Distributed FileSystems

» Network based - complex» What happens when a node fails?

Distributed FileSystem

HDFS

43

» Designed for storing very large files running on clusters on commodity hardware

» Highly fault-tolerant (via replication)» A typical file is gigabytes to terabytes in size» High throughput

HDFS - Hadoop Distributed FileSystem

HDFS

44

Running Hadoop = Running a set of daemons ondifferent servers in your network» NameNode» DataNode» Secondary NameNode» JobTracker» TaskTracker

Hadoop’s Building Blocks

HDFS

45

Topology of a Hadoop Cluster

Secondary NameNode

NameNode

JobTracker

DataNode

TaskTracker

DataNode

TaskTracker

DataNode

TaskTracker

DataNode

TaskTracker

46

» HDFS has a master/slave architecture ; The NameNode acts as the master

» Single NameNode per HDFS» Keeps track of :

How the files are broken into blocks Which nodes store those blocks The overall health of the filesystem

» Memory and I/O intensive

The NameNode

HDFS

47

» Each slave machine will host a DataNode daemon» Serves read/write/delete requests from the

NameNode» Manages the storage attached to the nodes » Sends a periodic Heartbeat to the NameNode

The DataNode

HDFS

48

» Failure is the norm rather than exception» Detection of faults and quick, automatic recovery» Each file is stored as a sequence of blocks (default:

64MB each)» The blocks of a file are replicated for fault tolerance» Block size and replicas are configurable per file

Fault Tolerance - Replication

HDFS

49

HDFS

50


Secondary NameNode

NameNode

JobTracker

DataNode

TaskTracker

DataNode

TaskTracker

DataNode

TaskTracker

DataNode

TaskTracker

51

» Assistant daemon that should be on a dedicated node» Takes snapshots of the HDFS metadata» Doesn’t receive real time changes» Helps minimizing downtime incase the NameNode

crashes

Secondary NameNode

HDFS

52


Secondary NameNode

NameNode

JobTracker

DataNode

TaskTracker

DataNode

TaskTracker

DataNode

TaskTracker

DataNode

TaskTracker

53

» One per cluster - on the master node» Receives job request submitted by the client» Schedules and monitors MapReduce jobs on

TaskTrackers

JobTracker

HDFS

54

» Run map and reduce tasks» Send progress reports to the JobTracker

TaskTracker

HDFS

55

» VIA file commands$ hadoop fs -mkdir /user/chuck$ hadoop fs -put hugeFile.txt$ hadoop fs -get anotherHugeFile.txt

» Programmatically (HDFS API)FileSystem hdfs = FileSystem.get(new Configuration());FSDataOutStream out = hdfs.create(filePath);while(...){ out.write(buffer,0,bytesRead);}

Working with HDFS

HDFS

56

Tips & Tricks

57

Tip #1: Hadoop Configuration Types

Tips & Tricks

HDFS # of Machines Type

No daemons, 1 JVM Local Machine Local Mode

Daemons running on separate JVMs (“cluster of one”)

Local Machine Pseudo-distributed mode

Daemons running on separate JVMs

Cluster with several nodes

Fully-distributed mode

58

» Monitoring events in the cluster can prove to be a bit more difficult

» Web interface for our cluster» Shows a summary of the cluster» Details about list of jobs there are currently running,

completed and failed

Tip #2: JobTracker UI

Tips & Tricks

59

WebTracker UI SS

Tips & Tricks

60

» Digging through logs or….Running again the exact same scenario with the same input on the same node?

» IsolationRunner can rerun the failed task to reproduce the problem

» Attach a debugger » Keep.failed.tasks.file = true

Tip #3: IsolationRunner – Hadoop’s Time Machine

Tips & Tricks

61

» Output of the map phase (which will be shuffled across the network) can be quite large

» Built in support for compression» Different codecs : gzip, bzip2 etc» Transparent to the developer

conf.setCompressMapOutput(true);conf.setMapOutputCompressorClass(GzipCodec.class);

Tip #4: Compression

Tips & Tricks

62

» A node can experience a slowdown, thus slowing down the entire job

» If a task is identified as “slow”, it will be scheduled to run in another node in parallel

» As soon as one finishes successfully, the others will be killed

» An optimization – not a feature

Tip #5: Speculative Execution

Tips & Tricks

63

» Input can come from 2 (or more) different sources» Hadoop has a contrib package called datajoin » Generic framework for performing reduce-side join

Tip #6: DataJoin Package

MapReduce

64

Hadoop in the CloudAmazon Web Services

65

» Cloud computing - Shared resources and information are provided on demand

» Rent a cluster rather than buy it» The best known infrastructure for cloud computing is

Amazon Web Services (AWS)» Launched at July 2002

Cloud Computing and AWS

Hadoop in the Cloud

66

» Elastic Compute Cloud (EC2) A large farm of VMs where a user can rent and use them to

run a computer application Wide range on instance types to choose from (price varies)

» Simple Storage Service (S3) – Online storage for persisting MapReduce data for future use

» Hadoop comes with built in support for EC2 and S3$ hadoop-ec2 launch-cluster <cluster-name>

<num-of-slaves>

Hadoop in the Cloud – Core Services

67

EC2 Data Flow

OurData

HDFS

MapReduce Tasks

EC2

68

EC2 & S3 Data Flow

S3

OurData

HDFS

MapReduce Tasks

EC2

69

Hadoop-Related Projects

70

» Thinking in the level of Map, Reduce and job chaining instead of simple data flow operations is non-trivial

» Pig simplifies Hadoop programming» Provides high-level data processing language : Pig Latin» Being used by Yahoo! (70% of production jobs),

Twitter, LinkedIn, EBay etc..

» Problem: Users file & Pages file. Find top 5 most visited pages by users aged 18-25

Pig


71

Users = LOAD ‘users.csv’ AS (name, age);Fltrd = FILTER Users BY age >= 18 AND age <= 25;

Pages = LOAD ‘pages.csv’ AS (user, url);

Jnd = JOIN Fltrd BY name, Pages BY user;Grpd = GROUP Jnd BY url;Smmd = FOREACH Grpd GENERATE group, COUNT(Jnd) AS clicks;Srtd = ORDER Smmd BY clicks DESC;Top5 = LIMIT Srtd 5;

STORE Top5 INTO ‘top5sites.csv’;

Pig Latin – Data Flow Language

72

» A data warehousing package built on top of Hadoop» SQL-like queries on large datasets

Hive


73

» Hadoop database for random read/write access» Uses HDFS as the underlying file system» Supports billions of rows and millions of columns» Facebook chose HBase as a framework for their new

version of “Messages”

HBase


74

» A distribution of Hadoop that simplifies deployment by providing the most recent stable version of Apache Hadoop with and backports

Cloudera


75

» Machine learning algorithms for Hadoop» Coming up next.. (-:

Mahout


76

» Big Data can and will cause serious scalability problems to your application

» MapReduce for analysis, Distributed filesystem for storage

» Hadoop = MapReduce + HDFS and much more» AWS integration is easy» Lots of documentation

Last words

Summary

77

Hadoop in Action / Chuck LamHadoop

: The Definitive Guide, 2nd Edition / Tom White (O’reilly)

Apache Hadoop DocumentationHadoop @ Last.FM Presentation MapReduce in Simple Terms / Saliya EkanayakeAmazon Web Services

References

http://www.manning.com/lam/

http://www.manning.com/lam/

http://hadoop.apache.org/common/docs/r0.21.0/







http://www.slideshare.net/esaliya/mapreduce-in-simple-terms




http://aws.amazon.com/

78

Thank you!

Documents

TheEdge10 : Big Data is Here - Hadoop to the Rescue