78
Big Data is Here – Hadoop to the Rescue! Shay Sofer, AlphaCSP

TheEdge10 : Big Data is Here - Hadoop to the Rescue

Embed Size (px)

DESCRIPTION

A presentation from TheEdge10 about Hadoop and Big data

Citation preview

Page 1: TheEdge10 : Big Data is Here - Hadoop to the Rescue

Big Data is Here – Hadoop to the Rescue!

Shay Sofer,AlphaCSP

Page 2: TheEdge10 : Big Data is Here - Hadoop to the Rescue

2

Today we will:

» Understand what is BigData» Get to know Hadoop» Experience some MapReduce magic» Persist very large files» Learn some nifty tricks

On Today's Menu...

Page 3: TheEdge10 : Big Data is Here - Hadoop to the Rescue

3

Data is Everywhere

Page 4: TheEdge10 : Big Data is Here - Hadoop to the Rescue

4

» IDC : “Total data in the universe : 1.2 Zettabytes” (May, 2010)

» 1ZB = 1 Trillion Gigabytes (or: 1,000,000,000,000,000,000,000 bytes = 1021 )

» 60% Growth from 2009» By 2020 – we will reach 35 ZB

Facts and Numbers

Data is Everywhere

Page 5: TheEdge10 : Big Data is Here - Hadoop to the Rescue

5

Facts and Numbers

Data is Everywhere

Source: www.idc.com

Page 6: TheEdge10 : Big Data is Here - Hadoop to the Rescue

6

» 234M Web sites» 7M New sites in 2009» New York Stock Exchange – 1 TB of data per day» Web 2.0

147M Blogs (and counting…) Twitter – ~12 TB of data per day

Facts and Numbers

Data is Everywhere

Page 7: TheEdge10 : Big Data is Here - Hadoop to the Rescue

7

» 500M users» 40M photos per day » More than 30 billion pieces of content (web links,

news stories, blog posts, notes, photo albums etc.) shared each month

Facts and Numbers - Facebook

Data is Everywhere

Page 8: TheEdge10 : Big Data is Here - Hadoop to the Rescue

8

» Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools

» Where and how do we store this information?» How do we perform analyses on such large datasets?

Why are you here?

Data is Everywhere

Page 9: TheEdge10 : Big Data is Here - Hadoop to the Rescue

9

Scale-up Vs. Scale-out

Data is Everywhere

Page 10: TheEdge10 : Big Data is Here - Hadoop to the Rescue

10

» Scale-up : Adding resources to a single node in a system, typically involving the addition of CPUs or memory to a single computer

» Scale-out : Adding more nodes to a system. E.g. Adding a new computer with commodity hardware to a distributed software application

Scale-up Vs. Scale-out

Data is Everywhere

Page 11: TheEdge10 : Big Data is Here - Hadoop to the Rescue

11

Introducing…Hadoop!

Page 12: TheEdge10 : Big Data is Here - Hadoop to the Rescue

12

» A framework for writing and running distributed applications that process large amount of data.

» Runs on large clusters of commodity hardware» A cluster with hundreds of machine is standard» Inspired by Google’s architecture : MapReduce and

GFS

What is Hadoop?

Hadoop

Page 13: TheEdge10 : Big Data is Here - Hadoop to the Rescue

13

» Robust - Handles failures of individual nodes» Scales linearly» Open source » A top-level Apache project

Why Hadoop?

Hadoop

Page 14: TheEdge10 : Big Data is Here - Hadoop to the Rescue

14

Hadoop

Page 15: TheEdge10 : Big Data is Here - Hadoop to the Rescue

15

» Facebook holds the largest known Hadoop storage cluster in the world 2000 machines 12 TB per machine (some has 24 TB) 32 GB of RAM per machine

» Total of more than 21 Petabytes » (1 Petabyte = 1024 Terabytes)

Facebook (Again…)

Hadoop

Page 16: TheEdge10 : Big Data is Here - Hadoop to the Rescue

16

History

Hadoop

2004 2006 2008 20082002 2010

Apache Nutch – Open Source web search engine founded by Doug Cutting

Cutting joins Yahoo!, forms Hadoop

Sorting 1 TB in 62 seconds

Google’s GFS & MapReduce papers published

Hadoop hits web scale, being used by Yahoo! for web indexing

Creating the longest Pi yet

Page 17: TheEdge10 : Big Data is Here - Hadoop to the Rescue

17

Hadoop

Common

MapReduce

Pig Chukwa

HDFS

Hive

Zoo Keeper

HBase

Page 18: TheEdge10 : Big Data is Here - Hadoop to the Rescue

18

IDE Plugin

Hadoop

Page 19: TheEdge10 : Big Data is Here - Hadoop to the Rescue

19

Hadoop and MapReduce

Page 20: TheEdge10 : Big Data is Here - Hadoop to the Rescue

20

» A programming model for processing and generating large data sets

» Introduced by Google » Parallel processing of the map/reduce operations

Definition

MapReduce

Page 21: TheEdge10 : Big Data is Here - Hadoop to the Rescue

21

Sam believed “An apple a day keeps a doctor away”

MapReduce – The Story of Sam

Mother

Sam

An Apple

Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs

Page 22: TheEdge10 : Big Data is Here - Hadoop to the Rescue

22

Sam thought of “drinking” the apple

MapReduce – The Story of Sam

He used a to cut

the and a to

make juice.

Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs

Page 23: TheEdge10 : Big Data is Here - Hadoop to the Rescue

23

(map ‘( ))

( )

Sam applied his invention to all the fruits he could find in the fruit basket

MapReduce – The Story of Sam

(reduce ‘( ))

A list of values mapped into another list of values, which gets reduced into a

single value

Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs

Page 24: TheEdge10 : Big Data is Here - Hadoop to the Rescue

24

Sam got his first job for his talent in making juice

MapReduce – The Story of Sam

Now, it’s not just one basket

but a whole container of

fruits

Also, they produce a list of

juice types separately

Fruits

But, Sam had just ONE

and ONE

Large data and list of values for output

Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs

Page 25: TheEdge10 : Big Data is Here - Hadoop to the Rescue

25

Sam Implemented a parallel version of his innovation Fruits

(<a, > , <o, > , <p ,> , …)

Each map input: list of <key, value> pairs

Each map output: list of <key, value> pairs

(<a’ , > , <o’, v > , <p’ , > , …)Grouped by key (shuffle)

Each reduce input: <key, value-list>

e.g. <a’, ( …)>

Reduced into a list of values

Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs

Map

Reduce

MapReduce – The Story of Sam

Page 26: TheEdge10 : Big Data is Here - Hadoop to the Rescue

26

» Mapper - Takes a series of key/value pairs, processes each and generates output key/value pairs

(k1, v1) list(k2, v2)» Reducer - Iterates through the values that are

associated with a specific key and generate output (k2, list (v2)) list(k3, v3)» The Mapper takes the input data, filters and

transforms into something The Reducer can aggregate over

First Map, Then Reduce

MapReduce

Page 27: TheEdge10 : Big Data is Here - Hadoop to the Rescue

27

MapReduce

Map

Map

Map

Map

Map

Input

Shuffle

Reduce

Reduce

Output

Output

Page 28: TheEdge10 : Big Data is Here - Hadoop to the Rescue

28

» Hadoop comes with a number of predefined classes BooleanWritable ByteWritable LongWritable Text, etc…

» Supports pluggable serialization frameworks» Apache Avro

Hadoop Data Types

MapReduce

Page 29: TheEdge10 : Big Data is Here - Hadoop to the Rescue

29

» TextInputFormat / TextOutputFormat» KeyValueTextInputFormat

» SequenceFile - A Hadoop specific compressed binary file format. Optimized for passing data between 2 MapReduce jobs

Input / Output Formats

MapReduce

Page 30: TheEdge10 : Big Data is Here - Hadoop to the Rescue

30

public static class MapClass extends MapReduceBase

private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, …)

{String line = value.toString();StringTokenizer itr = new StringTokenizer(line);while(itr.hasMoreTokens()){

word.set(itr.nextToken());output.collect(word, new IntWritable(1));}

} }

Word Count – The Mapper

implements Mapper<LongWritable,Text,Text,IntWritable<

<K1,Hello World Bye World>

< Hello, 1> < World, 1>

< Bye, 1> < World, 1>

Page 31: TheEdge10 : Big Data is Here - Hadoop to the Rescue

31

public static class ReduceClass extends MapReduceBase

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable> output,

…){int sum = 0;while(values.hasNext()){

sum += values.next().get(); }output.collect(key, new IntWritable(sum));

{{

Word Count– The Reducer

implements Reducer<Text,IntWritable,Text,IntWritable{<

< Hello, 1> < World, 2> < Bye, 1>

< Hello, 1> < World, 1> < Bye, 1> < World, 1>

Page 32: TheEdge10 : Big Data is Here - Hadoop to the Rescue

32

public static void main(String[] args){JobConf job = new JobConf(WordCount.class);

job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(MapClass.class);job.setReducerClass(ReduceClass.class);

FileInputFormat.addInputFormat(job ,new Path(args[0]));FileOutputFormat.addOutputFormat(job ,new Path(args[1]));

//job.setInputFormat(KeyValueTextInputFormat.class);

JobClient.runJob(job);{

Word Count – The Driver

Page 33: TheEdge10 : Big Data is Here - Hadoop to the Rescue

33

Music discovery website» Scrobbling / Streaming VIA radio» 40M unique visitors per month» Over 40M scrobbles per day» Each scrobble creates a log line

Hadoop @ Last.FM

MapReduce

Page 34: TheEdge10 : Big Data is Here - Hadoop to the Rescue

34

Page 35: TheEdge10 : Big Data is Here - Hadoop to the Rescue

35

» Goal : Create a “Unique listeners per track” chart

Sample listening data

MapReduce

Skip Radio Scrobbles TrackId UserId

0 10 5 100 55551

3 3 0 900 55551

0 5 0 101 55552

0 0 5 102 55553

Page 36: TheEdge10 : Big Data is Here - Hadoop to the Rescue

36

public void map(LongWritable position, Text rawLine, OutputCollector<IntWritable,IntWritable>

output, Reporter reporter) throws IOException {

int scrobbles, radioListens; // assume they are initialized - IntWritable trackId,userId; // for verbosity // if track somehow is marked with zero plays - ignore if (scrobbles <= 0 && radioListens <= 0) { return;

} // output user id against track id output.collect(trackId, userId); }

Unique Listens - Mapper

Page 37: TheEdge10 : Big Data is Here - Hadoop to the Rescue

37

public void reduce(IntWritable trackId, Iterator<IntWritable> values,

OutputCollector<IntWritable, IntWritable> output, Reporter reporter) throws IOException {

Set<Integer> usersSet = new HashSet<Integer>(); // add all userIds to the set, duplicates removed while (values.hasNext()) { IntWritable userId = values.next(); usersSet.add(userId.get()); }

// output: trackId -> number of unique listeners per track output.collect(trackId, new IntWritable(usersSet.size()));}

Unique Listens - Reducer

Page 38: TheEdge10 : Big Data is Here - Hadoop to the Rescue

38

» Complex tasks will sometimes be needed to be broken down to subtasks

» Output of the previous job goes as input to the next job

» job-a | job-b | job-c» Simply launch the driver of the 2nd job after the 1st

Chaining

MapReduce

Page 39: TheEdge10 : Big Data is Here - Hadoop to the Rescue

39

» Hadoop supports other languages via API called Streaming

» Use UNIX commands as mappers and reducers» Or use any script that processes line-oriented data

stream from STDIN and outputs to STDOUT» Python, Perl etc.

Hadoop Streaming

MapReduce

Page 40: TheEdge10 : Big Data is Here - Hadoop to the Rescue

40

$ hadoop jar hadoop-streaming.jar -input input/myFile.txt -output output.txt -mapper myMapper.py -reducer myReducer.py

Hadoop Streaming

MapReduce

Page 41: TheEdge10 : Big Data is Here - Hadoop to the Rescue

41

HDFSHadoop Distributed File System

Page 42: TheEdge10 : Big Data is Here - Hadoop to the Rescue

42

» A large dataset can and will outgrow the storage capacity of a single physical machine

» Partition it across separate machines – Distributed FileSystems

» Network based - complex» What happens when a node fails?

Distributed FileSystem

HDFS

Page 43: TheEdge10 : Big Data is Here - Hadoop to the Rescue

43

» Designed for storing very large files running on clusters on commodity hardware

» Highly fault-tolerant (via replication)» A typical file is gigabytes to terabytes in size» High throughput

HDFS - Hadoop Distributed FileSystem

HDFS

Page 44: TheEdge10 : Big Data is Here - Hadoop to the Rescue

44

Running Hadoop = Running a set of daemons ondifferent servers in your network» NameNode» DataNode» Secondary NameNode» JobTracker» TaskTracker

Hadoop’s Building Blocks

HDFS

Page 45: TheEdge10 : Big Data is Here - Hadoop to the Rescue

45

Topology of a Hadoop Cluster

Secondary NameNode

NameNode

JobTracker

DataNode

TaskTracker

DataNode

TaskTracker

DataNode

TaskTracker

DataNode

TaskTracker

Page 46: TheEdge10 : Big Data is Here - Hadoop to the Rescue

46

» HDFS has a master/slave architecture ; The NameNode acts as the master

» Single NameNode per HDFS» Keeps track of :

How the files are broken into blocks Which nodes store those blocks The overall health of the filesystem

» Memory and I/O intensive

The NameNode

HDFS

Page 47: TheEdge10 : Big Data is Here - Hadoop to the Rescue

47

» Each slave machine will host a DataNode daemon» Serves read/write/delete requests from the

NameNode» Manages the storage attached to the nodes » Sends a periodic Heartbeat to the NameNode

The DataNode

HDFS

Page 48: TheEdge10 : Big Data is Here - Hadoop to the Rescue

48

» Failure is the norm rather than exception» Detection of faults and quick, automatic recovery» Each file is stored as a sequence of blocks (default:

64MB each)» The blocks of a file are replicated for fault tolerance» Block size and replicas are configurable per file

Fault Tolerance - Replication

HDFS

Page 49: TheEdge10 : Big Data is Here - Hadoop to the Rescue

49

HDFS

Page 50: TheEdge10 : Big Data is Here - Hadoop to the Rescue

50

Topology of a Hadoop Cluster

Secondary NameNode

NameNode

JobTracker

DataNode

TaskTracker

DataNode

TaskTracker

DataNode

TaskTracker

DataNode

TaskTracker

Page 51: TheEdge10 : Big Data is Here - Hadoop to the Rescue

51

» Assistant daemon that should be on a dedicated node» Takes snapshots of the HDFS metadata» Doesn’t receive real time changes» Helps minimizing downtime incase the NameNode

crashes

Secondary NameNode

HDFS

Page 52: TheEdge10 : Big Data is Here - Hadoop to the Rescue

52

Topology of a Hadoop Cluster

Secondary NameNode

NameNode

JobTracker

DataNode

TaskTracker

DataNode

TaskTracker

DataNode

TaskTracker

DataNode

TaskTracker

Page 53: TheEdge10 : Big Data is Here - Hadoop to the Rescue

53

» One per cluster - on the master node» Receives job request submitted by the client» Schedules and monitors MapReduce jobs on

TaskTrackers

JobTracker

HDFS

Page 54: TheEdge10 : Big Data is Here - Hadoop to the Rescue

54

» Run map and reduce tasks» Send progress reports to the JobTracker

TaskTracker

HDFS

Page 55: TheEdge10 : Big Data is Here - Hadoop to the Rescue

55

» VIA file commands$ hadoop fs -mkdir /user/chuck$ hadoop fs -put hugeFile.txt$ hadoop fs -get anotherHugeFile.txt

» Programmatically (HDFS API)FileSystem hdfs = FileSystem.get(new Configuration());FSDataOutStream out = hdfs.create(filePath);while(...){ out.write(buffer,0,bytesRead);}

Working with HDFS

HDFS

Page 56: TheEdge10 : Big Data is Here - Hadoop to the Rescue

56

Tips & Tricks

Page 57: TheEdge10 : Big Data is Here - Hadoop to the Rescue

57

Tip #1: Hadoop Configuration Types

Tips & Tricks

HDFS # of Machines Type

No daemons, 1 JVM Local Machine Local Mode

Daemons running on separate JVMs (“cluster of one”)

Local Machine Pseudo-distributed mode

Daemons running on separate JVMs

Cluster with several nodes

Fully-distributed mode

Page 58: TheEdge10 : Big Data is Here - Hadoop to the Rescue

58

» Monitoring events in the cluster can prove to be a bit more difficult

» Web interface for our cluster» Shows a summary of the cluster» Details about list of jobs there are currently running,

completed and failed

Tip #2: JobTracker UI

Tips & Tricks

Page 59: TheEdge10 : Big Data is Here - Hadoop to the Rescue

59

WebTracker UI SS

Tips & Tricks

Page 60: TheEdge10 : Big Data is Here - Hadoop to the Rescue

60

» Digging through logs or….Running again the exact same scenario with the same input on the same node?

» IsolationRunner can rerun the failed task to reproduce the problem

» Attach a debugger » Keep.failed.tasks.file = true

Tip #3: IsolationRunner – Hadoop’s Time Machine

Tips & Tricks

Page 61: TheEdge10 : Big Data is Here - Hadoop to the Rescue

61

» Output of the map phase (which will be shuffled across the network) can be quite large

» Built in support for compression» Different codecs : gzip, bzip2 etc» Transparent to the developer

conf.setCompressMapOutput(true);conf.setMapOutputCompressorClass(GzipCodec.class);

Tip #4: Compression

Tips & Tricks

Page 62: TheEdge10 : Big Data is Here - Hadoop to the Rescue

62

» A node can experience a slowdown, thus slowing down the entire job

» If a task is identified as “slow”, it will be scheduled to run in another node in parallel

» As soon as one finishes successfully, the others will be killed

» An optimization – not a feature

Tip #5: Speculative Execution

Tips & Tricks

Page 63: TheEdge10 : Big Data is Here - Hadoop to the Rescue

63

» Input can come from 2 (or more) different sources» Hadoop has a contrib package called datajoin » Generic framework for performing reduce-side join

Tip #6: DataJoin Package

MapReduce

Page 64: TheEdge10 : Big Data is Here - Hadoop to the Rescue

64

Hadoop in the CloudAmazon Web Services

Page 65: TheEdge10 : Big Data is Here - Hadoop to the Rescue

65

» Cloud computing - Shared resources and information are provided on demand

» Rent a cluster rather than buy it» The best known infrastructure for cloud computing is

Amazon Web Services (AWS)» Launched at July 2002

Cloud Computing and AWS

Hadoop in the Cloud

Page 66: TheEdge10 : Big Data is Here - Hadoop to the Rescue

66

» Elastic Compute Cloud (EC2) A large farm of VMs where a user can rent and use them to

run a computer application Wide range on instance types to choose from (price varies)

» Simple Storage Service (S3) – Online storage for persisting MapReduce data for future use

» Hadoop comes with built in support for EC2 and S3$ hadoop-ec2 launch-cluster <cluster-name>

<num-of-slaves>

Hadoop in the Cloud – Core Services

Page 67: TheEdge10 : Big Data is Here - Hadoop to the Rescue

67

EC2 Data Flow

OurData

HDFS

MapReduce Tasks

EC2

Page 68: TheEdge10 : Big Data is Here - Hadoop to the Rescue

68

EC2 & S3 Data Flow

S3

OurData

HDFS

MapReduce Tasks

EC2

Page 69: TheEdge10 : Big Data is Here - Hadoop to the Rescue

69

Hadoop-Related Projects

Page 70: TheEdge10 : Big Data is Here - Hadoop to the Rescue

70

» Thinking in the level of Map, Reduce and job chaining instead of simple data flow operations is non-trivial

» Pig simplifies Hadoop programming» Provides high-level data processing language : Pig Latin» Being used by Yahoo! (70% of production jobs),

Twitter, LinkedIn, EBay etc..

» Problem: Users file & Pages file. Find top 5 most visited pages by users aged 18-25

Pig

Hadoop-Related Projects

Page 71: TheEdge10 : Big Data is Here - Hadoop to the Rescue

71

Users = LOAD ‘users.csv’ AS (name, age);Fltrd = FILTER Users BY age >= 18 AND age <= 25;

Pages = LOAD ‘pages.csv’ AS (user, url);

Jnd = JOIN Fltrd BY name, Pages BY user;Grpd = GROUP Jnd BY url;Smmd = FOREACH Grpd GENERATE group, COUNT(Jnd) AS clicks;Srtd = ORDER Smmd BY clicks DESC;Top5 = LIMIT Srtd 5;

STORE Top5 INTO ‘top5sites.csv’;

Pig Latin – Data Flow Language

Page 72: TheEdge10 : Big Data is Here - Hadoop to the Rescue

72

» A data warehousing package built on top of Hadoop» SQL-like queries on large datasets

Hive

Hadoop-Related Projects

Page 73: TheEdge10 : Big Data is Here - Hadoop to the Rescue

73

» Hadoop database for random read/write access» Uses HDFS as the underlying file system» Supports billions of rows and millions of columns» Facebook chose HBase as a framework for their new

version of “Messages”

HBase

Hadoop-Related Projects

Page 74: TheEdge10 : Big Data is Here - Hadoop to the Rescue

74

» A distribution of Hadoop that simplifies deployment by providing the most recent stable version of Apache Hadoop with and backports

Cloudera

Hadoop-Related Projects

Page 75: TheEdge10 : Big Data is Here - Hadoop to the Rescue

75

» Machine learning algorithms for Hadoop» Coming up next.. (-:

Mahout

Hadoop-Related Projects

Page 76: TheEdge10 : Big Data is Here - Hadoop to the Rescue

76

» Big Data can and will cause serious scalability problems to your application

» MapReduce for analysis, Distributed filesystem for storage

» Hadoop = MapReduce + HDFS and much more» AWS integration is easy» Lots of documentation

Last words

Summary

Page 78: TheEdge10 : Big Data is Here - Hadoop to the Rescue

78

Thank you!