23
Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

Embed Size (px)

Citation preview

Page 1: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

Big Data and Apache Hadoop’s MapReduce

Michael Hahsler

Computer Science and EngineeringSouthern Methodist University

January 23, 2012

Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

Page 2: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

Table of Contents

1 Introduction

2 Hadoop Distributed File System

3 MapReduce

4 More Examples

Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 2 / 23

Page 3: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

Single Machine

Disk

CPU

Memory

Typical set up for data processing/mining!What are the problems with big data?

Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 3 / 23

Page 4: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

Big Data Challenge

Data Sources

Internet and social networks

Sensors and science

Needed Infrastructure

Scale to thousands of CPUs

Run on cheap commodity hardware (fault-tolerant hardware isexpensive!)

Automatically handle data replication and node failure

Data distribution and load balancing

Easy to implement solutions (thinking in terms of parallel computingis hard!)

Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 4 / 23

Page 5: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

Hardware: Cluster Architecture

Disk

CPU

Memory

Disk

CPU

Memory

Disk

CPU

Memory

Disk

CPU

Memory

Disk

CPU

Memory

Disk

CPU

Memory

Disk

CPU

Memory

Disk

CPU

Memory

SwitchRack 1

SwitchRack n

SwitchBackbone

Rack contains 16-64 nodes

... ...

Node failure!

...

Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 5 / 23

Page 6: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

Software: Apache Hadoop

What is Apache Hadoop?

A software framework that supports data-intensive (petabytes) distributedapplications under a free license.

...inspired by Google’s MapReduce and Google File System (GFS).

Hadoop provides:

Distributed file system HDFS

API to work with MapReduce

Job configuration and scheduling

Track progress and utilization

Written in the Java programming language.Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 6 / 23

Page 7: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

What Problem can be solved with Hadoop?

Characteristics

Processing can easily be made in parallel (simple computations)

Process large amounts of unstructured data

Running batch jobs is acceptable

Examples

Creating statistics (word counting)

Searching (distributed grep)

Sorting

Indexing (postings list)

Document clustering

Graph algorithms (E.g. pagerank)

Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 7 / 23

Page 8: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

Who uses Hadoop?

Adobe

Amazon.com

AOL

Facebook

Google

IBM

Microsoft

NY Times

Yahoo!

...

Source http://wiki.apache.org/hadoop/PoweredBy

Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 8 / 23

Page 9: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

Table of Contents

1 Introduction

2 Hadoop Distributed File System

3 MapReduce

4 More Examples

Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 9 / 23

Page 10: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

The Hadoop Distributed File System (HDFS)

Source: http://hadoop.apache.org/common/docs/current/hdfs_design.html

Files are split into large block size: 64 MB (typical fs has 4 kB)

Replication (2-3x on different racks) → Fault-tolerance

Master node stores meta information

Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 10 / 23

Page 11: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

Table of Contents

1 Introduction

2 Hadoop Distributed File System

3 MapReduce

4 More Examples

Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 11 / 23

Page 12: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

MapReduce

Basic Idea1 Apply a map function to each input element and emit key/value pairs.

map(k1, v1)→ list(k2, v2)

2 Summarize the results for each key using a reduce function.

reduce(k2, list(v2))→ (k2, v3)

The user only has to specify the map and reduce functions and theframework takes care of the rest!

The user does not have to think about concurrency, load balancing, datadistribution, fault-tolerance!

Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 12 / 23

Page 13: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

Example: Word Counting

Count how often each word appears in a large number of documents.

1 void map ( String name , String document ){2 // name: document name

3 // document: document contents

4 for each word w in document :5 EmitIntermediate (w , ”1” )6 }7

8 void reduce ( String word , Iterator partialCounts ){9 // word: a word

10 // partialCounts: a list of aggregated partial counts

11 int sum = 0 ;12 for each pc in partialCounts :13 sum += ParseInt ( pc ) ;14 Emit ( word , AsString ( sum ) ) ;15 }

Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 13 / 23

Page 14: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

Example: Word Counting

the quick brown fox

the lazy old dog

the quick brown dog

the, 1the, 1the ,1

quick, 1quick, 1

brown, 1brown, 1

fox, 1

lazy, 1

old, 1

dog, 1dog, 1

fox, 1

lazy, 1

old, 1

dog, 2

brown, 2

quick, 2

the, 3

split map shuffle/sort reduce

the, 1

quick, 1

brown, 1

fox, 1

the, 1

lazy, 1

old, 1

dog, 1

the, 1

quick, 1

brown, 1

dog, 1

Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 14 / 23

Page 15: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

Execution of a MapReduce Job

Source: Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI, 2004.

Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 15 / 23

Page 16: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

Fault-tolerance

Disk

CPU

Memory

Disk

CPU

Memory

Disk

CPU

Memory

Disk

CPU

Memory

Disk

CPU

Memory

Disk

CPU

Memory

Disk

CPU

Memory

Disk

CPU

Memory

SwitchRack 1

SwitchRack n

SwitchBackbone

Rack contains 16-64 nodes

... ...

Node failure!

...

Task 1

Block 1 Block 1

Master reschedules Task 1 here

Master

ping

Master re-executes tasks for failed workers.

Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 16 / 23

Page 17: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

Locality

Disk

CPU

Memory

Disk

CPU

Memory

Disk

CPU

Memory

Disk

CPU

Memory

Disk

CPU

Memory

Disk

CPU

Memory

Disk

CPU

Memory

Disk

CPU

Memory

SwitchRack 1

SwitchRack n

SwitchBackbone

Rack contains 16-64 nodes

... ...

...

Block 1 Block 2

Schedule Task which needs block 2 here or at least on this rack!

Master

Schedule map tasks near to the data to preserve network bandwidth.

Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 17 / 23

Page 18: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

More Properties

Load Balancing: Subdivide work in many small tasks (� # ofworkers). Since the master dynamically assigns tasks to idle nodes,this automatically provides load balancing.

Chaining: MapReduce operations can be chained (output of oneoperation is input for another operation) to solve more complicatedcomputations.

Backup tasks: Reduce phase can only start after all map tasks arefinished (use backup tasks to avoid “stragglers”).

Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 18 / 23

Page 19: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

Table of Contents

1 Introduction

2 Hadoop Distributed File System

3 MapReduce

4 More Examples

Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 19 / 23

Page 20: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

Example: Distributed Grep

Map task?

Reduce task?

Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 20 / 23

Page 21: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

Example: Creating an Inverted Index

Map task?

Reduce task?

Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 21 / 23

Page 22: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

Related Projects

Apache HBase: open source, non-relational, distributed databasemodeled after Google’s BigTable

Apache Hive: data warehouse infrastructure built on top of Hadoopfor providing data summarization, query, and analysis. Provides aSQL-like query language called HiveQL.

Apache Pig: is a platform for analyzing large data sets that consistsof a high-level language for expressing data analysis programs.

Apache Mahout: free implementations of distributed or otherwisescalable machine learning algorithms on the Hadoop platform.

Apache Lucene: a high-performance, full-featured text search enginelibrary written entirely in Java.

Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 22 / 23

Page 23: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

Reading

Jeffrey Dean and Sanjay Ghemawat,MapReduce: Simplified Data Processing on Large Clustershttp://labs.google.com/papers/mapreduce.html

The Apache Hadoop Projecthttp://hadoop.apache.org/

Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 23 / 23