116
CS61A Lecture 41 Amir Kamil UC Berkeley April 26, 2013

CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

CS61A Lecture 41

Amir KamilUC Berkeley

April 26, 2013

Page 2: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

HW13 due Wednesday

Scheme project due Monday

Scheme contest deadline extended to Friday

Announcements

Page 3: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

CPU Performance

  

Page 4: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

CPU Performance

Performance of individual CPU cores has largely stagnated in recent years 

Page 5: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

CPU Performance

Performance of individual CPU cores has largely stagnated in recent yearsGraph of CPU clock frequency, an important component in CPU performance:

Page 6: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

CPU Performance

Performance of individual CPU cores has largely stagnated in recent yearsGraph of CPU clock frequency, an important component in CPU performance:

http://cpudb.stanford.edu

Page 7: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Parallelism

 

 

 

 

 

Page 8: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Parallelism

Applications must be parallelized in order run faster

 

 

 

 

Page 9: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Parallelism

Applications must be parallelized in order run faster• Waiting for a faster CPU core is no longer an option

 

 

 

 

Page 10: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Parallelism

Applications must be parallelized in order run faster• Waiting for a faster CPU core is no longer an option

Parallelism is easy in functional programming:

 

 

 

Page 11: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Parallelism

Applications must be parallelized in order run faster• Waiting for a faster CPU core is no longer an option

Parallelism is easy in functional programming:• When a program contains only pure functions, call expressions can be 

evaluated in any order, lazily, and in parallel

 

 

 

Page 12: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Parallelism

Applications must be parallelized in order run faster• Waiting for a faster CPU core is no longer an option

Parallelism is easy in functional programming:• When a program contains only pure functions, call expressions can be 

evaluated in any order, lazily, and in parallel• Referential transparency: a call expression can be replaced by its value (or 

vice versa) without changing the program

 

 

 

Page 13: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Parallelism

Applications must be parallelized in order run faster• Waiting for a faster CPU core is no longer an option

Parallelism is easy in functional programming:• When a program contains only pure functions, call expressions can be 

evaluated in any order, lazily, and in parallel• Referential transparency: a call expression can be replaced by its value (or 

vice versa) without changing the program

But not all problems can be solved efficiently using functional programming

 

 

Page 14: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Parallelism

Applications must be parallelized in order run faster• Waiting for a faster CPU core is no longer an option

Parallelism is easy in functional programming:• When a program contains only pure functions, call expressions can be 

evaluated in any order, lazily, and in parallel• Referential transparency: a call expression can be replaced by its value (or 

vice versa) without changing the program

But not all problems can be solved efficiently using functional programming

Today: the easy case of parallelism, using only pure functions

 

Page 15: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Parallelism

Applications must be parallelized in order run faster• Waiting for a faster CPU core is no longer an option

Parallelism is easy in functional programming:• When a program contains only pure functions, call expressions can be 

evaluated in any order, lazily, and in parallel• Referential transparency: a call expression can be replaced by its value (or 

vice versa) without changing the program

But not all problems can be solved efficiently using functional programming

Today: the easy case of parallelism, using only pure functions• Specifically, we will look at MapReduce, a framework for such computations

 

Page 16: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Parallelism

Applications must be parallelized in order run faster• Waiting for a faster CPU core is no longer an option

Parallelism is easy in functional programming:• When a program contains only pure functions, call expressions can be 

evaluated in any order, lazily, and in parallel• Referential transparency: a call expression can be replaced by its value (or 

vice versa) without changing the program

But not all problems can be solved efficiently using functional programming

Today: the easy case of parallelism, using only pure functions• Specifically, we will look at MapReduce, a framework for such computations

Next time:  the hard case, where shared data is required

Page 17: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce

 

 

 

Page 18: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce

MapReduce is a framework for batch processing of Big Data

 

 

Page 19: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce

MapReduce is a framework for batch processing of Big Data

What does that mean?

 

Page 20: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce

MapReduce is a framework for batch processing of Big Data

What does that mean?• Framework: A system used by programmers to build applications

 

Page 21: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce

MapReduce is a framework for batch processing of Big Data

What does that mean?• Framework: A system used by programmers to build applications• Batch processing: All the data is available at the outset, and results aren't 

used until processing completes

 

Page 22: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce

MapReduce is a framework for batch processing of Big Data

What does that mean?• Framework: A system used by programmers to build applications• Batch processing: All the data is available at the outset, and results aren't 

used until processing completes• Big Data: A buzzword used to describe data sets so large that they reveal 

facts about the world via statistical analysis

 

Page 23: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce

MapReduce is a framework for batch processing of Big Data

What does that mean?• Framework: A system used by programmers to build applications• Batch processing: All the data is available at the outset, and results aren't 

used until processing completes• Big Data: A buzzword used to describe data sets so large that they reveal 

facts about the world via statistical analysis

The MapReduce idea:

Page 24: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce

MapReduce is a framework for batch processing of Big Data

What does that mean?• Framework: A system used by programmers to build applications• Batch processing: All the data is available at the outset, and results aren't 

used until processing completes• Big Data: A buzzword used to describe data sets so large that they reveal 

facts about the world via statistical analysis

The MapReduce idea:• Data sets are too big to be analyzed by one machine

Page 25: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce

MapReduce is a framework for batch processing of Big Data

What does that mean?• Framework: A system used by programmers to build applications• Batch processing: All the data is available at the outset, and results aren't 

used until processing completes• Big Data: A buzzword used to describe data sets so large that they reveal 

facts about the world via statistical analysis

The MapReduce idea:• Data sets are too big to be analyzed by one machine• When using multiple machines, systems issues abound

Page 26: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce

MapReduce is a framework for batch processing of Big Data

What does that mean?• Framework: A system used by programmers to build applications• Batch processing: All the data is available at the outset, and results aren't 

used until processing completes• Big Data: A buzzword used to describe data sets so large that they reveal 

facts about the world via statistical analysis

The MapReduce idea:• Data sets are too big to be analyzed by one machine• When using multiple machines, systems issues abound• Pure functions enable an abstraction barrier between data processing logic 

and distributed system administration

Page 27: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Systems

 

 

 

Page 28: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Systems

Systems research enables the development of applications by defining and implementing abstractions:

 

 

Page 29: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Systems

Systems research enables the development of applications by defining and implementing abstractions:

• Operating systems provide a stable, consistent interface to unreliable, inconsistent hardware

 

 

Page 30: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Systems

Systems research enables the development of applications by defining and implementing abstractions:

• Operating systems provide a stable, consistent interface to unreliable, inconsistent hardware

• Networks provide a simple, robust data transfer interface to constantly evolving communications infrastructure

 

 

Page 31: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Systems

Systems research enables the development of applications by defining and implementing abstractions:

• Operating systems provide a stable, consistent interface to unreliable, inconsistent hardware

• Networks provide a simple, robust data transfer interface to constantly evolving communications infrastructure

• Databases provide a declarative interface to software that stores and retrieves information efficiently

 

 

Page 32: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Systems

Systems research enables the development of applications by defining and implementing abstractions:

• Operating systems provide a stable, consistent interface to unreliable, inconsistent hardware

• Networks provide a simple, robust data transfer interface to constantly evolving communications infrastructure

• Databases provide a declarative interface to software that stores and retrieves information efficiently

• Distributed systems provide a single‐entity‐level interface to a cluster of multiple machines

 

 

Page 33: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Systems

Systems research enables the development of applications by defining and implementing abstractions:

• Operating systems provide a stable, consistent interface to unreliable, inconsistent hardware

• Networks provide a simple, robust data transfer interface to constantly evolving communications infrastructure

• Databases provide a declarative interface to software that stores and retrieves information efficiently

• Distributed systems provide a single‐entity‐level interface to a cluster of multiple machines

A unifying property of effective systems:

 

Page 34: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Systems

Systems research enables the development of applications by defining and implementing abstractions:

• Operating systems provide a stable, consistent interface to unreliable, inconsistent hardware

• Networks provide a simple, robust data transfer interface to constantly evolving communications infrastructure

• Databases provide a declarative interface to software that stores and retrieves information efficiently

• Distributed systems provide a single‐entity‐level interface to a cluster of multiple machines

A unifying property of effective systems:

Hide complexity, but retain flexibility

Page 35: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

The Unix Operating System

 

Page 36: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

The Unix Operating System

Essential features of the Unix operating system (and variants):

Page 37: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

The Unix Operating System

Essential features of the Unix operating system (and variants):• Portability: The same operating system on different hardware

Page 38: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

The Unix Operating System

Essential features of the Unix operating system (and variants):• Portability: The same operating system on different hardware• Multi‐Tasking: Many processes run concurrently on a machine

Page 39: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

The Unix Operating System

Essential features of the Unix operating system (and variants):• Portability: The same operating system on different hardware• Multi‐Tasking: Many processes run concurrently on a machine• Plain Text: Data is stored and shared in text format

Page 40: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

The Unix Operating System

Essential features of the Unix operating system (and variants):• Portability: The same operating system on different hardware• Multi‐Tasking: Many processes run concurrently on a machine• Plain Text: Data is stored and shared in text format• Modularity: Small tools are composed flexibly via pipes

Page 41: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

The Unix Operating System

Essential features of the Unix operating system (and variants):• Portability: The same operating system on different hardware• Multi‐Tasking: Many processes run concurrently on a machine• Plain Text: Data is stored and shared in text format• Modularity: Small tools are composed flexibly via pipes

process

Page 42: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

The Unix Operating System

Essential features of the Unix operating system (and variants):• Portability: The same operating system on different hardware• Multi‐Tasking: Many processes run concurrently on a machine• Plain Text: Data is stored and shared in text format• Modularity: Small tools are composed flexibly via pipes

standard input process

Page 43: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

The Unix Operating System

Essential features of the Unix operating system (and variants):• Portability: The same operating system on different hardware• Multi‐Tasking: Many processes run concurrently on a machine• Plain Text: Data is stored and shared in text format• Modularity: Small tools are composed flexibly via pipes

standard input process

Text input

Page 44: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

The Unix Operating System

Essential features of the Unix operating system (and variants):• Portability: The same operating system on different hardware• Multi‐Tasking: Many processes run concurrently on a machine• Plain Text: Data is stored and shared in text format• Modularity: Small tools are composed flexibly via pipes

standard inputstandard output

process

Text input

Page 45: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

The Unix Operating System

Essential features of the Unix operating system (and variants):• Portability: The same operating system on different hardware• Multi‐Tasking: Many processes run concurrently on a machine• Plain Text: Data is stored and shared in text format• Modularity: Small tools are composed flexibly via pipes

standard inputstandard output

process

Text inputText output

Page 46: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

The Unix Operating System

Essential features of the Unix operating system (and variants):• Portability: The same operating system on different hardware• Multi‐Tasking: Many processes run concurrently on a machine• Plain Text: Data is stored and shared in text format• Modularity: Small tools are composed flexibly via pipes

standard inputstandard output

process

standard errorText input

Text output

Page 47: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

The Unix Operating System

Essential features of the Unix operating system (and variants):• Portability: The same operating system on different hardware• Multi‐Tasking: Many processes run concurrently on a machine• Plain Text: Data is stored and shared in text format• Modularity: Small tools are composed flexibly via pipes

standard inputstandard output

process

standard error

The standard streams in a Unix‐like operating system are conceptually similar to Python iterators

Text inputText output

Page 48: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Python Programs in a Unix Environment

 

 

 

 

 

Page 49: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Python Programs in a Unix Environment

The built‐in input function reads a line from standard input

 

 

 

 

Page 50: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Python Programs in a Unix Environment

The built‐in input function reads a line from standard input

The built‐in print function writes a line to standard output

 

 

 

Page 51: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Python Programs in a Unix Environment

The built‐in input function reads a line from standard input

The built‐in print function writes a line to standard output

The values sys.stdin and sys.stdout also provide access to the Unix standard streams as "files"

 

 

Page 52: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Python Programs in a Unix Environment

The built‐in input function reads a line from standard input

The built‐in print function writes a line to standard output

The values sys.stdin and sys.stdout also provide access to the Unix standard streams as "files"

A Python "file" is an interface that supports iteration, read, and write methods

 

Page 53: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Python Programs in a Unix Environment

The built‐in input function reads a line from standard input

The built‐in print function writes a line to standard output

The values sys.stdin and sys.stdout also provide access to the Unix standard streams as "files"

A Python "file" is an interface that supports iteration, read, and write methods

Using these "files" takes advantage of the operating system standard streamabstraction

Page 54: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

 

 

Page 55: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

Map phase: Apply a mapper function to inputs, emitting a set of intermediate key‐value pairs

 

Page 56: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

Map phase: Apply a mapper function to inputs, emitting a set of intermediate key‐value pairs• The mapper takes an iterator over inputs, such as text lines

 

Page 57: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

Map phase: Apply a mapper function to inputs, emitting a set of intermediate key‐value pairs• The mapper takes an iterator over inputs, such as text lines• The mapper yields zero or more key‐value pairs per input

 

Page 58: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

Map phase: Apply a mapper function to inputs, emitting a set of intermediate key‐value pairs• The mapper takes an iterator over inputs, such as text lines• The mapper yields zero or more key‐value pairs per input

 

For batch processingIs a Big Data frameworkGoogle MapReduce

Page 59: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

Map phase: Apply a mapper function to inputs, emitting a set of intermediate key‐value pairs• The mapper takes an iterator over inputs, such as text lines• The mapper yields zero or more key‐value pairs per input

 

mapper

For batch processingIs a Big Data frameworkGoogle MapReduce

Page 60: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

Map phase: Apply a mapper function to inputs, emitting a set of intermediate key‐value pairs• The mapper takes an iterator over inputs, such as text lines• The mapper yields zero or more key‐value pairs per input

 

mapper

For batch processingIs a Big Data frameworkGoogle MapReduce

Page 61: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

Map phase: Apply a mapper function to inputs, emitting a set of intermediate key‐value pairs• The mapper takes an iterator over inputs, such as text lines• The mapper yields zero or more key‐value pairs per input

 

mapper

For batch processingIs a Big Data frameworkGoogle MapReduce

Page 62: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

Map phase: Apply a mapper function to inputs, emitting a set of intermediate key‐value pairs• The mapper takes an iterator over inputs, such as text lines• The mapper yields zero or more key‐value pairs per input

 

mappero: 2a: 1u: 1e: 3For batch processing

Is a Big Data frameworkGoogle MapReduce

Page 63: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

Map phase: Apply a mapper function to inputs, emitting a set of intermediate key‐value pairs• The mapper takes an iterator over inputs, such as text lines• The mapper yields zero or more key‐value pairs per input

 

mappero: 2a: 1u: 1e: 3For batch processing

Is a Big Data frameworkGoogle MapReduce

Page 64: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

Map phase: Apply a mapper function to inputs, emitting a set of intermediate key‐value pairs• The mapper takes an iterator over inputs, such as text lines• The mapper yields zero or more key‐value pairs per input

 

mappero: 2a: 1u: 1e: 3For batch processing

Is a Big Data frameworkGoogle MapReduce

Page 65: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

Map phase: Apply a mapper function to inputs, emitting a set of intermediate key‐value pairs• The mapper takes an iterator over inputs, such as text lines• The mapper yields zero or more key‐value pairs per input

 

mappero: 2a: 1u: 1e: 3

i: 1a: 4e: 1o: 1

For batch processingIs a Big Data frameworkGoogle MapReduce

Page 66: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

Map phase: Apply a mapper function to inputs, emitting a set of intermediate key‐value pairs• The mapper takes an iterator over inputs, such as text lines• The mapper yields zero or more key‐value pairs per input

 

mappero: 2a: 1u: 1e: 3

i: 1a: 4e: 1o: 1

For batch processingIs a Big Data frameworkGoogle MapReduce

Page 67: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

Map phase: Apply a mapper function to inputs, emitting a set of intermediate key‐value pairs• The mapper takes an iterator over inputs, such as text lines• The mapper yields zero or more key‐value pairs per input

 

mappero: 2a: 1u: 1e: 3

i: 1a: 4e: 1o: 1

a: 1o: 2e: 1i: 1

For batch processingIs a Big Data frameworkGoogle MapReduce

Page 68: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

Map phase: Apply a mapper function to inputs, emitting a set of intermediate key‐value pairs• The mapper takes an iterator over inputs, such as text lines• The mapper yields zero or more key‐value pairs per input

Reduce phase: For each intermediate key, apply a reducer function to accumulate all values associated with that key

mappero: 2a: 1u: 1e: 3

i: 1a: 4e: 1o: 1

a: 1o: 2e: 1i: 1

For batch processingIs a Big Data frameworkGoogle MapReduce

Page 69: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

Map phase: Apply a mapper function to inputs, emitting a set of intermediate key‐value pairs• The mapper takes an iterator over inputs, such as text lines• The mapper yields zero or more key‐value pairs per input

Reduce phase: For each intermediate key, apply a reducer function to accumulate all values associated with that key• The reducer takes an iterator over key‐value pairs

mappero: 2a: 1u: 1e: 3

i: 1a: 4e: 1o: 1

a: 1o: 2e: 1i: 1

For batch processingIs a Big Data frameworkGoogle MapReduce

Page 70: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

Map phase: Apply a mapper function to inputs, emitting a set of intermediate key‐value pairs• The mapper takes an iterator over inputs, such as text lines• The mapper yields zero or more key‐value pairs per input

Reduce phase: For each intermediate key, apply a reducer function to accumulate all values associated with that key• The reducer takes an iterator over key‐value pairs• All pairs with a given key are consecutive

mappero: 2a: 1u: 1e: 3

i: 1a: 4e: 1o: 1

a: 1o: 2e: 1i: 1

For batch processingIs a Big Data frameworkGoogle MapReduce

Page 71: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

Map phase: Apply a mapper function to inputs, emitting a set of intermediate key‐value pairs• The mapper takes an iterator over inputs, such as text lines• The mapper yields zero or more key‐value pairs per input

Reduce phase: For each intermediate key, apply a reducer function to accumulate all values associated with that key• The reducer takes an iterator over key‐value pairs• All pairs with a given key are consecutive• The reducer yields 0 or more values,

each associated with that intermediate key

mappero: 2a: 1u: 1e: 3

i: 1a: 4e: 1o: 1

a: 1o: 2e: 1i: 1

For batch processingIs a Big Data frameworkGoogle MapReduce

Page 72: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

For batch processingIs a Big Data frameworkGoogle MapReduce mapper

o: 2a: 1u: 1e: 3

i: 1a: 4e: 1o: 1

a: 1o: 2e: 1i: 1

Reduce phase: For each intermediate key, apply a reducer function to accumulate all values associated with that key• The reducer takes an iterator over key‐value pairs• All pairs with a given key are consecutive• The reducer yields 0 or more values,

each associated with that intermediate key

Page 73: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

For batch processingIs a Big Data frameworkGoogle MapReduce mapper

o: 2a: 1u: 1e: 3

i: 1a: 4e: 1o: 1

a: 1o: 2e: 1i: 1

Reduce phase: For each intermediate key, apply a reducer function to accumulate all values associated with that key• The reducer takes an iterator over key‐value pairs• All pairs with a given key are consecutive• The reducer yields 0 or more values,

each associated with that intermediate key

...

a: 4a: 1a: 1e: 1e: 3e: 1

Page 74: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

reducera: 6

For batch processingIs a Big Data frameworkGoogle MapReduce mapper

o: 2a: 1u: 1e: 3

i: 1a: 4e: 1o: 1

a: 1o: 2e: 1i: 1

Reduce phase: For each intermediate key, apply a reducer function to accumulate all values associated with that key• The reducer takes an iterator over key‐value pairs• All pairs with a given key are consecutive• The reducer yields 0 or more values,

each associated with that intermediate key

...

a: 4a: 1a: 1e: 1e: 3e: 1

Page 75: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

reducere: 5

reducera: 6

For batch processingIs a Big Data frameworkGoogle MapReduce mapper

o: 2a: 1u: 1e: 3

i: 1a: 4e: 1o: 1

a: 1o: 2e: 1i: 1

Reduce phase: For each intermediate key, apply a reducer function to accumulate all values associated with that key• The reducer takes an iterator over key‐value pairs• All pairs with a given key are consecutive• The reducer yields 0 or more values,

each associated with that intermediate key

...

a: 4a: 1a: 1e: 1e: 3e: 1

Page 76: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

reducere: 5

reducera: 6 i: 2

For batch processingIs a Big Data frameworkGoogle MapReduce mapper

o: 2a: 1u: 1e: 3

i: 1a: 4e: 1o: 1

a: 1o: 2e: 1i: 1

Reduce phase: For each intermediate key, apply a reducer function to accumulate all values associated with that key• The reducer takes an iterator over key‐value pairs• All pairs with a given key are consecutive• The reducer yields 0 or more values,

each associated with that intermediate key

...

a: 4a: 1a: 1e: 1e: 3e: 1

Page 77: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

reducere: 5

reducera: 6 i: 2

o: 5

For batch processingIs a Big Data frameworkGoogle MapReduce mapper

o: 2a: 1u: 1e: 3

i: 1a: 4e: 1o: 1

a: 1o: 2e: 1i: 1

Reduce phase: For each intermediate key, apply a reducer function to accumulate all values associated with that key• The reducer takes an iterator over key‐value pairs• All pairs with a given key are consecutive• The reducer yields 0 or more values,

each associated with that intermediate key

...

a: 4a: 1a: 1e: 1e: 3e: 1

Page 78: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Evaluation Model

reducere: 5

reducera: 6 i: 2

o: 5

u: 1

For batch processingIs a Big Data frameworkGoogle MapReduce mapper

o: 2a: 1u: 1e: 3

i: 1a: 4e: 1o: 1

a: 1o: 2e: 1i: 1

Reduce phase: For each intermediate key, apply a reducer function to accumulate all values associated with that key• The reducer takes an iterator over key‐value pairs• All pairs with a given key are consecutive• The reducer yields 0 or more values,

each associated with that intermediate key

...

a: 4a: 1a: 1e: 1e: 3e: 1

Page 79: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Above‐the‐Line: Execution Model

http://research.google.com/archive/mapreduce‐osdi04‐slides/index‐auto‐0007.html

Page 80: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Below‐the‐Line: Parallel Execution

http://research.google.com/archive/mapreduce‐osdi04‐slides/index‐auto‐0008.html

Page 81: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Below‐the‐Line: Parallel Execution

http://research.google.com/archive/mapreduce‐osdi04‐slides/index‐auto‐0008.html

A "task" is a Unix process runningon a machine

Page 82: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Below‐the‐Line: Parallel Execution

http://research.google.com/archive/mapreduce‐osdi04‐slides/index‐auto‐0008.html

A "task" is a Unix process runningon a machine

Map

phaseReduce

phaseShuffle

Page 83: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Assumptions

 

 

 

Page 84: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Assumptions

Constraints on the mapper and reducer:

 

 

Page 85: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Assumptions

Constraints on the mapper and reducer:• The mappermust be equivalent to applying a deterministic pure function 

to each input independently

 

 

Page 86: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Assumptions

Constraints on the mapper and reducer:• The mappermust be equivalent to applying a deterministic pure function 

to each input independently• The reducermust be equivalent to applying a deterministic pure function 

to the sequence of values for each key

 

 

Page 87: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Assumptions

Constraints on the mapper and reducer:• The mappermust be equivalent to applying a deterministic pure function 

to each input independently• The reducermust be equivalent to applying a deterministic pure function 

to the sequence of values for each key

Benefits of functional programming:

 

Page 88: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Assumptions

Constraints on the mapper and reducer:• The mappermust be equivalent to applying a deterministic pure function 

to each input independently• The reducermust be equivalent to applying a deterministic pure function 

to the sequence of values for each key

Benefits of functional programming:• When a program contains only pure functions, call expressions can be 

evaluated in any order, lazily, and in parallel

 

Page 89: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Assumptions

Constraints on the mapper and reducer:• The mappermust be equivalent to applying a deterministic pure function 

to each input independently• The reducermust be equivalent to applying a deterministic pure function 

to the sequence of values for each key

Benefits of functional programming:• When a program contains only pure functions, call expressions can be 

evaluated in any order, lazily, and in parallel• Referential transparency: a call expression can be replaced by its value (or 

vice versa) without changing the program

 

Page 90: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Assumptions

Constraints on the mapper and reducer:• The mappermust be equivalent to applying a deterministic pure function 

to each input independently• The reducermust be equivalent to applying a deterministic pure function 

to the sequence of values for each key

Benefits of functional programming:• When a program contains only pure functions, call expressions can be 

evaluated in any order, lazily, and in parallel• Referential transparency: a call expression can be replaced by its value (or 

vice versa) without changing the program

In MapReduce, these functional programming ideas allow:

Page 91: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Assumptions

Constraints on the mapper and reducer:• The mappermust be equivalent to applying a deterministic pure function 

to each input independently• The reducermust be equivalent to applying a deterministic pure function 

to the sequence of values for each key

Benefits of functional programming:• When a program contains only pure functions, call expressions can be 

evaluated in any order, lazily, and in parallel• Referential transparency: a call expression can be replaced by its value (or 

vice versa) without changing the program

In MapReduce, these functional programming ideas allow:• Consistent results, however computation is partitioned

Page 92: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

MapReduce Assumptions

Constraints on the mapper and reducer:• The mappermust be equivalent to applying a deterministic pure function 

to each input independently• The reducermust be equivalent to applying a deterministic pure function 

to the sequence of values for each key

Benefits of functional programming:• When a program contains only pure functions, call expressions can be 

evaluated in any order, lazily, and in parallel• Referential transparency: a call expression can be replaced by its value (or 

vice versa) without changing the program

In MapReduce, these functional programming ideas allow:• Consistent results, however computation is partitioned• Re‐computation and caching of results, as needed

Page 93: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Python Example of a MapReduce Application

 

Page 94: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Python Example of a MapReduce Application

The mapper and reducer are both self‐contained Python programs

Page 95: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Python Example of a MapReduce Application

The mapper and reducer are both self‐contained Python programs• Read from standard input and write to standard output!

Page 96: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Python Example of a MapReduce Application

The mapper and reducer are both self‐contained Python programs• Read from standard input and write to standard output!

Mapper

Page 97: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Python Example of a MapReduce Application

The mapper and reducer are both self‐contained Python programs• Read from standard input and write to standard output!

def emit_vowels(line):for vowel in 'aeiou':

count = line.count(vowel)if count > 0:

emit(vowel, count)

Mapper

Page 98: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Python Example of a MapReduce Application

The mapper and reducer are both self‐contained Python programs• Read from standard input and write to standard output!

#!/usr/bin/env python3

import sysfrom ucb import mainfrom mapreduce import emit

def emit_vowels(line):for vowel in 'aeiou':

count = line.count(vowel)if count > 0:

emit(vowel, count)

Mapper

Page 99: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Python Example of a MapReduce Application

The mapper and reducer are both self‐contained Python programs• Read from standard input and write to standard output!

#!/usr/bin/env python3

import sysfrom ucb import mainfrom mapreduce import emit

def emit_vowels(line):for vowel in 'aeiou':

count = line.count(vowel)if count > 0:

emit(vowel, count)

Mapper Tell Unix: this is Python

Page 100: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Python Example of a MapReduce Application

The mapper and reducer are both self‐contained Python programs• Read from standard input and write to standard output!

#!/usr/bin/env python3

import sysfrom ucb import mainfrom mapreduce import emit

def emit_vowels(line):for vowel in 'aeiou':

count = line.count(vowel)if count > 0:

emit(vowel, count)

Mapper

The emit function outputs a key and value as a line of text to 

standard output

Tell Unix: this is Python

Page 101: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Python Example of a MapReduce Application

The mapper and reducer are both self‐contained Python programs• Read from standard input and write to standard output!

#!/usr/bin/env python3

import sysfrom ucb import mainfrom mapreduce import emit

for line in sys.stdin:emit_vowels(line)

def emit_vowels(line):for vowel in 'aeiou':

count = line.count(vowel)if count > 0:

emit(vowel, count)

Mapper

The emit function outputs a key and value as a line of text to 

standard output

Tell Unix: this is Python

Page 102: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Python Example of a MapReduce Application

The mapper and reducer are both self‐contained Python programs• Read from standard input and write to standard output!

#!/usr/bin/env python3

import sysfrom ucb import mainfrom mapreduce import emit

for line in sys.stdin:emit_vowels(line)

def emit_vowels(line):for vowel in 'aeiou':

count = line.count(vowel)if count > 0:

emit(vowel, count)

Mapper

The emit function outputs a key and value as a line of text to 

standard output

Mapper inputs are lines of text provided to standard input

Tell Unix: this is Python

Page 103: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Python Example of a MapReduce Application

The mapper and reducer are both self‐contained Python programs• Read from standard input and write to standard output!

Reducer

Page 104: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Python Example of a MapReduce Application

The mapper and reducer are both self‐contained Python programs• Read from standard input and write to standard output!

#!/usr/bin/env python3

import sysfrom ucb import mainfrom mapreduce import emit, group_values_by_key

Reducer

Page 105: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Python Example of a MapReduce Application

The mapper and reducer are both self‐contained Python programs• Read from standard input and write to standard output!

Takes and returns iterators#!/usr/bin/env python3

import sysfrom ucb import mainfrom mapreduce import emit, group_values_by_key

Reducer

Page 106: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Python Example of a MapReduce Application

The mapper and reducer are both self‐contained Python programs• Read from standard input and write to standard output!

Takes and returns iterators

Input: lines of text representing key‐value pairs, grouped by keyOutput: Iterator over (key, value_iterator) pairs that give all values for each key

#!/usr/bin/env python3

import sysfrom ucb import mainfrom mapreduce import emit, group_values_by_key

Reducer

Page 107: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

Python Example of a MapReduce Application

The mapper and reducer are both self‐contained Python programs• Read from standard input and write to standard output!

for key, value_iterator in group_values_by_key(sys.stdin):emit(key, sum(value_iterator))

Takes and returns iterators

Input: lines of text representing key‐value pairs, grouped by keyOutput: Iterator over (key, value_iterator) pairs that give all values for each key

#!/usr/bin/env python3

import sysfrom ucb import mainfrom mapreduce import emit, group_values_by_key

Reducer

Page 108: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

What the MapReduce Framework Provides

 

 

 

 

Page 109: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

What the MapReduce Framework Provides

Fault tolerance: A machine or hard drive might crash

 

 

 

Page 110: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

What the MapReduce Framework Provides

Fault tolerance: A machine or hard drive might crash• The MapReduce framework automatically re‐runs failed tasks

 

 

 

Page 111: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

What the MapReduce Framework Provides

Fault tolerance: A machine or hard drive might crash• The MapReduce framework automatically re‐runs failed tasks

Speed: Some machine might be slow because it's overloaded

 

 

Page 112: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

What the MapReduce Framework Provides

Fault tolerance: A machine or hard drive might crash• The MapReduce framework automatically re‐runs failed tasks

Speed: Some machine might be slow because it's overloaded• The framework can run multiple copies of a task and keep the result of the 

one that finishes first

 

 

Page 113: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

What the MapReduce Framework Provides

Fault tolerance: A machine or hard drive might crash• The MapReduce framework automatically re‐runs failed tasks

Speed: Some machine might be slow because it's overloaded• The framework can run multiple copies of a task and keep the result of the 

one that finishes first

Network locality: Data transfer is expensive

 

Page 114: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

What the MapReduce Framework Provides

Fault tolerance: A machine or hard drive might crash• The MapReduce framework automatically re‐runs failed tasks

Speed: Some machine might be slow because it's overloaded• The framework can run multiple copies of a task and keep the result of the 

one that finishes first

Network locality: Data transfer is expensive• The framework tries to schedule map tasks on the machines that hold the 

data to be processed

 

Page 115: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

What the MapReduce Framework Provides

Fault tolerance: A machine or hard drive might crash• The MapReduce framework automatically re‐runs failed tasks

Speed: Some machine might be slow because it's overloaded• The framework can run multiple copies of a task and keep the result of the 

one that finishes first

Network locality: Data transfer is expensive• The framework tries to schedule map tasks on the machines that hold the 

data to be processed

Monitoring: Will my job finish before dinner?!?

Page 116: CS61A Lecture 41inst.eecs.berkeley.edu/~cs61a/sp13/slides/41-MapReduce_1pp.pdf · Scheme contest deadline extended to Friday Announcements. CPU Performance. CPU Performance Performance

What the MapReduce Framework Provides

Fault tolerance: A machine or hard drive might crash• The MapReduce framework automatically re‐runs failed tasks

Speed: Some machine might be slow because it's overloaded• The framework can run multiple copies of a task and keep the result of the 

one that finishes first

Network locality: Data transfer is expensive• The framework tries to schedule map tasks on the machines that hold the 

data to be processed

Monitoring: Will my job finish before dinner?!?• The framework provides a web‐based interface describing jobs