Übung Datenbanksysteme II Web-Scale Data Management · Übung Datenbanksysteme II Web-Scale Data Management Leon Bornemann Folien basierend auf Maximilian Jenders, Thorsten Papenbrock

Übung Datenbanksysteme II

Web-Scale Data Management

Leon Bornemann

Folien basierend auf

Maximilian Jenders,

Thorsten Papenbrock

MapReduce:

Introduction

� MapReduce …

� is a paradigm derived from functional programming.

� is implemented as framework.

� operates primarily data-parallel (not task-parallel).

� scales-out on multiple nodes of a cluster.

� uses the Hadoop distributed filesystem.

� is designed for Big Data Analytics:

� Log-files

� Weather-statistics

� Sensor-data

� …

� “Competitors“:

Leon Bornemann | Übung Datenbanksysteme II – WSDM

2

MapReduce:

Introduction

� Who is using Hadoop?

� Yahoo!

� Biggest cluster: 2000 nodes, used to support research for Ad Systems and Web Search.

� Amazon

� Process millions of sessions daily for analytics, using both the Java and streaming APIs. Clusters vary from 1 to 100

nodes.

� Facebook

� Use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics. 600 machine cluster.

� ...

http://wiki.apache.org/hadoop/PoweredBy


3

MapReduce:

Introduction


4

http://www.josemalvarez.es/web/2013/04/10/mapreduce-design-patterns/

MapReduce:

Introduction

5

http://dme.rwth-aachen.de/de/research/projects/mapreduceLeon Bornemann | Übung Datenbanksysteme II – WSDM

MapReduce:

Introduction

6

http://mohamednabeel.blogspot.de/2011/03/starting-sub-sandwitch-business.htmlLeon Bornemann | Übung Datenbanksysteme II – WSDM

MapReduce:

Phases


7

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:

� shuffle and sort

� reducer

� output formater

MapReduce:

Phases


8

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:


� reducer

� output formater

� Input: �� (row/split/item)� Output: ��

� “��“ is usually positional information� “�� “ represents a raw data record

� Translates a given input into records� Parses data into records but not the

records itself

� Input: �� (row/split/item)� Output: ��

� “��“ is usually positional information� “�� “ represents a raw data record

� Translates a given input into records� Parses data into records but not the

records itself

MapReduce:

Phases


9

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:


� reducer

� output formater

� Input: �� Output: ��

� “��“ is a problem-specific key� e.g. the word for the word-count-task

� “��“ is a problem-specific value� e.g. “1“ for the occurence of a word

� Executes user defined code that startssolving the given task

� Defines the grouping of the data

� A single mapper can emit multiple ��output pairs for a single�� input pair




� Executes user defined code that startssolving the given task

� Defines the grouping of the data

� A single mapper can emit multiple ��output pairs for a single�� input pairIn der Praxis dann oft

„flatmap“ genanntIn der Praxis dann oft „flatmap“ genannt

MapReduce:

Phases


10

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:


� reducer

� output formater




� Executes user defined code that mergesa set of values

� Pre-aggregates values to reduce networktraffic

� Is an optional, localized reducer





� Pre-aggregates values to reduce networktraffic

� Is an optional, localized reducer

MapReduce:

Phases


11

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:


� reducer

� output formater


� “��“ is the reducer number that shouldhandle this key/value pair; reducer mightbe located on other compute nodes

� Distributes the keyspace randomly to thereducers

� Calculates the reducer by e.g.��


� “��“ is the reducer number that shouldhandle this key/value pair; reducer mightbe located on other compute nodes

� Distributes the keyspace randomly to thereducers

� Calculates the reducer by e.g.��

MapReduce:

Phases


12

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:


� reducer

� output formater


� Downloads the ��data to thelocal machines that run the correspondingreducers


� Downloads the ��data to thelocal machines that run the correspondingreducers

MapReduce:

Phases


13

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:


� reducer

� output formater


� “��“ is the solution/answer for thegiven “��“


� Calculates the final solution/answer to theproblem statement for the given key


� “��“ is the solution/answer for thegiven “��“


� Calculates the final solution/answer to theproblem statement for the given key

MapReduce:

Phases


14

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:


� reducer

� output formater


� Writes the key/result pairs to disk� Formates the final result and writes it

record-wise to disk


� Writes the key/result pairs to disk� Formates the final result and writes it

record-wise to disk

MapReduce:

Phases


15

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:


� reducer

� output formater

basic building blockswith user defined codebasic building blocks

with user defined code

helpful to build asorting algorithmhelpful to build asorting algorithm

useful to increasethe performanceuseful to increasethe performance

MapReduce:

Example 1: Distinct


16

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:


� reducer

� output formater

� Input:� A relational table instance

Car(name, vendor, color, speed, price)� Output:

� A distinct list of all vendors

��

��

!

��

"��

!



� A distinct list of all vendors

��

��

!

��

"��

!

MapReduce:

Example 2: Index-Generation


17

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:


� reducer

� output formater



� An index on Car.vendor

��

��

!

��

#��$�� %� ��

"��

!



� An index on Car.vendor

��

��

!

��

#��$�� %� ��

"��

!

MapReduce:

Example 3: Join


18

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:


� reducer

� output formater

� Input:� Two relational table instances

Car(name, vendor, color, speed, price)Plane(id, weight, length, speed, seats)

� Output:� All pairs of cars and planes with the

same speed

� Input:� Two relational table instances


� Output:� All pairs of cars and planes with the

same speed

MapReduce:

Example 3: Join


19

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:


� reducer

� output formater


��

��

�&��&%��

&�� &%�� !�

!

��

�� %��'��&��&�&��&�

��%��'��&��&�&��&�

� � �� (��

� � ��(��

"��

!


��

��

�&��&%��

&�� &%�� !�

!

��

�� %��'��&��&�&��&�

��%��'��&��&�&��&�

� � �� (��

� � ��(��

"��

!


20

Documents

Übung Datenbanksysteme II Web-Scale Data Management · Übung Datenbanksysteme II Web-Scale Data Management Leon Bornemann Folien basierend auf Maximilian Jenders, Thorsten Papenbrock