20
Übung Datenbanksysteme II Web-Scale Data Management Leon Bornemann Folien basierend auf Maximilian Jenders, Thorsten Papenbrock

Übung Datenbanksysteme II Web-Scale Data Management · Übung Datenbanksysteme II Web-Scale Data Management Leon Bornemann Folien basierend auf Maximilian Jenders, Thorsten Papenbrock

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Übung Datenbanksysteme II

Web-Scale Data Management

Leon Bornemann

Folien basierend auf

Maximilian Jenders,

Thorsten Papenbrock

MapReduce:

Introduction

� MapReduce …

� is a paradigm derived from functional programming.

� is implemented as framework.

� operates primarily data-parallel (not task-parallel).

� scales-out on multiple nodes of a cluster.

� uses the Hadoop distributed filesystem.

� is designed for Big Data Analytics:

� Log-files

� Weather-statistics

� Sensor-data

� …

� “Competitors“:

Leon Bornemann | Übung Datenbanksysteme II – WSDM

2

MapReduce:

Introduction

� Who is using Hadoop?

� Yahoo!

� Biggest cluster: 2000 nodes, used to support research for Ad Systems and Web Search.

� Amazon

� Process millions of sessions daily for analytics, using both the Java and streaming APIs. Clusters vary from 1 to 100

nodes.

� Facebook

� Use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics. 600 machine cluster.

� ...

http://wiki.apache.org/hadoop/PoweredBy

Leon Bornemann | Übung Datenbanksysteme II – WSDM

3

MapReduce:

Introduction

Leon Bornemann | Übung Datenbanksysteme II – WSDM

4

http://www.josemalvarez.es/web/2013/04/10/mapreduce-design-patterns/

MapReduce:

Introduction

5

http://dme.rwth-aachen.de/de/research/projects/mapreduceLeon Bornemann | Übung Datenbanksysteme II – WSDM

MapReduce:

Introduction

6

http://mohamednabeel.blogspot.de/2011/03/starting-sub-sandwitch-business.htmlLeon Bornemann | Übung Datenbanksysteme II – WSDM

MapReduce:

Phases

Leon Bornemann | Übung Datenbanksysteme II – WSDM

7

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:

� shuffle and sort

� reducer

� output formater

MapReduce:

Phases

Leon Bornemann | Übung Datenbanksysteme II – WSDM

8

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:

� shuffle and sort

� reducer

� output formater

� Input: ����� ������(row/split/item)� Output: ������� ���

� “��“ is usually positional information� “��� ��“ represents a raw data record

� Translates a given input into records� Parses data into records but not the

records itself

� Input: ����� ������(row/split/item)� Output: ������� ���

� “��“ is usually positional information� “��� ��“ represents a raw data record

� Translates a given input into records� Parses data into records but not the

records itself

MapReduce:

Phases

Leon Bornemann | Übung Datenbanksysteme II – WSDM

9

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:

� shuffle and sort

� reducer

� output formater

� Input: ������� ���� Output: �����������

� “���“ is a problem-specific key� e.g. the word for the word-count-task

� “�����“ is a problem-specific value� e.g. “1“ for the occurence of a word

� Executes user defined code that startssolving the given task

� Defines the grouping of the data

� A single mapper can emit multiple �����������output pairs for a single������� ���input pair

� Input: ������� ���� Output: �����������

� “���“ is a problem-specific key� e.g. the word for the word-count-task

� “�����“ is a problem-specific value� e.g. “1“ for the occurence of a word

� Executes user defined code that startssolving the given task

� Defines the grouping of the data

� A single mapper can emit multiple �����������output pairs for a single������� ���input pairIn der Praxis dann oft

„flatmap“ genanntIn der Praxis dann oft „flatmap“ genannt

MapReduce:

Phases

Leon Bornemann | Übung Datenbanksysteme II – WSDM

10

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:

� shuffle and sort

� reducer

� output formater

� Input: ������������� Output: �����������

� “���“ is a problem-specific key� e.g. the word for the word-count-task

� “�����“ is a problem-specific value� e.g. “1“ for the occurence of a word

� Executes user defined code that mergesa set of values

� Pre-aggregates values to reduce networktraffic

� Is an optional, localized reducer

� Input: ������������� Output: �����������

� “���“ is a problem-specific key� e.g. the word for the word-count-task

� “�����“ is a problem-specific value� e.g. “1“ for the occurence of a word

� Executes user defined code that mergesa set of values

� Pre-aggregates values to reduce networktraffic

� Is an optional, localized reducer

MapReduce:

Phases

Leon Bornemann | Übung Datenbanksysteme II – WSDM

11

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:

� shuffle and sort

� reducer

� output formater

� Input: ������������ Output: �������������������

� “�������“ is the reducer number that shouldhandle this key/value pair; reducer mightbe located on other compute nodes

� Distributes the keyspace randomly to thereducers

� Calculates the reducer by e.g.��������� ������������ � ���������

� Input: ������������ Output: �������������������

� “�������“ is the reducer number that shouldhandle this key/value pair; reducer mightbe located on other compute nodes

� Distributes the keyspace randomly to thereducers

� Calculates the reducer by e.g.��������� ������������ � ���������

MapReduce:

Phases

Leon Bornemann | Übung Datenbanksysteme II – WSDM

12

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:

� shuffle and sort

� reducer

� output formater

� Input: �������������������� Output: �������������������

� Downloads the �����������data to thelocal machines that run the correspondingreducers

� Input: �������������������� Output: �������������������

� Downloads the �����������data to thelocal machines that run the correspondingreducers

MapReduce:

Phases

Leon Bornemann | Übung Datenbanksysteme II – WSDM

13

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:

� shuffle and sort

� reducer

� output formater

� Input: ������������� Output: ������������

� “������“ is the solution/answer for thegiven “���“

� Executes user defined code that mergesa set of values

� Calculates the final solution/answer to theproblem statement for the given key

� Input: ������������� Output: ������������

� “������“ is the solution/answer for thegiven “���“

� Executes user defined code that mergesa set of values

� Calculates the final solution/answer to theproblem statement for the given key

MapReduce:

Phases

Leon Bornemann | Übung Datenbanksysteme II – WSDM

14

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:

� shuffle and sort

� reducer

� output formater

� Input: ������������� Output: ������������

� Writes the key/result pairs to disk� Formates the final result and writes it

record-wise to disk

� Input: ������������� Output: ������������

� Writes the key/result pairs to disk� Formates the final result and writes it

record-wise to disk

MapReduce:

Phases

Leon Bornemann | Übung Datenbanksysteme II – WSDM

15

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:

� shuffle and sort

� reducer

� output formater

basic building blockswith user defined codebasic building blocks

with user defined code

helpful to build asorting algorithmhelpful to build asorting algorithm

useful to increasethe performanceuseful to increasethe performance

MapReduce:

Example 1: Distinct

Leon Bornemann | Übung Datenbanksysteme II – WSDM

16

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:

� shuffle and sort

� reducer

� output formater

� Input:� A relational table instance

Car(name, vendor, color, speed, price)� Output:

� A distinct list of all vendors

��� ������� ����

���� ���� ������� �������

!

������ ������������

"���� ����

!

� Input:� A relational table instance

Car(name, vendor, color, speed, price)� Output:

� A distinct list of all vendors

��� ������� ����

���� ���� ������� �������

!

������ ������������

"���� ����

!

MapReduce:

Example 2: Index-Generation

Leon Bornemann | Übung Datenbanksysteme II – WSDM

17

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:

� shuffle and sort

� reducer

� output formater

� Input:� A relational table instance

Car(name, vendor, color, speed, price)� Output:

� An index on Car.vendor

��� ������� ����

���� ���� ������� �����

!

������ ������������

#����$���� %� ������������

"���� ���������

!

� Input:� A relational table instance

Car(name, vendor, color, speed, price)� Output:

� An index on Car.vendor

��� ������� ����

���� ���� ������� �����

!

������ ������������

#����$���� %� ������������

"���� ���������

!

MapReduce:

Example 3: Join

Leon Bornemann | Übung Datenbanksysteme II – WSDM

18

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:

� shuffle and sort

� reducer

� output formater

� Input:� Two relational table instances

Car(name, vendor, color, speed, price)Plane(id, weight, length, speed, seats)

� Output:� All pairs of cars and planes with the

same speed

� Input:� Two relational table instances

Car(name, vendor, color, speed, price)Plane(id, weight, length, speed, seats)

� Output:� All pairs of cars and planes with the

same speed

MapReduce:

Example 3: Join

Leon Bornemann | Übung Datenbanksysteme II – WSDM

19

� map-task:

� record reader

� mapper

� combiner

� partitioner

� reduce-task:

� shuffle and sort

� reducer

� output formater

Car(name, vendor, color, speed, price)Plane(id, weight, length, speed, seats)

��� ������� ����

���� ���� ���������

�&�����&%���������� ����

&��� ��&%���� ��!�

!

������ ������������

���� %������'�����&�����&�&���&�

������%������'�����&�����&�&�����&�

� � ���� (�����

� � ������(�������

"���� �������� ������������ ���

!

Car(name, vendor, color, speed, price)Plane(id, weight, length, speed, seats)

��� ������� ����

���� ���� ���������

�&�����&%���������� ����

&��� ��&%���� ��!�

!

������ ������������

���� %������'�����&�����&�&���&�

������%������'�����&�����&�&�����&�

� � ���� (�����

� � ������(�������

"���� �������� ������������ ���

!

Leon Bornemann | Übung Datenbanksysteme II – WSDM

20