Upload
others
View
17
Download
0
Embed Size (px)
Citation preview
Übung Datenbanksysteme II
Web-Scale Data Management
Leon Bornemann
Folien basierend auf
Maximilian Jenders,
Thorsten Papenbrock
MapReduce:
Introduction
� MapReduce …
� is a paradigm derived from functional programming.
� is implemented as framework.
� operates primarily data-parallel (not task-parallel).
� scales-out on multiple nodes of a cluster.
� uses the Hadoop distributed filesystem.
� is designed for Big Data Analytics:
� Log-files
� Weather-statistics
� Sensor-data
� …
� “Competitors“:
Leon Bornemann | Übung Datenbanksysteme II – WSDM
2
MapReduce:
Introduction
� Who is using Hadoop?
� Yahoo!
� Biggest cluster: 2000 nodes, used to support research for Ad Systems and Web Search.
� Amazon
� Process millions of sessions daily for analytics, using both the Java and streaming APIs. Clusters vary from 1 to 100
nodes.
� Use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics. 600 machine cluster.
� ...
http://wiki.apache.org/hadoop/PoweredBy
Leon Bornemann | Übung Datenbanksysteme II – WSDM
3
MapReduce:
Introduction
Leon Bornemann | Übung Datenbanksysteme II – WSDM
4
http://www.josemalvarez.es/web/2013/04/10/mapreduce-design-patterns/
MapReduce:
Introduction
5
http://dme.rwth-aachen.de/de/research/projects/mapreduceLeon Bornemann | Übung Datenbanksysteme II – WSDM
MapReduce:
Introduction
6
http://mohamednabeel.blogspot.de/2011/03/starting-sub-sandwitch-business.htmlLeon Bornemann | Übung Datenbanksysteme II – WSDM
MapReduce:
Phases
Leon Bornemann | Übung Datenbanksysteme II – WSDM
7
� map-task:
� record reader
� mapper
� combiner
� partitioner
� reduce-task:
� shuffle and sort
� reducer
� output formater
MapReduce:
Phases
Leon Bornemann | Übung Datenbanksysteme II – WSDM
8
� map-task:
� record reader
� mapper
� combiner
� partitioner
� reduce-task:
� shuffle and sort
� reducer
� output formater
� Input: ����� ������(row/split/item)� Output: ������� ���
� “��“ is usually positional information� “��� ��“ represents a raw data record
� Translates a given input into records� Parses data into records but not the
records itself
� Input: ����� ������(row/split/item)� Output: ������� ���
� “��“ is usually positional information� “��� ��“ represents a raw data record
� Translates a given input into records� Parses data into records but not the
records itself
MapReduce:
Phases
Leon Bornemann | Übung Datenbanksysteme II – WSDM
9
� map-task:
� record reader
� mapper
� combiner
� partitioner
� reduce-task:
� shuffle and sort
� reducer
� output formater
� Input: ������� ���� Output: �����������
� “���“ is a problem-specific key� e.g. the word for the word-count-task
� “�����“ is a problem-specific value� e.g. “1“ for the occurence of a word
� Executes user defined code that startssolving the given task
� Defines the grouping of the data
� A single mapper can emit multiple �����������output pairs for a single������� ���input pair
� Input: ������� ���� Output: �����������
� “���“ is a problem-specific key� e.g. the word for the word-count-task
� “�����“ is a problem-specific value� e.g. “1“ for the occurence of a word
� Executes user defined code that startssolving the given task
� Defines the grouping of the data
� A single mapper can emit multiple �����������output pairs for a single������� ���input pairIn der Praxis dann oft
„flatmap“ genanntIn der Praxis dann oft „flatmap“ genannt
MapReduce:
Phases
Leon Bornemann | Übung Datenbanksysteme II – WSDM
10
� map-task:
� record reader
� mapper
� combiner
� partitioner
� reduce-task:
� shuffle and sort
� reducer
� output formater
� Input: ������������� Output: �����������
� “���“ is a problem-specific key� e.g. the word for the word-count-task
� “�����“ is a problem-specific value� e.g. “1“ for the occurence of a word
� Executes user defined code that mergesa set of values
� Pre-aggregates values to reduce networktraffic
� Is an optional, localized reducer
� Input: ������������� Output: �����������
� “���“ is a problem-specific key� e.g. the word for the word-count-task
� “�����“ is a problem-specific value� e.g. “1“ for the occurence of a word
� Executes user defined code that mergesa set of values
� Pre-aggregates values to reduce networktraffic
� Is an optional, localized reducer
MapReduce:
Phases
Leon Bornemann | Übung Datenbanksysteme II – WSDM
11
� map-task:
� record reader
� mapper
� combiner
� partitioner
� reduce-task:
� shuffle and sort
� reducer
� output formater
� Input: ������������ Output: �������������������
� “�������“ is the reducer number that shouldhandle this key/value pair; reducer mightbe located on other compute nodes
� Distributes the keyspace randomly to thereducers
� Calculates the reducer by e.g.��������� ������������ � ���������
� Input: ������������ Output: �������������������
� “�������“ is the reducer number that shouldhandle this key/value pair; reducer mightbe located on other compute nodes
� Distributes the keyspace randomly to thereducers
� Calculates the reducer by e.g.��������� ������������ � ���������
MapReduce:
Phases
Leon Bornemann | Übung Datenbanksysteme II – WSDM
12
� map-task:
� record reader
� mapper
� combiner
� partitioner
� reduce-task:
� shuffle and sort
� reducer
� output formater
� Input: �������������������� Output: �������������������
� Downloads the �����������data to thelocal machines that run the correspondingreducers
� Input: �������������������� Output: �������������������
� Downloads the �����������data to thelocal machines that run the correspondingreducers
MapReduce:
Phases
Leon Bornemann | Übung Datenbanksysteme II – WSDM
13
� map-task:
� record reader
� mapper
� combiner
� partitioner
� reduce-task:
� shuffle and sort
� reducer
� output formater
� Input: ������������� Output: ������������
� “������“ is the solution/answer for thegiven “���“
� Executes user defined code that mergesa set of values
� Calculates the final solution/answer to theproblem statement for the given key
� Input: ������������� Output: ������������
� “������“ is the solution/answer for thegiven “���“
� Executes user defined code that mergesa set of values
� Calculates the final solution/answer to theproblem statement for the given key
MapReduce:
Phases
Leon Bornemann | Übung Datenbanksysteme II – WSDM
14
� map-task:
� record reader
� mapper
� combiner
� partitioner
� reduce-task:
� shuffle and sort
� reducer
� output formater
� Input: ������������� Output: ������������
� Writes the key/result pairs to disk� Formates the final result and writes it
record-wise to disk
� Input: ������������� Output: ������������
� Writes the key/result pairs to disk� Formates the final result and writes it
record-wise to disk
MapReduce:
Phases
Leon Bornemann | Übung Datenbanksysteme II – WSDM
15
� map-task:
� record reader
� mapper
� combiner
� partitioner
� reduce-task:
� shuffle and sort
� reducer
� output formater
basic building blockswith user defined codebasic building blocks
with user defined code
helpful to build asorting algorithmhelpful to build asorting algorithm
useful to increasethe performanceuseful to increasethe performance
MapReduce:
Example 1: Distinct
Leon Bornemann | Übung Datenbanksysteme II – WSDM
16
� map-task:
� record reader
� mapper
� combiner
� partitioner
� reduce-task:
� shuffle and sort
� reducer
� output formater
� Input:� A relational table instance
Car(name, vendor, color, speed, price)� Output:
� A distinct list of all vendors
��� ������� ����
���� ���� ������� �������
!
������ ������������
"���� ����
!
� Input:� A relational table instance
Car(name, vendor, color, speed, price)� Output:
� A distinct list of all vendors
��� ������� ����
���� ���� ������� �������
!
������ ������������
"���� ����
!
MapReduce:
Example 2: Index-Generation
Leon Bornemann | Übung Datenbanksysteme II – WSDM
17
� map-task:
� record reader
� mapper
� combiner
� partitioner
� reduce-task:
� shuffle and sort
� reducer
� output formater
� Input:� A relational table instance
Car(name, vendor, color, speed, price)� Output:
� An index on Car.vendor
��� ������� ����
���� ���� ������� �����
!
������ ������������
#����$���� %� ������������
"���� ���������
!
� Input:� A relational table instance
Car(name, vendor, color, speed, price)� Output:
� An index on Car.vendor
��� ������� ����
���� ���� ������� �����
!
������ ������������
#����$���� %� ������������
"���� ���������
!
MapReduce:
Example 3: Join
Leon Bornemann | Übung Datenbanksysteme II – WSDM
18
� map-task:
� record reader
� mapper
� combiner
� partitioner
� reduce-task:
� shuffle and sort
� reducer
� output formater
� Input:� Two relational table instances
Car(name, vendor, color, speed, price)Plane(id, weight, length, speed, seats)
� Output:� All pairs of cars and planes with the
same speed
� Input:� Two relational table instances
Car(name, vendor, color, speed, price)Plane(id, weight, length, speed, seats)
� Output:� All pairs of cars and planes with the
same speed
MapReduce:
Example 3: Join
Leon Bornemann | Übung Datenbanksysteme II – WSDM
19
� map-task:
� record reader
� mapper
� combiner
� partitioner
� reduce-task:
� shuffle and sort
� reducer
� output formater
Car(name, vendor, color, speed, price)Plane(id, weight, length, speed, seats)
��� ������� ����
���� ���� ���������
�&�����&%���������� ����
&��� ��&%���� ��!�
!
������ ������������
���� %������'�����&�����&�&���&�
������%������'�����&�����&�&�����&�
� � ���� (�����
� � ������(�������
"���� �������� ������������ ���
!
Car(name, vendor, color, speed, price)Plane(id, weight, length, speed, seats)
��� ������� ����
���� ���� ���������
�&�����&%���������� ����
&��� ��&%���� ��!�
!
������ ������������
���� %������'�����&�����&�&���&�
������%������'�����&�����&�&�����&�
� � ���� (�����
� � ������(�������
"���� �������� ������������ ���
!