Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
HJ-‐Hadoop An Op,mized MapReduce
Run,me for Mul,-‐core Systems
Yunming Zhang Advised by: Prof. Alan Cox and Vivek Sarkar
Rice University
ACM Student Research CompeDDon SPLASH 13
1
Hadoop MapReduce RunDme
2
Job Starts
Map
…….
Reduce
Job Ends
Figure 1. Map Reduce Programming Model
Map
Map Reduce
Hadoop Map Reduce
• Open source implementaDon of Map Reduce RunDme system – Scalable – Reliable – Available
• Popular plaLorm for big data analyDcs 3
Kmeans
Slice 1
Reducer
Join Table
<Key, Val> pairs…
Full Lookup Table
Thread 1
Map: Look up a Key in Lookup Table
Full Lookup Table
Thread 1
Slice 2
Slices ….
Slice n
<Key, Val> pairs… Slice 1
Reducer
Join Table
<Key, Val> pairs…
Full Lookup Table
Thread 1
Map: Look up a Key in Lookup Table
Full Lookup Table
Thread 1
Slice 2
Slices ….
Slice n
<Key, Val> pairs…
Join Table
ComputaDon Memory
Full Lookup Table 1x
Full Lookup Table 1x
Full Lookup Table 1x
Full Lookup Table 1x
Look up key
Machine1
Duplicated Tables
Reducer1
Reducer1 Full Lookup Table 1x
Full Lookup Table 1x
Full Lookup Table 1x
Full Lookup Table 1x
Machine2
Machines…..
Map task in a JVM
Join Table
ComputaDon Memory
Full Lookup Table 1x
Full Lookup Table 1x
Full Lookup Table 1x
Full Lookup Table 1x
Look up key
Machine1
Duplicated Tables
Reducer1
Reducer1 Full Lookup Table 1x
Full Lookup Table 1x
Full Lookup Table 1x
Full Lookup Table 1x
Machine2
Machines…..
Map task in a JVM
Join Table
ComputaDon Memory
Full Lookup Table 1x
Full Lookup Table 1x
Full Lookup Table 1x
Full Lookup Table 1x
Look up key
Machine1
Duplicated Tables
Reducer1
Reducer1 Full Lookup Table 1x
Full Lookup Table 1x
Full Lookup Table 1x
Full Lookup Table 1x
Machine2
Machines…..
Map task in a JVM
Join Table
ComputaDon Memory
Full Lookup Table 1x
Full Lookup Table 1x
Full Lookup Table 1x
Full Lookup Table 1x
Look up key
Machine1
Duplicated Tables
Reducer1
Reducer1 Full Lookup Table 1x
Full Lookup Table 1x
Full Lookup Table 1x
Full Lookup Table 1x
Machine2
Machines…..
Map task in a JVM
Join Table
ComputaDon Memory
Full Lookup Table 1x
Full Lookup Table 1x
Full Lookup Table 1x
Full Lookup Table 1x
Look up key
Machine1
Duplicated Tables
Reducer1
Reducer1 Full Lookup Table 1x
Full Lookup Table 1x
Full Lookup Table 1x
Full Lookup Table 1x
Machine2
Machines…..
Map task in a JVM
4
Cluster Centroids
To be Classified Documents
Topics
To be Classified Documents
Kmeans is an application that takes as input a large number of documents and try to classify them into different topics
Kmeans using Hadoop
5
To be classified documents
Computa,on Memory
Machine1 Map task in a JVM
Duplicated In-‐memory Cluster Centroids
Cluster Centroids 1x
Cluster Centroids 1x
Cluster Centroids 1x
Cluster Centroids 1x
Machines …
To be classified documents Computation Memory
Machine 1 Map task in a JVM
Topics
Machines …
Kmeans using Hadoop
6
To be classified documents
Computa,on Memory
Machine1 Map task in a JVM
Duplicated In-‐memory Cluster Centroids
Cluster Centroids 1x
Cluster Centroids 1x
Cluster Centroids 1x
Cluster Centroids 1x
Machines …
To be classified documents Computation Memory
Machine 1 Map task in a JVM
Topics
Topics
Machines …
Kmeans using Hadoop
7
To be classified documents
Computa,on Memory
Machine1 Map task in a JVM
Duplicated In-‐memory Cluster Centroids
Cluster Centroids 1x
Cluster Centroids 1x
Cluster Centroids 1x
Cluster Centroids 1x
Machines …
To be classified documents Computation Memory
Machine 1 Map task in a JVM
Topics
Topics
Topics
Machines …
Kmeans using Hadoop
8
To be classified documents
Computa,on Memory
Machine1 Map task in a JVM
Duplicated In-‐memory Cluster Centroids
Cluster Centroids 1x
Cluster Centroids 1x
Cluster Centroids 1x
Cluster Centroids 1x
Machines …
To be classified documents Computation Memory
Machine 1 Map task in a JVM
Topics
Topics
Topics
Topics
Machines …
Kmeans using Hadoop
9
To be classified documents
Computa,on Memory
Machine1 Map task in a JVM
Duplicated In-‐memory Cluster Centroids
Cluster Centroids 1x
Cluster Centroids 1x
Cluster Centroids 1x
Cluster Centroids 1x
Machines …
To be classified documents Computation Memory
Machine 1 Map task in a JVM
Duplicated In-memory Cluster Centroids
Topics 1x
Topics 1x
Topics 1x
Topics 1x
Machines …
Memory Wall
0
20
40
60
80
100
120
140
160
180
200
0 50 100 150 200 250 300 350 400
Num
ber o
f top
ics/ Tim
e in m
in
Topics data size (MB) with 4KB/topic
KMeans Throughput Benchmark
Hadoop
10
We used 8 mappers from 30 -80 MB, 4 mappers for 100 – 150 MB, 2 mappers for 180 – 380 for sequential Hadoop.
cluster size(MB)
Hadoop Full GC calls
30 3,542 50 4,390 70 5,186 80 1,108,888
Memory Wall
• Hadoop’s approach to the problem – Increase the memory available to each Map Task JVM by reducing the number of map tasks assigned to each machine.
11
Kmeans using Hadoop
12
To be classified documents
Computa,on Memory
Machine1 Map task in a JVM
Duplicated In-‐memory Cluster Centroids
Cluster Centroids 1x
Cluster Centroids 1x
Cluster Centroids 1x
Cluster Centroids 1x
Machines …
To be classified documents Computation Memory
Machine 1 Map task in a JVM
Machines …
Topics 2x
Topics 2x
Memory Wall
0
20
40
60
80
100
120
140
160
180
200
0 50 100 150 200 250 300 350 400
Num
ber o
f top
ics/ Tim
e in m
in
Topics data size (MB) with 4KB/topic
KMeans Throughput Benchmark
Hadoop
13
Decreased throughput due to reduced number of map tasks per machine
We used 8 mappers from 30 -80 MB, 4 mappers for 100 – 150 MB, 2 mappers for 180 – 380 for sequential Hadoop.
HJ-‐Hadoop
14
To be classified documents
Computa,on Memory
Machine1 Map task in a JVM
No Duplicated In-‐memory Cluster Centroids
Cluster Centroids 4x
Machines …
Dynamic chunking
To be classified documents Computation Memory
Machine1 Map task in a JVM
Topics 4x
Machines …
Dynamic chunking
HJ-‐Hadoop
15
To be classified documents
Computa,on Memory
Machine1 Map task in a JVM
No Duplicated In-‐memory Cluster Centroids
Cluster Centroids 4x
Machines …
Dynamic chunking
To be classified documents Computation Memory
Machine1 Map task in a JVM
Topics 4x
Machines …
Dynamic chunking
No Duplicated In-memory Cluster Centroids
Habanero Java(HJ)
• Programming Language and RunDme Developed at Rice University
• OpDmized for mulD-‐core systems – Lightweight async task – Work sharing runDme – Dynamic task parallelism – hcp://habanero.rice.edu
16
Results
0
20
40
60
80
100
120
140
160
180
200
0 50 100 150 200 250 300 350 400
Num
ber o
f top
ics/ Tim
e in m
in
Topics data size (MB) with 4KB/topic
KMeans Throughput Benchmark
HJ-‐Hadoop
Hadoop
We used 2 mappers for HJ-‐Hadoop 17
Results
0
20
40
60
80
100
120
140
160
180
200
0 50 100 150 200 250 300 350 400
Num
ber o
f top
ics/ Tim
e in m
in
Topics data size (MB) with 4KB/topic
KMeans Throughput Benchmark
HJ-‐Hadoop
Hadoop
We used 2 mappers for HJ-‐Hadoop 18
Process 5x Topics efficiently
Results
0
20
40
60
80
100
120
140
160
180
200
0 50 100 150 200 250 300 350 400
Num
ber o
f top
ics/ Tim
e in m
in
Topics data size (MB) with 4KB/topic
KMeans Throughput Benchmark
HJ-‐Hadoop
Hadoop
We used 2 mappers for HJ-‐Hadoop 19
4x throughput improvement
K Nearest Neighbor Join
0
50
100
150
200
250
0 50 100 150 200 250 Documen
ts processed
/ m
in
Input Document size (MB)
K Nearest Neighbor Join
Hadoop
HJ-‐Hadoop
20
Conclusions
• Our goal is to tackle the memory inefficiency in the execution of MapReduce applications on multi-core systems by integrating a shared memory parallel model into Hadoop MapReduce runtime – HJ-Hadoop can be used to solve larger problems
efficiently than Hadoop. HJ-Hadoop can process 5x more data at full throughput of the system
– The HJ-Hadoop can deliver a 4x throughput relative to Hadoop mapper processing large in-memory data sets
21