Intel Software Page 1
Native-Task Performance Test Report
Intel Software
Wang, Huafeng, [email protected]
Zhong, Xiang, [email protected]
Intel Software Page 2
1. Background
2. Related Work
3. Preliminary Experiments
3.1 Experimental Environment
Workbench Peculiarity
Wordcount CPU-intensive
Sort IO-intensive
DFSIO IO-intensive
Pagerank Map :CPU-intensive
Reduce :IO-intensive
Hivebench-Aggregation Map :CPU-intensive
Reduce :IO-intensive
Hivebench-Join CPU-intensive
Terasort Map :CPU-intensive
Reduce : IO-intensive
K-Means Iteration stage: CPU-intensive
Classification stage: IO-intensive
Nutchindexing CPU-intensive & IO-intensive
Cluster settings
Intel Software Page 3
Hadoop version 1.0.3-Intel (patched with native task)
Cluster size 4
Disk per machine 7 SATA Disk per node
Network GbE network
CPU E5-2680(32 core per node)
L3 Cache size 20480 KB
Memory 64GB per node
Map Slots 3*32+1*26=122
Reduce Slots 3*16+1*13=61
Job Configuration
io.sort.mb 1GB
compression Enabled
Compression algo snappy
Dfs.block.size 256MB
Io.sort.record.percent 0.2
Dfs replica 3
3.2 Performance Metrics
Data size
before
compressio
n
Data size
after
compressio
n
Native
job
run
time(s
Original
job run
time(s)
Job
performa
nce
improve-
Map
stage
performa
nce
Intel Software Page 4
) ment improve
ment
Wordcoun
t
1TB 500GB 1523.43 3957.11 159.8% 159.8%
Sort 500GB 249GB 2662.43 3066.97 15.2% 45.4%
DFSIO-Rea
d
1TB NA 1249.68 1384.52 10.8% 26%
DFSIO-Wri
te
1TB NA 6639.22 7165.97 7.9% 7.9%
Pagerank Pages:500M
Total:481GB
217GB 6105.71 11644.63 90.7% 133.8%
Hive-Aggr
egation
Uservisits:5
G
Pages:600M
Total:820GB
345GB 1113.82 1662.74 49.3% 76.2%
Hive-Join Uservisits:5
G
Pages:600M
Total:860GB
382GB 1107.55 1467.08 32.5% 42.8%
Terasort 1TB NA 4203.35 6360.49 51.3% 109.1%
K-Means Clusters:5
Samples:2G
Inputfilesec
ondample:4
00M
Total:378GB
350GB 5734.11 7045.62 22.9% 22.9%
Nutchinde
Pages:40M
22G
NA 4387.6 4600.56 4.9% 13.2%
0
2000
4000
6000
8000
10000
12000
3957 3067
1385
7166
1663 1467
11645
6360 7045
4601
1523 2662
1250
6639
1114 1108
6106
4203
5734
4388
second Hi-Bench Job Execution Time Analysis
Original mapred job runtime
Native-Task job runtime
Intel Software Page 5
3.3 Results
3.3.1 Wordcount
Job Details:
Name Maps Reducers
wordcount 1984 22
Native-Task running state:
0
50
100
150
11
36
27
14
06
54
16
76
81
19
46
10
81
12
16
13
51
14
86
16
21
17
56
18
91
20
26
21
61
22
96
24
31
25
66
27
01
28
36
29
71
31
06
32
41
33
76
35
11
36
46
task number
second
Job Execution Time(Original Wordcount) maps
shuffle
merge
reduce
0
50
100
150
11
36
27
14
06
54
16
76
81
19
46
10
81
12
16
13
51
14
86
16
21
17
56
18
91
20
26
21
61
22
96
24
31
25
66
27
01
28
36
29
71
31
06
32
41
33
76
35
11
36
46
task number
second
Job Execution Time(Native-Task Wordcount)
maps
shuffle
merge
reduce
Intel Software Page 6
Start time: 9:14
Finish time: 9:37
Original running state:
Start time: 10:32
Intel Software Page 7
Finish time: 11:28
Analysis
Wordcount is a CPU-intensive workload and it’s map stage run
through the whole job. So the native-task has a huge performance
improvement.
3.3.2 Sort
Job Details:
Name Maps Reducers
sorter 1024 33
Native-Task running state:
0
50
100
150
1
11
1
22
1
33
14
41
55
1
66
1
77
1
88
1
99
11
10
1
12
11
13
21
14
31
15
41
16
51
17
61
18
71
19
81
20
91
22
01
23
11
24
21
25
31
26
41
27
51
28
61
29
71
task number
second
Job Execution Time(Original Sort) maps
shuffle
merge
reduce
0
50
100
150
11
11
22
13
31
44
15
51
66
17
71
88
19
91
11
01
12
11
13
21
14
31
15
41
16
51
17
61
18
71
19
81
20
91
22
01
23
11
24
21
25
31
26
41
27
51
28
61
29
71
task number
second
Job Execution Time(Native-Task Sort)
maps
shuffle
merge
reduce
Intel Software Page 8
Start time :00:25
Finish time :1:10
Analysis
Sort is IO-intensive at both map and reduce stage. We can see that
it’s reduce time occupy the most of whole job running time, because of
that, the performance improvement is limited.
3.3.3 DFSIO-Read
Job Details:
Name Maps Reducers
Datatools.jar 256 1
Result Analyzer 50 63
Intel Software Page 9
3.3.4 DFSIO-Write
Job Details:
Name Maps Reducers
Datatools.jar 512 1
Result Analyzer 50 63
0
50
100
150
15
21
03
15
42
05
25
63
07
35
84
09
46
05
11
56
26
13
66
47
15
76
68
17
86
89
19
97
01
02
11
07
21
12
31
17
41
22
51
27
61
32
71
37
8
task number
second
Job Execution Time(Original DFSIO-Read)
maps
shuffle
merge
reduce
0
50
100
150
15
21
03
15
42
05
25
63
07
35
84
09
46
05
11
56
26
13
66
47
15
76
68
17
86
89
19
97
01
02
11
07
21
12
31
17
41
22
51
27
61
32
71
37
8
task number
second
Job Execution Time(Native-Task DFSIO-Read)
maps
shuffle
merge
reduce
0
50
100
150
12
66
53
17
96
10
61
13
26
15
91
18
56
21
21
23
86
26
51
29
16
31
81
34
46
37
11
39
76
42
41
45
06
47
71
50
36
53
01
55
66
58
31
60
96
63
61
66
26
68
91
71
56
task number
second
Job Execution Time(Original DFSIO-Write)
maps
shuffle
merge
reduce
Intel Software Page 10
Native-Task running state:
Aggregation start time: 9:58
Aggregation finish time: 10:19
Join start time: 10:19
Join finish time: 12:10
Original running state:
0
50
100
150
12
66
53
17
96
10
61
13
26
15
91
18
56
21
21
23
86
26
51
29
16
31
81
34
46
37
11
39
76
42
41
45
06
47
71
50
36
53
01
55
66
58
31
60
96
63
61
66
26
68
91
71
56
task number
second
Job Execution Time(Native-Task DFSIO-Write)
maps
shuffle
merge
reduce
Intel Software Page 11
Aggregation start time: 20:22
Aggregation finish time: 20:46
Join start time: 20:46
Join finish time: 22:45
Analysis
DFSIO is IO-intensive both at read and write stage. It’s bottleneck is
network bandwidth so the performance improvement is limited.
3.3.5 Pagerank
Job Details:
Name Maps Reducers
Pagerank_Stage1 163 33
Intel Software Page 12
Pagerank_Stage2 33 33
Native-Task running state:
Start time: 10:33
0
50
100
1501
40
18
01
12
01
16
01
20
01
24
01
28
01
32
01
36
01
40
01
44
01
48
01
52
01
56
01
60
01
64
01
68
01
72
01
76
01
80
01
84
01
88
01
92
01
96
01
10
00
11
04
01
10
80
1
task number
second
Job Execution Time(Original Pagerank) maps
shuffle
merge
reduce
0
50
100
150
14
01
80
11
20
11
60
12
00
12
40
12
80
13
20
13
60
14
00
14
40
14
80
15
20
15
60
16
00
16
40
16
80
17
20
17
60
18
00
18
40
18
80
19
20
19
60
11
00
01
10
40
11
08
01
task number
second
Job Execution Time(Native-Task Pagerank) maps
shuffle
merge
reduce
Intel Software Page 13
Finish time: 12:08
Original running state:
Start time: 10:59
Finish time: 14:06
Analysis
Pagerank is a CPU-intensive workload and it’s map stage take about
50% of the whole job running time. So the performance improvement is
obvious.
3.3.6 Hive-Aggregation
Job Details:
Name Maps Reducers
INSERT OVERWRITE TABLE 1386 33
Intel Software Page 14
uservisits...sourceIP(Stage-1)
Original running state:
Start time: 15:52
0
50
100
150
16
01
19
17
82
37
29
63
55
41
44
73
53
25
91
65
07
09
76
88
27
88
69
45
10
04
10
63
11
22
11
81
12
40
12
99
13
58
14
17
14
76
15
35
15
94
task number
second
Job Execution Time (Original Hive-Aaggregation)
maps
shuffle
merge
reduce
0
50
100
150
16
01
19
17
82
37
29
63
55
41
44
73
53
25
91
65
07
09
76
88
27
88
69
45
10
04
10
63
11
22
11
81
12
40
12
99
13
58
14
17
14
76
15
35
15
94
task number
second
Job Execution Time(Native-Task Hive-Aggregation)
maps
shuffle
merge
reduce
Intel Software Page 15
Finish time :16:22
Analysis
Hive-Aggregation is CPU-intensive at map stage and IO-intensive at
reduce stage. It’s map stage occupy the most of running time and when
it comes to reduce stage, network bandwidth limits the performance. So
the performance improvement at map stage is obvious.
3.3.7 Hive-Join
Job Details:
Name Maps Reducers
INSERT OVERWRITE TABLE
rankings_uservisi...1(Stage-1)
1551 33
INSERT OVERWRITE TABLE
rankings_uservisi...1(Stage-2)
99 33
INSERT OVERWRITE TABLE
rankings_uservisi...1(Stage-3)
99 1
INSERT OVERWRITE TABLE
rankings_uservisi...1(Stage-4)
1 1
Intel Software Page 16
Original running state:
Start time: 16:32
Finish time :16:58
0
50
100
150
16
01
19
17
82
37
29
63
55
41
44
73
53
25
91
65
07
09
76
88
27
88
69
45
10
04
10
63
11
22
11
81
12
40
12
99
13
58
14
17
14
76
15
35
15
94
task number
second
Job Execution Time(Original Hive-Join)
maps
shuffle
merge
reduce
0
50
100
150
16
01
19
17
82
37
29
63
55
41
44
73
53
25
91
65
07
09
76
88
27
88
69
45
10
04
10
63
11
22
11
81
12
40
12
99
13
58
14
17
14
76
15
35
15
94
task number
second
Job Execution Time(Native-Task Hive-Join)
maps
shuffle
merge
reduce
Intel Software Page 17
Analysis
Hive-join is a CPU-intensive workload and it’s map stage takes a high
percent of whole running time. So we can see at map stage, the
performance is improved by native-task.
3.3.8 Terasort
Job Details:
Name Maps Reducers
Terasort 458 34
Native-Task running state:
Original running state:
0
50
100
150
12
31
46
16
91
92
11
15
11
38
11
61
11
84
12
07
12
30
12
53
12
76
12
99
13
22
13
45
13
68
13
91
14
14
14
37
14
60
14
83
15
06
15
29
15
52
15
75
15
98
16
21
1
task number
second
Job Execution Time(Original Terasort)
maps
shuffle
merge
reduce
0
50
100
150
12
31
46
16
91
92
11
15
11
38
11
61
11
84
12
07
12
30
12
53
12
76
12
99
13
22
13
45
13
68
13
91
14
14
14
37
14
60
14
83
15
06
15
29
15
52
15
75
15
98
16
21
1
task number
second
Job Execution Time(Native-Task Terasort)
maps
shuffle
merge
reduce
Intel Software Page 18
Start time: 8:39
Finish time: 10:24
Analysis
Terasort is CPU-intensive at map stage and IO-intensive at reduce
stage.It’s map stage occupy the majority of the running time so there is a
huge performance improvement at map stage.
3.3.9 K-Means
Job Details:
Name Maps Reducers
Cluster Iterator running
iteration 1 …
1400 63
Cluster Iterator running 1400 63
Intel Software Page 19
iteration 2 …
Cluster Iterator running
iteration 3 …
1400 63
Cluster Iterator running
iteration 4 …
1400 63
Cluster Iterator running
iteration 5 …
1400 63
Cluster Classification
Driver running
1400 0
Native-Task running state:
0
50
100
150
12
71
54
18
11
10
81
13
51
16
21
18
91
21
61
24
31
27
01
29
71
32
41
35
11
37
81
40
51
43
21
45
91
48
61
51
31
54
01
56
71
59
41
62
11
64
81
67
51
70
21
72
91
task number
second
Job Execution Time(Original Kmeans)
maps
shuffle
merge
reduce
0
50
100
150
1
28
15
61
84
11
12
11
40
11
68
11
96
12
24
12
52
12
80
13
08
13
36
13
64
1
39
21
42
01
44
81
47
61
50
41
53
21
56
01
58
81
61
61
64
41
67
21
70
01
72
81
task number
second
Job Execution Time(Native Kmeans) maps
shuffle
merge
reduce
Intel Software Page 20
Start time: 20:38
Finish time: 22:23
Original running state:
Start time: 9:43
Finish time: 11:41
Intel Software Page 21
Analisys
From the running state graph, we can see that the former 5
iteration is CPU-intensive and the last classification stage is IO-intensive.
The two stages almost equally split the whole running time. So the
performance improvement at map stage is evident.
3.3.10 Nutchindexing
Job Details:
Name Maps Reducers
index-lucene
/HiBench/Nutch/Input/indexes
1171 133
Native-Task running state:
0
20
40
60
80
11
69
33
75
05
67
38
41
10
09
11
77
13
45
15
13
16
81
18
49
20
17
21
85
23
53
25
21
26
89
28
57
30
25
31
93
33
61
35
29
36
97
38
65
40
33
42
01
43
69
45
37
task number
second
Job Execution Time(Original Nutchindexing)
maps
shuffle
merge
reduce
0
20
40
60
80
11
69
33
75
05
67
38
41
10
09
11
77
13
45
15
13
16
81
18
49
20
17
21
85
23
53
25
21
26
89
28
57
30
25
31
93
33
61
35
29
36
97
38
65
40
33
42
01
43
69
45
37
task number
second
Job Execution Time(Native Nutchindexing)
maps
shuffle
merge
reduce
Intel Software Page 22
Start time: 17:26
Finish time: 18:40
Original running state:
Start time: 18:40
Intel Software Page 23
Finish time: 19:56
Analysis
Nutchindexing is CPU-intensive at map stage but the reduce stage
take the majority of whole running time. So the performance
improvement is not so huge.
3.4 Other related results
3.4.1 Cache miss hurts Sorting performance
• Sorting time increase rapidly as cache miss rate increase
• We divide a large buffer into several memory unit.
• BlockSize is the size of memory unit we doing the sorting.
0
10
20
30
40
50
0 5 10 15 20 25 30
sort time/log(BlockSize) sort time/log(blocksize)
SortTime/second
log2(BlockSize)
partition size reach CPU L3 cache limit
Intel Software Page 24
3.4.2 Compare with BlockMapOutputBuffer
• 70% faster than BlockMapOutputBuffer collector.
• BlockMapOutputBuffer supports ONLY BytesWritable
3.4.3 Effect of JVM reuse
• 4.5% improvement for Original Hadoop, 8% improvement for
Native-Task
0
50
100
150
200
Avg job time(s) Avg map task time(s)Avg reduce task time(s)
Time /s Job Execution Time Breakdown(WordCount)
BlockMapOutputBuffer collector
native task collector
NO Combiner, 16 mapper x 8 reducer, 4 nodes, 3 SATA/node 600MB per task, compression ratio: 2/1
Intel Software Page 25
4 nodes, 4 map slots per node.
3.4.4 Hadoop don’t scale well when slots number increase
• 4 nodes(32 core per node), 16 map slots max, CPU, memory,
disk are NOT fully used.
• Performance drops unexpectedly when slots# increase
0
200
400
600
800
1000
1200
1400
Original Hadoop Naitve-Task
1381
510
1322
472
Effect of Task JVM Reuse
No JVM-Reuse
JVM-Reuse
Improve 4.5%
Improve 8%
0
200
400
600
800
total job time total map stagetime(/s)
741 718
369 339
time/s Performance of different map slots
4 * 16 mapper slots
4 * 4 mapper
Wordcount benchmark
Intel Software Page 26
3.4.5 Native-Task mode: full task optimization
• 2x faster further for Native-Task full time optimization,
compared with native collector
4. Conclusions
0
100
200
300
400
500
600
700
800
900
1000
total job time(second) total map stage time(second)
Native Task modes: full task vs. Collector only
Hadoop Original
Native-Task(Collector)
Native-Task(Full Task)
Wordcount benchmark