View
217
Download
1
Embed Size (px)
Citation preview
Culler 1997CS267 L28 Sort.1
CS 267 Applications of Parallel Computers
Lecture 28: LogP and the
Implementation and Modeling of Parallel Sorts
James Demmel
(taken from David Culler,
Lecture 18, CS267, 1997)
http://www.cs.berkeley.edu/~demmel/cs267_Spr99
Culler 1997CS267 L28 Sort.2
Practical Performance Target (circa 1992)
• Sort one billion large keys in one minute on one thousand processors.
• Good sort on a workstation can do 1 million keys in about 10 seconds
– just fits in memory
– 16 bit Radix Sort
• Performance unit: µs per key per processor– s ~ 10 for single Sparc 2
Culler 1997CS267 L28 Sort.3
Studies on Parallel Sorting
Sorting Networks
PRAM Sorts
MEM
p p p°°°
Sorting on Network Y
P
M
network
P
M
P
M°°°
LogP SortsSorting onMachine X
Culler 1997CS267 L28 Sort.4
The Study
Interesting ParallelSorting Algorithms
Analyze under LogP
Parametersfor CM-5
Estimate ExecutionTime
Implement in Split-C
Execute on CM-5
Compare
??
(Bitonic, Column, Histo- radix, Sample)
Culler 1997CS267 L28 Sort.5
LogP
Culler 1997CS267 L28 Sort.6
Deriving the LogP Model
° Processing
– powerful microprocessor, large DRAM, cache => P
° Communication
+ significant latency (100's of cycles) => L
+ limited bandwidth (1 – 5% of memory bw) => g
+ significant overhead (10's – 100's of cycles) => o- on both ends
– no consensus on topology
=> should not exploit structure
+ limited capacity– no consensus on programming model
=> should not enforce one
Culler 1997CS267 L28 Sort.7
LogP
Interconnection Network
MPMPMP° ° °
P ( processors )
Limited Volume( L/ g to or from
a proc)
o (overhead)
L (latency)
og (gap)
• Latency in sending a (small) mesage between modules
• overhead felt by the processor on sending or receiving msg
• gap between successive sends or receives (1/BW)
• Processors
Culler 1997CS267 L28 Sort.8
Using the Model
° Send n messages from proc to proc in time 2o + L + g(n-1)
– each processor does o n cycles of overhead
– has (g-o)(n-1) + L available compute cycles
° Send n messages from one to many
in same time
° Send n messages from many to one
in same time
– all but L/g processors block
so fewer available cycles
o L o
o og
Ltime
P
P
Culler 1997CS267 L28 Sort.9
Use of the Model (cont)
° Two processors sending n words to each other (i.e., exchange) in time
2o + L + max(g,2o) (n-1) max(g,2o) + L
° P processors each sending n words to all processors (n/P each) in a static, balanced pattern without conflicts , e.g., transpose, fft, cyclic-to-block, block-to-cyclic
same
exercise: what’s wrong with the formula above?
Assumes optimal pattern of send/receive, so could underestimate time
Culler 1997CS267 L28 Sort.10
LogP "philosophy"
• Think about:
• – mapping of N words onto P processors
• – computation within a processor, its cost, and balance
• – communication between processors, its cost, and balance
• given a charaterization of processor and network performance
• Do not think about what happens within the network
This should be good enough!
Culler 1997CS267 L28 Sort.11
Typical Sort
Exploits the n = N/P grouping
° Significant local computation
° Very general global communication / transformation
° Computation of the transformation
Culler 1997CS267 L28 Sort.12
Split-C
Global Address Space
P0 Pprocs-1P1
local
• Explicitly parallel C
• 2D global address space– linear ordering on local spaces
• Local and Global pointers – spread arrays too
• Read/Write
• Get/Put (overap compute and comm)– x := G; . . .
– sync();
• Signaling store (one-way)– G :– x; . . .
– store_sync(); or all_store_sync();
• Bulk transfer
• Global comm.
Culler 1997CS267 L28 Sort.13
Basic Costs of operations in Split-C
• Read, Write x = *G, *G = x 2 (L + 2o)
• Store *G :– x L + 2o
• Get x := *G o
.... 2L + 2o sync(); o
– with interval g
• Bulk store (n words with words/message)
2o + (n-1)g + L
• Exchange 2o + 2L + (nL/g) max(g,2o)
• One to many
• Many to one
Culler 1997CS267 L28 Sort.14
LogP model
• CM5:– L = 6 µs
– o = 2.2 µs
– g = 4 µs
– P varies from 32 to 1024
• NOW– L = 8.9
– o = 3.8
– g = 12.8
– P varies up to 100
• What is the processor performance?
Culler 1997CS267 L28 Sort.15
Sorting
Culler 1997CS267 L28 Sort.16
Local Sort Performance (11 bit radix sort of 32 bits numbers)
Log N/P
µs
/ K
ey
0
1
2
3
4
5
6
7
8
9
10
0 5 10 15 20
31
25.1
16.9
10.4
6.2
Entropy inKey Values
Entropy = -i pi log pi ,
pi = Probability of key i
<--------- TLB misses ---------->
Culler 1997CS267 L28 Sort.17
Local Computation Parameters - EmpiricalParameter Operation µs per key Sort
Swap Simulate cycle butterfly per key 0.025 lg N Bitonic
mergesort Sort bitonic sequence 1.0
scatter Move key for Cyclic-to-block 0.46
gather Move key for Block-to-cyclic 0.52 if n<=64k or P<=64 Bitonic & Column
1.1 otherwise
local sort Local radix sort (11 bit) 4.5 if n < 64K
9.0 - (281000/n)
merge Merge sorted lists 1.5 Column
copy Shift Key 0.5
zero Clear histogram bin 0.2 Radix
hist produce histogram 1.2
add produce scan value 1.0
bsum adjust scan of bins 2.5
address determine desitination 4.7
compare compare key to splitter 0.9 Sample
localsort8 local radix sort of samples 5.0
Culler 1997CS267 L28 Sort.18
Bottom Line (Preview)
N/P
us/
key
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
Bitonic 1024
Bitonic 32
Column 1024
Column 32
Radix 1024
Radix 32
Sample 1024
Sample 32
• Good fit between predicted and measured (10%)
• Different sorts for different sorts– scaling by processor, input size, sensitivity
• All are global / local hybrids– the local part is hard to implement and model
Culler 1997CS267 L28 Sort.19
Odd-Even Merge - classic parallel sort
N values to be sorted
A0 A1 A2 A3 AM-1 B0 B1 B2 B3 BM-1
Treat as two lists ofM = N/2
Sort each separately
A0 A2 … AM-2 B0 B2 … BM-2
Redistribute intoeven and odd sublists A1 A3 … AM-1 B1 B3 … BM-1
Merge into twosorted lits
E0 E1 E2 E3 EM-1 O0 O1 O2 O3 OM-1
Pairwise swaps ofEi and Oi will put itin order
Culler 1997CS267 L28 Sort.20
Where’s the Parallelism?
E0 E1 E2 E3 EM-1 O0 O1 O2 O3 OM-1
1xN
1xN
4xN/4
2xN/2
Culler 1997CS267 L28 Sort.21
Mapping to a Butterfly (or Hypercube)A0 A1 A2 A3 B0 B1 B2 B3
A0 A1 A2 A3
A0 A1 A2 A3
B0B1 B3 B2
B2B3 B1 B0
A0 A1 A2 A3 B2B3 B1 B0
Reverse Orderof one list viacross edges
two sorted sublists
Pairwise swapson way back2 3 4 8 7 6 5 1
2 3 4 7 6 5 81
2 4 6 81 3 5 7
1 2 3 4 5 6 7 8
Culler 1997CS267 L28 Sort.22
Bitonic Sort with N/P per node
all_bitonic(int A[PROCS]::[n])
sort(tolocal(&A[ME][0]),n,0)
for (d = 1; d <= logProcs; d++)
for (i = d-1; i >= 0; i--) {
swap(A,T,n,pair(i));
merge(A,T,n,mode(d,i));
}
sort(tolocal(&A[ME][0]),n,mask(i));
sortswap
A bitonic sequence decreases and then increases (or vice versa)Bitonic sequences can be merged like monotonic sequences
Culler 1997CS267 L28 Sort.23
Bitonic Sort
lg N/p stages are local sort
Block Layout
remaining stages involve Block-to-cyclic, local merges (i - lg N/P cols)cyclic-to-block, local merges ( lg N/p cols within stage)
Culler 1997CS267 L28 Sort.24
Analysis of Bitonic
• How do you do transpose?
• Reading Exercise
Culler 1997CS267 L28 Sort.25
Bitonic Sort: time per key
Predicted
N/P
us/
key
0
10
20
30
40
50
60
70
80
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
Measured
N/P
us/
key
0
10
20
30
40
50
60
70
80
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
512
256
128
64
32
Culler 1997CS267 L28 Sort.26
Bitonic: Breakdown
Predicted
N/P
us/k
ey
0
10
20
30
40
50
60
70
80
16384
32768
65536
131072
262144
524288
1048576
Measured
N/P
us/k
ey
0
10
20
30
40
50
60
70
80
16384
32768
65536
131072
262144
524288
1048576
Remap B-C
Remap C-B
Mergesort
Swap
Localsort
P= 512, random
Culler 1997CS267 L28 Sort.27
Bitonic: Effect of Key Distributions
Entropy (bits)
µs/k
ey
0
5
10
15
20
25
30
35
40
45
0 6 10 17 25 31
Swap
Merge Sort
Local Sort
Remap C-B
Remap B-C
P = 64, N/P = 1 M
Culler 1997CS267 L28 Sort.28
Column Sort
(3) Sort
(2) Transpose - block to cyclic
(1) Sort
(4) Transpose- cyclic to block w/o scatter
(6) shift
(5) Sort
(8) Unshift
(7) mergework efficient
Treat datalike n x P array,with n >= P^2,I.e. N >= P^3
Culler 1997CS267 L28 Sort.29
Column Sort: Times
Predicted
N/P
us/
key
0
5
10
15
20
25
30
35
40
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
Measured
N/P
us/
key
0
5
10
15
20
25
30
35
40
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
512
256
128
64
32
Only works for N >= P^3
Culler 1997CS267 L28 Sort.30
Column: Breakdown
Predicted
N/P
us/k
ey
0
5
10
15
20
25
30
35
40
16384
32768
65536
131072
262144
524288
1048576
Measured
N/P
us/k
ey
0
5
10
15
20
25
30
35
40
16384
32768
65536
131072
262144
524288
1048576
Sort1
Sort2
Sort3
Merge
Trans
Untrans
Shift
Unshift
P= 64, random
Culler 1997CS267 L28 Sort.31
Column: Key distributions
Entropy (bits)
µs
/ key
0
5
10
15
20
25
30
35
0 6 10 17 25 31
Merge
Sorts
Remaps
Shifts
P = 64, N/P = 1M
Culler 1997CS267 L28 Sort.32
Histo-radix sortP
n=N/P
Per pass:
1. compute local histogram
2. compute position of 1st
member of each bucket in
global array
– 2^r scans with end-around
3. distribute all the keys
Only r = 8,11,16 make sense
for sorting 32 bit numbers
2^r23
Culler 1997CS267 L28 Sort.33
Histo-Radix Sort (again)
Local Data
Local Histograms
Each Passform local histogramsform global histogramglobally distribute data
P
Culler 1997CS267 L28 Sort.34
Radix Sort: Times
Predicted
N/P
us/
key
0
20
40
60
80
100
120
140
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
Measured
N/P
us/
key
0
20
40
60
80
100
120
140
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
512
256
128
64
32
Culler 1997CS267 L28 Sort.35
Radix: Breakdown
N/P
us/k
ey
0
20
40
60
80
100
120
140
1638
4
3276
8
6553
6
1E+
05
3E+
05
5E+
05
1E+
06
Dist
GlobalHist
LocalHist
Dist-m
GlobalHist-m
LocalHist-m
Culler 1997CS267 L28 Sort.36
Radix: Key distribution
Entropy
µs /
ke
y
0
10
20
30
40
50
60
70
0 6
10
17
25
31
Cycl
ic
118
Dist
Global Hist
Local Hist
Slowdown due to contentionin redistribution
Culler 1997CS267 L28 Sort.37
Radix: Stream Broadcast Problem
n
(P-1) ( 2o + L + (n-1) g ) ? Need to slow first processor to pipeline well
Culler 1997CS267 L28 Sort.38
What’s the right communication mechanism?
• Permutation via writes– consistency model?
– false sharing?
• Reads?
• Bulk Transfers?– what do you need to change in the algorithm?
• Network scheduling?
Culler 1997CS267 L28 Sort.39
Sample Sort
1. compute P-1 values of keys that
would split the input into roughly equal pieces.
– take S~64 samples per processor
– sort PS keys
– take key S, 2S, . . . (P-1)S
– broadcast splitters
2. Distribute keys based on splitters
3. Local sort
[4.] possibly reshift
Culler 1997CS267 L28 Sort.40
Sample Sort: Times
Predicted
N/P
us/
key
0
5
10
15
20
25
30
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
Measured
N/P
us/
key
0
5
10
15
20
25
30
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
512
256
128
64
32
Culler 1997CS267 L28 Sort.41
Sample Breakdown
N/P
us/
key
0
5
10
15
20
25
30
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
Split
Sort
Dist
Split-m
Sort-m
Dist-m
Culler 1997CS267 L28 Sort.42
Comparison
N/P
us/
key
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
Bitonic 1024
Bitonic 32
Column 1024
Column 32
Radix 1024
Radix 32
Sample 1024
Sample 32
• Good fit between predicted and measured (10%)
• Different sorts for different sorts– scaling by processor, input size, sensitivity
• All are global / local hybrids– the local part is hard to implement and model
Culler 1997CS267 L28 Sort.43
Conclusions
• Distributed memory model leads to hybrid global / local algorithms
• LogP model is good enough for the global part– bandwidth (g) or overhead (o) matter most
– including end-point contention
– latency (L) only matters when BW doesn’t
– g is going to be what really matters in the days ahead (NOW)
• Local computational performance is hard!– dominated by effects of storage hierarchy (TLBs)
– getting trickier with multilevels » physical address determines L2 cache behavior
– and with real computers at the nodes (VM)
– and with variations in model» cycle time, caches, . . .
• See http://www.cs.berkeley.edu/~culler/papers/sort.ps
• See http://now.cs.berkeley.edu/Papers2/Postscript/spdt98.ps– disk-to-disk parallel sorting