62
04/27/2009 CS267 Lecture 24 1 Parallel Sorting James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09

04/27/2009CS267 Lecture 241 Parallel Sorting James Demmel demmel/cs267_Spr09

Embed Size (px)

Citation preview

04/27/2009 CS267 Lecture 24 1

Parallel Sorting

James Demmel

www.cs.berkeley.edu/~demmel/cs267_Spr09

Some Sorting algorithms

• Choice of algorithm depends on• Is the data all in memory, or on disk/tape?

• Do we compare operands, or just use their values?

• Sequential or parallel? Shared or distributed memory? SIMD or not?

• We will consider all data in memory and parallel:• Bitonic Sort

• Naturally parallel, suitable for SIMD architectures

• Sample Sort

• Good generalization of Quick Sort to parallel case

• Radix Sort

• Data measured on CM-5• Written in Split-C (precursor of UPC)

• www.cs.berkeley.edu/~culler/papers/sort.ps

04/27/2009 CS267 Lecture 24 2

LogP – model to predict, understand performance

Interconnection Network

MPMPMP° ° °

P ( processors )

Limited Volume( L/ g to or from

a proc)

o (overhead)

L (latency)

og (gap)

• Latency in sending a (small) message between modules

• overhead felt by the processor on sending or receiving msg

• gap between successive sends or receives (1/BW)

• Processors

Bottom Line on CM-5 using Split-C (Preview)

N/P

us

/ke

y

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Bitonic 1024

Bitonic 32

Column 1024

Column 32

Radix 1024

Radix 32

Sample 1024

Sample 32

• Good fit between predicted (using LogP model) and measured (10%)• No single algorithm always best

• scaling by processor, input size, sensitivity

• All are global / local hybrids• the local part is hard to implement and model

Algorithm, #Procs

04/27/2009 CS267 Lecture 24 4

04/27/2009 CS267 Lecture 24 5

Bitonic Sort (1/2)• A bitonic sequence is one that is:

1. Monotonically increasing and then monotonically decreasing

2. Or can be circularly shifted to satisfy 1

• A half-cleaner takes a bitonic sequence and produces1. First half is smaller than smallest element in 2nd

2. Both halves are bitonic

• Apply recursively to each half to complete sorting

• Where do we get a bitonic sequence to start with?

00111000

00001011

00000111

00001011

04/27/2009 CS267 Lecture 24 6

Bitonic Sort (2/2)• A bitonic sequence is one that is:

1. Monotonically increasing and then monotonically decreasing

2. Or can be circularly shifted to satisfy

• Any sequence of length 2 is bitonic• So we can sort it using bitonic sorter on last slide

• Sort all length-2 sequences in alternating increasing/decreasing order

• Two such length-2 sequences form a length-4 bitonic sequence

• Recursively • Sort length-2k sequences in alternating increasing/decreasing

order

• Two such length-2k sequences form a length-2k+1 bitonic sequence (that can be sorted)

Bitonic Sort for n=16 – all dependencies shown

Block Layout

lg N/p stages are local sort – Use best local sort

remaining stages involve Block-to-cyclic, local merges (i - lg N/P cols)cyclic-to-block, local merges ( lg N/p cols within stage)

Similar pattern as FFT: similar optimizations possible

Bitonic Sort: time per key

Predicted using LogP Model

N/P

us/

key

0

10

20

30

40

50

60

70

80

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Measured

N/P

us/

key

0

10

20

30

40

50

60

70

80

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

512

256

128

64

32

#Procs

04/27/2009 CS267 Lecture 24 8

Sample Sort

1. compute P-1 values of keys that

would split the input into roughly equal pieces.

– take S~64 samples per processor

– sort P·S keys (on processor 0)

– let keys S, 2·S, . . . (P-1)·S be “Splitters”

– broadcast Splitters

2. Distribute keys based on Splitters

Splitter(i-1) < keys Splitter(i) all sent to proc i

3. Local sort of keys on each proc

[4.] possibly reshift, so each proc has N/p keys

If samples represent total population, then Splitters should dividepopulation into P roughly equal pieces

04/27/2009 CS267 Lecture 24 9

Sample Sort: Times

Predicted

N/P

us/

key

0

5

10

15

20

25

30

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Measured

N/P

us/

key

0

5

10

15

20

25

30

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

512

256

128

64

32

# Processors

04/27/2009 CS267 Lecture 24 10

Sample Sort Timing Breakdown

N/P

us/

key

0

5

10

15

20

25

30

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Split

Sort

Dist

Split-m

Sort-m

Dist-m

Predicted and Measured (-m) times

04/27/2009 CS267 Lecture 24 11

004/19/23 CS267 Lecture 24 12

Sequential Radix Sort: Counting Sort

• Idea: build a histogram of the keys and compute position in answer array for each element

A = [3, 5, 4, 1, 3, 4, 1, 4]

• Make temp array B, and write values into position B = [1, 1, 3, 3, 4, 4, 4, 5]

• Cost = O(#keys + size of histogram)

• What if histogram too large (eg all 32-bit ints? All words?)

0

1

2

3

4

0 1 2 3 4 5

004/19/23 CS267 Lecture 24 13

Radix Sort: Separate Key Into Parts

• Divide keys into parts, e.g., by digits (radix)

• Using counting sort on these each radix:• Start with least-significant

sat run sat pin

saw pin saw run

tip tip tip sat

run sat pin saw

pin saw run tip

sort on 3rd character

sort on 2nd character

sort on 1st character

sat run sat pin

saw pin saw run

tip tip tip sat

run sat pin saw

pin saw run tip

sat run sat pin

saw pin saw run

tip tip pin sat

run sat tip saw

pin saw run tip

sat run sat pin

saw pin saw run

tip tip pin sat

run sat tip saw

pin saw run tip

• Cost = O(#keys * #characters)

Histo-radix sort

P

n=N/P

Per pass:

1. compute local histogram

– r-bit keys, 2r bins

2. compute position of 1st

member of each bucket in

global array

– 2r scans with end-around

3. distribute all the keys

Only r = 4, 8,11,16 make sense

for sorting 32 bit numbers

2r23

004/19/23 CS267 Lecture 24 14

Histo-Radix Sort (again)

Local Data

Local Histograms

Each Passform local histogramsform global histogramglobally distribute data

P

004/19/23 CS267 Lecture 24 15

Radix Sort: Times

Predicted

N/P

us

/ke

y

0

20

40

60

80

100

120

140

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

us

/ke

y

512

256

128

64

Measured

N/P

0

20

40

60

80

100

120

140

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

32

# procs

004/19/23 CS267 Lecture 24 16

Radix Sort: Timing Breakdown

N/P

us/

key

0

20

40

60

80

100

120

14016

384

3276

8

6553

6

1E+

05

3E+

05

5E+

05

1E+

06

Dist

GlobalHist

LocalHist

Dist-m

GlobalHist-m

LocalHist-m

004/19/23 CS267 Lecture 24 17

Predicted and Measured (-m) times

Local Sort Performance on CM-5

Log N/P

µs

/ K

ey

0

1

2

3

4

5

6

7

8

9

10

0 5 10 15 20

31

25.1

16.9

10.4

6.2

Entropy inKey Values

<--------- TLB misses ---------->

(11 bit radix sort of 32 bit numbers)

004/19/23 18CS267 Lecture 24

Entropy = -i pi log2 pi ,

pi = Probability of key i

Ranges from 0 to log2 (#different keys)

Radix Sort: Timing dependence on Key distribution

Entropy

µs /

ke

y

0

10

20

30

40

50

60

700 6

10

17

25

31

Cycl

ic

118

Dist

Global Hist

Local Hist

Slowdown due to contentionin redistribution

004/19/23 CS267 Lecture 24 19

Entropy = -i pi log2 pi ,

pi = Probability of key i

Ranges from 0 to log2 (#different keys)

Bottom Line on CM-5 using Split-C

N/P

us

/ke

y

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Bitonic 1024

Bitonic 32

Column 1024

Column 32

Radix 1024

Radix 32

Sample 1024

Sample 32

• Good fit between predicted (using LogP model) and measured (10%)• No single algorithm always best

• scaling by processor, input size, sensitivity

• All are global / local hybrids• the local part is hard to implement and model

Algorithm, #Procs

04/27/2009 CS267 Lecture 24 20

Sorting Conclusions

• Distributed memory model leads to hybrid global / local algorithm• Use best local algorithm combined with global part

• LogP model is good enough to model global part• bandwidth (g) or overhead (o) matter most• including end-point contention• latency (L) only matters when bandwidth doesn’t

• Modeling local computational performance is harder• dominated by effects of storage hierarchy (eg TLBs), • depends on entropy

• See http://www.cs.berkeley.edu/~culler/papers/sort.ps• See http://now.cs.berkeley.edu/Papers2/Postscript/spdt98.ps

• disk-to-disk parallel sorting

04/27/2009 CS267 Lecture 24 21

EXTRA SLIDES

004/19/23 CS267 Lecture 24 22

Radix: Stream Broadcast Problem

n

(P-1) ( 2o + L + (n-1) g ) ? Need to slow first processor to pipeline well

• Processor 0 does only sends• Others receive then send• Receives prioritized over sends Processor 0 needs to be delayed

004/19/23 CS267 Lecture 24 23

What’s the right communication mechanism?

• Permutation via writes• consistency model?• false sharing?

• Reads?• Bulk Transfers?

• what do you need to change in the algorithm?

• Network scheduling?

Comparison

• Good fit between predicted and measured (10%)

• Different sorts for different sorts• scaling by processor, input size, sensitivity

• All are global / local hybrids• the local part is hard to implement and model

N/P

us

/ke

y

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Bitonic 1024

Bitonic 32

Column 1024

Column 32

Radix 1024

Radix 32

Sample 1024

Sample 32

04/27/2009 CS267 Lecture 24 26

Outline

• Some Performance Laws

• Performance analysis

• Performance modeling

• Parallel Sorting: combining models with measurments

• Reading: • Chapter 3 of Foster’s “Designing and Building Parallel Programs”

online text

• http://www-unix.mcs.anl.gov/dbpp/text/node26.html

• Abbreviated as DBPP in this lecture

• David Bailey’s “Twelve Ways to Fool the Masses”

04/19/23 CS267, Yelick 27

Measuring Performance

• Performance criterion may vary with domain

• There may be limits on acceptable running time• E.g., a climate model must run 1000x faster than real time.

• Any performance improvement may be acceptable• E.g., faster on 4 cores than on 1.

• Throughout may be more critical than latency• E.g., number of images processed per minute (throughput) vs.

total delay for one image (latency) in a pipelined system.

• Execution time per unit cost• E.g., GFlop/sec, GFlop/s/$ or GFlop/s/Watt

• Parallel scalability (speedup or parallel efficiency)

• Percent relative to best possible (some kind of peak)

04/19/23 CS267, Yelick 28

Amdahl’s Law (review)

• Suppose only part of an application seems parallel

• Amdahl’s law• let s be the fraction of work done sequentially, so

(1-s) is fraction parallelizable

• P = number of processors

Speedup(P) = Time(1)/Time(P)

<= 1/(s + (1-s)/P)

<= 1/s

• Even if the parallel part speeds up perfectly performance is limited by the sequential part

04/19/23 CS267, Yelick 29

Amdahl’s Law (for 1024 processors)

Does this mean parallel computing is a hopeless enterprise?

Source: Gustafson, Montry, Benner

04/19/23 CS267, Yelick 30

Scaled Speedup

See: Gustafson, Montry, Benner, “Development of Parallel Methods for a 1024 Processor Hypercube”, SIAM J. Sci. Stat. Comp. 9, No. 4, 1988, pp.609.

04/19/23 CS267, Yelick 31

Scaled Speedup (background)

04/19/23 CS267, Yelick 32

Little’s Law

• Latency vs. Bandwidth• Latency is physics (wire length)

• e.g., the network latency on the Earth Simulation is only about 2x times the speed of light across the machine room

• Bandwidth is cost:

• add more cables to increase bandwidth (over-simplification)

• Principle (Little's Law): the relationship of a production system in steady state is:

Inventory = Throughput × Flow Time

• For parallel computing, Little’s Law is about the required concurrency to be limited by bandwidth rather than latency

• Required concurrency = Bandwidth * Latency

bandwidth-delay product

• For parallel computing, this means:

Concurrency = bandwidth x latency

04/19/23 CS267, Yelick 33

Little’s Law

• Example 1: a single processor:• If the latency is to memory is 50ns, and the bandwidth is 5 GB/s

(.2ns / Bytes = 12.8 ns / 64-byte cache line)

• The system must support 50/12.8 ~= 4 outstanding cache line misses to keep things balanced (run at bandwidth speed)

• An application must be able to prefetch 4 cache line misses in parallel (without dependencies between them)

• Example 2: 1000 processor system• 1 GHz clock, 100 ns memory latency, 100 words of memory in

data paths between CPU and memory.

• Main memory bandwidth is:

~ 1000 x 100 words x 109/s = 1014 words/sec.

• To achieve full performance, an application needs:

~ 10-7 x 1014 = 107 way concurrency

(some of that may be hidden in the instruction stream)

04/19/23 CS267, Yelick 34

In Performance Analysis: Use more Data

• Whenever possible, use a large set of data rather than one or a few isolated points. A single point has little information.

• E.g., from DBPP: • Serial algorithm scales as N + N2

• Speedup of 10.8 on 12 processors with problem size N=100

• Case 1: T = N + N2/P

• Case 2: T = (N + N2)/P + 100

• Case 2: T = (N + N2)/P + 0.6*P2

• All have speedup ~10.8 on 12 procs

• Performance graphs (n = 100, 1000) show differences in scaling

04/19/23 CS267, Yelick 35

Example: Immersed Boundary Simulation

Joint work with Ed Givelberg, Armando Solar-Lezama

Time per timestep

0

20

40

60

80

100

# procs

time

(sec

s)

Pow3/SP 256 3̂

Pow3/SP 512 3̂

P4/Myr 512 2̂x256

• Using Seaborg (Power3) at NERSC and DataStar (Power4) at SDSC• How useful is this data? What are ways to make is more useful/interesting?

04/19/23 CS267, Yelick 36

Performance Analysis

time breakdown

0%10%20%30%40%50%60%70%80%90%

100%

256

on 2

256

on 4

256

on 8

256

on 1

6

256

on 3

2

256

on 6

4

512

on 3

2

512

on 6

4

512

on 1

28

move

Unpack U

Send velocities

Pack U

Copy fluid U

Inverse FFTs

SolveXformXEqns

Forward FFTs

Upwind

Exchange ghost

Unpack F

Set F = 0

Send F

Pack F

Spread F

Compute F

04/19/23 CS267, Yelick 37

Building a Performance Model• Based on measurements/scaling of components

• FFT is time is: • 5*nlogn flops * flops/sec (measured for FFT)

• Other costs are linear in either material or fluid points• Measure constants:

a) # flops/point (independent machine or problem size)b) Flops/sec (measured per machine, per phase)

• Time is: a * b * #points• Communication done similarly

• Find formula for message size as function of problem size• Check the formula using tracing of some kind• Use model to predict running time: * size

04/19/23 CS267, Yelick 38

A Performance Model

• 5123 in < 1 second per timestep not possible • Primarily limited by bisection bandwidth

Performance Model Validation

0.1

1

10

100

1000

# procs

tim

e (

se

cs

)

Total time (256 model)

Total time (256 actual)

Total time (512 model)

Total time (512 actual)

04/19/23 CS267, Yelick 39

Model Success and Failure

04/19/23 CS267, Yelick 40

OSKI SPMV: What We Expect

• Assume• Cost(SpMV) = time to read matrix

• 1 double-word = 2 integers

• r, c in {1, 2, 4, 8}

• CSR: 1 int / non-zero

• BCSR(r x c): 1 int / (r*c non-zeros)

• As r*c increases, speedup should• Increase smoothly

• Approach 1.5 (eliminate all index overhead)

5.11

1

5.1

),(,

cr

BCSR

CSR

rccrT

TSpeedup

04/19/23 CS267, Yelick 41

What We Get (The Need for Search)

Reference

Best: 4x2

Mflop/s

Mflop/s

04/19/23 CS267, Yelick 42

Using Multiple Models

04/19/23 CS267, Yelick 43

Multiple Models

04/19/23 CS267, Yelick 44

Multiple Models

Extended ExampleUsing Performance Modeling (LogP)

To Explain Data

Application to Sorting

Deriving the LogP Model

° Processing

– powerful microprocessor, large DRAM, cache => P

° Communication

+ significant latency (100's –1000’s of cycles) => L

+ limited bandwidth (1 – 5% of memory bw) => g

+ significant overhead (10's – 100's of cycles) => o- on both ends

– no consensus on topology

=> should not exploit structure

+ limited network capacity– no consensus on programming model

=> should not enforce one

LogP

Interconnection Network

MPMPMP° ° °

P ( processors )

Limited Volume( L/ g to or from

a proc)

o (overhead)

L (latency)

og (gap)

• Latency in sending a (small) message between modules

• overhead felt by the processor on sending or receiving msg

• gap between successive sends or receives (1/BW)

• Processors

Using the LogP Model

° Send n messages from proc to proc in time

2o + L + g (n-1)

– each processor does o*n cycles of overhead

– has (g-o)(n-1) + L available compute cycles

° Send n total messages from one to many

in same time

° Send n messages from many to one

in same time

– all but L/g processors block

so fewer available cycles, unless scheduled carefully

o L o

o og

Ltime

P

P

Use of the LogP Model (cont)

° Two processors sending n words to each other (i.e., exchange)

2o + L + max(g,2o) (n-1) max(g,2o) + L

° P processors each sending n words to all processors (n/P each) in a static, balanced pattern without conflicts , e.g., transpose, fft, cyclic-to-block, block-to-cyclic

exercise: what’s wrong with the formula above?

Assumes optimal pattern of send/receive, so could underestimate time

o L o

o og

L

o

LogP "philosophy"

• Think about:

• – mapping of N words onto P processors

• – computation within a processor, its cost, and balance

• – communication between processors, its cost, and balance

• given a charaterization of processor and network performance

• Do not think about what happens within the network

This should be good enough!

Typical Sort

Exploits the n = N/P grouping

° Significant local computation

° Very general global communication / transformation

° Computation of the transformation

Costs Split-C (UPC predecessor) Operations

• Read, Write x = *G, *G = x 2 (L + 2o)• Store *G :– x L + 2o• Get x := *G o .... 2L + 2o

sync(); o•with interval g

• Bulk store (n words with words/message)

2o + (n-1)g + L

• Exchange 2o + 2L + (nL/g) max(g,2o)

• One to many• Many to one

LogP model

• CM5:• L = 6 µs• o = 2.2 µs• g = 4 µs• P varies from 32 to 1024

• NOW• L = 8.9• o = 3.8• g = 12.8• P varies up to 100

• What is the processor performance?• Application-specific• 10s of Mflops for these machines

04/19/23 CS267, Yelick 54

LogP Parameters Today

0

1020

304050

607080

90100

4*o + 2*L g

Millennium (MPICH/M2K)

SP Loopback (MPI)

SP (MPI)

Quadrics loopback (MPI, ORNL)

Quadrics (MPI,ORNL)

Quadrics (Shmem, ORNL)

Quadrics (UPC, ORNL)

Myrinet(GM)

MVICH/Giganet

VIPL/Giganet/VIA

MVICH/M-VIA/Syskonnect

VIPL/M-VIA/SysKonnect

T3E Loopback (MPI/NERSC)

T3E (MPI, NERSC)

T3E UPC (NERSC)

Local Computation Parameters - Empirical

Parameter Operation µs per key Sort

Swap Simulate cycle butterfly per key 0.025 lg N Bitonic

mergesort Sort bitonic sequence 1.0

scatter Move key for Cyclic-to-block 0.46

gather Move key for Block-to-cyclic 0.52 if n<=64k or P<=64 Bitonic & Column

1.1 otherwise

local sort Local radix sort (11 bit) 4.5 if n < 64K

9.0 - (281000/n)

merge Merge sorted lists 1.5 Column

copy Shift Key 0.5

zero Clear histogram bin 0.2 Radix

hist produce histogram 1.2

add produce scan value 1.0

bsum adjust scan of bins 2.5

address determine desitination 4.7

compare compare key to splitter 0.9 Sample

localsort8 local radix sort of samples 5.0

Odd-Even Merge - classic parallel sort

N values to be sorted

A0 A1 A2 A3 AM-1 B0 B1 B2 B3 BM-1

Treat as two lists ofM = N/2

Sort each separately

A0 A2 … AM-2 B0 B2 … BM-2

Redistribute intoeven and odd sublists A1 A3 … AM-1 B1 B3 … BM-1

Merge into twosorted lists

E0 E1 E2 E3 EM-1 O0 O1 O2 O3 OM-1

Pairwise swaps ofEi and Oi will put itin order

Where’s the Parallelism?

E0 E1 E2 E3 EM-1 O0 O1 O2 O3 OM-1

1xN

1xN

4xN/4

2xN/2

Mapping to a Butterfly (or Hypercube)

A0 A1 A2 A3 B0 B1 B2 B3

A0 A1 A2 A3

A0 A1 A2 A3

B0B1 B3 B2

B2B3 B1 B0

A0 A1 A2 A3 B2B3 B1 B0

Reverse Orderof one list viacross edges

two sorted sublists

Pairwise swapson way back2 3 4 8 7 6 5 1

2 3 4 7 6 5 81

2 4 6 81 3 5 7

1 2 3 4 5 6 7 8

Bitonic Sort with N/P per node

all_bitonic(int A[PROCS]::[n])sort(tolocal(&A[ME][0]),n,0)for (d = 1; d <= logProcs; d++) for (i = d-1; i >= 0; i--) { swap(A,T,n,pair(i)); merge(A,T,n,mode(d,i)); }

sort(tolocal(&A[ME][0]),n,mask(i));

sortswap

A bitonic sequence decreases and then increases (or vice versa)Bitonic sequences can be merged like monotonic sequences

Bitonic: Breakdown

Predicted

N/P

us/k

ey

0

10

20

30

40

50

60

70

80

16384

32768

65536

131072

262144

524288

1048576

Measured

N/P

us/k

ey

0

10

20

30

40

50

60

70

80

16384

32768

65536

131072

262144

524288

1048576

Remap B-C

Remap C-B

Mergesort

Swap

Localsort

P= 512, random

Bitonic: Effect of Key Distributions

Entropy (bits)

µs/k

ey

0

5

10

15

20

25

30

35

40

45

0 6 10 17 25 31

Swap

Merge Sort

Local Sort

Remap C-B

Remap B-C

P = 64, N/P = 1 M

04/19/23 CS267, Yelick 62

Sequential Radix Sort: Counting Sort

• Idea: build a histogram of the keys and compute position in answer array for each element

A = [3, 5, 4, 1, 3, 4, 1, 4]

• Make temp array B, and write values into position

0

1

2

3

4

0 1 2 3 4 5

1 5 8 7 9 2 4 4 2 4 9 2 4 9 7 2 4 8 7 2 4 9

5 8 7 2 4 9

1 3 4 4 1 3 4

5