41
John Mellor-Crummey Department of Computer Science Rice University [email protected] Parallel Sorting COMP 422 Lecture 19 5 April 2011

Comp422 2011 Lecture19 Sorting

Embed Size (px)

Citation preview

Page 1: Comp422 2011 Lecture19 Sorting

John Mellor-Crummey

Department of Computer ScienceRice University

[email protected]

Parallel Sorting

COMP 422 Lecture 19 5 April 2011

Page 2: Comp422 2011 Lecture19 Sorting

Topics for Today

• Introduction

• Issues in parallel sorting

• Sorting networks and Batcher’s bitonic sort

• Bubble sort and odd-even transposition sort

• Parallel quicksort

2

Page 3: Comp422 2011 Lecture19 Sorting

3

Why Study Parallel Sorting?

• One of the most common operations performed

• Close relation to task of routing on parallel computers—e.g. HPC Challenge RandomAccess benchmark

Page 4: Comp422 2011 Lecture19 Sorting

4

Sorting Algorithm Attributes

• Internal vs. external—internal: data fits in memory—external: uses tape or disk

• Comparison-based or not—comparison sort

– basic operation: compare elements and exchange as necessary– Θ(n log n) comparisons to sort n numbers

—non-comparison-based sort– e.g. radix sort based on the binary representation of data– Θ(n) operations to sort n numbers

• Parallel vs. sequential

Page 5: Comp422 2011 Lecture19 Sorting

5

Parallel Sorting Is Intrinsically Interesting

Different algorithms for different architecture variants

• Abstract parallel architecture—PRAM

• Network topology—hypercube—mesh

• Communication mechanism—shared address space—message passing

Today’s focus: parallel comparison-based sorting

Page 6: Comp422 2011 Lecture19 Sorting

6

Parallel Sorting Basics

• Where are the input and output lists stored? —we assume that both input and output lists are distributed

• What is a parallel sorted sequence? —sequence partitioned among the processors—each processor’s sub-sequence is sorted —all in Pj's sub-sequence < all in Pk's sub-sequence if j < k

– the best process numbering can depend on network topology

Page 7: Comp422 2011 Lecture19 Sorting

7

When partitioning is one element per process

1. Processes Pj and Pk send their elements to each other

Each process now has both elements

2. Process Pj keeps min(aj,ak), and Pk keeps max(aj, ak)

Pj Pk

aj ak

Element-wise Parallel Compare-Exchange

Pj Pk

aj, ak ak, aj

Pj Pk

min(aj, ak) max(ak, aj)

[communication step]

[comparison step]

Page 8: Comp422 2011 Lecture19 Sorting

8

Bulk Parallel Compare-Split

1. Send block of size n/p to partner

2. Each partner now has both blocks

3. Merge received block with own block

4. Retain only the appropriate half of the merged block Pi retains smaller values; process Pj retains larger values

Page 9: Comp422 2011 Lecture19 Sorting

9

Basic Analysis

• Assumptions—Pi and Pj are neighbors—communication channels are bi-directional

• Elementwise compare-exchange: 1 element per processor— time = ts + tw

• Bulk compare-split: n/p elements per processor —after compare-split on pair of processors Pi and Pj, i < j

– smaller n/p elements are at processor Pi – larger n/p elements at Pj

— time = ts+ twn/p– merge in O(n/p) time, as long as partial lists are sorted

Page 10: Comp422 2011 Lecture19 Sorting

10

Sorting Network

• Network of comparators designed for sorting

• Comparator : two inputs x and y; two outputs x' and y’—types

– increasing (denoted ⊕): x' = min(x,y) and y' = max(x,y)

– decreasing (denoted Ө) : x' = max(x,y) and y' = min(x,y)

• Sorting network speed is proportional to its depth

x min(x,y)

y max(x,y)

x max(x,y)

y min(x,y)

Ө

Ө

Page 11: Comp422 2011 Lecture19 Sorting

11

Sorting Networks

• Network structure: a series of columns

• Each column consists of a vector of comparators (in parallel)

• Sorting network organization:

Page 12: Comp422 2011 Lecture19 Sorting

12

Example: Bitonic Sorting Network

• Bitonic sequence—two parts: increasing and decreasing

– 〈1,2,4,7,6,0〉: first increases and then decreases (or vice versa)—cyclic rotation of a bitonic sequence is also considered bitonic

– 〈8,9,2,1,0,4〉: cyclic rotation of 〈0,4,8,9,2,1〉

• Bitonic sorting network—sorts n elements in Θ(log2 n) time—network kernel: rearranges a bitonic sequence into a sorted one

Page 13: Comp422 2011 Lecture19 Sorting

13

Bitonic Split

• Let s = 〈a0,a1,…,an-1〉 be a bitonic sequence such that—a0 ≤ a1 ≤ ··· ≤ an/2-1 , and

—an/2 ≥ an/2+1 ≥ ··· ≥ an-1

• Consider the following subsequences of s

s1 = 〈min(a0,an/2),min(a1,an/2+1),…,min(an/2-1,an-1)〉

s2 = 〈max(a0,an/2),max(a1,an/2+1),…,max(an/2-1,an-1)〉

• Sequence properties—s1 and s2 are both bitonic —∀x ∀y x ∈ s1, y ∈ s2 , x < y

• Apply recursively on s1 and s2 to produce a sorted sequence

• Works for any bitonic sequence, even if |s1| ≠ |s2|

Page 14: Comp422 2011 Lecture19 Sorting

Splitting Bitonic Sequences - I

14min max

Sequence propertiess1 and s2 are both bitonic ∀x ∀y x ∈ s1, y ∈ s2 , x < y

Page 15: Comp422 2011 Lecture19 Sorting

Splitting Bitonic Sequences - II

15min max

Sequence propertiess1 and s2 are both bitonic ∀x ∀y x ∈ s1, y ∈ s2 , x < y

Page 16: Comp422 2011 Lecture19 Sorting

16

Bitonic Merge

Sort a bitonic sequence through a series of bitonic splits

Example: use bitonic merge to sort 16-element bitonic sequence

How: perform a series of log2 16 = 4 bitonic splits

Page 17: Comp422 2011 Lecture19 Sorting

17

Sorting via Bitonic Merging Network

• Sorting network can implement bitonic merge algorithm —bitonic merging network

• Network structure—log2 n columns—each column

– n/2 comparators – performs one step of the bitonic merge

• Bitonic merging network with n inputs: ⊕BM[n]—yields increasing output sequence

• Replacing ⊕ comparators by Ө comparators: ӨBM[n]—yields decreasing output sequence

Page 18: Comp422 2011 Lecture19 Sorting

18

Bitonic Merging Network, ⊕ BM[16]

• Input: bitonic sequence— input wires are numbered 0,1,…, n - 1 (shown in binary)

• Output: sequence in sorted order

• Each column of comparators is drawn separately

Page 19: Comp422 2011 Lecture19 Sorting

19

Bitonic Sort

How do we sort an unsorted sequence using a bitonic merge?

Two steps

• Build a bitonic sequence

• Sort it using a bitonic merging network

Page 20: Comp422 2011 Lecture19 Sorting

20

Building a Bitonic Sequence

• Build a single bitonic sequence from the given sequence —any sequence of length 2 is a bitonic sequence. —build bitonic sequence of length 4

– sort first two elements using ⊕BM[2] – sort next two using ӨBM[2]

• Repeatedly merge to generate larger bitonic sequences—⊕BM[k] & ӨBM[k]: bitonic merging networks of size k

Page 21: Comp422 2011 Lecture19 Sorting

21

Building a Bitonic Sequence

Input: sequence of 16 unordered numbers

Output: a bitonic sequence of 16 numbers

Page 22: Comp422 2011 Lecture19 Sorting

22

Bitonic Sort, n = 16

• First 3 stages create bitonic sequence input to stage 4

• Last stage (⊕BM[16]) yields sorted sequence

Page 23: Comp422 2011 Lecture19 Sorting

23

Complexity of Bitonic Sorting Networks

• Depth of the network is Θ(log2 n)—log2 n merge stages—jth merge stage is log2 2j = j

—depth =

• Each stage of the network contains n/2 comparators

• Complexity of serial implementation = Θ(n log2 n)€

log2 2j

j=1

log2 n

∑ = ji=1

log2 n

∑ = (log2 n +1)(log2 n) /2 = θ(log2 n)

Page 24: Comp422 2011 Lecture19 Sorting

24

Mapping Bitonic Sort to a Hypercube

Consider one item per processor

• How do we map wires in bitonic network onto a hypercube?

• In earlier examples—compare-exchange between two wires when labels differ in 1 bit

• Direct mapping of wires to processors—all communication is nearest neighbor

Page 25: Comp422 2011 Lecture19 Sorting

25

Mapping Bitonic Merge to a Hypercube

Communication during the last merge stage of bitonic sort

• Each number is mapped to a hypercube node

• Each connection represents a compare-exchange

Page 26: Comp422 2011 Lecture19 Sorting

26

Mapping Bitonic Sort to Hypercubes

Communication in bitonic sort on a hypercube

• Processes communicate along dims shown in each stage

• Algorithm is cost optimal w.r.t. its serial counterpart

• Not cost optimal w.r.t. the best sorting algorithm

Page 27: Comp422 2011 Lecture19 Sorting

Batcher’s Bitonic Sort in NESL

function merge(a) =if (#a == 1) then aelse let halves = bottop(a); mins = {min(x, y) : x in halves[0]; y in halves[1]}; maxs = {max(x, y) : x in halves[0]; y in halves[1]}; in flatten({merge(x) : x in [mins,maxs]});

function bitonic_sort(a) =if (#a == 1) then aelse let b = {bitonic_sort(x) : x in bottop(a)}; in merge(b[0]++reverse(b[1]));bitonic_sort([2, 3, -7, 6, 5, 22, -8, 12]);

27Try it at: http://www.cs.rice.edu/~johnmc/nesl.html

Page 28: Comp422 2011 Lecture19 Sorting

28

Bubble Sort and Variants

Sequential bubble sort algorithm

Compares and exchanges adjacent elements sequence

Page 29: Comp422 2011 Lecture19 Sorting

29

Bubble Sort and Variants

• Bubble sort complexity: Θ(n2)

• Difficult to parallelize—algorithm has no concurrency

• A simple variant uncovers concurrency

Page 30: Comp422 2011 Lecture19 Sorting

30

Sequential Odd-Even Transposition Sort

Page 31: Comp422 2011 Lecture19 Sorting

31

Odd-Even Transposition Sort, n = 8

In each phase, n = 8 elements are compared

Page 32: Comp422 2011 Lecture19 Sorting

32

Odd-Even Transposition Sort

• After n phases of odd-even exchanges, sequence is sorted

• Each phase of algorithm requires Θ(n) comparisons

• Serial complexity is Θ(n2)

Page 33: Comp422 2011 Lecture19 Sorting

33

Parallel Odd-Even Transposition

Consider one item per processor

• n iterations—in each iteration, each processor does one compare-exchange

• Parallel run time of this formulation is Θ(n)

• Cost optimal with respect to the base serial algorithm —but not the optimal one!

Page 34: Comp422 2011 Lecture19 Sorting

34

Parallel Odd-Even Transposition Sort

note: if partner id < 1 or > n, then skip compare

Page 35: Comp422 2011 Lecture19 Sorting

35

Quicksort

• Popular sequential sorting algorithm —simplicity, low overhead, optimal average complexity

• Operation—select an entry in the sequence to be the pivot —divide the sequence into two halves

– one with all elements less than the pivot – other greater

• Apply process recursively to each of sublist

Page 36: Comp422 2011 Lecture19 Sorting

36

Parallelizing Quicksort

• First, recursive decomposition —partition the list serially —handle each subproblems on a different processor

• Time for this algorithm is lower-bounded by Ω(n)!

• Can we parallelize the partitioning step? —can we use n processors to partition a list of length n around a

pivot in O(1) time?

• Tricky on real machines

Page 37: Comp422 2011 Lecture19 Sorting

37

Practical Parallel Quicksort

Each processor initially responsible for n/p elements

• Shared memory formulation—select first pivot & broadcast—each processor partitions own data—globally rearrange data into smaller and larger parts (in place)—recurse with proportional # processors on each part

• Message passing formulation—partitioning

– each processor first partitions local portion of array– determine which processes will be responsible for each partition– (based on size of smaller than pivot and larger than pivot groups)– divide up the data among the processor subsets responsible for each part

—continue recursively

Page 38: Comp422 2011 Lecture19 Sorting

Data Parallel Quicksort in NESL

• Total work is O(n log n)• Recursion depth is O(log n)• Depth of each operation is constant 38

12345678

• Total depth is O(log n) as well

Page 39: Comp422 2011 Lecture19 Sorting

39

Other Sorting Algorithms• Shellsort - another variant of bubble sort

—two stage process– log p rounds of long distance exchanges– followed by rounds of odd-even transposition sort until done

—key idea: long distance moves of first stage reduce number of rounds necessary in second stage

• Radix sort : in a series of rounds, sort elements into buckets by digit

• Bucket and sample sort—assumes evenly distributed items in an interval—buckets represent evenly-sized subintervals

• Enumeration sort: —determine rank of each element—place it in the correct position—CRCW PRAM algorithm: n2 processors, sort in Θ(1) time

– assumes that all concurrent writes to a location deposit sum n processes in column j test element j against the rest; write 1 into C[j]

– place A[j] into A[C[j]]

Page 40: Comp422 2011 Lecture19 Sorting

Other Sorting Algorithms (Cont)

• Histogram Sorting —goal: divide keys into p evenly sized pieces—use an iterative approach to do so—initiating processor broadcasts k > p-1 splitter guesses—each processor determines how many keys fall in each bin—sum histogram with global reduction—one processor examines guesses to see which are satisfactory—broadcast finalized splitters and number of keys for each

processor—each processor sends local data to appropriate processors using

all-to-all communication—each processor merges chunks it receives— Kale and Solomonik improved this (IPDPS 2010)

40

Page 41: Comp422 2011 Lecture19 Sorting

41

References

• Adapted from slides “Sorting” by Ananth Grama

• Based on Chapter 9 of “Introduction to Parallel Computing” by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Addison Wesley, 2003

• “Programming Parallel Algorithms.” Guy Blelloch. Communications of the ACM, volume 39, number 3, March 1996.

• http://www.cs.cmu.edu/~scandal/nesl/algorithms.html#sort

• Edgar Solomonik and Laxmikant V. Kale. Highly Scalable Parallel Sorting. Proceedings of IPDPS 2010.