Upload
chester-tyler
View
41
Download
0
Embed Size (px)
DESCRIPTION
A Dynamically Tuned Sorting Library. Xiaoming Li, María Jesús Garzarán, and David Padua. In 2004 International Symposium on Code Generation and Optimization (CGO ’ 04). University of Illinois at Urbana-Champaign. Motivation. Sorting Core operation in many applications, such as databases - PowerPoint PPT Presentation
Citation preview
A Dynamically Tuned Sorting Library
In 2004 International Symposium on Code Generation and Optimization (CGO’04)
Xiaoming Li, María Jesús Garzarán, and David Padua
University of Illinois at Urbana-Champaign
2
Motivation
Sorting – Core operation in many applications, such
as databases– Well understood symbolic computing
problem Libraries generators such as ATLAS
and SPIRAL have used empirical search to adapt to – Architectural features of the target
machine– Size of the input dataBut, performance of sorting also depends on the
distribution of the values to be sorted
3
Main difficulties to build a sorting library
1. Theoretical complexity is not sufficient to measure quality• Cache effect, instructions executed
2. Performance depends on the characteristics of the input• Amount & distribution of data to sort• A single algorithm is not optimal for all
possible input sets
Motivation
4
Contributions
1. Identify the architectural and runtime factors that affect the performance of the sorting algorithms.
2. Use empirical search to identify the best shape and parameter values of a sorting algorithm.
3. Use machine learning and runtime adaptation to select the best sorting algorithm for a specific input set.
5
Contributions
IBM Power 3, sorting 12 M keys (integer 32 bits)
Standard deviation of the inputs
Exe
cuti
on T
ime
(Cyc
les)
6
Outline
Sorting Algorithms Factors that determine performance The Library Evaluation Future Work Conclusions
7
Sorting Algorithms
Our sorting library contains– Quicksort– CC-Radix– Multiway Merge– Insertion Sort– Sorting Networks
For small partitions
8
Quicksort
Divide and conquer in-place sorting algorithm
Our implementation includes Sedgewick’s optimizations:– Set guardians at both ends of the input array.– Eliminate recursion.– Correctly select the pivot.– Use insertion sort for small partitions.
9
Radix sort
Non comparison algorithm
12233113 4 1
012345
Vectorto sort
2121
1234
counter
0235
1234
accum.
3
231341
012345
Dest.vector
31 1122333 4
1223
112334
3
123
1231
10
CC-radix (Cache Conscious Radix Sort) Tries to exploit data locality in caches Based on radix sort (Jimenez and Larriba – UPC)
if fits in cache (bucket) then radix sort (bucket)
CC-radix(bucket)
elsesub-buckets = Reverse sorting(bucket)
for each sub-bucket in sub-buckets CC-radix(sub-buckets) endfor endif
11
Multiway Merge Sort
SortedSubset
SortedSubset
SortedSubset
SortedSubset
Heap
p subsets
2*p -1 nodes
This algorithm exploits data locality very efficiently
12
Sorting algorithms for small partitions Insertion sort Exploits locality in the
cache line
Sorting networks Register blocking
13
Performance Comparison
4000
4500
5000
5500
6000
6500
7000
100 1000 10000 100000 1000000 10000000
Standard Deviation
Execution Time (Cycles)
Intel MKLQuicksort
Pentium III Xeon, 16 M keys (float)
14
Outline
Sorting Algorithms Factors that determine
performance The Library Evaluation Future Work Conclusions
15
Factors that determine performance Architectural Factors Considered
– Cache / TLB size– Number of Registers– Cache Line Size
Runtime Factors Considered– Amount of data to Sort– Distribution of the data
16
Architectural: Cache Size/TLB Size Tiling: Partition the data in subsets that fit in
the cache– Quicksort
•Using multiple pivots to tile– CC-radix
•Fit each partition into cache•The # active partitions < TLB size
– Multiway Merge Sort•Fit the heap into cache•Fit sorted subsets into cache
17
Architectural: Number of Registers For small partitions, sort in place using the processor
registers Optimizations like unroll and scheduling can be applied
cmp&swap(r0,r1)cmp&swap(r2,r3)cmp&swap(r1,r2)cmp&swap(r0,r3)cmp&swap(r4,r5)…..
cmp&swap(r0,r1)cmp&swap(r2,r3)cmp&swap(r4,r5)cmp&swap(r1,r2)cmp&swap(r0,r3)
18
Architectural: Cache Line Size
Fanout = Cache Line Size Increase cache line utilization when accessing children nodes
…
Cache Line
19
Runtime: Amount and Distribution Shape
Number of Keys (Millions)
Exe
cuti
on T
ime
(Cyc
les)
20
Runtime: Amount and Distribution Shape
Exe
cuti
on T
ime
(Cyc
les)
Number of Keys (Millions)
21
Runtime: Standard DeviationE
xecu
tion
Tim
e (C
ycle
s)
Standard deviation of the keys
Pentium III Xeon, 16 M keys
22
Outline
Sorting Algorithms Factors that determine performance The Library Evaluation Future Work Conclusions
23
Library adaptation
Architectural Factors– Cache / TLB size– Number of Registers – Cache Line Size
Empirical Search
Runtime Factors– Distribution shape of the data
– Amount of data to Sort – Standard Deviation
Does not matter
Machine learning and runtime adaptation
24
The Library
Building the library Intallation time– Empirical Search– Learning Procedure
• Use of training data
Running the library Runtime– Runtime Procedure
RuntimeAdaptation
25
Runtime Adaptation: Learning Procedure Goal function:
f:(N,E) {Multiway Merge Sort, Quicksort, CC-radix}
N: amount of input dataE: the entropy vector
– Use N to choose between Multiway Merge or Quicksort– Use the entropy and Winnow algorithm to learn the best
algorithm
• Output: weight vector (w) and threshold (S)
26
Runtime Adaptation:Runtime Procedure
Sample the input array Compute the entropy vector
Compute S = ∑i wi * entropyi
If S ≥ threshold choose CC-radix
elsechoose others
27
Outline
Sorting Algorithms Factors that determine performance The Library Evaluation Future Work Conclusions
28
Experimental Setup
Test Platforms:
– SGI R12000: 300 Mhz; L1I/D=32KB; L2 = 4MB
– UltraSparcIII: 750 Mhz; L1I/D=32KB, 64KB; L2 = 8MB
– PentiumIII Xeon: 550 Mhz; L1I/D=16KB; L2 = 512KB
– IBM Power3: 375 Mhz, L1I/D=64KB; L2 = 8MB
29
Sun UltraSparcIII: 12 M keysE
xecu
tion
Tim
e (C
ycle
s pe
r ke
y)
Standard deviation of the keys
30
IBM Power3: 12 M KeysE
xecu
tion
Tim
e (C
ycle
s pe
r ke
y)
Standard deviation of the keys
31
Conclusions
Identify the architectural and runtime factors
Use empirical search to find the best parameters values
Our machine learning techniques prove to be quite effective:– Always selects the best algorithm.– The wrong decision introduces a 37% average
performance degradation– Overhead (average 5%, worst case 7%)
32
Future Work
1. Search in the space of sorting algorithms using high-level primitives
2. Extend sorting to include more data types
3. Include other comparison strategies
4. Parallel algorithms
5. Explore other database operations, such as join.
For example, less than to sort vectors, graphs, …
A Memory Hierarchy Conscious and Self-
tunable Sorting Library
To appear in 2004 International Symposium on Code Generation and Optimization (CGO’04)
Xiaoming Li, María Jesús Garzarán, and David Padua
University of Illinois at Urbana-Champaign
34
Empirical search for small partitions
4M keys
16M keys
Threshold
Quicksort 2.43s 10.89s --
+ Insertsortat the end
2.17s 9.76s 20
+ Insertsortat each partition
2.32s 10.50s 20
+ Sorting networks
2.081s 9.20s 12
Intel Pentium III Xeon
Sorting networks obtains the best performance improvement (average 15%)
35
Runtime: Amount and Distribution Shape
Exe
cuti
on T
ime
(Cyc
les)
Number of Keys (Millions)
36
Performance vs. Distribution
37
Performance vs. Distribution
38
Performance vs. Sdev
39
Performance vs. Sdev
40
Multiway Merge Sort
41
Runtime: Distribution of Data
Distribution shapes: Uniform, Normal, Exponential, …
42
Architectural: Number of Registers
43
Sorting algorithms for small partitions Insertion sort Exploits locality in the
cache line Sorting networks Register blocking
44
Runtime: Distribution of Data
Distribution shapes: Uniform, Normal, Exponential, …
Distribution width:– Standard deviation (sdev):
• Only good for one-peak distribution• Expensive to calculate
– Entropy• Represents the distribution of each bit
The goal is to distinguish the comparison-based algorithm the radix based one
45
Entropy
Goal: determine when CC-radix is best
Standard Deviation – Expensive to compute– Not a good metric for our goal
Compute the entropy of of each digit
Entropy = ∑i -Pi * log2 Pi,
where Pi = ci/N; ci = number of keys that have a particular value for that digit.
46
Learning Procedure
f:(N,E) {Multiway merge, CC-radix} is a linear separable problem:– f(x1, x2, …,xn) is a decision problem where
there exists a weight vector
Use machine learning Winnow algorithm to learn f:(N,E). – The results of the learning are and Ө .
w→
f (x) is true if w * x ≥ Ө or false otherwise → → →
w→
47
Intel PIII Xeon
48
SGI R12000
49
Runtime: Amount of Data to Sort Quicksort
– Cache misses will increase with the increasing amount of data.
CC-radix– As amount of data increases, CC-radix needs
more partitioning passes.
Multiway Merge Sort– Can only show advantages when the amount of
data is big, i.e., when the gain in cache miss can compensate the complexity of the algorithm.
50
Empirical Search
Adaptation to the architecture of the machine– Quicksort and CC-radix,
• the best configuration does not change significantly with the characteristics of the input data set.
• Quicksort, CC-Radix:- Use of insertion sort/sorting networks for small
partitions- Threshold to use them
• CC-radix- Size of the radix
– Multiway Merge Sort• the best configuration changes with the amount and the
distribution of the input data. • The best values will be searched during the learning
procedure.
51
52
Multiway Merge Sort
SortedRun
SortedRun
SortedRun
SortedRun
Heap
11 21 23 607 42
21 60
60
42
28
60
42
28
4
42
28
23
53
Empirical Search
Example: Multiway Merge
• Search the heap size that obtains the best performance:- Different amount of data and
standard deviation