Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Category: Algorithms & NumericAl techNiquesposter
AN15 contact name
Lars nyland: [email protected]
Parallel Sorting with Skiplists and Atomic Memory Operations Lars S. Nyland, NVIDIA
Skiplists • Hierarchical linked lists for ordered data • Probabilistically-sized nodes • O(log n) find time, walking high to low • O(log n) insertion time • Reliably balanced • About 1.5x the cost in pointers (1 data, ~2 next) • Concurrent operations proven correct
0
1/8
1/4
3/8
1/2
5/8
1 2 3 4 5 6 7 8
Prob
abili
ty o
f allo
catio
n
Number of "next" pointers in a node
Atomic Memory Ops Compare-and-swap (CAS) is an atomic memory operation used to manipulate pointers concurrently. A CAS operation takes 3 inputs: an address A, a comparison value C, and a replacement value V. It compares the value in memory at location A (mem[A]) to C, and if they are equal, it stores V in mem[A]. It returns what was originally in mem[A]. Access to mem[A] is blocked during the CAS operation. For linked structures like skiplists, CAS is used to “swing pointers” to insert nodes, ensuring that the list is never corrupt. The figure below shows two threads trying to insert a node at the same location, requiring updates to the same pointer. Each thread uses CAS to update the “next” pointer (orange arrows). Only one will succeed while the other fails and repeats.
Parallel Skiplist Insertion Sort N items are inserted in a skiplist using P threads (N/P each), by these steps : 1. Allocate a new node with k “next” pointers for the next value. 2. Find the insertion point by chasing the pointers from high to low, staying at one
level until the value is exceeded, then stepping down a level, and repeating. 3. From low to the high, set the next pointers in the new node, and then swing
the previous pointer to the new node using CAS. By going from low to high, the skiplist is always valid, allowing other threads to chase and update. Figure 1 shows two colliding level-2 pointers after their level-1 pointers have been successfully set.
In total, there are O(N) successful CAS operations, and O(n log n) pointers chased. The question is how many CAS operations fail.
Concurrency, Collisions & Communication Thousands of concurrent threads attempt to insert values into the skiplist, retrying if they fail. At the start when the list is short, there are many failures, but the skiplist doubles in size until its length exceeds the number of concurrent threads. Parallel skiplist insertion sort is an example of a lock-free parallel algorithm, since at least one thread makes progress at all times. The number of failed CAS operations is shown in figure 2, indicating that far more than one thread is succeeding on every insertion attempt.
Thread k
Thread j
value next value
next value
next value
next
Conclusions 1. Parallel skiplist insertion sort is work-
efficient. 2. Performance is dominated by O(n log n)
loads, not O(n) atomic-CAS operations. 3. Performance is limited by memory address
divergence. 4. Skiplist traversal is accelerated by L2 hits. 5. CAS insertion failures drop dramatically as
number of items (N) exceeds number of parallel threads (P).
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000
Tim
e to
sor
t (se
cond
s)
N, the number of elements to sort
Sorting Time
GTX580 Time
GTX680 Time
K20c time
Figure 3. Performance of Skiplist-insertion Sort The time needed to insert N elements is shown. The scaling is as expected at O(n log n). We ran the same problems on 3 different GPUs (Fermi GF100, Kepler GK104, Kepler GK110), finding the performance differences surprisingly small.
References Many thanks to William Pugh for the invention of skiplists. Skiplists and all the topics discussed in this poster are described in detail on Wikipedia. The topic areas are skiplists, atomic memory operations, comparison sorting, lock-free and wait-free parallel algorithms, along with a complete description of NVIDIA GPUs.
Figure 1. Nodes X and Y are being inserted. Their 0-level pointers have been set successfully, and now both X and Y are trying to swing A1, shown with the dotted lines, using CAS. One succeeds, say the CAS of Y1, so the level-1 pointers are now A1-Y1-C1. The thread inserting X1 will see the failure, returning to find its level-1 insertion point between A1 and Y1, where it will again try to set A1 to point to X1 with CAS, and it will point to Y1.
A0
A1
X0
X1
B0 Y0
Y1 C0
C1
0.1%
1.0%
10.0%
100.0%
32768 131072 524288 2097152 8388608 33554432
Faili
ng in
sert
ions
, by
perc
ent
Number of elements sorted (N)
Failed Insertion Rate
gtx580 overhead
gtx680 overhead
K20c overhead
Figure 2. Overhead from failed CAS operations When N is small (near P), CAS operations fail nearly as often as not, leading to retries. As N grows, the percentage of operations that fail drops to near 0%. The chart shows the failure rate when 25,000 concurrent threads are running.
507
: 21
93 2
191
|
|
| | |
|
| 508
: 21
91 2
187
|
|
| | |
|
| 509
: 21
87 2
185
2182
+188
8+
| | |
|
| 510
: 21
85 2
182
|
|
| | |
|
| 511
: 21
82 1
888
1888
+
|
| | |
|
| 512
: 18
88 1
886
1879
+187
9+18
79+1815+ |
|
| 513
: 18
86 1
884
|
|
| | |
|
| 514
: 18
84 1
879
|
|
| | |
|
| 515
: 18
79 1
876
1876
+185
8+18
26+ | |
|
| 516
: 18
76 1
874
1869
+
|
| | |
|
| 517
: 18
74 1
872
|
|
| | |
|
| 518
: 18
72 1
869
|
|
| | |
|
| 519
: 18
69 1
867
1862
+
|
| | |
|
| 520
: 18
67 1
865
|
|
| | |
|
| 521
: 18
65 1
862
|
|
| | |
|
| 522
: 18
62 1
858
1858
+
|
| | |
|
| 523
: 18
58 1
856
1850
+185
0+
| | |
|
| 524
: 18
56 1
854
|
|
| | |
|
| 525
: 18
54 1
850
|
|
| | |
|
| 526
: 18
50 1
847
1847
+184
1+
| | |
|
| 527
: 18
47 1
845
1841
+
|
| | |
|
| 528
: 18
45 1
841
|
|
| | |
|
| 529
: 18
41 1
838
1838
+182
6+
| | |
|
| 530
: 18
38 1
836
1831
+
|
| | |
|
| 531
: 18
36 1
834
|
|
| | |
|
| 532
: 18
34 1
831
|
|
| | |
|
| 533
: 18
31 1
826
1826
+
|
| | |
|
| 534
: 18
26 1
824
1821
+181
5+18
15+ | |
|
| 535
: 18
24 1
821
|
|
| | |
|
| 536
: 18
21 1
815
1815
+
|
| | |
|
| 537
: 18
15 1
813
1806
+180
6+18
06+1934+ |
|
| 538
: 18
13 1
811
|
|
| | |
|
| 539
: 18
11 1
806
|
|
| | |
|
| 540
: 18
06 1
803
1803
+197
6+19
45+ | |
|
| 541
: 18
03 1
801
1987
+
|
| | |
|
| 542
: 18
01 1
799
|
|
| | |
|
| 543
: 17
99 1
987
|
|
| | |
|
| 544
: 19
87 1
985
1980
+
|
| | |
|
| 545
: 19
85 1
983
|
|
| | |
|
| 546
: 19
83 1
980
|
|
| | |
|
| 547
: 19
80 1
976
1976
+
|
| | |
|
| 548
: 19
76 1
974
1968
+196
8+
| | |
|
| 549
: 19
74 1
972
|
|
| | |
|
| 550
: 19
72 1
968
|
|
| | |
|
| 551
: 19
68 1
966
1960
+196
0+
| | |
|
| 552
: 19
66 1
964
|
|
| | |
|
| 553
: 19
64 1
960
|
|
| | |
|
| 554
: 19
60 1
957
1957
+194
5+
| | |
|
| 555
: 19
57 1
955
1950
+
|
| | |
|
| 556
: 19
55 1
953
|
|
| | |
|
| 557
: 19
53 1
950
|
|
| | |
|
| 558
: 19
50 1
945
1945
+
|
| | |
|
| 559
: 19
45 1
943
1940
+193
4+19
34+ | |
|
| 560
: 19
43 1
940
|
|
| | |
|
| 561
: 19
40 1
934
1934
+
|
| | |
|
| 562
: 19
34 1
932
1925
+192
5+19
25+1912+ |
|
| 563
: 19
32 1
930
|
|
| | |
|
| 564
: 19
30 1
925
|
|
| | |
|
| 565
: 19
25 1
922
1922
+191
2+19
12+ | |
|
| 566
: 19
22 1
920
1912
+
|
| | |
|
| 567
: 19
20 1
912
|
|
| | |
|
| 568
: 19
12 1
909
1909
+189
8+20
49+2049+2028+2
144+
| 569
: 19
09 1
907
1902
+
|
| | |
|
| 570
: 19
07 1
905
|
|
| | |
|
| 571
: 19
05 1
902
|
|
| | |
|
| 572
: 19
02 1
898
1898
+
|
| | |
|
| 573
: 18
98 1
896
2080
+208
0+
| | |
|
| 574
: 18
96 1
894
|
|
| | |
|
| 575
: 18
94 2
080
|
|
| | |
|
| 576
: 20
80 2
078
2072
+207
2+
| | |
|
| 577
: 20
78 2
076
|
|
| | |
|
| 578
: 20
76 2
072
|
|
| | |
|
| 579
: 20
72 2
069
2069
+204
9+
| | |
|
| 580
: 20
69 2
067
2062
+
|
| | |
|
| 581
: 20
67 2
065
|
|
| | |
|
| 582
: 20
65 2
062
|
|
| | |
|
| 583
: 20
62 2
060
2055
+
|
| | |
|
| 584
: 20
60 2
058
|
|
| | |
|
| 585
: 20
58 2
055
|
|
| | |
|
| 586
: 20
55 2
049
2049
+
|
| | |
|
| 587
: 20
49 2
047
2040
+204
0+20
40+2028+ |
|
| 588
: 20
47 2
045
|
|
| | |
|
| 589
: 20
45 2
040
|
|
| | |
|
| 590
: 20
40 2
037
2037
+202
8+20
28+ | |
|
| 591
: 20
37 2
035
2028
+
|
| | |
|
| 592
: 20
35 2
028
|
|
| | |
|
| 593
: 20
28 2
025
2025
+201
4+21
66+2166+2144+
|
| 594
: 20
25 2
023
2018
+
|
| | |
|
| 595
: 20
23 2
021
|
|
| | |
|
| 596
: 20
21 2
018
|
|
| | |
|
| 597
: 20
18 2
014
2014
+
|
| | |
|
| 598
: 20
14 2
012
2009
+200
5+
| | |
|
| 599
: 20
12 2
009
|
|
| | |
|
| 600
: 20
09 2
005
2005
+
|
| | |
|
| 601
: 20
05 2
003
1997
+199
7+
| | |
|
| 602
: 20
03 2
001
|
|
| | |
|
| 603
: 20
01 1
997
|
|
| | |
|
| 604
: 19
97 1
994
1994
+216
6+
| | |
|
| 605
: 19
94 1
992
2179
+
|
| | |
|
| 606
: 19
92 1
990
|
|
| | |
|
| 607
: 19
90 2
179
|
|
| | |
|
| 608
: 21
79 2
177
2172
+
|
| | |
|
| 609
: 21
77 2
175
|
|
| | |
|
| 610
: 21
75 2
172
|
|
| | |
|
| 611
: 21
72 2
166
2166
+
|
| | |
|
| 612
: 21
66 2
164
2157
+215
7+21
57+2144+ |
|
| 613
: 21
64 2
162
|
|
| | |
|
| 614
: 21
62 2
157
|
|
| | |
|
| 615
: 21
57 2
155
2144
+214
4+21
44+ | |
|
| 616
: 21
55 2
153
|
|
| | |
|
| 617
: 21
53 2
144
|
|
| | |
|
| 618
: 21
44 2
141
2141
+213
0+20
88+2088+2633+2
557+2981
+ 619
: 21
41 2
139
2134
+
|
| | |
|
| 620
: 21
39 2
137
|
|
| | |
|
| 621
: 21
37 2
134
|
|
| | |
|
| 622
: 21
34 2
130
2130
+
|
| | |
|
| 623
: 21
30 2
128
2125
+212
1+
| | |
|
| 624
: 21
28 2
125
|
|
| | |
|
| 625
: 21
25 2
121
2121
+
|
| | |
|
| 626
: 21
21 2
119
2113
+211
3+
| | |
|
| 627
: 21
19 2
117
|
|
| | |
|
| 628
: 21
17 2
113
|
|
| | |
|
| 629
: 21
13 2
110
2110
+210
4+
| | |
|
| 630
: 21
10 2
108
2104
+
|
| | |
|
| 631
: 21
08 2
104
|
|
| | |
|
| 632
: 21
04 2
101
2101
+208
8+
| | |
|
| 633
: 21
01 2
099
2094
+
|
| | |
|
| 634
: 20
99 2
097
|
|
| | |
|
| 635
: 20
97 2
094
|
|
| | |
|
| 636
: 20
94 2
088
2088
+
|
| | |
|
| 637
: 20
88 2
086
2644
+264
4+26
44+2633+ |
|
| 638
: 20
86 2
084
|
|
| | |
|
| 639
: 20
84 2
644
|
|
| | |
|
| 640
: 26
44 2
642
2633
+263
3+26
33+ | |
|
| 641
: 26
42 2
640
|
|
| | |
|
| 642
: 26
40 2
633
|
|
| | |
|
| 643
: 26
33 2
630
2630
+261
2+25
79+2579+2557+
|
| 644
: 26
30 2
628
2623
+
|
| | |
|
| 645
: 26
28 2
626
|
|
| | |
|
| 646
: 26
26 2
623
|
|
| | |
|
| 647
: 26
23 2
621
2616
+
|
| | |
|
| 648
: 26
21 2
619
|
|
| | |
|
| 649
: 26
19 2
616
|
|
| | |
|
| 650
: 26
16 2
612
2612
+
|
| | |
|
| 651
: 26
12 2
610
2604
+260
4+
| | |
|
| 652
: 26
10 2
608
|
|
| | |
|
| 653
: 26
08 2
604
|
|
| | |
|
| 654
: 26
04 2
601
2601
+259
5+
| | |
|
| 655
: 26
01 2
599
2595
+
|
| | |
|
| 656
: 25
99 2
595
|
|
| | |
|
| 657
: 25
95 2
592
2592
+257
9+
| | |
|
| 658
: 25
92 2
590
2585
+
|
| | |
|
| 659
: 25
90 2
588
|
|
| | |
|
| 660
: 25
88 2
585
|
|
| | |
|
| 661
: 25
85 2
579
2579
+
|
| | |
|
| 662
: 25
79 2
577
2574
+256
9+25
69+2557+ |
|
| 663
: 25
77 2
574
|
|
| | |
|
| 664
: 25
74 2
569
2569
+
|
| | |
|
| 665
: 25
69 2
567
2557
+255
7+25
57+ | |
|
| 666
: 25
67 2
565
|
|
| | |
|
| 667
: 25
65 2
557
|
|
| | |
|
| 668
: 25
57 2
554
2554
+253
6+25
04+2504+2483+2
981+
| 669
: 25
54 2
552
2547
+
|
| | |
|
| 670
: 25
52 2
550
|
|
| | |
|
| 671
: 25
50 2
547
|
|
| | |
|
| 672
: 25
47 2
545
2540
+
|
| | |
|
| 673
: 25
45 2
543
|
|
| | |
|
| 674
: 25
43 2
540
|
|
| | |
|
| 675
: 25
40 2
536
2536
+
|
| | |
|
| 676
: 25
36 2
534
2528
+252
8+
| | |
|
| 677
: 25
34 2
532
|
|
| | |
|
| 678
: 25
32 2
528
|
|
| | |
|
| 679
: 25
28 2
526
2520
+252
0+
| | |
|
| 680
: 25
26 2
524
|
|
| | |
|
| 681
: 25
24 2
520
|
|
| | |
|
| 682
: 25
20 2
517
2517
+250
4+
| | |
|
| 683
: 25
17 2
515
2510
+
|
| | |
|
| 684
: 25
15 2
513
|
|
| | |
|
| 685
: 25
13 2
510
|
|
| | |
|
| 686
: 25
10 2
504
2504
+
|
| | |
|
| 687
: 25
04 2
502
2499
+249
4+24
94+2483+ |
|
| 688
: 25
02 2
499
|
|
| | |
|
| 689
: 24
99 2
494
2494
+
|
| | |
|
| 690
: 24
94 2
492
2483
+248
3+24
83+ | |
|
| 691
: 24
92 2
490
|
|
| | |
|
| 692
: 24
90 2
483
|
|
| | |
|
| 693
: 24
83 2
480
2480
+247
3+24
73+2981+2981+
|
| 694
: 24
80 2
478
2473
+
|
| | |
|
| 695
: 24
78 2
473
|
|
| | |
|
| 696
: 24
73 2
470
2470
+245
9+29
94+ | |
|
| 697
: 24
70 2
468
2463
+
|
| | |
|
| 698
: 24
68 2
466
|
|
| | |
|
| 699
: 24
66 2
463
|
|
| | |
|
| 700
: 24
63 2
459
2459
+
|
| | |
|
| 701
: 24
59 2
457
3024
+302
4+
| | |
|
| 702
: 24
57 2
453
|
|
| | |
|
| 703
: 24
53 3
024
|
|
| | |
|
| 704
: 30
24 3
022
3016
+301
6+
| | |
|
| 705
: 30
22 3
020
|
|
| | |
|
| 706
: 30
20 3
016
|
|
| | |
|
| 707
: 30
16 3
013
3013
+299
4+
| | |
|
| 708
: 30
13 3
011
3006
+
|
| | |
|
| 709
: 30
11 3
009
|
|
| | |
|
| 710
: 30
09 3
006
|
|
| | |
|
| 711
: 30
06 3
004
2999
+
|
| | |
|
| 712
: 30
04 3
002
|
|
| | |
|
| 713
: 30
02 2
999
|
|
| | |
|
| 714
: 29
99 2
994
2994
+
|
| | |
|
| 715
: 29
94 2
992
2981
+298
1+29
81+ | |
|
|