Work-Efficient Parallel Skyline Computation for
the GPUAuthors: Kenneth S.Bogh, Sean Chester, Ira Assent (Data-Intensive Systems Group, Aarhus University).
Type: Research Paper
Presented by: Dardan Xhymshiti
Fall 2015
Outline Introduction Skyline computation Related-Work GPU-Friendly partitioning The SKYALIGN algorithm Experimental evaluation
Introduction Skyline operator:
First introduced:
Stephan Borzsonyi, Donald Kossman, Konrad Stocker 2001
(Universitat Passau & Technische Universitat Muncen Germany)
Introduction Skyline operator: Example:
1. Go for a one day skiing in one of the Colorado’s ski center. 2. You have spent a lot of money. 3. It happens a car defect. 4. Try to find the nearest and cheapest hotel. 5. Take your phone and lunch the unknown touristic application. 6. A lot of hotels in different locations with variety of prices. 7. You want to find the CHEAPEST and the NEAREST one!?
Introduction Skyline operator: Example:
Query results:
Result query Price Distance (Miles)
Hotel A $120 1.5Hotel B $140 1Hotel C $200 2Hotel D $150 0.7…. … …
120 140 150 200
0.71
1.5
2
Introduction Skyline operator: Example:
Query results:
Result query Price Distance (Miles)
Hotel A $120 1.5Hotel B $140 1Hotel C $200 2Hotel D $150 0.7…. … …
120 140 150 200
0.71
1.5
2Skyline set = {Hotel A, Hotel B, Hotel D}
Term: Dominance
Introduction Major problems:
Multidimensional data.
Computation intensive.
Comparison tuple-to-tuple (point-to-point).
What is done till now: State-of-the art sequential algorithms. Parallel skyline query processing algorithms.
Often try to achieve device’s maximum theoretical compute throughput.
Throughput is costly. The most efficient GPU algorithm GSS, does up to 650 times
more work comparing to the best sequential algorithm, even if executing in 2688 cores.
For benchmark datasets, sequential algorithms perform 3x faster than GPU ones.
Should we use GPU or NOT?
Introduction Sequential algorithms high performance is achieved by using:
Trees
Recursion
Strict ordering of computation.
Unpredictable branching.
Motivation:
Come up with a new algorithm called SkyAlign which: MAIN POINT: Avoid as much as it can point-to-point comparisons.
Employ a globally static grid schema to make the dataset compatible for GPU.
This algorithm do not maximizes THROUGHPUT but is WORK-EFFICIENT.
Many of these techniques are not compatible with
GPU.
Introduction
Dataset
Skyline set
Parallel
Dataset Skyline set
SequentialVS
Skyline computation Notations:
P : dataset
: number of tuples (points) in the dataset
dimensions (number of attributes)
arbitrary points
: the value of the attribute in the tuple (point)
Id1 2 32 2 12 4 13 3 3
Skyline computation Skyline definitions:
Skyline is defined through the concept of dominance. Definition 1:
Data point (tuple) A dominates the data point iff:
1. for all the attribute values
2. for at least one attribute value Definition 2 (on this paper):
Point dominates point , denoted by iff:
If neither we say that and are incomparable.
Transitivity:
Skyline computation Measuring skyline work:
Dominance Test (DT)
Determining if a data point dominates the data point by comparing point-to-point.
Defining the number of DTs done, actually tells the skyline work performed.
Mask Test(MT)
Define bitmask for each point by comparing it with a skyline (pivot) point.
Use transitivity for pruning the number of tests.
Mask Tests are much cheaper than Dominance Tests.
Skyline computation GPU Computation
Tesla K80: 4992 number of Cuda Cores. Threads are grouped into warps usually of sizes 32. Warps are grouped into thread blocks. All threads within a warp execute the same instruction at the same time. Problem: branch divergence.
Related work Partition-based skyline algorithms
Divide-and-Conquer:
Halved the dataspace recursively by the median of an arbitrarily chosen dimension and solved each half. After that the results are merged.
Sequential partition-based algorithms:
These algorithms employ recursive, point-based partitioning.
For each partition defined, a skyline point (pivot), is found, and the other points are partitioned based on their relationship to the pivot.
The work performed varies from the pivot selected.
SkyAlign: is a partition-based method, but it is not recursive and has no merge.
Related work Sort-based (and GPU) skyline algorithms
Obtain efficiency from monotonicity and transitivity. Block-nested-loops algorithm(BNL)
Each unprocessed point is compared with DT against each point which actually is a skyline point. If the is removed and control passes to the next point.
Sort-first skyline (SFS)
Sorts the data points prior to executing BNL. Once a point is added to the solution, it will never be removed.
GNL
Assigns a thread for every point . ’ thread compares it with another data points to check the dominance criterion.
GPU-Friendly partitioning Work efficiency of skyline algorithms comes from skipping DTs. To know which DTs to skip among two data points and we need to know if they are
incomparable. Transitivity helps on this. Example:
1. Say that we have three data tuples: and
2. The relationship of with is represented with one bit for each ,d). (Mask Test)
3. The relationship of with also is represented with one bit for each ,d).
4. The incomparability between with can be detected by comparing these mask tests.
Mask Test (MT) is cheaper than (DT).
GPU-Friendly partitioningGet to know with point-based methods Point-based recursive partitioning methods use a quad-tree partitioning of the data
set and record skyline points as they are found in a tree.
CB
E
A
D
F
Skyline points (pivots):
GPU-Friendly partitioning
Each tree node contains a bitmask that records on which dimensions is worse than its parent.
When processing a point the quad tree can be used to eliminate DTs for .
First builds a bitmask recording its dimension-wise relationship to the root of the tree (in this case C).
If all bits are set (all bits in bitmask are 1) is dominated, otherwise only children of the root (B, E) for which comparing the bitmasks between and them do not infer incomparability need to be visited.
Deeper tree, permits skipping more DTs.
GPU-Friendly partitioningWhy recursive partitioning is not preferred?
High divergence
Traversal
Consider when points in F are to compare with points in D. First a DT with the root E is performed for each point, so generating bitmasks. These
bitmasks are then used to determine which branches of D each point of F should traverse. Results often diverge.
Partitioning
Each partition has to be sub-partitioned relative to its own pivot.
The pivot needs to be skyline.
High dimensions
Quad-tree partitioning do not scale well with dimensionality.
GPU-Friendly partitioningA static grid alternative Each dimension is split based on the quartiles computed from the dimension
values. There is defined three global pivots one corresponding to each quartile boundary.
For each point there is defined:
1. One bitmask relative to the median
2. One bitmask relative to either first or third quartile. First level: all the points are partition by their relationship to the median of the
dataset. Second level: All the points are partitioned by their relationship to either the first or
third quartiles. Do we need a third level?
GPU-Friendly partitioningDefinition of masks Let:
be the quartile for the attribute
be the median
be the quartile for the attribute
We denote by:
the median-level-resolution bitmask for point
: the quartile-level-resolution bitmask for point
GPU-Friendly partitioningDefinition of masks For dimension , (or is set) if larger or equal to the median on dimension For dimension , (or is set) if larger or equal to the on dimension We have:
GPU-Friendly partitioningDefinition of masks
because is less then x-median and greater than y-median.
Same for the others.
GPU-Friendly partitioningHow to define incomparability using statically-based MT We can define incomparability between two bitmasks by considering:
1. Ordering (Number of bits being 1 in bitmasks)
2. Bitwise relationships. The authors have defined these equations for both resolutions, which rely on the
transitivity property with respect to the median: Median-level resolution
GPU-Friendly partitioning Median-level resolution
This equation checks whether has any bits set (equal to 1) that are not also set in (are 0). If so, then such that . Consequently .
Example:
GPU-Friendly partitioning Median-level resolution
If has more 1’s than does then it necessarily contains one that is not set in .
Example:
GPU-Friendly partitioning Median-level resolution
If and have the same order, then the only condition under which all bits set in are also set in , is if the masks are identical. If the bit masks are not identical then either and , because both of them have the same order but different arrangements of 1s.
Example: , but and
GPU-Friendly partitioning Quartile-level resolution
The Skyline Algorithm A global static partitioning is done in the data set. Each thread is assigned to each data point. At a high level, SkyAlign consists of d iterations. In the iteration, remaining points
are compared, each by its thread, to all points with order using MTs and DTs as necessary.
After each phase we remove dominated points and move all surviving points into the solution.
The Skyline Algorithm
Pre-filter: Eliminates points that are easy to identify as not in the skyline by defining a threshold as the min of max values.
which is the ’s max value and the smallest largest value in the data.
Each thread is responsible for a point, and the comparing starts whether the data point has any values larger than threshold.
Id
1 2 3
2 2 1
2 4 1
3 3 3
The Skyline Algorithm
Mask assignment: Masks are assigned for each point, given the quartiles of the dataset for each dimension.
The Skyline Algorithm
Data sorting:Sort the data points based on their masks order.
The Skyline Algorithm
Data sorting:Sort the data points based on their masks order.
Experimental evaluation Evaluation is done by comparing SkyAlign against state-of-the-art
sequential, multi-core, and GPU skyline algorithms. Algorithms used for comparing:
BSkyTree, Hybrid, GSS, SkyAlign Testing is done using synthetic data generated by skyline dataset
generator, which produced datasets that are correlated, independent and anticorrelated.
By default: and Environment:
Quad core Intel i7 at 3.40GHz, with 16GB of Ram, using NVidia GTX titan GPU.
Experimental evaluationRun-time performance Measure the execution time of the four algorithms, testing them on datasets with variations in distribution, dimensionality and cardinality.
1. Cardinality (d = 12)
2. Dimensionality (n = )
Experimental evaluationWork-efficiencyCompare the performance of the four algorithms with respect to:
1. Dominance tests (DT)
2. Work-efficiency
Experimental evaluationWork-efficiencyCompare the performance of the four algorithms with respect to:
1. Dominance tests (DT)
2. Work-efficiency
Thank You