19
Contraction Clustering (RASTER): A Very Fast Big Data Algorithm for Sequential and Parallel Density-Based Clustering in Linear Time, Constant Memory, and a Single Pass Gregor Ulm * , Simon Smith, Adrian Nilsson, Emil Gustavsson, Mats Jirstrand Fraunhofer-Chalmers Research Centre for Industrial Mathematics, Chalmers Science Park, SE-412 88 Gothenburg, Sweden Fraunhofer Center for Machine Learning, Chalmers Science Park, SE-412 88 Gothenburg, Sweden Abstract Clustering is an essential data mining tool for analyzing and grouping similar objects. In big data applications, however, many clustering algorithms are infeasible due to their high memory requirements and/or unfavorable runtime complexity. In contrast, Contra ction Cluster ing (RASTER) is a single-pass algorithm for identifying density-based clusters with linear time complexity. Due to its favorable runtime and the fact that its memory requirements are constant, this algorithm is highly suitable for big data applications where the amount of data to be processed is huge. It consists of two steps: (1) a contraction step which projects objects onto tiles and (2) an agglomeration step which groups tiles into clusters. This algorithm is extremely fast in both sequential and parallel execution. Our quantitative evaluation shows that a sequential implementation of RASTER performs significantly better than various standard clustering algorithms. Furthermore, the parallel speedup is significant: on a contemporary workstation, an implementation in Rust processes a batch of 500 million points with 1 million clusters in less than 50 seconds on one core. With 8 cores, the algorithm is about four times faster. Keywords: clustering, unsupervised learning, algorithms, big data analytics, machine learning 1. Introduction The goal of clustering is to aggregate similar objects into groups in which objects exhibit similar characteristics. However, when attempting to cluster very large amounts of data, i.e. data in excess of 10 12 elements [1], two limitations of many well-known clustering algorithms become apparent. First, they operate under the premise that all available data fits into main memory, which does not necessarily hold true in a big data context. Second, their time complexity is unfavorable. For instance, the frequently used clustering algorithms DBSCAN [2] runs in O(nlogn) in the best case, where n is the number of elements the input data consists of. There are standard implementations that use a distance * Corresponding author Email addresses: [email protected] (Gregor Ulm), [email protected] (Simon Smith), [email protected] (Adrian Nilsson), [email protected] (Emil Gustavsson), [email protected] (Mats Jirstrand) matrix, but its memory requirements of O(n 2 ) make big data applications infeasible in practice. In addition, the additional logarithmic factor is problematic in a big data context. Linear-time clustering methods exist [3, 4]. Unfortunately, their large factors make them inapplicable to big data clustering. In practice, even an algorithm with a seemingly favorable time complexity can be unsuitable because a large factor cannot just be conveniently ignored and may instead mean that an algorithm is inadequate. Thus, complexity analysis using O-notation can be deceptive as the measured performance difference between two algorithms belonging to the same complexity class can be quite high. Generally speaking, many common clustering algorithms are unable to handle comparatively modest amounts of data and may struggle mightily or even succumb to inputs consisting of just hundreds of clusters and thousands of data points. In this paper, we introduce Contraction Clustering (RASTER). 1 In the taxonomy presented in Fahad et al. [5], it 1 The chosen shorthand may not be immediately obvious: RASTER Preprint submitted to — January 30, 2020 arXiv:1907.03620v2 [cs.DS] 29 Jan 2020

arXiv:1907.03620v1 [cs.DS] 8 Jul 2019Contraction Clustering (RASTER): A Very Fast Big Data Algorithm for Sequential and Parallel Density-Based Clustering in Linear Time, Constant Memory,

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

  • Contraction Clustering (RASTER):A Very Fast Big Data Algorithm for Sequential and Parallel

    Density-Based Clustering in Linear Time, Constant Memory, and a Single Pass

    Gregor Ulm∗, Simon Smith, Adrian Nilsson, Emil Gustavsson, Mats Jirstrand

    Fraunhofer-Chalmers Research Centre for Industrial Mathematics,Chalmers Science Park, SE-412 88 Gothenburg, Sweden

    Fraunhofer Center for Machine Learning,Chalmers Science Park, SE-412 88 Gothenburg, Sweden

    Abstract

    Clustering is an essential data mining tool for analyzing and grouping similar objects. In big data applications, however,many clustering algorithms are infeasible due to their high memory requirements and/or unfavorable runtime complexity.In contrast, Contraction Clustering (RASTER) is a single-pass algorithm for identifying density-based clusters with lineartime complexity. Due to its favorable runtime and the fact that its memory requirements are constant, this algorithm is highlysuitable for big data applications where the amount of data to be processed is huge. It consists of two steps: (1) a contractionstep which projects objects onto tiles and (2) an agglomeration step which groups tiles into clusters. This algorithm isextremely fast in both sequential and parallel execution. Our quantitative evaluation shows that a sequential implementationof RASTER performs significantly better than various standard clustering algorithms. Furthermore, the parallel speedupis significant: on a contemporary workstation, an implementation in Rust processes a batch of 500 million points with 1million clusters in less than 50 seconds on one core. With 8 cores, the algorithm is about four times faster.

    Keywords:clustering, unsupervised learning, algorithms, big data analytics, machine learning

    1. Introduction

    The goal of clustering is to aggregate similar objectsinto groups in which objects exhibit similar characteristics.However, when attempting to cluster very large amounts ofdata, i.e. data in excess of 1012 elements [1], two limitationsof many well-known clustering algorithms become apparent.First, they operate under the premise that all available datafits into main memory, which does not necessarily holdtrue in a big data context. Second, their time complexityis unfavorable. For instance, the frequently used clusteringalgorithms DBSCAN [2] runs in O(nlogn) in the best case,where n is the number of elements the input data consistsof. There are standard implementations that use a distance

    ∗Corresponding authorEmail addresses: [email protected]

    (Gregor Ulm), [email protected](Simon Smith), [email protected](Adrian Nilsson), [email protected] (EmilGustavsson), [email protected] (Mats Jirstrand)

    matrix, but its memory requirements ofO(n2) make big dataapplications infeasible in practice. In addition, the additionallogarithmic factor is problematic in a big data context.Linear-time clustering methods exist [3, 4]. Unfortunately,their large factors make them inapplicable to big dataclustering. In practice, even an algorithm with a seeminglyfavorable time complexity can be unsuitable because a largefactor cannot just be conveniently ignored and may insteadmean that an algorithm is inadequate. Thus, complexityanalysis using O-notation can be deceptive as the measuredperformance difference between two algorithms belongingto the same complexity class can be quite high. Generallyspeaking, many common clustering algorithms are unableto handle comparatively modest amounts of data and maystruggle mightily or even succumb to inputs consisting ofjust hundreds of clusters and thousands of data points.

    In this paper, we introduce Contraction Clustering(RASTER).1 In the taxonomy presented in Fahad et al. [5], it

    1The chosen shorthand may not be immediately obvious: RASTER

    Preprint submitted to — January 30, 2020

    arX

    iv:1

    907.

    0362

    0v2

    [cs

    .DS]

    29

    Jan

    2020

  • is a grid-based clustering algorithm. RASTER has been de-signed for big data. Its runtime scales linearly in the size ofits input and has a very modest constant factor. In addition, itis able to handle cases where the available data do not fit intomain memory because its memory requirements are con-stant for any fixed-size grid. The available data can thereforebe divided into chunks and processed independently. Thetwo distinctive phases of RASTER are parallelizable; themost computationally intensive first one trivially so, whilethe second one can be expressed in the divide-and-conquerparadigm of algorithm design. According to Shirkhorshidiet al. [6], distributed clustering algorithms are needed forefficiently clustering big data. However, our algorithm is acounter example to that claim as it exhibits excellent perfor-mance metrics even in single-threaded execution on a singlemachine. It also effectively uses multiple CPU cores. Thenovelty of RASTER is that it is a straightforward, easy toimplement, and extremely fast big data clustering algorithmwith intuitive parameters. It requires only one pass throughthe input data, which it does not need to retain. In addition,key operations like projecting to a tile and neighborhoodlookup are performed in constant time.

    The remainder of this paper is organized as follows. InSect. 2 we introduce the problem description, with a par-ticular focus on the hub identification problem. This is fol-lowed by a presentation of the RASTER algorithm for batch-processing in Sect. 3, which includes the parallel versionP-RASTER. An evaluation follows in Sect. 4. As the presen-tation of RASTER is done partly in a comparative manner,related work is pointed out when appropriate. However,further related work is discussed in Sect. 5. This is followedby future work in Sect. 6 and a conclusion in Sect. 7. Theappendix contains a qualitative comparison of RASTER ver-sus other common clustering methods in Appendix A, a sep-arate comparison of RASTER and CLIQUE in Appendix B,correctness proofs for P-RASTER in Appendix C, and a de-tailed analysis of the runtime of P-RASTER in Appendix D.

    This paper is a substantial revision of a previouslypublished conference proceedings paper, which focusedon a qualitative description of RASTER [7]. In the currentpaper, we added a substantial quantitative component toevaluate RASTER as well as the new variant RASTER′. Inaddition, we describe and evaluate parallel implementationsof these two algorithms.

    operates on an implied grid. Resulting clusters can be made to look similarto the dot matrix structure of a raster graphics image. Furthermore, thename RASTER is, using our later terminology, an agglomerated contrac-tion of the words contractionand clustering. The name of this algorithmthus self-referentially illustrates how it contracts an arbitrary number oflarger squares, so-called tiles, to single points or, in this case, letters.

    2. Problem Description

    In this section we give an overview of the clusteringproblem in Sect. 2.1, describe the hub identificationproblem, which is the motivating use case behind RASTER,in Sect. 2.2, and highlight limitations of common clusteringmethods in Sect. 2.3.

    2.1. The Clustering Problem

    Clustering is a standard approach in machine learning forgrouping similar items, with the goal of dividing a data setinto subsets that share certain features. It is an example ofunsupervised learning, which implies that there are poten-tially many valid ways of clustering data points. Elementsbelonging to a cluster are normally more similar to otherelements in it than to elements in any other cluster. If anelement does not belong to a cluster, it is classified as noise.An element normally belongs to only one cluster. However,fuzzy clustering methods [8, 9] can identify non-disjointclusters, i.e. elements may be part of multiple overlappingclusters. Yet, this is not an area we are concerned with inthis paper. Instead, our focus is on problems that are inprinciple solvable by common clustering methods such asDBSCAN [2] or k-means clustering [10].

    2.2. The Hub Identification Problem

    Figure 1: This sample illustrates the hub identification problem, wherethe goal is to find dense clusters in a noisy data set. RASTER identifiestwo dense clusters and ignores the less dense points in the center.

    The main use case RASTER addresses is solving thehub identification problem. Consider being given a vastdata set of unlabeled telemetry data in the form of GPScoordinates that trace the movement of vehicles. Thegoal is to identify hubs in vehicle transportation networks.More concretely, hubs are locations within a road networkat which vehicles stop so that a particular action can beperformed, e.g. loading and unloading goods at warehouses,

    2

  • dropping of goods at delivery points, or passengers boardingand alighting at bus stops. Conceptually, a hub is a node in avehicle transportation network, and a node is the center of acluster. By knowing the location of hubs, as well as the IDsof vehicles that made use of any given hub, detailed vehicletransportation networks can be constructed from GPS data,and usage profiles of vehicles created. A particular featureof a hub is that it consists of a relatively large number ofpoints in a relatively small area. We therefore also referto them as tight, dense clusters. Generally, they cover anarea that ranges from a few square meters to a few dozensquare meters; the largest hubs cover a few hundred squaremeters. Hubs tend to also be a certain minimum distanceapart from each other, which implies that if multiple hubsare identified that are close to each other, it is generally theresult of a misclassification as they would form only onehub in reality. Figure 1, which shows dense clusters in anoisy data set, illustrates what hubs may look like.

    The provided input to the hub identification problem areGPS coordinates, indicating their latitude and longitude.Their precision is defined by the number of place valuesafter the decimal point. For instance, the coordinate pair(40.748441, −73.985664) specifies the location of the Em-pire State Building in New York City with a precision of 11.1centimeters. A seventh decimal place value would specify agiven location with a precision of 1.1 centimeters, while trun-cating to five decimal place values would lower precision to1.1 meters. High-precision GPS measurements require spe-cial equipment, but consumer-grade GPS is only accurateto within about ten meters under open sky [11]. Altogether,there are three reasons why a reduced precision will lead topractically useful clusters. In addition to the aforementionedimprecision of GPS equipment, one has to take into accountvehicle size as well as the size of hubs. Commercial vehiclesfrom where our data originate may have a length rangingfrom a few meters, in the case of cars or vans, to up to 25meters or more, in the case of a truck with an attached trailer.Furthermore, hubs cover an area that can reasonably be ex-pected to be much larger than the size of a vehicle. Thus,for the purpose of clustering, lower-precision GPS coordi-nates could be used, without losing a significant amount ofinformation. In practice, a precision of one meter or less isoften sufficient. The example just given may seem to implythat scaling, or reducing the precision, is based on powers often. However, this interpretation would not be correct as onecould use an arbitrary real-number as a scaling factor as well.

    The hub identification problem emerged from workingwith huge real-world data sets. A common approach ofclustering algorithms is to return labels of all input datapoints, indicating to which cluster they belong. This is alsothe approach taken by one of our comparative benchmarks(cf. Sect. 4). With another linear pass, clusters and their

    labels can be extracted. This is a viable approach forsmaller data sets. However, one of our goals was also tomake the amount of data easier to handle and, for instance,reduce 1012 points to a minuscule fraction of it that containsonly the relevant information. Points that do not belong to acluster are not of interest to our problem. Instead, RASTERretains counts of scaled inputs that are used to determinethe approximate area of a cluster in the underlying data.Via reducing the precision of the input, the GPS canvas istherefore divided into a number of tiles (cf. Sect. 3.2) foreach of which a count of the points it contains is maintained.The number of tiles is finite because the dimensions of theGPS canvas are finite with a latitude ∈ [−90,+90] and alongitude ∈ [−180,180]. Because the input points are notretained and the range of the input coordinates is finite, thememory requirements of RASTER are constant.

    2.3. Limitations of two common clustering methodsIn this subsection, we show how the limitations of two

    common clustering algorithms make them unsuitable forlarger data sets, taking into account both runtime andmemory considerations. While we focus on showing whythe two standard clustering algorithms DBSCAN andk-means clustering are not suitable for big data clustering,the bigger point is that these limitations affect all otherclustering algorithms that retain their inputs.

    DBSCAN identifies density-based clusters. Its timecomplexity is O(nlogn) in the best case. This depends onwhether a query identifying the neighbors of a particulardata point can be performed in O(logn). In a pathologicalcase, or in the absence of a fast lookup query, its timecomplexity is O(n2). DBSCAN is comparatively fast. Yet,when working with many billions or even trillions of datapoints, clustering becomes infeasible due to the logarithmicfactor, provided all data even fits into main memory. Wehave found that on a contemporary workstation with16 GB RAM, the scikit-learn implementation ofDBSCAN cannot even handle one million data points. Aless efficient implementation using a distance matrix wouldbecome infeasible with a much smaller number of datapoints already. Overall, depending on the implementation,DBSCAN is either too slow or consumes too much memoryto be a suitable choice for clustering of big data.

    In k-means clustering, the number of clusters k has tobe known in advance. The goal is to determine k partitionsof the input data. Two aspects make k-means clusteringless suitable for our use case. First, when dealing with bigdata, estimating a reasonable k is non-trivial. Second, itstime complexity is unfavorable. An exact solution requiresO(ndk+1), where d is the number of dimensions [12]. Inour case, using 2-dimensional data, it is O(n2k+1). Lloyd’salgorithm [13], which uses heuristics, is likewise not

    3

  • (a) Original data points

    x

    xxxx

    x

    x

    (b) Significant tiles (c) Clustered tiles (d) Clustered points

    Figure 2: High-level visualization of RASTER (best viewed in color). The original input is shown in a), followed by projection to tiles in b) where onlysignificant tiles are retained. Tile-based clusters are visualized in c), which corresponds to RASTER. Clusters that are returned as collections of pointsare shown in d), which corresponds to the variant RASTER′.

    applicable as its time complexity is O(dnki), where i isthe number of iterations until convergence is reached. Inpractice, i tends to be small and consequently does not posemuch of a problem. On the other hand, with increasingnumbers of clusters, k is increasingly negatively affectingruntime. There have been recent improvements to improvek-means clustering [14, 15], but their time complexity isstill highly unfavorable for huge data sets.

    3. RASTER

    This section contains a thorough presentation ofRASTER, starting with a high-level description of thealgorithm in Sect. 3.1, a discussion of the concept of tilesas well as their role in creating clusters in Sect. 3.2, andan informal time-complexity analysis in Sect. 3.3. Thisis followed by an explanation of the variation RASTER′,which retains its input, in Sect. 3.4 as well as the parallelvariant P-RASTER in Sect. 3.5. Lastly, we highlight somefurther technical details in Sect. 3.6.

    3.1. High-Level Description

    The goal of RASTER is to reduce a very large num-ber n of 2-dimensional points to a much more manageablenumber of coordinates that specify the approximate areaof clusters in the input data. The input is not retained. Fig-ure 2 provides a visualization. In more detail, the algorithmworks as follows. It uses an implicit 2-dimensional gridof a coarser resolution than the input data. Each square ofthis grid is referred to as a tile; each point in the input datais projected to exactly one tile. A tile containing at least auser-specified threshold value τ of data points is labeled asa significant tile. Afterwards, clusters are constructed fromadjacent significant tiles. RASTER creates clusters based ontiles; these clusters cover the area in which the clusters of theoriginal input are located (cf. Fig. 2c). The advantage of thisapproach is that it provides approximate truth very quickly.

    Algorithm 1 RASTERinput: data points, precision prec, threshold τ, distance δ,

    minimum cluster size µoutput: set of clusters clusters

    1: acc :=∅ . {(xπ,yπ):count}2: clusters :=∅ . set of sets3: for (x,y) in points do4: (xπ,yπ):=project(x,y,prec) . O(1)5: if (xπ,yπ) < keys of acc then6: acc[(xπ,yπ)]:=17: else8: acc[(xπ,yπ)]+=19: for (xπ,yπ) in acc do

    10: if acc[(xπ,yπ)]

  • significant tiles and ignores all points that were projected tothose tiles. This leads to great memory efficiency, but it maynot provide the user with enough information. After all, theresulting clusters of tiles can be imprecise as they are delim-ited by the tiles in the implied grid. The variant RASTER′

    addresses this problem (cf. Fig. 2d). In addition to clustersand their constituent significant tiles, this variant also retainsthe original data points that were projected to those tiles.The downside of this increased amount of information is thatRASTER′ does not use constant memory, unlike RASTER.

    3.2. Tiles and ClustersA key component of RASTER is the deliberate reduction

    of the precision of its input data. This is based on a projec-tion of points to tiles and could, for instance, be achieved bytruncating or rounding. The goal is the identification of clus-ters, which is attained via two distinct and consecutive steps:contraction and agglomeration. The contraction step firstdetermines the number of input data points per tile and thendiscards all non-significant tiles. The agglomeration stepconstructs clusters out of adjacent significant tiles. To illus-trate the idea of a tile, consider a grid consisting of squaresof a fixed side length. A square may contain several coor-dinate points. Reducing the precision by one decimal digitmeans removing the last digit of a fixed-precision coordi-nate. For instance, with a chosen decimal precision of 2, thecoordinates (1.005,1.000), (1.009,1.002), and (1.008,1.006)are all truncated (contracted) to the tile identified by thecorner point (1.00,1.00). Thus, (1.00,1.00) is a tile withthree associated points. This tile would be classified as asignificant tile with a threshold value of τ≥3 and discardedotherwise. The previous example and the one mentionedin the motivating use case suggest truncating input values,which is implemented as reduction of the precision of thedata by an integer power of 10. The input, which consistsof floating-point values, is scaled up and converted into inte-gers. However, it is also possible to project input data pointsto tiles using arbitrary real numbers, except zero. This makesit possible to fine-tune the clustering results or improve per-formance. For instance, a slightly larger grid size may leadto coarser clusters, but it also entails improved runtime, con-sidering that the number of tiles depends on the chosenprecision. With a lower precision value, the number of tilesis much lower, which leads to lowered constant memoryrequirements and thus an improved runtime of the agglom-eration and contraction steps. In general, the number of tileson a finite canvas of side lengths a and b with precision pis ab·10p. For practical purposes, p should be the lowestpossible value that leads to the desired clustering results.

    In the subsequent agglomeration step, RASTER clustersare constructed, which consist of significant tiles that are atmost a user-specified Manhattan or Chebyshev distance of

    δ steps apart. In our implementations, we use a Chebyshevdistance of 1, i.e. we take all eight neighbors of the currenttile into account, meaning that significant tiles need to bedirectly adjacent. Yet, the distance parameter is arbitrary.One alternative would be to use a Manhattan distance >1,which implies that a cluster could contain significant tileswithout direct neighbors. The parameter µ specifies theminimum number of tiles a cluster has to contain to berecognized as such by the algorithm. This is of practicalimportance as it filters out isolated significant tiles that donot constitute a hub. The reasoning behind this approachis that hubs generally cover multiple tiles, so very smallclusters are interpreted as noise.

    3.3. Time-Complexity Analysis

    In this subsection we present some explanations thataccompany the RASTER pseudocode in Alg. 1, couchedin an informal time-complexity analysis.2 The algorithmconsists of three sequential loops. The first two for-loopsconstitute the contraction step and the subsequent while-loop the agglomeration step. Projecting to a tile consistsof associating a data point p to a scaled value representinga tile. RASTER does not exhaustively check every possibletile value, but instead only retains tiles that were encoun-tered while processing data. Due to the efficiency of hashtables, the first for-loop runs in O(n), where n is the numberof input data points. Projection is performed in O(1),which is also the time complexity of the various hash tableoperations used. After the first for-loop all points have beenprojected to a tile. The second for-loop traverses all keys ofthe hash table tiles. Only significant tiles are retained. Theintermediate result of this loop is a hash table of significanttiles and their respective number of data points. Deletingan entry is an O(1) operation. At most, and only in thepathological case where there is exactly one data point pertile, there are n tiles. In any case, it holds that m≤n, wheren is the number of input data points and m the number oftiles. Thus, this step of the algorithm is performed in O(m).

    The subsequent while-loop performs the agglomerationstep, which constructs clusters from significant tiles. Acluster is a set of the coordinates of significant tiles. Thereare at most m significant tiles. In order to determine thetiles a cluster consists of, take one tile from the set tilesand recursively determine all neighboring tiles in O(m) in adepth-first manner. Neighborhood lookup with a hash tableis in O(1), as the locations of all its neighboring tiles areknown. For instance, when performing agglomerations with

    2Reference implementations of RASTER and its vari-ants in several programming languages are available athttps://github.com/fraunhoferchalmerscentre/raster.

    5

    https://github.com/fraunhoferchalmerscentre/raster

  • c1

    c2

    c3

    c4

    c5

    c6

    c7

    c8

    c9

    c10

    (a) Four subdivisions

    c1

    c2

    c3

    c4

    c5

    c8

    c9

    c10

    (b) Two subdivisions

    c1

    c2

    c3

    c4

    c10

    (c) Final clusters

    Figure 3: RASTER can be effectively parallelized (figure best viewed in color). This figure shows how clusters that are separated by borders can be joined inparallel in logn steps, where n is the number of initial slices. In Fig. 3a there are multiple clusters that are separated by borders, which are successively joinedas borders between slices are removed. The corner cases of clusters repeatedly crossing a border (e.g. c4,c5,c8 in Fig. 3a) and clusters crossing multipleborders are shown (cf. clusters c5,c7,c9 in Fig. 3a). As joining clusters is not a bottleneck in our use case, the idea presented here remains unimplemented.However, it may be instrumental when implementing RASTER for large-scale data processing (cf. Sect. 6). Our code iterates through candidate clustersin slices in a sequential manner from right to left; clusters that do not touch the border of a slice are excluded from this step.

    a Manhattan distance of 1, the neighbors of any coordinatepair (x,y) are the squares to its left, right, top, and bottomin the grid. One caveat regarding higher dimensional dataand how it affects neighborhood lookup should be stated,however. When taking two neighbors per dimension intoaccount, the total number of lookup operations is 2d, whichis suppressed in O-notation. Yet, this implies that RASTERis best-suited for low-dimensional data. In summary, thethird loop runs likewise in O(m). Consequently, the totaltime complexity is O(n).

    3.4. The Variation RASTER′

    A key element of RASTER is that information about theinput data is only retained in the aggregate and only if it isrelevant for clustering. In the projection step, the algorithmdetermines how many data points were projected to eachtile. By discarding the input data, and due to the fact thatthe range of both the longitude and latitude of GPS data arelimited, the resulting memory requirements of our algorithmare constant. Not storing input data is a useful approachwhen clustering very large amounts of data. However, ifall data fits into memory, one could retain all points pertile or all unique points per tile. This increases the memoryrequirements of the algorithm as clustering can no longer beperformed in constant memory. We refer to this variant ofour algorithm as RASTER′. Compared to the pseudocode inAlg. 1, the required changes are confined to the contractionstep. The main difference consists of maintaining a set ofunmodified input data points together with each key in thehash table tiles instead of incrementing a counter and, sub-sequently, removing keys if the number of associated storeddata points is below provided threshold value τ. While thisvariant is not particularly interesting for our motivating use

    case (cf. Sect. 2.2), it may allow for a fairer comparisonwith clustering algorithms that retain their inputs.

    3.5. Parallelizing the Algorithm

    In this section we describe the parallelization of RASTERand RASTER′. The resulting parallel versions are calledP-RASTER and P-RASTER′, respectively. As a prelimi-nary step, however, the algorithm first divides the input intounsorted batches. This is followed by processing each batchin parallel by performing the projection and accumulationsteps. For each thread, this results in one hashmap withthe number of points per tile. For performance reasons, theinput is not sorted, which implies that there can be entriesfor the same tile in different hashmaps. Consequently, thesehashmaps have to be joined. After joining, the significanttiles can be identified and the non-significant tiles removed.

    For the next step, the metaphor of a vertical slice of theinput space is helpful. The resulting significant tiles are as-signed to a slice based on their horizontal position, but theyare not sorted within a slice. Afterwards, the significant tilesof each vertical slice are clustered concurrently. However,clusters may cross the border between slices. Thus, onlyclusters that do not touch any border can be dismissed orretained based on the µ criterion. The others need to bejoined if they have a neighboring tile in a slice to the rightor left. They are only discarded if the final joined clusteris below the specified minimum size. This is done whilejoining slices, which may happen iteratively from one sideto the other, or in a bottom-up divide-and-conquer approach.A pseudo-code specification of the agglomeration step ofP-RASTER is provided in Alg. 2; the procedure for joiningclusters is shown separately as Alg. 3. Joining clusters that

    6

  • Algorithm 2 P-RASTER: Agglomerationinput: Minimum cluster size µ, parallelism N, set of

    significant tilesoutput: Set of clusters S c with at least µ tiles

    1: Sort all significant tiles into N vertical slices accordingto their spatial order

    2: for each slice in parallel do3: initialize Cb . clusters next to border4: determine clusters C5: for c in C do6: if c touches a border then7: add c to Cb . candidates for joining8: else if |c|≥µ then9: add c to S c

    10: for border Bi in BN−1...B1 do .iterative joining, right to left

    11: Clr := clusters in Cb that touch border Bi12: C j,Cinter := join border(Clr) . cf. Alg. 313: add C j to S c14: remove clusters that touch Bi and Bi−1 from Cb15: add Cinter to Cb

    cross borders has been implemented in a sequential manneras it only consumes a trivial amount of the total runtime.

    For slicing the input canvas vertically, we utilize domainknowledge as it is known that the longitude of GPScoordinates is in the range [−180,180]. Thus, we can easilydivide the input into evenly spaced slices. Alternatively, onecould dynamically determine the minimum and maximumlongitude value. Furthermore, it may be more appropriate,also in the context of GPS data, to dynamically determinethe width of each slice. With GPS data, this is particularlyrelevant as the input is normally not uniformly distributedacross the input space. For optimal performance, eachslice should contain approximately the same numberof data points. These performance benefits relate toclustering, but, as stated above, not to the final joiningof slices, which happens sequentially. In contrast, Fig. 3illustrates how this could be done in parallel with in adivide-and-conquer approach. Lastly, relevant correctnessproofs for P-RASTER are provided in Appendix C.

    3.6. Further Technical Details

    This subsection contains further technical details, cover-ing the practical aspect of working with arbitrarily large datasets, the mostly theoretical issue of disadvantageous grid lay-outs, and a brief remark on higher-dimensional input data.

    Working with huge data sets. RASTER can handlearbitrarily large data sets as its memory requirements are

    Algorithm 3 P-RASTER: Joining clustersinput: Minimum cluster size µ, clusters Clr that touch

    border Bioutput: Two sets of clusters, C j and Cinter. The first set

    contains clusters with at least µ tiles that only touchborder Bi. The second set contains clusters that touchboth Bi and Bi−1.

    1: procedure join border(Clr)2: initialize the sets Cinter and C j3: for c not visited in Clr do4: c join :=c5: to visit :={c}6: for v in to visit do7: take neighbors to v from Clr8: add neighbors to to visit9: merge neighbors into c join

    10: if c join touches Bi−1 then11: add c join to Cinter . add inter-slice cluster12: else if |c join|≥µ then13: add c join to C j

    constant. To demonstrate this, assume M available mainmemory of which T < M is required for the hash tabletiles. The values stored in it are fixed-size integers and thenumber of tiles has a constant upper bound as the rangeof the input values is fixed. This implies that T has a fixedupper bound. Afterwards, partition the input data intochunks of at most size P≤M−T and iteratively performthe projection step on them. After processing the entireinput, continue with determining significant tiles, followedby clustering. With this approach, RASTER can not onlyprocess an arbitrarily large amount of data, the input canalso be partitioned in an arbitrary manner and processedin an arbitrary order without affecting the resulting clusters.

    Minimum Cluster Size in Disadvantageous Grid Layouts.We consider truncation of a fixed number of decimaldigits to be the standard behavior of RASTER. As longas the projection to tiles is injective, any projection can bechosen. Yet, for any possible projection a corner case canbe found that illustrates that a significant tile may not bedetected based on the particular implied grid of the chosenprojection. However, due to hubs generally consistingof a large number of points, this is merely a theoreticalissue. As an illustration, assume a threshold of τ = 4 forsignificant tiles, and four adjacent points. If all points werelocated in the same tile of a grid, a significant tile would bedetected. However, those four points could also be spreadover neighboring tiles, as illustrated by Fig. 4. One couldshift the grid by choosing a different projection, but an

    7

  • Figure 4: RASTER is based on the idea of significant tiles, i.e. tiles that con-tain more than a threshold τ number of points. However, as this figure illus-trates, the location of the grid can interfere and separate points in close vicin-ity. In the given example, when using a threshold of τ=4, no significanttile would be detected. This is only a theoretical problem for our algorithmand its primary use case as hubs contain large numbers of points. However,a post-processing step can solve this minor issue at a very modest cost.

    adversary could easily place all points on different tiles inthe new grid. In order to alleviate this problem, a thresholdτ′

  • The focus is on comparing sequential performance, butsome of the Python implementations use more than oneCPU core. As this does not affect RASTER, which runs ona single CPU core, it was not of a primary concern for us.Yet, this means that the performance of some of the otheralgorithms is arguably better than it would have been if itwas possible to enforce using only a single core. In additionto measuring runtime and cluster quality, we also exploredat which points the chosen algorithms run out of memory.

    The comparative benchmark in experiment 1 puts ouralgorithm at a significant disadvantage due to the fact thatRASTER? is slower than RASTER. We therefore per-formed an additional experiment that gives a better insightinto the true performance of our algorithm: Experiment2 evaluates stand-alone implementations of RASTER andRASTER′ in Python and Rust. Unlike with the previous im-plementations, this allowed us to closely follow the specifica-tion provided in Alg. 1. A key aspect of RASTER is that in-put data points are projected to tiles. The number of impliedtiles is based on a precision factor, and the higher that factoris, the greater the number of tiles is as well. With a lowerprecision value, RASTER is faster and requires less memoryas there are fewer tiles to project to, but resulting clusters arecoarser. The algorithms processed input files with 102 to 106

    clusters, corresponding inputs ranging from 50K to 500Mdata points. For the chosen precision values of 3, 3.5, 4,and 5, we recorded the number of clusters identified and theruntime. This experiment also serves as a baseline for a com-parison with parallel RASTER, which is described below.

    The goal of experiment 3 is to empirically determinethe speedup of our parallelization efforts. We benchmarkedRust implementations of RASTER and RASTER′ on largedata sets, one with 50M points (105 clusters) and one with500M data points (106 clusters). For precision values of3.5 and 4, we measured the total runtime as well as the timespent on projection and clustering as the number of coresincreases from 1 to 8 in powers of 2.

    All experiments were executed on a contemporaryworkstation with an octa-core AMD Ryzen 7 2700XCPU, clocked at 3.70 GHz, and 64 GB RAM. The chosenoperating system was CentOS Linux 7.6.1810.

    4.2. Data DescriptionRASTER is designed for clustering unlabeled data where

    there is no ground truth available. Thus, it is an exampleof unsupervised learning. However, the input data for bothexperiments was produced by a custom data generator. Itsoutput is synthetic data that is an idealization of an existingproprietary real-world data set. For such data sets, we knowhow many clusters the algorithms should detect, but thedata is not labeled. For the first experiment, using the GPScoordinate plane as a canvas, the data generator randomly

    determines a predefined number of hypothetical clustercenters and spreads 500 points around them in a uniformdistribution with random parameters, which means thatclusters vary in their density. Cluster centers are locatedat least a given minimum distance apart. For the secondexperiment the data generator was modified to speed up filecreation for larger files. This variant of our data generatordivides the canvas into a number of rectangles and placescluster centers in them. This affects files with 105 and 106

    cluster centers. The resulting distribution of cluster centersis less random than it is with our data sets that containfewer clusters. As one idiosyncrasy of the generated data isthat data points forming a cluster are produced in sequence,all data was randomly shuffled, based on the reasoningthat partially sorted data may unfairly advantage somealgorithms, while randomly shuffled data can be expectedto not provide a discernible advantage to any algorithm.

    4.3. Results

    Experiment 1. Table 1 lists numerical results of ten cluster-ing algorithms, relating to experiment 1. The stated averagesare of ten runs with standard deviations in case of theruntime. Missing entries are due to memory constraints orexcessive runtime. Some of these algorithms, such as Affin-ity Propagation and Spectral Clustering, were not even ableto process 500K points in 64 GB RAM, even though theinput corresponds to less than 2 MB. Overall, the hamstrungversion of RASTER? clearly outperforms the other algo-rithms, with a runtime difference ranging from a single-digitmultiple in the case of DBSCAN to a triple-digit multiple inthe case of CLIQUE. Furthermore, Spectral Clustering didnot honor setting the number of jobs to 1 and instead usedup all available threads. Since this algorithm is a transforma-tion of k-means, its runtime is strictly worse than the latter’s.Gaussian Mixture Modelling also used all available threadsat times. The instances of unwanted parallelization areprobably due to numpy calling multi-threaded C-libraries.4

    We also looked at RAM consumption. The maximuminput size the various algorithms were able to process isas follows: 5K points (101 clusters) in the case of AffinityPropagation and Spectral Clustering; 50K points (102

    clusters) in the case of Ward’s method, AgglomerativeClustering and CLIQUE; 103 clusters (500K points) in thecase of Mini-batch k-means, BIRCH and Gaussian Mixture;5M points (104 clusters) in the case of DBSCAN; 50Mpoints (105 clusters) in the case of RASTER? and MeanShift. Mini-batch k-means had an overly excessive runtime,projected to be multiple hours, with 5M points (104

    4Trying to artificially limit the number of used threads with the Linuxtool numactl lead to erroneous behavior of the affected algorithms.

    9

  • Clusters % t s % t sRASTER?

    101 100.0 0.02 ± 0.00 1.00102 100.0 0.19 ± 0.01 1.00103 100.0 1.93 ± 0.01 1.00

    Mean Shift DBSCAN101 100.0 0.03 ± 0.00 1.00 100.0 0.03 ± 0.00 1.00102 100.0 0.36 ± 0.01 1.00 100.0 0.42 ± 0.03 1.00103 100.0 5.05 ± 0.06 1.00 99.8 6.12 ± 0.12 1.00

    Mini-Batch k-Means BIRCH101 100.0 0.04 ± 0.00 1.00 100.0 0.09 ± 0.05 1.00102 100.0 1.22 ± 0.63 1.00 100.0 1.52 ± 0.00 1.00103 100.0 4.97 ± 0.05 0.93 99.0 17.08 ± 0.60 0.99

    Gaussian Mixture Ward101 100.0 0.03 ± 0.00 1.00 100.0 0.34 ± 0.00 1.00102 100.0 1.31 ± 0.12 1.00 100.0 47.39 ± 9.38 1.00103 100.0 164.92 ± 4.79 1.00 — — —

    Agglomerative CLIQUE101 100.0 0.34 ± 0.00 1.00 100.0 0.28 ± 0.00 1.00102 100.0 52.34 ± 10.56 1.00 100.0 312.26 ± 9.15 0.97103 — — — — — —

    Spectral Affinity Propagation101 100.0 1.87 ± 0.06 1.00 100.0 22.07 ± 0.97 1.00102 — — — — — —103 — — — — — —

    Table 1: Comparison of sequential RASTER with various standard clustering algorithms. RASTER is labeled RASTER? because it deviates from ourspecification. In order to integrate with the scikit-learn benchmark, this algorithm has to retain the input in memory and read it two times. The inputsize is stated as the number of clusters, where each cluster contains 500 points. The percentage given is based on the number of clusters that are identifiedversus the number generated as the input. All times t are rounded, so values below 0.004 seconds are recorded as 0.00. As a qualitative performance metric,the silhouette coefficient s was used. Missing entries are due to algorithms running out of main memory or excessive runtime.

    clusters); CLIQUE showed the same problem with 500Kpoints (103 clusters) already. Thus, we terminated thesebenchmarks early. Even though Mean-Shift and RASTER?would be able to process larger inputs, the design of thatparticular benchmark was a limiting factor as all data isstored in memory and retained. Regarding Mini-batchk-means, it is also worth pointing out that the algorithmoccasionally seems to get trapped in a local optimum,which leads to runtimes that are over 10 times worse. Theimplementation in scikit-learn terminates those even-tually. In those rare outliers, both the silhouette coefficientand the percentage of identified clusters drop significantly.

    Experiment 2. In Table 2, a better overview of the trueperformance of RASTER and RASTER′ is given. ThePython implementations were not able to process fileswith 500M input points (1M clusters) as they ran out of

    memory.5 In general, RASTER scales well as the input sizeincreases. The chosen precision value affects both runtimeand quality of results. For example, a precision of 3 is fast,yet the resulting coarse grid detects fewer clusters.

    Experiment 3. As the results in Fig. 5 show, both RASTERand RASTER′ scale well. With RASTER, the totalspeedup with 8 cores is between 3 and 4. In contrast, theperformance improvements with RASTER′ are greatlyaffected by the chosen precision. With a precision of3.5, performance improvements are above 30% with fourcores, but level off with 8 cores. On the other hand, witha precision of 4, performance only improves when using up

    5This could be fixed by modifying the implementation to not userelatively memory-intensive tuples for storing GPS coordinates. Wewere more interested in a straightforward implementation of RASTERin Python, however, which is why this was not done.

    10

  • prec = 3 prec = 3.5 prec = 4 prec = 5

    Clusters % t % t % t % t

    RASTER (Python)102 87.0 0.03 ± 0.00 100.0 0.03 ± 0.00 100.0 0.04 ± 0.00 87.0 0.04 ± 0.00103 77.6 0.35 ± 0.00 100.0 0.35 ± 0.01 100.0 0.42 ± 0.00 93.7 0.41 ± 0.00104 78.7 3.69 ± 0.03 100.0 4.22 ± 0.06 100.0 5.10 ± 0.04 92.1 4.51 ± 0.03105 99.9 49.71 ± 0.33 100.0 51.40 ± 0.06 100.0 56.53 ± 0.61 99.3 50.03 ± 0.20106 — — — — — — — —

    RASTER′ (Python)102 87.0 0.04 ± 0.00 100.0 0.04 ± 0.00 100.0 0.05 ± 0.00 87.0 0.06 ± 0.00103 77.6 0.44 ± 0.00 100.0 0.49 ± 0.03 100.0 0.81 ± 0.00 93.7 0.84 ± 0.05104 78.7 6.09 ± 0.03 100.0 7.27 ± 0.51 100.0 9.12 ± 0.30 92.1 9.56 ± 0.04105 99.9 74.45 ± 0.51 100.0 84.74 ± 0.67 100.0 107.64 ± 0.37 99.3 121.79 ± 5.49106 — — — — — — — —

    RASTER (Rust)102 87.0 < 0.01 ± 0.00 100.0 < 0.01 ± 0.00 100.0 < 0.01 ± 0.00 87.0 < 0.01 ± 0.00103 77.6 0.01 ± 0.00 100.0 0.01 ± 0.00 100.0 0.01 ± 0.00 93.7 0.02 ± 0.00104 78.7 0.08 ± 0.00 100.0 0.15 ± 0.00 100.0 0.28 ± 0.00 92.1 0.40 ± 0.00105 99.9 1.38 ± 0.00 100.0 2.53 ± 0.00 100.0 4.74 ± 0.01 99.3 5.09 ± 0.01106 99.9 33.85 ± 0.07 100.0 48.59 ± 0.04 100.0 60.61 ± 0.05 99.3 58.05 ± 0.10

    RASTER′ (Rust)102 87.0 < 0.01 ± 0.00 100.0 < 0.01 ± 0.00 100.0 < 0.01 ± 0.00 87.0 0.01 ± 0.00103 77.6 0.01 ± 0.00 100.0 0.02 ± 0.00 100.0 0.04 ± 0.00 93.7 0.09 ± 0.00104 78.7 0.24 ± 0.00 100.0 0.50 ± 0.00 100.0 0.95 ± 0.01 92.1 1.62 ± 0.00105 99.9 4.32 ± 0.15 100.0 6.12 ± 0.31 100.0 10.75 ± 0.48 99.3 19.71 ± 0.03106 99.9 64.46 ± 0.89 100.0 82.92 ± 3.20 100.0 140.59 ± 6.81 99.3 248.33 ± 0.37

    Table 2: Comparing (sequential) stand-alone implementations of RASTER and RASTER′ in Python 3.6 and Rust 1.35 at different precision values. Theprovided figures show the averages runtime t of five runs, along with their standard deviation. The chosen parameters were threshold τ=5 and minimumcluster size µ=4. The input size is stated as the number of clusters, where each cluster contains 500 points. The most noteworthy aspect is that fine-tuningthe precision parameter entails performance improvement without deteriorating the quality of the results, with the caveat that the clusters identified bytiles in RASTER are coarser with a lower precision.

    to 4 course with the 50M data set. With the 500M data set,performance improves by ca. 22% with two cores, but de-teriorates with four and eight cores. A detailed breakdownof the runtime of the projection and clustering steps ofP-RASTER and P-RASTER′ is provided in Appendix D.

    Discussion. In experiment 1, RASTER? outperforms anassortment of other clustering algorithms by a wide margin.Given the popularity of scikit-learn, these implemen-tations can be considered state-of-the-art. Of course, thosealgorithms may have been developed for other use casesthan ours (cf. Appendix A for a brief discussion of RASTERas a general-purpose density-based clustering algorithm).Not only is their runtime worse, many of them have memoryrequirements that make them unsuitable for density-basedclustering of big data. Even though RASTER? was ham-strung in that comparison, it still has the best performance,

    and by a wide margin. Furthermore, as a comparison ofthe data in Table 1 and Table 2 shows, RASTER is aboutfive times faster than RASTER?, without those limitations.

    There is a conceptual similarity between a sub-methodof CLIQUE and RASTER, but the former performs lookupin a way that scales poorly as the input increases. This maynot be obvious when using inputs with very few clusters,as it was done in the original papers [25, 26]. Yet, CLIQUEis not an ideal choice in a big data context. We spent aconsiderable amount of time on comparing RASTER andCLIQUE. There are very few widely available implementa-tions of the latter, however, which perhaps underscores thatCLIQUE is more interesting from a theoretical perspective.We consulted the implementations of the pyclusteringpackage6 as well as György Katona’s implementation. Our

    6The code repository of the package pyclustering is https:

    11

    https://github.com/annoviko/pyclustering

  • 1 2 4 8

    1.0

    2.0

    3.0

    4.0

    P-RASTER (50M)prec = 3.5prec = 4.0

    1 2 4 8

    5.0

    6.0

    7.0

    8.0

    9.0

    10.0

    11.0

    P-RASTER' (50M)prec = 3.5prec = 4.0

    1 2 4 810.0

    20.0

    30.0

    40.0

    50.0

    60.0

    P-RASTER (500M)prec = 3.5prec = 4.0

    1 2 4 8

    60.0

    80.0

    100.0

    120.0

    140.0

    P-RASTER' (500M)

    prec = 3.5prec = 4.0

    Figure 5: Scalability of P-RASTER and P-RASTER′ when clustering 50M and 500M points (best viewed in color). With c cores, P-RASTER achievesa performance improvement of around c2 . The scalability of P-RASTER

    ′, on the other hand, is limited, due to cache synchronization and parallel slowdownissues. The latter is, in particular, an issue with very large data sets.

    implementation is based on the latter and we improved itsperformance by a factor of more than 3. It is also noticeablyfaster than the implementation in pyclustering. Ourresults comparing RASTER to another widely availableimplementation of CLIQUE in Java are documented inAppendix B.

    In experiment 2, we have shown that RASTER andRASTER′ scale very well. The purpose of the additional im-plementation of RASTER in Rust is to show our algorithm’spotential for real-world workloads. This implementation isable to process 500M points (1M clusters) in less than 50seconds with a single thread. Unlike many other clusteringalgorithms, RASTER can be easily parallelized. The pro-vided data also illustrate that there is an ideal range of valuesfor the precision parameter, which is a measure of the reso-lution of the implied grid. With a precision value of 3, thereis some misclassification because, for instance, two clustersin the data may be classified as one. At the other extreme,with a precision value of 5, a grid that is too fine may alsolead to a reduced precision, but in that case the reason is thatthe elements of a cluster may be mapped to tiles that are toofar apart for the used distance value. Subsequently, thesescattered tiles do not meet the chosen minimum cluster sizeµ and are filtered out prior to the agglomeration step.

    As our results of experiment 3 show, RASTER can notonly be elegantly parallelized in theory; the measured perfor-mance improvements also show a substantial performancegain. Based on these results, we would expect that a furtherincrease in the number of available CPU cores c leads, forP-RASTER, to a speedup of roughly c2 for the projectionstep and c for the clustering step. The improvement forclustering in P-RASTER′ is similar, but the performance

    //github.com/annoviko/pyclustering (accessed 15 April 2019).

    improvement for projection seems to level off. As the cur-rent picture is ambiguous, it is hard to make predictions onexpected performance increases of that part of the algorithm.

    We have found that with a greater precision value inP-RASTER′, total performance not only levels off, as itwas the case with a precision value of 3, but degrades, witha precision of 4. This seems to be due to memory allocationfor the data that is retained. In addition to increased memoryrequirements, cache synchronization issues may also playa role. Since each core on a multi-core CPU has its owncache, there is a greater need for cache synchronization.This phenomenon is referred to as parallel slowdown.

    We also took a look at related work on parallelizing otherclustering algorithms. The state of the art in parallelizingDBSCAN is Song et al.’s work on RP-DBSCAN, whichthey describe as ”superfast” [28]. Yet, the achieved speedupwith 10 cores is only around 2, reaching 4.4 with 40 cores.In comparison, P-RASTER achieves a parallel speedup of afactor of around 4 already with 8 cores, due to the excellentscalability inherent in the design of our algorithm. Basedon our experiments, we would expect a speedup of around20 with 40 cores. By design, RASTER can be parallelizedrather effectively. Yet, parallelizing many other clusteringalgorithms, such as DBSCAN, is much less straightforward.Such efforts may also have a relatively low ceiling, assuggested by the empirical performance of RP-DBSCAN.

    5. Related Work

    We compared RASTER to a number of canonicalclustering algorithms of which implementations werereadily available. There are other algorithms, which seemto have been of more academic interest. For reasons relatedto space as well as a lack of availability of implementations,

    12

    https://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclusteringhttps://github.com/annoviko/pyclustering

  • we therefore briefly discuss other density-based clusteringalgorithms separately below. In general, even though wepoint out some superficial similarities between RASTERand various other algorithms, we would like to stress thatnone of those algorithms, as described in the respectivepapers, seems suitable for big data clustering.

    An early approach to grid-based spatial data mining wasSTING [29]. A key difference between RASTER and thisalgorithm is that it performs statistical queries, using dis-tributions of attribute values. WaveCluster [30] shares somesimilarities with RASTER. It can reduce the resolution ofthe input, which leads to output that is visually similar toRASTER clusters that are defined by its significant tiles.WaveCluster runs in linear time. However, because the com-putation of wavelets is costlier than the operations RASTERperforms, its empirical runtime is presumably worse.

    The projection step of RASTER makes use of a stan-dard concept in data processing: assigning values to buckets.Similar corresponding projections are a feature in other algo-rithms as well. For instance, Baker and Valleron [31] presenta solution to a problem in spatial epidemiology whose ini-tial step seems similar to the projection step performed byRASTER. For solving their problem, it is sufficient to countdata points in squares of a grid. The approach taken byGRPDBSCAN, discovered by Darong and Peng [32] is sim-ilar to the aforementioned work in this regard, but choosesthe approach of RASTER′ by retaining the projected inputpoints. Unfortunately, their description seems rather vague.Because it includes only an illustration instead of an imple-mentation or specification, we cannot evaluate how similartheir algorithm is to ours. In both of those papers there is nodiscussion of using a variable scaling factor, however, to arbi-trarily adjust the granularity of the grid and the resulting clus-ters. Lastly, there is some similarity between our algorithm,when focusing only on the hub identification problem, andblob detection in image analysis [33]. A direct application ofthat method to finding clusters would arguably require verydense cluster centers, which could be achieved by projectingpoints to fewer tiles. Also, blob detection is computation-ally more expensive than RASTER. Furthermore, unlikeblob-detection algorithms, RASTER is viable as a general-purpose clustering algorithm, as illustrated in Appendix A.

    6. Future Work

    Xiaoyun et al. introduced GMDBSCAN [34], aDBSCAN-variant that is able to detect clusters of differentdensities. The inability to detect such clusters is sharedbetween DBSCAN and RASTER. For more generalpurpose-applications, it may be worth investigating asimilar approach for RASTER as well. A starting point isan adaptive distance parameter for significant tiles. While

    we mentioned that the distance metric δ can be arbitrarilychosen — we picked the eight immediate neighbors of atile — one could not only choose different Manhattan orChebyshev distances, but arbitrary adaptive values. Onthe other hand, by fine-tuning the precision parametervalue, very similar results could be achieved, possibly withmultiple passes over the same input, which would makeit possible to detect clusters of varying densities.

    RASTER does not distinguish between significant tiles.Yet, the number of points projected to different significanttiles can differ widely. Some tiles could have a very highcount of points, while others barely reach the specifiedthreshold value. One obvious modification would be touse those counts for visualization purposes, for instance byrendering tiles with fewer points in a lighter color tone andthose with more points in a darker one. However, there aremore sophisticated approaches for utilizing relative densitiesof tiles. Thus, one could consider an adaptive approach toRASTER clustering, for instance by subdividing such tilesinto smaller segments, with the goal of determining moreaccurate cluster shapes. This idea is related to adaptivemesh refinement, suggested by Liao et. al [35]. A relatedidea is to change the behavior of RASTER when detectinga large number of adjacent tiles that have not been classifiedas significant. This may prompt a coarsening of the gridsize for that part of the input space.

    For practical use, it may be worthwhile to add a contex-tual relaxation value � for the threshold value of significanttiles. For instance, in the vicinity of several significant tiles,a neighboring tile with τ−� data points may be consideredpart of the agglomeration, in particular if it has multiplesignificant tiles as neighbors. A key aspect of RASTERis that it returns clusters of significant tiles that cover mostof the area of a dense cluster. It may be interesting to notreturn such clusters at all but instead compute the ellipse ofthe least area that includes all tiles of a cluster. In particularfor larger clusters, this would lead to an even more memory-efficient output. This also has a fitting correspondence toRASTER′ as the ellipse could be computed based on allthe points projected to the tiles in the final set of clusters.

    RASTER is highly suited to clustering data batches, butwe have also implemented a variant of RASTER for datastreams (S-RASTER). In an upcoming paper, we intendto discuss this variant and show how it compares with otheralgorithms for clustering evolving data streams [36]. Lastly,P-RASTER was discussed as an algorithm running on amany-core CPU. It should be straightforward to adopt it toexecution on the cloud, similar to the experiments describedin the RP-DBSCAN paper [28]. We would expect it toscale well as slices can be processed in parallel. UnlikeRP-DBSCAN, P-RASTER on the cloud would not needto duplicate any data, implying better scalability in terms of

    13

  • required memory as well. That being said, P-RASTER isalready able to process terabytes of data in a very reasonableamount of time on a single workstation. This means thatadapting P-RASTER to be executed on a data center likeMicrosoft Azure, Amazon Web Services, or Google Cloudwould be hard to justify in the foreseeable future.

    7. Conclusion

    We hope to have shown that RASTER is an excellentdensity-based clustering algorithm with an outstandingsingle-threaded performance. For our particular problem,it outperforms standard implementations of existingclustering algorithms. On top, unlike many other clusteringalgorithms, RASTER can be effectively parallelized tomake use of many-core CPUs. Based on our experiments,we expect RASTER to scale linearly with the numberof available CPU cores. As the input space of RASTERcan be partitioned without data duplication and withoutaffecting clustering quality, it would be straightforward toadopt our algorithm for cloud-scale computations, albeitthat may not be necessary, considering the performance thatcan be achieved on a regular workstation.

    While we focused on the hub identification problem,RASTER can also be used for solving general-purposedensity-based clustering problems. The parametersRASTER uses, i.e. the threshold value τ for significanttiles, the distance metric δ, and the minimum cluster sizeµ, are all intuitive. In Appendix B, we illustrate on standarddata sets how our algorithm can deliver very good resultswith minimal parameter tweaking. RASTER is particularlysuited to low-dimensional data. Overall, we have shownthat RASTER is of great practical value due to its very fastsequential and parallel performance and its suitability forgeneral-purpose clustering problems.

    Acknowledgments

    This research was supported by the project Fleet telemat-ics big data analytics for vehicle usage modeling and anal-ysis (FUMA) in the funding program FFI: Strategic VehicleResearch and Innovation (DNR 2016-02207), which is ad-ministered by VINNOVA, the Swedish Government Agencyfor Innovation Systems. Initial multi-core benchmarks wereperformed on resources at the Chalmers Centre for Compu-tational Science and Engineering (C3SE), provided by theSwedish National Infrastructure for Computing (SNIC).

    [1] R. J. Hathaway, J. C. Bezdek, Extending fuzzy and probabilisticclustering to very large data sets, Computational Statistics & DataAnalysis 51 (1) (2006) 215–234.

    [2] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., A density-basedalgorithm for discovering clusters in large spatial databases withnoise, in: SIGKDD Conference on Knowledge Discovery and DataMining, Vol. 96, 1996, pp. 226–231.

    [3] A. Kumar, Y. Sabharwal, S. Sen, Linear time algorithms forclustering problems in any dimensions, in: International Colloquiumon Automata, Languages, and Programming, Springer, 2005, pp.1374–1385.

    [4] A. Kumar, Y. Sabharwal, S. Sen, Linear-time approximation schemesfor clustering problems in any dimensions, Journal of the ACM(JACM) 57 (2) (2010) 5.

    [5] A. Fahad, N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A. Y. Zomaya,S. Foufou, A. Bouras, A survey of clustering algorithms for big data:Taxonomy and empirical analysis, IEEE Transactions on EmergingTopics in Computing 2 (3) (2014) 267–279.

    [6] A. S. Shirkhorshidi, S. Aghabozorgi, T. Y. Wah, T. Herawan, Big dataclustering: a review, in: International Conference on ComputationalScience and Its Applications, Springer, 2014, pp. 707–720.

    [7] G. Ulm, E. Gustavsson, M. Jirstrand, Contraction Clustering(RASTER), in: G. Nicosia, P. Pardalos, G. Giuffrida, R. Umeton(Eds.), Machine Learning, Optimization, and Big Data, SpringerInternational Publishing, Cham, Switzerland, 2018, pp. 63–75.

    [8] M.-S. Yang, A survey of fuzzy clustering, Mathematical andComputer Modelling 18 (11) (1993) 1–16.

    [9] D. L. Pham, Spatial models for fuzzy clustering, Computer Visionand Image Understanding 84 (2) (2001) 285–297.

    [10] J. MacQueen, et al., Some methods for classification and analysisof multivariate observations, in: Proceedings of the Fifth BerkeleySymposium on Mathematical Statistics and Probability, Vol. 1,Oakland, CA, USA., 1967, pp. 281–297.

    [11] F. van Diggelen, P. Enge, The world’s first gps mooc and worldwidelaboratory using smartphones, in: Proceedings of the 28th Interna-tional Technical Meeting of The Satellite Division of the Instituteof Navigation (ION GNSS+ 2015), ION, 2015, pp. 361–369.

    [12] M. Inaba, N. Katoh, H. Imai, Applications of weighted voronoidiagrams and randomization to variance-based k-clustering:(extended abstract), in: Proceedings of the Tenth Annual Symposiumon Computational Geometry, SCG ’94, ACM, New York, NY, USA,1994, pp. 332–339.

    [13] S. Lloyd, Least squares quantization in pcm, IEEE Transactions onInformation Theory 28 (2) (1982) 129–137.

    [14] O. Bachem, M. Lucic, H. Hassani, A. Krause, Fast and provablygood seedings for k-means, in: Advances in Neural InformationProcessing Systems, 2016, pp. 55–63.

    [15] M. Capó, A. Pérez, J. A. Lozano, An efficient approximation to thek-means clustering for massive data, Knowledge-Based Systems117 (2017) 56–69.

    [16] D. Sculley, Web-scale k-means clustering, in: Proceedings of the19th International Conference on World Wide Web, ACM, 2010,pp. 1177–1178.

    [17] B. J. Frey, D. Dueck, Clustering by passing messages between datapoints, science 315 (5814) (2007) 972–976.

    [18] D. Comaniciu, P. Meer, Mean shift: A robust approach towardfeature space analysis, IEEE Transactions on Pattern Analysis &Machine Intelligence (5) (2002) 603–619.

    [19] J. Shi, J. Malik, Normalized cuts and image segmentation,Departmental Papers (CIS) (2000) 107.

    [20] X. Y. Stella, J. Shi, Multiclass spectral clustering, in: Ninth IEEEInternational Conference on Computer Vision (ICCV 2003), IEEE,2003, p. 313.

    [21] J. H. Ward Jr, Hierarchical grouping to optimize an objectivefunction, Journal of the American Statistical Association 58 (301)(1963) 236–244.

    [22] E. M. Voorhees, Implementing agglomerative hierarchic clusteringalgorithms for use in document retrieval, Information Processing

    14

  • & Management 22 (6) (1986) 465–476.[23] T. Zhang, R. Ramakrishnan, M. Livny, Birch: an efficient data

    clustering method for very large databases, in: ACM Sigmod Record,Vol. 25, ACM, 1996, pp. 103–114.

    [24] D. A. Reynolds, Gaussian mixture models, in: Encyclopedia ofBiometrics, 2009, pp. 827–832.

    [25] R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan, Automaticsubspace clustering of high dimensional data for data mining appli-cations, in: Proceedings of the 1998 ACM SIGMOD InternationalConference on Management of Data, SIGMOD ’98, ACM, NewYork, NY, USA, 1998, pp. 94–105.

    [26] R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan, Automaticsubspace clustering of high dimensional data, Data Mining andKnowledge Discovery 11 (1) (2005) 5–33.

    [27] P. J. Rousseeuw, Silhouettes: A graphical aid to the interpretationand validation of cluster analysis, Journal of Computational andApplied Mathematics 20 (1987) 53 – 65.

    [28] H. Song, J.-G. Lee, RP-DBSCAN: A superfast parallel DBSCANalgorithm based on random partitioning, in: Proceedings of the 2018International Conference on Management of Data, ACM, 2018, pp.1173–1187.

    [29] W. Wang, J. Yang, R. Muntz, et al., Sting: A statistical informationgrid approach to spatial data mining, in: International Conferenceon Very Large Data Bases VLDB, Vol. 97, 1997, pp. 186–195.

    [30] G. Sheikholeslami, S. Chatterjee, A. Zhang, Wavecluster: Amulti-resolution clustering approach for very large spatial databases,in: International Conference on Very Large Data Bases VLDB,Vol. 98, 1998, pp. 428–439.

    [31] D. M. Baker, A.-J. Valleron, An open source software for fastgrid-based data-mining in spatial epidemiology (FGBASE),International Journal of Health Geographics 13 (1) (2014) 46.

    [32] H. Darong, W. Peng, Grid-based DBSCAN algorithm withreferential parameters, Physics Procedia 24 (2012) 1166–1170.

    [33] A. J. Danker, A. Rosenfeld, Blob detection by relaxation, IEEETransactions on Pattern Analysis and Machine Intelligence (1)(1981) 79–92.

    [34] C. Xiaoyun, M. Yufang, Z. Yan, W. Ping, GMDBSCAN: multi-density DBSCAN cluster based on grid, in: e-Business Engineering,2008. ICEBE’08. IEEE International Conference on, IEEE, 2008,pp. 780–783.

    [35] W.-k. Liao, Y. Liu, A. Choudhary, A grid-based clusteringalgorithm using adaptive mesh refinement, in: 7th Workshop onMining Scientific and Engineering Datasets of SIAM InternationalConference on Data Mining, 2004, pp. 61–69.

    [36] G. Ulm, S. Smith, A. Nilsson, E. Gustavsson, M. Jirstrand, S-raster:Contraction clustering for evolving data streams, arXiv preprintarXiv:1911.09447.

    [37] E. Müller, S. Günnemann, I. Assent, T. Seidl, Evaluating clusteringin subspace projections of high dimensional data, Proceedings ofthe VLDB Endowment 2 (1) (2009) 1270–1281.

    [38] S. Even, Graph Algorithms, Cambridge University Press, 2011.

    Appendix A. Qualitative Comparison of ClusteringAlgorithms

    Arguably the most well-known qualitative comparison ofprominent clustering algorithms is provided by the onlinedocumentation of scikit-learn.7 We have modified

    7Refer to the Section ”Comparing different clustering algorithmson toy data sets” in the official documentation of scikit-learn,

    RASTER′ to work with that benchmark. The results areprovided in Figure A.6. The RASTER parameters we usedwere threshold τ = 5, minimum cluster size µ = 5, and aprecision value of 0.9. The runtime metrics have to be takenwith a grain of salt as our algorithm has to process the inputtwice to produce the desired labels. This is not an inherentflaw of RASTER but instead due to the implementation ofthat particular comparison. Furthermore, our implementa-tion has not been optimized. For instance, we convert fromPython to numpy objects and back. The standalone imple-mentation of RASTER′ furthermore does not require us toreturn a label for each element in the provided input, in theexact same order. It does not retain all data either as all noiseis simply discarded. In the provided examples, the inputsize is a mere 1,500 points each, with very few clusters. Aswe have shown earlier (cf. Sect. 4.3), our algorithm shineswith huge data sets and very large numbers of clusters.

    Having pointed out that this benchmark is not an idealmatch for RASTER, we can nonetheless confirm that the re-sults are in line with our expectations as our algorithm turnedout to perform very well also with smaller input sizes andfewer clusters. To go through the six input data sets from topto bottom: The concentric circles in the first data set are notsuited for the chosen RASTER parameters, for which theouter circle is not dense enough. The inner one, however, isproperly identified. By choosing a different precision value,the outer ring can be clearly identified (cf. Fig. A.7), butthis would negatively affect the results of the other data sets.The arcs in the second data set are nicely separated, show-ing that RASTER does not require dense, circular clusters.The third one contains two denser clusters on the left andright, while the middle is less dense. RASTER is the onlyalgorithm that considers the middle as noise. There is noground truth provided. Yet, that is how we would interpretthe GPS data we have worked with as well, which meansthat this data set is relevant to the hub identification problem,where tight clusters are surrounded by noise. As our exam-ple shows, RASTER ignores the less densely spread pointsbetween two dense clusters, while the other algorithms alldetect three clusters. This seems to imply that, if speed andmemory were not a concern, standard clustering algorithmsstill could not be used for the hub identification problemwithout preprocessing the data and filtering out noise.

    The fourth and particularly the fifth data set are ideallysuited for RASTER, which is shown to reliably identifyhubs. Lastly, the sixth data set is worth highlighting. Ac-cording to the information provided in the scikit-learndocumentation, such homogeneous data is a ”null situation

    which is available at https://scikit-learn.org/0.20/auto_examples/cluster/plot_cluster_comparison.html (AccessedApril 4, 2019).

    15

    http://arxiv.org/abs/1911.09447https://scikit-learn.org/0.20/auto_examples/cluster/plot_cluster_comparison.htmlhttps://scikit-learn.org/0.20/auto_examples/cluster/plot_cluster_comparison.html

  • Mini-Batch k-M

    eansAffinity PropagationM

    ean ShiftSpectral

    Ward

    Agglomerative

    DBSCANBIRCH

    Gaussian Mixture

    RASTER

    FigureA

    .6:Thiscom

    parison(bestview

    edin

    color)between

    RA

    STER?

    andvarious

    standardclustering

    algorithmshow

    sthatouralgorithm

    approximately

    clustercentersoftight,dense

    clustersvery

    well.

    Thethird

    datasetfrom

    thetop

    reflectsthe

    hubidentification

    problemw

    heretight,dense

    clustershave

    tobe

    foundin

    noisydata

    sets.R

    ASTER

    isthe

    onlyalgorithm

    inthis

    comparison

    thatadequatelydoes

    this.Furtherm

    ore,itisthe

    onlyalgorithm

    inthis

    comparison

    thatdoesnotproduce

    anyclusters

    with

    homogeneous

    input,asseen

    inthe

    bottomim

    age.Furtherm

    ore,theoutputofthe

    firstdataset

    canbe

    easilyim

    provedby

    modifying

    theprecision

    parameter(cf.Fig.A

    .7).

    16

  • for clustering”, stating that there is ”no good clustering”possible. Interestingly, RASTER is the only of the ten clus-tering algorithms in this comparison that does not performany clustering on this data set. Instead, all data is discardedas noise. It is also noteworthy that the indicated runtime ofRASTER does not depend on the number of clusters or thepoints per cluster. Instead, it is primarily dependent on thenumber of points in the input. As the input size is constantin the various data sets of this example, the resulting runtimeis identical. This is in stark contrast to the other algorithms,which have sometimes widely fluctuating runtimes.

    As stated earlier, adjusting the precision parameter canhelp improve clustering results. This is shown in Fig. A.7,which contrasts the output of RASTER with a precisionvalue of 0.90 with the results of an execution with a preci-sion of 0.73. As there is no ground truth in unsupervisedlearning, it is in the eye of the beholder which results arejudged to be satisfactory. Yet, as this juxtaposition shows,with minor adjustments very good results can be achievedwith each of the used reference data sets. We would arguethat either precision value delivers very good results in datasets 2, 4, and 5. Using a precision value of 0.73 seemspreferable for data set 1 as it fully clusters both rings. Fordata set 3, an argument could be made for either of those twoparameter values, depending on whether the less denselyscattered points in the middle are desired to be clustered ornot. Lastly, a precision value of 0.90 delivers superior resultsfor data set 6 as it does not cluster a homogeneous data set.

    Appendix B. Performance Comparison betweenCLIQUE, RASTER, and RASTER′

    As there is a superficial similarity between one step ofCLIQUE and the neighborhood-lookup step of RASTER,we were particularly interested in comparing the perfor-mance of these two algorithms on the hub identificationproblem. In Sect. 4.3 we show how a Python implemen-tation of CLIQUE compares against RASTER. However,we also came across a Java implementation of CLIQUE aspart of the OpenSubspace library [37].8 In order to performa comparative benchmark, we implemented RASTER andRASTER′ in Java and pitted them against that implemen-tation of CLIQUE. In Table B.3 we show the results, whichillustrate that the comparatively substandard performance ofCLIQUE was not just due to the Python implementations weused earlier. In Java, the difference is likewise rather substan-tial. Further, the bigger the input, the wider the gap between

    8The repository associated with that paper is no longer available online.However, the relevant Java packages were retrieved from version 1.0.4of the R library subspace, which makes use of those Java libraries.

    RASTER (prec. = 0.90) RASTER (prec. = 0.73)

    Figure A.7: Adjusting the precision parameter can improve clusteringresults (best viewed in color). With a precision of 0.90, all but the veryfirst data set are clustered satisfactorily. Reducing the precision to 0.73improves the results of the first data set. It is a matter of debate whetherthe third data set is better clustered with a precision of 0.73. Also, with areduced precision, RASTER attempts to cluster the homogeneous data set,while it is noteworthy (cf. Fig. A.6) that RASTER, with a precision of 0.90,is the only algorithm in that comparison that does not cluster this data set.

    17

  • Size CLIQUE RASTER RASTER′

    101 0.012 < 0.001 < 0.001102 1.498 0.005 0.008103 202.187 0.065 0.084

    Table B.3: Performance comparison of CLIQUE, RASTER, andRASTER′ in Java. Runtimes are given in seconds and show how thevarious algorithms scale with increasing input sizes. For CLIQUE, weused a value of xi of 20, for RASTER, a precision of 3.5 was used. Notshown are the percentages of clusters identified. With the exception ofCLIQUE with 500K points (103 clusters) at a 99.6 % success rate, thealgorithms detect all clusters.

    CLIQUE and RASTER gets. With an input of 50K points or100 clusters, RASTER is 300 times faster than CLIQUE. In-creasing the input size to 500K points or 1K cluster widensthe performance difference to well over 3,000 times.

    Appendix C. Correctness Proofs for P-RASTER

    Below, we provide correctness proofs of three propertiesof P-RASTER that are true after the termination of thealgorithm. These proofs complement Sec. 3.5, includingAlg. 2 and Alg. 3. The three properties are as follows:(1) Clusters that have a neighboring significant tile in adifferent slice will have been joined, regardless of how oftenthat border is crossed (multiple single-border crossingsinvariant), (2) clusters having neighboring significant tilesin multiple slices will have been joined regardless howmany borders they cross (consecutive border crossingsinvariant), and (3) slicing can be arbitrary and does notaffect the clusters in any way (arbitrary slicing invariant).

    There could potentially be a large number Oc of clustersaround a border which form one large cluster when joined.Figure 3a illustrates a scenario where Oc=3 in the case ofclusters c4, c5, and c8. The second problem is that therecould be a cluster within a slice that spans the length of theslice itself. This cluster can be joined from both its left sideand its right side as can be seen in Fig. 3a with c3, c7, and c9.When c7 is joined with the border to its right side, the newcluster needs to be carried over to the joining process aroundthe border to its left. This leads to the two theorems below.

    Theorem 1. Multiple single-border crossings invariant:Clusters that have neighboring significant tiles in anadjacent slice will be joined, regardless of how manyneighboring significant tiles there are across that borderand on which side of it they are located.

    Proof. Consider arbitrarily many clusters that are separatedby the same border that would form one single clusterwithout that border. This implies that each cluster has at

    least one other cluster it is adjacent to in the neighboringslice. One of the original clusters will eventually beadded to the set to visit, processed, and have its neighborsdetermined. Each cluster is removed from that set as it isprocessed and its neighbors are added to it if they have notbeen added before. This procedure continues until thereare no clusters remaining in to visit. Thus, this part of thealgorithm is an example of depth-first search (DFS) [38]where the searched nodes form a cluster.

    Theorem 2. Consecutive border crossings invariant:Clusters having neighboring significant tiles in multipleslices will be joined regardless of how many borders theymay cross.

    Proof. Consider the base case of n = 3 adjacent clustersthat are separated by two borders: Clusters Cleft andCmiddle are separated by border B1; clusters Cmiddle andCright are separated by border B2. Assuming the algorithmproceeds from left to right, the first step results in a clusterC|2| =Cleft∪Cmiddle after removing B1 as the clusters wereadjacent and shared that very border. As C|2| still bordersB2, it remains a candidate cluster for joining. Subsequently,when processing the next border, the algorithm producesC|3| =C|2|∪Cright, which is the final cluster. It should beeasy to see that in the case of k adjacent clusters that areseparated by k−1 borders, the result is a cluster C|k|. Theinductive step is C|k+1|=C|k|∪C|1|.

    The previous proof is also intuitive as clusters are pro-cessed from one vertical slice to the other and every newlyencountered neighboring cluster entails the removal ofone border and also means that the number of sub-clustersadded increases by one. Theorems 1 and 2 work in tandem.For instance, if a cluster C were to be split into more thanthree disjoint clusters around one border, Theorem 1 ensuresthat they are joined, while Theorem 2 ensures that clustersare joined across multiple borders. As a cluster remainsa candidate for as long as it contains at least one tile thattouches a border, it is irrelevant which case is addressed first.

    P-RASTER slices the tile space vertically in a uniformmanner. A practical rule of thumb is to set the number ofslices to the number of available threads. The number ofslices is only one aspect as it may also make sense to notslice uniformly but instead set the borders dynamically sothat each slice contains approximately the same number ofpoints, which cannot be expected when slicing the tile spaceuniformly with non-uniformly distributed data. What allthese considerations have in common is that they depend onslicing not affecting the resulting clusters, which dependson the correctness of the next theorem.

    18

  • prec = 3.5 prec = 4

    cores total π κ total π κ

    P-RASTER / 50M points (105 clusters)1 2.62 ± 0.00 2.05 ± 0.00 0.58 ± 0.00 4.79 ± 0.03 3.67 ± 0.03 1.12 ± 0.012 1.43 ± 0.01 1.13 ± 0.01 0.30 ± 0.00 2.57 ± 0.01 2.00 ± 0.01 0.57 ± 0.004 1.01 ± 0.01 0.87 ± 0.00 0.14 ± 0.00 1.85 ± 0.02 1.58 ± 0.02 0.27 ± 0.008 0.85 ± 0.01 0.75 ± 0.01 0.09 ± 0.00 1.59 ± 0.01 1.43 ± 0.01 0.16 ± 0.00

    P-RASTER / 500M points (106 clusters)1 49.33 ± 0.15 42.19 ± 0.14 7.14 ± 0.02 61.17 ± 0.27 48.56 ± 0.23 12.61 ± 0.102 26.67 ± 0.05 23.00 ± 0.06 3.66 ± 0.06 35.90 ± 0.07 29.39 ± 0.12 6.52 ± 0.104 17.01 ± 0.08 14.93 ± 0.07 2.08 ± 0.02 23.22 ± 0.14 19.44 ± 0.12 3.78 ± 0.058 12.50 ± 0.02 10.88 ± 0.03 1.61 ± 0.01 19.89 ± 0.17 16.91 ± 0.15 2.98 ± 0.02

    P-RASTER′/ 50M points (105 clusters)1 6.81 ± 0.43 6.01 ± 0.43 0.80 ± 0.00 11.61 ± 0.98 9.83 ± 0.93 1.78 ± 0.082 5.01 ± 0.16 4.68 ± 0.16 0.33 ± 0.01 9.56 ± 0.40 8.93 ± 0.40 0.63 ± 0.004 4.61 ± 0.08 4.45 ± 0.08 0.16 ± 0.00 8.11 ± 0.25 7.80 ± 0.25 0.31 ± 0.018 4.53 ± 0.10 4.43 ± 0.09 0.10 ± 0.01 8.44 ± 0.22 8.24 ± 0.22 0.19 ± 0.01

    P-RASTER′/ 500M points (106 clusters)1 82.32 ± 4.20 72.73 ± 4.20 9.58 ± 0.01 141.52 ± 2.79 118.99 ± 2.80 22.53 ± 0.772 63.01 ± 2.23 59.03 ± 2.25 3.98 ± 0.02 110.68 ± 4.31 103.41 ± 4.36 7.28 ± 0.064 55.94 ± 2.14 53.70 ± 2.15 2.24 ± 0.01 141.83 ± 4.82 137.64 ± 4.77 4.19 ± 0.058 55.42 ± 1.21 53.70 ± 1.22 1.72 ± 0.01 137.75 ± 11.32 134.43 ± 11.30 3.32 ± 0.07

    Table C.4: Runtime of parallel implementations of RASTER and RASTER′ in Rust at precision levels 3.5 and 4 (5 runs each). The chosen parametervalues for the threshold τ and minimum cluster size µ were 5 and 4, respectively. In addition to the total runtime, the runtime for the projection (π) andclustering (κ) steps are given. P-RASTER scales very well, while the parallel speedup of P-RASTER′ is more limited due to the cost incurred of havingto join large hash tables at the end of the projection step.

    Theorem 3. Arbitrary slicing invariant: Slicing of the tilespace can be arbitrary and does not affect the clusters inany way.

    Proof. The width of a slice is irrelevant. It only mattersif a border divides a cluster or not. If it does not, thencluster reconstruction cannot be affected. On the otherhand, if it does, then divided clusters will be unified basedon Theorems 1 and 2, which do not depend on the locationof the borders.

    Appendix D. Runtime of P-RASTER

    As part of the evaluation presented in Sec. 4, we bench-marked the performance of P-RASTER and P-RASTER′

    with datasets containing 50M and 500M points. In Ta-ble C.4, detailed results are listed, showing the total runtimebut also the runtime of the separate contraction/projectionand clustering steps. We find that P-RASTER scales verywell. The computationally intensive projection step is spedup by a factor of up to around 4 with 8 cores. This canbe explained by the sequential bottleneck of joining hashtables that result from performing the projection in parallel

    on the various slices of the tile space. On the other hand, thecomputationally less expensive clustering step scales approx-imately with the number of cores. With 8 cores, it is up to 8times faster. In contrast, the performance improvementswith RASTER′ are more modest, but not insubstantial.While the clustering step scales similarly well, the projectionstep improves mildly with increasing CPU cores in the caseof the 50M (105 clusters) input file. However, the picturewith 500M points (106 clusters) is ambivalent, showing thatwith an increased tile space due to a precision value of 4,performance degrades when more than two cores are used.

    19

    1 Introduction2 Problem Description2.1 The Clustering Problem2.2 The Hub Identification Problem2.3 Limitations of two common clustering methods

    3 RASTER3.1 High-Level Description3.2 Tiles and Clusters3.3 Time-Complexity Analysis3.4 The Variation RASTER'3.5 Parallelizing the Algorithm3.6 Further Technical Details

    4 Evaluation4.1 Experiments4.2 Data Description4.3 Results

    5 Related Work6 Future Work7 ConclusionAppendix A Qualitative Comparison of Clustering AlgorithmsAppendix B Performance Comparison between CLIQUE, RASTER, and RASTER'Appendix C Correctness Proofs for P-RASTERAppendix D Runtime of P-RASTER