Parallel online spatial and temporal aggregations on multi ...database/HIGEST-DB/publications/Publication... · Parallel online spatial and temporal aggregations on multi-core CPUs

Contents lists available at ScienceDirect

Information Systems

Information Systems 44 (2014) 134–154

http://d0306-43

n CorrE-m

syou@g

journal homepage: www.elsevier.com/locate/infosys

Parallel online spatial and temporal aggregationson multi-core CPUs and many-core GPUs

Jianting Zhang a,n, Simin You b, Le Gruenwald c

a Department of Computer Science, the City College of New York, New York, NY 10031, USAb Department of Computer Science, CUNY Graduate Center, New York, NY 10016, USAc School of Computer Science, the University of Oklahoma, Norman, OK 73071, USA

a r t i c l e i n f o

Available online 22 January 2014

Keywords:OLAPParallel designGPUMulti-core CPUSpatiotemporal aggregationSpatial indexingSpatial joinLarge-scale data

x.doi.org/10.1016/j.is.2014.01.00579 & 2014 Elsevier Ltd. All rights reserved.

esponding author. Tel.: þ1 212 650 6175.ail addresses: [email protected] (J. Zhc.cuny.edu (S. You), [email protected] (L. G

a b s t r a c t

With the increasing availability of locating and navigation technologies on portable wirelessdevices, huge amounts of location data are being captured at ever growing rates. Spatial andtemporal aggregations in an Online Analytical Processing (OLAP) setting for the large-scaleubiquitous urban sensing data play an important role in understanding urban dynamics andfacilitating decision making. Unfortunately, existing spatial, temporal and spatiotemporal OLAPtechniques are mostly based on traditional computing frameworks, i.e., disk-resident systemson uniprocessors based on serial algorithms, which makes them incapable of handling large-scale data on parallel hardware architectures that have already been equipped with commoditycomputers. In this study, we report our designs, implementations and experiments ondeveloping a data management platform and a set of parallel techniques to support high-performance online spatial and temporal aggregations on multi-core CPUs and many-coreGraphics Processing Units (GPUs). Our experiment results show that we are able to spatiallyassociate nearly 170 million taxi pickup location points with their nearest street segmentsamong 147,011 candidates in about 5–25 s on both an Nvidia Quadro 6000 GPU device anddual Intel Xeon E5405 quad-core CPUs when their Vector Processing Units (VPUs) are utilizedfor computing intensive tasks. After spatially associating points with road segments, spatial,temporal and spatiotemporal aggregations are reduced to relational aggregations and can beprocessed in the order of a fraction of a second on both GPUs and multi-core CPUs. In additionto demonstrating the feasibility of building a high-performance OLAP system for processinglarge-scale taxi trip data for real-time, interactive data explorations, our work also opens thepaths to achieving even higher OLAP query efficiency for large-scale applications throughintegrating domain-specific data management platforms, novel parallel data structures andalgorithm designs, and hardware architecture friendly implementations.

& 2014 Elsevier Ltd. All rights reserved.

1. Introduction

With the increasing availability of locating and navigationtechnologies on portable wireless devices, huge amounts

ang),ruenwald).

of location data are being captured at ever growing rates.For example, the approximately 13,000 taxicabs in the NewYork City (NYC) equipped with GPS devices generate morethan half a million taxi trip records per day. Cell phone calllogs represent a category of data at an even larger scale [1,2].Also as visitors travel around the world more frequently,location-dependent social networks such as Foursquare [3],and location-enhanced social media such as text posted toWiki sites [4], and images and videos posted to Flickr and

www.sciencedirect.com/science/journal/03064379

www.elsevier.com/locate/infosys

http://dx.doi.org/10.1016/j.is.2014.01.005



http://crossmark.crossref.org/dialog/?doi=10.1016/j.is.2014.01.005&domain=pdf



mailto:[email protected]




J. Zhang et al. / Information Systems 44 (2014) 134–154 135

YouTube [5], can also potentially generate large-amounts ofspatial and temporal data. All the three types of data have afew features in common: (1) they are produced and collectedby commodity sensing devices and are rich in data volumesin urban areas; and (2) they are a special type of spatial andtemporal data with an origin location and a destinationlocation in the geo-referenced space domain and a startingtime and an ending time in the time domain. However, theintermediate locations between origins and destinations areeither unavailable, inaccessible or unimportant. Comparedwith traditional geographical data collected by governmentagencies for urban planning and city administration pur-poses, these data can be more effective to help peopleunderstand the real dynamic of urban areas with respect tospatial/temporal resolutions and representativeness. Weterm such data as Ubiquitous Urban Sensing Origin-Destination data, or U2SOD data, for notation convenience[6]. Despite the close relationships between U2SOD data andSpatial Databases (SDBs) [7] and Moving Object Databases(MODs) [8], our experiences have shown that traditionaldisk-resident and tuple/row oriented spatial databases andmoving object databases are ineffective in processing large-scale U2SOD data for practical applications including multi-dimensional aggregations, one of the most important mod-ules in Online Analytical Processing (OLAP) [9].

Considerable work on developing efficient data struc-tures and algorithms has been proposed for multidimen-sional aggregations on CPU uniprocessors in the past fewdecades [10]. Modern hardware architectures increasinglyrely on parallel technologies to increase the processingpower due to various limits in improving the speeds ofuniprocessors [11]. Unfortunately, existing data structuresand algorithms that are designed for serial implementa-tions may not be able to effectively utilize the parallelprocessing power of modern hardware, including multi-core CPUs and many-core Graphics Processing Units(GPUs) [11]. Despite the fact that parallel hardware isalready available in the majority of commodity computers,there is still relatively little work in exploiting such parallelprocessing power for OLAP queries, especially in the areasof spatial and temporal aggregations of large-scale geo-graphical data where complex join operations are requiredin the aggregations. Examples of these are counting thenumber of taxi pickups at each of the community districtsor census blocks (spatial aggregation), generating hourlyhistogram of drop-offs near the JFK airport (temporalaggregation) and computing numbers of trips betweenTime Square and Central Park in morning peak hours (ODaggregation).

In this study, we report our work on developing a datamanagement framework for large-scale U2SOD data and aset of data parallel designs of spatial and temporal aggrega-tions that can be realized on both multi-core CPUs andmany-core GPUs in an OLAP setting. Our experiments onaggregating the approximately 170 million taxi trip recordsin NYC in 2009 have demonstrated the effectiveness of theproposed framework and parallel techniques. By utilizingparallel spatial joins [12] to support efficient online proces-sing, we are able to achieve real-time responses for spatial,temporal and spatiotemporal aggregations at differenthierarchical levels. Compared with traditional approaches

that rely on relational databases and spatial databases foraggregations, our techniques have reduced the OLAP queryresponse times from hours to seconds. This makes itpossible for urban geographers and transportationresearchers to explore the large-scale origin–destinationdata in general and taxi trip data in particular in aninteractive manner.

The rest of the paper is arranged as follows. Section 2introduces the background, motivation and related work.Section 3 presents a parallelization-friendly data manage-ment framework for managing large-scale origin–destina-tion data with a focus on taxi trip records. Section 4provides the details on the parallel designs and imple-mentations of spatial and temporal aggregations on bothmulti-core CPUs and many-core GPUs. Section 5 reportsthe experiment results. Finally Section 6 is the conclusionand future work.

2. Background, motivation and related work

The increasingly available location data generated byconsumer wireless portable devices, such as GPS, GPS enhan-ced cameras and GPS/WiFi/Cellular enhanced mobile phones,has significantly changed the ways of collecting, analyzing,disseminating and utilizing urban sensing data. Traditionallycity government agencies are responsible for collecting var-ious types of geographical data for city management pur-poses, such as urban planning and traffic control. The datacollection is usually done through sampling, typically atcoarse-resolutions, and questionnaire-based investigationswhich often incur long turn-around times. In contrast, asconsumer mobile devices become ubiquitous, similar dataobtained from GPS-traces and mobile phone call logs hasmuch finer resolutions. With the help of privacy and securityrelated technologies, the aggregated records from such ubi-quitous urban sensing data can be enormously helpful inunderstanding and addressing a variety of urban relatedissues. Research groups from both academia (e.g., MIT Sen-sible City Lab) and industries (e.g., IBM Smart Planet Initiativeand Microsoft Research Asia) have developed techniques toutilize such data and understand the interactions amongpeople and their locations/mobility at the city level (e.g.,Beijing [13], Boston [14] and Rome [1]), social group level (e.g., friends [15] and taxi-passenger pairs [16]) and individualslevel [17]. However, most of the existing studies focus on thedata mining aspects of such ubiquitous urban sensing datathrough case studies while largely leaving the data manage-ment aspects untouched. Lacking proper data managementtechniques can result in significant technical hurdles inmaking full use of such data to address outstanding societalconcerns. In this study, we focus on efficiently aggregatinglarge-scale taxi trip records to better understand humanmobility and facilitate transportation planning by develop-ing high-performance spatial, temporal and spatiotemporalaggregation techniques in an OLAP setting.

OLAP technologies are attractive to explore the possiblepatterns from large-scale taxi trip records and other typesof origin–destination data. As the taxi trip data has spatialdimensions and temporal dimensions for both pickup and

J. Zhang et al. / Information Systems 44 (2014) 134–154136

drop-off locations and conventional dimensions (e.g., fareand tip), taxi trips can be naturally modeled as spatial,temporal and spatiotemporal data which requires syner-gizing existing research on Spatial OLAP [18,19] andTemporal OLAP [18,20] or their combinations [18]. Due tothe popularity of geo-reference data, there are increasingresearch and application interests in Spatial OLAP. How-ever, most of them focus on data modeling and querylanguages [18,21–23], and applications on top of spatialdatabases and Geographical Information Systems (GISs)[24–27]. A few sophisticated indexing and query proces-sing algorithms to speed up certain analytical operations,such as consolidation/aggregation, drill-down, slicing anddicing, have been proposed [10,28–31]. Spatial OLAPapplications on top of spatial databases and GIS, whileeasy to implement, impose additional I/O and computa-tional overheads which may further slow down spatial andtemporal aggregations and may not be suitable for appli-cations that involve a large number of data records like ourtaxi trip application. We also note that the existingresearch on Spatial OLAP mostly targeted at the traditionalcomputing framework, i.e., disk-resident systems on uni-processors based on serial algorithms, which makes itincapable of handling large-scale data on parallel hard-ware architectures.

Our experiments using the open source PostgreSQLdatabase have shown that the performance of spatialaggregations on a large-scale dataset, which containshundreds of millions of taxi trip records, using the tradi-tional disk-resident database systems, is too poor to beuseful for our applications. We note that spatial queries aresupported in PostgreSQL through the PostGIS extension[32]. The Appendix at the end of the paper lists 16 SQLstatements (Q1–Q16) that are involved in a database basedimplementation of spatial and temporal queries, wheretables t and n represent the taxi trip records data andstreet network data, respectively. Note that queries Q1through Q8 are used for spatial associations (includingindexing and spatial join). Q9 and Q10 are used forindexing materialized spatial relationships, i.e., PUSegand DOSeg are indexed as relational attributes. Q11 andQ12 are used for spatial aggregations based on the materi-alized spatial relationships. Finally, Q13 and Q14 are usedfor temporal indexing and Q15 and Q16 are used fortemporal aggregations. On a high-end computing noderunning PostgreSQL 9.2.3, Q5 took dozens of hours. Wenote that Q5 is already an optimized SQL statement byusing the non-standard “SELECT DISTINCT ON” clause inPostgreSQL and approximating the nearest-neighbor queryusing the ST_DWithin function and the “ORDER BYdistance” clause. Obviously the performance is far fromsatisfactory for online OLAP queries. While we are aware ofthe fact that certain optimization techniques, such assetting proper parameters and data partitioning, canpotentially improve the overall performance, we believethat Spatial OLAP queries based on traditional databasesystems cannot achieve the performance level that we areaiming at for the data at the scale using existing technol-ogies. Our additional experiment results have also reve-aled that the performance can be drastically improved byutilizing large main-memory capacities and GPU parallel

processing [33,34]. This has motivated us to investigatetechniques in boosting the performance of spatial, tem-poral and spatiotemporal aggregations by making full useof modern hardware that has already been equipped withcommodity personal computers.

We refer the readers to Ref. [9] for a brief review onparallel OLAP computation. We note that existing work onparallel OLAP mostly focused on parallelization on shared-nothing architectures while leaving parallelization onshared-memory Symmetric Multiprocessing (SMP) architec-tures, including both multi-core CPUs and many-core GPUs,largely untouched. The number of processing cores on bothsingle-node CPUs and GPUs is fast increasing. The main-stream Intel CPUs and Nvidia GPUs have 4-8 and 512 cores,respectively. Devices based on the Intel Many IntegratedCore (MIC) architecture (such as Xeon Phi 5110p) have 60cores [35] and devices based on Nvidia Kepler architectureequipped with close to 3000 cores [36] are currently avail-able on the market. These inexpensive devices based onshared-memory SMP architectures are cost-effective andrelatively easy to program. We believe it is an attractivealternative to cluster computing in solving many practicallarge-scale data management problems when compared toMapReduce based cloud computing where computingresources are often utilized inefficiently [37]. Despite thefact that shared-nothing based architectures are often con-sidered having better scalability than shared-memory basedones, we argue that, from a practical perspective, higherscalability can be achieved by integrating the two architec-tures when necessary. Fully utilizing the parallel processingpower of SMP processors (including both CPUs and GPUs)will naturally improve the overall system performance in acluster computing environment using grid or cloud comput-ing resources. As a first step, we currently focus on parallelaggregations on multi-core CPUs and many-core GPUsequipped in a single computing node, i.e., in a personal com-puting environment that is more suitable for interaction-intensive applications such as OLAP queries.

There are a few pioneering works on using multi-coreCPUs and many-core GPUs for OLAP queries including aggre-gations. The design and implementation of the HYRISE system[38] have motivated our work in many aspects, such ascolumn-oriented physical data layout, data compression andin-memory data structures. However, most of the existingsystems including HYRISE are designed for traditional busi-ness data and do not support geo-referenced data. There arealso several attempts in using GPUs for OLAP applicationswith demonstrable performance speedups [10,39,40].However, again, they do not explicitly support spatial orspatiotemporal aggregations which are arguably more com-putationally intensive. Furthermore, while previous studieshave shown that parallel scan based GPU implementationscan be effective in processing data records in the order of afew millions, the number of data records in our application isalmost two orders of magnitude larger which makes GPUimplementation more technically challenging.

Our designs and implementations of spatial and tem-poral aggregations heavily rely on parallel primitives thatare supported by several parallel libraries on both CPUsand GPUs. Parallel primitives, such as sort, scan and reduce[41,42], refer to a collection of fundamental algorithms


that can be run on parallel machines. The behaviors ofpopular parallel primitives are well-understood. In parti-cular, we have used the open source Thrust library [36]that comes with Nvidia CUDA SDK [43] on GPUs and theopen source Intel TBB [44] package from Intel on CPUsextensively. It is beyond the scope of this paper to providea comprehensive review of the parallel primitives that wehave utilized in this study and we refer the interestedreaders to Ref. [42] for more details. A brief introduction toseveral parallel primitives that are involved in our GPUimplementations of the aggregations is provided online[45]. The parallel sorting primitive that we have used inthis study comes from the GNU libstdcþþ Parallel Modelibrary which was derived from the Multi-Core StandardTemplate Library (MCSTL) project [46]. In addition, ourmulti-core CPU implementation of spatial associationsutilizes the Vector Processing Units (VPUs) on CPUs toboost its performance. While several pioneering studieshave tried to exploit the Single Instruction Multiple Data(SIMD) parallel processing power on VPUs by calling lowerlevel hardware-specific APIs (SIMD intrinsics) [47–49], wetake advantage of the Intel ISPC open source compiler [50]that has recently become available. The ISPC compilersupports programming SIMD units in a way similar toCUDA based GPU programming which is much moreproductive. To the best of our knowledge, we are notaware of any previous work on utilizing VPU's SIMDprocessing power for spatial associations by speeding upgeometrical computation.

Compression, grouping, histogramming and ind

Day

Month

Year

Spatial Joins, Shortest Path Computation and Freq

Fig. 1. Illustration of U2SOD-DB prototype s

3. U2SOD-Db: managing origin-destination (OD) data inDBMS

Almost all taxi cabs in cities of developed countrieshave been equipped with GPS devices and different typesof trip related information are recorded. For example,more than 13,000 GPS-equipped medallion taxicabs inthe New York City (NYC) generate nearly half a milliontaxi trips per day and approximately 170 million trips peryear serving 300 million passengers. The number of yearlytaxi riders is about 1/5 of that of subway riders and 1/3 ofthat of bus riders in NYC according to MetropolitanTransportation Authority (MTA) ridership statistics [51].Taxi trips play important roles in everyday lives of resi-dents and visitors of NYC as well as any major city world-wide. The raw service data has a few dozens of attributessuch as pick-up/drop-off location and time as well asfare/tip/toll amounts. In this study, we generalize the taxitrip data as a special type of OD data and present the designand implementation of the U2SOD-DB system that isdesigned for managing U2SOD data on modern hardware.

Our design of U2SOD-DB has three tiers. The lowest tier isclosely related to physical data layout and we have adopted atime-segmented, column-oriented data layout approach. Theraw data are first transformed into binary representations andattributes are clustered into groups based on applicationsemantics. The data corresponding to the attribute groups ata certain time granularity are stored as a single database filewith the relevant metadata registered with the database

exing

Raw data

Physical Data Layout

U2SOD-DBuent Pattern Mining

ystem components and interactions.


system. We assume that one or more database files can bestreamed into CPUmainmemory as awhole to maximize diskI/O utilization. Multi-core CPU processors can access the datafiles in parallel once they are loaded into the CPU mainmemory. They can also be transferred to GPU global mem-ories should the system determine that GPU parallel proces-sing is more advantageous. The middle tier is designed tosupport efficient data accesses and provides some commonlyused routines, such as compression, histogramming andindexing. The top tier is more application specific. Forspatiotemporal aggregations in an OLAP setting, the currentU2SOD-DB supports efficient spatial joins between OD dataand urban infrastructure data, such as road networks andadministrative regions. An illustration of U2SOD-DB prototypesystem components and interactions is provided in Fig. 1 [6].For the remaining of this section, we provide technical detailson the data layout and timestamp compression and a high-level overview of spatial and temporal aggregations beforetheir parallel designs and implementations details are pre-sented in Section 4.

3.1. Time-segmented column-oriented data layout

As shown in Fig. 2 [6], we categorize the attributes thatare associated with trip records into several groups anddata of the attributes in the same group are stored in asingle database file by following the column-orientedlayout design principle: attributes in the same group arelikely to be used together and thus it is beneficial to loadthem into main memory as a whole to reduce I/O over-heads. Given a fixed amount of main memory, as theattribute field lengths of individual groups are muchsmaller than the lengths of all attributes, more records

Medallion#Shift#Trip#

Trip_Pickup_LocationTrip_Dropoff_Location

Start_LonStart_LatEnd_LonEnd_Lat

Trip_TimeTrip_Distance

Vendor_nameDate_loadedStore_and_forward

Start_Zip_CodeEnd_Zip_Code

start_xstart_yend_xend_y(local

projection)

1

25

6

8

11

Fig. 2. Column-oriented physical

that are related to analysis can be loaded into mainmemory for fast data accesses. In general, the column-oriented data layout design improves traditional tuple-based physical storage in relational databases by avoidingreading unneeded attribute values into main memorybuffers and subsequently increasing the number of tuplesthat can be read into memory in a single I/O request. Wealso note that combining several attribute groups into oneand extracting attributes from multiple groups to formmaterialized views can be beneficial for certain tasks. Forexample, in Fig. 2, attribute group 5 (start_x, start_y, end_xand end_y) can be considered as a materialized view of theattribute group 2 (start_lon, start_lat, end_lon and end_lat)by applying a local map projection to the latitude/long-itude pairs. Since the projected data are frequently used incalculating geometric and shortest path distances and mapprojections are fairly expensive, materializing attributegroup 5 can significantly improve system performance.Another example to demonstrate the utilization of materi-alized views on the physically grouped attributes is verify-ing the recorded trip times (indicated by trip_time ingroup 6) with computed trip times (by subtracting pickuptime from drop-off time in group 3) by materializing therespective attributes in the two groups. We further notethat among the attributes in the original dataset shown inFig. 2, some of them can be derived from the others. Forexample, both start/end zip codes (group 9) and addressesof pickup and drop-off locations (group 8) can be derivedfrom start/end latitudes and longitudes (group 2) throughreverse geocoding or other related techniques.

After determining the data layout by grouping attributes(vertical partitioning), the next issue is to determine theappropriate number of records in database files so that these

Trip_Pickup_DateTimeTrip_Dropoff_DateTime

Payment_TypeSurchargeTotal_AmtRate_Code

Passenger_CountFare_AmtTolls_AmtTip_Amt

Time_between_serviceDistance_between_service

3

4

79

10

data layout in U2SOD-DB.


files can be efficiently streamed among hard drives, CPUmain memory and GPU device memory. The sizes of thedatabase files should be large enough to reduce systemoverheads as much as possible but small enough to accom-modate multiple database files simultaneously in CPU/GPUmemories so that typical operations can be completed in thedesignated memory buffers. At the same time, the numbersof records should correspond to certain time granularities asmuch as possible. In our design, we use month as the basicunit (temporal granularity) to segment taxi trip records(horizontal partition). Given that there are about half amillion taxi trip records per day in NYC, assuming that anattribute group has a record length of 16 byte (e.g. fourattributes with each represented by a 4-bytes integer), thedatabase file would be 16�0.5�30¼240 megabytes. Sincethe main memory capacity in our experiment system is16 gigabytes, the database file size is appropriate althoughother sizes might be suitable as well.

3.2. Timestamp compression and temporal aggregations

The text format of the pickup and drop-off timesconverted from relational databases like “2009-01-1723:52:34” takes 20 byte. While the format can be easilyconverted to “struct tm” in the standard C language tosatisfy the needs of all temporal aggregations (e.g., month,day of week) on CPUs, we found that the data structuretakes 56 bytes on 64-bit Linux and 44 bytes on 32-bitLinux platforms which may be too much from a memoryfootprint perspective. Furthermore, the time structurecannot be used on GPUs directly which brings a significantcompatibility issue. Our solution is to compress the pickupand drop-off times (PUT and DOT) into 4-byte (32 bits)memory variables using the following bit layout startingfrom the most insignificant bit: 6 bits for second (0–59),6 bits for minute (0–59), 5 bits for hour (0–23), 5 bits forday (0–30) and 4 bits for month (0–11). The remaining6 bits (0–63) can be used to specify the year relative to abeginning year (e.g. 2000) which should be sufficient for areasonably long study period. Retrieving any of the year,month, day, minute and second fields can be easily doneby bitwise operations and integer operations which areefficient on both CPUs and GPUs. The straightforward

City

BoroughCommunity District Police

Precinct

Census Tract

Census Block

Street Segment

Tax Lot

Tax block

Pickup/drop-off locations

Level 0 g

Level k

Top level

NYC taxi trip record

Fig. 3. Example spatial and tempor

technique has reduced memory footprint to 1/5 (4/20)and is friendly to both CPUs and GPUs.

At the first glance, the design does not support temporalaggregations based on “day of week” and “day of year” verywell as these two fields are not explicitly stored in ourdesign as in “struct tm” in the standard C library. Computingthe values of these two fields, while feasible on CPUs(by using C/Cþþ mktime function), can incur significantoverheads when the number of records is huge. Forexample, our experiments have shown that computing“day of week” alone for 170 million records can take 22 s,i.e., 100þ CPU cycles per timestamp on average. However,we would like to draw attention to the fact that timestampscan be aggregated by “yearþmonthþday” before they arefurther aggregated according to “day of week” or “day ofyear”. Since the possible combinations of (year, month, day)in a reasonably long period are limited (in the orders of afew thousands), they can be aggregated to “day of week” or“day of year” in a fraction of a millisecond using the C/Cþþmktime function on CPUs. The design also eliminates theneed for GPU implementation to compute “day of week” or“day of year” from “yearþmonthþday” which is nontrivial.

3.3. Overview of spatial and temporal aggregations

While many operations and analytical tasks can beperformed on U2SOD data, spatial and temporal aggrega-tions are among the fundamental ones. In a way similar toefficient OLAP operations for decision support on rela-tional data, high-performance spatial and temporal aggre-gations are crucial in effectively supporting more complexanalytical operations on OD data. Fig. 3 illustrates thegeneral framework and example spatial and temporalaggregations that are supported by our U2SOD-DB design.Most of the temporal aggregations (e.g. daily and hourly)and some of the spatial aggregations (e.g. grid based) canuse data-independent schemes, while more complexaggregations rely on the schemas provided by infrastruc-ture data such as road network and administrative hier-archies. Our designs and implementations to be presentedin Section 4 focus on complex spatial aggregations. Wenote that, as shown in Fig. 3, once the OD locations (e.g.pickup and drop locations in the taxi trip data) areassociated with street segments or different types of

Year

Month

DayHour

Day of Year Week of Year

Day of Week

rid

grid

grid

15/30-minutes

s

Peak/off-peak

Auxiliary data (weather, events…)

Pickup/drop-off timestamps

al aggregations in U2SOD-DB.

Table 1List of parallel design and implementation choices.

Multi-core CPUimplementations

GPU implementations

1. Pointindexing

GNU parallel (FSG) Thrust over CUDA(MLQ)

2. Polylineindexing

Thrust over TBB (FSG) Thrust over CUDA(FSG/MLQ)

3. Spatialfiltering

Thrust over TBB (FSG) Thrust over CUDA(FSGþMLQ)

4. Spatialrefinement

TBB (with ISPC) (FSG) CUDA (FSG/MLQ)


zones, spatial, temporal and spatiotemporal aggregationscan be reduced to simple relational aggregations withoutinvolving expensive spatial and/or temporal operationsany more. The U2SOD-DB architecture is designed tosupport multiple types of spatial aggregations by provid-ing a common spatial indexing and spatial filtering frame-work to efficiently pair subsets of relevant datasets beforeapplying specific refinement approaches for associatingindividual data items, e.g., based on Nearest Neighbor (NN)search or Point-In-Polygon (PIP) test. In this study, wefocus on spatially associating taxi locations with segmentsin road networks by locating the nearest street segmentwithin distance R for each point location. After the170 million NYC tax trip locations are associated withthe LION road network dataset published by the NYCDepartment of City Planning (DCP) [52], as street segmentsare nicely associated with quite a few types of polygonzones as shown in Fig. 3, they can be used for furtherrelational aggregations.

It is beyond the scope of our study to implement all thetypes of spatial, temporal and spatiotemporal aggregationsthat have been modeled in the literature [53,18]. Instead,we focus on the aggregations along the spatial andtemporal hierarchies shown in Fig. 3. We divide anaggregation into two phases: the spatial and temporalassociation phase and the relational aggregation phase.The association phase, typically implemented as a join, canbe performed either offline or online. The advantage ofmaterializing spatial, temporal or spatiotemporal relation-ships offline is that, as computing the relationships typi-cally is expensive, directly accessing the materializedrelationships can significantly improve the overall perfor-mance. However, when dynamic query criteria areimposed (such as those based on taxi fare and tip), offlinematerialization becomes infeasible and fast real-timeonline aggregations become critical. In addition, whenspatial aggregations are combined with temporal aggrega-tions (i.e., spatiotemporal aggregations) at arbitrary levels,the possible number of aggregations grows quickly whichmakes offline materialization less attractive due to diskstorage, I/O and maintenance overheads. Online aggrega-tions are more desirable in such cases.

4. Parallel designs and implementations of spatialassociation

The U2SOD-DB architecture supports both serial andparallel designs and implementations of the spatial andtemporal aggregation operations discussed above. In thisstudy, we propose a set of data parallel designs that can beimplemented on both multi-core CPUs and many-coreGPUs. For spatial associations, in addition to providingthe design and implementation details of the Multi LevelQuadrants (MLQ) based approach presented in Ref. [34]that focuses on many-core GPUs, we have also provided analternative set of designs and implementations of a FlatlyStructured Grid (FSG) based approach that focuses moreon multi-core CPUs. While these designs and implementa-tions will be detailed in Sections 4.1–4.5, Table 1 providesan overview of the modules and their available implemen-tations on multiple parallel computing platforms. Note

that we use “/” to indicate that the implementation isapplicable to both designs while we use “þ” to indicatesimilar but different implementations are required for theMLQ and FSG designs. The rationales are provided whenwe describe the details of each module. Among the fourmodules in both approaches, the Point Indexing and Poly-line Indexing modules are used to partition points intogroups so that only spatially close points and polygons arepaired up in the Spatial Filtering module before each pointis associated with its nearest polyline in the SpatialRefinement module. Provided that a good spatial filteringstrategy is available, it is possible to reduce the nearestneighbor computing overhead from O(N�M) to O(N)where N is the number of points and M is the number ofpolylines.

The primary reason that motivates us to develop theFSG based design and implementation for point indexingis to overcome the memory limitation on GPUs (currentlylimited to 6 GB) for larger point datasets. The implementa-tion is based on the parallel sorting algorithm provided bythe GNU libstdcþþ Parallel Mode library [46] which isvery efficient on multi-core CPUs. For the Polyline Indexingmodule and Spatial Filtering module, we only provide aparallel primitive based design (whose implementation isbased on Thrust) because their runtimes are relativelyinsignificant on large datasets when compared to the othertwo modules. By changing the underlying computingplatform from CUDA to TBB, the parallel primitive basedimplementations based on Thrust can be compiled toCUDA code on GPUs and compiled to TBB code on multi-core CPUs. The downside of this highly portable solution isthat, in a way similar to using Standard Template Library(STL) [54], there are library overheads which can besignificant in certain cases, especially for using TBB onmulti-core CPUs where the Thrust library has not beenextensively optimized for multi-core CPUs. However, forthe Spatial Refinement module, we provide both a nativeCUDA implementation on GPUs and a native TBB imple-mentation on multi-core CPUs as we want to reduce theoverheads due to parallel libraries and minimize the overallresponse times. As detailed in Section 5, the modules can beorganized into different configurations under different sce-narios and the performance of the configurations can befurther compared. We next provide the details of the designsand implementations of the four modules in the followingsections.


4.1. Point indexing using multi-level quadrants

As illustrated in Fig. 4, the strategy of the MLQ based pointindexing is to partition the point data space in a top-down,level-wise manner and identify the quadrants with a max-imum of K points at multiple levels. While the point quadrantsare being identified level-by-level, the remaining points getmore clustered, the number of remaining points becomessmaller, and the data space is reduced. The process completeswhen either the maximum level is reached or all points havebeen grouped into quadrants. In the example shown in Fig. 4,we first sort all points at level 1 using their Z-order [58] codeand count the number of points under each level 1 quadrant.As quadrant 2 has only 11 points which is less than K¼20, itskey (2) and number of points under it (11) are identified andthe points are excluded from further processing. We use a(key, #of points) pair to represent a quadrant in Fig. 4. Thesame procedure is applied to the remaining points at level2 to identify two quadrants (4,9) and (7,9) and at level 3 toidentify 6 quadrants (9,7), (10,9), (11,8), (12,5), (13,8), (14,7), inan iterative manner. After the number of points in allquadrants is derived, the starting positions of the first points

Location Data

X/Y Coordinates

Leve

2

4

0 11 20 29 36 45 5

38 24

112

9 1613

915

4 7

0 11 20

Level 1

Level 2

Level 1 3

Prefix Sum

Z-Order/Sorting

97

109

118

125

138

147

Level 3

Fig. 4. Parallel-primitive based poin

in the quadrants can be computed in parallel easily by usingprefix-sum [41,42] for accesses to the points later. An advan-tage of limiting the maximum number of points in a quadrantto K is to facilitate load balancing in parallel computing. As itshall be clear after we introduce the spatial filtering andrefinement modules, when the numbers of points and thenumbers of polyline vertices are bounded, the workload ofparallel processing elements is also bounded and no proces-sing elements can dominate the whole process.

The design is highly data parallel and can be imple-mented using a few parallel primitives as detailed inRef. [34] where the point indexing module is also usedfor point-in-polygon test based spatial joins. The design isimplemented on top of the Thrust parallel library thatprovides all the necessary parallel primitives. Previousstudies have shown that the parallel sorting primitivesimplemented in Thrust are capable of handling hundredsof millions of data items per second and are highlyefficient [55]. Our experiments have shown that the mostexpensive step in this module is to sort points based ontheir quadrant identifiers before counting the numbers ofpoints in the quadrants and determining whether the

l-wise Space Partitioning

0

7

3 58 66 Starting Position

29 36 45 53 58 66 Starting Position

211

49

79

49

79

211

49

79

211

97

109

118

125

138

147

Leaf Key# pointspoint vector

t indexing using MLQ design.


points in the quadrants should be excluded for furtherpartition. By making full use of high-performance GPU-based sorting parallel primitives, as shown in Section 5.2,the GPU-based implementation of the MLQ approach forpoint indexing has achieved 13X speedup over CPU-basedserial implementation.

However, while the underlying radix sort algorithmitself is scalable on shared memory systems (includingboth CPUs and GPUs), the applicability of the parallel sortprimitives to large-scale point datasets is limited by GPUmemory capacity (currently 6 GB). Our experiments haveshown that the GPU implementation of the Point Indexingmodule based on the MLQ design allows a maximumnumber of points in the order of 150–200 million on anNvidia Quadro 6000 device. While it is possible to usepinned memory on CPUs to “virtually” expand the GPUmemory, our experiments suggest that extensive datacommunication (partially due to the nature of the under-lying radix sort algorithm) incurs significant overheadswhich makes the technique impractical. In addition, ourprevious experiments have shown that copying all thepoint data from CPU memory to GPU memory incurred25% of overhead which is quite significant. While the GPU-based design and implementation is still useful for reason-ably sized point datasets, it is necessary to seek alternativesolutions to efficiently index large point datasets. Whileour CPU-based point indexing approach on multi-coreCPUs that will be presented in the next subsection over-comes these problems, we note that several researchefforts are available to support GPU-based sorting onCPU memory and external memory (e.g. [56]) and we planto incorporate these techniques once they become avail-able to make the MLQ based point indexing techniquemore scalable.

4.2. Point indexing using flatly structured grids

We design an alternative approach for point indexingthat completely runs on multi-core CPUs. As CPUs can havemuch larger memory capacity (up to hundreds of GBs oncommodity workstations) than GPUs, the approach willeliminate the memory constraints to a certain extent. Inaddition, there is no need to copy data from CPUmemory toGPU memory anymore which can potentially reduce end-to-end runtime considerably. Grid files are among the mostclassic techniques for spatial indexing [57]. The FSG designfor point indexing on multi-core CPUs is based on a flatgrid-file structure which is significantly simpler than theMLQ based design. Given a grid resolution, we can generatea key based on the coordinates of a point, i.e., derivinga Space Filling Curve (SFC) code based on row-major orderor Z-order [58]. This step is highly parallelizable and isexpected to be very fast as only element-wise operations, i.e., operations apply to all vector elements in parallel, areinvolved. The most important step in the FSG approach isparallel sorting of the point records based on keys. Ourimplementation is based on the GNU libstdcþþ ParallelMode library [46]. Based on our experiments, the imple-mentation is memory efficient and is effective in makingfull use of multi-core CPU hardware. In addition to beingapplicable to large point datasets whose volumes are beyond

the capacity of GPU memory, eliminating the need to transferdata back-and-forth between CPU and GPU memories canpotentially improve end-to-end response time of the FSGapproach on CPUs. As detailed in Section 5.3, the newapproach is significantly more efficient than straightforwardporting of the MLQ based GPU design and implementation tomulti-core CPU, a feature offered by the Thrust library. Theefficiency is largely due to the simplified design using a flatlystructured grid. However, the simplified design also loses theload balancing feature provided by the MLQ approach, as thenumber of points in grid cells can potentially be unbounded.However, we argue that, when the grid resolutions aresufficiently fine (which is typically true in our applicationsusing millions of grid cells), the chance of extreme caseswhere points in a few grid cells dominate the wholecomputation (in the Spatial Refinement module) is very smallon multi-core CPUs. This is especially true on CPUs where thenumber of processing units (cores) is typically small (in theorder of a few to a few dozen) and tasks can be dynamicallybalanced by, for example, using the scheduling modulesprovided by TBB explicitly or implicitly.

4.3. Polyline indexing

Polylines can be indexed in a similar way as polygonsbased on their Minimum Bounding Boxes (MBBs). Whilemany approaches have been proposed in the serial com-puting setting where data partition based approaches(such as R-Trees) are preferred [57], it remains unclearwhat is the best way to index polylines and polygons inparallel settings. In this study, we index polylines thatrepresent street segments also based on grid files. As itshall be clear when the Spatial Filtering module is intro-duced in Section 4.4, after rasterizing the expanded poly-line MBBs to grid cells, binary searches can be applied topair point quadrants (MLQ approach) and point cells (FSGapproach) with polylines for refinement. As the number ofpolylines is typically much smaller than the number ofpoints in spatial associations, we assume that polylineMBBs can fit into GPU memory completely and thus wewill only provide a GPU-based implementation that isindependent of the MLQ and FSG designs.

To support locating the nearest polyline for a point (x,y)within a query window width of R, as shown at the top-rightpart of Fig. 5, it is sufficient to examine all the polylines whoseexpanded MBBs intersect with the point. Assuming the MBBof a polyline is (x1, y1, x2, y2), the necessary condition for thepolyline to be at most R distance away from the point is thatthe point intersects with the rectangle (x1�R, y1�R, x2þR,y2þR). Subsequently, for a group of points in a quadrant or acell, the necessary condition for at least one of the points thatare at most R distance away from a polyline is that thebounding box of the point group intersects with the expandedMBB of the polyline (x1�R, y1�R, x2þR, y2þR). The obser-vation requires rasterizing the expanded MBBs of polylines,which we introduce next through an example with twoexpanded MBBs shown in the bottom part of Fig. 5. Assumingthat the widths and heights of the two MBBs are W1, W2 andH1 and H2, and the coordinates of their top-left corners are(TPX1, TPY1) and (TPX2, TPY2), the first step is to compute thenumber of the cells that the two MBBs cover by using a


Transform primitive. After computing the starting positions ofthe first cell in the output cell vector through a Scan (prefix-sum) primitive in Step 2, the identifiers of the MBBs arescattered to the output cell vector and they are subsequentlypropagated along the vector using a (max) Scan primitive inStep 3. Using the sequences of cells in the MBBs generated inStep 4 and the MBB identifier vector derived in Step 3, Step 5first retrieves the top-left coordinates of the MBBs and addsthe local offsets of the cells they cover to compute the globalcell identifiers for all cells in parallel by using a Transformprimitive again. The algorithm is sketched in the top-left partof Fig. 5. We note that, for the derived multi-level pointquadrants in the MLQ approach, the same procedure can beapplied to rasterize the quadrants to grid cells.

4.4. Spatial filtering

The Spatial Filtering module requires two similar butdifferent implementations for the MLQ and FSG approaches.While the MLQ requires rasterizing the resulting quadrantsas discussed previously and pair polylines and point quad-rants indirectly through grid cell identifiers, the point gridcells can be directly paired with polylines in the FSQapproach, both through parallel binary searches as indicatedby the dotted lines in the middle of Fig. 6A and Fig. 6B. InFig. 6A (top), vectors VQQ and VQC keep the 1-to-manyrelationship between point quadrants and grid cells, andvectors VPP and VPC keep the 1-to-many relationshipbetween polyline MBBs and grid cells. Identifiers of pointquadrants in vector VQQ and identifiers of polyline MBBs can

Fig. 5. Parallel-primitive based

be paired through common grid cell identifiers in vectorsVQC and VPC. In Fig. 6B (bottom), vector VGC used in the FSGapproach is equivalent to VQC used in the MLQ approachand polyline MBBs are paired with grid cells directly. It isclear that the set of operations needed in the MLQ approachis a superset of those in the FSG approach. Although theimplementation of the FSG approach for spatial filtering issimpler, unlike the MLQ approach that is designed explicitlyfor load balancing, it can potentially suffer from loadunbalancing as the numbers of points in grid cells can varysignificantly while the number of points in point quadrants isbounded by threshold K in the MLQ design. This may causeworkload unbalancing in the refinement phase. However,similar to point indexing, we argue that when the gridresolutions are sufficiently high and the total number ofcells is significantly larger than the number of processingunits, load balancing can be achieved to a certain degreeby dynamically scheduling (cell, MBB) pairs to parallelprocessing units.

4.5. Spatial refinement

Recall that our goal is to compute the nearest polylinesof points that are at most R distance away. After the spatialfiltering phase, each point group (multi-level quadrant orgrid cell) is associated with a set of polyline identifiers.What needs to be done in the Spatial Refinement moduleis to identify the polylines that are nearest to each of thepoints in the group by computing and ordering thedistances between the point and the candidate polylines.

polyline MBB indexing.

1 1 1 2 2 2 2 21

20 40 … … 30 50 … ……

20 40 … 50 40 …30

1 1 2 2 21

VQQ

VQC

VPP

VPC

… 1 2 1 ... ... 2 ...…

… 20 30 40 ... ... 50 ……

VQQ

VQC

Q1P1

Q2P1

Q2P2

Unique

Q1P1

Q1P1

Q2P1

Q2P1

Q2P2

Sort

Sort

Refinement phase

20 40 … 50 40 …30

1 1 2 2 21 VPP

VPC

… 30 … 40 ... 50 …20 VGC

Q1P1

Q1P1

Q2P2

Q2P1

Q2P2

P1

P2

r

r

C1P1

C2P1

C3P2

C4P1

C5P2

C1P1

C2P1

C4P1

C3P2

C5P2

Refinement phase

Sort

Q1

P1

P2

r

r

Q2

Fig. 6. Parallel primitives based spatial filtering in MLQ (top) and FSG (bottom).

Fig. 7. Illustration of computing shortest distance between a point and aline segment.


Here the distance between a point and a polyline, asillustrated in Fig. 7, is canonically defined as the shortestdistance between the point and all the line segments in apolyline. When the point is projected to a line segment, ifthe projection point falls between the two ends of the linesegment, the point-to-line-segment distance is the dis-tance between the point and the projection point (per-pendicular distance); otherwise the point-to-line-segmentdistance will be the shorter of the distances between thepoint and the two ends of the line segment. Clearly, thespatial refinement module is more computation intensivethan the rest of the modules as quite many floating pointcomputations are required. For each point in a point group,looping through all the line segments of all the polylineswhose expanded MBBs intersect with the bounding box ofthe point group is required in our design.

Our GPU implementation of the refinement moduleuses CUDA directly to achieve better performance for tworeasons. First, there are some unavoidable library over-heads when using the Thrust parallel library. Second, moreimportantly, the refinement design involves multi-levelloops and it does not fit the parallel primitives designedmostly for 1D vectors very well. In our GPU implementa-tion, as shown in Fig. 8, we assign each point quadrant to athread block and assign points in the quadrant to threadsin the thread block. Threads within the computing blockprocess points in the point quadrant in parallel and eachthread loops through all the paired polylines and their linesegments to compute the shortest distance. The identifierof the polyline with the shortest distance will be asso-ciated with the point. We note that the scheme is similarto the nested loop join in relational databases. Althoughnested loop joins are typically considered inefficient indisk-resident relational databases, as all GPU threads in a

thread block access a continuous memory chunk thatholds the point coordinates and all threads accessthe same line segment during a looping step on GPUs,the memory accesses are highly coalesced. Similarly, asthe result of each point is a pair of polyline identifier andthe shortest distance, the thread that processes the pointknows exactly where to output the results. In a way similarto read accesses to point coordinates, neighboring threadsalso access neighboring vector elements of the outputsand memory accesses are also highly coalesced. UnlikeCPUs, GPUs have very limited cache capacities and rely oncoalesced memory accesses to hide high memory accesslatencies and achieve high performance. Our GPU imple-mentation is thus able to achieve high performance due tothe coalesced memory accesses in addition to utilizingGPU's excellent floating point computing power. Wewould like to note that while the GPU implementation isapplicable to both the MLQ and FSG designs (as indicatedin Table 1) as the inputs are in the same format for bothdesigns, the implementation may incur different work-loads and produce different results. This is because thepairs in the MLQ approach are at the quadrant level wherequadrants can have different sizes although the number ofpoints is bounded by K, while the pairs in the FSGapproach are at the grid cell level where grid cells are no

Points

GPU Global Memory

Thread Blockdistancesidentifiers

Line segs

GPU registers

Fig. 8. Illustration of neighbor-polyline-searching based spatial refinement on GPUs.

Looping over Points

CPU Memory

CPU Core/thread

distancesidentifiers

Line segs

Core

Local Cache

Shared L3 Cache

Private L2 Cache L1 Cache Registers

Fig. 9. Illustration of neighbor-polyline-searching based spatial refinement on CPUs equipped with vector processing units (VPUs).


larger than quadrants and the number of points isunbounded. As such, there will be more points in thequadrants paired with polylines in the MLQ approach thanthe points in the grid cells paired with polylines in the FSGapproach. This is very similar to the well-known fact inspatial filtering that using coarse grids will incur morefalse positives, although the MLQ approach uses quadrantswith variable sizes at different levels (but still larger thanthe grid cells in the FSG approach in this study). We expectthe workload for the implementation based on the MLQdesign will be heavier than that based on the FSG designwhich is verified in Section 5.3.

Our multi-core CPU implementation of the refinementmodule is based on the Intel open source TBB package [44]that typically performs better for computing tasks that arenot well balanced. As shown in Fig. 9, we formulateprocessing a point cell as a task and let TBB handle themapping between tasks and CPU threads. In runtime, TBBassigns a group of tasks to a CPU thread and lets the CPUthread loop through all the corresponding point cells. For

each point cell, the thread loops through all the points inthe cell and loops through all the line segments in thepaired polylines. When compared with the GPU imple-mentation, we can see that a CPU thread does much morework than a GPU thread. In addition, reading and writingto CPU memory are automatically cached by CPU cachingsubsystem which makes programming much easier. As anoptimization of the multi-core CPU based implementation,we have developed a module to utilize the Vector Proces-sing Units (VPUs) that are available on modern CPUs. SIMDbased parallel processing on VPUs are considered closelyrelated to the Single Instruction Multiple Thread (SIMT)model on CUDA enabled GPUs [36]. SIMD width hasincreased from 128-bit (4-way) in Steaming SIMD exten-sions (SSEs) to 256-bit (8-way) in Advanced Vector Exten-sions (AVX) and both are available on mainstream CPUs. Inaddition, the gap between GPU SIMD width (currently1024-bit, 32-way in a warp) and VPU SIMD width israpidly decreasing (for example, Intel MIC VPUs have a512-bit, 16-way SIMD width [35]). As such, exploiting


VPUs for high performance on CPUs is becoming increas-ingly important. Our implementation utilizes the ISPCcomplier [50] that is available from Intel as an open sourcepackage. Unlike CUDA threads that have their own pro-gram counters and thus allow complex control logic, allthe SIMD elements of VPUs need to follow the sameexecution path which makes their utilization less flexible.Another difference is that, while CUDA allows much largerlogic SIMD width (which is limited by the maximumnumber of threads in a computing block – currently inthe order of thousands), the SIMD width needs to matchthe VPU physical SIMD width of multi-core CPUs. As such,our design is to partition the points in a cell into chunksand use the VPUs to process the points in batches throughexplicit looping as shown in the mid-right part of Fig. 9.

4.6. Relational aggregations on CPUs and GPUs

After a database tuple is associated with one or morekeys based on spatial, temporal and spatiotemporal rela-tions, the next step in spatial and temporal aggregations isto derive statistics in groups defined by the individual keysor their combinations which can be informative in deci-sion making. Counting and similar relational aggregationoperations in deriving group-based statistics are funda-mental operators in relational databases and have beenwell researched, including some recent work on parallelaggregations on many-core GPUs [9,39,40]. A straightfor-ward implementation of the relational aggregation opera-tors on CPUs is to use STL map data structure to store (key,stat) pairs where stat can be one or more statistics, such ascount/sum/avg/min/max. As the number of cores on multi-core CPUs is currently limited, a commonly used strategyon parallelizing the relational aggregation operations onmulti-core CPUs is to provide each thread (typicallyassigned to a processor exclusively) a private copy of themap structure so that threads can work in parallel overnon-overlapping partitions of tuples and avoid usingexpensive locks. The private map structures are thencombined to derive the final result across threads/coresusing a single thread. Unfortunately, this strategy is largelyunrealistic on many-core GPUs as it is neither possible norefficient to provide hundreds of thousands of threads withtheir private copies of result structures (e.g., maps).Furthermore, as the map structure is essentially a hashingbased structure, it requires random accesses to hash tablesand may incur significant cache misses and reduce perfor-mance. While large caches on CPUs can tolerate randomdata accesses to a certain extent, hashing typically per-forms poorly on GPUs due to the uncoalesced dataaccesses among threads in computing blocks [59].

Given that sorting is among the best studied parallelalgorithms and efficient sorting implementations are pro-vided by major parallel packages on parallel systems, wepropose a Transform-Sort-Reduce based approach for rela-tional aggregations on GPUs. The approach first deriveskeys based on domain knowledge, then sorts databasetuples based on the keys and finally reduces on groups(identified by the keys) to produce the desired statistics forthe groups. We note that many statistic operators, includ-ing the commonly used count/sum/avg/min/max operators

that are supported by relational DBMS, are associative andcan be applied in parallel within groups. Our GPU imple-mentation of the relational aggregation module closelyfollows the three-step procedure design. For the first stepin combining key attributes and generating keys for alltuples, the element-wise operation is highly parallelizableand is efficiently supported by several parallel libraries,including Thrust (using transform primitives). Note thatthe decision on whether to concatenate relevant valueattributes to form a new relation (in a way similar tomaterialized views in relational databases) or to directlyoperate on the original relation is left to applications. Aswe have adopted a column-oriented database physicallayout, deriving a new relation is performed by verticallyconcatenating relevant in-memory column files. For thesecond step in sorting the database tuples based on theirkeys, as radix sorting on fixed-length keys are typicallymore efficient than comparison based sorting and have atheoretical linear complexity with respect to number ofdata items to be sorted, our implementation requiresformulating keys as 32-bit or 64-bit integers to take theadvantage of efficient parallel sorting implementations onGPUs. We argue that 32-bit keys can be sufficient for manypractical applications. Since users are typically interestedin only a limited amount of top-ranked groups, the keyscan be hierarchically refined into multi-levels while stillusing 32-bit keys at each level. For the third step inapplying reduction operators within groups to derivestatistics, Thrust provides a reduce (by key) primitive thatmeets our requirement.

Assuming that we want to perform a spatiotemporalaggregation on the street segment and hour, i.e., countingthe number of taxi pickup locations at each of the streetsegments at the each of the 24 h, Fig. 10 provides the codesegment for illustration purposes. Here we are given twovectors with the first storing the street segments and thesecond storing the pickup times in the compressed form(Section 3.2) for all taxi trip records using a column-oriented data layout. Note that the zip iterator is used tocombine the elements in the two input vectors into tuplesin the temp_keys vector so that they can be used in therequired functor (Cþþ function object) in the transformprimitive to convert the segment identifiers and pickuphours into keys for reduction. The last five bits in aresulting key are allocated to hour (24o25) and the restof the bits are allocated to segment identifiers whichallows up to 227 segment identifiers and is sufficientin our application. The same procedure can be appliedwhen variables of higher dimensions are involved in keyformations. The reduce_by_key primitive in Thrust is asegmented version of the regular reduce primitive. To helpunderstand the procedure, a simple example is provided inthe top-right part of Fig. 10. After sorting in Step 2, thesame keys are arranged consecutively in the temp_keysvector. For each of the unique keys in the temp_keys vector(which is output to the pu_keys vector), the count in thepu_count vector is increased (defined by the thrust::plusfunctor) by 1 (as defined by the thrust::constant_iterator).The primitive allows us to define how to determinewhether two keys are equal by providing a functor toreplace the thrust::equal_to functor and how to perform

3 1 2 1 3 1 1 2 3 3sort

1 2 32 1 2

reduce_by_keykey

count

Inputs: vectors pu_seg and pu_t Outputs: vectors pu_key and pu_count representing aggregated keys and counts

Step 1: transform pu_seg and pu_t into a vector of bin keys as the following: thrust::transform (

thrust::make_zip_iterator(thrust::make_tuple(pu_seg.begin(),pu_t.begin()),thrust::make_zip_iterator(thrust::make_tuple(pu_seg.end(),pu_t.end()),temp_keys.begin(),make_key());

Step 2: in-place sort the key vector to get ready for reduction thrust:sort(temp_keys.begin(),temp_keys.end());

Step 3: reduce by key to count the number of trips in each bin and output the results to pu_count as the following: int num_keys=thrust::reduce_by_key(

temp_keys.begin(),temp_keys.end(),hrust::constant_iterator<int>(1)pu_keys.begin(), pu_count.begin(), thrust::equal_to<uint>(), thrust::plus<int>()).pu_keys.begin()

struct make_key{

__host__ __device__uint operator()(thrust::tuple<uint, uint> v){

uint segid=(thrust::get(0)(v)) &0x07FFFFFFuint hour =(thrust::get(1)(v)>>12)&0x0000001F;return ((segid<<5)|hour);

}};

Fig. 10. Code segment of a parallel relational aggregation using spatiotemporal keys.


the reduction by providing a functor to replace the thrust::plus functor and thus is very flexible. Since thrust::make_-tuple takes up to 10 parameters to make tuples, it shouldbe more than sufficient to combine dimensional valuesand make keys in most cases.

The readers might have observed that the same Transform-Sort-Reduce based relational aggregation scheme can also beapplied to multi-core CPUs. Indeed, parallel libraries on CPUs,such as Intel TBB, provide similar parallel primitives such asparallel_for, parallel_sort and parallel_reduce which make themulti-core CPU implementation of parallel relational aggrega-tions possible. However, as shown in Section 5, sorting onmulti-core CPUs is several times slower than sorting on GPUsin our experiment system which makes the primitive-basedmulti-core CPU implementation less preferable. On the otherhand, while it is more efficient for GPUs to use more costlysorting operator to coordinate a large number of threads tomake full use of the massively data parallel computing powerof the hardware, multi-core CPUs are characterized by havingsmall numbers of powerful processors with large caches. TheTransform-Sort-Reduce scheme that works well for GPUs maynot be the most efficient implementation for multi-core CPUs.In fact, as many of the relational aggregation operators areassociative, it is unnecessary to produce a total order throughsorting. As mentioned earlier, given that the number of coreson CPUs is typically small, when the numbers of groups arenot extremely large, it is technically feasible to provide eachCPU thread a private copy of data structures for storingstatistics. The advantages of using private copies are tocompletely eliminate concurrent data accesses that typicallyrequire expensive locking, and, to eliminate expensive sorting.Each thread can just sequentially loop through the partition ofthe tuples assigned to it and generate statistics for thepartition serially before the private copies of the statisticsare combined. To reduce the memory management overheadsin STL map structures, in our multi-core CPU implementation,we use dynamic allocated arrays instead as the sizes of keysare often known before relational aggregations. As shown inSection 5.5, the efficiency of relational aggregations can be

significantly improved by using pre-allocated dynamic arraysand narrow the performance gaps between the GPU and CPUimplementations.

We would like to take the opportunity to briefly com-ment on the implementations using native parallel program-ming languages or similar approaches (e.g., CUDA on GPUsand Pthreds on CPUs) and those that are based on parallelprimitives (e.g., Thrust on GPUs and TBB on CPUs). In thecontext of OLAP aggregations, the work reported in Ref. [9]adopted a native implementation approach which requires adeep understanding of GPU hardware details and highparallel programming skills. Even though parallel reductionis a well-studied problem and the mapping between thereduction primitive and OLAP aggregations is relativelystraightforward, it remains non-trivial to implement theaggregations with good performance using a native pro-gramming approach [9]. Another research effort on MOLAPcube [39] utilized parallel primitives which allowed theauthors to focus on high-level constructs without diving intotoo many hardware details. While we are in favor of theparallel primitive based approach in general due to itssimplicity and portability, the tradeoffs between code effi-ciency and coding complexity of parallel primitive basedimplementations and native implementations need to bejustified in different applications. We also note that theprimitive based implementation does incur some overheadsthat can be avoided if implemented directly on top of nativeparallel programming languages. For example, the temp_keysvector in Fig. 10 of our example is accessed multiple timeswhen different primitives are invoked. In addition, mostparallel primitives are designed for 1-dimensional data thatcan be represented as arrays or vectors. The similar argu-ments can be applied to CPU-based implementations ingeneral, for example, Pthread and TBB. However, as multi-core CPUs typically favor coarse-grained task level paralleli-zation and allow serial implementation within parallel tasks,the differences on multi-core CPUs are not as significant asthose on GPUs from an implementation perspective in thecontext of data management applications. To the best of our


knowledge, there are no parallel libraries that support multi-dimensional data and provide a similar set of functionality.Our designs on parallel spatial indexing, filtering and refine-ments have the potential to be abstracted as multi-dimensional primitives for spatial and temporal aggregationsand we leave this interesting topic for future work.

5. Experiments and results

5.1. Data and experiment setup

Through a partnership with the NYC Taxi and Limou-sine Commission (TLC), we have access to roughly 300million GPS-based trip records collected during a period ofabout 2 years (2008–2010). In this study, we use theapproximately 170 million pickup locations and times in2009 for experiments. The NYC DCP LION street networkdataset with 147,011 street segments is used as the urbaninfrastructure data to associate taxi locations with streetsegments and subsequent spatial and temporal aggrega-tions. Among the 168,379,168 taxi pickup locations in NYC,the majority are successfully associated with their neareststreet segments within R¼250 ft. However, there are867,163 (0.515%) locations whose computed shortest dis-tances are more than 250 ft. They are considered as out-liers and are excluded from subsequent analysis. Allexperiments are performed on a Dell Precision T5400workstation equipped with dual quad-core CPUs runningat 2.0 GHz with 16 GB memory, a 500 GB hard drive and an

Fig. 11. Plot of runtimes of MLQ-base

Nvidia Quadra 6000 GPU device with 448 cores (runningat 1.15 GHz) and 6 GB memory.

5.2. Results on MLQ-based spatial associations on GPUs

Fig. 11 shows the results of MLQ-based spatial associa-tions implemented on GPUs using the maximum quadrantdepth L¼16 (2 ft cell resolution at the finest level), themaximum point quadrant size K¼512 and the expandedwindow width of R¼25 ft for polyline MBBs. In this figure,the horizontal axis represents the months in the year 2009that are used in the spatial associations, N1 is the numberof point locations, N2 is the number of derived pointquadrants, T1, T2, T3, and T4 are the runtimes measuredin seconds of the four modules, point indexing, polylineindexing, spatial filtering, and spatial refinement, respec-tively, and T is the total runtime. We can see that theruntimes of point indexing dominate the total runtime inall tests and increase almost linearly with the number ofpoint locations. This is mostly due to the radix-sort basedparallel primitive for sorting that incurs linear runtimeswith respect to the number of points. The runtimes of theremaining parallel primitives are mostly linear withrespect to the number of points or quadrants, except incases that a same point quadrant might be paired withmultiple grid cells. We note that the runtimes for pointindexing have already included data transfer timesbetween CPUs and GPUs which account for nearly 25% ofthe end-to-end runtime using 12 months data (the wholeyear of 2009). Also note that the polyline indexing time is

d spatial associations on GPUs.

Table 2Results on FSG-based spatial associations.

FSG-SC FSG-SC-VPU FSG-MC FSG-MC-VPU FSG-GPU MLQ-GPU

Point Indexing (PNI) 22.447 5.382 12.233Polyline Indexing (PLI)

R¼25 1.750 0.181R¼50 2.448 0.244R¼100 4.140 0.401R¼250 13.450 1.148

Spatial Filtering (SF)R¼25 0.738 0.068 1.221R¼50 0.984 0.083 1.391R¼100 1.521 0.118 1.702R¼250 4.305 0.264 3.207

Spatial Refinement (SR)R¼25 21.126 7.251 2.631 0.930 1.002 1.601R¼50 28.642 9.480 3.559 1.198 1.267 2.088R¼100 43.419 13.913 5.440 1.749 1.839 3.019R¼250 100.738 31.117 12.487 3.874 4.156 6.573


fixed across the months, which is relatively insignificant.From the results we can see that it takes about 15 s tocomplete the spatial associations between the 168.38million pick-up locations and the 147 thousand streetsegments based on the nearest-neighbor searching prin-ciple within a 25 ft window width. The GPU-based spatialassociations are considerably faster than running queriessimilar to query Q5 (listed in the Appendix at the end ofthe paper) in PostgreSQL that takes hours to days.Although it is not our intension to compare our resultswith PostgreSQL directly as they are designed for differentpurposes and exploit different sets of technologies, theresults clearly demonstrate the level of achievable perfor-mance on aggregating large-scale geospatial data onmodern commodity parallel hardware.

An interesting observation of the experiment results onspatial filtering runtimes is that, as the number of pointlocations grows across months, the spatial filtering runtimeactually decreases which is counter-intuitive. A careful exam-ination reveals that, as the threshold K is fixed in our experi-ments, when the number of point locations is small, thenumber of derived point quadrants is likely to be large whichresults in a large number of cells after rasterization. Subse-quently, it takes longer for binary searching, sorting and dupli-cation removal. As the spatial filtering runtimes can be severaltimes larger than the spatial refinement runtimes and evenlarger than the point indexing runtimes (for cases usingsmaller numbers of months in Fig. 11), it is desirable to reducethe runtimes as much as possible as users generally expectlower response times when the numbers of points are small.As such, we suggest to use adaptive grid resolutions in theGPU-based MLQ implementation, i.e., use coarser grid resolu-tions for smaller numbers of points being associated. Assum-ing that statistical distributions of point locations do notchange significantly across months, it is likely that a costmodel can be formally derived to suggest an optimal gridresolution for the implementation. This is left for our futurework.

To further evaluate the performance of many-core GPUsfor spatial associations, we have performed the same sets ofoperations corresponding to T1 (point indexing runtime) andT4 (spatial refinement runtime) on a single core CPU for the

points in all the 12 months. The runtimes are T10 ¼162.004 sand T40 ¼35.338 s, respectively. We have not measured T20

(polyline indexing runtime) and T30 (spatial filtering runtime)on single core CPUs because we are not aware of efficientimplementations of main-memory data structures for polylineindexing and spatial filtering. It would not be fair to simplyport their GPU implementations to CPUs which might not beefficient. However, even if T20 and T30 are excluded, we stillget a 13� speedup calculated as (T10 þT40)/(T1þT2þT3þT4)¼197.342/15.236. The result indicates the efficiency ofGPU based spatial association based on the MLQ design.

5.3. Results on FSG-based spatial association on multi-coreCPUs and GPUs

We have tested our FSG-based spatial associationmodule on multi-core CPUs under the following settings.A fixed grid cell size G¼16 ft is used for all experiments.We have chosen to experiment on four R values (25 ft,50 ft, 100 ft and 250 ft) to examine how end-to-endruntimes change as the polyline expanded window widthincreases. Using larger expanded window widths typicallyresults in larger numbers of (cell, polyline) pairs that needto be refined after the spatial filtering. As such, both spatialfiltering and spatial refinement runtimes are likely toincrease. As indicated in Table 1 in Section 4, we useGNU Parallel Mode for sorting on multi-core CPUs. As thePolyline Indexing module is independent of FSG/MLQdesigns, we re-use the MLQ implementation on GPUsdirectly and port it to multi-core CPUs through Thrust bychanging the Thrust library device backend from CUDA onGPUs to TBB on multi-core CPUs. For spatial filtering, wehave implemented the FSG-based design on GPUs tocompare with MLQ based GPU implementation and portit to multi-core CPUs in a similar way. Also as discussed inSection 4.3, although we have ported the FSG based GPUimplementations to multi-core CPUs (results listed inTable 2 in the FSG–MC column), we note that the libraryoverheads for the TBB ports may be higher than theoverheads for the native CUDA implementation of theThrust library and the factor is taken into account duringthe analysis. While it is beyond the scope of our work to


optimize the Thrust library on multi-core CPUs, we plan tonatively implement the parallel primitives that are used inpolyline indexing and spatial filtering based on the FSGdesign to improve their performance in our future work.For spatial refinement module using the FSG design, sincewe have explored VPUs in our multi-core CPU implemen-tations, the runtimes for the following implementationsare listed in Table 2: serial implementation using a singlecore (SC), parallel implementation on a single core withVPU (SC–VPU), parallel implementation on multiple coreswithout VPU (MC), and parallel implementation on multi-ple cores with VPU (MC–VPU). The runtimes for the spatialfiltering module using the GPU implementations based onboth FSG and MLQ designs under the four R values are alsolisted in Table 2 for comparison purposes.

From Table 2, we can see that the FSG-based multi-coreCPU implementation for point indexing is about 2.3� fasterthan the MLQ-based GPU implementation (12.233/5.382). Onthe other hand, while the GPU implementation has achieved13� speedups over serial implementation in the MLQapproach, only a 4.2� speedup (22.447/5.382) has beenachieved for multi-core implementation over serial imple-mentation in the FSG approach. This can be explained by that,as detailed in Section 4.1, the FSG design is significantlysimpler than the MLQ design which subsequently requiresmuch less computation. Unfortunately, the large memoryfootprints (for both the resulting grid cells and temporalmemory during sorting) have prevented us from implement-ing the Point Indexing module on GPUs based on the FSGdesign for direct comparisons. In addition, the CPU-basedimplementation of the FSG design does not require copyingthe point data from CPUs to GPUs which also contributes toits efficiency. Furthermore, as sorting is typically memoryintensive and the bandwidth of our GPU device is muchhigher (144 GB/s vs. 2�12.8 GB/s), we would expect theimplementation of the FSG design on GPUs to be moreefficient than the multi-core GPU implementation whenGPU memory capacity gets larger. We also suspect thatinsufficient memory bandwidth on multi-core CPUs is oneof the important reasons that the multi-core CPU implemen-tation achieves only half of the theoretical speedup (8� for 8cores). As the future generation CPUs are likely to have highermemory bandwidths and more cores, we expect the perfor-mance of the multi-core CPU based implementation willincrease accordingly.

For polyline indexing, the runtimes under all R values arerelatively insignificant on GPUs. However, their runtimes aremore than 10� higher when the GPU implementation (thesame in both the MLQ and FSG designs) is ported to multi-core CPUs. While the low performance can also occur due tothe small number of cores and lowmemory bandwidth as wehave discussed previously (and low CPU clock rate in ourexperiment system), we also believe that the Thrust libraryoverheads of the TBB port on multi-core CPUs are significantlyhigher than its native implementation on GPUs. For futurework, we plan to provide an optimized native implementationof the polyline indexing module in order to compare the bestperformance of the multi-core CPU implementation with theperformance of the GPU implementation.

The Spatial Filtering results provided in Table 2 showthat the FSG-based GPU implementation is 10–20� faster

(by dividing values in the MLQ–GPU column by those inthe FSG–GPU column in the four SF rows) than the MLQ-based GPU implementation which is preferred. Whencomparing the FSG-based implementation on multi-coreCPUs and GPUs (using the values in FSG–GPU column andFSG–MC column in the same SF rows), we can see that theGPU-based implementation is again more than 10� fasterthan the multi-core CPU implementation by porting thesame implementation from CUDA to TBB. Again, weattribute this result to a small number of cores and lowmemory bandwidth on multi-core CPUs in our experimentsystem and the higher library overheads on CPUs.

More comprehensive comparisons can be performed forthe Spatial Refinement module as all the six implementationsare available for this module. Our results in Table 2 show thatthe runtimes of the MLQ–GPU implementation are approxi-mately 60% higher than the runtimes for the FSG–GPUimplementation for the four R values. This may also indicatethat load balancing is not a dominating factor due to fine datadecompositions as discussed in Section 4. It might be moreinteresting to compare the performance of the five imple-mentations for the FSG design, especially the FSG–MC–VPUimplementation, which has the best performance on multi-core CPUs, and the GPU implementation. From Table 2 we cansee that the GPU implementation has achieved 13–15�speedups over the serial implementation in the FSG approachfor the four R values (FSG–SC/FSG–GPU for SR rows). They arecomparable with the speedups of the GPU implementationover the serial implementation in the MLQ based approach(13� for R¼25 as reported in Section 5.2). When comparingthe GPU performance (FSG–GPU) and multi-core CPU perfor-mance (FSG–MC), we can see that the GPU implementationhas achieved 2.6–3.0� speedups over multi-core CPUs (FSG–MC/FSG–GPU for SR rows). When comparing the runtimes ofFSG–MC with those of FSG–SC, we observe linear speedups(around 8� ) and we attribute these results to both theparallel design with regular data access patterns and theefficiency of TBB scheduling in balancing the workloadsamong grid cells. We next turn to analyzing the CPUperformance with multi-cores and VPUs.

When comparing the runtimes between the FSG–SCand FSG–SC–VPU columns in Table 2 for the spatialrefinement module, we can see that the speedups due toVPUs (vectorization) are around 3� (FSG–SC/FSG–SC–VPUfor SR rows). Slightly lower speedups (but still around 3� )can be observed by comparing the runtimes in the FSG–MC and FSG–MC–VPU columns where all cores are uti-lized. Given that the VPUs on the Intel Xeon CPUs in ourexperiment system have 128-bit SIMD width (4-way) andthe maximum possible speedup is only 4� , the achievedspeedups are very significant, especially considering thatnot all the numbers of points in grid cells can be dividedby 4 and the remaining points that cannot be fed to a SIMDbatch need to be processed sequentially. By combiningmulti-cores and VPUs, the multi-core CPU implementationhas achieved 21–24� speedups over serial implementa-tion without using multi-cores and VPUs (FSG–SC/FSG–MC–VPU for SR rows). The performance is about 5–7%better than the GPU implementation in the FSG design(calculated as (FSG–GPU/FSG–MC–VPU)-1 for SR rows). Asthe CPUs in our experiment system are relatively weak


(released in 2007), we believe more recent systems withmore CPU cores and higher CPU frequencies can achievebetter performance. The results are exciting in the sensethat, when parallel hardware (including multi-cores andVPUs) is fully utilized, multi-core CPUs have the potentialto achieve comparable or even better performance thanGPUs in our application.

5.4. Results on end-to-end response times of spatialassociations on hybrid systems

While we have compared the performance of theindividual modules on either multi-core CPUs or GPUs orboth under different settings for the FSG-based approachin Section 5.3, we next provide comparisons on the end-to-end response times of spatial associations in this sub-section. Although the end-to-end response time of theMLQ-based approach for R¼25 ft has been provided anddiscussed in Section 5.2, we would like to compare the twoapproaches under different R values. In Table 3, we referthe MLQ-based approach as Config1where all the runtimesare based on the GPU implementations. In contrast, forConfig2, the runtimes are mixtures of the GPU and CPUimplementations by taking the best performance of allavailable implementations of the respective module. Thedecision is based on several considerations. First, only CPUimplementation is available for the FSG design as the pointdata is beyond the memory capacity of the GPU device inour experiment system for the GPU implementation usingthe FSG design. Second, the multi-core CPU implementa-tions of the polyline indexing and spatial filtering modulesare naïve ports of the corresponding GPU designs and arenot optimized. It would be unfair to use them in calculat-ing the end-to-end response times for multi-core CPUs.Third, by utilizing VPUs on CPUs, the CPU based imple-mentation of the Spatial Refinement module actuallyslightly outperforms the corresponding GPU implementa-tion although the GPU implementation is still about 2.6–3.0� better than the multi-core CPU implementationwithout VPUs. It would not be appropriate to simply referto the VPU enhanced implementation as a multi-core CPUimplementation. By taking the best runtimes of the mod-ules in the FSG-based approach, we can compute the best

Table 3Runtimes of multiple design and implementation configurations.

R¼25

Config1 (MLQ-GPU)PNIPLI 0.181SF 1.221SR 1.601Tot 15.236

Config2 (FSG-Hybrid)PNI(FSG-MC)PLI (FSG-GPU) 0.181SF (FSG-SC) 0.068SR (FSG-MC-VPU) 0.930Tot 6.561

Speedup¼Config1-Tot/Config2-Tot 2.32

performance in the hybrid CPU–GPU system for the FSG-based approach and compare it with the MLQ-basedapproach with a pure GPU implementation.

From the last row of Table 3 we can see that theFSG-hybrid configuration has achieved a little over 2�speedups over the MLQ–GPU configuration which indicatesthat the FSG design is a better choice for spatial associationson our experiment system. We expect that, after the nativemulti-core CPU implementations of the polyline indexingand spatial filtering modules become available, although itis likely to increase the end-to-end runtimes of the twomodules when compared to the current GPU implementa-tions, the FSG implementation that runs completely onmulti-core CPUs will still likely be better than the MLQ-based approach on GPUs. The better performance of the FSGapproach on multi-core CPUs can be attributed to threemajor factors. The first factor is the algorithmic improve-ment to significantly reduce the computing overheads ofspatial filtering/refinement as well as point indexing. Theefficiency on reducing computing overheads is evidencedby comparing the runtimes in FSG–GPU and MLQ–GPUcolumns in Table 2 where GPU implementations for bothapproaches are available. The efficiency on point indexingbased on the FSG design can be inferred based on the factthat sorting the 168 million records should take less than1 s on the GPU if memory is sufficient based on previousresults [55] while it is more than 12 s based on the MLQdesign. The second factor is the elimination of the need totransfer data between CPUs and GPUs (3 s based on ourexperiments). The third factor is the effective utilization ofVPUs on CPUs that have been largely untouched in previousstudies in the spatial refinement module. We expect theend-to-end runtimes of the pure multi-core CPU basedimplementation can be further reduced after utilizing theSIMD processing powers on VPUs in other three modules inspatial associations. Finally, despite the fact that diffe-rent parallel designs and implementations have resultedin different performance, we would like to note thatthe end-to-end runtimes of both the FSG-Hybrid andMLQ–GPU implementations are in the order of 5–25 s forthe R values ranging from 25 ft to 250 ft as shown inTable 3. The results clearly indicate the high efficiency ofour parallel designs and implementations although there

R¼50 R¼100 R¼250

12.2330.244 0.401 1.1481.391 1.702 3.2072.088 3.019 6.573

15.956 17.355 23.161

5.3820.244 0.401 1.1480.083 0.118 0.2641.198 1.749 3.8746.907 7.65 10.668

2.31 2.27 2.17


are still rooms for improvements. The end-to-end responsetimes are getting closer to meeting the requirement ofinteractive spatial OLAP query processing. We believeinteractive OLAP queries on spatially associating hundredsof millions of points and hundreds of thousands of polylinesare technically feasible through further algorithmic engi-neering, implementation optimization and heterogeneous/distributed processing and they are left for our future work.

5.5. Results on parallel relational aggregations

After spatial and temporal associations, spatial and tem-poral aggregations are reduced to relational aggregations. Wehave performed three groups of experiments on relationalaggregations on both many-core GPUs and multi-core CPUsusing the 168 million taxi pick up locations after spatial andtemporal associations. The first group of experiments countson the 147,011 street segments after spatial aggregationsbased on the nearest neighbor search. The second group ofexperiments counts on the 24 h (temporal aggregation) andthe third group of experiments counts on both street seg-ments and hours (spatiotemporal aggregation). Clearly, thenumber of bins in the spatial aggregations based on segmentidentifiers is 3–4 orders of magnitude larger than the numberof bins in the temporal aggregations based on hours. They canbe used to represent the two extremes in aggregations withrespect to cardinality. The experiment results for the threegroups of queries on multi-core CPUs and many-core GPUsare plotted in Fig. 12 and are compared with serial imple-mentations. For the CPU implementations (serial and multi-core), we have experimented on using both STL map struc-tures and dynamic arrays. For the GPU implementation, as it isbased on the Thrust parallel library, we use multiple vectorsinstead.

Our results show that the speedups of the multi-core CPUimplementations over the serial implementations range from4.5 to 6.0� when STL maps are used and 1.5–2.7� whendynamic arrays are used. The results are expected as therelational aggregations are mostly memory intensive whichmay severely limit the achievable speedups (maximum 8�for 8 cores). When dynamic arrays are used, memory band-width contention is likely to be a major factor in furtherlimiting the achievable speedups. The GPU implementationshave achieved slightly better performance than the multi-coreimplementations using dynamic arrays. Given that the GPUdevice used in our experiment system has a larger number ofprocessing cores (although weaker than CPU cores) andhigher memory bandwidth, we believe it is quite possible tofurther improve the performance of the GPU implementationby using CUDA directly to reduce the Thrust library overheads.

Fig. 12. Runtimes of parallel relational aggregations.

While it remains nontrivial to design and implement sophis-ticated parallel primitives that are optimized for OLAP appli-cations using low-level programming languages and librarieson parallel hardware, we plan to provide designs and imple-mentations that can be compatible on both GPUs and VPUsof CPUs and make full use of their SIMD parallel proces-sing power.

6. Conclusion and future work

In this study, we reported our designs, implementationsand experiments on developing a data management plat-form and a set of parallel techniques to support high-performance online spatial and temporal aggregations onmulti-core CPUs and many-core GPUs that are becomingincreasingly available but largely under-utilized for OLAPapplications. Our results have shown that we are able tospatially associate nearly 170 million taxi pickup locationpoints with their nearest street segments among 147,011candidates in about 5–25 s, depending on different config-urations of indexing structures, computing models andparameters such as expanded window widths. After spa-tially associating points with road segments, spatial, tem-poral and spatiotemporal aggregations are reduced torelational aggregations and can be processed in the orderof a fraction of a second on both multi-core CPUs and GPUs.The experiment results support the feasibility of building ahigh-performance OLAP system for processing large-scaletaxi trip data for real-time, interactive data explorations onGPUs, multi-core CPUs and their hybridizations.

For the future work, first of all, we would like to furtherreduce the processing times for both spatial associations andrelational aggregations by fine tuning important parametersand further reducing memory footprint. Second, to ensureusability, we would like to investigate the appropriate spatialand temporal resolutions so that interactive OLAP processingcan be smoothly performed on commodity personal compu-ters with different hardware configurations. Third, while ourdesigns and implementations reported in this study areapplication driven, we are interested in formally analyzingthe complexity and scalability of the proposed solution byvarying numbers of CPU/GPU processors and empiricallyvalidating the analysis through extensive experiments. Fourth,while our techniques are designed and developed mostly inthe context of managing large-scale OD data, many of themare applicable to other types of spatial and spatiotemporaldata and we plan to investigate the possibilities. Finally, weplan to explore cluster computing technologies to processlarger scale data, for example, multi-year and multi-city taxitrip data and cell phone call log data.

Acknowledgment

This work is supported in part through PSC-CUNYGrants #65692-00 43 and #66724-00 44 and NSF GrantsIIS-1302423 and IIS-1302439. We thank Dr. Camille Kamgaat CUNY City College for providing the NYC taxi trip dataand Dr. Hongmian Gong at CUNY Hunter College forinsightful discussions on urban applications of GPS data.


Appendix SQL statements for spatial/temporalaggregations in PostgreSQL

Q1: UPDATE t SET PUGeo¼ST_SetSRID(ST_Point(“PULong”,“PuLat”),4326);Q2: UPDATE t SET DOGeo¼ST_SetSRID(ST_Point(“DOLong”,“DOLat”),4326);Q3: CREAT INDEX ti_pugeo ON t USING GIST (PUGeo);Q4: CREAT INDEX ti_dogeo ON t USING GIST (DOGeo);Q5: SELECT DISTINCT ON (ID, PUT) ID, PUT, segmentid,ST_Distance ( ST_Transform (PUGeo,2263), the_geom)as ndis INTO temp_PU FROM t, nWHERE ST_DWithin (ST_Transform (PUGeo, 2263),the_geom, 100) ORDER BY PUT, ID, ndisQ6: UPDATE t set PUSeg¼(SELECT segmentid Fromtemp_PU WHERE t.ID¼temp_PU.ID AND t.PUT¼temp_PU.PUT;Q7: SELECT DISTINCT ON (ID, DOT) ID, DOT, segmentid,ST_Distance ( ST_Transform (DOGeo,2263), the_geom)as ndis INTO temp_DO FROM t, nWHERE ST_DWithin(ST_Transform(DOGeo,2263), the_-geom, 100) ORDER BY DOT, ID, ndisQ8: UPDATE t set DOSeg¼(SELECT segmentid Fromtemp_DO WHERE t.ID¼temp_DO.ID AND t.DOT¼temp_DO.DOT;Q9: CREAT INDEX ti_pus ON t(PUSeg);Q10: CREAT INDEX ti_dos ON t(DOSeg);Q11: SELECT PUSeg, COUNT(n) FROM t GROUP BYPUSeg ORDER BY PUSeg;Q12: SELECT DOSeg, COUNT(n) FROM t GROUP BYDOSeg ORDER BY DOSeg;Q13: CREAT INDEX ti_put ON t (PUT);Q14: CREAT INDEX ti_dot ON t (DOT);Q15: SELECT EXTRACT (hour FROM PUT) as hour, count(n)FROM t GROUP BY hour ORDER BY hourQ16: SELECT EXTRACT (hour FROM DOT) as hour, count(n)FROM t GROUP BY hour ORDER BY hour

References

[1] J. Reades, F. Calabrese, A. Sevtsuk, C. Ratti, Cellular census: explora-tions in urban data collection, IEEE Pervasive Comput. 6 (3) (2007)30–38.

[2] F. Calabrese, M. Colonna, P. Lovisolo, D. Parata, C. Ratti, Real-timeurban monitoring using cell phones: a case study in Rome, IEEETrans. Intelligent Transp. Syst. 12 (1) (2011) 141–151.

[3] M.A. Vasconcelos, S. Ricci, J. Almeida, F. Benevenuto, V. Almeida,Tips, dones and todos: uncovering user profiles in Foursquare, in:Proceedings of the Fifth ACM International Conference on WebSearch and Data mining, WSDM012, 2012, pp. 653–662.

[4] F. Calabrese, K. Kloeckl, WikiCity: real-time location-sensitive toolsfor the city, in: M. Foth (Ed.), Handbook of Research on UrbanInformatics: The Practice and Promise of the Real-Time City, IGIGlobal, Hershey, PA, USA, 2008, pp. 390–413.

[5] G. Friedland, J. Choi, A. Janin, Video2GPS: a demo of multimodallocation estimation on Flickr videos, in: Proceedings of the 19thACM International Conference on Multimedia, MM011, 2011,pp. 833–834.

[6] J. Zhang, C. Kamga, H. Gong, L. Gruenwald, U2SOD-DB: a databasesystem to manage large-scale ubiquitous urban sensing origin-destination data, in: Proceedings of the ACM SIGKDD InternationalWorkshop on Urban Computing, UrbComp012, 2012, pp. 163–171.

[7] S. Shekhar, C. Chawla, Spatial Databases: A Tour, Prentice Hall, UpperSaddle River, NJ, USA, 2003.

[8] C. Duntgen, T. Behr, R. Guting, BerlinMOD: a benchmark for movingobject databases, VLDB J. 18 (2009) 1335–1368.

[9] T. Lauer, A. Datta, Z. Khadikov, C. Anselm, Exploring graphicsprocessing units as parallel coprocessors for online aggregation, in:Proceedings of the ACM 13th International Workshop on DataWarehousing and OLAP, DOLAP 010, 2010, pp. 77–84.

[10] R. Wrembel, Data warehouse performance: selected techniques anddata structures, in: M.-A. Aufaure, E. Zimnyi (Eds.), Business Intelli-gence, vol. 96 of Lecture Notes in Business Information Processing,2012, pp. 27–62.

[11] J. Hennessy, D.A. Patterson, Computer Architecture: A QuantitativeApproach, 5th ed. Morgan Kaufmann, Waltham, MA, USA, 2011.

[12] E.H. Jacox, H. Samet, Spatial join techniques, ACM Trans. DatabaseSyst. 32 (1) (2007) 7–24.

[13] Y. Zheng, Y. Liu, J. Yuan, X. Xie, Urban computing with taxicabs, in:Proceedings of the 13th International Conference on UbiquitousComputing, UbiComp'11, 2011, pp. 89–98.

[14] S. Phithakkitnukoon, T. Horanont, G.D. Lorenzo, R. Shibasaki, C. Ratti,Activity-aware map: identifying human daily activity pattern usingmobile phone data, in: Proceedings of the First International Conferenceon Human Behavior Understanding, HBU'10, 2010, pp. 14–25.

[15] Y. Zheng, Y. Chen, X. Xie, W.-Y. Ma, Geolife2.0: A location-basedsocial networking service, in: Proceedings of the 10th InternationalConference on Mobile Data Management: Systems, Services andMiddleware, MDM'09, 2009, pp. 357–358.

[16] J. Yuan, Y. Zheng, L. Zhang, X. Xie, G. Sun, Where to find my nextpassenger, in: Proceedings of the 13th International Conference onUbiquitous Computing, UbiComp'11, 2011, pp. 109–118.

[17] R. Xie, Y. Ji, Y. Yue, X. Zuo, Mining individual mobility patterns frommobile phone data, in: Proceedings of the 2011 International Workshopon Trajectory Data Mining and Analysis, TDMA'11, 2011, pp. 37–44.

[18] A. Vaisman, E. Zimanyi, What is spatio-temporal data warehousing?,in: Proceedings of the 11th International Conference on Data Ware-housing and Knowledge Discovery, DaWaK'09, 2009, pp. 9–23.

[19] S. Rivest, Y. Bdard, P. March, Towards better support for spatialdecision making: defining the characteristics, in: Proceedings ofGeomatica, 2001, pp. 539–555.

[20] A.O. Mendelzon, A.A. Vaisman, Temporal queries in OLAP, in:Proceedings of the 26th International Conference on Very LargeData Bases, VLDB '00, 2000, pp. 242–253.

[21] R. do Nascimento Fidalgo, V.C. Times, J. da Silva, F. da FonsecadeSouza, GeoDWFrame: a framework for guiding the design ofgeographical dimensional schemas, in: Proceedings of the 6thInternational Conference on Data Warehousing and KnowledgeDiscovery, DaWaK'04, 2004, pp. 26–37.

[22] K. Boulil, S. Bimonte, H. Mahboubi, F. Pinet, Towards the definition ofspatial data warehouses integrity constraints with spatial OCL, in:Proceedings of the ACM 13th International Workshop on DataWarehousing and OLAP, DOLAP '10, 2010, pp. 31–36.

[23] L.I. Gomez, S. Haesevoets, B. Kuijpers, A.A. Vaisman, Spatial aggrega-tion: data model and implementation, Inf. Syst. 34 (6) (2009)551–576.

[24] A. Escribano, L.I. Gomez, B. Kuijpers, A.A. Vaisman, Piet: a GIS-OLAPimplementation, in: Proceedings of the ACM 10th InternationalWorkshop on Data Warehousing and OLAP, 2007, pp. 73–80.

[25] S. Bimonte, A. Tchounikine, M. Miquel, Spatial OLAP: open issuesand a web based prototype, in: Proceedings of the 10th AGILEInternational Conference on Geographic Information Science, 2007,pp. 1–11.

[26] M. Scotch, B. Parmanto, Sovat: spatial olap visualization and analysistool, in: Proceedings of the 38th Annual Hawaii InternationalConference on System Sciences, 2005.

[27] O. Glorio, J.-N. Mazon, I. Garrigos, J. Trujillo, Using web-basedpersonalization on spatial data warehouses, in: Proceedings of the2010 EDBT/ICDT Workshops, EDBT '10, 2010, pp. 8:1–8:8.

[28] T.L.L. Siqueira, C.D. de Aguiar Ciferri, V.C. Times, R.R. Ciferri, The SB-index and the HSB-index: efficient indices for spatial data ware-houses, GeoInformatica 16 (1) (2012) 165–205.

[29] J.J. Brito, T.L.L. Siqueira, V.C. Times, R.R. CiferriC.D. de Ciferri, Efficientprocessing of drill-across queries over geographic data warehouses,in: Proceedings of the 13th International Conference on Data Ware-housing and Knowledge Discovery, DaWaK'11, 2011, pp. 152–166.

[30] K. Choi, W.-S. Luk, Processing aggregate queries on spatial OLAP data, in:Proceedings of the 10th International Conference on Data Warehousingand Knowledge Discovery, DaWaK '08, 2008, pp. 125–134.

[31] D. Papadias, P. Kalnis, J. Zhang, Y. Tao, Efficient OLAP operations inspatial data warehouses, in: Proceedings of the 7th InternationalSymposium on Advances in Spatial and Temporal Databases, SSTD'01, 2001, pp. 443–459.

http://refhub.elsevier.com/S0306-4379(14)00023-4/sbref1





























[32] R. Obe, L. Hsu, PostGIS in Action, Manning Publications, Stamford,CT, USA, 2011.

[33] J. Zhang, S. You, L. Gruenwald, High-performance online spatial andtemporal aggregations on multi-core CPUs and many-core GPUs, in:Proceedings of the ACM 15th International Workshop on DataWarehousing and OLAP, 2012, pp. 89–96.

[34] Zhang, S. You, Speeding up large-scale point-in-polygon test basedspatial join on GPUs, in: Proceedings of the ACM SIGSPATIAL Work-shop on BigSpatial, 2012, pp. 23–32.

[35] J. Jeffers, J. Reinders, Intel Xeon Phi, Coprocessor High PerformanceProgramming, Morgan Kaufmann, Waltham, MA, USA, 2013.

[36] D.B. Kirk, W.-M.W. Hwu, Programming Massively Parallel Proces-sors: A Hands-on Approach, 2nd ed. Morgan Kaufmann, Burlington,MA, USA, 2012.

[37] K.-H. Lee, Y.-J. Lee, H. Choi, Y.D. Chung, B. Moon, Parallel dataprocessing with MapReduce: a survey, SIGMOD Rec. 40 (4) (2012)11–20.

[38] M. Grund, J. Kruger, H. Plattner, A. Zeier, P. Cudre-Mauroux, S.Madden, Hyrise: a main memory hybrid storage engine, in: Pro-ceedings of VLDB Endowment 4 (2), 2010, pp. 105–116.

[39] K. Kaczmarski, T. Rudny, MOLAP cube based on parallel scanalgorithm, in: Proceedings of the 15th International Conference onAdvances in Databases and Information Systems, ADBIS'11, 2011,pp. 125–138.

[40] E.A. Sitaridi, K.A. Ross, Ameliorating memory contention of OLAPoperators on GPU processors, in: Proceedings of the Eighth Inter-national Workshop on Data Management on New Hardware,DaMoN '12, 2012, pp. 39–47.

[41] B. He, M. Lu, K. Yang, R. Fang, N.K. Govindaraju, Q. Luo, P.V. Sander,Relational query coprocessing on graphics processors, ACM Transac-tion on Database Systems 34 (4), 2009, 21.

[42] M. McCool, J. Reinders, J. Reinders, Structured Parallel Programming:Patterns for Efficient Computation, Morgan Kaufmann, Waltham,MA, USA, 2012.

[43] Nvidia, Compute unified device architecture (CUDA), ⟨http://www.nvidia.com/object/cuda_home_new.html⟩.

[44] Intel, Thread building blocks (TBB), ⟨http://threadingbuildingblocks.org/⟩.

[45] J. Zhang, A brief introduction to parallel primitives, ⟨http://www-cs.ccny.cuny.edu/� jzhang/intro_pp.pdf⟩.

[46] J. Singler, B. Konsik, The GNU libstdcþþ parallel mode: softwareengineering considerations, in: Proceedings of the 1st InternationalWorkshop on Multicore Software Engineering, IWMSE'08, 2008.

[47] T. Willhalm, N. Popovici, Y. Boshmaf, H. Plattner, A. Zeier, J. Schaffner,SIMD-scan: ultra fast in-memory table scan using on-chip vectorprocessing units, in: Proceedings of the VLDB Endowment 2 (1)(2009) pp. 385–394.

[48] C. Kim, J. Chhugani, N. Satish, E. Sedlar, A.D. Nguyen, T. Kaldewey, V.W. Lee, S.A. Brandt, P. Dubey, FAST: fast architecture sensitive treesearch on modern cpus and gpus, in: Proceedings of the 2010 ACMSIGMOD International Conference on Management of data, SIGMOD'10, 2010, pp. 339–350.

[49] T. Yamamuro, M. Onizuka, T. Hitaka, M. Yamamuro, VAST-Tree: avector-advanced and compressed structure for massive data treetraversal, in: Proceedings of the 15th International Conference onExtending Database Technology, EDBT '12, 2012, pp. 396–407.

[50] M. Pharr, W. Mark, ispc: a SPMD compiler for high-performance CPUprogramming, in: Proceedings of Innovative Parallel Computing,InPar, 2012.

[51] Metropolitan Transportation Authority (MTA), Subway and busridership, ⟨http://www.mta.info/nyct/facts/ridership/⟩.

[52] NYC Department of City Planning(DCP), DCPLION geographic basefile of New York City streets, ⟨http://www.nyc.gov/html/dcp/html/bytes/dwnlion.shtml⟩.

[53] I.F. Vega Lopez, R.T. Snodgrass, B. Moon, Spatiotemporal aggregatecomputation: a survey, IEEE Trans. Knowl. Data Eng. 17 (2) (2005)271–286.

[54] Wikipedia, Standard Template Library (STL), ⟨http://en.wikipedia.org/wiki/Standard_Template_Library⟩.

[55] D. Merrill, A.S. Grimshaw, High performance and scalable radixsorting: a case study of implementing dynamic parallelism for GPUcomputing, Parallel Process. Lett. 21 (2) (2011) 245–272.

[56] N. Leischner, V. Osipov, P. Sanders, GPU sample sort, in: Proceedingsof the 24th IEEE International Parallel and Distributed ProcessingSymposium, IPDPS'10, 2010, pp. 1–10.

[57] H. Samet, Foundations of Multidimensional and Metric Data Struc-tures, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,2005.

[58] M.F. Mokbel, W.G. Aref, Irregularity in high-dimensional space filingcurves, Distrib. Parallel Databases 29 (3) (2011) 217–238.

[59] D.A. Alcantara, A. Sharf, F. Abbasinejad, S. Sengupta, M. Mitzenmacher, J.D. Owens, N. Amenta, Real-time parallel hashing on the GPU, ACMTrans. Gr. 28 (5) (2009). (#9).














http://www.nvidia.com/object/cuda_home_new.html

http://www.nvidia.com/object/cuda_home_new.html

http://threadingbuildingblocks.org/

http://threadingbuildingblocks.org/

http://www-cs.ccny.cuny.edu/~jzhang/intro_pp.pdf



http://www.mta.info/nyct/facts/ridership/

http://www.nyc.gov/html/dcp/html/bytes/dwnlion.shtml

http://www.nyc.gov/html/dcp/html/bytes/dwnlion.shtml




http://en.wikipedia.org/wiki/Standard_Template_Library

http://en.wikipedia.org/wiki/Standard_Template_Library












Documents

Parallel online spatial and temporal aggregations on multi ...database/HIGEST-DB/publications/Publication... · Parallel online spatial and temporal aggregations on multi-core CPUs