Processing planetary sized datasets

Processing Planetary Sized DatasetsTim Park @timpark

Outdoor Activity Dataset

# UserId, ActivityId, Latitude, Longitude, Timestamp101528757,285751033,51.517227,-0.101553,1429087808101528757,285751033,51.517296,-0.101817,1429087812101528757,285751033,51.517353,-0.102064,1429087816101528757,285751033,51.517445,-0.102144,1429087820101528757,285751033,51.517475,-0.102259,1429087824101528757,285751033,51.51743,-0.102343,1429087828101528757,285751033,51.517338,-0.102309,1429087837101528757,285751033,51.517307,-0.102303,1429087857101528757,285751033,51.517296,-0.102346,1429087864101528757,285751033,51.517284,-0.102388,1429087877101528757,285751033,51.51729,-0.102321,1429087959101528757,285751033,51.51729,-0.102248,1429087961101528757,285751033,51.517338,-0.102088,1429087965101528757,285751033,51.51737,-0.10196,1429087969101528757,285751033,51.517334,-0.101849,1429087973101528757,285751033,51.517357,-0.101711,1429087977101528757,285751033,51.51739,-0.10159,1429087981101528757,285751033,51.51731,-0.101523,1429087985101528757,285751033,51.517223,-0.101501,1429087989101528757,285751033,51.51716,-0.101527,1429087994

Dataset Size

Total Size: 560GBActivities: 2.5 millionGPS Locations: 3.5 billion

Demo: Application

GPS Location Storage

• Many activities per user.• Want to be

able to pull a time range of user locations for activity display.

Range QueryUser Id Timestamp Latitude Longitude

…10152875766888406 144537442300

036.966819 -122.012298

10152875766888406 1445377625000

36.966845 -122.012248

10152875766888406 1445377627000

36.966877 -122.012228

10152875766888406 1445377629000

36.966913 -122.012236

10152875766888406 1445377630000

36.966946 -122.012236

10152875766888406 1445377631000

36.966984 -122.012263

10152875766888406 1445379512000

36.967027 -122.012281

…

Location Storage Options

This is a challenge with a large dataset:• A traditional relational database typically

requires hand sharding to scale to PBs of data (eg. Postgres).• Highly indexed non relational solutions can

be very expensive (eg. MongoDB).• Lightly indexed solutions are a good fit

because we really only have one query we need to execute against the data.

Pattern 1: Use Azure Table Storage for bulk data

PartitionKey (userId)

RowKey (timestamp)

Latitude Longitude

10152875766888406

1445377623000 36.966819 -122.012298

10152875766888406

1445377625000 36.966845 -122.012248

10152875766888406

1445377627000 36.966877 -122.012228

10152875766888406

1445377629000 36.966913 -122.012236

10152875766888406

1445377630000 36.966946 -122.012236

…

Activity Storage

• Want to query a set of activities in a bounding box.

• Also want to filter activities based on distance and duration.

Activity Data

activity id

start (sec) finish (sec) distance (m)

Duration (m)

bbox (geometry)

101528 1445377625

1445383025

50023 6222 [-104.990, 39.7392...

101643 1445362577

1445373616

28778 2498 [-122.01228, 36.96…

101843 1445377627

1445382432

4629 701 [0.1278, 51.5074 …

101901 1445362577

1445374713

99691 14232 [139.6917, 35.699...

102102 1445374713

1445374713

25259 6657 [1.3521, 103.8129…

Pattern 2: Use “polyglot persistence”

user Id timestamp

latitude longitude

10152875766888406

1445377623

36.966819

-122.012298

10152875766888406

1445377625

36.966845

-122.012248

…10152875766888406

1445383025

36.966913

-122.012236

10152875766888406

1445383030

36.966946

-122.012236

activity id

start finish … bbox

101528 1445362577

1445373616

… [-104.990, 39.7392...

101643 1445377625

1445383025

… [-122.01228, 36.96…

101843 1445377627

1445382432

… [0.1278, 51.5074 …

101901 1445362577

1445374713

… [139.6917, 35.699...

102102 1445374713

1445374713

… [1.3521, 103.8129…

Location Data(Azure Table

Storage)

Activity Data(Postgres + PostGIS)

Heatmap Generation

• Total number of location samples in a geographical area.• Whole

dataset operation.

HDInsight Spark

• Based on Apache Spark• Offered as a Service in Azure as

HDInsight Spark.• Can think of it is as “Hadoop the Next

Generation”• Better performance (10-100x)• Cleaner programming model

Pattern 3: XYZ Tiles for summarization• Divides world

up into tiles.• Each tile has

four children at the next higher zoom level.• Maps 2

dimension space to 1 dimension.

Heatmap Spark MapperFor each location, map to tiles at every zoom level:

(36.9741, -122.0308) [(10_398_164, 1), (11_797_329, 1)

(12_1594_659, 1), (13_3189_1319, 1), (14_6378_2638, 1),(15_12757_5276,1), (16_25514_10552, 1), (17_51028_21105, 1), (18_102057_42211, 1)]

Heatmap Spark Mapper def tile_id_mapper(location): tileMappings = [] tileIds = Tile.tile_ids_for_zoom_levels( location['latitude'], location['longitude'], MIN_ZOOM_LEVEL, MAX_ZOOM_LEVEL ) for tileId in tileIds: tileMappings.append( (tileId, 1) ) return tileMappings

Heatmap Spark AlgorithmReduce all these mappings with the same key into an aggregate value:

(10_398_164, 151) [(10_398_164, 15), (10_398_164, 28)

(10_398_164, 29), (10_398_164, 17), (10_398_164, 31), (10_398_164, 2), (10_398_164, 16), (10_398_164, 2), (10_398_164, 11)]

Heatmap Spark

lines = sc.textFile('wasb://[email protected]/') locations = lines.flatMap(json_loader)heatmap = locations .flatMap(tile_id_mapper)

.reduceByKey(lambda agg1,agg2: agg1+agg2)

heatmap.saveAsTextFile('wasb://[email protected]/');

Building the heatmap then boils down to this in Spark:

Spark Shuffle

Pattern 4: Incremental Ingestion

2016-04-28 17:00

2016-04-28 16:00

2016-04-28 15:00

2016-04-28 14:00

2016-04-28 13:00

2016-04-28 12:00

Activity

Activity

Activity

Activity

Activity

… AzureTable

Storage

Appl

icatio

n AP

I AzureEvent Hub

AzureStream

Analytics

Pattern 5: Data Slice Processing

2016-04-28 17:00

2016-04-28 16:00

2016-04-28 15:00

2016-04-28 14:00

2016-04-28 13:00

…

2016-04-28 17:00

Heatmap Partial

ExistingHeatmap

NewHeatmap

Displaying Heatmaps

Pattern 6: Precomputing Heatmaps

Pattern 6: Precomputing Data Views2016-04-28

17:00Heatmap

Deltas

PreviousHeatmap

NewHeatma

pUpdates

6_25_31

Appl

icatio

n AP

I

9_201_249

9_201_250

9_201_248

9_201_245

9_201_247

8_100_124

8_100_125

8_100_126

7_50_62

7_50_63

7_50_64

GPS Trace Dataset User Id Activity

IdTimestamp Latitude Longitude

…10152875766888406 57169639 144537762300

036.966819 -122.012298

10152875766888406 57169639 1445377625000

36.966845 -122.012248

10152875766888406 57169639 1445377627000

36.966877 -122.012228

10152875766888406 57169639 1445377629000

36.966913 -122.012236

10152875766888406 57169639 1445377630000

36.966946 -122.012236

10152875766888406 57169639 1445377631000

36.966984 -122.012263

10152875766888406 57169639 1445377632000

36.967027 -122.012281

…

Pattern 7: Data Cleansing / Enrichment

2016-04-28 17:00

2016-04-28 16:00

2016-04-28 15:00

2016-04-28 14:00

2016-04-28 13:00

…

ElevationEnrichment

2016-04-28 17:00

2016-04-28 16:00

2016-04-28 15:00

2016-04-28 14:00

2016-04-28 13:00

…

Raw Data Elevation Enriched

Bing Elevation

Azure Functions

let azure = require('azure-storage'), elevationService = require('../services/elevation');

module.exports = function(context, locationBlob) { let locations = locationBlob.split('\n'); elevationService.enrichLocations(locations, (err, enrichedLocations) => { if (err) return context.done(err); // ... save enrichedlocations to blob ... context.done(); });};

Pattern 8: Use an extensible binary encoding

560 GB

224 GB

JSON Avro

60% Smaller

Getting Started

• geotile: http://github.com/timfpark/geotile• XYZ tile math in C#, JavaScript, and

Python• heatmap:

http://github.com/timfpark/heatmap• Spark code for building heatmaps

http://github.com/timfpark/geotile

http://github.com/timfpark/heatmap

Questions?

Tim Park @timpark

Technology

Processing planetary sized datasets