Upload
tim-park
View
328
Download
0
Embed Size (px)
Citation preview
Processing Planetary Sized DatasetsTim Park @timpark
Outdoor Activity Dataset
# UserId, ActivityId, Latitude, Longitude, Timestamp101528757,285751033,51.517227,-0.101553,1429087808101528757,285751033,51.517296,-0.101817,1429087812101528757,285751033,51.517353,-0.102064,1429087816101528757,285751033,51.517445,-0.102144,1429087820101528757,285751033,51.517475,-0.102259,1429087824101528757,285751033,51.51743,-0.102343,1429087828101528757,285751033,51.517338,-0.102309,1429087837101528757,285751033,51.517307,-0.102303,1429087857101528757,285751033,51.517296,-0.102346,1429087864101528757,285751033,51.517284,-0.102388,1429087877101528757,285751033,51.51729,-0.102321,1429087959101528757,285751033,51.51729,-0.102248,1429087961101528757,285751033,51.517338,-0.102088,1429087965101528757,285751033,51.51737,-0.10196,1429087969101528757,285751033,51.517334,-0.101849,1429087973101528757,285751033,51.517357,-0.101711,1429087977101528757,285751033,51.51739,-0.10159,1429087981101528757,285751033,51.51731,-0.101523,1429087985101528757,285751033,51.517223,-0.101501,1429087989101528757,285751033,51.51716,-0.101527,1429087994
Dataset Size
Total Size: 560GBActivities: 2.5 millionGPS Locations: 3.5 billion
Demo: Application
GPS Location Storage
• Many activities per user.• Want to be
able to pull a time range of user locations for activity display.
Range QueryUser Id Timestamp Latitude Longitude
…10152875766888406 144537442300
036.966819 -122.012298
10152875766888406 1445377625000
36.966845 -122.012248
10152875766888406 1445377627000
36.966877 -122.012228
10152875766888406 1445377629000
36.966913 -122.012236
10152875766888406 1445377630000
36.966946 -122.012236
10152875766888406 1445377631000
36.966984 -122.012263
10152875766888406 1445379512000
36.967027 -122.012281
…
Location Storage Options
This is a challenge with a large dataset:• A traditional relational database typically
requires hand sharding to scale to PBs of data (eg. Postgres).• Highly indexed non relational solutions can
be very expensive (eg. MongoDB).• Lightly indexed solutions are a good fit
because we really only have one query we need to execute against the data.
Pattern 1: Use Azure Table Storage for bulk data
PartitionKey (userId)
RowKey (timestamp)
Latitude Longitude
10152875766888406
1445377623000 36.966819 -122.012298
10152875766888406
1445377625000 36.966845 -122.012248
10152875766888406
1445377627000 36.966877 -122.012228
10152875766888406
1445377629000 36.966913 -122.012236
10152875766888406
1445377630000 36.966946 -122.012236
…
Activity Storage
• Want to query a set of activities in a bounding box.
• Also want to filter activities based on distance and duration.
Activity Data
activity id
start (sec) finish (sec) distance (m)
Duration (m)
bbox (geometry)
101528 1445377625
1445383025
50023 6222 [-104.990, 39.7392...
101643 1445362577
1445373616
28778 2498 [-122.01228, 36.96…
101843 1445377627
1445382432
4629 701 [0.1278, 51.5074 …
101901 1445362577
1445374713
99691 14232 [139.6917, 35.699...
102102 1445374713
1445374713
25259 6657 [1.3521, 103.8129…
Pattern 2: Use “polyglot persistence”
user Id timestamp
latitude longitude
10152875766888406
1445377623
36.966819
-122.012298
10152875766888406
1445377625
36.966845
-122.012248
…10152875766888406
1445383025
36.966913
-122.012236
10152875766888406
1445383030
36.966946
-122.012236
activity id
start finish … bbox
101528 1445362577
1445373616
… [-104.990, 39.7392...
101643 1445377625
1445383025
… [-122.01228, 36.96…
101843 1445377627
1445382432
… [0.1278, 51.5074 …
101901 1445362577
1445374713
… [139.6917, 35.699...
102102 1445374713
1445374713
… [1.3521, 103.8129…
Location Data(Azure Table
Storage)
Activity Data(Postgres + PostGIS)
Heatmap Generation
• Total number of location samples in a geographical area.• Whole
dataset operation.
HDInsight Spark
• Based on Apache Spark• Offered as a Service in Azure as
HDInsight Spark.• Can think of it is as “Hadoop the Next
Generation”• Better performance (10-100x)• Cleaner programming model
Pattern 3: XYZ Tiles for summarization• Divides world
up into tiles.• Each tile has
four children at the next higher zoom level.• Maps 2
dimension space to 1 dimension.
Heatmap Spark MapperFor each location, map to tiles at every zoom level:
(36.9741, -122.0308) [(10_398_164, 1), (11_797_329, 1)
(12_1594_659, 1), (13_3189_1319, 1), (14_6378_2638, 1),(15_12757_5276,1), (16_25514_10552, 1), (17_51028_21105, 1), (18_102057_42211, 1)]
Heatmap Spark Mapper def tile_id_mapper(location): tileMappings = [] tileIds = Tile.tile_ids_for_zoom_levels( location['latitude'], location['longitude'], MIN_ZOOM_LEVEL, MAX_ZOOM_LEVEL ) for tileId in tileIds: tileMappings.append( (tileId, 1) ) return tileMappings
Heatmap Spark AlgorithmReduce all these mappings with the same key into an aggregate value:
(10_398_164, 151) [(10_398_164, 15), (10_398_164, 28)
(10_398_164, 29), (10_398_164, 17), (10_398_164, 31), (10_398_164, 2), (10_398_164, 16), (10_398_164, 2), (10_398_164, 11)]
Heatmap Spark
lines = sc.textFile('wasb://[email protected]/') locations = lines.flatMap(json_loader)heatmap = locations .flatMap(tile_id_mapper)
.reduceByKey(lambda agg1,agg2: agg1+agg2)
heatmap.saveAsTextFile('wasb://[email protected]/');
Building the heatmap then boils down to this in Spark:
Spark Shuffle
Pattern 4: Incremental Ingestion
2016-04-28 17:00
2016-04-28 16:00
2016-04-28 15:00
2016-04-28 14:00
2016-04-28 13:00
2016-04-28 12:00
Activity
Activity
Activity
Activity
Activity
… AzureTable
Storage
Appl
icatio
n AP
I AzureEvent Hub
AzureStream
Analytics
Pattern 5: Data Slice Processing
2016-04-28 17:00
2016-04-28 16:00
2016-04-28 15:00
2016-04-28 14:00
2016-04-28 13:00
…
2016-04-28 17:00
Heatmap Partial
ExistingHeatmap
NewHeatmap
Displaying Heatmaps
Pattern 6: Precomputing Heatmaps
Pattern 6: Precomputing Data Views2016-04-28
17:00Heatmap
Deltas
PreviousHeatmap
NewHeatma
pUpdates
6_25_31
Appl
icatio
n AP
I
9_201_249
9_201_250
9_201_248
9_201_245
9_201_247
8_100_124
8_100_125
8_100_126
7_50_62
7_50_63
7_50_64
GPS Trace Dataset User Id Activity
IdTimestamp Latitude Longitude
…10152875766888406 57169639 144537762300
036.966819 -122.012298
10152875766888406 57169639 1445377625000
36.966845 -122.012248
10152875766888406 57169639 1445377627000
36.966877 -122.012228
10152875766888406 57169639 1445377629000
36.966913 -122.012236
10152875766888406 57169639 1445377630000
36.966946 -122.012236
10152875766888406 57169639 1445377631000
36.966984 -122.012263
10152875766888406 57169639 1445377632000
36.967027 -122.012281
…
Pattern 7: Data Cleansing / Enrichment
2016-04-28 17:00
2016-04-28 16:00
2016-04-28 15:00
2016-04-28 14:00
2016-04-28 13:00
…
ElevationEnrichment
2016-04-28 17:00
2016-04-28 16:00
2016-04-28 15:00
2016-04-28 14:00
2016-04-28 13:00
…
Raw Data Elevation Enriched
Bing Elevation
Azure Functions
let azure = require('azure-storage'), elevationService = require('../services/elevation');
module.exports = function(context, locationBlob) { let locations = locationBlob.split('\n'); elevationService.enrichLocations(locations, (err, enrichedLocations) => { if (err) return context.done(err); // ... save enrichedlocations to blob ... context.done(); });};
Pattern 8: Use an extensible binary encoding
560 GB
224 GB
JSON Avro
60% Smaller
Getting Started
• geotile: http://github.com/timfpark/geotile• XYZ tile math in C#, JavaScript, and
Python• heatmap:
http://github.com/timfpark/heatmap• Spark code for building heatmaps
Questions?
Tim Park @timpark