30
© Hortonworks Inc. 2011 Bring Cartography to the Cloud with Apache Hadoop Nick Dimiduk Member of Technical Staff, HBase FOSS4G-NA, 2013-05-23 Page 1

Bring Cartography to the Cloud

Embed Size (px)

DESCRIPTION

If you've used a modern, interactive map such as Google or Bing Maps, you've consumed "map tiles". Map tiles are small images rendering a piece of the mosaic that is the whole map. Using conventional means, rendering tiles for the whole globe at multiple resolutions is a huge data processing effort. Even highly optimized, it spans a couple TBs and a few days of computation. Enter Hadoop. In this talk, I'll show you how to generate your own custom tiles using Hadoop. There will be pretty pictures.

Citation preview

Page 1: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Bring Cartography to the Cloud with Apache Hadoop

Nick Dimiduk Member of Technical Staff, HBase FOSS4G-NA, 2013-05-23

Page 1

Page 2: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Beginnings…

Page 2 Architecting the Future of Big Data

mapbox.com/blog/ rendering-the-world/

bmander.com/dotmap/index.html

Page 3: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Definitions

Page 3 Architecting the Future of Big Data

car•tog•ra•phy |kärˈtägrəәfē| noun the science or practice of drawing maps. rendering map tiles from some kind of geographic data.

cloud |kloud| noun a visible mass of condensed water vapor floating in the atmosphere, typically high above the ground. on demand consumption of computation and storage resources.

Page 4: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Background

Architecting the Future of Big Data Page 4

Page 5: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Apache Hadoop in Review •  Apache Hadoop Distributed Filesystem (HDFS)

– Distributed, fault-tolerant, throughput-optimized data storage – Uses a filesystem analogy, not structured tables –  The Google File System, 2003, Ghemawat et al. –  http://research.google.com/archive/gfs.html

•  Apache Hadoop MapReduce (MR) – Distributed, fault-tolerant, batch-oriented data processing –  Line- or record-oriented processing of the entire dataset * –  “[Application] schema on read” – MapReduce: Simplified Data Processing on Large Clusters, 2004, Dean and

Ghemawat –  http://research.google.com/archive/mapreduce.html

Page 5 Architecting the Future of Big Data

* For more on writing MapReduce applications, see “MapReduce Patterns, Algorithms, and Use Cases” http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/

Page 6: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

MapReduce in Detail

Page 6 Architecting the Future of Big Data

highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/

Page 7: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

MapReduce in Detail

Page 7 Architecting the Future of Big Data

highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/

Page 8: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

What we care about

Page 8 Architecting the Future of Big Data

$ map < input | sort | reduce > output

Page 9: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

How Seamlessly?

Page 9 Architecting the Future of Big Data

$ git show e65731e:bin/10_simulated_hadoop.sh gzcat "$INPUT_FILES" \ | python "${PYTHON_DIR}/sample_shapes.py" \ | sort \ | python "${PYTHON_DIR}/draw_tiles.py"

$ git show e65731e:bin/11_hadoop_local.sh hadoop jar target/tile-brute-0.1.0-SNAPSHOT.jar \ -input /tmp/input.csv \ -output "$OUTPUT_DIR" \ -mapper "python ${PYTHON_DIR}/sample_shapes.py" \ -reducer "python ${PYTHON_DIR}/draw_tiles.py"

Page 10: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

To the Code! github.com/ndimiduk/tilebrute

Architecting the Future of Big Data Page 10

Page 11: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Our Tools •  Python + GIS

– GDAL – Shapely – Mapnik

•  Java •  Apache Hadoop •  Bash •  MrJob

Page 11 Architecting the Future of Big Data

Page 12: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Prepare the Input

Page 12 Architecting the Future of Big Data

TIGER/Line Shapefiles

www.census.gov/geo/maps-data/data/tiger-line.html

$ tail -n6 bin/00_prepare_input.sh ogr2ogr `: invoke gdal tool ogr2ogr` \ -t_srs epsg:4326 `: reproject the data` \ -f CSV `: in CSV format` \ $OUTPUT `: producing output file` \ $INPUT `: from input file` \ -lco GEOMETRY=AS_WKT `: including geometries as WKT`

$ head -n2 /tmp/input.csv WKT,STATEFP10,COUNTYFP10,TRACTCE10,BLOCKCE,BLOCKID10,PARTFLG,HOUSING10,POP10 "POLYGON ((-118.81473 47.233499,...))",53,001,950100,1042,530019501001042,N,1,5

Page 13: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Prepare the Input

Page 13 Architecting the Future of Big Data

TIGER/Line Shapefiles

www.census.gov/geo/maps-data/data/tiger-line.html

$ tail -n6 bin/00_prepare_input.sh ogr2ogr `: invoke gdal tool ogr2ogr` \ -t_srs epsg:4326 `: reproject the data` \ -f CSV `: in CSV format` \ $OUTPUT `: producing output file` \ $INPUT `: from input file` \ -lco GEOMETRY=AS_WKT `: including geometries as WKT`

$ head -n2 /tmp/input.csv WKT,STATEFP10,COUNTYFP10,TRACTCE10,BLOCKCE,BLOCKID10,PARTFLG,HOUSING10,POP10 "POLYGON ((-118.81473 47.233499,...))",53,001,950100,1042,530019501001042,N,1,5

Page 14: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Map: Sample Geometries

Page 14 Architecting the Future of Big Data

[,[WKT, population]] => mapper => ['tx,ty,z', 'px,py']

def main(): for geom, population in read_feature(stdin): for lng, lat in sample_geometry(geom, population): for key, val in make_kv(lat, lng): emit(key, val)

$ map < input | sort | reduce > output

Page 15: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Map: Sample Geometries

Page 15 Architecting the Future of Big Data

$ head -n1 input.csv | python -m tilebrute.sample_shapes 2,5,4 -13224181.65427 5981084.37214 5,11,5 -13224181.65427 5981084.37214 10,22,6 -13224181.65427 5981084.37214 21,44,7 -13224181.65427 5981084.37214 43,89,8 -13224181.65427 5981084.37214 87,179,9 -13224181.65427 5981084.37214 174,359,10 -13224181.65427 5981084.37214 348,718,11 -13224181.65427 5981084.37214 696,1436,12 -13224181.65427 5981084.37214 1392,2873,13 -13224181.65427 5981084.37214 2785,5746,14 -13224181.65427 5981084.37214 5571,11493,15 -13224181.65427 5981084.37214 11142,22986,16 -13224181.65427 5981084.37214 22284,45973,17 -13224181.65427 5981084.37214

$ map < input | sort | reduce > output

Page 16: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Sort

Page 16 Architecting the Future of Big Data

$ head -n1 input.csv | python -m tilebrute.sample_shapes | sort 10,22,6 -13224414.42332 5983539.01581 10,22,6 -13225723.87449 5981201.60336 10,22,6 -13225793.67181 5983127.53706 10,22,6 -13226046.70101 5983375.66839 10,22,6 -13226331.90155 5984272.31303 11138,22981,16 -13226331.90155 5984272.31303 11139,22983,16 -13225793.67181 5983127.53706 11139,22983,16 -13226046.70101 5983375.66839 11139,22986,16 -13225723.87449 5981201.60336 11141,22982,16 -13224414.42332 5983539.01581

$ map < input | sort | reduce > output

Page 17: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Reduce: Draw Tiles

Page 17 Architecting the Future of Big Data

def main(): for tile,points in groupby(read_points(stdin), lambda x: x[0]): zoom = get_zoom(tile) map = init_map(zoom, points) map.zoom_all() im = mapnik.Image(256,256) mapnik.render(map,im) emit(tile, encode_image(im))

$ map < input | sort | reduce > output

$ head -n1 input.csv | python -m tilebrute.sample_shapes | sort | head -n5 | python -m tilebrute.draw_tiles 10,22,6 iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAYAAABccqhmAAADJ...+aBAAAAAElFTkSuQmCC

Page 18: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Write Output

Page 18 Architecting the Future of Big Data

public void write(Text tileId, Text tile) throws IOException { String[] tileIdSplits = tileId.toString().split(","); assert tileIdSplits.length == 3; String tx = tileIdSplits[0]; String ty = tileIdSplits[1]; String zoom = tileIdSplits[2]; Path tilePath = new Path(outputPath, zoom + "/" + tx + "/" + ty + ".png"); fs.mkdirs(tilePath.getParent()); byte[] buf = Base64.decodeBase64(tile.toString()); final FSDataOutputStream fout = fs.create(tilePath, progress); fout.write(buf); fout.close(); }

Page 19: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

To the Cloud!

Architecting the Future of Big Data Page 19

Page 20: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Basic Services: EC2, S3 •  EC2: Elastic Compute Cloud

– Virtual machines on demand – Different “instance types” with different hardware profiles – m1.large (2 cores, 7.5G), c1.xlarge (8 cores, 7G)

•  S3: Simple Storage Service – Distributed, replicated storage – Native Hadoop integration – Also exposed over http(s), easy tile hosting

Page 20 Architecting the Future of Big Data

Page 21: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Add-on Service: EMR •  EMR: Elastic MapReduce

–  “Hadoop as a Service” – On-demand, pre-installed and configured Hadoop clusters –  +1: standardize of provisioning, deployment, monitoring –  -1: “stable” (old) software

Page 21 Architecting the Future of Big Data

Page 22: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

MrJob: Python for EMR

Page 22 Architecting the Future of Big Data

class TileBrute(MRJob): HADOOP_OUTPUT_FORMAT = 'tilebrute.hadoop.mapred.MapTileOutputFormat' def mapper_cmd(self): return bash_wrap('$PYTHON -m tilebrute.sample_shapes') def reducer_cmd(self): return bash_wrap('$PYTHON -m tilebrute.draw_tiles')

github.com/Yelp/mrjob

Page 23: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Results

Architecting the Future of Big Data Page 23

Page 24: Bring Cartography to the Cloud

© Hortonworks Inc. 2011 Page 24 Architecting the Future of Big Data

Page 25: Bring Cartography to the Cloud

© Hortonworks Inc. 2011 Page 25 Architecting the Future of Big Data

14z, 2624x, 5722y

Page 26: Bring Cartography to the Cloud

© Hortonworks Inc. 2011 Page 26 Architecting the Future of Big Data

14z, 2624x, 5722y

Page 27: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

How much code?

Page 27 Architecting the Future of Big Data

$ find -f src -f bin | egrep '\.(java|sh|py)$' | grep -v test | xargs cloc --quiet http://cloc.sourceforge.net v 1.56 T=0.5 s (28.0 files/s, 1868.0 lines/s) ------------------------------------------------------------------------------- Language files blank comment code ------------------------------------------------------------------------------- Python 4 69 105 299 Bourne Shell 8 51 85 210 Java 2 25 16 74 ------------------------------------------------------------------------------- SUM: 14 145 206 583 -------------------------------------------------------------------------------

Page 28: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Performance

Page 28 Architecting the Future of Big Data

•  1 x m1.large (2 cores) –  195575 input features (WA state) –  3 zoom levels (6, 7, 8) –  1 hour

•  19 x c1.xlarge (152 cores) –  308745538 input features (all data) –  3 zoom levels (6, 7, 8) –  3 hours 15 minutes

Page 29: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

TODOs •  Macro-level performance optimizations (configuration)

– Balancing mappers and reducers, memory allocation, &c. – On-demand Hadoop means tuning the cluster to the application

•  Micro-level performance optimizations (code) – Smarter sampling logic – Mapnik API considerations – Multi-threaded S3 PUTs

–  https://forums.aws.amazon.com/thread.jspa?threadID=125135

•  Write tiles in MBTiles format •  Write tiles to HBase •  Compression! •  Ogrbrute?

Page 29 Architecting the Future of Big Data

Page 30: Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Thanks!

Architecting the Future of Big Data Page 30

M A N N I N G

Nick Dimiduk Amandeep Khurana

FOREWORD BY Michael Stack

hbaseinaction.com

Nick Dimiduk github.com/ndimiduk

@xefyr

n10k.com