Bring Cartography to the Cloud

© Hortonworks Inc. 2011

Bring Cartography to the Cloud with Apache Hadoop

Nick Dimiduk Member of Technical Staff, HBase FOSS4G-NA, 2013-05-23

Page 1


Beginnings…

Page 2 Architecting the Future of Big Data

mapbox.com/blog/ rendering-the-world/

bmander.com/dotmap/index.html


Definitions


car•tog•ra•phy |kärˈtägrəәfē| noun the science or practice of drawing maps. rendering map tiles from some kind of geographic data.

cloud |kloud| noun a visible mass of condensed water vapor floating in the atmosphere, typically high above the ground. on demand consumption of computation and storage resources.


Background

Architecting the Future of Big Data Page 4


Apache Hadoop in Review •  Apache Hadoop Distributed Filesystem (HDFS)

– Distributed, fault-tolerant, throughput-optimized data storage – Uses a filesystem analogy, not structured tables –  The Google File System, 2003, Ghemawat et al. –  http://research.google.com/archive/gfs.html

•  Apache Hadoop MapReduce (MR) – Distributed, fault-tolerant, batch-oriented data processing –  Line- or record-oriented processing of the entire dataset * –  “[Application] schema on read” – MapReduce: Simplified Data Processing on Large Clusters, 2004, Dean and

Ghemawat –  http://research.google.com/archive/mapreduce.html


* For more on writing MapReduce applications, see “MapReduce Patterns, Algorithms, and Use Cases” http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/


MapReduce in Detail


highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/


MapReduce in Detail


highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/


What we care about


$ map < input | sort | reduce > output


How Seamlessly?


$ git show e65731e:bin/10_simulated_hadoop.sh gzcat "$INPUT_FILES" \ | python "${PYTHON_DIR}/sample_shapes.py" \ | sort \ | python "${PYTHON_DIR}/draw_tiles.py"

$ git show e65731e:bin/11_hadoop_local.sh hadoop jar target/tile-brute-0.1.0-SNAPSHOT.jar \ -input /tmp/input.csv \ -output "$OUTPUT_DIR" \ -mapper "python ${PYTHON_DIR}/sample_shapes.py" \ -reducer "python ${PYTHON_DIR}/draw_tiles.py"


To the Code! github.com/ndimiduk/tilebrute



Our Tools •  Python + GIS

– GDAL – Shapely – Mapnik

•  Java •  Apache Hadoop •  Bash •  MrJob



Prepare the Input


TIGER/Line Shapefiles

www.census.gov/geo/maps-data/data/tiger-line.html

$ tail -n6 bin/00_prepare_input.sh ogr2ogr `: invoke gdal tool ogr2ogr` \ -t_srs epsg:4326 `: reproject the data` \ -f CSV `: in CSV format` \ $OUTPUT `: producing output file` \ $INPUT `: from input file` \ -lco GEOMETRY=AS_WKT `: including geometries as WKT`

$ head -n2 /tmp/input.csv WKT,STATEFP10,COUNTYFP10,TRACTCE10,BLOCKCE,BLOCKID10,PARTFLG,HOUSING10,POP10 "POLYGON ((-118.81473 47.233499,...))",53,001,950100,1042,530019501001042,N,1,5


Prepare the Input


TIGER/Line Shapefiles

www.census.gov/geo/maps-data/data/tiger-line.html

$ tail -n6 bin/00_prepare_input.sh ogr2ogr `: invoke gdal tool ogr2ogr` \ -t_srs epsg:4326 `: reproject the data` \ -f CSV `: in CSV format` \ $OUTPUT `: producing output file` \ $INPUT `: from input file` \ -lco GEOMETRY=AS_WKT `: including geometries as WKT`

$ head -n2 /tmp/input.csv WKT,STATEFP10,COUNTYFP10,TRACTCE10,BLOCKCE,BLOCKID10,PARTFLG,HOUSING10,POP10 "POLYGON ((-118.81473 47.233499,...))",53,001,950100,1042,530019501001042,N,1,5


Map: Sample Geometries


[,[WKT, population]] => mapper => ['tx,ty,z', 'px,py']

def main(): for geom, population in read_feature(stdin): for lng, lat in sample_geometry(geom, population): for key, val in make_kv(lat, lng): emit(key, val)



Map: Sample Geometries


$ head -n1 input.csv | python -m tilebrute.sample_shapes 2,5,4 -13224181.65427 5981084.37214 5,11,5 -13224181.65427 5981084.37214 10,22,6 -13224181.65427 5981084.37214 21,44,7 -13224181.65427 5981084.37214 43,89,8 -13224181.65427 5981084.37214 87,179,9 -13224181.65427 5981084.37214 174,359,10 -13224181.65427 5981084.37214 348,718,11 -13224181.65427 5981084.37214 696,1436,12 -13224181.65427 5981084.37214 1392,2873,13 -13224181.65427 5981084.37214 2785,5746,14 -13224181.65427 5981084.37214 5571,11493,15 -13224181.65427 5981084.37214 11142,22986,16 -13224181.65427 5981084.37214 22284,45973,17 -13224181.65427 5981084.37214



Sort


$ head -n1 input.csv | python -m tilebrute.sample_shapes | sort 10,22,6 -13224414.42332 5983539.01581 10,22,6 -13225723.87449 5981201.60336 10,22,6 -13225793.67181 5983127.53706 10,22,6 -13226046.70101 5983375.66839 10,22,6 -13226331.90155 5984272.31303 11138,22981,16 -13226331.90155 5984272.31303 11139,22983,16 -13225793.67181 5983127.53706 11139,22983,16 -13226046.70101 5983375.66839 11139,22986,16 -13225723.87449 5981201.60336 11141,22982,16 -13224414.42332 5983539.01581



Reduce: Draw Tiles


def main(): for tile,points in groupby(read_points(stdin), lambda x: x[0]): zoom = get_zoom(tile) map = init_map(zoom, points) map.zoom_all() im = mapnik.Image(256,256) mapnik.render(map,im) emit(tile, encode_image(im))


$ head -n1 input.csv | python -m tilebrute.sample_shapes | sort | head -n5 | python -m tilebrute.draw_tiles 10,22,6 iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAYAAABccqhmAAADJ...+aBAAAAAElFTkSuQmCC


Write Output


public void write(Text tileId, Text tile) throws IOException { String[] tileIdSplits = tileId.toString().split(","); assert tileIdSplits.length == 3; String tx = tileIdSplits[0]; String ty = tileIdSplits[1]; String zoom = tileIdSplits[2]; Path tilePath = new Path(outputPath, zoom + "/" + tx + "/" + ty + ".png"); fs.mkdirs(tilePath.getParent()); byte[] buf = Base64.decodeBase64(tile.toString()); final FSDataOutputStream fout = fs.create(tilePath, progress); fout.write(buf); fout.close(); }


To the Cloud!



Basic Services: EC2, S3 •  EC2: Elastic Compute Cloud

– Virtual machines on demand – Different “instance types” with different hardware profiles – m1.large (2 cores, 7.5G), c1.xlarge (8 cores, 7G)

•  S3: Simple Storage Service – Distributed, replicated storage – Native Hadoop integration – Also exposed over http(s), easy tile hosting



Add-on Service: EMR •  EMR: Elastic MapReduce

–  “Hadoop as a Service” – On-demand, pre-installed and configured Hadoop clusters –  +1: standardize of provisioning, deployment, monitoring –  -1: “stable” (old) software



MrJob: Python for EMR


class TileBrute(MRJob): HADOOP_OUTPUT_FORMAT = 'tilebrute.hadoop.mapred.MapTileOutputFormat' def mapper_cmd(self): return bash_wrap('$PYTHON -m tilebrute.sample_shapes') def reducer_cmd(self): return bash_wrap('$PYTHON -m tilebrute.draw_tiles')

github.com/Yelp/mrjob


Results


© Hortonworks Inc. 2011 Page 24 Architecting the Future of Big Data


14z, 2624x, 5722y


14z, 2624x, 5722y


How much code?


$ find -f src -f bin | egrep '\.(java|sh|py)$' | grep -v test | xargs cloc --quiet http://cloc.sourceforge.net v 1.56 T=0.5 s (28.0 files/s, 1868.0 lines/s) ------------------------------------------------------------------------------- Language files blank comment code ------------------------------------------------------------------------------- Python 4 69 105 299 Bourne Shell 8 51 85 210 Java 2 25 16 74 ------------------------------------------------------------------------------- SUM: 14 145 206 583 -------------------------------------------------------------------------------


Performance


•  1 x m1.large (2 cores) –  195575 input features (WA state) –  3 zoom levels (6, 7, 8) –  1 hour

•  19 x c1.xlarge (152 cores) –  308745538 input features (all data) –  3 zoom levels (6, 7, 8) –  3 hours 15 minutes


TODOs •  Macro-level performance optimizations (configuration)

– Balancing mappers and reducers, memory allocation, &c. – On-demand Hadoop means tuning the cluster to the application

•  Micro-level performance optimizations (code) – Smarter sampling logic – Mapnik API considerations – Multi-threaded S3 PUTs

–  https://forums.aws.amazon.com/thread.jspa?threadID=125135

•  Write tiles in MBTiles format •  Write tiles to HBase •  Compression! •  Ogrbrute?



Thanks!


M A N N I N G

Nick Dimiduk Amandeep Khurana

FOREWORD BY Michael Stack

hbaseinaction.com

Nick Dimiduk github.com/ndimiduk

@xefyr

n10k.com

Technology

Bring Cartography to the Cloud