Flooding Analysis based on LiDAR Point Cloud Data · Flooding Analysis based on LiDAR Point Cloud Data Fiona Lippert, Maki Gradecak, Peter Petkanic Abstract Flood defence is a topic

Flooding Analysis based on LiDAR Point Cloud Data

Fiona Lippert, Maki Gradecak, Peter Petkanic

Abstract

Flood defence is a topic of major importance in the Netherlands where large parts of the countrylie below sea level. To prevent severe catastrophes, a complex systems of dikes has been constructedto keep the water under control. Based on the AHN2 LiDAR point-cloud dataset, we aim atidentifying critical regions in the Netherlands that rely heavily on the protection through dikes.This is achieved by generating a digital elevation model, developing a framework for automaticdike detection from raw point cloud data, and combining both into an interactive visualisation thatallows to explore the consequences of a dike breach. Facing the massive volume of the given datasetwe propose a cloud-computing approach based on Apache Spark and its Python API. Our proposedpipeline is based on distributed point-cloud rasterisation, a variety of image processing techniquesand geometric graph analysis.

Keywords — Point Cloud, Digital Elevation Model, Dike Detection, Big Data, Apache Spark

1. Introduction

In The Netherlands, where large parts of thecountry lie below sea level, it is estimated that ap-proximately 2/3 of the landscape is susceptible toflooding, making the issue of flooding analysis, pre-vention and control extremely important for its resi-dents. There are a number of methods used for floodcontrol such as dams, floodgates and most predomi-nantly utilised in The Netherlands, dikes.

A dike is a raised ridge that is constructed par-allel to the river or shore. In situations where thewater level adjacent to the dike raises unexpectedly,the dikes prevent the water level from spreading tothe area on the other side, thereby preventing de-struction of land and property.

Located across The Netherlands are large systemsof connected dikes, some natural, some man-made,providing effective flooding control. However, whatwould happen if one of those Dike systems were to becompromised, be it through natural erosion, naturaldisasters or human intervention?

When considering disaster situations such asflooding, it is important to not only evaluate theeffectiveness of preventative approaches but also toconsider the impact of these approaches failing.

In that regard, this project targets 3 main goals.To create a coarse grained elevation map of TheNetherlands, to develop a framework for automatic

detection of dike systems, and using these two dataproducts, to create an interactive flooding simulationpredicting the flooding zone if these dike systems wereto fail.

Developing such complex products on the scaleof an entire country is generally an unattainable goalwithout various rich data sources to drive the anal-ysis and simulations. However, with the increasingamount of geospatial data being collected and madeopenly available, projects such as this become in-creasingly possible. The raw geospatial data setscome in a variety of formats, depending on the pur-pose and acquisition technique. Among them, point-cloud data in LAS format is extremely rich in infor-mation and can be used to model accurate 3D sur-faces and to extract complex spatial features.

The disadvantage of this data boom however, isthat it is no longer feasible to operate on this data us-ing single commodity or even high performance com-puters, as the data volume often spans into the or-der of terabytes. Thus, there is an emerging require-ment for efficient big data processing techniques forworking with geospatial data in order to truly bene-fit from it. With this in mind the aim of our projectis not only to investigate potential flooding hazards,but also to investigate various techniques for efficientprocessing of large volumes of geospatial data.

To accomplish our task we will utilise the ApacheSpark framework[1] for large scale data processing.

1

The dataset we are working with is the AHN2 datasetwhich consists of approximately 10 Terabytes ofpoint-cloud data, effectively mapping the entirety ofThe Netherlands to a 20cm resolution. In order to beable to store and even process such a large amountof data we will deploy our processing pipeline on theSurfsara Hadoop cluster, consisting of 170 computenodes totaling to a 1370 CPU cores, backed by 2.3Petabytes of storage capacity.

The remainder of the paper is laid out as fol-lows. In Section 2 we will discuss work related toour own. Our research questions will be presentedin detail in Section 3. Section 4 provides further in-sight into the data that will be used as a source forthis project. We will present the setup of our Projectin Section 5 where we will discuss how we accom-plished our goals. Section 6 evaluates and discussesthe results of our experiments. Finally in Section 7we present our conclusions.

2. Related Work

Our work falls into three categories, flooding as-sessment in The Netherlands, geographic informationscience (GIS) and massive point cloud processing us-ing cloud computing. In the following we present rel-evant related work from all of these areas of research.

2.1. Flooding Assessment in The Netherlands

With flooding being an omnipresent danger,flooding analysis in The Netherlands is a well stud-ied area, with many different initiatives such as thenational Delta Programme [2] aiming at optimizedflood defences, or the Floris project [3] exploring top-ics like risk analysis, loss of life, cost benefit analysis,uncertainty modelling etc. The main starting pointfor such studies is the identification of potential weaklinks in dike rings, which may be caused by over-topping or loss of stability due to erosion, sliding orpiping [4], followed by a complex analysis of conse-quences.For an accurate identification of regions being af-fected by a potential dike breach simulations are usedto predict flooded areas. Such simulations range fromsimple intersection of the water level plane with a sur-face model of the landscape [5] to sophisticated 3Dsolutions based on the Navier-Stokes equations andturbulence modelling [6].Our work differs from the described ones in that ouranalysis is not based on existing terrain models andmapped dike systems, but rather aims to automati-cally generate these from raw data and perform basicflood plain predictions. In that sense, our project

can be seen as ground work that transforms availabledata sources into products to be used in higher levelflooding analysis.

2.2. GIS Methodology and ToolsThis leads us to the topic of GIS technologies that

facilitate raw geospatial data processing and gen-eration of geographical models. Methods used fortransforming 3D points into a digital elevation model(DEM) can be grouped into two stages. First, pointsneed to be filtered to separate ground points fromnon-ground points. Since ground point features de-pend strongly on the terrain type, a variety of ap-proaches has been studied and evaluated on diverseterrains. Lowest elevation sampling [7] is a simpleinitialization procedure that assumes that bare earthpoints are usually the lowest features in a local neigh-bourhood. More sophisticated ground filtering ap-proaches build on weighted linear least squares in-terpolation developed by Pfeifer [8], slope-based fil-ter developed by Vosselman [9], or filtering based onmathematical morphology [10]. Second, the resultingpoints need to be combined into an elevation model.Two model types dominate the research in this area:raster DEMs which can be searched and analysed bysimple kernels, and so called Triangulated IrregularNetworks (TIN) which allow for accelerated search forneighbouring points [7]. These models are created byinterpolating between the irregularly spaced groundpoints. Here, one of the most popular interpolationmethods is Inverse Distance Weighting which uses theweighted average of points in a neighbourhood, wherepoints closer to the location of interest dominate [11].

Identification of ground-points as well as auto-matic dike detection falls into the category of geospa-tial classification. Here, two main approaches can befound in literature: classification based on raw pointfeatures, and image classification based on rasteriseddata. Bao et al [12] present an interesting approachfor separating ground and vegetation in point-cloudsby analysing skewness of the intensity distribution.

In general, a variety of powerful GIS software li-braries is available, such as ArcGis, QGIS and PointCloud Library (PCL), that provide tools for conver-sion, rasterisation, segmentation, object identifica-tion, spatial indexing and more. However, they aredesigned for standalone environments only and arehence not feasible for large-scale analysis.

2.3. Large-Scale Point Cloud ProcessingIn contrast to the well-established point-cloud

tools for single machines, only little research hasbeen conducted on massive point-cloud process-ing using distributed computing technologies [13].

2

Apache Spark frameworks for geospatial data suchas GeoSpark and Geotrellis provide functionalitiesfor spatial RDDs of raster and vector data provid-ing special SQL queries such as distance based joinsas well as efficient spatial indexing. However, theydo not come with any LAS point-cloud support. Toour knowledge, the only Spark library designed forLAS file import is IQmulus which relies on an out-dated version of Spark. Facing these challenges, sev-eral designs of efficient frameworks for distributedpoint-cloud processing has been proposed in the pastyears. Boehm et al [14] utilise an Apache Sparkbased approach similar to our own. By encoding alist of point cloud files as an RDD and distributing itamongst workers, point cloud segments can be indi-vidually read and processed concurrently. Wang et al[15] take a different approach to point cloud process-ing, by using a MapReduce based approach. Whilethis approach proves to be efficient for single trans-formations, repeated operations on the point cloudcan prove challenging due to the limitations of theMapReduce framework. Liu et al [13] provide thespecification and implementation of a Spark libraryfor ingestion of large point clouds. While our initialdata exploration was based on the library producedby the authors, our final solution moves away fromthis approach due to technical incompatibilities, bututilises the core concept of distributing the loading ofdata to many workers.

3. Research Questions

The main goal of this research is to summarizethe Dutch terrain in form of a DEM and to detecta topology of dike systems that protect critical areasbelow sea level. These goals will be achieved based onthe AHN2 point-cloud dataset, which imposes majorchallenges due to its sheer volume. With this in mind,we formulate the following questions to be answered:

• How can we efficiently load and process thepoint-cloud data on the Surfsara Hadoop Clus-ter, transforming the data into a workable for-mat and reducing it to a feasible volume?

• How can we transform the unstructured pointrecords into a DEM that appropriately repre-sents the regional ground levels?

• Is it possible to develop a framework for auto-matic detection of dike systems that is solelybased on point-cloud elevation data?

• Can we, based on the DEM and the topologyof dike systems, make predictions about sce-narios in which one or multiple dikes breach?

Which areas would be flooded and how wouldthese flood plains change with rising sea level?

The results are to be summarised into an interac-tive visualisation which facilitates exploration of theDutch terrain, the detected dikes, as well as predictedflood plains.

4. Data

The AHN2 dataset is a large 3D point cloud scanof the entirety of The Netherlands, collected over anumber of years using airborne LiDAR technology.By utilising lasers, and calculating the time takenfor a projected laser beam to return, in combinationwith GPS positioning, distance data can be inferredand highly accurate X, Y, Z points of the ground aswell as features such as vegetation and buildings areextracted. This highly detailed elevation data hasmany uses, from generating accurate 3D models todetecting abnormalities on the ground, fig 1 showsan example visualisation of such point cloud data.

Figure 1 A point cloud visualisation of a sample filefrom the AHN2 dataset

The entire dataset consists of billions of 3Dpoints, resulting in a total of over 10 Terrabytes ondisk space. The data is organised into 225 tiles,each approximately spanning a 6.5 x 5 km area, witheach tile further divided into a variable number ofdata files. The x and y co-ordinates translate to theAmersfoort co-ordinate system, which is one of thereference co-ordinate systems used in The Nether-lands [16].

The average density of measurements is about 10points per 1m2, ranging from detailed descriptionsof buildings with hundreds of points per m2 to verysparse representations of water bodies.

A major challenge of this data set is the fact thatthe points are completely unclassified. In contrastto the recent generation AHN3, there is no distinc-tion between ground, water, trees and buildings andtherefore points must be classified or filtered beforethey can be used in order to avoid distorted results.

3

For the purposes of this project we are only con-cerned with the ground level. Performing a quickanalysis of some data samples, fig 2 shows that thedistribution of height values is highly skewed, indi-cating a clear distinction between ubiquitous groundpoints and small-scale landscape features with vary-ing heights. However, the points relevant for dikedetection are expected to lie in the same histogramrange as irrelevant objects like medium height veg-etation. We will come back to this issue in Section5.

0 5 10 15 20height [m]

0

1000000

2000000

3000000

4000000

5000000

6000000

Freq

uenc

y

Figure 2 Distribution of ground point heights in a sam-ple of the AHN2 dataset

The second big challenge in working with thisparticular dataset is the format in which the data isprovided. The LAS format is an open format for LI-DAR data containing binary data split into a headerblock, a meta-information block, and the actual pointdata record [17]. However, storing 10TB of raw datain practise is not feasible and therefore the data iscompressed using a special compression algorithm op-timised for point clouds, leaving us with 1.6TB of rawdata in compressed LAZ format. While this mightsound like a big advantage, as we will see, the com-pressed format is more difficult to work with and fur-thermore the extra decompression step using the Las-Tools suite adds a non negligible delay in processingwhen the data is on this scale.

5. Project Setup

The envisaged data product, as well as the dataused to drive the project are quite large in scale. Withthis in mind, we split the process into 5 stages. Instage 1 - Preprocessing we will tackle the challengeposed by the massive size of the dataset, and reworkit into a more usable format. Building on this, instage 2 we create a Digital Elevation Model (DEM)of the terrain using the processed data from stage 1.Following on from this in stage 3 we aim to extractdike segments as polygon information by detecting ar-eas of increased height surrounding bodies of water.

From there, in stage 4 we run a connected componentalgorithms in order to detect the dike segments thatbelong to a single dike system. And finally in stage5 we aim to bring the data generated from stage 2, 3and 4 together in a Browser based interactive simu-lation.

5.1. Preprocessing

The resolution of the points in the AHN2 datasetis far higher then what is realistically required in or-der to reliably achieve our goals. While larger reso-lution may at first sights appear beneficial, in real-ity the increased data size and its associated com-putational costs outweighs the benefits of the in-creased richness of detail. Moreover, the given pointcloud data is completely unstructured and needs tobe transformed into a cleaner data structure.

With the above in mind, we chose to reduce thedata to a 2D raster with a 1m× 1m resolution. Thismeans that all measurements falling into a particular1m2 sampling window are represented by one singlepoint on the raster, where the window size is cho-sen such that it facilitates efficient processing whilestill retaining sufficient information for dike detec-tion. With the aim of identifying the regional groundlevel, we implemented minimum elevation samplingby aggregating points within the window and assign-ing the minimum of all height values in the pointgroup to the respective raster point, discarding un-desirable small-scale structures with higher elevation.Knowing that the resolution of the original unstruc-tured data is approximately 10 points per 1m2, thisrasterisation process allows us not only to project themeasurements onto a regular 2D grid space but alsoto reduce the amount of data by almost factor 10.

The sampling of the tiles is an inherently parallelproblem, as each tile is composed of on average 300files which can be sampled individually without theneed for explicit co-ordination, thereby reducing thenetwork overhead. With this in mind Spark was thepreferred framework of choice. A list of files inside ofa tile is first converted to Spark’s Resilient DatasetDistribution(RDD) format. The RDD is then parti-tioned and distributed across the workers in such away that each worker task involves reading on aver-age 2 files (partitions = num files/2). Subsequentlya flatMap operation is invoked on the RDD such thateach file name is converted into a list of sampledpoints. The resulting list of lists is then flattened.

The workers map task has a degree of complexityto it for a number of reasons. Primary complicationbeing the compressed point cloud files that it mustbe able to read. While Laspy, the tool we are us-ing for reading our point cloud data has compressed

4

file support, its compression support is implementedin a rather unusual way. In order for compressedreading to be successful the worker must have a bi-nary called LasZip (part of the LasTools suite [18])placed somewhere on its PATH. This external depen-dency makes the worker task more complicated as theLasZip binary must be recompiled to match the linuxarchitecture that the workers run on and be shippedto the worker nodes when the spark job is submitted.With this out of the way, upon receiving the task, theworker looks up the file in HDFS and downloads it tohis local temporary directory. Here the file is decom-pressed and loaded into memory where it is sampledusing the previously outlined procedure. Note thatthe sampling process is solely implemented throughSpark SQL queries working on the point RDD. Theoverall structure of the preprocessing stage is illus-trated in fig 3.

On completion of stage 1 we effectively reduce thedata volume from 10TB to 150GB. Bringing the sizeof individual tiles down to <1GB per tile, on average.

SparkDriver

Worker

Worker

Worker

Worker

Worker

HDFS

Sampled Tile

Figure 3 Overview of the sampling architecture, Par-allel task execution is driven by Spark framework, withHDFS being utilised as a datastore. Each worker is is-sued a task of ingesting point cloud files. While ingestingdata is sub sampled to reduce resolution and the resultingRDD is written back to HDFS

5.2. DEM Generation

Generating the DEM follows from the points sam-pled in the preprocessing stage. The minimum ele-vation sampling constitutes already the first step ofDEM construction as it filters out non-ground mea-surements and it yields evenly spaced grid points.Facing the lack of Spark point-cloud frameworks, werefrained from implementing complex interpolation-based methods from literature which would require

computationally expensive repetitive searches forneighbouring points in the unstructured dataframe.Instead, we opted for an image processing approach,where raster points are inserted into a Python numpymatrix on which kernel convolution can efficiently beapplied in order to obtain raster elevation data rep-resenting a smooth surface. We claim that, for ourpurpose of generating a coarse grain elevation mapof the Netherlands which provides information aboutareas being susceptible to flooding, it is not essentialto create an highly accurate surface model includinginformation about local terrain slopes; but a mappingof medium scale regions to discrete height levels issufficient. Inspired by the Inverse Distance Weight-ing interpolation approach, we decided to base thedefinition of these regions of equal height on a the ap-plication of a Gaussian filter kernel which essentiallysets the value of each raster point to the weightedaverage of its neighbours, favouring the nearest ones[19]. This reduces noise due to small-scale featuresthat deviate from the general ground level of a regionand that had not been discarded through minimumelevation sampling.

To obtain a model that represents areas of equalheight, we define a number of height bins spanningthe entire elevation range of the Dutch terrain (−50mup to 300m). These bins are unevenly distributed,such that heights close to sea level, which are moreimportant for flooding analysis, are represented inmore fine grained steps than larger elevations.

Due to the fact that the tiles obtained by stage 1are still quite large in size, generating a DEM outrightfrom an entire tile is not only unfeasible, but alsomeans losing out on the high level of parallelism facili-tated by using Spark. Therefore, each tile is first sub-divided into spatially contiguous blocks of 500x500points using Sparks Dataframe groupby functionalitybefore the actual processing is done, allowing us toutilise the cluster once again for processing.

Following this, a user defined function (UDF) isapplied to each of the blocks. Each worker can ap-ply the UDF on a block independently of the otherworkers, making this problem embarrassingly parallelsimilar to the sampling problem in the preprocessingstage.

The process for transforming a chunk of X,Y,Zpoints into a partial DEM is implemented as fol-lows. Initially, each worker transforms the chunk intoa numpy matrix in order to benefit from a numberof efficient matrix operations and image transforma-tions that Python’s numpy, scipy and scikit-imagelibraries offer. Next a downsampling function is ap-plied to reduce both the amount of computation perblock and the size of the final overall DEM. During

5

the downsampling first a Gaussian filter is applied toavoid aliasing effects, and afterwards a block of 10x10points is merged into one by extracting the medianheight value. Following this, another Gaussian filteris used to obtain a smooth image representation of theground surface. Finally, the heights are binned intodiscrete values, and the resulting DEM is returned tothe Spark Driver. The driver, inserts the returnedblocks into an image representing the entire tile andthe discrete height levels are mapped to RGB colours,resulting in an elevation map in PNG format.

Note that due to the distributed processing ofindividual blocks artifacts at the block edges may oc-cur. A more accurate approach avoiding such arti-facts would involve halo exchange from neighbouringblocks such that smoothing of edge points would con-sider their actual neighbours in the overall pictureinstead of approximating their values using the edgevalues them selves. However, this would imply signif-icant communication overhead and thus counteractSpark’s notion of data partitioning to facilitate fastquerying.

Finally, we apply the same approach used to gen-erate the raster DEM to extract a vector representa-tion of areas of equal height, which will be easier toincorporate into flood prediction. The only differenceis that for this purpose we operate on the scale of theentire dataset, which means that all sampled tiles areloaded at once and more radical downsampling is ap-plied to allow for concurrent processing of the entiredataset. The resulting raster DEM is separated intocollections of areas at 8 distinct height levels close tosea level, represented by 8 binary images. Then foreach level ordered lists of contour points are extractedwhich are transformed back to the original Amers-foort co-ordinates, and converted to a string rep-resentation of polygon objects in Well-Known Text(WKT) format, a markup language widely used forrepresenting vector geometry objects through theircoordinates. Being of regular string type, these ob-jects can then be collected into a Spark DataFrame atthe Spark driver, which can easily be converted into aGeoPandas dataframe containing all identified poly-gons. The vector DEM in form of a set of polygons isthen written to GeoJSON, a format commonly usedto encode geographic data structures in form of ge-ometric objects, and thereby presents a concise andworkable product which can easily be used for furtheranalysis and visualization.

5.3. Dike Detection

The dike detection process begins similarly toDEM generation, the same general approach isadopted where sampled tiles are divided into blocks,

distributed among workers which then apply imageprocessing techniques to their chunk of data. Themain difference is that for reliable detection of dikesegments a larger area of operation is needed in orderto distinguish between long but thin dikes and othersmall features of similar height. Therefore, the tile isfirst further downsampled in its entirety by grouping10x10 areas of points and retaining only the meanheight value. The newly sampled tile is then dividedinto blocks of 500x500 points, representing physicalregions of 5km by 5km.

To identify dike points, we decided to follow animage classification approach based on edge detec-tion and segmentation. A dike is a raised ridge witha crest between 2 and 5 meters width and with slopesto both water side and protected land side, resultingin a base width of up to 50 meters. Depending on thedike type, sea dike, river dike or lake dike, the crestis usually between 5 and 10 meters above the sur-rounding ground level [20]. With this knowledge, weexpect dikes in the raster image to appear as bands ofpoints with increased height value exhibiting a stronggradient alongside both edges. The discrete Laplacefilter [19], approximating the second spatial deriva-tive, is able to detect exactly these areas of strongintensity change together with the orientation of gra-dients. Hence, after Laplace filtering the dike edgesmanifest as pixel values around zero, whereas pixelsclose to the edges take negative values if they are onthe higher side of the edge, and positive values if theyare on the lower side. With the dikes being long butnarrow structures, it is reasonable to assume that theentirety of the dike appears as negative pixel values,forming segments that can be extracted by simplethresholding.

From the technical perspective, a new UDF fordike detection is used to perform a transformationoperation on the grouped data chunks. As before,a worker starts by transforming the assigned chunkinto a matrix. Listing 1 describes the process of iden-tifying potential dike segments in such a chunk. Thefirst step is to identify the ground level ZG. By gener-ating a histogram of heights for the given matrix themost frequent height bin is selected as the groundlevel. Following this, all of the heights in the gridare clipped to the range [ZG, ZG + 5], mapping allheights below ZG to ZG and all heights 5 meters ormore above ground level to ZG + 5 and effectivelyremoving elevation features outside the height rangeof interest for dike detection. Then downsamplingby factor 2 is performed to further decrease the com-putational cost per block and a Gaussian Filter isapplied to reduce noise.

6

def dike_detection_udf(chunk):

np_matrix = numpy.matrix(chunk)

# limit height values to relevant range

clipped = clip(np_matrix, Z_G, Z_G + 5)

# Change resolution from 500x500 to 50x50

sampled = downsample(clipped)

# Apply a Gaussian Filter to smooth points

smoothed = gauss_filter(sampled)

# Detect edges

edges = laplace_filter(smoothed)

binary = edges < threshold

# Close small gaps

closed = binary_closing(binary)

# Extract contours points

polygons = find_contours(closed)

return polygons

Listing 1 Pseudocode of the UDF outlining the opera-tions for transforming point grid into potential dike seg-ments

Once the transformed height map is smoothed,the Laplace filter and thresholding is applied resultingin a binary image of potential dike segments. How-ever, the identification is not perfect. There are un-connected segments that belong to a single systemand small segmented ’noise’ objects appear in the im-age. In order to make connecting dike componentsas easy as possible in the next stage, we apply mor-phology operators based on mathematical set theory[21]. More precisely, binary closing, a sequence ofdilation and erosion, is used to close small gaps be-tween detected segments. Additionally, all segmentsthat have no contact to the image borders are dis-carded, as dikes are expected to stretch over largeareas comprising multiple blocks and even tiles.

While the resulting binary image identifies dikesegments, it is not in a useful or space-efficient for-mat, due to the fact that the image includes verylarge irrelevant areas around dike segments. Hence,the next step is to extract the dike segment contoursas an ordered list of points, which are then trans-formed in the same way as described in Section 5.2to obtain a GeoPandas dataframe containing all iden-tified polygons.

5.4. Dike Connectivity AnalysisThe potential dike segments obtained by the pro-

cess described in the previous section rely solely ondetection within a limited area of 5 by 5 kilometers.However, the Dutch dikes form a large connected sys-tem of closed dike rings and long dike lines along

rivers. Therefore, our next step is to raise dike detec-tion to a higher level. Based on the collected poly-gons from the previous stage, we create a geomet-ric network [22] which is a graph structure consistingof nodes representing geometric objects in space, inour case individual polygons, and edges describingtheir spatial relationship. If two polygons intersector touch each other, they are connected in the graph.This data structure allows us to identify connectedcomponents that most probably belong to a largersystem of dikes, and to remove isolated segments.

Working on a tile level, the graph can be ob-tained by applying a spatial join to the GeoPandasdataframe with itself, which yields all pairs of inter-secting polygons. Due to the filtering of irrelevantsegments and closing of gaps between adjacent seg-ments in the previous stage, the resulting number ofcollected polygons from all blocks is relatively smalland thus GeoPandas’ spatial dataframe queries arepreferable to Spark’s distributed operations. The re-sulting joined dataframe contains all required edgeinformation to create the geometric network usingPython’s networkx. Finally, a connected componentalgorithm is applied to obtain sets of connected seg-ments which are then combined into a single polygonobject. Again, polygons that do not have contactwith the tile border are discarded to further reducethe amount of false positives.

As a last step, the output from all individual tilesis combined by applying the described connectivityanalysis once more, this time to the overall set ofsegments obtained from the tile-level analysis. Thisallows us to filter out more isolated segments thatprobably do not belong to any larger dike system, andmerge adjacent segments into single components.

The final set of detected dike segments is writtento GeoJSON.

5.5. Visualisation

The visualisation of the generated DEM and dikesegments was created by overlaying our data on topof a map of The Netherlands. To this end we utilisedthe MapBox service [23].

Mapbox provides a powerful map API that al-lows its users to perform a wide range of tasks fromsimple data visualisation, as is our use case to morecomplicated navigation and live data interactions.

The visualisation procedure is straight forward,we initialised a Mapbox map, centred around TheNetherlands and set the co-ordinate system to theAmersfoort co-ordinate system as this is how ourdataset is indexed. Following this, the DEM data inimage format was added to the maps source data us-ing Mapbox.addSource() . The position of our DEM

7

was encoded as bounding co-ordinates (xmin, ymin,xmax, ymax) into the file name. Once the sourceis loaded into the map object it can be easily over-layed using Mapbox.addLayer() . Dike overlay aswell as overlay of flood plains follow this approach.In contrast to the DEM in PNG format, co-ordinateinformation for these polygon objects is encoded di-rectly in the GeoJSON objects. We allow the userto freely combine the different layers by switching in-dividual components on and off which facilitates theexploration of the Dutch topology as well as differentflooding scenarios.

The polygon representations of both dikes andvector DEM plains facilitate a fast loading process ofdata as their spatial information is compressed intoonly few contour points. To allow for smooth interac-tion with the layered map we refrained from using ourhigh resolution tile-based raster DEM but use an ad-justed version in slightly lower resolution where tilesare merged into one single DEM.

6. Evaluation and Experiments

Due to manifold technical obstacles arising dur-ing this project, the focus of our work lies on designand implementation of the data processing pipelineand generation of the targeted data product. There-fore, the extent of experiments performed on the finaloutput is limited. With this in mind, this section willfocus on validation of our results based on referencedata. For both DEM and dikes we provide a sideby side comparison of our output with the reference.Finally, preliminary experiments with our data prod-ucts are presented and discussed.

6.1. DEM Evaluation

In order to evaluate the generated DEM we per-formed a visual comparison of the generated eleva-tion images versus a reference elevation map of TheNetherlands provided directly by AHN.

Figure 4 shows a comparison between the DEMgenerated by our pipeline (bottom) and the refer-ence elevation map (top). From both comparisons,on tile-level and on country-level, we can see thatoverall quality of the generated DEM is pleasinglyhigh. While our DEM’s do not capture the samelevel of height transitions our intentions were to cre-ate a course grained elevation map and as such ourgoal was effectively accomplished.

Figure 4 Reference elevation map (top) comparedagainst our generated elevation map (bottom)

6.2. Dike Detection Evaluation

To evaluate dike detection, reference dike datawas acquired from the Dutch Rijkswaterstaat [24],a government organisation dealing with the design,construction and management of infrastructure fa-cilities within The Netherlands. The reference datacomes in form of a shape file and consists of con-nected line segments representing the dikes. In orderto achieve an accurate comparison, the reference datawas segmented into tiles according to the AHN2 tilestructure.

First, a qualitative evaluation was performed.For the sake of brevity this procedure is based ontaking a random sample of the generated tiles andcomparing them with the reference tiles. Fig 5 showswhat we believe to be adequate representations ofthe best (top), average (middle) and worst (bottom)case of true positive outputs from our detection algo-rithm. From the figure we can see that throughout,the difference is generally the amount of erroneouslydetected dikes that is introduced through irrelevantelevated features surrounding the dikes, which is un-derstandable considering that our algorithm is basedon height detection. In the worst case however, aswell as seeing a lot of noise we do see loss of dike in-formation. This could potentially be alleviated by de-creasing the minimum height threshold for dike clas-sification but would as a side effect introduce furtherfalse positives into an already noisy result.

8

Figure 5 Best (top) Average(middle) andWorst(bottom) case outputs of our dike detectionalgorithm

Second, we provide a quantitative reliability as-sessment of our dike detection framework. Morespecifically, the number of detected dike segmentsoverlapping with the reference dikes (TPdetected) aswell as the number of reference dike segments thatoverlaps with our detected dikes (TPref ) are deter-mined and compared to the total counts (Ndetected,Nref ) respectively. With this we define the followingtwo measures of classification accuracy: Precision de-scribes the fraction of detected segments that actuallymatch true dikes, and miss rate gives us the fractionof reference segments that could not be detected byour classification method. Note that for these mea-sures we consider individual segments before appli-cation of connectivity analysis. The equations andthe actual results are presented in the following equa-tions:

precision =TPdetected

Ndetected=

147

206= 71.36% (1)

miss rate =Nref − TPref

Nref=

1051

1845= 56.96% (2)

In line with the visual examination, the precisionof 71.36% underlines that the majority of our dikesegments has correctly been classified, while there isstill a significant amount of cases where other elevatedfeatures like bridges or roads are wrongly identifiedas dikes in areas where there are no dikes whatsoever.Apart from that, the high miss rate of 56.95% showsthat our approach clearly suffers from missing a lotof true dike segments.

A visual overall comparison between the referencedike systems and our results supports the outcomesof the accuracy evaluation. Fig 6 shows that our dikedetection framework is able to capture the rough lay-out of Dutch dike systems. Especially sea dikes alongthe west coast as well as in the north and around theIjsselmeer could be detected with pleasing reliability.However, in line with the measured miss rate, fig 6illustrates weaknesses of our approach as dike linestend to be fragmented, with a significant amount ofsegments missing in between. Moreover, we see thatmost false positive detections occur in urban areas.Another significant part of false detections appear asstraight lines at tile borders where smoothing andedge detection are obstructed by missing halo pointsfrom neighbouring tiles. While applying overall con-nectivity analysis reduces the number of false posi-tives, it also results in loss of true dike segments dueto fragmentation.

6.3. Experiments

In a preliminary experiment we examine whichregions of the Netherlands would be flooded if therewas no protection through dikes at all. Based onthe generated vector DEM data, we identify all areaslying below sea level and present them in fig 7. Ad-ditionally, areas at height levels up to 3 meters abovesea level are visualised in steps of 0.5 meters.

From the figure it becomes very clear that theNetherlands rely heavily on their dike systems. With-out this protection all areas below sea level that arein some way connected to the sea would be flooded.This applies especially to the surroundings of the Ijs-selmeer, comprising the provinces of North Holland,west Friesland and Flevoland, as well as South Hol-land. From our polygon model we estimate that 9213km2 would be affected by such a catastrophe. Inter-estingly, a rising sea level only leads to a moderate

9

Figure 6 Overall comparison between reference dike data (grey) and ourresults (blue). Left: without overall connectivity analysis. Right: only con-nected components comprising at least 3 segments.

Figure 7 Critical areas that lie be-low (dark red) or up to 3 meters abovesea level (lighter red).

increase in the flooded area. Our model predicts anarea of 11981 km2 being flooded if the sea level risesby 0.5 meter, 14773 km2 for 1 meter.

7. Conclusions

To sum up, the project has been continuouslydriven by challenges posed by the massive size of thedataset. Many powerful tools could not be appliedbecause they do not support the LAS/LAZ formatand/or are not compatible with the Surfsara clusterconfiguration. However, we were able to overcomethese challenges by developing a distributed process-ing pipeline that combines the potential of ApacheSpark’s resilient distributed datasets with the algo-rithmic strengths of standalone libraries like Python’sscikit-image and geopandas.

Based on this technical approach, we successfullygenerated an image-based raster DEM of the entireNetherlands which provides coarse grained but reli-able information about regions of equal height level,filtering out irrelevant non-ground features like build-ings and vegetation. The same framework was ap-plied to obtain additional vector-based informationabout these areas of equal height, which proved to bea space-efficient representation that can immediatelybe used for preliminary flood analysis.

From the analysis perspective, dike detectionturned out to be the most challenging part of thisresearch. The high resolution of the raw data in-cluding lots of irrelevant information together withits unfeasible volume made it difficult to appropri-ately reduce the data to a workable amount that still

includes the small-scale dike details needed for au-tomatic classification. Our edge detection approachwas able to reliably detect dominant sea dikes wherethe surrounding terrain is relatively homogeneous.However, smaller river dikes in more urban areas re-main a challenge due to distorting features such aselevated roads, bridges and large constructions. Inthese cases, finding appropriate parameter values forGaussian smoothing and thresholding is non-trivialand needs further optimisation.

Flooding analysis had to be limited to prelimi-nary flood plain prediction as more sophisticated sim-ulations based on dike breaches would require notonly highly accurate detection of fully connected dikesystems but also deeper studying of flood dynamics.

Considering the discussed challenges, this projecthas great potential for further research. Future workmight focus on three different aspects. First, specialeffort should be made to build tools for distributedpoint-cloud processing, and implementation shouldbe moved to a up-to-date cluster architecture whichfacilitates full functionality of geospatial Spark li-braries like Geotrellis and GeoSpark. Second, moretime should be spent on optimising automatic de-tection of dike systems. Intelligent parameter tun-ing, incorporation of machine learning based on localpoint features or deep learning appraoches are con-ceivable. And third, with improved outcomes basedon the first two aspects, it is desirable to apply trueflood simulations to the geospatial models, includ-ing water flow modelling, susceptablity of dikes andmore. Insights from such studies are expected to bevital for the Dutch and the existence of their country,as improvements in flood defense rely more and more

10

on high-throughput computational simulations.

8. References

[1] M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker,and I. Stoica. Spark: Cluster computing with workingsets. HotCloud, 10(10-10):95, 2010.

[2] J. Van Alphen. The delta programme in the netherlands:a long-term perspective on flood risk management. InFloods: from risk to opportunity. Proceedings of the 5thInternational Conference on Flood Management, Tokyo,Japan, pages 27–29, 2011.

[3] H. Most and M. Wehrung. Project floris - flood risks andsafety in the netherlands : flooding in the netherlands -probabilities and consequences, 2003.

[4] H. Most and M. Wehrung. Dealing with uncertainty inflood risk assessment of dike rings in the netherlands. Nat-ural Hazards, 36:191–206, 2005.

[5] G. Priestnall, J. Jaafar, and A. Duncan. Extracting urbanfeatures from lidar-derived digital surface models. Com-puters, Environment and Urban Systems, 24:65–78, 2000.

[6] T.G. Thomas and J.J.R. Williams. Large eddy simula-tion of turbulent flow in an asymmetric compound openchannel. Journal of Hydraulic Research, 33:27–41, 1995.

[7] X. Meng, N. Currit, and K. Zhao. Ground filtering algo-rithms for airborne lidar data: A review of critical issues.Remote Sensing, 2(3):833–860, 2010.

[8] N. Pfeifer, A. Kostli, and K. Kraus. Interpolation andfiltering of laser scanner data - implementation and firstresults. International Archives of Photogrammetry, Re-mote Sensing and Spatial Information Sciences, 32 (part3/2):153–159, 1998.

[9] G. Vosselman. Slope based filtering of laser altimetrydata. International Archives of Photogrammetry, Re-mote Sensing and Spatial Information Sciences, 33 (partB3/2):935–934, 2005.

[10] J. Kilian, N. Haala, and M. Englich. Capture and evalua-tion of airborne laser scanner data. International Archivesof Photogrammetry, Remote Sensing and Spatial Infor-mation Sciences, 31 (part B3):383–388, 1996.

[11] E.S. Anderson, J.A. Thompson, and R.E. Austin. Li-dar density and linear interpolator effects on elevationestimates. International Journal of Remote Sensing,26(18):3889–3900, 2005.

[12] Y. et al. Bao. Classification of lidar point cloud and gen-eration of dtm from lidar height and intensity data inforested area. The International Archives of the Pho-togrammetry, Remote Sensing and Spatial InformationScience, 37 B3/B:313 –318, 2008.

[13] K. Liu and J. Boehm. Classification of big point clouddata using cloud computing. The International Archivesof the Photogrammetry, Remote Sensing and Spatial In-formation Sciences, XL-3/W3:553–557, 2015.

[14] J. Boehm, K. Liu, and C. Alis. Side loading – ingestionof large point clouds into the apache spark big data en-gine. The International Archives of the Photogrammetry,Remote Sensing and Spatial Information Sciences, XLI-B2:343–348, 2016.

[15] C. Wang, F. Hu, D. Sha, and X. Han. Efficient lidarpoint cloud data managing and processing in a hadoop-based distributed framework. The International Archivesof the Photogrammetry, Remote Sensing and Spatial In-formation Sciences, IV-4/W2:121–124, 2017.

[16] ESPG.IO. Coordinate systems worldwide. coordinate ref-erence systems for ”netherlands”. https://epsg.io/?q=

Netherlands.

[17] Las specification version 1.2. Technical report, ASPRS,2008.

[18] Lastools. https://rapidlasso.com/LAStools/, Oct 2018.[19] R.C. Gonzalez and R.E. Woods. Digital Image Process-

ing. Prentice Hall, 2 edition, 2002.[20] L. Van Nieuwenhuijze. The challenge of flood risk man-

agement, 2015.[21] J. Serra. Image Analysis and Mathematical Morphology.

Academic, 1982.[22] Marc Barthelemy. Spatial networks. Physics Reports,

499(1-3):1–101, 2011.[23] Mapbox. =https://www.mapbox.com/.[24] Rijkswaterstaat. Home. https://www.rijkswaterstaat.

nl/, Oct 2018.

11

https://epsg.io/?q=Netherlands

https://epsg.io/?q=Netherlands

https://rapidlasso.com/LAStools/

=

https://www.rijkswaterstaat.nl/

https://www.rijkswaterstaat.nl/

Appendix A Work Distribution

Table 1 Who worked an which parts of this project?

Preprocessing Maki GradecakDEM Generation Fiona LippertDike Detection Fiona LippertVisualization Maki Gradecak and Peter PetkanicReport Maki Gradecak and Fiona Lippert

12

Documents

Flooding Analysis based on LiDAR Point Cloud Data · Flooding Analysis based on LiDAR Point Cloud Data Fiona Lippert, Maki Gradecak, Peter Petkanic Abstract Flood defence is a topic