Distributed Viewshed Analysispublications.lib.chalmers.se/records/fulltext/238036/238036.pdf · Viewshed analysis can be used for applications other than human sight, such as determining

Distributed Viewshed AnalysisAn Evaluation of Distribution Frameworksfor Geospatial Information Systems

EMIL JOHANSSONJACOB LUNDBERG

Department of Computer Science and EngineeringCHALMERS UNIVERSITY OF TECHNOLOGY AND UNIVERSITY OF GOTHENBURGGothenburg, Sweden 2016

Master’s thesis 2016

Distributed Viewshed Analysis

An Evaluation of Distribution Frameworksfor Geospatial Information Systems

EMIL JOHANSSONJACOB LUNDBERG

Department of Computer Science and EngineeringChalmers University of Technology and University of Gothenburg

Gothenburg, Sweden 2016

Distributed Viewshed AnalysisAn Evaluation of Distribution Frameworksfor Geospatial Information SystemsEMIL JOHANSSONJACOB LUNDBERG

© 2016 EMIL JOHANSSON, JACOB LUNDBERG.

Academic supervisors: Birgit Grohe

Department of Computer Science and Engineering

K.V.S Prasad


Industry supervisor: Calle Hanson

Carmenta AB

Examiner: Graham Kemp


Master’s Thesis 2016 Department of Computer Science and EngineeringChalmers University of Technology and University of GothenburgSE-412 96 GothenburgTelephone +46 31 772 1000

Cover: A viewshed analysis over an area at Lake Tahoe, CA, USA. The eye denotesthe position of the observer and the areas visible are marked in blue.

Gothenburg, Sweden 2016

iii

Distributed Viewshed AnalysisAn Evaluation of Distribution Frameworksfor Geospatial Information SystemsEMIL JOHANSSONJACOB LUNDBERGDepartment of Computer Science and EngineeringChalmers University of Technology and University of Gothenburg

Abstract

Viewshed analysis is the process of computing what areas of a terrain are visible froma certain observation point. In this thesis we evaluated the performance of thesecomputations on cloud clusters using the distribution framework Apache Spark.We implemented three commonly used viewshed algorithms; R3 which is slow buthighly accurate as well as R2 and van Kreveld which are faster but less accurate.Two versions of each algorithm were implemented, one to run on a single multi-coremachine and one to run on a server cluster using Spark. We compared the accuracyand running time of the different algorithms in order to determine when to use thedifferent algorithms. Our results show that viewshed analysis does not perform wellwhen implemented using Spark if real-time results are required. In fact the fasteralgorithms performed consistently worse on the cluster, even for very large inputdata. For the highly accurate, but slow, R3 algorithm we were able to achieve a1.6x speedup using the distribution framework.

Keywords: viewshed, GIS, distributed, cluster, line-of-sight, Apache Spark

iv

Acknowledgements

We would like to thank our supervisors from Chalmers, Birgit Grohe and K.V.SPrasad for their feedback and fruitful meetings. We would also like to thank Car-menta and the people working there for making us feel welcome and for supportingus trough our project, with a special thanks to our industry supervisor Calle Hanson.

Emil Johansson, Jacob Lundberg, Gothenburg, June 2016

vi

Contents

List of Figures x

List of Tables xiii

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.6 Overview of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theory 52.1 Digital Elevation Model . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Line of Sight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Viewshed Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Viewshed Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.1 R3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4.2 R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4.3 Van Kreveld . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 Framework: Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Implementation 183.1 R3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2 R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3 Van Kreveld . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Results 234.1 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Accuracy Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.3 Timing Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3.1 Locally . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.3.2 Distributed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5 Discussion 335.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

viii

Contents

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Bibliography 35

A Definitions I

ix

List of Figures

1.1 A viewshed analysis over an area at Lake Tahoe, CA, USA. The eyedenotes the position of the observer and the visible areas are markedin blue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 A raster is a two-dimensional matrix, that divides a map in cells. Asample map (a) is overlayed with a raster (b), creating a rasterizedmap (c). The higher the resolution in the raster, the more rows andcolumns in the matrix resulting in more and smaller cells, as shownin (d). Every cell in the matrix represents a subarea of the map andthe height information for this subarea is stored in the correspondingcell in the raster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Side view of a LoS between observer O and points A, B and C. PointA is visible as there is no part of the terrain between O and A thatobstructs the ray from O to point A, the same is true for point C.Point B is not visible as the ray between O and B is obstructed bythe hill located between the points. . . . . . . . . . . . . . . . . . . . 7

2.3 The elevations contained in a raster-based DEM do not consider thecurvature of the earth. When performing viewshed analysis, a cor-rection value, δE, needs to be subtracted from each elevation valuein the DEM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Viewshed: A LoS is calculated from the observer O to all the pointsin the area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.5 The impact of DEM raster resolution on viewsheds calculated fromthe center of an area of interest, marked by an eye. The darkerareas represent rectangular prisms placed on a flat surface and thediagonally striped cells are considered visible. . . . . . . . . . . . . . 10

2.6 A LoS from (0,0) to (4,2), the dot is the target point for which to cal-culate visibility. The circles mark the x-crossings where height valueswill be sampled to use in calculation of the target point’s visibility. . 11

2.7 An area-of-interest divided into eight octants. . . . . . . . . . . . . . 112.8 The R2 algorithm calculates a LoS to points A and B, and stores LoS

height information on all x-crossings. Point C’s LoS height will bedetermined by the LoS that pass closest to point C. In this examplethe LoS to point A will determine the LoS height of point C, as dA ≤ dB. 12

x

List of Figures

2.9 van Krevelds algorithm uses a sweep line, that rotates around theobserver and calculates the visibility of a cell when it passes over thecell’s center. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.10 An example binary search three that is used in the van Kreveld-algorithm. α is the angle between the cell and the observer and αmax

is the highest angle on any cell that are positioned between the celland the observer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.11 Three different events for a point, where a sweep-line rotates counter-clockwise around the observer. Line A shows the sweep-line where itenters the point, line B shows the position of the sweep-line duringthe center-event, and the last line, line C, shows the exit-event. . . . . 14

2.12 The area contained between the two angles, A1 and A2, representsone part of the whole viewshed-analysis in the parallelized van Krev-eld algorithm. This part can be calculated independently, thereforemaking it possible to parallelize the algorithm. . . . . . . . . . . . . . 15

2.13 A sample configuration of a Spark cluster with five nodes. . . . . . . 162.14 The components of a Spark application. . . . . . . . . . . . . . . . . 17

3.1 Interpolation to get an approximation of the height of a theoreticalpoint that lies between point y1 and y2. . . . . . . . . . . . . . . . . . 19

3.2 The visibility of point (1,1), denoted by the dot, will be calculated byseveral LoS. The circles denote the points along the LoS which willapproximate (1,1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 While an approximation of the height of point (1,1) will be calculatedby LoS(2,3) it will not determine the visibility of point (1,1) sinceLoS(3,3) is closer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4 A raster where the gray area represents the cells on the edge of thearea-of-interest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1 Comparison between our implementation of the R3 algorithm and thealgorithm used at Carmenta. . . . . . . . . . . . . . . . . . . . . . . . 24

4.2 Differences when comparing the R2 and Kreveld algorithms to the R3algorithm on a 2001x2001 raster DEM representing the city of Paris,France. A negative value indicates that the R3 algorithm reports ahigher LoS height than the other algorithms, and vice versa. Valuesclose to zero means that the two compared algorithms agree on theresults. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 Running time when calculating a viewshed with the R2 algorithmlocally, on a raster of size 16001 x 16001, with different number ofthreads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.4 Comparison when running the R3 algorithm with different number ofthreads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.5 Comparison when running the R2 algorithm with different number ofthreads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.6 Comparison when running the van Kreveld algorithm with differentnumber of threads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.7 Comparison of memory consumption of the three algorithms . . . . . 29

xi

List of Figures

4.8 Comparison of running times of the R2 algorithm on a 16001x16001raster using Spark with varying amounts of executors per worker node. 30

4.9 Comparison of running times of the R3 algorithm, between a singlemulti-core CPU and a cluster using Spark . . . . . . . . . . . . . . . 31

4.10 Comparison of running times of the R2 algorithm, between a singlemulti-core CPU and a cluster using Spark. . . . . . . . . . . . . . . . 32

4.11 Comparison of running times of the van Kreveld algorithm, betweena single multi-core CPU and a cluster using Spark. . . . . . . . . . . 32

xii

List of Tables

4.1 Occurrences of differences in reported LoS height values when com-pared to the R3 algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2 Point-by-point comparisons of the visibility results of the R2 andvan Kreveld algorithms to R3’s results. The matching points are thenumber of points where the algorithms agree with R3 on visibility. . . 26

4.3 Error statistics when comparing the results of the R2 and van Kreveldalgorithms to R3’s results using a 2001x2001 raster. . . . . . . . . . . 26

xiii

1Introduction

1.1 Background

Geospatial Information Systems (GIS) are used to capture, analyze and presentdata with geographic or spatial components. With the trend of mobile and web ap-plications becoming more prevalent, the underlying hardware (e.g. mobile phones,tablets, light-weight web servers) used to access these systems might not provide suf-ficient computational capabilities required by GIS. It can therefore be advantageousto be able to distribute the more computationally heavy procedures to a machinewith more capable hardware or to a distributed cluster of servers.

One such computationally heavy procedure is viewshed analysis, the goal of whichis to calculate visible subareas for a particular area of terrain using elevation data.The set of visible points is called a viewshed. Figure 1.1 shows a viewshed analysisover an area at Lake Tahoe, CA, USA.

In this thesis we perform an evaluation on the feasibility of distributing GIS com-putations by distributing viewshed analysis from a single server to a server cluster.This evaluation is performed on behalf of Carmenta1, a company specializing inGIS. The server cluster is built using existing frameworks and hosted on a cloud-computing service. Distributing computations in this manner is not new in and ofitself, which is why we use existing frameworks rather than designing somethingcompletely new.

1.2 Problem Description

The aim of this thesis is two-fold. The first part consists of implementing differentalgorithms for calculating viewsheds in a distributed environment. The algorithmsused are described in Section 2.4. The idea is to examine the feasibility of running

1http://www.carmenta.com

1

1. Introduction

Figure 1.1: A viewshed analysis over an area at Lake Tahoe, CA, USA.The eye denotes the position of the observer and the visible areas are markedin blue.

viewshed analysis on a cluster by using existing frameworks. Some questions we an-swer are what kind of speed-up, if any, one can expect; how the cluster size (numberof nodes) affects the running-time of the viewshed analysis and what changes onemight need to make, in order to reduce potential overhead incurred by a distributedsolution. The second part is focused on the overall usability of distribution and thechosen frameworks. What would be a suitable application for using a cluster? Whatdetermines whether an application is suitable for distribution? What are some ofthe aspects one should take into consideration when deciding to use distribution onclusters? To answer these questions we need to run timing tests to be able to inves-tigate the feasibility of using distribution, as well as gather information regardingthe suitable size of a cluster. We also need to run tests comparing the visibilityresults of the different algorithms, in order to compare their accuracy.

1.3 Previous Work

Ware, Kidner and Rallings [1] anticipated that the demand for parallel processingwhen performing geospatial analysis would increase. This idea has been confirmedby more recent work, for instance by Thai and Olasz [2] and Abdul, Potbar andChahaun [3]. They all agree that this form of distribution is a worthwhile methodfor increasing the performance of GIS calculations.

According to Tabik et al. [4], performing viewshed analysis is highly useful in alarge number of applications, civilian as well as military. It is also well suited forparallelisation as lines-of-sight can be evaluated independently inside an area-of-

2

1. Introduction

interest [5]. Furthermore, since changes in terrain happen slowly, elevation data isfairly static. This means that replicating the elevation data to several distributednodes is straightforward and does not incur the high messaging overhead commonlyfound in distributed applications. This makes viewshed analysis a good candidatefor distribution.

Axell and Fridén [6] describe in their Master’s thesis (also performed at Carmenta)how they were able to obtain faster computations for performing viewshed analysisby utilising GPU parallelisation, compared to a multi-core CPU.

Our thesis examines potential increases in performance when using distributedframeworks and cluster computing. These performance increases are especially use-ful when concurrently performing multiple viewshed analyses on the same view sinceeven a small reduction in execution time will be amplified over multiple executions.Example use cases where this is applicable is calculating a route that is the leastvisible from a number of predetermined observation positions or finding the place-ment of a given number of observers that yields the optimal visibility coverage of aparticular area.

1.4 Goals

Our goals with this thesis are to implement algorithms for viewshed analysis withthe distribution framework Apache Spark. We also need to measure the accuracyof the implemented algorithms, making sure that they achieve adequate results. Inorder to compare our distributed solution to an implementation on a multi-coreCPU, we need to measure the computation times of running viewshed analyses onboth solutions. The last step is to draw conclusions regarding the measured overheadincurred by the distribution framework as well as the size of the cluster and also moregeneral conclusions on the usability of distribution frameworks in GIS applications.

1.5 Limitations

The height data used in all benchmarking will be a raster representation of a DigitalElevation Model (DEM) [7]. A raster is a grid of cells, like a matrix, with elevationdata stored in each cell. Other formats for storing the height data are not consideredin this thesis.

There are several different algorithms for performing viewshed analysis, in this thesiswe consider three of the most commonly used ones: R2, R3 and van Kreveld. An-other widely used algorithm is the Xdraw algorithm described by Franklin et al. [8].Xdraw is among the faster viewshed algorithms but is too inaccurate. According

3

1. Introduction

to Franklin et al. it suffers from a problem where points could influence visibilityresults even though they should not [8]. Therefore, we choose to not consider it inthis thesis in favour of algorithms with higher degrees of accuracy.

Viewshed analysis can be used for applications other than human sight, such asdetermining where a radio signal emitted from a certain point can be received.These other applications might introduce new considerations for the analysis. Inthe radio example, the analysis needs to take into account the diffraction of theradio waves. In this thesis we will only consider viewshed analysis in the context ofhuman sight.

We will not investigate either security or fault tolerance in this thesis.

1.6 Overview of Thesis

The structure of this report is as follows: Chapter 2 contains theory about viewshedanalysis, as well as a review of the selected algorithms that will be used in thisthesis. A detailed description on how we implemented the different algorithms canbe found in Chapter 3. Chapter 4 contains the results we gathered, both in termsof accuracy and execution time. Benchmarking was done both locally and on thecluster, to be able to compare the results. These results are then discussed andanalysed in Chapter 5, which also contains conclusions regarding the feasibility ofusing clusters when performing viewshed analysis.

4

2Theory

This Chapter presents the theory behind viewshed analysis and the different al-gorithms that have been used in this thesis. Information about the distributionframework is also be provided.

2.1 Digital Elevation Model

A topographic map containing elevation data of an area is called a Digital ElevationModel (DEM), which is a 2.5-dimensional representation of an area. A DEM canbe represented with either vector or raster data, with raster-based DEMs being themost common way to represent elevation data [9].

Vector data is visualized with a Triangulated Irregular Network (TIN) [7], whichhas irregularly positioned triangles that represents a surface. The irregularity comesfrom the size and positions of the triangles. In an area with little or no differencein elevation, few triangles are needed to represent this area. Conversely, for an areawith large differences in elevation, more triangles are needed to sufficiently representthat area.

A raster is a grid of squares, where every cell in the grid represents a small subareaof the map, which is visualized in Figure 2.1. The value stored in a cell is the heightinformation for that subarea. The higher the resolution of the raster, the smaller asubarea represented by a cell will be. This means that a higher resolution will givea more accurate representation of the area. The downside is that the raster willrequire more memory to store, as the raster will contain more data points. Sincethe value stored in a cell represents the height of a small subarea it will not be anexact value for every physical point in a terrain (of which there are infinitely many).There are different methods to sample the terrain in order to decide the elevationfor a cell. For example, one could use the elevation of the point in the center of acell, an average over the whole subarea, the minimum elevation in the subarea, themaximum elevation in the subarea or the elevation of the subarea’s center. Whenperforming viewshed analysis one needs to consider the sampling method used by

5

2. Theory

different input data sets since different methods will provide different results.

(a) Sample map1 (b) Empty raster

(c) Resolution: 4x4 (d) Resolution: 8x8

Figure 2.1: A raster is a two-dimensional matrix, that divides a map incells. A sample map (a) is overlayed with a raster (b), creating a rasterizedmap (c). The higher the resolution in the raster, the more rows and columnsin the matrix resulting in more and smaller cells, as shown in (d). Every cellin the matrix represents a subarea of the map and the height informationfor this subarea is stored in the corresponding cell in the raster.

2.2 Line of Sight

Line-of-Sight (LoS) is a procedure where the goal is to calculate the visibility fromone point to another point. An example of LoS can be seen in Figure 2.2. Theprocedure can be visualized by creating a ray in one direction from the observer to

1Sample map taken from the public domain USGS Digital Raster Graphic file o44072d6.tif forthe Stowe quadrangle, VT, USA

6

2. Theory

its closest point, this point is always visible, as there are no points that can influenceits visibility. Another ray is created from the observer to the next closest point, ifthis ray is obstructed by any points that are closer to the observer the second pointis deemed not visible, otherwise the second point is visible. In Figure 2.2, point Ais visible as the ray created between that point and the observer is not obstructedby any points closer to the observer. Point B is not visible as its ray is obstructed.Point C is visible as its ray is not obstructed by any points between point C andthe observer.

O A

B

C

Figure 2.2: Side view of a LoS between observer O and points A, B andC. Point A is visible as there is no part of the terrain between O and A thatobstructs the ray from O to point A, the same is true for point C. Point Bis not visible as the ray between O and B is obstructed by the hill locatedbetween the points.

The check for obstructing rays gives a result in the form of a Boolean, either thecell is visible or not. By calculating the difference in height between a non-visiblepoint and the lowest height needed for that point to be visible, one can calculatethe necessary increase in height needed to make something located on that pointvisible. This can for example be used for determining if an aircraft flying at a certainaltitude above the terrain is visible from the observer.

The curvature of the earth needs to be considered when performing LoS calculationsusing raster DEMs since the rasters contain elevations that are relative to some fixedlevel, typically sea level, and treats this fixed level as if it were flat. This problemis exemplified in Figure 2.3, which shows how the target point of a LoS calculationis lowered relative to the observation point because of earth’s curvature. In orderto address this, the elevation of each point in the raster needs to be corrected bysubtracting a correction value based on the respective point’s distance from theobserver. A simple approximation of the correction value, δE, that works well for

7

2. Theory

distances significantly smaller than earth’s radius is

δE = D2O

2RE

(2.1)

where DO is the distance from the observation to the target point and RE is earth’seffective radius [8]. For areas-of-interest with maximum distances of a few hun-dred meters this correction can be omitted, since the difference in elevation will beinsignificant.

O

δE

Figure 2.3: The elevations contained in a raster-based DEM do not con-sider the curvature of the earth. When performing viewshed analysis, acorrection value, δE, needs to be subtracted from each elevation value inthe DEM.

2.3 Viewshed Analysis

A viewshed analysis calculates the visibility for all points in an area-of-interest. Oneway to perform this analysis is to calculate multiple LoS from a specific observationpoint, to all other points in the region of interest. The idea is visualized in Figure 2.4.A LoS is created between the observer and all other points in the region. All pointsare calculated independently to determine which points, in a specific area, are visible.This is a naive method, described further in Section 2.4.1.

The resolution of a raster DEM used to model an area of terrain will have an impacton any viewshed analysis performed on the DEM. The most obvious impact comesfrom the generation of the raster, since each cell can only contain one height value.The value will be an approximation of the area the cell covers. A higher resolutionraster DEM will be able to provide a closer approximation to the physical terrainsince each cell represents a smaller area.

Higher resolutions will also have an impact on the viewshed analysis even if each cellis simply split into a number of smaller cells with the same elevation as the originalcell. This can be seen in Figure 2.5, which shows viewsheds calculated with differentraster resolutions for an area of interest consisting of a surface with a number ofrectangular prisms of equal height capable of obstructing sight. As the resolutionincreases and each cell becomes smaller, the resulting viewshed is altered. Someareas that were not considered visible are now deemed visible and vice versa.

8

2. Theory

O

Figure 2.4: Viewshed: A LoS is calculated from the observer O to all thepoints in the area.

2.4 Viewshed Algorithms

There exist several different algorithms for performing viewshed analysis, each withtheir own trade-offs. This section aims to give a general idea of how the algorithmsused in this thesis work and a brief look at their respective strengths and weaknesses.

2.4.1 R3

The R3 algorithm, described by Franklin et al. [8], is straightforward and highlyaccurate with relatively low performance, as it scales poorly with the size of theraster. The time complexity is O(R3), where R is the side length of the area-of-interest measured in number of points. The idea of the algorithm is to calculate aLoS from the observation point to every other point in the region of interest. EachLoS is traversed from the observation point toward the target point, this conceptis visualized in Figure 2.6. How the algorithm traverses a LoS is determined bythe position of the target point relative to the observation point. Consider thepartitioning of an area-of-interest in Figure 2.7, if the target point is in octant I, IV,V or VIII the elevation is sampled at every whole step along the x-axis, called anx-crossing. For target points in octants II, III, VI and VII the elevation is sampledat every whole step along the y-axis, called a y-crossing. At each step, the slope ofa line from the observation point is calculated and the maximum slope encounteredthus far along the LoS, smax, is stored. When the LoS reaches the target point,the slope of a line from the observation point to the target point, st, is calculatedand is compared to smax. The target point is visible from the observation point iffst > smax.

Since the x-crossings or y-crossings of a LoS will often not fall directly on points in

9

2. Theory

(a) Area of interest(b) 3D View

(c) 5x5 raster, 75% visibility (d) 15x15 raster, 68% visibility

(e) 25x25 raster, 65% visibility (f) 45x45 raster, 62% visibility

Figure 2.5: The impact of DEM raster resolution on viewsheds calculatedfrom the center of an area of interest, marked by an eye. The darker areasrepresent rectangular prisms placed on a flat surface and the diagonallystriped cells are considered visible.

the raster, the elevation at the crossings needs to be estimated based on the heightsof the points on the intersected line closest to the intersection. There are a numberof different ways these heights can be estimated with each being useful in differentuse cases. For instance, if it is important to avoid false positives for visibility, i.e.that a point is marked as visible when it actually is not, an appropriate estimate isto use the maximum height of the two neighboring points.

10

2. Theory

x axis(0,0)

y axis

(4,2)

Figure 2.6: A LoS from (0,0) to (4,2), the dot is the target point for whichto calculate visibility. The circles mark the x-crossings where height valueswill be sampled to use in calculation of the target point’s visibility.

O

I

IIIII

IV

V

VI VII

VIII

Figure 2.7: An area-of-interest divided into eight octants.

2.4.2 R2

In an effort to reduce the complexity of the R3 algorithm Franklin et al. developedanother algorithm, called R2 [8]. The R2 algorithm aims to provide a way to calcu-late a viewshed in quadratic complexity, O(R2), where R is the side length of thearea-of-interest measured in number of points. The algorithm will calculate a LoSto points that are on the edge of the viewshed, e.g. point A and B in Figure 2.8,while also calculating the LoS height information for every x-crossing, or y-crossingdepending on which quadrant the edge point belongs to, between the observer andthe target point. This height information will be used to calculate the visibility of

11

2. Theory

intermediate points, i.e. point C in Figure 2.8, but only if the distance from theLoS to point C is smaller than the distance of the neighboring LoS to point C. Iftwo LoS have the same distance from a crossing to a point to be calculated, bothLoS will calculate the visibility of that point. These conflicts can be handled in dif-ferent ways, for example to calculate the average or letting each consecutive resultoverwrite the previous.

x axis(0,0)

y axis

A

BCdA

dB

Figure 2.8: The R2 algorithm calculates a LoS to points A and B, andstores LoS height information on all x-crossings. Point C’s LoS height willbe determined by the LoS that pass closest to point C. In this example theLoS to point A will determine the LoS height of point C, as dA ≤ dB.

2.4.3 Van Kreveld

The algorithm introduced by van Kreveld [10] uses a line that sweeps over an areawhere a viewshed is to be calculated, as can be seen in Figure 2.9. This is combinedwith a self-balancing binary search tree (BST) used to keep track of which cells thesweep line currently intersects. The BST (which can be of any type, such as AVLor Red-Black), is sorted by the horizontal distance between a cell and the observer.An example of a BST used by the van Kreveld algorithm can be seen in Figure 2.10.By sorting in this manner, the cell in the left-most leaf in the tree is closest tothe observer (smallest distance) and the right-most cell is furthest away (largestdistance). All cells have three specific events connected to them: an enter-event,which is when the sweep line first enters a cell; a center-event, which is when thesweep-line intersects the center of the cell; and an exit-event, which is when thesweep line exits the cell. These events are visualized in Figure 2.11. When thesweep line encounters an enter-event it will insert the corresponding cell into to the

12

2. Theory

O

Sweep line

Figure 2.9: van Krevelds algorithm uses a sweep line, that rotates aroundthe observer and calculates the visibility of a cell when it passes over thecell’s center.

BST, if the BST is empty the new cell will become the root node. If the event is anexit-event the cell will be removed from the BST.

The last event, the center-event, is when the sweep-line is on the center of a celland this is the event where the actual visibility computation occurs. Every nodein the tree also stores information regarding the maximum angle that any node hascalculated from its subtree. The angle for the current cell will be calculated andcompared with the maximum value, retrieved from the tree, from nodes that have alower distance to the observer than the current cell, i.e. the cells that lie between theobserver and the current cell. If the current cell’s angle is higher than the maximumvalue from the tree, the cell is deemed visible, otherwise it is not. If the cell is notdeemed visible, a calculation is performed to gather how much higher the cell needsto be, to be considered visible. Pseudocode for how the algorithm works can be seenin Listing 2.1.

Listing 2.1: Van Kreveld-algorithm// Ca lcu la te ang le f o r ente r / cente r / ex i t−events// For a l l po in t s add events to pr ior i tyQueue

While (pQ not empty ) :take head o f queuecheck event :

Enter : add po int to t r e eExit : remove po int from t r e eCenter : Check i f po int i s v i s i b l e

13

2. Theory

50

40

30 45

Closer to obs

80

70 90

Further from obs

α = 1.3

αmax = 1.3

αmax = 1.3

α = 0.9

Figure 2.10: An example binary search three that is used in the vanKreveld-algorithm. α is the angle between the cell and the observer andαmax is the highest angle on any cell that are positioned between the celland the observer.

A

O

Enter

(a)O

B

Center

(b)

O

C

Exit

(c)

Figure 2.11: Three different events for a point, where a sweep-line rotatescounter-clockwise around the observer. Line A shows the sweep-line whereit enters the point, line B shows the position of the sweep-line during thecenter-event, and the last line, line C, shows the exit-event.

Ferreira et al. describe in their paper how to parallelize the van Kreveld algo-rithm [11]. After creating a representation of the cells with calculated values forthe different events, one can divide the list of all enter events and parallelize thecalculation on different threads. As can be seen in Figure 2.12, a part of the raster iscontained between the two angles A1 and A2. Cells whose enter-event or exit-eventare between these two angles will be considered one part of the algorithm and canbe calculated independently. The list of these cells will be computed by one thread,creating a separate priority queue and binary search tree for that part.

14

2. Theory

x axis(0,0)

y axis

A1

A2

Figure 2.12: The area contained between the two angles, A1 and A2,represents one part of the whole viewshed-analysis in the parallelized vanKreveld algorithm. This part can be calculated independently, thereforemaking it possible to parallelize the algorithm.

2.5 Framework: Apache Spark

To distribute the calculations of the different viewshed algorithms we used the dis-tribution framework Apache Spark. Spark is described as a "fast and general en-gine for large-scale data processing" [12]. Spark uses Resilient Distributed Datasets(RDDs) [13] which is an immutable dataset, that Spark distributes to the worker-nodes. Spark can perform two types of operations on an RDD, transformationsand actions. Transformations are when functions are applied to each element in anRDD. Due to the partitioning feature of the RDDs, transformations can be appliedin parallel. The other operation that Spark can perform on RDDs are called actions.Actions on RDDs will return a value, for instance the count action will return thenumber of entries that exist in a RDD. Spark uses lazy evaluation, meaning thattransformations will not be performed until an action is triggered. Transformationsand actions are combined to create a Spark job.

One can achieve better performance from Spark by utilizing its configuration options,as well as optimizing how and where data is stored. Spark provides APIs for fourdifferent programming languages: Java, Scala, Python and R.

A cluster is needed to perform distributed computations using Spark, Figure 2.13shows an example of a Spark cluster configuration. Nodes in a Spark cluster can havetwo different roles: cluster managers and workers. A cluster manager is responsiblefor managing the resources of the cluster, i.e. worker nodes, and can either beSpark’s built-in standalone manager, or YARN or Mesos. The latter two are generalcluster managers that also support other distribution frameworks. The standalone

15

2. Theory

manager is suitable for simpler applications where advanced resource scheduling isnot needed. If one wants to run different distribution frameworks alongside Sparkon the same cluster, one needs to use YARN or Mesos. Worker nodes are responsiblefor performing the computations, i.e. the transformations and actions on the RDDs.

Cluster Manager

Worker Worker Worker Worker

Figure 2.13: A sample configuration of a Spark cluster with five nodes.

Spark applications consist of two components: a driver program and at least oneexecutor, Figure 2.14 illustrates the architecture of these components. The drivermanages the connection to the Spark cluster and all work done before distribution,such as setting up common resources shared by all executors. The driver programdoes not need to be run on the cluster as long as full two-way communication ispossible. However, since a lot of communication takes place between the driver andthe cluster, it is advantageous to place the driver as close as possible to the clusterphysically to minimize latency introduced by network communication. An executoris a process running on a worker node that performs the actual transformationsand actions of a Spark job. Each executor is unique to a certain application andexists only during the lifetime of the application. However, it is possible for severaldifferent applications to run its own executors on the same worker. A Spark job isdivided into tasks which consist of particular transformations or actions performedon partitions of an RDD. An executor can be configured to use more than one CPUcore on the worker node on which it resides, if this is the case an executor canperform multiple tasks concurrently.

16

2. Theory

Driver

Worker

Executor

TaskTask

Executor

TaskTask

Worker

Executor

TaskTask

Executor

TaskTask

Figure 2.14: The components of a Spark application.

17

3Implementation

The viewshed algorithms addressed in this thesis are described in general in Chap-ter 2, but there are many ways in which each algorithm can be implemented. Theseimplementations can have different characteristics with regards to e.g. computationtime and memory requirements. This chapter describes the implementations wemade as part of this thesis, describing how they differ from the base algorithms andmotivations for the design choices we made.

One thing all our implementations have in common is that they do not only returna Boolean result whether a point is visible or not, they will also report how much apoint needs to be raised in order to be considered visible. For example, a value ofzero corresponds to a visible point and a value of one corresponds to the number ofmeters (in this case, one meter) that a points needs to be elevated to be visible fromthe observation point. All implementations created in this thesis were made in theprogramming language Java, both for the local implementations and our distributedsolutions.

3.1 R3

The R3 algorithm, described in Section 2.4.1, is a rather naive approach to viewshedanalysis. In order to handle a LoS that does not fall directly on a point when itintersects an x or a y-crossing, our algorithm interpolates the height from the twopoints closest to the intersecting point, y1 and y2. The code for this calculation canbe seen in Listing 3.1, and a descriptive figure can be seen in Figure 3.1

18

3. Implementation

Listing 3.1: Interpolation, for LoS intersection on a x-crossing.

// y = y−coord inate f o r LoS on c r o s s i n g// y1 = c l o s e s t i n t e g e r ≤ y// h1 = height in fo rmat ion f o r coord inate (x , y1 )// y2 = c l o s e s t i n t e g e r ≤ y// h2 = height in fo rmat ion f o r coord inate (x , y2 )

i n t e r s e c tHe i gh t = abs (x−y1 ) ∗ h2 +abs (x−y2 ) ∗ h1

x axis(0,0)

y axis

Ay1

y2(y − y2)

(y − y1)

Figure 3.1: Interpolation to get an approximation of the height of a the-oretical point that lies between point y1 and y2.

The y value is the exact position of the intersection in an x-crossing, y1 is the y valuerounded up to the closest integer, to get the height information from that point inthe raster. y2 is the y value rounded down. The value we get from taking abs(y-y1)is the distance between y and y1, which is a value between zero and one. This valueis then multiplied with the height value from the other point (y2) and then addedto the same calculation for y2. This way the point that is closest to the intersectingpoint will have the most influence on the resulting height value.

Interpolation is the preferable choice of approximation when the most accurate resultis needed [8]. The downside is that it is a more expensive computation comparedto for example taking the minimum or maximum value. Maximum and minimumis used when one wants to obtain results that are more conservative towards eithervisibility or non-visibility. For example, by taking the maximum the LoS-resultswill probably be higher than the correct result, which would mean that a pointthat is deemed visible, is visible with a high probability. In contrast, by taking theminimum, a point that is deemed not visible, is not visible with a high probability.

19

3. Implementation

Our implementation of the R3 algorithm in Spark consists of a main function thatwill read a raster and transform it into a list of points, for all of the cells in the raster.These are later used as keys in a key-value lookup to get the result correspondingto each cell. This list is transformed into an RDD. To calculate the viewshed wecreated a function that takes a coordinate as input and returns a key-value pair. Thevalue is the visibility height for that cell. The algorithm was then run by applyingthis function to every element in the RDD, effectively performing a LoS calculationfor every point in the raster. The result is a key-value map with a point as the key,and the value is either zero for a visible cell, or a positive decimal value for howmuch higher a specific cell needs to be elevated, to be considered visible.

3.2 R2

The R2 algorithm calculates visibility for all points on the grid a LoS passes andnot only those it actually intersects. This fact, combined with the fact that eachLoS is being calculated independently from all other LoS, means that many cellswill have their visibility calculated several times. This is illustrated in Figure 3.2.The highest number of visibility calculations for all nodes can be calculated with theformula 4R2−12R+8, where R is the side length of the area-of-interest measured innumber of points. Each point can only have one visibility value so the results of thedifferent calculations need to be merged. In a distributed setting this merge can bevery costly in terms of computation time since conflicts can occur on different nodes.This means that all results have to be aggregated on a single node for the merge tocomplete successfully, which causes a lot of overhead in the form of communicationand data transfer between nodes.

In order to address the problem of duplicates, we modified the R2 algorithm sothat each LoS keeps track of the destination of its’ neighboring LoS. When a LoScalculates the height of a point, it uses the information about the neighboring LoSto compute the distance from the point to each line. Only if none of the neighboringlines are closer to the point, the current LoS will calculate the visibility for thatpoint. This way the number of duplicates is greatly reduced and since each lineneeds no information about the actual results of its neighboring lines, each LoS canstill be calculated in parallel. Figure 3.3 provides an example of this modification.

The implementation of R2 in Spark is similar to that of R3, with two exceptions.Instead of creating a list with all coordinates we only used coordinates for the cellson the edge of the area-of-interest, visualized in Figure 3.4. The visibility functionfor R2 takes a coordinate as input and returns a list of key-value pairs as a result.This list corresponds to the result of all cells that are on the LoS from the observerto the edge cell. To remove duplicates we also applied a combining function onthe result. As the algorithm only saves results from a calculation if the activeLoS is the one that is closest to the cell that is to be calculated, we know that anyduplicates have the same distance between the calculated cell and the LoS, therefore

20

3. Implementation

x axis(0,0)

y axis

Figure 3.2: The visibility of point (1,1), denoted by the dot, will be cal-culated by several LoS. The circles denote the points along the LoS whichwill approximate (1,1)

x axis(0,0)

y axis (3,3)(2,3)

(1,1)

Figure 3.3: While an approximation of the height of point (1,1) will becalculated by LoS(2,3) it will not determine the visibility of point (1,1) sinceLoS(3,3) is closer.

the combining-function takes the average of the two reported results.

21

3. Implementation

Figure 3.4: A raster where the gray area represents the cells on the edgeof the area-of-interest.

3.3 Van Kreveld

Our implementation of the van Kreveld algorithm in Spark uses a different approachfor parallelization than described in 2.4.3. Rather than partitioning the raster basedon the angles of individual points in the raster, our implementation uses the co-ordinates of the points. The function we implemented takes a start and an endcoordinate as input, these coordinates represent the part of the raster that shouldbe calculated. The function then creates its own priority list of events, as well as abinary search tree. The algorithm would then be computed, but only between thedesignated start and end coordinate. The reason for using coordinates instead ofangles, as shown in Figure 2.12, is that there is no need to calculate the angles forall cells before dividing the raster, which makes it possible to calculate the anglesin parallel as this is done after the partitioning. The result from running the algo-rithm is a list of key-value pairs of all the cells in this part of the raster, with thekey corresponding to the coordinate of the cell and the value corresponding to theheight-information calculated by the algorithm.

The van Kreveld algorithm differs somewhat from the R2 and R3 algorithms whencalculating visibility heights of points. The R3 and the R2 algorithms use interpola-tion, as shown in Listing 3.1, whereas the van Kreveld algorithm uses the maximumheight of obstructing points. The effect of this is that points that the van Kreveldalgorithm deems visible, will most certainly be visible. Furthermore, van Kreveldmight report points as not visible that the two other algorithms will report as visible.

22

4Results

This chapter describes the results we gathered from running tests on our implemen-tations. The results include accuracy and time complexity and are used to comparethe algorithms as well as comparing local computations with running them on acluster. The hardware used locally was a computer with 16 GB RAM, running anIntel i7 870 at 2.93 GHz [14], the JVM was set to use 10 GB of RAM. The Sparktests were performed on the Google Cloud Engine platform [15] using a cluster con-sisting of five n1-highmem-4 instances [16], with one instance acting as the clustermanager, using YARN, and the rest as workers.

4.1 Correctness

To gather results and to test our implementations of the R2 algorithm and the vanKreveld algorithm we compared the results between them and the R3 algorithm.The reason for this is that the R3 algorithm, as previously stated, calculates anindividual LoS to every cell in the raster, meaning that approximation errors arekept to a minimum. To be able to use results from the R3 algorithm we firstneeded to assure that it would give accurate results. To do this we compared it tothe algorithm used by Carmenta, which we used as a base case. The comparison isvisualized in Figure 4.1, where points visible only according to Carmenta’s algorithmare coloured red, points deemed visible only by our R3 are coloured light blue. Thedarker area denotes points where both algorithms agree are visible. The reason forthe large amount of points visible only to our algorithm is that the algorithm usedby Carmenta only calculates viewshed for a circular area and not the entire area-of-interest. There are some points inside the circle that the algorithm used by Carmentadeems visible, which the R3 algorithm does not. This is probably because of some ofthe optimisations that are implemented in the Carmenta algorithm. Overall the twoalgorithms agree on a vast majority of points, which means that our implementationcan be considered correct and accurate.

23

4. Results

Figure 4.1: Comparison between our implementation of the R3 algorithmand the algorithm used at Carmenta.

4.2 Accuracy Tests

In Figure 4.2 the graph visualizes the comparison between the R2 and Kreveldalgorithms to the R3 algorithm, the same data can be viewed in Table 4.1. Desirablevalues are zero, or close to zero, as this means that there are little to no differencein the reported results from the algorithms. For R2, the vast majority of the valuesare concentrated around zero which means that we have achieved high accuracy inour implementation of the R2 algorithm. The van Kreveld algorithm also have mostvalues concentrated around zero, although slightly more spread out than for R2.An interesting result is that the van Kreveld comparison does not display the samesymmetric behaviour as R2. This can likely be attributed to the different ways thealgorithms estimate the height of obstructing points that lie between real points inthe raster. Van Kreveld uses the maximum height of the two closest real pointswhereas R2 and R3 both use interpolation.

Table 4.2 shows a comparison on how many points the algorithms agree are visibleor not. Our results show that both the R2 and van Kreveld algorithms have a highdegree of agreement compared to R3, around 99%, but R2 has a slight edge.

Table 4.3 shows error metrics when comparing the R3 algorithm to both the R2algorithm and the van Kreveld algorithm. The error metrics of the R2 algorithmare comparable to the values presented in the paper by Franklin et al. [8]. Whenevaluating the results from our tests, we can gather that both algorithms havesufficient accuracy, though R2 reports somewhat better results than the van Kreveldalgorithm.

24

4. Results

Figure 4.2: Differences when comparing the R2 and Kreveld algorithmsto the R3 algorithm on a 2001x2001 raster DEM representing the city ofParis, France. A negative value indicates that the R3 algorithm reports ahigher LoS height than the other algorithms, and vice versa. Values closeto zero means that the two compared algorithms agree on the results.

R2 Kreveld

≤-5 4232 1419

-4 4300 2046

-3 8877 4475

-2 22096 11292

-1 106401 45741

0 3717201 3100835

1 103539 515917

2 21061 144766

3 8113 63221

4 4107 37520

≥5 4073 76768

Table 4.1: Occurrences of differences in reported LoS height values whencompared to the R3 algorithm.

25

4. Results

R2 Kreveld

Total Points 4004000 4004000

Matching Points 3977861 (99.3%) 3953654 (98.7%)

Table 4.2: Point-by-point comparisons of the visibility results of the R2and van Kreveld algorithms to R3’s results. The matching points are thenumber of points where the algorithms agree with R3 on visibility.

R2 Kreveld

Mean error -0.0026 meters 0.4455 meters

Min error -13.88 meters -11.01 meters

Max error 13.05 meters 45.89 meters

Std deviation 0.49 meters 1.50 meters

Table 4.3: Error statistics when comparing the results of the R2 and vanKreveld algorithms to R3’s results using a 2001x2001 raster.

4.3 Timing Tests

To be able to draw conclusions on the performance between the different algorithmsand the different implementations we performed timing tests. The recorded timesexclude the time it takes to read and parse the raster.

During the initial timing tests we faced a problem where we did not have enoughRAM for storing each LoS result. To solve this problem, and to ensure that thetimes we reported were the running time of the actual algorithm, we chose to savethe result from every individual LoS in the same variable. Thereby reducing thememory problem, while still running the whole algorithm. The downside of thissolution is that the result from a previously computed LoS will be overwritten bythe next.

4.3.1 Locally

The results from running the R2 algorithm on our local implementation are shownin Figure 4.3. The times are an average over ten runs performed on the multi-coreCPU described previously. As can be seen, the execution time gets lower and lowerdepending on the number of threads the algorithm executes on.

The results from running the R3 algorithm with different numbers of threads can

26

4. Results

Figure 4.3: Running time when calculating a viewshed with the R2 algo-rithm locally, on a raster of size 16001 x 16001, with different number ofthreads.

be seen in Figure 4.4. It is easy to see the difference in complexity between the twoalgorithms. The time it takes to run the R3 algorithm on a raster with size 2001 x2001 is comparable to running the R2 algorithm on a raster with size 16001 x 16001,a raster of almost 64 times the size.

All three algorithms were implemented so that they could be run in parallel. Therunning time when executing the algorithms with different numbers of threads areshown in Figure 4.4, Figure 4.5 and Figure 4.6. The results show that running analgorithm with more threads will decrease the overall running time, which in turnimplies that we have succeeded in our parallel implementation of the algorithms. TheR2 algorithm receives a smaller speed-up when using multiple threads compared toboth the R3 and van Kreveld algorithms. This is probably due to that the runningtime of the algorithm is so short that the overhead incurred by running the algorithmin multiple threads have a measurable impact on the total running time.

One problem with the van Kreveld algorithm is that it uses a lot of RAM. Memoryusage is visualized in Figure 4.7. For a raster with a side length of 5001 cells,the van Kreveld algorithm uses more than three times the RAM compared to theother algorithms. Due to restrictions on our testing hardware, the largest rasterwe were able to run the van Kreveld algorithm on had a side length of 6001. Bycomparison, on our cluster we were able to run a viewshed analysis with the vanKreveld algorithm on a raster with side length of 10001 cells. This implies that byusing a cluster’s ability to scale, one could successfully perform calculations on largerasters that would otherwise not be possible on a single machine.

27

4. Results

Figure 4.4: Comparison when running the R3 algorithm with differentnumber of threads.

Figure 4.5: Comparison when running the R2 algorithm with differentnumber of threads.

4.3.2 Distributed

In order to evaluate how the configuration of the Spark cluster impacts the run-ning time of viewshed analysis we ran a series of tests of the R2 algorithm using a16001x16001 raster with varying number of CPU cores per executor, which deter-mines how many executors can run on a single worker node, and number of cores intotal. The results of these tests are displayed in Figure 4.8. We found that all the

28

4. Results

Figure 4.6: Comparison when running the van Kreveld algorithm withdifferent number of threads.

Figure 4.7: Comparison of memory consumption of the three algorithms

tested variations of cores per executor behaved similarly with regards to the run-ning time. Using two cores per executor, resulting in two executors on each worker,yielded slightly better results and was therefore used in all subsequent timing testsperformed on the Spark cluster.

Even though the R3 algorithm has a higher complexity than the R2 algorithm, westill wanted to gather results from running it. The reason for this is that we wantedto compare running times on a more complex algorithm between the local imple-

29

4. Results

Figure 4.8: Comparison of running times of the R2 algorithm on a16001x16001 raster using Spark with varying amounts of executors perworker node.

mentation and the Spark implementation, to be able to draw conclusions whetherother, more demanding, computations can benefit from being run on a cluster. Theresults of this comparison can be viewed in Figure 4.9. According to the results,Spark calculates the viewshed faster when compared to the local implementation.What one has to keep in mind when comparing the running times is that in thisinstance Spark was run on a cluster with a total of 16 virtual cores and the localimplementation was run with 8 threads on a CPU with 8 virtual cores. Even thoughthe hardware differs, it is still interesting to consider that Spark outperforms theR3 local implementation. We did the same comparison for the R2 algorithm, whichcan be viewed in Figure 4.10. It is apparent that the local implementation outper-forms our distributed solution, and there is nothing that points towards a theoreticalbreak-point for larger rasters where the distributed solution would perform betterthan the local one.

Figure 4.11 shows a comparison between the van Kreveld algorithm run locally andon the cluster. The two implementations have comparable performance, the reasonthat the measuring points for the local implementation stop at 6001 cells is that thealgorithm demanded too much internal memory, and was unable to calculate rastersof greater size.

30

4. Results

(a) Measurements of running times of the R3 algorithm, between a single multi-coreCPU and a cluster using Spark.

(b) Polynomial regression based on the findings presented in Figure 4.9a.

Figure 4.9: Comparison of running times of the R3 algorithm, between asingle multi-core CPU and a cluster using Spark

31

4. Results

Figure 4.10: Comparison of running times of the R2 algorithm, betweena single multi-core CPU and a cluster using Spark.

Figure 4.11: Comparison of running times of the van Kreveld algorithm,between a single multi-core CPU and a cluster using Spark.

32

5Discussion

Comparing our findings with those from Axell and Fridén [6] shows that it is notprobable to achieve better, or even equal, performance when doing viewshed analyseson a cluster compared to on a GPU, at least not when using raster with side lengthup to 16001 cells. This can be attributed to the fact that the computational powerof modern-day GPUs are exceedingly high and the speed of networks has not hadthe same evolution as GPUs.

Considering that running a viewshed analysis on a GPU is so much faster comparedto running the same analysis on multiple CPUs, we think that to utilize the powerof a distributed solution one needs to use it for specific, and suitable, use cases. Oneof these use cases would, for example, be finding a subset of points that has thehighest visibility, so called visibility index. First, to gather this result one has torun (exponentially) more computations, as this is equivalent to running viewshedanalyses regarding all points in the area as observers. Secondly, as the result onlyconsists of a subset of points there is no need to aggregate the whole result to onesingle node, thereby correctly utilizing the distributed aspect of this system.

We think that Spark works best when the total number of calculations that theapplication performs are very large, which is one difference between the R2 and theR3 algorithm. This is highlighted in Figure 4.9 where a comparison between the localand the distributed implementation of the R3 algorithm is shown. As previouslymentioned in Chapter 4 there is a difference in hardware when computing the results,but one also has to consider that it is usually both cheaper and easier to rent thecomputational power of 16 virtual cores than it is to buy the hardware yourself.

5.1 Conclusion

This thesis has examined the feasibility of using a distributed system to performviewshed analysis. As can be seen in Chapter 4 our implementation of the R2-algorithm achieved high accuracy, but the results of the running-time speak againstusing distribution for these kinds of computations. The reason for this seems to be

33

5. Discussion

the overhead incurred by sending data over a network, which compared to internalcommunication, with for example a GPU, is a rather slow medium.

We can also conclude that the idea of using distribution in GIS is a sound one,though one has to carefully select appropriate use cases for a distributed solution, inan effort to properly utilize the power of distribution and to minimize any overheadintroduced by the distributed solution, e.g. transmission overhead.

5.2 Future Work

As previously mentioned it is important to examine the use cases when distributingGIS computations. We think that the process of finding a subset of points withthe highest reach, i.e. the points where an observer would cover as big an area aspossible, would be a suitable use case for distribution. It would be interesting tosee the performance one could achieve on a medium sized cluster, while performingthese calculations.

According to the results we gathered, performing viewshed analysis on a cloud clusterwill not perform better than using a GPU. What would be interesting to examine isto perform analyses on rasters larger than 16001x16001 cells. To be able to do this,one either needs to have more capable hardware in terms of available RAM, or tooptimise the algorithms with respect to how much memory they use.

34

Bibliography

[1] J. A. Ware, D. B. Kidner, and P. J. Rallings, “Parallel distributed viewshedanalysis,” in Proceedings of the 6th ACM international symposium on Advancesin geographic information systems. ACM, 1998, pp. 151–156.

[2] B. N. Thai and A. Olasz, “Raster data partitioning for supporting distributedGIS processing,” The International Archives of Photogrammetry, Remote Sens-ing and Spatial Information Sciences, vol. 40, no. 3, pp. 543–551, 2015.

[3] J. Abdul, M. Potdar, and P. Chauhan, “Parallel and distributed GIS for process-ing geo-data: An overview,” International Journal of Computer Applications,vol. 106, no. 16, pp. 9–16, 2014.

[4] S. Tabik, A. R. Cervilla, E. Zapata, and L. F. Romero, “Efficient data structureand highly scalable algorithm for total-viewshed computation,” Selected Topicsin Applied Earth Observations and Remote Sensing, IEEE Journal of, vol. 8,no. 1, pp. 304–310, 2015.

[5] L. Lu, B. Paulovicks, M. Perrone, and V. Sheinin, “High performance comput-ing of line of sight viewshed,” in Multimedia and Expo (ICME), 2011 IEEEInternational Conference on. IEEE, 2011, pp. 1–6.

[6] T. Axell and M. Fridén, “Comparison between GPU and parallel CPU optimiza-tions in viewshed analysis,” Master’s thesis, Chalmers University of Technology,2015,Summary: The authors make a comparison between running viewshed anal-ysis on a multi-core CPU and a GPU, and concludes that the latter is moreefficient. They manage to observe a speed-up of 3.1x on a GPU compared to aCPU. We used the results from this paper to compare our distributed solutionto the one run on a GPU.

[7] S. Mayhew, A dictionary of geography. Oxford University Press, USA, 2015.

[8] W. R. Franklin, C. K. Ray, and S. Mehta, “Geometric algorithms for siting ofair defense missile batteries,” A Research Project for Battle, no. 2756, 1994,Summary: A paper that describes what a viewshed is, and what algorithms

35

Bibliography

are used to calculate a viewshed. The problem they aim to solve is to develop analgorithm for viewshed analysis that has less complexity than O(R3). The resultof this is the R2 algorithm. We used this paper to deduce how to implementthe R3 and the R2 algorithm, and also for general understanding of viewshedanalysis.

[9] J. Albrecht, Key concepts and techniques in GIS. Sage, 2007, ch. 9.3.

[10] M. van Kreveld, “Variations on sweep algorithms: Efficient computation of ex-tended viewsheds and classifications,” in Proceedings of 7th International Sym-posium on Spatial Data Handling, vol. 13, 1996, pp. 13A.15–13A.27,Summary: A paper that aims to calculate viewsheds with the use of planesweep algorithm. The author, Marc van Kreveld, is the source of both the algo-rithm and the name of the algorithm; the van Kreveld algorithm. We used thispaper as a guide when implementing the van Kreveld algorithm.

[11] C. Ferreira, M. V. Andrade, S. V. Magalhães, W. R. Franklin, and G. C. Pena,“A parallel sweep line algorithm for visibility computation.” in Proceedings ofXIV GEOINFO, 2013, pp. 85–96,Summary: The authors developed a parallelized version of the iterative vanKreveld algorithm. This is done by dividing the viewshed area into partitions,and calculate every partition in parallel. Their solution is used in our imple-mentation of the van Kreveld algorithm.

[12] Apache Spark. Apache Spark home page. (Visited on 2016-05-04). [Online].Available: http://spark.apache.org/

[13] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J.Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” in Proceedings of the9th USENIX conference on Networked Systems Design and Implementation.USENIX Association, 2012, pp. 2–2.

[14] Intel Corporation. Ark. Intel® Core™ i7-870 Processor. (Visited on2016-05-09). [Online]. Available: http://ark.intel.com/products/41315/Intel-Core-i7-870-Processor-8M-Cache-2_93-GHz

[15] Google Cloud Platform. (Visited on 2016-05-09). [Online]. Available:https://cloud.google.com/

[16] Machine Types. (Visited on 2016-05-31). [Online]. Available: https://cloud.google.com/compute/docs/machine-types#predefined_machine_types

36

http://spark.apache.org/

http://ark.intel.com/products/41315/Intel-Core-i7-870-Processor-8M-Cache-2_93-GHz

http://ark.intel.com/products/41315/Intel-Core-i7-870-Processor-8M-Cache-2_93-GHz

https://cloud.google.com/

https://cloud.google.com/compute/docs/machine-types#predefined_machine_types

https://cloud.google.com/compute/docs/machine-types#predefined_machine_types

ADefinitions

GIS - Geospatial Information Systems - Information systems that analyze andpresent geographic or spatial data.

DEM - Digital Elevation Model - A digital representation of terrain with elevationinformation.

TIN - Triangulated Irregular Network - A digital data structure that represents aDEM with the use of triangles.

Raster - Used to represent a DEM and is a grid of squares, where every cell in thegrid represents a small subarea.

LoS - Line of sight - The process of calculating whether a target point is visiblefrom an observer’s point of view.

Viewshed - A representation of an area that shows which points are visible, andwhich are not.

Apache Spark - An open-source distribution framework maintained by the ApacheSoftware Foundation.

RDD - Resilient Distributed Dataset - Dataset used by Spark.

Carmenta - The company where this thesis was performed, Carmenta specializesin GIS. www.carmenta.com

I

http://www.carmenta.com

Documents

Distributed Viewshed Analysispublications.lib.chalmers.se/records/fulltext/238036/238036.pdf · Viewshed analysis can be used for applications other than human sight, such as determining