Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Geographical sharding in MongoDBUsing Voronoi partitioning to geographically shard point data
Ricardo Filipe Amendoeira
Thesis to obtain the Master of Science Degree in
Electrical and Computer Engineering
Supervisor: Prof. Joao Nuno de Oliveira e Silva
Examination Committee
Chairperson: Prof. Antonio Manuel Raminhos Cordeiro Grilo
Supervisor: Prof. Joao Nuno de Oliveira e Silva
Member of the Committee: Prof. Joao Carlos Antunes Leitao
May 2017
Acknowledgments
I have tremendous gratitude for the help given by my supervisor, Professor Joao Nuno de Oliveira
e Silva, who not only started my journey in the world of computer programming and widened my
horizons with various other courses, but was also always available and happy to help me and make me
focus on the important tasks at hand, keeping me motivated and smiling and pushing me to do better
until the very end.
Of course I am incredibly thankful to my parents for my upbringing, their sacrifices so that I could
have a good education, their forgiveness for my dumb mistakes and their unconditional support during
my time at IST.
To my girlfriend, Rita, thank you for always being positive, supportive and enjoying life with me.
I would also like to acknowledge the support of my friends, who made my time at IST a great and
enjoyable experience.
i
Abstract
The growth of dataset sizes in the Big Data era has created the need for horizontally scalable Database
Management Systems (DBMS), like MongoDB, Apache Cassandra and many others, generally known
as NoSQL DBMSs. These systems need to be able to be configured in ways that best fit the data
that they’ll contain, partitioning the data (Sharding) in an context-aware way so that queries can be
executed as efficiently as possible.
Spatial data (associated with a location) availability has been steadily growing as more devices get
geolocation capabilities, making it important for NoSQL DBMSs to be able to handle this type of data.
While NoSQL DBMSs are capable of sharding, not all (MongoDB included), can do it based on
location or do it in an efficient way.
Following the work by Joao Pedro Martins Graca in his Master’s thesis [6], exploring the various
spatial partitioning policies, this work aims to implement and evaluating the practical usefulness of
the proposed policy GeoSharding, based on Voronoi indexing, on the popular open-source project
MongoDB.
Benchmark results show that this policy can have a large positive impact in performance when
used with datasets of large amounts of small documents, such as sensor data or statistical geography
information.
Keywords: Big Data, Spatial Data, Sharding, MongoDB, NoSQL
iii
Resumo
O crescimento dos conjuntos de dados na era da Big Data criou a necessidade de sistemas de gestao de
bases de dados com capacidade para escalar horizontalmente, como o MongoDB, o Apache Cassandra
e muitos outros, geralmente conhecidos como sistemas NoSQL. Estes sistemas devem ser capazes de ser
configurados de formas adequadas aos dados que vao armazenar, segmentando os conjuntos de dados
(tecnica conhecida como Sharding) de formas que aproveitem o contexto dos mesmos, permitindo que
as interaccoes com a base de dados sejam executadas de forma o mais eficiente possıvel.
A disponibilidade de dados espaciais (associados com uma localizacao) tem aumentado consis-
tentemente ao longo dos anos, com o aumento do numero de dispositivos com capacidade para
geolocalizacao, tornando importante que os sistemas NoSQL sejam capazes de lidar e tirar partido
deste tipo de dados.
Embora os sistemas NoSQL suportem sharding, nem todos (incluındo o MongoDB), sao capazes de
o fazer com base em geolocalizacao, ou de faze-lo de forma eficiente.
Em seguimento ao trabalho feito por Joao Pedro Martins Graca na sua tese de Mestrado [6], ex-
plorando varias polıticas de segmentacao espacial, este trabalho tem como objectivo implementar e
avaliar a utilidade pratica da polıtica proposta GeoSharding, baseada em indexacao de Voronoi, no
popular projecto de codigo aberto MongoDB.
Os resultados dos testes efectuados mostram que esta polıtica pode ter um grande impacto positivo
no desempenho do sistema quando usada com conjuntos de dados que contenham grandes numeros
de pequenos documentos, como dados de sensores ou informacao de geografia estatıstica.
Palavras-Chave: Big Data, Dados Espaciais, Sharding, MongoDB, NoSQL
v
Contents
Abstract iii
Resumo v
List of Tables ix
List of Figures xi
List of Listings xiii
Acronyms xv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Objectives and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Related work 5
2.1 Geographical Information Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 NoSQL database management systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Geographically aware sharding algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Existing solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 MongoDB (as of version 3.4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 GeoMongo 21
3.1 GeoMongo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Evaluation 29
4.1 Performance evaluation methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Performance evaluation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Evaluation of MongoDB as a base for GeoMongo . . . . . . . . . . . . . . . . . . . . . . 39
5 Conclusions 41
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Bibliography 45
A Write performance benchmark script A-1
vii
Contents
B Write performance results B-1
B.1 Documents with attached image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2
B.2 Text-only documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-3
C Read performance benchmark script C-1
D Read performance results D-1
D.1 Documents with attached image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-2
D.2 Text-only documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-3
E Example document from the dataset E-1
viii
List of Tables
B.1 Write benchmark for documents with image attached (total running time) - 1 shard . . . B-2
B.2 Write benchmark for documents with image attached (total running time) - 2 shards . . B-2
B.3 Write benchmark for documents with image attached (total running time) - 4 shards . . B-2
B.4 Write benchmark for text-only documents (total running time) - 1 shard . . . . . . . . . B-3
B.5 Write benchmark for text-only documents (total running time) - 2 shards . . . . . . . . . B-3
B.6 Write benchmark for text-only documents (total running time) - 4 shards . . . . . . . . . B-3
D.1 Read benchmark for documents with image attached (total running time) - 1 shard . . . D-2
D.2 Read benchmark for documents with image attached (total running time) - 2 shards . . . D-2
D.3 Read benchmark for documents with image attached (total running time) - 4 shards . . . D-2
D.4 Read benchmark for text-only documents (total running time) - 1 shard . . . . . . . . . . D-3
D.5 Read benchmark for text-only documents (total running time) - 2 shards . . . . . . . . . D-3
D.6 Read benchmark for text-only documents (total running time) - 4 shards . . . . . . . . . D-3
ix
List of Figures
2.1 A practical example of the MapReduce process, on MongoDB . . . . . . . . . . . . . . . . 8
2.2 An example of kriging interpolation (red full line). Also shown: smooth polynomial
(blue dotted line) and Gaussian confidence intervals (shaded area). Squares indicate
existing data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 A Z-curve used to partition the Earth into a 4x4 grid . . . . . . . . . . . . . . . . . . . . 11
2.4 The quadtree data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 A spherical Voronoi diagram of the world’s capitals . . . . . . . . . . . . . . . . . . . . . 12
2.6 JSON document example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7 MongoDB Sharded Cluster Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.8 Partition of the dataset into chunks based on shard key values . . . . . . . . . . . . . . . 15
2.9 Examples of bad shard key properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.10 An index being used to return sorted results . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.11 Replica set architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.12 Replica set failover process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 A visual summary of what data is stored in each type of machine in a MongoDB cluster . 23
3.2 Great-circle distance, ∆σ , between two points, P and Q. Also shown: λp and ϕp . . . . . . 27
4.1 Architecture of the benchmarking cluster (with varying numbers of shards) . . . . . . . . 31
4.2 Underwater resources of USA (mainland region shown) . . . . . . . . . . . . . . . . . . 32
4.3 Read performance of documents with attached image with: a) 1 shard, b) 2 shards and
c) 4 shards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Horizontal scaling of read performance of documents with attached image, measured
with 64 client threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 Write performance of documents with attached image with: a) 1 shard, b) 2 shards and
c) 4 shards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.6 Horizontal scaling of write performance of documents with attached image, measured
with 64 client threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.7 Read performance of text-only documents with: a) 1 shard, b) 2 shards and c) 4 shards . 36
4.8 Horizontal scaling of read performance of text-only documents, measured with 64 client
threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.9 Write performance of text-only documents with: a) 1 shard, b) 2 shards and c) 4 shards 38
4.10 Horizontal scaling of write performance of text-only documents, measured with 64
client threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
xi
List of Listings
A.1 The benchmark script for write performance, in Python3 . . . . . . . . . . . . . . . . . . A-2
C.1 The benchmark script for read performance, in Python3 . . . . . . . . . . . . . . . . . . C-2
xiii
Acronyms
DB Database.
DBMS Database Management Systems.
GIS Geographical Information Systems.
HA High Availability.
JSON JavaScript Object Notation.
RDBMS Relational database management systems.
SQL Serial Query Language.
xv
1Introduction
Contents1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Objectives and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1
1. Introduction
1.1 Motivation
Geo-referenced data, data that can be associated with some sort of location, has been growing in
availability as more devices like smart-phones get geolocation capabilities and satellite imagery and
sensor networks become better more available [12].
Photos posted to social-media, business information that can be searched by customers or naviga-
tion devices, weather data, census data among many other examples can all be associated with general
(regions) or specific (coordinates) locations.
As these datasets grow they become increasingly more difficult to manage and store while main-
taining good performance when querying or being used in computation [12].
A common way to scale databases is via vertical scaling, where the machine that runs the database
management system is upgraded with better hardware. Vertical scaling eventually hits a limit when the
cost of better hardware becomes uneconomical or there is no possible upgrade path. An alternative way
of scaling exists, horizontal scaling, where the system is scaled by adding more machines (usually with
commodity hardware to keep costs down) [2]. Horizontal scaling has no defined limit on scalability
but can of course be limited by communication overhead within the system or external factors like
network capacity.
The most common type of DMBS, Relational DBMSs, are usually very complex to scale horizontally
because of the way data is stored, with relations between tables and normalization, that makes tables
heavily dependent on each other to complete each table’s information [2].
A more recent kind of DBMS, the NoSQL DBMS, which aim to improve horizontal scalability so that
these systems can handle incredibly large datasets or/and very large request loads for web services.
They mostly achieve this by lowering some of the requirements/promises of RDBMSs, usually data
durability and/or data consistency, and by data de-normalization, which makes tables and rows (or the
equivalent data types) mostly independent, allowing for data to be easily partitioned and distributed
among several machines.
This type of horizontal scalability via data partitioning is called sharding. Sharding policies are
usually one of the main things to worry about when trying to improve the performance of a sharded
cluster, since the way data is partitioned and distributed affects the efficiency of operations on the
database. The policies should be adequate to both the data itself and the operations that will commonly
be done on it, so that data locality can be leveraged. Ideally queries should only require contacting a
single machine to be executed, the goal of choosing a good sharding policy is to approximate this as
much as possible.
Not many NoSQL DMBSs support sharding policies adequate for geographical data [17]: common
operations on georeferenced data usually involve fetching information related to a certain area, or do-
ing interpolation based on neighboring data, so a good distribution policy for geographical data should
take this into account to take advantage of data locality. If not, a simple query for data related with a
certain small area may have to communicate with all machines in the cluster, degrading performance
for any simultaneously occurring database operations.
A previous work compared different sharding policies by means of simulations, and concluded that
a new suggested policy named GeoSharding can have better performance than most of the policies
that are currently used in NoSQL DBMSs.
An implementation and evaluation of this policy could thus be valuable to measure the practical
usefulness of GeoSharding for NoSQL DBMSs.
2
1.2 Proposed Solution
1.2 Proposed Solution
An implementation of the partitioning policy proposed by Joao Graca [6] in the DBMS MongoDB is
proposed as a way to shard geo-referenced data without losing the context of the data’s location.
MongoDB was chosen as a base to the work because it’s a popular open-source NoSQL DMBS with
API-level support for interacting with geographical data (like range queries) and has existing support
for spatial indexing (which improves query performance at the machine level) but doesn’t have an
adequate sharding policy for georeferenced data, which would be the next logical step for geographical
data support on MongoDB. MongoDB can also do distributed computation, possibly making it a one-
stop shop for geographical data storage and processing.
The partitioning policy is conceptually quite simple: Each shard is assigned a virtual geographical
location in the form of latitude/longitude coordinates (doesn’t have to be related to it’s physical lo-
cation) and then Geo-referenced data is placed on the shard that is determined to be the closest one,
based on the shard’s virtual location. GeoSharding is a form of Voronoi Indexing, a spatial partitioning
technique that is used in many different fields [9].
This policy allows for irregularly sized regions to be assigned, as opposed to alternatives like geo-
hashing (z-curves) or quad-tree hashing (explained in chapter 2) where the regions are fixed sized.
This characteristic makes GeoSharding more flexible and better suited to irregular distributions of
spatial data, which is common (business locations are overwhelmingly located in cities, for example).
1.3 Objectives and results
The main objective of this work is to follow-up on the study done in [6] and implement the GeoShard-
ing partition policy for sharding in an existing database management system and to evaluate its real-
world effectiveness.
This boils down to making a working version of MongoDB that can be configured to shard by
geographical coordinates and allows the user to issue different virtual locations to different shards,
insert and search for Geo-referenced data and issue range-based queries that can be executed in an
efficient manner.
These objectives were achieved successfully, with the exception of range queries which, while pos-
sible, are not optimized to take full advantage of the data distribution of GeoSharding.
The results of the evaluation from chapter 4 show that GeoSharding is very promising for real-world
use, with a good choice of shard locations the performance is equivalent or better than hashed sharding
on all tested scenarios, with big performance improvements when using datasets where documents are
relatively small, which is common with sensor data and census information.
GeoSharding also has the advantage of distributing data in a way that allows queries and compu-
tations to take advantage of data locality, for example, range queries can be implemented efficiently
by contacting only a small subset of the cluster’s servers, which is not possible when using hashed-
sharding.
1.4 Outline
Chapter 2 presents some background that is relevant to understand GeoSharding and GeoMongo, as
well as some related work.
Chapter 3 goes into more detail about the requirements for a useful use of GeoMongo, an overview
of the MongoDB platform that GeoMongo is built on and the GeoMongo implementation details.
Results and evaluation are presented on chapter 4 and the conclusions are presented on chapter 5.
3
2Related work
Contents2.1 Geographical Information Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 NoSQL database management systems . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Geographically aware sharding algorithms . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Existing solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 MongoDB (as of version 3.4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5
2. Related work
This chapter presents some relevant background and Geographical Information Systems and data,
a small introduction to the requirements of data computation, an introduction to NoSQL systems and
popular algorithms for spatial partitioning, an introduction to some popular systems for data storage
and/or computation and finally an overview of MongoDB from a user’s perspective.
2.1 Geographical Information Systems
Geographical Information Systems (GIS) are systems designed to store, manipulate, analyze manage
or present many types of spatial or georeferenced data. These systems can vary much in both intended
use and technology stack, but always involve some kind of data store.
In the following sections several features of these systems are analyzed.
2.1.1 Spatial data types
Spatial data types are the backbone of GISs, as they are used to represent and store the information
in the system. As mentioned in [1], there are two types of spatial data types, which will be somewhat
familiar to those used to dealing with image formats: vector and raster data.
Like with images, vector data types represent geometric features, such as points, lines or polygons,
while raster data types use point grids that cover the desired space.
The use case will determine the ideal type to use, but the vector model can offer more flexibility
and precision. An overview of the most common types of vector data is given below:
• Point Data: This is the simplest data type and it is stored as a pair of X, Y Coordinates. It
represents the entity whose size is negligible as compared to the scale of the whole map such as
it may represent a city or a town on the map of a country.
• Lines and Polylines: A Line is formed by joining two points in an end to end manner and it
is represented as a pair of Point data types. In Spatial Databases, Polyline is represented as a
sequence of lines such that the end point of a line is the starting point of next line in series.
Polylines are used to represent spatial features such as Roads, Streets, Rivers or Routes, for
example.
• Polygons: Polygon is one of the most widely used spatial data type. They capture two dimensional
spatial features such as Cities, States, and Countries etc. In Spatial Databases, polygons are
represented by the ordered sequence of Coordinates of its Vertices, first and the last Coordinates
being the same.
• Regions: Regions is another significant spatial data model. A Region is a collection of overlap-
ping, non-overlapping or disjoint polygons. Regions are used to represent spatial features such
as the State of Hawaii, including several islands (polygons), for example.
2.1.2 Data computation
Data computation of spatial data can be useful for many different purposes, such as Geostatistics, sim-
ulations (of weather, for example), data mining or even machine learning (a transportation company
could use collected data to train an efficient route planner that takes several factors like weather and
time into account, for example).
The following are some important factors/features to consider when planning to do computation
on spatial data.
6
2.1 Geographical Information Systems
Server-side computation
Server-side computation can be a very important factor with huge datasets [1]. It avoids transferring
the dataset to the client by instead having the client upload the program to the server and having the
server send the results back to the client, which are almost always several orders of magnitude smaller
than the dataset, in the Big Data space.
Storage servers that have to do double-duty as processing servers can be much more expensive than
simple storage servers and vertical scalability is limited, so cluster processing with horizontal scaling
is often used when server-side computation is desired. This field is mostly dominated by Hadoop and,
more recently, Spark, both discussed in more detail on section 2.4.
Data locality
Data locality is an important concept to take into account when performing computation [1]. The
closer it is to the hardware that processes it, the faster the computation can be done. Note that in this
context closer does not necessarily mean physically closer but lower latency access by the CPU. This
is true even with small amounts of data, modern CPU’s have several levels of cache beyond the main
memory, which can itself be used to cache data from storage drives, so that the data in use can be more
readily available to the processor.
With large amounts of data this becomes increasingly important, since moving the data can actually
take much longer than the time it takes to process it, depending on how slow the connection used to
load the data is.
In the case of clustered data stores that support server-side computation the partitioning policy
can have big effects on performance, by distributing data appropriately according to the types of
computations that are desired and thus avoiding unnecessary data-transfer between nodes, which is
typically much slower than the internal connections to a machine’s storage hardware.
Temporal efficiency
Temporal efficiency of a computation system can be of two kinds: latency and throughput.
Throughput refers to the amount of data that can be processed in a given amount of time, ans is
usually more relevant for after the fact data analysis and data mining. This is the purposed of systems
like Hadoop, which are capable of crunching very large amounts of data at reasonable speed but suffer
in start-up latency, and can’t thus be expected to suit real-time or near real-time applications. lack
When doing
Latency refers to the amount of time the system will take to give the first result after starting the
computation, and is thus crucial for real-time or live computations (a user searching for nearby points
of interest, for example) but not very relevant for analysis applications where the goal is to be able to
handle larger amounts of data and latency isn’t relevant.
2.1.3 MapReduce
MapReduce [4] is a programming model that allows even programmers with little to no experience
with parallel programming to take advantage of massively parallel clusters.
To use MapReduce the programmer writes two pure (no side-effects) functions, Map and Reduce.
The requirement that both functions have to pure functions limits flexibility: they cannot, for example,
make database or network requests. MapReduce does not allow communication between running Map
or Reduce functions, which makes its use limited to embarrassingly parallel computations but is also
one of the reasons for its simplicity. When running the MapReduce computation, a set of key/value
pairs is fetched from the data store and the following order of operations occurs:
7
2. Related work
1. Some process is used to select the data that is to be processed. Not all implementations of
MapReduce include this step because undesired data can be filtered and thrown out by the Map
function. Making this a separate, explicit, step has the advantage of making Map functions
cleaner and can allow for some simple but effective optimizations.
2. The Map function is applied to each input key/value pair, executing the desired computation and
emitting an output key/value pair.
3. When all input pairs have been processed by the Map function a new set consisting of the pairs
emitted by the Map function is now available. This set may have keys with more than one value
if certain keys weren’t unique in the input set.
4. The Reduce function is applied for each key in the sub-set of keys from the output set that had
more than one associated value. The Reduce function’s purpose it to take the multiple values
and generate a single value for the respective key
5. After Reduce has been applied to all relevant keys the final set is ready, where every key from
Map’s output set is associated with a single value.
A practical example of the MapReduce process, on MongoDB, can be seen in fig. 2.1. The example
includes the data selection step, via a query filter that selects only the documents with the statusattribute equal to ”A”. In the case of the key ”A123”, which has two values after the Map step, 500 and
250, the Reduce function is applied so that in the result set each key only has a single value associated.
Figure 2.1: A practical example of the MapReduce process, on MongoDB
Source: https://docs.mongodb.com
8
2.2 NoSQL database management systems
2.1.4 Geostatistics
Geostatistics is the use of statistics with a focus on spatial or spatiotemporal data. It is used in many
different industries, like mining (to predict probability distributions or ore in the soil), petroleum
geology, oceanography, agriculture [14], forestry and, of course, geography [13]. It can generally be
used by businesses and the military to plan logistics and by health organizations to predict the spread
of diseases [11].
Kriging
Kriging is widely used class of geostatistical methods that are used to interpolate values of arbitrary
data (such as temperature or elevation) of unobserved locations based on observations from nearby
areas.
These methods are based on Gaussian process regression, the value interpolation is modeled using
a Gaussian process that takes into account prior covariances, as opposed to piecewise-polynomials that
optimize smoothness of the fitted values. If assumptions on the priors are appropriate, kriging can
yield the best linear unbiased prediction of the interpolated values [13].
Kriging techniques, when used on processing clusters, benefit from good data locality to avoid the
slow network transfer of data about surrounding areas on which the computation is dependent.
As can be seen in fig. 2.2, kriging interpolation chooses more likely results instead of optimizing
for line smoothness.
Figure 2.2: An example of kriging interpolation (red full line). Also shown: smooth polynomial (blue dotted line) and Gaussianconfidence intervals (shaded area). Squares indicate existing data.
Source: Wikimedia Commons
2.2 NoSQL database management systems
NoSQL, or Not Only SQL, is a category of database management systems (DBMS) that differentiate
themselves from the more common SQL DBMSs, that use the relational model, by basing themselves
on non-relational or not only relational models. Although NoSQL is a somewhat vague term that also
includes niche categories like Graph DMBSs, it’s commonly used to refer to systems that focus on
simplifying the design to more easily achieve horizontal scaling and high availability, which usually
means dropping the relational model [2].
To achieve simpler scalability and/or availability, NoSQL DBMSs usually relax the consistency guar-
antee on data they store. This follows from the CAP theorem, which states that a distributed system
9
2. Related work
can only provide two of the following guarantees:
• Consistency: Operations on stored data should always be based on the latest write to the system
or return an error
• Availability: Requests should receive non-error responses, although not necessarily from the
latest write
• Partition tolerance: The system continues to function correctly even if parts of it become un-
reachable or messages get delayed
Since many NoSQL DBMSs prioritize horizontal scalability and high-availability, partition tolerance
and availability become hard requirements of these systems. Since all three guarantees can’t simul-
taneously be achieved, a common solution is to degrade consistency to what is known as eventualconsistency: After a write the system might become inconsistent for a short period of time but eventu-
ally achieves consistency as the changes propagate to all machines.
Sharding
Sharding is a method to achieve horizontal scaling of a database via data partitioning between sev-
eral machines. This allows data and load to be spread among the various machines, providing high-
throughput and supporting very large datasets, bigger than what could fit on a single server.
Because of the focus on horizontal scalability these systems are useful for use with very large
datasets (that don’t fit in a single machine’s memory) or loads bigger than one machine can easily
handle by itself. They’re therefore good candidates for computational analysis with Geographical
Information Systems (GIS) or other use cases where consistency isn’t a hard requirement and capacity
and performance are priorities.
Some well-known examples of NoSQL DBMSs are MongoDB (used by several big tech companies
such as Google, Facebook and ebay), Cassandra (created and used by Facebook), DynamoDB (part of
Amazon Web Services) and BigTable (part of the Google Cloud platform).
2.3 Geographically aware sharding algorithms
In this section a quick overview of the relevant algorithms for geographically aware sharding is given.
An in-depth analysis of these algorithms can be seen on the work of Joao Graca in his Master’s thesis
[6], for which this work is a follow-up to.
Geohashing (Z-curve)
Geohashing 1 is a hierarchical geocoding system that subdivides the space into several equal-sized
grid-shaped regions that can be assigned to different machines. A Z-curve is used to reduce the dimen-
sionality of longitude/latitude coordinates by mapping them to a location along the curve, with closely
located coordinates being mapped to [6].
Some advantages of Geohashing are its simplicity and easily tunable precision (by truncating the
code). Z-curves are, however, not very efficient for certain types of queries (like range-queries) since
points on either end of a Z’s diagonal seem to be closely located, which is not the case. Since this
method involves projecting the spherical earth into a plane, points that are close by may end up on
opposite sides of the plane and appear far away to the curve.
Figure 2.3 shows a map of the Earth partitioned by a Z-curve.
1not to be confused with xkcd 426
10
2.3 Geographically aware sharding algorithms
Figure 2.3: A Z-curve used to partition the Earth into a 4x4 grid
Quadtree hashing
A Quadtree is a tree data structure that recursively subdivides a two-dimensional space into four
quadrants, as shown on fig. 2.4. The subdivision occurs per cell as its capacity (which can be an
arbitrary function) is reached, allowing the tree to be more detailed where appropriate [6].
Figure 2.4: The quadtree data structure
Quadtree hashing allows large empty sections to be quickly pruned when querying.
GeoSharding (Voronoi partitioning)
GeoSharding is the name of the policy used in this work, proposed by [6]. Based on Voronoi partition-
ing, which is a method for dividing a space into several non-uniform regions based on distance to a
given set of points in that space. The number of regions is equal to the number of points in the given
set as each point gets assigned the region of the space that no other point is closer to. By computing
the boundaries between regions a Voronoi diagram can be drawn, such as the spherical one shown on
fig.2.5, which has the benefit of being easily understandable visually.
11
2. Related work
Figure 2.5: A spherical Voronoi diagram of the world’s capitals
Source: rendered using jasondavies.com/maps/voronoi/capitals/
Unlike GeoHashing, Voronoi partitioning doesn’t suffer from the same disadvantages: points that
are close to each other, regardless of where they are on the map, are correctly considered close and
vice-versa.
GeoSharding is a very versatile and efficient space partitioning policy, as more points can be added
where more detail/capacity is required and the lack of regularity in the partitions.
A big drawback for GeoSharding can be it’s complexity in implementation for range or similar
queries. In terms of computational complexity, although computing the Voronoi diagram can be slower
than some alternatives, it can pre-computed and cached when changes to the cluster configuration are
made, so that the impact is minimal when actual queries are performed.
2.4 Existing solutions
In this section some viable existing solutions for either storage and/or computation of data are pre-
sented.
2.4.1 Spatial Hadoop
Hadoop is an open-source framework for distributed storage and computing of large scale datasets.
It’s one of the most mature and popular solutions for Big Data processing, it appeared at a time when
large scale computation was still done with powerful computers [10].
Hadoop’s design is based on taking advantage locality, it gets around the problem of moving very
large datasets to the processing nodes by instead moving the (small) processing software to storage
nodes and executing it there, in parallel. This design has two major parts: the Hadoop Distributed
File System (HDFS) and the MapReduce programming model, which allows the programmer to easily
write massively parallel programs at the cost of flexibility.
The Hadoop Distributed File System is capable of storing very large files, even up to the range
of terabytes, with horizontal scalability and high-availability via data redundancy. These files are as
flexible as regular files, they can be structured, unstructured, json, images, etc.
The recommended use of Hadoop is large-scale data analysis or other types of long-running compu-
tations that don’t need to be done in real-time, since Hadoop does not support streaming computation
as new data comes in.
12
2.4 Existing solutions
Spatial Hadoop [5] extends Hadoop with spatial capabilities: data types, indexes, operations and a
new high-level language called Pigeon for use with spatial data. The use of a dedicated language can
be seen as downside since it makes existing Hadoop programs harder to reuse and might make spa-
tial programs incompatible with other Hadoop extensions. Experienced Hadoop programmers might
appreciate the high-level nature of the language and its familiar MapReduce paradigm, as well as the
several tutorials are available on the project’s website.
2.4.2 GeoSpark
GeoSpark [17] is a recent and promising project built on the popular Apache Spark. First released
in July 2015 and still currently in an experimental stage as of writing, it’s an open-source cluster
computing system designed for large-scale spatial data processing. GeoSpark inherits many of it’s
characteristics from Spark.
Compared to Hadoop, Spark provides parallel, in-memory processing, whereas traditional Hadoop
is focused on MapReduce and distributed storage.
Spark doesn’t have a native data storage solution, so it’s often used with the Hadoop distributed
file-system or other compatible data-stores like MongoDB or Cassandra, making it’s configuration more
complex than other alternatives.
The main advantage of Spark (and GeoSpark) is it’s performance, being up to 100 times faster
than Spatial Hadoop’s MapReduce [10] which is, in part, a consequence of its much more aggressive
memory usage for it’s in-memory computation. The heavy memory usage makes Spark a very costly
option since RAM capacity is much more limited and expensive than SSD or HDD capacity.
GeoSpark adds spatial partitioning (including Voronoi diagram) and spatial indexing to Spark along
with geometrical operations, most notably range queries for searching inside regions of space [18].
Since spatial/geometrical support isn’t native to Spark but is added by GeoSpark, the API for these
operations is GeoSpark specific, meaning that programs have to be re-written to take advantage of
these features. Developing for GeoSpark is complicated not only by the underlying system’s complexity,
Spark, but also some consequences of it’s experimental status, like the lack of comprehensive and
organized documentation for it’s API.
2.4.3 Cassandra
Cassandra [8] is an open-source NoSQL database designed for use with multiple, geographically dis-
tributed, data-centers, with high availability and large total throughput at the cost of extra read and
write latency.
It’s what’s knows as a wide column store, data is stored in tables with a big number of columns of
different data-types. While it may seem similar to relational DBMS’s, Cassandra does not support join
operations or nested queries.
Cassandra’s recommended use is as a storage back-end to online services, with large number of
concurrent interaction, possibly in different geographical locations, with small data involved in each
request. The lack of transaction support makes Cassandra unsuitable for applications that require
atomic and durable write operations.
It does not support native server-side computation but it can work as a storage back-end for
Hadoop, replacing the Hadoop distributed file system [2].
2.4.4 MongoDB
MongoDB is a very popular open-source NoSQL DBMS that was designed to be very easily scal-
able. To achieve simple horizontal scalability it avoids the complexity of Relational DBMS, lowers
13
2. Related work
the consistency to eventual consistency and provides built-in support for sharding, replication and
load-balancing.
It’s a document database, meaning that instead of rows in tables it has JSON documents in no-
schema collections. The JSON format allows for the serialization of common programming data-types
like maps, strings, numbers and arrays. A simple JSON document example is given on fig.2.6.
Figure 2.6: JSON document example
Source: https://docs.mongodb.com
High Availability (HA) is an important characteristic of MongoDB, data replication and automatic
fail-over are built-in and easy to set-up by using replica sets.
MongoDB supports server-side distributed computation via MapReduce and also an abstraction of
MapReduce called Aggregation Pipeline, which aims to be a simpler way of using MapReduce. It sup-
ports other types of server-side computation that exist mostly as a convenience for small computations
like stored procedures, since they are not distributed.
Overall MongoDB is mostly designed for low-latency storage and querying but is also quite flexible,
with support for server-side computation, good scaling mechanisms and simple replication [2].
2.5 MongoDB (as of version 3.4)
In this section an overview of the relevant MongoDB features and characteristics is given.
2.5.1 Spatial data support
MongoDB has a native API for spatial data, which includes special queries for this data types and
indexing support. It can handle both 2D plane data and 2D sphere, which represents the surface of
the Earth and is the concern of this work. MongoDB can not only understand point data but also more
complex geometry, such as lines and polygons.
GeoJSON
MongoDB uses GeoJSON as the format for spatial data. It’s essentially a JSON that includes a special
field with the geometry related information of the object: it’s type and coordinates.
Each GeoJSON object can be one of the following types: Point, LineString, Polygon, MultiPoint,
MultiLineString, and MultiPolygon. The coordinates field can vary in shape depending on the type, for
a simple point it’s just a JSON array with the longitude and latitude of the object, in that order.
2.5.2 The layout of a MongoDB cluster
MongoDB sharding clusters are built from 3 different components:
• Router: Provides the interface between clients and the cluster and acts as a load balancer for the
shards in the cluster
• Config server: stores the meta-data and configuration of the cluster, like what data partition each
shard owns
14
2.5 MongoDB (as of version 3.4)
• Shard: Stores a subset of the sharded database data
For high availability multiple router’s can be used and Config servers and Shards can be set-up as
replica sets, described in 2.5.4. A standard highly-available 2 shard cluster architecture is shown on
fig.2.7.
Figure 2.7: MongoDB Sharded Cluster Architecture
Source: https://docs.mongodb.com
2.5.3 Sharding
Sharding is a method to achieve horizontal scaling of a database via data partitioning between several
machines. This allows the load to be spread among the various machines, providing high-throughput
and supporting bigger datasets than can fit on a single server.
When coupled with the use of MongoDB’s router processes sharding can become transparent to the
client.
MongoDB’s sharding mechanism works per collection and uses shard keys partition the data into
several parts called chunks, which are then distributed between the various shards of the cluster.
Shard Keys
A Shard key is an immutable field/fields selected by the user when sharding a collection and should
exist in all documents of that collection. MongoDB requires that each unique shard key value maps to
only one chunk, so that routers can contact only the responsible shard when routing queries.
Figure 2.8: Partition of the dataset into chunks based on shard key values
Source: https://docs.mongodb.com
15
2. Related work
Performance and efficiency of a sharded cluster is affected by the choice of shard key which should
ideally distribute data evenly among all shards in the cluster.
A good shard key should have:
• Many possible values (high cardinality): If a shard key has only 4 possible values there can’t be
more than 4 chunks, limiting the horizontal scalability of the collection
• High frequency in the dataset: If only a small subset of shard key values appear in the dataset
then chunks that contain those values may limit throughput as most of the dataset will be served
from a minority of the existing chunks, as shown in fig.2.9a
• Non-monotonic behaviour: If a shard key is a field that changes monotonically (like a counter)
inserts are very likely to target a single shard at a time, as shown in fig.2.9b
(a) Low frequency shard key (b) Monotonically increasing shard key
Figure 2.9: Examples of bad shard key properties
Source: https://docs.mongodb.com
Multiple fields in a compound index can be used as a shard key to attain these beneficial properties
if no single field is an ideal choice.
Besides the standard sharding method (called ranged sharding), hashed sharding is the only other
supported method. Hashed sharding can be used to avoid monotonic behavior. In this method the
shard key must be a single field which is passed through an hash function to get a pseudo-random
output with a uniform distribution. An unfortunate side effect of hashed sharding is that documents
with close shard keys won’t end up in the same chunks but that might not matter depending on the
meaning of the chosen field.
Chunks
Chunks are continuous ranges of the shard key’s key space that associated with a specific shard. Chunk
ranges cannot overlap each other and must collectively occupy the whole key space, to guarantee that
each shard key value is always mapped to a single chunk.
To prevent the data distribution throughout the cluster from becoming too unbalanced, the con-
fig server periodically runs a background process that migrates chunks to different shards called the
balancer and also uses a mechanism called chunk splitting.
A chunk splitting is a process where a big chunk (a chunk with a lot of documents in it’s range) gets
split into two smaller chunks with a similar number of documents and that together cover the same
key range as the original chunk. Chunk splitting does not cause migrations and happens when a chunk
goes over the chunk size (64MiB by default), a value that can be set by the user, with smaller chunk
sizes increasing granularity which allows for better data distribution at the cost of more migration
overhead (chunks moving between shards more often). If a chunk’s range is already a single value of
the key space it becomes indivisible and can become a performance bottleneck.
16
2.5 MongoDB (as of version 3.4)
2.5.3.A Zones
Zones can be used to get more manual control of chunk migration, by allowing the user to choose
specific shard key ranges and assign them to a shard or a group of shards. The balancer prioritizes
zones before other metrics when deciding where to migrate a chunk to. There’s no requirement of
covering the whole key space with zones, so the user can assign zones to only a subset of it.
Some use cases for this feature are to route data based on shard performance or, in larger systems,
to store data based on the location of data centers. An example of the second use case would be to add
a field to each document with the continent where that data is relevant, like where a user is located,
and then assign the various values of that field to different sharding zones, each going to a different
data center.
Indexes
Indexes are required when sharding, more specifically: When sharding on a certain key a index for
that same key must exist first. Indexes exist per-collection and are essentially sorted lists of key values
that exist in the collection, with pointers to the documents that contain those values. Indexes allow
certain queries to happen without requiring collection scans, which are very expensive, as well as
sorted results at little cost by using the sorted nature of the index (an example of this is shown on fig.
2.10).
Figure 2.10: An index being used to return sorted results
Source: https://docs.mongodb.com
2.5.4 Replica sets
Replica sets are small clusters of MongoDB daemon instances (Config server or Shards) that store the
same data set. A replica set has one primary instance that can both respond to reads and writes and
one or more secondary instances that only respond to reads (but by default don’t do so). A simple
replica set with architecture with two secondary instances is shown on fig.2.11.
17
2. Related work
Figure 2.11: Replica set architecture
Source: https://docs.mongodb.com
Changes to the primary’s dataset are propagated asynchronously to the secondary instances, mean-
ing that after a write is confirmed to a client reads on a secondary instance might still return old data
for a short amount of time. This eventual consistency allows for better write performance by not wait-
ing for the secondaries before confirming writes and is the reason that secondary instances by default
don’t respond to read queries. Clients have the option of requesting that a write is acknowledged by
at least n instances before being confirmed, to ensure write durability at the cost of performance.
In case of a failure of the primary (detected by a periodic heartbeat connection) the failover process
consists in a vote takes place and one of the secondary instances is promoted to primary, as shown on
fig.2.12.
Figure 2.12: Replica set failover process
Source: https://docs.mongodb.com
2.5.5 Storage engines
MongoDB (as of version 3.0) supports multiple different storage engines which can add new capabil-
ities or be optimized for certain hardware, making MongoDB a very versatile DBMS. This is done via
an exposed API, with the following ones being shipped with MongoDB:
• WiredTiger storage engine: the default, suited for most use cases. Includes native compression
and concurrency control
18
2.5 MongoDB (as of version 3.4)
• Encrypted storage engine: very similar to WiredTiger but including support for database encryp-
tion
• In-Memory storage engine: stores data in memory to provide low and predictable latency
19
3GeoMongo
Contents3.1 GeoMongo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
21
3. GeoMongo
3.1 GeoMongo
GeoMongo is the name chosen for this work. It aims to make MongoDB capable of geographically
aware sharding by using the GeoSharding partitioning policy discussed in section 2.3 and study the
usefulness of this feature.
Some knowledge of MongoDB’s internals is required to understand the reasoning behind or the
necessity of some changes that have been made. For this reason section 3.1.1 gives an overview
of MongoDB’s internals specific to sharding and any extra details that are considered relevant are
mentioned alongside the overview of the changes in section 3.1.2.
3.1.1 Internals of MongoDB’s sharding
Routing table
The routing table is a map of where a collection’s data is located, more specifically it’s a map of chunks
to shards.
Router (mongos)
The mongos process acts as the router/load-balancer for the cluster. It caches the routing table and
shard list in memory and has no persistent state.
The router also acts as a proxy between clients and the database cluster, so that clients can transpar-
ently access data without knowing the cluster configuration. A disadvantage of this is that the router’s
network connection can become limit, but luckily MongoDB is designed to support many different
routers being used simultaneously.
Shards (mongod)
Shards are mongod instances in its default operation mode (or a replica set of them). They also cache
the routing table and shard list to be able to perform certain cluster operations like chunk migrations
or sharded mapReduce.
Config server (mongod)
A mongod process can be configured as a config server, which is responsible for storing the cluster’s
metadata, like the routing table. Although the other server types cache some of the metadata the
config server acts as the single source of truth and is the only server type that stores the routing table
on disk.
Chunk and shard versions
Chunk versions are used as a way detect stale routing table caches. Each new chunk of a collection
must have a higher version than the previous ones and chunk migrations increment the version of
migrated chunks. Shards have a different shard version for each collection and which is simply the
highest version among chunks of the collection that it owns.
Whenever a request is sent to a shard, the request must include the shard version it thinks the shard
is currently at. The shard compares its local version with the sent version and updates its cache if the
sent version is higher. If the sent version is lower, an error is returned telling the sender to update its
cache.
22
3.1 GeoMongo
Chunks
Each chunk has a few important pieces of information:
• Minimum and maximum keys: the values that define the chunk’s range. From the minimum
(inclusive) to the maximum (exclusive)
• Version: chunk version, which includes a time-stamp
• Shard: the shard that owns the chunk
The use of minimum and maximum keys as an integral part of a chunk implies that sharding should
always be done by splitting up along a single dimension. Dimensionality reduction methods (like Z-
curves) can be used to shard spaces with multiple dimensions but a better solution would be to use a
more general approach that could handle multiple-dimension spaces defined by the policies.
Balancer
The balancer is a routine that runs periodically in the config server, checking for the data distribution
among the shards, and can begin chunk migrations between shards if it detects that one shard is
under/over utilized.
Chunk splitting
Every time an insert is made, the Router handling the request checks if the chunk goes over the chunk
size limit and, if so, issues a chunk split command to the shard that owns chunk.
Once a chunk is split there is no way to reduce the number of chunks as there is no chunk merging
capability.
Request flow in a router
Whenever the router gets a request a new thread is created to handle the connection. The request is
then interpreted in order to prepare the resources needed to process it and async connections to config
servers or shards are created, if required. After these connections are finished the result is assembled
and sent to the client. Note that the router doesn’t just point the client to the relevant shards, it makes
the connections itself so that the cluster configuration is transparent to the client.
Shard 1(mongod)
Shard 2(mongod)
Router(mongos)
Cluster operationsConfig server
(mongod)
Stores the actual data. For sharded collections, stores
only a subset of themAlso caches sharding
information.
Single source of truth about where data is in the cluster:
chunks, chunk ranges, shards, collections, etc
Caches sharding information, like
chunks, but otherwise has no permanent
data
Figure 3.1: A visual summary of what data is stored in each type of machine in a MongoDB cluster
23
3. GeoMongo
3.1.2 GeoMongo Implementation
Most of the time spent on implementing GeoMongo was to get an understanding of the relevant code
for sharding in MongoDB, which is a very large project with over 5000 files and over 1.2 million lines
of source code.
The MongoDB’s code-base has some internal assumptions on the properties of sharding keys and
chunks that are broken by the GeoSharding partition policy. This lead to a much more complex and
time-consuming implementation than originally predicted.
There are five major code sections that had to be changed to add support for a new partition policy:
• User interface for sharding
• Chunk creation
• Sharding initialization
• Write operations
• Read operations
There are also several other minor code sections that required updates, however most of these are
very similar and are either related to chunk validation or assumptions about chunk object structure
and not worth mentioning.
In the following sections the major changes are discussed.
User interface
From the user’s perspective not much changed. Two new commands were added to the client shell
and the Router process, a command to shard a collection with the GeoSharding policy and one to
add/change chunks locations or assigned shards.
The user interface remains very similar to btree and hashed sharding with the difference that the
user gets manual control over chunk ranges and what shards they’re assigned to. Although the choice
of shard can be automated via the balancer, dynamically choosing good locations to assign to each
chunk is a complex problem that should be studied in a future work.
Since MongoDB has native support for geographic queries, like geoNear, the user can possibly get
immediate performance benefits without changing existing application code, just by choosing a new
sharding policy.
Chunk creation
The chunk creation process is different for geographical chunks but also simpler in it’s current form, as
the chunk ranges (the Voronoi diagram) aren’t calculated at creation time, which could be a valuable
optimization.
Apart from verifying if a chunk with the same position already exists, creating a geographical chunk
involves using a higher version than the last created chunk, storing longitude and latitude instead of
min and max, setting the chunk’s name-space (what database and collection it belongs to) and what
shard it will initially be assigned to. In practical terms this boils down to creating a document in
a special collection that is stored on the config server, representing the chunk. These documents
can be fetched and/or updated by other routers or shards when stale caches are detected or chunk
modifications are made.
Unlike what’s done for the standard btree and hashed sharding mechanisms, automatically select-
ing uniformly distributed locations for chunks when setting up sharding on a collection isn’t desirable,
since expecting a dataset to be uniformly distributed over the entire sphere is unreasonable.
24
3.1 GeoMongo
There is, however, an opportunity for automatic adjustment of chunk locations after the dataset is
loaded or dynamically as it grows by perhaps using statistical analysis or machine learning algorithms.
Sharding initialization
When initializing sharding for an empty collection MongoDB still expects at least one chunk to exist
at the end of the process. Either a default location can be chosen, which would better fit the existing
interface of MongoDB, or the user can be asked for the location of the first chunk.
The implemented solution uses a default location that can later be changed by the user, since this
fits the existing interface and can transparently support a possible future work that dynamically selects
new locations for chunks as the database is used.
Write operations
When writing to the database, whether it’s a single document or a batch of multiple documents, each
document’s shard key goes through a method of the ChunkManager object, called findIntersect-
ingChunk, that is responsible for mapping the document’s shard key to the correct chunk. Thus the
necessary changes here involved adding a new code path, for when the shard key is of type 2dsphere,
that extracts the coordinates and iterates through the existing chunks to find the closest one. A new
method, findNearestGeoChunk, was created for this since it can be reused for reading queries. That
method is described in detail in GeoSharding routing, at the end of this section.
Note that depending on the database configuration documents might not be required to have a
shard key, in which case they get placed into the primary shard.
Read operations
Reading operations can be made up of multiple filters.
Read operations, unlike write operations, aren’t handled document by document since there’s no
way to know how many documents will be involved in a query before it’s executed. This means that
multiple chunks and shards may need to be contacted in order to execute the request. ChunkMan-
ager objects have a method, getShardIdsForQuery, responsible for interpreting the query’s filters and
returning a list of shards that may contain documents that fit the filters.
In the case of GeoSharding the changes are similar to what was done for write operations: A new
code path was added for when one of the filters included coordinates to search for. The same process
of matching the relevant chunk is done and the respective shard is returned.
When sharding is enabled an additional filter is added to every read query that instructs shards that
are contacted to filter out any documents that are found but don’t belong to chunks they own, since
during/after a migration documents might still exist on the old shard for a while.
If none of the filters are related to a document’s coordinates all shards have to be contacted.
Minor changes
There were also other, smaller, changes necessary for GeoMongo to work:
• Chunk ranges: MongoDB includes a mechanism for btree and hashed sharding to cache chunk
ranges. In it’s current form GeoSharding makes no use of this feature but it could be very useful
for range based queries, allowing the Voronoi diagram to be computed at chunk insertion/modi-
fication time rather than query time.
• Chunk ID’s: Chunk’s were previously identified by their minimum key, since the complete shard
key space had to be assigned and chunk ranges couldn’t overlap. Longitude and latitude don’t
have the same characteristics so both are needed to identify a chunk.
25
3. GeoMongo
• Chunk validation: chunks ranges and min/max values are validated in all server types and several
parts of the code-base. These checks range from verifying that the minimum key is smaller than
the maximum key to verifying that the whole key space is accounted for with no overlaps. For
GeoSharding the checks are simpler: make sure that the longitude and latitude values are valid
and prevent multiple chunks having the same location.
• Chunk splitting: Chunk splitting was disabled for GeoSharding since, as stated before, choosing
a location for new chunks can’t be trivially automated. Furthermore, the policy dictates that each
shard should own a single region of space and there’s no clear benefit to allowing more at the
cost of extra implementation complexity.
Internal changes
Like mentioned before, the GeoSharding policy breaks several of the assumptions that MongoDB makes
with regards to sharding keys and chunks, the biggest one being that it’s not 1-dimensional.
A new shard key type was created, 2dsphere, matching the key type used for geographical indexes.
Unlike other shard key types, for GeoSharding 2 values are needed, the coordinates.
Most of the changes that had to be made were, unsurprisingly, to code that handles chunks.
Because MongoDB is designed with 1-dimensional sharding in mind, chunks include a minimum
and maximum key. Since GeoSharding is 2-dimensional policy based on locations the minimum and
maximum keys have no meaning, so when using GeoSharding the chunk stores it’s virtual longitude
and latitude instead. This change affected many parts of the code-base and all three types of server
because the minimum and maximum keys are assumed to exist and are validated in several different
components responsible for the sharding mechanism, instead of relying on an abstraction for chunk
identification and validation.
Chunk creation, validation and chunk ID’s also required updates to remove limitations imposed by
the lack of an abstraction over chunk identification and validation.
The auto-splitting feature is disabled under GeoSharding because it was outside the scope of this
work to dynamically create and choose locations for chunks, which would have been required to
support the feature.
GeoSharding routing
When inserting data or making queries the router calculates what shard it should forward the request
to. While it is conceptually simple (select the shard that owns the chunk closest to the relevant doc-
ument) computational precision must be taken into account. Since the number of chunks should be
equal to the number of shards, and therefore very small, simply iterating over the chunks and calcu-
lating the great-circle distance to the document’s coordinates is expected to have a negligible impact
in performance.
26
3.1 GeoMongo
λ
ϕ
P
ΔσQ
Figure 3.2: Great-circle distance, ∆σ , between two points, P and Q. Also shown: λp and ϕp
Source: Wikimedia commons
Vincenty’s formula was used to compute the great-circle distance, as it’s more accurate than com-
mon alternative, the Haversine formula, and is computationally inexpensive [16]. Although the Earth
is not a sphere but an oblate spheroid and Vincenty’s formula still has a small amount of error, these
details aren’t problematic since precision isn’t crucial for GeoSharding, as it only affects the locations
of chunk boundaries, points that are close still end up in neighboring chunks. Performance is a concern
because this calculation is done multiple times for each write or read on the collection but Vincenty’s
formula is a relatively simple computation and should have a negligible impact.
Vincenty’s formula for a sphere (which the Earth closely approximates) is shown on 3.1, where
λp and λq are the longitudes of two points, P and Q, and ϕp and ϕq are the latitudes, ∆λ and ∆ϕ are
the differences in longitude and latitude between the two points and ∆σ is the great-circle distance
between them. Some of these angles are depicted in fig. 3.2.
∆σ = arctan
√(cosϕq · sin(∆λ)
)2+
(cosϕp · sinϕq − sinϕp · cosϕq · cos(∆λ)
)2
sinϕp · sinϕq + cosϕp · cosϕq · cos(∆λ)(3.1)
For computation the function atan2(y,x) should be used in place of arctan(y/x) because it can
calculate the quadrant of the computed angle and return valid results when x = 0, as it receives the
arguments separately and can thus know the values and signs of x and y.
In the case of ties when computing the closest chunk the one with the smallest Id is chosen.
Range based queries and similar
Unfortunately GeoMongo does not have optimized range queries (in it’s current form, GeoMongo’s
range queries send requests to all shards), which would be able to take full advantage of the data
distribution of GeoSharding. This is because range based queries (and similar, like intersection) ag-
gregate data points that are geographically close, which means that with GeoSharding there’s a good
probability that only a single (or very few) shards need to be contacted, lowering the resource impact
of the query, which means higher capacity for the cluster to handle other queries simultaneously [7].
If these queries had been optimized, this would be the order of steps:
1. When a chunk is added/modified, re-compute the Voronoi diagram of the chunk ranges. More
specifically, compute the boundaries (edges) of the ranges using Delaunay triangulation, which
27
3. GeoMongo
can be done in O(nloдn) time (n being the number of regions) with the divide and conquer
algorithm by Leach [9].
2. Cache the chunk range edges, to avoid unnecessary computation when running queries
3. When a range query is received, contact any chunks with a center within the range, as these are
definitely included in the search.
4. Compute the range boundaries that intersect the query’s search circle. The chunks to which these
boundaries belong contain a region within the search circle and should also be contacted
28
4Evaluation
Contents4.1 Performance evaluation methodology . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Performance evaluation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Evaluation of MongoDB as a base for GeoMongo . . . . . . . . . . . . . . . . . . 39
29
4. Evaluation
In this chapter an evaluation of GeoMongo is done, both in terms of performance and implementa-
tion
4.1 Performance evaluation methodology
The performance testing was done on a cluster of 6 equivalent machines and a 7th, different, machine
connected on the same network was used as the client for all the benchmarks. In this section the
details about the machines and benchmarking methodology are given.
Cluster machines specifications
All the cluster machines were identical, both in terms of hardware, software, and network connection.
The relevant specifications for these machines:
• CPU: Intel Core i7-2600k @ 3.4GHz, a 4-core, hyper-threaded CPU with Sandy Bridge architec-
ture and 8MB of L3 cache.
• RAM: 12GB of DDR3 in dual-channel configuration.
• OS and Storage drive: 1TB, 7200RPM Western Digital Blue hard disk drive.
The machines were running Ubuntu 14.04 - Trusty Tahr, a long term support version of Ubuntu.
Client machine specifications
All the cluster machines were identical, both in terms of hardware, software, and network connection.
The relevant specifications for these machines:
• CPU: Intel Xeon E3-1230v5 @ 3.4GHz, a 4-core, hyper-threaded CPU with Skylake architecture
and 8MB of L3 cache.
• RAM: 16GB of DDR4 in dual-channel configuration.
• OS and Storage drive: 1TB, Samsung 850 Pro solid state drive.
The client machine was running Ubuntu 16.04 - Xenial Xerus, a long term support version of
Ubuntu.
Network specifications
All 7 machines were connected on the same local area network via a 1Gb/s switch. During all the
test runs, with the exception of reads for documents with images, the total throughput of any of the
machines never reached the network’s Gigabit limit so there’s no reason to believe it was a bottleneck
on those tests. More details about the exception are given in the test results discussion.
Benchmark software
Two different scripts were made to serve as benchmarking software. The scripts were written in
Python3 and used the PyMongo driver library to interact with the MongoDB router. Pool threads were
used to simulate multiple concurrent clients.
Because CPython uses a Global Interpreter Lock (GIL) to prevent memory corruption from data
races, threads can’t run in parallel, only interleave their execution (unless multiple processes are used).
While this prevents parallel computation, it’s not an issue for network-bound programs that block on
I/O, since threads quickly switch between each other as they block waiting for request responses.
30
4.1 Performance evaluation methodology
With both reads and writes, the insert/query order was randomized for each run, to avoid impacting
the machines in order and thus reducing performance to a single machine at a time.
The scripts can be seen in appendix A (write benchmark) and appendix C (read benchmark).
Cluster configurations tested
Three different configurations were tested for the cluster: 1, 2 and 4 shards. Using no more than 4
shards allowed the router, config server and client to live on their own machines and avoid any load
from these processes to interfere with the performance of the shard processes.
Shard 1(mongod)
Shard 2(mongod)
Shard 3(mongod)
Shard 4(mongod)
Router(mongos)
Cluster operationsConfig server
(mongod)
Figure 4.1: Architecture of the benchmarking cluster (with varying numbers of shards)
Policies tested
GeoSharding was of course tested. For all tests with more than one shard, the chunk locations were
handpicked to achieve a near uniform distribution between all shards. Other than time-constraints, this
decision was made because the data distribution in GeoSharding depends completely on the locations
chosen for the chunks, and so they should picked at least somewhat competently or be dynamically
calculated (unfortunately this is out of the scope of this work). The tests done with a single shard
should be nearly equivalent to the worst case scenario, where all the data gets allocated to a single
chunk.
Besides GeoSharding, Hashed sharding was also tested as a means of comparison. Hashed sharding
was chosen as it’s the most adequate policy when a context-aware one isn’t available, and MongoDB
does not have a geography-aware sharding policy. The sharding key used for Hashed sharding was the
auto-generated document id since each document gets a unique one.
Dataset used
The performance evaluation was made using a real-world dataset of underground water resources
in the United States of America. The dataset is comprised of 4648 entries distributed throughout
the country’s territory, including Alaska and islands in the Pacific Ocean like Guam but mostly in the
mainland. Figure 4.2 shows that the distribution isn’t uniform even on mainland USA.
31
4. Evaluation
Figure 4.2: Underwater resources of USA (mainland region shown)
Two variations of the dataset were tested: the raw dataset, where each document is very small (on
the order of 100 bytes) and a variation where each document had a 500KB image attached to it, to
compare the effects of document.
The dataset can be downloaded from 1. An example of a document from the dataset can be seen in
appendix E.
Benchmark execution
Each test was repeated three times and it’s results were averaged. The variation between each run
was small so little information is lost by averaging the results. The complete results are included in
appendix B (writes) and appendix D (reads).
Write performance test consisted of writing the whole dataset, randomly ordered, to the cluster.
Different amounts of threads (1, 2, 4, 16 and 64) were used to simulate multiple concurrent clients.
The database was dropped between each run of a write performance test.
For read performance tests, the dataset was loaded once at the start and then several runs of
reads were performed, with different amounts of concurrent threads to simulate multiple clients. The
performance for reads seemed to scale much more for write performance test: 1, 2, 4, 16 and 64.
4.2 Performance evaluation results
In this section the benchmark results are presented and discussed. As said before, two versions of
the dataset were used, with remarkably different, but expected, outcomes: a text-only version and a
version where every document had a moderately sized image attached to it. The same benchmarks
were used for both versions.
4.2.1 Attached image documents
For this set of tests an image of roughly 0.5MB (473.35KB) was attached to each document, bringing
the total size of the dataset to about 2.4GB. Because of the size of each document the tests became
network/communication driven and the difference between GeoSharding and hashed sharding was
hidden by the transfer time of each document.
1https://nwis.waterdata.usgs.gov/usa/nwis/gwlevels
32
4.2 Performance evaluation results
Read performance
0
20
40
60
80
100
120
140
160
180
1 2 4 16 64
TIM
E TO
REA
D T
HE
DA
TASE
T (S
)
NUMBER OF CONCURRENT READS
Hashed Sharding GeoSharding
(a)
0
20
40
60
80
100
120
140
160
180
1 2 4 16 64
TIM
E TO
REA
D T
HE
DA
TASE
T (S
)
NUMBER OF CONCURRENT READS
Hashed Sharding GeoSharding
(b)
0
20
40
60
80
100
120
140
160
180
1 2 4 16 64
TIM
E TO
REA
D T
HE
DA
TASE
T (S
)
NUMBER OF CONCURRENT READS
Hashed Sharding GeoSharding
(c)
Figure 4.3: Read performance of documents with attached image with: a) 1 shard, b) 2 shards and c) 4 shards
Analyzing fig. 4.3, both policies had very similar results, performing equally within margin of error.
Because of how large each document was, the test became communication driven: the transfer time of
each document was much larger than the time taken to find it, which significantly reduced the impact
of the sharding policy.
0
5
10
15
20
25
30
1 2 4
TIM
E TO
REA
D T
HE
DA
TASE
T 1
0 T
IMES
(S)
NUMBER OF SHARDS
Hashed Sharding GeoSharding
Figure 4.4: Horizontal scaling of read performance of documents with attached image, measured with 64 client threads
33
4. Evaluation
In fig. 4.4 the horizontal scaling for reads on both policies can be seen. There is very little improve-
ment, which seems to be a consequence of even 1 shard being able to get very close to theoretical
maximum throughput of the network connection: at 1Gb/s, 2.5GB of data should take 20 seconds
to transfer. 80% of that throughput (800Mb/s for a running time of 25 seconds) is achieved with a
single shard so by adding any additional machines one can only expect at most a 20% improvement
in this test. Unfortunately the equipment needed to remove this bottleneck was not available but as
the running time was dominated by document transfer it’s unlikely that the two policies would show a
significant difference between each other.
Write performance
0
20
40
60
80
100
120
140
160
1 2 4 16 64
TIM
E TO
INSE
RT
THE
DA
TASE
T (S
)
NUMBER OF CONCURRENT INSERTIONS
Hashed Sharding GeoSharding
(a)
0
20
40
60
80
100
120
140
160
1 2 4 16 64
TIM
E TO
INSE
RT
THE
DA
TASE
T (S
)
NUMBER OF CONCURRENT INSERTIONS
Hashed Sharding GeoSharding
(b)
0
20
40
60
80
100
120
140
160
1 2 4 16 64
TIM
E TO
INSE
RT
THE
DA
TASE
T (S
)
NUMBER OF CONCURRENT INSERTIONS
Hashed Sharding GeoSharding
(c)
Figure 4.5: Write performance of documents with attached image with: a) 1 shard, b) 2 shards and c) 4 shards
Writing larger documents is the only test in this evaluation where increasing the number of threads
on the client has a very small performance benefit, as seen on fig. 4.5a. This points to a limitation
on the write performance of a single shard. Writing 2.5GB in 125 seconds means a throughput of
20MB/s, which is not unreasonable for a 7200RPM drive when not writing serially (as it’s writing
several thousand documents and not a single large file). The storage layer of MongoDB features
compression, which might also slow down write performance.
34
4.2 Performance evaluation results
0
20
40
60
80
100
120
140
160
1 2 4TI
ME
TO IN
SER
T TH
E D
ATA
SET
(S)
NUMBER OF SHARDS
Hashed Sharding GeoSharding
Figure 4.6: Horizontal scaling of write performance of documents with attached image, measured with 64 client threads
Horizontal scaling of writes was pretty much linear with the number of shards, as shown in fig. 4.6,
unlike with reads (which were limited by the network), writes seem to be directly limited by the write
capability of each shard, hence the linear scaling. With 4 shards the network throughput approached
the theoretical limit but with a big enough margin that it’s safe to assume it wasn’t limited by the
connection.
Performance was pretty much the same for both policies, which is expected on insertions and
even more in the case of larger documents, as the transfer time (and in this case, disk-write time)
overshadows their impact.
4.2.2 Text-only documents
The documents in the text-only version of the dataset were quite small, on the order of hundreds of
bytes. GeoSharding performed either on par or significantly better than hashed sharding in both reads
and writes.
Because the text-only version of the dataset is quite small, at roughly 0.5MB total, the running time
of some test runs took less than 1 second. To prevent this, the text-only benchmark was changed to
read the dataset 10 times, raising the running time of the fastest test runs to an acceptable level, where
start-up overheads and initial network latency should have negligible impact.
35
4. Evaluation
Read performance
0
200
400
600
800
1000
1200
1 2 4 16 64
TIM
E TO
REA
D T
HE
DA
TASE
T 1
0 T
IMES
(S)
NUMBER OF CONCURRENT READS
Hashed Sharding GeoSharding
(a)
0
200
400
600
800
1000
1200
1 2 4 16 64
TIM
E TO
REA
D T
HE
DA
TASE
T 1
0 T
IMES
(S)
NUMBER OF CONCURRENT READS
Hashed Sharding GeoSharding
(b)
0
200
400
600
800
1000
1200
1 2 4 16 64
TIM
E TO
REA
D T
HE
DA
TASE
T 1
0 T
IMES
(S)
NUMBER OF CONCURRENT READS
Hashed Sharding GeoSharding
(c)
Figure 4.7: Read performance of text-only documents with: a) 1 shard, b) 2 shards and c) 4 shards
As can be seen in fig. 4.7a multi-threading the client (or using multiple clients) can significantly
improve the read times, even with a single shard in the cluster. This is because MongoDB is capable of
opening multiple threads to respond to clients, parallelizing the data transfer and reducing the effect
of round trip latency between requests. Beyond 16 threads the improvement was small to insignificant,
however.
By default, MongoDB will accept up to 65536 simultaneous connections and the storage engine,
WiredTiger, has a limit of 128 concurrent simultaneous read/write operations. Thus the bottleneck
appears to lie elsewhere, perhaps in the 8 CPU cores of the machines, although the CPU’s didn’t seem
over-utilized during the test runs.
Still in fig. 4.7a, GeoSharding can be seen to perform on par or even slightly better than hashed
sharding, showing that GeoSharding routing does not introduce significant read overhead when com-
pared to hashed sharding.
On fig.4.7b, with 2 shards, GeoSharding shows significantly smaller running times, especially when
the client uses a large amount of threads. When using 64 threads GeoSharding is 3.1 times faster
than hashed sharding. This is likely because with hashed sharding the router, when given a pair of
coordinates, can’t calculate which shards might have documents with those coordinates and has to
contact both shards. Because there are multiple concurrent requests this can actually slow down the
36
4.2 Performance evaluation results
test more than expected, as each request causes an unnecessary slowdown to other shards, impacting
other requests as well.
GeoSharding allows the router to contact only the relevant shard for each query, lowering the
running time for the query and minimizing the impact on other concurrent queries.
With 4 shards (fig. 4.7c) the same effect can be observed, this time with an even bigger difference,
with GeoSharding being 5.1 times faster than hashed sharding.
0
50
100
150
200
250
1 2 4
TIM
E TO
REA
D T
HE
DA
TASE
T 1
0 T
IMES
(S)
NUMBER OF SHARDS
Hashed Sharding GeoSharding
Figure 4.8: Horizontal scaling of read performance of text-only documents, measured with 64 client threads
Fig. 4.8 shows the best case scenario for each policy and each number of shards, so that the
horizontal scaling capability of both policies can be analyzed. The advantage of GeoSharding over
hashed sharding is quite large, with GeoSharding even getting more than linear scaling: by moving
from 1 shard to 2, performance increases by a factor of 4.1, and then by a factor of 2.8 when going
to 4 shards.
Besides the reasons mentioned on the discussion of fig. 4.7, read locks might also explain the big
improvements in performance. With MongoDB, as with most DBMSs, reads and writes attempt to lock
the database to prevent other requests from using incomplete/incorrect document. MongoDB allows
multiple reads to share a single lock but there is still a performance impact, as locks are released with
their respective read queries and a new lock must be obtained for the next group of queries. When
using sharding on MongoDB read locks happen at the level of each shard, essentially multiplying the
number of possible concurrent query batches by the number of shards.
37
4. Evaluation
Write performance
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1 2 4 16 64
TIM
E TO
INSE
RT
THE
DA
TASE
T (S
)
NUMBER OF CONCURRENT INSERTIONS
Hashed Sharding GeoSharding
(a)
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1 2 4 16 64
TIM
E TO
INSE
RT
THE
DA
TASE
T (S
)
NUMBER OF CONCURRENT INSERTIONS
Hashed Sharding GeoSharding
(b)
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1 2 4 16 64
TIM
E TO
INSE
RT
THE
DA
TASE
T (S
)
NUMBER OF CONCURRENT INSERTIONS
Hashed Sharding GeoSharding
(c)
Figure 4.9: Write performance of text-only documents with: a) 1 shard, b) 2 shards and c) 4 shards
With writes (fig. 4.9) a similar trend can be seen in all cluster configurations in regards to threading
on the client: Performance steadily improves until somewhere between 4 and 16, probably 8 as that is
the number of threads on the CPU’s of all machines involved.
As expected, GeoSharding is very much on par with hashed sharding during insertions, since both
policies only contact one shard when inserting a document.
38
4.3 Evaluation of MongoDB as a base for GeoMongo
0
0,05
0,1
0,15
0,2
0,25
0,3
1 2 4TI
ME
TO IN
SER
T TH
E D
ATA
SET
(S)
NUMBER OF SHARDS
Hashed Sharding GeoSharding
Figure 4.10: Horizontal scaling of write performance of text-only documents, measured with 64 client threads
With regards to horizontal scaling, in fig. 4.10, both policies behave similarly: a modest improve-
ment can be seen when going from 1 to 2 shards but there are no significant gains when going beyond
2 shards.
The bottleneck might be the router, since it acts as a proxy between the client and the cluster. The
default connection limit for this component is 10 thousand connections, however, so at least it doesn’t
seem to be a configuration issue.
Another possibility is that, with a total running time of less than a quarter of a second, the dominant
factors might simply be the initial connection latency and the time it takes to start the threads on the
server-side. Unfortunately the text-only dataset is small enough for this to be a possible issue.
Inserting the data-set multiple times was not a reasonable option because MongoDB’s storage layer
features compression, which might produce inconsistent results, as the insert order was randomized
for each test run.
4.2.3 Network usage
Network usage during the tests was observed but not accurately recorded. It was fairly consistent for
the duration of each test, except for the very small ramp-up and ramp-down times. The following
observations could be made:
• The 1Gb/s connections bottlenecked the tests on the bigger dataset.
• For text-only files the network was not a bottleneck, none of the machines had a total throughput
close to the Gigabit limit of the connection.
• The router’s role as a proxy was clear, as it’s network throughput in each direction was symmet-
rical to the sum of the rest of the machines’ throughput.
• The config server used very little bandwidth on all test runs, rarely reaching 15Kb/s and below
1Kb/s most of the time.
4.3 Evaluation of MongoDB as a base for GeoMongo
The decision to use MongoDB came with some unfortunate setbacks, due to lack of knowledge about
the project’s limitations. In this section the pros and cons of this decision are mentioned.
39
4. Evaluation
4.3.1 The pros
The native support for geographical queries on MongoDB meant that most of the client interface for
the work would already be in place. It also makes it possible for code already written for this API to
get a performance boost for free, by changing the cluster sharding configuration, which means that
existing libraries like PyMongo (the Python3 client for MongoDB) and many others would be able to
take advantage of the new policy from day 1.
Another big advantage for basing the work on MongoDB was the existence of geographical in-
dex support. Indexes can be very important to query performance by avoiding collection scans and
should come before sharding when attempting to improve query performance, since sharding makes
the system more complex and comes with added costs for hardware. With both existing API for geo-
graphical operations and geographical indexes, geographic sharding was the next logical step in terms
of geographical data support for MongoDB.
Besides the technical aspects, MongoDB’s excellent user documentation was of great help when
trying to setup a cluster, writing the benchmarks and also understanding the impact of various things
on the performance of a MongoDB cluster.
4.3.2 The cons
The biggest issues were the sheer size and complexity of MongoDB’s code-base, with over 5000 files
of source code, and the very limited documentation about the code-base. Because of this, getting a
decent understanding of the source code organization took a long time, which delayed GeoMongo’s
implementation much more than expected.
The way sharding is implemented in MongoDB assumes too much about partition policies, with
many types of checks and validations in several parts of the code-base that have to be worked around
when implementing a policy that does not fit those assumptions. A more general implementation, using
the object-oriented features of C++, could allow for different policies to define their own validation
rules for chunks and chunk ranges. It also has some other limitations, like forbidding the assignment
of a document to more than one chunk, that may make some of the future work suggestions hard to
implement.
In terms of server-side computation MongoDB was a letdown. Although MongoDB supports several
options for server-side computation, most of them run only on the router, which means that data must
be transfered from the shards, processed by a single node and then sent back to update the results.
MongoDB does support MapReduce and an easier to use abstraction called the Aggregation pipeline,
which uses several rounds of MapReduce to perform a series of computations one after the other. The
problem here is that MapReduce functions can’t interact with the database for any reason, meaning
that many types of computations, like Kriging interpolation, can’t take full advantage of data locality:
the results have to be sent to the router and then sent back to write to the shards [3].
4.3.3 Conclusion
Overall, MongoDB was probably not the best choice to test the GeoSharding partitioning policy. Al-
though the feature can be useful for users of this DBMS, it was unnecessarily complex to modify.
Academically it was limiting in several areas, especially for server-side computation, where the advan-
tage of data locality is wasted by the lack of a mechanism to to store results without going through the
router.
40
5Conclusions
Contents5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
41
5. Conclusions
5.1 Summary
The work presented in this dissertation was aimed at evaluating, with a practical implementation,
the usefulness of the GeoSharding partitioning policy as suggested by Joao Graca’s work, which used
simulations to compare different policies in [6].
The very popular database management system MongoDB was chosen as the base for this work,
which seemed on paper like a good fit but its internal architecture proved to be limiting or difficult
to work with in several ways, for example for distributed computation. Despite the setbacks from
this choice, a working version of MongoDB with GeoSharding for point data, named GeoMongo, was
achieved.
In the evaluation chapter two scenarios were analyzed: datasets with moderately sized data points,
like georeferenced images, and datasets with small data points, like georeferenced sensor data. The
results showed that for larger data points MongoDB’s request latency and throughput is good enough
that hardware limitations came into play, as well as the large transfer times for these bigger data
points making the choice of policy less important. For small data points, where the transfer time is
small and internal overhead of the database is more important, the outcome was quite different, with
GeoSharding performing incredibly well when compared to hashed sharding, making GeoMongo very
well suited to these kinds of datasets.
The results show great promise for use with datasets with very large numbers of small data points,
such as sensor data [15], and the way spatial-data is distributed in GeoMongo clusters has great
potential to be leveraged further by future work.
Community contributions
During the development of GeoMongo a bug was found on the most recent released version of Mon-
goDB (3.4), which was able to crash a config server when receiving malformed input. The bug (issue
#27258) was reported to the project maintainers, who assigned it a critical severity level. The bug
was fixed a few days later, after requesting some more information and diagnostics to help determine
the root cause.
5.2 Future Work
In this work the GeoSharding partitioning policy was proved to work with point data on MongoDB, and
with very good results when used appropriately. Due to time constraints, some interesting ideas (range
queries and non-point data) could not be tested and unfortunately had to be left out. In this section
these and other relevant ideas are discussed, both on their possible benefits and their complications
5.2.1 Generalizing MongoDB’s sharding internals
As discussed in section 4.3, MongoDB’s sharding internals are not a good fit for GeoSharding, requiring
a ugly duct-tape in several parts of the code-base to work. Generalizing chunks, chunk creation, chunk
ranges, chunk id’s and chunk splits so that they can be overridden by subclasses of chunks could allow
for a very clean and maintainable implementation of GeoSharding and possibly other policies that
don’t be reduced to a single dimension without information loss.
5.2.2 Leveraging GeoSharding for range and intersect queries
A major feature of GeoSharding is how it keeps points located closely together in the same chunks/shards.
This can be leveraged by the native geographical queries of MongoDB to hit only the necessary shards
42
5.2 Future Work
when searching for documents, instead of the current behavior of hitting every shard because previ-
ously there was no way to predict what shards could have data for a certain location.
This is bound to be more complex than searching for specific locations, as the problem becomes
about calculating with what chunk regions the range query intersects, which can be more than one. To
avoid performance issues caching the chunk regions might be necessary as well.
5.2.3 GeoSharding for non-point data
Sharding non-point geographical data, such as lines or polygons, in MongoDB would be a very complex
problem, due to certain limitations imposed by MongoDB on it’s sharding mechanisms (for very valid
reasons).
The two main issues are that in MongoDB each document must be assigned to one and only one
chunk and documents must be filtered at the shard level. This means that a range query that intersects
a line or polygon may not find the line if the range is contained to a set of chunks than don’t include
the chunk that the line is assigned to.
One way to solve this problem is to make range queries (and similar queries) search also on chunks
outside it’s range, which is not optimal and can be very inefficient. Another solution would be to place
copies of the document on chunks it intersects, however, MongoDB’s requirements don’t allow this,
which means at least a heavy rewrite of several parts of MongoDB would be needed or it may even
incompatible with other existing features of MongoDB.
Although seemingly complex, implementing this would be very valuable and make GeoSharding
on MongoDB much more versatile, as lines are often used to represent roads, rivers and other similar
objects while polygons can often be used to represent buildings, land properties, cities, natural parks
and many other types of geographic areas.
5.2.4 Dynamic allocation of chunk’s virtual coordinates
Dynamic allocation of chunk regions would make GeoMongo much more useful, practical and intuitive,
since manually calculating good locations based on the geographical distribution of a dataset can
be time-consuming or otherwise impossible if the data-set is still growing. For datasets that grow
over time, dynamic allocation can make it possible to keep the cluster balanced, without manual
intervention, long after it’s been setup.
A possible method of doing this allocation is using the K-means algorithm, with K equal to the
number of shards, implemented as MapReduce, to take advantage of MongoDB’s distributed compu-
tation capabilities, as described in [19]. This computation can be used in a similar fashion to garbage
collectors in programming languages with managed memory, being done only periodically, when the
system load is low or when data distribution imbalance in the cluster rises above a certain threshold.
A study on the characteristics of different algorithms for this purpose, using simulations, could be
very valuable not only for GeoMongo but also for other projects that have or plan to have similar
features.
43
Bibliography
[1] Abhishek S (2014) Distributed Computation On Spatial Data on Hadoop Cluster (10305042):1–39
[2] Abramova V, Bernardino J (2013) NoSQL databases: MongoDB vs cassandra. Proceedings of the
International C* Conference on Computer Science and Software Engineering, ACM 2013 pp 14–22,
DOI 10.1145/2494444.2494447
[3] Akdogan A, Demiryurek U, Banaei-Kashani F, Shahabi C (2010) Voronoi-based geospatial query pro-
cessing with MapReduce. In: Proceedings - 2nd IEEE International Conference on Cloud Computing
Technology and Science, CloudCom 2010, pp 9–16, DOI 10.1109/CloudCom.2010.92
[4] Dean J, Ghemawat S (2004) MapReduce: Simplied Data Processing on Large Clusters. Pro-
ceedings of 6th Symposium on Operating Systems Design and Implementation pp 137–149,
DOI 10.1145/1327452.1327492, URL http://static.googleusercontent.com/media/research.
google.com/es/us/archive/mapreduce-osdi04.pdf, 10.1.1.163.5292
[5] Eldawy A (2014) SpatialHadoop: towards flexible and scalable spatial processing using mapreduce.
Proceedings of the 2014 SIGMOD PhD symposium (December):46–50, DOI 10.1145/2602622.
2602625, URL http://dl.acm.org/citation.cfm?id=2602625
[6] Graca JPM, Silva JNdOe (2016) GeoSharding : Optimization of data partitioning in sharded geo-
referenced databases. PhD thesis, Instituto Superior Tecnico
[7] Ilarri S, Mena E, Illarramendi A (2010) Location-dependent query processing. ACM Computing
Surveys 42(3):1–73, DOI 10.1145/1670679.1670682
[8] Lakshman A, Malik P (2010) Cassandra. ACM SIGOPS Operating Systems Review 44(2):35, DOI
10.1145/1773912.1773922, URL http://portal.acm.org/citation.cfm?doid=1773912.1773922
[9] Leach G (1992) Improving Worst-Case Optimal Delaunay Triangulation Algorithms. In 4th Canadian
Conference on Computational Geometry 2:15, DOI 10.1.1.56.2323, URL http://citeseerx.ist.
psu.edu/viewdoc/summary?doi=10.1.1.56.2323
[10] Lenka RK, Barik RK, Gupta N, Ali SM, Rath A, Dubey H (2016) Comparative Analysis of Spatial-
Hadoop and GeoSpark for Geospatial Big Data Analytics URL http://arxiv.org/abs/1612.07433,
1612.07433
[11] Li aD, Wang S (2005) Concepts, principles and applications of spatial data mining and knowledge
discovery. in ISSTM 2005 pp 27–29, DOI 10.1109/WKDD.2009.13, URL http://citeseerx.ist.
psu.edu/viewdoc/summary?doi=10.1.1.368.6770
[12] Mennis J, Guo D (2009) Spatial data mining and geographic knowledge discovery-An introduction.
Computers, Environment and Urban Systems 33(6):403–408, DOI 10.1016/j.compenvurbsys.2009.
11.001
45
[13] Olea RA (2009) A Practical Primer on Geostatistics. Tech. rep., URL http://pubs.usgs.gov/of/
2009/1103/
[14] Rajesh D (2011) Application of Spatial Data Mining for Agriculture. International Journal of Com-
puter Applications 15(2):975–8887, DOI 10.5120/1922-2566
[15] Tsai CW, Lai CF, Chiang MC, Yang LT (2014) Data mining for internet of things: A survey. IEEE
Communications Surveys and Tutorials 16(1):77–97, DOI 10.1109/SURV.2013.103013.00206,
1111.6189v1
[16] Vincenty T (1975) Direct and Inverse Solutions of Geodesics on the Ellipsoid With Application of
Nested Equations. Survey Review 23(176):88–93, DOI 10.1179/sre.1975.23.176.88, URL http:
//www.tandfonline.com/doi/full/10.1179/sre.1975.23.176.88
[17] Yu J, Jinxuan W, Mohamed S (2015) GeoSpark: A Cluster Computing Framework for Process-
ing Large-Scale Spatial Data. 23th International Conference on Advances in Geographic Infor-
mation Systems (3):4–7, DOI 10.1145/2820783.2820860, URL http://www.public.asu.edu/∼
jinxuanw/papers/GeoSpark.pdf%5Cnhttp://www.public.asu.edu/∼jinxuanw/
[18] Yu J, Wu J, Sarwat M (2016) A demonstration of GeoSpark: A cluster computing framework for
processing big spatial data. In: 2016 IEEE 32nd International Conference on Data Engineering,
ICDE 2016, pp 1410–1413, DOI 10.1109/ICDE.2016.7498357
[19] Zhao W, Ma H, He Q (2009) Parallel K-means clustering based on MapReduce. In: Lecture Notes in
Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics), vol 5931 LNCS, pp 674–679, DOI 10.1007/978-3-642-10665-1 71
46
AWrite performance benchmark script
A-1
1 from pymongo import MongoClient
2 from threading import Thread
3 from queue import Queue
4 import json , time , os, random
56 LIMIT_ENTRIES = None # Can be set to None to write all
7 ATTACH_FILE = False
8 N_THREADS = 1
910 ROUTER_IP = ’localhost ’
11 ROUTER_PORT = 27017
1213 DB_NAME = ’wells_US ’
14 COLL_NAME = ’points ’
15 DATASET_FILENAME = ’wells_US.json’
16 BIN_FILENAME = ’earth0 .5MB.jpg’
1718 client = MongoClient(ROUTER_IP , ROUTER_PORT)
19 db = client[DB_NAME]
20 collection = db[COLL_NAME]
2122 with open(DATASET_FILENAME) as dataset:
23 data_entries = json.load(dataset)
2425 if LIMIT_ENTRIES is None:
26 LIMIT_ENTRIES = len(data_entries)
2728 if ATTACH_FILE:
29 with open(BIN_FILENAME , ’rb’) as binary_file:
30 for entry in data_entries [: LIMIT_ENTRIES ]:
31 entry[’binary_data ’] = binary_file.read()
3233 q = Queue () # shared FIFO queue , to send work to the threads
34 def worker ():
35 while True:
36 batch = q.get()
37 result = collection.insert_many(batch , ordered=False)
38 q.task_done ()
3940 for i in range(N_THREADS):
41 t = Thread(target=worker)
42 t.daemon = True
43 t.start ()
4445 print("Created ", N_THREADS , " threads")
4647 random.shuffle(data_entries)
4849 start_time = time.time()
50 for i in range(N_THREADS):
51 q.put(data_entries[i:: N_THREADS ]) # N_THREADS batches
5253 q.join() # wait for all threads to finish
54 time_insert_many = time.time() - start_time
5556 print("insert many: " + str(time_insert_many) + " seconds elapsed")
57 print("wrote " + str(LIMIT_ENTRIES) + " documents totaling about " + str(LIMIT_ENTRIES * sys.
getsizeof(data_entries [0]) / 10**6) + "MB")
Codigo A.1: The benchmark script for write performance, in Python3
A-2
BWrite performance results
B-1
B.1 Documents with attached image
B.1.1 1 shard
Table B.1: Write benchmark for documents with image attached (total running time) - 1 shard(a) Hashed sharding
Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 145.53 143.53 145.93 145.002 131.91 129.67 126.54 129.374 130.29 128.19 127.54 128.67
16 124.66 127.26 129.62 127.1864 124.87 125.34 126.00 125.40
(b) GeoSharding
Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 146.98 139.76 148.42 145.052 128.58 131.82 125.71 128.704 129.42 126.05 127.13 127.53
16 122.14 124.67 126.04 124.2864 125.00 127.07 123.58 125.22
B.1.2 2 shards
Table B.2: Write benchmark for documents with image attached (total running time) - 2 shards(a) Hashed sharding
Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 101.79 100.44 98.60 100.282 77.57 79.49 77.83 78.304 62.87 64.59 62.69 63.38
16 62.35 63.22 62.27 62.6164 63.50 63.01 62.58 63.03
(b) GeoSharding
Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 100.37 98.54 96.33 98.412 75.70 79.15 79.22 78.024 62.96 64.54 62.80 63.43
16 62.10 64.11 63.21 63.1464 62.23 63.76 62.23 62.74
B.1.3 4 shards
Table B.3: Write benchmark for documents with image attached (total running time) - 4 shards(a) Hashed sharding
Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 80.56 80.46 79.84 80.292 64.90 65.23 64.31 64.814 40.65 40.81 41.74 41.07
16 38.12 37.14 36.63 37.3064 34.44 33.81 33.11 33.79
(b) GeoSharding
Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 80.54 80.10 79.18 79.942 64.80 66.68 62.63 64.704 40.10 41.58 42.81 41.50
16 38.38 36.95 36.91 37.4164 34.87 33.62 32.14 33.54
B-2
B.2 Text-only documents
B.2.1 1 shard
Table B.4: Write benchmark for text-only documents (total running time) - 1 shard(a) Hashed sharding
Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 1.16 1.14 1.19 1.162 0.59 0.62 0.63 0.614 0.42 0.40 0.42 0.41
16 0.28 0.28 0.28 0.2864 0.28 0.27 0.28 0.28
(b) GeoSharding
Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 1.09 1.18 1.11 1.132 0.63 0.59 0.63 0.624 0.38 0.39 0.41 0.39
16 0.29 0.29 0.27 0.2864 0.27 0.26 0.26 0.26
B.2.2 2 shards
Table B.5: Write benchmark for text-only documents (total running time) - 2 shards(a) Hashed sharding
Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 0.91 0.88 0.90 0.902 0.47 0.49 0.47 0.484 0.32 0.31 0.31 0.31
16 0.23 0.22 0.23 0.2364 0.21 0.20 0.21 0.21
(b) GeoSharding
Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 0.91 0.91 0.88 0.902 0.50 0.48 0.48 0.494 0.32 0.30 0.30 0.31
16 0.21 0.21 0.22 0.2164 0.20 0.19 0.19 0.19
B.2.3 4 shards
Table B.6: Write benchmark for text-only documents (total running time) - 4 shards(a) Hashed sharding
Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 0.77 0.76 0.76 0.762 0.46 0.45 0.46 0.464 0.28 0.27 0.28 0.28
16 0.20 0.20 0.20 0.2064 0.19 0.20 0.20 0.20
(b) GeoSharding
Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 0.73 0.72 0.77 0.742 0.46 0.47 0.46 0.464 0.26 0.27 0.26 0.26
16 0.21 0.20 0.20 0.2064 0.20 0.19 0.19 0.19
B-3
CRead performance benchmark script
C-1
1 from pymongo import MongoClient
2 from threading import Thread
3 from queue import Queue
4 import json , time , os, random
56 NUMBER_READS = None # Can be set to None to read all
7 N_THREADS = 1
89 ROUTER_IP = ’localhost ’
10 ROUTER_PORT = 27017
1112 DB_NAME = ’wells_US ’
13 COLL_NAME = ’points ’
14 DATASET_FILENAME = ’wells_US.json’
1516 client = MongoClient(ROUTER_IP , ROUTER_PORT)
17 db = client[DB_NAME]
18 collection = db[COLL_NAME]
1920 with open(DATASET_FILENAME) as dataset:
21 data_entries = json.load(dataset)
2223 if NUMBER_READS is None:
24 NUMBER_READS = len(data_entries)
2526 q = Queue () # shared FIFO queue , to send work to the threads
27 def worker ():
28 while True:
29 coordinates = q.get()
30 result = collection.find_one (’geometry ’:
31 ’type’: ’Point’,
32 ’coordinates ’: coordinates
33
34 )
35 q.task_done ()
3637 for i in range(N_THREADS):
38 t = Thread(target=worker)
39 t.daemon = True
40 t.start ()
4142 print("Created ", N_THREADS , " threads")
4344 start_time = time.time()
45 for i in range(NUMBER_READS):
46 coordinates = random.choice(data_entries)[’geometry ’][’coordinates ’]
47 q.put(coordinates)
4849 q.join() # wait for all threads to finish
50 time_lookup_fromDB = time.time() - start_time
5152 print(’took ’ + str(time_lookup_fromDB) + ’s to look up ’ + str(NUMBER_READS) + ’ coordinates
that exist in the dataset ’)
Codigo C.1: The benchmark script for read performance, in Python3
C-2
DRead performance results
D-1
D.1 Documents with attached image
D.1.1 1 shard
Table D.1: Read benchmark for documents with image attached (total running time) - 1 shard(a) Hashed sharding
Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 162.30 157.80 163.20 161.102 85.30 85.00 84.10 84.804 46.30 44.50 46.80 45.87
16 26.40 26.10 25.80 26.1064 25.50 24.90 25.60 25.33
(b) GeoSharding
Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 161.14 161.67 162.23 161.682 83.40 83.60 84.20 83.734 44.80 44.60 45.30 44.90
16 25.50 25.50 25.30 25.4364 24.50 24.80 25.10 24.80
D.1.2 2 shards
Table D.2: Read benchmark for documents with image attached (total running time) - 2 shards(a) Hashed sharding
Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 156.30 162.10 153.90 157.432 79.20 75.20 73.80 76.074 49.30 47.50 48.10 48.30
16 28.20 33.20 25.80 29.0764 24.30 25.90 25.00 25.07
(b) GeoSharding
Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 149.80 151.40 152.20 151.132 76.20 77.40 75.80 76.474 45.00 47.80 41.90 44.90
16 21.90 24.20 23.00 23.0364 24.10 22.90 22.30 23.10
D.1.3 4 shards
Table D.3: Read benchmark for documents with image attached (total running time) - 4 shards(a) Hashed sharding
Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 148.00 151.60 145.90 148.502 76.10 72.50 72.40 73.674 39.90 40.40 42.30 40.87
16 22.40 22.00 23.10 22.5064 20.70 21.10 23.20 21.67
(b) GeoSharding
Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 153.40 148.20 151.20 150.932 74.80 83.70 71.20 76.574 37.60 38.70 39.20 38.50
16 21.40 22.10 23.00 22.1764 19.82 20.30 21.80 20.64
D-2
D.2 Text-only documents
D.2.1 1 shard
Table D.4: Read benchmark for text-only documents (total running time) - 1 shard(a) Hashed sharding
Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 1058.50 1060.50 1059.80 1059.602 542.36 541.28 542.38 542.014 287.00 286.23 285.98 286.40
16 207.50 208.23 208.49 208.0764 207.21 210.02 209.10 208.78
(b) GeoSharding
Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 1044.21 1045.70 1046.75 1045.552 538.27 536.44 535.58 536.764 284.54 282.67 282.18 283.13
16 206.34 206.85 205.73 206.3164 208.11 200.80 208.96 205.96
D.2.2 2 shards
Table D.5: Read benchmark for text-only documents (total running time) - 2 shards(a) Hashed sharding
Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 1126.01 1126.52 1126.73 1126.422 542.82 543.19 543.36 543.124 272.41 272.88 273.20 272.83
16 158.19 158.24 158.40 158.2864 154.51 154.59 155.37 154.82
(b) GeoSharding
Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 649.85 648.73 648.15 648.912 315.57 315.89 315.55 315.674 153.90 153.64 153.70 153.75
16 58.33 57.83 58.69 58.2864 50.20 49.75 49.62 49.86
D.2.3 4 shards
Table D.6: Read benchmark for text-only documents (total running time) - 4 shards(a) Hashed sharding
Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 834.75 870.27 868.37 857.802 393.52 390.89 392.50 392.304 177.11 177.85 176.85 177.27
16 92.84 93.71 94.21 93.5964 90.26 91.56 92.91 91.58
(b) GeoSharding
Threads Trial 1[s] Trial 2[s] Trial 3[s] Average[s]1 490.71 494.82 492.53 492.692 243.20 242.83 240.90 242.314 110.67 113.54 111.83 112.01
16 28.32 28.03 28.21 28.1964 18.50 17.21 18.01 17.91
D-3
EExample document from the dataset
E-1
1
2 "type": "Feature",
3 "geometry ":
4 "type": "Point",
5 "coordinates ": [ -69.08976219 ,47.11225721 ]
6 ,
7 "properties ":
8 "SITE_NO ":1010500 ,
9 "SNAME ":"ST JOHN R A DICKEY ME",
10 "NWISDA1 ":2680 ,
11 "STATE ":" Maine",
12 "COUNTY_NAME ":" Aroostook",
13 "ECO_L3_CODE ":"5.3.1" ,
14 "ECO_L3_NAME ":" Northern Appalachian and Atlantic Maritime Highlands",
15 "ECO_L2_CODE ":5.3,
16 "ECO_L2_NAME ":" ATLANTIC HIGHLANDS",
17 "ECO_L1_NAME ":" NORTHERN FORESTS",
18 "ECO_L1_CODE ":5,
19 "HUC_REGION_NAME ":"New England Region",
20 "HUC_SUBREGION_NAME ":"St. John",
21 "HUC_BASIN_NAME ":"St. John",
22 "HUC_SUBBASIN_NAME ":" Upper St. John",
23 "HUC_2 ":1,
24 "HUC_4 ":101,
25 "HUC_6 ":"010100" ,
26 "HUC_8 ":1010001 ,
27 "PERM ":1.6567 ,
28 "BFI ":48.3947 ,
29 "KFACT ":0.2243 ,
30 "RFACT ":69.79872 ,
31 "PPT30 ":98.8643 ,
32 "URBAN ":0.146723998 ,
33 "FOREST ":59.02735921 ,
34 "AGRIC ":0.003490023 ,
35 "MAJ_DAMS ":0,
36 "NID_STOR ":0,
37 "CLAY ":17.4695 ,
38 "SAND ":22.1427 ,
39 "SILT ":60.3878 ,
40 "BENCHMARK_SITE ":"0",
41 "NHDP1 ":255,
42 "NHDP5 ":420,
43 "NHDP10 ":550,
44 "NHDP20 ":815,
45 "NHDP25 ":943.5 ,
46 "NHDP30 ":1100 ,
47 "NHDP40 ":1470 ,
48 "NHDP50 ":2000 ,
49 "NHDP60 ":2720 ,
50 "NHDP70 ":3850 ,
51 "NHDP75 ":4735 ,
52 "NHDP80 ":6000 ,
53 "NHDP90 ":11800 ,
54 "NHDP95 ":19700 ,
55 "NHDP99 ":39300 ,
56 "sample_count ":30
57
58
E-2