TrashCam - UAV's for marine litter mapping
A technical report by SKARL AB from a project funded by Vinnova
2
3
Table of Contents
Introduction 5
Relevant Concepts 5
Resolution types 6
Spatial Resolution 7
Spectral Resolution 7
Radiometric resolution 7
Temporal resolution 8
Ground Sampling Distance 8
Setup 10
Sensor and Vehicle 10
Operational Investigation 13
Context 13
Testing methodology 13
Altitude error 13
Ground based tests 14
Airborne tests 18
Post processing 21
Historical Background - state of the art 21
Implementation issues 23
Workflow Implementation 24
Generating a dataset 24
Training the network 26
Final Implementation 28
Data Augmentation Investigation 30
Manual identification 34
Large litter items 35
Discussion 37
Options for future implementations 37
Spectral signature 37
Large scale monitoring - fixed wings 38
4
5
Introduction
The project evaluated use of Unmanned Aerial Vehicles (UAVs, drones) in identifying beach
litter. Internationally, UAV methodology has recently been applied to traditional environmental
fields such as conservation1 and, more recently, beach litter as in the Plastic Tide2 project (UK).
In order to understand the problem and the solutions used by this project, we need some
background on the theory, which is covered in the following section.
Relevant Concepts
In order the understand the project’s context, scope and implementation certain concepts need
to be clarified first.
The overall discipline under which we operate is remote sensing and environmental monitoring.
Remote sensing has historically been associated with flight and, as technology progressed, with
satellite imaging.
More recently, with unmanned aerial systems becoming widespread, use of drones has started
filling the gaps left by the traditional remote sensing platforms as shown in the image below:
By flying extremely low and closer to the target, drones can offer almost arbitrarily better
resolution at a dramatically lower cost (but for much smaller areas). To understand this a bit
better, let us look at the concept of resolution.
1 https://conservationdrones.org/ 2 https://www.theplastictide.com/
6
Resolution types
Historically, resolution has been subdivided into four types: spatial, spectral, radiometric and
temporal. These are demonstrated in the two images below:
7
Spatial Resolution
Spatial resolution is a measure of how finely we subdivide an image into pixels. For example:
500 000 pixels 760 pixels
Spectral Resolution
Spectral resolution refers to the amount of spectral information contained in a single pixel, for
example:
RGB Grayscale Black and White
A “normal” photo in the human visible spectrum is actually three grayscale photographs, one for
each band of Red, Green and Blue. For example:
RGB image
Red band Green band Blue band
Radiometric resolution
Radiometric resolution is tied to spectral but there is an important distinction. Spectral resolution
refers to the wavelength bands that the sensor responds to (typically three, RGB).
Radiometric resolution corresponds to how finely these bands are graded, i.e. how well we can
distinguish differences between wavelengths. To give a spatial analogy, spectral resolution would
correspond to the area covered by an image whereas radiometric resolution would correspond to
the spatial resolution of the image.
The image below gives an example of this by comparing the response of a multispectral sensor
(low radiometric resolution, lower cost, often used in precision agriculture use cases) to a
hyperspectral sensor (high radiometric resolution, more expensive, used in specialised
applications).
8
Temporal resolution
Temporal resolution indicates how often we sample the same place. To make an analogy, a
movie’s frame rate corresponds to its temporal resolution.
It is not a useful concept for a single drone mission but after time has elapsed, successive
samplings of the same region show changes in the data (making a historical “movie” of the
location).
Ground Sampling Distance
Having introduced the concepts of resolution, we need to introduce the concept of Ground
Sampling Distance (GSD). This is one of the most important parameters when designing a
remote sensing mission and it ties the spatial resolution to the physical size of the object under
investigation. For a drone mapping mission, GSD shows how finely items are resolved in the
image.
GSD is affected by the camera’s focal length and by the distance from the target.
For a given camera focal length setting, GSD is affected by distance from the target (drone
height), with lower altitudes giving a lower GSD (higher resolution) e.g.:
9
Similar GSD’s can be achieved by a combination of heights and focal lengths, for example
compare the low altitude, 20mm focal length to the high altitude 40mm focal length in the image
below:
10
So, in summary, we can increase our height while keeping GSD the same by increasing our
focal length (zooming in).
For reference, although there is no one-size-fits-all, typical mapping GSDs are in the order of
cm / pixel. A GSD of 2 cm / pixel is considered quite detailed, while 10 cm / pixel or more can
be standard for a quick survey. Satellite imagery GSD is typically many tens of centimetres to
metres per pixel. Manned platforms can vary a lot, depending on platform and payload but can
generally fit in between (the tradeoff between them and UAVs is higher coverage but for much
higher operational cost and some risk to operators).
Setup
Given the above, we can start discussing the parameters for a setup that is most suited to the
mission. For this, we follow a Concept of Operations (CONOPS3) methodology, putting the
investigation necessary for the first phase as the central focus.
Considerations in terms of importance are:
1. Sensor:
a. Ground sampling distance. What values are needed for identifying litter?
b. Resolution type: should we concentrate on spatial or spectral resolution?
2. Vehicle: What type of vehicle will handle the sensor from point 1?
3. Operational: what type of use case are we imagining in terms of cost, time, operator
expertise and risk given points 1 and 2?
Sensor and Vehicle
The possible sensors for this investigation would rely either on spatial or spectral resolution.
The sensors available would be either normal RGB cameras or specialised cameras such as
multispectral, thermal or multispectral. Given that the latter group have significantly higher cost
but unknown performance advantages, the initial investigation focused on RGB cameras, i.e. on
spatial, rather than spectral resolution.
Given that the focus is on spatial resolution, an ability to sample down to very high GSD is
desirable, at least until a suitable GSD has been established.
Since the initial investigation focused on RGB cameras, the options were either for an off the
shelf solution or a more dedicated setup. Off the shelf drones are typically equipped with video
cameras mounted on gimbals. Apart from video, these can be used for shooting still images and
are often used in mapping applications. However, there are various disadvantages to this
approach. In terms of sensors, the performance of a video camera shooting stills is typically
inferior to the performance of dedicated stills cameras. The use of proprietary software and
hardware, makes modification beyond the scope envisioned by the manufacturer difficult, which
is a significant hindrance in research projects. Finally, if a dedicated stills camera was to be
used instead of the manufacturer supplied one, starting off with neutral system made more
3 https://en.wikipedia.org/wiki/Concept_of_operations
11
sense since the resources spent in modification of an existing system are comparable (or might
exceed) the resources spent in building a custom system according to the project’s needs.
For these reasons, the initial investigation was decided to use dedicated, RGB, still cameras.
The ones investigated have been the Canon IXUS 160 and Sony RX100 III.
Canon IXUS 160: this is a popular camera for drone mapping due to low weight (127 grams)
and cost and high resolution (20 Megapixel). It can also run the Canon Hacker Development
Kit4, allowing very precise parameter setting, specifically for mapping applications using the
KAP UAV Exposure Control Script5. Disadvantages of the camera are small sensor size
(leading to noisier images) and relatively small aperture (leading to lower shutter speeds or
higher ISO settings, in turn leading to noisier images).
Sony RX100 III: this offers a significant performance increase compared to the IXUS 160 but
comes at a cost of more than double the weight (287 grams). It has the same nominal resolution
but a larger sensor and aperture and faster cycle rate, meaning it can shoot images faster and
resulting images are less noisy.
Given the sensors, the next choice is the vehicle. In this case, options were either a fixed wing
or multirotor platform. Fixed wings offer significantly longer flight times at the cost of the inability
to hover or go slow. Since the initial phase of the project was evaluation of the operational
parameters, the most flexibility is given by a multirotor since they can hover and travel at very
slow speed if needed (at the cost of extremely limited flight time). Therefore, the choice was
made for a 550mm quadrotor, assembled from kit (carbon fibre, “Iris” supplied by uCandrone6)
for use during the investigation phase.
For the final phase, a twin motor, fixed wing (MFE Believer) was also tested for monitoring of
large items over large areas.
4 http://chdk.wikia.com/wiki/CHDK 5 http://chdk.wikia.com/wiki/KAP_UAV_Exposure_Control_Script 6 http://ucandrone.com/
12
The final choice is that of the autopilot software and hardware. For this size and use case,
popular and powerful options are the open source projects PX47 and Ardupilot8. Given that the
decision was made for SKARL to assemble a custom multirotor, Ardupilot (Arducopter and
Arduplane for the different vehicles) were selected since it is a more mature project with very
extensive documentation and community support. The different autopilot hardware used
throughout the project have been the Pixhawk9, Pixhawk 2 (Cube)10 and Pixracer11.
7 http://px4.io/ 8 http://ardupilot.org/ 9 https://pixhawk.org/ 10 http://ardupilot.org/copter/docs/common-thecube-overview.html 11 https://docs.px4.io/en/flight_controller/pixracer.html
13
Operational Investigation
Context
The most immediate consideration has been to establish the GSD required and, from there, see
how this fits with operational considerations.
To this end, the project needed to establish the smallest item of interest in our target sample.
Given its frequency in historical litter surveys, it turns out that the cigarette butt is extremely
common and persistent. It is also, a very small piece of litter (approximately 30mm in length). It
was, therefore, decided at the early stages of the project to aim for a GSD that could allow
identification of cigarette butts. The initial stages of the project have been dedicated to
establishing the right setup for this.
Testing methodology
Initial airborne tests conducted in August and September 2018 (at locations in Tyresö, Ingarö,
Sollentuna, Barkarby, Älvkarleby) indicated that for a cigarette butt to be reasonably well
identified, it ideally needs to have a resolution of 25-30 pixels. Given that a cigarette butt is
approximately 30mm long, this translates to an ideal GSD of 1 mm / pixel and preferably, no
worse than 1.5 mm / pixel.
However, the sampling height for this GSD was not well defined, though it was experimentally
established to be approximately 5-8 metres. The reason why these tests were not conclusive
was due to height estimation issues as described below.
Altitude error
Typically, altitude is measured by barometric sensor onboard the autopilot (though GPS and
other sensor inputs also play a role)12. Although the barometric sensor has very high resolution
(of the order of tens of centimetres), it is susceptible to minuscule changes in air pressure, as in
for example, when a wind gust blows, in which case, the vehicle changes altitude to
compensate for what it (erroneously) perceives as a change in altitude. To get a rough idea of
the variation possible due to wind gusts, the vehicle was placed on the ground, exposed to the
wind, during a moderately windy day (3-4 m/s) and altitude estimation was plotted, the result is
shown in the image below:
12 It has to be noted here that other sensors of ground clearance exist, such as ultrasonic or lidar rangefinders. Ultrasonic rangefinders are very common due to low cost but have very short range (typically up to 8m) so are not useful for flights higher than that. Lidar has a much greater range (of the order of 100m) but is also costlier and bulkier. In this project, the size of the cameras being evaluated did not allow much flexibility in adding extra sensors so these were not considered.
14
As can be seen, the variation in altitude estimated via barometric pressure is approximately just
under 1 metre for a vehicle stationary on the ground, exposed only to gusts of wind. It has to be
noted, however, that, due to wind gradient, wind speed close to the ground is lower than what
the vehicle typically encounters aloft. Thus, it is reasonable to assume that the error in altitude
introduced by wind gusts on a typical windy day is probably at least 1 metre. The extent to
which the vehicle actually changes its altitude to match, depends on the duration of the gust and
the dynamics of the vehicle (how aggressively it has been tuned to respond and how strong the
motors are).
When the vehicle is at its typical operating altitude (for a traditional mapping mission that is
many tens of metres), this is not an issue, since a change of, say, 1 metre is only 2% if flying at
an altitude of 50 metres (a very common mapping height). However, in this case, trying to
establish the optimal GSD required, a change of 1 metre when flying at 5 is significant (20%).
To account for this, flying altitudes were generally very conservative, erring on the side of low
altitude to ensure adequate GSD.
Ground based tests
To establish the impact of the altitude uncertainty, ground based tests were performed by
placing a target (cigarette butt, 30mm in length) and taking pictures at varying horizontal
distances (distance sweep).
15
Results (without zoom) are shown below for the Canon IXUS 160:
2 m, 38 pixels 4 m, 24 pixels 6m, 17 pixels 8m, 17 pixels 10m, 13 pixels
And the Sony RX100:
2m, 68 pixels 4m, 34 pixels 6m, 23 pixels 8m, 17 pixels 10m, 14 pixels
These results are summarised in the table and graph below:
Distance (metres)
2 4 6 8 10
IXUS 160
38 24 17 17 13 Pixel length
0.79 1.25 1.76 1.76 2.31 GSD (mm/pixel)
RX100
68 34 23 17 14 Pixel length
0.44 0.88 1.30 1.76 2.14 GSD (mm/pixel)
16
Looking at these results, it has been decided to put the cutoff for an acceptable resolution at a
GSD of 1.5 mm / pixel. For the two cameras, this is achiever at approximately 5 metres for the
IXUS and 7 metres for the RX100.
Also, note the differences in noise level between the two cameras at similar resolutions. This is
a good example of why GSD alone, although important is not the only criterion for what is an
acceptable image.
Having established an acceptable reference value for GSD, the next step was to see how big
our distance envelope is, as a result of zoom (changing focal length).
To test this, a set of photos was taken by the RX100 only, at a distance of 10 metres across a
full zoom sweep. The 35mm equivalent focal lengths were 30, 40, 50, 55, 60 and 70 mm.
NOTE: for ease of reference with other cameras, we will use the 35mm equivalent focal
length, NOT the cameras actual focal length. This exists in the EXIF metadata as Focal length
in 35mm.
17
30mm, 15 pixels 40mm, 19 pixels 50mm, 24 pixels 55mm, 27 pixels 60mm, 28 pixels 70mm, 34 pixels
Results summarised in table and graph below:
Focal length in 35mm 30.00 40.00 50.00 55.00 60.00 70.00
Pixel size 15.00 19.00 24.00 27.00 28.00 34.00
GSD mm/pixel 2.00 1.58 1.25 1.11 1.07 0.88
18
Summarising these results, as we have seen at the distance sweep for the RX100, the
unzoomed GSD at 10 metres is 2.14 mm/pixel (a length of 14 pixels for 30mm target).
Staying at 10 metres, a fully zoomed image at focal length of 70mm (in 35mm) gives us a GSD
of 0.88 mm/pixel. This means that, at full zoom, we can increase our GSD by a factor of 2.4
(the quoted zoom for the camera specs is 2.9).
Going to the graph of the Distance sweep, we see that, unzoomed, the GSD cutoff of 1.5
mm/pixel occurs at roughly 7 metres.
Given the results of the zoom sweep, we conclude that, in terms of GSD only, we should
theoretically be able to fly at approximately 17 metres and get a GSD of 1.5 mm/pixel.
Airborne tests
Having got an initial idea of the issues involved through the ground based testing, the next
step was to see airborne performance. When it came to unzoomed performance for both
cameras, it was well established that a height of 5 metres was generally guaranteed to yield
good results. The question then that needed to be investigated was: can similar results be
obtained by flying higher but using zoom?
In principle, the answer should be yes. However, there are various issues that need
addressing first. A zoomed in image (increased focal length) results in a narrower field of
view (angle). This, in turn, means that a given rotation of the camera has a much bigger
impact (in terms of changing the frame) on the highly zoomed (greater focal length) camera
as shown in the image below:
19
In an airborne multirotor, there are motions due to at least three factors: control inputs,
turbulence and frame vibration.
Control input and turbulence induced motions are low frequency and can be countered by
increased shutter speed of the camera and using a gimbal for stabilisation. Frame vibration
is high frequency and can be countered by vibration isolation, e.g. rubber couplings.
To test performance, the vehicle was flown at various heights, from 8 to 20 metres at
maximum zoom with the Sony RX100 (focal length of 70mm equivalent at 35mm). Since
initial results were not good, shutter speed was increased (maximum 1/2000). Many images
were captured with the vehicle hovering in place and with the camera mount vibration
isolated via rubber couplings.
However, even with the vehicle stationary and high shutter speed, the results were still not
good enough, example image below:
20
15 metre altitude, 1.5 mm/pixel GSD but image blurry.
Having multiple images, all of which blurry, with the vehicle stationary, excludes control
inputs and turbulence as the cause. The most likely conclusion is that frame vibration, while
not a problem at the wide field of view (unzoomed) setting, becomes problematic when
zooming in.
Evaluating the use of a gimbal for image stabilisation was not an option either. Of the shelf
drones with gimbal mounted cameras typically cannot zoom. Using a third party gimbal on a
camera as heavy as the RX100, would increase the weight of the system to the point where
flight time would be severely impacted, while the benefits were not guaranteed.
In conclusion, given these points, the most suitable strategy for data sampling was flying low
without zoom. This translated to the following operational parameters:
● altitude of approximately 5 - 8 metres
● Rate of 1 image / second
● Ground speed of 1 metre / second (to allow adequate overlap between successive
images)
21
Post processing
Having established the operational parameters, the next task was to establish how to
process the generated data. Our dataset was made up of high resolution images. Our goal
was to identify and locate litter items in these. One obvious way is to do so manually, this is
definitely possible and can yield very satisfactory results as shown below:
However, manual identification, although possible, is very labour intensive. Therefore, the
ideal implementation would automate some or all of the identification workload. Hence, we
focused on investigating methods for automatic identification of the litter.
Historical Background - state of the art
In recent years, machine learning and, specific to our case, machine vision, have been
advancing at a great pace. This is due to a number of reasons.
On the hardware side, computer graphics cards (GPUs) have become very powerful and
their computational capabilities happen to be very well suited for solving various classes of
problems relevant to machine learning.
On the data side, with the proliferation of the internet, the datasets of tagged images
available to researchers have become enormous.
On the software side, various techniques that have been used in research but not to their full
potential are now being refined to very powerful tools, thanks to these advances hardware
and datasets.
22
Of the various traditional machine learning techniques, the ones that we are interested in are
Artificial Neural Networks (ANNs) and, specifically, Convolutional Neural Networks13
(CNNs).
CNNs, have been the focus of very intense research development recently, due to their
ability to process large input data using a smaller sized network, compared to other types of
ANNs. For this reason, the various machine vision tools that are starting to appear, such as
face recognition, object detection etc are largely made possible through CNNs as opposed
to more traditional computer vision approaches.
Broadly speaking, there are three tasks of increasing complexity that are handled by such
networks: classification, detection and segmentation.
Image from Review of Deep Learning Algorithms for Object Detection
Without going into details14, classification looks at a single item in an image and classifies it.
Detection identifies different classes of items in an image and produces their rough locations
(bounding boxes). Segmentation does the same but produces very fine locations (pixel by
pixel).
In our case, we have multiple items of interest in each image, so we are looking at the latter
two cases.
13 https://en.wikipedia.org/wiki/Convolutional_neural_network#History 14 A more in depth explanation can be found here: https://medium.com/comet-app/review-of-deep-learning-algorithms-for-object-detection-c1f3d437b852
23
Implementation issues
Our case had various unique characteristics that made it stand out from more traditional
implementations. These had to do with the size of the items and the size of our dataset.
Regarding the size of the items, typical use involves an image where the object or objects of
interest is prominent and takes up a significant part of the image, as shown in the previous
image. By comparison, our images are very large (5472 x 3648 pixels when using the Sony
RX100) and the items are very small, as seen below:
In the image above, the longest item, the plastic bottle in the middle bottom, is approximately
150 pixels long. The smallest items, cigarette butts, are about 20-30 pixels long and barely
visible without zooming in (for example, top left). This means that, unlike traditional use
cases, most of our input area is irrelevant.
Regarding the dataset, training a network requires a very large amount of tagged data.
Tagged data, implies that, items of interest are identified, localized and labelled in the data,
typically by humans. This is a very labour intensive process and beyond the scope and
resources of the current project.
In our case, such datasets of litter do not yet exist in the public domain15. The Plastic Tide
project, which was working on a similar problem, crowdsourced this part of the workflow by
asking the public to tag items in their dataset16. However, the Plastic Tide’s dataset is taken
at a Ground Sampling Distance (GSD) roughly ten times coarser than ours. This meant that,
in their data, items such as cigarette butts, which we had decided we wanted to include,
would barely be identifiable at all. Therefore, using Plastic Tide’s dataset was not an option.
Instead, we had to establish our own workflow that would generate a sufficient dataset for
our purposes.
15 Mohammad Saeed Rad et al, “A Computer Vision System to Localize and Classify Wastes on the Streets”, arXiv:1710.11374, submitted on 31 Oct 2017. Section 4, “The Dataset”. 16 https://www.theplastictide.com/tagit/
24
Workflow Implementation
Generating a dataset
To circumvent the limitations discussed in the previous section, we tried a different approach
to generating a dataset.
When we capture images for litter, we are actually capturing in exactly the same way as
when using a drone to map an area. This requires images with overlap and no blind spots.
Using the right software, such images can be used to create a 3D representation of the area
in the images. In this project, we have used Agisoft Metashape for reasons explained in
more detail below.
Now, generally, images used for mapping have some overlap but are spread wide, in order
to cover a large area, for example see image below:
The circles colours in the image above correspond to how much overlap there is for various
areas, i.e. purple dots are only “seen” in one image, blue dots are “seen” in two images etc.
Typically, drone mapping missions try to spread out the images over as wide an area as
possible, while retaining sufficient overlap for the map to be generated.
25
However, it is possible to cover a very small area with a mostly stationary drone shooting at
roughly the same place for the whole duration of its flight. This way, we build a 3D model of
extreme overlap (the points at the centre of the model are “seen” by hundreds of images).
Here, we see a 3D model where all the images (rectangles at top) are more or less in the
same region and the modelled area is very small, approximately 15 metres across
(approximately three image lengths at the GSD we use). The upside of this is that, any
random point more or less close to the centre of the area, will be visible in most, if not all, the
images.
Now, once the 3D model has been generated, a user can manually mark polygons on the
model. These polygons are created in the coordinate system of the model.
Model point cloud Textured model Random image
26
As shown in the image above, since the (white) polygons are in the coordinate system of the
model, they can be automatically projected at the correct coordinates onto any image of the
dataset which “sees” the location of the polygon.17
One of the most powerful features of Metashape is the ability to run custom python scripts18.
We have written a script that, once all the items are tagged manually, exports all the
polygon data as a .json file19.
This data is then further manipulated by python scripts to generate .xml files in PASCAL
VOC format20, a format used in visual machine learning.
By using this method, we can end up with large datasets in a relatively short time (1-2 days
depending on the size of the dataset). For example, a ten minute flight can generate
between 400 and 600 usable images. Assuming all items are visible in all images, tagging
one item results in tagging 400 to 600 items, thus multiplying the human labour by a factor of
400 to 600.
It should also be noted that this multiplication of items does not happen by data
augmentation tricks common in computer vision such as flipping, rotating, etc but are
genuine separate images, taken from slightly different locations, light conditions, angles etc.
However, the drawback is that, for each flight, the items are at the same location, meaning
that, although they are genuinely different images, they are images of the same items
against the same background. Another drawback is that, due to uncertainty in the transfer
between coordinate systems, the polygon edges can sometimes be misplaced, which means
that they have to be drawn with fairly large margins of error to avoid them cutting across an
item.
Training the network
Once we have the dataset, the next decision concerns, of course, what type of network to
use. A very popular type of network for object detection is YOLO (You Only Look Once),
currently at version 321. This network does object detection, i.e bounding box localisation and
classification inside an image.
17 It has to be stressed that this projection is purely geometric. The user draws and labels the polygons by hand by looking for litter items in the model. The software calculates the locations of the polygons’ vertices in image coordinates by projecting the vertices back towards each image where the polygon is visible. No image processing is taking place at this step, the software simply carries out the projection of the points between the two coordinate systems. 18 http://www.agisoft.com/pdf/Metashape_python_api_1_4_0.pdf 19 All software developed as part of this project is open sourced on github at:
https://github.com/gitdillo/Garbage 20 http://host.robots.ox.ac.uk/pascal/VOC/ 21 https://pjreddie.com/media/files/papers/yolo_1.pdf, https://arxiv.org/pdf/1612.08242.pdf, https://pjreddie.com/media/files/papers/YOLOv3.pdf
27
We used an open source implementation22 which yielded very promising results on our initial
dataset of approximately 2000 images (containing 7 000 tagged) items, see example image
below:
Once it was apparent that the network was performing well on the initial dataset, the dataset
was enriched. By the end of 2018, we have datasets of approximately 15 000 images
against grass, gravel, asphalt and roadside backgrounds.
With the experience of training across these diverse datasets, one thing becomes clear.
YOLO is a popular network that performs detection, i.e. both bounding box localisation as
well as classification. However, a network that adequately captures the complexity of the
input (diversity of litter shapes and backgrounds) as well as the output (large number of
possible litter classes) needs to be adequately trained on a very large and diverse dataset
and will end up being resource intensive if it is to perform well. On top of that, a known
limitation of YOLO is detection of small items, as discussed in Redmon et al, 201523.
In our case, with a mix of small items (cigarette butts) and arbitrarily large items (bottles,
bags or even larger), it would be hard to find a compromise to fit all possibilities. Our network
was scoring very high within its training and validation set (the validation set is a set of
images not used during training, which is only used to evaluate the performance of the
network). However, when presented with data from a new dataset, performance was very
noticeably degrading, with incorrect classification becoming very noticeable.
22 https://github.com/experiencor/keras-yolo3 23 https://arxiv.org/abs/1506.02640
28
Final Implementation
In our case, items were typically not very tightly clustered, take up a really small percentage
of the input image and could belong to wildly diverse range of classes and sizes. Therefore,
instead of trying to accurately classify the outputs, we decided the final implementation
should focus on the detection performance of YOLO only.
Given that we are sampling at a roughly constant GSD (around 1 - 1.5 mm per pixel), we can
make some safe assumptions about the item sizes in pixels. The input was sliced into 500 x
500 chunks, making the resulting footprint of each image slice no smaller than ½ metre. On
one hand, this is adequate for most common litter items, on the other hand, it is small
enough to ensure that out items are not very small compared to the input image size, thus
circumventing, to some extent, YOLO’s performance problems on small items.
Finally, the network does not attempt any classification at all. Suspect items are simply
marked as litter as shown below:
Given that, for our use case, the vast majority of the input image was irrelevant, there was
little point in suffering the performance penalty (both in training and during use) of a network
that accurately classifies the detected items. By focusing on the bounding box detection
29
only, we can get an idea of the litter in our sample. Secondly, having discarded the majority
of the input, the detected items can then be passed to another dedicated classifier network,
whose only job will be classification.
This method offers various advantages:
● The input is parsed quickly and efficiently and most of it is immediately discarded.
● False positives do not matter that much since the classifier can reject them. In fact,
it’s better to get false positives than false negatives.
● The whole system is a lot easier to retrain. If a single network was doing both jobs,
adding or removing a class would require retraining the whole network. This way,
each component can be trained solely on its specific task as more usage data comes
in.
To counter the problem of insufficient diversity in the dataset backgrounds, datasets were
generated against very different backgrounds (sand, grass, asphalt, gravel) and the images
mixed for training.
Again, this method gave very high scores (mAP > 95%) in the training and validation sets.
However, when evaluated in completely fresh datasets, even of the same type of
background, the performance suffered greatly as can be seen in the image below:
30
In this example, the network is tested on a completely new dataset. Obvious large items
such as cans and coffee cups are missed, bounding boxes (bottom left) are miscalculated,
we have double detections (bottom left, top right) and not a single cigarette butt has been
detected. While these results are in stark contrast with the results of the validation sets
during training, they are not surprising.
Since the tagging is done automatically, based on multiple images of the same, small set of
items against a given background per dataset, the main problem is the lack of variation,
mainly in backgrounds. This results in a network that performs exceptionally well within the
limited envelope of training data but underperforms when faced with new data. While this is
an expected shortcoming of this methodology, it is not easily addressed, other than by
committing significant resources to manually tagging an extensive dataset, something
beyond the resources and scope of the current project.
Data Augmentation Investigation
Given the issues identified so far, a new dataset generating methodology was investigated.
Instead of planting litter items at fixed locations and taking multiple photographs from
different locations, what if images of litter items were “planted” automatically in otherwise
empty photographs? Similar techniques are being used in other computer vision projects,
where photorealistic, computer generated imagery is used as input to a network for training
purposes. In our case, we decided to use the most difficult litter items, cigarette butts, to
investigate whether we could achieve good detection for these against a variety of
backgrounds. Hence, instead of trying to detect multiple items but facing the problem of low
background variation, we concentrate on just one (difficult) item against many types of
background.
The workflow for generating datasets is as follows:
1. get high resolution images of many different items of the class we are interested in
2. remove the background and crop so we are left with only the item itself against a
transparent background
3. get multiple images of empty backgrounds representative of the types encountered in
the field
4. downsample the (cropped with transparent background) images of the litter items to
a resolution low enough as to correspond to the resolution of the background
(typically ~30 pixels long for a cigarette butt)
5. rotate and paste the (cropped with transparent background) images of the litter items
onto random locations in the background images
The first step, is to get good quality images of the litter items. In order to do that, the camera
was mounted on a tripod and each item was placed against a black border but removed from
it (approximately 20 cm) to avoid reflected light interfering with the edge pixels. An example
image is shown below:
31
The question then becomes how to algorithmically remove the background in order to be
able to handle large amounts of input images automatically. Because we have a black
background, we can do that via a relatively background subtraction using pixel intensity
followed by identifying the largest item, cropping to the smallest bounding box and adding a
transparent layer (alpha channel) to the surrounding pixels which do not belong to the item24.
After these operations, we end up with an image as shown below:
This, of course, is at too high a resolution (this example is 1100 pixels long), compared to
what can be realistically sampled by a drone so it needs to be downsampled. In the next
image, examples of different downsampling methods25 are shown:
24 The code for this is in function ”create_minimal_image” of module “image_utils” in github repository: https://github.com/gitdillo/Garbage 25 As described in: https://docs.opencv.org/2.4/modules/imgproc/doc/geometric_transformations.html#resize
32
These images can now be pasted against any background in order to create a composite as
shown below:
Here, the cigarette butts inside the red rectangle have been artificially pasted. The
advantage of this technique is that they are at known locations. This way, large datasets can
be generated automatically, with a set of items artificially pasted onto diverse backgrounds.
To investigate this method, a new dataset was generated as follows. First, a small, clean
region on a sandy beach was sampled. Then, cigarette butts were manually placed and the
region was sampled again.
The “clean” images were used as background for generating the dataset using the pasting
method. Evaluation was then performed by identifying how many of the planted items were
identified in the map of the “dirty” region.
Using the network trained using this data, detection rate was approximately 1 in 7 (3 out of
22 cigarette butts), see image below:
33
The red dots in this image mark cigarette butts. The rectangles (bottom left and right) are
bounding boxes of detected items.
This version of the network concluded the training attempts of our YOLO network in this
project. However, this left an important unanswered question. Were the generally poor
results indicative of a poor choice of network or training regime, or was the problem itself too
difficult for the current state of the art of machine learning? To test this, the same training
data discussed here (the “clean” sandy region with the pasted litter) was processed by a
commercial operator, nanonets26. The resulting network was then evaluated on the “dirty”
beach images. The resulting network identified 19 out of 22 items as shown below:
26 https://nanonets.com/
34
The yellow dots are the cigarette butts. The red rectangles are bounding boxes of
detections. Empty rectangles show false positives (approximately 100). Yellow dots without
rectangles show false negatives.
The difference between the two networks (our own YOLO and the network trained by
nanonets) show that there is indeed great potential for machine learning approaches. The
modest results of our attempts using YOLO were most likely due to weaknesses in the
implementation rather than this being too hard of a problem for machine learning to address.
Manual identification
Besides automatic identification, there is manual identification from the orthophotos
generated after the drone flight. The workflow for this is as follows:
1. The drone captures images, either via pre-programmed route or manual flight.
2. The images are post processed into an orthophoto of the area.
3. The orthophoto is examined by a human and items of interest are tagged.
An example of the resulting orthophoto and tagged items is shown below:
This process has to be contrasted with the traditional method of measurement, which is as
follows:
1. the area is scanned by humans and the litter found is picked up and categorised by
litter type according to a special protocol.
Both methods are labour intensive but at different stages. The drone flights take less time
than the manual picking up of litter, however, the examination of the orthophoto takes longer
than the counting of the litter. In other words, the manual drone method leads to less time
spent in the field but more time in post processing compared to the traditional method.
In terms of accuracy, at an example site (Rullsand, Älvkarleby municipality), manually
picking up litter led to 290 items of litter to be identified. Examining the orthophoto led to 75
items of litter to be identified. The difference in detection rates is down to uncertainty when
examining the orthophoto, for example compare these images of a fork and an uncertain
item below:
35
While the fork is easily identifiable as litter, the other item is not so easy to identify. This is an
issue with images, even of resolution is sufficient, the item might not be easily identifiable.
For a human on the beach, it is easy to get closer and examine an item which might be
perhaps partially covered or discoloured. In contrast, someone examining the orthophoto
only has the image to make a decision.
More about the results from the comparison between manual methods are found in the
project report by Keep Sweden Tidy27.
Large litter items
To identify larger litter items, flights were performed by both a quadrotot flying at 40 metres
and fixed wing flying at 60 metres to identify larger litter items. Again, the resulting images
were collated into a map and the map examined manually. The methodology can be
valuable when identifying larger marine litter items (>50 cm) in the MARLIN-method, which is
done on a 1000 m stretch of the beach with a maximum width of 50 m28 or to find hot-spots
where currents and winds accumulate litter.
Unlike the cases of small items the results were superior when examining the image. This is
shown in the image below:
27 TrashCam - Drönare för kartläggning av marint skräp i Östersjön. Håll Sverige Rent, 2019. 28 For example, the beach is considered to end where vegetation becomes stronger, cliffs or dunes take over or where a road or physical obstacle begins (eg wall, fence or house). If the beach is deeper than 50 m, the depth at the distance is 50 m from the waterline, and this point is then seen as the back end of the beach.
36
37
The results shown here were collected using the fixed wing on a flight that lasted 9 minutes
at 60 metres, resulting in a GSD of 1.3 cm / pixel. It has to be noted here that a fixed wing
can cover over 3 to 4 times the area of multirotor of similar payload capabilities. In the case
of the equipment used in this project, the quadrotor flight time was approximately 12 minutes
whereas the fixed wing’s theoretical flight time limit is over two hours.
Discussion
The project evaluated the use of drones in identifying beach litter. A quadrotor and fixed wing
aircraft equipped with an RGB camera was used to capture imagery. The imagery was then
post processed into a map and used to identify litter either manually or automatically.
Identifying small litter manually is more time consuming and less accurate than doing it in the
field. However, it leaves a digital record of the location were litter was identified which might be
of interest for future reference. Identifying large litter (>50 cm) is far more efficient using drone
imagery than trying to identify small items.
Automatic identification is an intriguing possibility. While the current project did not produce a
satisfactory algorithm, the results indicate that, given a strong focus on the machine learning
aspect, it is possible that a general purpose algorithm might be within reach in the near future.
It is worth noting that, mapping without processing is also a possibility. We have seen that
manual post processing is very time consuming and yields poor results while automatic
identification remains an open question. However, drone sampling itself is a lot quicker than
manually picking up the litter and the mapping post processing takes minimal human input.
Given this, it is worth noting that adding a drone sampling component to the beach litter
monitoring offers a digital archive which might be valuable for two reasons:
1. Given the rapid developments in machine learning, it is entirely possible that in the
future, maps generated today with very little effort can be post processed with algorithms
as yet undeveloped.
2. Even if machine learning does not deliver, a map is an archive that can be accessed in
the future to make comparisons between future recordings and the historical record.
Unlike sampling by picking up litter, which does not preserve location, a digital map
preserves the locations of the items it contains.
Options for future implementations
Spectral signature
In terms of identification for small items, most of what can be investigated has been
investigated in this project. Manual identification of drone imagery is labour intensive and
automatic identification, while very promising, requires a significant effort to yield results
across the whole spectrum of items.
However, in this project we focused on RGB imagery. Identification through spectral
signature beyond RGB was not explored due to the high cost of equipment. However, efforts
38
in this field are underway internationally, though still this is far from mainstream, for example
the Ocean Cleanup29 project’s “Sensing Ocean Plastics with an Airborne Hyperspectral
Shortwave Infrared Imager”30 and the Plastic Litter Project 201831 and 201932 of the
University of the Aegean.
Using information from multiple bands can greatly boost the performance of detection, for
example consider the following image where one item stands out as litter due to colour (top
left) whereas another stands out due to shape (bottom right) but it cannot be positively
identified.
Currently, rapid development of such sensors beyond the visible bands (e.g. multispectral,
hyperspectral, thermal) continuously pushes the price down for such equipment. It is
possible that, in the near future, a combination of multiple bands and machine learning
algorithms will yield very satisfactory results on identification of synthetic materials in the
environment.
Large scale monitoring - fixed wings
Given the promising performance of fixed wings for the larger items (>50 cm), it is very
interesting to consider their use in greatly expanded monitoring. The image below shows the
region of the monitored beach at Rullsand, Älvkarleby municipality.
29 https://theoceancleanup.com/ 30 DOI: 10.1021/acs.est.8b02855 31 https://mrsg.aegean.gr/?content=&nav=55 32 https://mrsg.aegean.gr/?content=&nav=65
39
The area in red is the area where small items are manually picked up and counted. The area
in blue is the longer stretch where large items (>50 cm) are identified and recorded. This is
the area that a fixed wing scanned in 9 minutes, using less than 20% of its battery capacity.
It is, therefore, quite possible to expand the monitored area to cover much larger regions. In
the case of Rullsand, it would be possible to expand to the yellow area, adding another 2.5
kilometres to the scanned area in a single flight.
40