Download pdf - TrashCam - UAV's for marine litter mapping...5 Introduction The project evaluated use of Unmanned Aerial Vehicles (UAVs, drones) in identifying beach litter. Internationally, UAV methodology

TrashCam - UAV's for marine litter mapping

A technical report by SKARL AB from a project funded by Vinnova

2

3

Table of Contents

Introduction 5

Relevant Concepts 5

Resolution types 6

Spatial Resolution 7

Spectral Resolution 7

Radiometric resolution 7

Temporal resolution 8

Ground Sampling Distance 8

Setup 10

Sensor and Vehicle 10

Operational Investigation 13

Context 13

Testing methodology 13

Altitude error 13

Ground based tests 14

Airborne tests 18

Post processing 21

Historical Background - state of the art 21

Implementation issues 23

Workflow Implementation 24

Generating a dataset 24

Training the network 26

Final Implementation 28

Data Augmentation Investigation 30

Manual identification 34

Large litter items 35

Discussion 37

Options for future implementations 37

Spectral signature 37

Large scale monitoring - fixed wings 38

4

5

Introduction

The project evaluated use of Unmanned Aerial Vehicles (UAVs, drones) in identifying beach

litter. Internationally, UAV methodology has recently been applied to traditional environmental

fields such as conservation1 and, more recently, beach litter as in the Plastic Tide2 project (UK).

In order to understand the problem and the solutions used by this project, we need some

background on the theory, which is covered in the following section.

Relevant Concepts

In order the understand the project’s context, scope and implementation certain concepts need

to be clarified first.

The overall discipline under which we operate is remote sensing and environmental monitoring.

Remote sensing has historically been associated with flight and, as technology progressed, with

satellite imaging.

More recently, with unmanned aerial systems becoming widespread, use of drones has started

filling the gaps left by the traditional remote sensing platforms as shown in the image below:

By flying extremely low and closer to the target, drones can offer almost arbitrarily better

resolution at a dramatically lower cost (but for much smaller areas). To understand this a bit

better, let us look at the concept of resolution.

1 https://conservationdrones.org/ 2 https://www.theplastictide.com/

https://www.theplastictide.com/

https://conservationdrones.org/


6

Resolution types

Historically, resolution has been subdivided into four types: spatial, spectral, radiometric and

temporal. These are demonstrated in the two images below:

7

Spatial Resolution

Spatial resolution is a measure of how finely we subdivide an image into pixels. For example:

500 000 pixels 760 pixels

Spectral Resolution

Spectral resolution refers to the amount of spectral information contained in a single pixel, for

example:

RGB Grayscale Black and White

A “normal” photo in the human visible spectrum is actually three grayscale photographs, one for

each band of Red, Green and Blue. For example:

RGB image

Red band Green band Blue band

Radiometric resolution

Radiometric resolution is tied to spectral but there is an important distinction. Spectral resolution

refers to the wavelength bands that the sensor responds to (typically three, RGB).

Radiometric resolution corresponds to how finely these bands are graded, i.e. how well we can

distinguish differences between wavelengths. To give a spatial analogy, spectral resolution would

correspond to the area covered by an image whereas radiometric resolution would correspond to

the spatial resolution of the image.

The image below gives an example of this by comparing the response of a multispectral sensor

(low radiometric resolution, lower cost, often used in precision agriculture use cases) to a

hyperspectral sensor (high radiometric resolution, more expensive, used in specialised

applications).

8

Temporal resolution

Temporal resolution indicates how often we sample the same place. To make an analogy, a

movie’s frame rate corresponds to its temporal resolution.

It is not a useful concept for a single drone mission but after time has elapsed, successive

samplings of the same region show changes in the data (making a historical “movie” of the

location).

Ground Sampling Distance

Having introduced the concepts of resolution, we need to introduce the concept of Ground

Sampling Distance (GSD). This is one of the most important parameters when designing a

remote sensing mission and it ties the spatial resolution to the physical size of the object under

investigation. For a drone mapping mission, GSD shows how finely items are resolved in the

image.

GSD is affected by the camera’s focal length and by the distance from the target.

For a given camera focal length setting, GSD is affected by distance from the target (drone

height), with lower altitudes giving a lower GSD (higher resolution) e.g.:

9

Similar GSD’s can be achieved by a combination of heights and focal lengths, for example

compare the low altitude, 20mm focal length to the high altitude 40mm focal length in the image

below:

10

So, in summary, we can increase our height while keeping GSD the same by increasing our

focal length (zooming in).

For reference, although there is no one-size-fits-all, typical mapping GSDs are in the order of

cm / pixel. A GSD of 2 cm / pixel is considered quite detailed, while 10 cm / pixel or more can

be standard for a quick survey. Satellite imagery GSD is typically many tens of centimetres to

metres per pixel. Manned platforms can vary a lot, depending on platform and payload but can

generally fit in between (the tradeoff between them and UAVs is higher coverage but for much

higher operational cost and some risk to operators).

Setup

Given the above, we can start discussing the parameters for a setup that is most suited to the

mission. For this, we follow a Concept of Operations (CONOPS3) methodology, putting the

investigation necessary for the first phase as the central focus.

Considerations in terms of importance are:

1. Sensor:

a. Ground sampling distance. What values are needed for identifying litter?

b. Resolution type: should we concentrate on spatial or spectral resolution?

2. Vehicle: What type of vehicle will handle the sensor from point 1?

3. Operational: what type of use case are we imagining in terms of cost, time, operator

expertise and risk given points 1 and 2?

Sensor and Vehicle

The possible sensors for this investigation would rely either on spatial or spectral resolution.

The sensors available would be either normal RGB cameras or specialised cameras such as

multispectral, thermal or multispectral. Given that the latter group have significantly higher cost

but unknown performance advantages, the initial investigation focused on RGB cameras, i.e. on

spatial, rather than spectral resolution.

Given that the focus is on spatial resolution, an ability to sample down to very high GSD is

desirable, at least until a suitable GSD has been established.

Since the initial investigation focused on RGB cameras, the options were either for an off the

shelf solution or a more dedicated setup. Off the shelf drones are typically equipped with video

cameras mounted on gimbals. Apart from video, these can be used for shooting still images and

are often used in mapping applications. However, there are various disadvantages to this

approach. In terms of sensors, the performance of a video camera shooting stills is typically

inferior to the performance of dedicated stills cameras. The use of proprietary software and

hardware, makes modification beyond the scope envisioned by the manufacturer difficult, which

is a significant hindrance in research projects. Finally, if a dedicated stills camera was to be

used instead of the manufacturer supplied one, starting off with neutral system made more

3 https://en.wikipedia.org/wiki/Concept_of_operations

https://en.wikipedia.org/wiki/Concept_of_operations

https://en.wikipedia.org/wiki/Concept_of_operations

11

sense since the resources spent in modification of an existing system are comparable (or might

exceed) the resources spent in building a custom system according to the project’s needs.

For these reasons, the initial investigation was decided to use dedicated, RGB, still cameras.

The ones investigated have been the Canon IXUS 160 and Sony RX100 III.

Canon IXUS 160: this is a popular camera for drone mapping due to low weight (127 grams)

and cost and high resolution (20 Megapixel). It can also run the Canon Hacker Development

Kit4, allowing very precise parameter setting, specifically for mapping applications using the

KAP UAV Exposure Control Script5. Disadvantages of the camera are small sensor size

(leading to noisier images) and relatively small aperture (leading to lower shutter speeds or

higher ISO settings, in turn leading to noisier images).

Sony RX100 III: this offers a significant performance increase compared to the IXUS 160 but

comes at a cost of more than double the weight (287 grams). It has the same nominal resolution

but a larger sensor and aperture and faster cycle rate, meaning it can shoot images faster and

resulting images are less noisy.

Given the sensors, the next choice is the vehicle. In this case, options were either a fixed wing

or multirotor platform. Fixed wings offer significantly longer flight times at the cost of the inability

to hover or go slow. Since the initial phase of the project was evaluation of the operational

parameters, the most flexibility is given by a multirotor since they can hover and travel at very

slow speed if needed (at the cost of extremely limited flight time). Therefore, the choice was

made for a 550mm quadrotor, assembled from kit (carbon fibre, “Iris” supplied by uCandrone6)

for use during the investigation phase.

For the final phase, a twin motor, fixed wing (MFE Believer) was also tested for monitoring of

large items over large areas.

4 http://chdk.wikia.com/wiki/CHDK 5 http://chdk.wikia.com/wiki/KAP_UAV_Exposure_Control_Script 6 http://ucandrone.com/

https://en.wikipedia.org/wiki/DIGIC#CHDK

https://en.wikipedia.org/wiki/DIGIC#CHDK

http://chdk.wikia.com/wiki/KAP_UAV_Exposure_Control_Script

http://ucandrone.com/

http://chdk.wikia.com/wiki/CHDK

http://chdk.wikia.com/wiki/KAP_UAV_Exposure_Control_Script

http://ucandrone.com/

12

The final choice is that of the autopilot software and hardware. For this size and use case,

popular and powerful options are the open source projects PX47 and Ardupilot8. Given that the

decision was made for SKARL to assemble a custom multirotor, Ardupilot (Arducopter and

Arduplane for the different vehicles) were selected since it is a more mature project with very

extensive documentation and community support. The different autopilot hardware used

throughout the project have been the Pixhawk9, Pixhawk 2 (Cube)10 and Pixracer11.

7 http://px4.io/ 8 http://ardupilot.org/ 9 https://pixhawk.org/ 10 http://ardupilot.org/copter/docs/common-thecube-overview.html 11 https://docs.px4.io/en/flight_controller/pixracer.html

http://px4.io/

http://ardupilot.org/

https://pixhawk.org/

http://ardupilot.org/copter/docs/common-thecube-overview.html

https://docs.px4.io/en/flight_controller/pixracer.html

http://px4.io/

http://ardupilot.org/

https://pixhawk.org/

http://ardupilot.org/copter/docs/common-thecube-overview.html

https://docs.px4.io/en/flight_controller/pixracer.html

13

Operational Investigation

Context

The most immediate consideration has been to establish the GSD required and, from there, see

how this fits with operational considerations.

To this end, the project needed to establish the smallest item of interest in our target sample.

Given its frequency in historical litter surveys, it turns out that the cigarette butt is extremely

common and persistent. It is also, a very small piece of litter (approximately 30mm in length). It

was, therefore, decided at the early stages of the project to aim for a GSD that could allow

identification of cigarette butts. The initial stages of the project have been dedicated to

establishing the right setup for this.

Testing methodology

Initial airborne tests conducted in August and September 2018 (at locations in Tyresö, Ingarö,

Sollentuna, Barkarby, Älvkarleby) indicated that for a cigarette butt to be reasonably well

identified, it ideally needs to have a resolution of 25-30 pixels. Given that a cigarette butt is

approximately 30mm long, this translates to an ideal GSD of 1 mm / pixel and preferably, no

worse than 1.5 mm / pixel.

However, the sampling height for this GSD was not well defined, though it was experimentally

established to be approximately 5-8 metres. The reason why these tests were not conclusive

was due to height estimation issues as described below.

Altitude error

Typically, altitude is measured by barometric sensor onboard the autopilot (though GPS and

other sensor inputs also play a role)12. Although the barometric sensor has very high resolution

(of the order of tens of centimetres), it is susceptible to minuscule changes in air pressure, as in

for example, when a wind gust blows, in which case, the vehicle changes altitude to

compensate for what it (erroneously) perceives as a change in altitude. To get a rough idea of

the variation possible due to wind gusts, the vehicle was placed on the ground, exposed to the

wind, during a moderately windy day (3-4 m/s) and altitude estimation was plotted, the result is

shown in the image below:

12 It has to be noted here that other sensors of ground clearance exist, such as ultrasonic or lidar rangefinders. Ultrasonic rangefinders are very common due to low cost but have very short range (typically up to 8m) so are not useful for flights higher than that. Lidar has a much greater range (of the order of 100m) but is also costlier and bulkier. In this project, the size of the cameras being evaluated did not allow much flexibility in adding extra sensors so these were not considered.

14

As can be seen, the variation in altitude estimated via barometric pressure is approximately just

under 1 metre for a vehicle stationary on the ground, exposed only to gusts of wind. It has to be

noted, however, that, due to wind gradient, wind speed close to the ground is lower than what

the vehicle typically encounters aloft. Thus, it is reasonable to assume that the error in altitude

introduced by wind gusts on a typical windy day is probably at least 1 metre. The extent to

which the vehicle actually changes its altitude to match, depends on the duration of the gust and

the dynamics of the vehicle (how aggressively it has been tuned to respond and how strong the

motors are).

When the vehicle is at its typical operating altitude (for a traditional mapping mission that is

many tens of metres), this is not an issue, since a change of, say, 1 metre is only 2% if flying at

an altitude of 50 metres (a very common mapping height). However, in this case, trying to

establish the optimal GSD required, a change of 1 metre when flying at 5 is significant (20%).

To account for this, flying altitudes were generally very conservative, erring on the side of low

altitude to ensure adequate GSD.

Ground based tests

To establish the impact of the altitude uncertainty, ground based tests were performed by

placing a target (cigarette butt, 30mm in length) and taking pictures at varying horizontal

distances (distance sweep).

15

Results (without zoom) are shown below for the Canon IXUS 160:

2 m, 38 pixels 4 m, 24 pixels 6m, 17 pixels 8m, 17 pixels 10m, 13 pixels

And the Sony RX100:

2m, 68 pixels 4m, 34 pixels 6m, 23 pixels 8m, 17 pixels 10m, 14 pixels

These results are summarised in the table and graph below:

Distance (metres)

2 4 6 8 10

IXUS 160

38 24 17 17 13 Pixel length

0.79 1.25 1.76 1.76 2.31 GSD (mm/pixel)

RX100

68 34 23 17 14 Pixel length

0.44 0.88 1.30 1.76 2.14 GSD (mm/pixel)

16

Looking at these results, it has been decided to put the cutoff for an acceptable resolution at a

GSD of 1.5 mm / pixel. For the two cameras, this is achiever at approximately 5 metres for the

IXUS and 7 metres for the RX100.

Also, note the differences in noise level between the two cameras at similar resolutions. This is

a good example of why GSD alone, although important is not the only criterion for what is an

acceptable image.

Having established an acceptable reference value for GSD, the next step was to see how big

our distance envelope is, as a result of zoom (changing focal length).

To test this, a set of photos was taken by the RX100 only, at a distance of 10 metres across a

full zoom sweep. The 35mm equivalent focal lengths were 30, 40, 50, 55, 60 and 70 mm.

NOTE: for ease of reference with other cameras, we will use the 35mm equivalent focal

length, NOT the cameras actual focal length. This exists in the EXIF metadata as Focal length

in 35mm.

https://en.wikipedia.org/wiki/35_mm_equivalent_focal_length

https://en.wikipedia.org/wiki/35_mm_equivalent_focal_length

17

30mm, 15 pixels 40mm, 19 pixels 50mm, 24 pixels 55mm, 27 pixels 60mm, 28 pixels 70mm, 34 pixels

Results summarised in table and graph below:

Focal length in 35mm 30.00 40.00 50.00 55.00 60.00 70.00

Pixel size 15.00 19.00 24.00 27.00 28.00 34.00

GSD mm/pixel 2.00 1.58 1.25 1.11 1.07 0.88

18

Summarising these results, as we have seen at the distance sweep for the RX100, the

unzoomed GSD at 10 metres is 2.14 mm/pixel (a length of 14 pixels for 30mm target).

Staying at 10 metres, a fully zoomed image at focal length of 70mm (in 35mm) gives us a GSD

of 0.88 mm/pixel. This means that, at full zoom, we can increase our GSD by a factor of 2.4

(the quoted zoom for the camera specs is 2.9).

Going to the graph of the Distance sweep, we see that, unzoomed, the GSD cutoff of 1.5

mm/pixel occurs at roughly 7 metres.

Given the results of the zoom sweep, we conclude that, in terms of GSD only, we should

theoretically be able to fly at approximately 17 metres and get a GSD of 1.5 mm/pixel.

Airborne tests

Having got an initial idea of the issues involved through the ground based testing, the next

step was to see airborne performance. When it came to unzoomed performance for both

cameras, it was well established that a height of 5 metres was generally guaranteed to yield

good results. The question then that needed to be investigated was: can similar results be

obtained by flying higher but using zoom?

In principle, the answer should be yes. However, there are various issues that need

addressing first. A zoomed in image (increased focal length) results in a narrower field of

view (angle). This, in turn, means that a given rotation of the camera has a much bigger

impact (in terms of changing the frame) on the highly zoomed (greater focal length) camera

as shown in the image below:

19

In an airborne multirotor, there are motions due to at least three factors: control inputs,

turbulence and frame vibration.

Control input and turbulence induced motions are low frequency and can be countered by

increased shutter speed of the camera and using a gimbal for stabilisation. Frame vibration

is high frequency and can be countered by vibration isolation, e.g. rubber couplings.

To test performance, the vehicle was flown at various heights, from 8 to 20 metres at

maximum zoom with the Sony RX100 (focal length of 70mm equivalent at 35mm). Since

initial results were not good, shutter speed was increased (maximum 1/2000). Many images

were captured with the vehicle hovering in place and with the camera mount vibration

isolated via rubber couplings.

However, even with the vehicle stationary and high shutter speed, the results were still not

good enough, example image below:

20

15 metre altitude, 1.5 mm/pixel GSD but image blurry.

Having multiple images, all of which blurry, with the vehicle stationary, excludes control

inputs and turbulence as the cause. The most likely conclusion is that frame vibration, while

not a problem at the wide field of view (unzoomed) setting, becomes problematic when

zooming in.

Evaluating the use of a gimbal for image stabilisation was not an option either. Of the shelf

drones with gimbal mounted cameras typically cannot zoom. Using a third party gimbal on a

camera as heavy as the RX100, would increase the weight of the system to the point where

flight time would be severely impacted, while the benefits were not guaranteed.

In conclusion, given these points, the most suitable strategy for data sampling was flying low

without zoom. This translated to the following operational parameters:

● altitude of approximately 5 - 8 metres

● Rate of 1 image / second

● Ground speed of 1 metre / second (to allow adequate overlap between successive

images)

21

Post processing

Having established the operational parameters, the next task was to establish how to

process the generated data. Our dataset was made up of high resolution images. Our goal

was to identify and locate litter items in these. One obvious way is to do so manually, this is

definitely possible and can yield very satisfactory results as shown below:

However, manual identification, although possible, is very labour intensive. Therefore, the

ideal implementation would automate some or all of the identification workload. Hence, we

focused on investigating methods for automatic identification of the litter.

Historical Background - state of the art

In recent years, machine learning and, specific to our case, machine vision, have been

advancing at a great pace. This is due to a number of reasons.

On the hardware side, computer graphics cards (GPUs) have become very powerful and

their computational capabilities happen to be very well suited for solving various classes of

problems relevant to machine learning.

On the data side, with the proliferation of the internet, the datasets of tagged images

available to researchers have become enormous.

On the software side, various techniques that have been used in research but not to their full

potential are now being refined to very powerful tools, thanks to these advances hardware

and datasets.

22

Of the various traditional machine learning techniques, the ones that we are interested in are

Artificial Neural Networks (ANNs) and, specifically, Convolutional Neural Networks13

(CNNs).

CNNs, have been the focus of very intense research development recently, due to their

ability to process large input data using a smaller sized network, compared to other types of

ANNs. For this reason, the various machine vision tools that are starting to appear, such as

face recognition, object detection etc are largely made possible through CNNs as opposed

to more traditional computer vision approaches.

Broadly speaking, there are three tasks of increasing complexity that are handled by such

networks: classification, detection and segmentation.

Image from Review of Deep Learning Algorithms for Object Detection

Without going into details14, classification looks at a single item in an image and classifies it.

Detection identifies different classes of items in an image and produces their rough locations

(bounding boxes). Segmentation does the same but produces very fine locations (pixel by

pixel).

In our case, we have multiple items of interest in each image, so we are looking at the latter

two cases.

13 https://en.wikipedia.org/wiki/Convolutional_neural_network#History 14 A more in depth explanation can be found here: https://medium.com/comet-app/review-of-deep-learning-algorithms-for-object-detection-c1f3d437b852

https://medium.com/comet-app/review-of-deep-learning-algorithms-for-object-detection-c1f3d437b852

https://en.wikipedia.org/wiki/Convolutional_neural_network#History



23

Implementation issues

Our case had various unique characteristics that made it stand out from more traditional

implementations. These had to do with the size of the items and the size of our dataset.

Regarding the size of the items, typical use involves an image where the object or objects of

interest is prominent and takes up a significant part of the image, as shown in the previous

image. By comparison, our images are very large (5472 x 3648 pixels when using the Sony

RX100) and the items are very small, as seen below:

In the image above, the longest item, the plastic bottle in the middle bottom, is approximately

150 pixels long. The smallest items, cigarette butts, are about 20-30 pixels long and barely

visible without zooming in (for example, top left). This means that, unlike traditional use

cases, most of our input area is irrelevant.

Regarding the dataset, training a network requires a very large amount of tagged data.

Tagged data, implies that, items of interest are identified, localized and labelled in the data,

typically by humans. This is a very labour intensive process and beyond the scope and

resources of the current project.

In our case, such datasets of litter do not yet exist in the public domain15. The Plastic Tide

project, which was working on a similar problem, crowdsourced this part of the workflow by

asking the public to tag items in their dataset16. However, the Plastic Tide’s dataset is taken

at a Ground Sampling Distance (GSD) roughly ten times coarser than ours. This meant that,

in their data, items such as cigarette butts, which we had decided we wanted to include,

would barely be identifiable at all. Therefore, using Plastic Tide’s dataset was not an option.

Instead, we had to establish our own workflow that would generate a sufficient dataset for

our purposes.

15 Mohammad Saeed Rad et al, “A Computer Vision System to Localize and Classify Wastes on the Streets”, arXiv:1710.11374, submitted on 31 Oct 2017. Section 4, “The Dataset”. 16 https://www.theplastictide.com/tagit/


https://arxiv.org/abs/1710.11374


https://www.theplastictide.com/tagit/

24

Workflow Implementation

Generating a dataset

To circumvent the limitations discussed in the previous section, we tried a different approach

to generating a dataset.

When we capture images for litter, we are actually capturing in exactly the same way as

when using a drone to map an area. This requires images with overlap and no blind spots.

Using the right software, such images can be used to create a 3D representation of the area

in the images. In this project, we have used Agisoft Metashape for reasons explained in

more detail below.

Now, generally, images used for mapping have some overlap but are spread wide, in order

to cover a large area, for example see image below:

The circles colours in the image above correspond to how much overlap there is for various

areas, i.e. purple dots are only “seen” in one image, blue dots are “seen” in two images etc.

Typically, drone mapping missions try to spread out the images over as wide an area as

possible, while retaining sufficient overlap for the map to be generated.

25

However, it is possible to cover a very small area with a mostly stationary drone shooting at

roughly the same place for the whole duration of its flight. This way, we build a 3D model of

extreme overlap (the points at the centre of the model are “seen” by hundreds of images).

Here, we see a 3D model where all the images (rectangles at top) are more or less in the

same region and the modelled area is very small, approximately 15 metres across

(approximately three image lengths at the GSD we use). The upside of this is that, any

random point more or less close to the centre of the area, will be visible in most, if not all, the

images.

Now, once the 3D model has been generated, a user can manually mark polygons on the

model. These polygons are created in the coordinate system of the model.

Model point cloud Textured model Random image

26

As shown in the image above, since the (white) polygons are in the coordinate system of the

model, they can be automatically projected at the correct coordinates onto any image of the

dataset which “sees” the location of the polygon.17

One of the most powerful features of Metashape is the ability to run custom python scripts18.

We have written a script that, once all the items are tagged manually, exports all the

polygon data as a .json file19.

This data is then further manipulated by python scripts to generate .xml files in PASCAL

VOC format20, a format used in visual machine learning.

By using this method, we can end up with large datasets in a relatively short time (1-2 days

depending on the size of the dataset). For example, a ten minute flight can generate

between 400 and 600 usable images. Assuming all items are visible in all images, tagging

one item results in tagging 400 to 600 items, thus multiplying the human labour by a factor of

400 to 600.

It should also be noted that this multiplication of items does not happen by data

augmentation tricks common in computer vision such as flipping, rotating, etc but are

genuine separate images, taken from slightly different locations, light conditions, angles etc.

However, the drawback is that, for each flight, the items are at the same location, meaning

that, although they are genuinely different images, they are images of the same items

against the same background. Another drawback is that, due to uncertainty in the transfer

between coordinate systems, the polygon edges can sometimes be misplaced, which means

that they have to be drawn with fairly large margins of error to avoid them cutting across an

item.

Training the network

Once we have the dataset, the next decision concerns, of course, what type of network to

use. A very popular type of network for object detection is YOLO (You Only Look Once),

currently at version 321. This network does object detection, i.e bounding box localisation and

classification inside an image.

17 It has to be stressed that this projection is purely geometric. The user draws and labels the polygons by hand by looking for litter items in the model. The software calculates the locations of the polygons’ vertices in image coordinates by projecting the vertices back towards each image where the polygon is visible. No image processing is taking place at this step, the software simply carries out the projection of the points between the two coordinate systems. 18 http://www.agisoft.com/pdf/Metashape_python_api_1_4_0.pdf 19 All software developed as part of this project is open sourced on github at:

https://github.com/gitdillo/Garbage 20 http://host.robots.ox.ac.uk/pascal/VOC/ 21 https://pjreddie.com/media/files/papers/yolo_1.pdf, https://arxiv.org/pdf/1612.08242.pdf, https://pjreddie.com/media/files/papers/YOLOv3.pdf

http://www.agisoft.com/pdf/photoscan_python_api_1_4_0.pdf

https://github.com/gitdillo/Garbage

http://host.robots.ox.ac.uk/pascal/VOC/

https://pjreddie.com/media/files/papers/yolo_1.pdf

https://arxiv.org/pdf/1612.08242.pdf

https://pjreddie.com/media/files/papers/YOLOv3.pdf

27

We used an open source implementation22 which yielded very promising results on our initial

dataset of approximately 2000 images (containing 7 000 tagged) items, see example image

below:

Once it was apparent that the network was performing well on the initial dataset, the dataset

was enriched. By the end of 2018, we have datasets of approximately 15 000 images

against grass, gravel, asphalt and roadside backgrounds.

With the experience of training across these diverse datasets, one thing becomes clear.

YOLO is a popular network that performs detection, i.e. both bounding box localisation as

well as classification. However, a network that adequately captures the complexity of the

input (diversity of litter shapes and backgrounds) as well as the output (large number of

possible litter classes) needs to be adequately trained on a very large and diverse dataset

and will end up being resource intensive if it is to perform well. On top of that, a known

limitation of YOLO is detection of small items, as discussed in Redmon et al, 201523.

In our case, with a mix of small items (cigarette butts) and arbitrarily large items (bottles,

bags or even larger), it would be hard to find a compromise to fit all possibilities. Our network

was scoring very high within its training and validation set (the validation set is a set of

images not used during training, which is only used to evaluate the performance of the

network). However, when presented with data from a new dataset, performance was very

noticeably degrading, with incorrect classification becoming very noticeable.

22 https://github.com/experiencor/keras-yolo3 23 https://arxiv.org/abs/1506.02640

https://github.com/experiencor/keras-yolo3


28

Final Implementation

In our case, items were typically not very tightly clustered, take up a really small percentage

of the input image and could belong to wildly diverse range of classes and sizes. Therefore,

instead of trying to accurately classify the outputs, we decided the final implementation

should focus on the detection performance of YOLO only.

Given that we are sampling at a roughly constant GSD (around 1 - 1.5 mm per pixel), we can

make some safe assumptions about the item sizes in pixels. The input was sliced into 500 x

500 chunks, making the resulting footprint of each image slice no smaller than ½ metre. On

one hand, this is adequate for most common litter items, on the other hand, it is small

enough to ensure that out items are not very small compared to the input image size, thus

circumventing, to some extent, YOLO’s performance problems on small items.

Finally, the network does not attempt any classification at all. Suspect items are simply

marked as litter as shown below:

Given that, for our use case, the vast majority of the input image was irrelevant, there was

little point in suffering the performance penalty (both in training and during use) of a network

that accurately classifies the detected items. By focusing on the bounding box detection

29

only, we can get an idea of the litter in our sample. Secondly, having discarded the majority

of the input, the detected items can then be passed to another dedicated classifier network,

whose only job will be classification.

This method offers various advantages:

● The input is parsed quickly and efficiently and most of it is immediately discarded.

● False positives do not matter that much since the classifier can reject them. In fact,

it’s better to get false positives than false negatives.

● The whole system is a lot easier to retrain. If a single network was doing both jobs,

adding or removing a class would require retraining the whole network. This way,

each component can be trained solely on its specific task as more usage data comes

in.

To counter the problem of insufficient diversity in the dataset backgrounds, datasets were

generated against very different backgrounds (sand, grass, asphalt, gravel) and the images

mixed for training.

Again, this method gave very high scores (mAP > 95%) in the training and validation sets.

However, when evaluated in completely fresh datasets, even of the same type of

background, the performance suffered greatly as can be seen in the image below:

30

In this example, the network is tested on a completely new dataset. Obvious large items

such as cans and coffee cups are missed, bounding boxes (bottom left) are miscalculated,

we have double detections (bottom left, top right) and not a single cigarette butt has been

detected. While these results are in stark contrast with the results of the validation sets

during training, they are not surprising.

Since the tagging is done automatically, based on multiple images of the same, small set of

items against a given background per dataset, the main problem is the lack of variation,

mainly in backgrounds. This results in a network that performs exceptionally well within the

limited envelope of training data but underperforms when faced with new data. While this is

an expected shortcoming of this methodology, it is not easily addressed, other than by

committing significant resources to manually tagging an extensive dataset, something

beyond the resources and scope of the current project.

Data Augmentation Investigation

Given the issues identified so far, a new dataset generating methodology was investigated.

Instead of planting litter items at fixed locations and taking multiple photographs from

different locations, what if images of litter items were “planted” automatically in otherwise

empty photographs? Similar techniques are being used in other computer vision projects,

where photorealistic, computer generated imagery is used as input to a network for training

purposes. In our case, we decided to use the most difficult litter items, cigarette butts, to

investigate whether we could achieve good detection for these against a variety of

backgrounds. Hence, instead of trying to detect multiple items but facing the problem of low

background variation, we concentrate on just one (difficult) item against many types of

background.

The workflow for generating datasets is as follows:

1. get high resolution images of many different items of the class we are interested in

2. remove the background and crop so we are left with only the item itself against a

transparent background

3. get multiple images of empty backgrounds representative of the types encountered in

the field

4. downsample the (cropped with transparent background) images of the litter items to

a resolution low enough as to correspond to the resolution of the background

(typically ~30 pixels long for a cigarette butt)

5. rotate and paste the (cropped with transparent background) images of the litter items

onto random locations in the background images

The first step, is to get good quality images of the litter items. In order to do that, the camera

was mounted on a tripod and each item was placed against a black border but removed from

it (approximately 20 cm) to avoid reflected light interfering with the edge pixels. An example

image is shown below:

31

The question then becomes how to algorithmically remove the background in order to be

able to handle large amounts of input images automatically. Because we have a black

background, we can do that via a relatively background subtraction using pixel intensity

followed by identifying the largest item, cropping to the smallest bounding box and adding a

transparent layer (alpha channel) to the surrounding pixels which do not belong to the item24.

After these operations, we end up with an image as shown below:

This, of course, is at too high a resolution (this example is 1100 pixels long), compared to

what can be realistically sampled by a drone so it needs to be downsampled. In the next

image, examples of different downsampling methods25 are shown:

24 The code for this is in function ”create_minimal_image” of module “image_utils” in github repository: https://github.com/gitdillo/Garbage 25 As described in: https://docs.opencv.org/2.4/modules/imgproc/doc/geometric_transformations.html#resize

https://github.com/gitdillo/Garbage

https://docs.opencv.org/2.4/modules/imgproc/doc/geometric_transformations.html#resize

32

These images can now be pasted against any background in order to create a composite as

shown below:

Here, the cigarette butts inside the red rectangle have been artificially pasted. The

advantage of this technique is that they are at known locations. This way, large datasets can

be generated automatically, with a set of items artificially pasted onto diverse backgrounds.

To investigate this method, a new dataset was generated as follows. First, a small, clean

region on a sandy beach was sampled. Then, cigarette butts were manually placed and the

region was sampled again.

The “clean” images were used as background for generating the dataset using the pasting

method. Evaluation was then performed by identifying how many of the planted items were

identified in the map of the “dirty” region.

Using the network trained using this data, detection rate was approximately 1 in 7 (3 out of

22 cigarette butts), see image below:

33

The red dots in this image mark cigarette butts. The rectangles (bottom left and right) are

bounding boxes of detected items.

This version of the network concluded the training attempts of our YOLO network in this

project. However, this left an important unanswered question. Were the generally poor

results indicative of a poor choice of network or training regime, or was the problem itself too

difficult for the current state of the art of machine learning? To test this, the same training

data discussed here (the “clean” sandy region with the pasted litter) was processed by a

commercial operator, nanonets26. The resulting network was then evaluated on the “dirty”

beach images. The resulting network identified 19 out of 22 items as shown below:

26 https://nanonets.com/

https://nanonets.com/

34

The yellow dots are the cigarette butts. The red rectangles are bounding boxes of

detections. Empty rectangles show false positives (approximately 100). Yellow dots without

rectangles show false negatives.

The difference between the two networks (our own YOLO and the network trained by

nanonets) show that there is indeed great potential for machine learning approaches. The

modest results of our attempts using YOLO were most likely due to weaknesses in the

implementation rather than this being too hard of a problem for machine learning to address.

Manual identification

Besides automatic identification, there is manual identification from the orthophotos

generated after the drone flight. The workflow for this is as follows:

1. The drone captures images, either via pre-programmed route or manual flight.

2. The images are post processed into an orthophoto of the area.

3. The orthophoto is examined by a human and items of interest are tagged.

An example of the resulting orthophoto and tagged items is shown below:

This process has to be contrasted with the traditional method of measurement, which is as

follows:

1. the area is scanned by humans and the litter found is picked up and categorised by

litter type according to a special protocol.

Both methods are labour intensive but at different stages. The drone flights take less time

than the manual picking up of litter, however, the examination of the orthophoto takes longer

than the counting of the litter. In other words, the manual drone method leads to less time

spent in the field but more time in post processing compared to the traditional method.

In terms of accuracy, at an example site (Rullsand, Älvkarleby municipality), manually

picking up litter led to 290 items of litter to be identified. Examining the orthophoto led to 75

items of litter to be identified. The difference in detection rates is down to uncertainty when

examining the orthophoto, for example compare these images of a fork and an uncertain

item below:

35

While the fork is easily identifiable as litter, the other item is not so easy to identify. This is an

issue with images, even of resolution is sufficient, the item might not be easily identifiable.

For a human on the beach, it is easy to get closer and examine an item which might be

perhaps partially covered or discoloured. In contrast, someone examining the orthophoto

only has the image to make a decision.

More about the results from the comparison between manual methods are found in the

project report by Keep Sweden Tidy27.

Large litter items

To identify larger litter items, flights were performed by both a quadrotot flying at 40 metres

and fixed wing flying at 60 metres to identify larger litter items. Again, the resulting images

were collated into a map and the map examined manually. The methodology can be

valuable when identifying larger marine litter items (>50 cm) in the MARLIN-method, which is

done on a 1000 m stretch of the beach with a maximum width of 50 m28 or to find hot-spots

where currents and winds accumulate litter.

Unlike the cases of small items the results were superior when examining the image. This is

shown in the image below:

27 TrashCam - Drönare för kartläggning av marint skräp i Östersjön. Håll Sverige Rent, 2019. 28 For example, the beach is considered to end where vegetation becomes stronger, cliffs or dunes take over or where a road or physical obstacle begins (eg wall, fence or house). If the beach is deeper than 50 m, the depth at the distance is 50 m from the waterline, and this point is then seen as the back end of the beach.

36

37

The results shown here were collected using the fixed wing on a flight that lasted 9 minutes

at 60 metres, resulting in a GSD of 1.3 cm / pixel. It has to be noted here that a fixed wing

can cover over 3 to 4 times the area of multirotor of similar payload capabilities. In the case

of the equipment used in this project, the quadrotor flight time was approximately 12 minutes

whereas the fixed wing’s theoretical flight time limit is over two hours.

Discussion

The project evaluated the use of drones in identifying beach litter. A quadrotor and fixed wing

aircraft equipped with an RGB camera was used to capture imagery. The imagery was then

post processed into a map and used to identify litter either manually or automatically.

Identifying small litter manually is more time consuming and less accurate than doing it in the

field. However, it leaves a digital record of the location were litter was identified which might be

of interest for future reference. Identifying large litter (>50 cm) is far more efficient using drone

imagery than trying to identify small items.

Automatic identification is an intriguing possibility. While the current project did not produce a

satisfactory algorithm, the results indicate that, given a strong focus on the machine learning

aspect, it is possible that a general purpose algorithm might be within reach in the near future.

It is worth noting that, mapping without processing is also a possibility. We have seen that

manual post processing is very time consuming and yields poor results while automatic

identification remains an open question. However, drone sampling itself is a lot quicker than

manually picking up the litter and the mapping post processing takes minimal human input.

Given this, it is worth noting that adding a drone sampling component to the beach litter

monitoring offers a digital archive which might be valuable for two reasons:

1. Given the rapid developments in machine learning, it is entirely possible that in the

future, maps generated today with very little effort can be post processed with algorithms

as yet undeveloped.

2. Even if machine learning does not deliver, a map is an archive that can be accessed in

the future to make comparisons between future recordings and the historical record.

Unlike sampling by picking up litter, which does not preserve location, a digital map

preserves the locations of the items it contains.

Options for future implementations

Spectral signature

In terms of identification for small items, most of what can be investigated has been

investigated in this project. Manual identification of drone imagery is labour intensive and

automatic identification, while very promising, requires a significant effort to yield results

across the whole spectrum of items.

However, in this project we focused on RGB imagery. Identification through spectral

signature beyond RGB was not explored due to the high cost of equipment. However, efforts

38

in this field are underway internationally, though still this is far from mainstream, for example

the Ocean Cleanup29 project’s “Sensing Ocean Plastics with an Airborne Hyperspectral

Shortwave Infrared Imager”30 and the Plastic Litter Project 201831 and 201932 of the

University of the Aegean.

Using information from multiple bands can greatly boost the performance of detection, for

example consider the following image where one item stands out as litter due to colour (top

left) whereas another stands out due to shape (bottom right) but it cannot be positively

identified.

Currently, rapid development of such sensors beyond the visible bands (e.g. multispectral,

hyperspectral, thermal) continuously pushes the price down for such equipment. It is

possible that, in the near future, a combination of multiple bands and machine learning

algorithms will yield very satisfactory results on identification of synthetic materials in the

environment.

Large scale monitoring - fixed wings

Given the promising performance of fixed wings for the larger items (>50 cm), it is very

interesting to consider their use in greatly expanded monitoring. The image below shows the

region of the monitored beach at Rullsand, Älvkarleby municipality.

29 https://theoceancleanup.com/ 30 DOI: 10.1021/acs.est.8b02855 31 https://mrsg.aegean.gr/?content=&nav=55 32 https://mrsg.aegean.gr/?content=&nav=65

https://theoceancleanup.com/

https://mrsg.aegean.gr/?content=&nav=55


https://theoceancleanup.com/

https://pubs.acs.org/doi/10.1021/acs.est.8b02855



39

The area in red is the area where small items are manually picked up and counted. The area

in blue is the longer stretch where large items (>50 cm) are identified and recorded. This is

the area that a fixed wing scanned in 9 minutes, using less than 20% of its battery capacity.

It is, therefore, quite possible to expand the monitored area to cover much larger regions. In

the case of Rullsand, it would be possible to expand to the yellow area, adding another 2.5

kilometres to the scanned area in a single flight.

40