Synthetic Meta-Learning - DiVA portal1375764/FULLTEXT01.pdf · Synthetic Meta-Learning Learning to learn real-world tasks with synthetic data LUKAS LUNDMARK KTH ROYAL INSTITUTE OF

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2019

Synthetic Meta-LearningLearning to learn real-world tasks with synthetic data

LUKAS LUNDMARK

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Synthetic Meta-Learning

Learning to learn real-world tasks withsynthetic data

LUKAS LUNDMARK

Master in Machine LearningDate: September 10, 2019Supervisor: Joel BrynielssonSupervisor at FOI: Linus LuotsinenExaminer: Olle BälterSwedish title: Syntetisk metainlärning: Lära sig att lära verkligauppgifter med syntetisk dataSchool of Electrical Engineering and Computer Science

Abstract

Meta-learning is an approach to machine learning that teaches models howto learn new tasks with only a handful of examples. However, meta-learningrequires a large labeled dataset during its initial meta-learning phase, whichrestricts what domains meta-learning can be used in. This thesis investigatesif this labeled dataset can be replaced with a synthetic dataset without a lossin performance. The approach has been tested on the task of military vehicleclassification. The results show that for few-shot classification tasks, modelstrained with synthetic data can come close to the performance of modelstrained with real-world data. The results also show that adjustments to thedata-generation process, such as light randomization, can have a significanteffect on performance, suggesting that fine-tuning to the generation processcould further improve performance.

ii

Sammanfattning

Metainlärning är en metodik inom maskininlärning som gör det möjligt att lä-ra en modell nya uppgifter med endast en handfull mängd träningsexempel.Metainlärning kräver dock en stor mängd träningsdata under själva metaträ-ningsfasen, vilket begränsar de domäner där metodiken kan användas. Dettaexamensarbete utreder huruvida syntetisk bilddata, som genererats medhjälp av en simulator, kan ersätta verklig bilddata under metainlärningsfasen.Metoden har utvärderats på militär fordonsklassificering. Resultaten visar attför bildklassificering med 1–10 träningsexempel per klass kan en modell me-tainlärd med syntetisk data närma sig prestandan hos en modell metainlärdmed riktig data. Resultaten visar även att små ändringar i genereringspro-cessen, exempelvis graden av slumpmässigt ljus, har en stor inverkan påden slutgiltiga prestandan, vilket ger hopp om att ytterligare finjustering avgenereringsprocessen kan resultera i ännu fler prestandaförbättringar.

iii

Acronyms

ANN artificial neural network. 5–7

CACTU clustering to automatically construct tasks forunsupervised meta-learning. 15, 16

CNN convolutional neural network. 7, 11, 12, 32,35

DR domain randomization. 18, 19, 26

FOI swedish defence research agency. 20, 21FOMAML first-order model-agnostic meta-learning. 15,

16, 31, 39

MAML model-agnostic meta-learning. 1, 2, 4, 13–16,20, 22, 31, 34, 35, 39, 56

MLP multilayer perceptron. 6

RELU rectified linear unit. 6, 8, 33

SDR structured domain randomization. 19SQF status quo function. 35

TCNN transfer convolutional neural network. 12

UMTRA unsupervised meta-learning with tasks con-structed by random sampling and augmenta-tion. 16

VBS3 virtual battlespace 3. 2, 20, 21, 23, 25

iv

XAI explainable artificial intelligence. 49

v

Contents

1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Scientific Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.5 Novelty and Scientific Relevance . . . . . . . . . . . . . . . . . 3

2 Background and Theory 4

2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Generalization and Overfitting . . . . . . . . . . . . . . 5

2.1.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.3 Convolutional Neural Networks . . . . . . . . . . . . . 7

2.1.4 Training Neural Networks . . . . . . . . . . . . . . . . . 8

2.1.5 Data Augmentation . . . . . . . . . . . . . . . . . . . . 10

2.1.6 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Few-Shot Learning and Meta-Learning . . . . . . . . . . . . . 12

2.2.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . 17

vi

2.3.2 Domain Randomization (DR) . . . . . . . . . . . . . . . 18

2.3.3 Structured Domain Randomization (SDR) . . . . . . . . 19

2.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 Thesis Background and Suggested Approach . . . . . . . . . . 20

2.4.1 FOI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.3 Why Synthetic Data? . . . . . . . . . . . . . . . . . . . . 21

3 Methodology 23

3.1 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.1 VBS3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.2 Generation Process . . . . . . . . . . . . . . . . . . . . . 24

3.1.3 Image Randomization . . . . . . . . . . . . . . . . . . . 26

3.1.4 Generated Datasets . . . . . . . . . . . . . . . . . . . . . 29

3.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Meta Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.1 Task Generation . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.2 Problem Settings . . . . . . . . . . . . . . . . . . . . . . 31

3.2.3 Image Pre-Processing . . . . . . . . . . . . . . . . . . . 32

3.2.4 Image Augmentation . . . . . . . . . . . . . . . . . . . . 32

3.2.5 Network Architecture and Hyperparameters . . . . . . 32

3.2.6 Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.7 Performance Evaluation . . . . . . . . . . . . . . . . . . 34

3.2.8 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3 Programming Libraries and Frameworks . . . . . . . . . . . . 35

vii

4 Result 37

4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 Test Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3 Training Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Discussion 45

5.1 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2 Effect of Image Randomization . . . . . . . . . . . . . . . . . . 45

5.3 Realism vs. Visual Variety . . . . . . . . . . . . . . . . . . . . . 47

5.4 The Effect of Task Difficulty . . . . . . . . . . . . . . . . . . . . 48

5.5 Are We Actually Learning? . . . . . . . . . . . . . . . . . . . . 49

5.6 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.7 Ethics and Sustainability . . . . . . . . . . . . . . . . . . . . . . 54

6 Conclusions 55

6.1 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Bibliography 57

A Appendix 61

A.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

A.2 VBS3 Vehicle Dataset . . . . . . . . . . . . . . . . . . . . . . . . 61

A.2.1 Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

A.2.2 Vehicle Classes . . . . . . . . . . . . . . . . . . . . . . . 62

A.2.3 Image Background . . . . . . . . . . . . . . . . . . . . . 62

A.2.4 Vehicle Data . . . . . . . . . . . . . . . . . . . . . . . . . 62

viii

1 Introduction

1.1 Background

Machine learning and deep learning have developed at a rapid pace. State-of-the-art machine learning models can now outperform humans in a varietyof tasks, both in terms of accuracy and efficiency. However, even state-of-the-art machine learning models are still limited when it comes to one of thehallmarks of human intelligence; generalizing from limited information. Ahuman child can, for example, learn to recognize an unknown animal speciesfrom only seeing a handful of images. A state-of-the-art deep learning imageclassifier can require hundreds of labeled example images to perform thesame task.

Meta-learning is an approach to machine learning that aims to address thisshortcoming and have models learn how to generalize from a limited numberof examples. The field and its central ideas date back several decades buthave in recent years had promising developments like model-agnostic meta-learning (MAML) [5] which have enabled meta-learning to be used on abroader range of tasks.

Broadly outlined, meta-learning involves training machine learning modelsin two phases: First, an initial meta-learning phase where the model graduallyacquires knowledge from various tasks T1, ..., TN . Secondly, a meta-testingphase where the previously trained model is tasked with quickly adapting topreviously unseen tasks TN+1 with only a couple of examples [17].

1

CHAPTER 1. INTRODUCTION

1.2 Scientific Gap

The end-goal of meta-learning is to leverage the benefits of deep learningwithout the use of large training sets. However, meta-learning requires largelabeled datasets during the initial meta-learning phase from which it cansample training tasks. This limits what problem domains meta-learning canbe used in.

The idea of this thesis is to utilize an automatically generated, syntheticdatasets during the meta-learning phase, and then have the trained modeladapt to unseen test tasks consisting of real-world data. An approach we callSynthetic Meta-Learning.

Previous results from Tremblay et al. [26], Prakash et al. [19] and Tobin et al.[25] have shown that synthetic data can be used to complement and evencompletely replace real-world data for a variety of tasks.

The goal of this thesis is to explore to what degree synthetic data can be usedduring the meta-learning phase of MAML, how well synthetically trainedmodels can adapt to new tasks with unseen real-world data and how thesynthetic data should be generated in order to maximize performance.

1.3 Problem Statement

Can meta-learning with synthetic image data rival the performance of meta-learningwith real-world images, and how do different aspects of the synthetic image data,such as color, lighting, and object position, affect final performance?

1.4 Scope

The scope of this thesis consists of implementing and evaluating an end-to-end meta-learning pipeline, training models using synthetic data andMAML, and evaluating the model’s ability to learn new tasks with real-world data (see Figure 1.1). The synthetic data consists of single objectimages generated using the virtual battlespace 3 (VBS3) military simula-tor, portraying a variety of military vehicles in a fixed number of settings.

2

CHAPTER 1. INTRODUCTION

A fixed number of randomization methods for introducing variation intothe image data are tested in order to determine how synthetic data is bestgenerated. Performance is evaluated on a hand-labeled real-world datasetspecifically gathered for this thesis, containing single object images of mili-tary vehicles. The network architecture and hyperparameter settings havebeen taken from previous related work in order to reduce the number ofnetwork configuration to be tested.

1.5 Novelty and Scientific Relevance

The need for large labeled datasets during meta-training is one of the method’smain drawbacks. Several papers have been published that focus on methodsto avoid using it, like unsupervised meta-learning [8, 11]. However, to theauthor’s knowledge, there is no previous research that uses synthetic datafor meta-training while adapting to real-world tasks, making the suggestedapproach a novel one.

Simulatedenvironment

Syntheticdataset (large)

Meta-learning

Untrainedmodel

Fine-tuning

Real-worldtask

(few samples)

Real-worldenvironment

Trainedmodel

Figure 1.1: Synthetic meta-learning pipeline

3

2 Background and Theory

This chapter will introduce the reader to the necessary information neededto understand the background and theoretical motivations that underpinthis thesis. In Section 2.1, the reader will be introduced to the basic conceptsrequired to understand this thesis. The theory mainly involves the termi-nology and the methodology of training and evaluating machine learningmodels with a focus on deep neural networks. A reader who is well versedin these areas can skip this section. Section 2.2 will explain the concepts ofmeta-learning and few-shot learning and introduce the reader to the MAMLalgorithm and its underlying theory. Section 2.3 will cover previous researchrelated to synthetic data generation. Section 2.4 will, in detail, outline theproposed approach. It will also outline the advantages of the suggestedapproach in comparison to existing methods.

2.1 Machine Learning

Machine learning is a field of study concerned with creating models that learnby utilizing statistics extracted from previously observed data. There areseveral forms of learning algorithms. supervised learning refers to training amodel on a set of labelled training dataD = {X,Y }whereX = {x1, . . . ,xn}are the input data and Y = {y1, . . . , yn} are labels that define the output of thefunction F : Rn → R the algorithm should learn. The learning is supervisedin the sense that the algorithm can inspect the label yi of each datum xi toasses how well it performs. This assessment is commonly done by computinga cost (or loss) function, which assesses the disparity between the algorithmsguess yi, and the true label yi [6].

In contrast, unsupervised learning refers to the problem of training an algo-rithm on unlabeled training data. This lack of labeling forces the algorithm

4

CHAPTER 2. BACKGROUND AND THEORY

to learn and infer properties of the data on its own, without the aid of asupervisor [6].

2.1.1 Generalization and Overfitting

The goal of any machine learning algorithm is to be able to generalize well topreviously unseen data. Generalization as a concept can be formally definedas the error rate the model or algorithm exhibits on unseen data. A lowererror on this test-data implies higher generalization and vice versa [6].

Every machine learning model possesses two attributes that relate to itsability to generalize. One is bias, expressed as E[f(x) − f(x)], which is thedifference between the expected prediction of the model and the true valueit tries to predict. A model with high bias will underfit to the training data,resulting in high training error rates. The reason why is because high biasmodels are inherently limited in what they can learn, and are restricted to amore narrow problem domain then what the training task requires [1].

The other is variance, which refers to the expected squared difference betweeneach prediction and the average prediction: E

[(f(x) − E[f(x)])2

]. For a

model with high variance, small changes in the training data can stronglyaffect the predictions on the test data. Such a model is sensitive to noiseand random patterns in the training data. As a result, it will often overfit totraining data, showing low error rates during training, while generalizingpoorly on the test data [1, 6].

The bias-variance trade off is a fundamental concept within machine learningwhich refers to the relationship between the bias and the variance. Reducingthe variance of a model will invariably result in an increase in its bias, andvice versa. To optimize performance in a model, one needs to find a properbalance between bias and variance. Finding a proper balance can, for exam-ple, be done by regularization, which entails adding certain restrictions to ahigh-variance model in order to lower its variance [1].

2.1.2 Deep Learning

Deep learning is a subfield within machine learning that focuses on thestudy of artificial neural networks (ANNs) and deep architectures. The field

5


α∑

w2x2

b

......

wnxn

w1x1

w01

Figure 2.1: Single neuron with n inputs and activation function α

has, in recent years, seen a substantial increase in popularity, despite thetechnology dating back to the 1980s, with the multilayer perceptron (MLP)and the back-propagation algorithm. Large public datasets, increasinglypowerful hardware, as well as increased knowledge of how these modelscan be trained, have all contributed to its recent rise to popularity [6].

The core of an ANN is a basic linear unit called a neuron, node or perceptron(see Figure 2.1), which is superficially inspired by the neurons in the humanbrain. These neurons compute the weighted sum of the input vector x withan added offset called bias (b in Figure 2.1). A non-linear activation functionis then applied to the output of the neuron (α in Figure 2.1). The activationfunction allows the network to express more complex functions than strictlylinear ones. This activation function can be any non-linear differentiablefunction, but the most common function is the rectified linear unit (RELU) [6].

A set of parallel neurons that are joined together make up a layer. An ANNis constructed by stacking several of these layers, by letting the output of onelayer become the input to the next layer (see Figure 2.2). The network thentakes some input vector x, feeds it through each layer in the network, andlets the final layer, the output layer, produce the network’s prediction y. Ifthe task is a classification task, a softmax function is often applied to the finalprediction vector y. The softmax function normalizes the values in the vectorto a valid probability distribution [6].

6


x11

x12

x13

x14

y1

Hiddenlayer

Input Output

Figure 2.2: Simple ANN with a single hidden layer.

2.1.3 Convolutional Neural Networks

The convolutional neural network (CNN) is a type of ANN specialized inprocessing grid-like input maps, such as images. A standard CNN consistsof a set of stacked convolutional layers, followed by a set of fully-connectedlayers, similarly to a standard neural network.

Each convolutional layer consists of a set of trainable filters. These filtersapply a linear function in a grid-like fashion to local regions, also called re-ceptive fields, in the input image. The outputs of these filters retain the samespatial ordering as the receptive fields. This gives the convolutional layer anadvantage over standard ANNs since it preserves spatial information.

The shape of the convolutional layer is defined by two parameters: filtersize f and stride s. The filter size f defines how large the receptive fieldsare, while stride s defines the distance between the receptive fields’ offset.Suppose that input image x have dimensions xwidth × xheight × d, where d isthe number of channels. Then matrix shaped output y from a single filter isexpressed as

y[a, b] =d∑

m=0

f∑i=0

f∑j=0

w[i, j,m] · x[bs+ i, as+ j,m] (2.1)

where w are the trainable weights of the filter, x[i, j,m] is value at columni, row j and channel m in the input image, and a ∈ [0, b(xwidth − f)/sc] andb ∈ [0, b(xwidth − f)/sc].

7


Each convolutional layer consists of a number of these filters. The 2D tensorsoutputted by the filters, also known as feature maps, are then stacked on topof each other, creating a 3D tensor which can be used as input for anotherconvolutional layer. However, before that, a non-linear activation function,such as RELU, is applied to the feature maps, which allows the layer toexpress more complex, non-linear functions [6]. A summary operator (orpooling operator), such as max or mean is also applied to the feature maps,in a similar fashion to how the filters were applied (see Figure 2.3). Thisreduces the resolution of the outputted feature maps, which requires lesstrainable weights in later layers, reducing the risk of overfitting [6].

0 1 2

0 0 0

3 1 2 1 2

3 2

Figure 2.3: Example of Max Pooling

Some convolutional layers also include an additional component called abatch-normalization layer. The batch-normalization layer is a function thatscales its input to have zero mean and unit variance, with respect to all inputsin the current mini-batch. This function is most commonly applied beforethe activation function to make the network learn faster [10].

2.1.4 Training Neural Networks

Feed forward neural networks are trained in a supervised fashion where itis presented with data-and-label pairs (xi, yi), and aims to minimize a task-specific loss or cost function L → R. This loss function is a measurement ofthe disparity of the network’s guess fθ(xi)→ yi and the ground truth valueyi.

The networks are trained using an iterative optimization algorithm, calledgradient descent. At each iteration, the algorithm computes the gradientof the network using the training data and moves the weights one stepproportional to the negative gradient. Moving along the negative gradient

8


can be conceptualized as trying to go down a hill, and at every step walkingdown the direction which has the steepest slope [6].

More formally, if some network is parametrized by wt at iteration t and thegoal is to minimize loss function L then a single step of gradient descent isexpressed as:

wt+1 = wt − α∂L

∂wt(2.2)

Where the learning rate α is a hyperparameter, a variable that needs to bemanually set before training the network. The learning rate defines how bigof a step the parameters should move in each iteration. If α is too high, thealgorithm will move erratically, often overshooting good local minima. Onthe other hand, if the learning rate is too low, the algorithm will take too longto converge [6].

There exist a set of different versions of gradient descent, each with differentapproaches to getting an estimate of the current gradient. In batch gradientdescent the gradient is computed over all the training examples in the trainingset. Using all data points provides an unbiased estimate of the gradient, butis often computationally expensive to calculate. Stochastic gradient descentinstead samples a random data point from the training set at each iteration,and uses that single data-point to compute the gradient estimate. Thisestimate is often noisier than the batch approach but is significantly fasterto compute. Lastly, mini-batch gradient descent is a combination of the twoprevious methods. It randomly samples a small mini-batch of examples ateach update, and uses that mini-batch to compute a less noisy estimate ofthe gradient estimate [6].

The solution space of neural networks is both non-linear and non-convex.Therefore, gradient descent is not guaranteed to reach the global minimumof the loss-function, since the negative gradient can be pointing towards localminima. To counteract this, there exist modifications to the gradient descentalgorithms that attempt to use previous knowledge of the solution spaceto guide the network towards better solutions [6]. One example of such analgorithm is the momentum algorithm, which stores an exponential movingaverage over all previous gradients. This average is then used as the updatedirection, rather than the standalone gradient. More formally, the update for

9


a single weight wt at time t with momentum becomes:

wt+1 = wt − αVt (2.3)

whereVt = βVt−1 + (1− β) ∂L

∂wt(2.4)

Here, α is the normal learning rate, and β is a manually set hyper-parameterthat determines how quickly the old gradients should be forgotten.

Another optimization algorithm is the Adam algorithm. In addition to keep-ing a moving average over past gradients, this algorithm also uses an adap-tive learning rate for each individual weight. These adaptive learning ratesare calculated using a exponential moving average over past square gradi-ents. More formally, the update for a single weight wt at time t becomes:

wt+1 = wt −α√St + ε

Vt (2.5)

whereVt =

Vt1− βt1

(2.6)

St =St

1− βt2(2.7)

Vt = β1Vt−1 + (1− β1)∂L

∂wt(2.8)

St = β2St−1 + (1− β2)(∂L

∂wt

)2

(2.9)

Here, α is the base learning rate, ε is a small value that prevents division byzero, and β1 and β2 adjust the two moving averages.

2.1.5 Data Augmentation

Data augmentation is a common approach to artificially increase trainingdata by applying label invariant transforms on existing data. These trans-formations are commonly known as augmentation functions. Data aug-mentation is a standard method within deep learning to reduce overfitting

10


when the amount of actual data is scarce. It can also have the added effectof introducing noise in the training process, which can also help preventoverfitting [6].

The choice of augmentation function is highly domain-specific since thetransforms that are label invariant varies between domains. For example,rotating an image of a dog 180 degrees does not change its natural class, itstill portrays a dog. However, rotating an image containing the digit nine180 degrees would change its class from nine to six.

Generally, if one has a set of classes C0, ...Cm−1 and a set of data pointsx0, ..., xn−1 and an augmentation-function A, then one wants xi ∈ Cj, A(xi) ∈Cj to hold for the augmentation function [6].

Standard augmentation functions for image data include: flipping imageshorizontally/vertically, rotating images, adding random image noise, andadjusting contrast and saturation [6].

2.1.6 Transfer Learning

Transfer learning is a field within machine learning that focuses on utilizingknowledge gained while solving one problem and applying it to a related butdifferent problem. The idea is that for two related tasks, a source task and atarget task there exists an overlap of knowledge that can be extracted from thesource task and used to aid in the learning of the target tasks. For example, aperson that learns to play the piano will have an easier time to learn playingguitar later, compared to a person who has no prior musical experience. Thereason is that there is an overlap in terms of knowledge between the sourcetask of learning piano, and the target task of learning guitar, for examplereading sheet music, a sense of rhythm and finger dexterity. Transfer learningaims to apply the same logic to machine learning tasks [28].

A common approach to transfer learning is to extract the first set of layersof a CNN trained on a large amount of labeled data, such as the ImageNetdataset [2]. These first layers output a set of generic mid-level features thatare useful for a variety of visual tasks [23]. To correct for any difference indistribution between the source and the target domain, a set of additionaladaption layers are appended to these extracted layers. These adaptionlayers are then trained on a small set of labeled data from the target domain,while the previously extracted layers remain locked in place. This approach

11


is called transfer convolutional neural network (TCNN) and is one of themost common methods when training deep CNNs [28].

2.2 Few-Shot Learning and Meta-Learning

Few-shot learning is a specific form of machine learning problem, wherelimits are set on how much data a model is allowed to observe duringtraining. For most machine learning and deep learning tasks, the modelis provided an extensive training set from which to learn. In contrast, infew-shot learning tasks, the models are only provided a handful of examplesduring training, with the number of examples being an integral part of theproblem definition [5].

For classification problems it is common to use the descriptionN -wayK-shotlearning, where there are N distinct classes and the model is allowed to trainusing K examples from each class. K should be small in order for the task tobe considered a few-shot learning tasks. For regression, it is instead commonto use the term K-shot learning, where K is the total number of data pointsthe model is allowed to observe [5].

Meta-learning, or learning how to learn, is a concept within machine learningthat has its origin in the late 1980s [11]. Although the term has been appliedto various scenarios throughout the literature, meta-learning generally refersto a training scenario in which a model learns on two different levels. Aninitial meta-learning phase where the model gradually acquires knowledgeacross various tasks T1, ..., TN , and a second meta-testing phase where thepreviously meta-trained model is trained on previously unseen tasks TN+1

with a limited number of examples [5, 11, 21].

Meta-learning is a common method for tackling few-shot learning problems.The idea is that by having a model meta-train on a distribution over similarfew-shot learning task, or a distribution over such tasks, the model can thenlearn a method of learning which allows it to quickly learn unseen few-shottasks.

In order to sample few-shot classification tasks, one needs a large datasetwith a large set of classes. As an example, one can consider the Omniglotdataset [13]. It consists of 1623 different characters from a range of differ-ent alphabets, with each character having twenty hand-drawn examples.

12


Sampling 5-way 1-shot classification tasks from this dataset would entailrandomly selecting five classes from the complete set of 1623 and taking oneexample from each class as training data.

There are many different approaches to meta-learning. One example ismemory-augmented neural networks [21]. These recurrent networks haveaccess to an external memory module to which it can read and write freely.The external memory module allows it to quickly encode new information,which in turn makes it very suitable to learn new tasks quickly.

Another example is meta-networks [16]. These networks relies on a conceptcalled fast and slow weights. The slow weights are updated using normalgradient descent. The fast weights, however, are updated by an externalmeta-learner module that uses input from both the original model, as well asknowledge of previous tasks, to predict the new weights. The fast and slowweights are then combined in the final prediction [16].

Another approach to meta-learning is metric learning. Rather than traininga model to learn new tasks, metric learning aims to find a task-invariantsimilarity metric across the set of training tasks [12]. This metric can then beused classify an entirely new set of classes by comparing the test samples tothe labeled training samples. A few examples of these approaches are theSiamese Neural Networks [12] and the Matching Network [27].

The meta-learning method used in this thesis is model-agnostic meta-learning(MAML) [5]. Unlike most other methods it does not require any specializedmodel architecture and can be used with any gradient descent trained model.The goal of MAML is to find a weight initialization that is optimized forlearning new tasks. Finding such a weight initialization can be conceptual-ized as finding an internal feature representation that generalizes to a broadrange of tasks [5]. With such a representation, the top-layers of, e.g., a neural

13


network can be fine-tuned to a new task in a couple of gradient steps [5].

Algorithm 1: Model-Agnostic Meta-LearningInput :p(T ): distribution over tasksInput :α, β: step size parameters

1 Randomly initialize θ2 while not done do3 Sample meta-batch of tasks T1, ..., Tn ∼ p(T )

4 where each task Ti contains training data xi and validation data x′i5 foreach Ti do6 Update: θ′i ← θ − α∇θLTi(f

xiθ )

7 end8 Update θ ← θ − β∇θ

∑ni=1 LTi(f

x′i

θ′i)

9 end10 return θ

MAML is outlined in Algorithm 1. Input to MAML consist of a distributionover tasks p(T ) and a model fθ parameterized by θ. It also uses two additionalhyperparameters: α, the learning rate for the inner update step and β, thelearning rate for the outer update step.

During training, a fixed number of tasks Ti are sampled for each meta itera-tions. The sampled tasks are referred to as a meta-batch, and the number oftasks to be sampled during each step is referred to as the meta-batch size.

Each training step in MAML is divided into two distinct steps: First, aninner, task-specific update step called the inner update step. Second, an outermeta-training step called the outer update step. For each task Ti in a meta-batch, a training set xi and a validation set x′i is sampled. The training setxi is used during the inner update step, while the validation set x′i is usedduring the outer update step.

In the inner update step, the the current model parameters θ are updatedusing batch gradient descent with the sampled task-specific training dataxi and the task-specific loss LTi . (To keep the notation simple the numberof update steps are limited to 1, but in practice it can be extended to anynumber of steps.) This update step results in the updated parameters θ′i =θ − α∇θLTi (f

xiθ ) for each task Ti in the meta-batch.

In the outer update step, the task-specific loss LTi is computed with thetask-specific validation data x′i and the task-specific parameter θi from the

14


inner update step. These losses of all the tasks in the meta-batch are thenadded together, serving as a measurement on how well fθ was able to learnthe tasks.

The training objective of MAML can be formalized as:

minθ

∑Ti∼p(T )

LTi(fx′i

θ′i

)(2.10)

This objective can be interpreted as finding a parameter setting θ that mini-mizes the expected loss after the inner update step on all the tasks sampledfrom the task distribution. Such a parameter setting θ would mean the modelwas in a good position to quickly learn a new task. MAML uses gradientdescent to optimize for this objective, computing the gradient with respectto θ over the current meta-batch. In practice, a more sophisticated gradientmethod can also be used, like for example Adam.

The outer update step in the MAML algorithm requires computing thegradient through a gradient update, which requires the computation of thesecond ordered Hessian Matrix. Computing the Hessian matrix for a neuralnetwork can be computationally expensive and can increase training timesignificantly. In order to speed up learning it is possible to ignore the secondorder gradients by considering parameters θ′i as constant during the outerupdate. Finn, Abbeel, and Levine [5] showed that although this has a slightlynegative effect on performance, it will still produce result comparable tothe standard MAML method. This first-order modification of the MAMLalgorithm is called first-order model-agnostic meta-learning (FOMAML).

A modification to the FOMAML algorithm called Reptile was later proposedby Nichol, Achiam, and Schulman [17]. Similarly to FOMAML, Reptiledoes not compute the second order gradients of the outer update. However,Reptile takes this one step further by completely ignoring the use of valida-tion data. Instead, it uses the averages of the task-specific weight updateθ′i to compute the outer update step. Averaging over θ′i makes it easier forReptile to apply more complex optimization methods in the inner-updatestep, unlike MAML, which uses standard gradient descent [17].

In addition to these supervised approaches to MAML there also exists ahandful of algorithms that tries to forgo the use of labeled data duringtraining, instead of aiming for an unsupervised approach. One example isclustering to automatically construct tasks for unsupervised meta-learning

15


(CACTU) [8] which uses various clustering approaches to generate multipleartificial class labels for the unlabeled data points, which can then be used tosample tasks.

Unsupervised meta-learning with tasks constructed by random samplingand augmentation (UMTRA) [11] is another example which, similarly toCACTU, automatically generates classification tasks from unlabeled data.UMTRA firstly modifies MAML to only perform N -way 1-shot tasks. Sinceeach class only needs a single sample,N random samples are taken uniformlyfrom the dataset and assigned a random class label. If there are many naturalclasses in the dataset, the probability that some of the random samples belongto the same class is low enough to be negligible. A validation set for eachclass is then created by applying some class-invariant augmentation functionto the original sample and used to train the meta-learning model.

2.2.1 Summary

Few-shot learning is a form of task-definition where the number of trainingsamples is a part of the problem definition. The number of samples is usuallysmall, between 1–50, in order to force machine learning models to learn howto generalize from smaller amounts of data.

Meta-learning is an approach to solving few-shot problem, where a modelis tasked with learning a task-agnostic learning method over tasks fromthe same task-distribution as the target task. This thesis utilized a meta-learning algorithm called MAML, that finds a weight initialization for amachine learning model that is optimized for learning fast. There also existmodifications to MAML, such as FOMAML and Reptile, which reduce thecomputational burden during training, at the potential loss of some accuracy.There also exist unsupervised methods to MAML, such as CACTU andUMTRA that can construct artificial training tasks from unlabeled real-worlddata.

2.3 Related Works

A significant portion of this thesis will focus on the effect of manipulatingsynthetic data in order to increase the performance of machine learning

16


models. This section will highlight some of the previous works which haveutilized synthetic data for training deep learning models, and whose resultshave influenced the approach used in this thesis.

2.3.1 Synthetic Data

The term synthetic data is a very general term that refers to data which havebeen explicitly generated for a specific learning task, rather than being theby-product of an actual event. Synthetic data is a common approach totackle one of the most common problems of deep learning: the need for largedatasets. The data can be fully synthetic [25, 26, 18], meaning no real-worlddata were used during the generation, or it can be partially synthetic [4],meaning it uses actual data as a basis.

Synthetic data have been used to train deep learning models for a variety oftasks. Some examples include: object detection [25, 4, 26, 18], optical flowestimation [3], text detection in natural images [7] and 3D face reconstruction[20] to name a few.

An important consideration when using synthetic data, especially imagedata, is how close to reality the generated images need to be to achievegood results. This topic has been covered extensively in various research.Peng et al. [18] tested, for the task of object detection, how changing low-level textures in their synthetic images affected model performance. Theyconcluded that the effect of making the images more realistic was negligible,showing that realism might not be important when generating syntheticimage data.

Similarly, both Tobin et al. [25] and Tremblay et al. [26] were able to trainmachine learning models with highly unrealistic images by randomizingaspects of their rendering processes when producing the training images.

Lastly, Mayer et al. [15] performed a thorough investigation regarding whatkind of synthetic images is the most optimal when training models foroptical flow estimation. Their findings were that visual variety is crucialfor the model’s ability to generalize, while realism has minimal effect onperformance. They also concluded that mixing different datasets, like morerealistic and more simplistic datasets can also improve performance. Lastly,they concluded that utilizing camera knowledge could be essential. Forexample, mimicking the lens distortion of the real camera, i.e., the camera that

17


took the test-data, was shown to have a noticeable effect on performance [15].

2.3.2 Domain Randomization (DR)

An inherent problem with using approximations of the real world, such assimulations, is that there will always be a disparity between the simulationand the real world. This disparity can be a problem when training deepneural networks since these networks often are sensitive to shifts in thedomain. Attempting to reduce this disparity is often time-consuming, re-quires domain-specific knowledge, and is limited by the capability of modernrendering technology.

Domain randomization (DR) [25, 26] is a more straightforward approach tosynthetic image generation that attempts to utilize the strengths of modernrendering software. That is, to create large quantities of diverse, unrealisticimagery, rather than producing a few images with perfect photo-realism.This approach aims to create data that force the model to become more robustto domain changes. The idea is if the network can learn to perform a tasksuccessfully, regardless of the domain, it should also be able to performthe task in the real world since the real world is simply another domaininstance [25].

Generating such data entails constructing images from a variety of so-calleddomains, or randomized domains, where a new domain is sampled by ran-domizing certain aspects of the simulation, such as background, lighting,and object textures. The images are often highly unrealistic, but as a result,they tend to be fast and easy to generate.

Tobin et al. [25] used this approach to train a mechanical arm to accuratelypick up objects in the real world, by only training it in various virtual, highlyunrealistic, domain randomized simulation settings. Tremblay et al. [26] useda similar approach to generate highly unrealistic imagery as complementarydata for the task of real-world car detection. Using this approach, they wereable to outperform other synthetic approaches that utilized more photo-realistic imagery. Sundermeyer et al. [24] used DR for object detection and6D pose estimation, achieving results rivaling approaches using real data.

18


2.3.3 Structured Domain Randomization (SDR)

Structured domain randomization (SDR) is a modification of standard DR.The idea is to incorporate knowledge of the application domain, or thecontext, in the generation process. Incorporating context can allow the modelto not only be robust to changes in lighting and textures but also to, forexample, learn how to use background information to find small objects.Instead of randomizing the rendering configuration uniformly, as in DR, theconfigurations are sampled along splines, which limits the variety in thegenerated scene along some dimensions.

Prakash et al. [19] applied this technique to the task of vehicle detection.The context is that the test images have been taken using a camera mountedon a car in traffic. During image generation, a setting is selected from a setof predefined options, all of which consist of the main road in which thecamera is fixed. After the scene has been sampled, different parts of the sceneare randomized and based on that. These randomized factors include thenumber of lanes in the road, the number of cars on the road, the texture ofeach car, the lighting and the weather and more [19].

This approach resulted in better results compared to standard DR. A modeltrained using only SDR images was still able to perform well on real, pre-viously unseen data. If real data is used in conjunction with the SDR data,it outperforms a model trained using only realistic data with a significantmargin [19].

2.3.4 Summary

Most previous research in synthetic image generation has concluded thatrealism is not important [25, 26, 18, 15]. Instead, variations in lighting [15, 25,26] and textures [25, 26] have been shown to have a positive effect. Othershave also shown that utilizing application domain knowledge, either byadjusting the randomization process [19] or by using camera knowledge,can further improve results [15].

Based on this previous research, the best approach seems to be to gener-ate highly varied data, rather than focusing on making the data realistic.However, the problem with these conclusions is that the results may be task-specific and will not generalize well to higher level tasks such as classification

19


or when used with MAML. In order to find the optimal approach to syntheticmeta-learning, it is important to evaluate the effect different randomization-methods have on performance. Since lighting has been shown to be impor-tant [15, 25, 26], while also being easy to adjust in most simulation-software,the main focus of this thesis’ experiments will be how randomizing aspectsof in-game lighting affects performance.

2.4 Thesis Background and Suggested Approach

This section will describe and motivate the suggested approach as well aswhy the thesis was written.

2.4.1 FOI

This thesis has been written in collaboration with the swedish defence re-search agency (FOI). FOI is a Swedish government agency responsible fordefense-related research that reports to the Swedish Ministry of Defence.One of the goals of using meta-learning, at least for an organization like FOIis to be able to build general meta trained models for general tasks, such asvehicle classification or object detection. These meta-trained models couldthen act as off-the-shelf models that could quickly be adapted to a new taskin a matter of minutes using only a handful of real-world examples.

2.4.2 Approach

The synthetic meta-learning approach outlined in this thesis is straight-forward. It can be summarized as training a neural network using MAMLwhile sampling tasks from a large synthetic dataset, in order to train themodel on how to quickly learn new tasks.

Creating a dataset from which one can sample few-shot military vehicleclassification tasks entails creating a broad set of classes, each with a non-trivial number of examples. For this thesis, the list of classes consists ofVBS3’s library of vehicle models, with each vehicle model being considereda unique class. For each vehicle model, multiple image samples are then gen-erated within the simulation. The resulting datasets consist of approximately

20


106,000 images (2357 different classes and roughly 45 samples per class onaverage). This amount is much larger than other meta-learning datasets,such as miniImageNet [27] with 60,000 images (100 classes, each with 600images per class), and Omniglot [13] with 32,460 images (1623 classes with20 image per class).

Inspired by previous research in synthetic data generation [15, 25, 26, 19], var-ious levels of randomization is also applied to the data. By also introducingvariation in the synthetic data, the hypothesis is that the meta-trained modelcan learn a domain agnostic feature representation, i.e. features that can functionregardless of the application domain. This general feature representationshould then allow the model to quickly adapt to new real-world tasks.

The strength of this approach is tested by tackling the task of few-shot militaryvehicle classification. The choice of task type, as well as the choice of generationtool, VBS3, was a consequence of the collaboration with FOI, who wantedto focus on a military-oriented task and who also utilize VBS3 internally fortraining purposes.

2.4.3 Why Synthetic Data?

The primary reason for why the combination of synthetic data and meta-learning is a promising concept is that it removes the need for a large labeleddataset during the initial metaphase. However, unsupervised methods likeUMTRA [11] and CACTU [8] have with relative success been able to trainmeta-learning models without any use of labeled data. Their success raisesthe question of why the synthetic approach should be investigated in thefirst place.

There are several benefits to using synthetic data for meta-learning whencompared to the unsupervised approaches. One is the amount of control itgives over the data. The synthetic data can be coupled with lots of additionalinformation other than class labels, which can be useful when analyzing howwell a model learns. One example is the bitmap, which shows what pixelsthat contains the object of interest. The bitmap can be used in combinationwith transparent AI techniques, such as GradCam [22], to see how well themodel learns to focus on the object of interest during training. Synthetic dataalso makes it possible to adjust how difficult the sampled tasks should be byadjusting how visually similar all samples of a class should be. Changing

21


the structure of a task and a dataset could offer additional insight into howalgorithms like MAML learn.

Another benefit is the number of possible tasks the process can generate andlearn, compared to unsupervised approaches. Methods like UMTRA andCACTU are both limited to generating classification tasks because of howthe algorithms approach automatic labeling. Since the MAML algorithmcan be applied to both classification, regression, and reinforcement learningtasks [5], this seems needlessly limiting. In contrast, the synthetic approachcan be configured to generate tasks of any kind, as long as the labeling canbe extracted from the simulation.

22

3 Methodology

This chapter will outline the methods and techniques used to implement andevaluate the experiments of this thesis. The experiments can be divided intotwo distinct steps: data generation, where a synthetic dataset is constructed,and the meta training where the model is meta-trained on the synthetic dataand then tries to adapt to a set of real-world tasks. Section 3.1 will cover theimage generation process. Section 3.2 will cover the meta training set up andthe training of the neural networks. The following two sections, Sections 3.3will outline the software tools used in the experiments.

3.1 Data Generation

This section will describe the procedure used to generate synthetic train-ing data from VBS3. It will outline how environments and objects aremanipulated in the simulator in order to create realistic imagery and howmeta-information is extracted from the simulator during generation. Also,this section will list the different adjustable parameters in the data generationsetup and how changing these influences the quality and characteristics ofthe generated data.

3.1.1 VBS3

The tool used for generating image data was the Virtual Battlespace 3 (VBS3)version 18.3.3.8. VBS3 is a desktop tactical trainer and mission rehearsal soft-ware system developed by Bohemia Interactive Simulations. This software isused by many major military organizations, including the U.S. Army and theU.S. Marine. VBS3 has a large library of over 10,000 high-resolution models.

23

CHAPTER 3. METHODOLOGY

This large amount of models makes it useful for generating image data fora variety of semi-realistic military-related tasks, such as generating vehicleimages.

3.1.2 Generation Process

The image generation process is started by initializing the simulation with afixed setting and a list of vehicles. The setting is chosen from the availablelibrary of terrain maps. There are five different standard environmentsavailable in VBS3 (see Section A.2.3). For the vehicle classification tasks, alist of 2357 vehicle models is used. These classes were extracted by iteratingover the list of vehicle models in the VBS3 documentation. These vehiclesinclude everything from large military vehicles like tanks, airplanes, andaircraft carriers to smaller objects like remote controlled cars and drones.

After a setting is decided the simulator starts iterating over the provided listof vehicle models. For each vehicle, it will take a fixed number of images.Since the task of interest is standard image classification, only a single objectat the time is spawned, although the process is easily extended to manyobjects. After all images of all vehicles have been taken, the simulation willbe restarted with a new setting.

The process of generating an image starts by spawning an instance of thecurrent model in a random location in the in-game map. Similarly to Prakashet al. [19], the context of the object and setting are taken into account whengenerating the images. Vehicles that are based in water, like boats, are morelikely to be spawned in a body of water. Flying vehicles like drones, planes,and helicopters are either spawned between 2 to 20 meters in the air orspawned laying on the ground. All other land-based vehicles are alwaysspawned standing on the ground in the most realistic way possible.

After the vehicle has been position in the world, an in-game camera israndomly positioned around the object. The camera is positioned in such away that the object of interest is clearly visible and not obfuscated completelyby the terrain. The distance between the camera and the object is dependenton the size of the objects. For smaller objects like drones, remote-controlledvehicles and smaller boats, the camera is set to be at the most 15 meters fromthe object. For larger vehicles like transport planes or oil tankers, the camerais given a larger offset, starting from 30 meters to ensure that the object never

24


covers the entire image.

After the in-game camera has found a clear view of the object, the world’slighting and weather effects are randomized, as well as other settings, inorder to introduce variance in the images. The camera then takes a photoof the resulting scene, which will be used as training data. Additional metainformation, like the current rotation of the object, is also saved.

After the first photo has been taken, all lighting and weather effects aretemporarily disabled in order to create a clearer image. A render mask isthen applied to the object, transforming it into a single uniform segment ofan easily identified color, such as pink (see Figure. 3.1). A second photo isthen taken with the same camera as before.

(a) Generated Image (b) Masked Object

Figure 3.1

Segmentation information and bounding box information is then extractedusing the render-masked image. There are several advantages of havingaccess to segmentation information, even if the final target task will notutilize it. It allows for automatic and precise cropping of the object of interest.It also makes it possible to detect bad image samples where the object ishidden or obscured. The simulator, VBS3, is not always able to find perfectlysuitable camera-angles. Sometimes vegetation and other objects can coverthe object. In this thesis, images with an object segment of fewer than 750pixels were therefore removed in a separate filtering step in order to removeobscured objects. However, this also meant that some of the smaller vehicleshad a higher probability of being removed. As a result, some of the classesconsisted of fewer images than others.

Between each photo of the same vehicle, the vehicle is moved to a new

25


position within a hundred meter radius of the previously used position. Thisrange limit is set to allow the simulation to load the background texturesproperly before the image is taken. If the vehicle is allowed to move overa larger area, the simulation can have a problem loading the backgroundtextures with the highest resolution in time, resulting in poor image quality.

This generation process is re-run one time for each of the five available maps,in order to create a high degree of variety in the dataset.

3.1.3 Image Randomization

The experiments of Tobin et al. [25], Tremblay et al. [26] and Prakash et al. [19]have shown that the variation of the training data can have a substantial effecton the final performance of models trained with synthetic data. Since thesemethods have never been tested in a meta-learning setting, it is, therefore,essential to determine if the same principles still hold.

In order to investigate this, the simulation setup has been built to allow forcertain simulation parameters to be enabled and disabled, which changeshow the scenes are randomized. These parameters are:

• Context: Objects are spawned in scenarios that take their real-worldcontext into account (see Figure 3.2). For example planes and heli-copters are spawned in the air, while boats are spawned in bodies ofwater. Disregarding the context makes the data-generation more akinto standard DR, where vehicles can be in any scenario, in any possibleposition, thus increasing the difficulty of the generated tasks (see Figure3.3).

Figure 3.2: With context enabled

26


Figure 3.3: With context disabled

• Color Scheme/Texture Randomization: For a subset of the vehicles(around 800), a randomized color is applied to each editable part ofthe vehicle. Randomizing the textures is a similar approach to what isdone both by Tremblay et al. [26] and Prakash et al. [19].

Figure 3.4: Different vehicle models with different color scheme

• Lighting: Tremblay et al. [26] showed that variations in lighting couldhave a huge effect on performance. Outlined here are the differentmethods for randomizing lighting in the simulation. Examples of theircombined effect can be seen in Figure 3.8.

– Light position: The time of the day is randomly chosen betweenthe range of hours in which the sun is still visible. Changing theinternal time results in the sun, the main lighting source, being invarious positions, resulting in both different degrees of lightingintensity as well as different shadow shapes (see Figure 3.5).

27


(a) (b) (c)

Figure 3.5: Randomized lighting positions

– Crepuscular rays: This setting adjusts the color of light rays com-ing from the in-game sun, as well as the transparency of these rays.Randomizing this configuration results in a high degree of colorvariation in the images, but can also make the images blurrier (seeFigure 3.6).

(a) (b) (c)

Figure 3.6: Randomized ray color using Crepuscular Rays

– Weather: With a given probability random weather configurationsare chosen. The weather is configured using three variables: levelof rain, level of fog, and level of overcast (see Figure 3.7).

(a) (b) (c)

Figure 3.7: Randomized weather

28


(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 3.8: Examples of randomized lighting configurations

3.1.4 Generated Datasets

In order to evaluate how different randomization settings affects final per-formance, seven datasets were generated with one or more of the imagerandomization settings turn on or off. The settings include: crepuscular Rays,weather, lighting position, context and texture randomization. If CrepuscularRays/God Rays was enabled, the simulation-light in each image would beshifted into a random color. If Weather was enabled the weather of the simu-lation was randomized for each photo. If random light position was enabled,the sun was randomly positioned somewhere in the sky. If Context wasdisabled, the vehicles were positioned without any regard to how their real-world counterparts would behave, while if it was enabled the vehicles werepositioned realistically. If Texture was enabled, the texture of the vehicleswas randomized in each image. These settings have been outlined in furtherdetail in Section 3.1.3.

All possible parameter configurations were not tested due to time constraints.For a full outline of all the generated datasets, see Table 3.1. A Xindicates thatthe randomization factor was active during the generation of that dataset.

Lighting was hypothesized to be the most critical factor in the generationprocess since it had previously been shown to have a significant effect onother tasks [25, 15]. The first five datasets D1–D5 was therefore constructedto test the combined and individual effect of each light-setting. D1 tests the

29


combined effect of all light settings, while D2 tests how completely static lighteffects performance. D3–D5 tests the effect of disabling the light parameters,in order to see their individual contribution. The last two, D6 and D7, wereconstructed to test if further randomization with positioning and texturescould further improve accuracy.

Table 3.1: Randomization configurations for the datasets

Dataset Crepuscular Rays Weather Light Pos. Context TextureD1 X X X XD2 XD3 X X XD4 X X XD5 X X XD6 X X X X XD7 X X X

Since invalid image samples are filtered out after the generation process,there is not a fixed number of samples per class, and the number couldvary depending on how small and likely to be obscured the vehicle modelswere. Similarly, some of the classes were removed if the number of sampleswas insufficient to sample training samples from. The exact number ofdata-samples per class can be seen in Table 3.2.

Table 3.2: Synthetic dataset statistics

Dataset # of classes Avg. # per class TotalD1 2357 45.15 106 414

D2 2357 46.08 108 603

D3 2358 46.08 108 667

D4 2356 45.32 106 770

D5 2357 45.15 106 414

D6 2357 45.04 106 154

D7 2342 48.10 112 653

3.1.5 Summary

The image generation process consists of: for each map and each vehiclemodel, randomly position the vehicle in the world, randomly position the

30


camera around it, randomize parts of the scene, take an image and save italong with meta-information about the scene.

The generation process had five parameters that could be adjusted in orderto change how the images were randomized. To test the effect of the differentparameters, seven different parameter configurations were selected and usedto generate datasets containing over 106,000 images.

3.2 Meta Training

After a synthetic dataset with the desired properties have been generated,the data was used to train a model using MAML [5] (see Section 2.2). Afterthe training was finished, the model’s performance was evaluated on a set ofreal-world tasks. This section will outline the setup and methods that wereused when training and evaluating the models.

3.2.1 Task Generation

One of the most fundamental aspects of methods like MAML [5] is samplingfrom a task distribution p(T ). A task is sampled from a dataset by firstuniformly sampling a set of N classes without replacement from the listof classes in the dataset. A set of training and validation samples are thensampled from the selected classes (see Section 2.2 and Algorithm 1) withoutreplacement and any overlap.

3.2.2 Problem Settings

For this thesis, three problem settings were explored: 5-way 1-shot, 5-way5-shot, and 5-way 10-shot classification. The 1-shot and 5+shot task utilizedregular MAML during meta-training. The 10-shot task used FOMAMLinstead of regular MAML, since regular MAML consumed to much video-memory for that many samples. For all of the tasks, a meta-batch size offour and five gradient update steps was used. Ten validation images perclass were used for each task. These settings were chosen primarily as aresult of hardware limitations since increasing the meta-batch size exhausted

31


the GPU’s memory resources, while setting a lower meta-batch size andlowering the number of update steps slowed down training significantly.

3.2.3 Image Pre-Processing

For each image in each task, a square area around the object of interestwas cropped. The crop was always performed such that the entire object iscontained within the selected region, but is randomly readjusted as not tohave the object be in the center. The cropped region was then re-sized to aresolution of 128x128 using bilinear interpolation. The resulting resolutionwas significantly larger compared to other common meta-learning taskslike Omniglot [13] or miniImageNet [27]. This size was chosen in order topreserve as much detail of the object as possible, in order to allow the modelto distinguish between superficially similar objects like different kinds oftanks or different kinds of cars.

3.2.4 Image Augmentation

Data augmentation methods were also applied to all images during themeta-learning phase. These methods were chosen because they have beenpreviously shown to be useful when training with synthetic data [19]. Thesemethods include:

• Random Flipping: Images were flipped horizontally with a 50% chance.

• Random Contrast: The contrast of the images was randomized be-tween 60% and 115% of the original contrast.

• Random Saturation: The saturation of the images was randomizedbetween 60% and 150% of the original image saturation.

3.2.5 Network Architecture and Hyperparameters

The network architecture used in this thesis was a similar architecture to whatwas utilized by Finn, Abbeel, and Levine [5] in their experiments with CNN.The network consisted of five convolutional layers with 32 convolutional

32


filters with a receptive field size of 3 × 3 and with stride 1. Each convolu-tional layer used RELU activation and was followed by a 2× 2 max-poolinglayer. Batch normalization was also applied between the convolutional oper-ation and the RELU activation in each convolutional layer. Lastly, a single,fully connected layer with softmax activation was used to output the finalprediction vector.

During meta training the inner learning rate α was set to 0.01. The outerupdate is performed using Adam with a learning rate of β = 0.001 andadditional parameters being set to β1 = 0.9, β2 = 0.999, ε = 1e− 08.

The same network was used for all the tasks, with the exception being thefinal fully connected layer being changed when the number of classes ischanged.

3.2.6 Test Data

In order to evaluate the generalizing performance of the models, a smallnumber of real-world images were collected from various sources on theinternet. The collected dataset was manually labeled and consist of 42 dif-ferent object classes with between 15 to 30 images per class. The classesconsist of different kinds of military vehicles, such as tanks, boats, airplanes,helicopters, and motorbikes, where each class is a specific vehicle model, forexample, JAS Gripen (see Figure 3.9 for examples).

33


(a) (b) (c) (d) (e)

(f) (g) (h) (i) (j)

(k) (l) (m) (n) (o)

(p) (q) (r) (s) (t)

Figure 3.9: Examples of real world images

3.2.7 Performance Evaluation

After a model had been trained using MAML on a synthetically generateddataset, the model’s ability to adapt to a real-world task was evaluated. Fromthe dataset described in Section 3.2.6, tasks were sampled in the same fashionas outlined in Section 3.2.1. The models were given the same number ofsamples for each class as was defined by the few-shot classification task itwas meta-trained on. Then the model was updated using batch gradientdescent with a learning rate of 0.01, same as during meta-training.

Five-thousand tasks were randomly sampled for evaluation, and the modelsare trained with ten gradient steps for each task. The mean of the model’s

34


accuracy after the final update-step, over five-thousand tasks, was thenreported as the final accuracy.

3.2.8 Baselines

For this thesis, two baselines were chosen: First, a baseline model named BL1.This baseline involved training the same network used in all other settings,using only the meta-test data, without doing the meta-trained. This baselinewas used to give an indication of both how difficult the final tasks were toperform and how well the MAML algorithm improves performance.

Secondly, a second baseline model was trained using a hand-labeled real-world dataset, collected from Bing and Google Image Search. This baselinewas named BL2 and was used to highlight how the synthetic performs incomparison to models trained on real-world data. Its training set consistsof 61 different vehicles, with each class consisting of between 80 to 350 RGBimages. In total, the number of images is 9954.

3.2.9 Summary

The experiments consisted of training a six-layer CNN using MAML foreach of the seven synthetic datasets (see Table 3.2). Also, two baselines weretrained, one where no pre-training was used (BL1), and one that was trainedwith MAML on a real-world dataset (BL2). Their performance was evaluatedby sampling five-thousand tasks from a real-world test dataset and havingthe networks adapt to each task and then calculating the average accuracy.

3.3 Programming Libraries and Frameworks

A mix of Python 3.7+ and the status quo function (SQF) scripting languagewas used to generate the images in VBS3.

SQF is a scripting language developed by Bohemia Interactive Simulationsand is used for scenario scripting within the simulation. In this thesis, itwas the primary tool for controlling the simulation process, creating scenes,spawning, and iterating over objects and collection and outputting relevantvehicle information.

35


All other code relating to both the image generation post-processing andall the machine learning code was written in Python 3.7+. Python is a high-level scripting language with a large user-base within the machine learningcommunity, making it the obvious choice for most things machine learning.

Google’s Tensorflow [14] library was used to construct machine learningmodels. Tensorflow is an open source software library developed for high-performance numerical computation. It is used both within the industryand for research applications. The implementation of the meta-learningalgorithm was built using Tensorflow 1.10.0 with GPU support.

36

4 Result

This chapter presents the results of the experiments outlined in the previouschapter. It will begin by showing the randomization configurations usedfor each generated dataset. It will then display the accuracy and standarddeviation of each of the models trained with one of the datasets, as well asthe baselines. Lastly, the meta-training accuracy of the different datasets willbe presented.

4.1 Datasets

Table 4.1 outlines which of the possible randomization options that wereenabled or disabled for each of the generated datasets. A full explanation ofall settings can be found in Section 3.1.4.

Table 4.1: Randomization configurations for synthetic datasets

Dataset Crepuscular Rays Weather Lighting Pos. Context TextureD1 X X X XD2 XD3 X X XD4 X X XD5 X X XD6 X X X X XD7 X X X

37

CHAPTER 4. RESULT

Table 4.2: Test-accuracy in % for each dataset on 5-way 1-shot classificationover 5000 test-tasks

Dataset Mean (%) Std. (%) CI ±95%BL1 31.95 8.06 ±0.22BL2 62.94 12.31 ±0.34D1 46.41 10.26 ±0.28D2 37.87 9.33 ±0.28D3 36.93 9.15 ±0.25D4 41.72 9.67 ±0.27D5 46.33 10.63 ±0.29D6 43.80 9.58 ±0.27D7 44.01 10.05 ±0.28

Table 4.3: Test-accuracy in % for each dataset on 5-way 5-shot classificationover five-thousand test-tasks


38

CHAPTER 4. RESULT

Table 4.4: Test-accuracy in % for each dataset on 5-way 10-shot classification(FOMAML) over 5000 test-tasks


4.2 Test Accuracy

Tables 4.2, 4.3 and 4.4 show the average accuracy of five-thousand randomlysampled test-tasks for each of the datasets for the three classification scenar-ios. Each model for each dataset was meta-trained using 30,000 meta-updates.(Note that 4.4 used FOMAML instead of standard MAML.)

Figures 4.1, 4.2 and 4.3 show the average validation accuracy at each of theten gradient steps the models take during the meta-test phase.

4.3 Training Accuracy

Figures 4.4a–4.4e display the smoothed training accuracy throughout themeta-training for a subset of trained 5-way 1-shot classifiers. The trainingaccuracy is the average accuracy the network achieves after one to fiveupdate steps when adapting to all tasks in a randomly sampled meta-batch.Similarly, 4.5a–4.5e and 4.6a–4.6e shows the training for the 5-way 5-shotand 5-way 10-shot classifiers. These plots are interesting to analyze sincedifferent datasets can have different effects on the meta-learning process.

The five figures 4.4a–4.4e, 4.5a–4.5e as well as 4.6a–4.6e show the trainingaccuracy during all 30,000 meta-training iterations, and each plot shows theaverage accuracy after a fixed set of gradient steps. For example, Figure 4.5ashows the accuracy after a single gradient update-step on an unseen task,

39

CHAPTER 4. RESULT

0 2 4 6 8 100

0.2

0.4

0.6

0.8

1

Steps

Acc

urac

y

BL1BL2D1D2D3D4D5D6D7

Figure 4.1: 5-way 1-shot accuracy during meta-training

0 2 4 6 8 100

0.2

0.4

0.6

0.8

1

Steps

Acc

urac

y

BL1BL2D1D2D3D4D5D6D7


40

CHAPTER 4. RESULT

0 2 4 6 8 100

0.2

0.4

0.6

0.8

1

Steps

Acc

urac

yBL1BL2D1D2D3D4D5D6D7


while Figure 4.5e shows the accuracy after five gradient update steps on anunseen task.

The accuracy-values in the figures have been smoothed significantly. Thesmoothing was necessary since the accuracy values are noisy and heavilydependant on the difficulty of the sampled task, which makes it difficult toplot. The smoothing was done using an exponential moving average:

yt+1 = αyt + (1− α)yt+1

with α = 0.97.

41

CHAPTER 4. RESULT

Figure 4.4: Smoothed training accuracy during meta-training for 5-way1-shot task, over 30,000 meta-updates. The five plots show the averagevalidation accuracy over the sampled meta-batch after one to five gradientsteps.

0 0.5 1 1.5 2 2.5·1040

0.2

0.4

0.6

0.8

1

Acc

urac

y

0 0.5 1 1.5 2 2.5·1040

0.2

0.4

0.6

0.8

1

Acc

urac

y

0 0.5 1 1.5 2 2.5·1040

0.2

0.4

0.6

0.8

1

0 0.5 1 1.5 2 2.5·1040

0.2

0.4

0.6

0.8

1

# meta-updates

Acc

urac

y

0 0.5 1 1.5 2 2.5·1040

0.2

0.4

0.6

0.8

1

# meta-updates

a: Update step 1

D1D2D3D4D5D6D7

b: Update step 2 c: Update step 3

d: Update step 4 e: Update step 5

42

CHAPTER 4. RESULT

Figure 4.5: Smoothed training accuracy during meta-training for 5-way5-shot task, over 30,000 meta-updates. The five plots show the averagevalidation accuracy over the sampled meta-batch after one to five gradientsteps.

0 0.5 1 1.5 2 2.5·1040

0.2

0.4

0.6

0.8

1

Acc

urac

y

0 0.5 1 1.5 2 2.5·1040

0.2

0.4

0.6

0.8

1

Acc

urac

y

0 0.5 1 1.5 2 2.5·1040

0.2

0.4

0.6

0.8

1

0 0.5 1 1.5 2 2.5·1040

0.2

0.4

0.6

0.8

1

# meta-updates

Acc

urac

y

0 0.5 1 1.5 2 2.5·1040

0.2

0.4

0.6

0.8

1

# meta-updates

a: Update step 1

D1D2D3D4D5D6D7



43

CHAPTER 4. RESULT

Figure 4.6: Smoothed training accuracy during meta-training for 5-way 10-shot task, over 30,000 meta-updates using FOMAML. The five plots showthe average validation accuracy over the sampled meta-batch after one tofive gradient steps.

0 0.5 1 1.5 2 2.5·1040

0.2

0.4

0.6

0.8

1

Acc

urac

y

0 0.5 1 1.5 2 2.5·1040

0.2

0.4

0.6

0.8

1

Acc

urac

y

0 0.5 1 1.5 2 2.5·1040

0.2

0.4

0.6

0.8

1

0 0.5 1 1.5 2 2.5·1040

0.2

0.4

0.6

0.8

1

# meta-updates

Acc

urac

y

0 0.5 1 1.5 2 2.5·1040

0.2

0.4

0.6

0.8

1

# meta-updates

a: Update step 1

D1D2D3D4D5D6D7



44

5 Discussion

This chapter will interpret the results outlined in the previous section andattempt to ground the solution in the previously presented theory.

5.1 Overall Performance

Overall, D1 was the best performing synthetic dataset. For the 10-shot task,it has an accuracy of 72.92, compared to 73.70 for the BL2 baseline. However,the disparity to the BL2 baseline increases as the number of few-shot samplesdecreases. For example, for the 5-shot task, this dataset achieved an accuracyof 67.83, compared to 72.88, a difference of roughly five percentage points.The best performing synthetic for the 5-shot class, dataset D5, was able toachieve 69.03 % accuracy for the 5-shot classifiers. For the 1-shot classifiers,the best synthetic dataset D1 reached 46.41% accuracy, roughly 16 percentagepoints the baseline of 62.94%.

Based on these results, none of the synthetic datasets was able to forcethe models to learn domain agnostic features that could be used directly.However, as the number of examples in the few-shot task increases, theperformance difference between the real-world dataset and the syntheticdataset is narrowed. This suggests that the synthetically trained networkscan at least be quickly fine-tuned to the unseen real-world domain.

5.2 Effect of Image Randomization

Variation in lighting has a strong positive effect on the model’s performance,similarly to what has been observed by Tobin et al. [25] and Mayer et al. [15].

45

CHAPTER 5. DISCUSSION

The difference between the best performing model trained with randomizedlight settings (using Dataset D1) and the model trained with a static lightsetting (using Dataset D2) is roughly eight percentage points for the 1-shot,5-shot and 10-shot classifiers (see Table 4.2 and 4.3).

Crepuscular rays (Dataset D3) is the single lighting setting that has the mostsignificant effect on performance. Disabling it results in roughly six toeight percentage points decrease in accuracy on all classifiers. This result isnot entirely surprising since randomizing the lighting into different colorsfundamentally changes the nature of most scenes and introduces lots ofvisual variety.

Not randomizing weather (Dataset D4) had a lesser but noticeable effect.Disabling it results in a roughly six percentage point decrease in accuracy forthe 10-shot classifier, two percentage points decrease in accuracy for the 5-shot classifier and a more than five percentage points decrease for the 1-shotclassifier. It is not surprising that randomizing the weather affects accuracy.The weather itself is randomized across several variables and changes bothbrightness, image contrast, and introduces background noise with clouds,lightning, and rainbows.

On the other hand, not randomizing the lighting position (Dataset D5) had asmall negative effect on the 10-shot classifier, a positive effect on the 5-shotclassifier and a statistically insignificant effect in the 1-shot classificationclassifier. One possible reason why this could have a negative effect is thatthe shadows it generates made some of the vehicles too difficult to detect inthe dataset. Another is that the test dataset does not have many images withlighting coming from the side. Therefore, making the network robust to thiskind of dynamic lighting is a wasted effort.

Removing context (Dataset D7) also hurts performance. Disabling it resultsin a reduction by one percentage point for the 1-shot classifier and morethan two percentage points lower for the 5 and 10-shot classifiers. Similarly,randomizing textures (Dataset D6) seems to have an overall negative effect.The result is a reduction of roughly two percentage points for the 1 and10-shot classifier and 0.7 for the 5-shot classifier.

The standard deviation is high for all the classifiers, but those trained withthe real-world baseline (Dataset BL2) does have a higher standard deviationthan any of the synthetic datasets. The 10 and 5-shot classifiers also have aconsistently lower standard deviation, which is to be expected since they are

46


allowed to observe more data. The high standard deviation is a result of thetest-set being small, but also a result of the military vehicle classification taskin itself. Since some vehicles in the test data set are very similar and hard todistinguish, like different types of tanks or different types of airplanes, thedifficulty of a sampled task can vary a lot depending on the visual similarityof the sampled classes.

In conclusion, randomizing light has an overall positive effect on perfor-mance for all tasks and can improve performance by several percentagepoints. Specifically, enabling crepuscular rays and randomizing weatherseems to be the two most important factors. However, the exact randomiza-tion configuration that generates the best result, as well as the worst result,differs slightly between the 1, 5 and 10-shot classifiers.

5.3 Realism vs. Visual Variety

A common question when generating synthetic data is how visually similar itmust be in order to be useful for training. Out of the seven generated datasets,D3 is the one which is most visually similar to the real-world test data. Thisdataset has realistic vehicle positions, lighting without any god rays, norandomized vehicle textures, and variations in both lighting direction andweather. As can be seen in Tables 4.2 and 4.3, this dataset is outperformedby all datasets for the 1 and 10-shot classifiers and all except D2 for 5-shotclassifier. These results suggest that aiming for visual realism, while reducingthe visual variety, can result in lower accuracy. This result, that variation canbe more important than realism, is similar to what has been discovered inmuch previous research [15, 25, 26].

It would be convenient if synthetic data did not have to be realistic at allsince it would make the data generation process easier. However, merelyadding more visual variety does not correlate with improved performanceeither. This phenomenon is most evident with dataset D5, which is lessvaried than D1, while still being able to outperform it on the 5-shot tasks.Similarly, positioning the models unrealistically, and randomizing texturesalso harmed performance, as seen with datasets D7 and D6 on both tasks.

These results raise the question about what aspects of synthetic data shouldbe randomized in order to improve performance. Prakash et al. [19] arguesin their paper that using knowledge about the application domain, which

47


they call context, can be more beneficial than simply introducing more visualvariety in the synthetic data. For example, they generated synthetic imagesfor the task of car detection. The images that they used as test data were alltaken in in traffic with a camera mounted to a car. Therefore, the syntheticimages were generated to always be on a road, and they always positionedthe simulation camera to be in a similar position as the test images. However,aspects that changed in the dataset, such as lighting, roads, background, andweather, were heavily randomized in the dataset.

The test-set used in this thesis do not come from one source, and therefore, itis difficult to make these kinds of assumptions, both about context or aboutthe camera. One thing which is known, however, is that the vehicles arealways positioned realistically, which could explain why simply positioningthe vehicles randomly, like in dataset D7, harms performance. Similarly,the vehicles in each class in the test-set are generally of the same color orsimilar colors, which could explain why randomizing vehicle textures didnot improve performance.

In conclusion, our hypothesis is that good synthetic data should be random-ized along dimensions that make the network robust to the domain shiftbetween synthetic and real data. For vehicle classification, these dimensionsseem to be mostly related to the light setting. The images should also berandomized along the dimensions which are varied in the test domain, inorder for the model to properly learn the tasks. However, randomizingalong the dimensions which are not varied in the test data, nor contribute tothe domain shift, such as vehicle positions in our experiments, can have anegative effect, since it needlessly raises the difficulty of the training task.

5.4 The Effect of Task Difficulty

By using synthetic data, there is an opportunity to investigate how changingcertain aspects of the data can affect the meta-learning process. An interest-ing phenomenon that can be seen in the training plots is how the trainingaccuracy plummets during the early update steps on specific datasets for1 and 5-shot classifier. In the 5-shot classifier (see Figures 4.5a–4.5e) it isthe more randomized datasets (D1, D5, D7, D6) that drops in accuracy afterthe first and second update step. Similarly, for the 1-shot task (see Figures4.4a–4.4e) it is the least randomized datasets (D2, D4, D3) who drops in

48


accuracy during the two first updates. Nichol, Achiam, and Schulman [17]hypothesize in their paper that their Reptile algorithm, converges towards aparameter setting that is the closest, in terms of Euclidean distance, to themanifold of optimal solutions for each of the training tasks. Assuming thata similar type convergence occurs in MAML, the drop in accuracy in theearly steps could be due to a high visual variety in the generated task, whichresults in a more considerable on-average distance between the task mani-folds. As a result, the network would need more fine-tuning steps in orderto reach the manifolds, which would explain why the accuracy plummets inthe earlier steps.

One problem with this theory is that the drop in validation accuracy happensto different datasets between the 5-shot and the 1-shot classifiers. If thetheory held the accuracy for the D2 dataset would not decrease, but rather allthe other. Therefore, before a solid conclusion can be drawn, further researchinto how task-difficulty, stochastic gradient descent, and MAML all affectthe ability of the network to generalize is needed.

5.5 Are We Actually Learning?

Even though the test-accuracy of a model is a good indication of how a modelwill perform in the real world, it does not show what the model has learned.It can, therefore, be a good idea to ascertain that the model is learning whatone intends for it to learn. Ensuring that a model learns properly is especiallycrucial when data is limited, and when a model is applied to a new targetdomain, which is the case for synthetic meta-learning.

To accomplish this, one can use one of the many recently developed ex-plainable artificial intelligence (XAI) methods to visualize what parts of theimage that influence the network’s decision making. Figure 5.1 shows afew examples of Grad-Cam [22] visualization. The images and heatmapsshowcase how the 5-way 5-shot model, trained with the D1 dataset, changeswhat pixels influence its decision after each update. These example imagesall consist of previously unseen validation data that was not used by thenetwork during these five update steps. The five training samples are shownin Figure 5.2.

49


(a) 0 steps (b) 1 step (c) 3 steps (d) 5 steps (e) Image

(f) 0 steps (g) 1 step (h) 3 steps (i) 5 steps (j) Image

(k) 0 steps (l) 1 step (m) 3 steps (n) 5 steps (o) Image

Figure 5.1: Network attention on unseen validation samples during trainingfor 5-way 1-shot tasks using D1. Lighter color indicates the pixel have moreinfluence on the final prediction

(a) (b) (c) (d) (e)

Figure 5.2: Training data for 5-way 5-shot

In Figure 5.1, the attention displayed shows that the helicopter, or parts ofit, are the most influential parts of the validation images. The networksability to consistently locate the helicopter suggests that the network haslearned to find the object in the image, regardless of image filter, rotation,and obfuscation. This is a good indicator that it has learned the given tasksuccessfully.

50





Figure 5.3: Network attention on unseen real-world test samples for 5-way5-shot tasks. Lighter color indicates the pixel have more influence on thefinal prediction

The same visualization technique can also be applied to the real test dataduring the meta-test phase, in order to ensure that this ability to learn iscarried over during the domain transition from synthetic to real data. Someexamples can be seen in Figure 5.3. These example images suggest that themodel can locate at least a part of the object of interest in all images. However,in this example, it is only able to find the rotor of the helicopter, while thebody is seemingly ignored. The focus on the rotor is most likely a resultof this model having been trained on vehicle models with more consistentcoloring, while the real-world vehicles in the test-set can have varied colors.

In contrast, for the 5-way 1-shot classification task, the model appears lessreliable in finding the object of interest. Both for the training data (see Figure5.4 and Figure 5.5) and the real-world test data (see Figure 5.6).

51





Figure 5.4: GradCam-visualization on unseen validation samples duringtraining for 5-way 1-shot tasks using D1. Lighter color indicates the pixelhave more influence on the final prediction

(a)

Figure 5.5: Training data for 5-way 1-shot

52





Figure 5.6: GradCam-visualization on unseen real-world test samples for5-way 1-shot tasks. Lighter color indicates the pixel have more influence onthe final prediction

5.6 Baselines

For the results of an experiment to be valid, the baseline which is used tocompare it with must be as strong as possible. This section will cover someof the potential issues with the baselines used in this thesis, to indicate whatcould be amended in future research.

For the first baseline, no explicit hyperparameter search was done. Instead, itused the same parameters as all the other models. This was simply a result oftime constraints when performing the experiments. It is, therefore, possiblethat this baseline is underperforming since there is no guarantee that theseparameters are suitable for this network.

For the second baseline, there is an issue with the number of training samples.The training dataset consists of 9954 samples, 62 classes with 80 to 300samples. As a comparison, the miniImageNet dataset [27] consists of 80

53


training classes with 600 images each, a total of 48,000 test images. As aresult, this thesis’ baseline is allowed to observe roughly 4.8 times fewerdata than a model trained on miniImageNet. Also, the models trained withsynthetic data have access to roughly 106,000 synthetic images. As a result,this baseline might not seem like an entirely fair comparison. However, thiswas the number of real-world images which could be collected by a single-man team in a couple of days. This highlights one of the problems with thehand-labeled approach since this dataset took several days to assemble, clean,and prepare. In comparison, creating a new dataset using the simulator takesa couple of hours.

5.7 Ethics and Sustainability

From an ethical point of view, meta-learning, and in particular syntheticmeta-learning have the potential of democratizing deep learning technology.Since deep learning requires large amounts of data, the state-of-the-art inmachine learning is mostly done by large organizations that have access toever-increasing amounts of data. If training efficient deep learning modelsdid not require any actual data, but rather easily generated synthetic dataand a handful of real-world examples, the number of people who could useand benefit from deep learning would increase significantly.

Meta-Learning can also offer increased sustainability. Training contemporarystate-of-the-art neural networks is a highly energy consuming task. Deepneural networks are often trained for days or even weeks on energy consum-ing hardware. Meta-learning offers a more energy-conserving approach totraining. With meta-training, a very general pre-trained model can be trainedin a couple of hours. This model can then later be trained for a broad rangeof different tasks in only a handful of update steps. If this technique couldbe perfected enough to rival state-of-the-art approaches, it would, therefore,remove the need for the long, energy consuming training process and thusdecrease overall energy consumption.

54

6 Conclusions

This thesis has investigated the question of whether combining meta-learningand synthetic data for real-world few-shot learning problems is a viableoption to using real data, and how such data should be generated in orderto maximize performance. This question has been examined by looking atthe task of few-shot military vehicle classification. Synthetic data were firstgenerated using a high-end military simulator. A neural network was thentrained using model-agnostic meta-learning on several synthetic datasetswith different randomization settings. The model was then evaluated on howwell it could learn previously unseen few-shot tasks consisting of real-worldimages.

The main conclusion of this thesis is that meta-learning with synthetic train-ing is a viable approach for learning few-shot classification tasks. The bestperforming classifier trained with synthetic data was for 5-way 10-shot tasksable to achieve 72.92% accuracy, compared to 73.70% for an identical classi-fier trained on real-world data. The results were 69.03% against to 72.88%for 5-way 5-shot and 46.41% against 62.94% for 5-way 1-shot classification.Although the results get increasingly worse with smaller amounts of few-shot data, the fast rate at which the gap in accuracy between real-world andsynthetic data narrows with only a handful more few-shot showcases thestrength of synthetic meta-learning.

The results also suggest that small changes to the data generation process canhave a significant effect on performance. Randomizing simulation lightingduring training alone can, for example, increase the final accuracy by morethan eight percentage points. These results also offer hope that further ad-justments to the generation process might be able to shorten the performancedifference between synthetic and real-world data even further.

55

CHAPTER 6. CONCLUSIONS

6.1 Future Research

The poor results on the 1-shot tasks suggest that the synthetically trainedmodels were unable to learn a sufficiently domain-agnostic feature represen-tation to handle the domain shift. Instead, more real-world data was neededto fine-tune the models to compensate for the shift in the domain. The goalshould be to have the model learn a domain agnostic feature representationdirectly from the synthetic data, allowing it to be used directly on real-worlddata. In order find a domain randomization method in VBS3 which canproduce such features for vehicle classification, further research into differentdomain randomization methods is needed.

As outlined in Section 2.4.3, one of the main advantages of using syntheticdata is that it allows for a greater range of possible tasks to be generated,compared to unlabeled approaches. As the results of this thesis show thatthe synthetic approach can achieve promising accuracy on few-shot tasks,the obvious next step is to apply the synthetic approach in more complexapplication domains. One of the more obvious choices for an applicationwould be object-detection, object tracking, or meta-training for reinforcement-learning tasks since that kind of data can be generated easily with existingcode.

Another possible avenue of improvement would be the neural networkarchitecture. The network architecture used in this thesis is a very simplisticone, which can seem wasteful, especially since the main advantage of MAMLis model independence. There are many reasons for being conservative aboutthe choice of network architecture. Using networks that have been proven towork in the past is significantly more manageable, and complex networksare often difficult to train and often consume more memory, especially whenusing MAML. However, being able to use more complex architecture is arequirement if more complex tasks, like object detection, are to be used withMAML. In order for this to be viable, there are many aspects of meta-learningthat needs to be investigated. One is how regularisation can be used withMAML. Another is what kind of networks are best suited. Memory savingarchitectures like SqueezeNet [9] and TinySDD [29] might be a promisingway forward.

56

Bibliography

[1] Christopher M. Bishop. Pattern Recognition and Machine Learning (Infor-mation Science and Statistics). Berlin, Heidelberg: Springer-Verlag, 2006.ISBN: 0387310738.

[2] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.“Imagenet: A large-scale hierarchical image database”. In: 2009 IEEEconference on computer vision and pattern recognition. Ieee. 2009, pp. 248–255.

[3] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, CanerHazirbas, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers,and Thomas Brox. “FlowNet: Learning Optical Flow With Convolu-tional Networks”. In: The IEEE International Conference on ComputerVision (ICCV). Dec. 2015.

[4] Debidatta Dwibedi, Ishan Misra, and Martial Hebert. “Cut, pasteand learn: Surprisingly easy synthesis for instance detection”. In: Pro-ceedings of the IEEE International Conference on Computer Vision. 2017,pp. 1301–1310.

[5] Chelsea Finn, Pieter Abbeel, and Sergey Levine. “Model-AgnosticMeta-Learning for Fast Adaptation of Deep Networks”. In: Proceed-ings of the 34th International Conference on Machine Learning. Ed. byDoina Precup and Yee Whye Teh. Vol. 70. Proceedings of MachineLearning Research. International Convention Centre, Sydney, Aus-tralia: PMLR, Aug. 2017, pp. 1126–1135. URL: http://proceedings.mlr.press/v70/finn17a.html.

[6] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning.MIT Press, 2016.

57

http://proceedings.mlr.press/v70/finn17a.html

http://proceedings.mlr.press/v70/finn17a.html

BIBLIOGRAPHY

[7] Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. “SyntheticData for Text Localisation in Natural Images”. In: The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR). June 2016.

[8] Kyle Hsu, Sergey Levine, and Chelsea Finn. “Unsupervised learningvia meta-learning”. In: arXiv preprint arXiv:1810.02334 (2018).

[9] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf,William J Dally, and Kurt Keutzer. “SqueezeNet: AlexNet-level accu-racy with 50x fewer parameters and< 0.5 MB model size”. In: arXivpreprint arXiv:1602.07360 (2016).

[10] Sergey Ioffe and Christian Szegedy. “Batch normalization: Acceleratingdeep network training by reducing internal covariate shift”. In: arXivpreprint arXiv:1502.03167 (2015).

[11] Siavash Khodadadeh, Ladislau Bölöni, and Mubarak Shah. “Unsuper-vised Meta-Learning For Few-Shot Image and Video Classification”.In: arXiv preprint arXiv:1811.11819 (2018).

[12] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. “Siameseneural networks for one-shot image recognition”. In: ICML deep learningworkshop. Vol. 2. 2015.

[13] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum.“Human-level concept learning through probabilistic program induc-tion”. In: Science 350.6266 (2015), pp. 1332–1338.

[14] Marin Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, ZhifengChen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, MatthieuDevin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, GeoffreyIrving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga,Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, JonathonShlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vin-cent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals,Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiao-qiang Zheng. TensorFlow: Large-Scale Machine Learning on HeterogeneousSystems. Software available from tensorflow.org. 2015. URL: https://www.tensorflow.org/.

58

https://www.tensorflow.org/

https://www.tensorflow.org/

BIBLIOGRAPHY

[15] Nikolaus Mayer, Eddy Ilg, Philipp Fischer, Caner Hazirbas, DanielCremers, Alexey Dosovitskiy, and Thomas Brox. “What makes goodsynthetic training data for learning disparity and optical flow estima-tion?” In: International Journal of Computer Vision 126.9 (2018), pp. 942–960.

[16] Tsendsuren Munkhdalai and Hong Yu. “Meta networks”. In: Proceed-ings of the 34th International Conference on Machine Learning-Volume 70.JMLR. org. 2017, pp. 2554–2563.

[17] Alex Nichol, Joshua Achiam, and John Schulman. “On first-order meta-learning algorithms”. In: arXiv preprint arXiv:1803.02999 (2018).

[18] Xingchao Peng, Baochen Sun, Karim Ali, and Kate Saenko. “LearningDeep Object Detectors From 3D Models”. In: The IEEE InternationalConference on Computer Vision (ICCV). Dec. 2015.

[19] Aayush Prakash, Shaad Boochoon, Mark Brophy, David Acuna, EricCameracci, Gavriel State, Omer Shapira, and Stan Birchfield. “Struc-tured Domain Randomization: Bridging the Reality Gap by Context-Aware Synthetic Data”. In: arXiv preprint arXiv:1810.10093 (2018).

[20] Elad Richardson, Matan Sela, and Ron Kimmel. “3D face reconstructionby learning from synthetic data”. In: 2016 Fourth International Conferenceon 3D Vision (3DV). IEEE. 2016, pp. 460–469.

[21] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra,and Timothy Lillicrap. “Meta-learning with memory-augmented neu-ral networks”. In: International conference on machine learning. 2016,pp. 1842–1850.

[22] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakr-ishna Vedantam, Devi Parikh, and Dhruv Batra. “Grad-cam: Visualexplanations from deep networks via gradient-based localization”. In:Proceedings of the IEEE International Conference on Computer Vision. 2017,pp. 618–626.

[23] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Ste-fan Carlsson. “CNN features off-the-shelf: an astounding baseline forrecognition”. In: Proceedings of the IEEE conference on computer vision andpattern recognition workshops. 2014, pp. 806–813.

[24] Martin Sundermeyer, Zoltan-Csaba Marton, Maximilian Durner, ManuelBrucker, and Rudolph Triebel. “Implicit 3d orientation learning for6d object detection from rgb images”. In: Proceedings of the EuropeanConference on Computer Vision (ECCV). 2018, pp. 699–715.

59

BIBLIOGRAPHY

[25] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba,and Pieter Abbeel. “Domain randomization for transferring deep neu-ral networks from simulation to the real world”. In: 2017 IEEE/RSJInternational Conference on Intelligent Robots and Systems (IROS). IEEE.2017, pp. 23–30.

[26] Jonathan Tremblay, Aayush Prakash, David Acuna, Mark Brophy,Varun Jampani, Cem Anil, Thang To, Eric Cameracci, Shaad Boo-choon, and Stan Birchfield. “Training deep networks with syntheticdata: Bridging the reality gap by domain randomization”. In: Proceed-ings of the IEEE Conference on Computer Vision and Pattern RecognitionWorkshops. 2018, pp. 969–977.

[27] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, etal. “Matching networks for one shot learning”. In: Advances in neuralinformation processing systems. 2016, pp. 3630–3638.

[28] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. “A survey oftransfer learning”. In: Journal of Big Data 3.1 (2016), p. 9.

[29] Alexander Womg, Mohammad Javad Shafiee, Francis Li, and BrendanChwyl. “Tiny ssd: A tiny single-shot detection deep convolutionalneural network for real-time embedded object detection”. In: 2018 15thConference on Computer and Robot Vision (CRV). IEEE. 2018, pp. 95–101.

60

A Appendix

A.1 Hardware

The training experiments were performed using a high-performance machinelearning PC. It had 64 AMD Radian Cores, two Nvidia GeForce RTX 2080with 11 GB of video memory each, 128 GB of RAM, 256 GB SSD and a 6TBHDD.

Also, a PC running Microsoft’s Windows 10 with 8 Intel i9 processors, oneNvidia GeForce GTX 1080 Ti and 64 GB RAM were used to run the VBS3simulations in order to generate the synthetic dataset. It was possible to runup to 8 parallel instances of VBS3 on the highest graphical settings withoutsignificantly slowing down the image generation.

A.2 VBS3 Vehicle Dataset

The work performed during this thesis has resulted in a large meta-learningdata set with high-quality images and a large set of labels for each image.This section will, in detail describe the dataset used in this thesis, which hasbeen named the VBS3 Vehicle Dataset. It will outline all the information thatis provided for each data-point how they have been used in this thesis andpossible future applications.

A.2.1 Images

Each image is a 1280 x 768 RGB color image in PNG format.

61

APPENDIX A. APPENDIX

A.2.2 Vehicle Classes

The VBS3 Vehicle Dataset contains a selection of vehicle models taken fromthe VBS3’s internal list of vehicle models available in version 18.3.3.8. In totalthere exist 2381 unique models in this dataset.

A.2.3 Image Background

In order to increase the variation in the data, the models are spawned into arange of pre-built environments. These five were selected due to them bothbeing the most detailed and with the best-looking texture assets. Each imageis therefore tagged with the name of the environment in which it was taken.These include:

• Tropical: A tropical environment covered mostly with forest and alarge river.

• Afghanistan: A desert-like landscape.

• USA: A small American town surrounded by open fields.

• Eastern European: A coniferous forest with a small lake.

• Iraq: A large Middle Eastern city surrounded by open desert.

A.2.4 Vehicle Data

For each object in the image has a set of corresponding meta information:object bitmap, object bounding-box, object color scheme, and object rotation.

Bitmap: The bitmap is a binary 2D vector that segments the original imageinto two segments: the pixels which contain the object, and those who donot (see Figure A.1). The segmentation produced are pixel perfect exceptfor some models where parts of the vehicle are consistently missing, due tothem not being appropriately masked by the engine.

62


(a) Original Image (b) Bitmap

Figure A.1

Bounding Box: The bounding boxes consist of four values. The x and y pixelindex of the leftmost lowest point of the box, and its width and height in thenumber of pixels.

Rotation: The rotation for each object is given as yaw, pitch, and bank.Although not utilized in this thesis, these values can be used in a myriad ofways. All the values are given in degrees.

• Yaw refers to the rotation around an axis drawn from the top to bottom.If all other rations are fixed a vehicle with a yaw of zero degrees willbe facing the camera head-on, while a yaw of 90 will have the vehiclefacing to the right of the image.

• Pitch refers to the rotation around an axis going from the left side tothe right side of the vehicle. If all other rations are fixed a vehicle witha pitch of 0 will be facing straight forward, while a pitch of 90 degreesmeans that the vehicle will be looking straight up.

• Bank refers to the rotation along an axis going from the front to theback of the vehicle. If all other rations are fixed a bank of 90 degreesmeans that the object will be lying on its right side.

Color Scheme: Some of the models have been randomly assigned a set ofcolors in some subset of the pictures. Each color is sampled uniformly fromthe possible set RGB values, and alpha is uniformly sampled between 0.5and 1.0. If a random color selection has been applied to a model in an image,

63


this is saved as a list of 4-grams. Each 4-gram corresponds to the RGBAvalues of the reassigned texture.

The degree to which the colors are randomized is reliant on what VBS3supports, and the quality of the randomized models can vary. The data hasbeen generated in such a way as to by randomly assigning a color to eachtexture in the model. As a result, the number of colors which are randomizedper vehicle is dependent on the number of unique textures that make up thevehicle. As a result, some randomized vehicles are given a single uniformcolor, while others can consist of a large set of colors (see Figure 3.4).

Weather: Each image is provided with the weather setting, which was en-abled when the photo was taken. The configuration is given as three valuesbetween zero and one. They correspond to the level of rain, level of fog, andlevel of overcast, with a higher value corresponding to more intense weathersettings.

Crepuscular Rays: Each image is also provided with the scattering coeffi-cients for the Crepuscular Rays/God Rays. The configuration is providedas three values between zero and thirty. They correspond to the relation be-tween the three RGB channels, as well as how much the light should fracturewhen colliding with something.

64

TRITA -EECS-EX-2019:640

www.kth.se

Documents

Synthetic Meta-Learning - DiVA portal1375764/FULLTEXT01.pdf · Synthetic Meta-Learning Learning to learn real-world tasks with synthetic data LUKAS LUNDMARK KTH ROYAL INSTITUTE OF