11
Enhancing Underwater Imagery using Generative Adversarial Networks Cameron Fabbri 1 , Md Jahidul Islam 2 , and Junaed Sattar 3 University of Minnesota, Minneapolis MN {fabbr013 1 , islam034 2 , junaed 3 }@umn.edu January 15, 2018 Abstract Autonomous underwater vehicles (AUVs) rely on a variety of sensors – acoustic, inertial and visual – for intelligent decision making. Due to its non-intrusive, passive nature, and high informa- tion content, vision is an attractive sensing modality, particularly at shallower depths. However, factors such as light refraction and absorption, suspended particles in the water, and color distor- tion affect the quality of visual data, resulting in noisy and distorted images. AUVs that rely on visual sensing thus face difficult challenges, and consequently exhibit poor performance on vision- driven tasks. This paper proposes a method to improve the quality of visual underwater scenes using Generative Adversarial Networks (GANs), with the goal of improving input to vision-driven behaviors further down the autonomy pipeline. Furthermore, we show how recently proposed methods are able to generate a dataset for the purpose of such underwater image restoration. For any visually-guided underwater robots, this improvement can result in increased safety and reli- ability through robust visual perception. To that effect, we present quantitative and qualitative data which demonstrates that images corrected through the proposed approach generate more visually appealing images, and also provide increased accuracy for a diver tracking algorithm. 1 Introduction Underwater robotics has been a steadily grow- ing subfield of autonomous field robotics, as- sisted by the advent of novel platforms, sensors and propulsion mechanisms. While autonomous underwater vehicles are often equipped with a variety of sensors, visual sensing is an attrac- tive option because of its non-intrusive, pas- sive, and energy efficient nature. The monitor- ing of coral reefs [28], deep ocean exploration [32], and mapping of the seabed [5] are a num- ber of tasks where visually-guided AUVs and ROVs (Remotely Operated Vehicles) have seen widespread use. Use of these robots ensures hu- mans are not exposed to the hazards of underwa- ter exploration, as they no longer need to ven- ture to the depths (which was how such tasks were carried out in the past). Despite the advan- tages of using vision, underwater environments pose unique challenges to visual sensing, as light refraction, absorption and scattering from sus- pended particles can greatly affect optics. For example, because red wavelengths are quickly absorbed by water, images tend to have a green or blue hue to them. As one goes deeper, this effect worsens, as more and more red hue is ab- sorbed. This distortion is extremely non-linear in nature, and is affected by a large number of factors, such as the amount of light present (over- cast versus sunny, operational depth), amount of particles in the water, time of day, and the cam- era being used. This may cause difficulty in tasks such as segmentation, tracking, or classification due to their indirect or direct use of color. 1 arXiv:1801.04011v1 [cs.CV] 11 Jan 2018

arXiv:1801.04011v1 [cs.CV] 11 Jan 2018 · 2018. 1. 15. · Cameron Fabbri1, Md Jahidul Islam2, and Junaed Sattar3 University of Minnesota, Minneapolis MN ffabbr0131, islam0342, [email protected]

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:1801.04011v1 [cs.CV] 11 Jan 2018 · 2018. 1. 15. · Cameron Fabbri1, Md Jahidul Islam2, and Junaed Sattar3 University of Minnesota, Minneapolis MN ffabbr0131, islam0342, junaed3g@umn.edu

Enhancing Underwater Imagery using Generative Adversarial

Networks

Cameron Fabbri1, Md Jahidul Islam2, and Junaed Sattar3

University of Minnesota, Minneapolis MN

{fabbr0131, islam0342, junaed3}@umn.edu

January 15, 2018

Abstract

Autonomous underwater vehicles (AUVs) rely on a variety of sensors – acoustic, inertial andvisual – for intelligent decision making. Due to its non-intrusive, passive nature, and high informa-tion content, vision is an attractive sensing modality, particularly at shallower depths. However,factors such as light refraction and absorption, suspended particles in the water, and color distor-tion affect the quality of visual data, resulting in noisy and distorted images. AUVs that rely onvisual sensing thus face difficult challenges, and consequently exhibit poor performance on vision-driven tasks. This paper proposes a method to improve the quality of visual underwater scenesusing Generative Adversarial Networks (GANs), with the goal of improving input to vision-drivenbehaviors further down the autonomy pipeline. Furthermore, we show how recently proposedmethods are able to generate a dataset for the purpose of such underwater image restoration. Forany visually-guided underwater robots, this improvement can result in increased safety and reli-ability through robust visual perception. To that effect, we present quantitative and qualitativedata which demonstrates that images corrected through the proposed approach generate morevisually appealing images, and also provide increased accuracy for a diver tracking algorithm.

1 Introduction

Underwater robotics has been a steadily grow-ing subfield of autonomous field robotics, as-sisted by the advent of novel platforms, sensorsand propulsion mechanisms. While autonomousunderwater vehicles are often equipped with avariety of sensors, visual sensing is an attrac-tive option because of its non-intrusive, pas-sive, and energy efficient nature. The monitor-ing of coral reefs [28], deep ocean exploration[32], and mapping of the seabed [5] are a num-ber of tasks where visually-guided AUVs andROVs (Remotely Operated Vehicles) have seenwidespread use. Use of these robots ensures hu-mans are not exposed to the hazards of underwa-ter exploration, as they no longer need to ven-ture to the depths (which was how such tasks

were carried out in the past). Despite the advan-tages of using vision, underwater environmentspose unique challenges to visual sensing, as lightrefraction, absorption and scattering from sus-pended particles can greatly affect optics. Forexample, because red wavelengths are quicklyabsorbed by water, images tend to have a greenor blue hue to them. As one goes deeper, thiseffect worsens, as more and more red hue is ab-sorbed. This distortion is extremely non-linearin nature, and is affected by a large number offactors, such as the amount of light present (over-cast versus sunny, operational depth), amount ofparticles in the water, time of day, and the cam-era being used. This may cause difficulty in taskssuch as segmentation, tracking, or classificationdue to their indirect or direct use of color.

1

arX

iv:1

801.

0401

1v1

[cs

.CV

] 1

1 Ja

n 20

18

Page 2: arXiv:1801.04011v1 [cs.CV] 11 Jan 2018 · 2018. 1. 15. · Cameron Fabbri1, Md Jahidul Islam2, and Junaed Sattar3 University of Minnesota, Minneapolis MN ffabbr0131, islam0342, junaed3g@umn.edu

As color and illumination begin to change withdepth, vision-based algorithms need to be gener-alizable in order to work within the depth rangesa robot may operate in. Because of the high costand difficulty of acquiring a variety of underwa-ter data to train a visual system on, as well asthe high amount of noise introduced, algorithmsmay (and do) perform poorly in these differentdomains. Figure 2 shows the high variability invisual scenes that may occur in underwater envi-ronments. A step towards a solution to this issueis to be able to restore the images such that theyappear to be above water, i.e., with colors cor-rected and suspended particles removed from thescene. By performing a many-to-one mapping ofthese domains from underwater to not underwa-ter (what the image would look like above wa-ter), algorithms that have difficulty performingacross multiple forms of noise may be able tofocus only one clean domain.

Deep neural networks have been shown to bepowerful non-linear function approximators, es-pecially in the field of vision [17]. Often times,these networks require large amounts of data, ei-ther labeled or paired with ground truth. Forthe problem of automatically colorizing grayscaleimages [33], paired training data is readily avail-able due to the fact that any color image can beconverted to black and white. However, under-water images distorted by either color or someother phenomenon lack ground truth, which isa major hindrance towards adopting a similarapproach for correction. This paper proposes atechnique based on Generative Adversarial Net-works (GANs) to improve the quality of visualunderwater scenes with the goal of improvingthe performance of vision-driven behaviors forautonomous underwater robots. We use the re-cently proposed CycleGAN [35] approach, whichlearns to translate an image from any arbitrarydomain X to another arbitrary domain Y with-out image pairs, as a way to generate a paireddataset. By letting X be a set of undistortedunderwater images, and Y be a set of distortedunderwater images, we can generate an imagethat appears to be underwater while retainingground truth.

Figure 1: Sample underwater images with nat-ural and man-made artifacts (which in this caseis our underwater robot) displaying the diversityof distortions that can occur. With the vary-ing camera-to-object distances in the images, thedistortion and loss of color varies between thedifferent images.

2 Related Work

While there have been a number of successful re-cent approaches towards automatic colorization[33, 11], most are focused on the task of con-verting grayscale images to color. Quite a fewapproaches use a physics-based technique to di-rectly model light refraction [15]. Specifically forrestoring color in underwater images, the workof [29] uses an energy minimization formulationusing a Markov Random Field. Most similar tothe work proposed in this paper is the recentlyproposed WaterGAN [20], which uses an adver-sarial approach towards generating realistic un-derwater images. Their generator model can bebroken down into three stages: 1) Attenuation,which accounts for range-dependent attenuationof light. 2) Scattering, which models the haze ef-fect caused by photons scattering back towardsthe image sensor and 3) Vignetting, which pro-duces a shading effect on the image corners thatcan be caused by certain camera lenses. Differen-tiating from our work, they use a GAN for gen-erating the underwater images and use strictlyEuclidean loss for color correction, whereas weuse a GAN for both. Furthermore, they requiredepth information during the training of Water-GAN, which can be often difficult to attain par-ticularly for underwater autonomous robotic ap-plications. Our work only requires images of ob-

2

Page 3: arXiv:1801.04011v1 [cs.CV] 11 Jan 2018 · 2018. 1. 15. · Cameron Fabbri1, Md Jahidul Islam2, and Junaed Sattar3 University of Minnesota, Minneapolis MN ffabbr0131, islam0342, junaed3g@umn.edu

jects in two separate domains (e.g., underwaterand terrestrial) throughout the entire process.

Recent work in generative models, specificallyGANs, have shown great success in areas suchas inpainting [24], style transfer [8], and image-to-image translation [14, 35]. This is primarilydue to their ability to provide a more meaningfulloss than simply the Euclidean distance, whichhas been shown to produce blurry results. Inour work, we structure the problem of estimat-ing the true appearance of underwater imageryas a paired image-to-image translation problem,using Generative Adversarial Networks (GANs)as our generative model (see Section 3.2 for de-tails). Much like the work of [14], we use im-age pairs from two domains as input and groundtruth.

3 Methodology

Underwater images distorted by color or othercircumstances lack ground truth, which is a ne-cessity for previous colorization approaches. Fur-thermore, the distortion present in an underwa-ter image is highly nonlinear; simple methodssuch as adding a hue to an image do not captureall of the dependencies. We propose to use Cycle-GAN as a distortion model in order to generatepaired images for training. Given a domain ofunderwater images with no distortion, and a do-main of underwater images with distortion, Cy-cleGAN is able to perform style transfer. Givenan undistorted image, CycleGAN distorts it suchthat it appears to have come from the domain ofdistorted images. These pairs are then used inour algorithm for image reconstruction.

3.1 Dataset Generation

Depth, lighting conditions, camera model, andphysical location in the underwater environmentare all factors that affect the amount of distor-tion an image will be subjected to. Under cer-tain conditions, it is possible that an underwaterimage may have very little distortion, or noneat all. We let IC be an underwater image withno distortion, and ID be the same image withdistortion. Our goal is to learn the function

f : ID → IC . Becasue of the difficulty of collect-ing underwater data, more often than not onlyID or IC exist, but never both.

To circumvent the problem of insufficient im-age pairs, we use CycleGAN to generate ID

from IC , which gives us a paired dataset ofimages. Given two datasets X and Y , where

IC ∈ X and ID ∈ Y , CycleGAN learns a map-ping F : X → Y . Figure 2 shows paired sam-ples generated from CycleGAN. From this paireddataset we train a generator G to learn the func-tion f : ID → IC . It should be noted that duringthe training process of CycleGAN, it simultane-ously learns a mapping G : Y → X, which issimilar to f . In Section 4, we compare imagesgenerated by CycleGAN with images generatedthrough our approach.

3.2 Adversarial Networks

In machine learning literature, Generative Ad-versarial Networks (GANs) [9] represent a classof generative models based on game theory inwhich a generator network competes against anadversary. From a classification perspective, thegenerator network G produces instances whichactively attempt to ‘fool’ the discriminator net-work D. The goal is for the discriminator net-work to be able to distinguish between ‘true’ in-stances coming from the dataset and ‘false’ in-stances produced by the generator network. Inour case, conditioned on an image ID, the gen-erator is trained to produce an image to try andfool the discriminator, which is trained to distin-guish between distorted and non-distorted un-derwater images. In the original GAN formula-tion, our goal is to solve the minimax problem:

minG

maxD

EIC∼ptrain(IC)[logD(IC)]+

EID∼pgen(ID)[log(1−D(G(ID)))](1)

Note for simplicity in notation, we will furtheromit IC ∼ ptrain(IC) and ID ∼ pgen(ID). In thisformulation, the discriminator is hypothesizedas a classifier with a sigmoid cross-entropy lossfunction, which in practice may lead to issuessuch as the vanish gradient and mode collapse.

3

Page 4: arXiv:1801.04011v1 [cs.CV] 11 Jan 2018 · 2018. 1. 15. · Cameron Fabbri1, Md Jahidul Islam2, and Junaed Sattar3 University of Minnesota, Minneapolis MN ffabbr0131, islam0342, junaed3g@umn.edu

Figure 2: Paired samples of ground truth anddistorted images generated by CycleGAN. Toprow: Ground truth. Bottom row: Generatedsamples.

As shown by [2], as the discriminator improves,the gradient of the generator vanishes, makingit difficult or impossible to train. Mode collapseoccurs when the generator “collapses” onto a sin-gle point, fooling the discriminator with only oneinstance. To illustrate the effect of mode col-lapse, imagine a GAN is being used to generatedigits from the MNIST [18] dataset, but it onlygenerated the same digit. In reality, the desiredoutcome would be to generate a diverse collec-tion of all the digits. To this end, there havebeen a number of recent methods which hypoth-esize a different loss function for the discrimina-tor [21, 3, 10, 34]. We focus on the WassersteinGAN (WGAN) [3] formulation, which proposesto use the Earth-Mover or Wasserstein-1 dis-tance W by constructing a value function usingthe Kantorovich-Rubinstein duality [31]. In thisformulation, W is approximated given a set ofk-Lipschitz functions f modeled as neural net-works. To ensure f is k-Lipschitz, the weightsof the discriminator are clipped to some range[−c, c]. In our work, we adopt the WassersteinGAN with gradient penalty (WGAN-GP) [10],which instead of clipping network weights like in[3], ensures the Lipschitz constraint by enforc-ing a soft constraint on the gradient norm of thediscriminator’s output with respect to its input.Following [10], our new objective then becomes

LWGAN (G,D) = E[D(IC)]− E[D(G(ID))]+

λGPEx∼Px[(||∇xD(x)||2 − 1)2]

(2)

where Px is defined as samples along straightlines between pairs of points coming from thetrue data distribution and the generator distri-bution, and λGP is a weighing factor. In orderto give G some sense of ground truth, as wellas capture low level frequencies in the image, wealso consider the L1 loss

LL1 = E[||IC −G(ID)||1] (3)

Combining these, we get our final objective func-tion for our network, which we call UnderwaterGAN (UGAN),

L∗UGAN = minG

maxDLWGAN (G,D) + λ1LL1(G)

(4)

3.3 Image Gradient Difference Loss

Often times generative models produce blurryimages. We explore a strategy to sharpen thesepredictions by directly penalizing the differencesof image gradient predictions in the generator,as proposed by [22]. Given a ground truth imageIC , predicted image IP = G(ID), and α whichis an integer greater than or equal to 1, the Gra-dient Difference Loss (GDL) is given by

LGDL(IC , IP ) =∑i,j

||ICi,j − ICi−1,j | − |IPi,j − IPi−1,j ||α+

||ICi,j−1 − ICi,j | − |IPi,j−1 − IPi,j ||α(5)

In our experiments, we denote our network asUGAN-P when considering the GDL, which canbe expressed as

L∗UGAN−P = minG

maxDLWGAN (G,D)+

λ1LL1(G)+λ2LGDL(6)

3.4 Network Architecture

Our generator network is a fully convolutionalencoder-decoder, similar to the work of [14],which is designed as a “U-Net” [26] due to the

4

Page 5: arXiv:1801.04011v1 [cs.CV] 11 Jan 2018 · 2018. 1. 15. · Cameron Fabbri1, Md Jahidul Islam2, and Junaed Sattar3 University of Minnesota, Minneapolis MN ffabbr0131, islam0342, junaed3g@umn.edu

Original

UGAN

UGAN-P

Figure 3: Samples from our ImageNet testing set. The network can both recover color and alsocorrect color if a small amount is present.

structural similarity between input and output.Encoder-decoder networks downsample (encode)the input via convolutions to a lower dimensionalembedding, in which this embedding is then up-sampled (decode) via transpose convolutions toreconstruct an image. The advantage of using a“U-Net” comes from explicitly preserving spatialdependencies produced by the encoder, as op-posed to relying on the embedding to contain allof the information. This is done by the additionof “skip connections”, which concatenate the ac-tivations produced from a convolution layer i inthe encoder to the input of a transpose convolu-tion layer n− i+1 in the decoder, where n is thetotal number of layers in the network. Each con-volutional layer in our generator uses kernel size4× 4 with stride 2. Convolutions in the encoderportion of the network are followed by batch nor-malization [12] and a leaky ReLU activation withslope 0.2, while transpose convolutions in the de-coder are followed by a ReLU activation [23] (nobatch norm in the decoder). Exempt from this isthe last layer of the decoder, which uses a TanHnonlinearity to match the input distribution of[−1, 1]. Recent work has proposed Instance Nor-malization [30] to improve quality in image-to-image translation tasks, however we observed noadded benefit.

Our fully convolutional discriminator is mod-eled after that of [25], except no batch normal-ization is used. This is due to the fact thatWGAN-GP penalizes the norm of the discrim-

inator’s gradient with respect to each input in-dividually, which batch normalization would in-validate. The authors of [10] recommend layernormalization [4], but we found no significant im-provements. Our discriminator is modeled as aPatchGAN [14, 19], which discriminates at thelevel of image patches. As opposed to a regulardiscriminator, which outputs a scalar value cor-responding to real or fake, our PatchGAN dis-criminator outputs a 32× 32× 1 feature matrix,which provides a metric for high level frequen-cies.

4 Experiments

4.1 Datasets

We used several subsets of Imagenet [7] for train-ing and evaluation of our methods. We alsoevaluate a frequency- and spatial-domain diver-tracking algorithm on a video of scuba diverstaken from YouTubeTM 1. Subsets of Imagenetcontaining underwater images were selected forthe training of CycleGAN, and manually sepa-rated into two classes based on visual inspection.We let X be the set of underwater images withno distortion, and Y be the set of underwater im-ages with distortion. X contained 6143 images,and Y contained 1817 images. We then trainedCycleGAN to learn the mapping F : X → Y ,

1https://www.youtube.com/watch?v=QmRFmhILd5o

5

Page 6: arXiv:1801.04011v1 [cs.CV] 11 Jan 2018 · 2018. 1. 15. · Cameron Fabbri1, Md Jahidul Islam2, and Junaed Sattar3 University of Minnesota, Minneapolis MN ffabbr0131, islam0342, junaed3g@umn.edu

such that images from X appeared to have comefrom Y . Finally, our image pairs for trainingdata were generated by distorting all images inX with F . Figure 2 shows sample training pairs.When comparing with CycleGAN, we used a testset of 56 images acquired from FlickrTM.

Original CycleGAN UGAN UGAN-P

Figure 4: Running the Canny Edge Detector on sam-ple images. Both variants of UGAN contain less noisethan CycleGAN, and are closer in the image space tothe original. For each pair, the top row is the inputimage, and bottom row the result of the edge detec-tor. The figure depicts four different sets of images,successively labeled A to D from top to bottom. SeeTable 1.

4.2 Evaluation

We train UGAN and UGAN-P on the imagepairs generated by CycleGAN, and evaluate onthe images from the test set, Y . Note thatthese images do not contain any ground truth,as they are original distorted images from Ima-genet. Images for training and testing are of size256 × 256 × 3 and normalized between [−1, 1].Figure 3 shows samples from the test set. No-tably, these images contain varying amounts ofnoise. Both UGAN and UGAN-P are able torecover lost color information, as well as correctany color information this is present.

While many of the distorted images contain ablue or green hue over the entire image space,that is not always the case. In certain environ-ments, it is possible that objects close to thecamera are undistorted with correct colors, whilethe background of the image contains distortion.In these cases, we would like the network to onlycorrect parts of the image that appear distorted.The last row in Figure 3 shows a sample of suchan image. The orange of the clownfish is left un-changed while the distorted sea anemone in thebackground has its color corrected.

For a quantitative evaluation we compare toCycleGAN, as it inherently learns an inversemapping during the training of G : Y → X.We first use the Canny edge detector [6], as thisprovides a color agnostic evaluation of the im-ages in comparison to ground truth. Second, wecompare local image patches to provide sharp-ness metrics on our images. Lastly, we show howan existing tracking algorithm for an underwaterrobot improves performance with generated im-ages.

4.3 Comparison to CycleGAN

It is important to note that during the process oflearning a mapping F : X → Y , CycleGAN alsolearns a mapping G : Y → X. Here we give acomparison to our methods. We use the Cannyedge detector [6] to provide a color agnostic eval-uation of the images, as the original contain dis-torted colors and cannot be compared back toas ground truth. Due to the fact that restor-

6

Page 7: arXiv:1801.04011v1 [cs.CV] 11 Jan 2018 · 2018. 1. 15. · Cameron Fabbri1, Md Jahidul Islam2, and Junaed Sattar3 University of Minnesota, Minneapolis MN ffabbr0131, islam0342, junaed3g@umn.edu

Original CycleGAN UGAN UGAN-P

Figure 5: Local image patches extracted for quantitative comparisons, shown in Tables 2 and 3.Each patch was resized to 64× 64, but shown enlarged for viewing ability.

ing color information should not alter the overallstructure of the image, we measure the distancein the image space between the edges found inthe original and generated images. Figure 4shows the original images and results from edgedetection. Table 1 provides the measurementsfrom Figure 4, as well as the average over our en-tire FlickrTM dataset. Both UGAN and UGAN-P are consistently closer in the image space tothe original than that of CycleGAN, suggestingnoise due to blur. Next, we evaluate this noiseexplicitly.

We explore the artifacts of content loss, as seenin Figure 5. In particular, we compare localstatistics of the highlighted image patches, whereeach image patch is resized to 64 × 64. We usethe GDL [22] from (5) as a sharpness measure.A lower GDL measure implies a smoother transi-tion between pixels, as a noisy image would havelarge jumps in the image’s gradient, leading toa higher score. As seen in Table 2, the GDLis lower for both UGAN and UGAN-P. Interest-ingly, UGAN consistently has a lower score thanUGAN-P, despite UGAN-P explicitly accountingfor this metric in the objective function. Reason-ing for this is left for our future work.

Table 1: Distances in image spaceRow/Method CycleGAN UGAN UGAN-P

A 116.45 85.71 86.15

B 114.49 97.92 101.01

C 120.84 96.53 97.57

D 129.27 108.90 110.50

Mean 111.60 94.91 96.51

Another metric we use to compare imagepatches are the mean and standard deviationof a patch. The standard deviation gives us asense of blurriness because it defines how far thedata deviates from the mean. In the case of im-ages, this would suggest a blurring effect due tothe data being more clustered toward one pixelvalue. Table 3 shows the mean and standarddeviations of the RGB values for the local im-age patches seen in Figure 5. Despite qualita-tive evaluation showing our methods are muchsharper, quantitatively they show only slight im-provement. Other metrics such as entropy areleft as future work.

7

Page 8: arXiv:1801.04011v1 [cs.CV] 11 Jan 2018 · 2018. 1. 15. · Cameron Fabbri1, Md Jahidul Islam2, and Junaed Sattar3 University of Minnesota, Minneapolis MN ffabbr0131, islam0342, junaed3g@umn.edu

Table 2: Gradient Difference Loss MetricsMethod/Patch CycleGAN UGAN UGAN-P

Red 11.53 9.39 10.93

Blue 7.52 4.83 5.50

Green 4.15 3.18 3.25

Orange 6.72 5.65 5.79

4.4 Diver Tracking using Frequency-Domain Detection

We investigate the frequency-domain character-istics of the restored images through a case-studyof periodic motion tracking in sequence of im-ages. Particularly, we compared the performanceof Mixed Domain Periodic Motion (MDPM)-tracker [13] on a sequence of images of adiver swimming in arbitrary directions. MDPMtracker is designed for underwater robots to fol-low scuba divers by tracking distinct frequency-domain signatures (high-amplitude spectra at 1-2Hz) pertaining to human swimming. Ampli-tude spectra in frequency-domain correspond tothe periodic intensity variations in image-spaceover time, which is often eroded in noisy under-water images [27].

Fig. 6 illustrates the improved performance ofMDPM tracker on generated images comparedto the real ones. Underwater images often failto capture the true contrast in intensity val-ues between foreground and background due tolow visibility. The generated images seem to re-store these eroded intensity variations to someextent, causing much improved positive detec-tion (a 350% increase in correct detections) forthe MDPM tracker.

4.5 Training Details and InferencePerformance

In all of our experiments, we use λ1 = 100,λGP = 10, batch size of 32, and the Adam Op-timizer [16] with learning rate 1e − 4. Follow-ing WGAN-GP, the discriminator is updated ntimes for every update of the generator, wheren = 5. For UGAN-P, we set λ2 = 1.0 and α = 1.Our implementation was done using the Tensor-

flow library [1]. 2 All networks were trained fromscratch on a GTX 1080 for 100 epochs. Inferenceon the GPU takes on average 0.0138s, which isabout 72 Frames Per Second (FPS). On a CPU(Intel Core i7-5930K), inference takes on average0.1244s, which is about 8 FPS. In both cases, theinput images have dimensions 256× 256× 3. Wefind both of these measures acceptable for un-derwater tasks.

5 Conclusion

This paper presents an approach for enhancingunderwater color images through the use of gen-erative adversarial networks. We demonstratethe use of CycleGAN to generate dataset ofpaired images to provide a training set for theproposed restoration model. Quantitative andqualitative results demonstrate the effectivenessof this method, and using a diver tracking algo-rithm on corrected images of scuba divers showhigher accuracy compared to the uncorrected im-age sequence.

Future work will focus on creating a largerand more diverse dataset from underwater ob-jects, thus making the network more generaliz-able. Augmenting the data generated by Cycle-GAN with noise such as particle and lighting ef-fects would improve the diversity of the dataset.We also intend to investigate a number of differ-ent quantitative performance metrics to evaluateour method.

Acknowledgment

The authors are grateful to Oliver Hennigh forhis implementation of the Gradient DifferenceLoss measure.

2Code is available at https://github.com/

cameronfabbri/Underwater-Color-Correction

8

Page 9: arXiv:1801.04011v1 [cs.CV] 11 Jan 2018 · 2018. 1. 15. · Cameron Fabbri1, Md Jahidul Islam2, and Junaed Sattar3 University of Minnesota, Minneapolis MN ffabbr0131, islam0342, junaed3g@umn.edu

Table 3: Mean and Standard Deviation MetricsMethod/Patch Original CycleGAN UGAN UGAN-P

Red 0.43 ± 0.23 0.42 ± 0.22 0.44 ± 0.23 0.45 ± 0.25

Blue 0.51 ± 0.18 0.57 ± 0.17 0.57 ± 0.17 0.57 ± 0.17

Green 0.36 ± 0.17 0.36 ± 0.14 0.37 ± 0.17 0.36 ± 0.17

Orange 0.3 ± 0.15 0.25 ± 0.12 0.26 ± 0.13 0.27 ± 0.14

Correct detection Wrong detection Missed detection Total # of frames

Real 42 14 444 500

Generated 147 24 329 500

Figure 6: Performance of MDPM tracker [13] on both real (top row) and generated (second row)images; the Table compares the detection performance for both sets of images over a sequence of500 frames.

References

[1] Martın Abadi, Ashish Agarwal, PaulBarham, Eugene Brevdo, Zhifeng Chen,Craig Citro, Greg S Corrado, Andy Davis,Jeffrey Dean, Matthieu Devin, et al. Ten-sorflow: Large-scale machine learning onheterogeneous distributed systems. arXivpreprint arXiv:1603.04467, 2016.

[2] Martin Arjovsky and Leon Bottou. To-wards principled methods for training gen-erative adversarial networks. arXiv preprintarXiv:1701.04862, 2017.

[3] Martin Arjovsky, Soumith Chintala, and

Leon Bottou. Wasserstein gan. arXivpreprint arXiv:1701.07875, 2017.

[4] Jimmy Lei Ba, Jamie Ryan Kiros, and Ge-offrey E Hinton. Layer normalization. arXivpreprint arXiv:1607.06450, 2016.

[5] Brian Bingham, Brendan Foley, Hanu-mant Singh, Richard Camilli, Katerina De-laporta, Ryan Eustice, Angelos Mallios,David Mindell, Christopher Roman, andDimitris Sakellariou. Robotic tools fordeep water archaeology: Surveying an an-cient shipwreck with an autonomous under-water vehicle. Journal of Field Robotics,27(6):702–717, 2010.

9

Page 10: arXiv:1801.04011v1 [cs.CV] 11 Jan 2018 · 2018. 1. 15. · Cameron Fabbri1, Md Jahidul Islam2, and Junaed Sattar3 University of Minnesota, Minneapolis MN ffabbr0131, islam0342, junaed3g@umn.edu

[6] John Canny. A computational approachto edge detection. IEEE Transactions onpattern analysis and machine intelligence,(6):679–698, 1986.

[7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet:A large-scale hierarchical image database.In Computer Vision and Pattern Recogni-tion, 2009. CVPR 2009. IEEE Conferenceon, pages 248–255. IEEE, 2009.

[8] Leon A. Gatys, Alexander S. Ecker, andMatthias Bethge. Image style transfer us-ing convolutional neural networks. In TheIEEE Conference on Computer Vision andPattern Recognition (CVPR), June 2016.

[9] Ian Goodfellow, Jean Pouget-Abadie,Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative adversarialnets. In Advances in neural informationprocessing systems, pages 2672–2680, 2014.

[10] Ishaan Gulrajani, Faruk Ahmed, MartinArjovsky, Vincent Dumoulin, and AaronCourville. Improved training of wassersteingans. arXiv preprint arXiv:1704.00028,2017.

[11] Satoshi Iizuka, Edgar Simo-Serra, andHiroshi Ishikawa. Let there be color!:joint end-to-end learning of global and lo-cal image priors for automatic image col-orization with simultaneous classification.ACM Transactions on Graphics (TOG),35(4):110, 2016.

[12] Sergey Ioffe and Christian Szegedy. Batchnormalization: Accelerating deep networktraining by reducing internal covariate shift.In Francis Bach and David Blei, editors,Proceedings of the 32nd International Con-ference on Machine Learning, volume 37 ofProceedings of Machine Learning Research,pages 448–456, Lille, France, 07–09 Jul2015. PMLR.

[13] Md Jahidul Islam and Junaed Sattar.Mixed-domain biological motion tracking

for underwater human-robot interaction. In2017 IEEE International Conference onRobotics and Automation (ICRA), pages4457–4464. IEEE, 2017.

[14] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou,and Alexei A Efros. Image-to-image transla-tion with conditional adversarial networks.arXiv preprint arXiv:1611.07004, 2016.

[15] Anne Jordt. Underwater 3D reconstructionbased on physical models for refraction andunderwater light propagation. Departmentof Computer Science, Univ. Kiel, 2014.

[16] Diederik Kingma and Jimmy Ba. Adam: Amethod for stochastic optimization. arXivpreprint arXiv:1412.6980, 2014.

[17] Alex Krizhevsky, Ilya Sutskever, and Geof-frey E Hinton. Imagenet classification withdeep convolutional neural networks. In Ad-vances in neural information processing sys-tems, pages 1097–1105, 2012.

[18] Yann LeCun, Corinna Cortes, and Christo-pher JC Burges. Mnist handwritten digitdatabase. AT&T Labs [Online]. Available:http://yann.lecun.com/exdb/mnist, 2, 2010.

[19] Chuan Li and Michael Wand. Precomputedreal-time texture synthesis with markoviangenerative adversarial networks. In Euro-pean Conference on Computer Vision, pages702–716. Springer, 2016.

[20] Jie Li, Katherine A Skinner, Ryan M Eu-stice, and Matthew Johnson-Roberson. Wa-tergan: Unsupervised generative network toenable real-time color correction of monoc-ular underwater images. arXiv preprintarXiv:1702.07392, 2017.

[21] Xudong Mao, Qing Li, Haoran Xie,Raymond YK Lau, Zhen Wang, andStephen Paul Smolley. Least squares gener-ative adversarial networks. arXiv preprintArXiv:1611.04076, 2016.

[22] Michael Mathieu, Camille Couprie, andYann LeCun. Deep multi-scale video pre-

10

Page 11: arXiv:1801.04011v1 [cs.CV] 11 Jan 2018 · 2018. 1. 15. · Cameron Fabbri1, Md Jahidul Islam2, and Junaed Sattar3 University of Minnesota, Minneapolis MN ffabbr0131, islam0342, junaed3g@umn.edu

diction beyond mean square error. arXivpreprint arXiv:1511.05440, 2015.

[23] Vinod Nair and Geoffrey E Hinton. Rec-tified linear units improve restricted boltz-mann machines. In Proceedings of the 27thinternational conference on machine learn-ing (ICML-10), pages 807–814, 2010.

[24] Deepak Pathak, Philipp Krahenbuhl, JeffDonahue, Trevor Darrell, and Alexei AEfros. Context encoders: Feature learningby inpainting. In Proceedings of the IEEEConference on Computer Vision and Pat-tern Recognition, pages 2536–2544, 2016.

[25] Alec Radford, Luke Metz, and SoumithChintala. Unsupervised representationlearning with deep convolutional genera-tive adversarial networks. arXiv preprintarXiv:1511.06434, 2015.

[26] Olaf Ronneberger, Philipp Fischer, andThomas Brox. U-net: Convolutional net-works for biomedical image segmentation.In International Conference on Medical Im-age Computing and Computer-Assisted In-tervention, pages 234–241. Springer, 2015.

[27] Florian Shkurti, Wei-Di Chang, PeterHenderson, Md Jahidul Islam, JuanCamilo Gamboa Higuera, Jimmy Li, TravisManderson, Anqi Xu, Gregory Dudek, andJunaed Sattar. Underwater multi-robotconvoying using visual tracking by detec-tion. In 2017 IEEE International Con-ference on Intelligent Robots and Systems.IEEE, 2017.

[28] Florian Shkurti, Anqi Xu, Malika Megh-jani, Juan Camilo Gamboa Higuera, Yo-gesh Girdhar, Philippe Giguere, Bir BikramDey, Jimmy Li, Arnold Kalmbach, ChrisPrahacs, et al. Multi-domain monitoringof marine environments using a heteroge-neous robot team. In Intelligent Robots andSystems (IROS), 2012 IEEE/RSJ Inter-national Conference on, pages 1747–1753.IEEE, 2012.

[29] Luz Abril Torres-Mendez and GregoryDudek. Color correction of underwater im-ages for aquatic robot inspection. In EMM-CVPR, pages 60–73. Springer, 2005.

[30] Dmitry Ulyanov, Andrea Vedaldi, and Vic-tor Lempitsky. Instance normalization: Themissing ingredient for fast stylization. arXivpreprint arXiv:1607.08022, 2016.

[31] Cedric Villani. Optimal transport: old andnew, volume 338. Springer Science & Busi-ness Media, 2008.

[32] Louis Whitcomb, Dana R Yoerger, Hanu-mant Singh, and Jonathan Howland. Ad-vances in underwater robot vehicles for deepocean exploration: Navigation, control, andsurvey operations. In Robotics Research,pages 439–448. Springer, 2000.

[33] Richard Zhang, Phillip Isola, and Alexei AEfros. Colorful image colorization. In Euro-pean Conference on Computer Vision, pages649–666. Springer, 2016.

[34] Junbo Zhao, Michael Mathieu, andYann LeCun. Energy-based generativeadversarial network. arXiv preprintarXiv:1609.03126, 2016.

[35] Jun-Yan Zhu, Taesung Park, Phillip Isola,and Alexei A Efros. Unpaired image-to-image translation using cycle-consistentadversarial networks. arXiv preprintarXiv:1703.10593, 2017.

11