8
Perceptually Motivated Method for Image Inpainting Comparison Ivan Molodetskikh, Mikhail Erofeev, Dmitry Vatolin Lomonosov Moscow State University Moscow, Russia [email protected] Abstract The field of automatic image inpainting has progressed rapidly in recent years, but no one has yet proposed a stan- dard method of evaluating algorithms. This absence is due to the problem’s challenging nature: image-inpainting algo- rithms strive for realism in the resulting images, but realism is a subjective concept intrinsic to human perception. Exist- ing objective image-quality metrics provide a poor approxi- mation of what humans consider more or less realistic. To improve the situation and to better organize both prior and future research in this field, we conducted a subjective comparison of nine state-of-the-art inpainting algorithms and propose objective quality metrics that exhibit high cor- relation with the results of our comparison. 1. Introduction Image inpainting, or hole filling, is the task of filling in missing parts of an image. Given an incomplete image and a hole mask, an inpainting algorithm must generate the missing parts so that the result looks realistic. Inpainting is a widely researched topic. Many classical algorithms have been proposed [25, 4], but over the past few years most research has focused on using deep neural networks to solve this problem [18, 22, 16, 11, 31, 15, 30]. Naturally, because of the many avenues of research in this field, the need to evaluate algorithms emerges. The specifics of image inpainting mean this problem has no simple so- lution. The goal of an inpainting algorithm is to make the final image as realistic as possible, but image realism is a concept intrinsic to humans. Therefore, the most accurate way to evaluate an algorithms performance is a subjective experiment where many participants compare the outcomes of different algorithms and choose the one they consider the most realistic. Unfortunately, conducting a subjective experiment in- volves considerable time and resources, so many authors resort to evaluating their proposed methods using traditional objective image-similarity metrics such as PSNR, SSIM and mean l 2 loss relative to the ground-truth image. This strategy, however, is inadequate. One reason is that evaluation by mea- suring similarity to the ground-truth image assumes that only a single, best inpainting result exists—a false assumption in most cases. As a trivial example, consider that inpainting is frequently used to erase unwanted objects from photographs. Therefore, at least two realistic outcomes are possible: the original photograph with the object and the desired photo- graph without the object. Furthermore, we show that the popular SSIM image-quality estimation metric correlates poorly with human responses and that the inpainting result can seem more realistic to a human than the original image does, limiting the applicability of full-reference metrics to this task. Moreover, owing to the lack of a clear and objective way to evaluate inpainting algorithms, different authors present results of different metrics on different data sets when consid- ering their proposed algorithms. Comparing these algorithms is therefore even harder. Thus, a perceptually motivated objective metric for inpainting-quality assessment is desirable. The objective metric should approximate the notion of image realism and yield results similar to those of a subjective study when comparing outputs from different algorithms. We conducted a subjective evaluation of nine state-of-the- art classical and deep-learning-based approaches to image inpainting. Using the results, we examine different methods of objective inpainting-quality evaluation, including both full-reference methods (taking both the resulting image and the ground-truth image as an input) and no-reference meth- ods (taking just the resulting image as an input). It is important to note that we are not proposing objective quality-evaluation models trained on a database of subjective scores, as obtaining sufficiently large and diverse databases of this nature is impractical. We do, however, evaluate the proposed models by comparing their correlations to human responses. 1 arXiv:1907.06296v1 [cs.CV] 14 Jul 2019

Perceptually Motivated Method for Image Inpainting Comparison · Content-aware fill is a tool in the popular photo-editing software Adobe Photoshop. We picked it because, be-ing

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Perceptually Motivated Method for Image Inpainting Comparison · Content-aware fill is a tool in the popular photo-editing software Adobe Photoshop. We picked it because, be-ing

Perceptually Motivated Method for Image Inpainting Comparison

Ivan Molodetskikh, Mikhail Erofeev, Dmitry VatolinLomonosov Moscow State University

Moscow, [email protected]

Abstract

The field of automatic image inpainting has progressedrapidly in recent years, but no one has yet proposed a stan-dard method of evaluating algorithms. This absence is dueto the problem’s challenging nature: image-inpainting algo-rithms strive for realism in the resulting images, but realismis a subjective concept intrinsic to human perception. Exist-ing objective image-quality metrics provide a poor approxi-mation of what humans consider more or less realistic.

To improve the situation and to better organize both priorand future research in this field, we conducted a subjectivecomparison of nine state-of-the-art inpainting algorithmsand propose objective quality metrics that exhibit high cor-relation with the results of our comparison.

1. IntroductionImage inpainting, or hole filling, is the task of filling

in missing parts of an image. Given an incomplete imageand a hole mask, an inpainting algorithm must generate themissing parts so that the result looks realistic. Inpainting isa widely researched topic. Many classical algorithms havebeen proposed [25, 4], but over the past few years mostresearch has focused on using deep neural networks to solvethis problem [18, 22, 16, 11, 31, 15, 30].

Naturally, because of the many avenues of research in thisfield, the need to evaluate algorithms emerges. The specificsof image inpainting mean this problem has no simple so-lution. The goal of an inpainting algorithm is to make thefinal image as realistic as possible, but image realism is aconcept intrinsic to humans. Therefore, the most accurateway to evaluate an algorithms performance is a subjectiveexperiment where many participants compare the outcomesof different algorithms and choose the one they consider themost realistic.

Unfortunately, conducting a subjective experiment in-volves considerable time and resources, so many authorsresort to evaluating their proposed methods using traditionalobjective image-similarity metrics such as PSNR, SSIM and

mean l2 loss relative to the ground-truth image. This strategy,however, is inadequate. One reason is that evaluation by mea-suring similarity to the ground-truth image assumes that onlya single, best inpainting result exists—a false assumption inmost cases. As a trivial example, consider that inpainting isfrequently used to erase unwanted objects from photographs.Therefore, at least two realistic outcomes are possible: theoriginal photograph with the object and the desired photo-graph without the object. Furthermore, we show that thepopular SSIM image-quality estimation metric correlatespoorly with human responses and that the inpainting resultcan seem more realistic to a human than the original imagedoes, limiting the applicability of full-reference metrics tothis task.

Moreover, owing to the lack of a clear and objective wayto evaluate inpainting algorithms, different authors presentresults of different metrics on different data sets when consid-ering their proposed algorithms. Comparing these algorithmsis therefore even harder.

Thus, a perceptually motivated objective metric forinpainting-quality assessment is desirable. The objectivemetric should approximate the notion of image realism andyield results similar to those of a subjective study whencomparing outputs from different algorithms.

We conducted a subjective evaluation of nine state-of-the-art classical and deep-learning-based approaches to imageinpainting. Using the results, we examine different methodsof objective inpainting-quality evaluation, including bothfull-reference methods (taking both the resulting image andthe ground-truth image as an input) and no-reference meth-ods (taking just the resulting image as an input).

It is important to note that we are not proposing objectivequality-evaluation models trained on a database of subjectivescores, as obtaining sufficiently large and diverse databasesof this nature is impractical. We do, however, evaluate theproposed models by comparing their correlations to humanresponses.

1

arX

iv:1

907.

0629

6v1

[cs

.CV

] 1

4 Ju

l 201

9

Page 2: Perceptually Motivated Method for Image Inpainting Comparison · Content-aware fill is a tool in the popular photo-editing software Adobe Photoshop. We picked it because, be-ing

Figure 1. Images for the subjective inpainting comparison. The black square in the center is the area to be inpainted.

2. Related workLittle work has been done on objective image inpainting-

quality evaluation or on inpainting detection in general. Thesomewhat related field of manipulated-image detection hasseen moderate research, including both classical and deep-learning-based approaches. This field focuses on detectingaltered image regions, usually involving a set of commonmanipulations: copy-move (copying an image fragment andpasting it elsewhere in the same image), splicing (pasting afragment from another image), fragment removal (deletingan image fragment and then performing either a copy-moveor inpainting to fill in the missing area), various effects suchas Gaussian blur and median filtering, and recompression(usually indicating that the image was handled in a photoeditor). Among these manipulations, the most interesting forthis work is fragment removal with inpainting.

The classical approaches to image-manipulation detectioninclude [19, 12], and the deep-learning-based approachesinclude [1, 33, 20, 32]. These algorithms aim to locate themanipulated image regions by outputting a mask or a set ofbounding boxes enclosing suspicious regions. Unfortunately,they are not directly applicable to inpainting-quality estima-tion because they have a different goal: whereas an objectivequality-estimation metric should strive to accurately com-pare realistically inpainted images similar to the originals, aforgery-detection algorithm should strive to accurately tellone apart from the other.

3. Inpainting subjective evaluationThe gold standard for evaluating image-inpainting algo-

rithms is human perception, since each algorithm strives toproduce images that look the most realistic to humans. Thus,to obtain a baseline for creating an objective inpainting-quality metric, we conducted a subjective evaluation of mul-tiple state-of-the-art algorithms, including both classical anddeep-learning-based ones. To assess the overall quality andapplicability of the current approaches and to see how theycompare with manual photo editing, we also included human-produced images. We asked several professional photo edi-

tors to fill in missing regions of the test photos just like anautomatic algorithm would.

3.1. Test data set

Since human photo editors were to perform inpainting,our data set excluded publicly available images: we wantedto ensure that finding the original photos online and achiev-ing perfect results would be impossible. We therefore createdour own private set of test images by taking photographs ofvarious outdoor scenes, which are the most likely target forinpainting.

Each test image was 512× 512 pixels with a square holein the middle measuring 180×180 pixels. We chose a squareinstead of a free-form shape because one algorithm in ourcomparison [29] lacks the ability to fill in free-form holes.The data set comprised 33 images in total. Figure 1 showsexamples.

3.2. Inpainting methods

We evaluated nine classical and deep-learning-based ap-proaches:

• Exemplar-based inpainting [4] is a well-known classicalalgorithm that finds the image patches that are mostsimilar to the region being filled and copies them intothat region in a particular order to correctly preserveimage structures.

• Statistics of patch offsets [6] is a more recent classicalalgorithm that employs distributions of statistics for therelative positions of similar image patches.

• Content-aware fill is a tool in the popular photo-editingsoftware Adobe Photoshop. We picked it because, be-ing part of a popular image editor, it is highly likelyto be used for inpainting. Thus, comparing it withother state-of-the-art approaches is valuable. We testedthe content-aware fill in Adobe Photoshop CS5; itsimplementation preceded deep learnings explosion inpopularity.

Page 3: Perceptually Motivated Method for Image Inpainting Comparison · Content-aware fill is a tool in the popular photo-editing software Adobe Photoshop. We picked it because, be-ing

Artist #1 Artist #2 Artist #3

Urb

anFl

ower

sSp

lash

ing

Sea

Fore

stTr

ail

Figure 2. Three images from our test set, inpainted by three humanartists.

• Deep image prior [26] is an unconventional deep-learning-based method. It relies on deep generator net-works converging to a realistic image from a randominitial state rather than by learning realistic image priorsfrom a large training set.

• Globally and locally consistent image completion [9]is a deep-learning-based approach that uses global andlocal discriminators, thereby improving both the coher-ence of the resulting image as a whole and the localconsistencies of image patches.

• High-resolution image inpainting [29], another deep-learning-based method, employs multiscale neuralpatch synthesis to preserve both contextual structuresand high-frequency details in high-resolution images.

• Shift-Net [28] is a U-Net architecture that implementsa special shift-connection layer to improve the inpaint-ing quality. The shift-connection layer offsets encoderfeatures of known regions to estimate the missing ones.

• Generative image inpainting with contextual atten-tion [31] uses a two-part convolutional neural networkto first predict the structural information and then re-store the fine details. It also has a special contextual-attention layer that finds the most similar patches fromthe known image areas to aid generation of fine details.

• Image inpainting for irregular holes using partial con-volutions [15] follows the idea that regular convolu-tional layers in deep inpainting networks are suboptimal

0 1 2 3 4 5Subjective Score

Deep Image PriorShift­Net

Globally and Locally ConsistentHigh­Resolution Inpainting

Partial ConvolutionsGenerative Inpainting (ImageNet)

Exemplar­Based (patch size 9)Exemplar­Based (patch size 13)

Statistics of Patch OffsetsContent­Aware Fill

Generative Inpainting (Places2)Artist #1Artist #3Artist #2

Ground Truth

Overall (3 images)

Ground TruthHuman Artist

Classical MethodDeep Learning­Based Method

Figure 3. Subjective-comparison results across three images in-painted by human artists.

because they unconditionally use both valid (known)and invalid (unknown) pixels from the input image.This method implements a partial convolution layerthat masks out the unknown pixels.

Additionally, we hired three professional photo-restora-tion and photo-retouching artists to manually inpaint threerandomly selected images from our test data set. To encour-age them to produce the best possible results, we offered a50% honorarium bonus for the ones that outranked the com-petitors. Although we imposed no strict time limit, all threeartists completed their work within 90 minutes. Figure 2shows their results.

3.3. Test method

The subjective evaluation took place through the http://subjectify.us platform. Human observers wereshown pairs of images and asked to pick from each pairthe one they found most realistic. Each image pair con-sisted of two different inpainting results for the same picture(the set also contained the original image). Also includedwere two verification questions that asked the participantsto compare the result of exemplar-based inpainting [4] withthe ground-truth image. The final results excluded all re-sponses from participants who failed to correctly answer oneor both verification questions. In total, 6,945 valid pairwisejudgements were collected from 215 participants.

The judgements were then used to fit a Bradley-Terrymodel [2]. The resulting subjective scores maximize likeli-hood given the pairwise judgements.

3.4. Results

Algorithms vs. artists. Figure 3 shows the results for thethree images inpainted by the human artists. The artists out-performed all state-of-the-art automatic algorithms, and outof the deep-learning-based methods, only generative imageinpainting [31] trained on the Places 2 data set outperformed

Page 4: Perceptually Motivated Method for Image Inpainting Comparison · Content-aware fill is a tool in the popular photo-editing software Adobe Photoshop. We picked it because, be-ing

0 1 2 3 4 5 6Subjective Score

Deep Image PriorShift­Net

Globally and Locally ConsistentPartial Convolutions

Exemplar­Based (patch size 9)High­Resolution Inpainting

Generative Inpainting (ImageNet)Exemplar­Based (patch size 13)Generative Inpainting (Places2)

Content­Aware FillArtist #1

Statistics of Patch OffsetsArtist #2Artist #3

Ground Truth

Urban Flowers

0 1 2 3 4 5Subjective Score

Shift­NetGlobally and Locally Consistent

Deep Image PriorPartial Convolutions

High­Resolution InpaintingStatistics of Patch Offsets

Content­Aware FillGenerative Inpainting (ImageNet)

Generative Inpainting (Places2)Exemplar­Based (patch size 13)

Exemplar­Based (patch size 9)Artist #1

Ground TruthArtist #2Artist #3

Splashing Sea

0 1 2 3 4 5 6Subjective Score

Deep Image PriorStatistics of Patch Offsets

Shift­NetGenerative Inpainting (ImageNet)

High­Resolution InpaintingExemplar­Based (patch size 13)Globally and Locally Consistent

Exemplar­Based (patch size 9)Content­Aware Fill

Generative Inpainting (Places2)Partial Convolutions

Artist #3Artist #1Artist #2

Ground Truth

Forest Trail

Ground Truth Human Artist Classical Method Deep Learning­Based Method

Figure 4. Results of the subjective study comparing images inpainted by human artists with images inpainted by conventional anddeep-learning-based methods.

Artist #1 Statistics of Patch Offsets [6]

Figure 5. Comparison of inpainting results from Artist #1 and statis-tics of patch offsets [6] (preferred in the subjective comparison).

the classical inpainting methods.The individual results for each of these three images ap-

pear in Figure 4. In only one case did an algorithm beat anartist: statistics of patch offsets [6] scored higher than oneartist on the “Urban Flowers” photo. Figure 5 shows the re-spective inpainting results. Additionally, for the “SplashingSea” photo, two artists actually “outperformed” the originalimage: their results turned out to be more realistic. Thisoutcome illustrates that the ideal quality-estimation metricshould be a no-reference one.

Algorithms vs. algorithms. We additionally performeda subjective comparison of various inpainting algorithmsamong the entire 33-image test set, collecting 3,969 validpairwise judgements across 147 participants. The overallresults appear in Figure 6. They confirm our observationsfrom the first comparison: among the deep-learning-basedapproaches we evaluated, generative image inpainting [31]seems to be the only one that can outperform the classicalmethods.

Conclusion. The subjective evaluation allows us to draw the

0 1 2 3 4Subjective Score

Deep Image PriorShift­Net

Globally and Locally ConsistentHigh­Resolution Inpainting

Partial ConvolutionsExemplar­Based (patch size 13)

Statistics of Patch OffsetsGenerative Inpainting (ImageNet)

Content­Aware FillGenerative Inpainting (Places2)

Ground Truth

Overall (33 images)

Ground Truth Classical Method Deep Learning­Based Method

Figure 6. Subjective-comparison results for 33 images inpaintedusing automatic methods.

following conclusions:

• Manual inpainting by human artists remains the onlyviable way to achieve quality close to that of the originalimages.

• Automatic algorithms can approach manual-inpaintingresults only for certain images.

• Classical inpainting methods continue to maintain astrong position, even though a deep-learning-based al-gorithm took the lead.

4. Objective inpainting-quality estimationUsing the results we obtained from the subjective compar-

ison, we evaluated several approaches to objective inpainting-quality estimation. Using these objective metrics, we esti-mated the inpainting quality of the images from our test setand then compared them with the subjective results. For eachof the 33 images, we applied every tested metric to everyinpainting result (as well as to the ground-truth image) andcomputed the Pearson and Spearman correlation coefficients

Page 5: Perceptually Motivated Method for Image Inpainting Comparison · Content-aware fill is a tool in the popular photo-editing software Adobe Photoshop. We picked it because, be-ing

with the subjective result. The final value was an average ofthe correlations over all 33 test images.

Below is an overview of each metric we evaluated.

4.1. Full-reference metrics

Full-reference metrics take both the ground-truth (origi-nal) image and the inpainting result as an input.

To construct a full-reference metric that encourages se-mantic similarity rather than per-pixel similarity, as in [10],we evaluated metrics that compute the difference between theground-truth and inpainted-image feature maps produced byan image-classification neural network. We selected five ofthe most popular deep architectures: VGG [21] (16- and 19-layer deep variants), ResNet-V1-50 [7], Inception-V3 [24],Inception-ResNet-V2 [23] and Xception [3]. We used themodels pretrained on the ImageNet [5] data set. The meansquared error between the feature maps was the metric result.

For VGG we tested the output from several deep layers(convolutional and pooling layers from the last block) andfound that the deepest layer has the highest correlation withthe subjective-study results. For each of the other architec-tures, therefore, we tested only the deepest layer.

We additionally included the structural-similarity (SSIM)index [27] as a full-reference metric. SSIM is widely usedto compare image quality, but it falls short when applied toinpainting-quality estimation.

The best correlations among the full-reference metricsemerged when using the last layer of the VGG-16 model.

4.2. No-reference metrics

No-reference metrics take only the target (possibly in-painted) image as an input, so they apply to a much widerrange of problems.

We used a deep-learning approach to constructing no-reference metrics. We picked several popular image-classification neural-network architectures and trained themto differentiate “clean” (realistic, original) images withoutany inpainting from partially inpainted images. The ar-chitectures included VGG [21] (16- and 19-layer deep),ResNet-V1-50 [7], ResNet-V2-50 [8], Inception-V3 [24],Inception-V4 [23] and PNASNet-Large [14].

Data set. For training, we used clean and inpainted imagesbased on the COCO [13] data set. To create the inpaintedimages, we cropped the input images to a square aspect ratio,resized them to 512× 512 pixels and masked out a squareof 180 × 180 pixels from the middle (the same procedurewe used when creating our subjective-evaluation data set).We then inpainted the masked images using five inpaintingalgorithms [4, 9, 6, 28, 31] in eight total configurations. Thetotal number of images in the data set appears in Table 1.

Training. The network architectures take a square imageas an input and output the score—a single number where

Exemplar-Based (patch size 9) [4] 16777Exemplar-Based (patch size 13) [4] 16777Globally and Locally Consistent [9] 13870

Statistics of Patch Offsets [6] 7955Shift-Net [28] 8928

Generative Inpainting (Places 2) [31] 8928Generative Inpainting (CelebA) [31] 8928

Generative Inpainting (ImageNet) [31] 8927Total inpainted 91090

Total original (clean) 81520

Table 1. Total number of images in the training data set by inpaint-ing algorithm.

0 1 2 3 4 5

Shift­NetDeep Image Prior

Globally and Locally ConsistentGenerative Inpainting (CelebA)Exemplar­Based (patch size 9)

Exemplar­Based (patch size 13)Generative Inpainting (ImageNet)

High­Resolution InpaintingGenerative Inpainting (Places2)

Content­Aware FillStatistics of Patch Offsets

Ground Truth

Correlation Peak

0 1 2 3 4 5

Deep Image PriorShift­Net

Globally and Locally ConsistentGenerative Inpainting (Places2)Generative Inpainting (CelebA)Exemplar­Based (patch size 9)

Exemplar­Based (patch size 13)Content­Aware Fill

Generative Inpainting (ImageNet)Statistics of Patch Offsets

High­Resolution InpaintingGround Truth

After Further Training

Ground Truth Classical Method Deep Learning­Based Method

Figure 7. Inpainting quality estimated by VGG-16 for one image atthe peak Pearson correlation and after further training. The scoredistribution at the correlation peak is similar to a subjective-scoredistribution; after further training, however, the network starts toheavily underscore most inpainting algorithms.

0 means the image contains inpainted regions and 1 meansthe image is “clean.” The loss function was mean squarederror. Some network architectures were additionally trainedto output the predicted class using one-hot encoding (similarto binary classification); the loss function for this case wassoftmax cross-entropy.

The network architectures were identical to the ones usedfor image classification, with one difference: we altered thenumber of outputs from the last fully connected layer. Thischange allowed us to initialize the weights of all previouslayers from the models pretrained on ImageNet, greatly im-proving the results compared with training from randominitialization.

For some experiments we additionally tried using theRGB noise features described in [32] and the spectral weightnormalization described in [17].

In addition to the typical validation on part of the data set,we also monitored correlation of network predictions withthe subjective scores collected in Section 3. We used thenetworks to estimate the inpainting quality of the 33-imagetest set, then computed correlations with subjective results inthe same way as the final comparison. The training of eachnetwork was stopped once the correlation of the network

Page 6: Perceptually Motivated Method for Image Inpainting Comparison · Content-aware fill is a tool in the popular photo-editing software Adobe Photoshop. We picked it because, be-ing

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9PNASNET­Large

ResNet­V2­50 (RGB noise)Inception­V4

Inception­ResNet­V2 (spec. norm.)VGG­16

Inception­V4 (two­class)ResNet­V2­50ResNet­V1­50

ResNet­V1­50 (spec. norm.)VGG­16 (spec. norm.)

SSIMInception­V3 (two­class)

ResNet­V1­50 (RGB noise)Inception­ResNet­V2 (two­class)

Inception­V3 (RGB noise)Inception­V3

Inception­V3 (spec. norm.)Inception­ResNet­V2 (conv_7b_ac)

VGG­16 (block5_conv1)Inception­V4 (spec. norm.)

Inception­ResNet­V2Inception­ResNet­V2 (conv_7b)

Xception (block14_sepconv2)VGG­16 (block5_pool)

Inception­V3 (mixed10)Xception (block14_sepconv2_act)

VGG­19 (block5_pool)VGG­16 (block5_conv2)VGG­19 (block5_conv4)

ResNet­V1­50 (res5c_branch2c)VGG­16 (block5_conv3)

Pearson

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9PNASNET­Large

SSIMInception­ResNet­V2 (spec. norm.)

ResNet­V2­50 (RGB noise)VGG­16

ResNet­V1­50 (spec. norm.)ResNet­V1­50ResNet­V2­50

Inception­V4Inception­V3

VGG­16 (spec. norm.)Inception­V3 (spec. norm.)

Inception­V4 (two­class)ResNet­V1­50 (RGB noise)

VGG­16 (block5_conv1)Inception­V3 (two­class)

Inception­ResNet­V2 (two­class)Inception­ResNet­V2

Inception­V3 (RGB noise)Inception­V4 (spec. norm.)

Inception­V3 (mixed10)Xception (block14_sepconv2)

Inception­ResNet­V2 (conv_7b_ac)VGG­16 (block5_conv2)

VGG­19 (block5_pool)VGG­16 (block5_conv3)

ResNet­V1­50 (res5c_branch2c)VGG­19 (block5_conv4)

Xception (block14_sepconv2_act)Inception­ResNet­V2 (conv_7b)

VGG­16 (block5_pool)Spearman

Full­Reference No­Reference

Figure 8. Mean Pearson and Spearman correlations between objective inpainting-quality metrics and subjective human comparisons(including ground-truth scores). The error bars show the standard deviations.

predictions with the subjective scores peaked and started todecrease (possibly because the networks were overfitting tothe inpainting results of the algorithms we used to createthe training data set). Figure 7 compares the network scoresfor one test image at the correlation peak and after furthertraining.

4.3. Results

We evaluated several objective quality-estimation ap-proaches, both full reference and no reference as well asboth classical and deep-learning based. Figure 8 shows theoverall results. We additionally did a comparison that ex-cluded ground-truth scores from the correlation, becausethe full-reference approaches always yield the highest scorewhen comparing the ground-truth image with itself. Theoverall results for that comparison appear in Figure 9.

As Figures 8 and 9 show, the no-reference methodsachieve slightly weaker correlation with the subjective-evaluation responses than do the best full-reference methods.But the results of most no-reference methods are still consid-erably better than those of the full-reference SSIM. The bestcorrelations among the no-reference methods came fromthe Inception-V4 model trained to output one score, withspectral weight normalization.

It is important to emphasize that we did not train the

networks to maximize correlation with human responsesor with the subjective-evaluation data set in general. Wetrained them to distinguish “clean” images from inpaintedimages, yet their output showed good correlation with humanresponses.

5. ConclusionWe have proposed a number of perceptually motivated

no-reference and full-reference objective metrics for eval-uating image-inpainting quality. We evaluated the metricsby correlating them with human responses from a subjectivecomparison of state-of-the-art image-inpainting algorithms.

The results of the subjective comparison indicate thatalthough a deep-learning-based approach to image inpaintingholds the lead, classical algorithms remain among the bestin the field.

The highest mean Pearson correlation with the humanresponses from the subjective study was achieved by theblock5 conv3 layer of the VGG-16 model trained on Ima-geNet (full reference) and by the Inception-V4 model withspectral normalization (no reference).

We achieved good correlation with the subjective-compar-ison results without specifically training our proposed objec-tive quality-evaluation metrics on the subjective-comparisonresponse data set.

Page 7: Perceptually Motivated Method for Image Inpainting Comparison · Content-aware fill is a tool in the popular photo-editing software Adobe Photoshop. We picked it because, be-ing

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9PNASNET­Large

Inception­ResNet­V2 (spec. norm.)SSIM

ResNet­V2­50 (RGB noise)Inception­V4

ResNet­V2­50VGG­16 (block5_conv1)Inception­V4 (two­class)

VGG­16Inception­V3 (RGB noise)Inception­V3 (spec. norm.)

ResNet­V1­50Inception­V3 (two­class)

ResNet­V1­50 (spec. norm.)Inception­ResNet­V2 (conv_7b_ac)

VGG­16 (spec. norm.)Inception­V3

ResNet­V1­50 (RGB noise)Inception­ResNet­V2 (two­class)

Inception­V3 (mixed10)Xception (block14_sepconv2)

Inception­ResNet­V2 (conv_7b)Inception­ResNet­V2

VGG­16 (block5_conv2)Inception­V4 (spec. norm.)

VGG­19 (block5_conv4)ResNet­V1­50 (res5c_branch2c)

VGG­16 (block5_pool)VGG­19 (block5_pool)

Xception (block14_sepconv2_act)VGG­16 (block5_conv3)

Pearson

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9PNASNET­Large

SSIMInception­ResNet­V2 (spec. norm.)

ResNet­V2­50 (RGB noise)VGG­16

ResNet­V2­50ResNet­V1­50

VGG­16 (block5_conv1)ResNet­V1­50 (spec. norm.)

Inception­V4Inception­V3

Inception­V3 (spec. norm.)ResNet­V1­50 (RGB noise)

Inception­V4 (two­class)Inception­V3 (RGB noise)

VGG­16 (spec. norm.)Inception­V3 (two­class)Inception­V3 (mixed10)

Inception­ResNet­V2 (two­class)Inception­ResNet­V2

Inception­V4 (spec. norm.)Xception (block14_sepconv2)

Inception­ResNet­V2 (conv_7b_ac)VGG­16 (block5_conv2)

VGG­19 (block5_pool)VGG­16 (block5_conv3)VGG­19 (block5_conv4)

ResNet­V1­50 (res5c_branch2c)Xception (block14_sepconv2_act)

Inception­ResNet­V2 (conv_7b)VGG­16 (block5_pool)

Spearman

Full­Reference No­Reference

Figure 9. Mean Pearson and Spearman correlations between objective inpainting-quality metrics and subjective human comparisons(excluding ground-truth scores). The error bars show the standard deviations.

References

[1] J. H. Bappy, A. K. Roy-Chowdhury, J. Bunk, L. Nataraj, andB. S. Manjunath. Exploiting spatial structure for localizingmanipulated image regions. In The IEEE International Con-ference on Computer Vision (ICCV), Oct 2017. 2

[2] R. A. Bradley and M. E. Terry. Rank analysis of incom-plete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952. 3

[3] F. Chollet. Xception: Deep learning with depthwise separableconvolutions. In The IEEE Conference on Computer Visionand Pattern Recognition (CVPR), July 2017. 5

[4] A. Criminisi, P. Perez, and K. Toyama. Region filling andobject removal by exemplar-based image inpainting. IEEETransactions on Image Processing, 13(9):1200–1212, 2004.1, 2, 3, 5

[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.Imagenet: A large-scale hierarchical image database. In 2009IEEE conference on computer vision and pattern recognition,pages 248–255. Ieee, 2009. 5

[6] K. He and J. Sun. Statistics of patch offsets for image com-pletion. In European Conference on Computer Vision, pages16–29. Springer, 2012. 2, 4, 5

[7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In The IEEE Conference on ComputerVision and Pattern Recognition (CVPR), June 2016. 5

[8] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings indeep residual networks. In European conference on computervision, pages 630–645. Springer, 2016. 5

[9] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally andlocally consistent image completion. ACM Transactions onGraphics (ToG), 36(4):107, 2017. 3, 5

[10] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses forreal-time style transfer and super-resolution. In Europeanconference on computer vision, pages 694–711. Springer,2016. 5

[11] H. Li, G. Li, L. Lin, H. Yu, and Y. Yu. Context-aware semanticinpainting. IEEE Transactions on Cybernetics, 2018. 1

[12] H. Li, W. Luo, X. Qiu, and J. Huang. Image forgery localiza-tion via integrating tampering possibility maps. IEEE Trans-actions on Information Forensics and Security, 12(5):1240–1252, 2017. 2

[13] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-manan, P. Dollar, and C. L. Zitnick. Microsoft coco: Commonobjects in context. In European conference on computer vi-sion, pages 740–755. Springer, 2014. 5

[14] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li,L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy. Progressiveneural architecture search. In The European Conference onComputer Vision (ECCV), September 2018. 5

[15] G. Liu, F. A. Reda, K. J. Shih, T.-C. Wang, A. Tao, andB. Catanzaro. Image inpainting for irregular holes using par-

Page 8: Perceptually Motivated Method for Image Inpainting Comparison · Content-aware fill is a tool in the popular photo-editing software Adobe Photoshop. We picked it because, be-ing

tial convolutions. In The European Conference on ComputerVision (ECCV), September 2018. 1, 3

[16] P. Liu, X. Qi, P. He, Y. Li, M. R. Lyu, and I. King. Semanti-cally consistent image completion with fine-grained details.arXiv preprint arXiv:1711.09345, 2017. 1

[17] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectralnormalization for generative adversarial networks. arXivpreprint arXiv:1802.05957, 2018. 5

[18] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A.Efros. Context encoders: Feature learning by inpainting.In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), June 2016. 1

[19] C.-M. Pun, X.-C. Yuan, and X.-L. Bi. Image forgery detectionusing adaptive oversegmentation and feature point matching.IEEE Transactions on Information Forensics and Security,10(8):1705–1716, 2015. 2

[20] R. Salloum, Y. Ren, and C.-C. J. Kuo. Image splicing local-ization using a multi-task fully convolutional network (mfcn).Journal of Visual Communication and Image Representation,51:201–209, 2018. 2

[21] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556, 2014. 5

[22] Y. Song, C. Yang, Z. Lin, H. Li, Q. Huang, and C. J. Kuo.Image inpainting using multi-scale feature image translation.arXiv preprint arXiv:1711.08590, 2, 2017. 1

[23] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.Inception-v4, inception-resnet and the impact of residual con-nections on learning. In Thirty-First AAAI Conference onArtificial Intelligence, 2017. 5

[24] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.Rethinking the inception architecture for computer vision.In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), June 2016. 5

[25] A. Telea. An image inpainting technique based on the fastmarching method. Journal of Graphics Tools, 9(1):23–34,2004. 1

[26] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Deep image prior.In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), June 2018. 3

[27] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al.Image quality assessment: from error visibility to structuralsimilarity. IEEE transactions on image processing, 13(4):600–612, 2004. 5

[28] Z. Yan, X. Li, M. Li, W. Zuo, and S. Shan. Shift-net: Imageinpainting via deep feature rearrangement. In The EuropeanConference on Computer Vision (ECCV), September 2018. 3,5

[29] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li.High-resolution image inpainting using multi-scale neuralpatch synthesis. In The IEEE Conference on Computer Visionand Pattern Recognition (CVPR), July 2017. 2, 3

[30] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Free-form image inpainting with gated convolution. arXiv preprintarXiv:1806.03589, 2018. 1

[31] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Gen-erative image inpainting with contextual attention. In The

IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), June 2018. 1, 3, 4, 5

[32] P. Zhou, X. Han, V. I. Morariu, and L. S. Davis. Learning richfeatures for image manipulation detection. In The IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR),June 2018. 2, 5

[33] X. Zhu, Y. Qian, X. Zhao, B. Sun, and Y. Sun. A deep learningapproach to patch-based image inpainting forensics. SignalProcessing: Image Communication, 67:90–99, 2018. 2