8
AIM 2019 Challenge on Bokeh Effect Synthesis: Methods and Results Andrey Ignatov Jagruti Patel Radu Timofte Bolun Zheng Xin Ye Li Huang Xiang Tian Saikat Dutta Kuldeep Purohit Praveen Kandula Maitreya Suin A. N. Rajagopalan Zhiwei Xiong Jie Huang Guanting Dong Mingde Yao Dong Liu Wenjin Yang Ming Hong Wenying Lin Yanyun Qu Jae-Seok Choi Woonsung Park Munchurl Kim Rui Liu Xiangyu Mao Chengxi Yang Qiong Yan Wenxiu Sun Junkai Fang Meimei Shang Fei Gao Sujoy Ghosh Prasen Kumar Sharma Dr. Arijit Sur Abstract This paper reviews the first AIM challenge on bokeh ef- fect synthesis with the focus on proposed solutions and re- sults. The participating teams were solving a real-world image-to-image mapping problem, where the goal was to map standard narrow-aperture photos to the same photos captured with a shallow depth-of-field by the Canon 70D DSLR camera. In this task, the participants had to restore bokeh effect based on only one single frame without any additional data from other cameras or sensors. The tar- get metric used in this challenge combined fidelity scores (PSNR and SSIM) with solutions’ perceptual results mea- sured in a user study. The proposed solutions significantly improved baseline results, defining the state-of-the-art for practical bokeh effect simulation. 1. Introduction The topic of image manipulation related to portable smartphone cameras has recently witnessed an increased interest from the vision and graphics communities, with a large number of works being published on different image quality enhancement aspects [7, 8, 9, 24, 1], photo segmen- tation and blurring [26, 22, 3, 2], style transfer [4, 12, 16], etc. One of the topics that gained a huge popularity over the past years is an automatic bokeh effect synthesis that can be now found in the majority of smartphone cam- era applications. The first works on portrait segmenta- tion [10] appeared in 2014, followed by many subsequent papers that have substantially improved the baseline re- sults [22, 28]. A further development in this field was facili- A. Ignatov and R. Timofte ({andrey,radu.timofte}@vision.ee.ethz.ch, ETH Zurich) are the challenge organizers, J. Patel worked on the dataset collection, while the other authors participated in the challenge. The Appendix A contains the authors’ teams and affiliations. AIM 2019 webpage: http://vision.ee.ethz.ch/aim19 tated by work [26] that in detail described a synthetic depth- of-field rendering method used by Google Pixel phones. The AIM 2019 challenge on bokeh effect synthesis is a step forward in benchmarking example-based single image bokeh effect rendering. It uses a large-scale ETH Zurich Bokeh dataset consisting of photo pairs with shallow and wide depth-of-field captured using the Canon 70D DSLR camera, and is taking into account both quantitative and qualitative visual results of the proposed solutions. In the next sections we describe the challenge and the correspond- ing dataset, present and discuss the results and describe the proposed methods. 2. AIM 2019 Bokeh Effect Synthesis Challenge The objectives of the AIM 2019 Challenge on bokeh ef- fect synthesis are to gauge and push the state-of-the-art in synthetic shallow depth-of-field rendering, to compare dif- ferent approaches and solutions, and to promote realistic settings for the considered problem as defined by the ETH Zurich Bokeh dataset described below. 2.1. ETH Zurich Bokeh dataset In order to tackle the problem of realistic bokeh effect rendering, a large-scale dataset containing more than 10K images was collected in the wild. By controlling the aper- ture size of the lens, images with shallow and wide depth- of-field were taken. In each photo pair, the first image was captured with a narrow aperture (f/16) that results in a nor- mal sharp photo, whereas the second one was captured us- ing the highest aperture (f/1.8) leading to a strong bokeh effect. The photos were taken during the daytime in a wide variety of places and in various illumination and weather conditions. The photos were captured in automatic mode, and we used default settings throughout the whole collec- tion procedure. An example set of collected images can be seen in figure 1.

AIM 2019 Challenge on Bokeh Effect Synthesis: Methods and ...timofter/publications/... · Zurich Bokeh dataset described below. 2.1. ETH Zurich Bokeh dataset In order to tackle the

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: AIM 2019 Challenge on Bokeh Effect Synthesis: Methods and ...timofter/publications/... · Zurich Bokeh dataset described below. 2.1. ETH Zurich Bokeh dataset In order to tackle the

AIM 2019 Challenge on Bokeh Effect Synthesis: Methods and Results

Andrey Ignatov Jagruti Patel Radu Timofte Bolun Zheng Xin Ye Li HuangXiang Tian Saikat Dutta Kuldeep Purohit Praveen Kandula Maitreya SuinA. N. Rajagopalan Zhiwei Xiong Jie Huang Guanting Dong Mingde Yao

Dong Liu Wenjin Yang Ming Hong Wenying Lin Yanyun Qu Jae-Seok ChoiWoonsung Park Munchurl Kim Rui Liu Xiangyu Mao Chengxi Yang

Qiong Yan Wenxiu Sun Junkai Fang Meimei Shang Fei Gao Sujoy GhoshPrasen Kumar Sharma Dr. Arijit Sur

Abstract

This paper reviews the first AIM challenge on bokeh ef-fect synthesis with the focus on proposed solutions and re-sults. The participating teams were solving a real-worldimage-to-image mapping problem, where the goal was tomap standard narrow-aperture photos to the same photoscaptured with a shallow depth-of-field by the Canon 70DDSLR camera. In this task, the participants had to restorebokeh effect based on only one single frame without anyadditional data from other cameras or sensors. The tar-get metric used in this challenge combined fidelity scores(PSNR and SSIM) with solutions’ perceptual results mea-sured in a user study. The proposed solutions significantlyimproved baseline results, defining the state-of-the-art forpractical bokeh effect simulation.

1. IntroductionThe topic of image manipulation related to portable

smartphone cameras has recently witnessed an increasedinterest from the vision and graphics communities, with alarge number of works being published on different imagequality enhancement aspects [7, 8, 9, 24, 1], photo segmen-tation and blurring [26, 22, 3, 2], style transfer [4, 12, 16],etc. One of the topics that gained a huge popularity overthe past years is an automatic bokeh effect synthesis thatcan be now found in the majority of smartphone cam-era applications. The first works on portrait segmenta-tion [10] appeared in 2014, followed by many subsequentpapers that have substantially improved the baseline re-sults [22, 28]. A further development in this field was facili-

A. Ignatov and R. Timofte ({andrey,radu.timofte}@vision.ee.ethz.ch,ETH Zurich) are the challenge organizers, J. Patel worked on the datasetcollection, while the other authors participated in the challenge.The Appendix A contains the authors’ teams and affiliations.AIM 2019 webpage: http://vision.ee.ethz.ch/aim19

tated by work [26] that in detail described a synthetic depth-of-field rendering method used by Google Pixel phones.

The AIM 2019 challenge on bokeh effect synthesis is astep forward in benchmarking example-based single imagebokeh effect rendering. It uses a large-scale ETH ZurichBokeh dataset consisting of photo pairs with shallow andwide depth-of-field captured using the Canon 70D DSLRcamera, and is taking into account both quantitative andqualitative visual results of the proposed solutions. In thenext sections we describe the challenge and the correspond-ing dataset, present and discuss the results and describe theproposed methods.

2. AIM 2019 Bokeh Effect Synthesis ChallengeThe objectives of the AIM 2019 Challenge on bokeh ef-

fect synthesis are to gauge and push the state-of-the-art insynthetic shallow depth-of-field rendering, to compare dif-ferent approaches and solutions, and to promote realisticsettings for the considered problem as defined by the ETHZurich Bokeh dataset described below.

2.1. ETH Zurich Bokeh dataset

In order to tackle the problem of realistic bokeh effectrendering, a large-scale dataset containing more than 10Kimages was collected in the wild. By controlling the aper-ture size of the lens, images with shallow and wide depth-of-field were taken. In each photo pair, the first image wascaptured with a narrow aperture (f/16) that results in a nor-mal sharp photo, whereas the second one was captured us-ing the highest aperture (f/1.8) leading to a strong bokeheffect. The photos were taken during the daytime in a widevariety of places and in various illumination and weatherconditions. The photos were captured in automatic mode,and we used default settings throughout the whole collec-tion procedure. An example set of collected images can beseen in figure 1.

Page 2: AIM 2019 Challenge on Bokeh Effect Synthesis: Methods and ...timofter/publications/... · Zurich Bokeh dataset described below. 2.1. ETH Zurich Bokeh dataset In order to tackle the

Canon photo, aperture f/16 Canon photo, aperture f/1.8

Figure 1. Example images from the collected ETH Zurich Bokeh dataset. The left image was taken with a narrow aperture (f/16) and has awide depth-of-field, while the right one was captured with the highest aperture (f/1.8), leading to a strong bokeh effect.

The captured images are not ideally aligned, thereforethey were first matched using SIFT keypoints and RANSACmethod. Then they were cropped to the intersection part anddownscaled to a resolution of approximately 1024×1536pixels. This procedure resulted in over 5 thousand imagepairs. Around 200 image pairs were reserved for testing,while the other 4.8K photo pairs were used for training andvalidation.

2.2. Tracks and Competitions

The challenge consists of the following phases:

i development: the participants get access to the data;

ii validation: the participants have the opportunity tovalidate their solutions on the server and compare theresults on the validation leaderboard;

iii test: the participants submit their final results, models,and factsheets.

All solutions submitted by the participants were evalu-ated based on three measures:

• PSNR measuring fidelity score,

• SSIM, a proxy for perceptual score,

• MOS (mean opinion scores) by a user study for explicitimage quality assessment.

The AIM 2019 Bokeh Effect Synthesis challenge con-sists of two tracks. In the first “Fidelity” track, the target isto obtain an output image with the highest pixel fidelity tothe ground truth as measured by PSNR and SSIM metrics.Since SSIM and PSNR scores are not reflecting many as-pects of real quality of the resulted images, in the second,“Perceptual” track, we are evaluating the solutions based ontheir Mean Opinion Scores (MOS). For this, we conduct auser study involving several hundreds of participants (using

Amazon’s MTurk platform ) evaluating the visual resultsof all proposed methods. The users were asked to rate thequality of each submitted solution (based on 10 full resolu-tion enhanced test images) by selecting one of five qualitylevels (0 - almost identical, 1 - mostly similar, 2 - similar, 3 -somewhat similar, 4 - mostly different) for each method re-sult in comparison with the original Canon images exhibit-ing bokeh effect. The expressed preferences were averagedper each test image and then per each method to obtain thefinal MOS.

3. Challenge Results

From above 80 registered participants, 9 teams enteredthe final phase and submitted results, codes / executables,and factsheets. Table 1 summarizes the final challenge re-sults and reports PSNR, SSIM and MOS scores for eachsubmitted solution, as well as the self-reported runtimes andhardware / software configurations. The methods are brieflydescribed in section 4, and the team members and affilia-tions are listed in Appendix A.

3.1. Architectures and Main Ideas

All the proposed methods are relying on end-to-end deeplearning-based solutions. The best results are achieved bymodels that are combining the ideas of multi-scale imageprocessing, residual learning and depth or saliency detec-tion modules. For faster training and lower RAM consump-tion (allowing to use larger batch sizes), most of the teamswere doing image processing in low-resolution space. Themost popular loss functions are L1 and MSE, though theywere often used in a combination with adversarial (GAN),color, Sobel, SSIM and VGG-based losses. Almost allteams are using Adam optimizer [13] to train deep learningmodels and PyTorch to implement and train the networks.

https://www.mturk.com/

Page 3: AIM 2019 Challenge on Bokeh Effect Synthesis: Methods and ...timofter/publications/... · Zurich Bokeh dataset described below. 2.1. ETH Zurich Bokeh dataset In order to tackle the

Factsheet Info Track 1: Fidelity Track 2: PerceptualTeam Author Framework Hardware, GPU Runtime, s PSNR↑ SSIM↑ PSNR↑ SSIM↑ MOS↓islab-zju q19911124 n.a. GeForce GTX 2080 Ti 0.27 23.43(2) 0.89 23.43 0.89 0.87(1)

CVL-IITM tensorcat PyTorch GeForce GTX 1080 Ti 0.8 22.90(6) 0.87 22.14 0.86 0.88(2)IPCV IITM kuldeeppurohit3 PyTorch Nvidia TITAN X 2.5 23.17(5) 0.88 23.17 0.88 0.93(3)VIDAR SchoolYellow PyTorch GeForce GTX 1080 Ti 0.53 23.93(1) 0.89 23.93 0.89 0.98(4)XMU-VIPLab Yangwj PyTorch Nvidia TITAN 0.6 23.18(4) 0.89 23.18 0.89 1.02(5)KAIST-VICLAB JSChoi TensorFlow Nvidia TITAN Xp 0.097 23.37(3) 0.86 23.37 0.86 1.08(6)SLYYM rliu PyTorch Nvidia TITAN Xp 0.1 22.25(7) 0.85 22.25 0.85 1.10(7)Bazinga kmaeii PyTorch Nvidia TITAN Xp 0.09 22.15(8) 0.84 22.15 0.84 1.48(8)aps vision@IITG ghoshsujoy19 PyTorch Nvidia DGX 270 (CPU) 20.88(9) 0.77 20.88 0.77 1.53(9)

Table 1. AIM 2019 bokeh effect synthesis challenge results and final rankings. The results are sorted by the MOS scores.

3.2. Performance

Quality. All the teams, except CVL-IITM, submitted thesame solution to both challenge tracks. As expected theperceptual ranking according to the MOS does not stronglycorrelate with the fidelity measures such us PSNR andSSIM. Particularly VIDAR team ranks first in terms of fi-delity but only fourth in terms of perceptual quality. Inter-estingly, the islab-zju team comes second in fidelity but firstin perceptual quality, slightly above CVL-IITM team.

Runtime. The self-reported runtimes of the proposed so-lutions on standard Nvidia GPU cards vary from 0.1s tomore than 2.5s per single image. The fastest solutions( 0.1s) are also among the worst performing in perceptualranking, while the top fidelity method of VIDAR requires0.53s and the top perceptual methods of islab-zju and ofCVL-IITM require 0.27s and 0.8s, respectively. Thus, theproposed solutions require further development to meet therequirements for real-time applications on the current mo-bile devices such as smartphones.

3.3. Discussion

The AIM 2019 bokeh effect synthesis challenge pro-moted realistic settings for shallow depth-of-field render-ing — instead of using synthetic datasets capturing onlya few aspects of the real bokeh effect, the participantswere proposed to model a shallow depth-of-field with theETH Zurich Bokeh dataset containing paired and alignedlow- and high-aperture photos captured with a high-endCanon 70D DSLR camera. A diversity of proposed ap-proaches demonstrated visual results with a significantlyaltered depth-of-field. We conclude that the challengethrough the proposed solutions defined the state-of-the-artfor the practical bokeh synthesis task.

4. Challenge Methods and TeamsThis section describes solutions submitted by all teams

participating in the final stage of the AIM 2019 bokeh effectsynthesis challenge.

4.1. islab-zju

Team islab-zju proposed a multiscale predictive filterCNN illustrated in Fig. 2 for synthesizing bokeh effect. TheMPF contains 3 levels that are processing the input image atdifferent scales. The model starts with a pixel shuffle [23]layer that is used to reduce the input image resolution by afactor of two, followed by two 2×2 stepped convolutionallayers starting the other two scale levels. In each level,there are three basic blocks: gate fusion block (GFB), con-strained predictive filter block (CPFB) and image recon-struction block (IRB). The GFB contains 5 densely con-nected convolution layers for extracting deep features andis designed to detect salient objects and fuse the featuresfrom salient and in-salient areas. The IRB block consistsof a 12-channel convolutional layer and a 2×2 pixel shuf-fle layer and is used to upsample the convolutional featuresproduced by the GFB. The CPFB block generates predictiveconvolutional filters that are merged using a locally con-nected 2D convolution with the outputs of the IRB block.The model is trained on 128×128 image patches concate-nated with their vertical and horizontal coordinate maps toindicate their location on the original full-resolution image.

The model was trained using a multiscale supervisedlearning with a combination of L1 and L1 Sobel losses ateach of three scales. The parameters were optimized usingthe Adam method with a batch size of 16. The learning ratewas initialized to 10e−4 and was halved every 10 epochs.

4.2. CVL-IITM

The solution of CVL-IITM team (Fig. 3) is based on thedepth estimation network (MegaDepth) proposed in [15].The outputs produced in its last spatial softmax layer areused to compute the weighted sum of original image andimage smoothed by different Gaussian filters. The final out-put of the model Ibokeh is given by:

Ibokeh =W1 � Iorig +W2 � G(Iorig, k1)+

W3 � G(Iorig, k2) +W4 � G(Iorig, k3),

where Iorig is the original input images, � is the pixel-wisemultiplication operation, G(I, k) is image I smoothed by a

Page 4: AIM 2019 Challenge on Bokeh Effect Synthesis: Methods and ...timofter/publications/... · Zurich Bokeh dataset described below. 2.1. ETH Zurich Bokeh dataset In order to tackle the

Figure 2. Multiscale predictive filter (MPF) model proposed byislab-zju team and the architectures of the gate fusion block (GFB)and the constrained predictive filter block (CPFB).

Figure 3. Depth estimation network used by CVL-IITM team.

k × k Gaussian filter, and k1, k2, k3 are equal to 25, 45 and75, respectively. The network was trained to minimize L1

loss function on resized to 384×512 px images. In the per-ceptual track, the model was additionally fine-tuned on the768×1024 px images with L1 and then with SSIM losses.

4.3. IPCV IITM

The authors proposed a depth-guided dynamic filteringdense network for rendering shallow depth-of-field (Fig. 4).At the onset, the network uses a space-to-depth module thatdivides each input channel into a number of blocks con-catenated along the channel dimension. The output of thislayer is concatenated with the outputs of the pre-traineddepth estimation [15] and salient object segmentation [5]

networks to achieve more accurate rendering results. Theresulting feature maps are passed to a U-net [21] based en-coder consisting of densely connected modules. The firstdense-block contains 12 densely-connected layers, the sec-ond block – 16, and the third one – 24 densely-connectedlayers. The weights of each block are initialized using theDenseNet-121 network [6] trained on the ImageNet dataset.The decoder has two difference branches which outputs aresummed to produce the final result. The first branch has aU-net architecture with skip-connections and also consistsof densely connected blocks. Its output is enhanced throughmulti-scale context aggregation through pooling and up-sampling at 4 scales. The second branch uses the idea ofdynamic filtering [11] and generates dynamic blurring fil-ters conditioned on the encoded feature map. These filtersare produced locally and on-the-fly depending on the input,the parameters of the filter-generating network are updatedduring the training. We refer to [19] for more details.

Figure 4. Depth-guided Dynamic Filtering Dense Network pro-posed by IPCV IITM team.

This entire model is trained on 640×640 image patchesusing the L1 loss, and at the end is fine-tuned with theMSE loss. The initial learning rate is set to 10e−4 and hasa decay-cycle of 200 epochs, Adam optimizer is used fortraining the network.

4.4. VIDAR

Forward

Concat

Pooling

Up-conv

Conv

Output

Encoder Decoder Network

Attention Module

Encoder Decoder Network

Input

Pooling

Conv

Conv

Conv

Conv

Upsample

Concat

Add

HxW

x3In

pu

t im

age

Channel Attention Block

OutputInput

HxW

x3O

utp

ut

imag

e

rate = 6

ConvKernel: 3x3

rate: 6

ConvKernel: 3x3

rate: 12

ConvKernel: 3x3

rate: 18

ConvKernel: 3x3

rate: 24

rate = 12rate = 18

rate = 24

Atrous Spatial Pyramid Pooling

Input Feature Map

c c 𝑐𝑡1𝑙

𝑠𝑒 𝑐𝑡2𝑙

𝑇1𝑙 𝑇2

𝑙

Feature Map

Atrous Spatial Pyramid Pooling Block

Dual Residual Block

Fusion Module

HxW

x3G

rou

dtr

uth

imag

e

𝐿𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡

𝐿𝑆𝑆𝐼𝑀

𝐿2

𝐿23

𝐿22𝐿21

𝐿𝑉𝐺𝐺

Figure 5. Multiscale connected residual attention UNet with fusionmodules proposed by VIDAR team.

Team VIDAR used an ensemble of five U-Net basedmodels (Fig. 5) with residual attention mechanism de-

Page 5: AIM 2019 Challenge on Bokeh Effect Synthesis: Methods and ...timofter/publications/... · Zurich Bokeh dataset described below. 2.1. ETH Zurich Bokeh dataset In order to tackle the

scribed in [29]. The model consists of multiple atrous spa-tial pyramid pooling blocks, fusion modules, channel atten-tion blocks and dual residual blocks. For the loss function,a combination of the MSE, SSIM, VGG and gradient lossesoperating on different scales is used.

The model was trained using the Adam optimizer withβ1=0.9, β2=0.999 and a batch size of 1. The initial learningrate was set to 1e−3 and the polynomial function was usedfor decay policy. During the test time, the results of the fivemodels were combined to achieve the highest performance.

4.5. XMU-VIPLab

Figure 6. The overall model architecture (top) and the structure ofthe bokehNet (middle and bottom) proposed by XMU-VIPLab.

XMU-VIPLab used two bokehNets shown in Fig. 6 forextracting global and local features from the input images.Each bokehNet consisted of several convolutional, residual,transposed convolutional and memory blocks. The obtainedfeatures were then concatenated and passed to a selectivekernel network described in [14]. A two-stage training pro-cedure was used, where in the first phase the bokehNetswere trained to perform different feature extraction tasks,and in the second phase the model was trained as a whole.

4.6. KAIST-VICLAB

KAIST-VICLAB proposed a low-complex residual au-toencoder network with a non-local block for bokeh effectgeneration (Fig. 7) inspired by [25]. First, the input image

Figure 7. A low-complex residual autoencoder network with anon-local block proposed by KAIST-VICLAB team.

is downscaled by a factor of 4 using a space2depth (S2D)operation that is followed by 8 residual blocks working inthe low-resolution feature space. The spatial size is fur-ther reduced in the four subsequent blocks consisting of tworesidual blocks with one space2batch operation in between.Additionally, one non-local block [27] (without a time di-mension) is added to make use of all spatial information.The decoder consists of 4 blocks, each one with two resid-ual blocks and one batch2space operation in between. Thenthe feature maps are up-scaled by a factor of 4 using oneconvolutional layer with depth2space operation, and the re-sulting output is concatenated with the input image and ispassed to the final convolutional layer.

The network was trained on 1024×1024 px images usingthe ADAM optimizer with L1 regularization and an initiallearning rate of 3e−4. The model was trained for 2e5 itera-tions with a mini-batch size of 1, and the learning rate wasdecreased by a factor of 10 after 105 iterations.

4.7. SLYYM

SLYYM team proposed a masked bokeh GAN modelbased on the SPADE architecture [17]. To enhance theperformance of the model, the authors additionally used asaliency detection model [5] to obtain coarse saliency masksthat are passed to a conditional SPADE generator togetherwith the input image. This generator is trained to produce aglobal bokeh image ISPADE and a refined saliency mask mthat are used to obtain the final bokeh image Ibokeh as:

Ibokeh = Iorig � m+ ISPADE � (1− m),

where � is the pixel-wise multiplication operation. In ad-dition to the GAN loss used in the SPADE, the authors alsoadded a mask matching loss to make the generated maskwell-refined. The generator and discriminator were trainedusing the standard GAN alternative scheme for 350 epochs.

4.8. Bazinga

The authors used a pre-trained BASNet [20] model toobtain a saliency map that was passed to a deep residualnetwork illustrated in Fig. 8. The model downscales theinput feature maps and further processes them in multipleresidual-in-residual dense blocks. The output of the modelis upscaled, multiplied by the original image and processed

Page 6: AIM 2019 Challenge on Bokeh Effect Synthesis: Methods and ...timofter/publications/... · Zurich Bokeh dataset described below. 2.1. ETH Zurich Bokeh dataset In order to tackle the

Figure 8. Bazinga’s network architecture.

by the last two convolutional layers to obtain the final result.The network was trained with a combination of L1, SSIMand color losses for 50 epochs with the initial learning rateof 1e−4.

4.9. aps vision@IITG

Team aps vision@IITG first used a Canny edge detec-tor [18] to locate the objects of interest. The obtained gradi-ent maps were concatenated with each of the three imagesobtained by applying Gaussian blur (with kernels 1, 5 and14) to the input image, and then were processed in three par-allel blocks with four convolutional layers in each one. Theresulting feature maps were concatenated and passed to theresidual network consisting of 18 layers which output wasadded to the original image. The authors trained the modelon 256×256 px images with a combination of the MSE andSobel losses for 32 epochs with a learning rate of 2e−5 anda batch size of 24.

AcknowledgmentsWe thank the AIM 2019 sponsors.

A. Appendix 1: Teams and affiliations

AIM 2019 Bokeh Effect Synthesis Team

Members:Andrey Ignatov – [email protected],Radu Timofte – [email protected]

Affiliation:Computer Vision Lab, ETH Zurich, Switzerland

islab-zju

Title:CNN Based Learnable Multiscale Bandpass Filter

Members:

Bolun Zheng – [email protected],Xin Ye, Li Huang, Xiang Tian

Affiliations:Zhejiang University, China

CVL-IITM

Title:DoF Effect Generation using Depth-aware Blending ofSmoothed Images

Members:Saikat Dutta – [email protected],

Affiliations:Indian Institute of Technology Madras, India

IPCV IITM

Title:Depth-guided Dynamic Filtering Dense Network forRendering Synthetic Depth-of-Field Effect

Members:Kuldeep Purohit – [email protected],Praveen Kandula, Maitreya Suin, A. N. Rajagopalan

Affiliations:Indian Institute of Technology Madras, India

VIDAR

Title:Multiscale Connected Residual Attention UNet withFusion Modules

Members:Zhiwei Xiong – [email protected],Jie Huang, Guanting Dong, Mingde Yao, Dong Liu

Affiliations:University of Science and Technology of China, China

XMU-VIPLab

Title:Selective Kernel Networks for Bokeh Effect Simulation

Members:Wenjin Yang – [email protected],Ming Hong, Wenying Lin, Yanyun Qu

Affiliations:Department of Computer Science, Xiamen University, China

KAIST-VICLAB

Title:

Page 7: AIM 2019 Challenge on Bokeh Effect Synthesis: Methods and ...timofter/publications/... · Zurich Bokeh dataset described below. 2.1. ETH Zurich Bokeh dataset In order to tackle the

A Low-Complex Residual Autoencoder Network with aNon-Local Block for Bokeh Effect Generation

Members:Jae-Seok Choi – [email protected],Woonsung Park, Munchurl Kim

Affiliations:Video and Image Computing Lab, Korea Advanced Institute ofScience and Technology (KAIST), Republic of Korea

SLYYM

Title:Masked Bokeh GAN

Members:Rui Liu – [email protected],Xiangyu Mao, Chengxi Yang, Qiong Yan, Wenxiu Sun

Affiliations:Sensetime Group Ltd, China

Bazinga

Title:Saliency Maps for Bokeh Effect Simulation

Members:Junkai Fang – [email protected],Meimei Shang, Fei Gao

Affiliations:None

aps vision@IITG

Title:Bokeh-Image Generation Using Deep CNN

Members:Sujoy Ghosh – [email protected],Prasen Kumar Sharma, Dr. Arijit Sur

Affiliations:Indian Institute of Technology Guwahati, India

References[1] Codruta O. Ancuti, Cosmin Ancuti, Radu Timofte, et al.

Ntire 2019 challenge on image dehazing: Methods and re-sults. In The IEEE/CVF Conference on Computer Vision andPattern Recognition (CVPR) Workshops, 2019.

[2] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla.Segnet: A deep convolutional encoder-decoder architecturefor image segmentation. IEEE transactions on pattern anal-ysis and machine intelligence, 39(12):2481–2495, 2017.

[3] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,Kevin Murphy, and Alan L Yuille. Deeplab: Semantic imagesegmentation with deep convolutional nets, atrous convolu-tion, and fully connected crfs. IEEE transactions on patternanalysis and machine intelligence, 40(4):834–848, 2017.

[4] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Im-age style transfer using convolutional neural networks. InProceedings of the IEEE conference on computer vision andpattern recognition, pages 2414–2423, 2016.

[5] Qibin Hou, Ming-Ming Cheng, Xiaowei Hu, Ali Borji,Zhuowen Tu, and Philip HS Torr. Deeply supervised salientobject detection with short connections. In Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition, pages 3203–3212, 2017.

[6] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil-ian Q Weinberger. Densely connected convolutional net-works. In Proceedings of the IEEE conference on computervision and pattern recognition, pages 4700–4708, 2017.

[7] Andrey Ignatov, Nikolay Kobyshev, Radu Timofte, KennethVanhoey, and Luc Van Gool. Dslr-quality photos on mobiledevices with deep convolutional networks. In the IEEE Int.Conf. on Computer Vision (ICCV), 2017.

[8] Andrey Ignatov, Radu Timofte, William Chou, Ke Wang,Max Wu, Tim Hartley, and Luc Van Gool. Ai benchmark:Running deep neural networks on android smartphones. InProceedings of the European Conference on Computer Vi-sion (ECCV), pages 0–0, 2018.

[9] Andrey Ignatov, Radu Timofte, et al. Pirm challenge on per-ceptual image enhancement on smartphones: Report. In Eu-ropean Conference on Computer Vision Workshops, 2018.

[10] Lens Blur in the new Google Camera app.https://ai.googleblog.com/2014/04/lens-blur-in-new-google-camera-app.html.

[11] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc VGool. Dynamic filter networks. In Advances in Neural In-formation Processing Systems, pages 667–675, 2016.

[12] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptuallosses for real-time style transfer and super-resolution. InEuropean Conference on Computer Vision, pages 694–711.Springer, 2016.

[13] Diederik P Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980,2014.

[14] X Li, W Wang, and X Hu. Selective kernel networks. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 510–519, 2019.

[15] Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition, pages 2041–2050, 2018.

[16] Fujun Luan, Sylvain Paris, Eli Shechtman, and Kavita Bala.Deep photo style transfer. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages4990–4998, 2017.

[17] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-YanZhu. Semantic image synthesis with spatially-adaptive nor-malization. In Proceedings of the IEEE Conference on Com-

Page 8: AIM 2019 Challenge on Bokeh Effect Synthesis: Methods and ...timofter/publications/... · Zurich Bokeh dataset described below. 2.1. ETH Zurich Bokeh dataset In order to tackle the

puter Vision and Pattern Recognition, pages 2337–2346,2019.

[18] Pietro Perona and Jitendra Malik. Scale-space and edge de-tection using anisotropic diffusion. IEEE Transactions onpattern analysis and machine intelligence, 12(7):629–639,1990.

[19] Kuldeep Purohit, Maitreya Suin, Praveen Kandula, and A.N.Rajagopalan. Depth-guided dense dynamic filtering networkfor bokeh effect rendering. In ICCV Workshops, 2019.

[20] Xuebin Qin, Zichen Zhang, Chenyang Huang, Chao Gao,Masood Dehghan, and Martin Jagersand. Basnet: Boundary-aware salient object detection. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 7479–7489, 2019.

[21] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmen-tation. In International Conference on Medical image com-puting and computer-assisted intervention, pages 234–241.Springer, 2015.

[22] Xiaoyong Shen, Aaron Hertzmann, Jiaya Jia, Sylvain Paris,Brian Price, Eli Shechtman, and Ian Sachs. Automatic por-trait segmentation for image stylization. In Computer Graph-ics Forum, volume 35, pages 93–102. Wiley Online Library,2016.

[23] Wenzhe Shi, Jose Caballero, Ferenc Huszar, Johannes Totz,Andrew P Aitken, Rob Bishop, Daniel Rueckert, and ZehanWang. Real-time single image and video super-resolutionusing an efficient sub-pixel convolutional neural network. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 1874–1883, 2016.

[24] Radu Timofte, Shuhang Gu, Jiqing Wu, and Luc Van Gool.Ntire 2018 challenge on single image super-resolution:Methods and results. In The IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) Workshops,June 2018.

[25] Thang Van Vu, Cao Van Nguyen, Trung X Pham, Tung MinhLiu, and Chang D. Youu. Fast and efficient image quality en-hancement via desubpixel convolutional neural networks. InEuropean Conference on Computer Vision Workshops, 2018.

[26] Neal Wadhwa, Rahul Garg, David E Jacobs, Bryan E Feld-man, Nori Kanazawa, Robert Carroll, Yair Movshovitz-Attias, Jonathan T Barron, Yael Pritch, and Marc Levoy.Synthetic depth-of-field with a single-camera mobile phone.ACM Transactions on Graphics (TOG), 37(4):64, 2018.

[27] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-ing He. Non-local neural networks. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 7794–7803, 2018.

[28] Ning Xu, Brian Price, Scott Cohen, and Thomas Huang.Deep image matting. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 2970–2979, 2017.

[29] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, BinengZhong, and Yun Fu. Image super-resolution using very deepresidual channel attention networks. In Proceedings of theEuropean Conference on Computer Vision (ECCV), pages286–301, 2018.