Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Joint Inpainting of RGB and Depth Images by Generative Adversarial Network with a Late Fusion approach
Ryo Fujii, Ryo Hachiuma, Hideo SaitoKeio University, Japan
{ryo.fujii0112, ryo-hachiuma, hs}@keio.jp
Abstract
References
BackgroundDiminished reality aims to remove real objects from images and fill in the removed regions with plausible textures. • Multi-view observations [1,2]
Ø Accurate Ø Cannot restore unobserved area, multiple camera setup ✘
• Inpainting using pixels in the image [3] Ø No need of other camera or recorded observation Ø Inpainted regions are predicted from surrounding regions ✘
If the filled pixels are plausible, the inpainting-based methods have an advantage over the multi-view based methods.
Goal:Filling in missing regions of both RGB and depth images with plausible textures and geometries. Contribution: The output features of RGB and depth encoder are added and used as the input of fusion part. This late fusion enables each decoder to utilize each feature as complementary information.
Discriminator NetworkInput: four channels, RGB-D image. • Global discriminator: Judges the consistency of
scene• Local discriminator: Assesses the quality of
small completed area
Result
Global Discriminator
Local DiscriminatorInput and Mask Inpainted imageCompletion Network
Real or Fake
Real or Fake
⋮ ⋮
⋮⋮
RGB Encoder
Depth Encoder
RGB Decoder
Depth Decoder
Fusion Part
Task: Simultaneous RGB and depth image inpainting
Input: RGB and depth image with missing regions
Output: inpaintedRGB and depth
image
Deep Neural Network
We first propose a deep learning network that jointly inpaint RGB and depth image while leveraging each others information.
Method
Loss FunctionTo optimize the network, we combine • "# reconstruction loss• WGAN-GP loss[4]
WGAN-GP works well when combined with "#reconstruction loss as they both use the "# distance metrics. WGAN-GP learning behave faster and more stable in convergence.
Network Architecture:We expanded the completion network proposed by Iizuka et. al.[5]
Completion Network• RGB encoder-decoder• Depth encoder-decoder • Fusion Part
The extracted feature of RGB and depth are fused and we employed Residual dilated convolutional layers (yellow) in Fusion part.
Dataset:SceneNet RGB-D dataset [6] Rendered RGB-D images from over 15K trajectories in synthetic layouts.
Training procedure:
• Training images: About 2 million images resized to 160×160
• Mask: 1/8 to 1/4 of original size• Batch size: 96 images• Iteration: 75000 iterations• Training time: About 2 days by two Nvidia
Quadro GV100 and P6000 GPU• Optimizer: Adam
Qualitative evaluation:
Input RGB or Depth
Generated Result
Ground truth
Future work:• Quantitative evaluation
Ø Comparison our baseline model with early fusion approach• Examination of the input
Ø Take advantage of normal map instead of depth map.
• Late fusion approach made the edge clear of each restored region clearly.• The depth completion sometimes fails as shown in third column of above figure.
Ø In the most common failure case, the similar texture appears.
[1] N.Kawai, T.Sato, andN.Yokoya. Diminished reality based on image inpainting considering background geometry. IEEE Trans. Vis. Comput. Graphics, 22:1–1, Jan. 2015.[2] S.Mori,J.Herling,W.Broll,N.Kawai,H.Saito,D.Schmalstieg,and D. Kalkofen. 3d pixmix: Image-inpainting in 3d environments. In Adjunct Proc. of the IEEE ISMAR, 2018. [3] S. Meerits and H. Saito. Real-time diminished reality for dynamic scenes. In Proc. of the IEEE ISMARW, pp. 53–59, 2015. [4] Gulrajani, Ishaan and Ahmed, Faruk and Arjovsky, Martin and Dumoulin, Vincent and Courville, Aaron. Improved Training of Wasserstein GANs. In NIPS’17. [5] S.Iizuka, E.Simo-Serra and H.Ishikawa. GloballyandLocallyConsistent Image Completion. SIGGRAPH, 36(4):107:1–107:14, 2017. [6] J.McCormac ,A.Handa, S.Leutenegger and A.J.Davison. Scenenet rgb-d: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation? In ICCV, 2017.
• ApplicationØ Integrate this model into virtual reality application with HMD
Filling the holes because of foreground object occlusion due to the viewpoint changes caused by the user’s head movements.