12
EPI-based Oriented Relation Networks for Light Field Depth Estimation Kunyuan Li [email protected] Jun Zhang * [email protected] Rui Sun [email protected] Xudong Zhang [email protected] Jun Gao [email protected] Abstract Light field cameras record not only the spatial information of observed scenes but also the directions of all incoming light rays. The spatial and angular information implicitly contain geometrical characteristics such as multi-view or epipolar geometry, which can be exploited to improve the performance of depth estimation. An Epipo- lar Plane Image (EPI), the unique 2D spatial-angular slice of the light field, contains patterns of oriented lines. The slope of these lines is associated with the disparity. Benefiting from this property of EPIs, some representative methods estimate depth maps by analyzing the disparity of each line in EPIs. However, these methods often extract the optimal slope of the lines from EPIs while ignoring the relationship be- tween neighboring pixels, which leads to inaccurate depth map predictions. Based on the observation that an oriented line and its neighboring pixels in an EPI share a sim- ilar linear structure, we propose an end-to-end fully convolutional network (FCN) to estimate the depth value of the intersection point on the horizontal and vertical EPIs. Specifically, we present a new feature-extraction module, called Oriented Relation Module (ORM), that constructs the relationship between the line orientations. To facilitate training, we also propose a refocusing-based data augmentation method to obtain different slopes from EPIs of the same scene point. Extensive experiments verify the efficacy of learning relations and show that our approach is competitive to other state-of-the-art methods. The code and the trained models are available at https://github.com/lkyahpu/EPI_ORM.git. 1 Introduction Light field cameras record both 2D spatial and 2D angular information of the observed scene [13]. The lenslet-based light field camera [18], a compact and hand-held light field camera, is able to achieve the dense sampling of the viewpoints by utilizing a micro-lens array inserted between the main lens and the photo sensor. The captured 4D light field data implicitly contains geometrical characteristics such as multi-view geometry or epipolar geometry, which has attracted much attention in recent years to improve the performance of depth estimation from light fields. To visualize light fields and extract light field features, the 4D light field data is often converted into various 2D images such as multi-view sub-aperture images [17], Epipolar Plane Images (EPIs) [13], and focal stacks [14]. Some representative methods [20, 21, 24] exploit different depth cues from sub-aperture images and focal stacks for depth estimation. However, it is difficult to acquire dense and accurate depth maps from the lenslet-based * Corresponding author 1 arXiv:2007.04538v2 [cs.CV] 26 Aug 2020

Kunyuan Li Jun Zhang Rui Sun Xudong Zhang · 2020-07-10 · EPI-based Oriented Relation Networks for Light Field Depth Estimation Kunyuan Li [email protected] Jun Zhang

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Kunyuan Li Jun Zhang Rui Sun Xudong Zhang · 2020-07-10 · EPI-based Oriented Relation Networks for Light Field Depth Estimation Kunyuan Li 2018110972@mail.hfut.edu.cn Jun Zhang

EPI-based Oriented Relation Networks for LightField Depth Estimation

Kunyuan [email protected]

Jun Zhang∗[email protected]

Rui [email protected]

Xudong [email protected]

Jun [email protected]

Abstract

Light field cameras record not only the spatial information of observed scenes butalso the directions of all incoming light rays. The spatial and angular informationimplicitly contain geometrical characteristics such as multi-view or epipolar geometry,which can be exploited to improve the performance of depth estimation. An Epipo-lar Plane Image (EPI), the unique 2D spatial-angular slice of the light field, containspatterns of oriented lines. The slope of these lines is associated with the disparity.Benefiting from this property of EPIs, some representative methods estimate depthmaps by analyzing the disparity of each line in EPIs. However, these methods oftenextract the optimal slope of the lines from EPIs while ignoring the relationship be-tween neighboring pixels, which leads to inaccurate depth map predictions. Based onthe observation that an oriented line and its neighboring pixels in an EPI share a sim-ilar linear structure, we propose an end-to-end fully convolutional network (FCN) toestimate the depth value of the intersection point on the horizontal and vertical EPIs.Specifically, we present a new feature-extraction module, called Oriented RelationModule (ORM), that constructs the relationship between the line orientations. Tofacilitate training, we also propose a refocusing-based data augmentation method toobtain different slopes from EPIs of the same scene point. Extensive experimentsverify the efficacy of learning relations and show that our approach is competitiveto other state-of-the-art methods. The code and the trained models are available athttps://github.com/lkyahpu/EPI_ORM.git.

1 Introduction

Light field cameras record both 2D spatial and 2D angular information of the observedscene [13]. The lenslet-based light field camera [18], a compact and hand-held light fieldcamera, is able to achieve the dense sampling of the viewpoints by utilizing a micro-lensarray inserted between the main lens and the photo sensor. The captured 4D light fielddata implicitly contains geometrical characteristics such as multi-view geometry or epipolargeometry, which has attracted much attention in recent years to improve the performanceof depth estimation from light fields.

To visualize light fields and extract light field features, the 4D light field data is oftenconverted into various 2D images such as multi-view sub-aperture images [17], EpipolarPlane Images (EPIs) [13], and focal stacks [14]. Some representative methods [20, 21, 24]exploit different depth cues from sub-aperture images and focal stacks for depth estimation.However, it is difficult to acquire dense and accurate depth maps from the lenslet-based

∗Corresponding author

1

arX

iv:2

007.

0453

8v2

[cs

.CV

] 2

6 A

ug 2

020

Page 2: Kunyuan Li Jun Zhang Rui Sun Xudong Zhang · 2020-07-10 · EPI-based Oriented Relation Networks for Light Field Depth Estimation Kunyuan Li 2018110972@mail.hfut.edu.cn Jun Zhang

Figure 1: An overview of the proposed network architecture. The input is a pair ofEPI patches obtained from the horizontal and vertical EPIs. Each branch consists oftwo oriented relation modules (ORMs), seven convolutional blocks, and a residual module(RM). The output of the two branches is integrated by a merging block to estimate thedepth value of each pixel.

cameras owing to the optical distortions [11] and the narrow baseline [18] between sub-aperture images. Besides, these methods are usually accompanied by heavy computationalburdens and carefully-designed optimization measures. To avoid these issues, some meth-ods [4, 15, 23, 27] exploit EPIs that exhibit patterns of oriented lines with constant colorsto visualize light fields. Each of these lines corresponds to the projection of a single 3Dscene point, and its slope is called disparity [22]. Therefore, one can infer the depth ofthe corresponding scene point by analyzing the disparity of the oriented line in the EPI.Moreover, the oriented line and its neighboring pixels share the similar linear structure,which is beneficial to estimate the slope of the EPI by constructing the relationship be-tween the center region in the EPI and its neighborhood. Nonetheless, current methodspredict depth maps by extracting the optimal slope of EPIs while ignoring the relationshipbetween neighboring pixels in EPIs, which makes the results inaccurate. It has been wellrecognized that the relation information is capable of offering important visual cues forcomputer vision tasks, such as spatial and channel relations in semantic segmentation [16]and object detection [8], and temporal relations in activity recognition [29].

In this paper, we propose an end-to-end fully convolutional network to estimate thedepth value of the intersection point on the horizontal and vertical EPIs, as shown in Fig-ure 1. We design a Siamese network without sharing weights (i. e. pseudo-Siamese [26])so that the convolution weights of the horizontal and vertical EPIs can be learned sepa-rately. Specifically, we propose a new feature extraction module, called Oriented Rela-tion Module (ORM), to learn and reason about the relationship between oriented linesin EPIs by extracting oriented relation features between the center pixel and its neigh-borhood from EPI patches. The proposed method can be considered as the first workon modeling relation features in EPIs, which is novel and different from existing relationmodels in two aspects: First, existing works [16, 29] focus on modeling temporal relationbetween frames and spatial relations between pixels. In contrast, our method proposesthe geometric relation between line orientations in EPI patches, which is beneficial to ex-tract the accurate slopes of EPIs for light field depth estimation. Second, the proposedmethod models dependencies between oriented lines, without making any assumptions ontheir feature distributions and locations. Our network is trained using the 4D light field

2

Page 3: Kunyuan Li Jun Zhang Rui Sun Xudong Zhang · 2020-07-10 · EPI-based Oriented Relation Networks for Light Field Depth Estimation Kunyuan Li 2018110972@mail.hfut.edu.cn Jun Zhang

benchmark dataset [7], where the ground truth disparities are available. However, we findthat it is hard to train such a deep network with insufficient data. To mitigate this issue,we propose a data augmentation method by refocusing EPIs so that EPIs with differentslopes as well as the corresponding ground truth disparities can be obtained at the samescene point. We show that the newly proposed ORM and EPI-based data augmentationcan bring performance boost for light field depth estimation.

2 Related Work

Conventional depth estimation from light fields mainly relies on different assumptions[14, 21] and handcrafted depth features [20, 24] based on sub-aperture images and focalstacks. In this section, we restrict ourselves to methods that exploit EPIs, and review somerepresentative works with relation reasoning.

Light field depth estimation based on EPIs. There exist a few methods that exploitthe EPI for light field depth estimation due to its linear structure associated with depth [22].For example, Wanner et al. [23] used a structured tensor to compute the slope of each linein vertical and horizontal EPIs. Zhang et al. [27] introduced the Spinning ParallelogramOperator (SPO) to find matching lines in EPIs. The lines with different slopes are locatedby maximizing the distribution distances of the regions. Zhang et al. [28] located theoptimal slope of each line segmentation on EPIs by using the locally linear embedding.Differing from these methods, some methods applied CNNs to extract light field featuresfrom EPIs. Sun et al. [25] presented a data-driven approach to estimate the object depthsfrom an enhanced EPI feature using CNN. Heber and Pock [4] used CNNs for predicting2D per-pixel hyperplane slope orientations in EPIs. Based on this work, Heber et al. [5,6]improved their work by utilizing an U-shaped network and EPI volumes to predict thedepth map. Luo et al. [15] designed an EPI-patch based CNN architecture to estimate thedepth of each pixel. Feng et al. [2] proposed a two-stream network that learns to estimatethe depth values of multiple correlated neighborhood pixels from EPI patches. Shin etal. [19] introduced a multi-stream network to extract features for epipolar property of fourviewpoints with horizontal, vertical and both diagonal directions. This method reachesstate-of-the-art results on the 4D light field benchmark [7]. One of the most recent works byLeistner et al. [12] shift the light field stack to retain a small receptive field, which improvesthe performance of depth estimation for large-disparity light fields. Some of these previousmethods [2,6,12,15] require data pre-processing and subsequent optimization. In contrast,we present an end-to-end fully convolutional network architecture to predict the depthvalues of center pixels from the corresponding horizontal and vertical EPIs. We explorethe similar linear structure information in EPIs and model the relationship between theoriented lines and their neighboring pixels, which help to estimate the slope of the orientedline.

Relation modeling. A few recent papers [8,16,29] have shown that relations have beenexploited to improve the performance of computer vision tasks. Zhou et al. [29] proposeda temporal relation network to learn and reason about temporal dependencies betweenvideo frames at multiple time scales. Hu et al. [8] proposed an object relation module tomodel relationships between sets of objects for object detection. Mou et al. [16] proposedthe spatial and channel relation modules to learn and reason about global relationshipsbetween any two spatial positions or feature maps, and then produced relation-augmentedfeature representations for semantic segmentation. Motivated by these works, we proposea oriented relation module to construct the relationship between the center pixel and its

3

Page 4: Kunyuan Li Jun Zhang Rui Sun Xudong Zhang · 2020-07-10 · EPI-based Oriented Relation Networks for Light Field Depth Estimation Kunyuan Li 2018110972@mail.hfut.edu.cn Jun Zhang

Figure 2: EPIs from the light field: each image in angular coordinates (u, v) yields a sub-aperture view of the scene. Given a pixel P (xi, yi) in the spatial coordinate, its horizontalor vertical EPIs are obtained by fixing the view v0 or u0, respectively. Three pairs ofhorizontal and vertical EPIs at different refocused depths are shown. The similar linearstructure information between the oriented line marked by the yellow and its neighboringpixels is shown among three EPI patches. The disparity ∆x

∆u of the EPI patch b describesthe pixel shift of the scene point P when moving between the views.

neighborhood in the EPI, which allows the network to explicitly learn the relationshipbetween the line orientations and improve the performance of depth estimation.

3 Proposed Method

In this paper, we present an end-to-end fully convolutional network to predict the depthvalues of center pixels in EPIs of light fields. Two branches are designed to process thehorizontal and vertical EPIs separately. The newly proposed oriented relation module iscapable of modeling the relationships between the neighboring pixels in EPIs. A refocusing-based EPI augmentation method is also proposed to facilitate training and improve theperformance of depth estimation. An overview of the network architecture is shown inFigure 1.

3.1 EPI Patches for Learning

The light field, indicated as L(u, v, x, y), is generally represented by the two-plane param-eterization [13]. Here, (x, y) and (u, v) are spatial and angular coordinates, respectively.The central sub-aperture (center view) image is formed by the rays passing through theoptical center of the camera main lens (u = u0, v = v0). As shown in Figure 2, givena pixel P (xi, yi) in the center view image, the horizontal EPI of the row view v0 can beformulated as L(u, v0, x, yi), which is centered at (u0, xi). Similarly, the vertical EPI of thecolumn view u0, with the center at (v0, yi), is written as L(u0, v, xi, y).

Z can be obtained by analyzing the slope ∆x∆u of the line [23],

∆x = −∆u

Zf (1)

4

Page 5: Kunyuan Li Jun Zhang Rui Sun Xudong Zhang · 2020-07-10 · EPI-based Oriented Relation Networks for Light Field Depth Estimation Kunyuan Li 2018110972@mail.hfut.edu.cn Jun Zhang

Figure 3: The proposed oriented relation module.

where f is the focal distance and Z is the depth value of the point P . The slope of theoriented line is shown in the EPI patch b of Figure 2.

To learn the slope of the oriented line of P (xi, yi), we extract patches of size H×W ×Cfrom L(u, v0, x, yi) and L(u0, v, xi, y) as inputs. Here, H and W indicate height and widthof the patch, respectively, and C is the channel dimension. The size of the patch isdetermined by the range of disparities. The proposed network predicts the depth of thecenter pixel from the pair of EPI patches.

3.2 Network Architecture

As shown in Figure 1, the proposed network shares the similar structure with the pseudo-Siamese network proposed in [26], where two branches are designed to learn the weights forthe horizontal and vertical EPI patches, respectively. Each branch contains two orientedrelation modules (ORMs), a set of seven convolutional blocks, a residual module (RM),and a merging block. The ORM will be discussed in Sect. 3.3. The convolutional block iscomposed of ‘Conv-ReLU-Conv-BN-ReLU’. To handle the small EPI slope, we apply theconvolutional filters with size of 2×2 or 1×2 and stride 1 to measure a small depth value.However, detailed information of the EPI slope is lost as the network goes deeper. Inspiredby the residual learning [3] that can introduce detailed information of the shallower layerinto the deeper layer and effectively improve the network performance, we design a residualmodule for each branch. The residual module consists of six residual blocks, each of whichconsists of one convolutional block and one skip connection. We take a slicing operationto implement the skip connection by extracting the center region of the input feature. Thefinal merging block, containing two different convolutional blocks (‘Conv-ReLU-Conv-BN-ReLU’ and ‘Conv-ReLU-Conv’), is used for fusing the horizontal and vertical EPI featuresto predict the depth value of each pixel.

3.3 Oriented Relation Module

We propose a new Oriented Relation Module (ORM) to reason about the relationshipbetween the center pixel and its neighborhood in each EPI patch. As shown in Figure 3,given an EPI patch I of size H ×W × C, we apply two single-layer convolutions of 1× 1kernel size to model a compact relationship in the EPI patch. The output features areconverted into F1 and F2, respectively, which are followed by a dot product to constructthe oriented relation feature F3 of size (H ·W ) × (H ·W ). Furthermore, to obtain therelationship between the center pixel and its neighborhood in F3, we extract the featuref3 of size W × (H ·W ) from the relational feature F3. Then, we apply the reshaping andReLU activation on the feature f3 to obtain a new feature F4 of size W ×H×W . Finally,we concatenate the original EPI patch I with the feature F4 to obtain the output featureF5 of size W ×H × (W + C).

5

Page 6: Kunyuan Li Jun Zhang Rui Sun Xudong Zhang · 2020-07-10 · EPI-based Oriented Relation Networks for Light Field Depth Estimation Kunyuan Li 2018110972@mail.hfut.edu.cn Jun Zhang

Figure 4: The light field refocusing. (a) before refocusing. (b) after refocusing.

3.4 EPI Refocusing-based Data Augmentation

To alleviate the problems of insufficient data and overfitting, we propose a new data aug-mentation method by refocusing EPIs. Differing from general augmentation techniquessuch as rotation, scaling and flipping [19], we refocus EPIs to generate multiple EPIs fo-cused at different depth levels. The light field refocusing shifts the sub-aperture images toobtain images focused at different depth planes [1]. Figure 4 shows sub-aperture imagesat the same horizontal or vertical views that are stacked together. Lines with differentslopes (i.e. the lines in EPIs) are inserted into the scene points of different depth planeson the sub-aperture images. The line at the focal depth should be vertical (slope = 0),while the other lines are inclined (slope > 0 or slope < 0). Taking the center view as thereference, the disparity shift between sub-aperture images changes the slope of the line.Thus, refocusing at a different depth plane changes the orientation of the structure in theEPI.

We convert the depth information into a disparity shift in every single EPI according toEq. 1. The resulting disparity shift ∆x(u) related to the depth Z is defined following [1],

∆x(u) = (u0 − u)∆u

Zf (2)

Here, we assume that the center view (u0, v0) is the reference view. For the sake ofsimplicity, we also assume that lenslet-based cameras have the same focal length f and thesame baseline ∆u for the neighboring views. Similarly, we can obtain the disparity shift∆y(v). Then we refocus the EPI based on the refocusing principle [18],

L(u, v, x, y) = L(u, v, x + ∆x(u), y + ∆y(v)) (3)

The EPI patches (a, b, c) in Figure 2 show three horizontal EPIs at different refocuseddepths. Our strategy not only changes the slope of the orientation line but also changesthe corresponding ground truth (Eq. 2).

4 Experiments

4.1 Implementation Details

Following previous works [2,15], we use the 4D light field benchmark [7] as our experimentaldataset, which provides highly accurate disparity ground truth and performance evaluationmetrics. The dataset includes 24 carefully designed scenes with ground-truth disparitymaps. Each scene has 9× 9 angular resolution and 512× 512 spatial resolution. 16 scenes

6

Page 7: Kunyuan Li Jun Zhang Rui Sun Xudong Zhang · 2020-07-10 · EPI-based Oriented Relation Networks for Light Field Depth Estimation Kunyuan Li 2018110972@mail.hfut.edu.cn Jun Zhang

Metric Baseline w/ ORM w/ EPIR Full modelBadPix 9.45 7.98 7.41 5.66MSE 2.058 1.621 1.475 1.393

Table 1: Effects of ORM and EPIR. Bold: the best.

are used for training and the remaining 8 scenes for testing. We randomly sample thehorizontal and vertical EPI patch pairs of size 9 × 29 × 3 from each scene as inputs. Toavoid overfitting, we increase the training data to 8 times the original data by the proposedEPI refocusing-based data augmentation.

The bad pixel ratio (BadPix) [7], which denotes the percentage of pixels whose disparityerror is larger than 0.07 pixels, as well as the Mean Square Errors (MSE) are computed forevaluation metrics. Given an estimated disparity map d, the ground truth disparity mapgt and evaluation region M , BadPix is defined as,

BadPix =|{x ∈M : |d(x)− gt(x)| > 0.07}|

|M |, (4)

and MSE is defined as,

MSE =

∑x∈M

(d(x)− gt(x))2

|M |× 100. (5)

Lower scores are better for both metrics.We use the Keras library [10] with the mean absolute error (MAE) loss to train the

proposed network from scratch. We formularize the depth estimation as a multi-labelregression problem to estimate the depth value of a single pixel. Note that the network istrained end-to-end and does not make use of pre- and post-processing complications. Weutilize the RMSprop optimizer [30] and set the weight decay rate to 1e− 5 and batch sizeto 128. Our network training takes one day for 750k iterations on an NVIDIA GTX 1080Ti 11GB GPU. The memory footprint is about 65%.

4.2 Ablation Study

We use the proposed network without the oriented relation module (ORM) and data aug-mentation based on EPI refocusing (EPIR) as the Baseline.

Effect of the oriented relation module. Table 1 shows that the network using theORM brings a significant improvement over the baseline, which can reduce the BadPix byaround 1.5. Figure 5 shows qualitative results for comparison. Boxes and Cotton show thatthe ORM can reduce the streaking artifacts and improve the accuracy in weakly texturedareas. The occlusion boundaries in Backgammon with multiple occlusions can also bebetter restored through the ORM. In addition, our network with ORM generates smoothdepth maps while preserving discontinuity between different objects, yielding the increasedMSE by about 21% compared to the baseline.

Effect of EPI refocusing-based data augmentation. From Table 1, we can see thatthe network using the EPIR is better than the baseline. Moreover, by using both theORM and the EPIR, the performance is further boosted. To further show the effect ofEPI refocusing in the network, we compare the performance by varying the number ofrefocusing in Table 2. We refocus the training data to the foreground and the backgroundof the original depth plane. From the table, we observe that there are performance gainswhen increasing the number of refocusing. However, there is no gain from EPIR×8 toEPIR×10 when comparing with Table 2.

7

Page 8: Kunyuan Li Jun Zhang Rui Sun Xudong Zhang · 2020-07-10 · EPI-based Oriented Relation Networks for Light Field Depth Estimation Kunyuan Li 2018110972@mail.hfut.edu.cn Jun Zhang

Figure 5: Qualitative comparison of the baseline and the network with the ORM. (a)Original scenes. (b) Ground truth maps. (c) Baseline. (d) Our network with the ORM.

Metric Baseline EPIR×2 EPIR×4 EPIR×6 EPIR×8 EPIR×10

BadPix 9.45 8.03 7.78 7.50 7.41 7.45MSE 2.058 1.781 1.511 1.482 1.475 1.480

Table 2: Performance in terms of the number of EPI refocusing. Bold: the best.

4.3 Comparison with State-of-the-Arts

We compare our approach with other state-of-the-art methods: LF [9], CAE [24], LF_OCC[21], SPO [27], EPN [15], and EPINET [19]. The qualitative comparison is shown in Figure6. The Cotton scene contains smooth surfaces and textureless regions, and the Boxesscene consists of occlusions with depth discontinuity. As can be seen from the figure, ourapproach can reconstruct the smooth surface and the region with sharp depth discontinuitycompared to other methods. For the Sideboard scene with the complex shape and texture,our approach preserves more details and sharper boundaries by distinguishing the subtledifference of EPI slopes. In addition, our approach obtains better disparity maps in theBoxes and Sideboard scenes than the recent state-of-the-art method [19], which uses thevertical, the horizontal, the left diagonal and the right diagonal viewpoints as inputs. Thenumber of the viewpoints is almost double that of our approach. Compared with the 28-layer network of 4 branches in [19], our network consists of 30 layers with 2 branches, whichmakes our trainable parameters be about half of those in [19]. However, our network cannotproduce good depth predictions for the Dots scene that contains a lot of noise, which isalso the common downside of applying EPIs to the CNN-based method (e.g. EPN [15]).The reason is that noises may lead to the false straight line estimation of EPI patches.Therefore, one of the future works could introduce global constraints of oriented lines intoour model.

Quantitative results are shown in Tables 3 and 4, which show that the proposed ap-proach performs the best in 4 out of 8 scenes. In particular, the proposed approachpredicts more accurate disparity values on the Boxes and Backgammon scenes under multi-occlusions. Note that we do not apply any post-processing for depth optimization whilemost other methods [9, 15,21,24,27] are accompanied by post optimization.

8

Page 9: Kunyuan Li Jun Zhang Rui Sun Xudong Zhang · 2020-07-10 · EPI-based Oriented Relation Networks for Light Field Depth Estimation Kunyuan Li 2018110972@mail.hfut.edu.cn Jun Zhang

Figure 6: Qualitative results on the 4D light field benchmark [7]. For each scene, the toprow shows the estimated disparity maps and the bottom row shows the error maps forBadPix. (a) Ground truth. (b) LF [9]. (c) CAE [24]. (d) LF_OCC [21]. (e) SPO [27]. (f)EPN [15]. (g) EPINET [19]. (h) Ours.

Scenes LF [9] CAE [24] LF_OCC [21] SPO [27] EPN [15] EPINET [19] Oursboxes 24.572 17.885 24.526 15.889 15.304 14.190 13.373cotton 8.794 3.369 6.548 2.594 2.060 0.810 0.869dino 21.478 4.968 15.466 2.184 2.877 2.970 2.814sideboard 23.906 9.845 17.923 9.297 7.997 6.260 5.580backgammon 4.810 3.924 18.061 3.781 3.328 4.130 2.511dots 2.441 12.401 5.109 16.274 39.248 9.370 25.930pyramids 10.949 1.681 2.830 0.356 0.242 0.540 0.240stripes 35.394 7.872 17.558 14.987 18.545 5.310 5.893

Table 3: Quantitative comparison of different methods using the BadPix metric. The bestthree results are shown in red, blue, and green, respectively (Best viewed in color).

9

Page 10: Kunyuan Li Jun Zhang Rui Sun Xudong Zhang · 2020-07-10 · EPI-based Oriented Relation Networks for Light Field Depth Estimation Kunyuan Li 2018110972@mail.hfut.edu.cn Jun Zhang

Scenes LF [9] CAE [24] LF_OCC [21] SPO [27] EPN [15] EPINET [19] Oursboxes 16.705 8.424 9.095 9.107 9.314 6.440 4.189cotton 11.773 1.506 1.103 1.313 1.406 0.270 0.313dino 1.558 0.382 1.077 0.310 0.565 0.940 0.336sideboard 4.735 0.876 2.158 1.024 1.744 0.770 0.733backgammon 15.109 6.074 20.962 4.587 3.699 4.700 1.403dots 4.803 5.082 2.731 5.238 22.369 3.320 6.754pyramids 0.243 0.048 0.098 0.043 0.018 0.020 0.016stripes 17.380 3.556 7.646 6.955 8.731 1.160 1.263

Table 4: Quantitative comparison of different methods using the MSE metric. The bestthree results are shown in red, blue, and green, respectively (Best viewed in color).

5 Conclusion

In this paper, we propose an end-to-end fully convolutional network for depth estimationfrom light fields by exploiting horizontal and vertical EPIs. We introduce a new relationalreasoning module to construct the relationship between oriented lines in EPIs. In addition,we propose a new data augmentation method by refocusing the EPIs. We demonstratethe effectiveness of our approach on the 4D light field benchmark [7]. Our approachis competitive with the state-of-the-art methods, and is able to predict more accuratedisparity map in some challenging scenes such as Boxes and Sideboard without any post-processing.

Acknowledgments. This work was supported by the National Natural Science Founda-tion of China, No. 61876057.

References

[1] Maximilian. Diebold and Bastian. Goldluecke. Epipolar Plane Image Refocusing forImproved Depth Estimation and Occlusion Handling. In Vision, Modelling & Visual-ization, 2013.

[2] M. Feng, Y. Wang, J. Liu, L. Zhang, H. F. M. Zaki, and A. Mian. Benchmark DataSet and Method for Depth Estimation From Light Field Images. IEEE Transactionson Image Processing, 2018.

[3] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recogni-tion. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2016.

[4] S. Heber and T. Pock. Convolutional Networks for Shape from Light Field. In Pro-ceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2016.

[5] S. Heber, W. Yu, and T. Pock. Neural EPI-Volume Networks for Shape from LightField. In Proceedings of International Conference on Computer Vision (ICCV), 2017.

[6] Stefan. Heber, Yu. Wei, and Thomas. Pock. U-shaped Networks for Shape from LightField. In Proceedings of British Machine Vision Conference, 2016.

[7] K. Honauer, O. Johannsen, D. Kondermann, and B. Goldluecke. A Dataset andEvaluation Methodology for Depth Estimation on 4D Light Fields. In Proceedings ofAsian Conference on Computer Vision (ACCV), 2016.

10

Page 11: Kunyuan Li Jun Zhang Rui Sun Xudong Zhang · 2020-07-10 · EPI-based Oriented Relation Networks for Light Field Depth Estimation Kunyuan Li 2018110972@mail.hfut.edu.cn Jun Zhang

[8] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei. Relation Networks for Object Detec-tion. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2018.

[9] H. Jeon, J. Park, G. Choe, J. Park, Y. Bok, Y. Tai, and I. S. Kweon. Accurate depthmap estimation from a lenslet light field camera. In Proceedings of IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2015.

[10] Z. Jiang and G. Shen. Prediction of House Price Based on The Back Propagation Neu-ral Network in The Keras Deep Learning Framework. In Proceedings of InternationalConference on Systems and Informatics (ICSAI), 2019.

[11] Ole. Johannsen, Christian. Heinze, Bastian. Goldluecke, and Perwa. Christian. Onthe Calibration of Focused Plenoptic Cameras. Springer Berlin Heidelberg, 2013.

[12] T. Leistner, H. Schilling, R. Mackowiak, S. Gumhold, and C. Rother. S. Gumhold,and C. Rother. Learning to think outside the box: Wide-baseline light field depthestimation with EPI-shift. In Proceedings of 2019 International Conference on 3DVision (3DV), 2019.

[13] Marc. Levoy and Pat. Hanrahan. Light Field Rendering. In Proceedings of the 23rdAnnual Conference on Computer Graphics and Interactive Techniques, 1996.

[14] H. Lin, C. Chen, S. B. Kang, and J. Yu. Depth Recovery from Light Field Using FocalStack Symmetry. In Proceedings of International Conference on Computer Vision(ICCV), 2015.

[15] Y. Luo, W. Zhou, J. Fang, L. Liang, H. Zhang, and G. Dai. EPI-Patch Based Con-volutional Neural Network for Depth Estimation on 4D Light Field. In InternationalConference on Neural Information Processing, 2017.

[16] L. Mou, Y. Hua, and X. X. Zhu. A Relation-Augmented Fully Convolutional Networkfor Semantic Segmentation in Aerial Scenes. In Proceedings of IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2019.

[17] Ren. Ng, Marc. Levoy, Gene. Duval, Mark. Horowitz, and Pat. Hanrahan. Light FieldPhotography with a Hand-held Plenoptic Camera. Computer Science Technical ReportCSTR, 2005.

[18] Ren. Ng. Digital light field photography. PhD thesis, Stanford University, 2006.

[19] C. Shin, H. Jeon, Y. Yoon, I. S. Kweon, and S. J. Kim. EPINET: A Fully-Convolutional Neural Network Using Epipolar Geometry for Depth from Light FieldImages. In Proceedings of IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR), 2018.

[20] M. W. Tao, S. Hadap, J. Malik, and R. Ramamoorthi. Depth from Combining Defo-cus and Correspondence Using Light-Field Cameras. In Proceedings of InternationalConference on Computer Vision (ICCV), 2013.

[21] T.-C. Wang, A. Efros, and R. Ramamoorthi. Occlusion-Aware Depth EstimationUsing Light-Field Cameras. In Proceedings of International Conference on ComputerVision (ICCV), 2017.

[22] S. Wanner and B. Goldluecke. Globally Consistent Depth Labeling of 4D Light Fields.In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2012.

11

Page 12: Kunyuan Li Jun Zhang Rui Sun Xudong Zhang · 2020-07-10 · EPI-based Oriented Relation Networks for Light Field Depth Estimation Kunyuan Li 2018110972@mail.hfut.edu.cn Jun Zhang

[23] S. Wanner and B. Goldluecke. Variational Light Field Analysis for Disparity Esti-mation and Super-Resolution. IEEE Transactions on Pattern Analysis and MachineIntelligence (TPAMI), 2014.

[24] W. Williem and I. K. Park. Robust Light Field Depth Estimation for Noisy Scenewith Occlusion. In Proceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2016.

[25] Xing. Sun, Z. Xu, Nan. Meng, E. Y. Lam, and H. K. -H. So. Data-driven lightfield depth estimation using deep Convolutional Neural Networks. In Proceedings ofInternational Joint Conference on Neural Networks (IJCNN), 2016.

[26] S. Zagoruyko and N. Komodakis. Learning to compare image patches via convolutionalneural networks. In Proceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2015.

[27] S. Zhang, H. Sheng, C. Li, J. Zhang, and Z. Xiong. Robust depth estimation for lightfield via spinning parallelogram operator. Computer Vision and Image Understanding(CVIU), 2016.

[28] Y. Zhang, H. Lv, Y. Liu, H. Wang, X. Wang, Q. Huang, X. Xiang, and Q. Dai.Light-Field Depth Estimation via Epipolar Plane Image Analysis and Locally LinearEmbedding. IEEE Transactions on Circuits and Systems for Video Technology, 2017.

[29] B. Zhou, A. Andonian, and A. Torralba. Temporal Relational Reasoning in Videos.In Proceedings of European Conference on Computer Vision (ECCV), 2018.

[30] F. Zou, L. Shen, Z. Jie, W. Zhang, and W. Liu. A Sufficient Condition for Con-vergences of Adam and RMSProp. In Proceedings of IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2019.

12