Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Computers & Graphics 66 (2017) 103–112
Contents lists available at ScienceDirect
Computers & Graphics
journal homepage: www.elsevier.com/locate/cag
Special Issue on SMI 2017
A multi-view recurrent neural network for 3D mesh segmentation
Truc Le, Giang Bui, Ye Duan
∗
Department of Computer Science, University of Missouri, MO 65211, Columbia
a r t i c l e i n f o
Article history:
Received 30 March 2017
Revised 17 May 2017
Accepted 25 May 2017
Available online 13 June 2017
Keywords:
Mesh segmentation
Multi-view
3D deep learning
CNN
RNN
LSTM
a b s t r a c t
This paper introduces a multi-view recurrent neural network (MV-RNN) approach for 3D mesh segmen-
tation. Our architecture combines the convolutional neural networks (CNN) and a two-layer long short
term memory (LSTM) to yield coherent segmentation of 3D shapes. The imaged-based CNN are useful
for effectively generating the edge probability feature map while the LSTM correlates these edge maps
across different views and output a well-defined per-view edge image. Evaluations on the Princeton Seg-
mentation Benchmark dataset show that our framework significantly outperforms other state-of-the-art
methods.
© 2017 Elsevier Ltd. All rights reserved.
1
i
m
m
t
F
s
b
b
i
a
s
t
d
c
f
t
r
t
w
w
S
o
s
p
s
d
v
t
r
[
c
m
e
f
i
i
t
t
s
t
e
[
b
s
m
p
f
h
0
. Introduction
Mesh segmentation is a classical, yet challenging problem
n computer graphics for many decades. Unfortunately, the seg-
entation problem is ill-posed and there is no general objective
easurement that can universally be applied in any case. Judging
he quality of a segmentation largely depends on application.
or instance, in a LiDAR scan of urban environment, a desired
egmentation should distinguish between different instances of
uildings, people, cars, trees, ground, etc. However, in a part-
ased annotation of a 3D model (e.g. human), the requirement
s usually to segment head, torso, left/right arms, left/right legs
nd sometimes more details such as thumb, index finger, and
o on depending on specific task. Consequently, in the scope of
his paper, we aim to tackle the mesh segmentation as a data
riven approach. Given a training dataset of input mesh and the
orresponding desired segmentation, we design a deep learning
ramework to learn the pattern of the segmentation given by the
raining dataset so that it can segment an unseen mesh. As a
esult, we make no geometric or topological assumptions about
he shape, nor exploit any hand-crafted descriptors.
In this paper, we propose a multi-view recurrent neural net-
ork (MV-RNN) deep learning framework to segment 3D model
hich significantly outperforms prior methods on the Princeton
egmentation Benchmark dataset [1] . It is worth mentioning that
ur goal is to partition the 3D model and not to do the semantic
∗ Corresponding author.
E-mail address: [email protected] (Y. Duan).
t
g
s
t
ttp://dx.doi.org/10.1016/j.cag.2017.05.011
097-8493/© 2017 Elsevier Ltd. All rights reserved.
egmentation. In semantic segmentation, the two wings of an air-
lane are assigned a single label wing . On the other hand, in mesh
egmentation, the two wings belong to two different regions and
o not have semantic label. In general, semantic segmentation pro-
ides better understanding of a 3D model. However, mesh segmen-
ation still has its merits such as guiding mesh processing algo-
ithms including skeleton extraction [2,3] , modeling [4] , morphing
5] , shape-based retrieval [6] and texture mapping [7] . Moreover, in
ontrast to semantic segmentation which requires a fixed set of se-
antic labels, many mesh segmentation algorithms could be gen-
ralized to unseen object categories. As a result, instead of identi-
ying surface area of the 3D model within a segment, we predict
ts boundary (or edge). The benefits of doing so are twofold. First,
t is usually more expensive to obtain dense surface annotations
han boundary annotations from humans. Second, we only have
wo semantic labels, i.e. boundary versus non-boundary, which is
impler for the framework to learn than using hundreds of seman-
ic labels (e.g. hand, torso, leg, head, etc.). In fact, detecting 3D
dges could be useful for other tasks such as suggestive contours
8,9] and ridge-valley detection [10] .
Our approach belongs to the multi-view paradigm which has
een shown success recently for many visual recognition tasks
uch as classification and segmentation [11–14] . Typically, in the
ulti-view segmentation, a 3D model is rendered with multi-
le views to generate multi-view images, each of which is fed-
orward to a (shared weights) convolutional neural network to ob-
ain densely labeled images before being mapped back to 3D. In
eneral, a multi-view approach for segmentation must overcome
everal technical obstacles. Firstly, there must be enough views
o minimize occlusions and cover the shape surface. This can be
104 T. Le et al. / Computers & Graphics 66 (2017) 103–112
Fig. 1. Given a 3D model, we try to detect boundary between segments by using
multi-view approach. We apply non-maximum suppression [15] to the MV-CNN re-
sults shown on the second row for visualization. The main drawback of MV-CNN
is its inconsistency across multiple views (e.g. the elbow and arm regions). On the
other hand, our MV-RNN could correlate multiple views and generate more coher-
ent results.
o
v
s
(
a
t
e
o
a
h
i
a
f
l
f
v
a
i
w
a
N
w
t
u
s
o
e
a
e
m
e
a
o
u
c
r
l
(
p
c
e
s
s
a
(
t
e
o
t
3
S
n
d
t
c
p
t
h
p
l
s
achieved by generating a large number of views equally distributed
around the object. Secondly, shape parts can be visible from more
than one view, thus, the proposed method should effectively cor-
relate information from multiple views. The main drawback of the
existing multi-view approaches such as the multi-view convolution
neural network (MV-CNN) [11,12] is that different views may not
be correlated and hence a 3D area may correspond to totally dif-
ferent outcomes from different views. Let us take an example of a
standing person rotating counter-clockwise ( Fig. 1 ). When the view
is front facing, the boundary between the torso and the right arm
is a real boundary. At certain time, the right arm starts to be oc-
cluded. Then the boundary between the torso and the right arm is
no longer real boundary, but the MV-CNN cannot distinguish them
because it does not correlate the result over different views.
We propose MV-RNN to overcome this limitation by treating
the sequential multiple views as a temporal sequence, and apply-
ing recurrent neural network to capture the redundancy between
adjacent views. More specifically, in this paper we employ the long
short term memory (LSTM) as the recurrent neural unit. The multi-
view outputs from CNN are correlated through a two-layer LSTM
to obtain consistent fine detail responses for every view. Finally,
the boundary pixels are back-projected onto 3D shape surface fol-
lowed by region growing and Conditional Random Field (CRF) to
obtain the final segmentation. The main contribution of our paper
is the MV-RNN, which is, to the best of our knowledge, the first
network treating multiple views as a temporal sequence and ap-
plying LSTM to correlate adjacent views. Moreover, since the pro-
posed framework is purely data driven, it can be easily adapted or
extended to other tasks in shape modeling such as suggestive con-
tours [8,9] and ridge-valley detection [10] .
In the next section, we briefly discuss existing methods related
to 3D segmentation with emphasis on deep learning. To make the
paper self-contained, we review the recurrent neural network in
Section 3 . Section 4 describes our approach in depth followed by
experimental results in Section 5 . Section 6 concludes our work.
2. Related work
Hand-crafted features : Before the era of deep learning, peo-
ple proposed many approaches (region growing [16,17] , hierarchi-
cal clustering [3,18,19] , spectral clustering [20] , k -means [21] , nor-
malized cut [22] , random walk [23] , heat walk [24] , etc.) based
n local features to segment a 3D model such as planarity of
arious forms, higher degree geometric proxies (cylinders, cones,
pheres, etc.), dihedral angles between triangles [25] , curvatures
Gaussian curvature or mean curvature) [26] , geodesic distances on
mesh, slippage, symmetry, convexity, medial axis, shape diame-
er [27] and motion characteristics [28] . Shamir et al. [28] , Agathos
t al. [29] and Theologou et al. [30] gave a comprehensive overview
f methodologies in 3D segmentation. In general, these approaches
re usually built on some particular property of 3D objects and
ence do not generalize well.
Image-based CNN : CNNs [31–34] are currently the main stream
n many visual recognition problems and have been extensively
pplied to image semantic segmentation [35–39] . For example,
ully convolutional network (FCN) [36] was a breakthrough in deep
earning based image semantic segmentation. In this approach,
ully connected layers in a standard CNN are replaced by con-
olutions with large receptive fields, and segmentation image is
chieved using coarse class score maps obtained by feed forward-
ng an input image. However, the deconvolution part of the net-
ork responsible for upsampling is fixed to bilinear interpolation
nd only the CNN part of the network is fine-tuned. In contrast,
oh et al. [37] proposed the deconvolution network (DeconvNet)
ith unpooling layers followed by convolutions, which increases
he network’s capability to learn more complex deconvolution than
sing just bilinear interpolation.
The holistically nested-edge detection (HED) [40] casts the clas-
ical edge detection as a CNN-based problem. An interesting idea
f this work is that the final edge map is fused from multiple
dge maps obtained at different scales. The multi-scale edge maps
re side outputs of a VGG-16 network [32] and hence the shallow
dge maps give fine detail edges while the deeper ones capture the
ore salient edges. The final result is linearly combined from all
dge maps at multiple scales. Our MV-RNN approach adopts HED
s a sub-module for our CNN part thanks to its high performance
n natural images.
Deep learning for 3D : While deep learning has been very pop-
lar in 2D images for many years, it has just been applied in 3D re-
ently because unlike pixels in 2D images, 3D objects do not have
egular structure. As a result, in the early period, people use deep
earning as a tool to learn high level features from low level cues
usually hand crafted). The unsupervised shape segmentation pro-
osed by Shu et al. [41] starts by over-segmenting the input model,
omputing patch-based local features and then uses stacked auto-
ncoder to learn high level features followed by Graph-Cut based
egmentation. Guo et al. [42] compute local features at different
cales for each triangle and arrange them into a rectangular im-
ge, which is feed forward through a convolutional neural network
CNN) to predict the semantic label for each triangle. Although
hese two frameworks use deep learning techniques (stacked auto-
ncoder, CNN) to learn high level features from local low level
nes, they do not exploit the full potential of deep learning.
A natural extension from 2D image to 3D shape is to discretize
he 3D object into 3D voxel and apply 3D convolutions on it. The
D ShapeNet [43] used this approach for 3D object classification.
u et al. [11] was the first one to apply multi-view convolutional
eural network (MV-CNN) for 3D recognition. The 3D shape is ren-
ered in multiple views, each of which is passed through the iden-
ical image-based CNN. Features obtained from multiple views are
ombined via a view pooling (which is the max pooling) and then
assed through another CNN to predict the final object label. Be-
ween volumetric and multi-view CNN, the later typically gives
igher accuracy [13] . One reason might be due to the higher com-
utation and memory cost of using 3D convolutions which in turn
imits the image resolution [13] . A similar result has also been ob-
erved in other 3D data such as videos [44–46] .
T. Le et al. / Computers & Graphics 66 (2017) 103–112 105
Table 1
The Rand Index scores of segmentation for each category with different methods. Smaller is better.
Object catergories MV-RNN MV-CNN [Shu2016] WcSeg RandCuts ShapeDiam NormCuts CoreExtra RandWalks FitPrim KMeans
Human 0.106 0.196 0.116 0.128 0.131 0.179 0.152 0.225 0.219 0.153 0.163
Cup 0.100 0.100 0.096 0.171 0.219 0.358 0.244 0.307 0.358 0.413 0.459
Glasses 0.066 0.115 0.173 0.173 0.101 0.204 0.141 0.301 0.311 0.235 0.188
Airplane 0.085 0.157 0.150 0.089 0.122 0.092 0.186 0.256 0.248 0.166 0.211
Ant 0.021 0.044 0.001 0.021 0.025 0.022 0.047 0.065 0.068 0.086 0.131
Chair 0.051 0.078 0.040 0.103 0.184 0.111 0.088 0.187 0.156 0.212 0.213
Octopus 0.022 0.060 0.036 0.029 0.063 0.045 0.061 0.051 0.067 0.101 0.101
Table 0.072 0.091 0.040 0.091 0.383 0.184 0.093 0.244 0.131 0.181 0.369
Teddy 0.035 0.055 0.024 0.056 0.045 0.057 0.121 0.114 0.128 0.132 0.182
Hand 0.076 0.122 0.135 0.116 0.090 0.202 0.155 0.155 0.189 0.202 0.154
Plier 0.054 0.143 0.151 0.087 0.109 0.375 0.183 0.093 0.230 0.169 0.263
Fish 0.146 0.253 0.288 0.203 0.297 0.248 0.394 0.273 0.388 0.424 0.413
Bird 0.059 0.119 0.171 0.101 0.107 0.115 0.184 0.124 0.250 0.196 0.190
Armadillo 0.060 0.120 0.073 0.081 0.092 0.090 0.116 0.141 0.115 0.091 0.117
Bust 0.162 0.351 0.275 0.266 0.232 0.298 0.316 0.315 0.298 0.300 0.334
Mech 0.121 0.369 0.073 0.182 0.277 0.238 0.159 0.387 0.211 0.306 0.425
Bearing 0.080 0.104 0.056 0.122 0.124 0.119 0.183 0.398 0.246 0.188 0.280
Vase 0.106 0.216 0.212 0.161 0.133 0.239 0.236 0.226 0.246 0.257 0.387
FourLeg 0.135 0.213 0.140 0.152 0.174 0.161 0.208 0.191 0.218 0.185 0.193
Average 0.082 0.154 0.118 0.123 0.153 0.176 0.172 0.211 0.215 0.210 0.251
Fig. 2. Overview of our MV-RNN approach. Given an input 3D mesh model, we render it with a sequence of ordered viewpoints. Each of view is passed through an identical
(shared weights) CNN to obtain a boundary probability map, which is correlated by a two-layer LSTM followed by a fully connected layer. The consistent edge images from
multiple views are unprojected back to 3D followed by a region growing and CRF for boundary smoothing.
i
v
i
a
m
e
u
r
i
s
i
a
[
m
n
f
e
3
i
r
c
r
w
p
t
b
(
V
s
w
[
s
R
R
t
t
a
c
p
n
Xie et al. [12] used multi-view depth images via extreme learn-
ng machine to generate per-view segmentation and combine them
ia Graph-Cut. This method works pretty fast due to the easy train-
ng of the extreme learning machine but it does not give high
ccuracy. Later, Kalogerakis et al. [47] proposed a more complete
ulti-view framework. They first render the 3D model with differ-
nt views, each of which is processed through a shared CNN before
nprojected to 3D. The label consistency is solved by a conditional
andom field (CRF), which is part of a network and is optimized
n an end-to-end manner. Although this approach uses the CRF to
olve the consistency after unprojection to 3D, the semantic label
mages from multiple views are obtained in a max-pooling manner
nd they are still not correlated.
Recently, Su et al. proposed the PointNet [48] and SyncSpecCNN
49] which consume directly non-regular 3D data (point cloud and
esh, respectively). These networks demonstrate the flexibility of
eural networks in many visual problems. However, in term of per-
ormance, these structures still fall behind MV-CNN approaches (if
quipped large enough number of views) [48] .
. Background on recurrent neural network
In contrast to normal feed-forward neural network which
s a one-shot function, recurrent neural network (RNN) runs
epeatedly through time which simulates human brain processing
apability. An RNN is a composition of identical feed-forward neu-
al networks, one for each moment, or step in time, which we
ill refer to as RNN cells. These cells operate on their own out-
ut, allowing them to be composed. They can also operate on ex-
ernal input and produce external output. Note that this is a much
roader definition of an RNN depending on the choice of RNN cells
e.g. Vanilla RNN, LSTM, etc.). Here is the algebraic description of a
anilla RNN cell.
t = φ(
W x t + Us t−1 + b
)(1)
here φ is the activation function (e.g. sigmoid, tanh, ReLU
31,50] ); Assuming d and h are the state input and output sizes, re-
pectively, s t ∈ R
h is the current state (and current output); s t−1 ∈
h is the prior state; x t ∈ R
d is the current input; W ∈ R
h ×d , U ∈
h ×h and b ∈ R
h are weights and biases.
Although being simple and quite powerful, Vanilla RNN has cer-
ain disadvantages. First, it is very difficult to exploit post informa-
ion if information constantly morphs, which leads to the degener-
tion problem [34] . Second, gradient vanishing and exploding are
ommon in training Vanilla RNN because we train it by the back-
ropagation over time algorithm. If the gradients explode, we can-
ot train our model. If they vanish, it is difficult for us to learn
106 T. Le et al. / Computers & Graphics 66 (2017) 103–112
Fig. 3. Multiple views ordered in a helix-like sequence.
Fig. 4. LSTM learning process (only four views are shown due to space limit). First
row : Input shaded images to a CNN. Second row : Outputs from CNN. Third to
Tenth rows : Edges returned from LSTM during training. Last row : Ground truth
edges.
long-term dependencies, since back-propagation will be too sensi-
tive to recent distractions.
To tackle the drawbacks of Vanilla RNN, the long short-term
memory (LSTM) unit [51] was introduced to ensure the integrity of
information thanks to its written memories. Furthermore, LSTMs
use gates as a mechanism to selectively control and coordinate
writing (i.e. the cell memory is written, read and forgot selec-
tively).
Unlike Vanilla RNN, an LSTM network is well-suited to learn
from experience to classify, process and predict time series when
there are time lags of unknown size and bound between important
events. Consequently, LSTM achieved the best known results in
natural language text compression, unsegmented connected hand-
writing recognition. Recently, researchers have been integrating
LSTMs to computer vision tasks such as image segmentation [52] ,
activity recognition, image captioning, video description [46] , 3D
object reconstruction [14] .
4. Multi-view recurrent neural network (MV-RNN)
Given an input 3D shape, our goal is to segment it into parts
based on the prior knowledge learned from a pre-segmented train-
ing dataset. We design a MV-RNN network to this end. Our net-
work architecture is visualized in Fig. 2 . It takes as input a set of
images from multiple views which are equally distributed over the
3D model; segments these images by generating per-view bound-
ary probability maps; correlates them by a two-layer LSTM fol-
lowed by a fully connected layer and returns the consistent edges
which are back projected to the 3D surface and finally integrated
by a CRF. In the following sections, we elaborate the input to our
network, its layers and the training procedure.
4.1. Input
The input to our whole algorithm is a 3D shape represented
as a polygonal mesh. As a pre-processing step, we normalize and
scale it to fit into the unit sphere. Then our algorithm renders the
object in K different views (we set K = 60 based on our experi-
ments). We first equally partition the unit sphere into K regions
T. Le et al. / Computers & Graphics 66 (2017) 103–112 107
u
t
l
o
w
p
t
w
[
W
[
s
t
w
w
r
p
c
1
p
o
4
t
C
W
H
p
b
a
[
f
i
t
(
i
p
b
n
l
p
c
b
4
m
n
r
t
t
p
r
d
o
u
d
i
o
c
h
w
p
s
a
t
o
m
e
t
a
t
c
a
4
w
1
a
s
o
u
o
t
L
i
H
m
g
e
e
y
t
4
a
i
i
t
m
a
t
t
f
u
t
c
sing [53] . These regions serve as camera locations. More impor-
antly, to make these views learnable for LSTM, we arrange these
ocations in sequence so that adjacent locations are next to each
ther such as in Fig. 3 . To make all views oriented consistently,
e choose the camera up vector pointing to a very far away fixed
oint (e.g. [0, 0, 100]). The camera always looks at the origin since
he model is normalized.
In general, CNN is quite robust to lighting illumination, so
e render shaded, grayscale images using Phong reflection model
54] with light source always behind the camera for every view.
e also experimented with depth images (with HHA encoding
55] ), normal images and/or combined them together but the re-
ult is not better than using the shading images alone. To make
he training faster, we opt to use image resolution of 128 × 128
ithout sacrificing the overall segmentation accuracy of the frame-
ork.
In addition, for each camera setting, we store the 3D vertex cor-
esponding to each pixel. The correspondence is determined by the
roximity of the 3D point unprojected from the 2D pixel and the
losest 3D vertex (the distance between them must be less than
0 −3 , otherwise there is no corresponding 3D vertex with that
ixel). The stored information is used for the back projection later
n.
.2. CNN module
The shaded images produced in the previous step are processed
hrough identical image-based CNN. There are many choices of
NN architecture such as FCN [36] , DeconvNet [37] and HED [40] .
e opt to choose HED because of its edge detection nature. Each
ED module outputs a grayscale image of the same size as the in-
ut shaded image (i.e. 128 × 128), which is the boundary proba-
ility map. Specifically, in our implementation, we employ the HED
rchitecture suggested in [40] , which adopted the VGG-16 network
32] for dense prediction by truncating after the pool5 layer and
using multiple side outputs. Since the HED is trained on RGB color
mages, we need to replicate our shaded grayscale images into
hree channels.
Fig. 4 shows the boundary probability maps in multiple views
only four views are shown here). As we can see the probabil-
ty maps are not well-localized nor consistent. The inconsistency
roblem cannot be solved by optimizing individual view alone, but
y aggregating them together in a more intelligent way. Recurrent
eural networks (RNN) represent a type of neural networks with
oop connections [56] , which allow them to capture long-range de-
endency by gates and memory structures (such as LSTM [51] ). In
onsequence, multiple views can be cast as time series which can
e learned by such LSTM.
.3. LSTM module
As mentioned in Section 1 , the goal of this layer is to correlate
ultiple views and generate consistent boundary maps. An LSTM
etwork is well-suited here which treats view sequence as time se-
ies. First, we unroll the 2D boundary probability maps and ground
ruth boundary maps into vectors of size 128 × 128 = 16384 . A
wo-layer LSTM (with one LSTM stacked over the other) is de-
loyed so that the first LSTM takes the sequence of ordered (un-
olled) boundary probability maps, produces a sequence of hid-
en states for the second LSTM to eventually output the sequence
f coherent boundary maps. We use the same number of hidden
nits (1024) for both peephole LSTMs [57] with the following up-
ates.
t = sigmoid
(W i x t + U i c t−1 + b i
)(2)
b
f t = sigmoid
(W f x t + U f c t−1 + b f
)(3)
t = sigmoid
(W o x t + U o c t−1 + b o
)(4)
t = f t ◦ c t−1 + i t ◦ tanh
(W c x t + b c
)(5)
t = o t ◦ c t (6)
here x t ∈ R
d is the current input; h t ∈ R
h is the current out-
ut; c t (and c t−1 ) ∈ R
h are the current (and prior) memory
tate, W i , W f , W o , W c ∈ R
h ×d , U i , U f , U o , U c ∈ R
h ×h , b i , b f , b o , b c ∈ R
h
re weights and biases and ‘ ◦’ denotes element-wise multiplica-
ion. In our case, d = 128 × 128 = 16384 and h = 1024 . The output
f the second LSTM is passed through a fully connected layer to
ap back to d -dimension edge image.
Fig. 4 illustrates how LSTM can help correct and correlate the
dge probability maps produced from the MV-CNN. For example,
he boundaries between the torso and two legs are quite different
mong four views, which may result in inconsistent edge informa-
ion when unprojecting them to 3D mesh. However, as the LSTM
onsumes the whole view sequence, the edges at convergence are
ll consistent.
.4. Training
We train our network in a two-stage approach. In the first stage,
e train the HED module. We randomly rotate each 3D model in
6 different ways. The network takes as input a pair of two im-
ges, shaded image and ground truth boundary map. We use the
igmoid cross-entropy loss for all five side outputs and the fused
utput. The network is initialized from VGG-16 weights [32] . We
se Adam optimizer [58] with fixed learning rate 10 −7 , batch size
f 16 and train for 10 0,0 0 0 iterations. The first stage training takes
hree days on an NVIDIA Titan X.
After the HED module is trained, it is fixed for training the
STM module in the second stage. The two-layer LSTM takes as
nput a pair of sequences of boundary probability maps from the
ED and ground truth boundary maps. We also use Adam opti-
izer [58] with initial learning rate 0.01 (as this optimization al-
orithm is able to compute adaptive learning rates for each param-
ter), batch size of 1 (due to memory limit) and train for 70 0 0
pochs. Each view sequence is processed bidirectionally, which
ields two sequences per shape. The second stage training takes
hree days on an NVIDIA Titan X.
.5. Back projection to 3D and post-processing
The consistent boundary maps produced from LSTM network
re back projected to 3D surface using the stored pixel-to-vertex
nformation (see Section 4.1 ). It is possible that many pixels (typ-
cally from different views) map to the same vertex, so we take
he maximum response as the final value. For each edge of the
esh model, we assign the boundary probability which is defined
s the average of the boundary probabilities of the two vertices
hat it connects. Finally a binary boundary edge map is created by
hresholding (we set the threshold as 0.5). These boundary edges
unction as the borders of the regions to be segmented. Thus, we
se a simple region growing to find the initial segmentation with
he boundary edges as blockers. A region with big enough area is
onsidered as a segment. The polygons near the boundaries may
e unlabeled due to projection error. Denote h v as the initial label
108 T. Le et al. / Computers & Graphics 66 (2017) 103–112
Fig. 5. Representative segmentation results produced by our MV-RNN on PSB dataset.
Fig. 6. Performance plots of different segmentation algorithms with respect to four evaluation metrics. Lower value is better.
T. Le et al. / Computers & Graphics 66 (2017) 103–112 109
Fig. 7. Comparison of segmentation algorithms.
f
r
a
o
E
E
E
w
a
p
c
5
s
w
d
t
o
i
m
p
s
o
t
b
h
s
b
s
t
c
n
t
o
e
d
a
d
i
t
t
n
o
i
w
or polygon v , where h v = 0 if v has no label. We expect that cor-
ect labels will be propagated to them via a CRF. Let V be the set of
ll polygons in a 3D shape, a CRF f with unary and pairwise terms
perating on the surface representation is defined as follows.
( f ) =
∑
v ∈ V E unary ( f v ) +
∑
(u, v ) ∈ V 2 E pairwise ( f u , f v ) (7a)
unary ( f v = l) =
{
0 , ∀ l if h v = 0
0 if h v = l ∞ if otherwise
(7b)
pairwise ( f u = l u , f v = l v ) =
{e −d 2 (u, v ) if l u � = l v
e −( 1 −d(u, v ) ) 2 if l u = l v (7c)
here d(u, v ) is the geodesic distance [59,60] between polygon u
nd polygon v . All distances are normalized to [0, 1].
The unary term tells that we only want to correct unlabeled
olygons while the pairwise terms favor the same label for adja-
ent polygons. We use mean-field approximation [61] to solve (7a) .
. Evaluation
In this section, we present experimental validations and analy-
es of our approach. We test the segmentation algorithm on the
ell-known Princeton Segmentation Benchmark dataset [1] . This
ataset has been intensively used to evaluate 3D shape segmenta-
ion and 3D shape retrieval algorithms. The dataset has 19 different
bject categories with 20 objects for each category which results
n 380 models in total. For each category, we randomly select 16
odels for training and 4 models for testing. Since there are multi-
le human generated segmentations for each model, we manually
elect one segmentation which is the most consistent among the
bject category. The ground truth edge images can be easily ob-
ained by rendering the edges between different segments overlaid
y the 3D shape with the same color as background. To further en-
ance the quality of the ground truth images, we use polygon off-
et in OpenGL . The ground truth edge images are used in training
oth the MV-CNN and the LSTM. Fig. 5 shows some representative
egmentations of our MV-RNN approach on this dataset.
To evaluate our segmentation method, we adopt four metrics
hat are defined by Chen et al. [1] , including Rand Index, Cut Dis-
repancy, Hamming Distance and Consistency Error. Rand Index,
amed after William M. Rand, measures the similarity between
wo segmentations of the same shape. From a mathematical point
f view, Rand Index is related to the accuracy, but is applicable
ven when class labels are not used. In this paper, we use Rand In-
ex Error, which equals to one minus the Rand Index. Cut Discrep-
ncy is a boundary-based method evaluating the distance between
ifferent cuts. It sums the distances from points along the cuts
n the computed segmentation to the closest cuts in the ground
ruth segmentation, and vice-versa. Hamming Distance, named af-
er Richard Hamming, is a region-based method and measures the
umber of substitutions required to change one region into the
ther. Hamming Distance is directional, hence it includes miss-
ng rate (Rm) and false alarm (Rf) distances. Consistency Errors,
hether the global version (GCE) or local version (LCE), are used to
110 T. Le et al. / Computers & Graphics 66 (2017) 103–112
Fig. 8. More comparisons of segmentation algorithms.
T
s
k
t
M
e
a
p
s
a
r
t
a
t
l
i
m
v
i
t
c
W
t
compute the hierarchical differences and similarities between seg-
mentations, which are based on the theory that humans percep-
tual organization imposes a hierarchical tree structure on objects.
Regarding all four metrics, smaller value indicates better result.
Comparison : We compare our method with the following seg-
mentation algorithms:
• MV-CNN: we apply non-maximum suppression [15] on the
boundary probability maps returned from the multi-view CNN
(HED in this case) and unproject them back to 3D (without
LSTM) followed by CRF. This serves as a baseline for multi-view
paradigm.
• [Shu2016] [41] : an unsupervised 3D shape segmentation via
stacked auto-encoders.
• WcSeg [62] ; approximate convexity analysis.
• RandCuts [22] : randomized cuts.
• ShapeDiam [27] : shape diameter function.
• NormCuts [22] : normalized cuts.
• CoreExtra [63] : core extraction.
• RandWalks [23] : random walks.
• FitPrim [18] : fitting primitives.
• KMeans [64] : k -means.
Figs. 7 and 8 provide a side-by-side comparison of segmen-
tations obtained from various algorithms. Although there are
large shape variations, the absolute majority of our segmenta-
tion results are desirable and consistent with our perception.
he baseline MV-CNN indeed yields better segmentations than
ome of the methods based on hand-crafted features such as
-means, fitting primitives, random walks. Due to the inconsis-
ency of the boundary probability maps across multiple views, the
V-CNN is still not as good as the shape diameter function. How-
ver, the added LSTM has a significant contribution to the over-
ll robustness, which vastly improves the nature of multi-view
aradigm.
Numerical comparison : The Rand Index score statistics of our
egmentation on the dataset, as well as those of other methods,
re detailed in Table 1 , from which we can see that our algo-
ithm obtains an average Rand Index of 0.084 that outperforms
he related algorithms. In addition to Rand Index, our MV-RNN
lso shines out of other methods with respect to other evalua-
ion metrics (see Fig. 6 and Table 2 ). Comparing with the base-
ine MV-CNN, the LSTM in our framework indeed has a significant
mprovement because it correlates the outputs from CNN across
ultiple views.
Different number of views : We also experiment with various
alues of K . According to Fig. 9 , using too few number of views
s not good due to occlusion. As using more views equally dis-
ributed around the object, the object’s surface area is more fully
overed, hence we get higher accuracy (or lower Rand Index score).
e choose K = 60 as a reasonable trade-off between accuracy and
ime/memory consumption.
T. Le et al. / Computers & Graphics 66 (2017) 103–112 111
Table 2
Average cut discrepancy, hamming distance, consistency error scores of segmentation for each category with different methods. Smaller is better.
MV-RNN MV-CNN [Shu2016] WcSeg RandCuts ShapeDiam NormCuts CoreExtra RandWalks FitPrim KMeans
Cut Discrepancy 0.144 0.220 0.212 0.211 0.263 0.275 0.282 0.375 0.367 0.341 0.409
Hamming 0.075 0.129 0.124 0.116 0.136 0.166 0.177 0.169 0.203 0.239 0.277
Hamming-Rm 0.061 0.104 0.130 0.118 0.152 0.187 0.195 0.126 0.209 0.293 0.345
Hamming-Rf 0.089 0.153 0.118 0.114 0.119 0.146 0.158 0.213 0.198 0.186 0.209
GCE 0.060 0.107 0.099 0.098 0.126 0.130 0.159 0.135 0.179 0.217 0.251
LCE 0.041 0.062 0.070 0.065 0.073 0.082 0.102 0.086 0.104 0.142 0.168
Fig. 9. The Rand Index with respect to the number of views. We choose K = 60 as
a reasonable trade-off between accuracy and time/memory usage.
Fig. 10. Limitation of our approach. The area under the torso is occluded and hence
the left and right thighs are not separated although our MV-RNN can detect 2D
edges correctly in all views.
5
h
t
t
c
o
e
v
i
p
6
t
m
fi
a
r
M
n
e
o
t
t
p
m
d
w
t
w
t
s
A
m
v
[
l
[
r
R
.1. Limitation
Because our approach belongs to the multi-view paradigm, it
as a common occlusion issue. For example, the left and right
highs of the man in Fig. 10 are not separated due to occlusion (i.e.
he area under torso is not revealed from any of K = 60 views). In-
rease the number of views could reduce the occlusions at the cost
f more computations. Since we can easily computed occluded ar-
as given the current set of views, we plan to use adaptive best
iew prediction to focus the camera on these areas, which is sim-
lar to the next-best-view prediction in 3D attention model pro-
osed by Xu et al. [65] .
. Conclusion
We have presented our novel MV-RNN for 3D shape segmen-
ation which combines the MV-CNN and LSTM to enhance the
ulti-view paradigm. To the best of our knowledge, we are the
rst group that treats multiple views as a temporal sequence and
pplies RNN to predict the edge images by aggregating the cor-
esponding edge probability maps obtained by feed-forwarding a
V-CNN. Our MV-RNN detects 3D edges in an end-to-end man-
er and the segmentation is obtained as a post-processing. The 3D
dges can be either semantic-based (e.g. semantic segmentation)
r geometric-based (e.g. CAD model segmentation, suggestive con-
our, ridge and valley). According to our experimental results on
he Princeton Segmentation Benchmark dataset, our MV-RNN com-
ares favorably with other state-of-the-art methods on mesh seg-
entation.
In the future, we would like to conduct more experiments on
ifferent datasets such as those in [49,66] . Additionally, our frame-
ork right now work on meshes only. In the future we would like
o extend it to handle point clouds as well. The proposed frame-
ork is purely data-driven, thus in the future we would like to ex-
end our method to other interesting problems in shape modeling
uch as suggestive contours [8,9] and ridge-valley detection [10] .
cknowledgment
We would like to acknowledge the authors of Princeton Seg-
entation Benchmark [1] who made the dataset public and pro-
ided evaluation toolbox. We also appreciate the authors of HED
40] for their edge detection network. Last but not least, we would
ike to thanks all the authors of other segmentation algorithms
18,22,23,27,41,62–64] for their contribution of the segmentation
esults on the Princeton Segmentation Benchmark dataset.
eferences
[1] Chen X , Golovinskiy A , Funkhouser T . A benchmark for 3d mesh segmentation.ACM Trans Graph 2009;28(3) . 73:1–73:12.
[2] Biasotti S , Marini S , Mortara M , Patane G . An overview on properties and effi-cacy of topological skeletons in shape modeling. In: Shape modeling interna-
tional; 2003. p. 245–54 .
[3] Katz S , Tal A . Hierarchical mesh decomposition using fuzzy clustering and cuts.ACM Trans Graph 2003;22(3):954–61 .
[4] Funkhouser T , Kazhdan M , Shilane P , Min P , Kiefer W , Tal A , et al. Modelingby example. ACM Trans Graph 2004;23(3):652–63 .
[5] Zockler M , Stalling D , Hege H-C . Fast and intuitive generation of geometricshape transitions. Vis Comput 20 0 0;16(5):241–53 .
[6] Zuckerberger E , Tal A , Shlafman S . Polyhedral surface decomposition with ap-
plications. Comput Graph 2002;26(5):733–43 . [7] Levy B , Petitjean S , Ray N , Maillot J . Least squares conformal maps for auto-
matic texture atlas generation. ACM Trans Graph 2002;21(3):362–71 . [8] DeCarlo D , Finkelstein A , Rusinkiewicz S , Santella A . Suggestive contours for
conveying shape. ACM Trans Graph 2003;22(3):848–55 . [9] Burns M , Klawe J , Rusinkiewicz S , Finkelstein A , DeCarlo D . Line drawings from
volume data. ACM Trans Graph 2005;24(3):512–18 . [10] Ohtake Y , Belyaev A , Seidel H-P . Ridge-valley lines on meshes via implicit sur-
face fitting. ACM Trans Graph 2004;23(3):609–12 .
[11] Su H , Maji S , Kalogerakis E , Learned-Miller EG . Multi-view convolutional neuralnetworks for 3d shape recognition. In: IEEE international conference on com-
puter vision; 2015 . [12] Xie Z , Xu K , Shan W , Liu L , Xiong Y , Huang H . Projective feature learning for 3d
shapes with multi-view depth images. Comput Graph Forum 2015;34(7):1–11 .
112 T. Le et al. / Computers & Graphics 66 (2017) 103–112
[
[13] Qi CR , Su H , Nießner M , Dai A , Yan M , Guibas L . Volumetric and multi-viewcnns for object classification on 3d data. In: IEEE international conference on
computer vision and pattern recognition . [14] Choy CB , Xu D , Gwak J , Chen K , Savarese S . 3d-r2n2: a unified approach for
single and multi-view 3d object reconstruction. In: IEEE european conferenceon computer vision; 2016. p. 628–44. ISBN 978-3-319-46484-8 .
[15] Dollar P , Zitnick CL . Structured forests for fast edge detection. In: IEEE interna-tional conference on computer vision; 2013 .
[16] Vieira M , Shimada K . Surface mesh segmentation and smooth surface extrac-
tion through region growing. Comput Aided Geometr Des 2005;22:771–92 . [17] Jagannathan A , Miller E . Three-dimensional surface mesh segmentation using
curvedness-based region growing approach. IEEE Trans Pattern Anal Mach In-tell 2007;29(12):2195–204 .
[18] Attene M , Falcidieno B , Spagnuolo M . Hierarchical mesh segmentation basedon fitting primitives. Vis Comput 2006;22:181–93 .
[19] Garland M , Willmott A , Heckbert PS . Hierarchical face clustering on polygo-
nal surfaces. In: Processing of the symposium on interactive 3D graphics. NewYork, NY, USA: ACM; 2001. p. 49–58. ISBN 1-58113-292-1 . I3D ’01.
[20] Shi J , Malik J . Normalized cuts and image segmentation. IEEE Trans PatternAnal Mach Intell 20 0 0;22(8):888–905 .
[21] Yamauchi H , Lee S , Lee Y , Ohtake Y , Belyaev AG , Seidel H-P . Feature sensi-tive mesh segmentation with mean shift. In: Processing of the international
conference on shape modeling and applications. IEEE; 2005. p. 238–45. ISBN
0-7695-2379-X . [22] Golovinskiy A , Funkhouser T . Randomized cuts for 3d mesh analysis. ACM
Trans Graph 2008;27(5) . 145:1–145:12. [23] Lai Y-K , Hu S-M , Martin RR , Rosin PL . Fast mesh segmentation using random
walks. In: Processing of the ACM symposium on solid and physical modeling.New York, NY, USA: ACM; 2008. p. 183–91. ISBN 978-1-60558-106-4 . SPM ’08.
[24] Benjamin W , Polk AW , Vishwanathan S , Ramani K . Heat walk: robust salient
segmentation of non-rigid shapes. Comput Graph Forum 2011;30(7):2097–106 .[25] Xiao D , Lin H , Xian C , Gao S . Cad mesh model segmentation by clustering.
Comput Graph 2011;35(3):685–91 . Shape Modeling International (SMI) Con-ference 2011.
[26] Lavoue G , Dupont F , Baskurt A . A new cad mesh segmentation method basedon curvature tensor analysis. Comput Aided Des 2005;37(10):975–87 .
[27] Shapira L , Shamir A , Cohen-Or D . Consistent mesh partitioning and skeletoni-
sation using the shape diameter function. Vis Comput 2008;24(4):249–59 . [28] Shamir A . A survey on mesh segmentation techniques. Comput Graph Forum
2008;27(6):1539–56 . [29] Agathos A , Pratikakis I , Perantonis S , Sapidis N , Azariadis P . 3d mesh seg-
mentation methodologies for cad applications. Comput Aided Des Applic2007;4(6):827–41 .
[30] Theologou P , Pratikakis I , Theoharis T . A comprehensive overview of method-
ologies and performance evaluation frameworks in 3d mesh segmentation.Comput Vis Image Underst 2015;135:49–82 .
[31] Krizhevsky A , Sutskever I , Hinton GE . Imagenet classification with deep convo-lutional neural networks. In: Advances in neural information processing sys-
tems; 2012. p. 1097–105 . [32] Simonyan K , Zisserman A . Very deep convolutional networks for large-scale
image recognition. In: International conference on learning representations . [33] Szegedy C , Liu W , Jia Y , Sermanet P , Reed S , Anguelov D , et al. Going deeper
with convolutions. In: IEEE conference on computer vision and pattern recog-
nition; 2015. p. 1–9 . [34] He K , Zhang X , Ren S , Sun J . Deep residual learning for image recognition. In:
IEEE conference on computer vision and pattern recognition; 2016. p. 770–8 . [35] Farabet C , Couprie C , Najman L , LeCun Y . Learning hierarchical features for
scene labeling. IEEE Trans Pattern Anal Mach Intell 2013;35(8):1915–29 . [36] Long J , Shelhamer E , Darrell T . Fully convolutional networks for semantic seg-
mentation. In: IEEE international conference on pattern recognition; 2015 .
[37] Noh H , Hong S , Han B . Learning deconvolution network for semantic segmen-tation. In: IEEE international conference on computer vision; 2015 .
[38] Sharma A , Tuzel O , Jacobs DW . Deep hierarchical parsing for semantic segmen-tation. In: 2015 IEEE conference on computer vision and pattern recognition
(CVPR); 2015. p. 530–8 . [39] Hong S, Oh J, Lee H, Han B. Learning transferrable knowledge for semantic
segmentation with deep convolutional neural network. Comput Res Repos -
arXiv 2015 . Url: https://arxiv.org/abs/1512.07928 .
[40] Xie S , Tu Z . Holistically-nested edge detection. In: IEEE international confer-ence on computer vision); 2015. p. 1395–403 .
[41] Shu Z , Qi C , Xin S , Hu C , Wang L , Zhang Y , et al. Unsupervised 3d shape seg-mentation and co-segmentation via deep learning. Comput Aided Geometr Des
2016;43:39–52 . Geometric Modeling and Processing 2016. [42] Guo K , Zou D , Chen X . 3d mesh labeling via deep convolutional neural net-
works. ACM Trans Graph 2015;35(1) . 3:1–3:12. [43] Wu Z , Song S , Khosla A , Yu F , Zhang L , Tang X , et al. 3d shapenets:
a deep representation for volumetric shapes. In: IEEE international confer-
ence on computer vision and pattern recognition; 2015. p. 1912–20. ISBN978-1-4673-6964-0 .
44] Karpathy A , Toderici G , Shetty S , Leung T , Sukthankar R , Fei-Fei L . Large-scalevideo classification with convolutional neural networks. In: 2014 IEEE confer-
ence on computer vision and pattern recognition; 2014. p. 1725–32 . [45] Simonyan K , Zisserman A . Two-stream convolutional networks for action
recognition in videos. In: Advances in neural information processing systems;
2014b. p. 568–76 . [46] Donahue J , Hendricks LA , Rohrbach M , Venugopalan S , Guadarrama S ,
Saenko K , et al. Long-term recurrent convolutional networks for visual recog-nition and description. IEEE Trans Pattern Anal Mach Intell 2017;39(4):677–91 .
[47] Kalogerakis E, Averkiou M, Maji S, Chaudhuri S. 3d shape segmentation withprojective convolutional networks. Comput Res Repos - arXiv 2016 . Url: http:
//arxiv.org/abs/1612.02808 .
[48] Qi CR, Su H, Mo K, Guibas LJ. Pointnet: Deep learning on point sets for 3dclassification and segmentation. Computing Research Repository - arXiv 2016b .
Url: https://arxiv.org/abs/1612.00593 . [49] Yi L , Kim VG , Ceylan D , Shen I-C , Yan M , Su H , et al. A scalable active
framework for region annotation in 3d shape collections. ACM Trans Graph2016;35(6) . 210:1–210:12.
[50] Nair V , Hinton GE . Rectified linear units improve restricted boltzmann ma-
chines. In: IEEE international conference on machine learning; 2010. p. 807–14 .[51] Hochreiter S , Schmidhuber J . Long short-term memory. Neural Comput
1997;9(8):1735–80 . [52] Li Z , Gan Y , Liang X , Yu Y , Cheng H , Lin L . Lstm-cf: unifying context modeling
and fusion with lstms for rgb-d scene labeling. In: IEEE european conferenceon computer vision; 2016. p. 541–57. ISBN 978-3-319-46475-6 .
[53] Leopardi P . A partition of the unit sphere into regions of equal area and small
diameter. Electron Trans Numer Anal 2006;25 . [54] Phong BT . Illumination for computer generated pictures. Commun ACM
1975;18(6):311–17 . [55] Gupta S , Girshick R , Arbeláez P , Malik J . Learning rich features from rgb-d im-
ages for object detection and segmentation. In: IEEE european conference oncomputer vision; 2014. p. 345–60. ISBN 978-3-319-10584-0 .
[56] Schmidhuber J . A local learning algorithm for dynamic feedforward and recur-
rent networks. Connect Sci 1989;1:403–12 . [57] Gers FA , Schmidhuber E . Lstm recurrent networks learn simple context-free
and context-sensitive languages. IEEE Trans Neural Netw 2001;12(6):1333–40 . [58] Kingma DP , Ba J . Adam: a method for stochastic optimization. In: IEEE inter-
national conference for learning representations; 2015 . [59] Hilaga M , Shinagawa Y , Kohmura T , Kunii TL . Topology matching for fully auto-
matic similarity estimation of 3d shapes. ACM Trans Graph 2001:203–12 . ISBN1-58113-374-X.
[60] Zhang E , Mischaikow K , Turk G . Feature-based surface parameterization and
texture mapping. ACM Trans Graph 2005;24(1):1–27 . [61] Krahenbuhl P , Koltun V . Efficient inference in fully connected crfs with gaus-
sian edge potentials. Neural Inf Process Syst 2011:109–17 . [62] Kaick OV , Fish N , Kleiman Y , Asafi S , Cohen-Or D . Shape segmentation by ap-
proximate convexity analysis. ACM Trans Graph 2014;34(1) . 4:1–4:11. [63] Katz S , Leifman G , Tal A . Mesh segmentation using feature point and core ex-
traction. Visual Comput 2005;21(8):649–58 .
[64] Shlafman S , Tal A , Katz S . Metamorphosis of polyhedral surfaces using decom-position. Comput Graph Forum 2002;21(3):219–28 .
[65] Xu K , Shi Y , Zheng L , Zhang J , Liu M , Huang H , et al. 3d attention–driven depth acquisition for object identification. ACM Trans Graph 2016;35(6) .
238:1–238:14. [66] Wang Y , Asafi S , van Kaick O , Zhang H , Cohen-Or D , Chen B . Active co-analysis
of a set of shapes. ACM Trans Graph 2012;31(6) . 165:1–165:10.