Fully Convolutional Networks for Semantic …yjlee/teaching/ecs289g-fall...Fully Convolutional Networks for Semantic Segmentation By Jonathan Long* Evan Shelhamer* Trevor Darrell Instance-sensitive

Fully Convolutional Networks for Semantic SegmentationBy Jonathan Long* Evan Shelhamer* Trevor Darrell

Instance-sensitive Fully Convolutional NetworksBy Jifeng Dai, Kaiming He, Yi Li, Shaoqing Ren, Jian Sun

Presented by Zilong [email protected]

mailto:[email protected]

mailto:[email protected]

Outline1. What problems they attempt to solve?

2. Key Contributions

3. Network Architecture Details

4. Experimental Setup and Results

5. Strengths and Weaknesses*

6. Possible Extensions*

a. And other comments

UC Berkeley

Fully Convolutional Networksfor Semantic Segmentation

Jonathan Long* Evan Shelhamer* Trevor Darrell

Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7

https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7

Problem to solve:Image Segmentation. Pixels in, pixels out.

Semanticsegmentation

Monocular depth estimation Eigen & Fergus 2015

Boundary prediction Xie & Tu 2015Optical flow Fischer et al. 2015



Problem to solveWhat is semantic segmentation?Input: Image (2D array of pixels)Output: Pixels clustered according to their semantical categories.

I.e. Class-level pixel-wise clustering (supervised)

NOTE: pixels of two people in the same image will be clustered together by this model. Second paper attempts to fill in the blank of this area ....

Input Output

Image Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7


Key Contributions1) AlexNet (VGG, GoogLeNet) -> Fully Convolutional Network

a) From image-level classification to pixel-level clustering

b) Arbitrary sized input images*

c) End-to-end learning model

2) Skip-layer structure to improve segmentation detail

a) Combine deep, coarse, semantic information with shallow, fine,

appearance information.

b) WHAT (deeper layers) + WHERE(shallower layers)

7

“tabby cat”

1000-dim vector

< 1 millisecond

Convnets perform classification

end-to-end learning



“tabby cat”

8

Recall: a classification network

NOTE: Implement layer 6 and 7 as fully connected layers fixes the size of input images



9

Recall: R-CNNObject detection without modifying AlexNet architecture

figure: Girshick et al.



R-CNN

10

Many seconds

“cat”

“dog”

Recall:R-CNN does detection

Whether using off-the-shelf methods or in-network layers for region proposals, bounding boxes are always needed in these approaches

SLOWContent Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7


11

~1/10 second

end-to-end learning

???



“tabby cat”

12

A classification network (see it again)



13

How to become fully convolutional

To be honest, fully convolutional, is just another way of thinking…



Becoming fully convolutional

To be honest, fully convolution, is just another way of thinking…

But it makes significant difference in training and maintaining the network structure in implementation!- Only convolution kernels are maintained; downsampling ratios are controlled by strides.- Arbitrary size- Faster! Compare to naive implementation

Layer 6 can be generated with kernel 13 x 13 x d_5, stride = 0: a kernel that does not move aroundLayer 7 can be generated with kernel 1 x 1 x d_6, stride = 0: another kernel that does not move around



15

Now it is fully convolutional



16

Upsampling output

NOTE: Upsampled output is H x W x (class number + 1)

Each H x W slice shows the heat map for one category



17

End-to-end &Pixels-to-pixels network

Each semantic segmentation ground truth image actually needs to be divided into (class number + 1) slices and each slice corresponds to the ground truth heat map of one category.



conv, pool,nonlinearity

upsampling

pixelwiseoutput + loss

End-to-end, pixels-to-pixels network

18Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7


stride 32

no skips

input image

If stopped right here, what could we get?

19

Coarse. Really, really coarse �



Spectrum of deep features

Combine where (local, shallow) with what (global, deep)

fuse features into deep jet

(cf. Hariharan et al. CVPR15 “hypercolumn”)



Skip layers

skip to fuse layers!

Interp + sum

Interp + sum

dense output 21

End-to-end, joint learningof semantics and location



Skip layers

22Content Source: https://computing.ece.vt.edu/~f15ece6504/slides/L13_FCN.pdf

Skip layers

23

How exactly are layers fused?

Take FCN-16s for instance: fusing pool4 and conv 7 in the following steps:

1. Add a 1 x 1 convolution layer on top of pool4 to produce additional class predictions. a. The output predictions of pool4 are 16s

2. 2x upsample the output of conv 7 which are 32s. a. The output predictions of upsampled conv 7 are 16s as well.

3. Add these 16s predictions together.4. Upsample these 16s predictions back to image size.NOTE: ALL the weights can be learned. The upsampling weights can be initialized with bilinear interpolation.

stride 32

no skips

stride 16

1 skip

stride 8

2 skips

ground truthinput image

Skip layer refinement



Training + Testing- Train full image at a time without patch sampling - Reshape network to take input of any size- Forward time is ~100ms for 500 x 500 x 21 output (This is really fast!)



Qualitative Results

FCN SDS* Truth Input

26

Relative to prior state-of-the-art SDS:

- 30% relative improvementfor mean IoU

- 286× faster

*Simultaneous Detection and Segmentation Hariharan et al. ECCV14

resultsFCN SDS* Truth Input

27

Relative to prior state-of-the-art SDS:

- 30% relative improvementfor mean IoU

- 286× faster

*Simultaneous Detection and Segmentation Hariharan et al. ECCV14

Ghosts sitting on that boat?!!

Qualitative Results

Experimental Setup1) AlexNet architecture2) VGG nets, pick the VGG 16-layer net5 3) GoogLeNet, use only the final loss layer, and improve performance by

discarding the final average pooling layer.

*Decapitate each net by discarding the final classifier layer, and convert all fully connected layers to convolutions.

results

29

Quantitative Results

results

30

Quantitative ResultsSIFT FLOW NYUDv2

PASCAL VOC 2011 8498-training

Content Source: https://computing.ece.vt.edu/~f15ece6504/slides/L13_FCN.pdf

Potential Extensions

A boring extension: if we directly use shallower layers and upsample without fusing with deeper layers, how bad would it be?

An interesting, promising and intuitive extension:What the next paper attempted to address=>

Instance-sensitive Fully Convolutional Networks

Jifeng Dai, Kaiming He, Jian Sun. Microsoft ResearchYi Li. Tsinghua University (While interning at Microsoft Research)Shaoqing Ren.University of Science and Technology of China (While interning at Microsoft Research)

32

Problem to solve:Instance-level Segmentation. Pixels in, pixels out.

33

Problem to solve:Instance-level Segmentation. Pixels in, pixels out.

34

Major ContributionsA fully convolutional network architecture that:1) Computes a set of instance-sensitive score maps

a) Each pixel is a classifier of relative positions to an object instance

b) Assemble to output instance candidate at each position

2) Reuse semantic segmentation results

3) Exploits image local coherence

a) w/o any high-dimensional layer related to the mask resolution

(compare with DeepMask)

Major ContributionsA fully convolutional network architecture for

instance-level segmentation.

37

Recall:Upsampling output

NOTE: Upsampled output is H x W x (class number + 1)

Each H x W slice shows the heat map for one category

38

> Generate instance-sensitive score maps > Assemble

Generate a set of k x k instance-sensitive score maps (for instance k = 3)

#1 #2 #3

#4 #5 #6

#7 #8 #9

#1 #2

#4

#3

#5 #6

#7 #8 #9

m x mx (k x k)

39

> Generate instance-sensitive score maps > Assemble

Generate a set of k x k instance-sensitive score maps (for instance k = 3)

#1 #2 #3

#4 #5 #6

#7 #8 #9

#1 #2

#4

#3

#5 #6

#7 #8 #9

NOTE: Not all positions the sliding window visited were objects.

m x mx (k x k)

Complete Instance-level Segmentation Network -2 BranchesUpper: Generate instance-sensitive score maps and assembleBottom: Generate objectness scores

Experimental Setup1) Use the VGG-16 network pre-trained on ImageNet as the feature extractor. 2) The 13 convolutional layers in VGG-16 are applied fully convolutionally on

an input image of arbitrary size.3) Reduce the network stride and increase feature map resolution:

a) the max pooling layer pool4 (between conv4_3 and conv5_1) is modified to have a stride of 1 instead of 2,

b) accordingly the filters in conv5_1 to conv5_3 are adjusted by the “hole algorithm”.

*Using this modified VGG network, the effective stride of the conv5_3 feature map is s = 8 pixels w.r.t. the input image.

DeepMaskLooks similar, but it doesn’t know how to use the local coherence

Quantitative ResultsAblation comparisons on the PASCAL VOC 2012 validation set

Quantitative ResultsPerformance evaluations on PASCAL VOC 2012Validation set

Quantitative ResultsPerformance evaluations on MS COCOValidation set

Qualitative Result

Qualitative Result

Strengths and WeaknessesStrengths:1) Both papers addressed very important questions with fully convolutional networks efficiently.2) Both papers have novelty with respect to network architectures.3) Both papers have convincing experiments.

a) Visualization and numerical results are clear and convincing.4) The discussion on the convolution operations in the first paper is helpful for interpretation and

better understanding of convolutional networks.5) The second paper doesn’t require another process to generate region proposals.

Weaknesses:1) How to use the training data is never clearly addressed.

a) What ground truth is used together with the forwarded heap maps for the loss functions?i) The first paper is intuitive in this part, but the second paper is very confusing.

2) Several essential points are unclear in the second papera) Did the second paper skip layers? b) Where did the second paper upsample? Or they just did not?

3) The relative location grids in the second paper worked well but look strange:a) One person’s “left” could be the other’s “right”, but each channel is in charge of the

relative location of all sliding windows.

Potential directions

1) Other tasks to be resolved by fully convolutional networksa) Scene recognition?

(1) Semantical combination of objects

2) Why is the size of sliding windows fixed in the second paper?a) Many small instances crowded together.

3) What about combining box-level object recognition with semantic segmentation?

Image Source: https://www.pinterest.com/pin/369787819374178444/https://www.pinterest.com/pin/399553798160612769/

https://www.pinterest.com/pin/369787819374178444/



Backup SlidesDatasets:

+ NYUD net for multi-modal input and SIFT Flow net for multi-task output

PASCAL VOC Table 3 gives the performance of our FCN-8s on the test sets of PASCAL VOC 2011 and 2012, and compares it to the previous state-of-the-art, SDS [17], and the well-known R-CNN [12]. NYUDv2 [33] is an RGB-D dataset collected using the Microsoft Kinect. It has 1449 RGB-D images, with pixelwise labels that have been coalesced into a 40 class semantic segmentation task by Gupta et al. [14].

SIFT Flow is a dataset of 2,688 images with pixel labels for 33 semantic categories (“bridge”, “mountain”, “sun”), as well as three geometric categories (“horizontal”, “vertical”,and “sky”).

Past and future history offully convolutional networks



history

Convolutional Locator NetworkWolf & Platt 1994

Shape Displacement NetworkMatan & LeCun 1992



53

Scale Pyramid, Burt & Adelson ‘83

pyramids

0 1 2

The scale pyramid is a classic multi-resolution representation.

Fusing multi-resolution network layers is a learned, nonlinear counterpart.



54

Jet, Koenderink & Van Doorn ‘87

jets

The local jet collects the partial derivatives at a point for a rich local description.

The deep jet collects layer compositions for a rich,learned description.



55

extensions

- more tasks- random fields- weak supervision



many pixelwise tasks

semanticsegmentation

56

monocular depth estimation Eigen & Fergus 2015

boundary prediction Xie & Tu 2015optical flow Fischer et al. 2015Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7


fully conv. nets + random fields

Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs.Chen* & Papandreou* et al. ICLR 2015. 57Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7


fully conv. nets + random fields

Conditional Random Fields as Recurrent Neural Networks. Zheng* & Jayasumana* et al. arxiv 2015. 58Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7


[ comparison credit: CRF as RNN, Zheng* & Jayasumana* et al. ICCV 2015 ]

59DeepLab: Chen* & Papandreou* et al. ICLR 2015. CRF-RNN: Zheng* & Jayasumana* et al. ICCV 2015Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7


fully conv. nets + weak supervision

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation.Pathak et al. arXiv 2015.

FCNs expose a spatial loss map to guide learning:segment from tags by MIL or pixelwise constraints.



fully conv. nets + weak supervision

BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation.Dai et al. 2015.

FCNs expose a spatial loss map to guide learning:mine boxes + feedback to refine masks.



leaderboard

== segmentation with Caffe

62

FCNFCNFCNFCNFCNFCNFCNFCNFCNFCNFCN

FCNFCNFCN

FCN



caffeinated contemporaries

Hypercolumn SDSHariharan, Arbeláez,Girshick, Malik

Zoom-OutMostajabi, Yadollahpour,Shaknarovich

Convolutional Feature MaskingDai, He, Sun



fcn.berkeleyvision.org

conclusionfully convolutional networks are fast, end-to-end models for pixelwise problems

- code in Caffe master branch- models for PASCAL VOC, NYUDv2,

SIFT Flow, PASCAL-Context

64

caffe.berkeleyvision.org

github.com/BVLC/caffe

model exampleinference examplesolving exampleContent Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7

http://fcn.berkeleyvision.org

http://fcn.berkeleyvision.org

http://caffe.berkeleyvision.org/

http://caffe.berkeleyvision.org/

https://github.com/BVLC/caffe

https://github.com/BVLC/caffe

https://github.com/shelhamer/fcn.berkeleyvision.org/tree/master/voc-fcn32s

https://github.com/shelhamer/fcn.berkeleyvision.org/tree/master/voc-fcn32s

https://github.com/shelhamer/fcn.berkeleyvision.org/blob/master/infer.py

https://github.com/shelhamer/fcn.berkeleyvision.org/blob/master/infer.py

https://github.com/shelhamer/fcn.berkeleyvision.org/blob/master/voc-fcn32s/solve.py

https://github.com/shelhamer/fcn.berkeleyvision.org/blob/master/voc-fcn32s/solve.py


Documents

Fully Convolutional Networks for Semantic …yjlee/teaching/ecs289g-fall...Fully Convolutional Networks for Semantic Segmentation By Jonathan Long* Evan Shelhamer* Trevor Darrell Instance-sensitive