1
At training time: Inputs: - Synthetically warped image pair - ground truth transf. Output: Estimated parametric transformation Convolutional neural network architecture for geometric matching Ignacio Rocco 1,2 Josef Sivic 1,2,3 Relja Arandjelović 1,2,* Model 1 DI ENS, École normale supérieure, PSL Research University 3 CIIRC, CTU in Prague 2 INRIA Goal Challenges Contributions Overview Feature extraction CNN I A f A Feature extraction CNN I B f B W Matching f AB Regression CNN I B Warp A I B Matching θ Aff ˆ Feature Extraction I A Feature Extraction Matching TPS Regression Feature Extraction Feature Extraction θ TPS ˆ Affine Regression I Instance-level and category-level image alignment Output: smooth dense correspondence eld Results on the Proposal Flow dataset DeepFlow [1] GMK [2] SIFT Flow [3] DSP [4] Proposal Flow [5] RANSAC with our features (ane) Ours (ane) Ours (ane + thin plate spline) Ours (ane ensemble + thin plate spline) 20 27 38 29 56 47 49 56 57 Methods PCK (%) Results on the Caltech-101 dataset DeepFlow [1] GMK [2] SIFT Flow [3] DSP [4] Proposal Flow [5] Ours (ane) Ours (ane + thin-plate spline) 0.74 0.40 0.34 0.77 0.42 0.34 0.75 0.48 0.32 0.77 0.47 0.35 0.78 0.50 0.25 0.79 0.51 0.25 0.82 0.56 0.25 Methods LT-ACC IoU LOC-ERR Substantial appearance dierences Presence of background clutter Lack of large annotated real image pair dataset CNN architecture suitable for category-level image alignment The model is trainable from synthetically warped image pairs Matching layer enables generalization to real image pairs At evaluation time: Input: Real image pair Output: Estimated parametric transformation Three stage siamese CNN architecture mimicking the classical matching pipeline 1. Feature extraction CNN: pre-trained VGG-16 model + per-column L2-normalization 2. Matching: correlation layer + per-column L2-normalization 3. Regression CNN: small CNN, trained from scratch 1. Feature extraction conv1 BN1 ReLU1 conv2 BN2 ReLU2 FC 7×7×225×128 5×5×128×64 5×5×64×P 3. Regression correlation layer L2 norm. 2. Matching Output: w×h dense d-dimensional features Output: L2 normalized pairwise correlation tensor (pairwise matching scores) Output: Estimated parameters of geometric transformation Coarse-to-ne matching architecture The same architecture can be applied with increasing geometric model complexity 1. Coarse alignment using an ane transformation 2. Rened alignment using a thin-plate spline transformation The nal transformation is the composition of both stages *Now at DeepMind Training from synthetic imagery Training pairs: generated by a real and a synthetically warped image Tokyo StreetView image Synthetic training pair featuring a thin-plate spline transformation , I A I B Source image Coarse alignment (ane) Fine alignment (ane+TPS) Target image 1. Coarse alignment (ane) 2. Fine alignment (thin-plate spline) Insight: L2 normalization penalizes ambiguous matches Insight: The rst layer convolutional lters can specialize to detect local neighbourhood consensus Averaged peaks from conv1 lters Evaluated using annotated keypoints Metric: Percentage of correct keypoints (PCK) Qualitative results: CNN model Synthetically warped image pair GT transf. Evaluated using annotated object segmentation masks Metrics: Label transfer accuracy (LT-ACC), Intersection over union (IoU), Localization error (LOC-ERR) Qualitative comparison to other methods: , Source image Aligned result Target image Image pair DeepFlow [1] GMK [2] SIFT Flow [3] DSP [4] Proposal Flow [5] Our method I B I A I B CNN model Real image pair Warp I A I A I B Insight: The loss computes a pixel distance and can be used with any type of dierentiable geometric transformation + crop crop References Generalization: We show that the method is relatively unaected by the nature of the training images [1] P. Weinzaepfel, et al. DeepFlow: Large displacement optical ow with deep matching. In Proc. ICCV, 2013 [2] O. Duchenne, et al. A graph-matching kernel for object categorization. In Proc. ICCV, 2011 [3] C. Liu, et al. SIFT Flow: Dense correspondence across scenes and its applications. IEEE PAMI, 2011 [4] J. Kim, et al. Deformable spatial pyramid matching for fast dense correspondences. In Proc. CVPR, 2013 [5] B. Ham, M. Cho, C. Schmid, and J. Ponce. Proposal Flow. In Proc. CVPR, 2016

1,2 Relja Arandjelovi Josef Sivic - École Normale Supérieure · 2017. 7. 29. · Ignacio Rocco1,2 Relja Arandjelović1,2,* Josef Sivic1,2,3 Model 1DI ENS, École normale supérieure,

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1,2 Relja Arandjelovi Josef Sivic - École Normale Supérieure · 2017. 7. 29. · Ignacio Rocco1,2 Relja Arandjelović1,2,* Josef Sivic1,2,3 Model 1DI ENS, École normale supérieure,

At training time:∙ Inputs: - Synthetically warped image pair - ground truth transf.∙ Output: Estimated parametric transformation

Convolutional neural network architecture for geometric matching

Ignacio Rocco1,2 Josef Sivic1,2,3Relja Arandjelović1,2,*

Model

1DI ENS, École normale supérieure, PSL Research University 3CIIRC, CTU in Prague2INRIA

Goal

Challenges

Contributions

Overview

Feature extraction CNNIA fA

Feature extraction CNNIB fB

W Matching fABRegression

CNN

IBWarpA

IB

Matching θAffˆ

Feature Extraction

IA Feature Extraction

Matching TPS Regression

Feature Extraction

Feature Extraction

θTPSˆ

Affine Regression

I

∙ Instance-level and category-level image alignment

‣ Output: smooth dense correspondence field

Results on the Proposal Flow dataset

DeepFlow [1] GMK [2]SIFT Flow [3]DSP [4]Proposal Flow [5]RANSAC with our features (affine)Ours (affine)Ours (affine + thin plate spline)Ours (affine ensemble + thin plate spline)

202738295647495657

Methods PCK (%)

Results on the Caltech-101 dataset

DeepFlow [1]GMK [2]SIFT Flow [3]DSP [4]Proposal Flow [5]Ours (affine)Ours (affine + thin-plate spline)

0.74 0.40 0.340.77 0.42 0.340.75 0.48 0.320.77 0.47 0.350.78 0.50 0.250.79 0.51 0.250.82 0.56 0.25

Methods LT-ACC IoU LOC-ERR

∙ Substantial appearance differences∙ Presence of background clutter∙ Lack of large annotated real image pair dataset

∙ CNN architecture suitable for category-level image alignment∙ The model is trainable from synthetically warped image pairs∙ Matching layer enables generalization to real image pairs

At evaluation time:∙ Input: Real image pair

∙ Output: Estimated parametric transformation

∙ Three stage siamese CNN architecture mimicking the classical matching pipeline 1. Feature extraction CNN: pre-trained VGG-16 model + per-column L2-normalization 2. Matching: correlation layer + per-column L2-normalization 3. Regression CNN: small CNN, trained from scratch

1. Feature extraction

conv1 BN1 ReLU1 conv2 BN2 ReLU2 FC

7×7×225×128 5×5×128×64 5×5×64×P

3. Regression

correlationlayer

L2norm.

2. Matching

Output: w×h dense d-dimensional features

Output: L2 normalized pairwise correlation tensor (pairwise matching scores)

Output: Estimated parameters of geometric transformation

Coarse-to-fine matching architecture∙ The same architecture can be applied with increasing geometric model complexity 1. Coarse alignment using an affine transformation 2. Refined alignment using a thin-plate spline transformation∙ The final transformation is the composition of both stages

*Now at DeepMind

Training from synthetic imagery∙ Training pairs: generated by a real and a synthetically warped image

Tokyo StreetView

image

Synthetic trainingpair featuring a thin-plate spline transformation

,IA IB

Source imageCoarse alignment

(affine)Fine alignment(affine+TPS)

Target image

1. Coarse alignment (affine) 2. Fine alignment (thin-plate spline)

Insight:L2 normalizationpenalizes ambiguousmatches

Insight: The first layer convolutional filters can specialize to detect local neighbourhood consensus

Averaged peaks from conv1 filters

∙ Evaluated using annotated keypoints

∙ Metric: Percentage of correct keypoints (PCK)

Qualitative results:

CNNmodel

Syntheticallywarped

image pair

GT transf.

∙ Evaluated using annotated object segmentation masks∙ Metrics: Label transfer accuracy (LT-ACC), Intersection over union (IoU), Localization error (LOC-ERR)

Qualitative comparison to other methods:

,Source image Aligned resultTarget image

Image pair DeepFlow [1] GMK [2] SIFT Flow [3] DSP [4] Proposal Flow [5] Our method

IB

IA

IB

CNNmodel

Realimagepair

Warp

IAIA

IB

Insight: The loss computes a pixel distance and can be usedwith any type of differentiablegeometric transformation

+ crop

crop

References

∙ Generalization: We show that the method is relatively unaffected by the nature of the training images

[1] P. Weinzaepfel, et al. DeepFlow: Large displacement optical flow with deep matching. In Proc. ICCV, 2013

[2] O. Duchenne, et al. A graph-matching kernel for object categorization. In Proc. ICCV, 2011

[3] C. Liu, et al. SIFT Flow: Dense correspondence across scenes and its applications. IEEE PAMI, 2011

[4] J. Kim, et al. Deformable spatial pyramid matching for fast dense correspondences. In Proc. CVPR, 2013

[5] B. Ham, M. Cho, C. Schmid, and J. Ponce. Proposal Flow. In Proc. CVPR, 2016