ICCV2009: MAP Inference in Discrete Models: Part 2

Course Program9.30-10.00 Introduction (Andrew Blake)

10.00-11.00 Discrete Models in Computer Vision (Carsten Rother)

15min Coffee break

11.15-12.30 Message Passing: DP, TRW, LP relaxation (Pawan Kumar)

12.30-13.00 Quadratic pseudo-boolean optimization (Pushmeet Kohli)

1 hour Lunch break

14:00-15.00 Transformation and move-making methods (Pushmeet Kohli)

15:00-15.30 Speed and Efficiency (Pushmeet Kohli)

15min Coffee break

15:45-16.15 Comparison of Methods (Carsten Rother)

16:15-17.30 Recent Advances: Dual-decomposition, higher-order, etc. (Carsten Rother + Pawan Kumar)

All online material will be online (after conference):http://research.microsoft.com/en-us/um/cambridge/projects/tutorial/

Discrete Models in Computer Vision

Carsten Rother

Microsoft Research Cambridge

Overview

• Introduce Factor graphs notation

• Categorization of models in Computer Vision:

– 4-connected MRFs

– Highly-connected MRFs

– Higher-order MRFs

Model : discrete or continuous variables? discrete or continuous space? Dependence between variables? …

Markov Random Field Models for Computer Vision

Inference: Graph Cut (GC)

Belief Propagtion (BP)

Tree-Reweighted Message Parsing (TRW)

Iterated Conditional Modes (ICM)

Cutting-plane

Dual-decomposition

Learning: Exhaustive search (grid search)

Pseudo-Likelihood approximation

Training in Pieces

Max-margin

Applications: 2D/3D Image segmentation Object Recognition 3D reconstruction Stereo matching Image denoising Texture Synthesis Pose estimation Panoramic Stitching …

Recap: Image Segmentation

P(x|z) ~ P(z|x) P(x)

P(x|z) ~ exp{-E(x)}

E(x) = ∑ θi (xi) + w∑ θij (xi,xj)

Posterior; Likelihood; Prior

Gibbs distribution

i,j Є N4i

Energy

E: {0,1}n → R

Unary term Pairwise term

Input z Maximum-a-posteriori (MAP)x* = argmax P(x|z)

x x= argmin E(x)

Min-Marginals(uncertainty of MAP-solution)

Can be used in several ways:• Insights on the model• For optimization (TRW, comes later)

image MAP Min-Marginals(foreground)

(bright=very certain)

ψv;i = min E(x)xv=i

Definition:

Introducing Factor Graphs

Write probability distributions as Graphical models:

- Direct graphical model- Undirected graphical model (… what Andrew Blake used)

- Factor graphs

References:- Pattern Recognition and Machine Learning *Bishop ‘08, book, chapter 8+

- several lectures at the Machine Learning Summer School 2009 (see video lectures)

Factor Graphs

P(x) ~ θ(x1,x2,x4) θ(x3,x4) θ(x2,x3) θ(x4,x5)

Factor graph

unobserved/latent/hidden variable

P(x) ~ exp{-E(x)} E(x) = θ(x1,x2,x3) + θ(x2,x4) + θ(x3,x4) + θ(x3,x5)

variables are in same factor.

“4 factors”

Gibbs distribution

Definition “Order”

Definition “Order”:The arity (number of variables) of the largest factor

P(X) ~ θ(x1,x2,x3) θ(x2,x4) θ(x3,x4) θ(x3,x5)

Factor graph with order 3

arity 3 arity 2

Examples - Order

4-connected; pairwise MRF

Higher-order MRF

E(x) = ∑ θij (xi,xj)i,j Є N4

higher(8)-connected; pairwise MRF

E(x) = ∑ θij (xi,xj)i,j Є N8

Order 2 Order 2 Order n

E(x) = ∑ θij (xi,xj)

+θ(x1,…,xn)i,j Є N4

“Pairwise energy” “higher-order energy”

Example: Image segmentationP(x|z) ~ exp{-E(x)}

E(x) = ∑ θi (xi,zi) + ∑ θij (xi,xj)i

Observed variable

Unobserved (latent) variable

i,j Є N4

Factor graph

Most simple inference technique:ICM (iterated conditional mode)

x* = argmin E(x)x

E(x) = θ12 (x1,x2)+ θ13 (x1,x3)+θ14 (x1,x4)+ θ15 (x1,x5)+…

Most simple inference technique:ICM (iterated conditional mode)

Can get stuck in local minima!

E(x) = θ12 (x1,x2)+ θ13 (x1,x3)+θ14 (x1,x4)+ θ15 (x1,x5)+…

ICM Global min

x* = argmin E(x)x

Simulated Annealing: accept a move even if energy increases (with certain probability)

means observed

Overview

• Categorization of models in Computer Vision:

Stereo matching

Ground truth depthImage – left(a) Image – right(b)

• Images rectified• Ignore occlusion for now

E(d): {0,…,D-1}n → R

Energy:

Labels: d (depth/shift)

Stereo matching - Energy

θij (di,dj) = g(|di-dj|)

E(d): {0,…,D-1}n → R

Energy:

E(d) = ∑ θi (di) + ∑ θij (di,dj)

Pairwise:

i i,j Є N4

θi (di) = (lj-ri-di)“SAD; Sum of absolute differences”(many others possible, NCC,…)

i-2(di=2)

Unary:

Left ImageR

many others

Stereo matching - prior

[Olga Veksler PhD thesis, Daniel Cremers et al.]

|di-dj|

No truncation(global min.)

[Olga Veksler PhD thesis, Daniel Cremers et al.]

|di-dj|

discontinuity preserving potentials*Blake&Zisserman’83,’87+

No truncation(global min.)

with truncation(NP hard optimization)

Stereo matching see http://vision.middlebury.edu/stereo/

No MRFPixel independent (WTA)

No horizontal links Efficient since independent chains

Ground truthPairwise MRF[Boykov et al. ‘01+

Texture synthesis

Output

[Kwatra et. al. Siggraph ‘03 +

E: {0,1}n → R

1i,j Є N4

E(x) = ∑ |xi-xj| [ |ai-bi|+|aj-bj| ]

i j i j

Good case: Bad case:

Video Synthesis

OutputInput

Video (duplicated)

Panoramic stitching

AutoCollage

http://research.microsoft.com/en-us/um/cambridge/projects/autocollage/ [Rother et. al. Siggraph ‘05 +

Recap: 4-connected MRFs

• A lot of useful vision systems are based on 4-connected pairwise MRFs.

• Possible Reason (see Inference part): a lot of fast and good (globally optimal) inference methods exist

Overview

• Categorization of models in Computer Vision?

Why larger connectivity?

We have seen…

• “Knock-on” effect (each pixel influences each other pixel)

• Many good systems

What is missing:

1. Modelling real-world texture (images)

2. Reduce discretization artefacts

3. Encode complex prior knowledge

4. Use non-local parameters

Reason 1: Texture modelling

Test image Test image (60% Noise)Training images

Result MRF9-connected

(7 attractive; 2 repulsive)

Result MRF4-connected

Result MRF4-connected(neighbours)

Reason1: Texture Modellinginput output

[Zalesny et al. ‘01+

Reason2: Discretization artefacts

*Boykov et al ‘03, ‘05+

Larger connectivity can model true Euclidean length (also other metric possible)

Length of the paths:

4-con.

8-con.

higher-connectivity can model true Euclidean length

4-connected Euclidean

8-connected Euclidean

8-connected geodesic

[Boykov et al. ‘03; ‘05+

3D reconstruction

[Slide credits: Daniel Cremers]

Reason 3: Encode complex prior knowledge: Stereo with occlusion

Each pixel is connected to D pixels in the other image

E(d): {1,…,D}2n → R

matchθlr (dl,dr) =

d=10 (match)

dd=20 (0 cost)

d=1 ( cost)∞

Left view right view

Stereo with occlusion

Ground truth Stereo with occlusion[Kolmogrov et al. ‘02+

Stereo without occlusion*Boykov et al. ‘01+

Reason 4: Use Non-local parameters:Interactive Segmentation (GrabCut)

[Boykov and Jolly ’01+

GrabCut *Rother et al. ’04+

A meeting with the Queen

Reason 4: Use Non-local parameters:Interactive Segmentation (GrabCut)

An object is a compact set of colors:

[Rother et al Siggraph ’04+

E(x,w) = ∑ θi (xi,w) + ∑ θij (xi,xj)i i,j Є N4

E(x,w): {0,1}n x {GMMs}→ R

Red Red

Model jointly segmentation and color model:

Reason 4: Use Non-local parameters:Segmentation and Recognition

E(x,w) = ∑ |T(w)i-xi| + ∑ θij (xi,xj)i i,j Є N4

E(x,w): {0,1}n x {Exemplar}→ R

Large set of example segmentation:

T(1) T(2) T(3)

Up to 2.000.000 exemplars

Goal, Segment test image:

“Hamming distance”

[Lempisky et al. ECCV ’08+

Reason 4: Use Non-local parameters:Segmentation and Recognition

[Lempisky et al. ECCV ’08+

UIUC dataset; 98.8% accuracy

Overview

Why Higher-order Functions?

In general θ(x1,x2,x3) ≠ θ(x1,x2) + θ(x1,x3) + θ(x2,x3)

Reasons for higher-order MRFs:

1. Even better image(texture) models:– Field-of Expert [FoE, Roth et al. ‘05+

– Curvature *Woodford et al. ‘08+

2. Use global Priors:– Connectivity *Vicente et al. ‘08, Nowizin et al. ‘09+

– Encode better training statistics *Woodford et al. ‘09+

Reason1: Better Texture Modelling

Test Image Test Image (60% Noise)

Training images

Result pairwise MRF9-connected

Higher Order Structure not Preserved

Higher-order MRF

*Rother et al CVPR ‘09+

Reason 2: Use global PriorForeground object must be connected:

User input Standard MRF:Removes noise (+)Shrinks boundary (-)

with connectivity

E(x) = P(x) + h(x) with h(x)={∞ if not 4-connected0 otherwise

[Vicente et. al. ’08Nowizin et al ‘09+

Reason 2: Use global PriorWhat is the prior of a MAP-MRF solution:

[Woodford et. al. ICCV ‘09](see poster on Friday)

Training image: 60% black, 40% white

MRF is a bad prior since input marginal statistic ignored !

MAP:prior(x) = 0.6 = 0.0168

Others less likely :

prior(x) = 0.6 * 0.4 = 0.0055 3

Introduce a global term, which controls global stats:

Pairwise MRF –Increase Prior strength

Ground truth

Noisy input

Global gradient prior

Summary

…. all useful models, but how do I optimize them?

Course Program9.30-10.00 Introduction (Andrew Blake)

10.00-11.00 Discrete Models in Computer Vision (Carsten Rother)

15min Coffee break

11.15-12.30 Message Passing: DP, TRW, LP relaxation (Pawan Kumar)

12.30-13.00 Quadratic pseudo-boolean optimization (Pushmeet Kohli)

1 hour Lunch break

14:00-15.00 Transformation and move-making methods (Pushmeet Kohli)

15:00-15.30 Speed and Efficiency (Pushmeet Kohli)

15min Coffee break

15:45-16.15 Comparison of Methods (Carsten Rother)

16:15-17.30 Recent Advances: Dual-decomposition, higher-order, etc. (Carsten Rother + Pawan Kumar)

All online material will be online (after conference):http://research.microsoft.com/en-us/um/cambridge/projects/tutorial/

unused slides …

Markov Property

• Markov Property: Each variable is only connected to a few others,

i.e. many pixels are conditional independent

• This makes inference easier (possible at all)

• But still… every pixel can influence any other pixel (knock-on effect)

Recap: Factor Graphs

• Factor graphs: very good representation since it reflects directly the given energy

• MRF (Markov Property) means many pixels are conditional independent

• Still … all pixels influence each other (knock-on effect)

Interactive Segmentation - Tutorial example

Given Z and unknown (latent) variables x:

P(x|z) = P(z|x) P(x) / P(z) ~ P(z|x) P(x)

z = (R,G,B)n x = {0,1}n

Posterior Probability

Likelihood(data-

dependent)

Maximium a Posteriori (MAP): x* = argmax P(x|z)

Prior(data-

independent)

Likelihood P(x|z) ~ P(z|x) P(x)

Maximum likelihood:

x* = argmax P(z|x) =

argmax P(zi|xi)

Log P(zi|xi=0) Log P(zi|xi=1)

x∏xi

Likelihood P(x|z) ~ P(z|x) P(x)

Prior P(x|z) ~ P(z|x) P(x)

P(x) = 1/f ∏ θij (xi,xj)

f = ∑ ∏ θij (xi,xj) “partition function”

θij (xi,xj) = exp{-|xi-xj|} “ising prior”

i,j Є N

(exp{-1}=0.36; exp{0}=1)

Posterior distribution

P(x|z) = 1/f(z,w) exp{-E(x,z,w)}

E(x,z,w) = ∑ θi (xi,zi) + w∑ θij (xi,xj)i i,j

P(zi|xi=1) xi + P(zi|xi=0) (1-xi) θi (xi,zi) =

θij (xi,xj) = |xi-xj|

Note, likelihood can be an arbitrary function of the data

P(x|z) ~ P(z|x) P(x)

Posterior “Gibbs” distribution:

Likelihood

f(z,w) = ∑ exp{-E(x,z,w)}X

Energy

Unary terms Pairwise terms

Energy minization

-log P(x|z) = -log (1/f(z,w)) + E(x,z,w)

MAP same as minimum Energy

MAP; Global min E

x* = argmin E(x,z,w)

f(z,w) = ∑ exp{-E(x,z,w)}X

P(x|z) = 1/f(z,w) exp{-E(x,z,w)}

Weight prior and likelihood

E(x,z,w) = ∑ θi (xi,zi) + w∑ θij (xi,xj)

w =200w =40

Moving away from a pure prior …

E(x,z,w) = ∑ θi (xi,zi) + w ∑ θij (xi,xj,zi,zj)i,ji

θij (xi,xj,zi,zj) = |xi-xj| (-exp{-ß||zi-zj||2})

ß=2(Mean(||zi-zj||2) )-1

Contrast Costising cost

||zi-zj||2

“Going from a Markov random Field to Conditional random field”

Tree vs Loopy graphs

- MAP (in general) NP hard(see inference part)

- Marginals P(xi) also NP hard

[Felzenschwalb, Huttenlocher ‘01+

• MAP is tractable• Marginal, e.g. P(foot), tractable

chaintree

rootxi

Markov blanket of xi: all variables which are in same factor as xi

|di-dj|

[Olga Veksler PhD thesis]

(Potts model) Smooth disparities

Potts model

Left image

Modelling texture [Zalesny et al ‘01+

“Unary only”

“8 connected MRF”

“13 connected MRF”

5.08 5.65

6.75 8

*Boykov et al ‘03, ‘05+

Larger connectivity can model true Euclidean length (also any Riemannian metric, e.g. geodesic length, can be modelled)

4-con.

Length of the path

8-con.

θij (xi,xj) = ∆a / (2*dis(xi,xj)) |xi –xj| ∆a = π/4

true euc.

References Higher-order Functions?

• In general

Field of Experts Model (2x2; 5x5)*Roth, Black CVPR ‘05 +[Potetz, CVPR ‘07+

Minimize Curvature (3x1)*Woodford et al. CVPR ‘08 +

Large Neighbourhood (10x10 -> whole image)*Rother, Kolmogorov, Minka & Blake, CVPR ‘06+*Vicente, Kolmogorov, Rother, CVPR ‘08+[Komodiakis, Paragios, CVPR ‘09+*Rother, Kohli, Feng, Jia, CVPR ‘09+*Woodford, Rother, Kolmogorov, ICCV ‘09+*Vicente, Kolmogorov, Rother, ICCV ‘09+*Ishikawa, CVPR ‘09+*Ishikawa, ICCV ‘09+

θ(x1,x2,x3) ≠ θ(x1,x2) + θ(x1,x3) + θ(x2,x3)

Conditional Random Field (CRF)

Definition CRF: all factors may depend on the data zNo problem for inference (but parameter learning)

E(x) = ∑ θi (xi,zi) + ∑ θij (xi,xj,zi,zj)i

with θij (xi,xj,zi,zj) = |xi-xj| exp(-ß||zi-zj||2)

Contrast CostIsing cost

i,j Є N4

||zi-zj||2

Factor graph

ICCV2009: MAP Inference in Discrete Models: Part 2

Education

Discrete Mathematics - Rules of Inference and …users.pja.edu.pl/~msyd/mad-lectures/proofs.pdfDiscrete Mathematics (c) Marcin Sydow Proofs Inference rules Proofs Set theory axioms

Basic inference and validation in discrete choice modelingbin.t.u-tokyo.ac.jp/model19/lecture/gian.pdfHensher, Rose, and Greene (2015) Basic Inference in discrete choice models MNL:

Statistical Inference for Diffusion Processes · Ergodicity, LLN, CLTs Statistical inference for SDEs (high frequency data) Statistical inference for SDEs (discrete data) Numerical

Discrete Inference and Learning - Inrialear.inrialpes.fr/~alahari/disinflearn/Lecture04-duality.pdfDiscrete Inference and Learning Lecture 4 Primal-dual schema, dual decomposition

Learning with Inference for Discrete Graphical Models

CSE 321 Discrete Structures Winter 2008 Lecture 5 Rules of Inference

Dynamic Hybrid Algorithms for MAP Inference in Discrete MRFs

Discrete MRF Inference of Marginal Densities for Non-uniformly Discretized Variable Space

ICCV2009: Max-Margin Ađitive Classifiers for Detection

Iccv2009 recognition and learning object categories p0 c00 - introduction

Discrete Choice Modeling - New York Universitypeople.stern.nyu.edu/wgreene/DiscreteChoiceSurvey.pdf · 0.2 Specification, estimation and inference for discrete choice models The classical

Statistical modelling and inference for multivariate and longitudinal discrete response data.pdf

Discrete Structures Rules of inference

Parametric inference for discrete observations of …Parametric inference for discrete observations of di usion processes with mixed e ects Maud Delattre, Valentine Genon-Catalot,

Semiparametric Bayesian inference on generalized …...Semiparametric Bayesian inference on generalized linear... 1095 where δg(·) is a discrete probability measure concentrated

Discrete-Continuous ADMM for Transductive Inference in Higher … · Discrete-Continuous ADMM for Transductive Inference in Higher-Order MRFs Emanuel Laude1 Jan-Hendrik Lange2 Jonas

1 Causal Inference on Discrete Data using Additive Noise Modelsis.tuebingen.mpg.de/fileadmin/user_upload/files/... · · 2011-06-21Causal Inference on Discrete Data using Additive

Statistical Inference for Continuous-Time Markov Processes ...ms88/publications/lc.pdfStatistical Inference for Continuous-Time Markov Processes With Block Structure Based On Discrete-Time

Iccv2009 recognition and learning object categories p3 c00 - summary and datasets

M. S. Ryoo and J. K. Aggarwal ICCV2009