Probabilistic Models for Parsing Images Xiaofeng Ren University of California, Berkeley

Probabilistic Modelsfor

Parsing Images

Xiaofeng RenUniversity of California, Berkeley

Parsing Images

TigerGrass

Water

Sand

outdoorwildlife

Tiger

tail

eye

legs

head

back

shadow

mouse

A Classical View of Visual Processing

Pixels &

Pixel Features

Contours &

Regions

TigerGrass

Water

Sand

Objects &

Scenes

Low-level

Image Processing

Mid-level

Perceptual Organization

High-level

Recognition

Models for Parsing Images

Pixels Contours

& Regions

Objects &

Scenes

Low-level

Image Processing

Mid-level

Perceptual Organization

High-level

Recognition

A unified framework incorporating all levels of abstraction

Probabilistic Models for Images

Markov Random Fields[Geman & Geman 84]

Pixels

Labels

very limited representational power

Image restoration Edge detection Texture synthesis Segmentation Super-resolution Contour completion

………

Empirical evidence against pixel-based MRF [Ren & Malik 02]

Where is Structure?

Our perception of structure is disrupted.

We cannot efficiently reason about structure if we cannot represent it.

Outline

Parsing Images Building a Mid-level Representation Probabilistic Models for Mid-level Vision

Contour Completion Figure/Ground Organization

Combining Mid- and High-level Vision Object Segmentation Finding People

Conclusion & Future Work

Outline





Local Edge Detection

Use the Pb (probability of boundary) edge detector: combining local brightness, texture and color contrasts.

Piece-wise Linear Approximation

Recursively split the boundaries (using angles) until each piece is approximately straight

Constrained Delaunay Triangulation (CDT) A variant of the standard Delaunay Triangulation Keeps a given set of edges in the triangulation Widely used in geometric modeling and finite elements.

Scale Invariance of CDT

The CDT Graph: Summary

• millions of pixels 1000 edges• fast to compute• scale-invariant• completes gaps• little loss of structure

Pixels

Superpixels

Principle of Uniform Connectedness: use homogenous regions as entry-level units in perceptual organization. [Palmer and Rock 94]

• longer ranges of

interaction

[Ren & Malik; ICCV 2003][Ren, Fowlkes & Malik; ICCV 2005]

Analogy with Natural Language Parsing

Sentences&

Paragraphs

Phrases

Words

Letters

Contours&

Regions

Objects&

Scenes

Pixels

Contours&

Regions

Objects&

Scenes

Pixels

Superpixels

Outline





Mid-level Vision

It is not low-level vision (which can be computed

independently in a local neighborhood). It is not high-level vision (which assumes knowledge

of particular object categories & scenes).

Problems in mid-level vision

Curvilinear grouping Figure/groundorganization

Region segmentation

Mid-level Vision


Region segmentation


Curvilinear Grouping

Boundaries are smooth in nature!

A number of associated visual phenomena

Good continuation Visual completion Illusory contours

Beyond Local Edge Detection

There is psychophysical evidence that we are approaching the limit of local edge detection

Smoothness of boundaries in natural images provides an important contextual cue.

Inference on the CDT Graph

Xe

Xe

Xe

Xe

Xe

Xe

Xe

Xe

Xe

XeXeXe

Xe

Xe

Xe

Xe

Xe

Xe

Xe{0,1} 1: boundary

0: non-boundary

Estimate the marginal P(Xe)

Random Field:

which defines a joint probability distribution on all {Xe}

Conditional Random Fields (CRF)

Edge potentials exp(ii)

Junction potentials exp(jj)

[Pietra, Pietra & Lafferty 97][Lafferty, McCallum & Pereira 01]

Z

XX

XPj

jji

ii exp

X j

jji

ii XXZ expwhereX={X1,X2,…,Xm}

Undirected graphical model with potential functions in the exponential family

Edge Potential: Local Contrast

potentials exp(ii)

= average contrast on each edge e

Junction Potential: Degree

XeXe

XeThe degree of the junction depends on the assignments of {Xe}

deg=0(no lines)

00

0

deg=1(line ending)

10

0

deg=2(continuation)

10

1

deg=3(T-junction)

11

1

j = ( deg=j )potentials exp(jj)

Junction Potential: Continuity

deg=2(continuation)

10

1

= g()·( deg=2 )

Learning the Parameters

2.46 0.87 1.14 0.01

mid-level representation + probabilistic framework + large annotated datasets

Compare to [Geman and Geman 84]

Evaluation: Precision vs Recall

Pre

cisi

on

Recallmatch to groundtruth

Precision =matched pairs

total detections

total groundtruth Recall =

matched pairs

High threshold; few detections

Low threshold; lots of detections

Curvilinear grouping improves boundary detection, both for low-recall and high-recall

Horse dataset of [Borenstein and Ullman 02], 175 images training, 175 testing

“Mid-level vision is useful”

[Ren, Fowlkes & Malik; ICCV 2005]

Image Pb CRF

Image Pb CRF

Mid-level Vision


Region segmentation


Mid-level Vision


Region segmentation


Figure/Ground Organization

A contour belongs to one of the two (but not both) abutting regions.

Figure(face)

Ground(shapeless)

Figure(Goblet)Ground

(Shapeless)

Important for the perception of shape


Xe

Xe

Xe

Xe

Xe

Xe

Xe

Xe

Xe

XeXeXe

Xe

Xe

Xe

Xe

Xe

Xe

Xe{-1,1} 1: Left is Figure

-1: Right is Figure

Local Model: Convexity, Parallelism,…

Global Model: Consistency at T-junctions

Results

Chance 50.0%

Baseline Size/Convexity 55.6%

Local Shapemes 64.8%

Averaging shapemes on segmentation boundaries

72.0%

Shapemes + CRF 78.3%

Dataset Consistency 88.0%

Using human segmentations

[Ren, Fowlkes & Malik; ECCV 2006]

Models for Contour Labeling

TigerGrass

Water

Sand

Labels {Xe}

Curvilinear GroupingFigure/Ground Assignment

Contours&

Regions

Objects&

Scenes

Pixels

Superpixels CRF

Line Labeling

> : contour direction+ : convex edge - : concave edge

Reviving the old tradition with modern technologies, for more realistic applications

possible junctions(constraints)

CSP

[Clowes 1971, Huffman 1971; Waltz 1972]

Parsing Images

TigerGrass

Water

Sand

Add region-based variables and cues

Joint contour and region inference

Add high-level knowledge (objects)

Contours&

Regions

Objects&

Scenes

Pixels

Superpixels

Object Segmentation

… Object-specific cues: Shape Region

support Color/Texture

…


Xe

Xe

Xe

Xe

Xe

Xe

Xe

Xe

Xe

XeXeXe

Xe

Xe

Xe

Xe

Xe

Xe

Yt

Yt

Yt

Yt

Yt

Yt

Yt

Yt

YtYt

ZContour variables {Xe}

Region variables {Yt}

Object variable {Z}

Integrating {Xe},{Yt} and{Z}: low/mid/high-level cues

Xe

Xe

Xe

Xe

Xe

Xe

Xe

Xe

Xe

XeXeXe

Xe

Xe

Xe

Xe

Xe

Xe

Yt

Yt

Yt

Yt

Yt

Yt

Yt

Yt

YtYt

Z

Encoding location, scale, pose, etc.

Grouping Cues

Low-level Cues Edge energy along edge e Brightness/texture similarity between two

regions s and t

Mid-level Cues Edge collinearity and junction frequency at

vertex V Consistency between edge e and two

adjoining regions s and t

High-level Cues Texture similarity of region t to exemplars Compatibility of region support with pose Compatibility of local edge shape with pose

Low-level Cues Edge energy along edge e Brightness/texture similarity between two

regions s and t

Mid-level Cues Edge collinearity and junction frequency at

vertex V Consistency between edge e and two

adjoining regions s and t

High-level Cues Texture similarity of region t to exemplars Compatibility of region support with pose Compatibility of local edge shape with pose

L1(Xe|I)L2(Ys,Yt|I)

M1(XV|I)

M2(Xe,Ys,Yt)

H1(Yt|I)H2(Yt,Z|I)H3(Xe,Z|I)

Cue Integration in CRF

,,|,exp),(

1,,, IZYXE

IZIZYXP

ts

tse

e IYYLIXLE,

21 |,|

ts

etsV

V XYYMIXM,

21 ,,|

e

et

tt

t IZXHIZYHIYH |,|,| 321

Estimate the marginal posteriors of X, Y and Z

Object knowledge helps a lot

Mid-level Cues still useful

[Ren, Fowlkes & Malik; NIPS 2005]

Input Input Pb Output Contour Output Figure

Input Input Pb Output Contour Output Figure

Finding People

The challenges:

Posearticulation + self-occlusion

Clothing Lighting Clutter

……

Finding People: Top-Down

Objects&

Scenes

Pixels

Top-down approaches

3D model-basedfails most of the time

2D template-basedneeds lots of training data

Contours&

Regions

Objects&

Scenes

Pixels

Superpixels

Finding People: Bottom-Up

Objects&

Scenes

Pixels

Objects&

Scenes

Pixels

Superpixels

Contours&

Regions

Pixels

Superpixels

Contours&

Regions

Objects&

Scenes

Pixels

Superpixels

[Ren, Berg & Malik; ICCV 2005]

Tracking People as Blobs

Blob tracking != Rectangle tracking

… k-1, k, k+1, …

Figure/GroundSegmentation

ObjectBackground

AppearanceModel

TemporalCoherence

Preliminary Results

Tracking = Repeated Segmentation

(video)

Conclusion

Constrained Delaunay Triangulation (CDT)

Conditional Random Fields (CRF)

Quantitative evaluations

Integration of mid-level with high-level vision

Future Work

Contours&

Regions

Objects&

Scenes

Pixels

Superpixels

A richer and more consistent mid-level representation

Higher-order potential functions

Using mid-level representation for general object recognition

A high-fidelity tracking system

Finding people in static images

Thank You

Acknowledgements Joint work with Charless Fowlkes, Alex Berg, and Jitendra Malik.

References X. Ren, C. Fowlkes and J. Malik. Figure/Ground Assignment in

Natural Images. In ECCV 2006. X. Ren, C. Fowlkes and J. Malik. Cue Integration in

Figure/Ground Labeling. In NIPS 2005. X. Ren, A. Berg and J. Malik. Recovering Human Body

Configurations using Pairwise Constraints between Parts. In ICCV 2005.

X. Ren, C. Fowlkes and J. Malik. Scale-Invariant Contour Completion using Conditional Random Fields. In ICCV 2005.

X. Ren and J. Malik. Learning a Classification Model for Segmentation. In ICCV 2003.

X. Ren and J. Malik. A Multi-Scale Probability Model for Contour Completion based on Image Statistics. In ECCV 2002.

Finding People from Bottom-Up

Detecting parts

Superpixels

Assembling parts

Integer Quadratic Programming (IQP)

Objects&

Scenes

Pixels

Superpixels

Finding People in Video

Contours&

Regions

Pixels

Superpixels

Additional information: Motion Appearance Temporal consistency

How much can we do without object model (blob tracking)?

…………

I stand at the window and see a house, trees, sky. Theoretically I might say there were 327 brightnesses and nuances of colour.

Do I have "327"?

No. I have sky, house, and trees.

---- Max Wertheimer, 1923

Learning the Parameters

Maximum-likelihood estimation in CRFLet denote the groundtruth labeling on the CDT graph

Maximum-likelihood estimation in CRFLet denote the groundtruth labeling on the CDT graph

XXq

qtt

ti

iss

s XfXgZ

XL~factor edge

exp1~

Xtqqt

t ZFfXL exp

1~~log

factor

Gradient descent works well Gradient descent works well

X~

)|()(~

XPtXPt ff

Global Consistency

F

G

F

F

GG

common

F

G

F

G

GF

uncommonUse junction potentials to encode junction type

Image Groundtruth Local Global

Results

Chance 50.0%

Baseline Size/Convexity N/A

Local Shapemes 64.9%

Averaging shapemes on segmentation boundaries

66.5%

Shapemes + CRF 68.9%

Dataset Consistency 88.0%

Without human segmentations

Image Pb Local Global

Outline





Detecting Parts: CDT

Candidate parts as parallel line segments (Ebenbreite)

Automatic scale selection from bottom-up

Feature combination with a logistic classifier

Candidate parts as parallel line segments (Ebenbreite)

Automatic scale selection from bottom-up

Feature combination with a logistic classifier

Assembling Parts: IQP

Candidates {Ci} Parts {Lj}

(Lj1,Ci1=(Lj1))

(Lj2,Ci2=(Lj2))

Cost for a partial assignment {(Lj1,Ci1), (Lj2,Ci2)}:

assignment

2

)22)(11( 2,1

2,1)22(),11(

kijij jj

k

jjk

ijijkf

H

Testing the Markov Assumption

The Markov Model for Contours:

Curvature = white noise (independent) Tangent direction t = random walk

P( t(s+1) | t(s),…) = P( t(s+1) | t(s) )

Dynamic Programming

t(s)

t(s+1)

s

s+1

[Mumford 1994, Williams & Jacobs 1995]

Testing the Markov Assumption

Segment the contours at high-curvature positions

If the Markov assumption holds, Each step, a high curvature event happens w/ probability p; High curvature events are independent from step to step;

Therefore if L is the length of contour segment between high curvature points,

P(L=k) = p(1-p)k

Berkeley Segmentation Dataset

[Martin, Fowlkes, Tal and Malik, ICCV 2001]1,000 images, >14,000 segmentations

Exponential vs Power Law

Contour segment length L

Pro

babi

lity Power Law

Scale Invariance

Markov Assumption

Exponential Law40.2

1

LP

Scale Invariance

Arbitrary viewing distance

Hierarchy of Parts

Finger

LegTorso

A Scale-Invariant Representation

TigerGrass

Water

Sand

Scale Space

Re-scale

? A scale-invariant representation for contours

Gap-Filling Property of CDT A typical scenario of contour completion A typical scenario of contour completion

low contrast

high contrasthigh contrast

CDT picks the “right” edge, completing the gap CDT picks the “right” edge, completing the gap

No Loss of Structure

Use Phuman the soft groundtruthlabel defined on CDT graphs:precision close to 100%

Pb averaged over CDT edges: no worse than the orignal Pb

Increase in asymptotic recall rate: completion of gradientless contours

Uniform Connectedness

Connected regions of homogeneous properties (brightness, color, texture) are perceived as entry-level units. [Palmer & Rock, 1994]

“Classical principles of grouping operate after UC, creating superordinate units consisting of two or more entry-level units.”

“… UC (uniform connectedness) cannot be reduced to grouping principles, because it is not a form of grouping at all…”

Local Model

“Bi-gram” model:

contrast + continuity

binary classification (0,0) vs (1,1)

logistic classifier

“Tri-gram” model:

1 2

L LPbL

=

Xe

Building a CRF Model

What are the features? edge features:

low-level “edgeness” (Pb) junction features:

Junction type Continuity

How to make inference? Loopy Belief Propagation

How to learn the parameters? Gradient Descent on Max. Likelihood

What are the features? edge features:

low-level “edgeness” (Pb) junction features:

Junction type Continuity

How to make inference? Loopy Belief Propagation

How to learn the parameters? Gradient Descent on Max. Likelihood

X={X1,X2,…,Xm}

Estimate P(Xi|)

Junction and Continuity

Junction types (degg,degc): Junction types (degg,degc):

degg=1,degc=0 degg=0,degc=2 degg=1,degc=2

Continuity term for degree-2 junctions Continuity term for degree-2 junctions

baXf cgba degdeg),( ),(),( exp baba f

2degdeg exp cgg

degg+degc=2

degg=0,degc=0

Interpreting the Parameters

=2.46 =0.87 =1.14 =0.01

=-0.59

=-0.98

Line endings and junctions are rare

Completed edges are weak

Continuity improves boundary detection in both low-recall and high-recall ranges

Global inference helps; mostly in low-recall/high-precision

Roughly speaking,

CRF>Local>CDT only>Pb

Image Pb Local Global

Figure/Ground Principles

Convexity

Parallelism

Surroundedness Symmetry Common Fate Familiar Configuration

……

F G

F GG

Figure/Ground Dataset

Figure/Ground Assignment in Natural Images

Local Model Use shapemes (prototypical local shapes) to

capture contextual information

Global Model Use CRF to enforce consistency at junctions

Shapemes: Prototypical Local Shapes

……

local shapes

collect

cluster

Average shape in each shapeme cluster

Shapemes for F/G Discrimination

L R

L:93.84%

L:49.80%

L:89.59%

L:11.69%

L:66.52%

L: 4.98%

Which side is Figure?

Train a logistic classifer to linearly combine the shapeme cues

CRF for Figure/Ground

F={F1,F2,…,Fm}

Fi{Left,Right}

• Put potential functions at junctions• One feature for each junction type

F

G

F

F

GG

F

G

F

G

G F

FG

F

G

{ (F,G),(G,F),(F,G) }

{ (G,F),(F,G) } { (F,G),(F,G),(F,G) }

Results

CDT vs K-Neighbor

An alternative scheme for completion: connect to k-nearest neighbor vertices, subject to visibility

CDT achieves higher asymptotic recall rates

Inference w/ Belief Propagation

Loopy Belief Propagation just like belief propagation iterates message passing until

convergence lack of theoretical foundations and

known to have convergence issues however becoming popular in practice typically applied on pixel-grid

Works well on CDT graphs converges fast (<10 iterations) produces empirically sound results

Loopy Belief Propagation just like belief propagation iterates message passing until

convergence lack of theoretical foundations and

known to have convergence issues however becoming popular in practice typically applied on pixel-grid

Works well on CDT graphs converges fast (<10 iterations) produces empirically sound results

Shape Context

Count the number of edge points inside each bin

“log-polar”

count=4

count=6

[Belongie, Malik & Punicha, ICCV 2001][Berg & Malik, CVPR 2001]

Compare to DDMCMC We try to solve the same problem

A unified framework for image parsing Mid-level representation

CDT vs “atomic regions” Probabilistic Model

Discriminative vs generative Inference mechanism

Belief propagation vs MCMC Quantitative evaluation

We try to develop models step by step

Documents

Probabilistic Models for Parsing Images Xiaofeng Ren University of California, Berkeley