View
225
Download
5
Embed Size (px)
Citation preview
Probabilistic Modelsfor
Parsing Images
Xiaofeng RenUniversity of California, Berkeley
Parsing Images
TigerGrass
Water
Sand
outdoorwildlife
Tiger
tail
eye
legs
head
back
shadow
mouse
A Classical View of Visual Processing
Pixels &
Pixel Features
Contours &
Regions
TigerGrass
Water
Sand
Objects &
Scenes
Low-level
Image Processing
Mid-level
Perceptual Organization
High-level
Recognition
Models for Parsing Images
Pixels Contours
& Regions
Objects &
Scenes
Low-level
Image Processing
Mid-level
Perceptual Organization
High-level
Recognition
A unified framework incorporating all levels of abstraction
Probabilistic Models for Images
Markov Random Fields[Geman & Geman 84]
Pixels
Labels
very limited representational power
Image restoration Edge detection Texture synthesis Segmentation Super-resolution Contour completion
………
Empirical evidence against pixel-based MRF [Ren & Malik 02]
Where is Structure?
Our perception of structure is disrupted.
We cannot efficiently reason about structure if we cannot represent it.
Outline
Parsing Images Building a Mid-level Representation Probabilistic Models for Mid-level Vision
Contour Completion Figure/Ground Organization
Combining Mid- and High-level Vision Object Segmentation Finding People
Conclusion & Future Work
Outline
Parsing Images Building a Mid-level Representation Probabilistic Models for Mid-level Vision
Contour Completion Figure/Ground Organization
Combining Mid- and High-level Vision Object Segmentation Finding People
Conclusion & Future Work
Local Edge Detection
Use the Pb (probability of boundary) edge detector: combining local brightness, texture and color contrasts.
Piece-wise Linear Approximation
Recursively split the boundaries (using angles) until each piece is approximately straight
Constrained Delaunay Triangulation (CDT) A variant of the standard Delaunay Triangulation Keeps a given set of edges in the triangulation Widely used in geometric modeling and finite elements.
Scale Invariance of CDT
The CDT Graph: Summary
• millions of pixels 1000 edges• fast to compute• scale-invariant• completes gaps• little loss of structure
Pixels
Superpixels
Principle of Uniform Connectedness: use homogenous regions as entry-level units in perceptual organization. [Palmer and Rock 94]
• longer ranges of
interaction
[Ren & Malik; ICCV 2003][Ren, Fowlkes & Malik; ICCV 2005]
Analogy with Natural Language Parsing
Sentences&
Paragraphs
Phrases
Words
Letters
Contours&
Regions
Objects&
Scenes
Pixels
Contours&
Regions
Objects&
Scenes
Pixels
Superpixels
Outline
Parsing Images Building a Mid-level Representation Probabilistic Models for Mid-level Vision
Contour Completion Figure/Ground Organization
Combining Mid- and High-level Vision Object Segmentation Finding People
Conclusion & Future Work
Mid-level Vision
It is not low-level vision (which can be computed
independently in a local neighborhood). It is not high-level vision (which assumes knowledge
of particular object categories & scenes).
Problems in mid-level vision
Curvilinear grouping Figure/groundorganization
Region segmentation
Mid-level Vision
Curvilinear grouping Figure/groundorganization
Region segmentation
Problems in mid-level vision
Curvilinear Grouping
Boundaries are smooth in nature!
A number of associated visual phenomena
Good continuation Visual completion Illusory contours
Beyond Local Edge Detection
There is psychophysical evidence that we are approaching the limit of local edge detection
Smoothness of boundaries in natural images provides an important contextual cue.
Inference on the CDT Graph
Xe
Xe
Xe
Xe
Xe
Xe
Xe
Xe
Xe
XeXeXe
Xe
Xe
Xe
Xe
Xe
Xe
Xe{0,1} 1: boundary
0: non-boundary
Estimate the marginal P(Xe)
Random Field:
which defines a joint probability distribution on all {Xe}
Conditional Random Fields (CRF)
Edge potentials exp(ii)
Junction potentials exp(jj)
[Pietra, Pietra & Lafferty 97][Lafferty, McCallum & Pereira 01]
Z
XX
XPj
jji
ii exp
X j
jji
ii XXZ expwhereX={X1,X2,…,Xm}
Undirected graphical model with potential functions in the exponential family
Edge Potential: Local Contrast
potentials exp(ii)
= average contrast on each edge e
Junction Potential: Degree
XeXe
XeThe degree of the junction depends on the assignments of {Xe}
deg=0(no lines)
00
0
deg=1(line ending)
10
0
deg=2(continuation)
10
1
deg=3(T-junction)
11
1
j = ( deg=j )potentials exp(jj)
Junction Potential: Continuity
deg=2(continuation)
10
1
= g()·( deg=2 )
Learning the Parameters
2.46 0.87 1.14 0.01
mid-level representation + probabilistic framework + large annotated datasets
Compare to [Geman and Geman 84]
Evaluation: Precision vs Recall
Pre
cisi
on
Recallmatch to groundtruth
Precision =matched pairs
total detections
total groundtruth Recall =
matched pairs
High threshold; few detections
Low threshold; lots of detections
Curvilinear grouping improves boundary detection, both for low-recall and high-recall
Horse dataset of [Borenstein and Ullman 02], 175 images training, 175 testing
“Mid-level vision is useful”
[Ren, Fowlkes & Malik; ICCV 2005]
Image Pb CRF
Image Pb CRF
Mid-level Vision
Curvilinear grouping Figure/groundorganization
Region segmentation
Problems in mid-level vision
Mid-level Vision
Curvilinear grouping Figure/groundorganization
Region segmentation
Problems in mid-level vision
Figure/Ground Organization
A contour belongs to one of the two (but not both) abutting regions.
Figure(face)
Ground(shapeless)
Figure(Goblet)Ground
(Shapeless)
Important for the perception of shape
Inference on the CDT Graph
Xe
Xe
Xe
Xe
Xe
Xe
Xe
Xe
Xe
XeXeXe
Xe
Xe
Xe
Xe
Xe
Xe
Xe{-1,1} 1: Left is Figure
-1: Right is Figure
Local Model: Convexity, Parallelism,…
Global Model: Consistency at T-junctions
Results
Chance 50.0%
Baseline Size/Convexity 55.6%
Local Shapemes 64.8%
Averaging shapemes on segmentation boundaries
72.0%
Shapemes + CRF 78.3%
Dataset Consistency 88.0%
Using human segmentations
[Ren, Fowlkes & Malik; ECCV 2006]
Models for Contour Labeling
TigerGrass
Water
Sand
Labels {Xe}
Curvilinear GroupingFigure/Ground Assignment
Contours&
Regions
Objects&
Scenes
Pixels
Superpixels CRF
Line Labeling
> : contour direction+ : convex edge - : concave edge
Reviving the old tradition with modern technologies, for more realistic applications
possible junctions(constraints)
CSP
[Clowes 1971, Huffman 1971; Waltz 1972]
Parsing Images
TigerGrass
Water
Sand
Add region-based variables and cues
Joint contour and region inference
Add high-level knowledge (objects)
Contours&
Regions
Objects&
Scenes
Pixels
Superpixels
Object Segmentation
… Object-specific cues: Shape Region
support Color/Texture
…
Inference on the CDT Graph
Xe
Xe
Xe
Xe
Xe
Xe
Xe
Xe
Xe
XeXeXe
Xe
Xe
Xe
Xe
Xe
Xe
Yt
Yt
Yt
Yt
Yt
Yt
Yt
Yt
YtYt
ZContour variables {Xe}
Region variables {Yt}
Object variable {Z}
Integrating {Xe},{Yt} and{Z}: low/mid/high-level cues
Xe
Xe
Xe
Xe
Xe
Xe
Xe
Xe
Xe
XeXeXe
Xe
Xe
Xe
Xe
Xe
Xe
Yt
Yt
Yt
Yt
Yt
Yt
Yt
Yt
YtYt
Z
Encoding location, scale, pose, etc.
Grouping Cues
Low-level Cues Edge energy along edge e Brightness/texture similarity between two
regions s and t
Mid-level Cues Edge collinearity and junction frequency at
vertex V Consistency between edge e and two
adjoining regions s and t
High-level Cues Texture similarity of region t to exemplars Compatibility of region support with pose Compatibility of local edge shape with pose
Low-level Cues Edge energy along edge e Brightness/texture similarity between two
regions s and t
Mid-level Cues Edge collinearity and junction frequency at
vertex V Consistency between edge e and two
adjoining regions s and t
High-level Cues Texture similarity of region t to exemplars Compatibility of region support with pose Compatibility of local edge shape with pose
L1(Xe|I)L2(Ys,Yt|I)
M1(XV|I)
M2(Xe,Ys,Yt)
H1(Yt|I)H2(Yt,Z|I)H3(Xe,Z|I)
Cue Integration in CRF
,,|,exp),(
1,,, IZYXE
IZIZYXP
ts
tse
e IYYLIXLE,
21 |,|
ts
etsV
V XYYMIXM,
21 ,,|
e
et
tt
t IZXHIZYHIYH |,|,| 321
Estimate the marginal posteriors of X, Y and Z
Object knowledge helps a lot
Mid-level Cues still useful
[Ren, Fowlkes & Malik; NIPS 2005]
Input Input Pb Output Contour Output Figure
Input Input Pb Output Contour Output Figure
Finding People
The challenges:
Posearticulation + self-occlusion
Clothing Lighting Clutter
……
Finding People: Top-Down
Objects&
Scenes
Pixels
Top-down approaches
3D model-basedfails most of the time
2D template-basedneeds lots of training data
Contours&
Regions
Objects&
Scenes
Pixels
Superpixels
Finding People: Bottom-Up
Objects&
Scenes
Pixels
Objects&
Scenes
Pixels
Superpixels
Contours&
Regions
Pixels
Superpixels
Contours&
Regions
Objects&
Scenes
Pixels
Superpixels
[Ren, Berg & Malik; ICCV 2005]
Tracking People as Blobs
Blob tracking != Rectangle tracking
… k-1, k, k+1, …
Figure/GroundSegmentation
ObjectBackground
AppearanceModel
TemporalCoherence
Preliminary Results
Tracking = Repeated Segmentation
(video)
Conclusion
Constrained Delaunay Triangulation (CDT)
Conditional Random Fields (CRF)
Quantitative evaluations
Integration of mid-level with high-level vision
Future Work
Contours&
Regions
Objects&
Scenes
Pixels
Superpixels
A richer and more consistent mid-level representation
Higher-order potential functions
Using mid-level representation for general object recognition
A high-fidelity tracking system
Finding people in static images
Thank You
Acknowledgements Joint work with Charless Fowlkes, Alex Berg, and Jitendra Malik.
References X. Ren, C. Fowlkes and J. Malik. Figure/Ground Assignment in
Natural Images. In ECCV 2006. X. Ren, C. Fowlkes and J. Malik. Cue Integration in
Figure/Ground Labeling. In NIPS 2005. X. Ren, A. Berg and J. Malik. Recovering Human Body
Configurations using Pairwise Constraints between Parts. In ICCV 2005.
X. Ren, C. Fowlkes and J. Malik. Scale-Invariant Contour Completion using Conditional Random Fields. In ICCV 2005.
X. Ren and J. Malik. Learning a Classification Model for Segmentation. In ICCV 2003.
X. Ren and J. Malik. A Multi-Scale Probability Model for Contour Completion based on Image Statistics. In ECCV 2002.
Finding People from Bottom-Up
Detecting parts
Superpixels
Assembling parts
Integer Quadratic Programming (IQP)
Objects&
Scenes
Pixels
Superpixels
Finding People in Video
Contours&
Regions
Pixels
Superpixels
Additional information: Motion Appearance Temporal consistency
How much can we do without object model (blob tracking)?
…………
I stand at the window and see a house, trees, sky. Theoretically I might say there were 327 brightnesses and nuances of colour.
Do I have "327"?
No. I have sky, house, and trees.
---- Max Wertheimer, 1923
Learning the Parameters
Maximum-likelihood estimation in CRFLet denote the groundtruth labeling on the CDT graph
Maximum-likelihood estimation in CRFLet denote the groundtruth labeling on the CDT graph
XXq
qtt
ti
iss
s XfXgZ
XL~factor edge
exp1~
Xtqqt
t ZFfXL exp
1~~log
factor
Gradient descent works well Gradient descent works well
X~
)|()(~
XPtXPt ff
Global Consistency
F
G
F
F
GG
common
F
G
F
G
GF
uncommonUse junction potentials to encode junction type
Image Groundtruth Local Global
Results
Chance 50.0%
Baseline Size/Convexity N/A
Local Shapemes 64.9%
Averaging shapemes on segmentation boundaries
66.5%
Shapemes + CRF 68.9%
Dataset Consistency 88.0%
Without human segmentations
Image Pb Local Global
Outline
Parsing Images Building a Mid-level Representation Probabilistic Models for Mid-level Vision
Contour Completion Figure/Ground Organization
Combining Mid- and High-level Vision Object Segmentation Finding People
Conclusion & Future Work
Detecting Parts: CDT
Candidate parts as parallel line segments (Ebenbreite)
Automatic scale selection from bottom-up
Feature combination with a logistic classifier
Candidate parts as parallel line segments (Ebenbreite)
Automatic scale selection from bottom-up
Feature combination with a logistic classifier
Assembling Parts: IQP
Candidates {Ci} Parts {Lj}
(Lj1,Ci1=(Lj1))
(Lj2,Ci2=(Lj2))
Cost for a partial assignment {(Lj1,Ci1), (Lj2,Ci2)}:
assignment
2
)22)(11( 2,1
2,1)22(),11(
kijij jj
k
jjk
ijijkf
H
Testing the Markov Assumption
The Markov Model for Contours:
Curvature = white noise (independent) Tangent direction t = random walk
P( t(s+1) | t(s),…) = P( t(s+1) | t(s) )
Dynamic Programming
t(s)
t(s+1)
s
s+1
[Mumford 1994, Williams & Jacobs 1995]
Testing the Markov Assumption
Segment the contours at high-curvature positions
If the Markov assumption holds, Each step, a high curvature event happens w/ probability p; High curvature events are independent from step to step;
Therefore if L is the length of contour segment between high curvature points,
P(L=k) = p(1-p)k
Berkeley Segmentation Dataset
[Martin, Fowlkes, Tal and Malik, ICCV 2001]1,000 images, >14,000 segmentations
Exponential vs Power Law
Contour segment length L
Pro
babi
lity Power Law
Scale Invariance
Markov Assumption
Exponential Law40.2
1
LP
Scale Invariance
Arbitrary viewing distance
Hierarchy of Parts
Finger
LegTorso
A Scale-Invariant Representation
TigerGrass
Water
Sand
Scale Space
Re-scale
? A scale-invariant representation for contours
Gap-Filling Property of CDT A typical scenario of contour completion A typical scenario of contour completion
low contrast
high contrasthigh contrast
CDT picks the “right” edge, completing the gap CDT picks the “right” edge, completing the gap
No Loss of Structure
Use Phuman the soft groundtruthlabel defined on CDT graphs:precision close to 100%
Pb averaged over CDT edges: no worse than the orignal Pb
Increase in asymptotic recall rate: completion of gradientless contours
Uniform Connectedness
Connected regions of homogeneous properties (brightness, color, texture) are perceived as entry-level units. [Palmer & Rock, 1994]
“Classical principles of grouping operate after UC, creating superordinate units consisting of two or more entry-level units.”
“… UC (uniform connectedness) cannot be reduced to grouping principles, because it is not a form of grouping at all…”
Local Model
“Bi-gram” model:
contrast + continuity
binary classification (0,0) vs (1,1)
logistic classifier
“Tri-gram” model:
1 2
L LPbL
=
Xe
Building a CRF Model
What are the features? edge features:
low-level “edgeness” (Pb) junction features:
Junction type Continuity
How to make inference? Loopy Belief Propagation
How to learn the parameters? Gradient Descent on Max. Likelihood
What are the features? edge features:
low-level “edgeness” (Pb) junction features:
Junction type Continuity
How to make inference? Loopy Belief Propagation
How to learn the parameters? Gradient Descent on Max. Likelihood
X={X1,X2,…,Xm}
Estimate P(Xi|)
Junction and Continuity
Junction types (degg,degc): Junction types (degg,degc):
degg=1,degc=0 degg=0,degc=2 degg=1,degc=2
Continuity term for degree-2 junctions Continuity term for degree-2 junctions
baXf cgba degdeg),( ),(),( exp baba f
2degdeg exp cgg
degg+degc=2
degg=0,degc=0
Interpreting the Parameters
=2.46 =0.87 =1.14 =0.01
=-0.59
=-0.98
Line endings and junctions are rare
Completed edges are weak
Continuity improves boundary detection in both low-recall and high-recall ranges
Global inference helps; mostly in low-recall/high-precision
Roughly speaking,
CRF>Local>CDT only>Pb
Image Pb Local Global
Figure/Ground Principles
Convexity
Parallelism
Surroundedness Symmetry Common Fate Familiar Configuration
……
F G
F GG
Figure/Ground Dataset
Figure/Ground Assignment in Natural Images
Local Model Use shapemes (prototypical local shapes) to
capture contextual information
Global Model Use CRF to enforce consistency at junctions
Shapemes: Prototypical Local Shapes
……
local shapes
collect
cluster
Average shape in each shapeme cluster
Shapemes for F/G Discrimination
L R
L:93.84%
L:49.80%
L:89.59%
L:11.69%
L:66.52%
L: 4.98%
Which side is Figure?
Train a logistic classifer to linearly combine the shapeme cues
CRF for Figure/Ground
F={F1,F2,…,Fm}
Fi{Left,Right}
• Put potential functions at junctions• One feature for each junction type
F
G
F
F
GG
F
G
F
G
G F
FG
F
G
{ (F,G),(G,F),(F,G) }
{ (G,F),(F,G) } { (F,G),(F,G),(F,G) }
Results
CDT vs K-Neighbor
An alternative scheme for completion: connect to k-nearest neighbor vertices, subject to visibility
CDT achieves higher asymptotic recall rates
Inference w/ Belief Propagation
Loopy Belief Propagation just like belief propagation iterates message passing until
convergence lack of theoretical foundations and
known to have convergence issues however becoming popular in practice typically applied on pixel-grid
Works well on CDT graphs converges fast (<10 iterations) produces empirically sound results
Loopy Belief Propagation just like belief propagation iterates message passing until
convergence lack of theoretical foundations and
known to have convergence issues however becoming popular in practice typically applied on pixel-grid
Works well on CDT graphs converges fast (<10 iterations) produces empirically sound results
Shape Context
Count the number of edge points inside each bin
“log-polar”
count=4
count=6
[Belongie, Malik & Punicha, ICCV 2001][Berg & Malik, CVPR 2001]
Compare to DDMCMC We try to solve the same problem
A unified framework for image parsing Mid-level representation
CDT vs “atomic regions” Probabilistic Model
Discriminative vs generative Inference mechanism
Belief propagation vs MCMC Quantitative evaluation
We try to develop models step by step