View
1
Download
0
Category
Preview:
Citation preview
Scene Collaging: Analysis and Synthesis of Natural Images with SemanticLayers – Supplementary Materials
Phillip IsolaMIT
phillipi@mit.edu
Ce LiuMicrosoft Researchceliu@microsoft.com
1. Maximum entropy priorOur scenegraph grammar (Section 2.3, main text), serves
as a binary prior: scenes are either valid or not. In orderto compare against a method than can capture more subtlegradations in the prior probability, we also implemented amaximum entropy (maxent) prior over object co-occurrencestatistics [3]. With this approach, we express the prior prob-ability of a scene X as:
P (X) =1
Z(Λ)exp (−
�
i
λifi(X)), (1)
where Z normalizes and the distribution is parameterizedby parameters λi ∈ Λ. Many methods exist for learning Λfrom training data. We employ gradient descent [2].
The functions fi measure object class count statisticsand object class co-count statistics.
We approximate the distribution of object class countswith a histogram with bins R. For each object class i,
f (count)i,r (X) = (ni(X) ∈ r)), (2)
where ni(X) is the number of objects of class i in the sceneX, that is:
ni(X) =�
�∈L
(c̃� = i). (3)
Recall from Section 2 of the main text that L is the setof dictionary indices of the objects used in a scene Xand c̃� is the class of an object �. r ∈ R is a rangeof integer values. In our current implementation, R ={[0], [1], [2, 3], [4, 7], [8,∞)}.
We approximate the distribution of object class co-counts with a 2D histogram R × R. For each pair of ob-ject classes (i, j), and histogram bin (r1, r2), r1, r2 ∈ R,we have the feature:
f (co−count)i,j,r1,r2
(X) = (ni(X) ∈ r1, nj(X) ∈ r2). (4)
In Section 5.2.1 of the main text, we compare our sys-tem with just the scenegraph grammar as a prior to our sys-tem with both the scenegraph grammar prior and the maxentprior described above. As shown in Table 4, the addition ofthe maxent prior decreases performance compared to ourscenegraph grammar alone.
2. Segment transformation – detailsTranslation and scaling:As summarized in Section 4.2 of the main text, each ob-
ject from our dictionary is translated and scaled indepen-dently. We seek to place the object at a location and scalewhere it well explains the appearance of the query image,I . Using θt to represent the center position of the object’smask and θs to represent how much we scale the object’smask, we start with the following objective function:
E1(θt, θs) =�
q∈Q�
g̃�(f(I)q ) + λt
�
q∈Q̃�
pt(c̃�, q) + λsps(θs),
(5)
Q� = Tts(Q̃�, {θt, θs}), (6)
where g̃� is the segment’s appearance model, pt(c̃�, q) is aprior over mask position for objects of class c̃�, and ps(θs) isa prior over scale factor. Tts applies a similarity transforma-tion to the dictionary object mask Q̃�. In order to encourageobjects to stay largely within the image frame, we assign adefault penalty to all pixels in the transformed mask that falloutside the image frame.
We calculate pt(c, q) by averaging masks of class cacross our dictionary:
pt(c, q) ∝�
�∈L s.t. c̃�=c
(q ∈ Q̃�). (7)
We manually set ps(θs) = {0.75, 0.875, 1, 0.875, 0.75}for the discrete set of scales θs ∈ {0.5, 0.75, 1, 1.5, 2}. Thatis, objects slightly prefer to remain the scale at which theywere found in our dictionary.
1
The objective E1 prefers that objects scale down as muchas possible until they achieve a precise fit onto the query im-age. In order to prevent all objects from choosing the small-est scale, we add a term that trades off between precisionand coverage of the detections:
E2(θt, θs) =E1(θt, θs)
|Q�|+ η(Q̃�, z̃�). (8)
Here we have scaled E1 by a softened version of the areathe transformed object mask covers. The softening factor,η(Q̃�, z̃�), controls preference for precision versus recall inthe object detections. We set this term so as to prefer recallfor large, background objects whereas to prefer precisionfor small, foreground objects. The intuition for this choiceis that objects in the background are likely to be largely cov-ered in the final scene explanation, and consequently, it isokay if only parts of them match the image well. This termis a function of the object’s untransformed mask Q̃� and thelayer of the object in its dictionary scene, z̃�:
η(Q̃�, z̃�) ∝1
|Q̃�|z̃�, (9)
We choose transformation parameters that maximize ourobjective:
{θ∗t , θ∗s} = argmaxθt,θs
E2(θt, θs). (10)
To choose θ∗s we try each of our discrete set of scales. Tochoose θ∗t , we adopt a sliding window approach, searchingover all possible placements of the object mask such that itscenter is within the image frame (however, for dictionaryobjects that touch an image border, we only consider place-ments such that the object still touches the image border).We use convolution between the object mask and image fea-tures to perform this search efficiently.
Trimming and growing:We edit object mask silhouettes after each iteration of
greedy optimization (Algorithm 1 in the main text; labeledas TRIMANDGROW). We formulate this part of the problemas 2D MRF-based segmentation, in which each object inthe scene becomes a segment class label. Here, we denoteobject segment labels as a 2D array �(q), and minimize thefollowing energy function using BP-S [1]:
− log p(� | I,X) =�
q
g̃�(f(I)q ) + λ1
�
q
ψ�(q)+ (11)
λ2
�
{p,q}∈�
φ(�(p), �(q); I) + logZ, (12)
where Z normalizes. The data term g̃�(·) is our appearancemodel from Section 2.2 of the main text. Spatial priors ψ�(·)
are provided by our object visibility masks (Equation 4 inthe main text):
ψ�(q) = (G ∗ V�)(q), (13)
where G is a Gaussian kernel with scale proportional to theobject’s mask area, |Q�|.
For the spatial smoothness term φ(·), we borrow theimage-sensitive smoothness potential from [1]:
φ(�(p), �(q); I) = (�(p) �= �(q))
�ξ + e−γ�I(p)−I(q)�2
ξ + 1
�,
(14)
with γ = (2 < �I(p)− I(q)�2 >)−1.This silhouette editing process leaves us with a 2D map
of object labels �(q). We use this map to update our objectvisibility masks: V� = ∪q (� = �(q)). These visibilitymasks affect the synthesized scene’s likelihood (Equation 6in the main text), and are used to calculate pixel-wise andclass-wise label accuracy (Section 5 of the main text). Uponeach invocation of TRIMANDGROW, we discard the old sil-houette edits and start afresh from the visibility masks givenby Equation 4 in the main text.
3. Random scene synthesis – detailsDuring random scene synthesis, we do not have a tar-
get image to match, so several changes need to be made toour collaging algorithm. First, the likelihood term is set tobe proportional to the number of pixels the collage covers:logP (I|X) ∝
��∈L |V�|. This biases inference toward
collages that fill the entire image frame. Second, segmenttransformation and trimming and growing are skipped – in-stead segments are placed exactly where they were foundin the dictionary scene they came from. Third, we choosethe layer on which to place each object segment � by find-ing the layer in the collage that best matches the object’slayer in the dictionary scene from which it was sampled.Let A(z̃�) be the histogram of object class counts above z̃�in the dictionary scene and B be the histogram below. Fur-ther, let A�(z) and B�(z) represent the histogram of objectclass counts above and below z in the scene collage beingsynthesized. We place the object on the layer z∗� that mini-mizes the following sum of histogram intersections:
z∗� = argmaxz
�
k
min(A(z̃�)k, A�(z)k)+
min(B(z̃�)k, B�(z)k). (15)
4. Additional resultsIn Figures 1, 2, and 3 we display additional example
parses on each dataset. Figures 4, 5, and 6 show the averageper-class accuracy of our algorithm on the top 30 most fre-quent classes in each dataset. Figure 7 shows several char-acteristic failure cases for our algorithm.
Query imageHuman
annotationGround truth scenegraph
Inferred appearance
Inferred annotations
Inferred scenegraph
Inferred layer-world
sky
sunmountain
mountainsea
skysea
seasun
boat
sky
treetree
treetree
treetree
field
skytree
treetree
treetree
planttree
treetree
treeplant
sky
road
mountaintree
building
fence
bridgebuilding
carcar
cartree
roadsky
treetree
fencefence
carfence
fencecar
car
carsign
sky
road
buildingbuilding
building
building
carcar
carcar
carcar
car
car
cartree
carwindow
windowwindow
windowawning
building
road
building
building
car
tree
carcar
cartree
car
building
carcar
awningwindow
carcar
carcar
tree
road
road
sidewalk
buildingbuilding
grassgrass
bussign
personbus
treebus
buscar
carcar
cartree
road
road
sky
road
treetree
car
buildingbuilding
bridge
treecar
carbuilding
carcar
carcar
signfield
carsign
carcar
signcar
carcar
skymountain
mountain
skymountain
mountain
mountainmountain
mountainmountain
mountainmountain
sky
roadroad
sidewalk
building
building
buildingbuilding
carcar
carcar
carcar
cartree
treetree
sky
road
building
building
buildingbuilding
buildingtree
car
carcar
building
carcar
treetree
carcar
carcar
carbuilding
buildingbuilding
skybuilding
buildingbuilding
buildingbuilding
buildingbuilding
buildingbuilding
building
building
skybuilding
building
building
buildingbuilding
building
building
building
buildingbuilding
buildingbuilding
building
buildingbuilding
buildingbuilding
buildingbuilding
building
skyroad
road
sidewalk
sidewalkbuilding
fence
carbuilding
carbuilding
persontree
car
carcar
person
road
building
skybuilding
buildingbuilding
busbuilding
staircaseperson
personsidewalk
windowawning
buildingbuilding
building
Figure 1: Additional example results parsing LMO images. Zoom in to see details, such as the object labels on the scene-graphs. The last row shows an example in which the dictionary contained the same building as in the query image (althougha different photo of that building). This occurs fairly frequency in LMO and SUN since these datasets contain many cases ofmultiple photos of more or less the same place.
Human annotation
Ground truth scenegraph
Inferred appearance
Inferred annotations
Inferred scenegraph
Inferred layer-worldQuery image
wallwall
wallwall
wall
wallwall
wallfloor
ceilingbox
bleachersperson
steps
wall
floor
personperson
windowdoor
ceilingperson
personperson
persondoor
person
personnewspaper
personperson
tree
tree
buildinggrass
pathplant
windowwindow
doorwindow
windowwindow
chimneypole
tree
grass
building
building
treebuilding
treebuilding
treewindow
windowtree
treewindow
plantwindow
skybuilding
tree
fencescoreboard
text
standfence
pitch
buildingbuilding
monitorfence
pitchtree
skybuilding
treebuilding
treebuilding
treegrass
grassperson
personbucket
grass
tree
groundground
persontable
treeplant
personperson
treeperson
ground
personperson
groundbuilding
personperson
personperson
treewindow
personwindow
personperson
windowwindow
personwindow
windowwindow
windowperson
doorperson
window
sky
road
sidewalk
wallmountain
building
building
carcar
cartree
treewindow
windowwindow
building
building
sidewalksky
road
building
buildingtree
buildingbuilding
treecar
treetree
doorwindow
windowplant
carbuilding
building
wallwall
floorceiling
windowair conditioning
desk lampcurtain
curtain
wallfloor
ceiling
wall
ceilingbed
wall
columnwindow
ceiling lamptest tube
wall
wallwall
wall
wallfloor
ceilingcolumn
doordoor
windowbench
wall
floorceiling
ceilingwall
columndoor
ceiling lampdoor
windowceiling lamp
ceiling lampwindow
wallwall
wallwall
floor
ceilingceiling
escalator
trolleyperson
person
mountain
mountain
mountainperson
mountainfog bank
person
Figure 2: Additional example results parsing SUN images. Zoom in to see details, such as the object labels on the scene-graphs.
Human annotation
Ground truth scenegraph
Inferred appearance
Inferred annotations
Inferred scenegraph
Inferred layer-worldQuery image
wall
floorwall
bed
pillowwindow
curtainpicture
pillowcurtain
curtainbed
clothing drying rackwindow
lamptowel
lampboxrolled carpet
curtain
bedwall
bedblinds
floor
blindsblinds
wall
bookshelfwall
bed
cabinet
wall
wall
wallblinds
chairwall
picturewall
chairpillow
desk
wallwall
ceiling
wall
floor
wallbookshelf
picturewindow
bottlewindowunknown
desk
chairchair
deskclothesplastic tray
paperchair
poster casecabinet
chairgarbage bin
picturechair
air ventair vent
lightlight
backpackpaper
clothesbackpack
stereochair
desk
paperflower
table
chairpaper
backpackfolder
paperbox
flower potdeskbasket
boxbooks
chair
desk
wall
wall
wall
walldoor
windowwindow
chair
chair
desk
desk
chairchair
chair
chair
cabinetchandelier
chairpicture
printerpaper holder
deskwindow
wallwall
floor
headboard
bedblinds
boxpillow
window
bookshelf
booksbooks
booksbooks
booksbooks
booksbooks
booksbox
decorative platepillow
basketpaper
door
picturepicture
floorwall
wallfloor
bed
wall
bed
wallbed
wall
bedwall
picturebed
wallblinds
picture
doorpicture
picture
blanketwall
backpackpicture
boxbackpack
pictureclothes
pillowpillow
clothes
wall
floor
cabinetcork board
deskchair
keyboard
computer
monitorbox
paperfile
paper
pen
wall
floorwall
wall
wallwall
wall
floor
bookshelf
floor
chairwall decoration
chairprinter
chairbackpack
chairchair
picturepicture
picturecomputer
chairchair
deskbox
boxchair
chair
wallfloor
wall
wallwall
door
lightpicture
desk
picturepicture
chaircomputer
blindswooden plank
paperjug
chairdog bowl
table
dog bowlchair
chairnapkin
cup
cabinet
bookshelf
floor
doordoor
cabinet
wall
floor
wallcabinet
doorchair
curtainpicture
picturebox
boxdresser
chairclothes
wall
floorlamp
headboard
bookspillow
picturepillow
bed
picturebook
sculpture
night stand
lampmattress
clothes
wall
bedfloor
wall
wall
wall
wallwall
blindswall
bed
bedpicture
picture
pictureblinds
headboardpillow
pillowgarbage bin
pillowpillow
headboard
wall
floor
walltable
chair
blinds
table
chairchair
blindschair
blindschair
bottlepicture
chairbottle
picture
floor mat
picturecabinet
chairbottle
picturechair
lightchair
floor
floor
cabinetcabinet
bookshelf
wallwall
cabinet
wall
chair
bookshelf
doorwall
cabinetbox
chair
chairwall
chairchair
picturepicture
chairgarbage bin
wallbooks
picture
wallwall
floorpicture
chair
table
curtainwindow
chaircurtain
chairpicture
chairchair
chairchair
chair
wall
floor
wallwall
wall
windowwindow
blanketbed
curtainchair
cribchair
bedpicture
window
floor
wall
ceiling
ceilingfloor
columnbookshelf
bookshelfshelves
shelvespipe
books
pipe
bookshelf
bookshelf
bench
columnbook
book
crate
yoga matyoga mat
bookshelfbooks
booksbooks
booksbooks
floor
bookshelfbookshelf
bookshelf
wallwall
floor
bookshelf
deskwall
floorbooks
boxbooks
booksbooks
picturewall
booksbooks
booksbooks
Figure 3: Additional example results parsing NYU RGBD images. Zoom in to see details, such as the object labels on thescenegraphs.
References[1] C. Liu, J. Yuen, and A. Torralba. Nonparametric Scene Parsing via
Label Transfer. PAMI, 2011. 2[2] C. Liu, S. Zhu, and H. Shum. Learning inhomogeneous Gibbs model
of faces by minimax entropy. Computer Vision, 2001. ICCV 2001.Proceedings. Eighth IEEE International Conference on, 1:281–287vol. 1, 2001. 1
[3] J. Yuen, C. Zitnick, C. Liu, and A. Torralba. A Framework for Encod-ing Object-level Image Priors. Microsoft Research Technical Report,2011. 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1sk
ybu
ildingtre
em
ount
ainroadsea
field car
sand
river
plant
gras
swi
ndow
sidew
alkrock
bridg
edo
orfe
nce
pers
onsta
ircas
eaw
ning
sign
boat
cros
swalkpole
bus
balco
nystr
eetlig
ht sun
bird
mea
n pe
rcl
ass
pixe
l lab
elin
g ac
cura
cy
Figure 4: Mean per-class pixel labeling accuracy on theLMO dataset.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
wall
sky
build
ing floor
tree
mou
ntain
ceilin
gro
adgr
ass
field
wind
owpla
nt sea
grou
ndwa
ter
door
table
pers
onsh
elves
show
case bed
stand
curta
inro
cksid
ewalk
cupb
oard
sand
chair
cabin
etriv
er
mea
n pe
rcl
ass
pixe
l lab
elin
g ac
cura
cy
Figure 5: Mean per-class pixel labeling accuracy on the top30 most common object classes in the SUN test set.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
wall
floor
cabin
etbe
dch
air sofa
door
table
pictu
rewi
ndow
book
shelf
blind
sco
unte
rce
iling
desk
curta
indr
esse
rm
irror
pillow
floor
mat
shelv
esbo
oks
refri
dger
ator
cloth
este
levisi
onpa
per
towe
lwh
itebo
ard
show
er cu
rtain
night
stan
d
mea
n pe
rcl
ass
pixe
l lab
elin
g ac
cura
cyFigure 6: Mean per-class pixel labeling accuracy on the top30 most common object classes in the NYU RGBD test set.
Query imageHuman
annotationInferred
appearanceInferred
annotations
Wrong category
Failure type
Underexplaining
Bad alignment
Overexplaining
Figure 7: Characteristic failure modes of our algorithm. Toprow: Sometimes the algorithm gets the entire scene cate-gory wrong. Second row: Our algorithm often runs into dif-ficulty with cluttered scenes full of small objects. Groupsof small objects are sometimes explained by one big ob-ject that resembles the ensemble appearance of the smallobjects. One reason for this is that often the human an-notations do not mark all small objects in a scene. In thisexample, the human annotations did not mark the people ineither the query image or in the scene whose segments wereused in the scene collage. Third row: Conversely, some-times the algorithm hallucinates small objects where noneare necessary, as in the addition of the foreground plants inthis simple image of a field. Together, the second and thirdrows demonstrate that the algorithm has difficulty choosingbetween explaining a region with several small objects ver-sus one big object. Bottom row: Our current segment trans-formation algorithm produces fairly coarse alignments, andoften makes false matches. The system often finds reason-able small objects to place into a scene collage (such as thepeople in this collage), but gets their position wrong, lead-ing to low per-class accuracy; better small object detectionand alignment is an important direction for future work.
Recommended