Scene Collaging: Analysis and Synthesis of Natural Images...

Scene Collaging: Analysis and Synthesis of Natural Images with SemanticLayers – Supplementary Materials

Phillip IsolaMIT

phillipi@mit.edu

Ce LiuMicrosoft Researchceliu@microsoft.com

1. Maximum entropy priorOur scenegraph grammar (Section 2.3, main text), serves

as a binary prior: scenes are either valid or not. In orderto compare against a method than can capture more subtlegradations in the prior probability, we also implemented amaximum entropy (maxent) prior over object co-occurrencestatistics [3]. With this approach, we express the prior prob-ability of a scene X as:

P (X) =1

Z(Λ)exp (−

λifi(X)), (1)

where Z normalizes and the distribution is parameterizedby parameters λi ∈ Λ. Many methods exist for learning Λfrom training data. We employ gradient descent [2].

The functions fi measure object class count statisticsand object class co-count statistics.

We approximate the distribution of object class countswith a histogram with bins R. For each object class i,

f (count)i,r (X) = (ni(X) ∈ r)), (2)

where ni(X) is the number of objects of class i in the sceneX, that is:

ni(X) =�

�∈L

(c̃� = i). (3)

Recall from Section 2 of the main text that L is the setof dictionary indices of the objects used in a scene Xand c̃� is the class of an object �. r ∈ R is a rangeof integer values. In our current implementation, R ={[0], [1], [2, 3], [4, 7], [8,∞)}.

We approximate the distribution of object class co-counts with a 2D histogram R × R. For each pair of ob-ject classes (i, j), and histogram bin (r1, r2), r1, r2 ∈ R,we have the feature:

f (co−count)i,j,r1,r2

(X) = (ni(X) ∈ r1, nj(X) ∈ r2). (4)

In Section 5.2.1 of the main text, we compare our sys-tem with just the scenegraph grammar as a prior to our sys-tem with both the scenegraph grammar prior and the maxentprior described above. As shown in Table 4, the addition ofthe maxent prior decreases performance compared to ourscenegraph grammar alone.

2. Segment transformation – detailsTranslation and scaling:As summarized in Section 4.2 of the main text, each ob-

ject from our dictionary is translated and scaled indepen-dently. We seek to place the object at a location and scalewhere it well explains the appearance of the query image,I . Using θt to represent the center position of the object’smask and θs to represent how much we scale the object’smask, we start with the following objective function:

E1(θt, θs) =�

q∈Q�

g̃�(f(I)q ) + λt

q∈Q̃�

pt(c̃�, q) + λsps(θs),

Q� = Tts(Q̃�, {θt, θs}), (6)

where g̃� is the segment’s appearance model, pt(c̃�, q) is aprior over mask position for objects of class c̃�, and ps(θs) isa prior over scale factor. Tts applies a similarity transforma-tion to the dictionary object mask Q̃�. In order to encourageobjects to stay largely within the image frame, we assign adefault penalty to all pixels in the transformed mask that falloutside the image frame.

We calculate pt(c, q) by averaging masks of class cacross our dictionary:

pt(c, q) ∝�

�∈L s.t. c̃�=c

(q ∈ Q̃�). (7)

We manually set ps(θs) = {0.75, 0.875, 1, 0.875, 0.75}for the discrete set of scales θs ∈ {0.5, 0.75, 1, 1.5, 2}. Thatis, objects slightly prefer to remain the scale at which theywere found in our dictionary.

The objective E1 prefers that objects scale down as muchas possible until they achieve a precise fit onto the query im-age. In order to prevent all objects from choosing the small-est scale, we add a term that trades off between precisionand coverage of the detections:

E2(θt, θs) =E1(θt, θs)

|Q�|+ η(Q̃�, z̃�). (8)

Here we have scaled E1 by a softened version of the areathe transformed object mask covers. The softening factor,η(Q̃�, z̃�), controls preference for precision versus recall inthe object detections. We set this term so as to prefer recallfor large, background objects whereas to prefer precisionfor small, foreground objects. The intuition for this choiceis that objects in the background are likely to be largely cov-ered in the final scene explanation, and consequently, it isokay if only parts of them match the image well. This termis a function of the object’s untransformed mask Q̃� and thelayer of the object in its dictionary scene, z̃�:

η(Q̃�, z̃�) ∝1

|Q̃�|z̃�, (9)

We choose transformation parameters that maximize ourobjective:

{θ∗t , θ∗s} = argmaxθt,θs

E2(θt, θs). (10)

To choose θ∗s we try each of our discrete set of scales. Tochoose θ∗t , we adopt a sliding window approach, searchingover all possible placements of the object mask such that itscenter is within the image frame (however, for dictionaryobjects that touch an image border, we only consider place-ments such that the object still touches the image border).We use convolution between the object mask and image fea-tures to perform this search efficiently.

Trimming and growing:We edit object mask silhouettes after each iteration of

greedy optimization (Algorithm 1 in the main text; labeledas TRIMANDGROW). We formulate this part of the problemas 2D MRF-based segmentation, in which each object inthe scene becomes a segment class label. Here, we denoteobject segment labels as a 2D array �(q), and minimize thefollowing energy function using BP-S [1]:

− log p(� | I,X) =�

g̃�(f(I)q ) + λ1

ψ�(q)+ (11)

{p,q}∈�

φ(�(p), �(q); I) + logZ, (12)

where Z normalizes. The data term g̃�(·) is our appearancemodel from Section 2.2 of the main text. Spatial priors ψ�(·)

are provided by our object visibility masks (Equation 4 inthe main text):

ψ�(q) = (G ∗ V�)(q), (13)

where G is a Gaussian kernel with scale proportional to theobject’s mask area, |Q�|.

For the spatial smoothness term φ(·), we borrow theimage-sensitive smoothness potential from [1]:

φ(�(p), �(q); I) = (�(p) �= �(q))

�ξ + e−γ�I(p)−I(q)�2

ξ + 1

with γ = (2 < �I(p)− I(q)�2 >)−1.This silhouette editing process leaves us with a 2D map

of object labels �(q). We use this map to update our objectvisibility masks: V� = ∪q (� = �(q)). These visibilitymasks affect the synthesized scene’s likelihood (Equation 6in the main text), and are used to calculate pixel-wise andclass-wise label accuracy (Section 5 of the main text). Uponeach invocation of TRIMANDGROW, we discard the old sil-houette edits and start afresh from the visibility masks givenby Equation 4 in the main text.

3. Random scene synthesis – detailsDuring random scene synthesis, we do not have a tar-

get image to match, so several changes need to be made toour collaging algorithm. First, the likelihood term is set tobe proportional to the number of pixels the collage covers:logP (I|X) ∝

��∈L |V�|. This biases inference toward

collages that fill the entire image frame. Second, segmenttransformation and trimming and growing are skipped – in-stead segments are placed exactly where they were foundin the dictionary scene they came from. Third, we choosethe layer on which to place each object segment � by find-ing the layer in the collage that best matches the object’slayer in the dictionary scene from which it was sampled.Let A(z̃�) be the histogram of object class counts above z̃�in the dictionary scene and B be the histogram below. Fur-ther, let A�(z) and B�(z) represent the histogram of objectclass counts above and below z in the scene collage beingsynthesized. We place the object on the layer z∗� that mini-mizes the following sum of histogram intersections:

z∗� = argmaxz

min(A(z̃�)k, A�(z)k)+

min(B(z̃�)k, B�(z)k). (15)

4. Additional resultsIn Figures 1, 2, and 3 we display additional example

parses on each dataset. Figures 4, 5, and 6 show the averageper-class accuracy of our algorithm on the top 30 most fre-quent classes in each dataset. Figure 7 shows several char-acteristic failure cases for our algorithm.

Query imageHuman

annotationGround truth scenegraph

Inferred appearance

Inferred annotations

Inferred scenegraph

Inferred layer-world

sunmountain

mountainsea

skysea

seasun

treetree

skytree

treetree

planttree

treetree

treeplant

mountaintree

building

bridgebuilding

carcar

cartree

roadsky

treetree

fencefence

carfence

fencecar

carsign

buildingbuilding

building

carcar

cartree

carwindow

windowwindow

windowawning

building

carcar

cartree

building

carcar

awningwindow

carcar

sidewalk

buildingbuilding

grassgrass

bussign

personbus

treebus

buscar

carcar

cartree

treetree

buildingbuilding

bridge

treecar

carbuilding

carcar

signfield

carsign

carcar

signcar

carcar

skymountain

mountain

skymountain

mountain

mountainmountain

roadroad

sidewalk

building

buildingbuilding

carcar

cartree

treetree

building

buildingbuilding

buildingtree

carcar

building

carcar

treetree

carcar

carbuilding

buildingbuilding

skybuilding

buildingbuilding

building

skybuilding

building

buildingbuilding

building

buildingbuilding

building

buildingbuilding

building

skyroad

sidewalk

sidewalkbuilding

carbuilding

persontree

carcar

person

building

skybuilding

buildingbuilding

busbuilding

staircaseperson

personsidewalk

windowawning

buildingbuilding

building

Figure 1: Additional example results parsing LMO images. Zoom in to see details, such as the object labels on the scene-graphs. The last row shows an example in which the dictionary contained the same building as in the query image (althougha different photo of that building). This occurs fairly frequency in LMO and SUN since these datasets contain many cases ofmultiple photos of more or less the same place.

Human annotation

Ground truth scenegraph

Inferred appearance

Inferred scenegraph

Inferred layer-worldQuery image

wallwall

wallfloor

ceilingbox

bleachersperson

personperson

windowdoor

ceilingperson

personperson

persondoor

person

personnewspaper

personperson

buildinggrass

pathplant

windowwindow

doorwindow

windowwindow

chimneypole

building

treebuilding

treewindow

windowtree

treewindow

plantwindow

skybuilding

fencescoreboard

standfence

buildingbuilding

monitorfence

pitchtree

skybuilding

treebuilding

treegrass

grassperson

personbucket

groundground

persontable

treeplant

personperson

treeperson

ground

personperson

groundbuilding

personperson

treewindow

personwindow

personperson

windowwindow

personwindow

windowwindow

windowperson

doorperson

window

sidewalk

wallmountain

building

carcar

cartree

treewindow

windowwindow

building

sidewalksky

building

buildingtree

buildingbuilding

treecar

treetree

doorwindow

windowplant

carbuilding

building

wallwall

floorceiling

windowair conditioning

desk lampcurtain

curtain

wallfloor

ceiling

ceilingbed

columnwindow

ceiling lamptest tube

wallwall

wallfloor

ceilingcolumn

doordoor

windowbench

floorceiling

ceilingwall

columndoor

ceiling lampdoor

windowceiling lamp

ceiling lampwindow

wallwall

ceilingceiling

escalator

trolleyperson

person

mountain

mountainperson

mountainfog bank

person

Figure 2: Additional example results parsing SUN images. Zoom in to see details, such as the object labels on the scene-graphs.

Human annotation

Ground truth scenegraph

Inferred appearance

Inferred scenegraph

Inferred layer-worldQuery image

floorwall

pillowwindow

curtainpicture

pillowcurtain

curtainbed

clothing drying rackwindow

lamptowel

lampboxrolled carpet

curtain

bedwall

bedblinds

blindsblinds

bookshelfwall

cabinet

wallblinds

chairwall

picturewall

chairpillow

wallwall

ceiling

wallbookshelf

picturewindow

bottlewindowunknown

chairchair

deskclothesplastic tray

paperchair

poster casecabinet

chairgarbage bin

picturechair

air ventair vent

lightlight

backpackpaper

clothesbackpack

stereochair

paperflower

chairpaper

backpackfolder

paperbox

flower potdeskbasket

boxbooks

walldoor

windowwindow

chairchair

cabinetchandelier

chairpicture

printerpaper holder

deskwindow

wallwall

headboard

bedblinds

boxpillow

window

bookshelf

booksbooks

booksbox

decorative platepillow

basketpaper

picturepicture

floorwall

wallfloor

wallbed

bedwall

picturebed

wallblinds

picture

doorpicture

picture

blanketwall

backpackpicture

boxbackpack

pictureclothes

pillowpillow

clothes

cabinetcork board

deskchair

keyboard

computer

monitorbox

paperfile

floorwall

wallwall

bookshelf

chairwall decoration

chairprinter

chairbackpack

chairchair

picturepicture

picturecomputer

chairchair

deskbox

boxchair

wallfloor

wallwall

lightpicture

picturepicture

chaircomputer

blindswooden plank

paperjug

chairdog bowl

dog bowlchair

chairnapkin

cabinet

bookshelf

doordoor

cabinet

wallcabinet

doorchair

curtainpicture

picturebox

boxdresser

chairclothes

floorlamp

headboard

bookspillow

picturepillow

picturebook

sculpture

night stand

lampmattress

clothes

bedfloor

wallwall

blindswall

bedpicture

picture

pictureblinds

headboardpillow

pillowgarbage bin

pillowpillow

headboard

walltable

blinds

chairchair

blindschair

bottlepicture

chairbottle

picture

floor mat

picturecabinet

chairbottle

picturechair

lightchair

cabinetcabinet

bookshelf

wallwall

cabinet

bookshelf

doorwall

cabinetbox

chairwall

chairchair

picturepicture

chairgarbage bin

wallbooks

picture

wallwall

floorpicture

curtainwindow

chaircurtain

chairpicture

chairchair

wallwall

windowwindow

blanketbed

curtainchair

cribchair

bedpicture

window

ceiling

ceilingfloor

columnbookshelf

bookshelfshelves

shelvespipe

bookshelf

columnbook

yoga matyoga mat

bookshelfbooks

booksbooks

bookshelfbookshelf

bookshelf

wallwall

bookshelf

deskwall

floorbooks

boxbooks

booksbooks

picturewall

booksbooks

Figure 3: Additional example results parsing NYU RGBD images. Zoom in to see details, such as the object labels on thescenegraphs.

References[1] C. Liu, J. Yuen, and A. Torralba. Nonparametric Scene Parsing via

Label Transfer. PAMI, 2011. 2[2] C. Liu, S. Zhu, and H. Shum. Learning inhomogeneous Gibbs model

of faces by minimax entropy. Computer Vision, 2001. ICCV 2001.Proceedings. Eighth IEEE International Conference on, 1:281–287vol. 1, 2001. 1

[3] J. Yuen, C. Zitnick, C. Liu, and A. Torralba. A Framework for Encod-ing Object-level Image Priors. Microsoft Research Technical Report,2011. 1

ildingtre

ainroadsea

field car

alkrock

swalkpole

eetlig

ht sun

Figure 4: Mean per-class pixel labeling accuracy on theLMO dataset.

ing floor

ceilin

nt sea

case bed

Figure 5: Mean per-class pixel labeling accuracy on the top30 most common object classes in the SUN test set.

air sofa

pillow

levisi

cyFigure 6: Mean per-class pixel labeling accuracy on the top30 most common object classes in the NYU RGBD test set.

Query imageHuman

annotationInferred

appearanceInferred

annotations

Wrong category

Failure type

Underexplaining

Bad alignment

Overexplaining

Figure 7: Characteristic failure modes of our algorithm. Toprow: Sometimes the algorithm gets the entire scene cate-gory wrong. Second row: Our algorithm often runs into dif-ficulty with cluttered scenes full of small objects. Groupsof small objects are sometimes explained by one big ob-ject that resembles the ensemble appearance of the smallobjects. One reason for this is that often the human an-notations do not mark all small objects in a scene. In thisexample, the human annotations did not mark the people ineither the query image or in the scene whose segments wereused in the scene collage. Third row: Conversely, some-times the algorithm hallucinates small objects where noneare necessary, as in the addition of the foreground plants inthis simple image of a field. Together, the second and thirdrows demonstrate that the algorithm has difficulty choosingbetween explaining a region with several small objects ver-sus one big object. Bottom row: Our current segment trans-formation algorithm produces fairly coarse alignments, andoften makes false matches. The system often finds reason-able small objects to place into a scene collage (such as thepeople in this collage), but gets their position wrong, lead-ing to low per-class accuracy; better small object detectionand alignment is an important direction for future work.

Scene Collaging: Analysis and Synthesis of Natural Images...

Documents

Scene-Aware Background Music Synthesis · 2020. 8. 29. · Scene-Aware Background Music Synthesis Yujia Wang Beijing Institute of Technology wangyujia@bit.edu.cn Wei Liang∗ Beijing

Perspective break down scene by scene

Configurable 3D Scene Synthesis and 2D Image Rendering ......instead of solving visual inverse problems, we sample from thegrammarmodeltosynthesize,inaforwardmanner,large varieties

Configurable 3D Scene Synthesis and 2D Image Rendering ...web.cs.ucla.edu/~dt/papers/ijcv18/ijcv18.pdfInternationalJournalofComputerVision Fig.1 a An example automatically-generated3D

Scene Collaging: Analysis and Synthesis of Natural Images ... › content_iccv_2013 › papers › Isola_S… · A scene collage also consists of a set of semantic ob-ject relationships

Activity-centric Scene Synthesis for Functional 3D Scene ...mdfisher/papers/actSynth.pdf · for which they are typically used. When modeling a scene, we ﬁrst identify the activities

Map Collaging: A Framework of Location-based Services with ... · Map Collaging: A Framework of Location-based Services with Heterogeneous Local Maps on Smartphones for Spatial Storytelling

Scene by scene final

Activity-centric Scene Synthesis for Functional 3D Scene Modelinggraphics.stanford.edu/~niessner/papers/2015/9synth/... · 2017. 3. 29. · text between objects in a scene is a relationship

Analysis & Synthesis of Narrative - CINVESTAV · Analysis & Synthesis of Narrative Two applications •Casablanca movie - middle scene 43. Mapping and tracking emotion. •Short look

Layer-structured 3D Scene Inference via View …Layer-structured 3D Scene Inference via View Synthesis 5 3.1 Overview Training Data. We leverage multi-view supervision to learn LDI

* Remember contextualization: Set the scene in your opening paragraph. * Complex Thesis Statement * Historical examples to defend argument * Synthesis:

IMAGE GENERATOR PERFORMANCE AND … GENERATOR PERFORMANCE AND MODERN HARDWARE IN THE LOOP ... projector ABSTRACT IR ... achieves the synthesis of 3D scene …

manoa.hawaii.edumanoa.hawaii.edu/.../uploads/1973.A-Streetcar-Named-Desire.pdf · Scene Scene Scene Scene Scene Scene 11: 111: Vil: Scene Vlll: Scene IX: ... "A Streetcar Named Desire"

Introduction to Illegal Art. Illegal art techniques Unauthorized Sampling Remixing Collaging

Collaging the Remains - Liminalitiesliminalities.net/12-4/collaging.pdfMeredith Monk calls her work “composite theatre” which Louise Steinman explains as what emerges when “one

Scene Composition ( mis -en-scene)

Mise en scene 2 (Hospital Scene)

Mise en-scene/Cave scene

CNN International presents 'Scene By Scene