Upload
zukun
View
221
Download
0
Tags:
Embed Size (px)
Citation preview
Learning from
Descriptive Text
Tamara L Berg
Stony Brook University
Vision
Humans
Language
Tags:
canon, eos, macro, japan, vacation, f
rog, animal, toad, amphibian, pet, ey
e, feet, mouth, finger, hand, prince, p
hoto, art, light, photo, flickr, blurry, fa
vorite, nice.
It's the perfect party dress. With
distinctly feminine details such as a wide
sash bow around an empire waist and a
deep scoopneck, this linen dress will
keep you comfortable and feeling
elegant all evening long.
Visually Descriptive Text
Visually descriptive language provides:
• information about how people construct natural language for imagery.
• information about the world, especially the visual world.
• guidance for computational visual recognition.
How do people
describe the world?
“It was an arresting face, pointed of chin, square of jaw. Her eyes
were pale green without a touch of hazel, starred with bristly black
lashes and slightly tilted at the ends. Above them, her thick black
brows slanted upward, cutting a startling oblique line in her
magnolia-white skin–that skin so prized by Southern women and so
carefully guarded with bonnets, veils and mittens against hot
Georgia suns” – Gone with the Wind
How does the
world work? What should we
recognize?
Visually Descriptive Text
Visually descriptive language provides:
• information about how people construct natural language for imagery.
• information about the world, especially the visual world.
• guidance for computational visual recognition.
How do people
describe the
world?
“It was an arresting face, pointed of chin, square of jaw. Her eyes
were pale green without a touch of hazel, starred with bristly black
lashes and slightly tilted at the ends. Above them, her thick black
brows slanted upward, cutting a startling oblique line in her
magnolia-white skin–that skin so prized by Southern women and so
carefully guarded with bonnets, veils and mittens against hot
Georgia suns” – from Gone with the Wind by Margaret Mitchell
How does the
world work? What should we
recognize?
What’s in a description?
What do people describe?“A bearded man is holding a child in a sling.”
manbabyslingshirtglassesladderfridgetablewatermelonchairboxescupswater bottlewallpacifierbeard…
“A bearded man stands while holding a small child in a green sheet.” “A bearded man with a baby in a sling poses.”“Man standing in kitchen with little girl in green sack.” “Man with beard and baby”
What’s in this image?
What’s in a description?
Predict what people will describeGiven an image
“looking for
castles in the
clouds out my car
window”
1)
2)
Given a caption
“two women sitting brunette
blonde on bench reading
magazine”
Predict what’s in the image
clouds ✔car
window
castle
✖✖?
women ✔bench ✔
magazine✔grass
skirt
✖✖…
e.g. Spain & Perona, 2010
President George W. Bush makes a
statement in the Rose Garden while
Secretary of Defense Donald Rumsfeld
looks on, July 23, 2003. Rumsfeld said
the United States would release graphic
photographs of the dead sons of
Saddam Hussein to prove they were
killed by American troops. Photo by
Larry Downing/Reuters
Who’s in the picture?T.L. Berg, A.C. Berg, J. Edwards, D.A. Forsyth
Model Accuracy of labeling
Vision model, No Lang model 67%
Vision model + Lang model 78%
Visually Descriptive Text
Visually descriptive language provides:
• information about how people construct natural language for imagery.
• information about the world, especially the visual world.
• guidance for computational visual recognition.
How do people
describe the world?
“It was an arresting face, pointed of chin, square of jaw. Her eyes
were pale green without a touch of hazel, starred with bristly black
lashes and slightly tilted at the ends. Above them, her thick black
brows slanted upward, cutting a startling oblique line in her
magnolia-white skin–that skin so prized by Southern women and so
carefully guarded with bonnets, veils and mittens against hot
Georgia suns” – from Gone with the Wind by Margaret Mitchell
How does the
world work? What should we
recognize?
Vision is hard
World knowledge (from descriptive text) can be used to smooth noisy vision predictions!
Green sheep
Learning World Knowledge
Attributes
Relationships
green green grass by the
lake
a very shiny car in the car
museum in my hometown of
upstate NY.
Our cat Tusik sleeping on
the sofa near a hot radiator.
very little person in a big
rocking chair
BabyTalk: Understanding and Generating Simple Image DescriptionsKulkarni, Premraj, Dhar, Li, Choi, AC Berg, TL Berg, CVPR 2011
System Flow
Input Image
Extract Objects/stuff
a) dog
b) person
c) sofa
brown 0.32
striped 0.09
furry .04
wooden .2
Feathered
.04
...
brown 0.94
striped 0.10
furry .06
wooden .8
Feathered
.08
...
brown 0.01
striped 0.16
furry .26
wooden .2
feathered
.06
...
a) dog
b) person
c) sofa
Predict attributesPredict prepositions
a) dog
b) person
c) sofa
near(a,b) 1
near(b,a) 1
against(a,b)
.11
against(b,a)
.04
beside(a,b)
.24
beside(b,a)
.17
...
near(a,c) 1
near(c,a) 1
against(a,c) .3
against(c,a)
.05
beside(a,c) .5
beside(c,a)
.45
...near(b,c) 1
near(c,b) 1
against(b,c)
.67
against(c,b)
.33
beside(b,c) .0
beside(c,b)
.19
...
Predict labeling – vision
potentials smoothed with text
potentials
! "#$%
! "#&%
' () *$%
+, ($%
! "#- %
+, (&%
+, (- %
' () *&%
' () *- %
<<null,person_b>,against,<brown,sofa_c>>
<<null,dog_a>,near,<null,person_b>>
<<null,dog_a>,beside,<brown,sofa_c>> Generate natural
language
description
This is a photograph of one
person and one brown sofa and
one dog. The person is against
the brown sofa. And the dog is
near the person, and beside the
brown sofa.
This is a picture of one
sky, one road and one
sheep. The gray sky is
over the gray road. The
gray sheep is by the gray
road.
Here we see one
road, one sky and one
bicycle. The road is near
the blue sky, and near the
colorful bicycle. The
colorful bicycle is within
the blue sky.
BabyTalk results
This is a picture of two
dogs. The first dog is
near the second furry
dog.
Objects, Attributes,Prepositions
Visually Descriptive Text
Visually descriptive language provides:
• information about how people construct natural language for imagery.
• information about the world, especially the visual world.
• guidance for computational visual recognition.
How do people
describe the world?
“It was an arresting face, pointed of chin, square of jaw. Her eyes
were pale green without a touch of hazel, starred with bristly black
lashes and slightly tilted at the ends. Above them, her thick black
brows slanted upward, cutting a startling oblique line in her
magnolia-white skin–that skin so prized by Southern women and so
carefully guarded with bonnets, veils and mittens against hot
Georgia suns” – from Gone with the Wind by Margaret Mitchell
How does the
world work?
What should we
recognize?
What should we recognize?
• Recognition is beginning to work
• Open question – what should we recognize?
• Maybe objects aren’t (always) the right base level entities
Object Recognition
Parts, Poselets, Attributes
For example: [Fergus, Perona, Zisserman2003],[Bourdev, Malik2009], …
Slide Credit: Ali Farhadi
Learn which terms in descriptions are depictable
Fully beaded with megawatt
crystals, this Christian Louboutin suede
pump matches the gleam in your eye.
Pump's linear heel plays up the alluring
curves of its dipped sides.
Round toe frames low-cut vamp.
Tonally topstitched collar.
4" straight, covered heel shows off
signature red sole.
Creamy leather lining with padded
insole.
"Fifi" is made in Italy.
attributes
Automatically Discovering Attributes from Noisy Web Data T.L. Berg, A.C. Berg, J. Shih ECCV 2010
Given Web Images + Noisy Text Descriptions:
1) Discover visual attribute terms in text descriptions - likely domain dependent
2) Learn appearance models for attributes without labeled data
3) Characterize attributes by: type, localizability
Object Recognition
For example: [Oliva, Torralba 2001],[SUN 2010], …
Slide Credit: Ali Farhadi
Scenes
What are the right quanta of
Recognition?
Farhadi & SadeghiRecognition using Visual Phrases , CVPR 2011
Participating in Phrases Profoundly affects the appearance of objects
Farhadi & SadeghiRecognition using Visual Phrases , CVPR 2011
Maybe descriptive text can inform entity hypotheses!
“the dog is sleeping”
“sleeping dog in delhi”“A dog is sleeping in”
“a sleeping dog in NTHU”
What should we recognize?
“cat in a bag”
“cat in the bag”“cat in bag”
“the cat is in the bag”
What should we recognize?
Maybe descriptive text can inform entity hypotheses!
Conclusion
Use large pools of descriptive text to:
Learn how people describe the visual world
Learn how the world works
Guide future efforts in recognition
Apply this knowledge to multi-modal
collections & applications
Acknowledgements
• Collaborators: Alex Berg, David Forsyth, JaetyEdwards, Jonathan Shih, Girish Kulkarni, VisruthPremraj, Sagnik Dhar, Vicente Ordonez, SimingLi, Yejin Choi, Kota Yamaguchi, Vicente Ordonez
• Funded by NSF Faculty Early Career Development (CAREER) Program: Award #1054133