93
Representing Visual Structure Visual Sense Disambiguation Conclusions Understanding Visual Scences Dependency Graphs, Word Senses, and Multimodal Embeddings Mirella Lapata School of Informatics University of Edinburgh E d n b u r g h N L P University of Edinburgh Natural Language Processing Mirella Lapata Understanding Visual Scenes 1

Understanding Visual Scences

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Understanding Visual ScencesDependency Graphs, Word Senses, and Multimodal Embeddings

Mirella LapataSchool of Informatics

University of Edinburgh

Ed nburghNLPUniversity of Edinburgh

Natural Language Processing

Mirella Lapata Understanding Visual Scenes 1

Page 2: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Joint Work with

Carina Silberer Spandana Gella Frank Keller Jasper Uijilings

Mirella Lapata Understanding Visual Scenes 2

Page 3: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Structure in Multimodal Processing

Lots of recent in work on multimodal processing:

image description generation;visual question answering;multimodal machine translation;video summarization.

We need to understand the meaning of images and text:Who does what to whom?

Understanding requires structure, not just an unordered set of labels:

linguistic structure;image structure.

Mirella Lapata Understanding Visual Scenes 3

Page 4: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Structure in Multimodal Processing

Lots of recent in work on multimodal processing:

image description generation;visual question answering;multimodal machine translation;video summarization.

We need to understand the meaning of images and text:Who does what to whom?

Understanding requires structure, not just an unordered set of labels:

linguistic structure;image structure.

Mirella Lapata Understanding Visual Scenes 3

Page 5: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Structure in Multimodal Processing

Lots of recent in work on multimodal processing:

image description generation;visual question answering;multimodal machine translation;video summarization.

We need to understand the meaning of images and text:Who does what to whom?

Understanding requires structure, not just an unordered set of labels:

linguistic structure;image structure.

Mirella Lapata Understanding Visual Scenes 3

Page 6: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Structure in Multimodal Processing

A man is playing a trumpet in front of a little boy.

Mirella Lapata Understanding Visual Scenes 4

Page 7: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Linguistic Structure

Output of dependency parser (with PoS labels):

http://nlp.stanford.edu:8080/corenlp/process

Mirella Lapata Understanding Visual Scenes 5

Page 8: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Linguistic Structure

Output of a semantic role labeler (with word senses):

http://cogcomp.cs.illinois.edu/page/demo_view/srl

Mirella Lapata Understanding Visual Scenes 6

Page 9: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Structure in Multimodal Processing

Linguistic structure:

discrete base units (words), ordered in 1D;span-based labels (e.g., PoS, phrases);tree-based hierarchies;clear distinction between syntax and semantics;canonical representations defined by linguistic theory.

Now let’s compare this to image structure.

Mirella Lapata Understanding Visual Scenes 7

Page 10: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Structure in Multimodal Processing

Linguistic structure:

discrete base units (words), ordered in 1D;span-based labels (e.g., PoS, phrases);tree-based hierarchies;clear distinction between syntax and semantics;canonical representations defined by linguistic theory.

Now let’s compare this to image structure.

Mirella Lapata Understanding Visual Scenes 7

Page 11: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Image Structure

Output of an image labeler:

https://www.clarifai.com/demo

We could also label: attributes, scene type, colors, textures, etc.Mirella Lapata Understanding Visual Scenes 8

Page 12: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Image Structure

Output of an object recognizer:

Output of FastRCNN model with AlexNet architecture trained on PASCAL VOC 2007.

Mirella Lapata Understanding Visual Scenes 9

Page 13: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Image Structure

Hierarchical segmentation (indicates part-whole relationships):

http://www.socher.org/index.php/Main/ParsingNaturalScenesAndNaturalLanguageWithRecursiveNeuralNetworks

Mirella Lapata Understanding Visual Scenes 10

Page 14: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Structure in Multimodal Processing

Linguistic structure:

discrete base units (words), ordered in 1D;span-based labels (e.g., PoS, phrases);tree-based hierarchies;clear distinction between syntax and semantics;canonical representations defined by linguistic theory.

Image structure:

continuous base units (pixels), ordered in 2D;region-based labels (e.g., objects, attributes);part–whole structure;no clear distinction between syntax and semantics;no “correct” canonical representations.

Mirella Lapata Understanding Visual Scenes 11

Page 15: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Structure in Multimodal Processing

Linguistic structure:

discrete base units (words), ordered in 1D;span-based labels (e.g., PoS, phrases);tree-based hierarchies;clear distinction between syntax and semantics;canonical representations defined by linguistic theory.

Image structure:

continuous base units (pixels), ordered in 2D;region-based labels (e.g., objects, attributes);part–whole structure;no clear distinction between syntax and semantics;no “correct” canonical representations.

Mirella Lapata Understanding Visual Scenes 11

Page 16: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Representational Divergence

Representational divergence: for multimodal processing, we need to fuse linguisticand image structures, but they are very different.

Hypothesis: We need to align visual representations.

Two examples in this talk:

visual dependency representations;visual sense disambiguation.

Mirella Lapata Understanding Visual Scenes 12

Page 17: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Representational Divergence

Representational divergence: for multimodal processing, we need to fuse linguisticand image structures, but they are very different.

Hypothesis: We need to align visual representations.

Two examples in this talk:

visual dependency representations;visual sense disambiguation.

Mirella Lapata Understanding Visual Scenes 12

Page 18: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

1 Representing Visual StructureVisual Dependency RepresentationsVisual Constituency RepresentationsApplications

2 Visual Sense DisambiguationTask DefinitionDataset ConstructionUnsupervised Model for VSD

3 Conclusions

Mirella Lapata Understanding Visual Scenes 13

Page 19: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

1 Representing Visual StructureVisual Dependency RepresentationsVisual Constituency RepresentationsApplications

2 Visual Sense DisambiguationTask DefinitionDataset ConstructionUnsupervised Model for VSD

3 Conclusions

Mirella Lapata Understanding Visual Scenes 14

Page 20: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

Spatial Relations

We need a grammar that defines the relations between the objects in an image:Visual Dependency Grammar (Elliott & Keller 2013).

It assumes eight relations that can hold between pairs of objects, based on threegeometric properties:

pixel overlap;angle between objects;distance between objects.

Mirella Lapata Understanding Visual Scenes 15

Page 21: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

Spatial Relations

X −→on Y X−−−−−−−→surrounds Y X

−−−−→beside Y X

−−−−−−→opposite Y

X−−−−→above Y X

−−−−→below Y

X−−−−→infront Y X

−−−−→behind Y

Mirella Lapata Understanding Visual Scenes 16

Page 22: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

Visual Tuples

An image represented a bag of VDR tuples (Ortiz et al., 2015).

person close personperson on_beside d_tabled_table surrounds cakeperson near cakeperson close d_tableperson above_close cake

Mirella Lapata Understanding Visual Scenes 17

Page 23: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

Visual Dependency Representations

An image is represented as a dependency tree (Silberer et al., 2017).

person person d_table cake

close

on_beside

surrounds

root

Mirella Lapata Understanding Visual Scenes 18

Page 24: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

Visual Constituency Representations

An image is represented as a constituency tree (Silberer et al., 2017).

NPNP SR

NP SR R NP

person

R NP

on_beside

NP SR

closeperson d_table

R NPsurrounds

cake

Mirella Lapata Understanding Visual Scenes 19

Page 25: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

Tree Constructiontv

person

bottle

table

pizza

d31

d53221

175

82

d24

d12

table

person

pizzaBuild a fully connected graph with all objects as nodes;edge weights correspond to spatial distance;minimum spanning tree (MST): visual dependency representation;use grammar to generate visual constituency representation.

Mirella Lapata Understanding Visual Scenes 20

Page 26: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

Tree Construction

pizza d_table person

on_beside below_close

root

Build a fully connected graph with all objects as nodes;edge weights correspond to spatial distance;minimum spanning tree (MST): visual dependency representation;use grammar to generate visual constituency representation.

Mirella Lapata Understanding Visual Scenes 20

Page 27: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

Tree Construction

NPNP SR

SRNP R NP

pizza

R NPbelow_close

personon_beside

d_table

Build a fully connected graph with all objects as nodes;edge weights correspond to spatial distance;minimum spanning tree (MST): visual dependency representation;use grammar to generate visual constituency representation.

Mirella Lapata Understanding Visual Scenes 20

Page 28: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

Image Description Generation via Machine Translation

Repurpose existing NLP technology to construct visual representations;use machine translation models: focus on tree-to-string translation;

trees are task-independent, do not take descriptions into account:

create parallel corpus of trees with multiple descriptions;

translation is loose: not all visual objects are verbalized; multiple descriptionscan focus different aspects of a scene:

generation model performs content selection.

Mirella Lapata Understanding Visual Scenes 21

Page 29: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

Image Description Generation via Machine Translation

Repurpose existing NLP technology to construct visual representations;use machine translation models: focus on tree-to-string translation;trees are task-independent, do not take descriptions into account:

create parallel corpus of trees with multiple descriptions;translation is loose: not all visual objects are verbalized; multiple descriptionscan focus different aspects of a scene:

generation model performs content selection.

Mirella Lapata Understanding Visual Scenes 21

Page 30: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

Image Description Generation via Machine Translation

Repurpose existing NLP technology to construct visual representations;use machine translation models: focus on tree-to-string translation;trees are task-independent, do not take descriptions into account:create parallel corpus of trees with multiple descriptions;

translation is loose: not all visual objects are verbalized; multiple descriptionscan focus different aspects of a scene:

generation model performs content selection.

Mirella Lapata Understanding Visual Scenes 21

Page 31: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

Image Description Generation via Machine Translation

Repurpose existing NLP technology to construct visual representations;use machine translation models: focus on tree-to-string translation;trees are task-independent, do not take descriptions into account:create parallel corpus of trees with multiple descriptions;translation is loose: not all visual objects are verbalized; multiple descriptionscan focus different aspects of a scene:

generation model performs content selection.

Mirella Lapata Understanding Visual Scenes 21

Page 32: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

Image Description Generation via Machine Translation

Repurpose existing NLP technology to construct visual representations;use machine translation models: focus on tree-to-string translation;trees are task-independent, do not take descriptions into account:create parallel corpus of trees with multiple descriptions;translation is loose: not all visual objects are verbalized; multiple descriptionscan focus different aspects of a scene:generation model performs content selection.

Mirella Lapata Understanding Visual Scenes 21

Page 33: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

Parallel Corpus Creation

Step 1: Grounding objects to linguistic expressions.

persond_tablepersoncakeplatecup

Little kids sitting around a table that has a birthday cake on it.A group of young children standing around a cake.

Mirella Lapata Understanding Visual Scenes 22

Page 34: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

Parallel Corpus Creation

Step 1: Grounding objects to linguistic expressions.

persond_tablepersoncakeplatecup

[Little kids]A1 sittingsit .01 [around a table]A2 that hashas.01 [a birthday cake]A2 on it.[A group of young children]A1 standingstand .01 [around a cake]A2.

Mirella Lapata Understanding Visual Scenes 23

Page 35: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

Parallel Corpus Creation

Step 1: Grounding objects to linguistic expressions.

persond_tablepersoncakeplatecup

[Little kids]A1 sittingsit .01 [around a table]A2 that hashas.01 [a birthday cake]A2 on it.[A group of young children]A1 standingstand .01 [around a cake]A2.

Mirella Lapata Understanding Visual Scenes 24

Page 36: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

Parallel Corpus CreationStep 2: Render scenes as trees and generate corpus.

person person d_table cake

close

on_beside

surrounds

root

Kids sitting around a table.

person person d_table cake

close

on_beside

surrounds

root

A table that has a birthday cake.

person person d_table cake

close

on_beside

surrounds

root

Children standing around a cake.Mirella Lapata Understanding Visual Scenes 25

Page 37: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

MT Model: Surface Realization

We train a translation model on our parallel corpus using the MT frameworkimplemented in Moses (Koehn et al., 2007):

t∗ = arg maxt

P(t |s)

P(t |s) = arg maxd

(K∑

k=1

λkhk (d)

)

d ∈ D(s, t) are derivations in a synchronous grammar;hk feature functions (language model, translation table, word penalty model);constants λk scale different models, tuned during training.

Mirella Lapata Understanding Visual Scenes 26

Page 38: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

MT Model: Content Selection

At test time we must decide which objects to talk about:

predict whether a detected object is relevant for scene;we use logistic regression with l2 regularization;trained on positive and negative instances;positives: objects aligned to SRL arguments;negatives: unaligned objects;features: object detection score, relative size, relative distance between twoobjects, object occurrences, spatial features.

Mirella Lapata Understanding Visual Scenes 27

Page 39: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

Query-by-Example Image Retrieval

Mirella Lapata Understanding Visual Scenes 28

Page 40: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

Query-by-Example Image RetrievalLet I denote an image collection;for every image q produce a ranking in order of similarity to q;subtree kernels measure similarity of constituent trees;partial tree kernels measure similarly of dependency trees.

NPSRNP

pizza

R NPon_beside

d_table

SRR NP

on_beside

d_table

SRR NP

on_beside

d_table

pizza d_table person

on_beside below_close

pizza d_table person

on_beside

pizza d_table

on_beside

Mirella Lapata Understanding Visual Scenes 29

Page 41: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

Results: Image Description Generation

0

10

20

30

40

50

60

43.8 44.1

47.9

5254.3

58.8

COCO 2015 Test Set

CID

Er

(%)

TemplateBag-of-ObjectsTuplesConstituencyDependencyNeuralTalk

Mirella Lapata Understanding Visual Scenes 30

Page 42: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

Results: Image Retrieval

P@1 P@5 P@100

10

20

30

40

8.6 10.5 13.3

11.7

13.4

13.6

19.9

14.2

11.915.2

15.7

13.9

10.7

29.6

42.3

Mac

ro-a

ver

aged

pre

cisi

on

Bag-of-ObjectsTuplesConstituencyDependencyNeuralTalk

Mirella Lapata Understanding Visual Scenes 31

Page 43: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Dependency RepresentationsVisual Constituency RepresentationsApplications

Example Output

TemplateTuplesDependencyConstituencyHuman

5) a couch has a couch4) the room has a couch1) a dog sitting on a couch2) dog laying on a couch3) a dog is looking at

something

2) an airplane is near a car5) a airplane sitting on a street3) a airplane parked next to a car4) a airplane parked next to a car1) a large plane with a red tail

Mirella Lapata Understanding Visual Scenes 32

Page 44: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

1 Representing Visual StructureVisual Dependency RepresentationsVisual Constituency RepresentationsApplications

2 Visual Sense DisambiguationTask DefinitionDataset ConstructionUnsupervised Model for VSD

3 Conclusions

Mirella Lapata Understanding Visual Scenes 33

Page 45: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Aligning Actions and Verbs

So far, we have looked at syntactic structure only: how do the objects in an imagerelate to each other.

To really understand the content of an image, we need semantics: represent theevent depicted, its participants, and the roles they play.

We can achieve this using verb senses:

well established in linguistics (e.g., WordNet);more general that the action labels used in computer vision;can be aligned with both sentences and images.

Mirella Lapata Understanding Visual Scenes 34

Page 46: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Word Sense Disambiguation

Word sense disambiguation is a standard NLP task:

(1) A man is playing a guitar.

play:1 perform music on musical instrument

(2) The children are playing across the street.

play:2 engage in a fun or recreational (childlike) activity

(3) Two men playing doubles tennis on a grass court.

play:3 engage in or make moves related to competition or sport

Mirella Lapata Understanding Visual Scenes 35

Page 47: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Word Sense Disambiguation

Word sense disambiguation is a standard NLP task:

(1) A man is playing a guitar.play:1 perform music on musical instrument

(2) The children are playing across the street.

play:2 engage in a fun or recreational (childlike) activity

(3) Two men playing doubles tennis on a grass court.

play:3 engage in or make moves related to competition or sport

Mirella Lapata Understanding Visual Scenes 35

Page 48: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Word Sense Disambiguation

Word sense disambiguation is a standard NLP task:

(1) A man is playing a guitar.play:1 perform music on musical instrument

(2) The children are playing across the street.play:2 engage in a fun or recreational (childlike) activity

(3) Two men playing doubles tennis on a grass court.

play:3 engage in or make moves related to competition or sport

Mirella Lapata Understanding Visual Scenes 35

Page 49: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Word Sense Disambiguation

Word sense disambiguation is a standard NLP task:

(1) A man is playing a guitar.play:1 perform music on musical instrument

(2) The children are playing across the street.play:2 engage in a fun or recreational (childlike) activity

(3) Two men playing doubles tennis on a grass court.play:3 engage in or make moves related to competition or sport

Mirella Lapata Understanding Visual Scenes 35

Page 50: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Visual Sense Disambiguation

We can apply this task to an image/verb pair:

play

play:1 perform music on musical instrument

New task: visual sense disambiguation (VSD, Gella et al. 2016).

Mirella Lapata Understanding Visual Scenes 36

Page 51: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Visual Sense Disambiguation

We can apply this task to an image/verb pair:

play

play:1 perform music on musical instrument

New task: visual sense disambiguation (VSD, Gella et al. 2016).

Mirella Lapata Understanding Visual Scenes 36

Page 52: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Existing Action Recognition Datasets

Dataset

Verbs

Actions

Sense

PPMI (Yao & Fei-Fei 2010)

2

24

N

Stanford 40 (Yao et al. 2011)

33

40

N

PASCAL 2012 (Everingham et al. 2015)

9

11

N

TUHOI (Le et al. 2014)

2974

N

Actions: verb phrases or verb-object pairs;verb senses are more general than actions;no existing datasets with verb sense annotation.

Mirella Lapata Understanding Visual Scenes 37

Page 53: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Existing Action Recognition Datasets

Dataset Verbs Actions SensePPMI (Yao & Fei-Fei 2010) 2 24 NStanford 40 (Yao et al. 2011) 33 40 NPASCAL 2012 (Everingham et al. 2015) 9 11 NTUHOI (Le et al. 2014) – 2974 N

Actions: verb phrases or verb-object pairs;verb senses are more general than actions;no existing datasets with verb sense annotation.

Mirella Lapata Understanding Visual Scenes 37

Page 54: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Dataset for Visual Verb Sense Disambiguation

Design a new dataset using images from:

MSCOCO: 123k images with object labels, image descriptions:not designed for action recognition;use verbs in descriptions as labels.

TUHOI: 10,805 images with object labels:labeled with actions (verb-object pairs);use verbs as labels.

Mirella Lapata Understanding Visual Scenes 38

Page 55: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Dataset for Visual Verb Sense Disambiguation

We use the OntoNotes inventory of verb senses (less fine-grained than WordNet).But: not all verb senses are visual.

Visual:

Non-Visual:

Solution: annotate only the visual senses:annotators decide which senses are visual (about 50% in MSCOCO);new annotators select correct visual sense for each image.

Mirella Lapata Understanding Visual Scenes 39

Page 56: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Dataset for Visual Verb Sense Disambiguation

We use the OntoNotes inventory of verb senses (less fine-grained than WordNet).But: not all verb senses are visual.

Visual:

Non-Visual:

Solution: annotate only the visual senses:annotators decide which senses are visual (about 50% in MSCOCO);new annotators select correct visual sense for each image.

Mirella Lapata Understanding Visual Scenes 39

Page 57: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Dataset for Visual Verb Sense Disambiguation

We use the OntoNotes inventory of verb senses (less fine-grained than WordNet).But: not all verb senses are visual.

Visual:

Non-Visual:

Solution: annotate only the visual senses:annotators decide which senses are visual (about 50% in MSCOCO);new annotators select correct visual sense for each image.

Mirella Lapata Understanding Visual Scenes 39

Page 58: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Annotating Image and Verb with Visual Sense

Mirella Lapata Understanding Visual Scenes 40

Page 59: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Annotating Image and Verb with Visual Sense

Mirella Lapata Understanding Visual Scenes 40

Page 60: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

VerSe Dataset

Comparison of VerSe with existing action recognition datasets:

Dataset Verbs Actions SensePPMI (Yao & Fei-Fei 2010) 2 24 NStanford 40 (Yao et al. 2011) 33 40 NPASCAL 2012 (Everingham et al. 2015) 9 11 NTUHOI (Le et al. 2014) – 2974 NVerSe (our dataset) 90 – Y (163)

Mirella Lapata Understanding Visual Scenes 41

Page 61: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

VerSe Dataset

VerSe dataset divided into motion and non-motion verbs:

Verb type Verbs Images Senses ExamplesMotion 39 1812 5.79 run, walk, jump, swing, hit, kickNon-motion 51 1698 4.86 sleep, sit, lean, read, write, look

Mirella Lapata Understanding Visual Scenes 42

Page 62: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Unsupervised Model for VSD

engage in competition or sport

perform or transmit music

engage in a playful activity

play Sense Inventory: D

O: person, guitar, microphone

C: A man playing guitar.

Sense Representations

objects

captions

CNN-fc7

Scoring Function Φ

s 1s2 s3

s2

Image Representations

Mirella Lapata Understanding Visual Scenes 43

Page 63: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Unsupervised Model for VSD

engage in competition or sport

perform or transmit music

engage in a playful activity

play Sense Inventory: D

O: person, guitar, microphone

C: A man playing guitar.

Sense Representations

objects

captions

CNN-fc7

Scoring Function Φ

s 1s2 s3

s2

Image Representations

Mirella Lapata Understanding Visual Scenes 43

Page 64: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Unsupervised Model for VSD

VGG - CNN

O: person, guitar, microphone

CNN-fc7

objects

word2vec

Object labels obtained using VGG (Simonyan & Zisserman 2014).

Mirella Lapata Understanding Visual Scenes 43

Page 65: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Unsupervised Model for VSD

engage in competition or sport

perform or transmit music

engage in a playful activity

play Sense Inventory: D

O: person, guitar, microphone

C: A man playing guitar.

Sense Representations

objects

captions

CNN-fc7

Scoring Function Φ

s 1s2 s3

s2

Image Representations

Mirella Lapata Understanding Visual Scenes 43

Page 66: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Unsupervised Model for VSD

VGG - CNN

LSTM

C: A man playing a guitar

captions

word2vec

Image descriptions from Show and Tell (Vinyals et al. 2015).

Mirella Lapata Understanding Visual Scenes 43

Page 67: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Unsupervised Model for VSD

engage in competition or sport

perform or transmit music

engage in a playful activity

play Sense Inventory: D

O: person, guitar, microphone

C: A man playing guitar.

Sense Representations

objects

captions

CNN-fc7

Scoring Function Φ

s 1s2 s3

s2

Image Representations

Mirella Lapata Understanding Visual Scenes 43

Page 68: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Unsupervised Model for VSD

engage in competition or sport

perform or transmit music

engage in a playful activity

play Sense Inventory: D

O: person, guitar, microphone

C: A man playing guitar.

Sense Representations

objects

captions

CNN-fc7

Scoring Function Φ

s 1s2 s3

s2

Image Representations

Mirella Lapata Understanding Visual Scenes 43

Page 69: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Visual Representation for Senses

play

perform or transmit music

engage in competition or sport

#1

#3

#2

. . . . .Mirella Lapata Understanding Visual Scenes 44

Page 70: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Visual Representation for Senses

play

perform or transmit music

engage in competition or sport

playing guitar

playing music

playing in a band

playing tennis

playing sport

playing game

#1

#3

#2

q 11

q12

q 13

q 22

q21

q23

. . . . .

. . . . .

. . . . .

Mirella Lapata Understanding Visual Scenes 44

Page 71: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Visual Representation for Senses

play

perform or transmit music

engage in competition or sport

playing guitar

playing music

playing in a band

playing tennis

playing sport

playing game

#1

#3

#2

q 11

q12

q 13

q 22

q21

q23

. . . . .

. . . . .

. . . . .

Mirella Lapata Understanding Visual Scenes 44

Page 72: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Visual Representation for Senses

play

perform or transmit music

engage in competition or sport

playing guitar

playing music

playing in a band

playing tennis

playing sport

playing game

#1

#3

#2

q 11

q12

q 13

q 22

q21

q23

. . . . .

. . . . .

. . . . . M

ean Pooling

Mean P

ooling

CNN - fc7

CNN - fc7

CNN - fc7

CNN - fc7

play #1

play #2

Mirella Lapata Understanding Visual Scenes 44

Page 73: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Unsupervised Model for VSD

engage in competition or sport

perform or transmit music

engage in a playful activity

play Sense Inventory: D

O: person, guitar, microphone

C: A man playing guitar.

Sense Representations

objects

captions

CNN-fc7

Scoring Function Φ

s 1s2 s3

s2

Image Representations

Mirella Lapata Understanding Visual Scenes 45

Page 74: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Scoring Function

Use vector similarity (cosine) as scoring function:

s = arg maxs∈S(v)

Φ(s, i , v ,D)

Representations:

textual: O, C embeddings;visual: CNN features;multi-modal: fused textual and visual features using Canonical CorrelationAnalysis.

Mirella Lapata Understanding Visual Scenes 46

Page 75: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Scoring Function

Use vector similarity (cosine) as scoring function:

s = arg maxs∈S(v)

Φ(s, i , v ,D)

Representations:

textual: O, C embeddings;

visual: CNN features;multi-modal: fused textual and visual features using Canonical CorrelationAnalysis.

Mirella Lapata Understanding Visual Scenes 46

Page 76: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Scoring Function

Use vector similarity (cosine) as scoring function:

s = arg maxs∈S(v)

Φ(s, i , v ,D)

Representations:

textual: O, C embeddings;visual: CNN features;

multi-modal: fused textual and visual features using Canonical CorrelationAnalysis.

Mirella Lapata Understanding Visual Scenes 46

Page 77: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Scoring Function

Use vector similarity (cosine) as scoring function:

s = arg maxs∈S(v)

Φ(s, i , v ,D)

Representations:

textual: O, C embeddings;visual: CNN features;multi-modal: fused textual and visual features using Canonical CorrelationAnalysis.

Mirella Lapata Understanding Visual Scenes 46

Page 78: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Results

First-sense

55

70

85

Accu

racy

Sco

res

70.8

80.6

MotionNon-Motion

Mirella Lapata Understanding Visual Scenes 47

Page 79: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Results

First-senseText

55

70

85

Accu

racy

Sco

res

70.8

65.1

80.6

64.3

MotionNon-Motion

Mirella Lapata Understanding Visual Scenes 47

Page 80: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Results

First-senseText

Visual

55

70

85

Accu

racy

Sco

res

70.8

65.1

58.3

80.6

64.3

56.1

MotionNon-Motion

Mirella Lapata Understanding Visual Scenes 47

Page 81: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Results

First-senseText

Visual

Multi-modal

55

70

85

Accu

racy

Sco

res

70.8

65.1

58.3

72.6

80.6

64.3

56.1

66.3

MotionNon-Motion

Mirella Lapata Understanding Visual Scenes 47

Page 82: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Results: Gold Standard Image Descriptions

First-senseText

Visual

Multi-modal

55

70

85

Accu

racy

Sco

res 70.8

75.6

58.3

75.4

80.6

72.7

56.1

72.2

MotionNon-Motion

Mirella Lapata Understanding Visual Scenes 48

Page 83: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Verb Prediction

ConvNet Classifier

fc7

Line

ar

Sig

moi

d

Output

for each v

play

swin

g

thro

wMIL

-Noi

sy O

R

(2048,12,12)

detect the verbs that are present in an image (250 classes);use multiple instance learning (we do not know which bounding boxescorrespond to which verbs).

Mirella Lapata Understanding Visual Scenes 49

Page 84: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Examples: Verb Prediction

play, perform hit, swing, play hold, sit, use

Mirella Lapata Understanding Visual Scenes 50

Page 85: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Task DefinitionDataset ConstructionUnsupervised Model for VSD

Verb Prediction and Sense Disambiguation

Mirella Lapata Understanding Visual Scenes 51

Page 86: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

1 Representing Visual StructureVisual Dependency RepresentationsVisual Constituency RepresentationsApplications

2 Visual Sense DisambiguationTask DefinitionDataset ConstructionUnsupervised Model for VSD

3 Conclusions

Mirella Lapata Understanding Visual Scenes 52

Page 87: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Conclusions

Image understanding (like text understanding) requires structuredrepresentations;for multimodal tasks, we need to align linguistic and image structure;syntactic example: visual dependency representations align geometricstructure of an image with syntactic structure of a sentence;application in image description and image retrieval;semantic example: visual word senses align event depicted in an image withevent described in a sentence;unsupervised VSD model using multimodal embeddings.

Mirella Lapata Understanding Visual Scenes 53

Page 88: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Other Approaches to Image Structure

Other approaches that align linguistic structure and image structure:

Scene (description) graphs (Johnson et al. 2015; Aditya et al. 2015):

triples of object, attribute, relation;aligned with image regions and region descriptions;no explicit alignment with linguistic structure (but could be derived).

Visual semantic roles (Yatskar et al. 2016):

uses semantic frames from FrameNet;annotates images with frames, participants, and roles;not aligned with regions or image descriptions; no verb senses.

Mirella Lapata Understanding Visual Scenes 54

Page 89: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Scene Graphs

http://cs.stanford.edu/people/jcjohns/cvpr15_supp/

Mirella Lapata Understanding Visual Scenes 55

Page 90: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Scene Graphs

Mirella Lapata Understanding Visual Scenes 55

Page 91: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

Visual Semantic Roles

http://imsitu.org/demo/

Mirella Lapata Understanding Visual Scenes 56

Page 92: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

References I

Aditya, S., Yang, Y., Baral, C., Fermuller, C., & Aloimonos, Y. (2015). From images to sentences through scenedescription graphs using commonsense reasoning and knowledge. arXiv preprint arXiv:1511.03292.

Elliott, D., & Keller, F. (2013). Image description using visual dependency representations. In Proceedings of theConference on Empirical Methods in Natural Language Processing, (pp. 1292–1302), Seattle, WA.

Everingham, M., Eslami, S. M. A., Gool, L. V., Williams, C. K. I., Winn, J. M., & Zisserman, A. (2015). The Pascalvisual object classes challenge: A retrospective. International Journal of Computer Vision, 111, 98–136.

Gella, S., Lapata, M., & Keller, F. (2016). Unsupervised visual sense disambiguation for verbs using multimodalembedding. In Proceedings of the Human Language Technology Conference of the North American Chapterof the Association for Computational Linguistics, (pp. 182–192), San Diego, CA.

Johnson, J., Krishna, R., Stark, M., Li, L.-J., Shamma, D. A., Bernstein, M., & Fei-Fei, L. (2015). Image retrievalusing scene graphs. In Proceedings of the Conference on Computer Vision and Pattern Recognition, (pp.3668–3678), Boston, MA.

Le, D.-T., Uijlings, J., & Bernardi, R. (2014). Proceedings of the Third Workshop on Vision and Language, chap.TUHOI: Trento Universal Human Object Interaction Dataset, (pp. 17–24). Dublin City University and theAssociation for Computational Linguistics.

Mirella Lapata Understanding Visual Scenes 57

Page 93: Understanding Visual Scences

Representing Visual StructureVisual Sense Disambiguation

Conclusions

References II

Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition.CoRR, abs/1409.1556.

Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In IEEEConference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015,(pp. 3156–3164).

Yao, B., & Fei-Fei, L. (2010). Grouplet: A structured image representation for recognizing human and objectinteractions. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, (pp. 9–16).IEEE.

Yao, B., Jiang, X., Khosla, A., Lin, A. L., Guibas, L., & Fei-Fei, L. (2011). Human action recognition by learningbases of action attributes and parts. In Computer Vision (ICCV), 2011 IEEE International Conference on, (pp.1331–1338). IEEE.

Yatskar, M., Zettlemoyer, L., & Farhadi, A. (2016). Situation recognition: Visual semantic role labeling for imageunderstanding. In Computer Vision and Pattern Recognition.

Mirella Lapata Understanding Visual Scenes 58