16
Towards a Watson That Sees: Language-Guided Action Recognition for Robots Ching L. Teo, Yezhou Yang, Hal Daum´ e III, Cornelia Ferm¨ uller and Yiannis Aloimonos University of Maryland Institute for Advanced Computer Studies Apr 9, 2012

Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots

Towards a Watson That Sees:Language-Guided Action Recognition for Robots

Ching L. Teo, Yezhou Yang, Hal Daume III, Cornelia Fermullerand

Yiannis Aloimonos

University of Maryland Institute for Advanced Computer Studies

Apr 9, 2012

Page 2: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots

Outline

1 Why Vision and Language

2 Action Recognition

3 Future Work

4 Conclusions

Page 3: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots

Motivation

Computer Vision seeks toachieve perception throughvisual signals.

Current approaches toperception: extract edges,texture, motion, color,surfaces.

What comes next? Felzenszwalb, P. et al, Objectdetection with discriminativelytrained part-based models, PAMI2010

Page 4: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots

This is not enough: high-levelknowledge is required.

High-level knowledge: semantics,function, attributes, associations.

Enables reasoning from perception.

Approach: use language as aresource of high-level informationto solve problems in ComputerVision. Jia Deng et al, ImageNet: A

Large-Scale Hierarchical ImageDatabase, CVPR 2009

Page 5: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots

Why Language?

Language provides:

1 A lexicon (dictionary) that encodes contextual relationshipbetween entities: e.g. ladle

occurs−−−−→ kitchen.

2 Prior knowledge: e.g. knifecuts−−→ cucumber.

3 Generalizes to beyond what is “seen” (non-visual): e.g. sun,

beaches, peoplerelates−−−−→ vacation.

Computational Linguistics have provided tools and relationaldatabases that encodes relationships such as: is-a, cause-effect,performs-functions, motivated by etc.

Page 6: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots

Approach: The Cognitive Dialog

A model of a reasoning process that involves the VisualExecutive (VE) and Language Executive (LE).

VE provides: observations, low-level feature extraction.

LE provides: constraints, plausibility of observations from VE.

Each iteration of the “dialog” seeks to optimize a globalfunction related to the task: e.g. scene/object/actionrecognition, object segmentation.

Page 7: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots

Action Recognition is Hard

Goal: Recognizing various actionsinvolving hand-tools.

Challenge: ambiguity of actionssimply from the trajectories of thehand alone.

Key idea is to use language assource of prior knowledge thatrelates tools with the actionsperformed.

Page 8: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots

Corpus-Guided Action Recognition

Input: set of unlabeled videos.Goal is to estimate a model thatassigns clusters of videos to thecorrect actions.

VE: detects tools and actionfeatures (noisy).

LE: provides a language model forcomputing likelihoods oftool-action co-occurrence.

Strategy: EM to update the actionassignment model at each iteration.

(a) Training the language model from alarge text corpus. (b) Detected tools arequeried into the language model. (c)Language model returns prediction ofaction. (d) Action features are comparedand beliefs updated.

Page 9: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots

Active Tool Detection

Overview of the tool detection strategy: (1) Optical flow is first computedfrom the input video frames. (2) We train a CRF segmentation model basedon optical flow + skin color. (3) Guided by the flow computations, wesegment out hand-like regions (and removed faces if necessary) to obtain thehand regions that are moving (the active hand that is holding the tool). (4)The active hand region is where the tool is localized. Using the PLS detector(5), we compute a detection score for the presence of a tool.

Page 10: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots

Extracting Action Features

Detected hands are trackedusing a Kalman Filter acrossthe entire video sequence.

Creates a 20-dim featurevector consisting of Fouriercoefficients and normalizedaveraged velocities.

Detected hand trajectories. x and ycoordinates are denoted as red and bluecurves respectively.

Page 11: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots

Action-Tool Language Model

Computes the conditionalprobability that an actionoccured given that a toolhas been detected:PL(vj |ni ) =

#(vj ,ni )#(ni )

.

Uses the Gigaword Corpuscontaining 10 years worth ofNYT newswire data.

Gigaword co-occurrence matrix of toolsand predicted actions.

Page 12: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots

Guiding Action Recognition Via EM

E step: compute the expectation of the assignment variablefor action j , tool i , video d : Wijd ∝ PI (i)PL(j |i)Pen(d |j)M step: update the model parameter C via:

arg maxC

(−∑

i ,j ,dWijd ||Fd − Cj ||2)

Page 13: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots

Experiments and Results

Evaluated over a dataset of4 humans making sushi –UMD-Sushi MakingDataseta.

2 experiments: (1)Unsupervised Clustering(K-Means vs EM) and (2)Supervised Classification.

ahttp://www.umiacs.umd.edu/research/POETICON/umd sushi/

(a) Unsupervised recognition accuracy: nolanguage (K-Means) versus language(EM). (b) Classification accuracy: nolanguage versus language. All reportedresults have variances within ±0.5%.

Page 14: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots

Future Work

Integrating the framework into arobotic agent – completes theCognitive Dialog.

Better language modelling – usingNER or shallow parsing methods toobtain better PL estimates.

Using Kinect as input to reduceviewpoint dependency.

Using attributes for more robusttool and action detection.

Page 15: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots

Conclusions

When used properly, language can be exploited to solveimportant problems in vision.

In this talk, used a Cognitive Dialog framework for actionrecognition.

Promising results point to feasibility of using Language ineven more Computer Vision problems.

Page 16: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots

Thank you

Contact information: [email protected]

Questions?