Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Towards a Watson That Sees:Language-Guided Action Recognition for Robots
Ching L. Teo, Yezhou Yang, Hal Daume III, Cornelia Fermullerand
Yiannis Aloimonos
University of Maryland Institute for Advanced Computer Studies
Apr 9, 2012
Outline
1 Why Vision and Language
2 Action Recognition
3 Future Work
4 Conclusions
Motivation
Computer Vision seeks toachieve perception throughvisual signals.
Current approaches toperception: extract edges,texture, motion, color,surfaces.
What comes next? Felzenszwalb, P. et al, Objectdetection with discriminativelytrained part-based models, PAMI2010
This is not enough: high-levelknowledge is required.
High-level knowledge: semantics,function, attributes, associations.
Enables reasoning from perception.
Approach: use language as aresource of high-level informationto solve problems in ComputerVision. Jia Deng et al, ImageNet: A
Large-Scale Hierarchical ImageDatabase, CVPR 2009
Why Language?
Language provides:
1 A lexicon (dictionary) that encodes contextual relationshipbetween entities: e.g. ladle
occurs−−−−→ kitchen.
2 Prior knowledge: e.g. knifecuts−−→ cucumber.
3 Generalizes to beyond what is “seen” (non-visual): e.g. sun,
beaches, peoplerelates−−−−→ vacation.
Computational Linguistics have provided tools and relationaldatabases that encodes relationships such as: is-a, cause-effect,performs-functions, motivated by etc.
Approach: The Cognitive Dialog
A model of a reasoning process that involves the VisualExecutive (VE) and Language Executive (LE).
VE provides: observations, low-level feature extraction.
LE provides: constraints, plausibility of observations from VE.
Each iteration of the “dialog” seeks to optimize a globalfunction related to the task: e.g. scene/object/actionrecognition, object segmentation.
Action Recognition is Hard
Goal: Recognizing various actionsinvolving hand-tools.
Challenge: ambiguity of actionssimply from the trajectories of thehand alone.
Key idea is to use language assource of prior knowledge thatrelates tools with the actionsperformed.
Corpus-Guided Action Recognition
Input: set of unlabeled videos.Goal is to estimate a model thatassigns clusters of videos to thecorrect actions.
VE: detects tools and actionfeatures (noisy).
LE: provides a language model forcomputing likelihoods oftool-action co-occurrence.
Strategy: EM to update the actionassignment model at each iteration.
(a) Training the language model from alarge text corpus. (b) Detected tools arequeried into the language model. (c)Language model returns prediction ofaction. (d) Action features are comparedand beliefs updated.
Active Tool Detection
Overview of the tool detection strategy: (1) Optical flow is first computedfrom the input video frames. (2) We train a CRF segmentation model basedon optical flow + skin color. (3) Guided by the flow computations, wesegment out hand-like regions (and removed faces if necessary) to obtain thehand regions that are moving (the active hand that is holding the tool). (4)The active hand region is where the tool is localized. Using the PLS detector(5), we compute a detection score for the presence of a tool.
Extracting Action Features
Detected hands are trackedusing a Kalman Filter acrossthe entire video sequence.
Creates a 20-dim featurevector consisting of Fouriercoefficients and normalizedaveraged velocities.
Detected hand trajectories. x and ycoordinates are denoted as red and bluecurves respectively.
Action-Tool Language Model
Computes the conditionalprobability that an actionoccured given that a toolhas been detected:PL(vj |ni ) =
#(vj ,ni )#(ni )
.
Uses the Gigaword Corpuscontaining 10 years worth ofNYT newswire data.
Gigaword co-occurrence matrix of toolsand predicted actions.
Guiding Action Recognition Via EM
E step: compute the expectation of the assignment variablefor action j , tool i , video d : Wijd ∝ PI (i)PL(j |i)Pen(d |j)M step: update the model parameter C via:
arg maxC
(−∑
i ,j ,dWijd ||Fd − Cj ||2)
Experiments and Results
Evaluated over a dataset of4 humans making sushi –UMD-Sushi MakingDataseta.
2 experiments: (1)Unsupervised Clustering(K-Means vs EM) and (2)Supervised Classification.
ahttp://www.umiacs.umd.edu/research/POETICON/umd sushi/
(a) Unsupervised recognition accuracy: nolanguage (K-Means) versus language(EM). (b) Classification accuracy: nolanguage versus language. All reportedresults have variances within ±0.5%.
Future Work
Integrating the framework into arobotic agent – completes theCognitive Dialog.
Better language modelling – usingNER or shallow parsing methods toobtain better PL estimates.
Using Kinect as input to reduceviewpoint dependency.
Using attributes for more robusttool and action detection.
Conclusions
When used properly, language can be exploited to solveimportant problems in vision.
In this talk, used a Cognitive Dialog framework for actionrecognition.
Promising results point to feasibility of using Language ineven more Computer Vision problems.