Upload
oberon
View
38
Download
0
Embed Size (px)
DESCRIPTION
Vision-Based Retrieval of Dynamic Hand Gestures. Thesis Proposal by Jonathan Alon. Thesis Committee: Stan Sclaroff, Margrit Betke, George Kollios, and Trevor Darrell. Example Application. Isolated Gesture Recognition. A query gesture Q - PowerPoint PPT Presentation
Citation preview
Computer Science
Vision-Based Retrieval of Dynamic Hand Gestures
Thesis Proposal by
Jonathan Alon
Thesis Committee:
Stan Sclaroff, Margrit Betke, George Kollios,
and Trevor Darrell
Computer Science
Example Application
Computer Science
Isolated Gesture Recognition
Q
M1
M2
M3
M4
A query gesture Q Database of gesture
examples Mg, and their class labels Cg, 1gN.
Problem: Predict the class label CQ bothaccurately and efficiently
C1=‘CAR’
CQ = ?
C2=‘BUY’
C3=‘CAR’
C4=‘BUY’
Computer Science
Research Goals
Problem: Predict the class label CQ accurately and efficiently:
1. Accurately: design a distance measure D such thatsimilarity in input space using D=>similarity in class space
2. Efficiently: better than brute force, computingD(Q,Mg), for all g:1gN.
QCQ=C3=‘CAR’A small D ( ,
M3) =>
QCQC4=‘BUY’A large D ( ,
M4) =>
Computer Science
Example Hand Gesture Data
“Video Gestures”American Sign Language
Computer Science
Related (ASL Recognition)Work
Hand segmentation: Previous: higher level recognition models assume perfect
segmentation, and methods are either too simple [Starner&Pentland 95, Vogler&Metaxas99, Yang&Ahuja
02] or too complicated [Cui&Weng 95, Ong&Bowden 04]
Proposed: more sophisticated distance measure will enable simple hand segmentation, and
more general background, textured clothes, and hand occlusions.
Vocabulary size Previous (vision-based): tens. Proposed: hundreds.
Data Previous: usually the researcher is the signer [Starner&Pentland
95, Cui&Weng 95]. Proposed: native signers. Fast gesture speeds. More realistic
gesture variations.
Computer Science
Proposed methods (1)
1. Accurately: propose a Dynamic Space Time Warping (DSTW) algorithm that can accommodate multiple hypotheses about the hand location in every frame of the query gesture sequence.
DSTW will enable a simple and efficient multiple candidate hand detection algorithm.
Computer Science
Proposed methods (2)
2. Efficiently: use a filtering method, which consists of two steps:
1. Filter step: compute D’(Q,Mg), for all g:1gN based on a fast but approximate distance D’. Retain P most promising gesture examples.
2. Refine step: compute D(Q,Mh), for h:1hP based on the slow but exact distance D. Predict CQ based on class labels of Nearest Neighbors (NN).
Computer Science
Outline
Introduction Motivation Research Goals Related Work Proposed Methods
System Overview Multiple Candidate Hand Detection Feature Extraction and Processing Dynamic Space-Time Warping (DSTW) Approximate Matching via Prototypes
Feasibility Study Thesis Roadmap Conclusion
Computer Science
Isolated Gesture RecognitionSystem Diagram
query gesture sequence
multiple candidatehand detection
multiple candidatehand subimages
feature extractionand processing
database features Mg
Filter: approximatematching using D’
candidate matches
video database ofisolated gestures
Refine: exact matching using
D
best matches
browsing
retrieval results
query features Q
Computer Science
Contributions
query gesture sequence
multiple candidatehand detection
multiple candidatehand subimages
feature extractionand processing
database features Mg
Filter: approximatematching using D’
candidate matches
video database ofisolated gestures
Refine: exact matching using
D
best matches
browsing
retrieval results
query features Q
Computer Science
System Diagram
query gesture sequence
multiple candidatehand detection
multiple candidatehand subimages
feature extractionand processing
database features Mg
Filter: approximatematching using D’
candidate matches
video database ofisolated gestures
Refine: exact matching using
D
best matches
browsing
retrieval results
query features Q
Computer Science
Multiple CandidateHand Detection (1)
Key observation: the gesturing hand cannot be reliably and unambiguously detected, regardless of the visual features used for detection.
However, the gesturing hand is consistently among the top K candidates identified by e.g., skin detection (K=15 in this example).
Candidate Hand RegionsInput Frame
Computer Science
Multiple CandidateHand Detection (2)
Input Sequence
Computer Science
Isolated Gesture RecognitionSystem Diagram
query gesture sequence
multiple candidatehand detection
multiple candidatehand subimages
feature extractionand processing
database features Mg
Filter: approximatematching using D’
candidate matches
video database ofisolated gestures
Refine: exact matching using
D
best matches
browsing
retrieval results
query features Q
Computer Science
Feature Extraction (1)
Multi-dimensionaltime series
Input Gesture Sequence
i
i
i
i
i
v
u
y
x
M
i
mi MMMMM ,,,,, 21
Computer Science
Feature Extraction (2)
Feature requirements: Low resolution hand image => coarse shape
features. Hand localization is not accurate => use histograms.
Features: Position: hand centroid. Velocity: optical flow. Motion: optical flow direction histograms [Ardizzone
and LaCascia 97] Texture: edge orientation histograms
[Roth&Freeman 95] Shape: parameters of ellipse fit to hand [Starner 95] Color: used for detection; not useful for recognition.
Computer Science
System Diagram
query gesture sequence
multiple candidatehand detection
multiple candidatehand subimages
feature extractionand processing
database features Mg
Filter: approximatematching using D’
candidate matches
video database ofisolated gestures
Refine: exact matching using
D
best matches
browsing
retrieval results
query features Q
Computer Science
Dynamic Time Warping (DTW) Recognition
Given a query sequence Q and a database sequence M, DTW computes the optimal alignment (or warping path) W and matching cost D.
However, DTW assumes that a single feature vector (e.g., 2D position of the hand) can be reliably extracted from each query frame.
Q
M
..
..
.. ..
W
D
Frame 1
Frame 32
Frame 51
Frame 1 Frame 50 Frame 80
DG(Mi,Qj)
Computer Science
DTW Math (1): Distance between feature vectors
Mi, Qj are F-dimensional vectors. The distance measure between two feature
vectors can be the Euclidean distance:
DG can be more general. For example, (weighted) Lp norm.
2
1
1
2),(
F
f
fj
fijiG QMQMD
Computer Science
DTW Math (2): Distance between (sub)sequences
Initialization
Iteration
Termination
0)0,0( cumD
njjDcum ,,1,),0(
miiDcum ,,1,)0,(
)1,(),,1(),1,1(min),(),( jiDjiDjiDQMDjiD cumcumcumjiGcum
),(),( nmDQMD cumDTW
njmi ,,1,,,1
Computer Science
Dynamic Space-Time Warping (DSTW) Recognition
DSTW can accommodate multiple candidate feature vectors at every time step.
DSTW simultaneously localizes the gesturing hand in every frame of the query sequence and recognizes the gesture.
Q
M
..
..
.. ..
W
12
K
WW
D
Computer Science
DSTW Math
Initialization
Iteration
Termination
KkkDcum ,,1,0),0,0(
KknjkjDcum ,,1,,,1,),,0(
KkmikiDcum ,,1,,,1,),0,(
),(min),(')( 1)(1
ttwNw
jkiGtcum wwCQMDwDtt
),,(min),( knmDQMD cumk
DSTW
Kknjmi ,1,,,1,,,1
),,(),(),(),( 111 kjiwwDwwwwC ttcumtttt
KjijijikjiN ,11,1,1,,,1),,(
Computer Science
Translation-Invariance (1)
2.1. The user may gesture in any part of the image.
Solution: Run K separate DSTW processes Pk in parallel
Pk subtracts the position of the kth candidate in the first frame from all candidates in subsequent frames.
Select Pk with the best matching score.
Computer Science
Translation-Invariance (2)
2.2. False matches occur frequently when only position feature is used.For example, notice how spurious detections on the face in the query sequence falsely match model digit 1.
Solution: include velocity in the feature vector.
Model digit 1Query digit 1
Frame 1
Frame 24
Frame 36
Computer Science
Translation-Invariance (3)
2.1. The user may gesture in any part of the image.
Solution: Use centroid of face detector’s bounding box.
Computer Science
Scale-Invariance
1. Use an image pyramid.2. Compare size of face bounding box.
(Face detector internally uses image pyramid).
Computer Science
Complexity
F – number of features L – average sequence length K – number of hand candidates
------------------------------------------------------------------DTW: O(F·L2)DSTW: O(K·F·L2)DSTW with translation invariance: O(K2·F·L2)
Computer Science
System Diagram
query gesture sequence
multiple candidatehand detection
multiple candidatehand subimages
feature extractionand processing
database features Mg
Filter: approximatematching using D’
candidate matches
video database ofisolated gestures
Refine: exact matching using
D
best matches
browsing
retrieval results
query features Q
Computer Science
Approximate Distance D’Motivation
Lipschitz embeddings and BoostMap are embedding methods that represent each object by a vector of distances from the object to a set of d prototypes.
Can efficiently compute distances between objects in the embedded space (requiring only O(d) operations).
Can apply the same idea to time series, however
The distance representation loses all information about the alignment.
Computer Science
Approximate Distance D’:Alignment via PrototypesM R1
1
76
5
4
3
32
1
2/)(
2/)(
1
FL
MM
M
M
M
MM
M
ME R
12
3
4
5
6
1
2
3
4
5
67
LdF
FL
R
FL
R
FL
RRR
d
d
MEMEMEME
)(,,)(,)(
1
2
2
1
11 ,,
Computer Science
Approximate Distance D’:Alignment via PrototypesM QR
6
5
4
3
2
1
76
5
4
3
32
1
'
'
'
'
'
'
2/)(
2/)(
M
M
M
M
M
M
MM
M
M
M
MM
M
ME R
6
5
4
3
2
1
54
3
3
2
2
1
'
'
'
'
'
'
2/)( Q
Q
Q
Q
Q
Q
Q
Q
Q
Q
Q
QE R
12
3
4
5
6
1
2
3
4
5
67
1
2
3
45
ll
L
lG QMDQMD ',','
1
Computer Science
Approximate Distance D’:Alignment via PrototypesM QR
12
3
4
5
6
1
2
3
4
5
67
1
2
3
45
M Q1
2
3
4
5
67
1
2
3
45
QMDQMD ,,'
Computer Science
Justifying the Approximation
Why does it work? Two properties:1. If the query and prototype are identical,
then the approximate distance and the exact distance are identical.
2. If the query and database object are identical, then the approximate distance is 0, and the database object will be retrieved as Nearest Neighbor.
3. More information…
Computer Science
Justifying the Approximation
Why does it work? Two properties:1. If the query and prototype are identical,
then the approximate distance and the exact distance are identical.
2. If the query and database object are identical, then the approximate distance is 0, and the database object will be retrieved as Nearest Neighbor.
3. More information…
0,',' MEMEDQEMED RRRR
QMDQMEDQEMEDQEMED QQQRR ,,',','
Computer Science
Prototype Selection
Approach: Sequential Forward Search(SFS):
1. Select the first prototype R1 that minimizes classification error.
2. For i=2 to d doSelect the next prototype Ri that together with the set of prototypes selected so far {R1,…,Ri-1} gives the lowest classification error.
Computer Science
Prototype Selection
Approach: Sequential Forward Search(SFS):1. Select the first prototype R1 that minimizes
classification error.2. For i=2 to d do
Select the next prototype Ri that together with the set of prototypes selected so far {R1,…,Ri-1} gives the lowest classification error.
Can do Sequential Backward Search(SBS) by removing worse prototype at every step.
Can give weights to individual prototypes or individual features.
Computer Science
Filter and Refine
Offline:0. Select prototypes Ri.
1. Embed all database gestures E(Mg).
Online:1. Embed query E(Q).2. Filter: compute approximate distance D’(Q,Mg)
between query and all database gestures in the embedded space.
3. Retain P NN as candidate matches.4. Refine: rerank P candidates based on the exact
distance D.
Computer Science
Complexity
F=3: number of features L=50: average sequence length N=10,000: number of database sequences d=10: number of prototypes P=10: number of retrieved database sequences
---------------------------------------------------------Brute force = O( N·F·L2 ) Compute N exact DDTW distances
---------------------------------------------------------Filter step = O( d·F·L2 + N·d·F·L )Compute d exact WDTW alignments + Compute N approximate DDTW’ distances
Refine step = O( P·F·L2 )Compute P exact DDTW distances
--------------------------------------------------------- N > (d + N·d/L + P)
Computer Science
Reducing Complexity
Filter step=O(d·F·L2+N·d·F·L)Second term is expensive.Well known NN shortcoming.Proposed solutions:1. Feature selection: reduce
the number of features, d·F·L.
2. Condensing: reduce the number of objects, N.
Computer Science
Feasibility Study
1. Exact distance DDSTW Application: recognition of “video digits”. Compare DTW vs. DSTW accuracy. Verify that translation-invariance works. What is the right K? Use cross-validation.
2. Approximate distance D’DTW Application: recognition of UNIPEN digits. Measure accuracy vs. time tradeoff of
approximate DTW vs. BoostMap and CSDTW.
Recognition of NIST digits, using approximate shape context distance.
Computer Science
Video Digit Recognition Experiment
3 users, 10 digits, 3 examples per digit. DSTW without translation invariance Features: Position and velocity (x,y,u,v) Performance measure: classification accuracy (%)
11.1%-21.1% increase in classification accuracy.
Computer Science
UNIPEN Digit Recognition Experiment
15,953 digit samples. Features: Position and angle (x,y,theta) Performance measure: classification error (%) vs. number of
exact distance computations.
Using query and all database gives 1.90% error using 10,630 DDTW.
CSDTW gives 2.90% using 150 DDTW.
Given a test error of 2.80% the method is about twice faster than BoostMap and about ten times faster than CSDTW.
Computer Science
Conclusions DSTW
Pros: Hand detection is not merely a bottom-up procedure. Recognition can be achieved even in the presence of
multiple “distractors”, and overlaps between the gesturing hand and the face or the other hand.
Recognition is translation-invariant. For real-time performance, hand detection can afford to use
more efficient features with higher false positive rates, and rely on DSTW’s capability to handle multiple candidates to reject many false detections.
DSTW provides a general method for matching time series, that can accommodate multiple candidate feature vectors at each time step.
Cons: Space and time complexity increase by a factor of K for
translation-dependent recognition, and by a factor of K2 for translation-invariant recognition.
Computer Science
Conclusions Approximate Alignment via Prototypes
Pros: Approximate alignment via prototypes is fast. Approximate alignment via prototypes provides a general
method for efficiently approximating distance measures that are based on expensive alignment methods (e.g., shape context distance).
The number of points in the two objects does not have to be equal.
The more expensive the exact alignment method the greater the benefit from approximation.
Cons: Cannot guarantee false dismissals in filter step. Every point in one object has to be matched with at least
one point from the other object. That excludes approximating Longest Common Subsequence (LCS)
similarity measure.
Computer Science
Gesture Spotting
Computer Science
Isolated Gesture Recognition vs. Gesture Spotting
Q
M1
M2
M3
M4
Q
M
Whole Matching Subsequence Matchingvs.
Computer Science
Gesture Spotting:Research Agenda
Indirect temporal segmentation (segmentation by recognition): implement brute-force search using sliding window. Now, we do not know hand locations in database
sequence M. Extend DSTW to include a 4th spatial axis. Alternatively, Assume cooperative user who marks hand locations in query.
Direct temporal segmentation: are there hand motion features that can predict gesture boundaries?
How to combine gesture boundaries estimates from direct and indirect approaches?
Computer Science
Thesis Roadmap
Data Collection and annotation: Isolated gesture recognition. Gesture spotting.
Algorithms: Hand features. Approximate DSTW, or alternative indexing
method(s). Temporal segmentation.
Implement demos.
Computer Science
Thank You!
Computer Science
Example Model Digits
Computer Science
Example Correct Match
Computer Science
Digit Recognition Experiment
3 users. Database models:
3 examples per digit per user User wears a colored glove. Color detection finds a single correct hand region.
Queries: 3 examples per digit per user. User wears a shirt with long sleeves in one
experiment, and short sleeves in another. Skin detection generates 15 candidate hand regions.
Features: 2D position (x,y) and 2D velocity (u,v)
Example Model Digits
Computer Science
Results
For translation invariant recognition, the inclusion of velocity in the feature vector is essential for recognition, and improves classification rates by 20% and 10% for user-dep. and user-indep. recognition respectively.
User-indep. results are perhaps not satisfactory for real HCI applications, but user-dependent results are, and user-dependent recognition is desirable in many real HCI applications.
Experiment(LS: Long Sleeves, SS: Short Sleeves. TD: Translation Dependent, TI: Translation Invariant. P: Position, PV: Position and Velocity).
User-dep. Classification Accuracy %
User-indep.Classification Accuracy %
LS-TD-P 96.7 85.6
SS-TI-P 73.3 64.4
SS-TI-PV 95.6 74.4
Computer Science
Problem 2: Translation-Invariant Recognition
Goal: maintain recognition rates even when the gesture is globally translated, i.e., signed in any part of the image.
Solution: given the K candidate regions detected in the first frame:
1. Run K separate DSTW processes Pk in parallel Pk assumes that k was the correct candidate in the first
frame, and subtracts the position of the kth candidate in the first frame from all candidates in subsequent frames.
2. Select Pk with the best matching score. Problem: many false matches occur
when only position feature is used.
Computer Science
Recognition Framework cont’d
score. matching optimal - ),(
and path, warpingoptimal - ),(
, and subseqs.between dist. cumulative - ),(
feature, velocity andposition 2Dquery - ) v,u ,y ,(x
feature, velocity andposition 2D model - ) v,u ,y ,(x
, and featuresbetween distanceEuclidean - ),(
where)},,1(),1,1(),1,(min{
),(),(
*
1*
j:1:1
jjjj
iiii
j
nmDD
wwW
QMjiD
Q
M
QMjid
jiDjiDjiD
jidjiD
L
i
j
i
i
k candidate and j frame of
featurequery - ),,,(
cost.n transitio-
and ,),,( of neighbors - )(
where)},,'()'({min
)()(
)('
jkjkjkjkjk
wNw
vuyxQ
kjiwwN
wwwD
wdwD
Computer Science
Problem 2: Translation-Invariant Recognition
2.2. False matches occur frequently when only position feature is used.For example, notice how the elbow in query digit 3 is falsely matched with the bottom part of the digit 7.
Solution: include velocity in the feature vector.
Model digit 7Query digit 3
Frame 1
Frame 45
Frame 85
Computer Science
Multi-dimensional time series examples
“Video Gestures”American Sign Language
Cursive Handwriting
Computer Science
Conclusions & Future Work
Conclusions+DSTW is a general framework for matching time
series, that can accommodate multiple (K) candidate feature vectors at each time step.
+Translation-invariance is incorporated in the framework.
- Space and time complexity increase by a factor of K for translation-dependent recognition, and K2 for translation-invariant recognition.
Future Work Dynamic feature selection. Gesture verification. Temporal segmentation.
Computer Science
Problem Statement (2)
Gesture Spotting Problem: Given a long image sequence of gestures M (the
database), a gesture query sequence Q, a distance measure D, and a distance tolerance ε, find those data subsequences x ⊆ M which satisfy D(x,Q) ≤ ε.
M can be an ASL story Q can be:
An ASL sign (e.g., “CAR”) Finger spelling (e.g., “John”) Any hand motion between signs (motion epenthesis)
D will be Dynamic Time Warping (DTW) distance or a variant of it.
Computer Science
QCQ=C3=‘CAR’A small D ( ,
M3) =>
QCQC4=‘BUY’A large D ( ,
M4) =>
Computer Science
Problem Statement (1)
Visual ASL Dictionary Problem: Given a database (dictionary) of gesture image
sequence Mi, a sign query sequence Q, a distance measure D, and a distance tolerance ε, find those data exemplars Mj which satisfy D(Mi,Q) ≤ ε.
Computer Science
Problem Statement (1)
Visual ASL Dictionary Problem: Given a database (dictionary) of gesture image
sequence Mi, a sign query sequence Q, a distance measure D, and a distance tolerance ε, find those data exemplars Mj which satisfy D(Mj,Q) ≤ ε.
Q is a sign performed by a novice ASL student in front of a camera.
Mi are examples of isolated signs.
Computer Science
Problem Statement (1)
Visual ASL Dictionary Problem: Given a database (dictionary) of gesture image
sequence Mi, a sign query sequence Q, a distance measure D, and a distance tolerance ε, find those data exemplars Mj which satisfy D(Mj,Q) ≤ ε.
Application Assumptions: In producing Q, the ASL student may be
cooperative. Examples Mi can be collected with any constraints
that would improve the task performance. For example:
Colored gloves Slow gestures
Computer Science
Problem Statement (1)
Visual ASL Dictionary Problem: Given a database (dictionary) of gesture image
sequence Mi, a sign query sequence Q, a distance measure D, and a distance tolerance ε, find those data exemplars Mj which satisfy D(Mj,Q) ≤ ε.
Search Alternatives: Search for neighbors in ε-ball. Search for k Nearest Neighbors (kNN). Rank the entire database.
Computer Science
Problem Statement (1)
Visual ASL Dictionary Problem: Given a database (dictionary) of gesture image
sequence Mi, a sign query sequence Q, a distance measure D, and a distance tolerance ε, find those data exemplars Mj which satisfy D(Mj,Q) ≤ ε.
Real Goal: Assume:
Class labels are known, e.g., C(Mi) = “CAR” Mj are sorted in ascending order based on D
We want: Sign(M1) = Sign(Q), or Sign(Mj) = Sign(Q) for as many examples Mj with small enough j.
We want: Similarity in input (feature) space => Similarity in class space.
Computer Science
Outline
Introduction Assumptions Challenges Formal Problem Statement
ASL Dictionary Problem Gesture Spotting
System Overview Multiple Candidate Hand Detection Feature Extraction and Processing Dynamic Space-Time Warping (DSTW) Approximate matching via Prototypes Temporal Segmentation
Feasibility Study Related Work Schedule Conclusion
Computer Science
Research Goals
Problem: Predict the class label CQ accurately and efficiently:
1. Accurately: design a distance measure D such thatsimilarity in input space using D=>similarity in class space
2. Efficiently: better than brute force, computing the exact distance between query gesture and all database gesture examples. ( D(Q,Mi), for all i ).
QCQ=C3=‘CAR’A small D ( ,
M3) =>
QCQC4=‘BUY’A large D ( ,
M4) =>
Computer Science
Proposed methods
1. Accurately: propose a Dynamic Space Time Warping (DSTW) algorithm that can accommodate multiple hypotheses about the hand location in every .
DSTW will enable a simple and efficient multiple candidate hand detection algorithm.
2. Efficiently: use a filtering method, which consists of two steps:
1. Filter step: compute D’(Q,Mi), for all i:1iN based on a fast but approximate distance D’. Retain P most promising gesture examples.
2. Refine step: compute D(Q,Mj), for j:1jP based on the slow but exact distance D. Predict CQ based on class labels of Nearest Neighbors (NN).
Computer Science
Feature Extraction (1)
Show image with (x,y,u,v) Or image with (x,y,theta)
Computer Science
Assumptions
Sensor Single color camera
Background Not necessarily uniform
Viewing Condition Frontal upper body view
Foreground Single gesturer Objects of interest: hands
Static camera and static gesturer Lighting
Constant or slowly varying
Computer Science
Challenges
Geometric variation Translation and scale
User (signer) independence Body kinematics: shape and size of different body
parts Style: speed, emphasis
Different gesture durations Textured clothes Native signers and high gesture speeds Hand occlusion and self-occlusion Difficult sign types
Repetitions, agentive forms, location and context dependent signs
Computer Science
Contributions
query gesture sequence
multiple candidatehand detection
multiple candidatehand subimages
feature extractionand processing
database features
approximatematching (filter)
candidate matches
video database ofisolated gestures
exact matching (refine)
best matches
browsing
retrieval results
query features
Computer Science
ASL DictionarySystem Diagram
query gesture sequence
multiple candidatehand detection
multiple candidatehand subimages
feature extractionand processing
database features
approximatematching (filter)
candidate matches
video database ofisolated gestures
exact matching (refine)
best matches
browsing
retrieval results
query features
Computer Science
System Diagram
query gesture sequence
multiple candidatehand detection
multiple candidatehand subimages
feature extractionand processing
database features
approximatematching (filter)
candidate matches
video database ofisolated gestures
exact matching (refine)
best matches
browsing
retrieval results
query features
Computer Science
ASL DictionarySystem Diagram
query gesture sequence
multiple candidatehand detection
multiple candidatehand subimages
feature extractionand processing
database features
approximatematching (filter)
candidate matches
video database ofisolated gestures
exact matching (refine)
best matches
browsing
retrieval results
query features
Computer Science
ASL DictionarySystem Diagram
query gesture sequence
multiple candidatehand detection
multiple candidatehand subimages
feature extractionand processing
database features
approximatematching (filter)
candidate matches
video database ofisolated gestures
exact matching (refine)
best matches
browsing
retrieval results
query features
Computer Science
DTW Math
Mi, Qj are F-dimensional vectors. Wf are weights. The distance measure between two feature vectors
is a weighted Lp norm.
For example, if Wf = 1 and p=2 we get the Euclidean distance.
pF
f
pfj
fi
fjiG QMwQMD
1
1
),(
Computer Science
Justifying the Approximation
We want a contractive embedding:
Why? because we can filter unlikely matches, and guarantee no false dismissals.
But, we could not prove it.
QMQMDQMD DTW ,),,(),('
Computer Science
Prototype Selection
Approach: Sequential Forward Search Select the first prototype R1 that minimizes
classification error. For i=2 to d do
Select the next prototype Ri that together with the set of prototypes selected so far {R1,…,Ri-1} gives the lowest classification error.
Can do backward search too by removing worse prototype at every step.
Can give weights to individual prototypes or individual features.
Computer Science
Complexity
F – number of features L – average sequence length K – number of hand candidates N – number of database objects d – number of prototypes O(dFL2 + NdFL) = O(dFL(L+N)) Example: in UNIPEN digit dataset, L = 50,
N = 10,000 The dominance of the second term is due to
NN shortcoming. Approach: Feature selection to reduce number of features, dFL. Condensing to reduce number of objects, N.
Computer Science
Feature Extraction (1)
1D time series
Multi-dimensionaltime series
Input Gesture Sequence
i
i
i
i
i
v
u
y
x
M
i
mi MMMMM ,,,,, 21
Computer Science
Approximate Distance D’:Alignment via Prototype
M R
2/)(
2/)(
76
5
4
3
32
1
MM
M
M
M
MM
M
MF
Computer Science
Approximate Distance D’:Alignment via Prototype
Computer Science
Alignment via Prototype
)),(),,(()','()( 432121 MMfMMfMMMF
))(),,(()','()( 32121 QfQQfQQQF
2
1
)','())(),(('j
jjG QMDQFMFD
For d prototypes concatenate the d vectors
Computer Science
Conclusions
Pros: Hand detection is not merely a bottom-up procedure. The gesture
model is used to select hand locations in a way that the query-to-model matching cost is optimized.
Recognition can be achieved even in the presence of multiple “distractors”, like moving objects, or skin-colored objects (e.g., face, non-gesturing hand, background objects).
Recognition is robust to overlaps between the gesturing hand and the face or the other hand.
Recognition is translation-invariant; the gesture can occur in any part of the image.
For real-time performance, hand detection can afford to use more efficient features with higher false positive rates, and rely on DSTW’s capability to handle multiple candidates to reject many false detections.
DSTW provides a general method for matching time series, that can accommodate multiple candidate feature vectors at each time step.
Cons:Space and time complexity increase by a factor of K for translation-
dependent recognition, and by a factor of K2 for translation-invariant recognition.
Computer Science
Filter:Approximate Distance D’:
Offline:0. Select prototypes Ri.1. Compute correspondence between database
sequences Mg and prototypes Ri using exact alignment W(Mg,Ri).
2. Use alignments to embed database sequences F(Mg). Online:
1. Compute correspondence between query sequence Q and prototypes Ri using exact alignment W(Q,Ri).
2. Use alignment to embed query sequences F(Q).This induces an approximate alignment WR(Q,Mg) between query and any database sequence.
3. Use approximate alignment WR(Q,Mg) to compute approximate distance D’(Q,Mg) in the embedded space.
Computer Science
Assumptions
Sensor Single color camera
Background Not necessarily uniform
Viewing Condition Frontal upper body view
Foreground Single gesturer Objects of interest: hands
Static camera and static gesturer Lighting
Constant or slowly varying
Computer Science
Challenges
Geometric variation Translation and scale
User (signer) independence Body kinematics: shape and size of different body
parts Style: speed, emphasis
Different gesture durations Textured clothes Native signers and high gesture speeds Hand occlusion and self-occlusion Difficult sign types
Repetitions, agentive forms, location and context dependent signs