89
Computer Science Vision-Based Retrieval of Dynamic Hand Gestures Thesis Proposal by Jonathan Alon Thesis Committee: Stan Sclaroff, Margrit Betke, George Kollios, and

Vision-Based Retrieval of Dynamic Hand Gestures

  • Upload
    oberon

  • View
    38

  • Download
    0

Embed Size (px)

DESCRIPTION

Vision-Based Retrieval of Dynamic Hand Gestures. Thesis Proposal by Jonathan Alon. Thesis Committee: Stan Sclaroff, Margrit Betke, George Kollios, and Trevor Darrell. Example Application. Isolated Gesture Recognition. A query gesture Q - PowerPoint PPT Presentation

Citation preview

Page 1: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Vision-Based Retrieval of Dynamic Hand Gestures

Thesis Proposal by

Jonathan Alon

Thesis Committee:

Stan Sclaroff, Margrit Betke, George Kollios,

and Trevor Darrell

Page 2: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Example Application

Page 3: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Isolated Gesture Recognition

Q

M1

M2

M3

M4

A query gesture Q Database of gesture

examples Mg, and their class labels Cg, 1gN.

Problem: Predict the class label CQ bothaccurately and efficiently

C1=‘CAR’

CQ = ?

C2=‘BUY’

C3=‘CAR’

C4=‘BUY’

Page 4: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Research Goals

Problem: Predict the class label CQ accurately and efficiently:

1. Accurately: design a distance measure D such thatsimilarity in input space using D=>similarity in class space

2. Efficiently: better than brute force, computingD(Q,Mg), for all g:1gN.

QCQ=C3=‘CAR’A small D ( ,

M3) =>

QCQC4=‘BUY’A large D ( ,

M4) =>

Page 5: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Example Hand Gesture Data

“Video Gestures”American Sign Language

Page 6: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Related (ASL Recognition)Work

Hand segmentation: Previous: higher level recognition models assume perfect

segmentation, and methods are either too simple [Starner&Pentland 95, Vogler&Metaxas99, Yang&Ahuja

02] or too complicated [Cui&Weng 95, Ong&Bowden 04]

Proposed: more sophisticated distance measure will enable simple hand segmentation, and

more general background, textured clothes, and hand occlusions.

Vocabulary size Previous (vision-based): tens. Proposed: hundreds.

Data Previous: usually the researcher is the signer [Starner&Pentland

95, Cui&Weng 95]. Proposed: native signers. Fast gesture speeds. More realistic

gesture variations.

Page 7: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Proposed methods (1)

1. Accurately: propose a Dynamic Space Time Warping (DSTW) algorithm that can accommodate multiple hypotheses about the hand location in every frame of the query gesture sequence.

DSTW will enable a simple and efficient multiple candidate hand detection algorithm.

Page 8: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Proposed methods (2)

2. Efficiently: use a filtering method, which consists of two steps:

1. Filter step: compute D’(Q,Mg), for all g:1gN based on a fast but approximate distance D’. Retain P most promising gesture examples.

2. Refine step: compute D(Q,Mh), for h:1hP based on the slow but exact distance D. Predict CQ based on class labels of Nearest Neighbors (NN).

Page 9: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Outline

Introduction Motivation Research Goals Related Work Proposed Methods

System Overview Multiple Candidate Hand Detection Feature Extraction and Processing Dynamic Space-Time Warping (DSTW) Approximate Matching via Prototypes

Feasibility Study Thesis Roadmap Conclusion

Page 10: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Isolated Gesture RecognitionSystem Diagram

query gesture sequence

multiple candidatehand detection

multiple candidatehand subimages

feature extractionand processing

database features Mg

Filter: approximatematching using D’

candidate matches

video database ofisolated gestures

Refine: exact matching using

D

best matches

browsing

retrieval results

query features Q

Page 11: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Contributions

query gesture sequence

multiple candidatehand detection

multiple candidatehand subimages

feature extractionand processing

database features Mg

Filter: approximatematching using D’

candidate matches

video database ofisolated gestures

Refine: exact matching using

D

best matches

browsing

retrieval results

query features Q

Page 12: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

System Diagram

query gesture sequence

multiple candidatehand detection

multiple candidatehand subimages

feature extractionand processing

database features Mg

Filter: approximatematching using D’

candidate matches

video database ofisolated gestures

Refine: exact matching using

D

best matches

browsing

retrieval results

query features Q

Page 13: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Multiple CandidateHand Detection (1)

Key observation: the gesturing hand cannot be reliably and unambiguously detected, regardless of the visual features used for detection.

However, the gesturing hand is consistently among the top K candidates identified by e.g., skin detection (K=15 in this example).

Candidate Hand RegionsInput Frame

Page 14: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Multiple CandidateHand Detection (2)

Input Sequence

Page 15: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Isolated Gesture RecognitionSystem Diagram

query gesture sequence

multiple candidatehand detection

multiple candidatehand subimages

feature extractionand processing

database features Mg

Filter: approximatematching using D’

candidate matches

video database ofisolated gestures

Refine: exact matching using

D

best matches

browsing

retrieval results

query features Q

Page 16: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Feature Extraction (1)

Multi-dimensionaltime series

Input Gesture Sequence

i

i

i

i

i

v

u

y

x

M

i

mi MMMMM ,,,,, 21

Page 17: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Feature Extraction (2)

Feature requirements: Low resolution hand image => coarse shape

features. Hand localization is not accurate => use histograms.

Features: Position: hand centroid. Velocity: optical flow. Motion: optical flow direction histograms [Ardizzone

and LaCascia 97] Texture: edge orientation histograms

[Roth&Freeman 95] Shape: parameters of ellipse fit to hand [Starner 95] Color: used for detection; not useful for recognition.

Page 18: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

System Diagram

query gesture sequence

multiple candidatehand detection

multiple candidatehand subimages

feature extractionand processing

database features Mg

Filter: approximatematching using D’

candidate matches

video database ofisolated gestures

Refine: exact matching using

D

best matches

browsing

retrieval results

query features Q

Page 19: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Dynamic Time Warping (DTW) Recognition

Given a query sequence Q and a database sequence M, DTW computes the optimal alignment (or warping path) W and matching cost D.

However, DTW assumes that a single feature vector (e.g., 2D position of the hand) can be reliably extracted from each query frame.

Q

M

..

..

.. ..

W

D

Frame 1

Frame 32

Frame 51

Frame 1 Frame 50 Frame 80

DG(Mi,Qj)

Page 20: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

DTW Math (1): Distance between feature vectors

Mi, Qj are F-dimensional vectors. The distance measure between two feature

vectors can be the Euclidean distance:

DG can be more general. For example, (weighted) Lp norm.

2

1

1

2),(

F

f

fj

fijiG QMQMD

Page 21: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

DTW Math (2): Distance between (sub)sequences

Initialization

Iteration

Termination

0)0,0( cumD

njjDcum ,,1,),0(

miiDcum ,,1,)0,(

)1,(),,1(),1,1(min),(),( jiDjiDjiDQMDjiD cumcumcumjiGcum

),(),( nmDQMD cumDTW

njmi ,,1,,,1

Page 22: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Dynamic Space-Time Warping (DSTW) Recognition

DSTW can accommodate multiple candidate feature vectors at every time step.

DSTW simultaneously localizes the gesturing hand in every frame of the query sequence and recognizes the gesture.

Q

M

..

..

.. ..

W

12

K

WW

D

Page 23: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

DSTW Math

Initialization

Iteration

Termination

KkkDcum ,,1,0),0,0(

KknjkjDcum ,,1,,,1,),,0(

KkmikiDcum ,,1,,,1,),0,(

),(min),(')( 1)(1

ttwNw

jkiGtcum wwCQMDwDtt

),,(min),( knmDQMD cumk

DSTW

Kknjmi ,1,,,1,,,1

),,(),(),(),( 111 kjiwwDwwwwC ttcumtttt

KjijijikjiN ,11,1,1,,,1),,(

Page 24: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Translation-Invariance (1)

2.1. The user may gesture in any part of the image.

Solution: Run K separate DSTW processes Pk in parallel

Pk subtracts the position of the kth candidate in the first frame from all candidates in subsequent frames.

Select Pk with the best matching score.

Page 25: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Translation-Invariance (2)

2.2. False matches occur frequently when only position feature is used.For example, notice how spurious detections on the face in the query sequence falsely match model digit 1.

Solution: include velocity in the feature vector.

Model digit 1Query digit 1

Frame 1

Frame 24

Frame 36

Page 26: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Translation-Invariance (3)

2.1. The user may gesture in any part of the image.

Solution: Use centroid of face detector’s bounding box.

Page 27: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Scale-Invariance

1. Use an image pyramid.2. Compare size of face bounding box.

(Face detector internally uses image pyramid).

Page 28: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Complexity

F – number of features L – average sequence length K – number of hand candidates

------------------------------------------------------------------DTW: O(F·L2)DSTW: O(K·F·L2)DSTW with translation invariance: O(K2·F·L2)

Page 29: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

System Diagram

query gesture sequence

multiple candidatehand detection

multiple candidatehand subimages

feature extractionand processing

database features Mg

Filter: approximatematching using D’

candidate matches

video database ofisolated gestures

Refine: exact matching using

D

best matches

browsing

retrieval results

query features Q

Page 30: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Approximate Distance D’Motivation

Lipschitz embeddings and BoostMap are embedding methods that represent each object by a vector of distances from the object to a set of d prototypes.

Can efficiently compute distances between objects in the embedded space (requiring only O(d) operations).

Can apply the same idea to time series, however

The distance representation loses all information about the alignment.

Page 31: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Approximate Distance D’:Alignment via PrototypesM R1

1

76

5

4

3

32

1

2/)(

2/)(

1

FL

MM

M

M

M

MM

M

ME R

12

3

4

5

6

1

2

3

4

5

67

LdF

FL

R

FL

R

FL

RRR

d

d

MEMEMEME

)(,,)(,)(

1

2

2

1

11 ,,

Page 32: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Approximate Distance D’:Alignment via PrototypesM QR

6

5

4

3

2

1

76

5

4

3

32

1

'

'

'

'

'

'

2/)(

2/)(

M

M

M

M

M

M

MM

M

M

M

MM

M

ME R

6

5

4

3

2

1

54

3

3

2

2

1

'

'

'

'

'

'

2/)( Q

Q

Q

Q

Q

Q

QQ

Q

Q

Q

Q

Q

QE R

12

3

4

5

6

1

2

3

4

5

67

1

2

3

45

ll

L

lG QMDQMD ',','

1

Page 33: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Approximate Distance D’:Alignment via PrototypesM QR

12

3

4

5

6

1

2

3

4

5

67

1

2

3

45

M Q1

2

3

4

5

67

1

2

3

45

QMDQMD ,,'

Page 34: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Justifying the Approximation

Why does it work? Two properties:1. If the query and prototype are identical,

then the approximate distance and the exact distance are identical.

2. If the query and database object are identical, then the approximate distance is 0, and the database object will be retrieved as Nearest Neighbor.

3. More information…

Page 35: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Justifying the Approximation

Why does it work? Two properties:1. If the query and prototype are identical,

then the approximate distance and the exact distance are identical.

2. If the query and database object are identical, then the approximate distance is 0, and the database object will be retrieved as Nearest Neighbor.

3. More information…

0,',' MEMEDQEMED RRRR

QMDQMEDQEMEDQEMED QQQRR ,,',','

Page 36: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Prototype Selection

Approach: Sequential Forward Search(SFS):

1. Select the first prototype R1 that minimizes classification error.

2. For i=2 to d doSelect the next prototype Ri that together with the set of prototypes selected so far {R1,…,Ri-1} gives the lowest classification error.

Page 37: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Prototype Selection

Approach: Sequential Forward Search(SFS):1. Select the first prototype R1 that minimizes

classification error.2. For i=2 to d do

Select the next prototype Ri that together with the set of prototypes selected so far {R1,…,Ri-1} gives the lowest classification error.

Can do Sequential Backward Search(SBS) by removing worse prototype at every step.

Can give weights to individual prototypes or individual features.

Page 38: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Filter and Refine

Offline:0. Select prototypes Ri.

1. Embed all database gestures E(Mg).

Online:1. Embed query E(Q).2. Filter: compute approximate distance D’(Q,Mg)

between query and all database gestures in the embedded space.

3. Retain P NN as candidate matches.4. Refine: rerank P candidates based on the exact

distance D.

Page 39: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Complexity

F=3: number of features L=50: average sequence length N=10,000: number of database sequences d=10: number of prototypes P=10: number of retrieved database sequences

---------------------------------------------------------Brute force = O( N·F·L2 ) Compute N exact DDTW distances

---------------------------------------------------------Filter step = O( d·F·L2 + N·d·F·L )Compute d exact WDTW alignments + Compute N approximate DDTW’ distances

Refine step = O( P·F·L2 )Compute P exact DDTW distances

--------------------------------------------------------- N > (d + N·d/L + P)

Page 40: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Reducing Complexity

Filter step=O(d·F·L2+N·d·F·L)Second term is expensive.Well known NN shortcoming.Proposed solutions:1. Feature selection: reduce

the number of features, d·F·L.

2. Condensing: reduce the number of objects, N.

Page 41: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Feasibility Study

1. Exact distance DDSTW Application: recognition of “video digits”. Compare DTW vs. DSTW accuracy. Verify that translation-invariance works. What is the right K? Use cross-validation.

2. Approximate distance D’DTW Application: recognition of UNIPEN digits. Measure accuracy vs. time tradeoff of

approximate DTW vs. BoostMap and CSDTW.

Recognition of NIST digits, using approximate shape context distance.

Page 42: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Video Digit Recognition Experiment

3 users, 10 digits, 3 examples per digit. DSTW without translation invariance Features: Position and velocity (x,y,u,v) Performance measure: classification accuracy (%)

11.1%-21.1% increase in classification accuracy.

Page 43: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

UNIPEN Digit Recognition Experiment

15,953 digit samples. Features: Position and angle (x,y,theta) Performance measure: classification error (%) vs. number of

exact distance computations.

Using query and all database gives 1.90% error using 10,630 DDTW.

CSDTW gives 2.90% using 150 DDTW.

Given a test error of 2.80% the method is about twice faster than BoostMap and about ten times faster than CSDTW.

Page 44: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Conclusions DSTW

Pros: Hand detection is not merely a bottom-up procedure. Recognition can be achieved even in the presence of

multiple “distractors”, and overlaps between the gesturing hand and the face or the other hand.

Recognition is translation-invariant. For real-time performance, hand detection can afford to use

more efficient features with higher false positive rates, and rely on DSTW’s capability to handle multiple candidates to reject many false detections.

DSTW provides a general method for matching time series, that can accommodate multiple candidate feature vectors at each time step.

Cons: Space and time complexity increase by a factor of K for

translation-dependent recognition, and by a factor of K2 for translation-invariant recognition.

Page 45: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Conclusions Approximate Alignment via Prototypes

Pros: Approximate alignment via prototypes is fast. Approximate alignment via prototypes provides a general

method for efficiently approximating distance measures that are based on expensive alignment methods (e.g., shape context distance).

The number of points in the two objects does not have to be equal.

The more expensive the exact alignment method the greater the benefit from approximation.

Cons: Cannot guarantee false dismissals in filter step. Every point in one object has to be matched with at least

one point from the other object. That excludes approximating Longest Common Subsequence (LCS)

similarity measure.

Page 46: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Gesture Spotting

Page 47: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Isolated Gesture Recognition vs. Gesture Spotting

Q

M1

M2

M3

M4

Q

M

Whole Matching Subsequence Matchingvs.

Page 48: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Gesture Spotting:Research Agenda

Indirect temporal segmentation (segmentation by recognition): implement brute-force search using sliding window. Now, we do not know hand locations in database

sequence M. Extend DSTW to include a 4th spatial axis. Alternatively, Assume cooperative user who marks hand locations in query.

Direct temporal segmentation: are there hand motion features that can predict gesture boundaries?

How to combine gesture boundaries estimates from direct and indirect approaches?

Page 49: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Thesis Roadmap

Data Collection and annotation: Isolated gesture recognition. Gesture spotting.

Algorithms: Hand features. Approximate DSTW, or alternative indexing

method(s). Temporal segmentation.

Implement demos.

Page 50: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Thank You!

Page 51: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Example Model Digits

Page 52: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Example Correct Match

Page 53: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Digit Recognition Experiment

3 users. Database models:

3 examples per digit per user User wears a colored glove. Color detection finds a single correct hand region.

Queries: 3 examples per digit per user. User wears a shirt with long sleeves in one

experiment, and short sleeves in another. Skin detection generates 15 candidate hand regions.

Features: 2D position (x,y) and 2D velocity (u,v)

Example Model Digits

Page 54: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Results

For translation invariant recognition, the inclusion of velocity in the feature vector is essential for recognition, and improves classification rates by 20% and 10% for user-dep. and user-indep. recognition respectively.

User-indep. results are perhaps not satisfactory for real HCI applications, but user-dependent results are, and user-dependent recognition is desirable in many real HCI applications.

Experiment(LS: Long Sleeves, SS: Short Sleeves. TD: Translation Dependent, TI: Translation Invariant. P: Position, PV: Position and Velocity).

User-dep. Classification Accuracy %

User-indep.Classification Accuracy %

LS-TD-P 96.7 85.6

SS-TI-P 73.3 64.4

SS-TI-PV 95.6 74.4

Page 55: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Problem 2: Translation-Invariant Recognition

Goal: maintain recognition rates even when the gesture is globally translated, i.e., signed in any part of the image.

Solution: given the K candidate regions detected in the first frame:

1. Run K separate DSTW processes Pk in parallel Pk assumes that k was the correct candidate in the first

frame, and subtracts the position of the kth candidate in the first frame from all candidates in subsequent frames.

2. Select Pk with the best matching score. Problem: many false matches occur

when only position feature is used.

Page 56: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Recognition Framework cont’d

score. matching optimal - ),(

and path, warpingoptimal - ),(

, and subseqs.between dist. cumulative - ),(

feature, velocity andposition 2Dquery - ) v,u ,y ,(x

feature, velocity andposition 2D model - ) v,u ,y ,(x

, and featuresbetween distanceEuclidean - ),(

where)},,1(),1,1(),1,(min{

),(),(

*

1*

j:1:1

jjjj

iiii

j

nmDD

wwW

QMjiD

Q

M

QMjid

jiDjiDjiD

jidjiD

L

i

j

i

i

k candidate and j frame of

featurequery - ),,,(

cost.n transitio-

and ,),,( of neighbors - )(

where)},,'()'({min

)()(

)('

jkjkjkjkjk

wNw

vuyxQ

kjiwwN

wwwD

wdwD

Page 57: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Problem 2: Translation-Invariant Recognition

2.2. False matches occur frequently when only position feature is used.For example, notice how the elbow in query digit 3 is falsely matched with the bottom part of the digit 7.

Solution: include velocity in the feature vector.

Model digit 7Query digit 3

Frame 1

Frame 45

Frame 85

Page 58: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Multi-dimensional time series examples

“Video Gestures”American Sign Language

Cursive Handwriting

Page 59: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Conclusions & Future Work

Conclusions+DSTW is a general framework for matching time

series, that can accommodate multiple (K) candidate feature vectors at each time step.

+Translation-invariance is incorporated in the framework.

- Space and time complexity increase by a factor of K for translation-dependent recognition, and K2 for translation-invariant recognition.

Future Work Dynamic feature selection. Gesture verification. Temporal segmentation.

Page 60: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Problem Statement (2)

Gesture Spotting Problem: Given a long image sequence of gestures M (the

database), a gesture query sequence Q, a distance measure D, and a distance tolerance ε, find those data subsequences x ⊆ M which satisfy D(x,Q) ≤ ε.

M can be an ASL story Q can be:

An ASL sign (e.g., “CAR”) Finger spelling (e.g., “John”) Any hand motion between signs (motion epenthesis)

D will be Dynamic Time Warping (DTW) distance or a variant of it.

Page 61: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

QCQ=C3=‘CAR’A small D ( ,

M3) =>

QCQC4=‘BUY’A large D ( ,

M4) =>

Page 62: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Problem Statement (1)

Visual ASL Dictionary Problem: Given a database (dictionary) of gesture image

sequence Mi, a sign query sequence Q, a distance measure D, and a distance tolerance ε, find those data exemplars Mj which satisfy D(Mi,Q) ≤ ε.

Page 63: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Problem Statement (1)

Visual ASL Dictionary Problem: Given a database (dictionary) of gesture image

sequence Mi, a sign query sequence Q, a distance measure D, and a distance tolerance ε, find those data exemplars Mj which satisfy D(Mj,Q) ≤ ε.

Q is a sign performed by a novice ASL student in front of a camera.

Mi are examples of isolated signs.

Page 64: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Problem Statement (1)

Visual ASL Dictionary Problem: Given a database (dictionary) of gesture image

sequence Mi, a sign query sequence Q, a distance measure D, and a distance tolerance ε, find those data exemplars Mj which satisfy D(Mj,Q) ≤ ε.

Application Assumptions: In producing Q, the ASL student may be

cooperative. Examples Mi can be collected with any constraints

that would improve the task performance. For example:

Colored gloves Slow gestures

Page 65: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Problem Statement (1)

Visual ASL Dictionary Problem: Given a database (dictionary) of gesture image

sequence Mi, a sign query sequence Q, a distance measure D, and a distance tolerance ε, find those data exemplars Mj which satisfy D(Mj,Q) ≤ ε.

Search Alternatives: Search for neighbors in ε-ball. Search for k Nearest Neighbors (kNN). Rank the entire database.

Page 66: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Problem Statement (1)

Visual ASL Dictionary Problem: Given a database (dictionary) of gesture image

sequence Mi, a sign query sequence Q, a distance measure D, and a distance tolerance ε, find those data exemplars Mj which satisfy D(Mj,Q) ≤ ε.

Real Goal: Assume:

Class labels are known, e.g., C(Mi) = “CAR” Mj are sorted in ascending order based on D

We want: Sign(M1) = Sign(Q), or Sign(Mj) = Sign(Q) for as many examples Mj with small enough j.

We want: Similarity in input (feature) space => Similarity in class space.

Page 67: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Outline

Introduction Assumptions Challenges Formal Problem Statement

ASL Dictionary Problem Gesture Spotting

System Overview Multiple Candidate Hand Detection Feature Extraction and Processing Dynamic Space-Time Warping (DSTW) Approximate matching via Prototypes Temporal Segmentation

Feasibility Study Related Work Schedule Conclusion

Page 68: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Research Goals

Problem: Predict the class label CQ accurately and efficiently:

1. Accurately: design a distance measure D such thatsimilarity in input space using D=>similarity in class space

2. Efficiently: better than brute force, computing the exact distance between query gesture and all database gesture examples. ( D(Q,Mi), for all i ).

QCQ=C3=‘CAR’A small D ( ,

M3) =>

QCQC4=‘BUY’A large D ( ,

M4) =>

Page 69: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Proposed methods

1. Accurately: propose a Dynamic Space Time Warping (DSTW) algorithm that can accommodate multiple hypotheses about the hand location in every .

DSTW will enable a simple and efficient multiple candidate hand detection algorithm.

2. Efficiently: use a filtering method, which consists of two steps:

1. Filter step: compute D’(Q,Mi), for all i:1iN based on a fast but approximate distance D’. Retain P most promising gesture examples.

2. Refine step: compute D(Q,Mj), for j:1jP based on the slow but exact distance D. Predict CQ based on class labels of Nearest Neighbors (NN).

Page 70: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Feature Extraction (1)

Show image with (x,y,u,v) Or image with (x,y,theta)

Page 71: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Assumptions

Sensor Single color camera

Background Not necessarily uniform

Viewing Condition Frontal upper body view

Foreground Single gesturer Objects of interest: hands

Static camera and static gesturer Lighting

Constant or slowly varying

Page 72: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Challenges

Geometric variation Translation and scale

User (signer) independence Body kinematics: shape and size of different body

parts Style: speed, emphasis

Different gesture durations Textured clothes Native signers and high gesture speeds Hand occlusion and self-occlusion Difficult sign types

Repetitions, agentive forms, location and context dependent signs

Page 73: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Contributions

query gesture sequence

multiple candidatehand detection

multiple candidatehand subimages

feature extractionand processing

database features

approximatematching (filter)

candidate matches

video database ofisolated gestures

exact matching (refine)

best matches

browsing

retrieval results

query features

Page 74: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

ASL DictionarySystem Diagram

query gesture sequence

multiple candidatehand detection

multiple candidatehand subimages

feature extractionand processing

database features

approximatematching (filter)

candidate matches

video database ofisolated gestures

exact matching (refine)

best matches

browsing

retrieval results

query features

Page 75: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

System Diagram

query gesture sequence

multiple candidatehand detection

multiple candidatehand subimages

feature extractionand processing

database features

approximatematching (filter)

candidate matches

video database ofisolated gestures

exact matching (refine)

best matches

browsing

retrieval results

query features

Page 76: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

ASL DictionarySystem Diagram

query gesture sequence

multiple candidatehand detection

multiple candidatehand subimages

feature extractionand processing

database features

approximatematching (filter)

candidate matches

video database ofisolated gestures

exact matching (refine)

best matches

browsing

retrieval results

query features

Page 77: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

ASL DictionarySystem Diagram

query gesture sequence

multiple candidatehand detection

multiple candidatehand subimages

feature extractionand processing

database features

approximatematching (filter)

candidate matches

video database ofisolated gestures

exact matching (refine)

best matches

browsing

retrieval results

query features

Page 78: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

DTW Math

Mi, Qj are F-dimensional vectors. Wf are weights. The distance measure between two feature vectors

is a weighted Lp norm.

For example, if Wf = 1 and p=2 we get the Euclidean distance.

pF

f

pfj

fi

fjiG QMwQMD

1

1

),(

Page 79: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Justifying the Approximation

We want a contractive embedding:

Why? because we can filter unlikely matches, and guarantee no false dismissals.

But, we could not prove it.

QMQMDQMD DTW ,),,(),('

Page 80: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Prototype Selection

Approach: Sequential Forward Search Select the first prototype R1 that minimizes

classification error. For i=2 to d do

Select the next prototype Ri that together with the set of prototypes selected so far {R1,…,Ri-1} gives the lowest classification error.

Can do backward search too by removing worse prototype at every step.

Can give weights to individual prototypes or individual features.

Page 81: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Complexity

F – number of features L – average sequence length K – number of hand candidates N – number of database objects d – number of prototypes O(dFL2 + NdFL) = O(dFL(L+N)) Example: in UNIPEN digit dataset, L = 50,

N = 10,000 The dominance of the second term is due to

NN shortcoming. Approach: Feature selection to reduce number of features, dFL. Condensing to reduce number of objects, N.

Page 82: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Feature Extraction (1)

1D time series

Multi-dimensionaltime series

Input Gesture Sequence

i

i

i

i

i

v

u

y

x

M

i

mi MMMMM ,,,,, 21

Page 83: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Approximate Distance D’:Alignment via Prototype

M R

2/)(

2/)(

76

5

4

3

32

1

MM

M

M

M

MM

M

MF

Page 84: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Approximate Distance D’:Alignment via Prototype

Page 85: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Alignment via Prototype

)),(),,(()','()( 432121 MMfMMfMMMF

))(),,(()','()( 32121 QfQQfQQQF

2

1

)','())(),(('j

jjG QMDQFMFD

For d prototypes concatenate the d vectors

Page 86: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Conclusions

Pros: Hand detection is not merely a bottom-up procedure. The gesture

model is used to select hand locations in a way that the query-to-model matching cost is optimized.

Recognition can be achieved even in the presence of multiple “distractors”, like moving objects, or skin-colored objects (e.g., face, non-gesturing hand, background objects).

Recognition is robust to overlaps between the gesturing hand and the face or the other hand.

Recognition is translation-invariant; the gesture can occur in any part of the image.

For real-time performance, hand detection can afford to use more efficient features with higher false positive rates, and rely on DSTW’s capability to handle multiple candidates to reject many false detections.

DSTW provides a general method for matching time series, that can accommodate multiple candidate feature vectors at each time step.

Cons:Space and time complexity increase by a factor of K for translation-

dependent recognition, and by a factor of K2 for translation-invariant recognition.

Page 87: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Filter:Approximate Distance D’:

Offline:0. Select prototypes Ri.1. Compute correspondence between database

sequences Mg and prototypes Ri using exact alignment W(Mg,Ri).

2. Use alignments to embed database sequences F(Mg). Online:

1. Compute correspondence between query sequence Q and prototypes Ri using exact alignment W(Q,Ri).

2. Use alignment to embed query sequences F(Q).This induces an approximate alignment WR(Q,Mg) between query and any database sequence.

3. Use approximate alignment WR(Q,Mg) to compute approximate distance D’(Q,Mg) in the embedded space.

Page 88: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Assumptions

Sensor Single color camera

Background Not necessarily uniform

Viewing Condition Frontal upper body view

Foreground Single gesturer Objects of interest: hands

Static camera and static gesturer Lighting

Constant or slowly varying

Page 89: Vision-Based Retrieval of Dynamic Hand Gestures

Computer Science

Challenges

Geometric variation Translation and scale

User (signer) independence Body kinematics: shape and size of different body

parts Style: speed, emphasis

Different gesture durations Textured clothes Native signers and high gesture speeds Hand occlusion and self-occlusion Difficult sign types

Repetitions, agentive forms, location and context dependent signs