Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Nonlinear Manifold Learning for Visual Speech Recognition
Christoph Bregler and Stephen Omohundro
University of California, Berkeley & NEC Research Institute, Inc.
1/25
Overview Manifold Learning:
Applications to Lip Tracking, Image Interpolation, and Feature Extraction.
Experiments with a full Speech Recognition System:
A technique for representing and learning smooth nonlinear manifolds is presented and applied to several lip reading tasks. Given a set of points drawn from a smooth manifold in an abstract feature space, the technique is capable of determining the structure of the surface and of finding the closest manifold point to a given query point.
We use this technique to learn the "space of lips" in a visual speech recognition task. The learned manifold is used for tracking and extracting the lips, for interpolating between frames in an image sequence and for providing features for recognition.
We describe a system based on Hidden Markov Models and this learned lip manifold that significantly improves the performance of acoustic speech recognizers in degraded environments. We also present preliminary results on a purely visual lip reader.
2125
----
Problem
1. Boundary Tracking
-Acoustic Features 2. Image Interpolation
Vt V2 V3 V4 V5 V6 V73. Lip Features At A2A3 A4 A5 A6A7
4. Speech Recognition [ HMM ]
3/25
--
Full Reeo gnition System
"Visual Speech"
"Contour-Surfing" MLPIHMM"Eigenlips"
HybridLip-Interpolation Speech-Recognition
System Acoustic Speech
'pi ",1 RASTA-PLP
I , ~' r
- ~ -- /
@~ Full Sentence Car-Noise Cross-Talk
4/25
Lip Tracking using Active Contour Models
"Snake"Tracking
Snake Model = "Controlled Continuity Spline"
- 5/25
Space ofContours as Constrained Manifold
N Dimensional Features are Points in N Dimensional Space.
N Dimensional Feature Space
K Dimensional Subspace
6/25
••• • • • • • • • • • • • • • • • • • •
Abstract Manifold Learning Task
/'/_.---..• / '"
\( " J \
\ II
• . ... --- ) I I \
\ i I
) \
t
, .. ""---..,._/
Mixture of LearnedSamples local adaptive Patches Representation
7/25
Mixture ofPatches Architecture
~.L.L.L.Luence ?'~,
ction
PI
Linear Patch
I,Gi (x) · Pi (x)= ___.i ____P(x) I,Gi (x)
i
8/25
Manifold Learning
• Initial Estimate
- Cluster Feature Space (K-Means)
- LocalPCA
• Minimize MSE between Training Set and Projection
- EM Algorithm
- Gradient Descent
• First Best Model Merging
9125
Synthetic Examples
Learned surface
Training Salllples 10/25
•• •• •••
Comparison to Nearest Neighbor
MSE 0.3
.,,1...Surface Priornearest -
point 0.2
query
I
• ,,
0.1
"
0.0 ;:
50
11125
Manifold Dimension ?
Eigen Value 180.00
160.00
140.00
120.00
100.00
80.00
60.00
40.00
20.00
o.oo...a...:
0.00
1st Eigenvalue
... 2nd Eigenvalue ,
, " ,
, , "
"
, J
,,
,,,
" "
,,,.
, " ... 3rd Eigenvalue
, ."......."..../ ...........
...................................." ....""",...."....".'
50.00 100.00 150.00 200.00 Samples per peA
12/25
~ ...... -Directed Graylevel Gradients along Snake
Manifold-based Snake Tracking
LIP-Space
Learned Manifold
Jl -.-
Manifold Snake Energy =Image Term+ Learned Manifold Term
Use Manifold-Snakes to estimate position, scale, and rotation for gray level matrix extraction.
13125
Graylevel Manifolds
Linear
Linear Lip-Space
=>
Nonlinear
Real Lip-Space
Dimension Reduction Dimension Reduction
14/25
Linear vs. Nonlinear Interpolation
, / ....
Graylevel
Dimensions
....... Leam nonlinear subspace of "legal" images
....... constrain interpolated images to subspace
(16x16 pixel = 256 dim. space)
15/25
Nonlinear Interpolation Technique
___.-~--i~~> /
Y :,...
"Manifold-Snake" "-
'\,
~
2E = L IVi- 1 - 2Vi + V i + 11+ distance (Vi) i
- Begin with sequence of linear interpolated points - Iteratively move points toward manifold - Iterations are equal to gradient descent steps using the Manifold-Snake Energy
.,,"'- .,., ", ...... - ........ ; ..".-...... #""•• - ••", ,.., -"\ , '. , I '.l
j • I i i j 'j,, ....... .., "I
~~~...~~ ~..... • "'f \ I \. ......_,,/ ..._" .. ... _wl' .. ~.~~" -"'11_.....,,,,I
alter. 1 Iter. 2 Iter. 5 Iter. lOIter.
16/25
Natural Lip Images Linear Interpolation Nonlinear Interpolation
24x24 Images projected in a 32-D linear space and 8-D nonlinear manifold
18/25
Phone Probability Estimation -2
P(phone I acoustics, visual-input)
00000 63 Softmax Prob. Outputs Multi-Layer Perceptron
256-512 Hidden Units
t 19 frames
~~---""".,... ~ ~
9 features 10 Eigen-Features (1 Energy, 8 RASTA-PLP) (+ 10 Delta-Eigen-Features)
~ -
20125
Hidden Markov Models
ViterbiApproximation
ele e e
•e
010 0 0
e01.
• •e
0 0
•0
•0
0 e•0 • 0
Mi (Word-Model)
Likelihood: P(A, VIMi)
Ol~ III
•0 P (phone lA, V)
III P (A, Vlphone) oc P (phone)
e~ Ol~
... Time
21125
Reeo gnition Results
Task
clean
20db SNR
lOdbSNR
15dbSNR crosstalk
Acoustic
11.0 %
33.5 %
56.1 %
67.3 %
Eigenlips
10.1 %
28.9%
51.7 %
51.7%
Delta-Lips Rel.Err.Red.
11.3 % -26.0% 22.4 %
48.0% 14.4 %
46.0% 31.6 %
Table 1: Spelling Task (6 Speakers) Word Error (wrong + insertion + deletion)
Task Pure Lipreading-Error
Bar-Tender 4.5%
Table 2: Pure Lipreading (1 Speaker, "Bar-Thsk": 4 Cocktail-Names)
22/25
Related Work
Related Linear Representations:
Kirby et aI., Pentland et aI.: Linear Subspaces Simard: Tangent Distance
Related Nonlinear Representation:
Jacobs, Jordan, Nowlan, Hinton: Mixture of Experts Kambhatla, Leen: Local peA
Related Interpolation Techniques:
Poggio et aI.: RBFs for Image Interpolation
23/25
Computer Lipreading History
- U.S.Patent 3192321 (1965) E. Nassimbene (IBM)
- E. Petajan (1984), Dissertation
- B. Yuhas, M. Goldstein, T. Sejnowski (1989)
- K. Mase, A. Pentland (1991)
- O. Garcia, A. Goldschen, E. Petajan (1992)
- D. Stork, G. Wolff, K. V. Prasad, M. Hennecke (1992)
- P.L. Silsbee (1993), Dissertation
- A.f. Goldschen (1993), Dissertation
- f. Movellan (1994)
24125
Summary
Lipreading:
- Improves Speech Recognition Performance significantly - Full System Solution
Manifold Learning:
- New fundamental technique for learning in geometric domains
- Applicable to variety of different tasks
25125