Nonlinear Manifold Learning for Visual Speech Recognition · 2009-03-29 · Overview . Manifold Learning: Applications to Lip Track ing, Image Interpolation, and Feature Extraction

Nonlinear Manifold Learning for Visual Speech Recognition

Christoph Bregler and Stephen Omohundro

University of California, Berkeley & NEC Research Institute, Inc.

1/25

Overview Manifold Learning:

Applications to Lip Tracking, Image Interpolation, and Feature Extraction.

Experiments with a full Speech Recognition System:

A technique for representing and learning smooth nonlinear manifolds is presented and applied to several lip reading tasks. Given a set of points drawn from a smooth manifold in an abstract feature space, the technique is capable of determining the structure of the surface and of finding the closest manifold point to a given query point.

We use this technique to learn the "space of lips" in a visual speech recognition task. The learned manifold is used for tracking and extracting the lips, for interpolating between frames in an image sequence and for providing features for recognition.

We describe a system based on Hidden Markov Models and this learned lip manifold that significantly improves the performance of acoustic speech recognizers in degraded environments. We also present preliminary results on a purely visual lip reader.

2125

----

Problem

1. Boundary Tracking

-Acoustic Features 2. Image Interpolation

Vt V2 V3 V4 V5 V6 V73. Lip Features At A2A3 A4 A5 A6A7

4. Speech Recognition [ HMM ]

3/25

--

Full Reeo gnition System

"Visual Speech"

"Contour-Surfing" MLPIHMM"Eigenlips"

HybridLip-Interpolation Speech-Recognition

System Acoustic Speech

'pi ",1 RASTA-PLP

I , ~' r

- ~ -- /

@~ Full Sentence Car-Noise Cross-Talk

4/25

Lip Tracking using Active Contour Models

"Snake"Tracking

Snake Model = "Controlled Continuity Spline"

- 5/25

Space ofContours as Constrained Manifold

N Dimensional Features are Points in N Dimensional Space.

N Dimensional Feature Space

K Dimensional Subspace

6/25

••• • • • • • • • • • • • • • • • • • •

Abstract Manifold Learning Task

/'/_.---..• / '"

\( " J \

\ II

• . ... --- ) I I \

\ i I

) \

t

, .. ""---..,._/

Mixture of LearnedSamples local adaptive Patches Representation

7/25

Mixture ofPatches Architecture

~.L.L.L.Luence ?'~,

ction

PI

Linear Patch

I,Gi (x) · Pi (x)= ___.i ____P(x) I,Gi (x)

i

8/25

Manifold Learning

• Initial Estimate

- Cluster Feature Space (K-Means)

- LocalPCA

• Minimize MSE between Training Set and Projection

- EM Algorithm

- Gradient Descent

• First Best Model Merging

9125

Synthetic Examples

Learned surface

Training Salllples 10/25

•• •• •••

Comparison to Nearest Neighbor

MSE 0.3

.,,1...Surface Priornearest -

point 0.2

query

I

• ,,

0.1

"

0.0 ;:

50

11125

Manifold Dimension ?

Eigen Value 180.00

160.00

140.00

120.00

100.00

80.00

60.00

40.00

20.00

o.oo...a...:

0.00

1st Eigenvalue

... 2nd Eigenvalue ,

, " ,

, , "

"

, J

,,

,,,

" "

,,,.

, " ... 3rd Eigenvalue

, ."......."..../ ...........

...................................." ....""",...."....".'

50.00 100.00 150.00 200.00 Samples per peA

12/25

~ ...... -Directed Graylevel Gradients along Snake

Manifold-based Snake Tracking

LIP-Space

Learned Manifold

Jl -.-

Manifold Snake Energy =Image Term+ Learned Manifold Term

Use Manifold-Snakes to estimate position, scale, and rotation for gray level matrix extraction.

13125

Graylevel Manifolds

Linear

Linear Lip-Space

=>

Nonlinear

Real Lip-Space

Dimension Reduction Dimension Reduction

14/25

Linear vs. Nonlinear Interpolation

, / ....

Graylevel

Dimensions

....... Leam nonlinear subspace of "legal" images

....... constrain interpolated images to subspace

(16x16 pixel = 256 dim. space)

15/25

Nonlinear Interpolation Technique

___.-~--i~~> /

Y :,...

"Manifold-Snake" "-

'\,

~

2E = L IVi- 1 - 2Vi + V i + 11+ distance (Vi) i

- Begin with sequence of linear interpolated points - Iteratively move points toward manifold - Iterations are equal to gradient descent steps using the Manifold-Snake Energy

.,,"'- .,., ", ...... - ........ ; ..".-...... #""•• - ••", ,.., -"\ , '. , I '.l

j • I i i j 'j,, ....... .., "I

~~~...~~ ~..... • "'f \ I \. ......_,,/ ..._" .. ... _wl' .. ~.~~" -"'11_.....,,,,I

alter. 1 Iter. 2 Iter. 5 Iter. lOIter.

16/25

Natural Lip Images Linear Interpolation Nonlinear Interpolation

24x24 Images projected in a 32-D linear space and 8-D nonlinear manifold

18/25

Phone Probability Estimation -2

P(phone I acoustics, visual-input)

00000 63 Softmax Prob. Outputs Multi-Layer Perceptron

256-512 Hidden Units

t 19 frames

~~---""".,... ~ ~

9 features 10 Eigen-Features (1 Energy, 8 RASTA-PLP) (+ 10 Delta-Eigen-Features)

~ -

20125

Hidden Markov Models

ViterbiApproximation

ele e e

•e

010 0 0

e01.

• •e

0 0

•0

•0

0 e•0 • 0

Mi (Word-Model)

Likelihood: P(A, VIMi)

Ol~ III

•0 P (phone lA, V)

III P (A, Vlphone) oc P (phone)

e~ Ol~

... Time

21125

Reeo gnition Results

Task

clean

20db SNR

lOdbSNR

15dbSNR crosstalk

Acoustic

11.0 %

33.5 %

56.1 %

67.3 %

Eigenlips

10.1 %

28.9%

51.7 %

51.7%

Delta-Lips Rel.Err.Red.

11.3 % -26.0% 22.4 %

48.0% 14.4 %

46.0% 31.6 %

Table 1: Spelling Task (6 Speakers) Word Error (wrong + insertion + deletion)

Task Pure Lipreading-Error

Bar-Tender 4.5%

Table 2: Pure Lipreading (1 Speaker, "Bar-Thsk": 4 Cocktail-Names)

22/25

Related Work

Related Linear Representations:

Kirby et aI., Pentland et aI.: Linear Subspaces Simard: Tangent Distance

Related Nonlinear Representation:

Jacobs, Jordan, Nowlan, Hinton: Mixture of Experts Kambhatla, Leen: Local peA

Related Interpolation Techniques:

Poggio et aI.: RBFs for Image Interpolation

23/25

Computer Lipreading History

- U.S.Patent 3192321 (1965) E. Nassimbene (IBM)

- E. Petajan (1984), Dissertation

- B. Yuhas, M. Goldstein, T. Sejnowski (1989)

- K. Mase, A. Pentland (1991)

- O. Garcia, A. Goldschen, E. Petajan (1992)

- D. Stork, G. Wolff, K. V. Prasad, M. Hennecke (1992)

- P.L. Silsbee (1993), Dissertation

- A.f. Goldschen (1993), Dissertation

- f. Movellan (1994)

24125

Summary

Lipreading:

- Improves Speech Recognition Performance significantly - Full System Solution

Manifold Learning:

- New fundamental technique for learning in geometric domains

- Applicable to variety of different tasks

25125

Documents

Nonlinear Manifold Learning for Visual Speech Recognition · 2009-03-29 · Overview . Manifold Learning: Applications to Lip Track ing, Image Interpolation, and Feature Extraction