MikeTalk:An Adaptive Man-Machine Interface

Preview:

DESCRIPTION

MikeTalk:An Adaptive Man-Machine Interface. Tony Ezzat Volker Blanz Tomaso Poggio. TTVS Overview. Input: Text Output: Photo-realistic talking face uttering text. Desktop Agents. You have received 1 email from Tommy Poggio. Desktop Agents. Customer Support. You have bought 20 - PowerPoint PPT Presentation

Citation preview

MikeTalk:An Adaptive Man-Machine Interface

Tony EzzatVolker Blanz

Tomaso Poggio

TTVS Overview

• Input: Text

• Output: Photo-realistic talking face uttering text

Desktop Agents

Desktop Agents

You have received 1 email from Tommy Poggio.

Customer Support

Customer Support

You have bought 20 shares of SONYat $40 each.

Advertisements

Advertisements

Hi Tony, would you be interestedin a ticket from Boston to New

York for $50.00?

Modules

Phoneme Corpus

Step 1:

– collect a visual corpus from a subject

– corpus contains 44 words

–one word for each American English phoneme

6 Consonantal Visemes

Step 2:

– extract one image per phoneme: viseme

–group visemes together by visual similarity

9 Vocalic Visemes (+ 1 SilenceViseme)

Problem1:Need to Interpolate!

Solution: Morphing!

Problem 2: too tedious to specify correspondence by hand across many images!

Simultaneous interpolation of shape & texture. (Beier & Neely 1992)

Solution: Optical Flow

• To interpolate between two visemes, optical flow is first computed

• A 2D motion vector field is produced:

dx(x,y) dy(x,y)

(Horn & Schunk 1986) (Lucas & Kanade 1988)

Morphing

• Forward warping A to B

• Forward warping B to A

• Blending

• Holefilling

Synthesis Database

• 16 Visemes total

• 256 Optical flow vectors total, from every viseme to every other viseme

Concatenation and Lip Sync

• Load the correct viseme transitions

• Concatenate viseme transitions

• Sample the viseme transitions using audio durations

Examples

“1, 2, 3, 4, 5”

“cat, dog, pig,cow, moose, horse,sheep”

“you have received10 email messages.”

Current Work

• Coarticulation

• Eye + head movements

• Emotion

• 3D instead of 2d

• Psychophysics

3DWith Volker Blanz

The End

Co-articulation

• Problem: Current method does not handle coarticulation, so speech looks overly articulated

• Can record all possible triphones/ quadriphones but this approach requires a lot of data!

• Best method is to learn a model for coarticulation, but what is the representation for the lips?

Principal Components Analysis

• Each image is a vector in a high-dimensional space

• Using PCA, find the optimal set of vectors that span the space

• Project the entire corpus onto those basis vectors

Top 2 PCA Bases for /buut/

Top 2 PCA Bases for /get/

Problem: Too nonlinear!

Flow Component Analysis

• Compute optical from a reference lip image to all other images in the corpus

• Compute PCA on all the flows

Top 2 FPCA Bases for /buut/

Top 2 FPCA Bases for /get/

Much more linear behavior!

Current Work

• Now that we have parameterized the mouth, what is the model for mouth synthesis?

• How is that model fit to the PCA data?

Recommended