56
Pattern Recognition in Acoustic Signal Processing Pattern Recognition in Acoustic Signal Processing Mark Hasegawa-Johnson These slides: http://www.isle.uiuc.edu/slides/2009/Hasegawa-Johnson09mlss.pdf Machine Learning Summer School, June 11, 2009

Pattern Recognition in Acoustic Signal Processing

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Mark Hasegawa-Johnson

These slides: http://www.isle.uiuc.edu/slides/2009/Hasegawa-Johnson09mlss.pdf

Machine Learning Summer School, June 11, 2009

Page 2: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Outline

1 Why Use Pattern Recognition in Acoustic Signal Processing?

2 Four Criteria for Choosing a Pattern Recognizer

3 Discriminative MethodsHypothesis Space: Universal ApproximatorsTraining Criteria: Differentiable Error MetricComplications: Label Dynamics, Data Sparsity

4 Bayesian MethodsHypothesis Space: Latent VariablesTraining Criteria: Maximum Likelihood, MAP, MaxEntComplications: HMM Regression, Switching Kalman Smoothers

5 Hybrid MethodsDiscriminative Hidden Nodes, Bayesian InferenceDiscriminative Feature Selection for Acoustic Event DetectionBayesian Re-Estimation of the “Discriminative” Hidden Nodes

6 Conclusions

Page 3: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Why Use Pattern Recognition in Acoustic Signal Processing?

The Scientific Method

y = h(x)

Hypothesize-Measure-Test

1 Based on knowledge of the physical situation, form:1 a hypothesis2 a null hypothesis

2 Collect data: (xt , yt), 1 ≤ t ≤ T .

3 Test the hypothesis: measure P(data|null hypothesis)

Page 4: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Why Use Pattern Recognition in Acoustic Signal Processing?

The Pattern Recognition Method

y = h(x)

Hypothesize-Measure-Learn-Test

1 Form an infinite set of hypotheses (called the “hypothesisspace”), usually a parameterized universal approximator.

2 Collect:1 training data ((xt , yt), 1 ≤ t ≤ T )2 testing data ((xt , yt), T + 1 ≤ t ≤ T + N)

3 Train the hypothesis: maximize P(hypothesis|training data)

4 Test the hypothesis: measure P(testing data|hypothesis)

Page 5: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Why Use Pattern Recognition in Acoustic Signal Processing?

Example: PR in the Scientific Method

Page 6: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Four Criteria for Choosing a Pattern Recognizer

Criteria for Choosing a Pattern Recognizer

1 Structure of the Model1 Discriminative Training: All parameters in the model can be

simultaneously adjusted to minimize global error metric2 Bayesian Training: Components must be separately trained,

then combined without blowing up

Page 7: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Four Criteria for Choosing a Pattern Recognizer

Criteria for Choosing a Pattern Recognizer

1 Structure of the Model1 Discriminative2 Bayesian

2 Size of the Training Database1 Empirical Risk Minimization: Training database includes

10,000 independent trials; train model to minimize trainingdatabase error

2 Structural Risk Minimization: Training database smallerthan 1000 trials; train model to minimize

P(Error) ≤ (Training Corpus Error) + λ(Model Complexity)

(Training Corpus Size)

Page 8: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Four Criteria for Choosing a Pattern Recognizer

Criteria for Choosing a Pattern Recognizer

1 Structure of the Model1 Discriminative2 Bayesian

2 Size of the Training Database1 Empirical Risk Minimization2 Structural Risk Minimization

3 Dynamic State1 y = h(x) has no hidden state (classification, regression)2 y = h(x) has hidden state (recognition, tracking)

Page 9: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Four Criteria for Choosing a Pattern Recognizer

Criteria for Choosing a Pattern Recognizer

1 Structure of the Model1 Discriminative2 Bayesian

2 Size of the Training Database1 Empirical Risk Minimization2 Structural Risk Minimization

3 Dynamic State1 y = h(x) has no hidden state (classification, regression)2 y = h(x) has hidden state (recognition, tracking)

4 Function Range1 y = h(x) is an integer (classification, recognition)2 y = h(x) is a real-valued vector (regression, tracking)

Page 10: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Discriminative Methods

Discriminative Training—Gradient Descent Methods

1 Choose a hypothesis space (a universal approximator)

2 Choose a differentiable error metric

3 Apply the Chain Rule

Page 11: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Discriminative Methods

Hypothesis Space: Universal Approximators

Universal Approximators

Universal Approximator: Definition

A parameterized function space hΘ(x), with parameter vectorΘ ∈ ℜαK , is called a universal approximator if for any boundedh(x) with finite domain,

limK→∞

minΘ

‖hΘ(x) − h(x)‖ = 0

Example: Sigmoidal Neural Network

1 Sigmoidal Neural Network (Θ = {c1, . . . , cK ,w1, . . . ,wK})

hΘ(x) =K∑

k=1

ck1

1 + e−xT wk

Page 12: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Discriminative Methods

Hypothesis Space: Universal Approximators

Universal Approximators

More Examples: Universal Approximators

1 Sigmoidal Neural Network (Θ = {c1, . . . , cK ,w1, . . . ,wK})

2 Mixture Gaussian (Θ = {c1, . . . , cK , µ1, . . . , µK ,Σ1, . . . ,ΣK})

hΘ(x) =

K∑

k=1

cke−12(x−µk)T Σ−1

k(x−µk )

3 Classification and Regression Tree (CART), K NearestNeighbors (KNN), etc.(Θ = [b1, . . . , bK ,w1, . . . ,wK ,R1, . . . ,Rk ])

hΘ(x) = xT wk + bk if x ∈ Rk

Rk is a region with piece-wise linear boundaries.

Page 13: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Discriminative Methods

Training Criteria: Differentiable Error Metric

Differentiable Error Metric

Minkowski Norm Error Metrics

EL =1

N

T∑

t=1

‖hΘ(xt) − yt‖LL

The “Best” Metric: The Zero Norm

‖hΘ(xt) − yt‖0 ,

{

0 yt = hΘ(xt)1 otherwise

Problem: if E0 6= 0, what do we do next?

Page 14: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Discriminative Methods

Training Criteria: Differentiable Error Metric

Differentiable Error Metric

Minkowski Norm Error Metrics

EL =1

N

T∑

t=1

‖hΘ(xt) − yt‖LL

Differentiable Error Metrics: L = 1, L = 2

1 The One Norm (Manhattan Distance, L = 1):

∂E1

∂hΘ(xt)= sign (hΘ(xt) − yt)

2 The Two Norm (Euclidean Distance, L = 2):

∂E2

∂hΘ(xt)= hΘ(xt) − yt

Page 15: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Discriminative Methods

Training Criteria: Differentiable Error Metric

Apply the Chain Rule

The Error Back-Propagation Algorithm

1 Initialize: Choose some initial parameter set Θ(0)

2 Iterate: For i = 1, . . . until E(Θ) stops changing:

Θ(i) = Θ(i−1) − η∇ΘE

∇ΘE ,

T∑

t=1

(

∂E

∂hΘ(xt)

)

(∇ΘhΘ(xt))

Page 16: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Discriminative Methods

Complications: Label Dynamics, Data Sparsity

Wrinkle #1: Recognition, Tracking

A Recursive Neural Net is a neural net with one or more hiddenstate variables:

hΘ(xt) =

K∑

k=1

ck

1

1 + e−[xTt ,hΘ(xt−1)]wk

Training is performed using Back-Propagation Through Time:

∂E2

∂hΘ(xt)= (hΘ(xt) − yt) + (hΘ(xt+1) − yt+1)

(

∂hΘ(xt+1)

∂hΘ(xt)

)

+ . . .

Page 17: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Discriminative Methods

Complications: Label Dynamics, Data Sparsity

Recursive Neural Nets: Example Application

Task Description

Goal of the RNN: Detectpitch accents (syllablesthat the talker has markedas “prominent” by sometype of F0 movement)

Blue = xt (F0=RNNinput)

Yellow = yt (pitch accent= target RNN output)

Pink = f (xt)(RNN-estimated pitchaccent probability)

Example Results

Page 18: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Discriminative Methods

Complications: Label Dynamics, Data Sparsity

Wrinkle #2: Small Training Corpus

Test Corpus Error Bounds based on the Central Limit Theorem

P(Error) ≤ E(X ,Y ,Θ) + G(Θ)

Example Bounds

Minimum Description Length: K (Θ) describes hΘ(x) as a binaryprogram written in signed bits; G(Θ) is itsKolmogorov complexity

G(Θ) ∝ ‖K (Θ)‖0

Support Vector Machines: ψ(Θ) describes hΘ(x) as a linearclassifier in an augmented feature space, and

G(Θ) ∝ ‖ψ(Θ)‖22

Page 19: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Discriminative Methods

Complications: Label Dynamics, Data Sparsity

Support Vector Machines: Example Application

Consonant-vowel transitions, and similar manner-changelandmarks, are good places to look for information aboutspeech [Stevens, Interspeech 2000]

Some landmarks occur relatively infrequently—it’s hard tolearn what they sound like.

SVMs do the job [Niyogi, Burges, and Ramesh, 1999; Borysand Hasegawa-Johnson, 2005; chance=50%]:

SVM %ACC SVM %ACC

-+Silence 92.1 +-Silence 91.6

-+Continuant 79.7 +-Continuant 81.1

-+Sonorant 86.4 +-Sonorant 91.1

-+Syllabic 88.6 +-Syllabic 78.5

-+Consonantal 78.1 +-Consonantal 73.1

Page 20: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Discriminative Methods

Complications: Label Dynamics, Data Sparsity

Wrinkle #3: Label Dynamics AND Small Dataset

Wrinkle #3: Label Dynamics AND Small Dataset

Landmark detectors (δit = sign (gi (~xt))) must be trained usingsmall dataset, so we use SVMs

Page 21: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Discriminative Methods

Complications: Label Dynamics, Data Sparsity

Wrinkle #3: Label Dynamics AND Small Dataset

Wrinkle #3: Label Dynamics AND Small Dataset

Landmark detectors (δit = sign (gi (~xt))) must be trained usingsmall dataset, so we use SVMs

Phonetic landmarks tell us about the phonemes that wereproduced (Q = [q1, . . . , qT ], qt ∈ set of phoneme labels)

. . . but phoneme also depends on previous phoneme!

p(qt = j) = fj

(

~δt , qt−1

)

Page 22: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Discriminative Methods

Complications: Label Dynamics, Data Sparsity

Wrinkle #3: Label Dynamics AND Small Dataset

Wrinkle #3: Label Dynamics AND Small Dataset

Landmark detectors (δit = sign (gi (~xt))) must be trained usingsmall dataset, so we use SVMs

Phonetic landmarks tell us about the phonemes that wereproduced (Q = [q1, . . . , qT ], qt ∈ set of phoneme labels)

. . . but phoneme also depends on previous phoneme!

p(qt = j) = fj

(

~δt , qt−1

)

Functions like fj (·) are hard to train using purelydiscriminative methods, but relatively easy using Bayesianmethods

Page 23: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Bayesian Methods

Bayesian Methods

Bayesian Classification and Recognition:

y∗ = arg max pΘ(y |x)

Bayesian Regression and Tracking:

y∗ = E {y |x}

Advantages and Disadvantages

Disadvantage: One must learn p(x , y); this usually requiresmore data, and is subject to more error, than learningy = h(x) directly

Advantage: Bayesian inference allows modeling of latentvariables, state dynamics, and extra sources of information ina principled manner

Page 24: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Bayesian Methods

Hypothesis Space: Latent Variables

Hypothesis Space Example: Hidden Markov Model

Let X = [x1, . . . , xN ] be the observations, let Y = [y1, . . . , yN ] bethe labels. A Hidden Markov Model posits the existence of somelatent variables Q = [q1, . . . , qN ] such that

hΘ(X ) = arg maxY

pΘ(X ,Y )

pΘ(X ,Y ) =∑

Q

T∏

t=1

pΘ(yt |yt−1)pΘ(qt |qt−1, yt)pΘ(xt |qt)

pΘ(yt |yt−1) (the “language model”) is a lookup table

pΘ(qt |qt−1, yt) (the “pronunciation model”) is a lookup table

pΘ(xt |qt , yt) (the “acoustic model”) is a Gaussian, with meanvector µq and covariance matrix Σq

Page 25: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Bayesian Methods

Training Criteria: Maximum Likelihood, MAP, MaxEnt

Learn the Distributions: Maximum Likelihood

Maximum Likelihood Parameter Estimation

Θ = arg max log pΘ(X ,Y )

What About Small Training Corpora?

Maximum Likelihood is a form of empirical risk minimization

Related forms of structural risk minimization include

MAP (maximum a posteriori probability)

Θ = arg max (log p(X ,Y |Θ) + log p(Θ))

MaxEnt(maximum entropy)

Θ = arg max (log pΘ(X ,Y ) + H(pΘ))

Page 26: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Bayesian Methods

Training Criteria: Maximum Likelihood, MAP, MaxEnt

Apply Bayes’ Rule

Bayes’ Rule (a.k.a. the Definition of Conditional Probability)

pΘ(X ,Y ) =∑

Q

T∏

t=1

pΘ(yt |yt−1)pΘ(qt |qt−1, yt)pΘ(xt |qt)

Training a Bayesian Classifier: Maximum Likelihood

Θ = arg max pΘ(X ,Y )

Testing a Bayesian Classifier: Minimum Probability of Error

Y = arg max pΘ(X ,Y )

Page 27: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Bayesian Methods

Complications: HMM Regression, Switching Kalman Smoothers

Bayesian Regression and Tracking

Hidden Markov Regression

Suppose yt ∈ ℜD is a real-valued vector, and (xt , yt) are jointlyGaussian:

p(x , y |q) ∝ exp

(

−1

2

[

xt − xq

yt − yq

]T [Aq Bq

BTq Cq

]−1 [xt − xq

yt − yq

]

)

then h(xt) = arg min E2 is

E {yt |X } =∑

qt

P(qt |X )(

yq + BTq A−1

q (x − xq))

Page 28: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Bayesian Methods

Complications: HMM Regression, Switching Kalman Smoothers

Switching Kalman Smoother

sst−1 s t+1t

x x xt−1 t t+1

t−1 t t+1y y y

Setup: exactly like HMM regression, except that (xt , yt , yt−1)are jointly Gaussian

Result: exactly like HMM regression, except that yt|q,X ,At|q,X , and Bt|q,X must be updated using interacting multipleKalman filters:

E {yt |X } =∑

qt

P(qt |X )(

yt|q,X + BTt|q,XA−1

t|q,X(x − xq)

)

Page 29: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Bayesian Methods

Complications: HMM Regression, Switching Kalman Smoothers

Switching Kalman Smoother: Example

Task Description

xt is an acoustic spectrum; yt is the corresponding vector ofspeech articulator positions

Results (unpublished):

E2(Switching Kalman Smoother) < E2(HMM Regression)

Difference is consistent but very small

Page 30: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Hybrid Methods

Hybrid Discriminative-Bayesian Systems

Task Scenario

yt is very difficult to classify without dynamic information,(e.g. speech recognition: yt =words, xt =short-timespectrum)

An auxiliary variable ft can be inferred very accurately using alocal classifier (e.g., ft =phonological distinctive features):

ft = hΘ(xt)

Training database includes X = [x1, . . . , xN ],Y = [y1, . . . , yN ], and F = [f1, . . . , fN ]

Testing database includes only X = [xN+1, . . . , xN+M ]

Page 31: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Hybrid Methods

Discriminative Hidden Nodes, Bayesian Inference

Hybrid Training Methods

Training Algorithm

Train hΘ(x) using discriminative methods

Minimize

EL =1

T

T∑

t=1

‖ft − hΘ(xt)‖LL

hΘ(xt) is a real-valued vector that approximates ft

Train a probability model pΘ(F ,X ,Y ) using Bayesianmethods

Using pΘ(F ,X ,Y ), Bayesian inference can model dynamics ofhidden states, incorporate multiple knowledge sources, etc.

Page 32: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Hybrid Methods

Discriminative Hidden Nodes, Bayesian Inference

Example: Landmark-Based Speech Recognizer

SVM-HMM Hybrid Landmark-Based Speech Recognizer

SVM computes a real-valued distinctive feature that optimallydiscriminates between the case ft = 1 (landmark of a specifiedtype is present) and ft = −1 (landmark absent)

HMM computes p(phoneme sequence|landmark sequence)

Page 33: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Hybrid Methods

Discriminative Hidden Nodes, Bayesian Inference

Phone Recognition Accuracy vs. Mixture Size, Telephone Speech

MFCC = mel frequency cepstral coefficients

Landmark = detect manner-to-manner landmarks, e.g.,obstruent-to-sonorant

Manner = detect manner-onset landmarks, e.g., onset ofsonorant region

Page 34: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Hybrid Methods

Discriminative Hidden Nodes, Bayesian Inference

Example: RNN with Kalman Smoothing

Mitra et al., in review

Time

Fre

quen

cy

Synthetic utterance:ground

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

2000

4000

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5−0.5

0

0.5

VE

L

ANN+Kalman(AP)TrueData

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

100

200

TB

CL

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5−20

0

20

TB

CD

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

50

100

TT

CL

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5−20

0

20

TT

CD

Page 35: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Hybrid Methods

Discriminative Feature Selection for Acoustic Event Detection

Example: Non-Speech Acoustic Event Detection

Motivation

“Activity detection and description is a key functionality ofperceptually aware interfaces working in collaborative humancommunication environments. . . detection and classification ofacoustic events may help to detect and describe humanactivity. . . ” (CLEAR-AED Task Brief)

Difficulties

Negative SNR (speech is “background noise”)

Unknown spectral structure

Different spectral structure for each event type

Page 36: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Hybrid Methods

Discriminative Feature Selection for Acoustic Event Detection

Difficulty #1: Negative SNR

Page 37: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Hybrid Methods

Discriminative Feature Selection for Acoustic Event Detection

Difficulty #2: Unknown Spectral Structure

Key Jingle Footsteps Speech

Page 38: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Hybrid Methods

Discriminative Feature Selection for Acoustic Event Detection

Discriminative Feature Selection for AED

Zhuang et al., ICASSP 2008

Problem: what acoustic features are relevant for detectingnon-speech acoustic events?

Input: (xt ∈ ℜD) includes many acoustic features inventedfor speech processing (MFCC, PLP, energy, ZCR)

Output: (ft ∈ ℜd) selects the most useful features:

ft = Wxt

where W T = [w1, . . . ,wK ], and wk is an indicator vector(only one non-zero element)

Hidden Markov Modeling: the label sequenceY ∗ = [y∗

1 , . . . , y∗T ], yt ∈ {keyjingle, footstep, . . .} is chosen by

a hidden Markov model observing F = [f1, . . . , fN ]:

Y ∗ = arg max p(F |Y )p(Y )

Page 39: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Hybrid Methods

Discriminative Feature Selection for Acoustic Event Detection

Bayes Error Rate

Zhuang et al., ICASSP 2008

Bayes Error Rate

Let wk be an indicator vector (all zeros except for one element).The Bayes-optimal error rate of a classifier observing feature wT

k x

is

P(error) =

∫ ∫

P(

y 6= arg max p(wTk x , y)

)

dydx

Bayes Error Rate Approximated on a Database

F(wk) =1

T

T∑

t=1

δ(

yt 6= arg max p(wTk xt , yt)

)

Page 40: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Hybrid Methods

Discriminative Feature Selection for Acoustic Event Detection

Feature Selection Algorithms

Hard-Bayes-Error Feature Selection

For k = 1, . . . ,K , Choose the indicator vector wk (wk is all zerosexcept for one nonzero element) to minimize

F(wk) =1

T

T∑

t=1

δ(

yt 6= arg max p(wTk xt , yt)

)

Soft-Bayes-Error Feature Selection

For k = 1, . . . ,K , Choose the indicator vector wk (wk is all zerosexcept for one nonzero element) to minimize

FS (wk) =1

T

T∑

t=1

rank(

yt

∣wT

k xt

)

Page 41: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Hybrid Methods

Discriminative Feature Selection for Acoustic Event Detection

Acoustic Event Detection Results

Zhuang et al., ICASSP 2008

MFCC26DAZ = 26 Mel-frequency cepstral coefficients +deltas + acceleration

DERIVE26DAZ = 26 Derived features + deltas + acceleration

DERIVE78 = 78 Derived features

Page 42: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Hybrid Methods

Bayesian Re-Estimation of the “Discriminative” Hidden Nodes

Bayesian Re-Estimation of Discriminative Nodes

Problem: what projection of the acoustic spectrogram isrelevant for recognizing acoustic events?

Discriminative Transform: (ft ∈ ℜd) selects the most usefulfeatures:

ft =K∑

k=1

ckσ(wTk xt)

where ck ∈ ℜd and wk ∈ ℜD are arbitrary real-valued weightvectors, and σ(z) = 1/(1 + e−z).

Bayesian Inference: the label sequence Y ∗ = [y∗1 , . . . , y

∗N ],

yt ∈ {keyjingle, footstep, . . .} is chosen by a hidden Markovmodel observing F = [f1, . . . , fN ]:

Y ∗ = arg max p(F |Y )p(Y )

Page 43: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Hybrid Methods

Bayesian Re-Estimation of the “Discriminative” Hidden Nodes

The Baum-Welch Algorithm

Hidden Markov model parameters are trained to maximize theexpected log likelihood, with expectation over the unknown statesequence Q = [q1, . . . , qN ]

F = EQ {log p(F ,Q)}

F = −1

2

T∑

t=1

q

p(qt = q|F ,Y )(ft − µq)T Σ−1

q (ft − µq) − . . .

Page 44: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Hybrid Methods

Bayesian Re-Estimation of the “Discriminative” Hidden Nodes

Baum-Welch Back-Propagation

The neural network can be trained, using standard gradientdescent methods, in order to minimize F . For example,

ft =

K∑

k=1

ckσ(wTk xt)

∂F

∂ck

=T∑

t=1

q

p(qt = q|F )

(

∂F

∂ft|qt = q

)(

∂ft

∂ck

)

=

T∑

t=1

q

p(qt = q|F )Σ−1q (µq − ft)σ(wT

k xt)

Page 45: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Hybrid Methods

Bayesian Re-Estimation of the “Discriminative” Hidden Nodes

The Problem of Spurious Maxima

It is always possible to train a mixture Gaussian so thatF = ∞

Give one of the Gaussians a nonzero variance (|Σ1| > 0), andone a zero variance (|Σ2| = 0)

Page 46: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Hybrid Methods

Bayesian Re-Estimation of the “Discriminative” Hidden Nodes

The Problem of Spurious Maxima

It is always possible to train a mixture Gaussian so thatF = ∞

Give one of the Gaussians a nonzero variance (|Σ1| > 0), andone a zero variance (|Σ2| = 0)Set mean of the second Gaussian equal to any training token(µ2 = ft for any particular t)

Page 47: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Hybrid Methods

Bayesian Re-Estimation of the “Discriminative” Hidden Nodes

The Problem of Spurious Maxima

It is always possible to train a mixture Gaussian so thatF = ∞

Give one of the Gaussians a nonzero variance (|Σ1| > 0), andone a zero variance (|Σ2| = 0)Set mean of the second Gaussian equal to any training token(µ2 = ft for any particular t)Result: F = ∞ + finite = ∞

Page 48: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Hybrid Methods

Bayesian Re-Estimation of the “Discriminative” Hidden Nodes

The Problem of Spurious Maxima

It is always possible to train a mixture Gaussian so thatF = ∞

Give one of the Gaussians a nonzero variance (|Σ1| > 0), andone a zero variance (|Σ2| = 0)Set mean of the second Gaussian equal to any training token(µ2 = ft for any particular t)Result: F = ∞ + finite = ∞This is called “over-training”

Page 49: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Hybrid Methods

Bayesian Re-Estimation of the “Discriminative” Hidden Nodes

The Problem of Spurious Maxima

It is always possible to train a mixture Gaussian so thatF = ∞

Give one of the Gaussians a nonzero variance (|Σ1| > 0), andone a zero variance (|Σ2| = 0)Set mean of the second Gaussian equal to any training token(µ2 = ft for any particular t)Result: F = ∞ + finite = ∞This is called “over-training”

In Baum-Welch Back-Propagation, the same result is obtainedfor ‖ck‖ → 0 or ‖wk‖ → 0

Solution: require ‖ck‖ = 1, or more generally, ‖ ∂ft∂xt

‖ = 1

Page 50: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Hybrid Methods

Bayesian Re-Estimation of the “Discriminative” Hidden Nodes

Methods for Avoiding Spurious Maxima

Constrained optimization: maximize

L = F +∑

k

αk(‖ck‖ − 1) + βk(‖wk‖ − 1)

with Lagrange multipliers αk and βk chosen so that ‖ck‖ = 1and ‖wk‖ = 1

Symplectic Maximum Likelihood Transform (SMLT, Omarand Hasegawa-Johnson, 2004): replace the neural networkwith one that computes a volume preserving transform:

df

dx

= 1

where Jf (x) is the Jacobian of the transform

Page 51: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Hybrid Methods

Bayesian Re-Estimation of the “Discriminative” Hidden Nodes

The Reflecting Symplectic TransformOmar and Hasegawa-Johnson, 2004

Divide x and y arbitrarily into equal-length sub-vectors,xT = [xT

1 , xT2 ], yT = [yT

1 , yT2 ]. Interpret as follows:

x1 is a vector of object positions

x2 is a vector of velocities

V (x2) is a scalar called the “kinetic energy”

T (y1) is a scalar called the “potential energy”

Then the following transform is volume-preserving:

[

y1

y2

]

=

[

x1 −∇x2V

x2 −∇y1T

]

=

[

x1 − g1(x2)x2 − g2(x1 − g1(x2))

]

g1(x2) and g2(y1) must be irrotational. Easiest way toguarantee this: train V (x2) and T (y1) directly, usingBaum-Welch back-propagation

Page 52: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Hybrid Methods

Bayesian Re-Estimation of the “Discriminative” Hidden Nodes

SMLT+GMM for Phone Classification

Omar and Hasegawa-Johnson, 2004

Compute phone label yt given MFCC cepstrum xt

Symplectic maximum likelihood transform (SMLT) computesft(xt)

Maximum likelihood linear transform (MLLT) computesft = Wxt

Gaussian mixture model (GMM) computes p(ft |yt)

Database: TIMIT

Features Classifier Accuracy

MFCC GMM 73.7%MLLT GMM 74.6%SMLT GMM 75.6%

Page 53: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Conclusions

Conclusions

Pattern recognition for acoustic applications:

Choose a hypothesis space (a family of universalapproximators)Gradient descent to minimize error

Page 54: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Conclusions

Conclusions

Pattern recognition for acoustic applications:

Choose a hypothesis space (a family of universalapproximators)Gradient descent to minimize error

Bayesian learning simplifies the use of structured models

Hidden-state dynamicsExternal sources of information

Page 55: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Conclusions

Conclusions

Pattern recognition for acoustic applications:

Choose a hypothesis space (a family of universalapproximators)Gradient descent to minimize error

Bayesian learning simplifies the use of structured models

Hidden-state dynamicsExternal sources of information

Hybrid discriminative-Bayesian methods sometimes give thebest of both worlds

Discriminative training = minimum error locallyBayesian inference = principled integration of disparateinformation sources“Discriminative” hidden nodes can be re-trained to optimize aglobal criterion if the training is constrained to avoid spuriousmaxima (e.g., volume preserving)

Page 56: Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing

Conclusions

Thank You!http://www.isle.uiuc.edu/slides/2009/Hasegawa-Johnson09mlss.pdf