Pattern Recognition in Acoustic Signal Processing

Pattern Recognition in Acoustic Signal Processing


Mark Hasegawa-Johnson

These slides: http://www.isle.uiuc.edu/slides/2009/Hasegawa-Johnson09mlss.pdf

Machine Learning Summer School, June 11, 2009


Outline

1 Why Use Pattern Recognition in Acoustic Signal Processing?

2 Four Criteria for Choosing a Pattern Recognizer

3 Discriminative MethodsHypothesis Space: Universal ApproximatorsTraining Criteria: Differentiable Error MetricComplications: Label Dynamics, Data Sparsity

4 Bayesian MethodsHypothesis Space: Latent VariablesTraining Criteria: Maximum Likelihood, MAP, MaxEntComplications: HMM Regression, Switching Kalman Smoothers

5 Hybrid MethodsDiscriminative Hidden Nodes, Bayesian InferenceDiscriminative Feature Selection for Acoustic Event DetectionBayesian Re-Estimation of the “Discriminative” Hidden Nodes

6 Conclusions


Why Use Pattern Recognition in Acoustic Signal Processing?

The Scientific Method

y = h(x)

Hypothesize-Measure-Test

1 Based on knowledge of the physical situation, form:1 a hypothesis2 a null hypothesis

2 Collect data: (xt , yt), 1 ≤ t ≤ T .

3 Test the hypothesis: measure P(data|null hypothesis)



The Pattern Recognition Method

y = h(x)

Hypothesize-Measure-Learn-Test

1 Form an infinite set of hypotheses (called the “hypothesisspace”), usually a parameterized universal approximator.

2 Collect:1 training data ((xt , yt), 1 ≤ t ≤ T )2 testing data ((xt , yt), T + 1 ≤ t ≤ T + N)

3 Train the hypothesis: maximize P(hypothesis|training data)

4 Test the hypothesis: measure P(testing data|hypothesis)



Example: PR in the Scientific Method


Four Criteria for Choosing a Pattern Recognizer

Criteria for Choosing a Pattern Recognizer

1 Structure of the Model1 Discriminative Training: All parameters in the model can be

simultaneously adjusted to minimize global error metric2 Bayesian Training: Components must be separately trained,

then combined without blowing up




1 Structure of the Model1 Discriminative2 Bayesian

2 Size of the Training Database1 Empirical Risk Minimization: Training database includes

10,000 independent trials; train model to minimize trainingdatabase error

2 Structural Risk Minimization: Training database smallerthan 1000 trials; train model to minimize

P(Error) ≤ (Training Corpus Error) + λ(Model Complexity)

(Training Corpus Size)





2 Size of the Training Database1 Empirical Risk Minimization2 Structural Risk Minimization

3 Dynamic State1 y = h(x) has no hidden state (classification, regression)2 y = h(x) has hidden state (recognition, tracking)





2 Size of the Training Database1 Empirical Risk Minimization2 Structural Risk Minimization

3 Dynamic State1 y = h(x) has no hidden state (classification, regression)2 y = h(x) has hidden state (recognition, tracking)

4 Function Range1 y = h(x) is an integer (classification, recognition)2 y = h(x) is a real-valued vector (regression, tracking)


Discriminative Methods

Discriminative Training—Gradient Descent Methods

1 Choose a hypothesis space (a universal approximator)

2 Choose a differentiable error metric

3 Apply the Chain Rule



Hypothesis Space: Universal Approximators

Universal Approximators

Universal Approximator: Definition

A parameterized function space hΘ(x), with parameter vectorΘ ∈ ℜαK , is called a universal approximator if for any boundedh(x) with finite domain,

limK→∞

minΘ

‖hΘ(x) − h(x)‖ = 0

Example: Sigmoidal Neural Network

1 Sigmoidal Neural Network (Θ = {c1, . . . , cK ,w1, . . . ,wK})

hΘ(x) =K∑

k=1

ck1

1 + e−xT wk



Hypothesis Space: Universal Approximators

Universal Approximators

More Examples: Universal Approximators

1 Sigmoidal Neural Network (Θ = {c1, . . . , cK ,w1, . . . ,wK})

2 Mixture Gaussian (Θ = {c1, . . . , cK , µ1, . . . , µK ,Σ1, . . . ,ΣK})

hΘ(x) =

K∑

k=1

cke−12(x−µk)T Σ−1

k(x−µk )

3 Classification and Regression Tree (CART), K NearestNeighbors (KNN), etc.(Θ = [b1, . . . , bK ,w1, . . . ,wK ,R1, . . . ,Rk ])

hΘ(x) = xT wk + bk if x ∈ Rk

Rk is a region with piece-wise linear boundaries.



Training Criteria: Differentiable Error Metric

Differentiable Error Metric

Minkowski Norm Error Metrics

EL =1

N

T∑

t=1

‖hΘ(xt) − yt‖LL

The “Best” Metric: The Zero Norm

‖hΘ(xt) − yt‖0 ,

{

0 yt = hΘ(xt)1 otherwise

Problem: if E0 6= 0, what do we do next?




Differentiable Error Metric

Minkowski Norm Error Metrics

EL =1

N

T∑

t=1

‖hΘ(xt) − yt‖LL

Differentiable Error Metrics: L = 1, L = 2

1 The One Norm (Manhattan Distance, L = 1):

∂E1

∂hΘ(xt)= sign (hΘ(xt) − yt)

2 The Two Norm (Euclidean Distance, L = 2):

∂E2

∂hΘ(xt)= hΘ(xt) − yt




Apply the Chain Rule

The Error Back-Propagation Algorithm

1 Initialize: Choose some initial parameter set Θ(0)

2 Iterate: For i = 1, . . . until E(Θ) stops changing:

Θ(i) = Θ(i−1) − η∇ΘE

∇ΘE ,

T∑

t=1

(

∂E

∂hΘ(xt)

)

(∇ΘhΘ(xt))



Complications: Label Dynamics, Data Sparsity

Wrinkle #1: Recognition, Tracking

A Recursive Neural Net is a neural net with one or more hiddenstate variables:

hΘ(xt) =

K∑

k=1

ck

1

1 + e−[xTt ,hΘ(xt−1)]wk

Training is performed using Back-Propagation Through Time:

∂E2

∂hΘ(xt)= (hΘ(xt) − yt) + (hΘ(xt+1) − yt+1)

(

∂hΘ(xt+1)

∂hΘ(xt)

)

+ . . .




Recursive Neural Nets: Example Application

Task Description

Goal of the RNN: Detectpitch accents (syllablesthat the talker has markedas “prominent” by sometype of F0 movement)

Blue = xt (F0=RNNinput)

Yellow = yt (pitch accent= target RNN output)

Pink = f (xt)(RNN-estimated pitchaccent probability)

Example Results




Wrinkle #2: Small Training Corpus

Test Corpus Error Bounds based on the Central Limit Theorem

P(Error) ≤ E(X ,Y ,Θ) + G(Θ)

Example Bounds

Minimum Description Length: K (Θ) describes hΘ(x) as a binaryprogram written in signed bits; G(Θ) is itsKolmogorov complexity

G(Θ) ∝ ‖K (Θ)‖0

Support Vector Machines: ψ(Θ) describes hΘ(x) as a linearclassifier in an augmented feature space, and

G(Θ) ∝ ‖ψ(Θ)‖22




Support Vector Machines: Example Application

Consonant-vowel transitions, and similar manner-changelandmarks, are good places to look for information aboutspeech [Stevens, Interspeech 2000]

Some landmarks occur relatively infrequently—it’s hard tolearn what they sound like.

SVMs do the job [Niyogi, Burges, and Ramesh, 1999; Borysand Hasegawa-Johnson, 2005; chance=50%]:

SVM %ACC SVM %ACC

-+Silence 92.1 +-Silence 91.6

-+Continuant 79.7 +-Continuant 81.1

-+Sonorant 86.4 +-Sonorant 91.1

-+Syllabic 88.6 +-Syllabic 78.5

-+Consonantal 78.1 +-Consonantal 73.1




Wrinkle #3: Label Dynamics AND Small Dataset


Landmark detectors (δit = sign (gi (~xt))) must be trained usingsmall dataset, so we use SVMs







Phonetic landmarks tell us about the phonemes that wereproduced (Q = [q1, . . . , qT ], qt ∈ set of phoneme labels)

. . . but phoneme also depends on previous phoneme!

p(qt = j) = fj

(

~δt , qt−1

)







Phonetic landmarks tell us about the phonemes that wereproduced (Q = [q1, . . . , qT ], qt ∈ set of phoneme labels)

. . . but phoneme also depends on previous phoneme!

p(qt = j) = fj

(

~δt , qt−1

)

Functions like fj (·) are hard to train using purelydiscriminative methods, but relatively easy using Bayesianmethods


Bayesian Methods

Bayesian Methods

Bayesian Classification and Recognition:

y∗ = arg max pΘ(y |x)

Bayesian Regression and Tracking:

y∗ = E {y |x}

Advantages and Disadvantages

Disadvantage: One must learn p(x , y); this usually requiresmore data, and is subject to more error, than learningy = h(x) directly

Advantage: Bayesian inference allows modeling of latentvariables, state dynamics, and extra sources of information ina principled manner


Bayesian Methods

Hypothesis Space: Latent Variables

Hypothesis Space Example: Hidden Markov Model

Let X = [x1, . . . , xN ] be the observations, let Y = [y1, . . . , yN ] bethe labels. A Hidden Markov Model posits the existence of somelatent variables Q = [q1, . . . , qN ] such that

hΘ(X ) = arg maxY

pΘ(X ,Y )

pΘ(X ,Y ) =∑

Q

T∏

t=1

pΘ(yt |yt−1)pΘ(qt |qt−1, yt)pΘ(xt |qt)

pΘ(yt |yt−1) (the “language model”) is a lookup table

pΘ(qt |qt−1, yt) (the “pronunciation model”) is a lookup table

pΘ(xt |qt , yt) (the “acoustic model”) is a Gaussian, with meanvector µq and covariance matrix Σq


Bayesian Methods

Training Criteria: Maximum Likelihood, MAP, MaxEnt

Learn the Distributions: Maximum Likelihood

Maximum Likelihood Parameter Estimation

Θ = arg max log pΘ(X ,Y )

What About Small Training Corpora?

Maximum Likelihood is a form of empirical risk minimization

Related forms of structural risk minimization include

MAP (maximum a posteriori probability)

Θ = arg max (log p(X ,Y |Θ) + log p(Θ))

MaxEnt(maximum entropy)

Θ = arg max (log pΘ(X ,Y ) + H(pΘ))


Bayesian Methods

Training Criteria: Maximum Likelihood, MAP, MaxEnt

Apply Bayes’ Rule

Bayes’ Rule (a.k.a. the Definition of Conditional Probability)

pΘ(X ,Y ) =∑

Q

T∏

t=1

pΘ(yt |yt−1)pΘ(qt |qt−1, yt)pΘ(xt |qt)

Training a Bayesian Classifier: Maximum Likelihood

Θ = arg max pΘ(X ,Y )

Testing a Bayesian Classifier: Minimum Probability of Error

Y = arg max pΘ(X ,Y )


Bayesian Methods

Complications: HMM Regression, Switching Kalman Smoothers

Bayesian Regression and Tracking

Hidden Markov Regression

Suppose yt ∈ ℜD is a real-valued vector, and (xt , yt) are jointlyGaussian:

p(x , y |q) ∝ exp

(

−1

2

[

xt − xq

yt − yq

]T [Aq Bq

BTq Cq

]−1 [xt − xq

yt − yq

]

)

then h(xt) = arg min E2 is

E {yt |X } =∑

qt

P(qt |X )(

yq + BTq A−1

q (x − xq))


Bayesian Methods


Switching Kalman Smoother

sst−1 s t+1t

x x xt−1 t t+1

t−1 t t+1y y y

Setup: exactly like HMM regression, except that (xt , yt , yt−1)are jointly Gaussian

Result: exactly like HMM regression, except that yt|q,X ,At|q,X , and Bt|q,X must be updated using interacting multipleKalman filters:

E {yt |X } =∑

qt

P(qt |X )(

yt|q,X + BTt|q,XA−1

t|q,X(x − xq)

)


Bayesian Methods


Switching Kalman Smoother: Example

Task Description

xt is an acoustic spectrum; yt is the corresponding vector ofspeech articulator positions

Results (unpublished):

E2(Switching Kalman Smoother) < E2(HMM Regression)

Difference is consistent but very small


Hybrid Methods

Hybrid Discriminative-Bayesian Systems

Task Scenario

yt is very difficult to classify without dynamic information,(e.g. speech recognition: yt =words, xt =short-timespectrum)

An auxiliary variable ft can be inferred very accurately using alocal classifier (e.g., ft =phonological distinctive features):

ft = hΘ(xt)

Training database includes X = [x1, . . . , xN ],Y = [y1, . . . , yN ], and F = [f1, . . . , fN ]

Testing database includes only X = [xN+1, . . . , xN+M ]


Hybrid Methods

Discriminative Hidden Nodes, Bayesian Inference

Hybrid Training Methods

Training Algorithm

Train hΘ(x) using discriminative methods

Minimize

EL =1

T

T∑

t=1

‖ft − hΘ(xt)‖LL

hΘ(xt) is a real-valued vector that approximates ft

Train a probability model pΘ(F ,X ,Y ) using Bayesianmethods

Using pΘ(F ,X ,Y ), Bayesian inference can model dynamics ofhidden states, incorporate multiple knowledge sources, etc.


Hybrid Methods


Example: Landmark-Based Speech Recognizer

SVM-HMM Hybrid Landmark-Based Speech Recognizer

SVM computes a real-valued distinctive feature that optimallydiscriminates between the case ft = 1 (landmark of a specifiedtype is present) and ft = −1 (landmark absent)

HMM computes p(phoneme sequence|landmark sequence)


Hybrid Methods


Phone Recognition Accuracy vs. Mixture Size, Telephone Speech

MFCC = mel frequency cepstral coefficients

Landmark = detect manner-to-manner landmarks, e.g.,obstruent-to-sonorant

Manner = detect manner-onset landmarks, e.g., onset ofsonorant region


Hybrid Methods


Example: RNN with Kalman Smoothing

Mitra et al., in review

Time

Fre

quen

cy

Synthetic utterance:ground

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

2000

4000

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5−0.5

0

0.5

VE

L

ANN+Kalman(AP)TrueData

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

100

200

TB

CL

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5−20

0

20

TB

CD

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

50

100

TT

CL

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5−20

0

20

TT

CD


Hybrid Methods

Discriminative Feature Selection for Acoustic Event Detection

Example: Non-Speech Acoustic Event Detection

Motivation

“Activity detection and description is a key functionality ofperceptually aware interfaces working in collaborative humancommunication environments. . . detection and classification ofacoustic events may help to detect and describe humanactivity. . . ” (CLEAR-AED Task Brief)

Difficulties

Negative SNR (speech is “background noise”)

Unknown spectral structure

Different spectral structure for each event type


Hybrid Methods


Difficulty #1: Negative SNR


Hybrid Methods


Difficulty #2: Unknown Spectral Structure

Key Jingle Footsteps Speech


Hybrid Methods


Discriminative Feature Selection for AED

Zhuang et al., ICASSP 2008

Problem: what acoustic features are relevant for detectingnon-speech acoustic events?

Input: (xt ∈ ℜD) includes many acoustic features inventedfor speech processing (MFCC, PLP, energy, ZCR)

Output: (ft ∈ ℜd) selects the most useful features:

ft = Wxt

where W T = [w1, . . . ,wK ], and wk is an indicator vector(only one non-zero element)

Hidden Markov Modeling: the label sequenceY ∗ = [y∗

1 , . . . , y∗T ], yt ∈ {keyjingle, footstep, . . .} is chosen by

a hidden Markov model observing F = [f1, . . . , fN ]:

Y ∗ = arg max p(F |Y )p(Y )


Hybrid Methods


Bayes Error Rate


Bayes Error Rate

Let wk be an indicator vector (all zeros except for one element).The Bayes-optimal error rate of a classifier observing feature wT

k x

is

P(error) =

∫ ∫

P(

y 6= arg max p(wTk x , y)

)

dydx

Bayes Error Rate Approximated on a Database

F(wk) =1

T

T∑

t=1

δ(

yt 6= arg max p(wTk xt , yt)

)


Hybrid Methods


Feature Selection Algorithms

Hard-Bayes-Error Feature Selection

For k = 1, . . . ,K , Choose the indicator vector wk (wk is all zerosexcept for one nonzero element) to minimize

F(wk) =1

T

T∑

t=1

δ(

yt 6= arg max p(wTk xt , yt)

)

Soft-Bayes-Error Feature Selection

For k = 1, . . . ,K , Choose the indicator vector wk (wk is all zerosexcept for one nonzero element) to minimize

FS (wk) =1

T

T∑

t=1

rank(

yt

∣

∣

∣wT

k xt

)


Hybrid Methods


Acoustic Event Detection Results


MFCC26DAZ = 26 Mel-frequency cepstral coefficients +deltas + acceleration

DERIVE26DAZ = 26 Derived features + deltas + acceleration

DERIVE78 = 78 Derived features


Hybrid Methods

Bayesian Re-Estimation of the “Discriminative” Hidden Nodes

Bayesian Re-Estimation of Discriminative Nodes

Problem: what projection of the acoustic spectrogram isrelevant for recognizing acoustic events?

Discriminative Transform: (ft ∈ ℜd) selects the most usefulfeatures:

ft =K∑

k=1

ckσ(wTk xt)

where ck ∈ ℜd and wk ∈ ℜD are arbitrary real-valued weightvectors, and σ(z) = 1/(1 + e−z).

Bayesian Inference: the label sequence Y ∗ = [y∗1 , . . . , y

∗N ],

yt ∈ {keyjingle, footstep, . . .} is chosen by a hidden Markovmodel observing F = [f1, . . . , fN ]:

Y ∗ = arg max p(F |Y )p(Y )


Hybrid Methods


The Baum-Welch Algorithm

Hidden Markov model parameters are trained to maximize theexpected log likelihood, with expectation over the unknown statesequence Q = [q1, . . . , qN ]

F = EQ {log p(F ,Q)}

F = −1

2

T∑

t=1

∑

q

p(qt = q|F ,Y )(ft − µq)T Σ−1

q (ft − µq) − . . .


Hybrid Methods


Baum-Welch Back-Propagation

The neural network can be trained, using standard gradientdescent methods, in order to minimize F . For example,

ft =

K∑

k=1

ckσ(wTk xt)

∂F

∂ck

=T∑

t=1

∑

q

p(qt = q|F )

(

∂F

∂ft|qt = q

)(

∂ft

∂ck

)

=

T∑

t=1

∑

q

p(qt = q|F )Σ−1q (µq − ft)σ(wT

k xt)


Hybrid Methods


The Problem of Spurious Maxima

It is always possible to train a mixture Gaussian so thatF = ∞

Give one of the Gaussians a nonzero variance (|Σ1| > 0), andone a zero variance (|Σ2| = 0)


Hybrid Methods




Give one of the Gaussians a nonzero variance (|Σ1| > 0), andone a zero variance (|Σ2| = 0)Set mean of the second Gaussian equal to any training token(µ2 = ft for any particular t)


Hybrid Methods




Give one of the Gaussians a nonzero variance (|Σ1| > 0), andone a zero variance (|Σ2| = 0)Set mean of the second Gaussian equal to any training token(µ2 = ft for any particular t)Result: F = ∞ + finite = ∞


Hybrid Methods




Give one of the Gaussians a nonzero variance (|Σ1| > 0), andone a zero variance (|Σ2| = 0)Set mean of the second Gaussian equal to any training token(µ2 = ft for any particular t)Result: F = ∞ + finite = ∞This is called “over-training”


Hybrid Methods




Give one of the Gaussians a nonzero variance (|Σ1| > 0), andone a zero variance (|Σ2| = 0)Set mean of the second Gaussian equal to any training token(µ2 = ft for any particular t)Result: F = ∞ + finite = ∞This is called “over-training”

In Baum-Welch Back-Propagation, the same result is obtainedfor ‖ck‖ → 0 or ‖wk‖ → 0

Solution: require ‖ck‖ = 1, or more generally, ‖ ∂ft∂xt

‖ = 1


Hybrid Methods


Methods for Avoiding Spurious Maxima

Constrained optimization: maximize

L = F +∑

k

αk(‖ck‖ − 1) + βk(‖wk‖ − 1)

with Lagrange multipliers αk and βk chosen so that ‖ck‖ = 1and ‖wk‖ = 1

Symplectic Maximum Likelihood Transform (SMLT, Omarand Hasegawa-Johnson, 2004): replace the neural networkwith one that computes a volume preserving transform:

∣

∣

∣

∣

df

dx

∣

∣

∣

∣

= 1

where Jf (x) is the Jacobian of the transform


Hybrid Methods


The Reflecting Symplectic TransformOmar and Hasegawa-Johnson, 2004

Divide x and y arbitrarily into equal-length sub-vectors,xT = [xT

1 , xT2 ], yT = [yT

1 , yT2 ]. Interpret as follows:

x1 is a vector of object positions

x2 is a vector of velocities

V (x2) is a scalar called the “kinetic energy”

T (y1) is a scalar called the “potential energy”

Then the following transform is volume-preserving:

[

y1

y2

]

=

[

x1 −∇x2V

x2 −∇y1T

]

=

[

x1 − g1(x2)x2 − g2(x1 − g1(x2))

]

g1(x2) and g2(y1) must be irrotational. Easiest way toguarantee this: train V (x2) and T (y1) directly, usingBaum-Welch back-propagation


Hybrid Methods


SMLT+GMM for Phone Classification

Omar and Hasegawa-Johnson, 2004

Compute phone label yt given MFCC cepstrum xt

Symplectic maximum likelihood transform (SMLT) computesft(xt)

Maximum likelihood linear transform (MLLT) computesft = Wxt

Gaussian mixture model (GMM) computes p(ft |yt)

Database: TIMIT

Features Classifier Accuracy

MFCC GMM 73.7%MLLT GMM 74.6%SMLT GMM 75.6%


Conclusions

Conclusions

Pattern recognition for acoustic applications:

Choose a hypothesis space (a family of universalapproximators)Gradient descent to minimize error


Conclusions

Conclusions



Bayesian learning simplifies the use of structured models

Hidden-state dynamicsExternal sources of information


Conclusions

Conclusions



Bayesian learning simplifies the use of structured models

Hidden-state dynamicsExternal sources of information

Hybrid discriminative-Bayesian methods sometimes give thebest of both worlds

Discriminative training = minimum error locallyBayesian inference = principled integration of disparateinformation sources“Discriminative” hidden nodes can be re-trained to optimize aglobal criterion if the training is constrained to avoid spuriousmaxima (e.g., volume preserving)


Conclusions

Thank You!http://www.isle.uiuc.edu/slides/2009/Hasegawa-Johnson09mlss.pdf

Documents

Pattern Recognition in Acoustic Signal Processing