81
Natural Language Processing Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition Vladimir Kulyukin www.vkedco.blogspot.com

NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Embed Size (px)

Citation preview

Page 1: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Natural Language Processing

Audio Processing, Zero Crossing Rate,

Dynamic Time Warping, Spoken Word

Recognition

Vladimir Kulyukin

www.vkedco.blogspot.com

Page 2: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Outline

Audio Processing

Zero Crossing Rate

Dynamic Time Warping

Spoken Word Recognition

Page 3: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Audio Processing

Page 4: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Samples

Samples are successive snapshots of a

specific signal

Audio files are samples of sound waves

Microphones convert acoustic signals into

analog electrical signals and then analog-to-

digital converter transform analog signals

into digital samples

Page 5: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Digital Audio Signal

time

Sound

pressure

Page 6: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Amplitude

Amplitude (in audio processing) is a

measure of sound pressure

Amplitude is measured at a specific rate

Amplitude measures result in digital

samples

Some samples have positive values

Some samples have negative values

Page 7: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Digital Approximation Accuracy

Any digitization of analog signals carries some

inaccuracy

Approximation accuracy depends on two

factors: 1) sampling rate and 2) resolution

In audio processing, sampling is reduction of

continuous signal to discrete signal

Sampling rate is the number of samples per unit

of time

Resolution is the size of a sample (e.g., the

number of bits)

Page 8: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Sampling Rate & Resolution

Sampling rate is measured in Hertz

Hertz or Hz are measured in samples per

second

For example, if the audio is sampled at a

rate of 44100 per second, then its sampling

rate is 44100Hz

Some typical resolutions are 8 bits, 16 bits,

and 32 bits

Page 9: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Nyquist-Shannon Sampling Theorem

This theorem states that perfect reconstruction of

a signal is possible if the sampling frequency is

greater than two times the maximum frequency of

the signal being sampled

For example, if a signal has a maximum frequency

of 50Hz, then it can, theoretically, be

reconstructed if sampled at a rate of 100Hz and

avoid aliasing (the effect of indistinguishable

sounds)

Page 10: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Audio File Formats

WAVE (WAV) is often associated with Windows but

are now implemented on other platforms

AIFF is common on Mac OS

AU is common on Unix/Linux

These are similar formats that vary in how they

represent data, pack samples (e.g., little-endian vs.

big-endian), etc.

Java example of how to manipulate Wav files can

be downloaded from WavFileManip.java

Page 11: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Zero Crossing Rate

Page 12: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

What is Zero Crossing Rate (ZCR)?

Zero Crossing Rate (ZCR) is a measure of the

number of times, in a given sample, when

amplitude crosses the horizontal line at 0

ZCR can be used to detect silence vs. non-

silence, voice vs. unvoiced, speaker’s identity,

etc.

ZCR is essentially the count of successive

samples changing algebraic signs

Page 13: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

ZCR Source

public class ZeroCrossingRate {

public static double computeZCR01(double[] signals, double normalizer)

{

long numZC = 0;

for(int i = 1; i < signals.length; i++) {

if ( (signals[i] >= 0 && signals[i-1] < 0) ||

(signals[i] < 0 && signals[i-1] >= 0) ) {

numZC++;

}

}

return numZC/normalizer;

}

}

source code is in ZeroCrossingRate.java

Page 14: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

ZCR in Voiced vs. Unvoiced Speech

Voiced speech is produced when vowels are spoken

Voiced speech is characterized of constant

frequency tones of some duration

Unvoiced speech is produced when consonants are

spoken

Unvoiced speech is non-periodic, random-like

because air passes through a narrow constriction of

the vocal tract

Page 15: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

ZCR in Voiced vs. Unvoiced Speech

Phonetic theory states that voiced speech

has a smooth air flow through the vocal tract

whereas unvoiced speech has a turbulent air

flow that produces noise

Thus, voiced speech should have a low ZCR

whereas unvoiced speech should have a high

ZCR

Page 16: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Amplitude of Voiced vs. Unvoiced Speech

Amplitude of unvoiced speech is low

Amplitude of voiced speech is high

Given a digital sample, we can use average

amplitude as a measure of the sample’s

energy

This can be used to classify samples as

vowels and consonants

Page 17: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

ZCR & Amplitude of Voiced & Unvoiced Speech

ZCR Amplitude

Voiced LOW HIGH

Unvoiced HIGH LOW

Page 18: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Detection of Silence & Non-Silence

silence_buffer = [];

non_silence_buffer = [];

buffer = [];

while ( there are still frames left ) {

Read a specific number of frames into buffer;

Compute ZCR and average amplitude of buffer;

if ( ZCR and average amplitude are below specific thresholds ) {

add the buffer to silence_buffer;

}

else {

add the buffer to non_silence_buffer;

}

}

source code is in WavFileManip.detectSilence()

Page 20: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Introduction

Dynamic Time Warping (DTW) is a method to

find an optimal alignment between two time-

dependent sequences

DTW aligns (“warps”) two sequences in a non-

linear way to match each other

DTW has been successfully used in automatic

speech recognition (ASR), bioinformatics

(genetic sequence matching), and video

analysis

Page 21: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Basic Definitions

There are two sequences:

𝑋 = 𝑥1, … , 𝑥𝑁 and 𝑌 = 𝑦1, … , 𝑦𝑀

There is a feature space F such that:

𝑥𝑖 ∈ 𝐹 & 𝑦𝑗 ∈ 𝐹 where 1 ≤ 𝑖 ≤ 𝑁, 1 ≤ 𝑗 ≤ 𝑀

There is a local cost measure mapping 2-

tuples of features to non-negative reals:

𝑐: 𝐹 x 𝐹 → 𝑅 ≥ 0

Page 22: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Sample Sequences

Page 23: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Sample Alignment

Page 24: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

X

Cost Matrix DTW(N, M)

Y

1 2 …. i … N

M

1

2

𝑑𝑡𝑤 𝑖, 𝑗 is the cost of warping X[1:i] with Y[1:j]

j

X and Y are sequences X[1:N] and Y[1:M]

Page 25: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Warping Path

𝑃 = 𝑝1, … , 𝑝𝐿 , where 𝑝 = 𝑛𝑗 , 𝑚𝑗 ∈ 1, 𝑁 × [1,𝑀] and

𝑗 ∈ 1, 𝐿 is a warping path if

1) 𝑝1 = 1,1 and 𝑝𝐿 = 𝑁,𝑀 2) 𝑛1 ≤ 𝑛2 ≤ … ≤ 𝑛𝑁 and 𝑚1 ≤ 𝑚2 ≤ … ≤ 𝑚𝑀

3) 𝑝𝑙+1 − 𝑝𝑙 ∈ 1, 0 , 0, 1 , 1, 1 , 1 ≤ 𝑙 ≤ 𝐿 − 1

Page 26: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Valid Warping Path

1 2 3 4

1

2

3

4

5

𝑃 = 𝑝1, 𝑝2, 𝑝3, 𝑝4, 𝑝5, 𝑝6 , where

𝑝1 = 1, 1 , 𝑝2 = 1, 2 , 𝑝3 = 2, 3 , 𝑝4 = 2, 4 , 𝑝5 = 3, 5 , 𝑝6 = (4, 5)

𝑝1

𝑝2

𝑝3

𝑝4

𝑝5 𝑝6

Page 27: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Invalid Warping Path

1 2 3 4

1

2

3

4

5

𝑝1 ≠ 1, 1 so constraint 1 is not satisfied

𝑝1 𝑝2

𝑝3

𝑝4

𝑝5 𝑝6

Page 28: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Invalid Warping Path

1 2 3 4

1

2

3

4

5

𝑝3 = 3, 3 , 𝑝4 = 2, 4 , 3 > 2 so 2nd constraint is not satisfied

𝑝1

𝑝2

𝑝3

𝑝4

𝑝5 𝑝6

Page 29: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Invalid Warping Path

1 2 3 4

1

2

3

4

5

𝑝2 = 2, 2 , 𝑝3 = 3, 4 , 𝑝3 − 𝑝2 = 3,4 − 2,2 = 1, 2 ∉1, 0 , 0, 1 , 1, 1 so 3rd condition is not satisfied

𝑝1

𝑝2

𝑝3

𝑝4 𝑝5

Page 30: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Total Cost of a Warping Path

𝑃 = 𝑝1, … , 𝑝𝐿 , is a warping path between sequences X

and Y, then its total cost is

𝑐𝑝 𝑋, 𝑌 = 𝑐(𝑥𝑛𝑗 , 𝑦𝑚𝑗)

𝐿

𝑗=1

Page 31: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Example

1 2 3 4

1

2

3

4

5 Assume that 𝑃 = 𝑝1, 𝑝2, 𝑝3, 𝑝4, 𝑝5, 𝑝6 , where 𝑝1 = 1, 1 , 𝑝2 = 1, 2 , 𝑝3 =2, 3 , 𝑝4 = 2, 4 , 𝑝5 = 3, 5 , 𝑝6 = 4, 5 ,

is a warping path b/w X[1:4] and Y[1:5].

Then the total cost of P is

𝑐 𝑥1, 𝑦1 + 𝑐 𝑥1, 𝑦2 + 𝑐 𝑥2, 𝑦3 +𝑐 𝑥2, 𝑦4 + 𝑐 𝑥3, 𝑦5 + 𝑐 𝑥4, 𝑦5 .

This notation 𝑐 𝑥𝑖 , 𝑦𝑗 can be simplified

to read 𝑐(𝑖, 𝑗) or 𝑐 𝑋 𝑖 , 𝑌 𝑗 .

𝑝1

𝑝2

𝑝3

𝑝4

𝑝5 𝑝6

X

Y

Page 32: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(X, Y) – Cost of an Optimal Warping Path

𝐷𝑇𝑊 𝑋, 𝑌 = min 𝑐𝑝 𝑋, 𝑌 𝑝 is a warping path}

Page 33: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Remarks on DTW(X, Y)

There may be several warping paths of the

same DTW(X, Y)

DTW(X, Y) is symmetric whenever the local

cost measure is symmetric

DTW(X, Y) does not necessarily satisfy the

triangle inequality (the sum of the lengths of

two sides is greater than the length of the

remaining side)

Page 34: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

X

DTW Equations: Base Cases

Y

1 2 …. i … N

M

1

2

Initial condition: 𝑑𝑡𝑤 1,1 = 𝑐(1,1)

j

1st Row: 𝑑𝑡𝑤 𝑖, 1 = 𝑑𝑡𝑤 𝑖 − 1,1 + 𝑐(𝑖, 1)

1st Column:

𝑑𝑡𝑤 1, 𝑗 = 𝑑𝑡𝑤 1, 𝑗 − 1 +𝑐(1, 𝑗)

Page 35: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

X

DTW Equations: Recursion

Y

1 2 … i … N

M

1

2

j

Inner Cell: 𝑑𝑡𝑤 𝑖, 𝑗 = min 𝑑𝑡𝑤 𝑖 − 1, 𝑗 , 𝑑𝑡𝑤 𝑖 − 1, 𝑗 − 1 , 𝑑𝑡𝑤 𝑖, 𝑗 − 1 + 𝑐(𝑖, 𝑗)

Interpretation: Cost of

warping X[1:i] with Y[1:J] is

the cost of warping X[i] with

Y[j] plus the minimum of the

following three costs: 1) the

cost of warping X[1:i-1] with

Y[1:j]; 2) the cost of warping

X[1:i-1] with Y[1:j-1]; 3) the

cost of warping X[1:i] with

Y[1:j-1]

Page 36: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Example

Let the sequences be:

𝑋 = 𝑎, 𝑏, 𝑔 𝑌 = 𝑎, 𝑏, 𝑏, 𝑔 𝑍 = (𝑎, 𝑔, 𝑔)

Let the feature space 𝐹 = 𝑎, 𝑏, 𝑔 .

Let the local cost measure be

defined as follows:

𝑐 𝑥, 𝑦 = 0 𝑖𝑓 𝑥 = 𝑦1 𝑖𝑓 𝑥 ≠ 𝑦

Let us compute dtw(X,Y), dtw(Y,Z), and dtw(X, Z).

Work it out on paper.

Page 37: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(X, Y)

Page 38: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Example: DTW(X,Y)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 1,1 = 𝑐 𝑎, 𝑎 = 0

Page 39: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Example: DTW(X,Y)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 2,1 = 𝑐 2,1 + 𝑑𝑡𝑤 1,1= 𝑐 𝑏, 𝑎 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1

1

Page 40: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Example: DTW(X,Y)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 3,1 = 𝑐 3,1 + 𝑑𝑡𝑤 2,1= 𝑐 𝑔, 𝑎 + 𝑑𝑡𝑤 2,1= 1 + 1 = 2

1 2

Page 41: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Example: DTW(X,Y)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 1,2 = 𝑐 1,2 + 𝑑𝑡𝑤 1,1= 𝑐 𝑎, 𝑏 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1

1 2

1

Page 42: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Example: DTW(X,Y)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 1,3 = 𝑐 1,3 + 𝑑𝑡𝑤 1,2= 𝑐 𝑎, 𝑏 + 𝑑𝑡𝑤 1,2= 1 + 1 = 2

1 2

1

2

Page 43: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Example: DTW(X,Y)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 1,4 = 𝑐 1,4 + 𝑑𝑡𝑤 1,3= 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,3= 1 + 2 = 3

1 2

1

2

3

Page 44: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Example: DTW(X,Y)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 2,2= 𝑐 2,2

+min𝑑𝑡𝑤 1,2 ,𝑑𝑡𝑤 1,1 ,𝑑𝑡𝑤 2,1

= 𝑐 𝑏, 𝑏 + min 1,0,1 = 0 + 0= 0 1 2

1

2

3

0

Page 45: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Example: DTW(X,Y)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 3,2= 𝑐 3,2

+min𝑑𝑡𝑤 2,2 ,𝑑𝑡𝑤 2,1 ,𝑑𝑡𝑤 3,1

= 𝑐 𝑔, 𝑏 + min 0,1,2 = 1 + 0= 1 1 2

1

2

3

0 1

Page 46: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Example: DTW(X,Y)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 2,3= 𝑐 2,2

+min𝑑𝑡𝑤 1,3 ,𝑑𝑡𝑤 1,2 ,𝑑𝑡𝑤 2,2

= 𝑐 𝑏, 𝑏 + min 2,1,0 = 0 + 0= 0 1 2

1

2

3

0 1

0

Page 47: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Example: DTW(X,Y)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 3,3= 𝑐 3,3

+min𝑑𝑡𝑤 2,3 ,𝑑𝑡𝑤 2,2 ,𝑑𝑡𝑤 3,1

= 𝑐 𝑔, 𝑏 + min 0,0,1 = 1 + 0= 1 1 2

1

2

3

0 1

0 1

Page 48: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Example: DTW(X,Y)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 2,4= 𝑐 2,4

+min𝑑𝑡𝑤 1,4 ,𝑑𝑡𝑤 1,3 ,𝑑𝑡𝑤 2,3

= 𝑐 𝑏, 𝑔 + min 3,2,0 = 1 + 0= 1 1 2

1

2

3

0 1

0 1

1

Page 49: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Example: DTW(X,Y)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 3,4= 𝑐 3,4

+min𝑑𝑡𝑤 2,4 ,𝑑𝑡𝑤 2,3 ,𝑑𝑡𝑤 3,3

= 𝑐 𝑔, 𝑔 +min 1,0,1 = 0 + 0= 0

So DTW(X,Y) = 0

1 2

1

2

3

0 1

0 1

1 0

Page 50: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Example: DTW(X,Y)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

DTW(X, Y) = 0.

Optimal Warping Path

(OWP) P can be found by

chasing pointers (red

arrows): P = ((1,1), (2, 2),

(2, 3), (3, 4)). 1 2

1

2

3

0 1

0 1

1 0

Page 51: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(Y, Z)

Page 52: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(Y, Z)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 1,1 = 𝑐 𝑎, 𝑎 = 0 Z

Page 53: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(Y, Z)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 2,1= 𝑐 𝑏, 𝑎 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1

Z

1

Page 54: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(Y, Z)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 3,1= 𝑐 𝑏, 𝑎 + 𝑑𝑡𝑤 2,1= 1 + 1 = 2

Z

1 2

Page 55: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(Y, Z)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 4,1= 𝑐 𝑔, 𝑎 + 𝑑𝑡𝑤 3,1= 1 + 2 = 3

Z

1 2 3

Page 56: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(Y, Z)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 1,2= 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1

Z

1 2 3

1

Page 57: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(Y, Z)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 1,3= 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,2= 1 + 1 = 2

Z

1 2 3

1

2

Page 58: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(Y, Z)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 2,2= 𝑐 𝑏, 𝑔+ min {𝑑𝑡𝑤 1,2 ,

𝑑𝑡𝑤 1,1 , 𝑑𝑡𝑤 2,1 }

= 1 + 0 = 1

Z

1 2 3

1

2

1

Page 59: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(Y, Z)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 3,2= 𝑐 𝑏, 𝑔+ min {𝑑𝑡𝑤 2,2 ,

𝑑𝑡𝑤 2,1 , 𝑑𝑡𝑤 3,1 }

= 1 +min 1,1,2 = 1 + 1 = 2

Z

1 2 3

1

2

1 2

Page 60: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(Y, Z)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 4,2= 𝑐 𝑔, 𝑔+ min {𝑑𝑡𝑤 3,2 ,

𝑑𝑡𝑤 3,1 , 𝑑𝑡𝑤 4,1 }

= 0 +min 2,2,3 = 0 + 2 = 2

Z

1 2 3

1

2

1 2 2

Page 61: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(Y, Z)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 2,3= 𝑐 𝑏, 𝑔+ min {𝑑𝑡𝑤 1,3 ,

𝑑𝑡𝑤 1,2 , 𝑑𝑡𝑤 2,2 }

= 1 +min 2,1,1 = 1 + 1 = 2

Z

1 2 3

1

2

1 2 2

2

Page 62: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(Y, Z)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 3,3= 𝑐 𝑏, 𝑔+ min {𝑑𝑡𝑤 2,3 ,

𝑑𝑡𝑤 2,2 , 𝑑𝑡𝑤 3,2 }

= 1 +min 2,1,2 = 1 + 1 = 2

Z

1 2 3

1

2

1 2 2

2 2

Page 63: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(Y, Z)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 4,3= 𝑐 𝑔, 𝑔+ min {𝑑𝑡𝑤 3,4 ,

𝑑𝑡𝑤 3,2 , 𝑑𝑡𝑤 4,2 }

= 0 +min 2,2,2 = 0 + 2 = 2

Z

1 2 3

1

2

1 2 2

2 2 2

Page 64: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(Y, Z)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

DTW(Y, Z) = 2.

Optimal Warping Path (OWP) P

can be found by chasing pointers

(red arrows): P = ((1,1), (2, 2), (3,

2), (4, 3)).

Z

1 2 3

1

2

1 2 2

2 2 2

Page 65: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(X, Z)

Page 66: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(X, Z)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 1,1 = 𝑐 𝑎, 𝑎 = 0

0

Page 67: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(X, Z)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 2,1 = 𝑐 𝑏, 𝑎 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1

0 1

Page 68: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(X, Z)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 3,1 = 𝑐 𝑔, 𝑎 + 𝑑𝑡𝑤 2,1= 1 + 1 = 2

0 1 2

Page 69: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(X, Z)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 1,2 = 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1

0 1 2

1

Page 70: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(X, Z)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 1,3 = 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,2= 1 + 1 = 2

0 1 2

1

2

Page 71: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(X, Z)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 2,2= 𝑐 𝑏, 𝑔

+ min𝑑𝑡𝑤 1,2 ,𝑑𝑡𝑤 1,1 ,𝑑𝑡𝑤 2,1

= 1 +min 1,0,1= 1 + 0 = 1

0 1 2

1

2

1

Page 72: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(X, Z)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 3,2= 𝑐 𝑔, 𝑔

+min𝑑𝑡𝑤 2,2 ,𝑑𝑡𝑤 2,1 ,𝑑𝑡𝑤 3,1

= 0 +min 1,1,2= 0 + 1 = 1

0 1 2

1

2

1 1

Page 73: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(X, Z)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 2,3= 𝑐 𝑏, 𝑔

+ min𝑑𝑡𝑤 1,3 ,𝑑𝑡𝑤 1,2 ,𝑑𝑡𝑤 2,2

= 1 +min 2,1,1= 1 + 1 = 2

0 1 2

1

2

1 1

2

Page 74: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(X, Z)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 3,3= 𝑐 𝑔, 𝑔

+min𝑑𝑡𝑤 2,3 ,𝑑𝑡𝑤 2,2 ,𝑑𝑡𝑤 3,2

= 0 +min 2,1,2= 0 + 1 = 1

0 1 2

1

2

1 1

2 1

Page 75: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

DTW(X, Z)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1 0 1 2

1

2

1 1

2 1 DTW(X, Z) = 1.

Optimal Warping Path (OWP)

P can be found by chasing

pointers (red arrows): P =

((1,1), (2, 2), (3, 3)).

Page 76: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Possible Optimizations of DTW

The computation of DTW can be optimized so that only the

cells within a specific window are considered

Page 77: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Possible Optimizations of DTW

You may have realized by now that if we care

only about the total cost of warping sequence X

with sequence Y, we do not need to compute

the entire N x M cost matrix – we need only two

columns

The storage savings are huge, but the running

time remains the same – O(N x M)

We can also normalize the DTW cost by N x M

to keep it low

Page 78: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Spoken Word Recognition

source code is in WavAudioDictionary.java

Page 79: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

General Outline

Given a directory of audio files with spoken words, process

each file into a table that maps specific words (or phrases)

to digital signal vectors

These signal vectors can be pre-processed to eliminate

silences

An input audio file is taken and digitized into a digital

signal vector

The input vector is compared with DTW scores b/w the

input vector and the digital vectors in the table

Page 80: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Optimizations

If we use DTW to compute the similarity b/w the

digital audio input vector and the vectors in the table,

it is vital to keep the vectors as short as possible w/o

sacrificing precision

Possible suggestions: decreasing the sampling rate

and merging samples into super-features (e.g., Haar

coefficients)

Parallelizing similarity computations

Page 81: NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

References

M. Muller. Information Retrieval for Music and

Motion, Ch.04. Springer, ISBN 978-3-540-74047-6

Bachu, R. G., et al. “Separation of Voiced and

Unvoiced using Zero Crossing Rate and Energy of the

Speech Signal." American Society for Engineering

Education (ASEE) Zone Conference Proceedings. 2008.