Nonlinear Aspects of Speech Production: Fractals and Chaotic …cvsp.cs.ntua.gr/interspeech2018/slides/PM_Nonlinear... · 2018-09-27 · Nonlinear Aspects of Speech Production: Fractals

Computer Vision, Speech Communication & Signal Processing Group,Intelligent Robotics and Automation Laboratory

National Technical University of Athens, Greece (NTUA)Robot Perception and Interaction Unit,

Athena Research and Innovation Center (Athena RIC)

Nonlinear Aspects of Speech Production: Fractals and Chaotic Dynamics

Petros Maragos

1

Summer School on Speech Signal Processing (S4P)DA-IICT, Gandhinagar, India, 9-11 Sept. 2018

2

Outline

Nonlinear Speech Processing Turbulence: Fractals, Chaotic Dynamics

Multiscale Fractal Dimensions of Speech Sounds Fractal Modulations for Fricative SoundsChaotic Dynamics of Speech SoundsAlgorithms for Speech Fractal & Chaos Analysis Application to Speech RecognitionApplication to Music Recognition

)(ns

IMPULSETRAIN

GENERATOR

PITCH PERIOD

GLOTTALPULSE

MODELG(z)

X

RANDOMNOISE

GENERATORX

VOCALTRACTMODEL

V(z)

VOCAL TRACT

PARAMETERS

RADIATIONMODEL

R(z)

VA

NA

)(nu

VOICED/UNVOICED SWITCH

(Rabiner & Schafer, 1978)

Linear Source-Filter Model

Nonlinear Fluid Dynamic of the Vocal Tract(Kaiser 1993)

Physics of Speech Airflow• airflow variables: = air density; = pressure

= 3D air particle velocity

• governing equations:mass conservation (continuity eqn):

momentum conservation (Navier-Stokes eqn):

state equation:

• time-varying boundary conditions

u

0ut

2 13

u u u p g u ut

1.4 const.p

p

Speech Aerodynamics• Reynolds number:

• low viscosity μ high Reinertia forces viscous forces

• “aerodynamic” phenomena (Re >>1):air jet, rotational motion, separated airflow,boundary layers, vortices, turbulence

• experimental & theoretical evidence for nonlinear phenomena:

Teager (1970s–1980s), Kaiser (1983 – ), Thomas (1986),

McGowan (1988), Barney, Shadle & Davis (1999), ...

( ) ( )Re velocity scale length scale

Vortices• vorticity:• VORTEX is a flow region of similar• a vortex can be generated by:

– velocity gradients in boundary layers– separated air flow– curved geometry of vocal tract

• dynamics of vortex propagation:

vorticity twisting & stretchingdiffusion of vorticity

u

2u ut

u

2

Nonlinear Speech Processing

• Modulations

• Turbulence– Fractals– Chaos

Turbulence• flow state with broad-spectrum rapidly-varying (in space and

time) velocity and vorticity• transition to turbulence is easier for higher Re flows• eddies: vortices of a characteristic size

• Energy Cascade Theory (Richardson,1922)(multiscale hierarchy of eddies)

• 5/3 spectral law (Kolmogorov, 1941):

wavenumber energy dissipation rate

velocity wavenumber spectrum

2 3 5 3,S k r r k 2 /k

r ,S k r

Turbulence, Fractals and Chaos

• fractal geometry quantifies multiscale structuresin turbulence

• Kolmogorov’s 5/3 law

• we use fractal dimension to quantify “amount” of turbulence in speech

• chaos turbulence

2 3Var u x u x x x

Multiscale Fractal Dimension of Speech Spounds

0 0.5 1 1.5 2 2.5 3 3.5 41

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

SCALE (millisec)

FR

AC

TA

L D

IME

NS

ION

of

/ F /

0 5 10 15 20 25 30−400

0

400

TIME (millisec)

SP

EE

CH

SIG

NA

L: /

F /

0 5 10 15 20 25 30−800

0

800

TIME (millisec)

SP

EE

CH

SIG

NA

L: /

V /

0 0.5 1 1.5 2 2.5 3 3.5 41

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

SCALE (millisec)

FR

AC

TA

L D

IME

NS

ION

of

/ V /

0 0.5 1 1.5 2 2.5 3 3.5 41

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

SCALE (millisec)

FR

AC

TA

L D

IME

NS

ION

of

/ IY

/

0 5 10 15 20 25 30−3000

0

3000

TIME (millisec)

SP

EE

CH

SIG

NA

L: /

IY /

/f/ /iy//v/

[ P. Maragos & A. Potamianos, JASA 1999 ]

−1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1−1

−0.5

0

0.5

1

/ao/,DE=6, #1846

−1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1−1

−0.5

0

0.5

1

/iy/,DE=5, #1068

Speech Attractors0 500 1000 1500

−1

−0.5

0

0.5

1

Time

X(t

)/ao/

0 200 400 600 800 1000−1

−0.5

0

0.5

1

Time

X(t

)

/iy/

−1.5−1

−0.50

0.51

−1.5

−1

−0.5

0

0.5

1−1.5

−1

−0.5

0

0.5

1

/k/,DE=6, #816

0 200 400 600 800−1

−0.5

0

0.5

1

Time

X(t

)

/k/

−1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1−1

−0.5

0

0.5

1

/s/,DE=5, #829

0 200 400 600 800−1

−0.5

0

0.5

1

Time

X(t

)

/s/

[ Pitsikalis & Maragos, Speech Commun 2009 ]

Multiscale Fractal Dimensionsfor

Speech Sounds

Refs: • P. Maragos and A. Potamianos, “Fractal Dimensions of Speech Sounds: Computation and

Application to Automatic Speech Recognition”, Journal of Acoustical Society of America, March 1999.

• P. Maragos, “Fractal Signal Analysis Using Mathematical Morphology”, in Advances in Electronics and Electron Physics, vol.88, Academic Press, 1994.

14

FRACTALS: Definitions• Mandelbrot’s definition

set is fractal Hausdorff dim topological dim

• Examples

• SignalsA function is a fractal if its graph

is a fractal set in is continuous

S ( ) HD S ( )TD S

1 2 T HD D 0 1 T HD D

2 3 T HD D

S =S =

S =fractal curvefractal surface

fractal dust

: vf ( )Gr f 1v

f [ ( )] 1T Hv D D Gr f v

15

‘FRACTAL’ DIMENSIONS (OF SETS IN Rν)

Hausdorff dimension

Minkowski-Bouligand dimension

box counting dimension

similarity dimension

0 T H MB BCD D D D v£ £ £ = £

H SD D£

HD =

MBD =

BCD =

SD =

Morphological Measurement of Fractal Dimension

• Minkowski cover of curve

• Fractal (Minkowski-Bouligand) dimension

• Least-Squares line fit to data

: ( )Bz G

G rB z C r

1,2D

1;2B D

B B

A rA r area C r length of G r r

r

2log , log 1BA r r r D

Morphological (Flat & Weighted) FiltersDilation (Max-plus convolution):

Erosion (Min-plus correlation):

( )( ) max ( ) ( )yf g x f y g x y

( )( ) min ( ) ( )yf g x f y g y x

( )f g f g g

( )f g f g g

0 100 200 300

0

50

100

Sample Index

ORIGINAL SIGNAL

−10 0 100

10

Sample Index

PARABOLA PULSE

0 100 200 300−10

0

50

100

110DILATION BY FLAT & PARABOLIC SE

0 100 200 300−10

0

50

100

110EROSION BY FLAT & PARABOLIC SE

0 100 200 300−10

0

50

100

110OPENING BY FLAT & PARABOLIC SE

0 100 200 300−10

0

50

100

110CLOSING BY FLAT & PARABOLIC SE

Opening:

Closing:

18

Minkowski Fractal Dimension of 1D Curveand Morphological Algorithm for 1D Signals

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

11.21.41.61.8

2

TIME (in sec)

ZERO−CROSSINGSMS−AMPLITUDE

FRACTAL DIMENSION

/ SOOTHING /

SPEECH SIGNAL

ST Speech & Fractal Dimension

Multiscale Speech Fractal Dimension

• short-time speech signal

• signal graph

• fractal constant power law

• variable power law

• multiscale fractal “dimension”

(speech fractogram):

of

short-time speech segment

around time

,tS Tt 0

2 , : 0G t S t R t T

2 , as 0Darea G B C

DtMFD ,

2 Darea G B C

t

Multiscale Fractal Dimension of Speech Spounds

0 0.5 1 1.5 2 2.5 3 3.5 41

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

SCALE (millisec)

FR

AC

TA

L D

IME

NS

ION

of

/ F /

0 5 10 15 20 25 30−400

0

400

TIME (millisec)

SP

EE

CH

SIG

NA

L: /

F /

0 5 10 15 20 25 30−800

0

800

TIME (millisec)

SP

EE

CH

SIG

NA

L: /

V /

0 0.5 1 1.5 2 2.5 3 3.5 41

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

SCALE (millisec)

FR

AC

TA

L D

IME

NS

ION

of

/ V /

0 0.5 1 1.5 2 2.5 3 3.5 41

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

SCALE (millisec)

FR

AC

TA

L D

IME

NS

ION

of

/ IY

/

0 5 10 15 20 25 30−3000

0

3000

TIME (millisec)

SP

EE

CH

SIG

NA

L: /

IY /

/f/ /iy//v/

[ P. Maragos & A. Potamianos, JASA 1999 ]

0 1 2 3 41

1.2

1.4

1.6

1.8

2

/AA/

SCALE (millisec)

FR

AC

TA

L D

IME

NS

ION

of /

AA

/

0 1 2 3 41

1.2

1.4

1.6

1.8

2

/B/

SCALE (millisec)

FR

AC

TA

L D

IME

NS

ION

of /

B/

0 1 2 3 41

1.2

1.4

1.6

1.8

2

/F/

SCALE (millisec)

FR

AC

TA

L D

IME

NS

ION

of /

F/

0 1 2 3 41

1.2

1.4

1.6

1.8

2

/R/

SCALE (millisec)

FR

AC

TA

L D

IME

NS

ION

of /

R/

0 1 2 3 41

1.2

1.4

1.6

1.8

2

/EN/

SCALE (millisec)

FR

AC

TA

L D

IME

NS

ION

of /

EN

/

0 1 2 3 41

1.2

1.4

1.6

1.8

2

/M/

SCALE (millisec)

FR

AC

TA

L D

IME

NS

ION

of /

M/

Mean and standard deviation (error bars) of the multiscale fractal dimension for the phonemes /aa/, /b/, /en/, /f/, /m/, /r/ from the TIMIT database (20 ms window, updated every 10 ms. Average over 200 phonemic instances.)

Mean MFD for /sh/, /zh/, /uh/, /t/, /d/

81.2% 83.5% 84.5%

11,

,,,DD

CECE

, , ,E C E C

1 16 1 16

, , ,

,

E C E C

D D

Features

Models5-mixture Gaussians 85.6% 86.3%10-mixture Gaussians 88.6% 88.9%

},,,,,{

CECECE

},{},

,,,,{

DDCE

CECE

Word Percent Correct For the E-set Recognition Task (ISOLET Database, 5-Mixture Gaussians per HMM State)

Word Percent Correct for the E-set Recognition Task

Maragos & Potamianos, JASA 1999

Fractal Modulationsfor

Fricative Sounds

Ref: • A. G. Dimakis and P. Maragos, “Phase Modulated Resonances Modeled as Self-

Similar Processes With Application to Turbulent Sounds”, IEEE Transactions on Signal Processing, Nov. 2005.

• An important class of statistically self-similar randomprocesses defined by their measured power spectra:

A truly enormous collection of natural phenomena exhibit 1/f-type spectral behavior over a wide frequency range:(frequency variations in quartz crystal oscillators, geophysical variations,heart rate variations, electronic device noises, network traffic flow andeconomic time series.)

• Most popular mathematical model for Gaussian 1/fprocesses: Fractional Brownian Motion (FBM)

1/f Noises

2

( )| |

S

27

Noises• Stochastic processes with power spectrum • Filtering white noise with convolution kernel

(Fractional Integration)• Non – exponential autocorrelation• White noise• Pink noise• Fractal Brownian Motion• Brown noise• Black noise• Applications: electronics, geophysics, astronomy, music,

acoustics, optics, economics, traffic flows, communications, geometry of nature

1 f

1 f 2 1t

1t 0

1

1 3

2 2

Examples of FFT-based Synthesis of 1D FBM

29

FBM Synthesis of Fractal Landscapes

D = 2.15

D = 2.5

D = 2.8

R. Voss, 1988

1/f Speech Modulation Model

• Model a resonance of a random speech phoneme as a phase-modulated 1/f signal:

• Nonlinear phase signal P(t) modeled as 1/f randomprocess.

• Useful model for broad resonances often observed infricative voiced or unvoiced sounds and probably caused bynonlinear phenomena during speech production.

( )

( ) cos ( )c

t

S t A t P t

• Isolate resonance: Bandpass filter the speech signal.• Demodulate filtered signal using ESA, obtain instant

frequency F(t), and median filter to reduce spikes.• Estimate phase modulation signal P(t) by integrating IF:

• Fit 1/fγ model to P(t). Methods tested include: Linear regression on Periodogram Estimation using variance of wavelet coefficients Maximum Likelihood estimation

Parameter Estimation in 1/f-PM

0( ) 2 ( ( ) )

tP t F F d

1 2 3 4 5 6 7-20

-15

-10

-5

0

5

10

Scale m101

102

103

104

-300

-250

-200

-150

-100

-50

0

Frequency (Hz)

=2.99

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.052800

3000

3200

3400

3600

3800

4000

4200

4400

4600

4800

Time (sec)0 1000 2000 3000 4000 5000 6000 7000 8000

-180

-160

-140

-120

-100

-80

-60

Frequency (Hz)

/S/ phoneme experiment

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Time (sec)

Speech signal

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05-2

-1

0

1

2

3

4

5

6

7

8

Time (sec)

Power Spectrum

Phase modulation P(t)

Instant Frequency

PSD of P(t) Var. of wavelet coefficients

101

102

103

104

-350

-300

-250

-200

-150

-100

-50

0

50

Frequency (Hz)

=3.63

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Time (sec)

/Z/ phoneme experimentSpeech signal

0 1000 2000 3000 4000 5000 6000 7000 8000

-180

-160

-140

-120

-100

-80

Frequency (Hz)

Power Spectrum

0 0.01 0.02 0.03 0.04 0.05 0.06 0.072400

2600

2800

3000

3200

3400

3600

3800

4000

4200

4400

Time (sec)

Instant Frequency

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07-20

-15

-10

-5

0

5

10

Time (sec)

1 2 3 4 5 6 7-20

-15

-10

-5

0

5

10

Scale m

Phase modulation P(t) PSD of P(t) Var. of wavelet coefficients

Chaotic Dynamics of Speech Sounds

Refs: • V. Pitsikalis and P. Maragos, “Filtered Dynamics and Fractal Dimensions for

Noisy Speech Recognition”, IEEE Signal Processing Letters, Nov. 2006.• V. Pitsikalis and P. Maragos, “Analysis and Classification of Speech Signals by

Generalized Fractal Dimension Features”, Speech Communication, Dec. 2009.• I. Kokkinos and P. Maragos, “Nonlinear Speech Analysis Using Models for

Chaotic Systems”, IEEE Transactions Speech and Audio Processing, Nov. 2005.

Embedding-Attractor Reconstruction

•Parameters to specify: , ET D

h G 1x t

1 2x t T

1x t T

0 200 400 600 800 1000 1200 1400 1600 1800 2000−10

−8

−6

−4

−2

0

2

4

6

8

10X−projection (Lorenz)

Time

X(t

) 1Y t 1x t 1x t T

1 1Ex t D T

−10−8−6−4−20246810

−10

−8

−6

−4

−2

0

2

4

6

8

10

−10

0

10

Reconstructed Lorenz Attractor,20000 iterations

X

Y

Z

−10 −8 −6 −4 −2 0 2 4 6 8 10−20

−10

0

10

20

0

5

10

15

20

25

30

Lorenz Attractor;σ=5 R=15 B=1;Δt=0.25; 20000 iterations

X

Y

Z

• Nonlinear Dynamic System (Lorenz)

• Attractor

• 1D Projection

dxx y

dtdy

R x y x zdtdz

B z x ydt

−10 −8 −6 −4 −2 0 2 4 6 8 10−20

−10

0

10

20

0

5

10

15

20

25

30

Lorenz Attractor;σ=5 R=15 B=1;Δt=0.25; 20000 iterations

X

Z

Y

0 500 1000 1500 2000 2500 3000 3500 4000−10

−8

−6

−4

−2

0

2

4

6

8

10X−projection (Lorenz)

Time

X(t

)

Time Delay

• Average Mutual Information between

• “Optimum” Time Delay:

( ), ( )x t x t T

Pr ( ),Pr , log

Pr Prx t x t T

I T x t x t Tx t x t T

min arg min ( )optT

T I T

0 50 100 1500

1

2

3

4

5Lorenz System,σ=5 R=15 B=1,Δt=0.25,#10000

Time Delay T

Ave

rage

Mut

ual I

nfor

mat

ion

Embedding Dimension• Sufficient: • False Neighbors: from projection• True Neighbors: from dynamics• False Neighbors Criterion

• When % false neighbors =0,

Attractor is unfolded

2E AttractorD D

1 1,

( ) ( ) ( ) ( )( ) ( )

d d d di j

d d

y i y j y i y jR Threshold

y i y j

1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4Lorenz System,σ=5 R=15 B=1,Δt=0.25,#2000

Embedding Dimension DE

% F

alse

Nei

ghbo

rs

−1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1−1

−0.5

0

0.5

1

/ao/,DE=6, #1846

−1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1−1

−0.5

0

0.5

1

/iy/,DE=5, #1068

Speech Attractors0 500 1000 1500

−1

−0.5

0

0.5

1

Time

X(t

)/ao/

0 200 400 600 800 1000−1

−0.5

0

0.5

1

Time

X(t

)

/iy/

−1.5−1

−0.50

0.51

−1.5

−1

−0.5

0

0.5

1−1.5

−1

−0.5

0

0.5

1

/k/,DE=6, #816

0 200 400 600 800−1

−0.5

0

0.5

1

Time

X(t

)

/k/

−1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1−1

−0.5

0

0.5

1

/s/,DE=5, #829

0 200 400 600 800−1

−0.5

0

0.5

1

Time

X(t

)

/s/

[ Pitsikalis & Maragos, Speech Commun 2009 ]

Correlation Dimension (Speech)

Correlation Dimension:

Correlation integral:

N: # of points, r: scale, x: set points,Η: Ηeavyside function

1

1,1

N

i ji j i

C N r H r x xN N

0

log ,lim lim

logC r N

C N rD

r

Correlation Dimension (Lorenz)

•

•

1

1,1

N

i ji j i

C N r H r x xN N

0

log ,lim lim

logC r N

C N rD

r

10−1

100

101

104

105

106

Lorenz System,σ=5 R=15 B=1,Δt=0.25,#4000

Cor

rela

tion

Inte

gral

Scale0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1.8

1.85

1.9

1.95

2

2.05

2.1

2.15

2.2

Scale

Loca

l Slo

pe

Lorenz System,σ=5 R=15 B=1,Δt=0.25,#4000

10−3

10−2

10−1

100

101

100

101

102

103

104

105

106

107

averaging over 8 phonemes of type /ao/

scale

corr

elat

ion

inte

gral

plain averaging weighted averaging 10

−210

−110

010

110

−1

100

101

102

103

104

105

106

107

averaging over 15 phonemes of type /iy/

scale

corr

elat

ion

inte

gral

plain averaging weighted averaging

10−2

10−1

100

101

100

101

102

103

104

105

106

107

averaging over 13 phonemes of type /s/

scale

corr

elat

ion

inte

gral


10−3

10−2

10−1

100

101

100

101

102

103

104

105

106

averaging over 11 phonemes of type /k/

scale

corr

elat

ion

inte

gral


Correlation Integrals of Speech Sounds

/ao//iy/

/k//s/

Fractal Features

800 1000 1200 1400−1

−0.5

0

0.5

1

(ms)

N-d

CleanedEmbedding N-d

SignalLocal SVDspeech

signalFiltered Dynamics -Correlation Dimension (8)

Noisy Embedding Filtered Embedding

FDCD

Multiscale Fractal Dimension (6)MFDGeometrical

Filtering

−0.200.20.40.6−0.2 0 0.2 0.4

−0.2

0

0.2

0.4

−0.200.20.40.6−0.2 0 0.2 0.4

−0.2

0

0.2

0.4

0.1 0.2 0.30

0.05

0.1

0.15

Median Neighborhood Distance

Den

sity

FilteredNoisy

Neighborhood Distance Reduction

Projection

50 100 150 200 250 300 350 400

−600

−400

−200

0

200

400

600 Mproj−noisyMproj−cleanMproj−filt

Enhanced Speech

[ Pitsikalis & Maragos, IEEE SPL 2006 ]

Noisy Speech Database: Aurora 2

Task: Speaker Independent Recognition of Digit Sequences TI - Digits at 8kHz Training (8440 Utterances per scenario, 55M/55F) Clean (8kHz, G712) Multi-Condition (8kHz, G712) 4 Noises (artificial): subway, babble, car, exhibition 5 SNRs : 5, 10, 15, 20dB , clean

Testing, artificially added noise 7 SNRs: [-5, 0, 5, 10, 15, 20dB , clean] A: noises as in multi-cond train., G712 (28028 Utters) B: restaurant, street, airport, train station, G712 (28028 Utters) C: subway, street (MIRS) (14014 Utters)

Average Recognition Results on Aurora 2: plain CD vs FDCD

0

20

40

60

80

100

Wor

d A

ccur

acy

(%)

20dB 15dB 10dB 5dB 0dB Aver

SNR

+Plain CD +FDCD

Plain CD: Correlation Dimension without Dynamical Filtering

Average Recognition Results on Aurora 2: FDCD

0102030405060708090

100W

ord

Acc

urac

y (%

)

Clean 20dB 15dB 10dB 5dB 0dB Aver

SNR

Baseline +FDCD

Up to +40%

Average Recognition Results on Aurora 2: MFD

0102030405060708090

100W

ord

Accu

racy

clean 20 dB 15 dB 10 dB 5 dB 0 dB Ave.

SNRBaseline MFD

Up to +27%

Average Recognition Results on Aurora 2

2

12

22

32

42

52

62

72

Accuracy

Clean 20 15 10 5 0 average

SNR

Plain Fractal Features (Aurora 2)

CDFDCDMFD

Average Recognition Results on Aurora 2:Hybrid Features: Fractals and Modulations

3040

5060

7080

9010

0

Acc

urac

y

20 dB 10 dB 5 dB

SNR

Baseline +FMP +FDCD +FMP+FDCD

Up to +61%

[ Pitsikalis & Maragos, IEEE SPL 2006; Speech Commun 2009 ]

Lyapunov Exponents (L.E.s)

k-th Lyapunov Number:k-th Lyapunov Exponent:

1/

lim ( )nn

k n kL rln( )k kL

1 2 1 ek k D

Lyapunov Exponents (II)

Quantify signal predictability (orbits convergence-divergence rates in phase space)

Positive L.E. exponential divergenceNegative L.E. exponential convergence

Dissipative system sum of L.Es <0Chaotic system at least one L.E >0

Invariants of system dynamics useful for characterization /recognition purposes

Determine prediction horizon(upper bound of system predictability)

Prediction on Reconstructed Attractor(Kokkinos & Maragos, T-SAP 2005)

Goal: capture dynamics of MIMO systemfrom input-output pairs

Models tested: Local Polynomials Global Polynomials Radial Basis Function networks Takagi-Sugeno-Kang models Support Vector Machines

1 ( )n nX F X

1 ( )n nX X f

Computation of Lyapunov Exponents Consider an orbit Oseledec matrix:

i-th L.E. , is i-th eigenvalue of OSL Limitations:• Only approximation of Jacobian J of f is available

( F is an approximation to f )• Ill-conditioned nature of OSL

recursive QR decomposition technique• Limited data set local L.E.s

1 1lim ( ) ( ) ( ) ( )T T TN F N F F F NX X X X OSL J J J J

1 ( ), 1, 2, ,n nX X n N f

log( )i is is

Validation of Lyapunov Exponents

Inverse time sequencing of data • True exponents flip sign (divergence of nearby orbits

becomes convergence & vice versa)• False exponents remain negative

(artifact of embedding processno dependence on system dynamics)

Models that learn the data (and not the system dynamics) fail to give such results.

RBF nets, TSK-0, Global Polynomials ... failed SVM, TSK-1 succeeded

Applications to Speech Signals(Kokkinos & Maragos 2005)

Prediction – coding with global polynomials(smaller MSE than LPC with same # of params )

Speech analysis using Lyapunov exponents• Vowels have small positive L.E.s• Voiced fricatives have bigger positive L.E.s• Unvoiced fricatives have no validated L.E.s

(too noisy)• Stop sounds have no validated L.E.s

(non-stationary)• Non-validated L.E.s are still useful

Speech Results: validated L.E.s Phoneme: /aa/

Phoneme: /v/ 0 200 400 600 800 1000 1200

−1

−0.5

0

0.5

1

# of Jacobians used

Lyapunov Exponents of /aa/

Direct L.E.−Inverse L.E.

−0.50

0.51

−0.50

0.51

−0.5

0

0.5

1

X1

Reconstructed Attractor of /aa/

X2

X3

200 400 600 800 1000 1200−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1Original Signal: /aa/

Sample index

1D

Sig

na

l

200 400 600 800 1000

−0.5

0

0.5

1Original Signal: /v/

Sample index

1D

Sig

na

l

−0.50

0.51

−0.50

0.51

−0.5

0

0.5

1

X1

Reconstructed Attractor of /v/

X2

X3

200 400 600 800

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

# of Jacobians used

Lyapunov Exponents of /v/


Speech Results: Non-validated L.E.s Phoneme: /sh/

Phoneme: /t/500 1000 1500

−1

−0.5

0

0.5

1Original Signal: /sh/

Sample index

1D S

igna

l

−1

0

1

−1

0

1−1

−0.5

0

0.5

1

X1

Reconstructed Attractor of /sh/

X2

X3

−0.50

0.51

−0.50

0.51

−0.5

0

0.5

1

X1

Reconstructed Attractor of /t/

X2

X3

200 400 600 800 1000

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1Original Signal: /t/

Sample index

1D S

igna

l

0 500 1000 1500

−2

−1

0

1

2

# of Jacobians used

Lyapunov Exponents of /sh/


0 200 400 600 800 1000

−1.5

−1

−0.5

0

0.5

1

1.5

2

# of Jacobians used

Lyapunov Exponents of /t/


Speech Lyapunov Exponents

[ Kokkinos & Maragos, IEEE T-SAP 2005 ]

Speech Sound ClassificationUsing only L.E.s (PCA projection of 3 first L.E.s):

x :Unvoiced Fric., o :Unvoiced Stop

x :Unvoiced Fric., o :Voiced Fric.

x :Vowel, o :Voiced Stop

x :Vowel, o :Unvoiced Fric.

> :Unvoiced Stop, o :Voiced Stop

x :Vowel, o :Unvoiced Stop

When combined with MFCC: (4 classes) ~12% smaller error using K-NN classifier

Other Works on Speech Fractals or Chaotic Dynamics

C. A. Pickover and A. Khorasani, ‘‘Fractal Characterization of Speech Waveform Graphs,’’ Computer Graphics 1986.

P. J. B. Jackson and C. H. Shadle, “Frication noise modulated by voicing, as revealed by pitch-scaled decomposition”, J. Acoust. Soc. Amer. 2000.

S. McLaughlin and P. Maragos, “Nonlinear Methods for Speech Analysis and Synthesis”, in Advances in Nonlinear Signal and Image Processing, edited by S. Marshall and G. L. Sicuranza, EURASIP Book Series on Signal Processing and Communications, Hindawi Publ. Corp., 2006, pp.103-140.

M. Zaki, J. N. Shah and H. A. Patil, “Effectiveness of Multiscale Fractal Dimension-based Phonetic Segmentation in Speech Synthesis for Low Resource Language”, in Proc. Int’l Conf. on Asian Language Processing (IALP) 2014.

K. López-de-Ipina, J. Solé-Casals, H. Eguiraun, J.B. Alonso, C.M. Travieso, A.Ezeiza, N Barroso, M. Ecay-Torres, P. Martinez-Lage, Blanca Beitia, “Feature selection for spontaneous speech analysis to aid in Alzheimer’s disease diagnosis: A fractal dimension approach”, Computer Speech & Language 2015.

E. Tzinis, G. Paraskevopoulos, C. Baziotis, A. Potamianos, “Integrating Recurrence Dynamics for Speech Emotion Recognition”, in Proc. Interspeech 2018.

61

Fractals and Music

Ref: • A. Zlatintsi and P. Maragos, “Multiscale Fractal Analysis of Musical Instrument Signals

with Application to Recognition”, IEEE Transactions on Audio, Speech and Language Processing, Apr. 2013.

Multiscale Fractal Dimension of Music Sounds

0 1 2 3 4 53

4

5

6

7

8

LOG SCALE

LOG

AR

EA

BassBassoonBb ClarinetCelloFluteHornTuba

[ Zlatintsi & Maragos, T‐ASLP 2013 ]

63

Morphological Covering MethodD 2 lim log[AB(s) / s2 ]

log(1 / s)

Double Bass steady state (solid line), its multiscale flat dilations and erosions at scales s=25,75, where B is a 3-sample symmetric horizontal segment with zero height.

P. Maragos and A. Potamianos. J. Acoust. Soc. Amer., 1999.

0 1 2 3 4 53

4

5

6

7

8

LOG SCALELO

G A

REA

BassBassoonBb ClarinetCelloFluteHornTuba

Multiscale Fractal Analysis of Musical Instrument Signals

log[AB(s)] vs log(s) for the seven analyzed instruments for the note C3 except for Bb Clarinet and Flute shown for C5 instead. Notethe difference in the slope for larger scales . (for 30ms analysis window).

64

MFD Analysis for Steady State of the Note

Upright Bass Clarinet Cello

Flute Clarinet Oboe Piano

Mean MFD (middle line) and standard deviation (error bars) (for 30 ms analysis window, updated every 15 ms).

65

MFD Analysis for Steady State of the Note

D = 1.42

D = 1.35

D = 1.65

D = 1.46

D = 1.37

D = 1.89

Horn Tuba Bassoon

Bb Clarinet Flute Bb Clarinet Flute

66

MFD Analysis on Synthesized Signals

Strong dependence on the frequency

f=5Hz f=50Hz f=100Hz

f=200Hz

f=300Hz f=400Hz f=500Hz f=600Hz

67

Analysis of MFD during the Attack

MFDs estimated for the 7 analyzed instruments attacks, averaged over the whole range (using 30 msanalysis windows).

Similar as for the steady state Higher D for small scales st and

more fragmentation.

Increased value of D(s = 1).

Clear distinction of D among some of the analyzed instruments.

Mean MFD and standard deviation of the attack and steady state of A3 for Cello (left images) and F4 for Flute (right images).

68

MFD Variability of the Steady State for the Same Instrument over One Octave

Dependence on the acoustical frequency and the MFD profile that increases rapidly for higher frequency sounds.

The instruments’ specific MFDs beholds the shape observed for the specific octave.

MFD of Bb Clarinet steady state notes, over one octave for one 30 ms analysis window.

69

Experimental Evaluation

Double Bass, Bassoon & Tuba best recognized Low discriminability between Bb Clarinet & Flute

Enhanced discriminability for Bassoon, Bb Clarinet and Horn Decreased for Cello & Flute On average MFD+MFCC features improve the recognition over the baseline

Mean AccuracyΗΜΜ Ν=5 Μ=3

Example of the 13 logarithmically sampled points of the MFD, for Bb Clarinet (A3), forming the MFDLG feature vector.

70

Conclusions

Existence-Importance of nonlinear speech structure of turbulence type (fractals, chaotic dynamics)

Speech technology systems can benefit from including such nonlinear features

Find/extract robust nonlinear features of turbulence type Improve computational algorithms Fuse nonlinear with linear features Applications also to other sound signals, e.g. music

For more information, demos, and current results: http://cvsp.cs.ntua.gr and http://robotics.ntua.gr

Documents

Nonlinear Aspects of Speech Production: Fractals and Chaotic …cvsp.cs.ntua.gr/interspeech2018/slides/PM_Nonlinear... · 2018-09-27 · Nonlinear Aspects of Speech Production: Fractals