Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Computer Vision, Speech Communication & Signal Processing Group,Intelligent Robotics and Automation Laboratory
National Technical University of Athens, Greece (NTUA)Robot Perception and Interaction Unit,
Athena Research and Innovation Center (Athena RIC)
Nonlinear Aspects of Speech Production: Fractals and Chaotic Dynamics
Petros Maragos
1
Summer School on Speech Signal Processing (S4P)DA-IICT, Gandhinagar, India, 9-11 Sept. 2018
2
Outline
Nonlinear Speech Processing Turbulence: Fractals, Chaotic Dynamics
Multiscale Fractal Dimensions of Speech Sounds Fractal Modulations for Fricative SoundsChaotic Dynamics of Speech SoundsAlgorithms for Speech Fractal & Chaos Analysis Application to Speech RecognitionApplication to Music Recognition
)(ns
IMPULSETRAIN
GENERATOR
PITCH PERIOD
GLOTTALPULSE
MODELG(z)
X
RANDOMNOISE
GENERATORX
VOCALTRACTMODEL
V(z)
VOCAL TRACT
PARAMETERS
RADIATIONMODEL
R(z)
VA
NA
)(nu
VOICED/UNVOICED SWITCH
(Rabiner & Schafer, 1978)
Linear Source-Filter Model
Nonlinear Fluid Dynamic of the Vocal Tract(Kaiser 1993)
Physics of Speech Airflow• airflow variables: = air density; = pressure
= 3D air particle velocity
• governing equations:mass conservation (continuity eqn):
momentum conservation (Navier-Stokes eqn):
state equation:
• time-varying boundary conditions
u
0ut
2 13
u u u p g u ut
1.4 const.p
p
Speech Aerodynamics• Reynolds number:
• low viscosity μ high Reinertia forces viscous forces
• “aerodynamic” phenomena (Re >>1):air jet, rotational motion, separated airflow,boundary layers, vortices, turbulence
• experimental & theoretical evidence for nonlinear phenomena:
Teager (1970s–1980s), Kaiser (1983 – ), Thomas (1986),
McGowan (1988), Barney, Shadle & Davis (1999), ...
( ) ( )Re velocity scale length scale
Vortices• vorticity:• VORTEX is a flow region of similar• a vortex can be generated by:
– velocity gradients in boundary layers– separated air flow– curved geometry of vocal tract
• dynamics of vortex propagation:
vorticity twisting & stretchingdiffusion of vorticity
u
2u ut
u
2
Nonlinear Speech Processing
• Modulations
• Turbulence– Fractals– Chaos
Turbulence• flow state with broad-spectrum rapidly-varying (in space and
time) velocity and vorticity• transition to turbulence is easier for higher Re flows• eddies: vortices of a characteristic size
• Energy Cascade Theory (Richardson,1922)(multiscale hierarchy of eddies)
• 5/3 spectral law (Kolmogorov, 1941):
wavenumber energy dissipation rate
velocity wavenumber spectrum
2 3 5 3,S k r r k 2 /k
r ,S k r
Turbulence, Fractals and Chaos
• fractal geometry quantifies multiscale structuresin turbulence
• Kolmogorov’s 5/3 law
• we use fractal dimension to quantify “amount” of turbulence in speech
• chaos turbulence
2 3Var u x u x x x
Multiscale Fractal Dimension of Speech Spounds
0 0.5 1 1.5 2 2.5 3 3.5 41
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
SCALE (millisec)
FR
AC
TA
L D
IME
NS
ION
of
/ F /
0 5 10 15 20 25 30−400
0
400
TIME (millisec)
SP
EE
CH
SIG
NA
L: /
F /
0 5 10 15 20 25 30−800
0
800
TIME (millisec)
SP
EE
CH
SIG
NA
L: /
V /
0 0.5 1 1.5 2 2.5 3 3.5 41
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
SCALE (millisec)
FR
AC
TA
L D
IME
NS
ION
of
/ V /
0 0.5 1 1.5 2 2.5 3 3.5 41
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
SCALE (millisec)
FR
AC
TA
L D
IME
NS
ION
of
/ IY
/
0 5 10 15 20 25 30−3000
0
3000
TIME (millisec)
SP
EE
CH
SIG
NA
L: /
IY /
/f/ /iy//v/
[ P. Maragos & A. Potamianos, JASA 1999 ]
−1
−0.5
0
0.5
1
−1
−0.5
0
0.5
1−1
−0.5
0
0.5
1
/ao/,DE=6, #1846
−1
−0.5
0
0.5
1
−1
−0.5
0
0.5
1−1
−0.5
0
0.5
1
/iy/,DE=5, #1068
Speech Attractors0 500 1000 1500
−1
−0.5
0
0.5
1
Time
X(t
)/ao/
0 200 400 600 800 1000−1
−0.5
0
0.5
1
Time
X(t
)
/iy/
−1.5−1
−0.50
0.51
−1.5
−1
−0.5
0
0.5
1−1.5
−1
−0.5
0
0.5
1
/k/,DE=6, #816
0 200 400 600 800−1
−0.5
0
0.5
1
Time
X(t
)
/k/
−1
−0.5
0
0.5
1
−1
−0.5
0
0.5
1−1
−0.5
0
0.5
1
/s/,DE=5, #829
0 200 400 600 800−1
−0.5
0
0.5
1
Time
X(t
)
/s/
[ Pitsikalis & Maragos, Speech Commun 2009 ]
Multiscale Fractal Dimensionsfor
Speech Sounds
Refs: • P. Maragos and A. Potamianos, “Fractal Dimensions of Speech Sounds: Computation and
Application to Automatic Speech Recognition”, Journal of Acoustical Society of America, March 1999.
• P. Maragos, “Fractal Signal Analysis Using Mathematical Morphology”, in Advances in Electronics and Electron Physics, vol.88, Academic Press, 1994.
14
FRACTALS: Definitions• Mandelbrot’s definition
set is fractal Hausdorff dim topological dim
• Examples
• SignalsA function is a fractal if its graph
is a fractal set in is continuous
S ( ) HD S ( )TD S
1 2 T HD D 0 1 T HD D
2 3 T HD D
S =S =
S =fractal curvefractal surface
fractal dust
: vf ( )Gr f 1v
f [ ( )] 1T Hv D D Gr f v
15
‘FRACTAL’ DIMENSIONS (OF SETS IN Rν)
Hausdorff dimension
Minkowski-Bouligand dimension
box counting dimension
similarity dimension
0 T H MB BCD D D D v£ £ £ = £
H SD D£
HD =
MBD =
BCD =
SD =
Morphological Measurement of Fractal Dimension
• Minkowski cover of curve
• Fractal (Minkowski-Bouligand) dimension
• Least-Squares line fit to data
: ( )Bz G
G rB z C r
1,2D
1;2B D
B B
A rA r area C r length of G r r
r
2log , log 1BA r r r D
Morphological (Flat & Weighted) FiltersDilation (Max-plus convolution):
Erosion (Min-plus correlation):
( )( ) max ( ) ( )yf g x f y g x y
( )( ) min ( ) ( )yf g x f y g y x
( )f g f g g
( )f g f g g
0 100 200 300
0
50
100
Sample Index
ORIGINAL SIGNAL
−10 0 100
10
Sample Index
PARABOLA PULSE
0 100 200 300−10
0
50
100
110DILATION BY FLAT & PARABOLIC SE
0 100 200 300−10
0
50
100
110EROSION BY FLAT & PARABOLIC SE
0 100 200 300−10
0
50
100
110OPENING BY FLAT & PARABOLIC SE
0 100 200 300−10
0
50
100
110CLOSING BY FLAT & PARABOLIC SE
Opening:
Closing:
18
Minkowski Fractal Dimension of 1D Curveand Morphological Algorithm for 1D Signals
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
11.21.41.61.8
2
TIME (in sec)
ZERO−CROSSINGSMS−AMPLITUDE
FRACTAL DIMENSION
/ SOOTHING /
SPEECH SIGNAL
ST Speech & Fractal Dimension
Multiscale Speech Fractal Dimension
• short-time speech signal
• signal graph
• fractal constant power law
• variable power law
• multiscale fractal “dimension”
(speech fractogram):
of
short-time speech segment
around time
,tS Tt 0
2 , : 0G t S t R t T
2 , as 0Darea G B C
DtMFD ,
2 Darea G B C
t
Multiscale Fractal Dimension of Speech Spounds
0 0.5 1 1.5 2 2.5 3 3.5 41
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
SCALE (millisec)
FR
AC
TA
L D
IME
NS
ION
of
/ F /
0 5 10 15 20 25 30−400
0
400
TIME (millisec)
SP
EE
CH
SIG
NA
L: /
F /
0 5 10 15 20 25 30−800
0
800
TIME (millisec)
SP
EE
CH
SIG
NA
L: /
V /
0 0.5 1 1.5 2 2.5 3 3.5 41
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
SCALE (millisec)
FR
AC
TA
L D
IME
NS
ION
of
/ V /
0 0.5 1 1.5 2 2.5 3 3.5 41
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
SCALE (millisec)
FR
AC
TA
L D
IME
NS
ION
of
/ IY
/
0 5 10 15 20 25 30−3000
0
3000
TIME (millisec)
SP
EE
CH
SIG
NA
L: /
IY /
/f/ /iy//v/
[ P. Maragos & A. Potamianos, JASA 1999 ]
0 1 2 3 41
1.2
1.4
1.6
1.8
2
/AA/
SCALE (millisec)
FR
AC
TA
L D
IME
NS
ION
of /
AA
/
0 1 2 3 41
1.2
1.4
1.6
1.8
2
/B/
SCALE (millisec)
FR
AC
TA
L D
IME
NS
ION
of /
B/
0 1 2 3 41
1.2
1.4
1.6
1.8
2
/F/
SCALE (millisec)
FR
AC
TA
L D
IME
NS
ION
of /
F/
0 1 2 3 41
1.2
1.4
1.6
1.8
2
/R/
SCALE (millisec)
FR
AC
TA
L D
IME
NS
ION
of /
R/
0 1 2 3 41
1.2
1.4
1.6
1.8
2
/EN/
SCALE (millisec)
FR
AC
TA
L D
IME
NS
ION
of /
EN
/
0 1 2 3 41
1.2
1.4
1.6
1.8
2
/M/
SCALE (millisec)
FR
AC
TA
L D
IME
NS
ION
of /
M/
Mean and standard deviation (error bars) of the multiscale fractal dimension for the phonemes /aa/, /b/, /en/, /f/, /m/, /r/ from the TIMIT database (20 ms window, updated every 10 ms. Average over 200 phonemic instances.)
Mean MFD for /sh/, /zh/, /uh/, /t/, /d/
81.2% 83.5% 84.5%
11,
,,,DD
CECE
, , ,E C E C
1 16 1 16
, , ,
,
E C E C
D D
Features
Models5-mixture Gaussians 85.6% 86.3%10-mixture Gaussians 88.6% 88.9%
},,,,,{
CECECE
},{},
,,,,{
DDCE
CECE
Word Percent Correct For the E-set Recognition Task (ISOLET Database, 5-Mixture Gaussians per HMM State)
Word Percent Correct for the E-set Recognition Task
Maragos & Potamianos, JASA 1999
Fractal Modulationsfor
Fricative Sounds
Ref: • A. G. Dimakis and P. Maragos, “Phase Modulated Resonances Modeled as Self-
Similar Processes With Application to Turbulent Sounds”, IEEE Transactions on Signal Processing, Nov. 2005.
• An important class of statistically self-similar randomprocesses defined by their measured power spectra:
A truly enormous collection of natural phenomena exhibit 1/f-type spectral behavior over a wide frequency range:(frequency variations in quartz crystal oscillators, geophysical variations,heart rate variations, electronic device noises, network traffic flow andeconomic time series.)
• Most popular mathematical model for Gaussian 1/fprocesses: Fractional Brownian Motion (FBM)
1/f Noises
2
( )| |
S
27
Noises• Stochastic processes with power spectrum • Filtering white noise with convolution kernel
(Fractional Integration)• Non – exponential autocorrelation• White noise• Pink noise• Fractal Brownian Motion• Brown noise• Black noise• Applications: electronics, geophysics, astronomy, music,
acoustics, optics, economics, traffic flows, communications, geometry of nature
1 f
1 f 2 1t
1t 0
1
1 3
2 2
Examples of FFT-based Synthesis of 1D FBM
29
FBM Synthesis of Fractal Landscapes
D = 2.15
D = 2.5
D = 2.8
R. Voss, 1988
1/f Speech Modulation Model
• Model a resonance of a random speech phoneme as a phase-modulated 1/f signal:
• Nonlinear phase signal P(t) modeled as 1/f randomprocess.
• Useful model for broad resonances often observed infricative voiced or unvoiced sounds and probably caused bynonlinear phenomena during speech production.
( )
( ) cos ( )c
t
S t A t P t
• Isolate resonance: Bandpass filter the speech signal.• Demodulate filtered signal using ESA, obtain instant
frequency F(t), and median filter to reduce spikes.• Estimate phase modulation signal P(t) by integrating IF:
• Fit 1/fγ model to P(t). Methods tested include: Linear regression on Periodogram Estimation using variance of wavelet coefficients Maximum Likelihood estimation
Parameter Estimation in 1/f-PM
0( ) 2 ( ( ) )
tP t F F d
1 2 3 4 5 6 7-20
-15
-10
-5
0
5
10
Scale m101
102
103
104
-300
-250
-200
-150
-100
-50
0
Frequency (Hz)
=2.99
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.052800
3000
3200
3400
3600
3800
4000
4200
4400
4600
4800
Time (sec)0 1000 2000 3000 4000 5000 6000 7000 8000
-180
-160
-140
-120
-100
-80
-60
Frequency (Hz)
/S/ phoneme experiment
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Time (sec)
Speech signal
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05-2
-1
0
1
2
3
4
5
6
7
8
Time (sec)
Power Spectrum
Phase modulation P(t)
Instant Frequency
PSD of P(t) Var. of wavelet coefficients
101
102
103
104
-350
-300
-250
-200
-150
-100
-50
0
50
Frequency (Hz)
=3.63
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Time (sec)
/Z/ phoneme experimentSpeech signal
0 1000 2000 3000 4000 5000 6000 7000 8000
-180
-160
-140
-120
-100
-80
Frequency (Hz)
Power Spectrum
0 0.01 0.02 0.03 0.04 0.05 0.06 0.072400
2600
2800
3000
3200
3400
3600
3800
4000
4200
4400
Time (sec)
Instant Frequency
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07-20
-15
-10
-5
0
5
10
Time (sec)
1 2 3 4 5 6 7-20
-15
-10
-5
0
5
10
Scale m
Phase modulation P(t) PSD of P(t) Var. of wavelet coefficients
Chaotic Dynamics of Speech Sounds
Refs: • V. Pitsikalis and P. Maragos, “Filtered Dynamics and Fractal Dimensions for
Noisy Speech Recognition”, IEEE Signal Processing Letters, Nov. 2006.• V. Pitsikalis and P. Maragos, “Analysis and Classification of Speech Signals by
Generalized Fractal Dimension Features”, Speech Communication, Dec. 2009.• I. Kokkinos and P. Maragos, “Nonlinear Speech Analysis Using Models for
Chaotic Systems”, IEEE Transactions Speech and Audio Processing, Nov. 2005.
Embedding-Attractor Reconstruction
•Parameters to specify: , ET D
h G 1x t
1 2x t T
1x t T
0 200 400 600 800 1000 1200 1400 1600 1800 2000−10
−8
−6
−4
−2
0
2
4
6
8
10X−projection (Lorenz)
Time
X(t
) 1Y t 1x t 1x t T
1 1Ex t D T
−10−8−6−4−20246810
−10
−8
−6
−4
−2
0
2
4
6
8
10
−10
0
10
Reconstructed Lorenz Attractor,20000 iterations
X
Y
Z
−10 −8 −6 −4 −2 0 2 4 6 8 10−20
−10
0
10
20
0
5
10
15
20
25
30
Lorenz Attractor;σ=5 R=15 B=1;Δt=0.25; 20000 iterations
X
Y
Z
• Nonlinear Dynamic System (Lorenz)
• Attractor
• 1D Projection
dxx y
dtdy
R x y x zdtdz
B z x ydt
−10 −8 −6 −4 −2 0 2 4 6 8 10−20
−10
0
10
20
0
5
10
15
20
25
30
Lorenz Attractor;σ=5 R=15 B=1;Δt=0.25; 20000 iterations
X
Z
Y
0 500 1000 1500 2000 2500 3000 3500 4000−10
−8
−6
−4
−2
0
2
4
6
8
10X−projection (Lorenz)
Time
X(t
)
Time Delay
• Average Mutual Information between
• “Optimum” Time Delay:
( ), ( )x t x t T
Pr ( ),Pr , log
Pr Prx t x t T
I T x t x t Tx t x t T
min arg min ( )optT
T I T
0 50 100 1500
1
2
3
4
5Lorenz System,σ=5 R=15 B=1,Δt=0.25,#10000
Time Delay T
Ave
rage
Mut
ual I
nfor
mat
ion
Embedding Dimension• Sufficient: • False Neighbors: from projection• True Neighbors: from dynamics• False Neighbors Criterion
• When % false neighbors =0,
Attractor is unfolded
2E AttractorD D
1 1,
( ) ( ) ( ) ( )( ) ( )
d d d di j
d d
y i y j y i y jR Threshold
y i y j
1 2 3 4 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4Lorenz System,σ=5 R=15 B=1,Δt=0.25,#2000
Embedding Dimension DE
% F
alse
Nei
ghbo
rs
−1
−0.5
0
0.5
1
−1
−0.5
0
0.5
1−1
−0.5
0
0.5
1
/ao/,DE=6, #1846
−1
−0.5
0
0.5
1
−1
−0.5
0
0.5
1−1
−0.5
0
0.5
1
/iy/,DE=5, #1068
Speech Attractors0 500 1000 1500
−1
−0.5
0
0.5
1
Time
X(t
)/ao/
0 200 400 600 800 1000−1
−0.5
0
0.5
1
Time
X(t
)
/iy/
−1.5−1
−0.50
0.51
−1.5
−1
−0.5
0
0.5
1−1.5
−1
−0.5
0
0.5
1
/k/,DE=6, #816
0 200 400 600 800−1
−0.5
0
0.5
1
Time
X(t
)
/k/
−1
−0.5
0
0.5
1
−1
−0.5
0
0.5
1−1
−0.5
0
0.5
1
/s/,DE=5, #829
0 200 400 600 800−1
−0.5
0
0.5
1
Time
X(t
)
/s/
[ Pitsikalis & Maragos, Speech Commun 2009 ]
Correlation Dimension (Speech)
Correlation Dimension:
Correlation integral:
N: # of points, r: scale, x: set points,Η: Ηeavyside function
1
1,1
N
i ji j i
C N r H r x xN N
0
log ,lim lim
logC r N
C N rD
r
Correlation Dimension (Lorenz)
•
•
1
1,1
N
i ji j i
C N r H r x xN N
0
log ,lim lim
logC r N
C N rD
r
10−1
100
101
104
105
106
Lorenz System,σ=5 R=15 B=1,Δt=0.25,#4000
Cor
rela
tion
Inte
gral
Scale0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1.8
1.85
1.9
1.95
2
2.05
2.1
2.15
2.2
Scale
Loca
l Slo
pe
Lorenz System,σ=5 R=15 B=1,Δt=0.25,#4000
10−3
10−2
10−1
100
101
100
101
102
103
104
105
106
107
averaging over 8 phonemes of type /ao/
scale
corr
elat
ion
inte
gral
plain averaging weighted averaging 10
−210
−110
010
110
−1
100
101
102
103
104
105
106
107
averaging over 15 phonemes of type /iy/
scale
corr
elat
ion
inte
gral
plain averaging weighted averaging
10−2
10−1
100
101
100
101
102
103
104
105
106
107
averaging over 13 phonemes of type /s/
scale
corr
elat
ion
inte
gral
plain averaging weighted averaging
10−3
10−2
10−1
100
101
100
101
102
103
104
105
106
averaging over 11 phonemes of type /k/
scale
corr
elat
ion
inte
gral
plain averaging weighted averaging
Correlation Integrals of Speech Sounds
/ao//iy/
/k//s/
Fractal Features
800 1000 1200 1400−1
−0.5
0
0.5
1
(ms)
N-d
CleanedEmbedding N-d
SignalLocal SVDspeech
signalFiltered Dynamics -Correlation Dimension (8)
Noisy Embedding Filtered Embedding
FDCD
Multiscale Fractal Dimension (6)MFDGeometrical
Filtering
−0.200.20.40.6−0.2 0 0.2 0.4
−0.2
0
0.2
0.4
−0.200.20.40.6−0.2 0 0.2 0.4
−0.2
0
0.2
0.4
0.1 0.2 0.30
0.05
0.1
0.15
Median Neighborhood Distance
Den
sity
FilteredNoisy
Neighborhood Distance Reduction
Projection
50 100 150 200 250 300 350 400
−600
−400
−200
0
200
400
600 Mproj−noisyMproj−cleanMproj−filt
Enhanced Speech
[ Pitsikalis & Maragos, IEEE SPL 2006 ]
Noisy Speech Database: Aurora 2
Task: Speaker Independent Recognition of Digit Sequences TI - Digits at 8kHz Training (8440 Utterances per scenario, 55M/55F) Clean (8kHz, G712) Multi-Condition (8kHz, G712) 4 Noises (artificial): subway, babble, car, exhibition 5 SNRs : 5, 10, 15, 20dB , clean
Testing, artificially added noise 7 SNRs: [-5, 0, 5, 10, 15, 20dB , clean] A: noises as in multi-cond train., G712 (28028 Utters) B: restaurant, street, airport, train station, G712 (28028 Utters) C: subway, street (MIRS) (14014 Utters)
Average Recognition Results on Aurora 2: plain CD vs FDCD
0
20
40
60
80
100
Wor
d A
ccur
acy
(%)
20dB 15dB 10dB 5dB 0dB Aver
SNR
+Plain CD +FDCD
Plain CD: Correlation Dimension without Dynamical Filtering
Average Recognition Results on Aurora 2: FDCD
0102030405060708090
100W
ord
Acc
urac
y (%
)
Clean 20dB 15dB 10dB 5dB 0dB Aver
SNR
Baseline +FDCD
Up to +40%
Average Recognition Results on Aurora 2: MFD
0102030405060708090
100W
ord
Accu
racy
clean 20 dB 15 dB 10 dB 5 dB 0 dB Ave.
SNRBaseline MFD
Up to +27%
Average Recognition Results on Aurora 2
2
12
22
32
42
52
62
72
Accuracy
Clean 20 15 10 5 0 average
SNR
Plain Fractal Features (Aurora 2)
CDFDCDMFD
Average Recognition Results on Aurora 2:Hybrid Features: Fractals and Modulations
3040
5060
7080
9010
0
Acc
urac
y
20 dB 10 dB 5 dB
SNR
Baseline +FMP +FDCD +FMP+FDCD
Up to +61%
[ Pitsikalis & Maragos, IEEE SPL 2006; Speech Commun 2009 ]
Lyapunov Exponents (L.E.s)
k-th Lyapunov Number:k-th Lyapunov Exponent:
1/
lim ( )nn
k n kL rln( )k kL
1 2 1 ek k D
Lyapunov Exponents (II)
Quantify signal predictability (orbits convergence-divergence rates in phase space)
Positive L.E. exponential divergenceNegative L.E. exponential convergence
Dissipative system sum of L.Es <0Chaotic system at least one L.E >0
Invariants of system dynamics useful for characterization /recognition purposes
Determine prediction horizon(upper bound of system predictability)
Prediction on Reconstructed Attractor(Kokkinos & Maragos, T-SAP 2005)
Goal: capture dynamics of MIMO systemfrom input-output pairs
Models tested: Local Polynomials Global Polynomials Radial Basis Function networks Takagi-Sugeno-Kang models Support Vector Machines
1 ( )n nX F X
1 ( )n nX X f
Computation of Lyapunov Exponents Consider an orbit Oseledec matrix:
i-th L.E. , is i-th eigenvalue of OSL Limitations:• Only approximation of Jacobian J of f is available
( F is an approximation to f )• Ill-conditioned nature of OSL
recursive QR decomposition technique• Limited data set local L.E.s
1 1lim ( ) ( ) ( ) ( )T T TN F N F F F NX X X X OSL J J J J
1 ( ), 1, 2, ,n nX X n N f
log( )i is is
Validation of Lyapunov Exponents
Inverse time sequencing of data • True exponents flip sign (divergence of nearby orbits
becomes convergence & vice versa)• False exponents remain negative
(artifact of embedding processno dependence on system dynamics)
Models that learn the data (and not the system dynamics) fail to give such results.
RBF nets, TSK-0, Global Polynomials ... failed SVM, TSK-1 succeeded
Applications to Speech Signals(Kokkinos & Maragos 2005)
Prediction – coding with global polynomials(smaller MSE than LPC with same # of params )
Speech analysis using Lyapunov exponents• Vowels have small positive L.E.s• Voiced fricatives have bigger positive L.E.s• Unvoiced fricatives have no validated L.E.s
(too noisy)• Stop sounds have no validated L.E.s
(non-stationary)• Non-validated L.E.s are still useful
Speech Results: validated L.E.s Phoneme: /aa/
Phoneme: /v/ 0 200 400 600 800 1000 1200
−1
−0.5
0
0.5
1
# of Jacobians used
Lyapunov Exponents of /aa/
Direct L.E.−Inverse L.E.
−0.50
0.51
−0.50
0.51
−0.5
0
0.5
1
X1
Reconstructed Attractor of /aa/
X2
X3
200 400 600 800 1000 1200−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1Original Signal: /aa/
Sample index
1D
Sig
na
l
200 400 600 800 1000
−0.5
0
0.5
1Original Signal: /v/
Sample index
1D
Sig
na
l
−0.50
0.51
−0.50
0.51
−0.5
0
0.5
1
X1
Reconstructed Attractor of /v/
X2
X3
200 400 600 800
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
# of Jacobians used
Lyapunov Exponents of /v/
Direct L.E.−Inverse L.E.
Speech Results: Non-validated L.E.s Phoneme: /sh/
Phoneme: /t/500 1000 1500
−1
−0.5
0
0.5
1Original Signal: /sh/
Sample index
1D S
igna
l
−1
0
1
−1
0
1−1
−0.5
0
0.5
1
X1
Reconstructed Attractor of /sh/
X2
X3
−0.50
0.51
−0.50
0.51
−0.5
0
0.5
1
X1
Reconstructed Attractor of /t/
X2
X3
200 400 600 800 1000
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1Original Signal: /t/
Sample index
1D S
igna
l
0 500 1000 1500
−2
−1
0
1
2
# of Jacobians used
Lyapunov Exponents of /sh/
Direct L.E.−Inverse L.E.
0 200 400 600 800 1000
−1.5
−1
−0.5
0
0.5
1
1.5
2
# of Jacobians used
Lyapunov Exponents of /t/
Direct L.E.−Inverse L.E.
Speech Lyapunov Exponents
[ Kokkinos & Maragos, IEEE T-SAP 2005 ]
Speech Sound ClassificationUsing only L.E.s (PCA projection of 3 first L.E.s):
x :Unvoiced Fric., o :Unvoiced Stop
x :Unvoiced Fric., o :Voiced Fric.
x :Vowel, o :Voiced Stop
x :Vowel, o :Unvoiced Fric.
> :Unvoiced Stop, o :Voiced Stop
x :Vowel, o :Unvoiced Stop
When combined with MFCC: (4 classes) ~12% smaller error using K-NN classifier
Other Works on Speech Fractals or Chaotic Dynamics
C. A. Pickover and A. Khorasani, ‘‘Fractal Characterization of Speech Waveform Graphs,’’ Computer Graphics 1986.
P. J. B. Jackson and C. H. Shadle, “Frication noise modulated by voicing, as revealed by pitch-scaled decomposition”, J. Acoust. Soc. Amer. 2000.
S. McLaughlin and P. Maragos, “Nonlinear Methods for Speech Analysis and Synthesis”, in Advances in Nonlinear Signal and Image Processing, edited by S. Marshall and G. L. Sicuranza, EURASIP Book Series on Signal Processing and Communications, Hindawi Publ. Corp., 2006, pp.103-140.
M. Zaki, J. N. Shah and H. A. Patil, “Effectiveness of Multiscale Fractal Dimension-based Phonetic Segmentation in Speech Synthesis for Low Resource Language”, in Proc. Int’l Conf. on Asian Language Processing (IALP) 2014.
K. López-de-Ipina, J. Solé-Casals, H. Eguiraun, J.B. Alonso, C.M. Travieso, A.Ezeiza, N Barroso, M. Ecay-Torres, P. Martinez-Lage, Blanca Beitia, “Feature selection for spontaneous speech analysis to aid in Alzheimer’s disease diagnosis: A fractal dimension approach”, Computer Speech & Language 2015.
E. Tzinis, G. Paraskevopoulos, C. Baziotis, A. Potamianos, “Integrating Recurrence Dynamics for Speech Emotion Recognition”, in Proc. Interspeech 2018.
61
Fractals and Music
Ref: • A. Zlatintsi and P. Maragos, “Multiscale Fractal Analysis of Musical Instrument Signals
with Application to Recognition”, IEEE Transactions on Audio, Speech and Language Processing, Apr. 2013.
Multiscale Fractal Dimension of Music Sounds
0 1 2 3 4 53
4
5
6
7
8
LOG SCALE
LOG
AR
EA
BassBassoonBb ClarinetCelloFluteHornTuba
[ Zlatintsi & Maragos, T‐ASLP 2013 ]
63
Morphological Covering MethodD 2 lim log[AB(s) / s2 ]
log(1 / s)
Double Bass steady state (solid line), its multiscale flat dilations and erosions at scales s=25,75, where B is a 3-sample symmetric horizontal segment with zero height.
P. Maragos and A. Potamianos. J. Acoust. Soc. Amer., 1999.
0 1 2 3 4 53
4
5
6
7
8
LOG SCALELO
G A
REA
BassBassoonBb ClarinetCelloFluteHornTuba
Multiscale Fractal Analysis of Musical Instrument Signals
log[AB(s)] vs log(s) for the seven analyzed instruments for the note C3 except for Bb Clarinet and Flute shown for C5 instead. Notethe difference in the slope for larger scales . (for 30ms analysis window).
64
MFD Analysis for Steady State of the Note
Upright Bass Clarinet Cello
Flute Clarinet Oboe Piano
Mean MFD (middle line) and standard deviation (error bars) (for 30 ms analysis window, updated every 15 ms).
65
MFD Analysis for Steady State of the Note
D = 1.42
D = 1.35
D = 1.65
D = 1.46
D = 1.37
D = 1.89
Horn Tuba Bassoon
Bb Clarinet Flute Bb Clarinet Flute
66
MFD Analysis on Synthesized Signals
Strong dependence on the frequency
f=5Hz f=50Hz f=100Hz
f=200Hz
f=300Hz f=400Hz f=500Hz f=600Hz
67
Analysis of MFD during the Attack
MFDs estimated for the 7 analyzed instruments attacks, averaged over the whole range (using 30 msanalysis windows).
Similar as for the steady state Higher D for small scales st and
more fragmentation.
Increased value of D(s = 1).
Clear distinction of D among some of the analyzed instruments.
Mean MFD and standard deviation of the attack and steady state of A3 for Cello (left images) and F4 for Flute (right images).
68
MFD Variability of the Steady State for the Same Instrument over One Octave
Dependence on the acoustical frequency and the MFD profile that increases rapidly for higher frequency sounds.
The instruments’ specific MFDs beholds the shape observed for the specific octave.
MFD of Bb Clarinet steady state notes, over one octave for one 30 ms analysis window.
69
Experimental Evaluation
Double Bass, Bassoon & Tuba best recognized Low discriminability between Bb Clarinet & Flute
Enhanced discriminability for Bassoon, Bb Clarinet and Horn Decreased for Cello & Flute On average MFD+MFCC features improve the recognition over the baseline
Mean AccuracyΗΜΜ Ν=5 Μ=3
Example of the 13 logarithmically sampled points of the MFD, for Bb Clarinet (A3), forming the MFDLG feature vector.
70
Conclusions
Existence-Importance of nonlinear speech structure of turbulence type (fractals, chaotic dynamics)
Speech technology systems can benefit from including such nonlinear features
Find/extract robust nonlinear features of turbulence type Improve computational algorithms Fuse nonlinear with linear features Applications also to other sound signals, e.g. music
For more information, demos, and current results: http://cvsp.cs.ntua.gr and http://robotics.ntua.gr