27
1 of 27 SOUND SOURCE RECOGNITION AND MODELING CASA seminar, summer 2000 Antti Eronen [email protected] Contents: Basics of human sound source recognition Timbre Voice recognition Recognition of environmental sounds and events Musical instrument recognition Audio Research Group, TUT

SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

  • Upload
    vukiet

  • View
    228

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

1 of 27

SOUND SOURCE RECOGNITION

n

events

AND MODELING

CASA seminar, summer 2000

Antti Eronen

[email protected]

Contents:

• Basics of human sound source recognitio

• Timbre

• Voice recognition

• Recognition of environmental sounds and

• Musical instrument recognition

Audio Research Group, TUT

Page 2: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

2 of 27

gnition

cing objectstening

ction process

oduced at each

rom different sound sources

ingle objects in belonging to

Audio Research Group, TUT

Human sound source recoabilities

• Different acoustic properties of sound produenable us to recognize sound sources by lis

• These properties are the result of the produ

• The produced sound waves are different prevent

• Acoustic properties change over time

• The acoustic world is linear: sound waves fsources combine together and result larger

• Combination and interaction of properties sthe mix generate new, emergent propertiesthe larger sound producing system

Page 3: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

3 of 27

)

at is, “what it

tener can tell thatissimilar”

properties

of properties

teractive

of a violin sounds like this)

learn how the violin sounds

Audio Research Group, TUT

Timbre (= äänen väri

• The perceptual qualities of objects and events; thsounds like”

• ANSI 1973: “The quality of a sound by which a listwo sounds of the same loudness and pitch are d

• There are many stable and time-varying acoustic affecting timbre

• It is unlikely that any one property or combinationuniquely determines timbre

• The sense of timbre comes from the emergent, inproperties of the vibration pattern

• The identification is the result of

• the apprehension of acoustical invariants (the bowing

• inferences made according to learned experience (we

like in different acoustic environments)

Page 4: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

4 of 27

roduction

a vibration

nt vibration

nant frequency

, it modifies the source input

m of the signal

re of its

harp peak inignal (and vice

Audio Research Group, TUT

Source-filter model of sound p

• The source is excited by energy to generatepattern

• The filter acts as a resonator, having differemodes

• Each mode can be characterized by its resoand by its damping or quality factor Q

• When the excitation is imposed on the filterelative amplitudes of the components of the

• This results peaks in the frequency spectruat resonant frequencies

• Damping of the vibration modes is a measusharpness of tuning and temporal response

• Lightly damped mode (high Q) results a sspectrum and a longer time delay into the sversa)

Page 5: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

5 of 27

spectrum and few

ted by theeristics

e modeled asesulting signale partialically

(1)

f the output a sn

z-transforms of

ract and the

Audio Research Group, TUT

• We can hear both the change in the sound the time differences (if they are more than amilliseconds)

• The final sound is the result of effects resulexcitation, resonators and radiation charact

• In sound producing mechanisms, that can blinear systems, the transfer function of the ris the product of the transfer functions of thsystems (if they are in cascade), mathemat

,

where and are the z-transforms o

excitation signal, respectively. are the

the N subsystems (for instance, the vocal treflections at lips)

Y z( ) X z( ) Hi z( )i 1=

N

∏=

Y z( ) X z( )Hi z( )

Page 6: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

6 of 27

gnition

tem should:

of the same kind of instance, musicals or by different

ble to work withverberation and

ditional sounds and

s performancegree ofnd sources

Audio Research Group, TUT

Machine sound source reco

A good sound source recognition sys

• Exhibit generalization. Different instancessound should be recognized as similar. (forinstruments played at different environmentplayers)

• Hande real world complexity . Should be arealistic recording conditions, with noise, reeven competing sound sources.

• Be scalable. Ability to learn to recognize adaffects on performance.

• Exhibit graceful degradation. The systemshould gradually worsen while noise, the dereverberation and number of competing souincreases.

Page 7: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

7 of 27

uld be able to its refine

simpler out of twoemory ornderstand how the

Audio Research Group, TUT

• Employ a flexible learning strategy. It shointroduce new categories as necessary andclassification criteria.

• Simplicity, computational efficiency. Thesystems performing equally well is better. (mprocessing requirements, how easy is it to usystem works)

Page 8: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

8 of 27

gnition

ervised)

etworks,

classes and

Audio Research Group, TUT

A typical sound source recosystem

• Preprocessing (filtering, noise removal)

• Feature extraction

• Training and learning (supervised or unsup

• Classification (pattern recognition, neural nstochastic models)

• Is able to work with limited number of soundtest data

Page 9: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

9 of 27

ognition

nergy distribution,

(2)

k) it’s amplitude andersion can be used,

ncy of a harmonic

oments have been

Audio Research Group, TUT

Features for sound source rec

Frequency spectrum

• spectral centroid measures the spectral ecorresponds to the perceived “brightness”

,

k is the spectral component, and f(k) and A(frequency, respectively. Also a normalized v

, where f0 is the fundamental freque

sound.

• is the same as first moment, also higher order mused as features

f c

A k( ) f k( )k 1=

N

A k( )k 1=

N

∑----------------------------------=

f cnorm

f c

f 0-----=

Page 10: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

10 of 27

al bands or

r transform of the

(3)

d Fourier transform of

(4)

(5)

Audio Research Group, TUT

• The power spectrum accross a set of criticsuccessive frequency regions

• The power spectrum of signal x(n) is the Fourieautocorrelation sequence r(n):

.

• This can be calculated as the magnitude squarethe signal x(n):

,where

P ω( ) r n( )e jωn–

n ∞–=

∑=

P ω( ) X ω( ) 2=

X ω( ) x n( )e jωn–

n ∞–=

∑=

Page 11: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

11 of 27

veraged harmonic

(6)

al spectrum

, . (7)

Ak 1–-----------------

k

n2--- M N

2----=

Audio Research Group, TUT

• Spectral irregularity

• Corresponds to the standard deviation of time-aamplitudes from a spectral envelope

• Even and odd harmonic content in the sign

• Even harmonic content

IRR 20 Ak

Ak 1+ Ak+ +

3-----------------------------–

k 2=

n 1–

log=

hev

A22

A42

A62 …+ + +

A12

A22

A32 …+ + +

--------------------------------------------

A22

k 1=

M

An 1=

N

∑-----------------==

Page 12: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

12 of 27

, . (8)

sonances in the filter of

stral coefficients

(9)

LN2---- 1–=

Audio Research Group, TUT

• Odd harmonic content

• Formants

• Spectral prominences created by one or more rethe sound source

• A robust feature for measuring formants are cep

• The cepstrum of a signal x(n) is defined as

hodd

A12

A32

A52 …+ + +

A12

A22

A32 …+ + +

--------------------------------------------

A2k 1–2

k 0=

L

An2

n 1=

N

∑---------------------------==

c n( ) F1–

F x n( ){ }log{ }=

Page 13: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

13 of 27

instance with linear

ted with an all-pole filter

d

trum of the sound

oefficients, which

s(n)

white noise) is scaled

Audio Research Group, TUT

• In practise the coefficients may be obtained for prediction (LP)

• In LP, the filter of the sound source is approxima

• The coefficients of the all-pole filter can be solve

• These coefficients describe the magnitude specsource filter

• These coefficients are converted into cepstral cbehave nicely for recognition purposes

X

1

A(z)

u(n)

G

Figure. The normalized input u(n) (pulse train orby the gain G and filtered with an all-pole filter

Page 14: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

14 of 27

2 2.5

x 104

guitar tone, and an

Audio Research Group, TUT

0 0.5 1 1.5−80

−60

−40

−20

0

20

40

60

Frequency [Hz]

Magn

itude [

dB]

Figure. Magnitude spectrum of a 40 ms frame of a approximating LPC spectrum of order 15.

Page 15: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

15 of 27

2 2.5

x 104

2 2.5

x 104

in and a trumpet

Audio Research Group, TUT

0 0.5 1 1.5−30

−20

−10

0

10

20

30

40

50

60Violin

Magni

tude [d

B]

0 0.5 1 1.5−30

−20

−10

0

10

20

30

40

50Trumpet

Magni

tude [d

B]

Frequency [Hz]

Figure. Average LPC spectrum of a violtone, respectively.

Page 16: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

16 of 27

ant of maximal

to locate these points

rmonics or different

es of the harmonic

Audio Research Group, TUT

Onset and offset transients

• Rise time (the duration of attack)

• the time interval between the onset and the instamplitude

• Usually some kind of energy threshols are usedfrom an overall amplitude envelope

• Onset asynchrony

• Calculate the individual rise times of different hafrequency ranges

• Onset harmonic skew: a linear fit to the onset timpartials as a function of frequency

Page 17: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

17 of 27

m modulations

m modulations

ude envelope

Audio Research Group, TUT

Modulations

• Frequency modulation

• Vibrato (periodic), jitter (random)

• Difficult to measure reliably

• Presence/absence/degree of periodic and rando

• Amplitude modulation

• Tremolo

• Presence/absence/degree of periodic and rando

• These features can be extracted from an amplit

Page 18: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

18 of 27

Audio Research Group, TUT

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

5

10

15

20

0

5

10

15

20

25

30

35

Time [sec]Bark frequency

Inte

nsity

[dB]

Flute, violin, trumpet and clarinet

Page 19: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

19 of 27

vectors, which are

identified as a member

signed to a hitherto

5

ss1ss2

Audio Research Group, TUT

Classification

• Pattern recognition

• The data is presented as N dimensional featureassigned to different classes or clusters

• Supervised classification : the input pattern isof a predefined class

• Unsupervised classification : the pattern is asunknown class (e.g. clustering)

−2 −1 0 1 2 3 4−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

3.5

Feature

1

Feature 2

clacla

Page 20: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

20 of 27

problem

s, leave only the

ing applications

able long-term statistics

in phonetic

rain and test utterance

Audio Research Group, TUT

Speaker recognition

• The most studied sound source recognition

• Recognition and verification

Three major approaches

• Long-term averages of acoustic features

• Average out phonetic variations affecting featurespeaker dependent component

• Earliest, has been used successfully in demand

• Discards much speaker-dependent information

• Can require long speech utterances to derive st

• Model the speaker-dependent features withsounds

• Compare within similar phonetic sounds in the t

Page 21: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

21 of 27

l (HMM)-based little or no

g of acoustic featuresre models, GMM)

hich best discriminates

tic conditions vary from

Audio Research Group, TUT

• Explicit segmentation : A Hidden Markov Modecontinuous speech recognizer as a front end ->inmprovement in performance

• Implicit segmentation : unsupervised clusterinduring training and recognition (Gaussian mixtu

• Discriminative neural networks (NN)

• NN’s are trained to model the decision function wspealers within a known set

• Problems

• Fundamental frequency information not used

• Speech rhythm not used

• Lack of generality: do not work well when acousthose used in training

• Cannot deal with mixtures of sounds

• Performance suffers as population size grows

Page 22: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

22 of 27

ms frames

odel is formed

els (GMM) in

r-dependent GMM areasses

oothly the long-term

Audio Research Group, TUT

Case Reynolds 1995

• 20 mel-frequency cepstral coefficients at 20

• Given a recorded utterance, a probalistic mbased on Gaussian distributions

• Motivations for using Gaussian Mixture Modspeaker recognition:

• The individual component Gaussians in speakeinterpreted to represent some broad acoustic cl

• A Gaussian mixture density is able to model smsample distribution

Page 23: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

23 of 27

ordings (630

g differenting)

Audio Research Group, TUT

• The performance of the system depends on

• The noise characteristics of the signal

• The population size

• Nearly perfect performance with pristine rectalkers)

• Under varying acoustic conditions (e.g. usintelephone handsets during testing and train

• 94% with population of 10 talkers

• 83% with population of 113 talkers

Page 24: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

24 of 27

ition

es

Audio Research Group, TUT

Automatic noise recogn

Case Gaunard 1998

• Car, truck, moped, aircraft, train

• 12 cepstral coefficients from 50-100 ms fram

• 1-5 state HMM

• Recognition performance

• 90-95% with cepestral coefficients as features

• 80% with 1/3 octave filter bank as front end

Page 25: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

25 of 27

nvironments

order 10 LPC

tness

oises (restaurant

endent speech signals)perimpositions

more as babble than as

Audio Research Group, TUT

Case El-Maleh 1999

• Frame level noise classification for mobile e

• Car, voice babble, street, bus and factory

• Line spectral frequences (LSF:s) based on analysis as features

• 89% average performance

• Shows some ability to generalize and robus

• New noises were classified as similar training n(babble, music) -> babble or bus noise)

• Human speech-like noise (superimposed indepwas classified as speech with low number of su

• As number of superimposed signals increased,speech

Page 26: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

26 of 27

nition

ith different techniques

rius vs. cheap violin)

finding

what’s the thing thatle (timbre)

Audio Research Group, TUT

Musical instrument recog

• Difficulties

• Wide pitch ranges

• Variety of playing tecniques

• Properties of sounds may change completely wand different notes

• Interfering sounds in polyphony

• Different recording conditions

• Differences between instrument pieces (stradiva

• Psychological research as starting point forfeatures

• Lots of work have been done in order to resolvemakes musical instrument sounds distinguishab

Page 27: SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

27 of 27

ment recognition

s

ne example of a

only four instruments

ith several instruments

tion

Audio Research Group, TUT

• This knowledge has been used in musical instrusystems

• Also lots of work with human voices

• Much less knowledge with environmental sound

• The state-of-the-art still quite low

• Good results with isolated tones but with only oparticular instrument

• Good results with monophonic phrases but with

• Not so good results with monophonic phrases w

• Some first attempts towards polyphonic recogni