Voice Transformation, Part II - Hellas · Voice Morphing Voice Conversion A baseline probabilistic approach 4 Discussion 5 Extensions of the mapping function 6 References. Voice Trans-formation,

Voice Trans-formation,

Part II

YannisStylianou

Outline of thetalk

Simplecontrol ofvoice quality

For morecomplex voicetransforma-tions

Control ofsource andfiltercharacteristics

Discussion

Extensions ofthe mappingfunction

References

Voice Transformation, Part II

Yannis Stylianou

Computer Science Department, Multimedia Informatics [email protected]

Interspeech 2007August 27th 2007, Antwerpen, Belgium


Part II

YannisStylianou

Outline of thetalk




Discussion


References

1 Simple control of voice qualityVoice quality in TTSDetection of voice quality problemsCompensation

2 For more complex voice transformations

3 Control of source and filtercharacteristics

Voice MorphingVoice ConversionA baseline probabilistic approach

4 Discussion

5 Extensions of the mapping function

6 References


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Voice quality in TTS

The key for natural sounding TTS is the use of large speechdatabases, where we wish to have:

Many instances of basic units

Variety of prosodic characteristics

Variety of spectral information

while we wish to avoid:

Variability in voice quality


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Problems associated with thevariability of voice quality

Degradation of the overall quality of synthesis

Problems in the unit selection algorithm

Problems in the unit concatenation algorithm

A big part of the database may be useless


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Intra and Inter session variability

Intra-session variability

Movements of the microphone during the recording sessionFatigue of the speaker after a long recording session ( 5hours)

Inter-session variability

Modifications of the recording equipment from onerecording session to the otherVariability of emotional state and/or health of the speaker


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Task

Given a large speech database

Automatically detect voice quality problems and

Correct voice quality problems with NO degradation of thespeech signals.


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Intra-session variability

Modeling of the acoustic space of the speaker usingGMMs, ri (with i = 1, ,N) based on the first k of Lobservations from each recording session ri :

Ori =

[O

(1)ri ,O

(2)ri , . . . ,O

(k)ri

...O(k+1)ri , . . . ,O

(L)ri

]Estimation of the log-likelihood function:

L(O(l)ri |ri ) =1

T

Tt=1

p(o(l)t |ri )

for l = 1, , LVariance of L(O(l)ri |ri ) reflects intra-session variability anddefines reference recording session, rp.


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Inter-session variability

Measurement of

L(O(l)ri |rp ), i , l

Compute z-score:

z lri =L(O(l)ri |rp ) L

L, i 6= p

Test null hypothesis (rp ri (l)) against alternativehypothesis (rp 6 ri (l)) with a 0.01 level of alpha error.


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Schematically

1 Lk kL )P(O

2

1Var

R1

O

)P(

1

O

Vark

Lk

Ref DBmin of variance

lO

Rk

l

O

P( 1L ) nO )P(1

nO

1L

O

O

1O 2

L1


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Example of Voice quality problem

Intra-session variability Inter-session variability

0 2 4 6 8 10 12 14 16 18 207.5

8

8.5

9

9.5

10

10.5

11

scor

e

# of segment

Intravariability

0 2 4 6 8 10 12 14 16 18 207.5

8

8.5

9

9.5

10

10.5

scor

e

# of segment

Intervariability


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Compensation

Given a segment ri with voice quality problems and thereference recording session, rp, their difference is given by:

(l)ri () =

1/21/2

(Prp (f ) P

(l)ri (f )

)exp(j2f )df

where P.(f ) denotes power spectrum density.Computing the coefficients of an AR corrective filter usingthe standard Levinson-Durbin algorithm.

Filtering the speech signal from ri with the computed ARcorrective filter.


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Evaluation

Objective:

L(O(l)ri |rp ) should increase after compensation.Spectral distance should decrease after compensation.

Subjective:

A-B listening tests


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Results

i = 1 i = 2 i = 3

L(Ori |rp ) 9.0974 8.0646 9.6506L(Ori |rp ) 9.2654 (1.84%) 9.2160 (14.27%) 9.6506


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Example of Voice quality problemresolved

0 2 4 6 8 10 12 14 16 18 207.5

8

8.5

9

9.5

10

10.5

scor

e

# of segment

Intervariability after correction


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Speech Models

Harmonic plus Noise Model, HNM (Stylianou et al.,1995)[1]

Speech Transformation and Representation using AdaptiveInterpolation of weiGHTed spectrum, STRAIGHT(Kuwahara, 1997)[2]

Auto-Regressive eXogenous Liljencrant-Fant, ARX-LF(Vincent et al. 2007)[3]


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Motivation for HNM

0 200 400 6002

1

0

1

2x 10

4

(a) Time in samples

Am

plitu

de

Original speech signal

0 2000 4000 6000 800050

0

50

100

(b) Frequency (Hz)

db

Original magnitude spectrum

0 200 400 6002

1

0

1

2x 10

4

(c) Time in samples

Am

plitu

de

Harmonic part (05000Hz)

0 200 400 6002000

1000

0

1000

2000

(d) Time in samples

Am

plitu

de

Noise part (50008000Hz)


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Brief overview of HNM

HNM is a pitch-synchronous harmonic plus noiserepresentation of the speech signal.

Speech spectrum is divided into a low and a high banddelimited by the so-called maximum voiced frequency.

The low band of the spectrum (below the maximumvoiced frequency) is represented solely by harmonicallyrelated sine waves.

The upper band is modeled as a noise componentmodulated by a time-domain amplitude envelope.

HNM allows high-quality copy synthesis and prosodicmodifications.


Part II

YannisStylianou

Outline of thetalk




Discussion


References

HNM in equations

Harmonic part:

h(t) =

L(t)k=L(t)

Ak(t)ej k0(t) t

Noise part:n(t) = e(t) [v(, t) ? b(t)]

Speech:s(t) = h(t) + n(t)


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Audio examples from HNM

Original

Time-scale by 0.7

Time-scale by 1.6

Pitch modification by 0.8

Pitch modification by 1.6

Original

Time-varying pitch and time modif.

Original

Time-scale by 4

Time-scale by 6


Part II

YannisStylianou

Outline of thetalk




Discussion


References

STRAIGHT

Speech signal is represented as a sum of minimum phaseimpulse responses[2]:

s(t) =tiQ

1G (fo(ti ))

uti (t T (ti ))

where Q represents a set of positions and G () represents apitch modification function.

Minimum phase impulse responses are modified usingall-pass filters

Filter information is reconstructed in the time-frequencyregion

Excitation information is manipulated through phase.


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Audio examples from STRAIGHT

Original utterance /kohi ni miruku wo iremasu ka/ Coffee withmilk?

Original

2 times of F0 and 1.25 times frequency axis

3 times of F0 and 1.44 times frequency axis

0.5 times of F0 and 0.8 times frequency axis


Part II

YannisStylianou

Outline of thetalk




Discussion


References

ARX-LF

ARX model:

s(t) = p

k=1

ak(t)s(t k) + b0u(t) + r(t)

LF model:

u(t) = E1et sin (wt) 0 t Te (1)

u(t) = E 2[eb(tTe) eb(T0Te)

](2)

Residual signal, r(t), is modeled by HNM[4].


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Audio examples from ARX-LF

Original

Time scale by 2.0

Pitch scale by 0.7

Pitch scale by 1.4


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Voice Morphing

From two speakers (source speakers) who utter the samesentence, create a new speaker who utters the same sentenceas the source speakers and with voice characteristics takenfrom the source speakers.

Dynamic Time Warping (DTW) between the twosentences

Linear Interpolation between corresponding frames

Synthesis


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Voice Conversion

Definition:Voice conversion aims at transforming the characteristicsof the speech signal uttered by a speaker (SourceSpeaker), in such a way that a human listener couldbelieve that the transformed speech is produced byanother specific speaker (Target Speaker).

Control of the source and filter characteristics


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Overview of techniques

Abe et al. (1988)[5]: VQ mapping

Valbret et al. (1992)[6]: Linear Multivariate Regression(LMR), Dynamic Frequency Warping (DFW)

Iwahashi et al. (1994)[7]: Speaker Interpolation

Kuwabara et al. (1995)[8]: Fuzzy VQ

Stylianou et al. (1995)[9]: Probabilistic approach (GMM)

Kain et al. (1998) [10]: Probabilistic approach (GMM)

Toda et al. (2001) [11]: Probabilistic approach(GMM)and DFW

Toda et al. (2005) [12]: Probabilistic approach (GMM)

Turk et al. (2005) [13]: Correction filters

Mouchtaris et al.(2006)[14]: Probabilistic approach(GMM)and speaker adaptation


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Main steps for voice conversion

Source (prosody) modifications

Filter modification

1 Representation2 Mapping


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Example of spectral envelopes mapping

0 500 1000 1500 2000 2500 3000 3500 400040

30

20

10

0

10

20

30

(a) Frequency (Hz)

dB

Fulltype: Dist. to src:2.6dB Dist. to tar:15dB


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Main steps for learning filter mappings

Alignment

Explicitly for parallel data (DTW, HMM)Implicitly for non-parallel data (through speakeradaptation[14]

define mapping function (VQ, GMM)


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Parallel data: alignment with DTW

Distance/Correlation

Short/long sentences

Using anchor points

Constraints (steps)


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Parallel data: alignment with HMM

Forced-alignment Source and Target speakers givenorthographic transcription of the utterance

Sentence HMM ([15])

Template sentences from source (phonetically balanced)Left-to-Right HMM for each sentence by adding a newstate at a constant rate (i.e. every 40 ms.)Forced-alignment using the Viterbi algorithm (find the bestsequence of states)Alignment using state indices.


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Non-parallel data

Assuming[14]:

1 Parallel data for two speakers (Speaker 1 and Speaker 2)exist

2 Conversion function (mapping) between these twospeakers is known

then:

Adapt Speaker 1 to the Source speaker

Adapt Speaker 2 to the Target speaker

Compute Conversion function by using:

the initial conversion function of the parallel datathe adaptation parameters


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Overview of a baseline GMM-basedapproach

Data: Parallel data alignment with DTW

Probabilistic classification: The acoustic space of aspeaker is described by a parametric Gaussian mixturemodel (GMM).

Mapping function: A mapping function associates theacoustic space of the source speaker with the acousticspace of the target speaker.

Iterative approach: Re-alignment after conversion.


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Probabilistic classification

Modeling of the acoustic space of a speaker by a GMM :

p(x) =m

i=1

iN(x; i ,i ),

Classification:

P(Ci |x) =iN(x; i ,i )

mj=1

jN(x; j ,j)

Estimation using an Expectation-Maximization (EM)algorithm initialized by a standard binary splitting VQprocedure.


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Mapping function

Mapping function[16]:

F(xt) =m

i=1

P(Ci |xt)[ i + i

1i (xt i )

]Motivation:

E [y|x = xt ] = + 1(xt )

Estimation of mapping function:

=n

t=1

||yt F(xt)||2


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Conversion using HNM

Conversion of the harmonic part:

Asynchronous mode

HNM analysis

Asynchronous mode

EM

HNM analysis

envelope (y)

source

targetDTW

LS

optim

izat

ion

data

data

Spectral

Align. path

Spectral

envelope (x)

GMM

Conversionfunction

Conversion of the noise part: use of two separate timeinvariant 6th order all-pole corrective filters; one for voicedframes (upper band) and one for unvoiced frames (full band).


Part II

YannisStylianou

Outline of thetalk




Discussion


References

The voice conversion system

Spectral envelope (voiced part)

prosodic specifications

speechsignal

converted

speech

ti t

itis

HNM analysisSynchronous mode

Conversion function

Envelopetransformation

HNM synthesisSynchronous mode

Corrective filters (noise part)

ti

Mappingtitis


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Corpus and test conditions for theformal listening test

Conversion between two French male speakers. Data havebeen provided by FT ( 5 minutes per speaker)Sampling frequency: 16kHz

Frame size for the asynchronous HNM analysis: 10 msec.

Cepstrum order: 20

Maximum voiced frequency was fixed at constant value of4kHz.

Twenty adult listeners familiar with listening tests ofspeech coding but unfamiliar with voice conversion task.

Prosody of the source speaker has been altered to matchas close as possible the prosody of the target speaker.


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Results - XAB test

Task: Listeners were asked to select either A or B as beingmost similar to X.

PO 16 GMM 64 GMM 64 GMM(2)

Correct 18% 83% 88% 97%answers


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Results - Opinion test

Task: rate similarity of each pair of speakers (0: the samespeaker 9: very different speaker).

TT SS M2 M1 PT ST 0

1

2

3

4

5

6

7

8

9S

core


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Audio examples of Voice Conversion:HNM + GMM

Source Converted Target

Source Converted Target


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Discussion

Quality

Alignment

Source processing

Filter processing

Interaction between source and filter

Voice conversion / Emotions

Before that ...


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Jointly model Source and Target

Kain et al.[10] suggests to jointly model the target and thesource by a GMM:

F(xt) =m

i=1

P(Ci |xt)[yi +

yxi

xx1i (xt xi )

]where

P(Ci |x) =iN(x;

xi ,

xxi )

mj=1

jN(x; xj ,

xxj )


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Joint GMM and DFW

Toda et al.[11] combined the previous joint GMM approachand DFW to avoid over smoothing of the converted spectralenvelope:

|Sc(f )| = exp [ln |Sd(f )|+ w(ln |Sg (f )| ln |Sd(f )|)]

where Sd(f ) and Sg (f ) denote the spectrum after DFW andafter conversion, respectively. Weight, w varies between 0 and1.


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Joint GMM and use of Global Variance

Toda et al.[12] suggested:

combine joint GMM with the global variance of theconverted spectra in each utterance to cope withover-smoothing

Use of delta features have been used to alleviate spectraldiscontinuities

F(xt) = (W T D1m W )1W T D1m Emwhere

Em = [E1(mi1) E2(mi2) EN(miN)]D1m = diag

[Dm1i1 Dm

1i2 Dm

1iN

]En(mi ) =

yi +

yxi

xx1i (xt xi )

Dmi = yyi

yxi

xx1i

xyi


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Another GMM based ...

Meshabi et al.[17] suggests a modified mapping function tryingto overcome over smoothing effects:

F(xt) =m

i=1

P(Ci |xt)[yi + (xt

xi )]

where is constrained to be diagonal prohibiting thecross-correlation between coordinates of teh acoustic vectors.


Part II

YannisStylianou

Outline of thetalk




Discussion


References

THANK YOU

Time for questions ...


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Y. Stylianou, J. Laroche, and E. Moulines, High-Quality Speech Modification based on a Harmonic

+ Noise Model., Proc. EUROSPEECH, 1995.

H. Kuwahara, Speech representation and transformation using adaptive interpolation of weighted

spectrum: vocoder revisited, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Munich,Germany), pp. 13031306, 1997.

D. Vincent, O. Rosec, and T. Chonavel, Estimation of LF glottal source parameters based on arx

model, in Proc. Interspeech, (Lisbon, Portugal), pp. 333336, 2005.

D. Vincent, O. Rosec, and T.Chonavel, A new method for speech synthesis and transformation

based on an ARX-LF source-filter decomposition and HNM modeling, ICASSP, 2007.

M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, Voice conversion through vector

quantization, in Proc. ICASSP88, pp. 655658, 1988.

H. Valbret, E. Mulines, and J. Tubach, Voice transformation using PSOLA techinques, Speech

Communication, vol. 11, no. 2-3, pp. 175187, 1992.

N. Iwahashi and Y. Sagisaka, Speech spectrum transformation based on speaker interpolation, in

Proc. ICASSP94, 1994.

H. Kuwabara and Y. Sagisaka, Acoustic characteristics of speaker individuality: Control and

conversion, Speech Communication, vol. 16, no. 2, pp. 165173, 1995.

Y. Stylianou, O. Cappe, and E. Moulines, Statistical methods for voice quality transformation,

Proc. EUROSPEECH, 1995.

A. Kain and M. Macon, Spectral voice conversion for text-to-speech synthesis, in Proc. ICASSP98,

pp. 285288, 1998.

yannisPencil

yannisPencil


Part II

YannisStylianou

Outline of thetalk




Discussion


References

T. Toda, H. Saruwatari, and K. Shikano, Voice Conversion Algorithm based on Gaussian Mixture

Model with Dynamic Frequency Warping of STRAIGHT spectrum, in Proc. IEEE Int. Conf. Acoust.,Speech, Signal Processing, (Salt Lake City, USA), pp. 841844, 2001.

T. Toda, A. Black, and K. Tokuda, Spectral Conversion Based on Maximum Likelihood Estimation

considering Global Variance of Converted Parameter, in Proc. IEEE Int. Conf. Acoust., Speech,Signal Processing, (Philadelphia, USA), pp. 912, 2005.

O. Turk and L. M. Arslan, Robust processing techniques for voice conversion, Computer Speech

and Language, vol. 20, pp. 441467, 2006.

A. Mouchtaris, J. V. derSpiegel, and P.Mueller, Non parallel training for voice conversion based on a

parameter adaptation, IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 3,pp. 952963, 2006.

L. Arslan and D. Talkin, Speaker transformation algrithm using segmental codebooks, Speech

Communication, vol. 28, pp. 211226, 1999.

Y. Stylianou, O. Cappe, and E. Moulines, Continuous probabilistic transform for voice conversion,

IEEE Trans. Speech and Audio Processing, vol. 6, no. 2, pp. 131142, 1998.

L. Meshabi, V. Barreaud, and O. Boeffard, GMM-based Speech Transformation Systems under Data

Reduction, 6th ISCA Workshop on Speech Synthesis, pp. 119124, August 22-24, 2007.

yannisPencil

yannisPencil

yannisPencil


Part II

YannisStylianou

Outline of thetalk




Discussion


References

Outline of the talkSimple control of voice qualityVoice quality in TTSDetection of voice quality problemsCompensation

For more complex voice transformationsControl of source and filter characteristicsVoice MorphingVoice ConversionA baseline probabilistic approach

DiscussionExtensions of the mapping functionReferences

Documents

Voice Transformation, Part II - Hellas · Voice Morphing Voice Conversion A baseline probabilistic approach 4 Discussion 5 Extensions of the mapping function 6 References. Voice Trans-formation,