Upload
truongdieu
View
226
Download
0
Embed Size (px)
Citation preview
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Voice Transformation, Part II
Yannis Stylianou
Computer Science Department, Multimedia Informatics [email protected]
Interspeech 2007August 27th 2007, Antwerpen, Belgium
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
1 Simple control of voice qualityVoice quality in TTSDetection of voice quality problemsCompensation
2 For more complex voice transformations
3 Control of source and filtercharacteristics
Voice MorphingVoice ConversionA baseline probabilistic approach
4 Discussion
5 Extensions of the mapping function
6 References
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Voice quality in TTS
The key for natural sounding TTS is the use of large speechdatabases, where we wish to have:
Many instances of basic units
Variety of prosodic characteristics
Variety of spectral information
while we wish to avoid:
Variability in voice quality
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Problems associated with thevariability of voice quality
Degradation of the overall quality of synthesis
Problems in the unit selection algorithm
Problems in the unit concatenation algorithm
A big part of the database may be useless
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Problems associated with thevariability of voice quality
Degradation of the overall quality of synthesis
Problems in the unit selection algorithm
Problems in the unit concatenation algorithm
A big part of the database may be useless
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Problems associated with thevariability of voice quality
Degradation of the overall quality of synthesis
Problems in the unit selection algorithm
Problems in the unit concatenation algorithm
A big part of the database may be useless
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Problems associated with thevariability of voice quality
Degradation of the overall quality of synthesis
Problems in the unit selection algorithm
Problems in the unit concatenation algorithm
A big part of the database may be useless
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Intra and Inter session variability
Intra-session variability
Movements of the microphone during the recording sessionFatigue of the speaker after a long recording session ( 5hours)
Inter-session variability
Modifications of the recording equipment from onerecording session to the otherVariability of emotional state and/or health of the speaker
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Intra and Inter session variability
Intra-session variability
Movements of the microphone during the recording sessionFatigue of the speaker after a long recording session ( 5hours)
Inter-session variability
Modifications of the recording equipment from onerecording session to the otherVariability of emotional state and/or health of the speaker
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Intra and Inter session variability
Intra-session variability
Movements of the microphone during the recording sessionFatigue of the speaker after a long recording session ( 5hours)
Inter-session variability
Modifications of the recording equipment from onerecording session to the otherVariability of emotional state and/or health of the speaker
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Intra and Inter session variability
Intra-session variability
Movements of the microphone during the recording sessionFatigue of the speaker after a long recording session ( 5hours)
Inter-session variability
Modifications of the recording equipment from onerecording session to the otherVariability of emotional state and/or health of the speaker
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Intra and Inter session variability
Intra-session variability
Movements of the microphone during the recording sessionFatigue of the speaker after a long recording session ( 5hours)
Inter-session variability
Modifications of the recording equipment from onerecording session to the otherVariability of emotional state and/or health of the speaker
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Intra and Inter session variability
Intra-session variability
Movements of the microphone during the recording sessionFatigue of the speaker after a long recording session ( 5hours)
Inter-session variability
Modifications of the recording equipment from onerecording session to the otherVariability of emotional state and/or health of the speaker
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Task
Given a large speech database
Automatically detect voice quality problems and
Correct voice quality problems with NO degradation of thespeech signals.
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Task
Given a large speech database
Automatically detect voice quality problems and
Correct voice quality problems with NO degradation of thespeech signals.
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Intra-session variability
Modeling of the acoustic space of the speaker usingGMMs, ri (with i = 1, ,N) based on the first k of Lobservations from each recording session ri :
Ori =
[O
(1)ri ,O
(2)ri , . . . ,O
(k)ri
...O(k+1)ri , . . . ,O
(L)ri
]Estimation of the log-likelihood function:
L(O(l)ri |ri ) =1
T
Tt=1
p(o(l)t |ri )
for l = 1, , LVariance of L(O(l)ri |ri ) reflects intra-session variability anddefines reference recording session, rp.
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Inter-session variability
Measurement of
L(O(l)ri |rp ), i , l
Compute z-score:
z lri =L(O(l)ri |rp ) L
L, i 6= p
Test null hypothesis (rp ri (l)) against alternativehypothesis (rp 6 ri (l)) with a 0.01 level of alpha error.
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Schematically
1 Lk kL )P(O
2
1Var
R1
O
)P(
1
O
Vark
Lk
Ref DBmin of variance
lO
Rk
l
O
P( 1L ) nO )P(1
nO
1L
O
O
1O 2
L1
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Example of Voice quality problem
Intra-session variability Inter-session variability
0 2 4 6 8 10 12 14 16 18 207.5
8
8.5
9
9.5
10
10.5
11
scor
e
# of segment
Intravariability
0 2 4 6 8 10 12 14 16 18 207.5
8
8.5
9
9.5
10
10.5
scor
e
# of segment
Intervariability
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Compensation
Given a segment ri with voice quality problems and thereference recording session, rp, their difference is given by:
(l)ri () =
1/21/2
(Prp (f ) P
(l)ri (f )
)exp(j2f )df
where P.(f ) denotes power spectrum density.Computing the coefficients of an AR corrective filter usingthe standard Levinson-Durbin algorithm.
Filtering the speech signal from ri with the computed ARcorrective filter.
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Evaluation
Objective:
L(O(l)ri |rp ) should increase after compensation.Spectral distance should decrease after compensation.
Subjective:
A-B listening tests
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Results
i = 1 i = 2 i = 3
L(Ori |rp ) 9.0974 8.0646 9.6506L(Ori |rp ) 9.2654 (1.84%) 9.2160 (14.27%) 9.6506
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Example of Voice quality problemresolved
0 2 4 6 8 10 12 14 16 18 207.5
8
8.5
9
9.5
10
10.5
scor
e
# of segment
Intervariability after correction
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Speech Models
Harmonic plus Noise Model, HNM (Stylianou et al.,1995)[1]
Speech Transformation and Representation using AdaptiveInterpolation of weiGHTed spectrum, STRAIGHT(Kuwahara, 1997)[2]
Auto-Regressive eXogenous Liljencrant-Fant, ARX-LF(Vincent et al. 2007)[3]
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Motivation for HNM
0 200 400 6002
1
0
1
2x 10
4
(a) Time in samples
Am
plitu
de
Original speech signal
0 2000 4000 6000 800050
0
50
100
(b) Frequency (Hz)
db
Original magnitude spectrum
0 200 400 6002
1
0
1
2x 10
4
(c) Time in samples
Am
plitu
de
Harmonic part (05000Hz)
0 200 400 6002000
1000
0
1000
2000
(d) Time in samples
Am
plitu
de
Noise part (50008000Hz)
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Brief overview of HNM
HNM is a pitch-synchronous harmonic plus noiserepresentation of the speech signal.
Speech spectrum is divided into a low and a high banddelimited by the so-called maximum voiced frequency.
The low band of the spectrum (below the maximumvoiced frequency) is represented solely by harmonicallyrelated sine waves.
The upper band is modeled as a noise componentmodulated by a time-domain amplitude envelope.
HNM allows high-quality copy synthesis and prosodicmodifications.
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
HNM in equations
Harmonic part:
h(t) =
L(t)k=L(t)
Ak(t)ej k0(t) t
Noise part:n(t) = e(t) [v(, t) ? b(t)]
Speech:s(t) = h(t) + n(t)
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Audio examples from HNM
Original
Time-scale by 0.7
Time-scale by 1.6
Pitch modification by 0.8
Pitch modification by 1.6
Original
Time-varying pitch and time modif.
Original
Time-scale by 4
Time-scale by 6
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
STRAIGHT
Speech signal is represented as a sum of minimum phaseimpulse responses[2]:
s(t) =tiQ
1G (fo(ti ))
uti (t T (ti ))
where Q represents a set of positions and G () represents apitch modification function.
Minimum phase impulse responses are modified usingall-pass filters
Filter information is reconstructed in the time-frequencyregion
Excitation information is manipulated through phase.
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Audio examples from STRAIGHT
Original utterance /kohi ni miruku wo iremasu ka/ Coffee withmilk?
Original
2 times of F0 and 1.25 times frequency axis
3 times of F0 and 1.44 times frequency axis
0.5 times of F0 and 0.8 times frequency axis
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
ARX-LF
ARX model:
s(t) = p
k=1
ak(t)s(t k) + b0u(t) + r(t)
LF model:
u(t) = E1et sin (wt) 0 t Te (1)
u(t) = E 2[eb(tTe) eb(T0Te)
](2)
Residual signal, r(t), is modeled by HNM[4].
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Audio examples from ARX-LF
Original
Time scale by 2.0
Pitch scale by 0.7
Pitch scale by 1.4
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Voice Morphing
From two speakers (source speakers) who utter the samesentence, create a new speaker who utters the same sentenceas the source speakers and with voice characteristics takenfrom the source speakers.
Dynamic Time Warping (DTW) between the twosentences
Linear Interpolation between corresponding frames
Synthesis
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Voice Morphing
From two speakers (source speakers) who utter the samesentence, create a new speaker who utters the same sentenceas the source speakers and with voice characteristics takenfrom the source speakers.
Dynamic Time Warping (DTW) between the twosentences
Linear Interpolation between corresponding frames
Synthesis
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Voice Morphing
From two speakers (source speakers) who utter the samesentence, create a new speaker who utters the same sentenceas the source speakers and with voice characteristics takenfrom the source speakers.
Dynamic Time Warping (DTW) between the twosentences
Linear Interpolation between corresponding frames
Synthesis
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Voice Morphing
From two speakers (source speakers) who utter the samesentence, create a new speaker who utters the same sentenceas the source speakers and with voice characteristics takenfrom the source speakers.
Dynamic Time Warping (DTW) between the twosentences
Linear Interpolation between corresponding frames
Synthesis
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Voice Conversion
Definition:Voice conversion aims at transforming the characteristicsof the speech signal uttered by a speaker (SourceSpeaker), in such a way that a human listener couldbelieve that the transformed speech is produced byanother specific speaker (Target Speaker).
Control of the source and filter characteristics
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Voice Conversion
Definition:Voice conversion aims at transforming the characteristicsof the speech signal uttered by a speaker (SourceSpeaker), in such a way that a human listener couldbelieve that the transformed speech is produced byanother specific speaker (Target Speaker).
Control of the source and filter characteristics
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Voice Conversion
Definition:Voice conversion aims at transforming the characteristicsof the speech signal uttered by a speaker (SourceSpeaker), in such a way that a human listener couldbelieve that the transformed speech is produced byanother specific speaker (Target Speaker).
Control of the source and filter characteristics
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Overview of techniques
Abe et al. (1988)[5]: VQ mapping
Valbret et al. (1992)[6]: Linear Multivariate Regression(LMR), Dynamic Frequency Warping (DFW)
Iwahashi et al. (1994)[7]: Speaker Interpolation
Kuwabara et al. (1995)[8]: Fuzzy VQ
Stylianou et al. (1995)[9]: Probabilistic approach (GMM)
Kain et al. (1998) [10]: Probabilistic approach (GMM)
Toda et al. (2001) [11]: Probabilistic approach(GMM)and DFW
Toda et al. (2005) [12]: Probabilistic approach (GMM)
Turk et al. (2005) [13]: Correction filters
Mouchtaris et al.(2006)[14]: Probabilistic approach(GMM)and speaker adaptation
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Main steps for voice conversion
Source (prosody) modifications
Filter modification
1 Representation2 Mapping
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Example of spectral envelopes mapping
0 500 1000 1500 2000 2500 3000 3500 400040
30
20
10
0
10
20
30
(a) Frequency (Hz)
dB
Fulltype: Dist. to src:2.6dB Dist. to tar:15dB
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Main steps for learning filter mappings
Alignment
Explicitly for parallel data (DTW, HMM)Implicitly for non-parallel data (through speakeradaptation[14]
define mapping function (VQ, GMM)
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Parallel data: alignment with DTW
Distance/Correlation
Short/long sentences
Using anchor points
Constraints (steps)
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Parallel data: alignment with HMM
Forced-alignment Source and Target speakers givenorthographic transcription of the utterance
Sentence HMM ([15])
Template sentences from source (phonetically balanced)Left-to-Right HMM for each sentence by adding a newstate at a constant rate (i.e. every 40 ms.)Forced-alignment using the Viterbi algorithm (find the bestsequence of states)Alignment using state indices.
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Non-parallel data
Assuming[14]:
1 Parallel data for two speakers (Speaker 1 and Speaker 2)exist
2 Conversion function (mapping) between these twospeakers is known
then:
Adapt Speaker 1 to the Source speaker
Adapt Speaker 2 to the Target speaker
Compute Conversion function by using:
the initial conversion function of the parallel datathe adaptation parameters
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Overview of a baseline GMM-basedapproach
Data: Parallel data alignment with DTW
Probabilistic classification: The acoustic space of aspeaker is described by a parametric Gaussian mixturemodel (GMM).
Mapping function: A mapping function associates theacoustic space of the source speaker with the acousticspace of the target speaker.
Iterative approach: Re-alignment after conversion.
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Probabilistic classification
Modeling of the acoustic space of a speaker by a GMM :
p(x) =m
i=1
iN(x; i ,i ),
Classification:
P(Ci |x) =iN(x; i ,i )
mj=1
jN(x; j ,j)
Estimation using an Expectation-Maximization (EM)algorithm initialized by a standard binary splitting VQprocedure.
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Mapping function
Mapping function[16]:
F(xt) =m
i=1
P(Ci |xt)[ i + i
1i (xt i )
]Motivation:
E [y|x = xt ] = + 1(xt )
Estimation of mapping function:
=n
t=1
||yt F(xt)||2
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Mapping function
Mapping function[16]:
F(xt) =m
i=1
P(Ci |xt)[ i + i
1i (xt i )
]Motivation:
E [y|x = xt ] = + 1(xt )
Estimation of mapping function:
=n
t=1
||yt F(xt)||2
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Mapping function
Mapping function[16]:
F(xt) =m
i=1
P(Ci |xt)[ i + i
1i (xt i )
]Motivation:
E [y|x = xt ] = + 1(xt )
Estimation of mapping function:
=n
t=1
||yt F(xt)||2
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Conversion using HNM
Conversion of the harmonic part:
Asynchronous mode
HNM analysis
Asynchronous mode
EM
HNM analysis
envelope (y)
source
targetDTW
LS
optim
izat
ion
data
data
Spectral
Align. path
Spectral
envelope (x)
GMM
Conversionfunction
Conversion of the noise part: use of two separate timeinvariant 6th order all-pole corrective filters; one for voicedframes (upper band) and one for unvoiced frames (full band).
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
The voice conversion system
Spectral envelope (voiced part)
prosodic specifications
speechsignal
converted
speech
ti t
itis
HNM analysisSynchronous mode
Conversion function
Envelopetransformation
HNM synthesisSynchronous mode
Corrective filters (noise part)
ti
Mappingtitis
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Corpus and test conditions for theformal listening test
Conversion between two French male speakers. Data havebeen provided by FT ( 5 minutes per speaker)Sampling frequency: 16kHz
Frame size for the asynchronous HNM analysis: 10 msec.
Cepstrum order: 20
Maximum voiced frequency was fixed at constant value of4kHz.
Twenty adult listeners familiar with listening tests ofspeech coding but unfamiliar with voice conversion task.
Prosody of the source speaker has been altered to matchas close as possible the prosody of the target speaker.
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Results - XAB test
Task: Listeners were asked to select either A or B as beingmost similar to X.
PO 16 GMM 64 GMM 64 GMM(2)
Correct 18% 83% 88% 97%answers
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Results - Opinion test
Task: rate similarity of each pair of speakers (0: the samespeaker 9: very different speaker).
TT SS M2 M1 PT ST 0
1
2
3
4
5
6
7
8
9S
core
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Audio examples of Voice Conversion:HNM + GMM
Source Converted Target
Source Converted Target
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Discussion
Quality
Alignment
Source processing
Filter processing
Interaction between source and filter
Voice conversion / Emotions
Before that ...
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Discussion
Quality
Alignment
Source processing
Filter processing
Interaction between source and filter
Voice conversion / Emotions
Before that ...
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Discussion
Quality
Alignment
Source processing
Filter processing
Interaction between source and filter
Voice conversion / Emotions
Before that ...
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Discussion
Quality
Alignment
Source processing
Filter processing
Interaction between source and filter
Voice conversion / Emotions
Before that ...
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Discussion
Quality
Alignment
Source processing
Filter processing
Interaction between source and filter
Voice conversion / Emotions
Before that ...
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Discussion
Quality
Alignment
Source processing
Filter processing
Interaction between source and filter
Voice conversion / Emotions
Before that ...
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Discussion
Quality
Alignment
Source processing
Filter processing
Interaction between source and filter
Voice conversion / Emotions
Before that ...
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Jointly model Source and Target
Kain et al.[10] suggests to jointly model the target and thesource by a GMM:
F(xt) =m
i=1
P(Ci |xt)[yi +
yxi
xx1i (xt xi )
]where
P(Ci |x) =iN(x;
xi ,
xxi )
mj=1
jN(x; xj ,
xxj )
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Joint GMM and DFW
Toda et al.[11] combined the previous joint GMM approachand DFW to avoid over smoothing of the converted spectralenvelope:
|Sc(f )| = exp [ln |Sd(f )|+ w(ln |Sg (f )| ln |Sd(f )|)]
where Sd(f ) and Sg (f ) denote the spectrum after DFW andafter conversion, respectively. Weight, w varies between 0 and1.
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Joint GMM and use of Global Variance
Toda et al.[12] suggested:
combine joint GMM with the global variance of theconverted spectra in each utterance to cope withover-smoothing
Use of delta features have been used to alleviate spectraldiscontinuities
F(xt) = (W T D1m W )1W T D1m Emwhere
Em = [E1(mi1) E2(mi2) EN(miN)]D1m = diag
[Dm1i1 Dm
1i2 Dm
1iN
]En(mi ) =
yi +
yxi
xx1i (xt xi )
Dmi = yyi
yxi
xx1i
xyi
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Another GMM based ...
Meshabi et al.[17] suggests a modified mapping function tryingto overcome over smoothing effects:
F(xt) =m
i=1
P(Ci |xt)[yi + (xt
xi )]
where is constrained to be diagonal prohibiting thecross-correlation between coordinates of teh acoustic vectors.
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
THANK YOU
Time for questions ...
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Y. Stylianou, J. Laroche, and E. Moulines, High-Quality Speech Modification based on a Harmonic
+ Noise Model., Proc. EUROSPEECH, 1995.
H. Kuwahara, Speech representation and transformation using adaptive interpolation of weighted
spectrum: vocoder revisited, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Munich,Germany), pp. 13031306, 1997.
D. Vincent, O. Rosec, and T. Chonavel, Estimation of LF glottal source parameters based on arx
model, in Proc. Interspeech, (Lisbon, Portugal), pp. 333336, 2005.
D. Vincent, O. Rosec, and T.Chonavel, A new method for speech synthesis and transformation
based on an ARX-LF source-filter decomposition and HNM modeling, ICASSP, 2007.
M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, Voice conversion through vector
quantization, in Proc. ICASSP88, pp. 655658, 1988.
H. Valbret, E. Mulines, and J. Tubach, Voice transformation using PSOLA techinques, Speech
Communication, vol. 11, no. 2-3, pp. 175187, 1992.
N. Iwahashi and Y. Sagisaka, Speech spectrum transformation based on speaker interpolation, in
Proc. ICASSP94, 1994.
H. Kuwabara and Y. Sagisaka, Acoustic characteristics of speaker individuality: Control and
conversion, Speech Communication, vol. 16, no. 2, pp. 165173, 1995.
Y. Stylianou, O. Cappe, and E. Moulines, Statistical methods for voice quality transformation,
Proc. EUROSPEECH, 1995.
A. Kain and M. Macon, Spectral voice conversion for text-to-speech synthesis, in Proc. ICASSP98,
pp. 285288, 1998.
yannisPencil
yannisPencil
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
T. Toda, H. Saruwatari, and K. Shikano, Voice Conversion Algorithm based on Gaussian Mixture
Model with Dynamic Frequency Warping of STRAIGHT spectrum, in Proc. IEEE Int. Conf. Acoust.,Speech, Signal Processing, (Salt Lake City, USA), pp. 841844, 2001.
T. Toda, A. Black, and K. Tokuda, Spectral Conversion Based on Maximum Likelihood Estimation
considering Global Variance of Converted Parameter, in Proc. IEEE Int. Conf. Acoust., Speech,Signal Processing, (Philadelphia, USA), pp. 912, 2005.
O. Turk and L. M. Arslan, Robust processing techniques for voice conversion, Computer Speech
and Language, vol. 20, pp. 441467, 2006.
A. Mouchtaris, J. V. derSpiegel, and P.Mueller, Non parallel training for voice conversion based on a
parameter adaptation, IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 3,pp. 952963, 2006.
L. Arslan and D. Talkin, Speaker transformation algrithm using segmental codebooks, Speech
Communication, vol. 28, pp. 211226, 1999.
Y. Stylianou, O. Cappe, and E. Moulines, Continuous probabilistic transform for voice conversion,
IEEE Trans. Speech and Audio Processing, vol. 6, no. 2, pp. 131142, 1998.
L. Meshabi, V. Barreaud, and O. Boeffard, GMM-based Speech Transformation Systems under Data
Reduction, 6th ISCA Workshop on Speech Synthesis, pp. 119124, August 22-24, 2007.
yannisPencil
yannisPencil
yannisPencil
Voice Trans-formation,
Part II
YannisStylianou
Outline of thetalk
Simplecontrol ofvoice quality
For morecomplex voicetransforma-tions
Control ofsource andfiltercharacteristics
Discussion
Extensions ofthe mappingfunction
References
Outline of the talkSimple control of voice qualityVoice quality in TTSDetection of voice quality problemsCompensation
For more complex voice transformationsControl of source and filter characteristicsVoice MorphingVoice ConversionA baseline probabilistic approach
DiscussionExtensions of the mapping functionReferences