72
Voice Trans- formation, Part II Yannis Stylianou Outline of the talk Simple control of voice quality For more complex voice transforma- tions Control of source and filter characteristics Discussion Extensions of the mapping function References Voice Transformation, Part II Yannis Stylianou Computer Science Department, Multimedia Informatics Lab [email protected] Interspeech 2007 August 27th 2007, Antwerpen, Belgium

Voice Transformation, Part II - Hellas · Voice Morphing Voice Conversion A baseline probabilistic approach 4 Discussion 5 Extensions of the mapping function 6 References. Voice Trans-formation,

Embed Size (px)

Citation preview

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Voice Transformation, Part II

    Yannis Stylianou

    Computer Science Department, Multimedia Informatics [email protected]

    Interspeech 2007August 27th 2007, Antwerpen, Belgium

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    1 Simple control of voice qualityVoice quality in TTSDetection of voice quality problemsCompensation

    2 For more complex voice transformations

    3 Control of source and filtercharacteristics

    Voice MorphingVoice ConversionA baseline probabilistic approach

    4 Discussion

    5 Extensions of the mapping function

    6 References

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Voice quality in TTS

    The key for natural sounding TTS is the use of large speechdatabases, where we wish to have:

    Many instances of basic units

    Variety of prosodic characteristics

    Variety of spectral information

    while we wish to avoid:

    Variability in voice quality

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Problems associated with thevariability of voice quality

    Degradation of the overall quality of synthesis

    Problems in the unit selection algorithm

    Problems in the unit concatenation algorithm

    A big part of the database may be useless

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Problems associated with thevariability of voice quality

    Degradation of the overall quality of synthesis

    Problems in the unit selection algorithm

    Problems in the unit concatenation algorithm

    A big part of the database may be useless

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Problems associated with thevariability of voice quality

    Degradation of the overall quality of synthesis

    Problems in the unit selection algorithm

    Problems in the unit concatenation algorithm

    A big part of the database may be useless

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Problems associated with thevariability of voice quality

    Degradation of the overall quality of synthesis

    Problems in the unit selection algorithm

    Problems in the unit concatenation algorithm

    A big part of the database may be useless

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Intra and Inter session variability

    Intra-session variability

    Movements of the microphone during the recording sessionFatigue of the speaker after a long recording session ( 5hours)

    Inter-session variability

    Modifications of the recording equipment from onerecording session to the otherVariability of emotional state and/or health of the speaker

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Intra and Inter session variability

    Intra-session variability

    Movements of the microphone during the recording sessionFatigue of the speaker after a long recording session ( 5hours)

    Inter-session variability

    Modifications of the recording equipment from onerecording session to the otherVariability of emotional state and/or health of the speaker

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Intra and Inter session variability

    Intra-session variability

    Movements of the microphone during the recording sessionFatigue of the speaker after a long recording session ( 5hours)

    Inter-session variability

    Modifications of the recording equipment from onerecording session to the otherVariability of emotional state and/or health of the speaker

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Intra and Inter session variability

    Intra-session variability

    Movements of the microphone during the recording sessionFatigue of the speaker after a long recording session ( 5hours)

    Inter-session variability

    Modifications of the recording equipment from onerecording session to the otherVariability of emotional state and/or health of the speaker

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Intra and Inter session variability

    Intra-session variability

    Movements of the microphone during the recording sessionFatigue of the speaker after a long recording session ( 5hours)

    Inter-session variability

    Modifications of the recording equipment from onerecording session to the otherVariability of emotional state and/or health of the speaker

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Intra and Inter session variability

    Intra-session variability

    Movements of the microphone during the recording sessionFatigue of the speaker after a long recording session ( 5hours)

    Inter-session variability

    Modifications of the recording equipment from onerecording session to the otherVariability of emotional state and/or health of the speaker

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Task

    Given a large speech database

    Automatically detect voice quality problems and

    Correct voice quality problems with NO degradation of thespeech signals.

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Task

    Given a large speech database

    Automatically detect voice quality problems and

    Correct voice quality problems with NO degradation of thespeech signals.

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Intra-session variability

    Modeling of the acoustic space of the speaker usingGMMs, ri (with i = 1, ,N) based on the first k of Lobservations from each recording session ri :

    Ori =

    [O

    (1)ri ,O

    (2)ri , . . . ,O

    (k)ri

    ...O(k+1)ri , . . . ,O

    (L)ri

    ]Estimation of the log-likelihood function:

    L(O(l)ri |ri ) =1

    T

    Tt=1

    p(o(l)t |ri )

    for l = 1, , LVariance of L(O(l)ri |ri ) reflects intra-session variability anddefines reference recording session, rp.

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Inter-session variability

    Measurement of

    L(O(l)ri |rp ), i , l

    Compute z-score:

    z lri =L(O(l)ri |rp ) L

    L, i 6= p

    Test null hypothesis (rp ri (l)) against alternativehypothesis (rp 6 ri (l)) with a 0.01 level of alpha error.

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Schematically

    1 Lk kL )P(O

    2

    1Var

    R1

    O

    )P(

    1

    O

    Vark

    Lk

    Ref DBmin of variance

    lO

    Rk

    l

    O

    P( 1L ) nO )P(1

    nO

    1L

    O

    O

    1O 2

    L1

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Example of Voice quality problem

    Intra-session variability Inter-session variability

    0 2 4 6 8 10 12 14 16 18 207.5

    8

    8.5

    9

    9.5

    10

    10.5

    11

    scor

    e

    # of segment

    Intravariability

    0 2 4 6 8 10 12 14 16 18 207.5

    8

    8.5

    9

    9.5

    10

    10.5

    scor

    e

    # of segment

    Intervariability

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Compensation

    Given a segment ri with voice quality problems and thereference recording session, rp, their difference is given by:

    (l)ri () =

    1/21/2

    (Prp (f ) P

    (l)ri (f )

    )exp(j2f )df

    where P.(f ) denotes power spectrum density.Computing the coefficients of an AR corrective filter usingthe standard Levinson-Durbin algorithm.

    Filtering the speech signal from ri with the computed ARcorrective filter.

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Evaluation

    Objective:

    L(O(l)ri |rp ) should increase after compensation.Spectral distance should decrease after compensation.

    Subjective:

    A-B listening tests

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Results

    i = 1 i = 2 i = 3

    L(Ori |rp ) 9.0974 8.0646 9.6506L(Ori |rp ) 9.2654 (1.84%) 9.2160 (14.27%) 9.6506

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Example of Voice quality problemresolved

    0 2 4 6 8 10 12 14 16 18 207.5

    8

    8.5

    9

    9.5

    10

    10.5

    scor

    e

    # of segment

    Intervariability after correction

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Speech Models

    Harmonic plus Noise Model, HNM (Stylianou et al.,1995)[1]

    Speech Transformation and Representation using AdaptiveInterpolation of weiGHTed spectrum, STRAIGHT(Kuwahara, 1997)[2]

    Auto-Regressive eXogenous Liljencrant-Fant, ARX-LF(Vincent et al. 2007)[3]

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Motivation for HNM

    0 200 400 6002

    1

    0

    1

    2x 10

    4

    (a) Time in samples

    Am

    plitu

    de

    Original speech signal

    0 2000 4000 6000 800050

    0

    50

    100

    (b) Frequency (Hz)

    db

    Original magnitude spectrum

    0 200 400 6002

    1

    0

    1

    2x 10

    4

    (c) Time in samples

    Am

    plitu

    de

    Harmonic part (05000Hz)

    0 200 400 6002000

    1000

    0

    1000

    2000

    (d) Time in samples

    Am

    plitu

    de

    Noise part (50008000Hz)

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Brief overview of HNM

    HNM is a pitch-synchronous harmonic plus noiserepresentation of the speech signal.

    Speech spectrum is divided into a low and a high banddelimited by the so-called maximum voiced frequency.

    The low band of the spectrum (below the maximumvoiced frequency) is represented solely by harmonicallyrelated sine waves.

    The upper band is modeled as a noise componentmodulated by a time-domain amplitude envelope.

    HNM allows high-quality copy synthesis and prosodicmodifications.

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    HNM in equations

    Harmonic part:

    h(t) =

    L(t)k=L(t)

    Ak(t)ej k0(t) t

    Noise part:n(t) = e(t) [v(, t) ? b(t)]

    Speech:s(t) = h(t) + n(t)

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Audio examples from HNM

    Original

    Time-scale by 0.7

    Time-scale by 1.6

    Pitch modification by 0.8

    Pitch modification by 1.6

    Original

    Time-varying pitch and time modif.

    Original

    Time-scale by 4

    Time-scale by 6

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    STRAIGHT

    Speech signal is represented as a sum of minimum phaseimpulse responses[2]:

    s(t) =tiQ

    1G (fo(ti ))

    uti (t T (ti ))

    where Q represents a set of positions and G () represents apitch modification function.

    Minimum phase impulse responses are modified usingall-pass filters

    Filter information is reconstructed in the time-frequencyregion

    Excitation information is manipulated through phase.

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Audio examples from STRAIGHT

    Original utterance /kohi ni miruku wo iremasu ka/ Coffee withmilk?

    Original

    2 times of F0 and 1.25 times frequency axis

    3 times of F0 and 1.44 times frequency axis

    0.5 times of F0 and 0.8 times frequency axis

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    ARX-LF

    ARX model:

    s(t) = p

    k=1

    ak(t)s(t k) + b0u(t) + r(t)

    LF model:

    u(t) = E1et sin (wt) 0 t Te (1)

    u(t) = E 2[eb(tTe) eb(T0Te)

    ](2)

    Residual signal, r(t), is modeled by HNM[4].

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Audio examples from ARX-LF

    Original

    Time scale by 2.0

    Pitch scale by 0.7

    Pitch scale by 1.4

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Voice Morphing

    From two speakers (source speakers) who utter the samesentence, create a new speaker who utters the same sentenceas the source speakers and with voice characteristics takenfrom the source speakers.

    Dynamic Time Warping (DTW) between the twosentences

    Linear Interpolation between corresponding frames

    Synthesis

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Voice Morphing

    From two speakers (source speakers) who utter the samesentence, create a new speaker who utters the same sentenceas the source speakers and with voice characteristics takenfrom the source speakers.

    Dynamic Time Warping (DTW) between the twosentences

    Linear Interpolation between corresponding frames

    Synthesis

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Voice Morphing

    From two speakers (source speakers) who utter the samesentence, create a new speaker who utters the same sentenceas the source speakers and with voice characteristics takenfrom the source speakers.

    Dynamic Time Warping (DTW) between the twosentences

    Linear Interpolation between corresponding frames

    Synthesis

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Voice Morphing

    From two speakers (source speakers) who utter the samesentence, create a new speaker who utters the same sentenceas the source speakers and with voice characteristics takenfrom the source speakers.

    Dynamic Time Warping (DTW) between the twosentences

    Linear Interpolation between corresponding frames

    Synthesis

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Voice Conversion

    Definition:Voice conversion aims at transforming the characteristicsof the speech signal uttered by a speaker (SourceSpeaker), in such a way that a human listener couldbelieve that the transformed speech is produced byanother specific speaker (Target Speaker).

    Control of the source and filter characteristics

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Voice Conversion

    Definition:Voice conversion aims at transforming the characteristicsof the speech signal uttered by a speaker (SourceSpeaker), in such a way that a human listener couldbelieve that the transformed speech is produced byanother specific speaker (Target Speaker).

    Control of the source and filter characteristics

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Voice Conversion

    Definition:Voice conversion aims at transforming the characteristicsof the speech signal uttered by a speaker (SourceSpeaker), in such a way that a human listener couldbelieve that the transformed speech is produced byanother specific speaker (Target Speaker).

    Control of the source and filter characteristics

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Overview of techniques

    Abe et al. (1988)[5]: VQ mapping

    Valbret et al. (1992)[6]: Linear Multivariate Regression(LMR), Dynamic Frequency Warping (DFW)

    Iwahashi et al. (1994)[7]: Speaker Interpolation

    Kuwabara et al. (1995)[8]: Fuzzy VQ

    Stylianou et al. (1995)[9]: Probabilistic approach (GMM)

    Kain et al. (1998) [10]: Probabilistic approach (GMM)

    Toda et al. (2001) [11]: Probabilistic approach(GMM)and DFW

    Toda et al. (2005) [12]: Probabilistic approach (GMM)

    Turk et al. (2005) [13]: Correction filters

    Mouchtaris et al.(2006)[14]: Probabilistic approach(GMM)and speaker adaptation

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Main steps for voice conversion

    Source (prosody) modifications

    Filter modification

    1 Representation2 Mapping

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Example of spectral envelopes mapping

    0 500 1000 1500 2000 2500 3000 3500 400040

    30

    20

    10

    0

    10

    20

    30

    (a) Frequency (Hz)

    dB

    Fulltype: Dist. to src:2.6dB Dist. to tar:15dB

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Main steps for learning filter mappings

    Alignment

    Explicitly for parallel data (DTW, HMM)Implicitly for non-parallel data (through speakeradaptation[14]

    define mapping function (VQ, GMM)

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Parallel data: alignment with DTW

    Distance/Correlation

    Short/long sentences

    Using anchor points

    Constraints (steps)

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Parallel data: alignment with HMM

    Forced-alignment Source and Target speakers givenorthographic transcription of the utterance

    Sentence HMM ([15])

    Template sentences from source (phonetically balanced)Left-to-Right HMM for each sentence by adding a newstate at a constant rate (i.e. every 40 ms.)Forced-alignment using the Viterbi algorithm (find the bestsequence of states)Alignment using state indices.

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Non-parallel data

    Assuming[14]:

    1 Parallel data for two speakers (Speaker 1 and Speaker 2)exist

    2 Conversion function (mapping) between these twospeakers is known

    then:

    Adapt Speaker 1 to the Source speaker

    Adapt Speaker 2 to the Target speaker

    Compute Conversion function by using:

    the initial conversion function of the parallel datathe adaptation parameters

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Overview of a baseline GMM-basedapproach

    Data: Parallel data alignment with DTW

    Probabilistic classification: The acoustic space of aspeaker is described by a parametric Gaussian mixturemodel (GMM).

    Mapping function: A mapping function associates theacoustic space of the source speaker with the acousticspace of the target speaker.

    Iterative approach: Re-alignment after conversion.

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Probabilistic classification

    Modeling of the acoustic space of a speaker by a GMM :

    p(x) =m

    i=1

    iN(x; i ,i ),

    Classification:

    P(Ci |x) =iN(x; i ,i )

    mj=1

    jN(x; j ,j)

    Estimation using an Expectation-Maximization (EM)algorithm initialized by a standard binary splitting VQprocedure.

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Mapping function

    Mapping function[16]:

    F(xt) =m

    i=1

    P(Ci |xt)[ i + i

    1i (xt i )

    ]Motivation:

    E [y|x = xt ] = + 1(xt )

    Estimation of mapping function:

    =n

    t=1

    ||yt F(xt)||2

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Mapping function

    Mapping function[16]:

    F(xt) =m

    i=1

    P(Ci |xt)[ i + i

    1i (xt i )

    ]Motivation:

    E [y|x = xt ] = + 1(xt )

    Estimation of mapping function:

    =n

    t=1

    ||yt F(xt)||2

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Mapping function

    Mapping function[16]:

    F(xt) =m

    i=1

    P(Ci |xt)[ i + i

    1i (xt i )

    ]Motivation:

    E [y|x = xt ] = + 1(xt )

    Estimation of mapping function:

    =n

    t=1

    ||yt F(xt)||2

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Conversion using HNM

    Conversion of the harmonic part:

    Asynchronous mode

    HNM analysis

    Asynchronous mode

    EM

    HNM analysis

    envelope (y)

    source

    targetDTW

    LS

    optim

    izat

    ion

    data

    data

    Spectral

    Align. path

    Spectral

    envelope (x)

    GMM

    Conversionfunction

    Conversion of the noise part: use of two separate timeinvariant 6th order all-pole corrective filters; one for voicedframes (upper band) and one for unvoiced frames (full band).

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    The voice conversion system

    Spectral envelope (voiced part)

    prosodic specifications

    speechsignal

    converted

    speech

    ti t

    itis

    HNM analysisSynchronous mode

    Conversion function

    Envelopetransformation

    HNM synthesisSynchronous mode

    Corrective filters (noise part)

    ti

    Mappingtitis

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Corpus and test conditions for theformal listening test

    Conversion between two French male speakers. Data havebeen provided by FT ( 5 minutes per speaker)Sampling frequency: 16kHz

    Frame size for the asynchronous HNM analysis: 10 msec.

    Cepstrum order: 20

    Maximum voiced frequency was fixed at constant value of4kHz.

    Twenty adult listeners familiar with listening tests ofspeech coding but unfamiliar with voice conversion task.

    Prosody of the source speaker has been altered to matchas close as possible the prosody of the target speaker.

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Results - XAB test

    Task: Listeners were asked to select either A or B as beingmost similar to X.

    PO 16 GMM 64 GMM 64 GMM(2)

    Correct 18% 83% 88% 97%answers

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Results - Opinion test

    Task: rate similarity of each pair of speakers (0: the samespeaker 9: very different speaker).

    TT SS M2 M1 PT ST 0

    1

    2

    3

    4

    5

    6

    7

    8

    9S

    core

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Audio examples of Voice Conversion:HNM + GMM

    Source Converted Target

    Source Converted Target

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Discussion

    Quality

    Alignment

    Source processing

    Filter processing

    Interaction between source and filter

    Voice conversion / Emotions

    Before that ...

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Discussion

    Quality

    Alignment

    Source processing

    Filter processing

    Interaction between source and filter

    Voice conversion / Emotions

    Before that ...

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Discussion

    Quality

    Alignment

    Source processing

    Filter processing

    Interaction between source and filter

    Voice conversion / Emotions

    Before that ...

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Discussion

    Quality

    Alignment

    Source processing

    Filter processing

    Interaction between source and filter

    Voice conversion / Emotions

    Before that ...

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Discussion

    Quality

    Alignment

    Source processing

    Filter processing

    Interaction between source and filter

    Voice conversion / Emotions

    Before that ...

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Discussion

    Quality

    Alignment

    Source processing

    Filter processing

    Interaction between source and filter

    Voice conversion / Emotions

    Before that ...

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Discussion

    Quality

    Alignment

    Source processing

    Filter processing

    Interaction between source and filter

    Voice conversion / Emotions

    Before that ...

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Jointly model Source and Target

    Kain et al.[10] suggests to jointly model the target and thesource by a GMM:

    F(xt) =m

    i=1

    P(Ci |xt)[yi +

    yxi

    xx1i (xt xi )

    ]where

    P(Ci |x) =iN(x;

    xi ,

    xxi )

    mj=1

    jN(x; xj ,

    xxj )

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Joint GMM and DFW

    Toda et al.[11] combined the previous joint GMM approachand DFW to avoid over smoothing of the converted spectralenvelope:

    |Sc(f )| = exp [ln |Sd(f )|+ w(ln |Sg (f )| ln |Sd(f )|)]

    where Sd(f ) and Sg (f ) denote the spectrum after DFW andafter conversion, respectively. Weight, w varies between 0 and1.

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Joint GMM and use of Global Variance

    Toda et al.[12] suggested:

    combine joint GMM with the global variance of theconverted spectra in each utterance to cope withover-smoothing

    Use of delta features have been used to alleviate spectraldiscontinuities

    F(xt) = (W T D1m W )1W T D1m Emwhere

    Em = [E1(mi1) E2(mi2) EN(miN)]D1m = diag

    [Dm1i1 Dm

    1i2 Dm

    1iN

    ]En(mi ) =

    yi +

    yxi

    xx1i (xt xi )

    Dmi = yyi

    yxi

    xx1i

    xyi

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Another GMM based ...

    Meshabi et al.[17] suggests a modified mapping function tryingto overcome over smoothing effects:

    F(xt) =m

    i=1

    P(Ci |xt)[yi + (xt

    xi )]

    where is constrained to be diagonal prohibiting thecross-correlation between coordinates of teh acoustic vectors.

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    THANK YOU

    Time for questions ...

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Y. Stylianou, J. Laroche, and E. Moulines, High-Quality Speech Modification based on a Harmonic

    + Noise Model., Proc. EUROSPEECH, 1995.

    H. Kuwahara, Speech representation and transformation using adaptive interpolation of weighted

    spectrum: vocoder revisited, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Munich,Germany), pp. 13031306, 1997.

    D. Vincent, O. Rosec, and T. Chonavel, Estimation of LF glottal source parameters based on arx

    model, in Proc. Interspeech, (Lisbon, Portugal), pp. 333336, 2005.

    D. Vincent, O. Rosec, and T.Chonavel, A new method for speech synthesis and transformation

    based on an ARX-LF source-filter decomposition and HNM modeling, ICASSP, 2007.

    M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, Voice conversion through vector

    quantization, in Proc. ICASSP88, pp. 655658, 1988.

    H. Valbret, E. Mulines, and J. Tubach, Voice transformation using PSOLA techinques, Speech

    Communication, vol. 11, no. 2-3, pp. 175187, 1992.

    N. Iwahashi and Y. Sagisaka, Speech spectrum transformation based on speaker interpolation, in

    Proc. ICASSP94, 1994.

    H. Kuwabara and Y. Sagisaka, Acoustic characteristics of speaker individuality: Control and

    conversion, Speech Communication, vol. 16, no. 2, pp. 165173, 1995.

    Y. Stylianou, O. Cappe, and E. Moulines, Statistical methods for voice quality transformation,

    Proc. EUROSPEECH, 1995.

    A. Kain and M. Macon, Spectral voice conversion for text-to-speech synthesis, in Proc. ICASSP98,

    pp. 285288, 1998.

    yannisPencil

    yannisPencil

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    T. Toda, H. Saruwatari, and K. Shikano, Voice Conversion Algorithm based on Gaussian Mixture

    Model with Dynamic Frequency Warping of STRAIGHT spectrum, in Proc. IEEE Int. Conf. Acoust.,Speech, Signal Processing, (Salt Lake City, USA), pp. 841844, 2001.

    T. Toda, A. Black, and K. Tokuda, Spectral Conversion Based on Maximum Likelihood Estimation

    considering Global Variance of Converted Parameter, in Proc. IEEE Int. Conf. Acoust., Speech,Signal Processing, (Philadelphia, USA), pp. 912, 2005.

    O. Turk and L. M. Arslan, Robust processing techniques for voice conversion, Computer Speech

    and Language, vol. 20, pp. 441467, 2006.

    A. Mouchtaris, J. V. derSpiegel, and P.Mueller, Non parallel training for voice conversion based on a

    parameter adaptation, IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 3,pp. 952963, 2006.

    L. Arslan and D. Talkin, Speaker transformation algrithm using segmental codebooks, Speech

    Communication, vol. 28, pp. 211226, 1999.

    Y. Stylianou, O. Cappe, and E. Moulines, Continuous probabilistic transform for voice conversion,

    IEEE Trans. Speech and Audio Processing, vol. 6, no. 2, pp. 131142, 1998.

    L. Meshabi, V. Barreaud, and O. Boeffard, GMM-based Speech Transformation Systems under Data

    Reduction, 6th ISCA Workshop on Speech Synthesis, pp. 119124, August 22-24, 2007.

    yannisPencil

    yannisPencil

    yannisPencil

  • Voice Trans-formation,

    Part II

    YannisStylianou

    Outline of thetalk

    Simplecontrol ofvoice quality

    For morecomplex voicetransforma-tions

    Control ofsource andfiltercharacteristics

    Discussion

    Extensions ofthe mappingfunction

    References

    Outline of the talkSimple control of voice qualityVoice quality in TTSDetection of voice quality problemsCompensation

    For more complex voice transformationsControl of source and filter characteristicsVoice MorphingVoice ConversionA baseline probabilistic approach

    DiscussionExtensions of the mapping functionReferences