47
Introduction to Voice Conversion Hsin-Te Hwang [email protected] Department of Communication Engineering, Chiao Tung University, Hsinchu 1

Introduction to Voice Conversion

  • Upload
    lars

  • View
    85

  • Download
    6

Embed Size (px)

DESCRIPTION

Introduction to Voice Conversion. Hsin-Te Hwang [email protected] Department of Communication Engineering, Chiao Tung University, Hsinchu. Outline. Introduction VC baseline (GMM based VC) Problems Summary References. What is voice conversion (VC)?. Definition: - PowerPoint PPT Presentation

Citation preview

Page 1: Introduction to Voice Conversion

Introduction to Voice Conversion

Hsin-Te [email protected]

Department of Communication Engineering, Chiao Tung University, Hsinchu

1

Page 2: Introduction to Voice Conversion

Outline Introduction VC baseline (GMM based VC) Problems Summary References

2

Page 3: Introduction to Voice Conversion

3

What is voice conversion (VC)?

Definition: To modify the speech signal of one speaker

(source) to sound like the other speaker (target).

More generalized definition: To modify (transform) the characteristics of

the speech signal. Ex: Emotional Voice Conversion [1,2]

Page 4: Introduction to Voice Conversion

4

Application of VC In TTS:

Building a new voice based on Current state of the art TTS system such as Corpus based TTS is hard.

Same problem in building an Emotional TTS [1,2].

By using VC, one can use recorded database and convert it to a target voice using as little as 10-20 sentences [3].

Others: To convert narrow-band speech to wide band

speech for telecommunication [4]. Modeling of speech production [5].

Page 5: Introduction to Voice Conversion

5

Conversion? Spectrum:

Convert Spectrum only. Prosody remains unchanged or uses sample way to convert prosody.

Prosody Convert prosody only.

Spectrum + Prosody Convert spectrum and prosody.

Page 6: Introduction to Voice Conversion

6

Overview of Techniques Abe et al. (1988) [6]: VQ mapping Valbret et al. (1992) [7]: Linear Multivariate Regression (LMR). Dynamic Frequency Warping (DFW) Kuwabara et al. (1995) [8]: Fuzzy VQM. Narendranath et al. (1995) [9]: ANN based Stylianou et al. (1995) [10]: GMM based Kain et al. (1998) [11]: GMM based Toda et al. (2001) [12]: GMM and DFW Toda et al. (2005) [13]: GMM consider Globe Variance Mouchtaris et al. (2006) [14]: GMM and speaker adaptation

Page 7: Introduction to Voice Conversion

Outline Introduction VC baseline (GMM based VC) Problems Summaries Reference

7

Page 8: Introduction to Voice Conversion

8

The block diagram for building VC system.

The following figure shows the block diagram of a voice conversion system.

T r a i n i n g

C o r p u s

Feature Extraction

Feature Extraction

Alignment Training

C o n v e r s i o n

F u n c t o i nFeature Extraction Synthesizer

SourceSpeech

SynthesizedSpeech

Training Phrase

Synthesizing Phrase

SourceFeature

TargetFeature

Source

Target

Page 9: Introduction to Voice Conversion

Review GMM based VC Start form Minimum Mean Square

Estimation (MMSE) Time alignment To derive the transfer function of

GMM based VC.

9

Page 10: Introduction to Voice Conversion

Mean-Square Estimation(1/4)

10

如用一個 constant c 去 estimate RV y,以MS estimation (i.e.,mean-square error為最小之estimation) 可如下推導

2 2( ) ( ) ( )e E c y c f y dy

y

2( ) ( ) 0de y c f y dydc

( )c yf y dy E

y

Page 11: Introduction to Voice Conversion

Mean-Square Estimation(2/4)

11

現在考慮 nonlinear MS estimation 由一個RV x 去估計另一個RV y

22

2

( )

( ( )) ( , )

( ) ( ( )) ( )

xye E c

y c x f x y dxdy

f x y c x f y x dy dx

y x

為正, ( )f x 為正,所以只要 中之 ( )c x 使得 為最小 for

every given x,then e is minimum (i.e., 本來是 ( )f x dx 合起來

考慮時要 minimum,但它等同於對每一 x , 皆 minimum即可)

Page 12: Introduction to Voice Conversion

Mean-Square Estimation(3/4)

要minimum for each given x,而 ( )c x 為一deterministic

(constant) when x is given,由前面 case和 ( ) |yc x E x y ,

再將 x 可改變考慮進去,上式變為 ( ) |yc Ex y x

如 RVs y 和 x 為 independent,則 | = constanty yE E y x y

Page 13: Introduction to Voice Conversion

Mean-Square Estimation(4/4)

^1

1 mixture Gaussian, assume and are joint Gaussian, source follow a Gaussian distribution.By using MMSE, conversion function is

( ) [ | ] ( )where , and

t t

t

t t t xx t xt

y xy

x yx

y F x E y x v xv

Page 14: Introduction to Voice Conversion

Stylianou-GMM based mapping function (1/2)Probability classification: Modeling acoustic space of source

speaker by using GMM

Classification:

14

1

( ) ( ; , )M

i i ii

P x N x

1

( ; , )( | )( ; , )

i i ii M

j j jj

N xP C xN x

Page 15: Introduction to Voice Conversion

Stylianou-GMM based mapping function (2/2) Mapping Function [10]:

Motivation:

Estimation of mapping function:

15

1

1

( ) ( | )[ ( )]M

t i t i i i t ii

F x P C x V x

1[ | ] ( )tE y x v x

2

1

( )n

t tt

y F x

Page 16: Introduction to Voice Conversion

16

Source feature: 1[ , , ]MX x x , Target feature: 1[ , , ]NY y y .

DTW alignment X,Y

Parallel data time alignment using DTW (1/2)

Page 17: Introduction to Voice Conversion

17

Time alignment:

After DTW,:

New training vector:

1 2 3 4 5 6 7 8

1 1 2 2 3 4 5 8

x x x x x x x xZ

y y y y y y y y

Parallel data time alignment using DTW (2/2)

Page 18: Introduction to Voice Conversion

18

Kain-GMM based mapping function

By using GMM to model the joint pdf of X and Y [11],

,1 1

GMM (Kain) based:After alignment [ , ], 1 ~ , joint model , by GMM

( ) ( , | ) ( ; )

where,

From MMSE, ( ) [ | ]

( )

t t t t t

M M

t i t t i t i ii i

xx xy xi i i

i iyx yy yi i i

t t t

t

z x y t N x y

P z w P x y i w N z

and

F x E y x

F x

1

1

( | )[ ( )]...M

y yx xx xi t i i i t i

i

P C x x Kain

Where,

1

( ; , )( | )

( ; , )

xx xxi i i

i t Mxx xx

j j jj

N xP C x

N x

;

Page 19: Introduction to Voice Conversion

Stylianou based vs Kain based VC Kain[11] based method makes no

assumptions about the target distributions: clustering takes place on the source and the target vectors.

In theory, modeling the joint density rather than the source density should lead to a more judicious allocation of mixtures for the regression problem.

Kain based method is computationally more expensie during the EM step than Stylianou [10].

Page 20: Introduction to Voice Conversion

Outline Introduction VC baseline (GMM based VC) Problems Summary Reference

20

Page 21: Introduction to Voice Conversion

Problems To make the training more flexible (non-

parallel training) To improve the quality and similarity of

transform speech Prosody conversion Other issues

21

Page 22: Introduction to Voice Conversion

22

In order to derive the conversion function, a speech corpus is needed that contains the same utterances form both the source and target speakers. Such corpus is called parallel corpus.

The disadvantage of this method is that

such corpus is difficult or even impossible to collect. – Cross lingual voice conversion.– Most of the databases are nonparallel.

Problems of parallel training for VC

Page 23: Introduction to Voice Conversion

Nonparallel training for VC Mouchtaris et al. (2004, 2006) [14,15]: GMM

and speaker adaptation D. Säundermann et al (2003) [16] VTLN based H. Ye et al (2004) [17] VC for Unknown

Speaker M. Mashimo et al. (2001) [18] Cross-Language

VC

23

Page 24: Introduction to Voice Conversion

24

Nonparallel Training for Voice Conversion by ML Constrained Adaptation (1/2)Mouchtaris et al. (2004, 2006) [14,15]:Assuming:1. Parallel data for two speakers exist2. Conversion function between these two speakers

is knownThen: Adapt S1 to the Source speaker Adapt S2 to the Target speaker Compute Conversion function by using:• The initial conversion function of the parallel data• The adaptation parameters

Page 25: Introduction to Voice Conversion

25

Nonparallel Training for Voice Conversion by ML Constrained Adaptation (2/2)

Block diagram of nonparallel VC [14,15]

Page 26: Introduction to Voice Conversion

Quality improvementTwo major problems of GMM based

VC: Time independent assumption Over-smooth

26

Page 27: Introduction to Voice Conversion

Time independent assumption(1/2) GMM based mapping function performs

the frame by frame basis. ( Time independent approach).

The correlation of the target feature vectors between frames is ignored in the conventional mapping.

1

1

1

1

From MMSE, ( ) [ | ]

( ) ( | )[ ( )]...

( ) ( | )[ ( )]...

t t t

m

t i t i i i t ii

my yx xx x

t i t i i i t ii

F x E y x

F x P C x V x Stylianou

F x P C x x Kain

Page 28: Introduction to Voice Conversion

Time independent assumption(2/2)

Example of converted and natural target parameter trajectories. [24]

Page 29: Introduction to Voice Conversion

Solution for time independent assumption (1/3)Duxans et al [23] (HMM based voice conversion): HMM are well-known models which can capture

the dynamics of the training data using states. it can model the dynamics of sequences of vectors

with transition probabilities between states.

HMM based VC system block diagram [23]

Page 30: Introduction to Voice Conversion

Solution for time independent assumption (2/3)Chi-Chun Hsia et al [21] (Gaussian

Mixture Bi-gram Model): To Adopt the Gaussian mixture bi-

gram model to characterize temporal and spectral evolution in the conversion function.

30

Page 31: Introduction to Voice Conversion

Solution for time independent assumption (3/3)

,1 1

1

1

GMM (Kain) based:[ , ] joint model , by GMM

( ) ( , | ) ( ; )

From MMSE, ( ) [ | ]

( ) ( | )[ ( )]...

Gaussian Mixture Bi-gram Model

t t t t t

M M

t i t t i t i ii i

t t t

My yx xx x

t i t i i i t ii

z x y x y

P z w P x y i w N z

F x E y x

F x P C x x Kain

1 1 1 1

1 1 ,1 1

1 1

:[ , ] joint model , by GMM

( ) ( , | ) ( ; )

From MMSE, ( ) [ | , ]

t t t t t t t t t

M M

t i t t t t i t i ii i

t t t t t

z y y x x y y x x

P z w P y y x x i w N z

F x E y y x x

Page 32: Introduction to Voice Conversion

Over-smooth problem (1/3)

1

1

( ) ( | )[ ( )]...m

y yx xx xt i t i i i t i

i

F x P C x x Kain

1i i for Stylianou or 1yx xx

i i for Kain can be very small.

This leads to an over-smoothing of the converted speech.

The correlation between the source and target speakers being

weak.

This smoothing causes error reduction of the spectral

conversion and quality degradation of the converted speech.

1

1

( ) ( | )[ ( )]...m

t i t i i i t ii

F x P C x V x Stylianou

Page 33: Introduction to Voice Conversion

Over-smooth problem (2/3)

^

1 mixture Gaussian, assume and are joint Gaussian, and and dimention 1.By using MMSE, conversion function is

( ) [ | ] ( )

( , ) ( )

( ,

t t

t t

yt t t y t xt

x

yy t x

x x y

y

x yx y

y F x E y x x

Cov x y x

Cov x y

1

^ ^

) ( ) ( )

( , ) [ ] and/or ( ) [ ]

t x

t tt t

Var x x

Cov x y y E y Var x y E y

Page 34: Introduction to Voice Conversion

Over-smooth problem (3/3)

Example of converted and natural target spectra. [24]

Page 35: Introduction to Voice Conversion

Solutions for over-smooth problem (1/2)

1

Meshabi et al. [28] suggests a modified mapping function trying to overcome over smoothing effects:

( ) ( | )[ ( )]

where is constrained to be diagonal prohibitingthe cross-correlat

My x

t i t i t ii

F x P C x x

ion between coordinates of tehacoustic vectors.

Page 36: Introduction to Voice Conversion

Solutions for over-smooth problem (2/2)Toda et al [11,29]: Combine joint GMM with the global

variance of the converted spectra in each utterance to cope with over-smoothing

Use of delta features have been used to alleviate spectral discontinuities

Page 37: Introduction to Voice Conversion

CART based voice conversion(1/2)Duxans et al [23]: UsingGMMor HMM, we only have

spectral information to identify the classes. But using decision trees we can also use phonetic information.

Phonetic information for each frame, such as the phone, a vowel/consonant flag, point of articulation, manner and voicing.

Page 38: Introduction to Voice Conversion

CART based voice conversion(2/2)

Q1: Voice?

Q2

Q3

Yes No

Yes

YesGMM1

No

No

GMM2 GMM3

GMM4

Leaf node:conversion function

Multiple conversion functionsImprove the performance of conversion GMM based vs HMM based vs CART based

Page 39: Introduction to Voice Conversion

Prosody conversion Chi-Chun Hsia, Chung-Hsien Wu,(2007) [21] “A

Study on Synthesis Unit Selection and Voice Conversion for Text-to-Speech Synthesis”

Hanzlíček, Zdeněk et al (2007) [22] "F0 transformation within the voice conversion framework”

Guoyu Zuo et al (2005) [19] “ Mandarin Voice Conversion Using Tone Codebook Mapping.

E.E.Helander et al (2007) [2] “A Novel Method for Prosody Prediction in Voice Conversion”

39

Page 40: Introduction to Voice Conversion

Other issues Subjective and objective evaluation

Cross-lingual voice conversion [25] Time alignment A novel VC frame work [26] Residual prediction [27]

2

1norm mse

2

1

1 ( )

1

N

n nn

N

n nn

y F xN

y xN

Page 41: Introduction to Voice Conversion

Summary To increase the usefulness of the

voice conversion system, practical aspects should be considered.

Flexible training framework Quality and Similarity Objective Evaluation

41

Page 42: Introduction to Voice Conversion

References (1/5)[1] Chung-Hsien Wu, Chi-Chun Hsia, Te-Hsien Liu, and Jhing-Fa Wang, “Voice

Conversion Using Duration-Embedded Bi-HMMs for Expressive Speech Synthesis, IEEE Trans. Audio, Speech and Language Processing, vol. 14, no. 4, July, 2006, pp. 1109-1116.

[2] Chi-Chun Hsia, Chung-Hsien Wu, Jian-Qi Wu, “Conversion Function Clustering and Selection Using Linguistic and Spectral Information for Emotional Voice Conversion, “ IEEE Trans. Computers (Special Issue on Emergent Systems, Algorithms and Architectures for Speech-based Human machine Interaction), vol. 56, no. 9, September 2007, pp. 1225-1233.

[3] http://festvox.org/transform/transform.html[4] K. Y. Park and H. S. Kim, “Narrowband to wideband conversion of speech

using GMM based transformation,” in Proc. ICASSP, Istanbul, Turkey, Jun. 2000, pp. 1847–1850.

[5] K. Richmond, S. King, and P. Taylor, “Modelling the uncertainty in recovering articulation from acoustics,” Comput. Speech Lang., vol. 17, pp. 153–172, 2003.

[6] M. Abe, S. Nakamura, K. Shikano and H. Kuwabara, “Voice conversion through vector Quantization,”in Proc. of ICASP, New York, NY, USA, pp. 655-658, Apr. 1988.

[7 ] N. Iwahashi and Y. Sagisaka, “ Speech spectrum transformation based on speaker interpolation.” in Proc. ICASSP94. 1994. 42

Page 43: Introduction to Voice Conversion

References (2/5)[8] H. Kuwabara and Y. Sagisaka, “ Acoustic characteristics of speaker

individuality: Control and conversion, “ Speech Communication, vol,19, no. 2, pp. 165-173, 1995.

[9] M. Narendranath, H. A. Murthy, S. Rajendran, and B. Yegnanarayana, “Transformation of formants for voice conversion using artificial neural networks,” Speech Commun., vol. 16, no. 2, pp. 207–216, 1995.

[10] Y. Stylianou, “Continuous probabilistic transform for voice conversion,”IEEE Trans. on Speech and Audio Processing, vol. 6, no. 2, pp. 131-142, Mar. 1998.

[11] A. Kain and M. W. Macon, “Spectral Voice Conversion for Text-to-Speech Synthesis,” in Proc. of ICASSP, vol. 1, pp. 285-288, Seattle, Washington, USA, May 1998.

[12] T. Toda, H. Saruwatari, and K. Shikano, “Voice Conversion Algorithm based on Gaussian Mixture Model with Dynamic Frequency Warping of STRAIGHT spectrum, “in Proc. IEEE Int. Conf. Acoust, Speech, Signal Processing, (Salt Lake City, USA), pp. 841-844,2001.

[13] T. Toda, A. Black, and K. Tokuda, “ Spectral Conversion Based on Maximum Likelihood Estimation considering Global Variance of Converted Parameter,” in Proc. IEEE Int. Conf. Acoust. Speech, Signal Processing, (Philadelphia, USA), pp. 9-12, 2005.

[14] A. Mouchtaris, J. Van der Spiegel, and P. Mueller, “Non-Parallel Training for Voice Conversion Based on a Parameter Adaptation Approach”, in IEEE Trans. Audio, Speech and Language Processing, vol. 14, no. 3, May 2006, pp. 952-963. 43

Page 44: Introduction to Voice Conversion

References (3/5)[15] A. Mouchtaris, J. Spiegel, and P. Mueller, Non-Parallel Training for Voice Conversion

by Maximum Likelihood Constrained Adaptation," in Proc: of the ICASSP'04, Montreal, Canada, 2004.

[16] D. SÄundermann, H. Ney, and H. HÄoge, VTLN-Based Cross-Language Voice Conversion," in Proc: of the ASRU'03, St:Thomas, USA, 2003.

[17] H. Ye and S. J. Young, \Voice Conversion for Unknown Speakers," in Proc: of the ICSLP'04, Jeju, South Korea, 2004.

[18] M. Mashimo, T. Toda, K. Shikano, and N. Campbell, Eval-uation of Cross-Language Voice Conversion Based on GMMand STRAIGHT," in Proc: of the Eurospeech'01, Aalborg,Denmark, 2001.

[19] Guoyu Zuo, Yao Chen, Xiaogang Ruan, Wenju Liu: Mandarin Voice Conversion Using Tone Codebook Mapping. ICMLC 2005: 965-973 [DBLP:conf/icmlc/ZuoCRL05]

[20] E.E.Helander,J.Nurminen.2007.A Novel Method for Prosody Prediction in Voice Conversion Acoustics.Speech and Signal Processing.ICASSP 2007.IEEE International Conference on Volume 4:509-512

[22] Hanzlíček, Zdeněk / Matoušek, Jindřich (2007): "F0 transformation within the voice conversion framework", In INTERSPEECH-2007, 1961-1964.

44

Page 45: Introduction to Voice Conversion

References (4/5)[21] Chi-Chun Hsia, Chung-Hsien Wu, “A Study on Synthesis Unit Selection

and Voice Conversion for Text-to-Speech Synthesis”, Department of Computer Science and Information Engineering, NCKU, Dissertation for Doctor of Philosophy, December 2007.

[23] Duxans, H., Bonafonte, A., Kain, A. and van Santen, J., “Including Dynamic and Phonetic Information in Voice Conversion Systems,” in Proc. of ICSLP 2004, pp. 5-8, Jeju Island, South Korea, 2004.

[24] T. Toda, A.W. Black, K. Tokuda, ''Voice Conversion Based on Maximum Likelihood Estimation of Spectral Parameter Trajectory,'' IEEE Transactions on Audio, Speech and Language Processing, Vol. 15, No. 8, pp. 2222-2235, Nov. 2007.

[25] D. S¨undermann, H. Ney, and H. H¨oge, “VTLN-Based Cross-Language Voice Conversion,” in Proc. of the ASRU’03, Virgin Islands, USA, 2003.

[26] T. Toda, Y. Ohtani, and K. Shikano, “One-to-many and many-to-one voice conversion based on eigenvoices,” in Proc. ICASSP, Honolulu, HI, Apr. 2007, vol. 4, pp. 1249–1252.

[27] A. Kain and M. W. Macon, “Design and evaluation of a voice conversion algorithm based on spectral envelope mapping and residual prediction,” in Proc. ICASSP, Salt Lake City, UT, May 2001, pp. 813–816. 45

Page 46: Introduction to Voice Conversion

References (5/5)[28] L. Meshabi, V. Barreaud, and O. Boeffard, “GMM-

based Speech Transformation Systems under Data Reduction,” 6th ISCA Workshop on Speech Synthesis, pp.119-124. August 22-24, 2007.

[29] T. Toda, A.W. Black, K. Tokuda, “ Voice Conversion Based on Maximum Likelihood Estimation of Spectral Parameter Trajectory,'' IEEE Transactions on Audio, Speech and Language Processing, Vol. 15, No. 8, pp. 2222-2235, Nov. 2007.

Page 47: Introduction to Voice Conversion

Thanks for your listening!

Q&A?