1
http://www.crestmuse.jp/sympo2008/ Symposium 2008 STRAIGHT, TANDEM-STRAIGHT and beyond Hideki Kawahara, Masanori Morise, Toru Takahashi, Ryuichi Nisimura Hideki Banno and Toshio Irino (Wakayama Univ., Kwansei Gakuin Univ., Kyoto Univ. and Meijo Univ.) Overview/Summary STFT power spectrum T0/2 time shift + averaging TANDEM spectrum TANDEM-STRAIGHT architecture STRAIGHT, a speech analysis, modification synthesis system, is an extension of the classical channel VOCODER that exploits the advantages of progress in information processing technologies and a new conceptualization of the role of repetitive structures in speech sounds. STRAIGHT was designed to provide representation consistent with our auditory perception which decomposes input sounds in terms of excitation (source) and resonant (filter) characteristics. This architecture and sophisticated implementation made it the most flexible and high-quality speech manipulation system to date. However, underlying algorithms were not well formulated mathematically sound manner. TANDEM, a method to extract temporally stable power spectral representation and consistent sampling theory, a new mathematical foundation of sampling made possible to reformulate STRAIGHT completely based on sound foundation. It also simplified codes and enhanced execution speed drastically. Examples/Applications Partial morphing for singing style, timbre and identity Vowel-based voice conversion TANDEM-STRAIGHT provides a unified method to extract the excitation source information with supra- and sub- harmonic structures. The baseline performance of the proposed method without any post processing is comparable to available state of the art methods. Morphing based on STRAIGHT provides means to investigate physical correlates of perceptual attributes in singing. Preliminary tests indicated perceptual dominance of voice timbre in judgement of singers' identity. This finding and a concept of speech texture mapping are basis of a voice conversion solely based on vowel information.

sample panel sympo2008 - CrestMuse

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

http://www.crestmuse.jp/sympo2008/Symposium 2008STRAIGHT, TANDEM-STRAIGHT and beyond

Hideki Kawahara, Masanori Morise, Toru Takahashi, Ryuichi NisimuraHideki Banno and Toshio Irino

(Wakayama Univ., Kwansei Gakuin Univ., Kyoto Univ. and Meijo Univ.)Overview/Summary

STFT power spectrum

T0/2 time shift + averaging

TANDEM spectrum

TANDEM-STRAIGHT architecture

STRAIGHT, a speech analysis, modi�cation synthesis system, is an extension of the classical channel VOCODER that exploits the advantages of progress in information processing technologies and a new conceptualization of the role of repetitive structures in speech sounds. STRAIGHT was designed to provide representation consistent with our auditory perception which decomposes input sounds in terms of excitation (source) and resonant (�lter) characteristics. This architecture and sophisticated implementation made it the most �exible and high-quality speech manipulation system to date. However, underlying algorithms were not well formulated mathematically sound manner.

TANDEM, a method to extract temporally stable power spectral representation and c o n s i s t e n t s a m p l i n g t h e o r y , a n e w mathematical foundation of sampling made p o s s i b l e t o r e f o r m u l a t e S T R A I G H T completely based on sound foundation. It a l s o s i m p l i � e d c o d e s a n d e n h a n c e d execution speed drastically.

Examples/Applications

Partial morphing for singingstyle, timbre and identity

Vowel-basedvoice conversion

TANDEM-STRAIGHT provides a uni�ed method to extract the excitation source information with supra- a n d s u b - h a r m o n i c s t r u c t u r e s . T h e b a s e l i n e performance of the proposed method without any post processing is comparable to available state of the art methods.Morphing based on STRAIGHT provides means to investigate physical correlates of perceptual attributes in singing. Preliminary tests indicated perceptual dominance of voice timbre in judgement of singers' identity. This �nding and a concept of speech texture mapping are basis of a voice conversion solely based on vowel information.