Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

Acknowledgement:

Yvonne Lee, Paul Chan, Dong Minghui

International Symposium on Next-Generation Artificial Intelligence

3 March 2016, Tokyo

Personalized Singing Synthesis - Science Meeting Arts

Haizhou Li

Institute for Infocomm Research, A*STAR

Singapore

1. Speech

2. Singing

3. Music

4. Speech to Singing

Agenda: Personalized Singing Synthesis

• Speech is the most natural way of human communication.

• Singing is to augment regular speech by the use of both tonality and rhythm.

• Music is an artistic way of expression, with instrumental sounds and vocals.

Speech, Singing and Music

Speech

Singing

Music

Speech

4

• Who you are

Elements in Human Voice

Human Language Technology

• what you want to say

Speech

Prosody

Timbre Content

• Expression of affective

state/emotion

http://www.google.com.sg/url?sa=i&rct=j&q=speech+synthesis&source=images&cd=&cad=rja&docid=v4-IMd_b5HQNEM&tbnid=ZUuKEvfgORIgqM:&ved=0CAUQjRw&url=http://ptolemy.eecs.berkeley.edu/~eal/audio/voder.html&ei=1feMUczTM46FrAfhyoDQDQ&bvm=bv.46340616,d.bmk&psig=AFQjCNFdBdCuK3y29FLN14qkI407EFZoQw&ust=1368279371367347

6

• Speech and singing can be modelled as two processes: source generation and filtering

• Air flow passes through the vocal folds as the sound source (periodic pulse & noise)

• It is then filtered by our vocal tract to produce the sounds

Vocal Tract (Filter)

Air Flow (Source)

Speech

Speech Theory

7

• Kratzenstein’s acoustic resonators – Apparatus created in St. Petersburg (1779)

– Figure shown from Schroeter (1993)

• Vowel tubes (SF Science Museum)

First Set of Vowel Tubes

8

Homer Dudley (1896-1987) , 1939 World Fair in New

York City – Bell Labs VODER

Electronic Synthesizer: VODER

Source: 120 Years of Electronic Music: The history of electronic music from 1800 to

2015, http://120years.net/the-voder-vocoderhomer-dudleyusa1940/

9

• Haskins, 1959

• KTH – Stockholm, 1962

• Bell Labs, 1973

• MIT, 1976

• MIT-talk, 1979

• Speak ‘N Spell, 1980

• BELL Labs, 1985

• DECtalk (voice morphing), 1987

• I2R Abacus Engine 2013

Continuing Evolution (1959-2013)

Elements in Human Voice

Human Language Technology

speech

Prosody

Timbre Content

11

Analysis and Synthesis

Content

Prosody

Timbre

Analysis

Synthesis

speech speech

12

Singing

• Singing is emotional musical vocalization. It is an emotional expression of feelings which has the power to alter the mood of both the singer and the listener.

• Singing requires expert knowledge and operations. Vocalists acquire singing skills by extensive training.

• By singing, we – Reduce stress and improve mood

– Lower blood pressure

– Help improve sleep

– Reduce perceived pain

– Motivate and empower …

13

Singing – a part of our life

14

• Almost all singers develop vibrato during voice training. Vibrato is a periodic modulation of the pitch frequency. Vibrato reduces the demand of accuracy in pitch frequency and the singer can use vibrato artistically for expressive purposes [Sundberg87]

[Sundberg87] J. Sundberg, The Science of the Singing Voice, Northern Illinois University Press, 1987.

Singing Theory – Vibrato

15

• By varying the vocal tract to achieve different voice spectra, singers may produce different voice timbres, styles, and achieve efficient sound transmission

[Wolfe09] J. Wolfe, M. Garnier, and J. Smith, Voice Acoustics: An Introduction to the Science of Speech and Singing, 2009. [Sundberg77] J. Sundberg, The Acoustics of the Singing Voice, Scientific American, 1977.

• Singing formant

By adjusting the larynx and the vocal tract near glottis, singing formant is made and the singing can be projected further far away [Wolfe09]

Singing Theory – Singing Formant (Tenor)

La Lore Loo Ler Lee

low pitch

high pitch

high pitch, order changed

16

[Joliveau04] E. Joliveau, J. Smith, and J. Wolfe, “Tuning of vocal tract resonance by sopranos”, Nature, vol. 427, pp. 116, Jan. 2004. [Wolfe09] J. Wolfe, M. Garnier, and J. Smith, Voice Acoustics: An Introduction to the Science of Speech and Singing, 2009. [Wolfe12] J. Wolfe, Sopranos: Resonance Tuning and Vowel Changes, 2012.

[Wolfe12] [Joliveau04]

[Wolfe09]

Singing Theory – Resonance Tuning (Soprano)

17

Music

18

• A piece of music is described by a sequence of notes, showing the pitch and the relative durations

• In Western music, 12 notes of fixed frequencies are used. They are different from each other by a semitone (a ratio of ). An octave span over 12 semitones.

12 2

name whole half quarter eighth sixteenth thirty-second

sixty-fourth

pitched note

rest note

name C4 C4# D4 D4# E4 F4 F4# G4 G4# A4 A4# B4

freq. (Hz)

262 277 294 311 330 349 370 392 415 440 466 494

Music Theory

19

Speech to Singing Synthesis

Are these possible? – Projecting vocal techniques in singing

– Personalizing a voice [Kenmochi10]

– Automating the accompaniment

[Kenmochi10] H. Kenmochi, “VOCALOID and Hatsune Miku phenomenon in Japan,” in Proc. Intersinging, pp. 1-4, 2010.

Singing Synthesis

Singing

Prosody

Timbre Content

Synchronization

21

Analysis and Synthesis

Content

Prosody

(F0)

Timbre

Synthesized

Singing

Music

Accompaniment

Speech Singing

Analysis

Synthesis

• Input: speech or singing

• Output: time-aligned vocal and music

• Changing prosody and timbre from speech to singing vocal

Siu Wa Lee, Ling Cen, Haizhou Li, Yaozhu Paul Chan, Minghui Dong, Method and system for template-based

personalized singing synthesis, US Patent: 20150025892 A1

Personalized Singing Synthesis

(shift to a higher pitch)

pitch

time

pitch

time

pitch

time

pitch

time

Singing

Syncopated Singing

(horizontal nudge

time warping)

Transposed Singing

(shift to a lower pitch)

pitch

time

pitch

time

pitch

time

Example of Speech to Singing Synthesis

Harmonized Singing

(Sum of Melody

& Accompaniment

Vocals)

Speech

pitch

time

24

Template singing voice

time (s)

frequ

ency

(Hz)

0 0.5 1 1.5 20

2000

4000

6000

8000

Speaking voice under conversion

time (s)

frequ

ency

(Hz)

0 0.5 10

2000

4000

6000

8000

Converted singing voice

time (s)

frequ

ency

(Hz)

0 0.5 1 1.5 20

2000

4000

6000

8000

Personalized Singing Synthesis – voice alignment

pitch

time

Singing

pitch

time

Speech

Siu Wa Lee, Ling Cen, Haizhou Li, Yaozhu Paul Chan, Minghui Dong, Method and system for template-based

personalized singing synthesis, US Patent: 20150025892 A1

25

• Generalized Pitch Modeling

– To learn and generate these fluctuations note-by-note [Lee12]

• Various types of fluctuation are implicitly modeled using the same representation

[Lee12] S. W. Lee, S. T. Ang, M. Dong, and H. Li, “Generalized F0 modeling with absolute and relative pitch features for signing voice synthesis,” in Proc. ICASSP, pp. 429-432, 2012.

Generalized Pitch Modeling

Content Independent Method

Pitch Modeling

• Context Independent Method – A fixed formula to assign pitch

fluctuation patterns

• However, pitch fluctuations are context dependent, e.g.

overshoot1 in high frequency notes is different from that in low frequency notes. Vibrato is present more often in long notes, etc.

1. Overshoot is a deflection exceeding the target note frequency after a note change 2. LEE S.W.Y, , DONG M.H., "Singing voice synthesis: Singer-dependent vibrato modeling and coherent processing of spectral envelope", INTERSPEECH 2011

Singing with Vibrato

Personalized Singing Voice Synthesis

(NDP 2013 Mobile App)

• Thank you!

28

Documents

Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time