28
Acknowledgement: Yvonne Lee, Paul Chan, Dong Minghui International Symposium on Next-Generation Artificial Intelligence 3 March 2016, Tokyo Personalized Singing Synthesis - Science Meeting Arts Haizhou Li Institute for Infocomm Research, A*STAR Singapore

Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

Acknowledgement:

Yvonne Lee, Paul Chan, Dong Minghui

International Symposium on Next-Generation Artificial Intelligence

3 March 2016, Tokyo

Personalized Singing Synthesis - Science Meeting Arts

Haizhou Li

Institute for Infocomm Research, A*STAR

Singapore

Page 2: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

1. Speech

2. Singing

3. Music

4. Speech to Singing

Agenda: Personalized Singing Synthesis

Page 3: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

• Speech is the most natural way of human communication.

• Singing is to augment regular speech by the use of both tonality and rhythm.

• Music is an artistic way of expression, with instrumental sounds and vocals.

Speech, Singing and Music

Speech

Singing

Music

Page 4: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

Speech

4

Page 6: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

6

• Speech and singing can be modelled as two processes: source generation and filtering

• Air flow passes through the vocal folds as the sound source (periodic pulse & noise)

• It is then filtered by our vocal tract to produce the sounds

Vocal Tract (Filter)

Air Flow (Source)

Speech

Speech Theory

Page 7: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

7

• Kratzenstein’s acoustic resonators – Apparatus created in St. Petersburg (1779)

– Figure shown from Schroeter (1993)

• Vowel tubes (SF Science Museum)

First Set of Vowel Tubes

Page 8: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

8

Homer Dudley (1896-1987) , 1939 World Fair in New

York City – Bell Labs VODER

Electronic Synthesizer: VODER

Source: 120 Years of Electronic Music: The history of electronic music from 1800 to

2015, http://120years.net/the-voder-vocoderhomer-dudleyusa1940/

Page 9: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

9

• Haskins, 1959

• KTH – Stockholm, 1962

• Bell Labs, 1973

• MIT, 1976

• MIT-talk, 1979

• Speak ‘N Spell, 1980

• BELL Labs, 1985

• DECtalk (voice morphing), 1987

• I2R Abacus Engine 2013

Continuing Evolution (1959-2013)

Page 10: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

Elements in Human Voice

Human Language Technology

speech

Prosody

Timbre Content

Page 11: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

11

Analysis and Synthesis

Content

Prosody

Timbre

Analysis

Synthesis

speech speech

Page 12: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

12

Singing

Page 13: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

• Singing is emotional musical vocalization. It is an emotional expression of feelings which has the power to alter the mood of both the singer and the listener.

• Singing requires expert knowledge and operations. Vocalists acquire singing skills by extensive training.

• By singing, we – Reduce stress and improve mood

– Lower blood pressure

– Help improve sleep

– Reduce perceived pain

– Motivate and empower …

13

Singing – a part of our life

Page 14: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

14

• Almost all singers develop vibrato during voice training. Vibrato is a periodic modulation of the pitch frequency. Vibrato reduces the demand of accuracy in pitch frequency and the singer can use vibrato artistically for expressive purposes [Sundberg87]

[Sundberg87] J. Sundberg, The Science of the Singing Voice, Northern Illinois University Press, 1987.

Singing Theory – Vibrato

Page 15: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

15

• By varying the vocal tract to achieve different voice spectra, singers may produce different voice timbres, styles, and achieve efficient sound transmission

[Wolfe09] J. Wolfe, M. Garnier, and J. Smith, Voice Acoustics: An Introduction to the Science of Speech and Singing, 2009. [Sundberg77] J. Sundberg, The Acoustics of the Singing Voice, Scientific American, 1977.

• Singing formant

By adjusting the larynx and the vocal tract near glottis, singing formant is made and the singing can be projected further far away [Wolfe09]

Singing Theory – Singing Formant (Tenor)

Page 16: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

La Lore Loo Ler Lee

low pitch

high pitch

high pitch, order changed

16

[Joliveau04] E. Joliveau, J. Smith, and J. Wolfe, “Tuning of vocal tract resonance by sopranos”, Nature, vol. 427, pp. 116, Jan. 2004. [Wolfe09] J. Wolfe, M. Garnier, and J. Smith, Voice Acoustics: An Introduction to the Science of Speech and Singing, 2009. [Wolfe12] J. Wolfe, Sopranos: Resonance Tuning and Vowel Changes, 2012.

[Wolfe12] [Joliveau04]

[Wolfe09]

Singing Theory – Resonance Tuning (Soprano)

Page 17: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

17

Music

Page 18: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

18

• A piece of music is described by a sequence of notes, showing the pitch and the relative durations

• In Western music, 12 notes of fixed frequencies are used. They are different from each other by a semitone (a ratio of ). An octave span over 12 semitones.

12 2

name whole half quarter eighth sixteenth thirty-second

sixty-fourth

pitched note

rest note

name C4 C4# D4 D4# E4 F4 F4# G4 G4# A4 A4# B4

freq. (Hz)

262 277 294 311 330 349 370 392 415 440 466 494

Music Theory

Page 19: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

19

Speech to Singing Synthesis

Page 20: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

Are these possible? – Projecting vocal techniques in singing

– Personalizing a voice [Kenmochi10]

– Automating the accompaniment

[Kenmochi10] H. Kenmochi, “VOCALOID and Hatsune Miku phenomenon in Japan,” in Proc. Intersinging, pp. 1-4, 2010.

Singing Synthesis

Singing

Prosody

Timbre Content

Synchronization

Page 21: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

21

Analysis and Synthesis

Content

Prosody

(F0)

Timbre

Synthesized

Singing

Music

Accompaniment

Speech Singing

Analysis

Synthesis

Page 22: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

• Input: speech or singing

• Output: time-aligned vocal and music

• Changing prosody and timbre from speech to singing vocal

Siu Wa Lee, Ling Cen, Haizhou Li, Yaozhu Paul Chan, Minghui Dong, Method and system for template-based

personalized singing synthesis, US Patent: 20150025892 A1

Personalized Singing Synthesis

Page 23: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

(shift to a higher pitch)

pitch

time

pitch

time

pitch

time

pitch

time

Singing

Syncopated Singing

(horizontal nudge

time warping)

Transposed Singing

(shift to a lower pitch)

pitch

time

pitch

time

pitch

time

Example of Speech to Singing Synthesis

Harmonized Singing

(Sum of Melody

& Accompaniment

Vocals)

Speech

pitch

time

Page 24: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

24

Template singing voice

time (s)

frequ

ency

(Hz)

0 0.5 1 1.5 20

2000

4000

6000

8000

Speaking voice under conversion

time (s)

frequ

ency

(Hz)

0 0.5 10

2000

4000

6000

8000

Converted singing voice

time (s)

frequ

ency

(Hz)

0 0.5 1 1.5 20

2000

4000

6000

8000

Personalized Singing Synthesis – voice alignment

pitch

time

Singing

pitch

time

Speech

Siu Wa Lee, Ling Cen, Haizhou Li, Yaozhu Paul Chan, Minghui Dong, Method and system for template-based

personalized singing synthesis, US Patent: 20150025892 A1

Page 25: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

25

• Generalized Pitch Modeling

– To learn and generate these fluctuations note-by-note [Lee12]

• Various types of fluctuation are implicitly modeled using the same representation

[Lee12] S. W. Lee, S. T. Ang, M. Dong, and H. Li, “Generalized F0 modeling with absolute and relative pitch features for signing voice synthesis,” in Proc. ICASSP, pp. 429-432, 2012.

Generalized Pitch Modeling

Content Independent Method

Pitch Modeling

Page 26: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

• Context Independent Method – A fixed formula to assign pitch

fluctuation patterns

• However, pitch fluctuations are context dependent, e.g.

overshoot1 in high frequency notes is different from that in low frequency notes. Vibrato is present more often in long notes, etc.

1. Overshoot is a deflection exceeding the target note frequency after a note change 2. LEE S.W.Y, , DONG M.H., "Singing voice synthesis: Singer-dependent vibrato modeling and coherent processing of spectral envelope", INTERSPEECH 2011

Singing with Vibrato

Page 27: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

Personalized Singing Voice Synthesis

(NDP 2013 Mobile App)

Page 28: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time

• Thank you!

28