Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Speech Synthesis asA Machine Learning Problem
Keiichi TokudaNagoya Institute of Technology
Introduction
Rule-based, formant synthesis (~’90s)– Hand-crafting each phonetic units by rules
Corpus-based, concatenative synthesis (’90s~)– Concatenate speech units (waveform) from a database
• Single inventory: diphone synthesis• Multiple inventory: unit selection synthesis
Corpus-based, statistical parametric synthesis– Source-filter model + statistical acoustic model
• Hidden Markov model: HMM-based synthesis
2
How we can formulate and understand the whole corpus-based speech synthesis process in a unified statistical framework?
Problem of speech synthesis (1/2)
3
We have a speech database, i.e., a set of texts and corresponding speech waveforms.Given a text to be synthesized, what is the speech waveform corresponding to the text?
: speech waveform
: speech waveforms
: texts
: text to be synthesized
databaseGiven
?
w
WX
x
Problem of speech synthesis (2/2)
Problem of statistical parametric speech synthesis:
4
( )LOlooo
,,|maxargˆ P=
( ) ( ) λLOλλloo
dPP∫= ,|,|maxarg
: Synthesis data (speech parameter sequence)
: Training data (speech parameter sequence)
: Label sequence for training data(pronunciation, stress, POS, pause position, etc.)
: Label sequence for synthesis data
: Model parameters (HMM parameters)λ
l
L
Ο
o
Approximation
Maximum A Posterior (MAP) & ML estimation
Speech parameter generation using estimated parameters
5
( )LOλλλ
,|maxargˆMAP P=
( ) ( )λλLOλ
PP ,|maxarg=
( )λLOλλ
,|maxargˆML P=
MAP estimation
ML estimation
( )λlooo
ˆ,|maxargˆ P= Speech parametergeneration
HMM-based speech synthesis system
SPEECHDATABASE
ExcitationParameterextraction
SpectralParameterExtraction
Excitationgeneration
Synthesis Filter
TEXT
Text analysis
SYNTHESIZEDSPEECH
Training HMMs
Parameter generationfrom HMMs
Context-dependent HMMs& state duration models
Labels Excitationparameters
Excitation
Spectralparameters
Spectralparameters
Excitationparameters
Labels
Speech signal Training part
Synthesis part6
Text analysis
HMM-based speech synthesis system
SPEECHDATABASE
ExcitationParameterextraction
SpectralParameterExtraction
Excitationgeneration
Synthesis Filter
TEXT
Text analysis
SYNTHESIZEDSPEECH
Parameter generationfrom HMMsLabels Excitation
parametersExcitation
Spectralparameters
Speech signal Training part
Synthesis part7
Text analysis
Training HMMs
Context-dependent HMMs& state duration models
Spectralparameters
Excitationparameters
Labels
),|(maxargˆ λLOλλ
P=
Hidden Markov model (HMM)
11a 22a 33a
12a 23a
)(1 tb o )(2 tb o )(3 tb o
1o 2o 3o 4o 5o To ・ ・
1 2 3
1 1 1 1 2 2 3 3
o
q
Observation sequence
State sequence
8
ija
)( tqb o
: state transition probability
: output probability
Output probability of HMM
1 2 3 4 5 6 7 8
i
=T t
∑∏∑=
−==
qqoλqολο
T
ttqqq ttt
baPP1
)()|,()|(1
9
33a
1
2
3
23a )( 73 ob
1π
Model parameters can be estimated by an EM algorithm
HMM-based speech synthesis system
SPEECHDATABASE
ExcitationParameterextraction
SpectralParameterExtraction
Excitationgeneration
Synthesis Filter
TEXT
Text analysis
SYNTHESIZEDSPEECH
Training HMMs
Excitation
Spectralparameters
Excitationparameters
Labels
Speech signal Training part
Synthesis part10
Text analysis
Parameter generationfrom HMMs
Context-dependent HMMs& state duration models
Labels Excitationparameters
Spectralparameters
)ˆ,|(maxargˆ λlooo
P=
Speech parameter generation algorithm
)|(),|(max
)|(),|()|(
λqλqo
λqλqoλo
q
q
PP
PPP
≈
=∑
),ˆ|(maxargˆ
),|(maxargˆ
λqoo
λlqq
o
q
p
p
=
=
⇒
11
For given HMM , determine a speech parameter vectorSequence which maximizes
λTTTT ],,,[ 21 Toooo =
Determination of state sequence
Observation sequence
State sequence
4 10 5dState duration
The state sequence can be determined by state durations12
11a 22a 33a
12a 23a
)(1 tb o )(2 tb o )(3 tb o
1o 2o 3o 4o 5o To ・ ・
1 2 3
1 1 1 1 2 2 3 3
ija
)( tqb o
: state transition probability
: output probability
o
q
HMM (Hidden Markov Model)– State duration prob. depends only on transition prob.– State duration probability exponentially decreases
HSMM (Hidden Semi Markov Model)– HMM + explicit duration model ⇒ HSMM
State durations are given by means of Gaussians.
Hidden Semi Markov Model
13
1O 2O 3O 4O 5O 6O 7O 9O8O
1q 2q 3q
1d 2d 3d
Introduce state duration probability
)|( qdP
Gaussian
Speech parameter generation algorithm
)|(),|(max
)|(),|()|(
λqλqo
λqλqoλo
q
q
PP
PPP
≈
=∑
),ˆ|(maxargˆ
),|(maxargˆ
λqoo
λlqq
o
q
p
p
=
=
⇒
14
For given HMM ,determine a speech parameter vectorSequence which maximizes
λTTTT ],,,[ 21 Toooo =
Generated feature sequence
Mean Variance
becomes a sequence of mean vectors⇒ discontinuous outputs between stateso
15
frame
a i
Dynamic features
112
22
11
2
)(5.0
−+
−+
+−≈∂∂
=∆
−≈∂∂
=∆
ttt
t
ttt
tccccc
cctcc
t
t
tc1−tc 1+tc
tc∆
tc1−tc 1+tc
tc2∆
16
1−∆ tc 1+∆ tc 1−∆ tc 1+∆ tc
5.0− 5.0 1 12−
S
Integration of dynamic features
Relationship between speech parm. vec. & static feat
17
o cW
Solution for the problem (1/2)
0c
qWc=
∂∂ ),ˆ|(log λP
qqq μWWcW ˆ1
ˆ1
ˆ−− = ΣΣ TT
TTTT ],,,[ 21 Tcccc =TTTT ],,,[ ˆˆˆˆ 21 Tqqqq μμμμ =
TTTT ],,,[ ˆˆˆˆ 21 Tqqqq ΣΣΣΣ =
By setting
we obtain
where
18
Generated speech parameter trajectory
Mean Variance c
19
frame
c∆
c
a i
Generated spectra
200
12
34
5(kH
z)0
12
34
5(kH
z)
a i silsil
w/o dyn
w dyn
Spectra changing smoothly between phonemes
Conventional HMM Trajectory HMM
Training
Synthesis
Trajectory HMM
Solve inconsistency between training & synthesis
21
),|( λLCP
)ˆ,|( λlcP
),|( λLOP
Wcoλlo =|)ˆ,|(P
⇒ improving the model accuracy
is not a proper distribution of c),|(),|( λlWcλlo PP =
HMM-based speech synthesis system
TEXT
Text analysis
Training HMMs
Parameter generationfrom HMMs
Context-dependent HMMs& state duration models
Labels
Labels
Training part
Synthesis part22
Text analysis
SPEECHDATABASE
ExcitationParameterextraction
SpectralParameterExtraction
Spectralparameters
Excitationparameters
Speech signal
Excitationgeneration
Synthesis Filter
SYNTHESIZEDSPEECH
Excitationparameters
Excitation
Spectralparameters
∑=
−=M
m
mzmczH0
)(exp)(
Mel-cepstral representation of speech spectra
ML-estimation of mel-cepstrum
ML estimation of spectral parameter
)|(maxarg cxcc
p=
23
xc
: speech waveform (Gaussian process): mel-cepstrum
War
ped
frequ
ency
(rad
)
Frequency (rad)
warped frequencymel-scale frequency
ω
αα ~
1
11
1~ je
zzz −
−
−− =
−−
=
∑=
−=M
m
mzmczH0
~)(exp)(
ω
π
2π
π0 2/π
Synthesis filter
Overview of speech vocoding
Original speech
Mel-cepstrumUnvoiced/voicedF0
24
)(zH)(ne )(nxexcitation
pulse train
white noise
speech
These speech parameter modeled by HMM
HMM-based speech synthesis system
SPEECHDATABASE
ExcitationParameterextraction
SpectralParameterExtraction
Excitationgeneration
Synthesis Filter
TEXT
Text analysis
SYNTHESIZEDSPEECH
Training HMMs
Parameter generationfrom HMMs
Context-dependent HMMs& state duration models
Labels Excitationparameters
Excitation
Spectralparameters
Spectralparameters
Excitationparameters
Labels
Speech signal Training part
Synthesis part25
Text analysis
Observation of F0
Time
Log
Freq
uenc
y
Unable to model by continuous or discrete distributions⇒ Multi-space distribution HMM (MSD-HMM)
26
11 R=Ω
voiced0
2 R=Ωunvoiced
・
Structure of MSD-HMM
27
2,1w 2,2w 2,3w2
2nR=Ω
)(12 xN )(22 xN )(32 xN
3,1w 3,2w 3,3w3
3nR=Ω
)(13 xN )(23 xN )(33 xN
1 2 3
1,1w 1,2w 1,3w1
1nR=Ω
)(11 xN )(21 xN )(31 xN
MSD-HMM for F0 modeling
1 2 3HMM for F0
0,1w 0,2w 0,3w
1,1w 1,2w 1,3w
unvoiced
voiced
unvoiced
voiced
unvoiced
voiced
・ ・ ・01 R=Ω
12 R=Ω
voiced/unvoiced weights
28
Generated F0
29
Speech samples
Mel-cepstrum
w/ dyn. w/o dyn.
log F0w/ dyn.
w/o dyn.
30
Inclusion of speech analysis & waveform reconstruction
Problem of statistical parametric speech synthesis
31
( )LXlxxx
,,|maxargˆ P=
( )∑∑∫=c Cx
cx |maxarg P ( )λlc ,|P
( )LCλ ,|P× ( ) λXC dP |
Waveform reconstruction Speech parameter generation
HMM parameter estimation Speech analysis
c and consist of mel-cepstrum & F0 parametersC
Context clustering factors preceding, succeeding two phonemes Current phoneme Position of current phoneme in current syllable # of phonemes at preceding, current, succeeding syllable accent, stress of preceding, current, succeeding syllable Position of current syllable in current word # of preceding, succeeding accented, stressed syllable in current phrase # of syllables from previous, to next accented, stressed syllable Vowel within current syllable Guess at part of speech of preceding, current, succeeding word # of syllables in preceding, current, succeeding word Position of current word in current phrase # of preceding, succeeding content words in current phrase # of words from previous, to next content word # of syllables in preceding, current, succeeding phrase…
Vast # of combinations ⇒ Difficult to have possible models32
Decision tree-based state clustering
k-a+b
t-a+h
…
…
……
…
yes
yesyesyes
yes no
no
no
no
no
w-a+t
w-a+sil w-a+sh
gy-a+sil
gy-a+pau
g-a+pau
leaf nodes
synthesizedstates
R=silence?
L=“gy”?
L=voice?
L=“w”?
R=silence?
33
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
HMM
State durationmodel
Decision treefor state dur.
models
34
Inclusion of model structure parameter
Problem of statistical parametric speech synthesis
35
( )LOlooo
,,|maxargˆ P=
( )∑∫=m
mP ,,|maxarg λloo
( )LOλ ,,| mP× ( ) λLO dmP ,|
Generate coefficient
Posterior ofmodel parameters
Posterior ofmodel structure
Usually fixed is usedm
HMM-based speech synthesis system
SPEECHDATABASE
ExcitationParameterextraction
SpectralParameterExtraction
Excitationgeneration
Synthesis Filter
SYNTHESIZEDSPEECH
Excitationparameters
Excitation
Spectralparameters
Speech signal Training part
Synthesis part36
),|(maxargˆ ΛWLLL
P=
TEXT
Text analysisParameter generation
from HMMsLabels
),|(maxargˆ Λwlll
P=
Training HMMs
Context-dependent HMMs& state duration models
Spectralparameters
Excitationparameters
Labels
Text analysis
Inclusion of text analysis
Problem of statistical parametric speech synthesis
37
( )WOwooo
,,|maxargˆ P=
( )∑∑∫ ∫=l Lo
λlo ,|maxarg P ( )Λ,| wlP
( )LOλ ,|P× ( ) ( ) ΛΛΛ ddPP λWL ,|
Speech parameter generation
HMM parameter estimation Text processing
Text processing
Prior
HMM-based speech synthesis system
SPEECHDATABASE
ExcitationParameterextraction
SpectralParameterExtraction
Excitationgeneration
Synthesis Filter
TEXT
Text analysis
SYNTHESIZEDSPEECH
Training HMMs
Parameter generationfrom HMMs
Context-dependent HMMs& state duration models
Labels Excitationparameters
Excitation
Spectralparameters
Spectralparameters
Excitationparameters
Labels
Speech signal Training part
Synthesis part38
Text analysis
Inclusion of all components
Problem of statistical parametric speech synthesis
39
( )WXwxxx
,,|maxargˆ P=
( )∑∑∑∫ ∫=Cc Llx
cx, ,
|maxargm
P ( )mP ,,| λlc ( )Λ,| wlP
Waveform generation Parameter generation Text processing
Posterior of model parameter
Text processing Prior
( ) ΛΛ ddP λ( )Λ,|WLP× ( )XC|P
Speech analysis
( )LCλ ,,| mP× ( )LC ,|mP
Posterior of model structure
Emotional Speech Synthesis
text neutral angry
「授業中に携帯いじってんじゃねえよ!
電源切っとけ!」
“Don’t touch your cell phone during a class! Turn off it!”
「ミーティングには毎週参加しなさい!」
“You must attend the weekly meeting!”
trained with 200 utterances
Speaker Adaptation (mimicking voices)
w/o adaptation (initial model) Adapted with 4 utterances Adapted with 50 utterances Speaker-dependent model
Speaker-independent
Adaptationdata
Adapted model
MLLR-based adaptation
adaptation
?
Speaker Interpolation (mixing voices)
Linear combination of two speaker-dependent models
Model A Model BInterpolated model
1.00 0.75 0.50 0.25 0.00
0.00 0.25 0.50 0.75 1.00
A:
B:
Voice Morphing
A BTwo voices:
Four voices:A B
Interpolation of Speaking Styles
Neutral High Tension
Interpolation extrapolation
Base model A Base model B
Multilingual Speech Synthesis Japanese American English Chinese (Mandarin) (by ATR) Brazilian Portuguese (by Nitech, and UFRJ) European Portuguese
(by Nitech, Univ of Porto, and UFRJ) Slovenian
(by Bostjan Vesnicer, University of Ljubljana, Slovenia ) Swedish
(by Anders Lundgren, KTH, Sweden) German (by University of Bonn, and Nitech) Korean (by Sang-Jin Kim, ETRI, Korea) Hungarian (by Budapest University of Technology and
Economics) Finish (by TKK, Finland) Baby English (by Univ of Edinburgh, UK) Polish, Slovak, Finnish, Arabic, Farsi, Polyglot, etc.
Singing Voice Synthesis
Singing voice database
Musical score
Trained HMMs
Singing voice forany piece of music
male:
female:
male:
female:
Conclusion
• Whole speech synthesis process is described as a statistical framework
• It gives a unified view and reveals what is correct and what is wrong
• Importance of the databaseFuture work• Still we have many problems should be solved:
• Speech waveform modeling• Combination with text processing part
47
Statistical approach to speech synthesis