Implementation of a speech Implementation of a speech Analysis-Synthesis Toolbox using Analysis-Synthesis Toolbox using
Harmonic plus Noise ModelHarmonic plus Noise Model
Didier CadicDidier Cadic11, engineering student, engineering student
supervised bysupervised by
Olivier CappéOlivier Cappé11, Maurice Charbit, Maurice Charbit11, , Gérard CholletGérard Chollet11, Eric Moulines, Eric Moulines11
(presented here by Guido Aversano(presented here by Guido Aversano1,21,2))22IIASS, IIASS, Vietri sul Mare (SA), ItalyVietri sul Mare (SA), Italy
11Département TSI, ENST, Paris, FranceDépartement TSI, ENST, Paris, France
Plan of the presentationPlan of the presentation
Text-to-speech: classic methodsText-to-speech: classic methods
HNM modelHNM model
AnalysisAnalysis
SynthesisSynthesis
Analysis-Synthesis examplesAnalysis-Synthesis examples
ConclusionsConclusions
Text-To-Speech by concatenationText-To-Speech by concatenation
EnglishEnglish, male, male
EnglishEnglish, female (vocal server example), female (vocal server example)
EnglishEnglish, female (another vocal server example), female (another vocal server example)
GermanGerman, male, male
FrenchFrench, female, female
Examples realized on the AT&T web site:Examples realized on the AT&T web site:
Text-To-Speech by concatenationText-To-Speech by concatenation
2 major challenges :2 major challenges :
smooth connection between acoustic unitssmooth connection between acoustic units
flexible prosodyflexible prosody
TD-PSOLA methodTD-PSOLA method
Analysis :Analysis :
Pitch estimationPitch estimation
Pitch-synchronous Pitch-synchronous windowing windowing
Synthesis :Synthesis :
Rearrangement of Rearrangement of framesframes
TD-PSOLA methodTD-PSOLA method
Some very good-quality results:Some very good-quality results:
Singing, originalSinging, original
Singing, modifiedSinging, modified
Time-scalingTime-scaling
Cello, originalCello, original
Cello, modifiedCello, modified
Pitch-shiftingPitch-shifting
TD-PSOLA methodTD-PSOLA method
"rain", original"rain", original
"rain", 0.5 rate"rain", 0.5 rate
"ss", original"ss", original
"ss", slowed down (classic method)"ss", slowed down (classic method)
"ss", slowed down (improved)"ss", slowed down (improved)
Artifacts appearing in non-voiced sounds:Artifacts appearing in non-voiced sounds:
Phase Vocoder methodPhase Vocoder method
Intuitive description:Intuitive description:
Compression/stretchingCompression/stretchingof (narrow-band) spectrogram’s of (narrow-band) spectrogram’s time-frequency scales…time-frequency scales…
time-scalingtime-scaling
pitch-shiftingpitch-shifting
Phase Vocoder methodPhase Vocoder method
Examples :Examples :
"rain", male voice"rain", male voice
Slow-motion by Vocoder (PSOLA : )Slow-motion by Vocoder (PSOLA : )
"The quick fox …", female voice"The quick fox …", female voice
Slow-motion by VocoderSlow-motion by Vocoder
Main problem :Main problem : phase coherence is lost in the synthesized signalphase coherence is lost in the synthesized signal
TD-PSOLA and Vocoder allow TD-PSOLA and Vocoder allow basic prosodic modifications. basic prosodic modifications.
The problem of unit concatenation for TTS isThe problem of unit concatenation for TTS is not solved. not solved.
Other kinds of modifications (timbre,Other kinds of modifications (timbre, denoising, …) should be considered. denoising, …) should be considered.
We need a parametric modelWe need a parametric model
Harmonic plus Noise Model (HNM)Harmonic plus Noise Model (HNM)
Main assumption :Main assumption :
stationary segments of a stationary segments of a speech signal can be speech signal can be always seen as the always seen as the superposition of a periodic superposition of a periodic and a noisy partand a noisy part
HNM ModelHNM Model
Modelling :Modelling :
S(t)S(t) H(t)H(t) B(t)B(t)== ++
where :where : H(t) = H(t) = A Ak k cos ( 2cos ( 2 k f k f0 0 t + t + k k ))
andand B(t) = white noise passed through an AR filterB(t) = white noise passed through an AR filter
HNM analysis of a frameHNM analysis of a frame
1.1. Pitch estimationPitch estimation
Spectral comb methodSpectral comb method
HNM analysis of a frameHNM analysis of a frame
1.1. Pitch estimationPitch estimation
Good results are obtainedGood results are obtained
In some cases the method In some cases the method erroneously returns f0/2erroneously returns f0/2
Possibility of tracking…Possibility of tracking…
"aka…aga""aka…aga"
HNM analysis of a frameHNM analysis of a frame
2.2. Harmonic part: extraction of amplitudesHarmonic part: extraction of amplitudes
Least squares methodLeast squares method
H(t) = H(t) = aakk cos ( 2cos ( 2k fk f0 0 t ) + t ) + bbkk sin ( 2sin ( 2k fk f0 0 t )t )
minmin s(t) – H(t) s(t) – H(t) 22
aak, k, bbkk
HNM analysis of a frameHNM analysis of a frame
2.2. Extraction of amplitudesExtraction of amplitudes
Problem: the noisy part gives aProblem: the noisy part gives anon-null contribution to the non-null contribution to the spectral powerspectral power
Gain correction for the harmonicsGain correction for the harmonics(using an euristic formula (using an euristic formula gg((DVDV), where ), where DVDV is the estimated voicing degree) is the estimated voicing degree)
HNM analysis of a frameHNM analysis of a frame
2.2. Extraction of amplitudesExtraction of amplitudes
Residual:Residual: R(t) = s(t) - H(t)R(t) = s(t) - H(t)
HNM analysis of a frameHNM analysis of a frame
2.2. Extraction of amplitudesExtraction of amplitudes
Possibility of improving harmonic estimationPossibility of improving harmonic estimation
where Bg = gaussian white noisewhere Bg = gaussian white noise
and F(t) = AR filter, F(z) =and F(t) = AR filter, F(z) =
HNM analysis of a frameHNM analysis of a frame
3.3. AR filter estimation for the residual:AR filter estimation for the residual:
Linear prediction methodLinear prediction method
R(t) = Bg R(t) = Bg F(t) F(t)
aa0 0 + a+ a1 1 zz-1 -1 + … + a+ … + aN N zz-N-N
11
HNM SynthesisHNM Synthesis
Interpolation for each harmonic between Interpolation for each harmonic between two succesive framestwo succesive frames
H(t) = H(t) = aakk(t)(t) cos ( 2cos ( 2k fk f00(t)(t) t ) + t ) + bbkk(t)(t) sin ( 2sin ( 2k fk f00(t)(t) t ) =t ) =
= = AAkk(t)(t) cos cos kk(t)(t)
kk(t(taa) = 2) = 2k fk f00(t(taa) ) is known by pitch analysisis known by pitch analysis..
AAkk(t(taa) and ) and kk(t(taa) ) are known at analysis instants tare known at analysis instants taa
HNM SynthesisHNM Synthesis
Erroneous pitch (usually f0/2)Erroneous pitch (usually f0/2)
harmonic correspondence problemharmonic correspondence problem
is solved introducing fictitious harmonicsis solved introducing fictitious harmonics
HNM SynthesisHNM Synthesis
AAk k cos cos kk(t)(t)Linear interpolation Linear interpolation
UnwrappingUnwrapping + + cubic interpolationcubic interpolation
HNM SynthesisHNM Synthesis
Noisy partNoisy part
Generation of normally distributed random Generation of normally distributed random numbersnumbers
AR filtering (abrupt changes of coefficients AR filtering (abrupt changes of coefficients between 2 windows have no incidence…)between 2 windows have no incidence…)
HNM SynthesisHNM Synthesis
ResultsResults
"Carottes" :"Carottes" :synthesizedsynthesized
originaloriginal
"Lawyer" :"Lawyer" :synthesizedsynthesized
originaloriginal
Tuba :Tuba :synthesizedsynthesized
originaloriginal
"wazi" :"wazi" :synthesizedsynthesized
originaloriginal
a-e-i-o-u :a-e-i-o-u :synthesizedsynthesized
originaloriginal
singing :singing :synthesizedsynthesized
originaloriginal
HNM SynthesisHNM Synthesis
ResultsResults
Discours :Discours :synthesizedsynthesized
originaloriginal
"aka aga" :"aka aga" :synthesizedsynthesized
originaloriginalDussolier :Dussolier : synthesizedsynthesized
originaloriginal
Andie :Andie :synthesizedsynthesized
originaloriginal
noisy partnoisy part
"coiffe" :"coiffe" :synthesizedsynthesized
originaloriginal
Synthesis with time-stretchingSynthesis with time-stretching
Synthesis instants (tSynthesis instants (tss) ) Analysis instants (t Analysis instants (taa))
The following parameters remain unchanged:The following parameters remain unchanged:
Noisy part parametersNoisy part parameters
The pitchThe pitch
The amplitudes AThe amplitudes Akk of the harmonics of the harmonics
Synthesis with time-stretchingSynthesis with time-stretching
Simple phase trajectories resamplingSimple phase trajectories resampling
oror
"harmonic" rephasing"harmonic" rephasing
Phase adaptationPhase adaptation
a-e-i-o-u :a-e-i-o-u : slow-motion with phase "stretching"slow-motion with phase "stretching"originaloriginal
slow-motion with "harmonic" rephasingslow-motion with "harmonic" rephasing
Final resultsFinal results
OriginalOriginal 11Synthesized with rate : Synthesized with rate :
0.40.4 0.50.5 0.60.6 0.70.7 0.80.8 1.21.2 1.51.5 22
"carottes" :"carottes" :"lawyer" :"lawyer" :
tuba :tuba :"wazi" :"wazi" :singing :singing :
"a-e-i-o-u" :"a-e-i-o-u" :Dussolier :Dussolier :Discours :Discours :
Andie :Andie :"aka aga":"aka aga":"coiffe" :"coiffe" :
ConclusionsConclusions
Good results, showing method’s potential for Good results, showing method’s potential for different applications including TTSdifferent applications including TTS
Future work will include other kinds of Future work will include other kinds of modifications (pitch shifting, timbre etc.)modifications (pitch shifting, timbre etc.)