View
215
Download
0
Category
Tags:
Preview:
Citation preview
ON THE REPRESENTATION OF VOICE SOURCE APERIODICITIES IN THE MBE
SPEECH CODING MODEL
Preeti Rao and Pushkar PatwardhanPreeti Rao and Pushkar Patwardhan
Department of Electrical Engineering,Indian Institute of Technology, Bombay India
Department of Electrical Engineering, IIT Bombay 2
The MBE Speech Model (Griffin & Lim, 1988)
X
MBE modeling
Original Modeled
Department of Electrical Engineering, IIT Bombay 3
Frame-based analysis
Within the window, assume: a constant–amplitude, constant-frequency sinusoidal model
Department of Electrical Engineering, IIT Bombay 4
MBE Speech Model Parameters
Pitch Harmonic amplitudes
Band-wise voicing decisions
Parameter Estimation
Windowed speech
(Phase is predicted for smoothness)
Department of Electrical Engineering, IIT Bombay 5
MBE Analysis: Parameter Estimation
Pitch and Spectral Amplitudes:Analysis-by-synthesis matching of a predicted harmonic spectrum with the actual signal spectrum.
Voicing decision per frequency band (3 harmonics):Based on the error between the actual and predicted spectra.
Department of Electrical Engineering, IIT Bombay 6
MBE Analysis: Spectral Matching
Voicing thresholds are frame-adapted as determined by experimental tuning.
Department of Electrical Engineering, IIT Bombay 7
MBE Synthesis
Voiced amplitudes
White noise
Unvoiced amplitudes
Reconstructedspeech
Bank of Harmonic OscillatorsPitch
Voiced speech
Voiced speech synthesis
Unvoicedspeech
LinearInterpolation
STFTReplace Envelope
Weighted Overlap-Add
Unvoiced speech synthesis
Voicedspeech
Unvoicedspeech
Department of Electrical Engineering, IIT Bombay 8
The efficient quantisation of MBE parameters has led to:
IMBE (Inmarsat) @ 4.15 kbps
DVSI MBE codecs @ >2 kbps
LR MBE (IITB) @ 1.5 kbps
Research groups: (Univ. Surrey, UCSB, Sony Corp.)@1.2 kbps to 3 kbps
Narrowband Speech Coding with MBE
modeledreference
Department of Electrical Engineering, IIT Bombay 9
Related Models: Speech Synthesis
• Harmonics+Noise Model (HNM): Stylianou• Harmonic/Stochastic Model (H/S): Dutoit,1996
Emphasis is on natural sounding wideband speech and easy prosody modification.
Both use essentially the Griffin & Lim MBE analysis.Important differences: • Analysis and synthesis are pitch synchronous• Estimated harmonic phases are utilised in synthesis
Department of Electrical Engineering, IIT Bombay 10
MBE Model: Limitations
The codec speech quality does not improve with increasing bit rate => the model has its limitations
Assumption of frame-level quasi-stationarity: enables the accurate representation only of
• vowels• unvoiced and voiced fricatives
(not plosives, onsets,…)
Department of Electrical Engineering, IIT Bombay 11
dark sharp
Glottal pulse shape variation(brightness, vocal effort)
Pitch cycle variations: Jitter / shimmer(roughness / harshness)
Frication and aspiration(friction, breathiness)
T2 TmT1
+
Glottal pulse
Vocal tract response
Speech signal
0 0.002 0.004 0.006 0.008 0.01 0.012 0.014-1
0
1
2
3
4
5
6
7
8
9
msec
“Steady” Sounds: Voice Quality
Department of Electrical Engineering, IIT Bombay 12
Role of Model Excitation Parameters
The glottal spectral shape (glottal waveform shape) can be captured by the spectral envelope parameters. But the perceptual effects of
• vocal cord vibration aperiodicities • aspiration / frication noise must be reproduced (if at all) by the MB excitation.
Department of Electrical Engineering, IIT Bombay 13
Effect of Aperiodicities on MBE Parameters
Voice source aperiodicities distort the harmonic spectrum (esp. if the frame contains several pitch cycles).
• Modulation (jitter-shimmer) aperiodicities => smearing of harmonic lobe structure; noise and subharmonics may be introduced.
• Aspiration noise => additive noise in harmonic regions
Department of Electrical Engineering, IIT Bombay 14
MBE Analysis: Aperiodic Vowel
Increase in the analysis spectrum matching error => MBE synthesis of UV (random noise) frequency bands
Department of Electrical Engineering, IIT Bombay 15
Previous: On Multi-band Excitation
• Fujimura, 1968: “A crude approximation of aperiodicity observed in natural speech can be made by distributing patches of random noise signals in the time-frequency space of the speech signal.”
• Makhoul, 1978: “Spectral devoicing due to vocal cord vibration irregularities is an artifact of the spectral estimation, and it may not be appropriate to use a noise source for the synthesis…”
• Griffin and Lim, 1988: Justify MBE model by quoting Fujimura, and also their own observations with speech in noise.
Department of Electrical Engineering, IIT Bombay 16
Synthetic Vowel: Modulation Aperiodicities
Department of Electrical Engineering, IIT Bombay 17
Synthetic Vowel: Modulation Aperiodicities
50 100 150 200 250 3000
2
4
6
8
10
12
Pitch (Hz)
% s
him
mer
high shimmer
referencemodeled
50 100 150 200 250 3000
0.5
1
1.5
2
2.5
3
3.5
4
Pitch (Hz)
% ji
tter
high jitter
referencemodeled
HIGH JITTER HIGH SHIMMER
80 Hz 160 Hz 250 HzPeriodic ref:
Department of Electrical Engineering, IIT Bombay 18
Fujimura-type Experiment
Highly jittered vowel /ɑ/
Reference
MBE (note “unfused” noise)
MBE-modeled with forced decisions
Department of Electrical Engineering, IIT Bombay 19
Experiments with Natural Speech
Goal: to study the MBE representation of • Unvoiced and voiced fricatives• Breathy voice• Rough and hoarse voices• Speech in noisy background
To understand the implications of simplifying the excitation to single-band (SBE) or two-band excitation (TBE)
Department of Electrical Engineering, IIT Bombay 20
VCV: /ɑzɑ/
Reference MBE-Modeled
SBE modeled
Department of Electrical Engineering, IIT Bombay 21
VCV: /ɑƷɑ/
Reference MBE-modeled
Department of Electrical Engineering, IIT Bombay 22
Voice quality: BreathyMBE-modeled
TBE-modeled (buzzy)
Reference
Department of Electrical Engineering, IIT Bombay 23
Voice Quality: HarshMBE-modeled
TBE-modeled
Reference
Department of Electrical Engineering, IIT Bombay 24
Voice Quality: Rough
MBE-modeled
Reference
Department of Electrical Engineering, IIT Bombay 25
Noise Corrupted Speech (15 dB SNR)
Reference
MBE-modeled
TBE-modeled (buzzy)
Department of Electrical Engineering, IIT Bombay 26
Conclusions
• MB excitation represents frication and aspiration accurately; esp. crucial for noisy speech.
• Modulation aperiodicities are not captured at high pitches except through devoiced bands. Depending on the setting of thresholds, the noise bands may not fuse perceptually.
• It is possible to simulate partially the perceptual effects of jitter/shimmer by the controlled devoicing of bands in the t-f space.
Thank you
Recommended