Click here to load reader
Upload
adolf
View
86
Download
2
Embed Size (px)
Citation preview
Copyright (c) Andreas Spanias 10-1
Title/number:
DSP
byAndreas Spanias, Ph.D.
Speech Processing
Phone: 480 965 1837, Fax: 480 965 8325http://www.eas.asu.edu/~spanias
Copyright (c) Andreas Spanias 10-2
Topics
1. Speech Spectrum and Source System Coders
2. Speech Processing Analysis-Synthesis Algorithms
3. Historical Perspective on Algorithmic Research
4. The Standards on Speech Coding
5. Algorithm Examples
6. Remarks
Copyright (c) Andreas Spanias 10-3
Voiced and Unvoiced Speech
Time domain speech segment
Time (mS)
Am
plitu
de
TAPE TIME: 3840
0 8 16 24 32
1.0
0.0
-1.0
Mag
nitu
de (d
B)
-30
0
20
40
0 1 2 3 4
Frequency (KHz)
Time domain speech segment
Time (mS)
Am
plitu
de
TAPE TIME: 8014
0 8 16 24 32
1.0
0.0
-1.0
Mag
nitu
de (d
B)
-20
0
20
50
0 1 2 3 4
Frequency (KHz)
fundamentalfrequency
Formant Structure
Copyright (c) Andreas Spanias 10-4
Fine (Pitch) and Formant Structure of the Short-time Speech Spectrum
Fine Harmonic Structure : reflects the quasi-periodicity of speech and is attributed to the vibrating vocal chords.
Formant Structure (Spectral Envelope): is due to the interaction of the source and the vocal tract. The vocal tract consists of the pharynx and the mouth cavity.
Note the narrow peaks
Note the envelope peaks
Copyright (c) Andreas Spanias 10-5
Formants: peaks of the spectral envelope representing the resonant modes of the vocal tract. 3-5 formants below 5 kHz.The first 3 formants, usually occurring below 3 kHz, are quite important both in speech synthesis and perception. Higher formants are important for wideband and unvoiced speech representations.
Formants
f1 f2 f3 f4 f5
Copyright (c) Andreas Spanias 10-6
Speech Analysis/Synthesis
Speech analysis-synthesis: speech is analyzed (represented)in terms of a compact parametric set which is then used forspeech synthesis. Speech coding at medium-rates and below is achieved using an analysis-synthesis process.
Closed-loop analysis or analysis-by-synthesis: In closed-loop analysis, the parameters are extracted and encoded by minimizingexplicitly the difference between the original and reconstructedspeech. CELP typed algorithms belong to this category. Closed-loop analysis is usually high complexity.
Open-loop analysis: In open-loop analysis, the parametersare extracted and encoded without considering the difference between the original and the reconstructed speech.
Copyright (c) Andreas Spanias 10-7
Speech Synthesis Model (1)
f f f
=x
t t t
=*
1/A(z)X(z) S(z)
Copyright (c) Andreas Spanias 10-8
Simple Speech Synthesis Model (2)
VOCAL
TRACT
FILTER
SYNTHETIC
SPEECH
gain
Requires “hard” (binary) info voicing
V/UV
Pitch
iM
ii za
bzH
1
0
1)(
Copyright (c) Andreas Spanias 10-9
Speech Analysis-by-Synthesis (closed-loop)
A(z)A (z)L
+
+
+
+
s(n)^
s(n)
Excitation
W(z)
+
-Select
or Form
MSE
gain
Synthesis speech isforced to match i/p speech
Frequency responsesof the two synthesis filters
LTP LP
Copyright (c) Andreas Spanias 10-10
3095.01
1 z
Impulse response
LTP excited by a random signal creates pseudo-periodicity
Normalized frequency (Nyquist = 1)
0 0.5 0.9 1-10
0
10
Mag
nitu
de R
espo
nse
(dB
)
Frequency response
Copyright (c) Andreas Spanias 10-11
Subjective Speech Quality
BroadcastBroadcast wideband speech refers to high quality
“commentary” speech at rates above 64 kbits/s.
Network or tollToll or Network quality refers to quality comparable
to the classical analog speech (200-3200 Hz)Communications
Communications quality implies somewhat degraded speech quality but adequate for cellular communications. Synthetic
Synthetic speech is usually intelligible but can be unnatural and associated with a loss of speaker recognizability.
Copyright (c) Andreas Spanias 10-12
The Mean Opinion Score
MOS Scale Speech Quality
1 Bad
2 Poor
3 Fair
4 Good
5 Excellent
Copyright (c) Andreas Spanias 10-13
The Mean Opinion Score (2)
The MOS range relates to speech quality as follows :
MOS 4.0 - 4.5 : network or toll quality
MOS 3.5 - 4.0 : communications quality
MOS 2.5 - 3.5 : synthetic quality
Remarks : MOS ratings may differ significantly from test to test and hence they are not absolute measures for the comparison of different coders.
Copyright (c) Andreas Spanias 10-14
First Generation Analysis-by-Synthesis LPC
This class includes: IS-54 VSELP, RPE-LTP GSM, FS-1016, LD-CELP G.728, IS-96 QCELP
Mostly Encode Reflection Coefficients or LARs
Employ for the most part full searches of the code books and LTPs
High MIPS (most of them 20 MIPS+)
Modest MOS (~ 3.5)
Copyright (c) Andreas Spanias 10-15
Code Excited Linear Prediction (CELP)
- produced low-rate coded speech comparable to that of medium-rate waveform coders
- bridged the gap between waveform coders and vocoders
- codebook originally consisted of Gaussian sequences; 1024 vectors 40-samples (5ms) long
- gain scales excitation vector and excitation filtered by LTP and L synthesis
- “optimum” vector selected such that the perceptually weighted MSE minimized.
A(z)A (z)L
+
+
+
+
+
-
g
...
...
...
...
W(z)
W(z)
x (n)C
s (n)W
s(n)
s (n)^
W
e (n)C
Codebook
Error
Minimization
VQ index
Copyright (c) Andreas Spanias 10-16
Code Excited Linear Prediction (2)
The Nx1 error vector
ksgsske wkwwc ˆˆ 0
output due to the initial filter state,0ˆws
Minimizing w.r.t. gk we get kekek cTcc
ksks
kssg
wTw
wTw
k ˆˆ
ˆ
Copyright (c) Andreas Spanias 10-17
Code Excited Linear Prediction (3)
ksks
kssssk
wTw
wTw
wTwc ˆˆ
ˆ2
The k-th excitation vector, , that minimizes is selected
closed-loop analysis is used for LTP parameters; range of values for within the integers 20 to 147
kX c kc
M.R. Schroeder and B. Atal, "Code-Excited Linear Prediction (CELP): High Quality Speech at Very Low Bit Rates," Proc. ICASSP-85, p. 937, Tampa, Apr. 1985.
Copyright (c) Andreas Spanias 10-18
The IS-54 and GSM VSELP
- developed by Motorola - part of IS-54 and GSM cellular standards
- speech sampled at 8 kHz - segmented in 20ms frames - sub-frames of 5 ms
- complexity estimated at 30 MIPS - - MOS 3.45
Long Term
Filter State
Codebook 1
Codebook 2
Postfilter
speech
g2
g1
ga
A(z)
+
+
Lag Index
VQ-1 Index
VQ-2 Index
I. Gerson and M. Jasiuk, "Vector Sum Excited Linear Prediction (VSELP) Speech Coding at 8 kbits/s," Proc. ICASSP-90, pp. 461-464, New Mexico, Apr. 1990.
A. Spanias, M. Deisher, P. Loizou and G. Lim, "Fixed-Point Implementation of the VSELP algorithm, ASU-TRC Technical Report, TRC-SP-ASP-9201, July 1992.
Copyright (c) Andreas Spanias 10-19
Second Generation Analysis-by-Synthesis LPCThis class includes: G.723.1, G.729, CDMA EVRC IS-127, GSM EFR, IS-641
Encode Line Spectrum Pairs (LSP) using Split Vector Quantizers
Employ for the most part partial searches of the LTPs; usually open loop estimate refined by closed loop search around the neiborhood of estimate
Codebooks have Algebraic structure (ACELP)
High MIPS (most of them 20 MIPS)
Provisions for channel errors
Very Good MOS (~ 3.8+)
Copyright (c) Andreas Spanias 10-20
The CDMA IS-127 Enhanced Variable Rate Coder (EVRC) Algorithm
- It is an RCELP (Relaxed CELP) algorithm
- Different than classical CELP in that a time-warped (downsampled) version of
the residual is matched instead of actual speech
- Operates at 3 rates 9.6/4.8/1.2 kbits/s and also blank
- Unless requested by network rate is determined based on voice activity
- Upon command it may generate blank or Rate 1/2
- includes an FFT-based speech enhancement pre-processing block
- Estimated Pitch at higher rates has to conform with a pitch contour
- No pitch estimation at 1.2 kbits/s
- LPC coefficients encoded as LSPs - subframe LSPs by interpolation
- The random codebook is searched using Algebraic CELP techniques
- includes postfilters
- MOS 3.8 at 9.6 kbits/s and Complexity around 20 MIPS
Copyright (c) Andreas Spanias 10-21
THE GSM ENHANCED FULL-RATE (EFR) CODER
- Bit Rate 12.2 kbits/s- Speech is sampled at 8 kHz and segmented into 20 ms frames (160 samples)- 10 LPC parameters determined by Levinson-Durbin and vector quantized as LSPs- subframes are 5 ms each- Uses an Algebraic codebook- The pitch is first estimated open loop and refined using close loop searchmuch like the IS 641 pitch search
GSM: Enhanced Full Rate Speech Transcoder, ETSI GSM 6.60, Nov. 1996
Vendors for EFR GSM Coder(figures are approximate - check with the vendor for more accurate estimates)- VLSI VWS22030 based on the DSPGroup OakDSPcore, contact VLSI Inc., (www.vlsi.com)
Copyright (c) Andreas Spanias 10-22
Third Generation Vocoders
Some Recent and Ongoing Standardization Efforts
CDMA 2000 (supports next generation data services envisioned up to 2MB/s)
GSM AMR - Adaptive Multirate Speech Coder (multiple coders)
ITU-4 - ITU Standardization Efforts for 4kb/s (on-going)
CDMA SMV - Selectable Mode Vocoder for the next generation CDMA
Copyright (c) Andreas Spanias 10-23
Wideband CDMA
Objective to meet IMT 2000 requirements (at least 144 Kb/s in a vehicular environment, 384 Kb/s in a pedestrian environment, and 2048 Kb/s in an indoor office environment)
To supports next generation data services envisioned up to 2MB/s (Full coverage and mobility for 144 Kb/s, preferably 384 Kb/s - Limited coverage and mobility for 2 Mb/s)
Enhanced Voice Services (audioconferencing & voice mail)
Concurrent high-quality video/audio
Backward compatible with IS-95B
high security & low power
Significantly enhanced version of EVRC for voice services- http://www.comsoc.org/pubs/surveys/4q98issue/prasad.html
- D. Knisely et al, Evolution of Wireless Data Services: IS-95 to CDMA 2000, IEEE Communications Magazine, pp. 140-149, October 1998
- IS-95 CDMA and cdma2000: Cellular/PCS Systems Implementation, 1/e, Vijay K. Garg, University of Illinois, Chicago, Illinois Published
December, 1999 by Prentice Hall PTR (ECS Professional)
Copyright (c) Andreas Spanias 10-24
GSM Adaptive Multirate Coder
Adjusts its bit-rate according to network load
Rates 12.2, 10.2, 7.95, 6.7, 5.9, 5.15, 4.75kb/s
Based on CELP with 20 ms frame and 5 ms subframe
Multirate-ACELP with 10th order short-term LPC and perceptual weighting (uses levinson)
Encodes LSPs using split VQ
An open loop LTP is first obtained and refined by closed loop
Highest bit rate provides toll quality & half rate provides communications quality
- ETSI TS 126 090 V.3.1.0 2000-01 - AMR SPEECH CODEC TRANSCODING FUNCTIONS 3G-TS 26.090 Technical Specification
- R. Ekudden, R. Hagen, I. Johansson, and J. Svedburg, "The Adaptive Multi-Rate speech coder, Proc. IEEE Workshop on Speech Coding, pp. 117-119, 1999
Copyright (c) Andreas Spanias 10-25
The Selectable Mode Vocoder
• Algorithm to provide higher quality, flexibility, and capacity over existing IS-96C, IS-127 EVRC, and IS-733 (that replaced IS-96C but working at higher average rate)
• The Conexant SMV algorithm became the core technology for 3G CDMA (core SMV algorithm to be refined in the interim by participating companies according to the publication below)
• Based on 4 codecs: full rate at 8.5 kbps, half rate at 4 kbps, quarter rate at 2 kbps, and eighth rate at 800 bps
• Pre-processing includes noise suppression similar to IS 127 EVRC
• Full rate and half rate based on Conexants eXtended CELP (eX-CELP) a core technology also used in the ITU G.4 Conexant submission to ITU-4
• Performed better than IS-733 and IS-127 in tests with and without background noise
• Scored as high as 4.1 MOS at full rate with clean speech. Performed very well with background noise
REFERENCES:[1] “The SMV algorithm selected for TIA and 3GPP2 for CDMA applications,” conference paper by Conexant systems, Y.Gao, E.
Schlomot, A. Benyassine, J. Thyssen, H. Su, and C Murgia (portions published at ICASSP-2001)
Copyright (c) Andreas Spanias 10-26
• ITU Wideband Coding– G.722 Coding of 7 kHz speech at 64, 56,48 kbps - Sub-band ADPCM
– G.WB1 Coding of 7 kHz speech at 32/ 24 kbps - Combined Transform and CELP coding
– G.WB2 Coding of 7 kHz speech at 16 kbps or less (ongoing)
• ITU Telephony– G.711 PCM (64 kbps) late 60’s
– G.726 ADPCM (32/40/ 24/16 kbps) 1988
– G.728 LD-CELP coding (16 kbps) 1992
– G.723.1 True Speech (5.3/6.3 kbps) 1995
– G.729 CS-ACELP (8/12.8/6.4 kbps) 1996 and Annex in 1998
– G.4kbps Toll quality at 4 kbps (on going)
• Non-ITU – MPEG1/Audio (includes MP3), 1991
– MPEG2/Audio: 64 kbps (1992)
– MPEG4/Audio: audio/speech coding at bit rates between 64 and 2 kbps (1998)
– MPEG7/Audio: audio/speech/MIDI coding (ongoing)
STANDARDS AT A GLANCE
Copyright (c) Andreas Spanias 10-27
• TIA – CDMA
• IS96 8,4,2 kbps Q-CELP (Qualcomm CELP, 1992)
• IS127 8.55, 4, 0.8 kbps EVRC (Enhanced Variable. Rate Coder, 1996)
• IS733 13.3, 6.2, 2.7, 1 kbps VRC (Variable Rate Coder, 1998)
• 3GPP2 0.8-8.55 kbps SMV (Selectable Mode Vocoder, 2001)
– TDMA• IS54 7.95 kbps VSELP (Vector-Sum Excitation Linear.Predictor., 1989)
• IS641 7.4 kbps CELP (Similar to EFR but at lower rate, 1997)
– PCS1800 (GSM variant working at 1800 MHz)
• IS136-410 12.2 kbps US1 (1999)
• ETSI (GSM): – 13 kbps RPE-LTP (Full rate GSM, 1988)
– 6.5 kbps VSELP (Half-rate GSM, 1993)
– 12.2 kbps EFR (Enhanced full-rate GSM, 1996)
– 12.2 - 4.75 kbps AMR (Adaptive Multi Rate, 1999)
• ARIB Japan– Full-rate PDC (Personal Digital Communication) 6.7 kbps VSELP
– Half-rate PDC 3.45 kbps Multimode CELP`
STANDARDS AT A GLANCE (2)
Copyright (c) Andreas Spanias 10-28
Bit rate (kbps))
Vocoder/Waveform/Hybrid
1 2 4 8 16 32 64
Vocoders
Waveform Coders
Hybrid Coders
LPC10e
CELP
ADPCM PCMMOS1-5 SMV
MELP
Copyright (c) Andreas Spanias 10-29
PERFORMANCE OF SOME STANDARDIZED ALGORITHMS
Algorithm Bit Rate (kbits/s)
MOS Complexity (MIPS)
Framesize (ms)
PCM G.711 64 4.3 0.01 0ADPCM G.726 32 4.1 2+ 0.125
SBC G.722 48/56/64 4.1 5 0.125LD-CELP G.728 16 4 ~30 0.625
CS-ACELP G.729 8 4 ~20 10CS-ACELP-A G.729 8 3.76 11 10MPC-MLQ G.723.1 6.3/5.3 3.98/3.7 ~16 30GSM FR RPE-LTP 13 3.7 (ave) 5 20
GSM EFR 13 4 14 20GSM HR VSELP 6.3 ~3.4 14 20
IS-54 VSELP 8 3.5 14 20IS-641 EFR 8 3.8 14 20
Conexant eX-CELP SMV 8.55/4/2/0.8 ~4.1 (8.55) ~20 MIPS 20IS-96 QCELP 1.2/2.4/4.8/9.6 3.33 (9.6) 15 20IS-127 EVRC 1.2/4.8/9.6 ~3.8 (9.6) 20 20PDC VSELP 6.3 3.5 14 20
PDC PCI-CELP 3.45 ~3.4 ~48 40FS 1015 – LPC 10e 2.4 2.3 7 22.5FS 1016 – CELP 4.8 4.8 3.2 16 30
MELP 2.4 3.2 ~30 22.5Inmarsat-B APC 9.6/12.8 ~3.1/3.4 10 20
Inmarsat-M IMBE 6.3 3.4 ~13 20