15

Click here to load reader

Lecture on Speech Processing

  • Upload
    adolf

  • View
    86

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Lecture on Speech Processing

Copyright (c) Andreas Spanias 10-1

Title/number:

DSP

byAndreas Spanias, Ph.D.

Speech Processing

[email protected]

Phone: 480 965 1837, Fax: 480 965 8325http://www.eas.asu.edu/~spanias

Copyright (c) Andreas Spanias 10-2

Topics

1. Speech Spectrum and Source System Coders

2. Speech Processing Analysis-Synthesis Algorithms

3. Historical Perspective on Algorithmic Research

4. The Standards on Speech Coding

5. Algorithm Examples

6. Remarks

Page 2: Lecture on Speech Processing

Copyright (c) Andreas Spanias 10-3

Voiced and Unvoiced Speech

Time domain speech segment

Time (mS)

Am

plitu

de

TAPE TIME: 3840

0 8 16 24 32

1.0

0.0

-1.0

Mag

nitu

de (d

B)

-30

0

20

40

0 1 2 3 4

Frequency (KHz)

Time domain speech segment

Time (mS)

Am

plitu

de

TAPE TIME: 8014

0 8 16 24 32

1.0

0.0

-1.0

Mag

nitu

de (d

B)

-20

0

20

50

0 1 2 3 4

Frequency (KHz)

fundamentalfrequency

Formant Structure

Copyright (c) Andreas Spanias 10-4

Fine (Pitch) and Formant Structure of the Short-time Speech Spectrum

Fine Harmonic Structure : reflects the quasi-periodicity of speech and is attributed to the vibrating vocal chords.

Formant Structure (Spectral Envelope): is due to the interaction of the source and the vocal tract. The vocal tract consists of the pharynx and the mouth cavity.

Note the narrow peaks

Note the envelope peaks

Page 3: Lecture on Speech Processing

Copyright (c) Andreas Spanias 10-5

Formants: peaks of the spectral envelope representing the resonant modes of the vocal tract. 3-5 formants below 5 kHz.The first 3 formants, usually occurring below 3 kHz, are quite important both in speech synthesis and perception. Higher formants are important for wideband and unvoiced speech representations.

Formants

f1 f2 f3 f4 f5

Copyright (c) Andreas Spanias 10-6

Speech Analysis/Synthesis

Speech analysis-synthesis: speech is analyzed (represented)in terms of a compact parametric set which is then used forspeech synthesis. Speech coding at medium-rates and below is achieved using an analysis-synthesis process.

Closed-loop analysis or analysis-by-synthesis: In closed-loop analysis, the parameters are extracted and encoded by minimizingexplicitly the difference between the original and reconstructedspeech. CELP typed algorithms belong to this category. Closed-loop analysis is usually high complexity.

Open-loop analysis: In open-loop analysis, the parametersare extracted and encoded without considering the difference between the original and the reconstructed speech.

Page 4: Lecture on Speech Processing

Copyright (c) Andreas Spanias 10-7

Speech Synthesis Model (1)

f f f

=x

t t t

=*

1/A(z)X(z) S(z)

Copyright (c) Andreas Spanias 10-8

Simple Speech Synthesis Model (2)

VOCAL

TRACT

FILTER

SYNTHETIC

SPEECH

gain

Requires “hard” (binary) info voicing

V/UV

Pitch

iM

ii za

bzH

1

0

1)(

Page 5: Lecture on Speech Processing

Copyright (c) Andreas Spanias 10-9

Speech Analysis-by-Synthesis (closed-loop)

A(z)A (z)L

+

+

+

+

s(n)^

s(n)

Excitation

W(z)

+

-Select

or Form

MSE

gain

Synthesis speech isforced to match i/p speech

Frequency responsesof the two synthesis filters

LTP LP

Copyright (c) Andreas Spanias 10-10

3095.01

1 z

Impulse response

LTP excited by a random signal creates pseudo-periodicity

Normalized frequency (Nyquist = 1)

0 0.5 0.9 1-10

0

10

Mag

nitu

de R

espo

nse

(dB

)

Frequency response

Page 6: Lecture on Speech Processing

Copyright (c) Andreas Spanias 10-11

Subjective Speech Quality

BroadcastBroadcast wideband speech refers to high quality

“commentary” speech at rates above 64 kbits/s.

Network or tollToll or Network quality refers to quality comparable

to the classical analog speech (200-3200 Hz)Communications

Communications quality implies somewhat degraded speech quality but adequate for cellular communications. Synthetic

Synthetic speech is usually intelligible but can be unnatural and associated with a loss of speaker recognizability.

Copyright (c) Andreas Spanias 10-12

The Mean Opinion Score

MOS Scale Speech Quality

1 Bad

2 Poor

3 Fair

4 Good

5 Excellent

Page 7: Lecture on Speech Processing

Copyright (c) Andreas Spanias 10-13

The Mean Opinion Score (2)

The MOS range relates to speech quality as follows :

MOS 4.0 - 4.5 : network or toll quality

MOS 3.5 - 4.0 : communications quality

MOS 2.5 - 3.5 : synthetic quality

Remarks : MOS ratings may differ significantly from test to test and hence they are not absolute measures for the comparison of different coders.

Copyright (c) Andreas Spanias 10-14

First Generation Analysis-by-Synthesis LPC

This class includes: IS-54 VSELP, RPE-LTP GSM, FS-1016, LD-CELP G.728, IS-96 QCELP

Mostly Encode Reflection Coefficients or LARs

Employ for the most part full searches of the code books and LTPs

High MIPS (most of them 20 MIPS+)

Modest MOS (~ 3.5)

Page 8: Lecture on Speech Processing

Copyright (c) Andreas Spanias 10-15

Code Excited Linear Prediction (CELP)

- produced low-rate coded speech comparable to that of medium-rate waveform coders

- bridged the gap between waveform coders and vocoders

- codebook originally consisted of Gaussian sequences; 1024 vectors 40-samples (5ms) long

- gain scales excitation vector and excitation filtered by LTP and L synthesis

- “optimum” vector selected such that the perceptually weighted MSE minimized.

A(z)A (z)L

+

+

+

+

+

-

g

...

...

...

...

W(z)

W(z)

x (n)C

s (n)W

s(n)

s (n)^

W

e (n)C

Codebook

Error

Minimization

VQ index

Copyright (c) Andreas Spanias 10-16

Code Excited Linear Prediction (2)

The Nx1 error vector

ksgsske wkwwc ˆˆ 0

output due to the initial filter state,0ˆws

Minimizing w.r.t. gk we get kekek cTcc

ksks

kssg

wTw

wTw

k ˆˆ

ˆ

Page 9: Lecture on Speech Processing

Copyright (c) Andreas Spanias 10-17

Code Excited Linear Prediction (3)

ksks

kssssk

wTw

wTw

wTwc ˆˆ

ˆ2

The k-th excitation vector, , that minimizes is selected

closed-loop analysis is used for LTP parameters; range of values for within the integers 20 to 147

kX c kc

M.R. Schroeder and B. Atal, "Code-Excited Linear Prediction (CELP): High Quality Speech at Very Low Bit Rates," Proc. ICASSP-85, p. 937, Tampa, Apr. 1985.

Copyright (c) Andreas Spanias 10-18

The IS-54 and GSM VSELP

- developed by Motorola - part of IS-54 and GSM cellular standards

- speech sampled at 8 kHz - segmented in 20ms frames - sub-frames of 5 ms

- complexity estimated at 30 MIPS - - MOS 3.45

Long Term

Filter State

Codebook 1

Codebook 2

Postfilter

speech

g2

g1

ga

A(z)

+

+

Lag Index

VQ-1 Index

VQ-2 Index

I. Gerson and M. Jasiuk, "Vector Sum Excited Linear Prediction (VSELP) Speech Coding at 8 kbits/s," Proc. ICASSP-90, pp. 461-464, New Mexico, Apr. 1990.

A. Spanias, M. Deisher, P. Loizou and G. Lim, "Fixed-Point Implementation of the VSELP algorithm, ASU-TRC Technical Report, TRC-SP-ASP-9201, July 1992.

Page 10: Lecture on Speech Processing

Copyright (c) Andreas Spanias 10-19

Second Generation Analysis-by-Synthesis LPCThis class includes: G.723.1, G.729, CDMA EVRC IS-127, GSM EFR, IS-641

Encode Line Spectrum Pairs (LSP) using Split Vector Quantizers

Employ for the most part partial searches of the LTPs; usually open loop estimate refined by closed loop search around the neiborhood of estimate

Codebooks have Algebraic structure (ACELP)

High MIPS (most of them 20 MIPS)

Provisions for channel errors

Very Good MOS (~ 3.8+)

Copyright (c) Andreas Spanias 10-20

The CDMA IS-127 Enhanced Variable Rate Coder (EVRC) Algorithm

- It is an RCELP (Relaxed CELP) algorithm

- Different than classical CELP in that a time-warped (downsampled) version of

the residual is matched instead of actual speech

- Operates at 3 rates 9.6/4.8/1.2 kbits/s and also blank

- Unless requested by network rate is determined based on voice activity

- Upon command it may generate blank or Rate 1/2

- includes an FFT-based speech enhancement pre-processing block

- Estimated Pitch at higher rates has to conform with a pitch contour

- No pitch estimation at 1.2 kbits/s

- LPC coefficients encoded as LSPs - subframe LSPs by interpolation

- The random codebook is searched using Algebraic CELP techniques

- includes postfilters

- MOS 3.8 at 9.6 kbits/s and Complexity around 20 MIPS

Page 11: Lecture on Speech Processing

Copyright (c) Andreas Spanias 10-21

THE GSM ENHANCED FULL-RATE (EFR) CODER

- Bit Rate 12.2 kbits/s- Speech is sampled at 8 kHz and segmented into 20 ms frames (160 samples)- 10 LPC parameters determined by Levinson-Durbin and vector quantized as LSPs- subframes are 5 ms each- Uses an Algebraic codebook- The pitch is first estimated open loop and refined using close loop searchmuch like the IS 641 pitch search

GSM: Enhanced Full Rate Speech Transcoder, ETSI GSM 6.60, Nov. 1996

Vendors for EFR GSM Coder(figures are approximate - check with the vendor for more accurate estimates)- VLSI VWS22030 based on the DSPGroup OakDSPcore, contact VLSI Inc., (www.vlsi.com)

Copyright (c) Andreas Spanias 10-22

Third Generation Vocoders

Some Recent and Ongoing Standardization Efforts

CDMA 2000 (supports next generation data services envisioned up to 2MB/s)

GSM AMR - Adaptive Multirate Speech Coder (multiple coders)

ITU-4 - ITU Standardization Efforts for 4kb/s (on-going)

CDMA SMV - Selectable Mode Vocoder for the next generation CDMA

Page 12: Lecture on Speech Processing

Copyright (c) Andreas Spanias 10-23

Wideband CDMA

Objective to meet IMT 2000 requirements (at least 144 Kb/s in a vehicular environment, 384 Kb/s in a pedestrian environment, and 2048 Kb/s in an indoor office environment)

To supports next generation data services envisioned up to 2MB/s (Full coverage and mobility for 144 Kb/s, preferably 384 Kb/s - Limited coverage and mobility for 2 Mb/s)

Enhanced Voice Services (audioconferencing & voice mail)

Concurrent high-quality video/audio

Backward compatible with IS-95B

high security & low power

Significantly enhanced version of EVRC for voice services- http://www.comsoc.org/pubs/surveys/4q98issue/prasad.html

- D. Knisely et al, Evolution of Wireless Data Services: IS-95 to CDMA 2000, IEEE Communications Magazine, pp. 140-149, October 1998

- IS-95 CDMA and cdma2000: Cellular/PCS Systems Implementation, 1/e, Vijay K. Garg, University of Illinois, Chicago, Illinois Published

December, 1999 by Prentice Hall PTR (ECS Professional)

Copyright (c) Andreas Spanias 10-24

GSM Adaptive Multirate Coder

Adjusts its bit-rate according to network load

Rates 12.2, 10.2, 7.95, 6.7, 5.9, 5.15, 4.75kb/s

Based on CELP with 20 ms frame and 5 ms subframe

Multirate-ACELP with 10th order short-term LPC and perceptual weighting (uses levinson)

Encodes LSPs using split VQ

An open loop LTP is first obtained and refined by closed loop

Highest bit rate provides toll quality & half rate provides communications quality

- ETSI TS 126 090 V.3.1.0 2000-01 - AMR SPEECH CODEC TRANSCODING FUNCTIONS 3G-TS 26.090 Technical Specification

- R. Ekudden, R. Hagen, I. Johansson, and J. Svedburg, "The Adaptive Multi-Rate speech coder, Proc. IEEE Workshop on Speech Coding, pp. 117-119, 1999

Page 13: Lecture on Speech Processing

Copyright (c) Andreas Spanias 10-25

The Selectable Mode Vocoder

• Algorithm to provide higher quality, flexibility, and capacity over existing IS-96C, IS-127 EVRC, and IS-733 (that replaced IS-96C but working at higher average rate)

• The Conexant SMV algorithm became the core technology for 3G CDMA (core SMV algorithm to be refined in the interim by participating companies according to the publication below)

• Based on 4 codecs: full rate at 8.5 kbps, half rate at 4 kbps, quarter rate at 2 kbps, and eighth rate at 800 bps

• Pre-processing includes noise suppression similar to IS 127 EVRC

• Full rate and half rate based on Conexants eXtended CELP (eX-CELP) a core technology also used in the ITU G.4 Conexant submission to ITU-4

• Performed better than IS-733 and IS-127 in tests with and without background noise

• Scored as high as 4.1 MOS at full rate with clean speech. Performed very well with background noise

REFERENCES:[1] “The SMV algorithm selected for TIA and 3GPP2 for CDMA applications,” conference paper by Conexant systems, Y.Gao, E.

Schlomot, A. Benyassine, J. Thyssen, H. Su, and C Murgia (portions published at ICASSP-2001)

Copyright (c) Andreas Spanias 10-26

• ITU Wideband Coding– G.722 Coding of 7 kHz speech at 64, 56,48 kbps - Sub-band ADPCM

– G.WB1 Coding of 7 kHz speech at 32/ 24 kbps - Combined Transform and CELP coding

– G.WB2 Coding of 7 kHz speech at 16 kbps or less (ongoing)

• ITU Telephony– G.711 PCM (64 kbps) late 60’s

– G.726 ADPCM (32/40/ 24/16 kbps) 1988

– G.728 LD-CELP coding (16 kbps) 1992

– G.723.1 True Speech (5.3/6.3 kbps) 1995

– G.729 CS-ACELP (8/12.8/6.4 kbps) 1996 and Annex in 1998

– G.4kbps Toll quality at 4 kbps (on going)

• Non-ITU – MPEG1/Audio (includes MP3), 1991

– MPEG2/Audio: 64 kbps (1992)

– MPEG4/Audio: audio/speech coding at bit rates between 64 and 2 kbps (1998)

– MPEG7/Audio: audio/speech/MIDI coding (ongoing)

STANDARDS AT A GLANCE

Page 14: Lecture on Speech Processing

Copyright (c) Andreas Spanias 10-27

• TIA – CDMA

• IS96 8,4,2 kbps Q-CELP (Qualcomm CELP, 1992)

• IS127 8.55, 4, 0.8 kbps EVRC (Enhanced Variable. Rate Coder, 1996)

• IS733 13.3, 6.2, 2.7, 1 kbps VRC (Variable Rate Coder, 1998)

• 3GPP2 0.8-8.55 kbps SMV (Selectable Mode Vocoder, 2001)

– TDMA• IS54 7.95 kbps VSELP (Vector-Sum Excitation Linear.Predictor., 1989)

• IS641 7.4 kbps CELP (Similar to EFR but at lower rate, 1997)

– PCS1800 (GSM variant working at 1800 MHz)

• IS136-410 12.2 kbps US1 (1999)

• ETSI (GSM): – 13 kbps RPE-LTP (Full rate GSM, 1988)

– 6.5 kbps VSELP (Half-rate GSM, 1993)

– 12.2 kbps EFR (Enhanced full-rate GSM, 1996)

– 12.2 - 4.75 kbps AMR (Adaptive Multi Rate, 1999)

• ARIB Japan– Full-rate PDC (Personal Digital Communication) 6.7 kbps VSELP

– Half-rate PDC 3.45 kbps Multimode CELP`

STANDARDS AT A GLANCE (2)

Copyright (c) Andreas Spanias 10-28

Bit rate (kbps))

Vocoder/Waveform/Hybrid

1 2 4 8 16 32 64

Vocoders

Waveform Coders

Hybrid Coders

LPC10e

CELP

ADPCM PCMMOS1-5 SMV

MELP

Page 15: Lecture on Speech Processing

Copyright (c) Andreas Spanias 10-29

PERFORMANCE OF SOME STANDARDIZED ALGORITHMS

Algorithm Bit Rate (kbits/s)

MOS Complexity (MIPS)

Framesize (ms)

PCM G.711 64 4.3 0.01 0ADPCM G.726 32 4.1 2+ 0.125

SBC G.722 48/56/64 4.1 5 0.125LD-CELP G.728 16 4 ~30 0.625

CS-ACELP G.729 8 4 ~20 10CS-ACELP-A G.729 8 3.76 11 10MPC-MLQ G.723.1 6.3/5.3 3.98/3.7 ~16 30GSM FR RPE-LTP 13 3.7 (ave) 5 20

GSM EFR 13 4 14 20GSM HR VSELP 6.3 ~3.4 14 20

IS-54 VSELP 8 3.5 14 20IS-641 EFR 8 3.8 14 20

Conexant eX-CELP SMV 8.55/4/2/0.8 ~4.1 (8.55) ~20 MIPS 20IS-96 QCELP 1.2/2.4/4.8/9.6 3.33 (9.6) 15 20IS-127 EVRC 1.2/4.8/9.6 ~3.8 (9.6) 20 20PDC VSELP 6.3 3.5 14 20

PDC PCI-CELP 3.45 ~3.4 ~48 40FS 1015 – LPC 10e 2.4 2.3 7 22.5FS 1016 – CELP 4.8 4.8 3.2 16 30

MELP 2.4 3.2 ~30 22.5Inmarsat-B APC 9.6/12.8 ~3.1/3.4 10 20

Inmarsat-M IMBE 6.3 3.4 ~13 20