Speech Signal Analysis and Coding - ERNETpkalra/OLD-COURSES/siv864-2010/session-0… · MPEG-4 HVXC...

Preview:

Citation preview

Speech Signal Analysis

and Coding

Dr. Arun Kumar

Centre for Applied Research in Electronics

(CARE), IIT Delhi

arunkm@care.iitd.ernet.in

Contents

• Speech Processing Applications

• Speech Signal Understanding

– Speech Production

– Speech Signal Characteristics and Analysis

• Speech Coding

– Coding Standards

– Coder Attributes including Quality Evaluation

– Coding Methodologies

• Speech Transmission

– Trunk-line telephony

– Wireless telephony

• Speech Storage

– Voice Mail, Voice Memo, Answering

machines

• Speech Synthesis

– Text-to-speech-synthesis

– Automatic information services

Speech Processing Applications

• Speaker Verification and Identification

– Phone banking

– Secure entry

• Aids for the Handicapped

– Variable rate playback

– Hearing aids

– Reading machine for visually impaired

– Visual display of speech information for

hearing impaired

Speech Processing Applications

• Speech Enhancement

– Echo and noise cancellation

• Speech Recognition

– Automatic language translation

• Voice Personality Transformation

– Voice conversion from “source” to “target”

Speech Processing Applications

“ It is the variation of pressure, from atmospheric pressure, as a function of time, caused by traveling waves from the speaker’s mouth (apart from nostrils, cheeks and throat).”

The Speech Signal

Units:

SPL (Sound Pressure Level) in dB

relative to a reference level.

Reference: 10 –16 W/cm2

- Corresponds to ‘just barely audible’

The Intensity Level of Speech

0

20

55 60

70

80

100

120

d

B

Just barely audible

Whisper

Airplane

Rock concert

Heavy traffic Variations in normal voice

level (1 meter distance from

mouth)

The Intensity Level of Speech

• Energy of speech during 1 s

– 2 x 10 –5 Joules

(It takes 100 Joules to light a 100 W bulb for

1 s)

• Strongest vowel: /a/ as in “talk”

• Weakest vowel: /i/ as in “see”

• Strongest consonant: /r/ as in “run”

• Weakest consonant: /Θ/ as in “thin”

The Intensity Level of Speech

Audio

Signal

Category

Bandwid

th(Hz)

Sampling

Rate

(kHz)

Source

Rate

(kbps)

Telephone

Band

Speech

300-3400 8.0 128

Wideband

Speech50-7000 16.0 256

Wideband

Audio20-20,000 44.1/48.0 705/768

Speech & Audio Signal Specs.

Speech Articulation by the Vocal System

Reproduced from: D. O’Shaughnessy, Human and machine speech communication, IEEE Press, 2000

Speech Classes by Articulation

• Voiced speech

• Unvoiced speech

• Transient (stop) sounds

The relationship between speech sounds (phonemes) and their acoustic realizations

– Waveform

– Spectrum

– Spectrogram

Acoustic Analysis of Speech

Time Waveform of a Speech Sentence

0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4

- 1

- 0 . 8

- 0 . 6

- 0 . 4

- 0 . 2

0

0 . 2

0 . 4

0 . 6

0 . 8

T im e ( s )

Am

plit

ud

e

ʓʓʓʓ(TH)

THIS IS GOOD

ɪɪɪɪ(i) s

(s)ɪɪɪɪ(i) s

(s)

ɡɡɡɡ (G) U (O) d

(D)

• Vowels– High energy, periodic, steady state utterance

• Unvoiced fricatives– Low energy, noise-like, steady-state utterance

• Voiced fricatives– Low energy, element of periodicity, steady-state

utterance

• Stops– Transient release, medium to low energy

• Nasals– Low-to-medium energy, periodic, steady-state

utterance

Waveform Analysis of a Speech

Fundamental frequency F0 / Pitch period

F0 Male Female

Average (Hz) 132 223

Range (Hz) 50-250 120-500

Acoustic Analysis of Vowels

• Stop Consonants

– Momentary blockage of the vocal tract (50-

100ms): Closure phase

– Release burst (shortest acoustic event)

– Voice – onset time (VOT)

• Fricatives

– Narrow constriction somewhere in vocal

tract

– Turbulent airflow through the constriction

Acoustic Analysis of Consonants

The

International

Phonetic

Alphabet

(IPA)

Universal Speech Production Model

Output speech

Impulse Train

Generator

Glottal Pulse Model

White Noise

Generator

Vocal Tract Filter

Voiced or Unvoiced switch

Radiation Model

Voiced Gain

Unvoiced Gain

Vocal Tract Model

• Time-varying all-pole linear filter excited by a

source signal.

• H(z) models the vocal tract system.

H(z)=1/A(z)

e[n] s[n]

)(

1

1

1)(

1

zAza

zHP

i

i

i

=

=

∑=

0 500 1000 1500 2000 2500 3000 3500 4000-100

-80

-60

-40

-20

0

20

40

60

80

Frequency (Hz)

Mag (

dB

)Voiced Speech Spectrum

0 500 1000 1500 2000 2500 3000 3500 4000-100

-80

-60

-40

-20

0

20

40

60

80

Frequency (Hz)

Mag (

dB

)Superimposed 2nd-order LP Envelope

0 500 1000 1500 2000 2500 3000 3500 4000-100

-80

-60

-40

-20

0

20

40

60

80

Frequency (Hz)

Mag (

dB

)Superimposed 2nd, 6th order LP Envelopes

0 500 1000 1500 2000 2500 3000 3500 4000-100

-80

-60

-40

-20

0

20

40

60

80

Frequency (Hz)

Mag (

dB

)Superimposed 2nd, 6th, &10th order LP Envelopes

0 500 1000 1500 2000 2500 3000 3500 4000-100

-80

-60

-40

-20

0

20

40

60

80

Frequency (Hz)

Mag (

dB

)Superimposed 2nd, 6th, 10th & 16th order LP Envelopes

Unvoiced Speech and 10th order LP Residual

0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0-0 .1 9

-0 .1 8

-0 .1 7

-0 .1 6

-0 .1 5

-0 .1 4

-0 .1 3

-0 .1 2

-0 .1 1

- 0 . 1

T im e ( m s )

Am

plit

ud

e

0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0- 0 . 2

-0 .1 5

- 0 . 1

-0 .0 5

0

0 .0 5

0 . 1

0 .1 5

T im e ( m s )

Am

plit

ud

e

Voiced Speech and 10th-order LP Residual

0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0- 0 . 8

- 0 . 6

- 0 . 4

- 0 . 2

0

0 . 2

0 . 4

0 . 6

T i m e ( m s )

Am

plit

ude

0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0- 0 . 1 5

- 0 . 1

- 0 . 0 5

0

0 . 0 5

0 . 1

0 . 1 5

0 . 2

T i m e ( m s )

Am

plit

ud

e

• Short-term correlation

• Long-term correlation

Speech Coding

• For telephone band (or narrowband) speech:– Signal Bandwidth: 300-3400 Hz

– Sampling Rate: 8000 Hz

– Resolution: 16 bits / sample linear PCM

• Uncompressed bit rate:16 bits/sample x 8000 samples/s

= 128 Kbit/s

• What is the minimum coding rate for transmitting the message information?

Coding Rates

Coder Classes according to Bit-Rate

B > 16 Kbps High bit rate coders

4 < B <=16 KbpsMedium bit rate

coders

1 < B <=4 Kbps Low bit rate coders

B < 1 KbpsVery low bit rate

coders

• ITU-T: International Telecommunications Union (UN)

• MPEG: Motion Pictures Experts Group (ISO/UN)

• INMARSAT: Intl. Maritime Satellite Corporation – for geo-synchronous satellites

• US Government: DoD, NATO

• TIA: Telecom Industry Association - for North American Telecom standards

• ETSI: European Telecom. Standards Institute

Standards Organizations

Name Coding TypeBit-rate

(kbps)Organization Year

G.711/

G.712

PCM µ-law/

A-law64 ITU-T 1972

G.721/G.723

G.726/G.727ADPCM

32/24/40/

16ITU-T

1984/86/

88/90

G.728 LD-CELP 16 ITU-T 1992

G.729 CS-ACELP 8.0 ITU-T 1995

G.723.1 ACELP 6.3/5.3 ITU-T 1995

G.722(Wideband)

SB-ADPCM48/56/64 ITU-T 1985

Speech Coding Standards

Name Coding TypeBit-rate

(kbps)Organization Year

G.722.1(Wideband)

Transform 24/32 ITU-T 1999

Inmarsat IMBE 4.15 INMARSAT 1990

IS-54 (old) VSELP 7.95 TIA 1992

GSM-FR RPE-LTP 13 GSM 1991

GSM-HR CELP 5-6 GSM 1994

GSM-EFR CELP 12.2 GSM 1997

Speech Coding Standards

Name Coding TypeBit-rate

(kbps)Organization Year

IS-641(new) ACELP 7.4 TIA 1997

Iridium AMBE 2.4 Iridium 1996

MPEG-4 HVXC 2-4 MPEG/ISO 1999

MPEG-4 CELP 4-24 MPEG/ISO 1999

FS-1015 LPC-10 2.4 US-DoD

/NATO 1984

FS-1016 CELP 4.8US-DoD

/NATO1989

MELP MELP 2.4US-DoD

/NATO1996

Speech Coding Standards

• Coding Methodologies

– Waveform coding

– Vocoding or parametric coding

– Hybrid coding

Coding Methodologies

Classes according to Coding Type

Bit rate (Kbps)

Quality

Poor

Fair

Good

Excellent

Parametric Coders

Waveform

approximating

coders

1 42 168 32 64

Hybrid

Coders

Coding Standards

Bit rate (Kbps)

Quality

Poor

Fair

Good

Excellent

Parametric Coders

Waveform approximating

coders

1 42 168 32 64

Hybrid Coders

G.726G.711

Linear

PCM

GSM EFR

FS1015

G.723.1

G.729

G.728

IS96

GSM/2

GSM FR

MELP

PCM Coding

Q[.]x[n] x’[n]

i[n]

• Instantaneous, non-uniform quantization

• For time-varying energy signals eg speech, uniform quantization is inefficient.

• If signal energy is halved, SQNR falls 6 dB.

• SQNR is independent of signal level in Log quantizer.

ADPCM Coding

+ Q[.]

Encoder

+P

Decoder +

P

Input

x[n]- d[n]

x’[n]

c[n]d’[n]

x”[n]

c[n]

d’[n] x”[n]

x’[n]

Prediction in the context of Coding

0 5 1 0 1 5 2 0- 0 . 8

- 0 . 6

- 0 . 4

- 0 . 2

0

0 . 2

0 . 4

0 . 6

T i m e ( m s )

Am

plit

ud

e

0 5 1 0 1 5 2 0- 0 . 8

- 0 . 6

- 0 . 4

- 0 . 2

0

0 . 2

0 . 4

T i m e ( m s )

Am

plit

ude

Signal and first-difference signal

• DPCM with fixed predictor can give 4-11 dB improvement over PCM.

• PCM with adaptive quantization can give ~ 5

dB improvement over µ-law non-adaptive PCM.

• DPCM with adaptive prediction can give 10-12 dB improvement over fixed predictor.

ADPCM Coding

Code Excited Linear Prediction (CELP) Coding

• Most coders in 4.8-16 kbps are based on Linear Prediction Analysis-by-Synthesis (LPAS) coding.

• CELP belongs to LPAS paradigm of speech coding.

Generic Linear Prediction Analysis Synthesis (LPAS) Coder

Excitation

Generator

Error

Minimization

Synthesis

Filter

LP Analysis

+

Input

speech

-

CELP Decoder

Excitation

GeneratorG/A(z)

Excitation parameters

LP and Gain parameters

Synthesized speech

• Speech Quality

– Objective measures

• Segmental SNR

• Itakura-Saito distance measure

• Spectral distortion (SD)

• ITU-T P.862 Recommendation

– Subjective measures

• Mean opinion score (MOS)

• Diagnostic Rhyme Test (DRT)

• Diagnostic Acceptability Measure (DAM)

Speech Quality Measurement

• Listening quality scale

Excellent 5

Good 4

Fair 3

Poor 2

Bad 1

Absolute Category Rating Tests (MOS)

• Measures speech intelligibility

• Listeners are presented with one of two words which differ only in leading consonant

– Examples:

• Meet - Beat

• Than - Dan

• Met - Net

• Jest - Guest

Diagnostic Rhyme Test

• Total possible pairs = 96

• Intelligibility score, S, is given by:

N(correct) – N(incorrect)

S = 100 x

N(test pairs)

Coder Rate (kbps) DRT MOS

FS1016 4.8 91.7 3.3

G.728 16 93.0 3.9

Diagnostic Rhyme Test

• Part of ITU-T P.862 standard

• Objective is to mimic sound perception by persons in real life

• PESQ simulates expts. in which subjects judge speech quality

• Physical signals are mapped to psychophysical representations that match internal representations in the head

Perceptual evaluation of speech quality (PESQ)

• Complexity

– Computational complexity

• Simplex/half-duplex/full-duplex real time

performance on a single DSP

• Fixed point vs. floating point

• CELP coders are computationally complex

– Memory requirement

• Storage of look-up tables, codebooks etc.

Speech Coder Complexity Issues

Timing Diagram for various Coding Delays

Buffer input

speech frame

Buffer input

speech frame 2

Buffer input

speech frame 3

Buffer input

speech frame 4

Buffer input

speech frame 5

Encode

frame 1Encode

frame 2

Encode

frame 3

Encode

frame 4

Transmit bits of

frame 1

Transmit bits of

frame 2Transmit bits of

frame 3

decode

frame 1decode

frame 2

decode

frame 2

Play back

decoded speech

frame 1

Play back

decoded speech

frame 2Total one way coding delay

Algorithmic

buffering delay

Encoder

processing

delay

Bit transmission

delay

Decoder

processing

delay

Sum of the

two is the

total processing

delay

0 1 2 3 4 5Time (frame index)

Thank You!

Recommended