Speech Signal Analysis and Coding - ERNETpkalra/OLD-COURSES/siv864-2010/session-0… · MPEG-4 HVXC...

Speech Signal Analysis

and Coding

Dr. Arun Kumar

Centre for Applied Research in Electronics

(CARE), IIT Delhi

arunkm@care.iitd.ernet.in

Contents

• Speech Processing Applications

• Speech Signal Understanding

– Speech Production

– Speech Signal Characteristics and Analysis

• Speech Coding

– Coding Standards

– Coder Attributes including Quality Evaluation

– Coding Methodologies

• Speech Transmission

– Trunk-line telephony

– Wireless telephony

• Speech Storage

– Voice Mail, Voice Memo, Answering

machines

• Speech Synthesis

– Text-to-speech-synthesis

– Automatic information services

Speech Processing Applications

• Speaker Verification and Identification

– Phone banking

– Secure entry

• Aids for the Handicapped

– Variable rate playback

– Hearing aids

– Reading machine for visually impaired

– Visual display of speech information for

hearing impaired

• Speech Enhancement

– Echo and noise cancellation

• Speech Recognition

– Automatic language translation

• Voice Personality Transformation

– Voice conversion from “source” to “target”

“ It is the variation of pressure, from atmospheric pressure, as a function of time, caused by traveling waves from the speaker’s mouth (apart from nostrils, cheeks and throat).”

The Speech Signal

Units:

SPL (Sound Pressure Level) in dB

relative to a reference level.

Reference: 10 –16 W/cm2

- Corresponds to ‘just barely audible’

The Intensity Level of Speech

Just barely audible

Whisper

Airplane

Rock concert

Heavy traffic Variations in normal voice

level (1 meter distance from

mouth)

• Energy of speech during 1 s

– 2 x 10 –5 Joules

(It takes 100 Joules to light a 100 W bulb for

• Strongest vowel: /a/ as in “talk”

• Weakest vowel: /i/ as in “see”

• Strongest consonant: /r/ as in “run”

• Weakest consonant: /Θ/ as in “thin”

Signal

Category

Bandwid

th(Hz)

Sampling

Source

(kbps)

Telephone

Speech

300-3400 8.0 128

Wideband

Speech50-7000 16.0 256

Wideband

Audio20-20,000 44.1/48.0 705/768

Speech & Audio Signal Specs.

Speech Articulation by the Vocal System

Reproduced from: D. O’Shaughnessy, Human and machine speech communication, IEEE Press, 2000

Speech Classes by Articulation

• Voiced speech

• Unvoiced speech

• Transient (stop) sounds

The relationship between speech sounds (phonemes) and their acoustic realizations

– Waveform

– Spectrum

– Spectrogram

Acoustic Analysis of Speech

Time Waveform of a Speech Sentence

0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4

- 0 . 8

- 0 . 6

- 0 . 4

- 0 . 2

T im e ( s )

ʓʓʓʓ(TH)

THIS IS GOOD

ɪɪɪɪ(i) s

(s)ɪɪɪɪ(i) s

ɡɡɡɡ (G) U (O) d

• Vowels– High energy, periodic, steady state utterance

• Unvoiced fricatives– Low energy, noise-like, steady-state utterance

• Voiced fricatives– Low energy, element of periodicity, steady-state

utterance

• Stops– Transient release, medium to low energy

• Nasals– Low-to-medium energy, periodic, steady-state

utterance

Waveform Analysis of a Speech

Fundamental frequency F0 / Pitch period

F0 Male Female

Average (Hz) 132 223

Range (Hz) 50-250 120-500

Acoustic Analysis of Vowels

• Stop Consonants

– Momentary blockage of the vocal tract (50-

100ms): Closure phase

– Release burst (shortest acoustic event)

– Voice – onset time (VOT)

• Fricatives

– Narrow constriction somewhere in vocal

– Turbulent airflow through the constriction

Acoustic Analysis of Consonants

International

Phonetic

Alphabet

Universal Speech Production Model

Output speech

Impulse Train

Generator

Glottal Pulse Model

White Noise

Generator

Vocal Tract Filter

Voiced or Unvoiced switch

Radiation Model

Voiced Gain

Unvoiced Gain

Vocal Tract Model

• Time-varying all-pole linear filter excited by a

source signal.

• H(z) models the vocal tract system.

H(z)=1/A(z)

e[n] s[n]

0 500 1000 1500 2000 2500 3000 3500 4000-100

Frequency (Hz)

)Voiced Speech Spectrum

0 500 1000 1500 2000 2500 3000 3500 4000-100

Frequency (Hz)

)Superimposed 2nd-order LP Envelope

0 500 1000 1500 2000 2500 3000 3500 4000-100

Frequency (Hz)

)Superimposed 2nd, 6th order LP Envelopes

0 500 1000 1500 2000 2500 3000 3500 4000-100

Frequency (Hz)

)Superimposed 2nd, 6th, &10th order LP Envelopes

0 500 1000 1500 2000 2500 3000 3500 4000-100

Frequency (Hz)

)Superimposed 2nd, 6th, 10th & 16th order LP Envelopes

Unvoiced Speech and 10th order LP Residual

0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0-0 .1 9

-0 .1 8

-0 .1 7

-0 .1 6

-0 .1 5

-0 .1 4

-0 .1 3

-0 .1 2

-0 .1 1

- 0 . 1

T im e ( m s )

0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0- 0 . 2

-0 .1 5

- 0 . 1

-0 .0 5

0 .0 5

0 .1 5

T im e ( m s )

Voiced Speech and 10th-order LP Residual

0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0- 0 . 8

- 0 . 6

- 0 . 4

- 0 . 2

T i m e ( m s )

0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0- 0 . 1 5

- 0 . 1

- 0 . 0 5

0 . 0 5

0 . 1 5

T i m e ( m s )

• Short-term correlation

• Long-term correlation

Speech Coding

• For telephone band (or narrowband) speech:– Signal Bandwidth: 300-3400 Hz

– Sampling Rate: 8000 Hz

– Resolution: 16 bits / sample linear PCM

• Uncompressed bit rate:16 bits/sample x 8000 samples/s

= 128 Kbit/s

• What is the minimum coding rate for transmitting the message information?

Coding Rates

Coder Classes according to Bit-Rate

B > 16 Kbps High bit rate coders

4 < B <=16 KbpsMedium bit rate

coders

1 < B <=4 Kbps Low bit rate coders

B < 1 KbpsVery low bit rate

coders

• ITU-T: International Telecommunications Union (UN)

• MPEG: Motion Pictures Experts Group (ISO/UN)

• INMARSAT: Intl. Maritime Satellite Corporation – for geo-synchronous satellites

• US Government: DoD, NATO

• TIA: Telecom Industry Association - for North American Telecom standards

• ETSI: European Telecom. Standards Institute

Standards Organizations

Name Coding TypeBit-rate

(kbps)Organization Year

G.711/

PCM µ-law/

A-law64 ITU-T 1972

G.721/G.723

G.726/G.727ADPCM

32/24/40/

16ITU-T

1984/86/

G.728 LD-CELP 16 ITU-T 1992

G.729 CS-ACELP 8.0 ITU-T 1995

G.723.1 ACELP 6.3/5.3 ITU-T 1995

G.722(Wideband)

SB-ADPCM48/56/64 ITU-T 1985

Speech Coding Standards

G.722.1(Wideband)

Transform 24/32 ITU-T 1999

Inmarsat IMBE 4.15 INMARSAT 1990

IS-54 (old) VSELP 7.95 TIA 1992

GSM-FR RPE-LTP 13 GSM 1991

GSM-HR CELP 5-6 GSM 1994

GSM-EFR CELP 12.2 GSM 1997

IS-641(new) ACELP 7.4 TIA 1997

Iridium AMBE 2.4 Iridium 1996

MPEG-4 HVXC 2-4 MPEG/ISO 1999

MPEG-4 CELP 4-24 MPEG/ISO 1999

FS-1015 LPC-10 2.4 US-DoD

/NATO 1984

FS-1016 CELP 4.8US-DoD

/NATO1989

MELP MELP 2.4US-DoD

/NATO1996

• Coding Methodologies

– Waveform coding

– Vocoding or parametric coding

– Hybrid coding

Coding Methodologies

Classes according to Coding Type

Bit rate (Kbps)

Quality

Excellent

Parametric Coders

Waveform

approximating

coders

1 42 168 32 64

Hybrid

Coders

Coding Standards

Bit rate (Kbps)

Quality

Excellent

Parametric Coders

Waveform approximating

coders

1 42 168 32 64

Hybrid Coders

G.726G.711

Linear

GSM EFR

FS1015

G.723.1

GSM FR

PCM Coding

Q[.]x[n] x’[n]

• Instantaneous, non-uniform quantization

• For time-varying energy signals eg speech, uniform quantization is inefficient.

• If signal energy is halved, SQNR falls 6 dB.

• SQNR is independent of signal level in Log quantizer.

ADPCM Coding

+ Q[.]

Encoder

Decoder +

x[n]- d[n]

x’[n]

c[n]d’[n]

x”[n]

d’[n] x”[n]

x’[n]

Prediction in the context of Coding

0 5 1 0 1 5 2 0- 0 . 8

- 0 . 6

- 0 . 4

- 0 . 2

T i m e ( m s )

0 5 1 0 1 5 2 0- 0 . 8

- 0 . 6

- 0 . 4

- 0 . 2

T i m e ( m s )

Signal and first-difference signal

• DPCM with fixed predictor can give 4-11 dB improvement over PCM.

• PCM with adaptive quantization can give ~ 5

dB improvement over µ-law non-adaptive PCM.

• DPCM with adaptive prediction can give 10-12 dB improvement over fixed predictor.

ADPCM Coding

Code Excited Linear Prediction (CELP) Coding

• Most coders in 4.8-16 kbps are based on Linear Prediction Analysis-by-Synthesis (LPAS) coding.

• CELP belongs to LPAS paradigm of speech coding.

Generic Linear Prediction Analysis Synthesis (LPAS) Coder

Excitation

Generator

Minimization

Synthesis

Filter

LP Analysis

speech

CELP Decoder

Excitation

GeneratorG/A(z)

Excitation parameters

LP and Gain parameters

Synthesized speech

• Speech Quality

– Objective measures

• Segmental SNR

• Itakura-Saito distance measure

• Spectral distortion (SD)

• ITU-T P.862 Recommendation

– Subjective measures

• Mean opinion score (MOS)

• Diagnostic Rhyme Test (DRT)

• Diagnostic Acceptability Measure (DAM)

Speech Quality Measurement

• Listening quality scale

Excellent 5

Good 4

Fair 3

Poor 2

Absolute Category Rating Tests (MOS)

• Measures speech intelligibility

• Listeners are presented with one of two words which differ only in leading consonant

– Examples:

• Meet - Beat

• Than - Dan

• Met - Net

• Jest - Guest

Diagnostic Rhyme Test

• Total possible pairs = 96

• Intelligibility score, S, is given by:

N(correct) – N(incorrect)

S = 100 x

N(test pairs)

Coder Rate (kbps) DRT MOS

FS1016 4.8 91.7 3.3

G.728 16 93.0 3.9

Diagnostic Rhyme Test

• Part of ITU-T P.862 standard

• Objective is to mimic sound perception by persons in real life

• PESQ simulates expts. in which subjects judge speech quality

• Physical signals are mapped to psychophysical representations that match internal representations in the head

Perceptual evaluation of speech quality (PESQ)

• Complexity

– Computational complexity

• Simplex/half-duplex/full-duplex real time

performance on a single DSP

• Fixed point vs. floating point

• CELP coders are computationally complex

– Memory requirement

• Storage of look-up tables, codebooks etc.

Speech Coder Complexity Issues

Timing Diagram for various Coding Delays

Buffer input

speech frame

Buffer input

speech frame 2

Buffer input

speech frame 3

Buffer input

speech frame 4

Buffer input

speech frame 5

Encode

frame 1Encode

frame 2

Encode

frame 3

Encode

frame 4

Transmit bits of

frame 1

Transmit bits of

frame 2Transmit bits of

frame 3

decode

frame 1decode

frame 2

decode

frame 2

Play back

decoded speech

frame 1

Play back

decoded speech

frame 2Total one way coding delay

Algorithmic

buffering delay

Encoder

processing

Bit transmission

Decoder

processing

Sum of the

two is the

total processing

0 1 2 3 4 5Time (frame index)

Thank You!

Speech Signal Analysis and Coding - ERNETpkalra/OLD-COURSES/siv864-2010/session-0… · MPEG-4 HVXC...

Documents

ESCUELA SUPERIOR POLITÉCNICA DEL LITORAL Facultad de ... · niveles de Compresión de audio y video llamados MPEG-2 y MPEG-4. La diferencia entre MPEG-2 y MPEG-4, es que MPEG-2 está

Wireless MPEG-4 Internet Camera

Compression vidéo MPEG-4

MPEG-4 Essentials 108

MPEG-4 Structured Audio

Raumklangwiedergabe und der MPEG-4 Standard: Das … · MPEG-4 audio and video multiplexer Source Recording Parameter Modeling Video Decoder Display MPEG-4 Decoder WFS Rendering Space

MPEG-4 Demystified file24 June ’03 0 MPEG-4 Demystified Apple Worldwide Developers Conference 24 June 2003 Rob Koenen President, MPEG-4 Industry Forum Chairman, MPEG Requirements

Le traitement des essences vidéo & audio : Du MPEG 1 au MPEG 4

MPEG-4 (XviD) Encoder Guide - Dynnic.dnsalias.com/MPEG-4_XviD_Koepi_24062003-1_Encoder_Guide_ver...Gordian Knot 0.28.5 – XviD Koepi 24062003-1 i MPEG-4 (XviD) Encoder Guide This

MPEG-4 Streaming Basics

XV-Y360A · MPGE-2 MPEG-4 MPEG-1 Layer2 MPEG-XVID MPEG-1 Layer3 DIVX MPEG-2 Layer3 MPEG(.mpeg, MPEG-1 MP AC3 .mpg,.dat,.vob) MPEG-2 MP Music *.mp3 MPEG-1 Layer3 MPEG-2 Layer3

H.264-MPEG-4 AVC

MPEG-1 MPEG-2 MPEG-4 H.264 H.265 ... · MPEG-1 MPEG-2 MPEG-4 H.264 H.265 von MPEG-1 bis H.264 und H.265 Videokompressionsverfahren Martin Fiedler Dream Chip Technologies GmbH Grundlagen

MPEG-4 - ERNETpkalra/OLD-COURSES/siv864-2010/session-08-13.pdfMPEG-4 Principles • Scene Description provides : – the spatial/temporal relationship between the audiovisual objects

Overview of MPEG-4

MPEG-4 Toward Solid Representation

My MPEG life: MPEG-2, MPEG-4, H264/AVC and H.265/HEVC

Comparaison de différents codecs MPEG-4 (MPEG-4 codecs test)

MPEG-4 Demystified · Rob Koenen President, MPEG-4 Industry Forum Chairman, MPEG Requirements Group Vice President, InterTrust Technologies. 24 June ’03 1 Overview What is MPEG-4?

MPEG Standards MPEG - Moving Picture Experts Group Standards - MPEG-1 - MPEG-2 - MPEG-4 - MPEG-7 - MPEG-21 MPEG - Moving Picture Experts Group Standards