NTT Labs. 2005 2005.12.16 NTT Communication Science Labs. Takehiro Moriya 守谷健弘 Coding Technologies for Speech and Audio Signals ISPACS 2005

NTT Labs. 2005

2005.12.16

NTT Communication Science Labs.

Takehiro Moriya　守谷　健弘

Coding Technologies for Speech

and Audio Signals

ISPACS 2005

NTT Labs. 2005

Self introduction

• 1980 Joined NTT, Basic research– Transform domain interleave VQ– Conjugate VQ

• 1989 guest researcher at AT&T Bell Labs• 1990 Standardization for Japanese PDC (PSI-CELP)• 1993 Standardization for ITU-T (CS-ACELP)• 1995 Standardization for MPEG-4 (TwinVQ)• 2001 Standardization for MPEG lossless audio

NTT Labs. 2005

512256128

6432168

42

1980 1985 1990 1995

PARCORLSP

APC-AB

VSELP

G.711G.726 G.728

G.722

MPEG-1CD, DAT

MPEG-4

1975 2000

MPEG-2

1024

bit rate [kbit/s]

2005year

MP3AAC

Technologies of speech and audio coding

mobilevocoder

music

telephone

mobile phone

streaming

archive

ubiquitous

VoIP/mobile

PSI-CELPG.729

MPEG-4(lossless)

wideband

NTT Labs. 2005

Outline

• 1. Fundamentals– 1.1 Time domain for speech– 1.2 Frequency domain for audio

• 2. Standardization– 2.1 ITU-T speech coding– 2.2 MPEG audio coding

• 3. Hot topics– 3.1 MPEG lossless (ALS, SLS, DTS)– 3.2 MPEG SBR and SSC– 3.3 MPEG surround

NTT Labs. 2005

Fundamentals

NTT Labs. 2005

Category of coding

coding

compression

presentationmetadata

speech

language

lossless

lossy

time-domain

frequency-domain

text

speech

audioimagevideo

NTT Labs. 2005

Time-domain

• linear prediction -> CELP

• predictive coefficients– PARCOR (partial auto correlation)

– LSP (line spectral pair)

• vector quantization of excitation source– algebraic structure (ACELP)

• Big market for cellular phone and VoIP

NTT Labs. 2005

LPC (Linear Predictive Coding)

Σ

Ｚ－１ α １

synthesized output

excitation(innovation)(prediction residual)

Ｚ－１ α 2

Ｚ－１ α p

・・

predictive coefficients

NTT Labs. 2005

Family of LPC parameters

predictive coefficients

α1 .... αp

PARCOR coefficientsk1 .... kp

LSP parameters

ω1 .... ωp

frequencyω １ω ２ ω ｐ

merits of LSP•stability•interpolation•quantization•prediction

NTT Labs. 2005

CELP (Code Excited Linear Prediction)

adaptivecodebook(periodic)

randomcodebook

(noise, pulse)

+ LPCsynthesis

perceptualerror

LSPparameter

Feedback (analysis by synthesis)

gain

input

NTT Labs. 2005

Synthesis model for vocoder

pitch intervalgain

（ random）

synthesisfilterΣ

NTT Labs. 2005

Synthesis model for multi-pulse

pitch intervalgain

amplitude and position of pulse

Σsynthesis

filter

NTT Labs. 2005

Synthesis model for regular multi-pulse

pitch intervalgain

amplitude of regular pulse

Σsynthesis

filter

NTT Labs. 2005

Synthesis model for CELP

pitch intervalgain

selection of code vector

Σ

・・・・・・・

synthesisfilter

NTT Labs. 2005

Synthesis model of VSELP

pitch intervalgain

polarity of base vector

Σ+/-

+/-

+/-

・・・・・・・・・・

synthesisfilter

NTT Labs. 2005

Synthesis model for CS-CELP

pitch intervalgain

selection of vector pair

Σ+/-

・・・・・・・

+/-

+/-

synthesisfilter

NTT Labs. 2005

Synthesis model of ACELP

pitch interval gain

selection of unit pulse position

Σ+/-

+/-

+/-

+/-

+/-

synthesisfilter

Simplicity is the seal of truth

NTT Labs. 2005

Frequency-domain

• Lapped transform: MDCT– Without frame noise nor information loss due to overlap

• Filter bank: QMF– compromises time and frequency

• adaptive noise control

• psycho-acoustics

NTT Labs. 2005

Transform coding

Transformtime to

frequency

envelopeestimation

quantization

input

Transformfrequency to

time

Adaptivebit allocation

output

Side information

NTT Labs. 2005

Base of ＤＣＴ

freq

uenc

y

time

NTT Labs. 2005

Base of ＭＤＣＴ0verlap

withprevious

frame

0verlap with

next frame

symmetryanti-symmetry

NTT Labs. 2005

frequency32 band QMF filter bank (analysis)

QMF for MPEG1,2 Layer-I, II

frequency

32 band QMF filter bank (synthesis)

…..

…..

•down sample•adaptive bit allocation for 32 equal bands (energy, masking)•adaptive quantization

reconstructionbit stream

NTT Labs. 2005


QMF for MPEG1,2 Layer-III

frequency


…..

…..

•down sample•long and short MDCT•adaptive bit allocation for Bark-scale (energy, masking)•adaptive quantization (Huffman coding), bit reservoir


NTT Labs. 2005


QMS for MPEG extension tools

frequency


…..

…..

•SBR (Spectral Band Replication)•PS (Parametric Stereo)•Surround


NTT Labs. 2005

Masking effect

original spectrum

allowable noise level

audible level

log

spec

trum

frequency

masked region

NTT Labs. 2005

Physical and perceptual distortion

un-noticeable(masking)

result of compression

additive noise

un-noticeableregion

original

additive echo

characteristics of perception application

NTT Labs. 2005

Distortion by additional noise

original

distortion

original

noticeabletime

frequency

log

spec

trum

distortion

NTT Labs. 2005

Distortion by data compression

control quantization noisedistortion is masked

original

frequency

distortion

time

original

distortion

log

spec

trum

NTT Labs. 2005

Distortion by echo

echo is masked watermark

search or recognition

time

40 ms

original

frequency

distortion

log

spec

trum

original

distortion

NTT Labs. 2005

Predictive coding and transform coding

small correlation

effect

gain

large correlation

method

unpredictable flat spectrum

prediction gain transform gain

waveform energyresidual energy

arithmetic meangeometric mean

predictable varied spectrum

closed-loop quantizationadaptive bit allocationweighted quantization

time-domain(prediction)

frequency-domain(transform, subband)

Speech (5 ms) Audio (30 ms)

=

NTT Labs. 2005

Standards

NTT Labs. 2005

Example of standard

• ITU-T– cellular phone– VoIP– TV-phone– FAX

• ISO/IEC JPEG, MPEG– digital camera, video– digital broadcasting– portable music player, DVD

NTT Labs. 2005

Merits of standard

• interoperability

• open source– long term maintenance– visible patent holders

• Integration of the highest technologies

• cost reduction by mass production

market creation

NTT Labs. 2005

patent pool

disclosure of technologypatent

standardserviceproduct

marketresearch

R & Dbasic research

service andproducts

cost reduction

users

royalty

Circulatory evolution of market

competition

convenient

NTT Labs. 2005

Standardization for speech

• ITU-T 　 G.• IMT-2000 (International Mobile Telecommunication)• GSM (European, Asia)• TIA (North America)• US FS-1015 (LPC-10), 1016 (CELP), 1017 (MELP)• Japanese Cellular

- PDC full/half rate- PHS- cdmaOne- PDC enhanced full rate

NTT Labs. 2005

ITU-T standard for speech

• Telephone band (8 kHz sample)– G.711 PCM 64 kbit/s– G.726 ADPCM 32 kbit/s (16,24,40 kbit/s)– G.727 　 Embedded ADPCM 　 32 kbit/s (16,24,40 kbit/s)– G.728 　 Low-delay CELP 　　　　 16 kbit/s– G.723.1 ACELP/MPC-MLQ 5.3/6.3 kbit/s– G.729 CS-ACELP 　　　　　　　　 8 kbit/s

• Wide band (16 kHz sample)– G.722 SB-ADPCM 64, 56, 48 kbit/s– G.722.1 Transform coding 24, 32 kbit/s 　– G.722.2 AMR-WB 6.6 - - 24 kbit/s

NTT Labs. 2005

Standard for IMT-2000

• 3GPP　 (3rd Generation Partnership Project)　(ARIB, TTC, T1, ETSI,TTA )

• 3GPP2• bi-directional CODEC　

AMR (Advanced Multi Rate)AMR-WB (wide band)

• video phone (H.263)• Audio/Low rate speech• packet transmission (MPEG-4)

NTT Labs. 2005

Bandwidth and bitrate for audio coding

24 48 96 192 384 768

18

12

6

0

24

MPEG-4 　MPEG-1

MPEG-2,1/2sample

MPEG-2multi-channel

AC-3,AAC

CDDAT

Rate[kbit/s]

band

wid

th [

kHz] 　MD

NTT Labs. 2005

Basic technology for audio coding

Transform

MPEG-1 L1,2 subband adaptive bit

MPEG-1 L3 subband+MDCT adaptive+Huffman

ATRAC subband+MDCT adaptive bit

AC-3 MDCT adaptive+Huffman

AAC MDCT

TwinVQ MDCT adaptive VQ

adaptive+Huffman

Quantization

NTT Labs. 2005

MPEG-１ , 2/audio

• MPEG-1 – sampling rate: 32, 44.1, 48 kHz stereo– algorithm:

Layer-I 32 band splitLayer-II + improved quantizerLayer-III + MDCT + Variable length + bit reservoir ++

• MPEG-2– low sampling rate 16, 22.05, 24 kHz– multi channel 5.1ch– backward compatibility

NTT Labs. 2005

MPEG-2/AAC

• 3 profiles-main, -LC (Low Complexity),-SSR (Scalable Sampling Rate)

• sampling rate: 32, 44.1, 48 kHz, +X2, X1/2, X1/4

• channel: 1-48

• bit rate: 8-576 kbit/s/ch

• MDCT 1024 or 128

• TNS (Time domain Noise Shaping)• MS (Middle-Side) stereo/intensity stereo• non-linear scale quantizer + variable length code

(2 and 4 dimension Huffman code)

NTT Labs. 2005

Tools in MPEG-4 audio

Low rate speech 　 HVXC (Harmonic Vector eXcitation Coder)

Speech (narrow/wide) 　 CELPLow rate audio 　 TwinVQ (Transform domain Weighted Interleave VQ)

Audio 　 MPEG-2 AAC 　 (Advanced Audio Coder)Error resilient frameworkParametric audio coding HILNFine granular scalable audio coding BSACLow delay audio coding LD-AACLow overhead Audio Transport LATM

NTT Labs. 2005

MPEG-4 General audio

IMDCTLTPTNS

stereo codingscalability

output

common toolsinterleave VQ

for MDCT

scale factorHuffman coding

scale factorBit-slice arithmetic

TwinVQ

AAC

BSAC

NTT Labs. 2005

Audio Demo (low rate)

• ITU-T G.711 64 kbit/s




• PDC Full 6.7 kbit/s

• PDC Half 3.45 kbit/s

• MPEG4 HVXC 2 kbit/s

• MPEG4 TwinVQ 8 kbit/s

NTT Labs. 2005

Hot Topics

NTT Labs. 2005

Background of lossless coding

• Demand for lossless compression of audio– archiving analog and digital contents – delivery over broadband network– high quality audio format

• up to 24 bit 192 kHz sampling

– multi-channel • medical data, seismic data, sensor array, etc.

• MPEG-4 extension– official tools (open source)– inter operability (good for over 100 years)

NTT Labs. 2005

Family of MPEG lossless

• ALS– one-step compression in time domain

• SLS– scalable to lossless from MPEG lossy core– fine grain scalability in frequency domain– Integer MDCT

• DTS– 1-bit oversample format– compatible with Sony-Philips SACD format

NTT Labs. 2005

Property of ALS

• Time domain adaptive prediction– simple to high-performance backward prediction– BGMC for prediction residual– Golomb-Rice Code for PARCOR– Progressive order prediction– Long-term prediction– Hierarchical block switching

• extension– Floating-point support– Multi-channel predictive coding

NTT Labs. 2005

Prediction residual

time

ampl

itud

e

Original wave

Prediction residualwave

NTT Labs. 2005

Predictive coding

vocoder

waveform coding

lossless coding

compressionratio1/30

ratio1/10

ratio1/2

input residual

prediction

synthesis

parameters

pulse interval

all residual

codebook forresidual

magnify30 times

different framework rich commonality

NTT Labs. 2005

45

46

47

48

49

50

0 5 10 15averaged decoding time for 30 sec files (48,96,192 kHz)

[%]

45

46

47

48

49

50

20 40 60 80 100 120 140[sec]

com

pres

sion

rat

io

Monkey’s Audio (free Software)

OptimFrog (free Software)

MPEG-4 SLS

[%]ALS(reference decoder)

ALS（ high-

compression）

ALS(enhanced decoder)

Compression and decoding time

NTT Labs. 2005

24 48 72 96 120 144stereo bit rate [kbit/s]

rela

tive

qual

ity

MP3AACHE-AACHE-AAC V2

Quality improvements by SBR and PS

AAC

AAC profile

SBR

PS

HE-AAC profile

HE-AAC V2 profile

Japanese digital broadcasting (2003)

Japanese mobile digital broadcasting (2006)

NTT Labs. 2005

MPEG SBR (HE-AAC)

AAC stereo

encoder

AAC stereo

decoder

AAC stereobit steam

low-pass output

downsample

high frequency analysis

(Spectral Band Replication)

SBR bit steam

full-band output

full-band input

high frequency synthesis

envelopeexcitation

low-pass input

NTT Labs. 2005

MPEG SBR+PS (HE-AAC v2)

AAC monauralencoder

AAC monauraldecoder

AAC monauralbit steam

monaural output

monaural input

mixdown

stereo output

PS(parametric stereo)

analysis

PSbit stream

PS(parametric stereo)

synthesis

stereo input

Channel level differencesInter channel correlation

NTT Labs. 2005

MPEG surround

AACstereo encoder

AAC stereo

decoder

AAC stereo bit stream

stereo output

stereo input

mix-down

surround analysis

surround bit stream

5-ch output

5-ch input

surround synthesis

Channel level differencesInter channel correlationChannel prediction coefficients

NTT Labs. 2005

1992 1994 1996 1998 2000 2002 2004 2006

MPEG-1

MPEG-2MC/LSF

MPEG-2AAC

MPEG-4V1 V2 2001

SBR

SSC

MP3 on 4

2005 DST

ALS

SLS

History of MPEG Audio

surround

lossless

forward andbackward

compatibility

*Multi-channel and Low Sampling Frequency

NTT Labs. 2005

Future challenge

• Open problems– all-mighty coder for both speech and audio at less

than 16 kbit/s– Wave field synthesis (multi-channel)

• Integrated service– video– copyright management

Documents

NTT Labs. 2005 2005.12.16 NTT Communication Science Labs. Takehiro Moriya 守谷 健弘 Coding Technologies for Speech and Audio Signals ISPACS 2005

NTT Labs. 2005 2005.12.16 NTT Communication Science Labs. Takehiro Moriya 守谷健弘 Coding Technologies for Speech and Audio Signals ISPACS 2005