Upload
nigel-peters
View
228
Download
0
Embed Size (px)
Citation preview
NTT Labs. 2005
2005.12.16
NTT Communication Science Labs.
Takehiro Moriya 守谷 健弘
Coding Technologies for Speech
and Audio Signals
ISPACS 2005
NTT Labs. 2005
Self introduction
• 1980 Joined NTT, Basic research– Transform domain interleave VQ– Conjugate VQ
• 1989 guest researcher at AT&T Bell Labs• 1990 Standardization for Japanese PDC (PSI-CELP)• 1993 Standardization for ITU-T (CS-ACELP)• 1995 Standardization for MPEG-4 (TwinVQ)• 2001 Standardization for MPEG lossless audio
NTT Labs. 2005
512256128
6432168
42
1980 1985 1990 1995
PARCORLSP
APC-AB
VSELP
G.711G.726 G.728
G.722
MPEG-1CD, DAT
MPEG-4
1975 2000
MPEG-2
1024
bit rate [kbit/s]
2005year
MP3AAC
Technologies of speech and audio coding
mobilevocoder
music
telephone
mobile phone
streaming
archive
ubiquitous
VoIP/mobile
PSI-CELPG.729
MPEG-4(lossless)
wideband
NTT Labs. 2005
Outline
• 1. Fundamentals– 1.1 Time domain for speech– 1.2 Frequency domain for audio
• 2. Standardization– 2.1 ITU-T speech coding– 2.2 MPEG audio coding
• 3. Hot topics– 3.1 MPEG lossless (ALS, SLS, DTS)– 3.2 MPEG SBR and SSC– 3.3 MPEG surround
NTT Labs. 2005
Fundamentals
NTT Labs. 2005
Category of coding
coding
compression
presentationmetadata
speech
language
lossless
lossy
time-domain
frequency-domain
text
speech
audioimagevideo
NTT Labs. 2005
Time-domain
• linear prediction -> CELP
• predictive coefficients– PARCOR (partial auto correlation)
– LSP (line spectral pair)
• vector quantization of excitation source– algebraic structure (ACELP)
• Big market for cellular phone and VoIP
NTT Labs. 2005
LPC (Linear Predictive Coding)
Σ
Z-1 α 1
synthesized output
excitation(innovation)(prediction residual)
Z-1 α 2
Z-1 α p
・・
predictive coefficients
NTT Labs. 2005
Family of LPC parameters
predictive coefficients
α1 .... αp
PARCOR coefficientsk1 .... kp
LSP parameters
ω1 .... ωp
frequencyω 1ω 2 ω p
merits of LSP•stability•interpolation•quantization•prediction
NTT Labs. 2005
CELP (Code Excited Linear Prediction)
adaptivecodebook(periodic)
randomcodebook
(noise, pulse)
+ LPCsynthesis
perceptualerror
LSPparameter
Feedback (analysis by synthesis)
gain
input
NTT Labs. 2005
Synthesis model for vocoder
pitch intervalgain
( random)
synthesisfilterΣ
NTT Labs. 2005
Synthesis model for multi-pulse
pitch intervalgain
amplitude and position of pulse
Σsynthesis
filter
NTT Labs. 2005
Synthesis model for regular multi-pulse
pitch intervalgain
amplitude of regular pulse
Σsynthesis
filter
NTT Labs. 2005
Synthesis model for CELP
pitch intervalgain
selection of code vector
Σ
・・・・・・・
synthesisfilter
NTT Labs. 2005
Synthesis model of VSELP
pitch intervalgain
polarity of base vector
Σ+/-
+/-
+/-
・・・・・・・ ・・・
synthesisfilter
NTT Labs. 2005
Synthesis model for CS-CELP
pitch intervalgain
selection of vector pair
Σ+/-
・・・・・・・
+/-
+/-
synthesisfilter
NTT Labs. 2005
Synthesis model of ACELP
pitch interval gain
selection of unit pulse position
Σ+/-
+/-
+/-
+/-
+/-
synthesisfilter
Simplicity is the seal of truth
NTT Labs. 2005
Frequency-domain
• Lapped transform: MDCT– Without frame noise nor information loss due to overlap
• Filter bank: QMF– compromises time and frequency
• adaptive noise control
• psycho-acoustics
NTT Labs. 2005
Transform coding
Transformtime to
frequency
envelopeestimation
quantization
input
Transformfrequency to
time
Adaptivebit allocation
output
Side information
NTT Labs. 2005
Base of DCT
freq
uenc
y
time
NTT Labs. 2005
Base of MDCT0verlap
withprevious
frame
0verlap with
next frame
symmetryanti-symmetry
NTT Labs. 2005
frequency32 band QMF filter bank (analysis)
QMF for MPEG1,2 Layer-I, II
frequency
32 band QMF filter bank (synthesis)
…..
…..
•down sample•adaptive bit allocation for 32 equal bands (energy, masking)•adaptive quantization
reconstructionbit stream
NTT Labs. 2005
frequency32 band QMF filter bank (analysis)
QMF for MPEG1,2 Layer-III
frequency
32 band QMF filter bank (synthesis)
…..
…..
•down sample•long and short MDCT•adaptive bit allocation for Bark-scale (energy, masking)•adaptive quantization (Huffman coding), bit reservoir
reconstructionbit stream
NTT Labs. 2005
frequency32 band QMF filter bank (analysis)
QMS for MPEG extension tools
frequency
32 band QMF filter bank (synthesis)
…..
…..
•SBR (Spectral Band Replication)•PS (Parametric Stereo)•Surround
reconstructionbit stream
NTT Labs. 2005
Masking effect
original spectrum
allowable noise level
audible level
log
spec
trum
frequency
masked region
NTT Labs. 2005
Physical and perceptual distortion
un-noticeable(masking)
result of compression
additive noise
un-noticeableregion
original
additive echo
characteristics of perception application
NTT Labs. 2005
Distortion by additional noise
original
distortion
original
noticeabletime
frequency
log
spec
trum
distortion
NTT Labs. 2005
Distortion by data compression
control quantization noisedistortion is masked
original
frequency
distortion
time
original
distortion
log
spec
trum
NTT Labs. 2005
Distortion by echo
echo is masked watermark
search or recognition
time
40 ms
original
frequency
distortion
log
spec
trum
original
distortion
NTT Labs. 2005
Predictive coding and transform coding
small correlation
effect
gain
large correlation
method
unpredictable flat spectrum
prediction gain transform gain
waveform energyresidual energy
arithmetic meangeometric mean
predictable varied spectrum
closed-loop quantizationadaptive bit allocationweighted quantization
time-domain(prediction)
frequency-domain(transform, subband)
Speech (5 ms) Audio (30 ms)
=
NTT Labs. 2005
Standards
NTT Labs. 2005
Example of standard
• ITU-T– cellular phone– VoIP– TV-phone– FAX
• ISO/IEC JPEG, MPEG– digital camera, video– digital broadcasting– portable music player, DVD
NTT Labs. 2005
Merits of standard
• interoperability
• open source– long term maintenance– visible patent holders
• Integration of the highest technologies
• cost reduction by mass production
market creation
NTT Labs. 2005
patent pool
disclosure of technologypatent
standardserviceproduct
marketresearch
R & Dbasic research
service andproducts
cost reduction
users
royalty
Circulatory evolution of market
competition
convenient
NTT Labs. 2005
Standardization for speech
• ITU-T G.• IMT-2000 (International Mobile Telecommunication)• GSM (European, Asia)• TIA (North America)• US FS-1015 (LPC-10), 1016 (CELP), 1017 (MELP)• Japanese Cellular
- PDC full/half rate- PHS- cdmaOne- PDC enhanced full rate
NTT Labs. 2005
ITU-T standard for speech
• Telephone band (8 kHz sample)– G.711 PCM 64 kbit/s– G.726 ADPCM 32 kbit/s (16,24,40 kbit/s)– G.727 Embedded ADPCM 32 kbit/s (16,24,40 kbit/s)– G.728 Low-delay CELP 16 kbit/s– G.723.1 ACELP/MPC-MLQ 5.3/6.3 kbit/s– G.729 CS-ACELP 8 kbit/s
• Wide band (16 kHz sample)– G.722 SB-ADPCM 64, 56, 48 kbit/s– G.722.1 Transform coding 24, 32 kbit/s – G.722.2 AMR-WB 6.6 - - 24 kbit/s
NTT Labs. 2005
Standard for IMT-2000
• 3GPP (3rd Generation Partnership Project) (ARIB, TTC, T1, ETSI,TTA )
• 3GPP2• bi-directional CODEC
AMR (Advanced Multi Rate)AMR-WB (wide band)
• video phone (H.263)• Audio/Low rate speech• packet transmission (MPEG-4)
NTT Labs. 2005
Bandwidth and bitrate for audio coding
24 48 96 192 384 768
18
12
6
0
24
MPEG-4 MPEG-1
MPEG-2,1/2sample
MPEG-2multi-channel
AC-3,AAC
CDDAT
Rate[kbit/s]
band
wid
th [
kHz] MD
NTT Labs. 2005
Basic technology for audio coding
Transform
MPEG-1 L1,2 subband adaptive bit
MPEG-1 L3 subband+MDCT adaptive+Huffman
ATRAC subband+MDCT adaptive bit
AC-3 MDCT adaptive+Huffman
AAC MDCT
TwinVQ MDCT adaptive VQ
adaptive+Huffman
Quantization
NTT Labs. 2005
MPEG-1 , 2/audio
• MPEG-1 – sampling rate: 32, 44.1, 48 kHz stereo– algorithm:
Layer-I 32 band splitLayer-II + improved quantizerLayer-III + MDCT + Variable length + bit reservoir ++
• MPEG-2– low sampling rate 16, 22.05, 24 kHz– multi channel 5.1ch– backward compatibility
NTT Labs. 2005
MPEG-2/AAC
• 3 profiles-main, -LC (Low Complexity),-SSR (Scalable Sampling Rate)
• sampling rate: 32, 44.1, 48 kHz, +X2, X1/2, X1/4
• channel: 1-48
• bit rate: 8-576 kbit/s/ch
• MDCT 1024 or 128
• TNS (Time domain Noise Shaping)• MS (Middle-Side) stereo/intensity stereo• non-linear scale quantizer + variable length code
(2 and 4 dimension Huffman code)
NTT Labs. 2005
Tools in MPEG-4 audio
Low rate speech HVXC (Harmonic Vector eXcitation Coder)
Speech (narrow/wide) CELPLow rate audio TwinVQ (Transform domain Weighted Interleave VQ)
Audio MPEG-2 AAC (Advanced Audio Coder)Error resilient frameworkParametric audio coding HILNFine granular scalable audio coding BSACLow delay audio coding LD-AACLow overhead Audio Transport LATM
NTT Labs. 2005
MPEG-4 General audio
IMDCTLTPTNS
stereo codingscalability
output
common toolsinterleave VQ
for MDCT
scale factorHuffman coding
scale factorBit-slice arithmetic
TwinVQ
AAC
BSAC
NTT Labs. 2005
Audio Demo (low rate)
• ITU-T G.711 64 kbit/s
• ITU-T G.726 32 kbit/s
• ITU-T G.728 16 kbit/s
• ITU-T G.729 8 kbit/s
• PDC Full 6.7 kbit/s
• PDC Half 3.45 kbit/s
• MPEG4 HVXC 2 kbit/s
• MPEG4 TwinVQ 8 kbit/s
NTT Labs. 2005
Hot Topics
NTT Labs. 2005
Background of lossless coding
• Demand for lossless compression of audio– archiving analog and digital contents – delivery over broadband network– high quality audio format
• up to 24 bit 192 kHz sampling
– multi-channel • medical data, seismic data, sensor array, etc.
• MPEG-4 extension– official tools (open source)– inter operability (good for over 100 years)
NTT Labs. 2005
Family of MPEG lossless
• ALS– one-step compression in time domain
• SLS– scalable to lossless from MPEG lossy core– fine grain scalability in frequency domain– Integer MDCT
• DTS– 1-bit oversample format– compatible with Sony-Philips SACD format
NTT Labs. 2005
Property of ALS
• Time domain adaptive prediction– simple to high-performance backward prediction– BGMC for prediction residual– Golomb-Rice Code for PARCOR– Progressive order prediction– Long-term prediction– Hierarchical block switching
• extension– Floating-point support– Multi-channel predictive coding
NTT Labs. 2005
Prediction residual
time
ampl
itud
e
Original wave
Prediction residualwave
NTT Labs. 2005
Predictive coding
vocoder
waveform coding
lossless coding
compressionratio1/30
ratio1/10
ratio1/2
input residual
prediction
synthesis
parameters
pulse interval
all residual
codebook forresidual
magnify30 times
different framework rich commonality
NTT Labs. 2005
45
46
47
48
49
50
0 5 10 15averaged decoding time for 30 sec files (48,96,192 kHz)
[%]
45
46
47
48
49
50
20 40 60 80 100 120 140[sec]
com
pres
sion
rat
io
Monkey’s Audio (free Software)
OptimFrog (free Software)
MPEG-4 SLS
[%]ALS(reference decoder)
ALS( high-
compression)
ALS(enhanced decoder)
Compression and decoding time
NTT Labs. 2005
24 48 72 96 120 144stereo bit rate [kbit/s]
rela
tive
qual
ity
MP3AACHE-AACHE-AAC V2
Quality improvements by SBR and PS
AAC
AAC profile
SBR
PS
HE-AAC profile
HE-AAC V2 profile
Japanese digital broadcasting (2003)
Japanese mobile digital broadcasting (2006)
NTT Labs. 2005
MPEG SBR (HE-AAC)
AAC stereo
encoder
AAC stereo
decoder
AAC stereobit steam
low-pass output
downsample
high frequency analysis
(Spectral Band Replication)
SBR bit steam
full-band output
full-band input
high frequency synthesis
envelopeexcitation
low-pass input
NTT Labs. 2005
MPEG SBR+PS (HE-AAC v2)
AAC monauralencoder
AAC monauraldecoder
AAC monauralbit steam
monaural output
monaural input
mixdown
stereo output
PS(parametric stereo)
analysis
PSbit stream
PS(parametric stereo)
synthesis
stereo input
Channel level differencesInter channel correlation
NTT Labs. 2005
MPEG surround
AACstereo encoder
AAC stereo
decoder
AAC stereo bit stream
stereo output
stereo input
mix-down
surround analysis
surround bit stream
5-ch output
5-ch input
surround synthesis
Channel level differencesInter channel correlationChannel prediction coefficients
NTT Labs. 2005
1992 1994 1996 1998 2000 2002 2004 2006
MPEG-1
MPEG-2MC/LSF
MPEG-2AAC
MPEG-4V1 V2 2001
SBR
SSC
MP3 on 4
2005 DST
ALS
SLS
History of MPEG Audio
surround
lossless
forward andbackward
compatibility
*Multi-channel and Low Sampling Frequency
NTT Labs. 2005
Future challenge
• Open problems– all-mighty coder for both speech and audio at less
than 16 kbit/s– Wave field synthesis (multi-channel)
• Integrated service– video– copyright management