32
SPEECH ENHANCEMENT Sefik Emre Eskimez Dept. of Electrical and Computer Engineering University of Rochester, Rochester, NY

SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

  • Upload
    ngonhu

  • View
    223

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

SPEECH ENHANCEMENT

Sefik Emre Eskimez

Dept. of Electrical and Computer Engineering

University of Rochester, Rochester, NY

Page 2: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

Motivation Corruption present in speech signal reduces the

performance of the automatic processes, such as:

Automatic speech recognition (ASR)

Automatic speaker identification/verification (ASID/ASV)

Automatic speech emotion recognition (ASER)

Try it with Amazon’s Alexa and Google’s assistant

Hearing implants performance suffers in noise

conditions

Page 3: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

Ideal Cases

Page 4: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

Problem Definition – Additive

Noise

𝑠(𝑡) is the speech signal

𝑛 𝑡 is the noise signal

𝑚 𝑡 = 𝑠 𝑡 + 𝑛(𝑡),

given 𝒎 𝒕 , estimate 𝒔 𝒕 !

Page 5: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

Approaches1. Spectral Subtraction

Estimate the noise spectrum and subtract it from the noisy speech

spectrum.

a. Wiener Filtering

LTI filter to estimate clean speech.

b. Log-Minimum Mean Square Error (MMSE) Short-Time Spectral

Amplitude (STSA) Estimator

A short-time spectral amplitude (STSA) estimator which minimizes the mean-square error of

the log-spectra

2. Non-negative Dictionary Learning

Utilizes sparse coding and a voice activity detector to find which

frames belongs to noise and which belongs to speech. Usually

two dictionaries are built for speech and noise.

3. Deep Learning Approaches

Early work

1979-1984

Page 6: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

Spectral Subtraction Taking the Fourier Transform Yields:

𝑚 𝑡 = 𝑠 𝑡 + 𝑛(𝑡) ⟷ 𝑀 𝑒𝑗𝑤 = 𝑆 𝑒𝑗𝑤 +𝑁(𝑒𝑗𝑤)

Speech spectra 𝑆 𝑒𝑗𝑤 can be represented as:

𝑆 𝑒𝑗𝑤 = 𝑀 𝑒𝑗𝑤 − 𝑁𝜇 (𝑒𝑗𝑤) 𝑒𝑗𝜃𝑀,

where 𝑆 and 𝑁 are speech and noise estimates.

𝑁𝜇 (𝑒𝑗𝑤) = Ε 𝑁(𝑒𝑗𝑤)

Noise estimate is usually calculated using first few frames

of the input signal

Page 7: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

Wiener Filtering

𝑀 𝑒𝑗𝑤 = 𝑆 𝑒𝑗𝑤 + 𝑁 𝑒𝑗𝑤

A filter can be defined as follows:

𝐻 𝑒𝑗𝑤 =𝑆 𝑒𝑗𝑤

𝑀(𝑒𝑗𝑤)

The filter can be estimated using the noise estimate:

𝐻 𝑒𝑗𝑤 =𝑀 𝑒𝑗𝑤 − 𝑁 (𝑒𝑗𝑤)

𝑀(𝑒𝑗𝑤)𝐻 𝑒𝑗𝑤𝑠(𝑡)

n(𝑡)

𝑠 (𝑡)

Page 8: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

Log-Minimum Mean Square Error (MMSE)

Short-Time Spectral Amplitude (STSA)

Let’s simplify the notation: 𝑆 𝑒𝑗𝑤 → 𝑆

Log-MMSE STSA minimizes the logarithmic mean

square error

Ε log10 𝑆 − log10( 𝑆)2

Page 9: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

Non-negative Dictionary

Learning Let us denote the basis matrix of speech and noise as

𝑊𝑠 and 𝑊𝑛 respectively

The basis matrix for the noisy signal

𝑊 = 𝑊𝑠𝑊𝑛

The noisy spectrogram can be represented as 𝑀 ≈ 𝑊𝐻,

where the noisy NMF coefficients defined as

𝐻 = 𝐻𝑠𝑇𝐻𝑛𝑇𝑇

Page 10: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

Non-negative Dictionary

Learning

The mask can be obtained as follows:

𝑚𝑠 =𝑊𝑠𝐻𝑠

𝑊𝑠𝐻𝑠+𝑊𝑛𝐻𝑛

Page 11: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

Time-Frequency (T-F) Masks

T-F masks operate on the magnitude

spectra of the signal.

Let 𝑆𝑡 𝑓 , 𝑁𝑡 𝑓 and 𝑀𝑡(𝑓) be the

magnitude spectra of the speech, noise

and mixture signal, respectively.

Page 12: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

Time-Frequency (T-F) Masks

𝑆 𝑁

𝑀

Page 13: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

T-F Masks

Ideal Binary Masks (IBM), 0 dB

𝐼𝐵𝑀𝑡(𝑓) 1 , 𝑖𝑓 𝑆𝑡 𝑓 > 𝑁𝑡 𝑓0 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Page 14: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

Ideal Binary Masks (IBM)

𝑆 𝑀

𝐼𝐵𝑀 𝐼𝐵𝑀 ⊙ 𝑀

Page 15: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

The problem becomes a binary

classification task!

Given 𝑀𝑡 𝑓 , determine whether it belongs to

speech or noise

Can be estimated with any machine learning

classifier

Problem: The results obtained from the

ground-truth IBM mask has “musical noise”

Ideal Binary Masks (IBM)

Page 16: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

T-F Masks

Amplitude Soft Masks (ASM) or Ideal

Ratio Masks (IRM)

𝐼𝑅𝑀𝑡 𝑓 =𝑆𝑡 𝑓

𝑆𝑡 𝑓 + 𝑁𝑡 𝑓

Page 17: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

Ideal Ratio Masks (IRM)

𝑆 𝑀

𝐼𝑅𝑀 𝐼𝑅𝑀 ⊙ 𝑀

Page 18: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

Predicting Masks – System

Overview

Page 19: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

Predicting Masks – Features

Mel-Frequency Cepstrum (MFC)

Magnitude Spectra

Raw waveform

Can be supplemented by traditional

features

Page 20: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

Autoencoder based methods

Two types:

1. Trained with only clean speech

Network learns speech

representation

2. Trained with noisy-clean speech

pairs

Network learns transfer function from

noisy to clean speech

Page 21: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

Recurrent Neural Network

(RNN) RNNs are useful for modeling temporal

relations

Huang et al. (Huang, Kim et al. 2015)

proposed predicting masks with the

following cost function:

min 𝑚𝑠 − 𝑚𝑠2 + 𝑚𝑛 − 𝑚𝑛

2

− 𝑚𝑠 − 𝑚𝑛2 − 𝑚𝑛 − 𝑚𝑠

2

where 𝑚𝑠 and 𝑚𝑛 are the speech and

noise masks, respectively.

Page 22: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

Redundant Convolutional

Encoder-Decoder (R-CED)

Park et al. (Park and Lee 2016)

proposed a convolutional network with

1-dimensional convolutional operation

which operates on frequency axis

Convolutional networks have fewer

parameters than RNN, which makes

them feasible for small devices, such

as hearing implants!

Page 23: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

Predicting Masks – Our Methods

Convolutional Encoder-Decoder (CED)

network with skip connections

Bidirectional Long Short-Term Memory

(BLSTM) network

Page 24: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

Convolutional Encoder-Decoder

(CED)

InputSpectrogram

skipconnections

ConvolutionalEncoder-Decoder(CED)

Conv

BNReLU

64filters

Deconv

BNReLU

64filters

Conv

BNReLU

128filters

Conv

BNReLU

256filters

Conv

BNReLU

512filters

Deconv

BNReLU

256filters

Deconv

BNReLU

128filters

Speechmask

Noisemask

Conv

BNReLU1filter

Conv

BNReLU1filter

Page 25: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

Bidirectional Long Short-Term

Memory (BLSTM)

Page 26: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

Predicting Masks – Comparison

with other methods

(a)Noisyspectrogram

(b)Cleanspectrogram

(c)Enhanced(SS)spectrogram

(d)Enhanced(Log-MMSE)spectrogram

(e)Enhanced(RNN)spectrogram

(f)Enhanced(R-CED)spectrogram

(g)Enhanced(BLSTM)spectrogram

(h)Enhanced(CED)spectrogram

Page 27: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

Evaluation metrics

• Objective measures:

• Perceptual evaluation of speech quality (PESQ) – Ranges from -0.5

to 4.5

• Short-time Objective Intelligibility (STOI) – Ranges from 0 to 1

• Segmental SNR (SSNR, in dB)

• Log-spectral distortion (LSD, in dB)

• Hearing aids speech quality index (HASQI)

• Hearing aids speech perception index (HASPI)

• Speech distortion index (SDI)

• Subjective measures:

• Listening tests

The most

important

metrics!

Page 28: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

RESULTS - PESQ

Page 29: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

RESULTS - STOI

Page 30: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

More examples…

Page 31: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

The End…

Thank you!

Page 32: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative

References

Loizou, Philipos C. Speech enhancement: theory and practice. CRC press, 2013.

Boll, Steven. "Suppression of acoustic noise in speech using spectral subtraction." IEEE Transactions

on acoustics, speech, and signal processing 27.2 (1979): 113-120.

Ephraim, Yariv, and David Malah. "Speech enhancement using a minimum mean-square error log-

spectral amplitude estimator." IEEE Transactions on Acoustics, Speech, and Signal Processing 33.2

(1985): 443-445.

Huang, Po-Sen, et al. "Joint optimization of masks and deep recurrent neural networks for monaural

source separation." IEEE/ACM Transactions on Audio, Speech and Language Processing

(TASLP) 23.12 (2015): 2136-2147.

Park, Se Rim, and Jinwon Lee. "A fully convolutional neural network for speech

enhancement." arXiv preprint arXiv:1609.07132 (2016).

Mohammadiha, Nasser. Speech Enhancement Using Nonnegative Matrix Factorization and Hidden

Markov Models. Diss. KTH Royal Institute of Technology, 2013.

Wang, Yuxuan, Arun Narayanan, and DeLiang Wang. "On training targets for supervised speech

separation." IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 22.12

(2014): 1849-1858.

Lu, Xugang, et al. "Speech enhancement based on deep denoising autoencoder." Interspeech. 2013.