Optimal Filtering and Speech Recognition With Microphone Arrays

OPTIMAL FILTERING AND SPEECH RECOGNITION WITH

MICROPHONE ARRAYS

ByJohn E. Adcock

Sc.B., Computer Science, Brown University, 1989Sc.M., Engineering, Brown University, 1993

A dissertation submitted in partial fulfillment of the requirements for the degree ofDoctor of Philosophy in the Division of Engineering at Brown University

Providence, Rhode IslandMay 2001

This dissertation by John E. Adcock is accepted in its present form by the Division of Engineering assatisfying the dissertation requirement for the degree of Doctor of Philosophy.

DateHarvey F. Silverman, Director

Recommended to the Graduate Council

DateMichael S. Brandstein, Reader

DateAllan E. Pearson, Reader

Approved by the Graduate Council

DatePeder J. EstrupDean of the Graduate School and Research

ii

c�

Copyright 2001 by John E. Adcock

THE VITA OF JOHN E. ADCOCK

John was born October 19, 1967 in Boston, Massachusetts. As a child he spent five years with his familyin Paris, France returning in 1980 to Pound Ridge, NY where he lived while attending high school inBedford Hills, NY. John received the Bachelor of Science degree in Computer Science, magna cum laude,from Brown University in 1989. He subsequently spent two years as an engineer in the Signal ProcessingCenter of Technology at Lockheed Sanders in Nashua, New Hampshire before returning to BrownUniversity to pursue advanced degrees in Engineering. John earned his Master of Science Degree inEngineering from Brown in 1993. John received a University Fellowship in 1991, and the Doris I.Eccleston ‘25 Fellowship in 1996 for support of his graduate studies. In 1993 John worked with BrownEngineering alumnus Krishna Nathan in the handwriting recognition group at IBM Watson Researchlaboratories in Hawthorne, NY and in 1995 as Brown Engineering alumnus Professor MichaelBrandstein’s teaching assistant at the John’s Hopkins University Center for Talented Youth summerprogram in Lancaster, PA. In 1998 and 1999 John’s fingers provided the thundering bass lines for thepopular local rock band Wave. John is co-inventor of a 1998 Brown University patent describing a methodfor microphone-array source location. John has worked as a self-employed programmer/technicalconsultant and was briefly a partner in an Internet bingo venture.

iii

ACKNOWLEDGMENTS

I thank my advisor, Professor Harvey Silverman, for his support and trust over the course of my graduatecareer at Brown. Thanks to my readers, Professor Allan E. Pearson and especially Professor MichaelBrandstein whose critical input and encouragement were vital to the completion of this work.In my time at Brown I have had the privilege of working and playing with many wonderful and talentedpeople. Michael Hochberg, Jon Foote and Alexis Tzannes who were here with me at the beginning and JoeDiBiase, Michael Brandstein, Aaron Smith and Michael Blane who shared my trials in the remainder. Ialso thank my wonderful friends Lance Riek, Laura Maxwell and Carina Quezada for their relentlesssupport.Thanks to Ginny Novak for the uncomplaining effort she has consistently made to extend my deadlinesand otherwise keep me in the good graces of the Registrar and Graduate School.Finally, at the culmination of my formal education, I thank my parents for all they’ve taught me over thecourse of many years.

iv

CONTENTS

1 Introduction 11.1 Methods for Acquiring Speech With Microphone Arrays . . . . . . . . . . . . . . . . . . 11.2 The Scope of This Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Evaluating Speech Enhancement 42.1 Listening Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Objective Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 Signal-to-Noise Ratio (SNR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2.2 Segmental Signal-to-Noise Ratio (SSNR) . . . . . . . . . . . . . . . . . . . . . . 52.2.3 Bark Spectral Distortion (BSD) . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Speech Recognition Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.1 Feature Distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Speech Recognition With Microphone Arrays 93.1 Experimental Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1.2 Beamforming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.3 Recognizer Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.1.4 Signal Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.5 Recognition Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Noisy Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.1 Signal Measurements and Recognition Performance . . . . . . . . . . . . . . . . 20

3.3 Correlation Between Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3.1 Linear Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3.2 Fit Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3.3 Nonlinear Fits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Towards Enhancing Delay and Sum Beamforming 314.1 Overview of the Delay-and-Sum beamformer . . . . . . . . . . . . . . . . . . . . . . . . 314.2 Delay-Weight-and-Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3 Delay-Filter-and-Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3.1 Optimal-SNR Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.4 A Reverberant Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

v

5 Optimal Filtering 415.1 The Single Channel Wiener Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.1.1 Additive Uncorrelated Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.2 Multi-channel Wiener Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.2.1 Additive Uncorrelated Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.2.2 Direct Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2.3 Filtered Signal Plus Additive Independent Noise . . . . . . . . . . . . . . . . . . 465.2.4 Filtered Signal Plus Semi-Independent Noise Model . . . . . . . . . . . . . . . . 48

5.3 A Non-Optimal Filter and Sum Framework . . . . . . . . . . . . . . . . . . . . . . . . . 495.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6 Signal Spectrum Estimation 536.1 Spectral-Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.2 Cross-Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.2.1 Computational Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.3 Combining Spectral-Subtraction and Cross-Power . . . . . . . . . . . . . . . . . . . . . . 566.4 Comparison of Signal Estimate Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

7 Implementations and Analysis 597.1 Optimal-SNR Filter-and-Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

7.1.1 Subjective Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607.1.2 Objective Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7.2 Wiener Sum-and-Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637.2.1 Subjective Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.2.2 Objective Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

7.3 Wiener Filter-and-Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677.3.1 Subjective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677.3.2 Objective Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7.4 Multi-Channel Wiener . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717.4.1 Subjective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727.4.2 Objective Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

8 Summary and Conclusions 848.1 Directions for Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Bibliography 86

vi

LIST OF TABLES

3.1 Breakdown of the experimental database . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Distortion measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.3 Word error rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.4 Distortion measurements for the noisy database . . . . . . . . . . . . . . . . . . . . . . 203.5 Word error rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.6 Matrix of correlation coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.7 Matrix of correlation coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.8 RMS linear fit error for linear predictors of recognition error rate. . . . . . . . . . . . . 253.9 RMS fit error for linear predictors of recognition error rate . . . . . . . . . . . . . . . . 263.10 RMS errors for linear least squares estimators of recognition error rates . . . . . . . . . 293.11 RMS errors for linear least squares estimators of recognition error rates . . . . . . . . . 29

7.1 Recognition performance for the OSNR beamformer . . . . . . . . . . . . . . . . . . . 607.2 Summary of the measured average FD, BSD, SSNR and peak SNR . . . . . . . . . . . 617.3 Recognition performance for the WSF beamformers . . . . . . . . . . . . . . . . . . . 657.4 Recognition performance for the WSFosnr beamformers . . . . . . . . . . . . . . . . . 657.5 Summary of the measured average FD, BSD, SSNR and peak SNR . . . . . . . . . . . 657.6 Summary of the measured average FD, BSD, SSNR and peak SNR . . . . . . . . . . . 667.7 Recognition performance for the WFS beamformer . . . . . . . . . . . . . . . . . . . . 697.8 Recognition performance for the WFSosnr beamformer . . . . . . . . . . . . . . . . . . 697.9 Distortion values for the WFS beamformer. . . . . . . . . . . . . . . . . . . . . . . . . 697.10 Distortion values for the WFSosnr beamformer. . . . . . . . . . . . . . . . . . . . . . . 707.11 Recognition performance for the MCW beamformer . . . . . . . . . . . . . . . . . . . 737.12 Recognition performance for the MCWosnr beamformer . . . . . . . . . . . . . . . . . 737.13 Measured distortion for the MCW beamformer. . . . . . . . . . . . . . . . . . . . . . . 737.14 Measured distortion for the MCWosnr beamformer. . . . . . . . . . . . . . . . . . . . . 74

vii

LIST OF FIGURES

2.1 Relationship between Hz and Bark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Equal-loudness curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Spreading function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1 Layout of the LEMS microphone-array system . . . . . . . . . . . . . . . . . . . . . . 103.2 Data flow for the array recording process. . . . . . . . . . . . . . . . . . . . . . . . . . 103.3 Example recorded time sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4 Log-magnitude spectrograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.5 Outline of the method for delay steering the array recordings. . . . . . . . . . . . . . . 123.6 Measured talker locations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.7 Distortion measurements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.8 Word recognition error rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.9 Layout of the recording room . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.10 PCM sequence of a talker recording with pink noise . . . . . . . . . . . . . . . . . . . 183.11 Spectrograms of the noisy recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.12 Aliasing spectral bands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.13 Distortion measurements for the added-noise database. . . . . . . . . . . . . . . . . . . 213.14 Word recognition error rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.15 Scatter plots comparing the various distortion measures . . . . . . . . . . . . . . . . . . 243.16 Scatter plots comparing the various distortion measures . . . . . . . . . . . . . . . . . . 263.17 Scatter plot of baseline-HMM error rate versus MAP-HMM error rate . . . . . . . . . . 273.18 Scatter plot of baseline-HMM error rate versus MAP-HMM error rate . . . . . . . . . . 283.19 Scatter plots of the linear least squares fit . . . . . . . . . . . . . . . . . . . . . . . . . 293.20 Scatter plots of the linear least squares fit . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1 Idealized SNR gain as a function of the number of sensors . . . . . . . . . . . . . . . . 314.2 A linear microphone array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.3 Simulation showing the improvement in SNR . . . . . . . . . . . . . . . . . . . . . . . 334.4 Linear microphone array with interfering noise point-source . . . . . . . . . . . . . . . 344.5 SNR improvement for DSBF with point-source noise . . . . . . . . . . . . . . . . . . . 344.6 Location of the source, noise, and microphones for the reverberant simulation. . . . . . 374.7 The simulated impulse response for the talker received at microphone 1. . . . . . . . . . 374.8 Optimal filter for microphone 1 for the 40 microphone case . . . . . . . . . . . . . . . 374.9 Signal-to-noise+reverb improvement for a simulated room . . . . . . . . . . . . . . . . 394.10 BSD measure for the 4 different beamforming schemes . . . . . . . . . . . . . . . . . . 394.11 Signal-to-noise-only ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.12 Signal-to-reverberation ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.1 Flow diagram for a Wiener filter-and-sum beamformer . . . . . . . . . . . . . . . . . . 50

5.2 The attenuation of Φ�1 �

m � ω � . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.3 The attenuation of Φ�1 �

m � ω � . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.4 Ad hoc methods of warping the filter gains . . . . . . . . . . . . . . . . . . . . . . . . 52

viii

6.1 Average BSD, SSNR and peak SNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.2 Average BSD, peak SNR and SSNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.3 Average BSD, peak SNR and SSNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

7.1 The structure of the DSBF with optimal-SNR based channel filtering (OSNR). . . . . . 597.2 Narrowband spectrogram of a noisy utterance processed with OSNR. . . . . . . . . . . 617.3 The structure of the DSBF with Wiener post-filtering . . . . . . . . . . . . . . . . . . . 637.4 Narrowband spectrograms for the WSF beamformer. . . . . . . . . . . . . . . . . . . . 647.5 The structure of the Wiener filter-and-sum (WFS) beamformer. . . . . . . . . . . . . . 677.6 Narrowband spectrograms from the WFS beamformer. . . . . . . . . . . . . . . . . . . 687.7 Diagram of the optimal multi-channel Wiener (MCW) beamformer. . . . . . . . . . . . 717.8 Narrowband spectrograms from the MCW beamformer. . . . . . . . . . . . . . . . . . 727.9 Summary of word error rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757.10 Word error rates with OSNR input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 767.11 The best performing filtering schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . 777.12 Summary of FD and BSD values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787.13 Summary of SSNR and SNR values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797.14 Scatter plots of error rate and distortion measures. . . . . . . . . . . . . . . . . . . . . 807.15 Scatter plots comparing correlation of distortion measures . . . . . . . . . . . . . . . . 817.16 Scatter plot of distortion measurements by algorithm type. . . . . . . . . . . . . . . . . 827.17 Scatter plots of OSNR distortion measurements . . . . . . . . . . . . . . . . . . . . . . 83

ix

CHAPTER 1:INTRODUCTION

Microphone arrays are becoming an increasingly popular tool for speech capture[1, 2, 3, 4, 5, 6] and maysoon render the traditional desk-top or headset microphone obsolete. Unlike conventional directionalmicrophones, microphone arrays are electronically steerable which gives them the ability to acquire ahigh-quality signal (or signals) from a desired direction (or directions) while attenuating off-axis noise orinterference. Because the steering is implemented by software and not by a physical realignment ofsensors, moving targets can be tracked anywhere within the receptive area of the microphone array and thenumber of simultaneously active targets is limited only by the available processing power. Theapplications for microphone-array speech interfaces include telephony and teleconferencing in home,office[7, 4, 8] and car environments[3, 9], speech recognition and automatic dictation[6, 10, 11, 12], andacoustic surveillance[13, 14] to name a few. To realize the promise of unobtrusive hands-free speechinterfaces that microphone arrays offer, they must perform effectively and robustly in a wide variety ofchallenging environments.Microphone-array systems face several sources of signal degradation:

� additive noise uncorrelated with the speech signal (background noise),

� convolutional distortion of the speech signal (reverberation)1,

� additive noise correlated with the speech signal (correlated noise).

In speech-acquisition applications the primary source of interference will vary. In a teleconferencing orspeaker-phone application there may be interfering speech present in addition to background noise andreverberation. In a car-phone application the primary interference will be non-speech correlated anduncorrelated noise. In an auditorium or concert hall reverberation may be the primary source ofinterference.

1.1 Methods for Acquiring Speech With Microphone Arrays

Beamforming

Delay-and-sum beamforming (DSBF) is the mainstay of sensor-array signal processing[15].Delay-and-sum beamforming is relatively simple to implement and suppresses correlated as well asuncorrelated noise (although uncorrelated noise is suppressed most consistently)[16, 17]. It requires nosignal model or measurement of signal statistics to implement. Delay-and-sum beamforming also has thedesirable property that it introduces no nonlinear distortion products into the desired signal. The onlyadded signal distortion from the beamforming process is a possible linear distortion of the frequencyresponse due to the variation of the beamwidth over frequency. Unfortunately the idealized SNR gain of adelay-and-sum beamformer is 10log10 � M � , where M is the number of microphones in the array. Toachieve a significant improvement in SNR (say � 30dB) with simple delay-and-sum beamformingrequires an impractical number of microphones, even under idealized noise conditions. This built-in

1Which contains both correlated and uncorrelated components.

1

2

limitation of DSBF motivates research into supplemental processing techniques to improvemicrophone-array performance.

Inversion Techniques

Multiple input/output inverse theorem (MINT) techniques are aimed at inverting the impulse response ofthe room which is assumed to be known a priori and thereby eliminating the effects ofreverberation[18, 19]. Although a room impulse response is not certain to be invertible[20, 21], undercertain constraints on the nature of the individual transfer functions between the source and eachmicrophone, a perfect inverse system can be realized for the multiple-input/single-output system[19].Although this is an effective method for reducing the effects of reverberation it requires an accurateestimate of the transfer function between the source and each microphone, a measurement that can bequite difficult to make in practice. A simultaneous recording and playback system is required to accuratelymeasure the transfer function between the sound source and a microphone, and a separate transfer functionmust be measured for every point in the room from which a talker may speak. The room transfer functionwill change as the room configuration changes, whether due to rearrangement of furniture or the numberof occupants. To further complicate the task of measuring the room transfer functions, significantvariations in impulse response occur over time as temperature and humidity, and therefore the speed ofpropagation of sound, vary[22] and impulse response inversion techniques may be very sensitive to theaccuracy of the measured impulse response[21].Matched-filtering compensates for the effects of reverberation by whitening the received signal with thetime-reverse of the estimated impulse response [23]. Matched-filtering relies upon the generallyuncorrelated nature of the channel impulse response to strengthen the main impulse while spreadingenergy away from the central peak. As such, it performs best when significant reverberant energy exists.Although a sub-optimal inversion technique, matched-filtering does not require the inverse of the channelimpulse response. Like MINT techniques, matched-filtering requires prior knowledge of the impulseresponse.

Adaptive Beamformers

Traditional adaptive beamformers[24, 25, 26, 15] optimize a set of channel filters under some set ofconstraints. A typical optimization is for minimal output power subject to flat overall frequency response.These techniques do well in narrowband, far-field applications and where the signal of interest hasgenerally stationary statistics but are not as well suited for use in speech applications where:

� the signal of interest has a wide bandwidth

� the signal of interest is non-stationary

� interfering signals also have a wide bandwidth

� interfering signals may be spatially distributed

� interfering signals may be non-stationary

Adaptive noise cancellation (ANC)[27] techniques exploit knowledge of the interference signal to cancel itout of the desired input system. In some ANC applications the interference signal can be obtainedindependently (perhaps with a sensor located close to the interferer but far from the desired source). Theremaining problem is then to eliminate the interfering signal from a mix of desired and interfering signalstypically by matching the delay and gain of the measured noise and subtracting it from the beamformeroutput. A particular sort of adaptive array employing ANC has been used with some success inmicrophone-array applications [28, 29]. The generalized sidelobe canceler (GSC) uses an adaptive arraystructure to measure a noise-only signal which is then canceled from the beamformer output. Obtaining anoise measurement that is free from signal leakage, especially in reverberant environments, is generallywhere the difficulty lies in implementing a robust and effective GSC.

3

Another variant of adaptive beamformers that has found some application in microphone arrays is thesuperdirective array[30, 31, 32]. Superdirective array structures generally exhibit greater off-axis noisesuppression than delay-and-sum arrays given similar aperture size but at the expense of being geometrydependent (endfire arrays are a common superdirective configuration) which restricts the effective steeringrange.

Noise Reduction Filtering

Single-channel noise reduction techniques are of also applicable to microphone-array processing, eitherbefore or after any summing operation. A widely known technique for single-channel noise reduction isspectral subtraction[33] in which the magnitude of the noise bias is estimated and subtracted in theshort-time frequency (Fourier) domain to obtain a noise-suppressed short-term signal magnitude spectrumestimate. Variants on the spectral-subtraction idea include using the spectral estimate in a Wiener filter[34]and other methods aimed at tuning the degree of noise suppression with varying SNR to best preservesubjective speech quality[35, 36, 37]. These methods are all similar in the sense that they generally involveprocessing the short-term magnitude spectrum to generate a noise-reduced signal spectrum estimate whichis then used to generate a filter through which the signal is processed. Wiener filtering techniques, whichfall in this general family of algorithms, have been applied to microphone-array applications, typically as apost-processing or post-filtering step[38, 39, 40, 41].

1.2 The Scope of This Work

All the algorithms discussed in the preceding section have their advantages and disadvantages.Beamforming in particular has the advantage that it is relatively simple to implement and requires no priorknowledge of the signal or the environment to be effective. Noise reduction filtering strategies share asimilar advantage in that they generally have low computational complexity and the signal and noiseparameters required for their implementation can be estimated fairly readily and robustly from thereceived signals. In the following chapters the performance of a delay-and-sum-beamformer (DSBF) willbe evaluated and extensions to the DSBF developed to incorporate noise reduction filtering in a novel way.Chapter 2 will introduce a set of objective measures to be used in evaluating the effectiveness of a speechenhancement technique. In Chapter 3 a set of microphone-array recordings will be introduced and baselinemeasurements on a delay-and-sum beamformer presented, including the performance of a speechrecognition system. In Chapter 4 an optimal microphone weighting will be derived and the performance ofa weighted beamformer in a reverberant simulation analyzed. In Chapter 5 the derivation of MMSE filters(Wiener filters) for signal enhancement will be described and a novel multi-input extension to thesingle-channel method will be derived. Methods for noise and signal power estimation will be presented inChapter 6 and then applied in Chapter 7 where the results of implementations of the optimal weighting andfiltering strategies introduced in Chapters 4 and 5 will be presented and contrasted with the performance ofthe baseline delay-and-sum beamformer as well as a standard Wiener post-filter algorithm.

CHAPTER 2:EVALUATING SPEECH ENHANCEMENT

When evaluating a speech or signal enhancement technique, the intended application influences the choiceof benchmark. If the application is speech recognition then clearly recognition performance is theobjective measure of interest. If the microphone-array system is acquiring speech for a teleconferencingtask the intelligibility and subjective quality of the speech as perceived by human listeners is the importantmeasure. In this chapter some signal quality measures will be described. These measures will be used inlater chapters to evaluate the performance of proposed speech-enhancement algorithms.

2.1 Listening Scores

Listening scores are commonly used for evaluating speech coders[42, 43, 44, 45, 46] and are typicallybroken down into intelligibility tests which measure the intelligibility of distorted or processed speech andquality or acceptability tests which measure the subjective quality of distorted or processed speech.Speech intelligibility tests include the Diagnostic Rhyme Test (DRT), Modified Rhyme Test (MRT) andPhonetically Balanced Word Lists (PB) tests. Speech quality tests include the Diagnostic AcceptabilityMeasure (DAM), Mean Opinion Score (MOS) and Degradation Mean Opinion Score (DMOS) tests.ANSI standards exist for DRT, MRT, and PB tests. Although these and similar tests are widely considereduseful for tasks such as comparing vocoder standards [47], it is impractical to do a new set of listeningtests every time a parameter is changed or an algorithm is updated. Quality tests in particular requireprofessional evaluation by expert listeners1 and intelligibility tests generally require a specializedvocabulary for the test speech. For this reason easily computed objective measures derived directly fromthe output waveform that accurately reflect speech quality are a very valuable commodity.

2.2 Objective Measures

A large study of correlations between objective measures of speech quality and the results of listening testsis described in [48]. Recent publications cite the high correlation of Bark Spectral Distortion (BSD) withMOS scores[49, 50]. A modified BSD that employs an explicitly perceptual masking model has exhibitedeven better correlation with MOS scores[51, 52].

2.2.1 Signal-to-Noise Ratio (SNR)

Signal-to-noise ratio (SNR) is a ubiquitous measure of signal quality.

SNR � x � y �� 10log10

�∑N

n � 1 � x � n � � 2∑N

n � 1 � x � n �� y � n � � 2 � (2.1)

where x � n � is an undistorted reference signal and y � n � is the distorted (for instance with additive noise) testsignal. In some cases the signal and noise may not be available separately to form the ratio in Equation

1Dynastat, Inc. is one company that performs these services.

4

5

(2.1). In such situations an alternative is to estimate the peak SNR directly from the measured signal. If therecording under analysis has regions with only noise and no signal (recordings of speech almost alwaysdo) and the noise is assumed to be stationary, these regions can be used to estimate the noise power.Correspondingly a region where the signal is active can be used to estimate the signal+noise power. Thepeak SNR can be formed from the ratio of these measured powers:

peakSNR � x � y � � 10log10

�∑K

k � 1 � y � nmax � k � � 2 � ∑Kk � 1 � y � nmin � k � � 2

∑Kk � 1 � y � nmin � k � � 2 � (2.2)

where the signal is broken down into frames of K samples, y � n � k � indicates the kth sample of the nth

analysis frame of the observed signal y. The power in each frame is measured and nmax indicates theanalysis frame with the highest power and nmin denotes the analysis frame with the lowest power in the testsignal. The difference in the numerator of Equation (2.2) is based on the assumption that the noise andsignal are statistically independent which implies that E

� � s � n � 2 � � E� � s � 2 � � E

� � n � 2 � . In this way anestimate of the peak SNR is made without access to the reference clean signal as in Equation (2.1).SNR is attractive for several reasons:

� It is simple to compute, using Equation (2.1) if the reference signal and noise may be separated orusing Equation (2.2) if an estimate must be made only from an observed signal.

� It is intuitively appealing, especially to electrical or communications engineers who are accustomedto the idea that improved SNR indicates improved information transfer.

Unfortunately SNR correlates poorly with subjective speech quality[53, 48]. SNR as written in Equation(2.1) is sensitive to phase shift (delay), bias and scaling which are often perceptually insignificant.Meanwhile, the peak SNR as measured in Equation (2.2) has no clean signal reference to work from and isreally measuring the dynamic range rather than the distortion in the recording under test. As such it ispossible for the peak SNR to improve even as the signal is becoming more distorted. SNR and peak SNRare still useful measurements, but care must be taken to interpret them in the proper context.

2.2.2 Segmental Signal-to-Noise Ratio (SSNR)

The segmental signal-to-noise ratio (SSNR) has been determined to be a better estimator of subjectivespeech quality[53, 48]:

SSNR � x � y � � 1N

N

∑n � 1

10log10

�K

∑k � 1

� x � n � k � � 2� x � n � k �� y � n � k � � 2 � (2.3)

where x � n � k � denotes the kth sample of the nth frame of the reference signal x, and y � n � k � thecorresponding frame and sample of the distorted test signal. Because the ratio is evaluated on individualframes loud and soft portions contribute equally to the measure. Speech detection is desirable to preventsilence frames from unduly biasing the average with extremely low segment SNR’s[53, 48]. Alsolong-term frequency response adjustment of the test signal may be desirable to avoid biasing the error withfrequency response effects that could be easily compensated for.

2.2.3 Bark Spectral Distortion (BSD)

The Bark frequency scale is based upon a variety of psycho-acoustic experiments that investigaterelationships between the bandwidth of an acoustic stimuli its perceived loudness or its ability to masktone stimuli[54]. These and other experiments form the basis for the concept of a critical bandwidth whichcorresponds in a rough sense to the bandwidth resolution of human hearing. The Bark frequency scale isnormalized by critical bandwidth; at any frequency a unit of 1 Bark corresponds to 1 critical bandwidth.The transformation from linear frequency in Hertz, f , to Bark frequency, z, is commonly approximated bythe following relationship[54] which is plotted in Figure 2.1.

6

0 2000 4000 6000 80000

2

4

6

8

10

12

14

16

18

20

22

Frequency (Hz)

Bark

estimatedmeasured

Figure 2.1: Relationship between Hz and Bark. The dashed line is from Equation (2.4) and the�

marksindicate the band edges derived from the psycho-acoustical experiments undertaken by Zwicker[54].

101

102

103

104

0

20

40

60

80

100

120

140

10

20

30

40

50

60

70

80

90

100

Frequency (Hz)

Inte

nsity

(dB)

Figure 2.2: Equal-loudness curves from Robinson[55]. Each line is at the constant phon value indicated onthe plot. The phon and intensity (dB) values are equal at 1kHz by definition.

z � 13atan �� 00076 f �� 3 � 5atan � f7500

� 2 (2.4)

In addition to frequency warping, the Bark spectrum incorporates spreading, preemphasis, and loudnessscaling to model the perceived excitation of the auditory nerve along the basilar membrane in the ear. Inthis work the Bark spectrum, Lx � z � , of a discrete time signal, x � k � , is formed by the following steps:

1. Take a windowed time segment: xw � k �� w � k � x � k �A 32ms (512 points at 16kHz) Hanning window is used to form a tapered time segment.

2. Compute the PSD: X � l ��F � xw � k �� 2The magnitude squared of a 1024 point DFT is used.

3. Apply preemphasis: Xe � l �� W � l � X � l �The equal-loudness curve[55] for 70dB loudness is used to form a preemphasis filter[50]. SeeFigure 2.2. This equalizes the perceptual contribution of energy at different frequencies.

7

−2 0 2 4 6−60

−50

−40

−30

−20

−10

0

Bark

dB

Figure 2.3: Spreading function from Wang[50].

4. Warp to Bark frequency: X � z �� warp � Xe � l � �The linear frequency spectrum of the DFT is warped onto the constant rate bandwidth Bark scale.See Figure 2.1.

5. Apply spreading function: Xs � z �� SF � z � � X � z �Excitation spreading is approximated by convolution with a spreading function[50] pictured inFigure 2.3. This is a first approximation to account for the effects of simultaneous masking.

6. Convert to loudness in phons: Px � z �� 10log10 � X � z � �The power excitation is converted to dB.

7. Convert to loudness in sones: Lx � z �� 2� �

Px�z �� 40 � � 10 �

The loudness in dB is warped onto the (approximate) sone scale[50] where a doubling in perceivedloudness has a constant distance. Each 10dB corresponds approximately to a doubling in perceivedloudness for phones above 40dB. In the absence of an absolute loudness calibration the 40dB offsetterm in this expression is not especially meaningful.

The BSD measure itself[51] is computed by taking the mean difference between reference and test Barkspectra and then normalizing by the mean Bark spectral power of the reference signal:

BSD � x � y � � 1N ∑N

n � 1 ∑Zz � 1 �Lx � n � z �� Ly � n � z � � 2

1N ∑N

n � 1 ∑Zz � 1 �Lx � n � z � � 2 (2.5)

where Lx � n � z � and Ly � n � z � are the discrete Bark power-spectra in dB of the reference and test signals fortime frame index n and Bark frequency index z. Speech detection is performed by an energy thresholdingoperation on the reference signal so that the distortion measure averages the distortion only over speechsegments. Also the test signal is filtered before computing the Bark spectrum to equalize the averagespectral power in 6 octave-spaced frequency bands to prevent the measure from being overly sensitive tothe long term spectral density of the test signal.The Modified Bark Spectral Distortion (MBSD)[51] measure incorporates an explicit model ofsimultaneous masking to determine if distortion at a particular frequency is audible. If the error betweentest and reference signals at a particular frequency falls below the masking threshold, then that error is notincluded in the error sum (since it is deemed to be inaudible). Accurate computation of the maskingthreshold is a very involved process. Yang[51] cites a simplified method for determining the overallmasking threshold given by Johnston[56]. Even the simplified method is quite involved so this methodwill be omitted in the use of BSD contained herein.

8

2.3 Speech Recognition Performance

The use of speech-recognition accuracy as a means of evaluating a speech enhancement method had someadvantages. Certainly if the goal is to achieve robust hands-free use of a speech recognition system,recognition accuracy is the measurement of interest. Unfortunately the evaluation of the performance of aspeech recognition system is not without drawbacks. If the recognition system being used is sensitive tothe acoustic environment (and all are) in which it is used it may require some form of retraining oradaptation. This retraining may require significant training data.

2.3.1 Feature Distortion

In lieu of retraining a speech recognition system, an obvious metric (when a reference signal is available)is the difference between the features of the processed speech and the features of the reference speech. TheLEMS speech recognizer uses a feature vector made up of the real Mel-warped cepstra and its timederivative and the energy of the speech signal and its time derivative[57, 58, 59]. Specifically, the LEMSspeech recognizer models the speech signal observations with Gaussian distributions of 3 ubiquitous[60]feature vectors for each analysis frame: Mel cepstral values 1 to 12 (12 features), time derivatives of theMel cepstral values 1 to 12 (12 features) and the signal energy and time derivative of the energy (2features). The speech signals are sampled at 16kHz. The features are evaluated with a 640 point (40ms)Hamming window with a 160 point (10ms) frame shift. The energy is gain normalized for each utterance.The mean cepstral vector for the test utterance is subtracted from the vectors for that utterance and eachcepstra is normalized by the standard deviation of the cepstra at that quefrency measured over theutterance.To measure the feature distortion (FD) the test features are subtracted from the reference features, squared,then normalized within each of the 3 sub-vectors by the squared sum of the reference features in thatsub-vector. The 3 mean distortion values are then averaged together giving an equal weight to thedistortion in each sub-vector:

FD � R � T � l � � 13

3

∑m � 1

∑Nl � 1 ∑Km

k � 1

�� R

�m �

k � l �� T�m �

k � l ��

2

∑Nl � 1 ∑Km

k � 1

�� R

�m �

k � l ��

2 (2.6)

where FD � R � T � denotes the feature distortion between the reference signal, R, and the signal being tested,

T . R�m �

k � l � denotes the kth feature of the mth sub-vector for analysis frame l for the features derived from

the reference signal. T�m �

k � l � denotes the corresponding feature value for the test signal.Although cepstral distance is used in a variety of speech processing applications including speech qualityassessment[61, 62], the measure presented above is referred to instead as the feature distortion because itoperates directly on the features used by the LEMS speech recognizer. Mel-cepstral distortion is similar innature to Bark spectral distortion as both measures are derived from a warped and smoothed (or liftered)log spectral representation.

2.4 Summary

In this chapter several objective measures of speech quality were introduced, varying from the traditionalsignal-to-noise ratio measure to a perceptually motivated Bark spectral distortion measure, and thespeech-recognition targeted feature distortion measure. In the following chapters the measures presentedhere will be used to evaluate the performance of proposed microphone-array processing techniques. Byusing a set of measures with different underlying principles a multifaceted view of the performance of thealgorithms to follow will be possible.

CHAPTER 3:SPEECH RECOGNITION WITH MICROPHONE

ARRAYS

In this chapter tests on the performance of a 16 element microphone array will be presented. A database ofmultichannel recordings was collected and speech recognition tests performed on delay-and-sumbeamformer configurations using from 1 to 16 microphones. The methods used to record and process themultichannel recordings will be described and the performance of each microphone-array configurationwill be presented. Each tested configuration will be evaluated with the measures described in Chapter 2:feature distortion (FD), Bark spectral distortion (BSD), segmental signal-to-noise ratio (SSNR), peak SNR(SNR). Also each array configuration will be used as a front end to the LEMS alphadigit speechrecognition system and its performance assessed in that role. Finally, the significance of and relationshipsbetween the various performance measures will be discussed.

3.1 Experimental Database

A microphone-array speech database was collected from 22 talkers of American English. The vocabularycomprises the American English alphabet (A � Z), the digits (0 � 9), “space” and “period”. The typicalutterance contains approximately 12 vocabulary items and is approximately 4 seconds long. Each talkercontributed the same number of utterances. Table 3.1 shows the data sets broken down by gender for eachof the training and test sets. The training set is used to retrain the recognizer for the novel acousticenvironment (see Section 3.1.3).

3.1.1 Data Acquisition

The microphone-array environment used in this experiment is depicted in Figure 3.1. It consists of16 pressure gradient microphones, 8 on each of two orthogonal walls of a 3.5 � 4.8 m enclosure,horizontally placed at a height of 1.6 m. Within each 8-microphone sub-array the microphones areuniformly spaced at 16.5 cm intervals. The microphone-array is in a partially walled-off area of a 7 x 8 macoustically-untreated workstation lab. Approximately 70% of the surface area of the enclosure walls iscovered with 7.5 cm acoustic foam, the 3 m ceiling is painted concrete, and the floor is carpeted. Thereverberation time within the enclosure is approximately 200 ms.The utterances were recorded with the talker standing approximately 2m away from each of themicrophone sub-arrays. The microphone-array recording was performed with a custom-built 12-bit20 kHz multichannel acquisition system[63]. The 20 kHz datastream was resampled to match the 16 kHzsampling rate used by the recognition system. During recording, the talker wore a close-talking headset

data set female male # utterances # words

training 5 6 436 4415testing 5 6 438 4497

Table 3.1: Breakdown of the experimental database by the number of talkers of each gender, the number ofutterance and the number of words in each of the training and test sets.

9

10

118

acousticfoam

480

233

( units: cm )

talker area1

8

9 16

16.5

16.5

260

145

350

Figure 3.1: Layout of the LEMS microphone-array system using 16 pressure gradient microphones.

Alignand

Segment

16

Synchronization

SBUS

Array Mic’s

A/DSUN Sparc

LEMSMicrophone

Array IIA/D

Close Talking Mic

ResampleSUN Sparc

Disk File

Disk File

Figure 3.2: Data flow for the array recording process.

microphone. This is the same microphone used to collect the high-quality speech data for training thebaseline HMM system (see Section 3.1.3). Using the analog-to-digital conversion unit of a Sparc10workstation, the signal from the close-talking microphone was digitized to 16 bits at 16 kHzsimultaneously with the 16 remote microphones in the array system. Both the close-talking and the arrayrecordings were segmented by hand to remove leading and trailing silence and then the close-talkingrecording was time-aligned to the first channel of the multi-channel recordings. See Figure 3.2.Figures 3.3 and 3.4 show data from an example recording from the recognition database. Figure 3.3 showsthe time-sequences for a single utterance recorded from the close-talking microphone (a), a singlemicrophone in the array (b) and the output of the 16 channel DSBF (c). Figure 3.4 shows thecorresponding spectrograms for the sequences shown in Figure 3.3. The noise-suppressing effect of thebeamformer is evident in both Figures 3.3 and 3.4. Also evident is that the output of the beamformer,though greatly improved from the single microphone, is quite a bit more noisy than the recording from theclose-talking microphone.

3.1.2 Beamforming

After recording and preliminary segmentation and alignment, the channels of every multichannel data filewere time-aligned with the reference close-talking microphone recording. Figure 3.5 shows an outline ofthe process. The close-talking microphone recording was used as a reference to ensure the best possibletime alignment. This is critical when computing distortion measurements (BSD, SSNR etc...) that assumethe test and reference signals are precisely aligned. The time-alignment was achieved by using an

11

(a)

0.5 1 1.5 2 2.5 3 3.5 4

−6000

−4000

−2000

0

2000

4000

sec.

(b)

0.5 1 1.5 2 2.5 3 3.5 4

−60

−40

−20

0

20

40

60

sec.

(c)

0.5 1 1.5 2 2.5 3 3.5 4

−60

−40

−20

0

20

40

60

80

sec.

Figure 3.3: Example recorded time sequences. The recording from the close talking microphone (a), chan-nel 1 from the microphone array (b) and the output of the beamformer (c). The talker is male. The spokentext is "GRAPH 600".

implementation of the all-phase transform (PHAT)[13, 64, 65, 66]. The all-phase transform of twotime-domain sequences x � k � and y � k � is given by the inverse DFT of their magnitude normalizedcross-correlation:

PHAT � x � y � � F � 1

�X � ω � Y � � ω ��X � ω � Y � � ω � �� (3.1)

where F � 1 denotes the inverse Fourier transform and X � ω � and Y � ω � the Fourier transforms of x � k � andy � k � , respectively. A 512 point (32ms) Hamming window with a 256 point (16ms) shift was used incomputing the cross-spectra. The cross-spectra were smoothed by averaging 7 adjacent frames1thenmagnitude normalized and an inverse Fourier transform applied. The resulting cross-correlation was thenupsampled by a factor of 202 and the lag corresponding to the peak value chosen. Some post-processingwas performed to eliminate spurious estimates and to constrain the delay estimates during non-speechperiods. Each channel was steered using the estimated time delays and a delay-steered version of the

1Although not entirely still during the recording, the talker movements were generally limited to leaning or shifting, rarely resultingin a change of more than 0.1 m in location. The resulting time-delay changes were generally small and slowly varying and not adverselyaffected by the time-averaging of the cross-spectra.

2This fairly high upsampling factor wasn’t chosen because resolution to 120 of a sample is necessary, but because the higher

sampling rate makes it more likely that the peak sampled value will correspond with the actual peak value of the underlying waveform.

12

(a)

sec.

Her

tz

0 0.5 1 1.5 2 2.5 3 3.5 40

1000

2000

3000

4000

5000

6000

7000

8000

(b)

sec.

Her

tz

0 0.5 1 1.5 2 2.5 3 3.5 40

1000

2000

3000

4000

5000

6000

7000

8000

(c)

sec.

Her

tz

0 0.5 1 1.5 2 2.5 3 3.5 40

1000

2000

3000

4000

5000

6000

7000

8000

Figure 3.4: Log-magnitude spectrograms of the example sequences shown in Figure 3.3.The recordingfrom the close talking microphone (a), channel 1 from the microphone array (b) and the output of thebeamformer (c). The talker is male. The spoken text is "GRAPH 600". The analysis window is a 512 point(32ms) Hamming window with a half-window (16ms) shift.

Conj

IDFTFindPeak

DFT

Upsample

ApplyDelay

DFTInter−Frame

Smooth

ReferenceRecording

ArrayRecording

d(t)

x(n)

D’(f)

C(k)X(k)

D(k)

p(t)P(k)

t

SteeredResult

y(n)

1

|C(k)|

Figure 3.5: Outline of the method for delay steering the array recordings.

multichannel data file saved to disk. Figure 3.6(a) shows locations derived from the estimated delays usinga maximum-likelihood (ML) location estimator[67]. Figure 3.6(b) shows the distribution of the x and ycoordinates for ML location estimates of every analysis frame from the entire database. The talker locationestimates were generated for analysis only; the channel delays were not constrained to correspond to anyparticular source radiation model during processing.For the purposes of establishing a baseline for the beamformer performance, no channel weighting ornormalization was performed. The channels were simply delay-steered according to the estimated delays

13

0 0.5 1 1.5 2 2.5 30

0.5

1

1.5

2

2.5

3

x coord (meters)

y co

ord

(met

ers)

(a)

1.4 1.6 1.8 2 2.2 2.4 2.60

0.1

0.2

0.3

0.4

0.5

0.6

x/y coord (meters)

frac

tion

xy

(b)

Figure 3.6: Measured talker locations. (a) Scatter plot of locations from a single talker (b) Distribution ofthe measured x and y talker locations taken over the entire database.

and summed3 in sequential order (See Figure 3.1 for the microphone numbering).

3.1.3 Recognizer Training

As mentioned previously, speech recognizers tend to be very sensitive to changes in acoustic environmentor other changes to the dynamic range, noise floor, frequency response, etc.. of the data under test. Half ofthe data was used to retrain the recognizer to the novel acoustic space for each different beamformer ormicrophone.

Incremental MAP Training

It was reported in[68] that substantial speed improvements in HMM training can be obtained usingincremental maximum a posteriori (MAP) estimation. The significance of this approach is that it does notlose any recognition performance while speeding the convergence. The learning technique presented is avariation on the recursive Bayes approach for performing sequential estimation of model parameters givenincremental data. Let

�x1 �� xT be i.i.d. observations and θ be a random variable such that f

� �xt � θ � is a

likelihood on θ given by�xt . The posterior distribution of θ is

f�θ � �x1 �� xt � f

� �xt � θ � f

�θ � �x1 �� xt � 1 � (3.2)

where f�θ � �x1 � f

� �x1 � θ � f

�θ � and f

�θ � is the prior distribution on the parameters. The recursive Bayes

approach results in a sequence of MAP estimations of θ,

θt � argmaxθ

f�θ � �x1 �� xt �� (3.3)

A corresponding sequence of posterior parameters acts as the memory for previously observed data. If thelikelihood f

� �xt � θ � is from the exponential family (That is, a sufficient statistic of fixed dimension exists)

and f�θ � is the conjugate prior, then the posterior f

�θ � �x1 �� xt � is a member of the same distribution as

3The sum was used rather than the mean since the microphone-array recordings are 12 bit and the recognition system uses 16 bitPCM input data so the sum of the 16 12-bit channels will never overflow a 16 bit word. Using the mean of the channels instead of thesum would involve a requantization step. Although the effect of this requantization is almost certainly unmeasurable by any methodused herein, there was no reason not to preserve the data with maximum precision.

14

the prior regardless of sample size t. This implies that the representation of the posterior remains fixed asadditional data is observed.In the case of missing-data problems (e.g., HMMs), the expectation-maximization (EM) algorithm can beused to provide an iterative solution for estimation of the MAP parameters[69]. The iterative EM MAPestimation process can be combined with the recursive Bayes approach. The approach that incorporates(3.2) and (3.3) with the incremental EM method[70] (That is, randomly selecting a subset of data from thetraining set and immediately applying the updated model) is fully described in[68]. Also, Gauvain and Leehave presented the expressions for computing the posterior distributions and MAP estimates of continuousobservation density HMM (CD-HMM) parameters[71]. Because the posterior is from the same family asthe prior, (3.2) and (3.3) are equivalent to the update expressions in[71] and are not repeated here.

Baseline Model

A baseline talker-independent continuous density hidden Markov model (CD-HMM) is obtained by aconventional maximum likelihood (ML) training scheme using a training database of high-quality dataacquired with a close-talking headset microphone. The training set contains contained 3484 utterancesfrom 80 talkers. The initial parameters of the CD-HMM are derived from a discrete observation hiddensemi-Markov model (HSMM) using a Poisson distribution to model state duration. This model is thenconverted to a tied-mixture HSMM by simply replacing each discrete symbol with a multivariate normaldistribution. Normal means and full covariances are estimated from the training data.

Prior Generation

The initial prior distributions are also derived from the training data set used to train the baseline HMM.The employed prior distributions are the normal-Wishart distribution for the parameters of the normaldistribution and the Dirichlet distribution for the rest of model parameters. The parameters describing thepriors are set such that the mode of the distribution corresponded to the initial CD-HMM. The strength ofthe prior (That is, the amount of observed data required for the posterior to significantly differ from theprior) is set to “moderately” strong belief. A subjective measure of prior strength is used[6, 72, 73] wherea very weak prior is (almost) equivalent to a non-informative prior and a very strong prior (almost)corresponds to impulses at the initial parameter values4.

Model Parameter Adjustment

Starting from the baseline HMM, incremental MAP training is performed to adjust the model parametersfor the novel database. 10 utterances are randomly chosen at each iteration for 100 iterations. Note that thetraining data size for the second stage (428 utterances from 11 talkers) is an order of magnitude smallerthan that used for creating the baseline HMM. Gotoh [72, 73] presents extensive information on the effectof varying the parameters (training set size, prior strength, number of iterations) of the MAP training.

3.1.4 Signal Measurements

Figure 3.7 and Table 16 show baseline distortion measurements for a delay-and-sum beamformer with avariable number of microphones and for each microphone taken individually. It is worth noting that thebest measuring single channel has distortion and SNR comparable to the 2-microphone beamformer. Thedistortion and SNR measurements were made on each of the 438 utterances in the recognizer test set andthen averaged to form the ensemble values shown in Figure 3.7. Of the 4 objective measures presentedbelow only the peak SNR can be measured from the reference close-talking microphone data. The peakSNR of the recordings made with the close-talking microphone is 43.14dB.

4In [72, 6] Gotoh characterizes the initial prior weights as “weak” or “moderate” or “strong” often without attributing a specificvalue. Presumably the intention was to drive me insane. The prior strength value used herein is 0 � 1 and corresponds to a “moderate”prior strength.

15

2 4 6 8 10 12 14 16

0.55

0.6

0.65

0.7

0.75

0.8

0.85

# mics/mic #

Fea

ture

Dis

tort

ion

(FD

)

(a)

2 4 6 8 10 12 14 160.055

0.06

0.065

0.07

0.075

0.08

0.085

0.09

0.095

0.1

# of mics / mic #

BS

D

(b)

2 4 6 8 10 12 14 16

4

4.5

5

5.5

6

# of mics / mic #

SS

NR

(dB

)

(c)

2 4 6 8 10 12 14 16

18

19

20

21

22

23

24

25

26

27

# of mics / mic #

Pea

k S

NR

(dB

)

(d)

Figure 3.7: Distortion measurements. (a) Feature distortion (FD), (b) Bark spectral distortion (BSD), (c)segmental SNR (SSNR) and (d) peak SNR shown as a function of the number of microphones used ina delay-and-sum beamformer and for each channel individually. The measurements shown were averagedover all recorded utterances for the 11 test talkers. The

�’s indicate the beamformer with the x-axis showing

the number of microphones included in the sum. The microphones were added in order, starting withmicrophone 1 (see Figure 3.1). The � ’s denote the average distortion or SNR value for that channel takenalone.

The overall improvement in peak SNR from the 1 microphone beamformer to the 16 microphonebeamformer is 9.7dB. As expected, this is somewhat less than the ideal value derived in Chapter 4 due tothe non-ideal noise cancellation and signal reinforcement. Note also that this improvement figure is greatlydependent on which microphone is chosen for the 1 microphone beamformer. Microphone 1 has one ofthe lowest peak SNR’s of any channel and the 9.7dB figure is therefore somewhat generous. By choosingthe single microphone with the highest SNR (microphone 12) the total improvement in peak SNR is only7.65dB.

3.1.5 Recognition Performance

The LEMS speech recognition system was used to evaluate the speech recognition performance of thebeamformer processed speech. Figure 3.8 and Table 3.3 show the per-microphone and delay-and-sumbeamformer recognition error rates before MAP training (baseline-HMM) after MAP training

16

By number of microphones in DSBF

#mics in DSBF � 1 2 3 4 5 6 7 8

FD 0.89 0.81 0.74 0.69 0.66 0.63 0.61 0.60BSD .096 .087 .079 .074 .070 .068 .067 .066

SSNR (dB) 3.83 4.23 4.55 4.77 5.00 5.14 5.31 5.38Peak SNR (dB) 18.18 19.84 20.90 21.90 22.93 23.59 24.21 24.53

#mics in DSBF � 9 10 11 12 13 14 15 16

FD 0.59 0.57 0.56 0.55 0.54 0.53 0.52 0.51BSD .063 .061 .059 .058 .056 .055 .055 .054


By individual microphone used

mic # � 1 2 3 4 5 6 7 8

FD 0.89 0.90 0.83 0.82 0.82 0.82 0.81 0.87BSD .096 .094 .085 .090 .086 .090 .094 .100


mic # � 9 10 11 12 13 14 15 16

FD 0.89 0.82 0.81 0.80 0.79 0.81 0.84 0.83BSD .098 .096 .096 .090 .089 .087 .091 .090


Table 3.2: Distortion measurements plotted in Figure 3.7. Distortion is shown as a function of the numberof microphones included in a DSBF and as a function of the single microphone used.

(MAP-HMM) as described in Section 3.1.3. Note that after MAP training the best single-channelrecognition result is marginally better than the performance of the 2-channel beamformer; a testament tothe usefulness of the MAP training. A different choice of microphones to include in the 2-channelbeamformer would change this result considering that microphone 1 is one of the poorest performingindividual channels. To put these values in perspective, the word error rate for the data collected with theclose-talking microphone is 8.16%. The improvement in MAP-HMM recognition accuracy from the 1microphone beamformer to the 16 microphone beamformer is 9.38% (or a 44% reduction in error).Comparing the 16 microphone beamformer against the best performing single microphone (12) the errorrate is reduced by 5.45% (a 31% reduction in error). The performance of the array is close enough to theclose-talking microphone performance that a small number of errors is a large fraction of the gap betweenthe array performance and the close-talking microphone performance.

3.2 Noisy Database

The experimental database described above contains relatively low levels of noise. The best achievablespeech recognition with the 16 channel beamformer (using MAP training) is only 3.76 percentage pointsbelow the MAP-HMM close-talking microphone results. To create a noisier condition with moredramatically degraded recognition rates a recording of a noise source was made with the same microphonearray and added into the experimental speech database. Pink noise5[74] was played out through a 4"diameter speaker and recorded by the same set of microphones previously used to record the talkers. Thenoise source was located near and directed towards microphone 1 as indicated in Figure 3.9.

5Pink noise contains equal power in each octave. This corresponds to a -3dB per octave slope of the power spectral density.

17

2 4 6 8 10 12 14 165

10

15

20

25

30

35

40

45

# mics / mic #

Wor

d E

rror

Rat

e (%

)

Headset Error Rate: 8.16%

(a) Baseline model

Figure 3.8: Word recognition error rates as a function of the number of microphones used in a delay-and-sum beamformer and as a function of the single microphone used alone, before and after MAP training.�

denotes the recognition performance before MAP training of the beamformer with varying numbers ofmicrophones and � the beamformer performance after MAP training.

�denotes the recognition perfor-

mance before MAP training of each channel taken individually and�

the performance after MAP training.The microphones were added in numerical order for the beamformer (see Figure 3.1). The strong line atthe base of the graph corresponds to the error rate for the data acquired with the close-talking microphone,8.16%. These values are tabulated in Table 3.3.


# mics in DSBF � 1 2 3 4 5 6 7 8

Baseline-HMM 41.49 36.36 32.16 30.09 28.13 26.35 24.66 24.28MAP-HMM 21.3 18.52 16.43 15.86 15.23 14.63 14.14 14.03

# mics in DSBF � 9 10 11 12 13 14 15 16Baseline-HMM 23.04 23.22 22.19 21.12 20.46 20.81 19.81 19.44

MAP-HMM 13.81 13.52 12.85 12.67 12.28 12.32 11.81 11.92


mic # � 1 2 3 4 5 6 7 8


mic # � 9 10 11 12 13 14 15 16Baseline-HMM 40.60 37.62 38.65 35.05 35.49 36.16 37.40 38.16

MAP-HMM 20.41 18.72 18.77 17.37 17.37 18.01 17.81 18.83

Table 3.3: Word error rates (%) for the HMM before and after MAP training as a function of the number ofmicrophones included in the DSBF or the single microphone used. The beamformer values are plotted inFigure 3.8 with

�’s and � ’s and the single microphone values are plotted in Figure 3.8 with

�’s and

�’s.

18

118

acousticfoam

480

233

( units: cm )

talker area1

8

9 16

16.5

16.5

260

145

350

pink noise source

300

115

Figure 3.9: Layout of the recording room as in Figure 3.1 but showing the position of the interfering noisesource.

0.5 1 1.5 2 2.5 3 3.5 4

−60

−40

−20

0

20

40

60

80

sec.

0.5 1 1.5 2 2.5 3 3.5 4

−60

−40

−20

0

20

40

60

80

sec.

Figure 3.10: PCM sequence of a talker recording with pink noise recording added. This is the same talkerand utterance used in Figures 3.3 and 3.4. The top plot is channel 16 alone and the bottom plot is the outputof the 16 channel beamformer.

Each channel of the noise recording was added to the corresponding channel of the talker recordings. Forbeamforming the inter-microphone delays estimated from the clean speech were used to steer both theoriginal speech channels and the added noise. Figures 3.10 and 3.11 show time and spectrogram plots forchannel 16 and the beamformer output for a talker recording with pink noise recording added.The discerning reader may notice the pronounced bands of noise visible in Figure 3.11 around 2800 and5600Hz and assume that these are a result of a resonance in the playback or recording system. Thesebands result from the beamforming operation and the spatial aliasing inherent in the geometry of thisparticular microphone array. The bands appear in all recordings but their exact location in frequency varieswith each recording as a function of the applied steering delays. To illustrate this, Figure 3.12 shows thespectrum of the DSBF output with no speech present as the steering location is moved in a spiral ofincreasing radius starting at x=2, y=2. The noise bands appear at harmonically spaced intervals which vary

19

sec.

Her

tz

0 0.5 1 1.5 2 2.5 3 3.5 40

1000

2000

3000

4000

5000

6000

7000

8000

sec.

Her

tz

0 0.5 1 1.5 2 2.5 3 3.5 40

1000

2000

3000

4000

5000

6000

7000

8000

Figure 3.11: Spectrograms of the noisy recordings shown in Figure 3.10. A single channel of a noise-addedrecording on top and the output of the 16 channel DSBF on the bottom.

sec

Her

tz

0 0.5 1 1.5 2 2.5 3 3.50

1000

2000

3000

4000

5000

6000

7000

8000

0.5 1 1.5 2 2.5 3 3.50.5

1

1.5

2

2.5

3

sec

met

ers

xy

Figure 3.12: Aliasing spectral bands in the response of the beamformer to a stationary noise input as thebeamformer is steered to different locations indicated in the bottom plot. The x-coordinate of the steeringlocation is indicated with a solid line and the y-coordinate with the dashed line.

smoothly with the steering location. As the beamformer is steered to different locations the noise source isfound in different portions of the beamformer sidelobes and aliasing pattern. Spatial aliasing is a sideeffect of having an array aperture larger than the wavelength of the target signal. Microphone arrays maybe designed with a constant width main lobe[75, 76, 77] which will eliminate the sort of aliasing seen herethough the tradeoff is that the main lobe is wider throughout most of the bandwidth. The particular arraygeometry used here is prone to aliasing and an explicit solution to this issue is outside the scope of thiswork, though processing techniques presented in later chapters will significantly reduce this effect.

20


#mics in DSBF � 1 2 3 4 5 6 7 8

FD 1.94 1.77 1.55 1.41 1.33 1.27 1.21 1.17BSD .221 .218 .185 .161 .150 .142 .131 .124


#mics in DSBF � 9 10 11 12 13 14 15 16

FD 1.14 1.11 1.06 1.03 1.02 1.00 0.99 0.98BSD .118 .114 .108 .104 .101 .099 .097 .096



mic # � 1 2 3 4 5 6 7 8

FD 1.94 2.03 1.66 1.68 1.68 1.52 1.48 1.40BSD .221 .248 .185 .191 .195 .170 .169 .166


mic # � 9 10 11 12 13 14 15 16

FD 1.49 1.52 1.21 1.34 1.48 1.42 1.36 1.45BSD .176 .184 .141 .164 .173 .170 .184 .189


Table 3.4: Distortion measurements for the noisy database plotted in Figure 3.13. Distortion as a func-tion of the number of microphones included in the delay-and-sum beamformer and as a function of eachmicrophone taken alone using the noisy speech data.

3.2.1 Signal Measurements and Recognition Performance

The measurements and recognition tests made on the basic talker recordings were repeated on theadded-noise database. Figures 3.13 and 3.14 summarize these results along with Table 3.4. The proximityof the noise source to microphone 1 is evident in the low recognition rates and high distortion values forthe microphones closest to the noise source. For perspective, note that the best performing singlemicrophone, 11 (see Figure 3.14), is on a par with the 8 channel beamformer using the noisiermicrophones 1-8. This discrepancy suggests strongly that an appropriately weighted sum of microphonescould outperform the uniform weighted beamformer presented here.The overall improvement in peak SNR from the 1 microphone beamformer to the 16 microphonebeamformer is 13.95dB. This is greater than the improvement predicted in Chapter 4 of10log10 16 � 12 � 04dB. In the noisy recordings here though the noise in each channel is not equal.Microphone 1 has the highest level of noise using it as the starting point for improvements inflates theapparent improvement. If microphone 11 is used for the single microphone beamformer the total SNRimprovement comes out to only 3.6dB.The improvement in MAP-HMM recognition accuracy from the 1 microphone beamformer to the 16microphone beamformer is 58.4% (or a 71% reduction in error). Comparing the 16 microphonebeamformer against the best performing single microphone (11) the error rate is reduced by 9.02% (a 27%reduction in error).

21

2 4 6 8 10 12 14 161

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

# mics/mic #

Fea

ture

Dis

tort

ion

(FD

)

(a)

2 4 6 8 10 12 14 160.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

# of mics / mic #

BS

D

(b)

2 4 6 8 10 12 14 16

1

1.5

2

2.5

3

3.5

4

# of mics / mic #

SS

NR

(dB

)

(c)

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

# of mics / mic #

Pea

k S

NR

(dB

)

(d)

Figure 3.13: Distortion measurements for the added-noise database. (a) Feature distortion (FD) (b) Barkspectral distortion (BSD), (c) segmental SNR (SSNR) and (d) peak SNR shown as a function of the numberof microphones used in a delay-and-sum beamformer and for each channel individually. The measurementsshown were averaged over all recorded utterances with added noise for the 11 test talkers. The

�’s indicate

the beamformer with the x-axis showing the number of microphones included in the sum. The microphoneswere added in order, starting with microphone 1 (see Figure 3.9). The � ’s denote the average distortion orSNR value for that channel taken alone. These values are tabulated in Table 3.4.

3.3 Correlation Between Measures

All the graphs in Sections 3.1.4 and 3.1.5 show a generally similar trend as a function of microphone(s).These trends have been averaged over many presentations and don’t show the variance of the measure orthe strength of the correlation between the different distortion or SNR measures. A distortion or SNRmeasure that correlates strongly with recognition performance has value as a quick means of evaluatingrecognition performance. On the other hand if all the measures are very strongly correlated then it makeslittle sense to perform them all since one would suffice.

3.3.1 Linear Correlation

A linear correlation analysis of the individual measurements summarized in Section 3.1.4 and 3.1.5 ispresented below. Each test set utterance provides one observation of FD, BSD, SSNR, peak SNR and

22

2 4 6 8 10 12 14 1620

30

40

50

60

70

80

90

100

Wor

d E

rror

Rat

e (%

)

# mics / mic #

Figure 3.14: Word recognition error rates as a function of the number of microphone used in a delay-and-sum beamformer and as a function of channel, before and after MAP training.

�denotes the performance

before MAP training of the beamformer with varying numbers of microphones and � the beamformerperformance after MAP training.

�denotes the recognition performance before MAP training of each

channel taken individually and�

the performance after MAP training. The microphones were added innumerical order for the beamformer (see Figure 3.9). These values are tabulated in Table 3.5.


# mics in DSBF � 1 2 3 4 5 6 7 8Baseline-HMM 96.69 92.59 84.32 77.21 71.78 67.62 63.40 60.44

MAP-HMM 81.90 72.09 56.10 47.74 41.45 38.43 35.45 32.49

# mics in DSBF � 9 10 11 12 13 14 15 16



mic # � 1 2 3 4 5 6 7 8


mic # � 9 10 11 12 13 14 15 16


Table 3.5: Word error rates (%) for the model before MAP training and model after MAP training as a func-tion of microphones included in the delay-and-sum beamformer and as a function of the single microphoneused alone. These values are plotted in Figure 3.14 with

�’s and � ’s for the beamformer values, and

�’s

and�

’s for the single microphone values.

recognition performance before and after MAP retraining. The coefficient of correlation is given by[78]:

ρ � Cov � Y1 � Y2 �σ1σ2

(3.4)

23

Baseline MAP FD BSD SSNR SNRError % Error %

Baseline Error % 1.00 0.94 0.95 0.86 -0.75 -0.89MAP Error % 1.00 0.89 0.82 -0.69 -0.81

FD 1.00 0.94 -0.81 -0.95BSD 1.00 -0.90 -0.94

SSNR 1.00 0.89SNR 1.00

Table 3.6: Matrix of correlation coefficients for the different types of measurements without per-talkernormalization.

where Cov � Y1Y2 � denotes the covariance of the joint distribution of the two variables under examination

and σ1 � 2 are the corresponding standard deviations, σ1 � E�

� Y1 � Y1 � 2 � 12 . The covariance is given by:

Cov � Y1 � Y2 � � E�Y1Y2 � � E

�Y1 � E

�Y2 � (3.5)

Table 3.6 shows the inter-measure correlation coefficients for the recognition scores and distortionmeasures presented above. For each talker the average value of each measure was computed over the set ofutterances for that talker. For each of the 11 test talkers 31 values of each measure were taken (15 valuesmeasured from beamformers with a varying number of microphones > 1 and 16 values measured fromeach microphone taken individually) both for the original database and the added-noise database for a totalof 682 values over which the correlation coefficient was measured. The correlation coefficients are shownas signed values since the distortion measures and signal-to-noise ratio measures are inversely correlated.The signs of the measured correlation coefficients are all appropriate for the measures being compared;measures of goodness correlate positively with other measures of goodness and negatively with measuresof badness. Table 3.6 shows generally very strong correlations. The correlation between FD andbaseline-HMM error rate is strongest with a generally lower correlation between any other distortion orSNR measure and any recognition error rate. The strong correlation between FD and baseline-HMM errorrate can be clearly seen in the scatter plots in Figure 3.15(c). Note that FD is (slightly) more stronglycorrelated with the baseline-HMM error rate than the MAP-HMM error rate is. Although it is notsurprising that the recognizer performance, in particular before retraining, would be closely related to thefeature distortion, it is somewhat unexpected that the feature distortion would be more closely coupledthan the MAP-HMM error rate. Note also that all distortion and SNR measures are less strongly correlatedwith the MAP-HMM error rate than with the baseline-HMM error rate.The preceding analysis fails to take into account the talker-dependence of the error rates. That is, thespeech recognition error rates can vary quite a bit between talkers or even individual utterances from asingle talker. Using the results from the recordings acquired with the close-talking microphone, theinter-talker standard deviation of the error rates of the 11 talkers in the test set is 5.9%. This is largely dueto a single talker with a very high error rate - this same talker stands out clearly in the scatter plots inFigure 3.15. Excluding this one outlying talker the standard deviation of the per-talker error rates is 2.0%.This phenomenon is not necessarily an issue of sound quality but often one of the manner of pronunciationor elocution6. To eliminate this talker variability the per-talker word error rates were computed and themean error rate for each talker was subtracted from that talker’s values. The result is a differential errorrate for each talker relative to their mean error rate. The same normalization was performed with thedistortion and SNR measures7 rendering them per-talker difference measures as well. Table 3.7 shows theinter-measure correlation coefficients after this normalization. Figure 3.15 shows scatter plots of the more

6In particular, the very worst performing talker is a female talker with very high pitch.

7Although this may not have been strictly necessary, little is gained by maintaining the absolute distortion measurements in theabsence of absolute recognition error rates. That is, all the results were already rendered relative by the per-talker normalization of theerror rates.

24

20 40 60 80 100

10

20

30

40

50

60

70

80

90

Baseline error rate

MA

P e

rror

rat

e

(a)

0.5 1 1.5 2

20

40

60

80

100

Feature Distortion

Bas

elin

e er

ror

rate

(b)

0.05 0.1 0.15 0.2 0.25

20

40

60

80

100

BSD

Bas

elin

e er

ror

rate

(c)

0 10 20 30

20

40

60

80

100

SNR

Bas

elin

e er

ror

rate

(d)

0.5 1 1.5 2

10

20

30

40

50

60

70

80

90

Feature Distortion

MA

P e

rror

rat

e

(e)

0.05 0.1 0.15 0.2 0.25

10

20

30

40

50

60

70

80

90

BSD

MA

P e

rror

rat

e

(f)

0 10 20 30

10

20

30

40

50

60

70

80

90

SNR

MA

P e

rror

rat

e

(g)

Figure 3.15: Scatter plots comparing the various distortion measures with recognition error rates. The blackpoints correspond to the values with per-talker bias removed and the lighter patches to the values withoutany normalization. One point is shown for each of 62 measurements for 11 talkers (682 points in total).

strongly correlated pairs. Most obviously and not surprisingly, the per-talker bias normalization hasgenerally increased the correlation values. This effect can be seen in the plots of Figure 3.15 as the spreadof the data away from the primary linear trend is greatly reduced in the talker-normalized data. As in theunnormalized case peak FD is still most strongly correlated with the baseline-HMM error rate though nowboth BSD and peak SNR are much closer than in the unnormalized case.

25


Baseline Error % 1.00 0.95 0.98 0.95 -0.90 -0.97MAP Error % 1.00 0.96 0.93 -0.85 -0.90

FD 1.00 0.97 -0.90 -0.97BSD 1.00 -0.91 -0.94

SSNR 1.00 0.93SNR 1.00

Table 3.7: Matrix of correlation coefficients for the different measurements with per-talker bias normaliza-tion of each of the 11 � 62 measurements.


Without Talker NormalizationBaseline Error % 0.00 8.18 7.86 12.40 16.06 10.99

MAP Error % 6.54 0.00 8.68 11.25 14.12 11.34With Talker Normalization

Baseline Error % 0.00 7.33 4.27 7.18 10.17 5.66MAP Error % 5.84 0.00 5.41 6.87 9.84 8.04

Table 3.8: RMS linear fit error for linear predictors of recognition error rate. The column corresponds tothe type of data used to predict with and each row corresponds to the target of the predictor. Errors are inthe units of the recognition error rate, % words in error.

3.3.2 Fit Error

While the correlation coefficient is useful for comparing the relative correlation between measurements ondifferent scales, the RMS linear fit error can be used to put the differences between correlation coefficientsinto perspective. The fit error gives an indication of how much error would be expected when using onemeasure (or group of measures) to predict another.Each measure was fit with a linear function of every other measure. The least-squares coefficients of thepolynomial were obtained and the RMS error from the linear model measured. Table 3.8 shows the RMSlinear fit errors for linear predictors of recognition error rates, both with and without per-talkernormalization of the measures. For each measure, the fit and fit error are computed over 682 totalobservations from the 11 test talkers and 62 different single-microphone and beamformer configurations.The errors shown in Table 3.8 are quite large; too large for practical use as a predictor of recognition rateon a talker-by-talker basis. The lowest talker-normalized error value at 4.27% is comparable to thedifference between the 8 and 16 microphone beamformers in Figure 3.8.The fit error is also useful for examining the correlation of the overall average measures, grouping alltalkers together as in Figures 3.7 and 3.8. When the overall averages are used there are only 62 valuesavailable for each measurement type and this isn’t sufficient for a good estimate of the correlationcoefficients8. Averaging all the talkers together greatly reduces the variance of the measures and the linearfit error provides a way to quantify this effect. The reduced variance of the measures can be seen inFigure 3.16 which shows scatter plots of the measures most strongly correlated with recognition error rate.Table 3.9 shows the linear fit errors for the average values of each measurement. It is apparent fromFigure 3.16(c) that the strongest linear correlation is still between peak FD and baseline-HMM error rate.The correlation coefficients shown in Table 3.9 bear this out. The SNR scatter plot shows a slightnonlinear trend. The BSD scatter plot shows a similar linear relationship with baseline-HMM error rate,but with a greater spread from the linear trend.

8The self -correlation coefficient of 62 random points is only around 0.98.

26

20 40 60 80

20

30

40

50

60

70

80

Baseline error rate

MA

P e

rror

rat

e

(a)

1 1.5 220

30

40

50

60

70

80

90

FD

Bas

elin

e er

ror

rate

(b)

0.1 0.15 0.220

30

40

50

60

70

80

90

BSD

Bas

elin

e er

ror

rate

(c)

5 10 15 20 2520

30

40

50

60

70

80

90

SNR

Bas

elin

e er

ror

rate

(d)

1 1.5 2

20

30

40

50

60

70

80

FD

MA

P e

rror

rat

e

(e)

0.1 0.15 0.2

20

30

40

50

60

70

80

BSD

MA

P e

rror

rat

e

(f)

5 10 15 20 25

20

30

40

50

60

70

80

SNR

MA

P e

rror

rat

e

(g)

Figure 3.16: Scatter plots comparing the various distortion measures with recognition error rates. All talkersare averaged together for each of 62 data points.


Baseline Error % 0.00 6.38 2.14 4.07 5.68 3.14MAP Error % 5.05 0.00 3.96 4.39 7.65 6.83

Table 3.9: RMS fit error for linear predictors of recognition error rate using the overall average value ofeach measure. The column corresponds to the type of data used to predict with and each row correspondsto the measure being predicted. Errors are in the units of the recognition error rate, % words in error.

27

0 20 40 60 800

10

20

30

40

50

60

70

80

Baseline error rate

MA

P e

rror

rat

e

(a) 1st order. E=6.15

0 20 40 60 800

10

20

30

40

50

60

70

80

Baseline error rate

MA

P e

rror

rat

e

(b) 2nd order. E=3.41

0 20 40 60 800

10

20

30

40

50

60

70

80

Baseline error rate

MA

P e

rror

rat

e

(c) 3rd order. E=0.79

0 20 40 60 800

10

20

30

40

50

60

70

80

Baseline error rate

MA

P e

rror

rat

e

(d) 4th order. E=0.74

0 20 40 60 800

10

20

30

40

50

60

70

80

Baseline error rate

MA

P e

rror

rat

e

(e) 5th order. E=0.61

0 20 40 60 800

10

20

30

40

50

60

70

80

Baseline error rate

MA

P e

rror

rat

e

(f) 6th order. E=0.59

Figure 3.17: Scatter plot of baseline-HMM error rate versus MAP-HMM error rate with polynomial fits ofvarious orders. One data point is shown for each error rate averaged over all the talkers.

3.3.3 Nonlinear Fits

The relationship between baseline-HMM error rate and MAP-HMM error rate has a decidedly nonlineartrend to it (see Figures 3.15(a) and 3.16(a)) but a relatively low variance away from that trend.Incorporating the appropriate nonlinearity could provide a stronger correlation between the two error rates.Figure 3.17 shows the scatter plot of the MAP-HMM error rate against a 1st order through 6th orderpolynomial function of the baseline-HMM error rate along with the trajectory of the polynomial fit. Forthese polynomials the constant term was set to 0 to constrain the fit to pass through the origin sincepresumably when there are no errors with the baseline model there will be none with the MAP trainedmodel.Predictably, the fit error declines significantly with higher orders of polynomial fit. For orders higher than2 the polynomials can fit the convex shape of the data points and still change slope to intersect the origin.The drop in error from 4th to 5th order is similarly due to the polynomial fitting another zig in the data.With only 62 data points in a fairly sparse distribution this is most likely the result of overfitting.Figure 3.18 shows polynomial fit of the talker-by-talker data points. Here the fit error improves not at allfor polynomials of greater than 3rd order, reinforcing the conclusion that the improvement at higher ordersshown in in Figure 3.17 is the result of overfitting and not indicative of any underlying trend in the data.It is also reasonable to expect that a combination of FD, BSD, SNR, SSNR and functions thereof mightform a better fit to recognition rate than any one taken alone. A general linear least squares[79] fitprovides an approach for forming estimates of one variable by linear combinations of arbitrary functionsof another variable or variables. The general form of the least squares model is

28

0 20 40 60 80 1000

20

40

60

80

Baseline error rate

MA

P e

rror

rat

e

(a) 1st order. E=6.81

0 20 40 60 80 1000

20

40

60

80

Baseline error rate

MA

P e

rror

rat

e

(b) 2nd order. E=4.57

0 20 40 60 80 1000

20

40

60

80

Baseline error rate

MA

P e

rror

rat

e

(c) 3rd order. E=3.66

0 20 40 60 80 1000

20

40

60

80

Baseline error rate

MA

P e

rror

rat

e

(d) 4th order. E=3.66

0 20 40 60 80 1000

20

40

60

80

Baseline error rate

MA

P e

rror

rat

e

(e) 5th order. E=3.66

0 20 40 60 80 1000

20

40

60

80

Baseline error rate

MA

P e

rror

rat

e

(f) 6th order. E=3.66

Figure 3.18: Scatter plot of baseline-HMM error rate versus MAP-HMM error rate with polynomial fits ofvarious orders. One data point is shown for each talker and each processing type. Only talker normalizeddata is shown.

y � x �� M

∑k � 1

akFk � x � (3.6)

where each Fk � x � is an arbitrary fixed basis function of the input x. The optimal least squares coefficients,ak, are those that minimize the total squared error, χ2, over the set of observations, � xi � yi � :

χ2 � N

∑i � 1

�yi � y � xi �

σi � 2

(3.7)

where σi is the standard deviation of measurement i, or 1 if it is unknown or they are all equal.Tables 3.10 and 3.11 show the RMS error for general least squares fit of the baseline-HMM andMAP-HMM error rates using different sets of measures as input. In each column another function of ameasure is added to the set of basis functions: F1 � x � � 1, F2 � x � � FD � x � , F3 � x � � SNR � x � and so on. Foreach set of basis functions the optimal linear coefficients were computed and the RMS error measured.The standard deviation factor in Equation (3.7) was set to a constant value of 1. When fitting to thebaseline-HMM error rate, powers of the MAP-HMM error rate are added as basis functions and whenfitting to the MAP-HMM error rate, powers of the baseline-HMM error rate are added as basis functions.Except for error rate the functions are added in decreasing order of their linear correlation. As would beexpected the fit error decreases significantly as basis functions are added. In particular the fit of theMAP-HMM error rates is greatly enhanced by the addition of the squared functions which allows theoptimization to fit the curvature of the relationship and achieve a fit as good or nearly as good as the fit ofthe baseline-HMM error rate. Figure 3.19 shows the scatter plots of the linear predictors including 1st and

29

FD SNR BSD SSNR FD2 SNR2 Err%1 � 2 � 3BSD2 SSNR2

Without Talker NormalizationBaseline Err% 7.86 7.83 7.53 7.42 7.00 3.42

MAP Err% 8.68 8.32 8.30 8.20 6.96 4.11With Talker Normalization

Baseline Err% 4.27 4.09 4.05 4.03 3.79 3.08MAP Err% 5.41 4.74 4.70 4.58 3.71 2.76

Table 3.10: RMS errors for linear least squares estimators of recognition error rates using per-utterancevalues. Units are % words in error. Basis functions are added to the least squares estimator starting with FD.Each column adds another function to the set of basis functions and the 5th column adds squared functionsof the 4 distortion measures. For the column labeled "Err%1 � 2 � 3” powers of the baseline-HMM error rateis added to the basis functions for predicting MAP-HMM error rate and vice versa. The optimization anderror computation is performed over the 662 per-talker values as in Table 3.8.

FD SNR BSD SSNR FD2 SNR2 Err%1 � 2 � 3BSD2 SSNR2

Baseline Err% 2.14 1.59 1.57 1.35 0.81 0.51MAP Err% 3.96 2.11 2.10 2.07 1.06 0.58

Table 3.11: RMS errors for linear least squares estimators of recognition error rates using the ensembleaverage values of each measure. Units are % words in error. Basis functions are added to the least squaresestimator starting with FD. Each column adds another function to the set of basis functions and the 5th

column adds squared functions of the 4 distortion measures. For the column labeled "Err%1 � 2 � 3” powers 1through 3 of the baseline-HMM error rate are added to the basis functions for predicting MAP-HMM errorrate and vice versa. The optimization and error computation is performed over the 62 average values as inTable 3.9.

2nd powers of FD, SNR, BSD and SSNR. The fits with 3rd powers included aren’t shown since adding the3rd power reduced the error only very marginally.Figure 3.19 shows how the general least squares fit has a strong linear relationship with the predicted errorrate (compared to Figure 3.15) but the variance around this linear trend remains significant. As beforewhen the data is averaged across all the talkers (see Figure 3.20) the variance from the linear trend is (notsurprisingly) greatly reduced.

20 40 60 80 100

20

40

60

80

100

LLS Prediction

Bas

elin

e er

ror

rate

20 40 60 80

10

20

30

40

50

60

70

80

90

LLS Prediction

MA

P e

rror

rat

e

Figure 3.19: Scatter plots of the linear least squares fit including all 1st and 2nd powers of FD, SNR,BSD and SSNR against the predicted error rate. The corresponding fit error is shown in Table 3.10. Theblack points correspond to the values with per-talker bias removed and the lighter patches to the valueswithout any normalization. As before there are 682 data points plotted, one for each of 11 talkers times 62measurements.

30

20 40 60 8020

30

40

50

60

70

80

90

LLS Prediction

Bas

elin

e er

ror

rate

20 40 60 80

20

30

40

50

60

70

80

LLS Prediction

MA

P e

rror

rat

e

Figure 3.20: Scatter plots of the linear least squares fit including all 1st and 2nd powers of FD, SNR, BSDand SSNR against the predicted error rate for the overall averages. The corresponding fit error is shown inTable 3.11.

3.4 Summary

A set of multichannel recordings was made and processed through a uniformly weighted delay-and-sumbeamformer using from 1 to 16 microphones. The resulting enhanced recordings were evaluated withdistortion measures (FD, BSD, SSNR, peak SNR) and with the performance of a speech recognitionsystem. As expected, the performance of the DSBF improves as the number of microphones usedincreases.

� Recognition performance improves steadily as microphones are added to the DSBF resulting inapproximately a 40% decrease in error rate for the quiet data and 70% decrease in error rate for thenoisy data (compared to the performance of microphone 1).

� Every distortion measure shows similar monotonic improvement as microphones are added to theDSBF.

� For the added-noise case, the large range of performance measured for each single microphonesuggests that a non-uniform weighting of the microphones should provide an improvement over theuniform weighting used here.

� The feature distortion measure (FD) tracked the recognition performance very closely and mayprovide a way to predict the recognition performance without actually running a large data setthrough the speech recognizer.

� The improvement in peak SNR was also quite well correlated with the improvement in speechrecognition score. In general this is not expected to be true; there are trivial operations that couldincrease SNR while destroying the speech signal (adding noise only during times of active speechfor instance). For the DSBF however the overall performance is well reflected by its ability tosuppress noise.

� The MAP training greatly improved the speech recognition accuracy in every instance. Theimprovement due to MAP training is of the same order of magnitude as the improvement from theDSBF.

In the following chapters methods intended to improve upon the performance of the simple unweightedDSBF used here will be developed. Chapter4 investigates some alternative weighting methods based uponthe array and source geometry. Chapter 5 offers development of an MMSE multi-input noise-suppressionfiltering system and Chapter 2 shows experimental results on the recorded database using those methods.

CHAPTER 4:TOWARDS ENHANCING DELAY AND SUM

BEAMFORMING

This chapter will examine the performance that can be expected from delay-and-sum beamformers in asimple microphone-array scenario and investigate the improvements possible through an optimalmicrophone weighting scheme. Reverberant room simulations with multiple noise sources will be used toevaluate the impact of adding microphones to a linear array. The beamformer performance will beassessed with objective measures including signal-to-noise ratio (SNR), signal-to-reverberation ratio(SRR) and Bark spectral distortion (BSD).

4.1 Overview of the Delay-and-Sum beamformer

Consider the idealized model with a signal s � t � impinging upon M sensors. For convenience assume s � t � iszero mean. The signal received at each sensor is a delayed version of the original signal plus an additivenoise component:

ym � t � � hms � t � τm � � nm � t � 1�

m � M (4.1)

where nm � t � is a zero-mean normally distributed noise signal with variance σ2m, τm is the time delay to the

mth sensor and hm is the signal attenuation at the mth sensor. nm � t � is uncorrelated with s � t � and all nl � t �for m �� l. Assuming that the τm are known each received signal can be appropriately delayed. Thebeamformed output is then the sum of M copies of the signal s � t � with M uncorrelated additive noisesources:

y � t �� M � 1

∑m � 0

hms � t � � nm � t � τm � � s � t �M

∑m � 1

hm � M

∑m � 1

nm � t � τm �

0 200 400 600 800 10000

5

10

15

20

25

30

Number of microphones

SNR

impr

ovem

ent (

dB)

Figure 4.1: Idealized SNR gain as a function of the number of sensors in a DSBF beamformer given thatthe noise at each sensor is uncorrelated with the noise at all other sensors and the signal.

31

32

To simplify the analysis assume that the noise power is identical in each channel, σ2m � σ2, and the signal

gain is likewise identical, hm � 1 for all 1�

m � M. This assumption implies that no sensor contributesmore than any other sensor; SNR at each sensor is identical. The signal-to-noise ratio of a single (delayed)channel is

SNR1 � E�s2 � t � �

E�n2

1 � t � τi � � � E�s2 � t � �σ2

and the signal-to-noise ratio of the beamformer output is

SNRuni f orm � E� � ∑M

m � 1 s � t � � 2 �E

� � ∑Mm � 1 nm � t � τm � � 2 � � M2E

�s2 � t � �

∑Mm � 1 σ2

m� ME

�s2 � t � �

σ2

Taking the log of the ratio of these two SNR’s to get the improvement in dB yields:

10log10 � SNRuni f orm

SNR1� � 10log10 � M � (4.2)

Figure 4.1 plots this gain function. For every doubling of the number of sensors a 3dB improvement inoutput SNR is realized. Even in this idealized scenario 100 sensors are needed to achieve a 20dBimprovement in SNR.The situation becomes more complicated if each sensor measures a different level of signal and noise.That is, hm �� hl and σ2

m �� σ2l . In this case the SNR of the simple delay-and-sum beamformer is

SNRnonuni f orm � E� � ∑M

m � 1 hms � t � � 2

� ∑Mm � 1 nm � t � � 2 � � � ∑M

m � 1 hm � 2E�s2 � t � �

∑Mm � 1 σ2

m(4.3)

4.2 Delay-Weight-and-Sum

The output SNR of the beamformer can be maximized by weighting each ym � t � before performingsumming the channels. If gm is the weight applied to channel m, then the SNR of the output of thisdelay-weight-and-sum beamformer is

SNRweighted � � ∑Mm � 1 gmhm � 2E

�s2 � t � �

∑Mm � 1 g2

mσ2m

(4.4)

A simple optimization can be performed on the express in Equation (4.4) by taking the derivative withrespect to gl and setting the result equal to zero:

∂∂gl

SNRweighted � 2 � ∑Mm � 1 gmhm � hlE

�s2 � t � �

∑Mm � 1 g2

mσ2m

� � ∑Mm � 1 gmhm � 22glσ2

l E�s2 � t � �

� ∑Mm � 1 g2

mσ2m � 2

� 0

Which simplifies to the expression

gl � hl

σ2l

∑Mm � 1 gmhm

∑Mm � 1 g2

mσ2m

which is trivially satisfied by

gl � hl

σ2l

(4.5)

Substituting the optimal weight from Equation (4.5) into Equation (4.4) yields the following expressionfor the SNR of this optimally weighted beamformer:

SNRoptimalweighted ��∑M

m � 1h2

mσ2

m � 2E

�s2 � t � �

∑Mm � 1

h2m

σ2m

� M

∑m � 1

h2m

σ2m

E�s2 � t � � (4.6)

33

21 M−1

d

r

M 3

Figure 4.2: A linear microphone array with inter-microphone spacing d and distance from the talker to thearray midpoint r.

100

101

102

0

2

4

6

8

10

12

14

16

18

20

Number of microphones

SNR

impr

ovm

ent (

dB)

ideal optimal unweighted

Figure 4.3: Simulation showing the improvement in SNR for the optimally-weighted beamformer and theunweighted beamformer for a near-field linear array. The ideal SNR curve from Figure 4.1 is included forcomparison.

To get a sense of what this could mean in practice consider the example in Figure 4.2. A talker standsr � 1m away from the center of a symmetrical linear microphone array with constant microphone spacingof d � 10cm. Each microphone receives an equal level of independent noise (σ2

m � 1). The hm (and gm)terms are inversely proportional to the distance from the talker to the microphone and given byhm � 1�

r2 �� m � 1 �� 2 2d2(This choice of distances results in unity gain at the center microphone). As

microphones are added at the ends of the array we can compute the improvement in SNR as a function ofthe number of microphones with Equation (4.6). This is plotted in Figure 4.3.It is apparent from Figure 4.3 that when even a simple model of signal attenuation is taken into account therealizable gain from a delay-and-sum or delay-weight-and-sum beamformer can be quite limited. Asmicrophones are added in this simple example the SNR in each added microphone drops as the addedmicrophones are further and further away from the source, eventually negating the gain of adding thedistant microphone at all1.Taking this example one step further, supplant the independent noise at each microphone with a Gaussian

1Granted, this is a contrived example and with the 10cm spacing described the array is 4 1m wide when there are 40 microphonesin the array, 10m wide when there are 100 microphones in the array. Clearly this is not the ideal array geometry for a talker 1 meteraway from the array but it makes the point.

34

r

d

M 3 1 2 M−1

N

Figure 4.4: Linear microphone array with interfering noise point-source located 2m to the right of the talker.

100

101

102

0

2

4

6

8

10

12

14

16

18

20

# of microphones

SNR

impr

ovem

ent (

dB)

ideal optimal unweighted

Figure 4.5: SNR improvement for DSBF with point-source noise and ambient noise.

point-noise source 2m to the right of the talker. Assume that the level of noise measured by themicrophone closest to the noise source is equal to the ambient noise level. This is depicted in Figure 4.4.In this case the optimal weighting according to Equation (4.5) is no longer simply inversely proportional tothe distance from the talker, but also proportional to the square of the distance from the interfering noisesource (because in the simple spherical propagation model the noise power in the denominator ofEquation (4.5) is inversely proportional to the square-root of the distance from the noise source).Figure 4.5 shows SNR improvement as a function of the number of microphones in the array2. Thisparticular result obviously will vary with the ratio of the ambient noise power to the source noise power. Ifthe ambient noise power is much greater than the point-noise power the result will approach the oneplotted in Figure 4.3.The curves shown in Figures 4.3 and 4.5 show only a subtle difference. The achievable SNR is marginallyhigher ( � 1dB) in Figure 4.5 for both the unweighted and optimally weighted schemes, but in eitherscenario the weighted solution only enjoys about 1dB of improvement in SNR over the unweightedsolution.

2In this scenario the noise is no longer independent and the expression in equation 4.6 is not truly applicable. Although correlatednoise may add destructively or constructively depending upon the geometry of the array and source, a reasonable (or pessimistic)expectation is that the array will do worse in the presence of correlated noise. The SNR improvements shown here are over-estimatesunder that expectation.

35

4.3 Delay-Filter-and-Sum

The obvious extension of the simple source model in Equation (4.1) is to generalize the signal scalingfactor hm to a convolutional element or channel impulse response. Each sensor in this model receives afiltered version of the signal plus an independent noise signal:

ym � t �� hm � t � � s � t � � nm � t �(where � denotes convolution) and to introduce a corresponding convolutional element into the channelweighting in the beamformer:

y � t �� M

∑m � 1

gm � t � � � hm � t � � s � t � � nm � t � � (4.7)

The channel dependent filtering function can be distributed to write the beamformer output in terms of thesignal-derived portion, ys � t � , and the noise-derived portion, yn � t � :

ys � t � � M

∑m � 1

gm � t � � hm � t � � s � t �

yn � t � � M

∑m � 1

gm � t � � nm � t �

Rewriting these expressions in the frequency domain facilitates the formulation of a frequency-dependentoptimal weighting:

Ys � ω � � M

∑m � 1

Gm � ω � Hm � ω � S � ω �

Yn � ω � � M

∑m � 1

Gm � ω � Nm � ω �

where Hm � ω � , S � ω � and Nm � ω � are the Fourier transforms of hm � t � , s � t � and nm � t � , respectively. Gm � ω � isthe frequency dependent weighting. Ys � ω � is the Fourier transform of the signal-derived portion of thebeamformer output, and Yn � ω � is the Fourier transform of the noise-derived portion of the beamformeroutput. The output SNR can be written as a function of frequency:

SNRweighted � ω � � E� �Ys � ω � � 2 �

E� �Yn � ω � � 2 � � E

� � ∑Mm � 1 Gm � ω � Hm � ω � S � ω � � 2 �

E� � ∑M

m � 1 Gm � ω � Nm � ω � � 2 � (4.8)

� � ∑Mm � 1 Gm � ω � Hm � ω � � 2E

� � S � ω � � 2 �∑M

m � 1 �Gm � ω � � 2σ2m � ω � (4.9)

where σ2m � ω �� E

� �Nm � ω � � 2 � . Once again the assumption in place is that the noise in each channel isindependent of the noise in every other channel.

4.3.1 Optimal-SNR Solution

As before this SNR can be optimized by taking the derivative, with respect to G �l � ω � this time, and settingthe result equal to 0. Dropping the � ω � notation and omitting the limits on the summations (they are allm � 1 � � M) for the sake of brevity, the following solution results:

∂∂Gl

SNRweighted ∝ � ∑ �Gm � 2σ2m � � ∑GmHm � H �l � � ∑GmHm � 2Glσ2

l

� ∑ �Gm � 2σ2m � 2 � 0 (4.10)

36

which simplifies to

Gl � H �lσ2

l

� ∑ �Gm � 2σ2m � � ∑GmHm �� ∑GmHm � 2

and, not surprisingly, this is satisfied by

Gl � ω �� H �l � ω �σ2

l � ω � (4.11)

The solution in Equation (4.11) features the time-reverse of the impulse response, H �l � ω � , and is anoise-weighted variant of the matched-filter method[23]. The H �l � ω � filter acts as a sort of pseudo-inversefor Hl � ω � .

Magnitude-only Solution

If the channel impulse response is modeled as a magnitude-only filter, Hl � ω � � �Hl � ω � � , thenEquation (4.11) becomes

Gl � ω �� Hl � ω � �σ2

l � ω � (4.12)

and the filter-and-sum strategy in this case can be considered as a filterbank implementation whereEquation (4.5) is used to derive a real-valued weight for every filterbank frequency ω.

Beamformer Frequency Response

The set of filters described in Equations (4.11) and (4.12) will distort the frequency response of thebeamformer (That is,

�� ∑M

m � 1 Gm � ω � �� 1) if left in the form presented. To preserve a flat frequency

response to the beamformer the filters once derived from Equations (4.11) and (4.12) should have theirmagnitudes normalized by a factor of 1� ∑M

m � 1 Gm�ω � � for each analysis frequency, ω, to ensure that the gain of

the beamformer is uniform across frequency.It may seem an obvious step to apply a frequency weighting that implements the same sort of SNRoptimization as the microphone weighting, but if Equation (4.9) is rewritten as a function of frequencyweights instead of channel weights it will not lead to a similar solution for a frequency weighting.Frequency weighting strategies are discussed in Chapter 5.

4.4 A Reverberant Simulation

The following section describes a simulation of a linear microphone array using the channel weightingstrategies described above.

4.4.1 Methods

Figure 4.6 depicts the layout of the simulated room. 40 microphones are arranged in a linear array with10cm microphone separation across the short wall of a 4x6m room. 3 interfering noise sources aresimulated with recordings of computer equipment fan noise and placed as shown in Figure 4.6. A digitalrecording of a male talker made with a close-talking microphone is used as the desired source signal. Thesampling rate is 16000Hz. The impulse response from each source (the talker and 3 separate noisesources) to each microphone is simulated using the image method[80], see Figure 4.7 for an exampleimpulse response. The resultant reverberation time is approximately 250ms and the unprocessed

37

0

2

4

02

460

1

2

x

1

40

y

z

Microphones Talker Noise source

Figure 4.6: Location of the source, noise, and microphones for the reverberant simulation.

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

0

0.1

0.2

0.3

0.4

0.5

0.6

time (s)

Figure 4.7: The simulated impulse response for the talker received at microphone 1.

−0.04 −0.035 −0.03 −0.025 −0.02 −0.015 −0.01 −0.005 0 0.005 0.01

−2

0

2

4

6

8

10

12

14

16

18

x 10−3

time (s)

Figure 4.8: Optimal filter for microphone 1 for the 40 microphone case using the complex model of thechannel transfer function (based on Equation (4.11)).

38

signal-to-noise ratio at microphone 1 is approximately � 3dB. Signal or noise power in the following iscomputed by taking the mean squared value of the time signal over the length of the test signal3.The microphone outputs are delay-steered to the source location then the statistics of the noise and signalis measured over the entire 3 second utterance and used to derive the weight or filter for each channel. Thechannels are weighted (or filtered) and summed in the following 4 ways:

1. Uniformly weighted (unweighted). This is simply a delay and sum beamformer; each channel isweighted equally.

2. Weighted according to Equation (4.5) (weighted). The noise power and the magnitude of the directpath in the impulse response are measured for each microphone to form each weight according toEquation (4.5).

3. Weighted at each analysis frequency according to Equation (4.11) (freq weighted). The PSD of thenoise and the time inverse of the impulse response for each microphone is used to generate theoptimal filter according to Equation (4.11).

4. Weighted at each analysis frequency according to Equation (4.12) (mag freq weighted). Amagnitude-only version of the filter computed in item 3 above is used to weight each frequency ateach microphone.

In this simulation a 512 point analysis window (32ms) was used for measuring the power spectral densityof the signal and noise. The filters in the filter-and-sum beamformer are 1024 points long for the complexmodel of impulse response (Gm � H �m �

σ2m; method (3), freq weighted) and 512 points for the

magnitude-only case (Gm � �Hm � �σ2

m; method (4), mag freq weighted). A 1024 point tapered truncation ofthe channel impulse responses, hm � t � , is used to compute the numerator for the filters4. The shorterwindow length was used for the magnitude-only version only because the shorter window is a morepractical analysis length for speech signals. For the complex case the window had to be lengthened toinclude a reasonable portion of the impulse response.For cases (3) and (4) the overall array frequency response is normalized as described in Section 4.3.1. Thisfrequency normalization smoothes out the overall spectral shape of the beamformer response, butespecially in case (3) where the conjugate of the channel transfer function is being used as the beamformerfilter (see Figure 4.8), there are zeros in the derived channel filters; the resulting total beamformerfrequency response still contains these zeros. Because of this beamformer frequency response distortionand the action of the matched filter the signal-to-noise and reverb ratios in Figures 4.9, 4.11 and 4.12 showan improvement even for the case of a single microphone.

4.4.2 Results

Figure 4.9 depicts the SNR yielded by the different weighting strategies as a function of the number ofmicrophones added in order of increasing distance from the source. For the SN ratio shown in Figure 4.9,the numerator is the energy of the direct path signal of the talker and the denominator includes the energyof the reverberation due to the talker as well as all direct and reverberant energy from the noise sources. Inother words, anything other than the talker direct path is considered to be noise in this ratio. The signalimpulse response is divided into direct and reverberant components by applying a 4ms5 wide window

3If the unprocessed SNR seems low consider that this is an average SNR. Even at � 3dB the target speech is intelligible. The peakSNR is approximately 5dB.

4The truncated impulse response was used to try to inject a little bit of practicality into the implementation of the filter-and-sumbeamformers. This can be increased along with the lengths of the derived filters to correspond to the total length of the simulatedchannel impulse responses at the cost of increased computational complexity but the results do not significantly change.

5This value was chosen fairly ad hoc. It should be noted that the measured signal power is very sensitive to the value of thisparameter. The wider the time window that is considered to be direct path energy, the higher the measured direct signal energy will beand subsequently the higher all the signal power ratios (SNR, SRR) will be.

39

5 10 15 20 25 30 35 400

2

4

6

8

10

12

# microphonesS

NR

Impr

ovem

ent (

dB)

(noi

se+

reve

rb)

unweighted weighted freq weighted mag freq weighted

Figure 4.9: Signal-to-noise+reverb improvement for a simulated room with 3 equipment fan noise sourcesas a function of the number of microphones in the array.

5 10 15 20 25 30 35 400.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

# microphones

Bar

k S

pect

ral D

isto

rtio

n


Figure 4.10: BSD measure for the 4 different beamforming schemes as a function of the number of micro-phones in the array.

around the impulse corresponding to the direct path in the simulated channel impulse response. The talkersignal is then convolved with the direct path component and the reverberant component separately andsummed separately for cases (1), (2), and (4). For case (3) one of the actions of the derived filter is toincrease the power in the main lobe while spreading reverberant energy out away from the mainimpulse[23], consequently for this case the derived filter is convolved with the simulated channel impulseresponse and then decomposed into direct and reverberant components so that the “matched-filtering”effect can be measured accurately.For all methods the SNR improvement falls well short of the 10log10 40 � 16dB theoretical array gainfrom Equation (4.2). This is hardly surprising given the correlated nature of the noise included in thissimulation; both from the simulated noise sources and from the talker reverberance. Figure 4.9 shows thatthe simple weighted beamformer (2) and the magnitude-only filter-and-sum beamformer (4) hardly do anybetter than the uniformly weighted beamformer (1). Although method (3) does noticeably better accordingto the signal power ratio measures (Figures 4.9, 4.11 and 4.12) the Bark spectral distortion measure(Figure 4.10) is only marginally lower than even the simplest unweighted beamformer. Note also how theincremental improvement in SNR is quite similar for all methods. Method (3)’s higher SNR starts right atthe single microphone case suggesting that the main source of the higher ratios can be attributed to thematched-filtering and spectral distortion effects rather than the optimization in the microphone weighting.

40

5 10 15 20 25 30 35 400

2

4

6

8

10

12

# microphonesS

NR

Impr

ovem

ent (

dB)

(noi

se o

nly)


Figure 4.11: Signal-to-noise-only ratios as a function of the number of microphones in the array.

5 10 15 20 25 30 35 400

2

4

6

8

10

12

# microphones

SR

R Im

prov

emen

t (dB

)


Figure 4.12: Signal-to-reverberation ratios as a function for the number of microphones in the array.

Informal listening tests confirm that method (3) sounds marginally clearer than the other methods, but thisimprovement comes at the cost of knowing the channel impulse response exactly. The computation of theoptimal filters in simulation is trivial since the channel impulse responses are all known, but in a practicalsituation accurate channel impulse responses will be quite difficult to measure and will vary widely withchanges in the room[21] not to mention the position of the talker.The marginal results of the weighting methods investigated in this simulation suggest that a differentstrategy may be more fruitful in providing improvement over the basic unweighted DSBF. Chapter 5 willintroduce another form of optimization and the resulting algorithm will be implemented in Chapter 7.

CHAPTER 5:OPTIMAL FILTERING

In the previous chapter an optimal-SNR weighting strategy was derived based upon a combination of noisestatistics and geometric signal propagation models. The derived weights or filters were constant for aparticular arrangement of talker and noise sources in the room with no dependence on the signal receivedat the array sensors. Additionally, no frequency shaping or distortion was permitted in the array frequencyresponse. In this chapter, data-dependent optimal filtering strategies that use spectral shaping in an attemptto improve signal quality will be investigated, in particular the Wiener filter and a novel multi-channelvariant of the Wiener filter. Also a non-optimal application of Wiener pre-filtering to microphone arrayswill be introduced.

5.1 The Single Channel Wiener Filter

Assume a signal, y � t � , that contains the desired signal, s � t � , corrupted by some as yet unspecified noise orother distortion. A filter φ � t � that when convolved with the received signal, y � t � , approximates s � t � isdesired. The signal estimate is given by:

s � t �� φ � t � � y � t � (5.1)

and the error by:e � t � � s � t � � s � t �

If minimum mean-squared error is the criterion for choosing φ � t � the expression for the mean-squarederror can be written:

ξ � E� �� ∞

� ∞� s � t � � s � t � � 2dt � � E

� �� ∞

� ∞� e � t � � 2dt � (5.2)

Rewriting the expression for the error in the frequency domain and employing Parseval’s relation yields:

S � ω � � Φ � ω � Y � ω �E � ω � � S � ω �� S � ω �

ξ � 12π

E� � � ∞

� ∞� S � ω �� S � ω � � 2dω � � 1

2πE

� � � ∞

� ∞�E � ω � � 2dω � (5.3)

where e � t � � s � t �� s � t � and S � ω � , S � ω � , Φ � ω � , Y � ω � and E � ω � are the Fourier transforms of s � t � , s � t � ,φ � t � , y � t � and e � t � , respectively. The filter Φ � ω � is chosen to minimize the total squared error, ξ. Movingthe expected value operation inside the integral, taking the derivative1 of the integrand with respect toΦ � � ω � and setting it equal to 0 yields

1A frequently omitted detail is that this is not, strictly speaking, the derivative of the total squared error, ξ. Nevertheless, the valuesof Φ that minimize this pseudo-derivative will also minimize ξ.

41

42

∂∂Φ � � ω � E

�E � ω � E � � ω � � � E

�E � ω � ∂E � � ω �

∂Φ � � ω � � � E� � E � ω � Y � � ω � � � 0

Substituting back in for E � ω � and solving for Φ � ω � yields

E� � � S � ω �� Φ � ω � Y � ω � � Y � � ω � � � 0

E� � S � ω � Y � � ω � � � Φ � ω � E

�Y � ω � Y � � ω � � � 0

Φ � ω � E� �Y � ω � � 2 � � E

�S � ω � Y � � ω � �

Φ � ω � � E�S � ω � Y � � ω � �

E� �Y � ω � � 2 � (5.4)

Equation (5.4) is the most general form of the Wiener filter.

5.1.1 Additive Uncorrelated Noise

A simple model for the received signal, y � t � , in which the desired signal, s � t � � is corrupted by an additivenoise signal, n � t � , is simply y � t �� s � t � � n � t � . Using this signal model in Equation (5.4) would result in avariety of cross terms unless some assumptions are made about the nature of the signal, s � t � , and noise,n � t � .A commonly made assumption that simplifies Equation (5.4) is that the signal and noise are uncorrelated.More explicitly, that the expected value of the cross-correlation of the signal and noise is equal to 0.

E� � � ∞

� ∞S � ω � N � � ω � dω � � 0 (5.5)

where E�

� � denotes expected value. Using the signal model y � t �� s � t � � n � t � or its frequency domaincounterpart, Y � ω � � S � ω � � N � ω � , we get a new expression for the Wiener filter:

Φ � E�SY � �

E� �Y � 2 � � � S � 2� S � 2 � E

� �N � 2 � (5.6)

or rewriting this in terms of the measured signal Y and the noise statistic E� �N � 2 � :

Φ � �Y � 2 � E� �N � 2 ��Y � 2 (5.7)

Equations (5.6) and (5.7) are the most commonly seen forms of the Wiener filter. Note that the filtercoefficients are strictly real and non-negative. In Equation (5.6) it is clear that Φ �

1. The result ofEquation (5.7) may not satisfy this condition if the noise power in the observation of �Y � 2 is less thanE

� �N � 2 � . Care must be taken in the implementation to insure that noise in the observations doesn’t createdegenerate filter coefficients.The model for y � t � can be altered to include a convolutional distortion of the signal component:y � t � � h � t � � s � t � � n � t � . In this case, using the same assumption about the lack of correlation betweennoise and signal, Equation (5.4) becomes

Φ � E�S � HS � N � � �

E� �HS � N � 2 � � H � � S � 2�H � 2 � S � 2 � E

� �N � 2 � (5.8)

Details of Implementation

Note that if the estimate of the signal power already incorporates the transfer function, H, then theformulation in Equation (5.8) shouldn’t be applied. For instance, if deriving a filter, Φ, according to

43

Equation (5.8) but the estimate of � S � 2 , � S � 2, is formed by subtracting the expected value of the noise powerfrom the instantaneous measurement of the input signal:

� S � 2 � �Y � 2 � E� �N � 2 � � �HS � N � 2 � E

� �N � 2 � � �H � 2 � S � 2Then it is inappropriate to directly substitute this signal estimate into Equation (5.8),

Φ � H � � S � 2�Y � 2because according to the signal model � S � 2 already includes a factor of �H � 2. To achieve the form inEquation (5.8), � S � 2 needs to be divided by H rather than multiplied by H � . That is,

Φ � 1H

� �Y � 2 � E� �N � 2 ��Y � 2 � 1

H �H � 2 � S � 2�Y � 2 � H � � S � 2�Y � 25.2 Multi-channel Wiener Filter

In a multi-channel configuration the most obvious application of a Wiener filter is to apply it after thebeamforming operation:

yB � t � � M

∑m � 1

ym � t �

A post-filter φ � t � can be designed to minimize the squared error between s � t � � φ � t � � yB � t � and s � t � in amanner analogous to the development of the Wiener filter above. Since the channels have been summedinto a single output channel the MMSE solution is simply a Wiener post-filter on the beamformed signal:

Φ � ω �� E�

S � ω � Y �B � ω ��YB � ω � � 2 � (5.9)

This sum-and-filter formulation has been used variously in[38, 39].An alternative is to derive a filter-and-sum process. In the following formulation each of the M channels isfiltered independently:

s � t � � M

∑m � 1

φm � t � � ym � t � (5.10)

Or expressed in the frequency domain:

S � ω �� M

∑m � 1

Φm � ω � Ym � ω � (5.11)

In a manner similar to the derivation of the single channel Wiener filter a solution for Φm � ω � can bederived starting from Equation (5.3). Substituting S � ω � as it is defined as in Equation (5.11) yields thefollowing expression for the total squared error:

ξ � E� �� ∞

� ∞�E � ω � � � � E

� �� ∞

� ∞

�� S � ω �� S � ω � �

� �� E

� �� ∞

� ∞

��S � ω �� M

∑m � 1

Φm � ω � Ym � ω �� (5.12)

Taking the derivative of the integrand of Equation (5.12) with respect to Φ �m � ω � and setting it equal to zeroyields:

44

∂∂Φ �m � ω � E

�E � ω � E � � ω � � � E

�E � ω � ∂E � � ω �

∂Φ �m � ω � � � E�E � ω � � � Y �m � ω � � � � 0

The resulting system of M equations can be written in matrix form,

E

�� Y1 � 2 Y2Y �1 � � � YMY �1Y1Y �2 �Y2 � 2 � � � YMY �2

......

. . ....

Y1Y �M Y2Y �M � � � �YM � 2��

Φ1

Φ2...

ΦM

�� E

��

SY �1SY �2

...SY �M

��

and the general solution written as:��

Φ1

Φ2...

ΦM

�� E

�� Y1 � 2 Y2Y �1 � � � YMY �1Y1Y �2 �Y2 � 2 � � � YMY �2

......

. . ....

Y1Y �M Y2Y �M � � � �YM � 2�� 1

E

��

SY �1SY �2

...SY �M

�� (5.13)

The discerning reader will note that the matrix in Equation (5.13) is the spatial correlation matrix[81] andcan be written as the outer product of the input signal vector:

E

��

Y1

Y2...

YM

��

Y1 Y2� � � YM �

� �� (5.14)

Measurement of the spatial correlation matrix in Equation (5.14) can be problematic in practice. Toestimate this matrix by averaging different instances of the Y vectors requires at least M instances toachieve a spatial correlation matrix with full rank, since a particular instance of YY � has rank 1. Considerwhat this means for a typical speech signal where it can only be considered stationary for 40ms or so; ifthere are 16 microphones in the array the spatial correlation estimate requires at least 16 independentframes within that 40ms window to form a spatial-correlation matrix of full rank. With a half-overlaphamming analysis window this would imply an individual analysis frame of 4 � 7ms. At 16kHz samplingrate this results in a frequency resolution of approximately 212hz. For larger numbers of microphones thefrequency resolution only gets worse.

5.2.1 Additive Uncorrelated Noise

Once again the simplest model is the additive noise model. In the multi-channel case each received signalhas an independent noise signal:

ym � t � � s � t � � nm � t � �� Ym � ω �� S � ω � � Nm � ω � (5.15)

Assuming initially that not only are signal and noise uncorrelated, but that each nm � t � is uncorrelated witheach nl � t � for m �� l implies the following substitutions

E� �Ym � 2 � � � S � 2 � �Nm � 2 � E

�SY �m � � � S � 2 � E

�YmY �l � � � S � 2 (5.16)

Using the compact overbar notation for expected value, E�X � � X , σ2

l� E

� �Nl � 2 � � �Nl � 2, andincorporating the simplifications from Equation (5.16) into Equation (5.13) yields��

�Φ1

Φ2...

ΦM

� ��

�� S � 2 � σ2

1 � S � 2 � � � � S � 2� S � 2 � S � 2 � σ22

� � � � S � 2...

.... . .

...� S � 2 � S � 2 � � � � S � 2 � σ2M

� ��

� 1 �� S � 2� S � 2

...� S � 2� �� (5.17)

45

The matrix in Equation (5.17) can be written as the sum of a constant matrix and a diagonal matrix ofnoise autocorrelation values.

� S � 2��

1 1 � � � 11 1 � � � 1...

.... . .

...1 1 � � � 1

��

��

σ21 0 � � � 0

0 σ22

� � � 0...

.... . .

...0 0 � � � σ2

M

��

The diagonal matrix is full rank and non-negative so its addition with a non-negative constant matrix isalso full rank. The matrix inverse in Equation (5.17) exists in general and can be formed without long termaveraging of observations of YY � .5.2.2 Direct Solution

The highly structured form of the matrix in Equation (5.17) suggests that there may be a simplified form ofthe solution that does away with the matrix inversion. The form of this simplified solution can bediscerned by writing the inversion in terms of the adjoint matrix (or matrix of cofactors) and thedeterminant. Specifically:

A � 1 � 1det A

Aco f

where for a matrix, A, det A is its determinant and Aco f is the matrix of cofactors. Applying this basicformula for the inverse to Equation (5.17) yields:��

�Φ1

Φ2...

ΦM

�� S � 2� S � 2 ∑M

k � 1

�∏M

m�� k σ2

m � � ∏Mm σ2

m

��

∏Mm

�� 1 σ2m

∏Mm

�� 2 σ2m

...∏M

m�� M σ2

m

�� (5.18)

where ∏Mm

�� k σ2m denotes the product of all σ2

m terms except for the m � k term. It may be clearer to viewEquation (5.18) with the product of the σ2

m’s factored out. Note that

M

∏m

�� k

σ2m

� σ2k

∏Mm σ2

m

so Equation (5.18) can be rewritten as:��

Φ1

Φ2...

ΦM

� �� S � 2 ∏M

m � 1 σ2m� S � 2 ∑M

k � 1

�∏M

m � 1 σ2m

σ2k � � ∏M

m � 1 σ2m

��

1σ2

11

σ22...1

σ2M

��

and the product terms in numerator and denominator cancel out to yield:��

Φ1

Φ2...

ΦM

�� S � 2� S � 2 ∑M

k � 1

�1

σ2k � � 1

��

1σ2

11

σ22...1

σ2M

�� (5.19)

or written more succinctly in terms of the weight for a single microphone (and including the dependenceon ω previously omitted for brevity):

46

Φm � ω �� S � ω � � 2� S � ω � � 2 ∑Mk � 1

�1

σ2k

�ω � � � 1

1σ2

m � ω � (5.20)

In this form it can be clearly seen that each each Φm is the reciprocal of the noise power for that channelwith a common overall weighting that is a function of � S � 2 and σ2

m. In this form it is also more apparentthat the computational complexity of this solution is now O � M � rather than the O � M3 � typically requiredby the matrix inverse2 . Note that for M � 1 this solution is identical to the Wiener filter in Equation (5.6).Also, if the noise power is the same in each channel, σ2

m � ω �� σ2 � ω � , then the resulting Φm � ω � is also thesame for each channel and given by:

Φm � ω � � 1M

� S � ω � � 2� S � ω � � 2 � σ2 � ω �which is precisely a Wiener filter on each individual channel. Since this filter is the same for each channel,by the principles of linear systems it can be applied after the beamforming summation, resulting in abeamformer with Wiener post-filter, as would be expected.

5.2.3 Filtered Signal Plus Additive Independent Noise

Proceeding as above, a slightly more comprehensive signal model is one where each channel is subject toconvolutional distortion as well as independent additive noise:

ym � t � � hm � t � � s � t � � nm � t � �� Ym � ω �� Hm � ω � S � ω � � Nm � ω � (5.21)

In this case the following substitutions apply, where P�Y �

m � l � YmY �l ,

�Ym � 2 � �Hm � 2 � S � 2 � σ2m � SY �m � H �m � S � 2 � P

�Y �

m � l � HmH �l � S � 2 (5.22)

and lead to the solution (from Equation (5.13))

Φ ��H1 � 2 � S � 2 � σ2

1 P�H �

2 � 1 � S � 2 � � � P�H �

M � 1 � S � 2P

�H �

1 � 2 � S � 2 �H2 � 2 � S � 2 � σ22

� � � P�H �

M � 2 � S � 2...

.... . .

...

P�H �

1 � M � S � 2 P�H �

2 � M � S � 2 � � � �HM � 2 � S � 2 � σ2M

� ��

� 1 ��

H �1 � S � 2H �2 � S � 2

...H �M � S � 2

� �� (5.23)

The matrix to be inverted in Equation (5.23) is the sum of a vector cross product and a diagonal matrix ofnoise autocorrelation values.

� S � 2��

H1

H2...

HM

� ��

H1 H2� � � HM � �

��

σ21 0 � � � 0

0 σ22

� � � 0...

.... . .

...0 0 0 σ2

M

� ��

The diagonal matrix is positive and full rank (barring a vanishing noise signal) so the sum is also full rank(barring a vanishing Hm) so the matrix inversion in Equation (5.23) above exists in general. As in theprevious case this expression for the optimal filter can be rewritten in a simplified form that obviates theuse of the generalized matrix inversion in Equation (5.23). The simplified solution is given by:

2A matrix inverse can be computed in a manner that has complexity O�M log2 7 � but at the expense of a very large constant factor[79].

The typical LU decomposition algorithm for matrix inversion is a O�M3 � process.

47

��

Φ1

Φ2...

ΦM

� �� S � 2� S � 2 ∑M

k � 1

� �Hk � 2ΠMm

�� kσ2k � � ΠM

m � 1σ2m

��

H �1 ΠMm

�� 1σ2m

H �2 ΠMm

�� 2σ2m

...H �MΠM

m�� Mσ2

m

� �� (5.24)

This result can be rewritten in a more revealing form by factoring out the product of the noise variances.As above, note that

M

∏m

�� k

σ2m

� σ2k

∏Mm � 1 σ2

m

So Equation (5.25) can be rewritten as:

��

Φ1

Φ2...

ΦM

�� S � 2 ∏M

m � 1 σ2m� S � 2 ∑M

k � 1

� �Hk � 2 ∏Mm � 1 σ2

m

σ2k � � ∏M

m � 1 σ2m

��

H�1

σ21

H�2

σ22...

H2M

σ2M

� ��

and the product terms cancel out resulting in:

��

Φ1

Φ2...

ΦM

� �� S � 2� S � 2 ∑M

k � 1

� �Hk

� 2σ2

k � � 1

��

H�1

σ21

H�2

σ22...

H2M

σ2M

�� (5.25)

or written more succinctly in terms of the weight for a single microphone (and including the dependenceon ω previously omitted for brevity):

Φm � ω �� S � ω � � 2� S � ω � � 2 ∑Mk � 1

�Hk

�ω � � 2

σ2k

�ω � � 1

H �m � ω �σ2

m � ω � (5.26)

As in the previous case, the computational complexity of the solution written in this form is only O � M � asopposed to the O � M3 � for the form including the matrix inversion. Equation (5.25) is very similar to theoptimal-SNR weighting derived in Equation (4.11). Each Φm is the ratio of the conjugated channeltransfer function and the channel noise power, but now also includes an overall weighting at eachfrequency. This is consistent with the optimal-SNR result of Equation (4.11).Note that when Hm � 1 �� m � 1 � � � M Equation (5.25) is identical to the solution for the model withoutsignal filtering n Equation (5.19). Also, for the case where M � 1 Equation (5.25) becomes

Φ � H � � S � 2�H � 2 � S � 2 � σ2

which is the single channel Wiener filter.A reasonable question to ask is if the solution in Equation (5.26) is equivalent to an optimal-SNRweighting as in Equation (4.11) followed by a Wiener post filter. The optimal-SNR weighting withnormalization is given by

Φosnr�0 �

m � ω �� H�m

�ω �

σ2m

�ω �

∑Mk � 1

��H�k

�ω �

σ2k

�ω �

��

(5.27)

48

where the denominator is designed to normalize the gain of the array so that

M

∑k � 1

�Φosnrm � ω � � � 1

. Adding a Wiener weighting on top of this weighting adds in a factor of�S

�ω � � 2

E� �

Y�ω � � 2 � to the weighting,

where Y � ω � is the output of the optimal-SNR weighted beamformer. Incorporating this factor into

Φosnr�0 �

m � ω � from Equation (5.27) yields a new weighting:

φosnr�1 �

m � ω �� H�m

�ω �

σ2m

�ω �

∑Mk � 1

��H�k

�ω �

σ2k

�ω �

��

� S � ω � � 2E

� �� ∑M

k � 1 Φosnr�0 �

k � ω � Xk � ω ��

2 � (5.28)

Using the independent noise model from above to simplify the expected value in Equation (5.28) (anddropping the � ω � notation for brevity’s sake) yields:

E

��

M

∑k � 1

Φosnr�0 �

k Xk

��

2� � � E

��

M

∑k � 1

Φosnr�0 �

k HkS � M

∑k � 1

Φosnr�0 �

k Nk

��

2� �

� E� � S � 2 � �

��

M

∑k � 1

Φosnr�0 �

k Hk

��

2 � M

∑k � 1

�� Φosnr

�0 �

k

��

2σ2

k

Comparing this to Equation (5.26), the H�m

�ω �

σ2m

and � S � ω � � 2 terms (both in numerator and denominator) arein common. What remains are the normalization terms in the denominators:

��

M

∑k � 1

Φosnr�0 �

k � ω � Hk � ω ��

2 � M

∑k � 1

�� Φosnr

�0 �

k � ω ��

2σ2

k � ω � ?� M

∑k � 1

�Hk � ω � � 2σ2

k � ω � � 1

This can be re-expanded in terms of Hm � ω � and σ2m � ω � in search of an equivalence:

��

M

∑k � 1

�Hk

�ω � � 2

σ2k

�ω �

∑Ml � 1

��H�l

�ω �

σ2l

�ω �

��

��

� M

∑k � 1

��

H�k

�ω �

σ2k

�ω �

∑Ml � 1

��H�l

�ω �

σ2l

�ω �

��

��

2

σ2k � ω � ?� M

∑k � 1

�Hk � ω � � 2σ2

k � ω � � 1

��

1

∑Ml � 1

��H�l

�ω �

σ2l

�ω �

��

��

2 ��

M

∑k � 1

�Hk � ω � � 2σ2

k � ω �

��

2 � M

∑k � 1

��

H �k � ω �σ2

k � ω ��

2 �� M

∑k � 1

�Hk � ω � � 2σ2

k � ω � � 1

In this form it is clear (or at least more clear) that the two weightings are not equivalent; there arecross-terms introduced on the left side that will not be cancelled on the right side. The relative weightingbetween the channels is the same as for the optimal-SNR weighting, but the overall weighting at eachfrequency is not.

5.2.4 Filtered Signal Plus Semi-Independent Noise Model

Changing the signal model one more time to one where the signal undergoes convolutional distortion, asabove, but in this case the corrupting noise is not independent from channel to channel. That is to say,

P�N �

l � m �� 0. The assumption that the signal and noise are uncorrelated is still in effect. This signal modelleads to the following substitutions:

49

�Ym � 2 � �Hm � 2 � S � 2 � σ2m � SY �m � H �m � S � 2 � P

�Y �

m � l � P�H �

m � l � S � 2 � P�N �

m � lApplying these substitutions to Equation (5.13) yields

Φ ��

�H1 � 2 � S � 2 � σ21 P

�H �

2 � 1 � S � 2 � P�N �

2 � 1 � � � P�H �

M � 1 � S � 2 � P�N �

M � 1P

�H �

1 � 2 � S � 2 � P�N �

1 � 2 �H2 � 2 � S � 2 � σ22

� � � P�H �

M � 2 � S � 2 � P�N �

M � 2...

.... . .

...

P�H �

1 � M � S � 2 � P�N �

1 � M P�H �

2 � M � S � 2 � P�N �

2 � M � � � �HM � 2 � S � 2 � σ2M

� ��

� 1 ��

H �1 � S � 2H �2 � S � 2

...H �M � S � 2

�� (5.29)

On the face of it this matrix might appear singular, but because it is the expected value of the noisecovariance that is added it is generally not singular. That is, the matrix in Equation (5.29) can be written asthe sum of two cross products:

� S � 2��

H1

H2...

HM

��

H1 H2� � � HM � � E

��

N1

N2...

NM

��

N1 N2� � � NM �

� ��

Noting that the second cross product is the expected value of the noise cross-correlation. This is aHermitian matrix and except under degenerate values of noise correlation it will be full rank, and thereforethe matrix inverse in Equation (5.29) will exist. In Equations (5.17) and (5.23) the noise in each channelwas assumed to be independent of the noise in any other channel simplifying this cross-correlation matrixto a diagonal matrix of noise autocorrelation values. Effective estimation of the noise correlation matrixthrough the averaging of multiple observations may be made if the noise is slowly varying. This is incontrast to the spatial correlation matrix in Equation (5.14) which contains an estimate of the rapidlyvarying speech signal.Unlike the previous cases, the matrix in Equation (5.29) does not lend itself to a simplified inverseoperation. Also Equation (5.29) requires the estimation of the complete noise cross-correlation matrixrather than just the noise autocorrelation terms used in Equations (5.17) and (5.23).

5.3 A Non-Optimal Filter and Sum Framework

An alternative way to incorporate Wiener filtering into a beamformer is to simply apply the Wiener filtersbefore sum of the beamformer. The DSBF output can be used to provide the clean signal estimate since ithas reduced noise compared to the individual channels. The advantage of this method is that the DSBFitself can provide the signal estimate and the matrix inversion of the previous section can be avoidedaltogether. Also, since the Wiener filtering occurs before the sum it is possible that the artifacts from theWiener filtering in each channel will tend to cancel in the beamformer output, resulting in a lower level offiltering artifacts in the final output. This process is diagrammed in Figure 5.1. After delay steering, thechannels are averaged together forming the basic DSBF output. This signal is then used as a signalestimate to design a Wiener filter for each individual channel. Because the DSBF output is used as thesignal reference, the individual channel filters will implement the same noise-canceling andsignal-reinforcing behavior that the DSBF accomplishes. That is to say, if the beamformer provides goodnoise attenuation at some frequency, that frequency will be weighted more strongly by the channel Wienerfilters and conversely where the beamformer does not provide noise attenuation the channel Wiener filterswill attenuate at that frequency. Since this filtering is done on each channel before the beamformer sum,the two processes (Wiener filtering and beamforming) are additive.In general the process can be iterative, reusing the filter-and-sum beamformer output at iteration k, S

�k � , as

the new signal reference to generate the channel filters for iteration k � 1, Φ�k � 1 � . Explicitly:

50

jωτe M

Y1SY’

e YY’1

MY

1

S(0)

τ

M Σ

1

ω

1

MΣ��

��

1

1j

�� SY’

YY’M

MM

Delay Sum SumFilter

S(1)

Figure 5.1: Flow diagram for a Wiener filter-and-sum beamformer using the delay-and-sum beamformeroutput as the signal estimate for the Wiener filters.

Φ�0 �

m � ω � � 1

S�k � � ω � � 1

M

M

∑m � 1

Φ�k �

m � ω � Ym � ω �

Φ�k � 1 �

m � ω � � S�k � � ω � Y

�

m � ω ��Ym � ω � � 2 (5.30)

This is illustrated in an idealized example. Consider an M channel array. The noise in each channel isGaussian, uncorrelated between the channels and of equal power in each channel. Let the signal of interest

be a sine wave of frequency ω0 at a nominal power, E� � S � ω0 � � 2 � � ψ2. The noise spectrum has a

constant power, E� �N � ω � � 2 � � σ2. The power spectrum of a single channel is then:

E� �Y � ω � � 2 � � �

ψ2 � σ2 � ω � ω0

σ2 � ω �� ω0

The noise power in the initial DSBF output is reduced by a factor of 1M :

E� � S �

0 � � ω � � 2 � � �ψ2 � σ2

M� ω � ω0

σ2

M� ω �� ω0

Now forming the ratio in Equation (5.30) to generate a Wiener filter for each channel results in a filter withthe following transfer function:

Φ�1 �

m � ω �� ψ2 � σ2

Mψ2 � σ2 � ω � ω0

1M� ω �� ω0

(5.31)

In the Wiener filter the factor of 1M reappears but now in magnitude rather than power, effectively doubling

(in dB) the noise suppression achieved by the beamformer. The gain at the signal frequency is not unity,but will approach 1 for ψ2 � σ2 and as M increases it approaches the minimum mean squared error

optimal gain of ψ2

ψ2 � σ2 . Figure 5.2 shows the value of this term for varying number of microphones and

signal-to-noise ratio. In this simple example the value of Φ�1 �

m � ω � is directly related to the SNR of channelm. In a more realistic situation the attenuation of the noise by the beamformer will not be so reliable -coherent noise may sum constructively at some frequencies and destructively at others - and this direct

mapping of channel SNR to Φ�1 �

m � ω � will not hold. Applying Φ�1 �

m � ω � to each channel3 and beamforming(averaging) to generate S

�1 � � ω � results in

3Since each channel has identical statistics and therefore identical Φm in this example, it is mathematically equivalent to apply thefilter on the beamformer output.

51

0 5 10 15 20 25 30−6

−5

−4

−3

−2

−1

0

SNR in dBF

ilter

res

pons

e in

dB

2 3 4 8 256

Figure 5.2: The attenuation of Φ�1 �

m � ω � from Equation (5.31) at the signal frequency for different values of

SNR, 10log10ψ2

σ2 , and number of microphones in the beamformer, M.

0 5 10 15 20 25 30−40

−35

−30

−25

−20

−15

−10

−5

0

SNR in dB

Filt

er r

espo

nse

in d

B

137

Figure 5.3: The attenuation of Φ�1 �

m � ω � as a function of input SNR raised to different powers corresponding

to Φ�2 �

m � ω � and Φ�3 �

m � ω � . The number of channels is fixed at 16.

E� � S �

1 � � ω � � 2 � ��

ψ2 � σ2

M ��

ψ2

σ2� 1

M

ψ2

σ2� 1 � 2 � ω � ω0

σ2

M3 � ω �� ω0

where the gain at ω � ω0 has been rewritten to more clearly separate the influences of the signal-to-noiseratio and the number of microphones. Note the 1

M3 reduction in noise power. This is the cube of thereduction in noise power achieved by the DSBF.

Repeating the process to generate Φ�2 �

m � ω � yields

Φ�2 �

m � ω � ��

Φ�1 �

m � ω0 � � 2 ψ2 � σ2M

ψ2 � σ2� ω � ω0�

Φ�1 �

m � ω �� ω0 � � 21M� ω �� ω0

which shows that subsequent iterations of Φ�k �

m are simply powers of Φ�1 �

m . In particular,

Φ�k � 1 �

m � �Φ

�k �

m � 2Φ

�1 �

m . Figure 5.3 shows the value of Φ�1 �

m � ω � raised to the 3rd and 7th powers,

corresponding to Φ�2 � 3 �

m � ω � . Note how the attenuation falls off steeply; signals at different frequencies willbe attenuated to a degree that is greatly sensitive to the SNR at that frequency, potentially resulting in

undesirable signal coloration if a higher power of Φ�1 �

m � ω � is employed. This is illustrated in Figure 5.3.One way to avoid this sort of signal distortion while increasing the noise-suppression of the filter is bymapping the filter response non-uniformly. Compressing Φm � ω � in the neighborhood of 0dB while

52

0 5 10 15 20 25 30−35

−30

−25

−20

−15

−10

−5

0

SNR in dBF

ilter

res

pons

e in

dB

Figure 5.4: Ad hoc methods of warping the filter gains to create a flatter response at moderately high SNR

while preserving a strong attenuation at low SNR.�

denotes the curve for Φ�1 �

m and�

denotes the result

after a warping by Equation (5.32). For�

the attenuation was held at 1 for values of Φ�1 �

m greater than -2dB

and set to�Φ

�1 �

m � 5for values below that.

maintaining a strong attenuation in low SNR regions. For instance, any gain above some threshold could

be set to unity while leaving gains below the threshold alone. Another strategy would be to raise Φ�1 �

m � ω �to a variable power based on its value. A possible (absolutely ad-hoc) warping which maintains a longerflat region and faster dropoff below some threshold is

Φ� � �m � ω ��

Φ�1 �

m � ω � ��20log10

�Φ

�1 �

m�ω � � �

(5.32)

The effect of this ad hoc warping is shown in Figure 5.4. Both warped curves are flatter than Φ�1 �

m above9dB SNR and then drop off sharply at lower SNRs.

5.4 Summary

The derivation of the single channel Wiener filter was presented and extended to a multi-input MMSEsolution, multi channel Wiener (MCW). The form of this multi-channel Wiener filter was simplified for thecases of additive noise, and convolution plus additive noise signal scenarios, resulting in solutions of lowcomputational complexity. The MCW method was shown to incorporate the optimal-SNRinter-microphone weighting derived in Chapter 4 and the overall frequency weighting of the MCWalgorithm was shown to be different from that of the Wiener post-filter (WSF) algorithm. Anothernon-optimal but intuitively appealing application of Wiener filters to microphone arrays as pre-filters wasdescribed and its behavior as an iterative process explored. These methods, along with a reference Wienerpost-filter (WSF), will be implemented in Chapter 7.All the Wiener algorithms presented in this chapter require a noise-free or at least noise-reduced estimateof the signal spectrum. Chapter 6 addresses the spectrum estimation problem in the context of microphonearrays.

CHAPTER 6:SIGNAL SPECTRUM ESTIMATION

The Wiener filter requires knowledge of the power spectrum of the desired signal (see Equation (5.6)). Insome communications applications the statistics of the desired signal may be reasonably wellapproximated by an a priori distribution, but when the signal of interest is speech, ongoing signalmeasurements are required to estimate the rapidly changing signal power spectrum. In this chapter somemethods of spectrum estimation will be investigated. The cross-spectrum signal estimation method whichis often used in microphone-array systems[39, 38, 41] will be shown to be a special case of the ubiquitousnoise-spectrum subtraction methods[33]. A novel spectral estimate that combines the cross-spectrum andminimum-noise subtraction methods will be developed with some investigation of parameter optimizationfor the method.

6.1 Spectral-Subtraction

In the classical single-channel spectral-subtraction case[33] the signal spectrum is commonly estimated bymeasuring the noise power during silence regions and estimating the signal power with:

� S � ω � � 2 � � S � ω � � N � ω � � 2 � E� �N � ω � � 2 � (6.1)

To the extent that the noise is stationary and uncorrelated with the signal this is a good estimate of thesignal power, though care must be taken to avoid over-estimating the instantaneous noise power andinserting negative values into the signal power-spectrum estimate[33].One way to estimate the noise power[82, 34, 83] is to use the minimum observed value of the powerspectrum over some interval:

� Nk � ω � � � min � �Y�k � N � � � k � � ω � � 2 � (6.2)

where the noise power spectrum estimate for analysis frame k and frequency ω, � Nk � ω � � 2, is the minimumvalue taken from the last N analysis frames of the noise corrupted signal power spectrum, �Y �

k � N � � � k � � ω � � 2.The advantages of this approach are

� Explicit speech/no-speech decision is not necessary.

� Over estimation of the noise power is less likely.

� The noise estimate will adapt to changing noise.

� Very simple to implement.

Implementations of this technique typically use a smoothed version of the power spectrum. Theimplementation used herein weights past analysis frames with an exponentially decaying weight factor:

� Yk � ω � � 2 � � 1 � α � �Yk � ω � � 2 � α �Yk � 1 � ω � � 2 (6.3)

where � Yk � ω � � 2 , the smoothed spectrum estimate for frame k, is formed by weighting the raw estimate forthe current frame, �Yk � ω � � 2 by � 1 � α � and the estimate for the previous frame, �Yk � 1 � ω � � 2, by α. The noise

53

54

2040

6080

00.2

0.40.6

0.8

0.09

0.1

0.11

0.12

0.13

Framesalpha

BS

D

(a) BSD

2040

6080

00.2

0.40.6

0.8

3.5

4

4.5

Framesalpha

SS

NR

(b) SSNR

2040

6080

00.2

0.40.6

0.8

22

24

26

28

30

32

Framesalpha

SN

R

(c) peak SNR

Figure 6.1: Average BSD, SSNR and peak SNR for minimum spectral subtraction scheme as describedin Equation (6.3) as a function of the averaging constant, α, and the number of past analysis frames fromwhich the minimum is taken, N.

estimate for frame k is then formed by taking the minimum value of � Yk � ω � � 2 for k � k � N � � � k.Unfortunately, because the processing is done on an utterance-by-utterance basis only 3 or 4 seconds ofinput are processed at a given time. This leads to a distinct “start-up” phenomenon at the beginning ofeach utterance whereby the noise estimate at the beginning of the utterance is significantly lower than theestimate towards the end of the utterance. To ameliorate this problem the minimum choosing process isdone forward and backwards and the two results averaged together. To determine appropriate values of theweighting factor α and the history length N various values were used to do spectral subtraction on a set of4 noisy utterances from the noisy database and 4 utterances from the quiet database. The data for eachutterance was processed with a delay-and-sum beamformer using 1 to 16 microphones, so a total of16*8=128 signals with various noise characteristics were processed. For each test instance the BSD,SSNR and peak SNR were measured from the resulting power spectra. Figure 6.1 shows the average BSD,SSNR, and peak SNR as a function of α and N for the quiet and noisy databases1 using a 512 pointHamming window was used and a 1024 point zero padded FFT2. A reasonable choice of parametersfalling near the optimal areas of both BSD and SSNR while staying as high on the SNR curve as possibleis α � 0 � 6 and N � 80 (1.28 seconds)3. Note that the peak SNR is a monotonically increasing function ofα. As the time constant of the averaging function increases the magnitude of the noise estimate will tendto increase as more speech energy is incorporated into the noise estimate. The increase in peak SNR is aby-product of the increase in the magnitude of the noise estimate and doesn’t account for the “quality” ofthe noise estimate.

6.2 Cross-Power

When multiple input channels are available the cross-power spectra of the channels can be used to formthe signal power estimate, � S � ω � � 2, in a way that takes advantage of the correlated nature of the signal andthe (hopefully) uncorrelated nature of the interference. Using the signal plus independent noise modelfrom Equation (5.15) the expected value of the cross-spectrum of 2 channels is (from Equation (5.16))

1For this optimization data from the training set rather than the test set was used.

2The analysis length and FFT size were chosen to correspond with the parameters of the BSD measure (See Section 2.2.3) becausethe BSD is being measured directly from the power spectra estimated by the spectral subtraction process.

3It is expected that the best values for these parameters will vary with the test conditions; noise levels, channel variations etc. Thevalues derived here are entirely specific to the database used.

55

E�Ym � ω � Y �l � ω � � � � S � ω � � 2 (6.4)

since the noise is assumed to be uncorrelated the expected value of its cross-correlation is 0. A pair ofmicrophones, m and l, can be used to form an estimate of the signal power by taking the real portion oftheir cross-spectrum:

Pml � ω � � re � Ym � ω � Y �l � ω � � (6.5)

In general there are M microphones available so there are�M2� independent estimates of � S � ω � � 2 that can be

formed from Equation (6.5). Taking the average of these individual estimates and the applying a half-waverectification yields an estimate of the signal power incorporating information from all the microphones:

� S � ω � � 2 � max

�0 � 1� M

2� M � 1

∑m � 1

M

∑l � m � 1

Pml � ω � � (6.6)

This is essentially the development done by Zelinski in[38]. The signal power spectrum was estimated byaveraging together the cross-power spectra of all possible microphone combinations in a 4-microphonearray. This estimate of the signal spectrum was then used in a Wiener filter applied to the output of thebeamformer as in Equation (5.9). This formulation of the spectral estimate has been used elsewhereincluding[40][30][84]. In [84] taking the real portion of the mean cross-power spectrum is eschewed infavor the magnitude. The rationale behind doing this is to design the derived Wiener filter to attenuate thespatially uncorrelated noise while ignoring coherent noise[84] (or rather including it equally in thenumerator and denominator of the Wiener filter transfer function) and thereby leaving the attenuation ofcoherent noise to the spatial selectivity of the beamformer.

6.2.1 Computational Considerations

Although Equation (6.6) is written as the mean of� M

2� cross-spectrum computations it should be noted that

this value can be measured without forming the cross-spectra individually. The power-spectrum of thedelay-and-sum beamformer output contains all the cross-spectra from Equation (6.6) as well as theauto-spectrum of each microphone:

�YB � 2 � ��

1M

M

∑m � 1

Ym

��

2 � �1M

M

∑m � 1

Ym � �1M

M

∑m � 1

Y �m �� 1

M2

M

∑m � 1

�Ym � 2 � 1M2

M � 1

∑l � 1

M

∑m � l � 1

YlY �m � 1M2

M � 1

∑l � 1

M

∑m � l � 1

Y �l Ym

� 1M2

M

∑m � 1

�Ym � 2 � 2M2

M � 1

∑l � 1

M

∑m � l � 1

real � YlY �m � (6.7)

where YB is the delay-and-sum beamformer output and Ym is the mth channel (expressed in the frequencydomain). The estimate in Equation (6.6) can be formed by computing and subtracting out theauto-spectrum terms from Equation (6.7) and scaling appropriately. This entails the computation of onlyM � 1 power-spectrum estimates rather than

� M2� cross-spectrum estimates4. Specifically, given YB as

expressed in Equation (6.7) above, the spectral estimate in Equation (6.6) can be realized:

4This relationship between the power spectrum of the beamformer output and the desired cross-spectrum estimate is also pointedout in[30].

56

1� M2� M � 1

∑m � 1

M

∑l � m � 1

re � YmY �l � �� YB � 2 � 1

M2 ∑Mm � 1 �Ym � 2 � M2

2� M2�

� MM � 1

� �YB � 2 � 1M2

M

∑m � 1

�Ym � 2 � (6.8)

This economy of computation is only available if the function chosen to combine the cross-spectrumestimates is the mean and the function chosen to project the complex-valued cross-spectra is the realfunction. If the absolute value of the cross-spectral estimates is used[84][85][39] this breakdown of thebeamformer power-spectrum does not apply. Likewise, if some function other than the mean is used tocombine the individual cross-spectral estimates (e.g. the median) this simplification also does not apply.

6.3 Combining Spectral-Subtraction and Cross-Power

In light of Equation (6.8) it is apparent that the cross-power spectral estimate is a special case of spectralsubtraction (followed by a scaling factor). In this case the noise power estimate to be subtracted is a scaledaverage of the individual channel power spectra as opposed to an average or minimum statistic of thepower spectrum of the beamformer output. The two different strategies for forming noise estimates can beused simultaneously. The rationale behind doing this is that the cross-power estimate lumps anything thatis correlated between channels into the signal estimate. Because of this, when the noise combinescoherently the noise will tend to be underestimated. The spectral subtraction method on the other handuses long term statistics of the beamformer output power spectrum to estimate the noise bias; coherentnoise will show up in the beamformer output and in the corresponding noise estimate. Because thetendency is for the cross-power estimate of the noise to be too small the appropriate combination is to usethe larger of the two noise estimates at any given time:

� Ncp � 2 � 1M2

M

∑m � 1

�Ym � 2� Nss � 2 � min � � Y�

k � N � � � k � � 2 �� S � 2 � �YB � 2 � max � � Ncp � 2 � � Nss � 2 � (6.9)

where � Ncp � 2 is the noise estimate from the cross-power method of Equation (6.8) and � Nss � is the noiseestimate from the spectral-subtraction estimate of Equation (6.2). The signal estimate, � S � 2 is formed bysubtracting the larger of the two noise estimates from the beamformer power spectrum, �YB � 2.

6.4 Comparison of Signal Estimate Methods

The recordings described in Section 3.1.1 were used to evaluate the performance of some of the variationsof the signal spectrum estimation methods described above. The 4 signal power estimates evaluated are:

1. power spectrum of the beamformer output

2. spectral-subtraction method

3. cross-power method

4. combination cross-power and spectral-subtraction method

The signal power spectrum estimate types enumerated above were then measured against power-spectragenerated from the close-talking microphone reference recordings. Peak SNR (SNR), segmental SNR

57

BF Spec. Sub. Cross Power Combo0.05

0.055

0.06

0.065

0.07

0.075

BS

D

(a) BSD

BF Spec. Sub. Cross Power Combo

5

5.2

5.4

5.6

5.8

6

6.2

SS

NR

(b) SSNR

BF Spec. Sub. Cross Power Combo24

26

28

30

32

34

36

38

Peak

SN

R

(c) peak SNR

Figure 6.2: Average BSD, peak SNR and SSNR for the different spectral estimation methods for the quietdatabase using 8 microphones (

�) and 16 microphones ( � ). The values were averaged across all 438

utterances in the test set. See Figure 3.7 for scale comparisons.

BF Spec. Sub. Cross Power Combo

0.08

0.09

0.1

0.11

0.12

BS

D

(a) BSD


3.5

4

4.5

SS

NR

(b) SSNR


12

14

16

18

20

22

Peak

SN

R

(c) peak SNR

Figure 6.3: Average BSD, peak SNR and SSNR for the different spectral estimation methods for the noisydatabase using 8 microphones (

�) and 16 microphones ( � ). The values were averaged across all 438

utterances in the test set. See Figure 3.13 for scale comparisons.

(SSNR) and Bark spectral distortion (BSD) values are computed and averaged over the 438 test set talkerutterances. For all estimation techniques the data segmentation was done with a 512 point (32ms)Hamming window with a half-window overlap and a 1024 point FFT used.The most glaring feature of the quiet-database results in Figure 6.2 is that the Bark distortion (BSD) andsegmental SNR (SSNR) are worse after any sort of processing of the beamformer spectrum. UsingFigures 3.7 and 3.13 for comparison the degradation in the SSNR and BSD measurements of Figure 6.2 ismarginal. The total increase in BSD shown in Figure 6.2 is approximately 5% of the difference betweenthe values measured for the 1 and 16 microphone beamformers in Figure 3.7(a). For the noisy data inFigure 6.3 the BSD improves (decreases) slightly with all post-processing methods. The magnitude of theimprovement in Figure 6.3(a) is approximately 8 times greater than the decline in Figure 6.2(a). TheSSNR deteriorates with all post-processing methods for both data sets but the decrease for the noisy dataset is about half as much as with the quiet data. In both cases the change in SSNR is approximately anorder of magnitude smaller than the total range shown in Figure 3.7(b). In contrast, the peak SNR isimproved significantly by all 3 post-processing methods for both quiet and noisy data, and though thecross-power method has a worse peak SNR than the spectral-subtraction method on the quiet data, thecombined cross-power/spectral-subtraction method displays the best peak SNR in both noisy and quietcases. Also, unlike the marginal decline or improvement in BSD and SSNR, the magnitude of theimprovement in peak SNR in Figure 6.2(c) is comparable to the improvement achieved by the 16

58

microphone beamformer shown in Figure 3.7(c).Since BSD and SSNR are measured only during active speech segments this suggests that the minimumspectral subtraction method does a good job of attenuating noise during silence passages, but is lesseffective at reducing distortion during segments of speech. For the quiet database this tradeoff betweenreducing the noise and distorting the speech results in a slight overall performance degradation preciselybecause the noise is minimal; the distortion introduced by the processing is on the same order as thedistortion already present in the signal. The combination algorithm incorporates the improvement in signaldistortion of the cross-power method and the improvement in peak SNR of the spectral subtraction method.

6.5 Summary

In this chapter methods for signal spectrum estimation were described. The cross-spectrum method wasshown to be a special case of spectral subtraction. An algorithm combining a minimum noise estimate andthe cross-spectrum noise estimate was motivated and described. A preliminary comparison of the differentestimate methods using signal distortion measures was presented supporting the use of the combinationestimate. In Chapter 7 optimal filtering strategies will be implemented using the spectrum estimationmethods described here.

CHAPTER 7:IMPLEMENTATIONS AND ANALYSIS

In this chapter variations on the filtering strategies described in Chapters 5 and 4 will be implemented andevaluated. In particular, implementations of the optimal-SNR filter-and sum strategy (OSNR) fromEquation (4.12), the Wiener sum-and-filter (WSF) from Equation (5.9), Wiener filter-and-sum (WFS) fromEquation (5.30) and Multi-Channel Wiener (MCW) from Equation (5.26) beamformers will described, andthe distortion measures and speech-recognition performance results for 8 and 16 microphone versionspresented. For the Wiener techniques, the different methods of estimating the signal spectrum described inChapter 6 will be used and compared.

7.1 Optimal-SNR Filter-and-Sum

Figure 7.1 shows the basic structure of the optimal-SNR weighted beamformer (OSNR). Each channelwas filtered with a magnitude weighting derived as in Equation (4.12).

� A 512 point (32ms at 16kHz) rectangular window with a half window shift was used for spectralanalysis.

� To preserve a linear convolution in the frequency domain a 1024 point FFT was used.

� The noise power in each channel, σm � ω � , was estimated using the minimum statistic methoddescribed in Section 6.1.

� To estimate the channel transfer functions, Hm � ω � , the signal power (after subtracting the noiseestimate) was averaged over the utterance. These long-term power estimates were then divided bythe power estimate of the first channel to provide an estimate of the channel transfer function. The

Y

M

1

je

Yω

ω

τ

1j

M

τ1

2

H

e

1

M Σ(0)

S

N1

Delay

HM

NM

2

Filter Sum

Figure 7.1: The structure of the DSBF with optimal-SNR based channel filtering (OSNR). Signal and noisestatistics are generated for each channel. The resulting filter weights are normalized across the array toyield a flat total frequency response for the array.

59

60

Database � Quiet NoisyModel � Baseline MAP Baseline MAP# mics � 8 16 8 16 8 16 8 16

DSBF 24.28 19.44 14.03 11.92 60.44 47.59 32.49 23.48

OSNR 23.93 19.01 14.08 11.52 53.92 40.25 27.13 18.81

Table 7.1: Recognition performance for the OSNR beamformer expressed in % words in error. The lowestword error rate in each column is highlighted in boldface.

result is a constant magnitude transfer function estimate for each channel. For instance:

�Hm � ω � � ��

� � �Ym � ω � � 2 � σ2m � ω � �� Y1 � ω � � 2 � σ21 � ω � �

�� 1

2

� To guard against divide by zero or multiply by zero situations the gain at each frequency wasconstrained to lie within -40dB to 3dB by hard limiting at both ends of the range.

� To reduce artifacts in the reconstructed signal the derived filters were truncated to 512 points in thetime domain to preserve the linear convolution property. This corresponds to a smoothing of thefilter in the frequency domain.

� The filter weights were normalized across the array at each frequency so that the total arrayfrequency response was flat.

� The filters were applied to the beamformer output in the frequency domain and reconstructed in thetime domain with an overlap-add technique. A Hanning window was used to taper the overlappingsegments together during reconstruction.

� After filtering the individual channels they were summed to form the final output.

The microphone-array databases (quiet and noise-added) described in Chapter 3 were processed with theOSNR algorithm as described above, using 8 and 16 microphones. The results are presented the sectionsthat follow.

7.1.1 Subjective Observations

The output of the OSNR beamformer sounds uncolored1 and free of the sort of warbling artifacts that aretypical of noise-suppression filters (including those in ensuing sections). For the quiet database the noise,although not noticeably attenuated, sounds slightly whiter; periodicities in the background noise are subtlybut audibly reduced2. For the noisy database the character of the background noise is greatly changed. Thebands of noise visible in Figure 3.11 are now attenuated and the whistling quality of the noise is gone.Overall the background noise sounds much whiter than with the unweighted beamformer. Figure 7.2shows a comparison of noisy optimal-SNR weighted and unweighted DSBF spectrograms.

7.1.2 Objective Performance

Table 7.1 shows the recognition performance when using the optimal-SNR weighting filter-and-sumtechnique. The least difference is seen for the 8 microphone beamformer and the quiet database where the

1The uncolored nature of the beamformer output is a given since the overall frequency response is constrained to be uniform.

2This different quality to the noise is extremely subtle and probably would not be noticed in casual listening or listening in lessthan optimal environments, but the beamformers can be reliably distinguished in a blind test.

61

DSBF Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

OSNR Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

Figure 7.2: Narrowband spectrogram of a noisy utterance processed with OSNR. The top figure is from theunweighted 16 channel beamformer and the bottom figure is from the optimal-SNR weighted 16 channelbeamformer. The talker is female and the text being spoken is the alpha-digit string ’BUCKLE83566’. Theoverall reduction in background noise is apparent, especially the bands around 4500Hz and 6500Hz.

Database � Quiet Noisy Quiet Noisy# mics � 8 16 8 16 8 16 8 16measure � FD BSD

DSBF 0.60 0.51 1.17 0.98 .066 .054 .124 .096OSNR 0.59 0.51 1.06 0.85 .065 .053 .106 .083

measure � SSNR peak SNR

DSBF 5.38 6.17 3.25 4.06 24.53 27.86 10.86 14.90OSNR 5.48 6.26 3.69 4.41 25.21 28.90 14.09 18.33

Table 7.2: Summary of the measured average FD, BSD, SSNR and peak SNR values for the OSNR beam-former.

measured performance actually decreases. The .05% increase in error rate corresponds to 2 added errors3.All other results for the quiet database are nearly an order of magnitude greater and are improvements inperformance. Admittedly these are very small improvements in terms of the number of word errorsinvolved, but note that the difference between the DSBF (11.92%) and the close-talking microphone(8.16%) performance is only 3.76%. A .5% change in performance is 13% of that margin. The .4%improvement for the 16 microphone MAP-HMM case is slightly more than 10% that performance gap.With the noisy data, the performance is significantly improved relative to the unweighted DSBF. In the 16microphone MAP-HMM case the 4.5% decrease in error rate brings the beamformer performance 30%closer to the 8.16% error rate of the close-talking microphone. In the noisy case the SNR varies enoughacross the array for the weighting to provide significant gain. In the quiet case the noise is very similar ineach channel and little can be gained by weighting the microphones nonuniformly.Table 7.2 shows the distortion measurements for the optimal-SNR weighted beamformer. All themeasurements show some improvement over the unweighted DSBF, but the most notable improvement isin peak SNR which shows only about 4dB improvement for both 8 and 16 microphone arrays using the

3There are 4497 total test words, so each error contributes 100 � 14497

�� 022% to the error rate.

62

noisy data. This improvement in peak SNR may seem small compared to the improvement shown by othertechniques presented herein; unlike the Wiener filtering strategies in the following sections, the OSNRbeamformer achieves this improvement in SNR while maintaining a flat overall frequency response. Thatis, the peak SNR values from the OSNR beamformer are not inflated by arbitrarily high noise suppressionduring silence passages. The overall array response is uniform during silence as well as during speech.

63

Y1

S(0)

j τω

jωτ1

MYe

Σ S(0)

M

��

��

e

M

1 NR

Sum N.R.

NoiseReduc.

Delay

Figure 7.3: The structure of the DSBF with Wiener post-filtering or Wiener sum-and-filter (WSF). Notethat the channels may feed forward into the noise-reduction step as they may be necessary to generate thepost-filter.

7.2 Wiener Sum-and-Filter

A delay-and-sum beamformer with Wiener post-filter (Herein this will be termed “Wiener sum-and-filter”or WSF) in the manner of [39] was implemented as follows:

� A 512-point (32ms at 16kHz) rectangular window with a half window shift was used for spectralanalysis.

� To preserve a linear convolution in the frequency domain a 1024 point FFT was used.

� 3 different methods were used to form the signal spectral estimate, or rather 3 different methodswere used to estimate the noise power spectrum to be subtracted from the beamformer powerspectrum:

1. The cross-spectral power signal estimate from (6.8) was used. This is consistent with theimplementation generally found in the literature[39]. In the results this is denoted by WSFcor.

2. The minimum statistic noise power estimate described in Sections 6.1 and 6.4. This is denotedbelow by WSFmin.

3. The combination noise power estimate described in Sections 6.3 and 6.4. This is denotedbelow by WSFcom.

� The spectral densities used in the formulation of the Wiener filter from (5.6) were smoothed in timewith an exponential weighting factor of 0.4. That is,

�Sk � ω � � 2

11 4 � � Sk � ω � � 2 �� 4 � Sk � 1 � ω � � 2 � . This value was chosen to correspond with the smoothing reported in[39] and is intended to strike a balance between a low variance estimate of the spectral densitieswhile still accommodating rapid variation in the speech spectrum.

� To reduce artifacts in the reconstructed signal, the post filter is truncated to 512 points in the timedomain to preserve the linear convolution property. This corresponds to a smoothing of the filter inthe frequency domain.


The microphone-array databases (quiet and noise-added) described in Chapter 3 were processed with theWSF algorithm as described above, using 8 and 16 microphones. The results are presented the sectionsthat follow.

64

Her

tzsec.

0 0.5 1 1.5 2 2.5 3 3.50

1000

2000

3000

4000

5000

6000

7000

8000

Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

cor

Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

min Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

com

Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

WSF WSFosnr

Figure 7.4: Narrowband spectrograms for the WSF beamformer. The bottom 3 rows correspond to thedifferent ways of forming the signal power spectrum estimate. The left hand column images are generatedfrom data with no extra pre-filtering and the right hand column images are from data that was processedwith the OSNR filtering prior to the WSF processing. The spectrograms for the unweighted DSBF andOSNR outputs are shown at the top for reference. The talker is female and the text being spoken is thealpha-digit string ’BUCKLE83566’. All examples are from 16 microphone implementations.

7.2.1 Subjective Observations

Figure 7.4 shows example spectrograms for an utterance processed with the WSF algorithm. In general thecom spectral estimate shows a markedly reduced background noise level compared to the other spectralestimate methods. This observation is confirmed by listening; the background noise is more stronglysuppressed by the min and com methods. For the noisy data the com method sounds noticeably better thaneither of the other two methods (though when OSNR pre-processing is used com and min sound verysimilar). The cor method also introduces a greater degree of warbling and tonal noise in the residualbackground noise. With the quiet data, the level of tone and warble artifacts is low for the com and minspectral estimate types. With the noisy data a greater level of warble and tone artifacts is introduced. Somedouble talk and reverberation artifacts can be heard in the 8 microphone case; less so when 16microphones are used. As when OSNR is used without any post-filtering, the versions using OSNR as apreprocessing step show dramatically reduced spectral peaks in the background noise.


Tables 7.3 and 7.4 show the recognition performance of the WSF and WSFosnr beamformers. The valuesfor the quiet database show a slight performance improvement in all cases. The improvement of WSF overDSBF is comparable to the improvement of WSFosnr over OSNR; the improvement is somewhat additive.In the noisy cases the com spectral estimate performs marginally better in every case; with the quiet datathe cor estimate edges out the other methods. This isn’t unexpected since the cor estimate has a generally

65


DSBF 24.28 19.44 14.03 11.92 60.44 47.59 32.49 23.48

WSFcor 21.77 16.70 13.03 11.10 50.39 36.60 27.15 20.61WSFmin 21.19 18.50 13.41 11.50 51.66 36.83 27.20 20.35WSFcom 20.72 18.14 13.14 11.54 47.43 34.29 26.75 19.70

Table 7.3: Recognition performance for the WSF beamformers expressed in % words in error. The lowestword error rate in each column is highlighted in boldface.


OSNR 23.93 19.01 14.08 11.52 53.92 40.25 27.13 18.81

WSFosnrcor 21.57 16.41 12.88 10.96 44.03 30.11 25.35 17.92

WSFosnrmin 21.06 18.52 13.43 10.87 43.59 28.71 23.35 16.90

WSFosnrcom 20.84 18.12 13.16 10.90 40.41 28.06 23.30 16.59

Table 7.4: Recognition performance for the WSFosnr beamformers expressed in % words in error. Thelowest word error rate in each column is highlighted in boldface.

lower estimate of the noise and the com estimate (by definition) the highest noise estimate; the corestimate performs best in the low-noise cases and the com estimate performs best in the high-noise cases.Note that the OSNR performance is better than the WSF performance (without OSNR pre-processing).The best improvement for the quiet data (WSFosnr

min ) makes up 27% of the difference from the DSBF to theclose-talking microphone performance. The best noisy performance (WSFosnr

min ) is 45% of the performancedifference.The largest improvements in the distortion measures can be seen in the peak SNR values. The peak SNRmeasured for the quiet data is approximately 1.5 times greater, and 2 times greater for the noisy data. Forthe other measures the difference is generally much smaller. For the quiet data especially the difference inmeasured distortion is sometimes vanishingly small. The noisy data shows significantly more of a

Database � Quiet Noisy Quiet Noisy# mics � 8 16 8 16 8 16 8 16

measure � FD BSD

DSBF 0.60 0.51 1.17 0.98 .066 .054 .124 .096

WSFcor 0.57 0.51 0.82 0.73 .068 .055 .100 .075WSFmin 0.54 0.49 0.86 0.74 .075 .063 .098 .077WSFcom 0.54 0.49 0.77 0.69 .075 .062 .100 .076


DSBF 5.38 6.17 3.25 4.06 24.53 27.86 10.86 14.90

WSFcor 5.31 6.24 3.62 4.49 28.41 32.71 15.78 20.41WSFmin 5.66 6.38 4.03 4.88 40.94 45.02 21.15 26.56WSFcom 5.63 6.37 3.98 4.89 39.91 44.54 22.63 28.45

Table 7.5: Summary of the measured average FD, BSD, SSNR and peak SNR values for the WSF beam-formers. The baseline values for the delay-and-sum beamformer are shown for reference (DSBF). The best(lowest for distortions and highest for SNR’s) is highlighted in bold-face.

66


measure � FD BSD

OSNR 0.59 0.51 1.06 0.85 .065 .053 .106 .083

WSFosnrcor 0.56 0.49 0.78 0.68 .067 .055 .090 .068

WSFosnrmin 0.53 0.49 0.80 0.67 .075 .063 .092 .074

WSFosnrcom 0.54 0.49 0.74 0.65 .075 .062 .092 .073


OSNR 5.48 6.26 3.69 4.41 25.21 28.90 14.09 18.33

WSFosnrcor 5.57 6.41 4.00 4.81 30.08 35.22 19.59 24.96

WSFosnrmin 5.72 6.41 4.36 5.12 41.49 45.97 25.20 31.39

WSFosnrcom 5.69 6.41 4.33 5.14 40.29 45.36 26.22 32.49

Table 7.6: Summary of the measured average FD, BSD, SSNR and peak SNR values for the WSFosnr

beamformers. The baseline values of the delay-and-sum beamformer with optimal-SNR weighting areshown for reference (OSNR).The best value (lowest for distortions and highest for SNR’s) in each categoryis highlighted in bold-face.

difference. Note that for the quiet data the BSD is worsened by any version of WSF with the cor spectralestimate type degrading the least of the 3 estimate types. This is consistent with the observation above thatthe cor spectral estimate with its conservative noise estimate performs best on the quiet data whereas themin and com spectral estimate types are most likely over-estimating the noise resulting in a degree ofsignal distortion that outweighs the noise suppression.

67

SS(0)

j

(1)

1

ωM

jωτ1

τ

SY’

e

YY’e 1

MM

SY’M

1

M

1

1

MΣS

1

NR

��

��

��

Y

Σ

MY

(0)

Sum Filter SumN.R.

NoiseReduc.

Delay

YY’

Figure 7.5: The structure of the Wiener filter-and-sum (WFS) beamformer. This is the same as Figure5.1 with the addition of a configurable post-filtering step on the first beamformer output. The individualchannels may feed forward into the noise reduction step.

7.3 Wiener Filter-and-Sum

A version of the ad-hoc Wiener filter-and-sum (WFS) strategy described in Section 5.3 is implemented andevaluated. Figure 7.5 shows the structure of the algorithm. The delay-and-sum beamformer output isformed, an optional noise-reduction filter applied and the result is fed back as a signal reference togenerate Wiener prefilters for each channel.As described in section 5.3 the output of the beamformer can be used as a signal reference to pre-filter thechannels individually. This algorithm was implemented and applied to the databases described inSections 3.1 and 3.2.

� For the pre-filtering step a 512 point (32ms) Hanning window was used, half overlapped, with a1024 point FFT.

� 4 different signal spectrum estimates were used for the numerator of the Wiener filters:

1. The power spectrum of the unfiltered DSBF output (bf ).

2. The cross-correlation spectral estimate (cor) (see Section 6.2).

3. The minimum-statistic noise spectral estimate (min) (see Section 6.1).

4. The combination spectral estimate (com) (see Section 6.3).

� The same exponential smoothing described in Section 7.2 was used to smooth the spectral estimatesused in the generation of the channel filters.

� To reduce artifacts the filters were smoothed by taking them back into the time domain andtruncating them to 512 points, preserving the linear convolution property of the frequency domainimplementation.

The microphone-array databases (quiet and noise-added) described in Chapter 3 were processed with theWFS algorithm as described above, using 8 and 16 microphones. The results are presented the sectionsthat follow.

7.3.1 Subjective Evaluation

The amount of warbling and tonal artifacts in the residual background noise is noticeably lower for theWFS algorithm than for the WSF algorithm. In particular the speech processed with WFScor has a moresuppressed and more natural sounding residual noise than WSFcor. Of the spectral estimate types, the bfversion has the least effective suppression of background noise; the “sound” of the original backgroundcan still be discerned in those recordings. WFScor has a lower level of residual noise than WFSb f as can beseen in the spectrograms in Figure 7.6. The min and com versions sound virtually indistinguishable from

68

Her

tzsec.

0 0.5 1 1.5 2 2.5 3 3.50

1000

2000

3000

4000

5000

6000

7000

8000

Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

bf Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

cor

Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

min Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

com

Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

WFS WFSosnr

Figure 7.6: Narrowband spectrograms from the WFS beamformer. The 4 bottom rows correspond to thedifferent ways of forming the signal power spectrum estimate. The left hand column images are generatedfrom data with no extra pre-filtering and the right hand column images are from data that was processedwith the OSNR filtering prior to the WFS processing. The spectrograms for unweighted DSBF and OSNRoutputs are shown at the top for reference. The talker is female and the text being spoken is the alpha-digitstring ’BUCKLE83566’.

each other and have a noticeably lower level of residual noise than either the bf or cor processing types.The noisy examples are greatly enhanced by the use of OSNR as a pre-processing step; the spectral peaksin the background noise are greatly attenuated by the OSNR processing. The application of OSNRpre-processing to the quiet data is impossible to distinguish by listening. The WFS recordings exhibit adegree of “breathing” at the transitions from speech to silence and silence to speech. This “breathing” ismore apparent in the quiet recordings (also when the min and com processing types with greater noisesuppression are used) where less residual background noise is available to mask the artifact. The processedspeech also exhibits a varying degree of echo and similar processing artifacts comparable to that observedwith WSF processing. Artifacts are lower when 16 microphones are used.


Tables 7.7 and 7.8 summarize the recognition performance of the WFS algorithm. Most notably therecognition performance is generally worse with the quiet data. A slight improvement can be seen in some

69


DSBF 24.28 19.44 14.03 11.92 60.44 47.59 32.49 23.48

WFSb f 23.46 18.55 14.10 11.88 51.63 35.40 27.89 19.77WFScor 23.55 19.46 14.28 12.08 48.63 32.93 28.37 19.77WFSmin 23.24 21.10 14.77 12.79 47.45 32.04 26.91 18.88WFScom 23.86 20.72 14.59 12.47 46.25 31.42 26.42 19.23

Table 7.7: Recognition performance for the WFS beamformer expressed in % words in error.


OSNR 23.93 19.01 14.08 11.52 53.92 40.25 27.13 18.81

WFSosnrb f 23.35 18.12 13.68 11.92 46.25 30.06 25.04 17.92

WFSosnrcor 22.55 18.97 14.45 12.21 44.16 28.17 25.66 17.75

WFSosnrmin 23.37 20.72 14.30 12.56 41.18 27.13 23.95 16.66

WFSosnrcom 23.64 20.57 14.30 12.52 40.45 26.91 24.06 16.50

Table 7.8: Recognition performance for the WFSosnr beamformer expressed in % words in error.

cases before MAP training but once MAP training has been applied any improvement disappears. Thenoisy data on the other hand does show a significant decrease in error rate with (as in the precedingsection) the com spectral estimate type leading the way.Tables 7.9 and 7.10 show the measured distortion values for the WFS beamformer. For all measuresexcept for peak SNR, the quiet data shows no improvement with the WFS processing. The noisy datashows some improvement though no particular spectral estimate method stands out from the others.


measure � FD BSD

DSBF 0.60 0.51 1.17 0.98 .066 .054 .124 .096

WFSb f 0.58 0.50 0.90 0.72 .078 .066 .106 .077WFScor 0.58 0.51 0.81 0.67 .082 .069 .110 .079WfSmin 0.56 0.50 0.81 0.67 .090 .076 .109 .088WFScom 0.57 0.51 0.78 0.66 .090 .076 .112 .089


DSBF 5.38 6.17 3.25 4.06 24.53 27.86 10.86 14.90

WFSb f 5.09 5.90 3.57 4.50 30.28 37.62 16.51 24.15WFScor 5.03 5.84 3.55 4.51 32.89 41.07 18.95 27.45WFSmin 5.06 5.74 3.79 4.61 41.84 48.92 24.38 32.64WFScom 5.03 5.72 3.72 4.59 41.12 48.60 24.71 33.33

Table 7.9: Distortion values for the WFS beamformer. The best value in each column is highlighted inboldface.

70


measure � FD BSD

OSNR 0.59 0.51 1.06 0.85 .065 .053 .106 .083

WFSosnrb f 0.57 0.49 0.86 0.68 .077 .065 .097 .073

WFSosnrcor 0.57 0.51 0.80 0.65 .081 .069 .100 .076

WFSosnrmin 0.56 0.51 0.78 0.64 .088 .075 .102 .085

WFSosnrcom 0.57 0.51 0.77 0.64 .089 .075 .104 .085


OSNR 5.48 6.26 3.69 4.41 25.21 28.90 14.09 18.33

WFSosnrb f 5.13 5.92 3.72 4.63 30.58 38.31 18.67 26.54

WFSosnrcor 5.07 5.87 3.70 4.65 33.22 41.83 21.44 29.94

WFSosnrmin 5.12 5.77 3.96 4.75 42.11 49.61 27.30 36.20

WFSosnrcom 5.07 5.76 3.89 4.73 41.38 49.26 27.34 36.37

Table 7.10: Distortion values for the WFSosnr beamformer. The best value in each column is higlighted inboldface.

71

(1)S

jωτe

ωτ1e

M

1

j

Σ

1 1Y

MΦ MY

Mm

σm

Φ2

|S|

��

��

��

Y

Φ

M

m

H

2

Apply Filters SumDelay Estimate Parameters

1Y

Figure 7.7: Diagram of the optimal multi-channel Wiener (MCW) beamformer. The delay compensationstage is followed by a parameter estimation stage which feeds into the channel filters applied before thefinal summation.

7.4 Multi-Channel Wiener

Figure 7.7 shows the basic structure of the multi-channel Wiener (MCW) algorithm described in Section5.2. Equation (5.26) forms the basis of the channel filters.

� A 512 point analysis window was used with a 1024 point FFT length.

� The channel filters were derived according to Equation (5.26):

– The noise power spectrum (σ2m � ω � ) for each channel was estimated in the same manner as for

the OSNR processing (Section 7.1) see with the minimum statistic method as described inSection 6.1.

– The transfer function for each channel (Hm � ω � in Equation (5.26)) was estimated the same wayas in the OSNR processing (see Section 7.1) with the normalized average power spectrum foreach channel after noise subtraction.

– The signal power (�S� 2 in the numerator of Equation (5.26)) was estimated from the input

channel data with the 3 different methods described in Section 6. The same 3 methods used inSections 7.2 and 7.3 above.

– The signal power in the denominator of Equation (5.26) (The�S � ω � � 2 �Hm � ω � � 2 term) was

estimated with the channel power after subtracting the noise estimate,�Ym � ω � � 2 � σ2

m .

� The same exponential smoothing described in Section 7.2 was used to smooth the spectral estimatesin the numerator of Equation (5.26).

� To guard against divide by zero or multiply by zero situations the gain at each frequency wasconstrained to lie within -40dB to 3dB by truncating at both ends of the range.

� To reduce artifacts in the reconstructed signal, the post filter is truncated to 512 points in the timedomain to preserve the linear convolution property. This corresponds to a smoothing of the filter inthe frequency domain.


� As in the preceding sections the 3 different spectral estimate types were used (corÿ min, com).

� For the MCW implementations the OSNR version denotes the use of OSNR weighted data whenforming the signal spectrum estimate in the numerator of Equation (5.26). As opposed to thepreceding algorithms where the OSNR process was used as a pre-processing step, since the MCWalgorithm incorporates the OSNR weighting the MCWosnr implementations use the OSNRpre-processing only on the data used in the signal spectrum estimate.

72

Her

tzsec.

0 0.5 1 1.5 2 2.5 3 3.50

1000

2000

3000

4000

5000

6000

7000

8000

Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

cor

Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

min Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

com

Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

Her

tz

sec.0 0.5 1 1.5 2 2.5 3 3.5

0

1000

2000

3000

4000

5000

6000

7000

8000

MCW MCWosnr

Figure 7.8: Narrowband spectrograms from the MCW beamformer. The 3 rows correspond to the differentways of forming the signal power spectrum estimate. The spectrogram for the unweighted DSBF outputis show at the top for reference. The talker is female and the text being spoken is the alpha-digit string’BUCKLE83566’.

The microphone-array databases (quiet and noise-added) described in Chapter 3 were processed with theMCW algorithm as described above, using 8 and 16 microphones. The results are presented the sectionsthat follow.

7.4.1 Subjective Evaluation

Figure 7.8 shows spectrograms for the example utterance for the MCW processed speech. Thespectrograms suggest a very strong suppression of the background noise and this is confirmed by listening.The increased noise suppression for min and com can also be seen in Figure 7.8 as well as the greatersuppression of the noise bands in the OSNR processed column. The MCW processed speech has adistinctly processed quality. The residual background noise is at a lower level than for the correspondingoutputs from WSF or WFS but overwhelmingly consists of warbling tones; the background noise issquelched to such a degree that any sense of the ambiance of the original recordings is lost, even in the 8microphone case. The different methods of signal estimation are virtually indistinguishable from eachother by listening.


Tables 7.11 and 7.12 show the recognition performance for the MCW beamformer. An improvement isshown for every tabulated case with the min and com spectral estimation types performing better than thecor type. The min spectral estimate performs marginally better than the com estimate in most cases. Forthe quiet data the difference between MCW and MCWosnr is virtually nonexistent; for the noisy data the

73


DSBF 24.28 19.44 14.03 11.92 60.44 47.59 32.49 23.48

MCWcor 24.02 18.26 13.36 11.23 51.41 34.27 27.37 18.70MCWmin 20.26 17.14 12.83 11.36 46.56 30.80 25.77 17.88MCWcom 20.81 17.37 12.88 11.14 46.79 30.71 25.84 18.35

Table 7.11: Recognition performance for the MCW beamformer expressed in % words in error.


OSNR 23.93 19.01 14.08 11.52 53.92 40.25 27.13 18.81

MCWosnrcor 23.15 17.46 13.43 11.34 44.14 30.15 24.31 17.39

MCWosnrmin 20.15 16.86 12.81 11.25 40.63 27.06 22.79 16.61

MCWosnrcom 20.53 17.21 12.74 11.12 41.16 27.44 23.19 16.81

Table 7.12: Recognition performance for the MCWosnr beamformer expressed in % words in error.

use of the OSNR data in the spectral estimation step does improve performance by a non-negligibleamount. The best case improvement with MAP training, using the quiet data is 21% of the differencebetween the DSBF baseline and the close-talking microphone performance for both the 8 microphone and16 microphone case. Using the noisy data the improvement is 44% for 16 microphones and 39% for 8microphones.The distortion measures shown in tables and are qualitatively similar to those in the preceding sections.BSD is made worse by all cases, quiet and noisy. FD shows marginal improvement with the quiet data andsomewhat greater improvement with the noisy data. SSNR declines in nearly all cases though it declinesless for the noisy data than for the quiet data. Peak SNR improves by nearly a factor of 2 in all cases.


measure � FD BSD

DSBF 0.60 0.51 1.17 0.98 .066 .054 .124 .096

MCWcor 0.59 0.52 0.83 0.70 .099 .093 .125 .099MCWmin 0.56 0.50 0.82 0.69 .106 .097 .125 .109MCWcom 0.57 0.50 0.81 0.69 .106 .098 .129 .109


DSBF 5.38 6.17 3.25 4.06 24.53 27.86 10.86 14.90

MCWcor 4.54 4.98 3.22 4.07 34.86 44.17 21.20 30.84MCWmin 4.63 4.94 3.52 4.14 53.55 52.44 26.67 36.35MCWcom 4.57 4.91 3.42 4.14 42.52 51.94 26.54 36.48

Table 7.13: Measured distortion for the MCW beamformer.

74


measure � FD BSD

OSNR 0.59 0.51 1.06 0.85 .065 .053 .106 .083

MCWosnrcor 0.59 0.51 0.80 0.68 .100 .093 .115 .102

MCWosnrmin 0.56 0.50 0.78 0.66 .105 .097 .123 .113

MCWosnrcom 0.57 0.50 0.78 0.67 .106 .098 .123 .112


OSNR 5.48 6.26 3.69 4.41 25.21 28.90 14.09 18.33

MCWosnrcor 4.59 5.01 3.50 4.16 35.74 45.69 26.48 34.64

MCWosnrmin 4.67 4.96 3.74 4.19 44.28 53.58 32.18 40.44

MCWosnrcom 4.60 4.93 3.67 4.19 43.18 53.03 31.49 39.95

Table 7.14: Measured distortion for the MCWosnr beamformer.

75

QUIET DATA

8mic

10

15

20

Word

Err

or

%

89

58

27 % Im

pro

ved

DS

BF

OS

NR

WS

Fco

r

WS

Fm

in

WS

Fco

m

WF

Sb

f

WF

Sco

r

WF

Sm

in

WF

Sco

m

MC

Wco

r

MC

Wm

in

MC

Wco

m

10

12

14

Word

Err

or

%

69

35

1

% Im

pro

ved

DS

BF

OS

NR

WS

Fco

r

WS

Fm

in

WS

Fco

m

WF

Sb

f

WF

Sco

r

WF

Sm

in

WF

Sco

m

MC

Wco

r

MC

Wm

in

MC

Wco

m

16mic

10

15

20

Wo

rd E

rro

r %

84

39

−5

% Im

pro

ve

d

DS

BF

OS

NR

WS

Fco

r

WS

Fm

in

WS

Fco

m

WF

Sb

f

WF

Sco

r

WF

Sm

in

WF

Sco

m

MC

Wco

r

MC

Wm

in

MC

Wco

m

9

10

11

12

13

Wo

rd E

rro

r %

78

51

24

−2

−29

% Im

pro

ve

d

DS

BF

OS

NR

WS

Fco

r

WS

Fm

in

WS

Fco

m

WF

Sb

f

WF

Sco

r

WF

Sm

in

WF

Sco

m

MC

Wco

r

MC

Wm

in

MC

Wco

m

NOISY DATA

8mic

20

40

60

Word

Err

or

%

77

39

1

% Im

pro

ved

DS

BF

OS

NR

WS

Fco

r

WS

Fm

in

WS

Fco

m

WF

Sb

f

WF

Sco

r

WF

Sm

in

WF

Sco

m

MC

Wco

r

MC

Wm

in

MC

Wco

m

10

20

30W

ord

Err

or

%

92

51

10

% Im

pro

ved

DS

BF

OS

NR

WS

Fco

r

WS

Fm

in

WS

Fco

m

WF

Sb

f

WF

Sco

r

WF

Sm

in

WF

Sco

m

MC

Wco

r

MC

Wm

in

MC

Wco

m

16mic

10

20

30

40

Word

Err

or

%

95

70

45

19 % Im

pro

ved

DS

BF

OS

NR

WS

Fco

r

WS

Fm

in

WS

Fco

m

WF

Sb

f

WF

Sco

r

WF

Sm

in

WF

Sco

m

MC

Wco

r

MC

Wm

in

MC

Wco

m

10

15

20

Word

Err

or

%

88

55

23 % Im

pro

ved

DS

BF

OS

NR

WS

Fco

r

WS

Fm

in

WS

Fco

m

WF

Sb

f

WF

Sco

r

WF

Sm

in

WF

Sco

m

MC

Wco

r

MC

Wm

in

MC

Wco

m

Baseline HMM MAP HMM

Figure 7.9: Summary of word error rates in % words in error. The bottom level of each graph correspondsto the close-talking microphone error rate. The axis on the right hand side shows the percent improvementfrom the DSBF baseline. That is, the DSBF is 0% improved and the close-talking microphone is 100%improved.

7.5 Summary

Figures 7.9 and 7.10 show the recognition performance for each tested combination of microphones (8 or16) and database (quiet or noisy). Figure 7.9 shows the performance for algorithms using the unweightedchannels as input and Figure 7.10 shows the performance for algorithms using the OSNR filtering as apreprocessing stage. These are the same values tabulated in the previous section, but presented graphicallyand side by side to facilitate comparisons over the full range of algorithms.

76

QUIET DATA

8mic

10

15

20

Word

Err

or

%

89

58

27 % Im

pro

ved

DS

BF

OS

NR

WS

Fo

sn

rco

r

WS

Fo

sn

rm

in

WS

Fo

sn

rco

m

WF

So

sn

rb

f

WF

So

sn

rco

r

WF

So

sn

rm

in

WF

So

sn

rco

m

MC

Wo

sn

rco

r

MC

Wo

sn

rm

in

MC

Wo

sn

rco

m

10

12

14

Word

Err

or

%

69

35

1%

Impro

ved

DS

BF

OS

NR

WS

Fo

sn

rco

r

WS

Fo

sn

rm

in

WS

Fo

sn

rco

m

WF

So

sn

rb

f

WF

So

sn

rco

r

WF

So

sn

rm

in

WF

So

sn

rco

m

MC

Wo

sn

rco

r

MC

Wo

sn

rm

in

MC

Wo

sn

rco

m

16mic

10

15

20

Wo

rd E

rro

r %

84

39

−5

% Im

pro

ve

d

DS

BF

OS

NR

WS

Fo

sn

rco

r

WS

Fo

sn

rm

in

WS

Fo

sn

rco

m

WF

So

sn

rb

f

WF

So

sn

rco

r

WF

So

sn

rm

in

WF

So

sn

rco

m

MC

Wo

sn

rco

r

MC

Wo

sn

rm

in

MC

Wo

sn

rco

m

9

10

11

12

Wo

rd E

rro

r %

78

51

24

−2

% Im

pro

ve

d

DS

BF

OS

NR

WS

Fo

sn

rco

r

WS

Fo

sn

rm

in

WS

Fo

sn

rco

m

WF

So

sn

rb

f

WF

So

sn

rco

r

WF

So

sn

rm

in

WF

So

sn

rco

m

MC

Wo

sn

rco

r

MC

Wo

sn

rm

in

MC

Wo

sn

rco

m

NOISY DATA

8mic

20

40

60

Word

Err

or

%

77

39

1

% Im

pro

ved

DS

BF

OS

NR

WS

Fo

sn

rco

r

WS

Fo

sn

rm

in

WS

Fo

sn

rco

m

WF

So

sn

rb

f

WF

So

sn

rco

r

WF

So

sn

rm

in

WF

So

sn

rco

m

MC

Wo

sn

rco

r

MC

Wo

sn

rm

in

MC

Wo

sn

rco

m

10

20

30W

ord

Err

or

%

92

51

10

% Im

pro

ved

DS

BF

OS

NR

WS

Fo

sn

rco

r

WS

Fo

sn

rm

in

WS

Fo

sn

rco

m

WF

So

sn

rb

f

WF

So

sn

rco

r

WF

So

sn

rm

in

WF

So

sn

rco

m

MC

Wo

sn

rco

r

MC

Wo

sn

rm

in

MC

Wo

sn

rco

m

16mic

10

20

30

40

Word

Err

or

%

95

70

45

19 % Im

pro

ved

DS

BF

OS

NR

WS

Fo

sn

rco

r

WS

Fo

sn

rm

in

WS

Fo

sn

rco

m

WF

So

sn

rb

f

WF

So

sn

rco

r

WF

So

sn

rm

in

WF

So

sn

rco

m

MC

Wo

sn

rco

r

MC

Wo

sn

rm

in

MC

Wo

sn

rco

m

10

15

20

Word

Err

or

%

88

55

23 % Im

pro

ved

DS

BF

OS

NR

WS

Fo

sn

rco

r

WS

Fo

sn

rm

in

WS

Fo

sn

rco

m

WF

So

sn

rb

f

WF

So

sn

rco

r

WF

So

sn

rm

in

WF

So

sn

rco

m

MC

Wo

sn

rco

r

MC

Wo

sn

rm

in

MC

Wo

sn

rco

m

Baseline HMM MAP HMM

Figure 7.10: Word error rates with OSNR input in % words in error. The bottom level of each graphcorresponds to the close-talking microphone error rate. The axis on the right hand side shows the percentimprovement from the DSBF baseline. That is, the DSBF is 0% improved and the close-talking microphoneis 100% improved.

Looking at the MAP-HMM column of Figure 7.9. For tests of the quiet data OSNR, WSF and MCWprocessing all improve recognition rates and WFS reduces recognition performance. For tests of the noisydata every filtering strategy improves recognition performance though the OSNR filtering out-performs allbut the MCW algorithm. This is a strong result considering that the OSNR is the only algorithm in thiscomparison (apart from DSBF) that is distortionless. That is, OSNR has an overall flat system frequencyresponse whereas the other methods (WSF, WFS, MCW) all impose a non-uniform overall frequencyweighting that distorts the spectrum.

77

9

10

11

12

Wor

d E

rror

%78

51

24

−2

% Im

proved

DS

BF

OS

NR

WS

Fco

r

WS

Fos

nrco

r

MC

Wco

m

MC

Wos

nrco

m

10

15

20

Wor

d E

rror

%

88

55

23 % Im

proved

DS

BF

OS

NR

WS

Fco

m

WS

Fos

nrco

m

MC

Wm

in

MC

Wos

nrm

in

Quiet data Noisy Data

Figure 7.11: The best performing filtering schemes using 16 microphones and MAP training. These valuesare culled from Figures 7.9 and 7.10.

In Figure 7.10 results for the quiet data show a similar trend as in Figure 7.9: improved performance isshown for every strategy except WFS for which performance declines. Unlike the tests without OSNRpre-processing, for the noisy data the performance of WFSosnr, WSFosnr and MCWosnr are all better thanthe performance of the OSNR filtering alone. The gains from the OSNR weighting and the gain from thenoise-reduction filtering which follows are additive which is not unexpected. Using the OSNR as apre-processing step simultaneously improves the signal estimate available to the filtering step and providesan inter-microphone weighting that is missing from the WFS and WSF processing types. MCW alreadyincorporates the non-uniform microphone weighting but gains from using the OSNR weighted data for thespectral estimate.The relatively poor performance of the WFS algorithms on the quiet data may seem somewhatcounter-intuitive since the WFS algorithms are arguably the best sounding processing types on the quietdata - at least in terms of having adding the least amount of artifacts into the processed speech. The ad-hocnature of WFS algorithm and the way it may distort the spectrum seems to be reflected in the recognitionresults. WFS is the only algorithm that isn’t based directly on an optimization and it’s also the onlyalgorithm that reduces the recognition performance on quiet data. This probably isn’t a coincidence.To better compare the best performing algorithms, Figure 7.11 shows a subset of the results in Figures 7.9and 7.10 using 16 microphones and the MAP model for both quiet and noisy databases. With the quietdata the performance of the MCW and WSF variants are virtually identical. With the noisy data the MCWoutperforms WSF but the versions with OSNR included are deadlocked again. As discussed in Section5.2.3 WSFosnr and MCWosnr use the same inter-microphone weighting function and differ only in thespecifics of the final frequency shaping. What the results here show is that the difference in frequencyweighting between the WFS and MCW methods is not significant enough to affect the recognitionperformance. The difference in performance between WFS and MCW goes away when the OSNRweighting is used equally in both methods.Figures 7.12 and 7.13 graphically summarize the values of the various distortion measures applied to thefiltering algorithms4. In Figure 7.12 the FD measure varies only slightly with the different algorithmswhen used on the quiet data, with the noisy data on the other hand every algorithm significantly lowers themeasured FD. This is not unlike the recognition results where with the noisy data any distortion introducedby the processing methods is outweighed by the degree to which they suppress the noise. For the quietdata the level of noise is low enough that the gain from reducing it and the penalty paid for introducingfiltering distortions are much more similar in magnitude.The BSD values measured on the quiet data increase for all algorithms. The increase for WSFcor isminimal but virtually every other algorithm shows a significant increase. The increase in BSD is generallygreater for those methods with greater noise suppression. Using the com and min spectral estimatesgenerally results in a higher BSD, and these two methods generally result in a larger estimate of the noise(and greater corresponding noise suppression) than the cor method. The measurements on the noisy datashow a similar upward trend in the BSD though in this case only the MCW values are worse than theDSBF and OSNR baselines. The SSNR values shown in Figure 7.13 show the complementary trend with

4The values for the implementations using OSNR pre-processing are qualitatively extremely similar and are not plotted here.

78

FD

8mic

0

0.1

0.2

0.3

0.4

0.5

0.6

FD

DS

BF

OS

NR

WS

Fco

r

WS

Fm

in

WS

Fco

m

WF

Sb

f

WF

Sco

r

WF

Sm

in

WF

Sco

m

MC

Wco

r

MC

Wm

in

MC

Wco

m

0

0.2

0.4

0.6

0.8

1

FD

DS

BF

OS

NR

WS

Fco

r

WS

Fm

in

WS

Fco

m

WF

Sb

f

WF

Sco

r

WF

Sm

in

WF

Sco

m

MC

Wco

r

MC

Wm

in

MC

Wco

m

16mic

0

0.1

0.2

0.3

0.4

0.5

FD

DS

BF

OS

NR

WS

Fco

r

WS

Fm

in

WS

Fco

m

WF

Sb

f

WF

Sco

r

WF

Sm

in

WF

Sco

m

MC

Wco

r

MC

Wm

in

MC

Wco

m

0

0.2

0.4

0.6

0.8

1

FD

DS

BF

OS

NR

WS

Fco

r

WS

Fm

in

WS

Fco

m

WF

Sb

f

WF

Sco

r

WF

Sm

in

WF

Sco

m

MC

Wco

r

MC

Wm

in

MC

Wco

m

BSD

8mic

0

0.02

0.04

0.06

0.08

0.1

BS

D

DS

BF

OS

NR

WS

Fco

r

WS

Fm

in

WS

Fco

m

WF

Sb

f

WF

Sco

r

WF

Sm

in

WF

Sco

m

MC

Wco

r

MC

Wm

in

MC

Wco

m

0

0.02

0.04

0.06

0.08

0.1

0.12

BS

D

DS

BF

OS

NR

WS

Fco

r

WS

Fm

in

WS

Fco

m

WF

Sb

f

WF

Sco

r

WF

Sm

in

WF

Sco

m

MC

Wco

r

MC

Wm

in

MC

Wco

m

16mic

0

0.02

0.04

0.06

0.08

0.1

BS

D

DS

BF

OS

NR

WS

Fco

r

WS

Fm

in

WS

Fco

m

WF

Sb

f

WF

Sco

r

WF

Sm

in

WF

Sco

m

MC

Wco

r

MC

Wm

in

MC

Wco

m

0

0.02

0.04

0.06

0.08

0.1

BS

D

DS

BF

OS

NR

WS

Fco

r

WS

Fm

in

WS

Fco

m

WF

Sb

f

WF

Sco

r

WF

Sm

in

WF

Sco

m

MC

Wco

r

MC

Wm

in

MC

Wco

m

Quiet Noisy

Figure 7.12: Summary of FD and BSD values measured on the variety of beamforming algorithms.

slight differences. The similarity of these trends is entirely expected since the SSNR measurement isessentially a linear-frequency version of the BSD measurement. In this measurement the WSF algorithmsshow slight improvement even on the quiet data and the MCW algorithms (as with the BSDmeasurements) still show the worst performance by this measure. This ordering is reversed on the peakSNR graphs. Every algorithm shows a significant increase in SNR and the WFS and MCW algorithmsshow greater SNR than the WSF algorithm. These results point towards the tradeoff between introducingdistortion and suppressing noise. The more aggressively the noise is suppressed (indicated by SNR) themore unwanted signal distortions (indicated by BSD and SSNR) will creep in. The surprised is how theFD measure does not follow the other distortion measures as tightly as it did in Chapter 3. Despite havingthe worst BSD performance in the group, the MCW algorithms FD scores and recognition performance

79

SSNR

8mic

0

1

2

3

4

5

dB

DS

BF

OS

NR

WS

Fco

r

WS

Fm

in

WS

Fco

m

WF

Sb

f

WF

Sco

r

WF

Sm

in

WF

Sco

m

MC

Wco

r

MC

Wm

in

MC

Wco

m

00.5

11.5

22.5

33.5

4

dB

DS

BF

OS

NR

WS

Fco

r

WS

Fm

in

WS

Fco

m

WF

Sb

f

WF

Sco

r

WF

Sm

in

WF

Sco

m

MC

Wco

r

MC

Wm

in

MC

Wco

m

16mic

0

1

2

3

4

5

6

dB

DS

BF

OS

NR

WS

Fco

r

WS

Fm

in

WS

Fco

m

WF

Sb

f

WF

Sco

r

WF

Sm

in

WF

Sco

m

MC

Wco

r

MC

Wm

in

MC

Wco

m

0

1

2

3

4

5

dB

DS

BF

OS

NR

WS

Fco

r

WS

Fm

in

WS

Fco

m

WF

Sb

f

WF

Sco

r

WF

Sm

in

WF

Sco

m

MC

Wco

r

MC

Wm

in

MC

Wco

m

SNR

8mic

0

10

20

30

40

50

dB

DS

BF

OS

NR

WS

Fco

r

WS

Fm

in

WS

Fco

m

WF

Sb

f

WF

Sco

r

WF

Sm

in

WF

Sco

m

MC

Wco

r

MC

Wm

in

MC

Wco

m

0

5

10

15

20

25

dB

DS

BF

OS

NR

WS

Fco

r

WS

Fm

in

WS

Fco

m

WF

Sb

f

WF

Sco

r

WF

Sm

in

WF

Sco

m

MC

Wco

r

MC

Wm

in

MC

Wco

m

16mic

0

10

20

30

40

50

dB

DS

BF

OS

NR

WS

Fco

r

WS

Fm

in

WS

Fco

m

WF

Sb

f

WF

Sco

r

WF

Sm

in

WF

Sco

m

MC

Wco

r

MC

Wm

in

MC

Wco

m

0

5

10

15

20

25

30

35

dB

DS

BF

OS

NR

WS

Fco

r

WS

Fm

in

WS

Fco

m

WF

Sb

f

WF

Sco

r

WF

Sm

in

WF

Sco

m

MC

Wco

r

MC

Wm

in

MC

Wco

m

Quiet Noisy

Figure 7.13: Summary of SSNR and SNR values measured on the variety of beamforming algorithms.

are among the best observed.Figure 7.14 shows scatter plots of the recognition error rate for each of the 84 individual trials5 as afunction of each distortion measure along with a superimposed linear fit. The RMS linear fit errors for thebaseline-HMM and the MAP-HMM are shown below each plot. By far FD shows the strongest linearcorrelation with recognition error rate with SSNR, SNR and BSD following in that order. Note that theMAP-HMM and baseline-HMM linear trends all intersect at an error rate of approximately 5%. This

54 trials for OSNR ( 8 and 16 microphones, quiet and noisy), 24 trials each for WSF and MCW (8 and 16 microphones, quiet andnoisy, OSNR pre-processed or not, 3 spectral estimate types) and 32 trials for WFS (8 and 16 microphones, quiet and noisy, OSNRpre-processed or not, 4 spectral estimate types).

80

0.4 0.6 0.8 1 1.2 1.4

10

20

30

40

50

60

FD

Err

or R

ate

(%)

BaselineMAP

0.02 0.04 0.06 0.08 0.1 0.12 0.14

10

20

30

40

50

60

BSD

Err

or R

ate

(%)

BaselineMAP

RMSE: 2.8%(Base) 1.8%(MAP) RMSE: 8.9%(Base) 4.4%(MAP)

3 4 5 6 7

10

20

30

40

50

60

SSNR (dB)

Err

or R

ate

(%)

BaselineMAP

10 20 30 40 50

10

20

30

40

50

60

SNR (dB)

Err

or R

ate

(%)

BaselineMAP

RMSE: 6.0%(Base) 2.9%(MAP) RMSE: 6.1%(Base) 3.2%(MAP)

Figure 7.14: Scatter plots of error rate and distortion measures. Each figure plots word error rate as afunction of measured distortion values FD, BSD, SSNR, SNR. The

�’s denote results from the baseline

HMM and � ’s denote the results after MAP training. There are 84 data points. The linear fits to each set ofpoints is overlayed and the RMS errors from the linear fit for the baseline-HMM and the MAP-HMM areshown below each plot.

somewhat better than the reference close-talking microphone error rate of 8.16% suggesting that datapoints at lower distortions and error rates than those plotted here would fall above the linear trend shownhere. It is interesting though that the MAP-HMM and baseline-HMM linear trends for the differentmeasures intersect at such similar performance points.The corresponding RMS linear fit error figures for the baseline measurements made on the DSBF data isshown in Table 3.9. Compared to the measurements made on the DSBF processed data in Section 3.3.2 thedistortion measurements here are generally less correlated with error rate; the baseline-HMM error rateespecially. Only FD has a better linear fit here than with the DSBF measurements. This disparity is despitethe restricted range over which the error rates here fall; in Section 3.3.2 a good deal of the linear fit errorwas due to a strong nonlinearity in the relation between distortion measures and error rates. In themeasurements presented here the trends appear quite linear, but with a greater variance from the trend.The increase in overall apparent linearity is at least in part because the range of error rates observed in thisdata is significantly smaller than the range observed in Section 3.3.2.Figure 7.15 shows scatter plots of the FD, BSD, and SNR as functions of each other for measurementstaken on the noise suppressed data (OSNR, WSF, WFS, MCW) and for measurements taken on the DSBFbaseline data presented in Chapter 3. The greatly reduced correlation between measures seen with thenoise reduction algorithms is readily apparent. With the DSBF measurements the successive addition ofmicrophones yields data points that travel somewhat continuously through all the measurement spaces.With the nonlinear nature of the noise-suppression algorithms this property appears to no longer hold. In

81

0.6 0.8 1 1.2 1.4 1.60.05

0.1

0.15

0.2

FD

BS

D

WSF,WFS,MCW,OSNRDSBF

0.5 1 1.5

10

20

30

40

50

FD

SN

R


0.06 0.08 0.1 0.12 0.14 0.16 0.18

10

20

30

40

50

BSD

SN

R


Figure 7.15: Scatter plots comparing correlation of distortion measures for the noise-suppression algorithms(OSNR, WSF, WFS, MCW) and for DSBF. The

�marks represent the values measured in Chapter 3 on

the output of the DSBF. The�

marks represent the values measured from the noise reduction algorithms;OSNR, WSF, WFS, MCW.

Figure 7.15 the scatter plot of BSD and SNR reflects the low correlation between these measurescompared to the relatively tight correlation that was observed with the DSBF measurements. Figure 7.16repeats the scatter plot from Figure 7.15 but breaks each algorithm out into its own marker and linear fitline. From this it is apparent that within each algorithm type the correlation between measures is muchstronger than it is between algorithm types, though certainly the reduced number of points in each clustercontributes somewhat to that perception. In particular for the FD vs BSD scatter plot the differencebetween the algorithms is primarily a different bias in BSD for each algorithm. The FD vs SNR plot showsquite a bit less separation between the different algorithms. Note also that taken on their own the OSNRpoints fall very neatly along a linear trend much more like the DSBF measurements in Chapter 3. In factthe OSNR measurements generally fall very close to the trends established by the DSBF data as shown inFigure 7.17. In contrast to Figure 7.15 the OSNR measurements taken alone are quite consistent with thetrends set by the DSBF measurements. This is reflective of how closely related the OSNR algorithm is tothe DSBF algorithm in that it doesn’t employ the Wiener filtering noise suppression that is common to allthe other algorithm types.

82

0.4 0.6 0.8 10.05

0.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13

FD

BS

D

OSNRWSF WFS MCW

0.4 0.6 0.8 110

20

30

40

50

60

FD

SN

R

OSNRWSF WFS MCW

0.04 0.06 0.08 0.1 0.12 0.1410

15

20

25

30

35

40

45

50

55

BSD

SN

R

OSNRWSF WFS MCW

Figure 7.16: Scatter plot of distortion measurements by algorithm type. When broken out into the differentalgorithm types the distortion measures show a stronger linear correlation with each other.

83

0.5 1 1.5 2 2.50.05

0.1

0.15

0.2

0.25

FD

BS

D

OSNRDSBF

0.5 1 1.5 2 2.50

5

10

15

20

25

30

FD

SN

R

OSNRDSBF

0.05 0.1 0.15 0.2 0.250

5

10

15

20

25

30

BSD

SN

R

OSNRDSBF

Figure 7.17: Scatter plots of OSNR distortion measurements along with the DSBF measurements. TheOSNR measurements fall much more close to the DSBF trends than the other tested algorithms.

CHAPTER 8:SUMMARY AND CONCLUSIONS

The goal of this work was to measure the performance of a delay-and-sum beamformer and to investigatetechniques for improving upon that performance. Several measures by which to judge performance wereintroduced in Chapter 2. The measures introduced vary from traditional signal-to-noise ratio measures(SNR, SSNR) to perceptually motivated measures that more closely reflect subjective speech quality(BSD). The feature distortion (FD) measure was also introduced as an attempt to predict the performanceof a speech recognition system. In Chapter 3 a database of microphone-array recordings was described.This database of recordings was originally collected to make direct comparisons between the performanceof the microphone array and a close-talking microphone in a speech recognition task[6]. Because themicrophone-array recordings include simultaneous recordings with a close-talking microphone, signalquality measures that require a reference signal (FD, SSNR, BSD) could be used to evaluate the results ofbeamforming algorithms. Chapter 3 also describes how a high-noise database was created by adding noiserecorded by the same microphone array to the original, relatively quiet, recordings. Chapter 3 describesthe performance of a delay-and-sum beamformer using from 1 to 16 microphones. The results wereevaluated with the measures described in Chapter 2 and with the performance of an alphadigit speechrecognition system. The MAP retraining method was used to adapt the speech recognition models andoptimize the performance on the novel microphone-array data. Chapter 4 used simulations of a linearmicrophone array to investigate the limits of delay-and-sum techniques in noisy and reverberantenvironments. Motivated by the results in Chapter 4, in Chapter 5 MMSE optimizations for single channelsignal enhancement were extended to an optimal multi-input solution (MCW). The optimal multi-inputsolution was solved for signal-plus-noise and filtered-signal-plus-noise models for the received signal. Inaddition, an intuitively appealing but non-optimal filter-and-sum approach (WFS) was presented andanalyzed. Chapter 6 describes some methods for generating the spectral estimate required for theimplementation of the Wiener filtering strategies including a novel combination of cross-spectrum andminimum-noise-subtraction spectrum estimation. Finally, Chapter 7 presents implementations andevaluations of the various speech enhancement algorithms. Significant points of the results include:

� Overall the noise-reduction techniques were quite successful in improving recognition performance,reducing the gap between the DSBF performance and the close-talking microphone performance byup to 27% on the quiet data and 45% on the noisy data.

� The OSNR weighting is very successful in the noisy data tests, outperforming all but the MCWalgorithm. This is significant in that, unlike the other algorithms, the OSNR weighting is adistortion-free filtering.

� The MCW algorithm has the best speech recognition performance on the noisy data and is withinthe smallest of margins of the WSF algorithm on the quiet data.

� When OSNR is used as a pre-processing step, the MCW and WSF algorithms perform nearlyidentically on the speech-recognition task. The OSNR pre-processing is the deciding factor in thespeech recognition performance; the difference between the frequency weightings of the WSF andMCW algorithms is insignificant by comparison.

� The min and com spectrum estimates generally result in better recognition scores and worsedistortion scores than the cor cross-spectrum method. This is largely due to the generally largernoise spectrum estimates from these two methods.

84

85

� The WFS algorithm has the worst speech recognition scores and distortion measures of the 3 Wienerfiltering schemes although it shows strong improvement in SNR and informal subjective evaluationsof subjective quality. WFS is the only algorithm that has worse recognition performance than theDSBF on the quiet data set.

� The FD measure does a consistently good job of predicting speech recognition performance.

� The Wiener-based methods show very different relationships between measurements than the DSBFand OSNR algorithms. FD is still strongly related to speech recognition performance, but the strongrelationships with BSD, SSNR and SNR observed with the DSBF tests are not seen here. This wasforeseen in Chapter 3; the DSBF is unique in that adding microphones simultaneously reduces thenoise and enhances the signal in a fairly uniform manner. The Wiener filtering strategies on theother hand are based upon amplifying the signal in high-SNR regions and squelching it in low-SNRregions, and does so in a nonlinear fashion. The result is that noise is suppressed at the cost ofincreased signal distortion.

� The efficacy of the MAP training technique was very effective at tuning the recognizer to the noveldata. The MAP training reduced the error rate often by nearly 50%. On the other hand, thebaseline-HMM recognition performance closely follows the MAP-HMM performance; forcomparing the performance of two speech enhancement methods it may not be necessary to doMAP retraining; the performance given by the baseline model may reflect the MAP resultssufficiently well.

8.1 Directions for Further Study

Throughout this work no attempt to incorporate a speech model was made. Neither was any specific noisemodel imposed. Incorporating a speech model certainly has the potential for improving the recoveredspeech by imposing constraints on the trajectory of the estimated speech signal rather than relying uponunconstrained non-parametric spectral estimates[86, 87, 88, 89]. The difficulty lies in having a model thatcan simultaneously represent all sorts of speech accurately while being sufficiently constrained to avoidmodeling noise elements. In a similar manner some gain may be realized by using a noise model. Thesemodels could model a particular method of source production or could attempt to track noise sources withspecific statistical constraints. The large variety of noise types that may be encountered (narrowband,broadband, coherent, ambient, impulsive) indicates that a flexible model or multiple simultaneous modelswould be required for accurate modeling. Ultimately the speech production model could be integratedwith the speech recognition system providing a single global model guiding optimal filtering for subjectivequality and for speech recognition performance within one speech modeling framework.From a strictly signal-processing point of view, all the processing herein would most likely be enhanced bythe use of wavelet transforms or some other nonlinearly spaced filterbank processing[41]. Thelinearly-spaced FFT is a very convenient tool, but ideally the signal processing would be tailored to thesensitivity of the human auditory system. The features used in Bark spectral distortion and theMel-warped features used by the speech recognition system are based on nonlinear frequency scales (Barkand Mel, respectively) though the underlying processing is made with linear filterbanks. Why notincorporate the varying frequency resolution (and at the same time a varying time resolution) into theunderlying signal processing front end and corresponding noise reduction. An auditory model can beincorporated to help determine where and when in the received signal the greatest noise reduction gainscan be achieved or where the greatest penalty for added distortion will be incurred[90, 91].

BIBLIOGRAPHY

[1] M. S. Brandstein and D. B. Ward, editors. Microphone Arrays: Signal Processing Techniques andApplications. Springer Verlag, 2001.

[2] J. L. Flanagan, A. Surendran, and E. Jan. Spatially selective sound capture for speech and audioprocessing. Speech Communication, 13(1-2):207–222, 1993.

[3] Y. Grenier. A microphone array for car environments. In Proceedings of ICASSP-92 [92], pages305–309.

[4] W. Kellerman. A self-steering digital microphone array. In Proceedings of ICASSP-91 [93], pages3581–3584.

[5] J. Adcock, J. DiBiase, M. Brandstein, and H. F. Silverman. Practical issues in the use of afrequency-domain delay estimator for microphone-array applications. In Proceedings of AcousticalSociety of America Meeting, Austin, Texas, November 1994.

[6] J. Adcock, Y. Gotoh, D. J. Mashao, and H. F. Silverman. Microphone-array speech recognition viaincremental MAP training. In Proceedings of ICASSP-96 [94], pages 897–900.

[7] J. L. Flanagan. Bandwidth design for speech-seeking microphone arrays. In Proceedings ofICASSP-85, pages 732–735, Tampa, FL, March 1985. IEEE.

[8] J. L. Flanagan, D. Berkley, G. Elko, J. West, and M. Sondhi. Autodirective microphone systems.Acustica, 73:58–71, 1991.

[9] S. Oh, V. Viswanathan, and P. Papamichalis. Hands-free voice communication in an automobile witha microphone array. In Proceedings of ICASSP-92 [92], pages 281–284.

[10] H. F. Silverman. Some analysis of microphone arrays for speech data acquisition. IEEE Trans.Acoust. Speech Signal Process., ASSP-35(2):1699–1712, December 1987.

[11] C. Che, M. Rahim, and J. Flanagan. Robust speech recognition in a multimedia teleconferencingenvironment. J. Acoust. Soc. Am., 92(4, pt.2):2476(A), 1992.

[12] D. Giuliani, M. Omologo, and P. Svaizer. Talker localization and speech recognition using amicrophone array and a cross-power spectrum phase analysis. In Proceedings of ICSLP, volume 3,pages 1243–1246, September 1994.

[13] Maurizio Omologo and Piergiorgio Svaizer. Acoustic event localization using acrosspower-spectrum phase based technique. In Proceedings of ICASSP-94, volume II, pages273–276, Adelaide, Australia, April 1994. IEEE.

[14] M. Omologo and P. Svaizer. Use of the cross-power spectrum phase in acoustic event localization.Technical Report Technical Report No. 9303-13, IRST, Povo di Trento, Italy, March 1993.

[15] B. D. Van Veen and K. M. Buckley. Beamforming: A versatile approach to spatial filtering. IEEEASSP Magazine, 5(2):4–24, April 1988.

86

87

[16] J. L. Flanagan, J. D. Johnson, R. Zahn, and G. W. Elko. Computer-steered microphone arrays forsound transduction in large rooms. J. Acoust. Soc. Am., 78(5):1508–1518, November 1985.

[17] H. F. Silverman. Some analysis of microphone arrays for speech data acquisition. LEMS TechnicalReport 27, LEMS, Division of Engineering, Brown University, Providence, RI 02912, September1986.

[18] Masato Miyoshi and Yutaka Kaneda. Inverse filtering of room acoustics. IEEE Transactions onAcoustics, Speech, and Signal Processing, 36(2):145–152, February 1988.

[19] Hideaki Yamada, Hong Wang, and Fumitada Itakura. Recovering of broad band reverberant speechsignal by sub-band MINT method. In Proceedings of ICASSP-91 [93], pages 969–972.

[20] S. T. Neely and J. B. Allen. Invertibility of a room impulse response. J. Acoust. Soc. Amer.,66(1):165–169, July 1979.

[21] J. Mourjopolous. On the variation and invertibility of room impulse response functions. Journal ofSound and Vibration, 102(2):217–228, 1985.

[22] Takafumi Hikichi and Fumitada Itakura. Time variation of room acoustic transfer functions and itseffects on a multi-microphone dereverberation approach. Preprint received at 2nd InternationalWorkshop on Microphone Arrays, Rutgers University, NJ, 1994.

[23] E. Jan, P. Svaizer, and J. Flanagan. Matched-filter processing of microphone array for spatial volumeselectivity. In Proceedings of ICASSP-95 [95], pages 1460–1463.

[24] O. L. Frost. An algorithm for linearly constrained adaptive array processing. Proceedings of theIEEE, 60(8):926–935, August 1972.

[25] L. J. Griffiths and C. W. Jim. An alternative approach to linearly constrained adaptive beamforming.IEEE Transactions on Antennas and Propagation, AP-30(1):27–34, January 1982.

[26] B. Widrow, P. E. Mantey, L. J. Griffiths, and B. B. Goode. Adaptive antenna systems. Proceedings ofthe IEEE, 55:2143–2159, 1967.

[27] B. Widrow, J. R. Glover, J. M. McCool, J. Kaunitz, C. S. Williams, R. H. Hearn, J. R. Ziedler,E. Dong, and R. C. Goodlin. Adaptive noise cancelling: Principles and applications. Proceedings ofthe IEEE, 63(12):1692–1716, December 1975.

[28] Osamu Hoshuyama and Akihiko Sugiyama. A robust adaptive beamformer for microphone arrayswith a blocking matrix using constrained adaptive filters. In Proceedings of ICASSP-96 [94], pages925–928.

[29] Jens Meyer and Carsten Sydow. Noise cancelling for microphone arrays. In Proceedings ofICASSP-97 [96], pages 211–213.

[30] Joerg Bitzer, Klaus Uwe Simmer, and Karl-Dirk Kammeyer. Multi-microphone noise reduction bypost-filter and superdirective beamformer. In Proceedings of International Workshop on AcousticEcho and Noise Control, pages 100–103, Pocono Manor, USA, September 1999.

[31] Peter L. Chu. Superdirective microphone array for a set-top videoconferencing system. InProceedings of ICASSP-97 [96], pages 235–2358.

[32] J. Kates. Superdirective arrays for hearing aids. Journal of the acoustical society of america,94(4):1930–1933, 1993.

[33] S. F. Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions onAcoustics, Speech and Signal Processing, 27(2):113–120, April 1979.

88

[34] Levent Arslan, Alan McCree, and Vishu Viswanathan. New methods for adaptive noise suppression.In Proceedings of ICASSP-95 [95], pages 812–815.

[35] T. S. Sun, S. Nandkumar, J. Carmody, J. Rothweiler, A. Goldschen, N. Russell, S. Mpasi, andP. Green. Speech enhancement using a ternary-decision based filter. In Proceedings of ICASSP-95[95], pages 820–823.

[36] R. J. McAulay and M. L. Malpass. Speech enhancement using a soft-decision noise suppressionfilter. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28:137–145, 1980.

[37] E. Bryan George. Single-sensor speech enhancement using a soft-decision/variable attenuationalgorithm. In Proceedings of ICASSP-95 [95], pages 816–819.

[38] R. Zelinski. A microphone array with adaptive post-filtering for noise reduction in reverberantrooms. In Proceedings of ICASSP-88, pages 2578–2580, New York, April 1988. IEEE.

[39] Claude Marro, Yannick Mahieux, and K. Uwe Simmer. Analysis of Noise Reduction andDereverberation Techniques Based on Microphone Arrays with Postfiltering. IEEE Transactions onSpeech and Audio Processing, 6(3):240–259, May 1998.

[40] Joerg Meyer and Klaus Uwe Simmer. Multi-channel speech enhancement in a car environment usingWiener filtering and spectral subtraction. In Proceedings of ICASSP-97 [96], pages 1167–1171.

[41] Djamila Mahmoudi and Andrzej Drygajlo. Combined Wiener and coherence filtering in waveletdomain for microphone array speech enhancement. In Proceedings of ICASSP-98 [97], pages385–389.

[42] T. E. Tremain, M. A. Kohler, and T. G. Champion. Pilosophy and goals of the d.o.d. 2400 bpsvocoder selection process. In Proceedings of ICASSP-96 [94], pages 1137–1140.

[43] Matthew R. Bielefeld and Lynn M. Supplee. Developing a test program for the dod 2400 bpsvocoder selection process. In Proceedings of ICASSP-96 [94], pages 1141–1144.

[44] John D. Tardelli and Elizabeth Woodard Kreamer. Vocoder intelligibility and quality test methods. InProceedings of ICASSP-96 [94], pages 1145–1148.

[45] Elizabeth Woodard Kreamer and John D. Tardelli. Communicability testing for voice coders. InProceedings of ICASSP-96 [94], pages 1153–1156.

[46] M. A. Kohler, Philip A. LaFollette, and Matthew R. Bielefeld. Criteria for the d.o.d. 2400 bpsvocoder selection. In Proceedings of ICASSP-96 [94], pages 1161–1164.

[47] M. A. Kohler, Philip La Follette, and Matthew R. Bielefeld. Criteria for the dod 2400 bps vocoderselection. In Proceedings of ICASSP-96 [94], pages 1161–1164.

[48] Schuyler R. Quackenbush, Thomas P. Barnwell III, and Mark A. Clements. Objective Measures ofSpeech Quality. Prentice Hall, Englewood Cliffs, NJ, 1988.

[49] K. Lam, O. Au, C. Chan, K. Hui, and S. Lau. Objective speech quality measure for cellular phone. InProceedings of ICASSP-96 [94], pages 487–490.

[50] Shihua Wang, Andrew Sekey, and Allen Gersho. An objective measure for predicting subjectivequality of speech coders. IEEE Journal on Selected Areas in Communications, 10(5):819–829, June1992.

[51] Wonho Yang, Majid Benbouchta, and Robert Yantorno. Performance of the modified Bark spectraldistortion as an objective speech quality measure. In Proceedings of ICASSP-98 [97], pages541–544.

89

[52] Wonho Yang and Robert Yantorno. Improvement of MBSD by scaling noise masking threshold andcorrelation analysis with MOS difference instead of MOS. In Proceedings of ICASSP-99, Phoenix,Arizona, April 1999. IEEE.

[53] Jr. John R. Deller, John G. Proakis, and John H. L. Hansen. Discrete-Time Processing of SpeechSignals. Prentice Hall, Upper Saddle River, NJ, 1987.

[54] E. Zwicker and H. Fastl. Psychoacoustics Facts and Models. Springer-Verlag, 1990.

[55] D. W. Robinson and R. S. Dadson. A re-determination of the equal-loudness relations for pure tones.British Journal of Applied Physics, 7:166–181, may 1956.

[56] James D. Johnston. Transform coding of audio signals using perceptual noise criteria. IEEE Journalon Selected Areas in Communications, 6(2):314–323, Feb 1988.

[57] D. J. Mashao, Y. Gotoh, and H. F. Silverman. Analysis of LPC/DFT features for an HMM-basedalphadigit recognizer. IEEE Signal Processing Letters, 3(4):103–106, April 1996.

[58] H. F. Silverman and Yoshihiko Gotoh. On the implementation and computation of training an HMMrecognizer having explicit state durations and multiple-feature-set, tied-mixture output probabilities.LEMS Technical Report 129, LEMS, Division of Engineering, Brown University, Providence, RI02912, December 1993.

[59] M. Hochberg, J. Foote, and H. Silverman. The LEMS talker-independent connected speechalphadigit recognition system. Technical Report 82, LEMS, Division of Engineering, BrownUniversity, Providence RI, 1991.

[60] Lawrence R. Rabiner and Biing Hwang Juang. Fundamentals of Speech Recognition. Prentice-Hall,Englewood Cliffs, N.J., 1993.

[61] Stefan Gustafsson, Peter Jax, and Peter Vary. A novel psychoacoustically motivated audioenhancement algorithm preserving background noise characteristics. In Proceedings of ICASSP-98[97], pages 397–400.

[62] Yohichi Yohkura. A weighted cepstral distance measure for speech recognition. IEEE Trans. onAcoustics Speech and Signal Processing, 35(10):1414–1422, 1987.

[63] S. E. Kirtman and H. F. Silverman. A user-friendly system for microphone-array research. InProceedings of ICASSP-95 [95], pages 3015–3018.

[64] Maurizio Omologo and Piergiorgio Svaizer. Acoustic source location in noisy and reverberantenvironment using CSP analysis. In Proceedings of ICASSP-96 [94], pages 921–924.

[65] P. Svaizer, M. Matassoni, and M. Omologo. Acoustic source location in a three-dimensional spaceusing crosspower spectrum phase. In Proceedings of ICASSP-97 [96], pages 231–234.

[66] M. Omologo and P. Svaizer. Use of the cross-power spectrum phase in acoustic event localization.IEEE Transactions on Speech and Audio Processing, 5(3):288–292, 1997.

[67] S. Kay. Fundamentals of Statistical Signal Processing: Estimation Theory. Prentice Hall, firstedition, 1993.

[68] Y. Gotoh, M. M. Hochberg, D. J. Mashao, and H. F. Silverman. Incremental MAP estimation ofHMMs for efficient training and improved performance. In Proceedings of ICASSP-95 [95], pages457–460.

[69] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via theEM algorithm. Journal of the Royal Statistical Society, series B, 39(1):1–38, 1977.

90

[70] Radford M. Neal and Geoffrey E. Hinton. A new view of the EM algorithm that justifies incrementaland other variants. submitted to Biometrika, 1993.

[71] Jean-Luc Gauvain and Chin-Hui Lee. Maximum a posteriori estimation for multivariate Gaussianmixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing,2(2):291–298, April 1994.

[72] Y. Gotoh and H. F. Silverman. Incremental ML estimation of HMM parameters for efficient training.In Proceedings of ICASSP-96 [94].

[73] Y. Gotoh, M. M. Hochberg, and H. F. Silverman. Efficient training algorithms for HMMs usingincremental estimation. IEEE Transactions on Speech and Audio Processing, 6(6):539–548,November 1996.

[74] William W Seto. Schaum’s Outline of Theory and Problems of Acoustics. Schaum’s Outline Series.McGraw-Hill Publishing Company, New York, 1971.

[75] F. Pirz. Design of a wideband, constant beamwidth, array microphone for use in the near field. BellSystem Technical Journal, 58(8):1839–1850, October 1979.

[76] M. Goodwin and G. Elko. Constant beamwidth beamforming. In Proceedings of ICASSP-93 [98],pages 169–172.

[77] J. Lardies. Acoustic ring array with constant beamwidth over a very wide frequency range. Acoust.Letters, 13(5):77–81, 1989.

[78] William Mendenhall, Dennis D. Wackerly, and Richard L. Scheaffer. Mathematical Statistics withApplications. The Duxbury Series in Statistics and Decision Sciences. PWS-KENT, Boston,Massachusetts, fourth edition, 1990.

[79] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes in C: The Artof Scientific Computing. Cambridge University Press, Cambridge, UK CB2 1RP, 2nd edition, 1992.

[80] J. B. Allen and D. A. Berkley. Image method for efficiently simulating small room acoustics. J.Acoust. Soc. Am., 65(4):943–950, April 1979.

[81] D. Johnson and D. Dudgeon. Array Signal Processing- Concepts and Techniques. Prentice Hall, firstedition, 1993.

[82] Peter L. Chu. Desktop mic array for teleconferencing. In Proceedings of ICASSP-95 [95], pages2999–3002. Volume 5.

[83] H. G. Hirsche and C. Ehrlicher. Noise estimation techniques for robust speech recognition. InProceedings of ICASSP-95 [95], pages 153–156.

[84] Sven Fischer and Karl-Dirk Kammeyer. Broadband beamforming with adaptive postfiltering forspeech acquisition in noisy environments. In Proceedings of ICASSP-97 [96], pages 359–363.

[85] Regine Le Bouquin-Jeannes, Ahmad Akbari Azirani, and Gerard Faucon. Enhancement of speechdegraded by coherent and incoherent noise using a cross-spectral estimator. IEEE Transactions onSpeech and Audio Processing, 5(5):484–487, September 1997. Correspondence.

[86] Chang D. Yoo and Jae S. Lim. Speech enhancement based on the generalized dual excitation modelwith adaptive analysis window. In Proceedings of ICASSP-95 [95], pages 832–835.

[87] C. d’Alessandro, B. Yegnanarayana, and V. Darsinos. Decomposition of speech signals intodeterministic and stochastic components. In Proceedings of ICASSP-95 [95], pages 760–763.

[88] John Hardwick, Chang D. Yoo, and Jae S. Lim. Speech enhancement using the dual excitationspeech model. In Proceedings of ICASSP-93 [98], pages 367–370.

91

[89] Zenton Goh, Kah Chye Tan, and B. T. G. Tan. Speech enhancement based on a voiced-unvoicedspeech model. In Proceedings of ICASSP-98 [97], pages 401–404.

[90] Lance Riek and Randy Goldberg. A Practical Handbook of Speech Coders. CRC Press, Boca Raton,FL, 2000.

[91] Nathalie Virag. Speech enhancement based on masking properties of the auditory system. InProceedings of ICASSP-95 [95], pages 796–799.

[92] IEEE. International Conference on Acoustics, Speech, and Signal Processing, San Francisco, CA,March 1992.

[93] IEEE. International Conference on Acoustics, Speech, and Signal Processing, Toronto, Canada, May1991.

[94] IEEE. International Conference on Acoustics, Speech, and Signal Processing, Atlanta, GA, May1996.

[95] IEEE. International Conference on Acoustics, Speech, and Signal Processing Signal Processing,Detroit, MI, May 1995.

[96] IEEE. International Conference on Acoustics, Speech, and Signal Processing, Munich, Germany,April 1997.

[97] IEEE. International Conference on Acoustics, Speech, and Signal Processing, Seattle, Washington,May 1998.

[98] IEEE. International Conference on Acoustics, Speech, and Signal Processing, Minneapolis, MN,April 1993.

Documents

Optimal Filtering and Speech Recognition With Microphone Arrays