Data-Adaptive Source Separation for Audio Spatializationlight/IIT Bombay... · I am extremely thankful to Nokia, India and specifically, Dr. Pushkar Patwardhan for ... representation,

Data-Adaptive Source Separation for Audio Spatialization

Submitted in partial fulfilment of the requirements for the degree of

Master of Technology

(Electronic Systems)

by

Pradeep Gaddipati

08307029

Under the guidance of

Prof. Preeti Rao

and

Prof. V. Rajbabu

Department of Electrical Engineering

INDIAN INSTITUTE OF TECHNOLOGY BOMBAY

June 2010

Dedication

I dedicate this thesis to my family. Without their patience, understanding,

support and most of all love, the completion of this work would not have been

possible.

ii

Dissertation Approval for Master of Technology

This dissertation entitled Data-adaptive source separation for audio

spatialization by Pradeep Gaddipati (Roll no. 08307029) is approved for the

degree of Master of Technology in Electrical Engineering.

Prof. Preeti Rao _______________________ (Supervisor)

Prof. V. Rajbabu _______________________ (Co-supervisor)

Dr. Samudravijaya K. _____ __________________ (External Examiner)

Prof. Prem C. Pandey _______________________ (Internal Examiner)

Prof. K. P. Karunakaran _______________________ (Chairman)

June 17th, 2010

iii

Acknowledgments

I express my sincere gratitude towards Prof. Preeti Rao and Prof. V. Rajbabu for the guidance

and support they gave me during this project. The regular discussions with them on every

aspect of the research work helped me refine my approach towards the problem and motivated

me to give my best. Working with them in the field of audio signal processing was a very

pleasant learning experience and my interest in the subject has considerably grown.

I am extremely thankful to Nokia, India and specifically, Dr. Pushkar Patwardhan for

providing me with the opportunity of pursuing research in such a remarkable domain. I would

like to thank them for providing the financial support and technical inputs for the work.

I would like to thank Vishweshwara Rao for his valuable suggestions and help during various

stages of my project. I thank all the members of the Digital Audio Processing lab, Department

of Electrical Engineering, IIT Bombay for providing a friendly and enjoyable working

environment.

I would also like to thank my family for their love and moral support. Finally, I thank all the

people who have contributed ideas, concepts and corrections to be incorporated in my project.

The mistakes if any in the final draft finally are all my own.

Pradeep Gaddipati

iv

Abstract

The existing surround audio needs to be spatialized to obtain a signal which can generate the

effect of auditory immersion over headphones. This spatialization process comprises of two

stages viz. separating of the individual sources from the available mixtures and then

combining them to re-create the compatible audio for the desired output configuration (for

the case of headphones, the individual sources are convolved with the HRIRs for localization

and then mixed together to form the final output audio). The source separation technique itself

involves four stages viz. transformation of the mixtures into a sparse time-frequency

representation, estimation of mixing parameters (i.e. the direction and location of sources),

estimation of sources in the time-frequency domain and finally inverting back into the time-

domain by using an appropriate inverse time-frequency transformation technique.

Various sparsity-based source separation techniques namely degenerate un-mixing estimation

technique (DUET), lq-basis pursuit (LQBP) and delay and scale subtraction scoring (DASSS)

have been explored for the purpose of estimating mixing parameters and individual sources

from the mixtures. However their performance is directly coupled to two parameters viz.

sparsity of the time-frequency representation and the W-disjoint orthogonality of the

underlying sources in the time-frequency representation of the mixtures.

This thesis endeavours to find a time-frequency representation which is sparser and can

provide a higher degree of W-disjoint orthogonality amongst the underlying sources in the

mixtures than the time-frequency representation obtained using short-time Fourier transform

(STFT). With this objective, a time-varying data-adaptive time representation was developed

and its performance in terms of the aforementioned parameters was compared to that of the

fixed-window STFT. The data-adaptive time-frequency representation leads to better

estimation of mixing parameters which is translated into better separation of sources from the

stereo mixtures. This enables the sources to be better spatialized in the auditory space with

fewer artifacts as has been observed.

v

Table of Contents

Dedication ................................................................................................................................... i

Dissertation Approval for Master of Technology .................................................................. ii

Acknowledgments .................................................................................................................... iii

Abstract .................................................................................................................................... iv

Table of Contents ...................................................................................................................... v

List of Figures ........................................................................................................................ viii

List of Tables ............................................................................................................................. x

List of Abbreviations ............................................................................................................... xi

Declaration of Academic Honesty and Integrity ................................................................ xiii

Chapter 1. Introduction ........................................................................................................ 1

Chapter 2. Spatial Audio ....................................................................................................... 4

2.1. Sound localization ........................................................................................................ 4

2.1.1. Binaural cues ........................................................................................................ 5

2.1.2. Monaural spectral cue .......................................................................................... 5

2.1.3. Rotation of the human head .................................................................................. 5

2.1.4. Head related impulse response ............................................................................ 6

2.2. Surround sound generation .......................................................................................... 7

2.3. Panning laws ................................................................................................................ 8

Chapter 3. Audio Spatialization ......................................................................................... 11

3.1. Stages of audio spatialization .................................................................................... 11

3.1.1. Analysis – source separation .............................................................................. 12

3.1.2. Re-synthesis – convolution with HRIRs .............................................................. 12

Chapter 4. Sparsity-based Source Separation .................................................................. 15

4.1. Classification of source separation algorithms .......................................................... 15

4.1.1. Based on mixing parameters considered in the mixing model ........................... 15

vi

4.1.2. Based on number of mixtures and sources in the mixing model ........................ 16

4.2. Source separation algorithms: A review .................................................................... 16

4.3. Mixing models ........................................................................................................... 17

4.4. Sparsity-based source separation ............................................................................... 18

4.5. Stages of sparsity-based source separation ................................................................ 19

4.6. Source assumptions .................................................................................................... 19

1.1.1. Local stationarity ................................................................................................ 19

4.6.1. Microphone spacing ........................................................................................... 20

4.6.2. W-disjoint orthogonality ..................................................................................... 20

4.7. Mixing parameter estimation technique .................................................................... 20

4.8. Source estimation techniques ..................................................................................... 21

4.8.1. Degenerate unmixing estimation technique (DUET) ......................................... 22

4.8.2. Lq-basis pursuit (LQBP) ..................................................................................... 23

4.8.3. Delay and scale subtraction scoring (DASSS) ................................................... 24

Chapter 5. Adaptive Time-Frequency Representation .................................................... 27

5.1. Short-time Fourier transform ..................................................................................... 27

5.2. Need for data-adaptive time-frequency representations ............................................ 28

5.3. Data-adaptive time-frequency representations .......................................................... 29

5.3.1. Steps to obtain a data-adaptive time-frequency representation of a signal ....... 31

5.4. Invertibility of time-frequency representations ......................................................... 33

5.4.1. Frame-based transition-window re-construction technique .............................. 34

5.4.2. Modified (extended) window re-construction technique .................................... 34

5.4.3. Segment-based transition-window re-construction technique ........................... 36

Chapter 6. Concentration Measure .................................................................................... 37

6.1. W-disjoint orthogonality ............................................................................................ 37

6.2. Sparsity ...................................................................................................................... 39

6.2.1. Characteristics of sparsity measures .................................................................. 39

vii

6.2.2. Sparsity measures ............................................................................................... 41

6.3. Relation between sparsity measures and WDO measure ........................................... 42

6.3.1. Steps for obtaining the W-disjoint orthogonality measure for a set of signals .. 44

6.3.2. Steps for obtaining the sparsity measure for a set of signals ............................. 45

Chapter 7. Experiments and Results ................................................................................. 48

7.1. Datasets ...................................................................................................................... 48

7.1.1. BSS Oracle database .......................................................................................... 48

7.1.2. TIMIT speech database ...................................................................................... 48

7.2. Performance evaluation measures.............................................................................. 48

7.3. Performance evaluation ............................................................................................. 49

7.3.1. Setup for performance evaluation test ................................................................ 50

7.3.2. Mixing parameters estimation stage................................................................... 51

7.3.3. Source estimation stage ...................................................................................... 52

Chapter 8. Conclusions and Future Work ........................................................................ 54

8.1. Conclusions ................................................................................................................ 54

8.2. Future work ................................................................................................................ 55

Appendix A. Sinusoid Detection using Data-Adaptive Time-Frequency Representation

56

A.1. Sinusoid detection ...................................................................................................... 56

A.2. Data-adaptive time-frequency representation for sinusoid detection ........................ 57

A.3. Performance of data-adaptive time-frequency representation ................................... 58

A.3.a. Sinusoid signals .................................................................................................. 59

A.3.b. Chirp signals ...................................................................................................... 60

A.3.c. Frequency modulated signals ............................................................................. 61

A.3.d. Mixture of sinusoids and frequency modulated signals...................................... 62

A.3.e. Music/speech signals (real signals) .................................................................... 64

References ............................................................................................................................... 66

viii

List of Figures

Figure 2.1 Binaural cues – interaural time difference (ITD) ...................................................... 6

Figure 2.2 Binaural cues – interaural level difference (ILD) ..................................................... 6

Figure 2.3 Cone of confusion ..................................................................................................... 6

Figure 2.4 Rotation of human head ............................................................................................ 6

Figure 2.5 Monaural spectral cues .............................................................................................. 6

Figure 2.6 Reproduction of two-channel stereo ......................................................................... 8

Figure 3.1 Audio spatialization block diagram ........................................................................ 12

Figure 3.2 Time-domain virtualization based on HRIRs ......................................................... 13

Figure 4.1 Mixing models - anechoic mixing........................................................................... 18

Figure 4.2 Mixing model - echoic mixing ................................................................................ 18

Figure 4.3 Block diagram of sparsity-based source separation ................................................ 19

Figure 5.1: Data-adaptive time-frequency representation of a singing voice using frame-based

adaptation (window function: hamming; window sets for adaptation: 30, 60 and 90 ms; hop

size: 10 ms; concentration measure: kurtosis; adaptation region: 1000 to 3000 Hz) ............... 32

Figure 5.2: Data-adaptive time-frequency representation of a singing voice using segment-

based adaptation (window function: hamming; window sets for adaptation: 30, 60 and 90 ms;

hop size: 10 ms; concentration measure: kurtosis; adaptation region: 1000 to 3000 Hz) ........ 33

Figure 5.3: Frame-based transition-window reconstruction technique .................................... 34

Figure 5.4: Modified (extended) window re-construction technique ....................................... 35

Figure 6.1: Disjoint orthogonal for time-frequency representations of speech source mixing as

a function of window size used in the time-frequency transformation. ................................... 43

Figure 6.2: The kurtosis (left) and Gini Index (right) sparsity measures applied to speech

signals in the time-frequency domain as a function of window size. ....................................... 43

Figure 6.3: WDO vs. window size ........................................................................................... 46

Figure 6.4: Sparsity measure (kurtosis) vs. window size ......................................................... 47

Figure 6.5: Sparsity measure (Gini Index) vs. window size ..................................................... 47

Figure A.1 Time-frequency representation of sinusoid signals ................................................ 59

Figure A.2 True hits vs. false alarms plot for the sinusoid signals .......................................... 60

Figure A.3 Time-frequency representation of chirp signal ...................................................... 61

Figure A.4 True hits vs. false alarms plot for chirp signals ..................................................... 61

Figure A.5 Time-frequency representation of frequency modulated signals ........................... 62

ix

Figure A.6 True hits vs. false alarms plot for the frequency modulated signals ...................... 62

Figure A.7 Time-frequency representation of mixture of sinusoid signals and frequency

modulated signal (signal energy, frequency modulated to sinusoid signal = 7 dB) ................. 63

Figure A.8 Time-frequency representation of mixture of sinusoid signals and frequency

modulated signal (signal energy, frequency modulated to sinusoid signal = -3 dB) ............... 64

Figure A.9 True hits vs. false alarms for of mixture of sinusoid signals and frequency

modulated signal ....................................................................................................................... 64

Figure A.10: Data-adaptive time-frequency representation of a singing voice signal ............. 65

x

List of Tables

Table 6-A: Validation table showing the characteristics satisfied by the sparsity measures

(kurtosis/Gini Index) ................................................................................................................ 42

Table 6-B: Counter-examples for testing a sparsity measure whether it satisfies a particular

property with the desired outcome if the sparsity measure satisfies the property. S(x) denotes

sparsity measure of x ................................................................................................................ 42

Table 7-A: Performance of the mixing parameter estimation stage on BSS oracle (music)

dataset ....................................................................................................................................... 52

Table 7-B: Performance of the mixing parameter estimation stage on BSS oracle (speech)

dataset ....................................................................................................................................... 52

Table 7-C: Performance of the source estimation stage (in time-frequency domain) using

DUET and LQBP algorithms on BSS oracle dataset ............................................................... 53

Table A-A: True hits percentage of sinusoid detection for singing voice for different

frequency bands ........................................................................................................................ 65

xi

List of Abbreviations

Abbreviation Meaning

ATFR Adaptive Time-Frequency Representation

BSS Blind Source Separation

CASA Computational Auditory Scene Analysis

CIPIC Centre for Image Processing and Integrated Computing

COLA Constant Over-Lap Add

DASSS Delay And Scale Subtraction Scoring

DFT Discrete Fourier Transform

DUET Degenerate Unmixing Estimation Technique

DVD Digital Video Disc

ECG Electrocardiography

HRIR Head Related Impulse Response

HRTF Head Related Transfer Function

ICA Independent Component Analysis

ICLD Inter-Channel Level Difference

IEEE Institute of Electrical and Electronics Engineers

ILD Interaural Level Difference

ITD Interaural Time Difference

KEMAR Knowles Electronics Manikin for Acoustic Research

LQBP Lq-Basis Pursuit

MIDI Musical Instrument Digital Interface

OLA Over-Lap Add

PCA Principle Component Analysis

PSR Preserved Signal Ratio

SAR Source to Artifacts Ratio

SD Sparse Decomposition

SDR Source to Distortion Ratio

SIR Source to Interference Ratio

xii

Abbreviation Meaning

SNR Signal to Noise Ratio

SRS Sound Retrieval System

STFT Short-Time Fourier Transform

TIMIT Texas Instruments Massachusetts Institute of Technology

WDO W-Disjoint Orthogonality

xiii

Declaration of Academic Honesty and Integrity

I declare that this written submission represents my ideas in my own words and where others'

ideas or words have been included, I have adequately cited and referenced the original

sources. I also declare that I have adhered to all principles of academic honesty and integrity

and have not misrepresented or fabricated or falsified any idea/data/fact/source in my

submission. I understand that any violation of the above will be cause for disciplinary action

by the Institute and can also evoke penal action from the sources which have thus not been

properly cited or from whom proper permission has not been taken when needed.

Pradeep Gaddipati

08307029

June 17th, 2010

Chapter 1. Introduction

With the proliferation of portable media devices, headphone listening has become

increasingly common; in both mobile and non-mobile listening scenarios, providing a high-

fidelity listening experience over headphones is thus a key value-add (or arguably even a

necessary feature) for modern consumer electronic products. This enhanced headphone

reproduction is relevant for both stereo content such as legacy music recordings as well as

multichannel music and movie soundtracks. The audio, when properly generated, can be used

to render a realistic auditory experience with auditory immersion. The audio signal which is

capable of this, is known as spatial audio.

Spatial audio refers to the rendering of the realistic auditory experience with auditory

immersion. Surround sound, an outcome of the extensive research on spatial audio, refers to

the use of multiple loudspeakers to envelop a person watching a movie or listening to music,

making them feel as if they are in the middle of the action or the concert [1]. The surround

sound tracks enable the audience to hear sounds coming from all around them, contributing to

the sensation of what movie-makers call suspended disbelief. Such a technique is only

applicable in the case when the playback devices are placed at a considerable distance from

the listener. The same audio signals are not as effective when headphones are used for

listening.

The headphone reproduction simply constitutes presenting a left-channel signal to the

listener’s left ear and likewise a right-channel signal to the right ear. In such headphone

systems, stereo music recordings can obviously be directly rendered by routing the respective

channel signals to the headphone transducers. However, such rendering, which is the default

practice in consumer devices, leads to an in-the-head listening experience, which is counter-

productive to the goal of spatial immersion: sources panned between the left and right

channels are perceived to be originating from a point between the listener’s ears [2]. For audio

content intended for multichannel surround playback (perhaps most notably movie

soundtracks), typically with a front centre channel and multiple surround channels in addition

to the front left and right, direct headphone rendering calls for a down-mix of these additional

channels; in-the-head localization again occurs as for stereo content, and furthermore the

surround spatial image is compromised by elimination of front/back discrimination cues.

Hence these surround audio needs to be spatialized to obtain a signal that can generate the

2

effect of auditory immersion over headphones. However, re-recording of the existing audio in

the new format is an infeasible task. One of the possible solutions to such a problem would be

audio spatialization where the existing spatial audio is processed to obtain surround sound

that creates an auditory immersion over headphones.

Given a multi-channel audio mixture as input in any available format, audio spatialization is

the process of realistic spatial rendering of audio in the desired listening configuration (e.g.

over headphones). One approach to this problem involves separating the individual sources

from the multi-channel audio mixture, and then re-creating the desired listener-end mixtures

by suitable recombination of the individual spatialized sources. The success of this approach

hinges on achieving the proper separation of sources from the input multi-channel mixtures.

Various source separation algorithms [3] have been developed based on the different source

models and mixing models.

There exist several successful techniques for blind source separation such as independent

component (ICA) and sparse decomposition. These sparsity-based techniques require the

sources to be sparse and disjoint-orthogonal in some time-frequency representation, these

techniques explores the sparsity of music/speech signals in the short time Fourier transform

(STFT) domain to construct binary time-frequency masks, which are then used to extract

several sources from only two mixtures. It is expected that the performance of the source

separation process can be improved by obtaining sparser time-frequency representation. The

STFT performs well in terms of concentration and resolution of a given signal component

when a properly chosen window is used. But the proper window function depends on the data,

and no automated procedure currently exists for determining a good window. And for signals

like music/speech which are composed of several different components at different time

instants, the best window differs for each time instant. Thus the fact that different windows

are appropriate for different time instants suggests the use of data-dependent time-varying

time-frequency representation [4].

Chapter 2 describes the various aspects of spatial audio. Chapter 3 discusses the various

stages involved in the audio spatialization process and presents techniques for re-synthesis of

the surround sound for headphones. Chapter 4 provides a brief review of the various source

separation algorithms and it also discusses about the various source models and the mixing

models considered for solving the blind source separation problem. Chapter 5 discusses the

generalized staged procedure for the sparsity-based source separation and it also describes

3

three sparsity-based source separation techniques viz. degenerate unmixing estimation

technique (DUET), lq basis pursuit (LQBP) and delay and subtraction scale scoring (DASSS)

in detail. Chapter 6 discusses the time-frequency representations that are used in source

separation algorithms, the need for the data-adaptive time-frequency representations and also

gives details about the adaptive time-frequency representation used for this work. Chapter 7

investigates the various concentration measures that can be used for the purpose of the

adaptation in the case of the adaptive time-frequency representations. Experiments to evaluate

the performance of the different source separation techniques discussed in chapter 5 and the

various time-frequency representations discussed in chapter 6 are described in chapter 8. In

chapter 9, the conclusions and the future work are presented. And finally in the appendix, a

detailed discussion about the role of adaptive time-frequency representation in sinusoid

detection problem is presented.

4

Chapter 2. Spatial Audio

Everyday life is full of three-dimensional sound experiences. Humans have the capability to

localize these sound sources even in noisy and reverberant environments. This ability of

humans to make sense of their environments and to interact with them depends strongly on

spatial awareness and hearing plays a major part in this process. The auditory system of the

human identifies various cues in the sounds heard at the two ears which indicate the spatial

locations of the sources in the three-dimensional space around the listener. The mechanisms

of sound source localization involve the detection of timing or phase difference between the

ears and of amplitude or spectral difference between the ears. The majority of spatial

perception is dependent on the listener having two ears, although certain monaural cues have

been shown to exist – in other words it is mainly the differences in signals received by the two

ears that matter.

2.1. Sound localization

We listen to speech (as well as other sounds) with two ears, and it is quite remarkable how

well we can separate and selectively attend to individual sound sources in a cluttered

acoustical environments. This ability of the listener to determine the location of the

origination of the sound is termed as sound localization. In fact, the familiar term cocktail

party processing was coined in an early study of how the binaural system enables us to

selectively attend to individual conversations when many are present, as in, of course, a

cocktail party. This phenomenon illustrates the important contribution that binaural hearing

makes to auditory scene analysis, by enabling us to localize and separate sound sources. In

addition the binaural system plays a major role in improving speech intelligibility in noisy and

reverberant environments.

Humans can deduce the various parameters of the location of the source viz. azimuth,

elevation, distance and spaciousness of the auditory environment from the sounds heard. This

is on the basis of the different cues introduced into the sound by the pinna, proximate parts of

the human body and the surrounding acoustic environment as it travels from the source to the

eardrum of the listener. Thereafter, the cues are processed by the human brain for determining

the acoustic characteristics of the source and the auditory environment. In general, a potential

acoustical localization cue is any physical aspect of the acoustical waveform reaching a

5

listener’s ears that is altered by a change in the position of the sound source relative to that of

the listener. The most important cues [5] used by humans are discussed below.

2.1.1. Binaural cues

Binaural localization relies on the comparison of auditory input from two separate detectors;

most evolved auditory systems feature two ears, on each side of the head.

• Interaural time difference (ITD): This cue arises because of the difference in the

distances between the source and the two ears as seen in Figure 2.1. The resulting

phase shift is used for localization of frequencies below 1.5 kHz. This cue is also

sensitive to the shift in the envelope of the signals at higher frequencies.

• Interaural level difference (ILD): The shadowing of the sound wave by the head as

seen in Figure 2.2 results in the sound having a higher intensity at the ear nearest to the

source, depending on the azimuth. It results in a difference in the energy levels

depending on the frequency. This cue is primarily used at frequencies above 1.5 kHz.

As a result of the symmetry of the human head, sounds originating from many different

directions share ITD and ILD. The locus of all source locations that share the same ITD and

ILD is called the cone of confusion as shown in Figure 2.3. Within the cone of confusion, the

estimation of source location is on the basis of monaural spectral cues and the effect of head

rotation..

2.1.2. Monaural spectral cue

This cue is primarily used to determine the elevation of the source. Figure 2.5 shows the

measured energies at different frequencies for two different directions of arrival. In each case,

there are two paths from the source to the ear canal – a direct path and a longer path following

a reflection from the pinna. For frequencies in the range 6-16 kHz, the delayed signal is out of

phase with the direct signal, and destructive interference occurs. The greatest interference

occurs when the difference in length is half the wavelength. This produces a notch in the

spectrum as seen in the Figure 2.5. Thus, the elevation of the source can be estimated from the

location of this notch.

2.1.3. Rotation of the human head

6

Typically, a listener directs his head towards the interesting sound source. The change in ITD,

ILD and monaural spectral cues with the rotation of the head helps the listener in further

localizing the source and for resolving confusions (see Figure 2.4)

Figure 2.1 Binaural cues – interaural time

difference (ITD)

Figure 2.2 Binaural cues – interaural level

difference (ILD)

Figure 2.3 Cone of confusion

Figure 2.4 Rotation of human head

Figure 2.5 Monaural spectral cues

Source: Aureal Corporation, “3-D Audio Primer,” Aureal Semiconductor A3D White Paper, 1998.

2.1.4. Head related impulse response

The frequency and position dependent characteristics of the pinna, proximate parts of human

body and ear canal are summarized in the form of the head-related transfer function (HRTF)

and its time-domain analogue is the head-related impulse response (HRIR). As the HRTF

depends on the diffraction and reflection properties of the head, pinna and torso which differ

7

from one person to another, it is unique for each person. The HRIR is measured in an

anechoic room and hence depends solely on the morphology of the listener.

2.2. Surround sound generation

Today a variety of multichannel transmission formats and end-user configurations are

available for conveying a 3-D audio scene to a listener. An example is a 5.1 DVD recording

intended for reproduction over a standard 3/2-stereo loudspeaker layout. In addition to the

practical choice of a multichannel transmission or storage and rendering format, various

microphone recording techniques or electronic spatialization methods can be used to encode

the directional information in the chosen multichannel format. This section gives a brief

introduction to the commonly available techniques for reproducing the desired directional

information over headphones or a number of loudspeakers located at known positions

surrounding the listening area. These techniques can be classified into three main approaches

[6]:

• Sound field reconstruction methods: The objective is to control an acoustical

variable of the sound field (pressure, velocity) at or around a reference measuring

point in the listening area. This reference point is usually the sweet spot where the

auditory image created during rendition is as desired by the mixer.

• Discrete panning techniques: The knowledge of the desired apparent direction of the

sound is used to selectively feed the closest loudspeakers in the reproduction system

based on a panning law.

• Head-related stereophony (binaural recording or binaural synthesis): The intent is to

control the acoustic pressure at the ears of the listener via headphone or loudspeaker

playback.

The most extensively used method for creating surround sound of various formats is discrete

amplitude panning. During the recording of the surround sound, each input source of the

mixing console receives a monophonic recorded or synthetic signal which is devoid of the

room effect, from an individual sound source. A panning module called as the panoramic

potentiometer (or panpot) is used to spatialize each source by multiplying the source signal

with gains corresponding to each of the output channels. These gains are determined by a

panning law depending on the desired source location. The commonly used panning laws

include constant gain optimization (or amplitude preserving law) and constant power

8

optimization (or energy preserving law). All the individual source components are then added

together to give the final multichannel audio. The inter-channel level difference (ICLD)

arising from the different channel gains for each source is translated into an ITD at the

listener’s ears for frequencies below 1.5 kHz. Additionally, the source signals might be fed to

an artificial reverberator which delivers several uncorrelated reverberation signals to the main

output channels, thus reproducing a diffuse immersive room effect, in which every sound

source can contribute a different intensity. The direct sound level and reverberation level can

be adjusted individually in each source channel in order to control the perceived distance of

the corresponding sound source.

2.3. Panning laws

Amplitude panning refers to techniques in which a monophonic audio channel is applied to all

or a subset of the loudspeakers with different gains. Depending on the gain relationships, the

listener perceives a virtual source, also known as a phantom source, in a direction that does

not necessarily match with the direction of any of the loudspeakers. Although the created

sound field does not match the sound field created by a single sound source, listeners perceive

it like that [2]. The best playback of stereo audio is obtained by placing the two speakers

symmetrically with respect to the median place to the front of the listener. Consequently, they

are referred to as the left (L) and right (R) speakers. Usually the speakers are placed at an

angle of 30˚ with respect to the median plane, as shown in Figure 2.6.

Figure 2.6 Reproduction of two-channel stereo

9

The total system gain and the total power are two important attributes of a panning law. For a

system with N-channel output, the total gain and power for source i is given by

= (2.1) = (2.2)

where aij is the gain for the ith source at the jth channel.

The constant gain law requires that the total gain, which is the sum of the gains for all

channels corresponding to a particular source, be a constant. In the two channel case, this

implies that the gain linearly decreases in one channel as it is increased in the other. The

angles are considered to be positive when measured in the anticlockwise direction. The gains

aL and aR given to the left and right speakers are obtained as follows

= ( − )2 & = ( − )2 , ℎ ≥ ≥ − (2.3) where θ is the desired angle.

The constant power law implies that the total gain which is the sum of the squares of the gains

for all channels corresponding to a source should be a constant. In the two channel case, this

constraint results in the gains aL and aR as follows

= ′ & = ′, ℎ ′ = ( − )2 90 (2.4) If there are N sources present in the system, the mixing parameters for each of them are

determined as described previously. Thereafter, the left and right channels i.e. XL(t) and XR(t)

are obtained by adding the individual components as follows

( ) = ( ) & ( ) = ( ) (2.5) where si(t) is the time-domain signal corresponding to the ith source, aiL and aiR are the gains

or mixing parameters corresponding to the ith source. Thus, each channel is actually a mixture

of the individual sources obtained by their linear combination. Besides, there is no relative

delay in the components corresponding to each source in the two channels. Thus, the mixing

process is linear and instantaneous.

Using these techniques, it is possible to spatialize sources to locations between the speakers

only. To place virtual sources outside this region, some additional processing based on

10

psychoacoustic principles is required as is being done in sound retrieval system (SRS)

technology. When more than two speakers are present in the system, only the two speakers

closest to the desired source location can be considered to be active. Using this assumption,

the gains for the two adjacent speakers can be determined using one of the aforementioned

laws while the gain for all the speakers is assigned the value zero. This is known as pair-wise

panning.

11

Chapter 3. Audio Spatialization

Sounds we hear are normally perceived to be located in the space around us and are usually

associated with sources which are visible or which we know to be there. During stereophonic

recording of music, the audio mixer virtually moves the sound sources using amplitude

panning that would give the desired response when reproduced with loudspeakers placed at

some distance from the user. But it is a common experience that when these stereophonic

recordings are presented by means of headphones, the sound images are localized within the

head [2], this phenomenon of in-head localization is known as lateralization. This is because

of the fact that these recorded signals lack the appropriate interaural time difference (ITD),

interaural intensity difference (IID) and body reflection cues associated with the real-world

sources. The consequence is that the music thus recorded is not ideal to be reproduced via

headphones, at least in terms of the truthfulness of reproduction of the desired auditory

environment. The sound arriving at the listener’s ears corresponds to an unnatural sound field

increasing the listener’s fatigue. Hence the stereophonic loudspeaker audio needs to be

specially processed to obtain a signal that can generate the effect of auditory immersion over

headphones. The techniques for including the appropriate real-world cues into the

stereophonic audio and making it compatible for the headphones are discussed in the

following sections and chapters. This process of spatial rendering for conversion of the

available audio configuration into the desired listening configuration is termed as audio

spatialization.

3.1. Stages of audio spatialization

As described in the previous section audio spatialization is a process of realistic spatial

rendering of audio into the desired listening configuration from the available audio format, in

our case, the available format is the stereophonic loudspeaker audio and the desired listening

configuration is the headphones. The approach (which is considered by us) to this problem

involves separating the individual sources from the multi-channel audio mixture and then re-

creating the desired listener-end configuration by suitable re-combination of the individual

spatialized sources. The stages involved in the audio spatialization process are shown in

Figure 3.1.

• Analysis (source separation) – the individual source signals and their locations in the 3D space are estimated from the available mixtures

12

• Re-synthesis (convolving with HRIRs) – the estimated sources are externalized to the desired locations by filtering them with the head related impulse responses (HRIRs)

Figure 3.1 Audio spatialization block diagram

3.1.1. Analysis – source separation

The process of extracting the individual sources from a set of observations from sensors such

as microphones (mixtures) is called source separation. When the information about the mixing

process and sources is limited, the problem is known as ‘blind’ source separation (BSS). The

classical example is the cocktail party problem, where a number of people are talking

simultaneously in a room (like at a cocktail party), and one is trying to follow one of the

discussions. The human brain can handle this sort of auditory source separation problem, but

it is a very tricky problem in signal processing. Several approaches have been proposed for

the solution of this problem but development is currently still very much in progress. The

separation of a superposition of multiple signals is accomplished by taking into account the

structure of the mixing process and by making assumptions about the sources. Some of the

successful approaches are principal component analysis (PCA) and independent component

analysis (ICA), which work well when there are no delays or echoes present; that is, the

problem is simplified a great deal. By assuming sources can be represented sparsely in given

basis, recent research has demonstrated that solutions to previously problematic blind source

separation problems can be obtained. In some cases, solutions are possible to problems

intractable by previous non-sparse methods. Indeed, sparse methods provide a powerful

approach to the separation of mixtures [3].

3.1.2. Re-synthesis – convolution with HRIRs

Each surround sound system has a pre-determined position for each of the speakers. The

audio for each system is recorded taking this factor into account. With this apriori knowledge

about the speaker locations, one of the simplest methods of spatialization would be to

convolve each of the individual input channels with HRIRs of the corresponding speaker and

then summing the results [5]. The location of each speaker determines the HRIR to be used

13

for that speaker. The HRIRs can be obtained from the CIPIC [7] or the KEMAR [8] database.

This process is depicted in Figure 3.2.

Let xm(t) be the mth channel signal. The filters hmL(t) and hmR(t) represents the HRIRs

corresponding to the mth speaker for the left and right ear respectively. The left and right ear

signals for playback over headphones are then given by yL(t) and yR(t) as follows:

( ) = ℎ ( ) ∗ ( ) (3.1) ( ) = ℎ ( ) ∗ ( ) (3.2)

Figure 3.2 Time-domain virtualization based on HRIRs

Using this method, the sources that are active only on a single channel can be convincingly

virtualized over headphones, i.e. a rendering can be achieved that generates a sense of

externalization and accurate spatial positioning of the source. However, a sound source that is

partially panned across channel in the recording may not be convincingly reproduced.

Consider a set of input signals each of which are amplitude-scaled version of the source s(t)

( ) = ( ) (3.3) with these inputs, the equations (1) and (2) become

( ) = ( ) ∗ ( ( )ℎ ( )) (3.4)

14

( ) = ( ) ∗ ( ( )ℎ ( )) (3.5) The source s(t) is thus rendered through a combination of HRIRs for multiple different

directions instead of via the correct HRIR for the actual desired source location. Unless the

combined HRIRs correspond to closely spaced channels, this combination of HRIRs will

significantly degrade the spatial image. This is one of the drawbacks of this method. To

rectify this, the desired source signals and locations can be estimated from the multiple

channels and then the corresponding source signals can be spatialized to the appropriate

location obtained from the estimation.

15

Chapter 4. Sparsity-based Source Separation

Source separation arises in a variety of signal processing applications, ranging from speech

processing to medical image analysis. The process of extracting the individual sources from a

set of observations from sensors such as microphones (mixtures) is called source separation.

When the information about the mixing process and sources is limited, the problem is known

as ‘blind’ source separation (BSS). Generally the problem is stated as follows

Given M mixtures of N sources mixed via an unknown (M x N) mixing matrix A,

estimate the underlying sources from the mixtures.

BSS of acoustic signals is often referred to as the cocktail party problem that is the separation

of individual voices from a myriad of voices in an uncontrolled acoustic environment such as

cocktail party.

4.1. Classification of source separation algorithms

BSS algorithms can be categorized according to the assumptions they make about the mixing

model. Thus, one can classify them based on mixing parameters considered in the mixing

model or based on the number of mixtures and sources considered in the mixing model.

4.1.1. Based on mixing parameters considered in the mixing model

Environmental assumptions about the surroundings in which the sensor observations are made

also influence the complexity of the problem. Sensor observations in a natural environment

are confounded by signal reverberations, and consequently, the estimated un-mixing process

needs to identify a source arriving from multiple directions at different times as one individual

source. Generally, BSS techniques depart from this difficult real world scenario and make less

realistic assumptions about the environment so as to make the problem more tractable. There

are typically three assumptions that are made about environment.

• instantaneous mixtures • anechoic mixtures • echoic mixtures

The most rudimentary of these is the instantaneous case, where sources assumed to arrive

instantly at the sensors but with different signal intensity. An extension of the previous

assumption, where arrival delays between sensors are also considered, is known as the

16

anechoic case. The anechoic case can be further extended by considering multiple paths

between each source and sensor, which results in the echoic case, sometimes also known as

convolutional mixing. Each case can be extended to incorporate linear additive noise. But the

presence of noise in the system increases the complexity of the source separation process.

Separation becomes even more challenging if the sources are assumed to me mobile. Most

systems assume that the sources are static as in the case of instantaneous and anechoic

mixtures, whereas echoic case signifies the most natural and general situation.

4.1.2. Based on number of mixtures and sources in the mixing model

The source separation algorithms can also be categorized based upon the assumptions made

related to the number of mixtures and the number of sources considered in the mixing model.

There are typically three assumptions that are made

• over-determined (M > N) • even-determined (M = N) • under-determined (M < N)

where M is the number of mixtures and N is the number of sources.

When M ≥ N, separation of sources can be achieved by constructing an unmixing matrix W,

where W = A-1 up to permutation and scaling of the rows. The dimensionality of the mixing

process influences the complexity of source separation. If M = N, the mixing process is

defined by an even-determined (i.e. square) matrix A and, provided that it is non-singular, the

underlying sources can be estimated by a linear transformation. And if M > N, the mixing

process is defined by an over-determined matrix A and, provided that it is full rank, the

underlying sources can be estimated by least-squares optimization or linear transformation

involving matrix pseudo-inversion. If M < N, the mixing process A is defined by an under-

determined matrix and consequently source estimation becomes more involved and is usually

achieved by some non-linear techniques.

4.2. Source separation algorithms: A review

The separation process is accomplished by taking into account the structure of the mixing

process and making assumptions about the sources. Several methods exist that attempt to

solve the BSS problem under various assumptions and conditions. Usually the following

17

assumptions about the nature of the sources are made in order to make the source separation

algorithm more tractable:

• statistical independence of sources [9] • sparse decomposition of sources into some basis (time-frequency dictionaries) [10] • sparsity of sources in some time-frequency representations [11] [12]

One such approach is independent component analysis (ICA) which is based on the

assumption that the sources are statistically independent [9]. This technique extracts N sources

from N instantaneous mixtures. This algorithm can be extended for the case of instantaneous

under-determined mixtures. There also exist algorithms that demix under-determined

anechoic mixtures, one such algorithm is complex independent component analysis technique

to solve the BSS problem for electroencephalographic (ECG) data.

An alternative approach to the BSS problem for under-determined instantaneous mixtures is

to assume that the sources have sparse expansion with respect to some basis. In this case, one

can formulate the source extraction problem as a constrained l1 minimization problem, which

typically yields a convex program [10]

Another approach to demix under-determined anechoic mixtures, called degenerate unmixing

estimation technique (DUET), was proposed by Yilmaz and Rickard [11]. This algorithm uses

sparsity of music/speech signals in the short time Fourier transform (STFT) domain to

construct binary time-frequency masks, which are then used to extract several sources from

only two mixtures. Another algorithm presented in [12] uses lq minimization based approach,

with q < 1 for estimation of sources in the STFT domain.

4.3. Mixing models

Suppose we have N time domain sources s1(t), s2(t), . . . , sN(t) and M mixtures x1(t), x2(t), . . . ,

xM(t) such that

= − , = 1,2, … , (4.1) where aij are the attenuation coefficients and δij are the time delays associated with the path

from the jth source to the ith receiver (sensor). Equation (1) defines an anechoic mixing model.

With δij = 0, equation (1) defines instantaneous mixing model. The problem of anechoic

signal unmixing is therefore to identify the attenuation coefficient and the relative delay

18

associated with each source. An illustration of the anechoic and under-determined case (M =

2, N = 3) is provided in Figure 4.1.

The echoic case of BSS considers not only transmission delays but reverberations too. This

results in a more involved generative model that in turn makes finding a solution more

difficult.

= − , = 1,2, … , (4.2) where L is the number of paths the source signal can take to the sensors. An illustration of the

echoic case is provided in Figure 4.2.

Figure 4.1 Mixing models - anechoic mixing

Figure 4.2 Mixing model - echoic mixing

Source: P. O’Grady, B. Pearlmutter and S. Rickard, “Survey of sparse and non-sparse methods in source separation,” International Journal of Imaging Systems and Technology, vol. 15, no. 1, 2005

4.4. Sparsity-based source separation

In order to make the source separation problem more tractable, one usually needs to make

certain assumptions about the nature of the sources. Such assumptions form the basis for most

source separation algorithms and include statistical properties such as independence and

stationarity. One increasingly popular and powerful assumption is that the sources have a

sparse representation in a given basis. These methods are known as sparsity-based source

separation methods. A signal is said to be sparse when most of its coefficients are zero (or

nearly zero) valued i.e. the major part of the signal energy is concentrated in very few

coefficients of the signal.

The advantage of a sparse signal representation is that the probability of two or more sources

being simultaneously active is low. Thus, sparse representations lend themselves to good

separability because most of the energy in a basis coefficient at any time instant belongs to a

single source. Additionally, sparsity can be used in many instances to perform source

19

separation in the case when there are more sources than sensors. A sparser representation of

an acoustic signal can often be achieved by a transformation into a Fourier, Gabor or Wavelet

basis.

4.5. Stages of sparsity-based source separation

The block representation of sparsity-based source separation algorithm is shown in Figure 4.3.

The various steps involved in the algorithm are:

• Time-frequency transform – Transformation of the available mixtures into some

sparse time-frequency representation such as short-time Fourier transform (STFT)

• Mixing parameter estimation – Estimation of the mixing parameters is done by

clustering the ratios of the time-frequency representations of the mixtures

• Source estimation – Using the estimates of the mixing parameters, the estimates of

each of the individual sources in the time-frequency domain is obtained by using an

appropriate source estimation algorithm like DUET, LQBP or DASSS

• Inverse time-frequency transform – Finally the time-frequency estimates of each of

the individual sources are inverted back to time-domain using an appropriate inverse

time-frequency transformation to recover the original sources

Figure 4.3 Block diagram of sparsity-based source separation

4.6. Source assumptions

1.1.1. Local stationarity

Windowed Fourier transform of a signal s(t) is obtained as

(∙) ( , ) = ( ) ( − )∞∞

(5.1)

The windowed Fourier transform of s(t) defined in equation (1) will be referred as SW(w,t)

where appropriate. Using equation (1) and the following Fourier transform pair,

( − ) ↔ ( ) (5.2)

20

we have

(∙ − ) ( , ) = (∙) ( , ) (5.3) when W(t) ≡ 1. However, when W(t) is a windowing function, equation (3) is not necessarily

true. This can be thought as a form of a narrowband assumption in array processing [13] , but

this label is perhaps misleading in that speech is not narrowband and local stationarity seems

a more appropriate moniker. For DUET, it is necessary that equation (3) holds for all δ, |δ| ≤

Δ, even when W(t) has finite support [14]. Here Δ is maximum time difference possible in the

mixing model (the microphone spacing divided by the speed of sound signal propagation).

4.6.1. Microphone spacing

Additionally, one crucial issue is that, the DUET algorithm is based on the extraction of

attenuation and delay parameters estimate for each time-frequency bin. We will utilize the

local stationarity assumption to turn the delay in time into a multiplicative factor in time-

frequency. Of course, this multiplicative factor e-iwδ uniquely specifies δ only if |wδ| < π as

otherwise we have an ambiguity due to phase-wrap [15] . So we require,

< , ∀ , ∀ (5.4) to avoid phase ambiguity. This is guaranteed when the microphones are separated by less than

πc/wm where wm is the maximum frequency present in the sources and c is the speed of sound.

4.6.2. W-disjoint orthogonality

Given a windowing function W(t), we call two functions sj(t) and sk(t) W-disjoint orthogonal

if the supports of the windowed Fourier transforms of sj(t) and sk(t) are disjoint [11] . The W-

disjoint orthogonality assumption can be stated concisely as,

( , ) ( , ) = 0, ∀ ≠ , ∀ , (5.5) This assumption is the mathematical idealization of the condition that is likely that every

time-frequency point in the mixture with significant energy is dominated by the contribution

of one source.

4.7. Mixing parameter estimation technique

The assumptions of anechoic mixing and local stationarity allow us to rewrite the mixing

equation (4.1) in the time-frequency domain as,

21

( , )( , ) = 1 ⋯ 1⋯ ( , )⋮( , ) (5.6)

With the further assumption of W-disjoint orthogonality, at most one source is active at every

(w,τ), the mixing process can be described for each (w,τ) and for some j as,

( , )( , ) = 1 ( , ) (5.7)

here j is the index of the source active at (w,τ).

Now, we can calculate the relative amplitude and delay parameters associated with one

source, using

, = ( , )( , ) , log ( , )( , ) / (5.8) for some j, where denoted taking imaginary part. Using (8), every (w,τ) yields an estimate

pair for the relative attenuation-delay parameter associated with each source. For W-disjoint

orthogonal signals, if we calculate the amplitude-delta estimates from a number of time-

frequency points, we would expect to see clusters around the true mixing parameters for each

source.

If we now construct a two-dimensional weighted histogram, the number of peaks found would

be the estimated of the number of sources, and the peak centres would be the estimate of the

attenuation-delay estimates associated with each source. From these estimates of mixing

parameters we then construct the time-frequency masks which de-mix the mixtures.

The main observation that DUET leverages is that the ratio of the time-frequency

representations of the mixtures does not depend on the source components but only on the

mixing parameters associated with the active source component [15]. Thus it can be seen that,

the successful extraction of mixing parameters relies on the sparsity of speech in the time-

frequency domain.

4.8. Source estimation techniques

Once the mixing parameters are estimated, each of the individual source signals is extracted

from the mixtures. The unmixing could be either a hard-assignment of each time-frequency

component of the mixture to only one source or a soft-assignment to multiple sources or a

combination of both hard and soft assignments (where for some time-frequency bins hard

22

assignment is used and for other bins soft assignment is used based on some criterion which

decides when to use hard/soft assignments).

4.8.1. Degenerate unmixing estimation technique (DUET)

If the number of sources is equal to the number of mixtures, the non-degenerate case, the

standard demixing method is to invert the mixing matrix. The mixing model for two sources

can be written as,

( , )( , ) = 1 1 ( , )( , ) (5.9)

When the number of sources is greater than the number of mixtures, the degenerate case,

matrix inversion is no longer possible. Nevertheless, in this case one can still de-mix by

partitioning the time-frequency plane using one of the mixtures based on the estimated mixing

parameters [11].

For W-disjoint orthogonal signals, using equation (5), we know that the every time-frequency

bin in ( , ) corresponds to ( , ) for some i. Moreover, the ratio ( , )/ ( , ) depends only on the mixing parameters associated with one source. Thus, for each time-

frequency point, we can determine which of the n peaks in the 2-D histogram of attenuation-

delay estimates is closest to the , estimate for the given ( , ) [11] The following likelihood function is used to produce a measure of closeness.

( , ) = argmin ( , ) − ( , )1 + (5.10)and then assign each time-frequency point to the mixing parameter estimate via

( , ) = 1, ( , ) =0, ℎ (5.11)Essentially, (10) and (11) assign each time-frequency point to the mixing parameter pair

which best explains the mixtures at that particular time-frequency point. We de-mix via

masking and maximum likelihood combining

( , ) = ( , ) ( , ) + ( , ) 1 + (5.12)Then the original sources are reconstructed from their time-frequency representations by

converting them back into the time domain.

23

4.8.2. Lq-basis pursuit (LQBP)

We have seen in DUET that it assumes only one active source at every time-frequency bin,

but practically this assumption is not always true. As we might have multiple sources present

at most of the time-frequency bins, the LQBP algorithm relaxes this assumption to a level that

it assumes at most M (the number of mixtures) sources active at every time-frequency bin

[12].

The LQBP algorithm proposed in [12] separates N sources from M mixtures. The task is

accomplished by extracting at most M sources at each time-frequency point that minimize via

lq-basis-pursuit. The following assumptions are required to ensure an accurate recovery of the

sources.

• No more than m sources are active at each time-frequency point. • The columns of the mixing matrix were accurately extracted in the mixing model

recovery stage. • The mixing matrix is full rank.

First, the mixing matrix is constructed from the mixing parameters estimates obtained from

the previous stage

= ⋯⋯⋮ ⋮⋯ ⋮ (5.13)here are the estimated attenuation parameters and are the estimated delay parameters,

computed as discussed in the previous. Note that column of is a unit vector.

The goal now is to compute good estimates ̂ , ̂ , … , ̂ of the original sources , , … , . These estimates must satisfy

̂ = (5.14)where ̂ = [ ̂ , ̂ , … , ̂ ] is the vector of source estimates in the time-frequency domain. At each time-frequency bin, equation (14) provides M equations (corresponding to M

available mixtures) with N > M unknowns ( ̂ , ̂ , … , ̂ ). Assuming that this system of equations is consistent, it has infinitely many solutions. To choose a reasonable estimate

among these solutions, we shall exploit the sparsity of the sources vector in the time-

frequency domain.

24

The problem can be formally stated as [12]

min̂ ‖ ̂‖ ̂ = (5.15)where ‖ ‖ denotes some measure of sparsity of a vector u. Given a vector u = (u1, u2, … , un) є R, one measure of its sparsity is simply the number of the

non-zero components of u, commonly denoted as ‖ ‖ . But in general, the sparsity of the Gabor coefficients of speech signals essentially suggests that most of the coefficients are

small, though not identically zero. In this case, P0 fails miserably. Alternatively, one can

consider

‖ ‖ = | | / (5.16)where 0 < q ≤ 1 as a measure of sparsity. Here, smaller q signifies increased importance of the

sparsity of u. Such a problem statement is commonly called as Pq problem.

The solution to Pq is identical to the solution of the lq-basis-pursuit (LQBP) problem, given by

∶ min̂ ‖ ̂‖ ̂ = ‖ ̂‖ ≤ (5.17)Note that to solve the LQBP problem, one need to find the best basis for the column space of

that minimizes the lq norm of the solution vector. The solution of LQBP is given by the

solution of

‖ ‖ ℎ (5.18)for = ( )|… | ( ) , = ( )|… | ( ) .

4.8.3. Delay and scale subtraction scoring (DASSS)

We have seen that DUET uses nearest-neighbour approach to demix the sources, which

suffers in cases where W-disjoint orthogonality is violated. Specifically, if two sources

contribute to the energy in a particular time-frequency bin, the mixing parameter estimates

will lie between those of the contributing sources. In some cases, this result might generate

mixing parameter estimates whose nearest neighbour source is actually a third source. So we

may spuriously assign energy to a third source that is not contributing. On the other hand, we

saw that LQBP assumes at most M (the number of mixtures) sources at every time-frequency

bin, which may not be the case always, as we might have only one active source at some time-

25

frequency bins. In this case too, we might spuriously assign energy to a source which might

not be present at that particular time-frequency bin. Thus a new demixing method called delay

and scale subtraction scoring (DASSS) was presented in [16] that is less erratic than the

nearest-neighbour method and highlights when the W-disjoint orthogonality assumptions of

the DUET system are not valid. Furthermore, this technique identifies the time-frequency bins

were actually multiple sources are present and uses a source-aware demixing technique for

those bins.

It should be noted that we require reliable estimates of the mixing parameters which is also

the case with the other two source separation techniques (DUET and LQBP). In this technique

N new signals Yi, each of which entirely eliminates a particular Si are created in the following

manner

= − 1 (5.19)It should be noted that the multiplicative factor applied to X2 corresponds to scaling and delay

in the time domain. Hence this source eliminating technique is called delay and scale

subtraction scoring or DASSS. Yi can also be written in the following way:

= , + , + ⋯ + , , ℎ , ≡ 1 − ( ) (5.20)If exactly one source is active, at a specific time-frequency bin, we have the following

= 0 (5.21) = , (5.22) = , (5.23)

Equation (23) reveals is that if only one source is active; the N values in the set of Yi for a

given bin can be predicted using only the known α values and the given mixture X1. In fact N

sets of such predictions, each assuming one guessed active source g. (We will use to

denote the prediction of the ith Y function value when assuming only source g is active.) Now

the predicted set of values can be compared with the actual observed set of Yi. If exactly one

source is active, only its corresponding prediction will fit the observed Yi. Further the

following scoring function can be used to compare predicted values to the observed values.

26

( ) = ∑ −∑ | | (5.24)If the error is sufficiently small for a particular g in a given bin, we consider that only one

source is active in that bin and i.e. g, and assign the energy of the mixture in that bin to the

source g. And if the error is too large, consider multi-source demixing algorithm discussed

below.

For the bins where no single source model scores well enough, it is reasonable to conclude

that at least two sources must be present. And hence the problem of partitioning the input into

those two sources may be solved via simply inverting the mixing matrix for those two

sources. It is assumed that the two sources whose fractional error f(g) are lowest are the active

sources.

27

Chapter 5. Adaptive Time-Frequency Representation

Time-frequency representations describe signals in terms of their frequency content at a given

time. These representations are useful for analyzing signals varying both in time and

frequency. For speech and music signals where we have continuously time-varying frequency

content, frequency domain representations cannot be used because they only give spectral

information and no time information i.e. they fail to convey when, in time, the different events

are occurring in the signal. The short-time Fourier transform is one of the most widely used

approaches to time-frequency analysis.

In case of the sparsity-based source separation techniques, all the processing i.e. the

estimation of mixing parameters and the estimation of the sources is carried out in the time-

frequency domain. The major assumption in such source separation techniques is that the

underlying sources are W-disjoint orthogonal i.e. only one source is active in every time-

frequency bin. But practically such an assumption is not always valid; so it is at least assumed

that at every time-frequency bin only one of the sources has dominant energy. It is expected

that if the time-frequency representation of the mixture is sparse, then the assumption of the

W-disjoint orthogonality can be satisfied to a greater extent. One such time-frequency

representation which provides sparse representation is short-time Fourier transform.

5.1. Short-time Fourier transform

The short-time Fourier transform is the most widely used method for studying non-stationary

signals like music/speech. The concept behind it is simple and powerful. Suppose we listen to

a piece of music that lasts an hour where in the beginning there are violins and at the end

drums. If we Fourier analyze the whole hour, the energy spectrum will show peaks at the

frequencies corresponding to the violins and drums. That will tell us that there were violins

and drums but will not give us any indication of when the violins and drums were played. The

most straightforward thing to do is to break up the hour into five minute segments and Fourier

analyze each interval. Upon examining the spectrum of each segment we will see in which

five minute intervals the violins and drums occurred. If we want to localize even better, we

break up the hour into one minute segments or even smaller time intervals and Fourier

analyze each segment. That is the basic idea of the short-time Fourier transform – break up

the signal into small time segments and Fourier analyze each time segment to ascertain the

28

frequencies that existed in that segment. The totality of such spectra indicates how the

spectrum is varying in time.

The short-time Fourier Transform (STFT) of the signal x(t) is defined as [17],

(∙) ( , ) = ( ) ( − )∞∞

(6.1)

where W(t) is the window function. W(t) can be considered as a window that selects a

particular portion of the signal centred around the given time location, and the Fourier

transform of the windowed signal yields the frequency content of the signal at the given time.

If we want good time localization we have to pick a narrow window in the time domain, and

if we want good frequency localization we have to pick a narrow window in the frequency

domain (i.e. a long window in time domain). But both the time domain window and the

frequency domain window cannot be made arbitrarily narrow; hence there is an inherent

trade-off between time and frequency localization in the spectrogram for a particular window.

The degree of trade-off depends on the window, signal, time, and frequency. We have just

seen that one window, in general, cannot give good time and frequency localization. That

should not cause any problem of principle as long as we look at the spectrogram as a tool at

our disposal that has many options including the choice of window. There is no reason why

we cannot change the window depending on what we want to study. That can sometimes be

done effectively, but not always. Sometimes a compromise window does very well.

5.2. Need for data-adaptive time-frequency representations

Most algorithms for underdetermined separation as mentioned earlier are based on the

assumption that the signals are sparse in some domain. In most cases, the sparser the sources,

the less they will overlap when mixed (i.e. the more disjoint their mixture will be), and

consequently the easier their separation will be. The most widely used transform for the

purpose of sparsification in the context of blind source separation has been the STFT. The

choice of the window significantly affects the signal concentration in the STFT. In fact, the

uniform frequency and time resolution it offers are disadvantageous for the task of speech or

music separation. When the pitch of a source varies slightly, the variation of the higher

harmonics is higher. This amplified variation of the higher harmonic frequencies can be

accurately tracked with shorter windows. Hence, a variation in the window duration with

frequency is expected to result in a more concentrated representation for some signals.

29

A second problem is that for signals having several different components occurring at

different instants, the best window differs for each component. A sparse representation is

obtained for the harmonic and impulsive parts of the signal by analyzing the segments with a

long and short window respectively. Thus the fact that different windows are appropriate for

different signal components suggests the use of a data-dependent time-and-frequency-varying

window function for analysis to achieve a high concentration and resolution of any signal

component present at any time-frequency location.

5.3. Data-adaptive time-frequency representations

As discussed in the previous section, the STFT performs well in terms of the concentration

and resolution of a given signal component when a properly chosen window is used. An

adaptive time-frequency representation was proposed in [4] by D. L. Jones and T. W. Parks.

In order to track the different signal components we need to have an adaptive window whose

parameters are dependent on the signal. The window function selected for this purpose was

the Gaussian function. Thus there are two parameters (the real and the imaginary part of the

Gaussian parameter) of the window function that are equivalent in terms of their fundamental

time-frequency concentration. They differ, though, in the time-frequency concentration they

provide for a particular signal component. A local signal concentration measure is used to

compute the Gaussian window parameter to achieve maximum concentration of the locally

dominant signal component at every time-frequency location. This procedure automates the

choice of the window and thus overcomes the problem of window selection in the short-time

Fourier transform. The adaptive time-frequency representation of a signal x(t) is given as [4]

( , ) = ( ) −2 [ , ] . , ( )∞∞

(6.2)

which projects the signal x(τ) onto the unit energy Gaussian basis elements

−2 [ , ] . , ( ) (6.3) Ct,ω is the Gaussian parameter which can vary with time and frequency. The adaptive time-

frequency representation in the equation (2) differs from the STFT with a Gaussian window in

that the Gaussian parameter Ct,ω may vary with time and frequency. The basic idea behind the

adaptive time-frequency representation is that extra degrees of the freedom, namely, the real

and the imaginary parts of the Gaussian parameter at every time-frequency location, can

improve the performance over that of a fixed-window STFT. The performance of the adaptive

30

time-frequency representation depends on the selection of the adaptive Gaussian parameters.

Those Gaussian parameters are selected for a particular time-frequency location which

maximizes the local concentration measure. The following local concentration measure was

used in [4]

= | ( , )|∞∞∞∞ | ( , )|∞∞∞∞ (6.4) which is the fourth power of the L4 norm dived by the squared L2 norm of the magnitude of

the short-time Fourier transform. This measure is very similar to kurtosis in statistics and to

other equivalent measures of peakedness or sharpness.

Motivated from this concept of data-adaptive time-frequency representation which provides

better resolution as compared to fixed-window STFT, we can apply this concept of

adaptiveness for obtaining sparser time-frequency representations for the application of blind

source separation. The requirement in the blind source separation problem is that the time-

frequency representations of the mixtures are as sparse as possible so that the underlying

sources satisfy the W-disjoint orthogonality criterion to a greater extent. So here the

adaptiveness can be used to maximize sparsity of the time-frequency representation.

Most real world signals are essentially stationary over short intervals of time. Consequently, a

sparse representation could be obtained by analyzing each frame with a window that has been

optimized for the frame. Long windows give a sparser representation for frames containing

steady frequency components than when shorter windows are used. On the contrary, the time-

frequency representation of impulses or onsets of events is sparser with short windows. And it

has also been observed that the by simply varying the length of the analysis window the

sparsity of the time-frequency representation varies [21] and usually there exists an optimum

length for which the sparsity is maximum. But the selection of optimum analysis window

length depends on the signal.

So instead of adapting the time-frequency representation at every time-frequency location as

in [4], for the application of blind source separation the adaptation can be restricted to only

time i.e. different analysis window lengths can be used for different time-instants. The reason

for restricting the adaptation to only time is because the blind source separation application

demands reconstruction of the time-frequency representations for estimation of the source

signals in time domain (this problem would be discussed in detail in the next section). Now

31

the next problem that needs to be addressed is what concentration measure to use for

adaptation process i.e. on what selection criterion to select the optimum analysis window

length. Some of the commonly used sparsity measures like kurtosis and Gini index can be

used for this purpose of adaptation. This aspect is investigated thoroughly in the next chapter.

The adaptive transformations used to obtain the time-frequency representation are non-linear

i.e. the representation for the sum of two signals might not be equal to the sum of the time-

frequency representations of the individual signals. This depends on the window sequence

chosen for obtaining the time-frequency components in each of the signals. However, if the

same sequence of windows is used to obtain the time-frequency representations for the

mixture as well as the individual signals, the transformation can be considered to be linear.

This linearity property is vital during the estimation of sources in the source separation

algorithm.

5.3.1. Steps to obtain a data-adaptive time-frequency representation of a signal

The procedure to obtain the data-adaptive time-frequency representation is as follows:

a) first select the set of analysis window sizes to be used for adaptation purpose (say 30

ms, 60 ms, 90 ms)

b) now for a particular time-instant, using an analysis window select a portion of the

signal and then Fourier analyze the selected signal

c) repeat step (b) using all the analysis window sizes selected for the purpose of

adaptation

d) once on obtaining the Fourier spectrum using all the different analysis windows

selected for adaptation, using an appropriate concentration measure select the optimal

Fourier spectrum which gives the best resolution (which concentration measures to

use for the purpose of adaptation is discussed in chapter 7), furthermore the adaptation

can be carried out on various bands of frequencies depending on the requirement

e) note down the analysis window size that was used for obtaining the best resolution

f) then based on the analysis window size used for this time-instant and the technique to

be used for re-construction (discussed in section 6.4) of the signal, decide an

appropriate hop size (i.e. step size) and proceed to the next time-instant

g) finally at the new time-instant, follow the adaptation procedure discussed in steps (b),

(c), (d), (e) and (f) until the end of the signal is reached

32

Figure 5.1: Data-adaptive time-frequency representation of a singing voice using frame-based adaptation (window function: hamming; window sets for adaptation: 30, 60 and 90 ms; hop size: 10 ms; concentration measure: kurtosis; adaptation region: 1000 to 3000 Hz)

Figure 5.1 shows data-adaptive time-frequency representation of a singing voice obtained

using above mentioned procedure. The red-dashed line shows the window size selected for

each of the frame. The window function used is hamming, the window sizes used for the

adaptation are 30, 60 and 90 ms, hop size used is 10 ms, concentration measure used for

adaptation is kurtosis and the region over which the adapta

Documents

Data-Adaptive Source Separation for Audio Spatializationlight/IIT Bombay... · I am extremely thankful to Nokia, India and specifically, Dr. Pushkar Patwardhan for ... representation,