Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Data-Adaptive Source Separation for Audio Spatialization
Submitted in partial fulfilment of the requirements for the degree of
Master of Technology
(Electronic Systems)
by
Pradeep Gaddipati
08307029
Under the guidance of
Prof. Preeti Rao
and
Prof. V. Rajbabu
Department of Electrical Engineering
INDIAN INSTITUTE OF TECHNOLOGY BOMBAY
June 2010
Dedication
I dedicate this thesis to my family. Without their patience, understanding,
support and most of all love, the completion of this work would not have been
possible.
ii
Dissertation Approval for Master of Technology
This dissertation entitled Data-adaptive source separation for audio
spatialization by Pradeep Gaddipati (Roll no. 08307029) is approved for the
degree of Master of Technology in Electrical Engineering.
Prof. Preeti Rao _______________________ (Supervisor)
Prof. V. Rajbabu _______________________ (Co-supervisor)
Dr. Samudravijaya K. _____ __________________ (External Examiner)
Prof. Prem C. Pandey _______________________ (Internal Examiner)
Prof. K. P. Karunakaran _______________________ (Chairman)
June 17th, 2010
iii
Acknowledgments
I express my sincere gratitude towards Prof. Preeti Rao and Prof. V. Rajbabu for the guidance
and support they gave me during this project. The regular discussions with them on every
aspect of the research work helped me refine my approach towards the problem and motivated
me to give my best. Working with them in the field of audio signal processing was a very
pleasant learning experience and my interest in the subject has considerably grown.
I am extremely thankful to Nokia, India and specifically, Dr. Pushkar Patwardhan for
providing me with the opportunity of pursuing research in such a remarkable domain. I would
like to thank them for providing the financial support and technical inputs for the work.
I would like to thank Vishweshwara Rao for his valuable suggestions and help during various
stages of my project. I thank all the members of the Digital Audio Processing lab, Department
of Electrical Engineering, IIT Bombay for providing a friendly and enjoyable working
environment.
I would also like to thank my family for their love and moral support. Finally, I thank all the
people who have contributed ideas, concepts and corrections to be incorporated in my project.
The mistakes if any in the final draft finally are all my own.
Pradeep Gaddipati
iv
Abstract
The existing surround audio needs to be spatialized to obtain a signal which can generate the
effect of auditory immersion over headphones. This spatialization process comprises of two
stages viz. separating of the individual sources from the available mixtures and then
combining them to re-create the compatible audio for the desired output configuration (for
the case of headphones, the individual sources are convolved with the HRIRs for localization
and then mixed together to form the final output audio). The source separation technique itself
involves four stages viz. transformation of the mixtures into a sparse time-frequency
representation, estimation of mixing parameters (i.e. the direction and location of sources),
estimation of sources in the time-frequency domain and finally inverting back into the time-
domain by using an appropriate inverse time-frequency transformation technique.
Various sparsity-based source separation techniques namely degenerate un-mixing estimation
technique (DUET), lq-basis pursuit (LQBP) and delay and scale subtraction scoring (DASSS)
have been explored for the purpose of estimating mixing parameters and individual sources
from the mixtures. However their performance is directly coupled to two parameters viz.
sparsity of the time-frequency representation and the W-disjoint orthogonality of the
underlying sources in the time-frequency representation of the mixtures.
This thesis endeavours to find a time-frequency representation which is sparser and can
provide a higher degree of W-disjoint orthogonality amongst the underlying sources in the
mixtures than the time-frequency representation obtained using short-time Fourier transform
(STFT). With this objective, a time-varying data-adaptive time representation was developed
and its performance in terms of the aforementioned parameters was compared to that of the
fixed-window STFT. The data-adaptive time-frequency representation leads to better
estimation of mixing parameters which is translated into better separation of sources from the
stereo mixtures. This enables the sources to be better spatialized in the auditory space with
fewer artifacts as has been observed.
v
Table of Contents
Dedication ................................................................................................................................... i
Dissertation Approval for Master of Technology .................................................................. ii
Acknowledgments .................................................................................................................... iii
Abstract .................................................................................................................................... iv
Table of Contents ...................................................................................................................... v
List of Figures ........................................................................................................................ viii
List of Tables ............................................................................................................................. x
List of Abbreviations ............................................................................................................... xi
Declaration of Academic Honesty and Integrity ................................................................ xiii
Chapter 1. Introduction ........................................................................................................ 1
Chapter 2. Spatial Audio ....................................................................................................... 4
2.1. Sound localization ........................................................................................................ 4
2.1.1. Binaural cues ........................................................................................................ 5
2.1.2. Monaural spectral cue .......................................................................................... 5
2.1.3. Rotation of the human head .................................................................................. 5
2.1.4. Head related impulse response ............................................................................ 6
2.2. Surround sound generation .......................................................................................... 7
2.3. Panning laws ................................................................................................................ 8
Chapter 3. Audio Spatialization ......................................................................................... 11
3.1. Stages of audio spatialization .................................................................................... 11
3.1.1. Analysis – source separation .............................................................................. 12
3.1.2. Re-synthesis – convolution with HRIRs .............................................................. 12
Chapter 4. Sparsity-based Source Separation .................................................................. 15
4.1. Classification of source separation algorithms .......................................................... 15
4.1.1. Based on mixing parameters considered in the mixing model ........................... 15
vi
4.1.2. Based on number of mixtures and sources in the mixing model ........................ 16
4.2. Source separation algorithms: A review .................................................................... 16
4.3. Mixing models ........................................................................................................... 17
4.4. Sparsity-based source separation ............................................................................... 18
4.5. Stages of sparsity-based source separation ................................................................ 19
4.6. Source assumptions .................................................................................................... 19
1.1.1. Local stationarity ................................................................................................ 19
4.6.1. Microphone spacing ........................................................................................... 20
4.6.2. W-disjoint orthogonality ..................................................................................... 20
4.7. Mixing parameter estimation technique .................................................................... 20
4.8. Source estimation techniques ..................................................................................... 21
4.8.1. Degenerate unmixing estimation technique (DUET) ......................................... 22
4.8.2. Lq-basis pursuit (LQBP) ..................................................................................... 23
4.8.3. Delay and scale subtraction scoring (DASSS) ................................................... 24
Chapter 5. Adaptive Time-Frequency Representation .................................................... 27
5.1. Short-time Fourier transform ..................................................................................... 27
5.2. Need for data-adaptive time-frequency representations ............................................ 28
5.3. Data-adaptive time-frequency representations .......................................................... 29
5.3.1. Steps to obtain a data-adaptive time-frequency representation of a signal ....... 31
5.4. Invertibility of time-frequency representations ......................................................... 33
5.4.1. Frame-based transition-window re-construction technique .............................. 34
5.4.2. Modified (extended) window re-construction technique .................................... 34
5.4.3. Segment-based transition-window re-construction technique ........................... 36
Chapter 6. Concentration Measure .................................................................................... 37
6.1. W-disjoint orthogonality ............................................................................................ 37
6.2. Sparsity ...................................................................................................................... 39
6.2.1. Characteristics of sparsity measures .................................................................. 39
vii
6.2.2. Sparsity measures ............................................................................................... 41
6.3. Relation between sparsity measures and WDO measure ........................................... 42
6.3.1. Steps for obtaining the W-disjoint orthogonality measure for a set of signals .. 44
6.3.2. Steps for obtaining the sparsity measure for a set of signals ............................. 45
Chapter 7. Experiments and Results ................................................................................. 48
7.1. Datasets ...................................................................................................................... 48
7.1.1. BSS Oracle database .......................................................................................... 48
7.1.2. TIMIT speech database ...................................................................................... 48
7.2. Performance evaluation measures.............................................................................. 48
7.3. Performance evaluation ............................................................................................. 49
7.3.1. Setup for performance evaluation test ................................................................ 50
7.3.2. Mixing parameters estimation stage................................................................... 51
7.3.3. Source estimation stage ...................................................................................... 52
Chapter 8. Conclusions and Future Work ........................................................................ 54
8.1. Conclusions ................................................................................................................ 54
8.2. Future work ................................................................................................................ 55
Appendix A. Sinusoid Detection using Data-Adaptive Time-Frequency Representation
56
A.1. Sinusoid detection ...................................................................................................... 56
A.2. Data-adaptive time-frequency representation for sinusoid detection ........................ 57
A.3. Performance of data-adaptive time-frequency representation ................................... 58
A.3.a. Sinusoid signals .................................................................................................. 59
A.3.b. Chirp signals ...................................................................................................... 60
A.3.c. Frequency modulated signals ............................................................................. 61
A.3.d. Mixture of sinusoids and frequency modulated signals...................................... 62
A.3.e. Music/speech signals (real signals) .................................................................... 64
References ............................................................................................................................... 66
viii
List of Figures
Figure 2.1 Binaural cues – interaural time difference (ITD) ...................................................... 6
Figure 2.2 Binaural cues – interaural level difference (ILD) ..................................................... 6
Figure 2.3 Cone of confusion ..................................................................................................... 6
Figure 2.4 Rotation of human head ............................................................................................ 6
Figure 2.5 Monaural spectral cues .............................................................................................. 6
Figure 2.6 Reproduction of two-channel stereo ......................................................................... 8
Figure 3.1 Audio spatialization block diagram ........................................................................ 12
Figure 3.2 Time-domain virtualization based on HRIRs ......................................................... 13
Figure 4.1 Mixing models - anechoic mixing........................................................................... 18
Figure 4.2 Mixing model - echoic mixing ................................................................................ 18
Figure 4.3 Block diagram of sparsity-based source separation ................................................ 19
Figure 5.1: Data-adaptive time-frequency representation of a singing voice using frame-based
adaptation (window function: hamming; window sets for adaptation: 30, 60 and 90 ms; hop
size: 10 ms; concentration measure: kurtosis; adaptation region: 1000 to 3000 Hz) ............... 32
Figure 5.2: Data-adaptive time-frequency representation of a singing voice using segment-
based adaptation (window function: hamming; window sets for adaptation: 30, 60 and 90 ms;
hop size: 10 ms; concentration measure: kurtosis; adaptation region: 1000 to 3000 Hz) ........ 33
Figure 5.3: Frame-based transition-window reconstruction technique .................................... 34
Figure 5.4: Modified (extended) window re-construction technique ....................................... 35
Figure 6.1: Disjoint orthogonal for time-frequency representations of speech source mixing as
a function of window size used in the time-frequency transformation. ................................... 43
Figure 6.2: The kurtosis (left) and Gini Index (right) sparsity measures applied to speech
signals in the time-frequency domain as a function of window size. ....................................... 43
Figure 6.3: WDO vs. window size ........................................................................................... 46
Figure 6.4: Sparsity measure (kurtosis) vs. window size ......................................................... 47
Figure 6.5: Sparsity measure (Gini Index) vs. window size ..................................................... 47
Figure A.1 Time-frequency representation of sinusoid signals ................................................ 59
Figure A.2 True hits vs. false alarms plot for the sinusoid signals .......................................... 60
Figure A.3 Time-frequency representation of chirp signal ...................................................... 61
Figure A.4 True hits vs. false alarms plot for chirp signals ..................................................... 61
Figure A.5 Time-frequency representation of frequency modulated signals ........................... 62
ix
Figure A.6 True hits vs. false alarms plot for the frequency modulated signals ...................... 62
Figure A.7 Time-frequency representation of mixture of sinusoid signals and frequency
modulated signal (signal energy, frequency modulated to sinusoid signal = 7 dB) ................. 63
Figure A.8 Time-frequency representation of mixture of sinusoid signals and frequency
modulated signal (signal energy, frequency modulated to sinusoid signal = -3 dB) ............... 64
Figure A.9 True hits vs. false alarms for of mixture of sinusoid signals and frequency
modulated signal ....................................................................................................................... 64
Figure A.10: Data-adaptive time-frequency representation of a singing voice signal ............. 65
x
List of Tables
Table 6-A: Validation table showing the characteristics satisfied by the sparsity measures
(kurtosis/Gini Index) ................................................................................................................ 42
Table 6-B: Counter-examples for testing a sparsity measure whether it satisfies a particular
property with the desired outcome if the sparsity measure satisfies the property. S(x) denotes
sparsity measure of x ................................................................................................................ 42
Table 7-A: Performance of the mixing parameter estimation stage on BSS oracle (music)
dataset ....................................................................................................................................... 52
Table 7-B: Performance of the mixing parameter estimation stage on BSS oracle (speech)
dataset ....................................................................................................................................... 52
Table 7-C: Performance of the source estimation stage (in time-frequency domain) using
DUET and LQBP algorithms on BSS oracle dataset ............................................................... 53
Table A-A: True hits percentage of sinusoid detection for singing voice for different
frequency bands ........................................................................................................................ 65
xi
List of Abbreviations
Abbreviation Meaning
ATFR Adaptive Time-Frequency Representation
BSS Blind Source Separation
CASA Computational Auditory Scene Analysis
CIPIC Centre for Image Processing and Integrated Computing
COLA Constant Over-Lap Add
DASSS Delay And Scale Subtraction Scoring
DFT Discrete Fourier Transform
DUET Degenerate Unmixing Estimation Technique
DVD Digital Video Disc
ECG Electrocardiography
HRIR Head Related Impulse Response
HRTF Head Related Transfer Function
ICA Independent Component Analysis
ICLD Inter-Channel Level Difference
IEEE Institute of Electrical and Electronics Engineers
ILD Interaural Level Difference
ITD Interaural Time Difference
KEMAR Knowles Electronics Manikin for Acoustic Research
LQBP Lq-Basis Pursuit
MIDI Musical Instrument Digital Interface
OLA Over-Lap Add
PCA Principle Component Analysis
PSR Preserved Signal Ratio
SAR Source to Artifacts Ratio
SD Sparse Decomposition
SDR Source to Distortion Ratio
SIR Source to Interference Ratio
xii
Abbreviation Meaning
SNR Signal to Noise Ratio
SRS Sound Retrieval System
STFT Short-Time Fourier Transform
TIMIT Texas Instruments Massachusetts Institute of Technology
WDO W-Disjoint Orthogonality
xiii
Declaration of Academic Honesty and Integrity
I declare that this written submission represents my ideas in my own words and where others'
ideas or words have been included, I have adequately cited and referenced the original
sources. I also declare that I have adhered to all principles of academic honesty and integrity
and have not misrepresented or fabricated or falsified any idea/data/fact/source in my
submission. I understand that any violation of the above will be cause for disciplinary action
by the Institute and can also evoke penal action from the sources which have thus not been
properly cited or from whom proper permission has not been taken when needed.
Pradeep Gaddipati
08307029
June 17th, 2010
Chapter 1. Introduction
With the proliferation of portable media devices, headphone listening has become
increasingly common; in both mobile and non-mobile listening scenarios, providing a high-
fidelity listening experience over headphones is thus a key value-add (or arguably even a
necessary feature) for modern consumer electronic products. This enhanced headphone
reproduction is relevant for both stereo content such as legacy music recordings as well as
multichannel music and movie soundtracks. The audio, when properly generated, can be used
to render a realistic auditory experience with auditory immersion. The audio signal which is
capable of this, is known as spatial audio.
Spatial audio refers to the rendering of the realistic auditory experience with auditory
immersion. Surround sound, an outcome of the extensive research on spatial audio, refers to
the use of multiple loudspeakers to envelop a person watching a movie or listening to music,
making them feel as if they are in the middle of the action or the concert [1]. The surround
sound tracks enable the audience to hear sounds coming from all around them, contributing to
the sensation of what movie-makers call suspended disbelief. Such a technique is only
applicable in the case when the playback devices are placed at a considerable distance from
the listener. The same audio signals are not as effective when headphones are used for
listening.
The headphone reproduction simply constitutes presenting a left-channel signal to the
listener’s left ear and likewise a right-channel signal to the right ear. In such headphone
systems, stereo music recordings can obviously be directly rendered by routing the respective
channel signals to the headphone transducers. However, such rendering, which is the default
practice in consumer devices, leads to an in-the-head listening experience, which is counter-
productive to the goal of spatial immersion: sources panned between the left and right
channels are perceived to be originating from a point between the listener’s ears [2]. For audio
content intended for multichannel surround playback (perhaps most notably movie
soundtracks), typically with a front centre channel and multiple surround channels in addition
to the front left and right, direct headphone rendering calls for a down-mix of these additional
channels; in-the-head localization again occurs as for stereo content, and furthermore the
surround spatial image is compromised by elimination of front/back discrimination cues.
Hence these surround audio needs to be spatialized to obtain a signal that can generate the
2
effect of auditory immersion over headphones. However, re-recording of the existing audio in
the new format is an infeasible task. One of the possible solutions to such a problem would be
audio spatialization where the existing spatial audio is processed to obtain surround sound
that creates an auditory immersion over headphones.
Given a multi-channel audio mixture as input in any available format, audio spatialization is
the process of realistic spatial rendering of audio in the desired listening configuration (e.g.
over headphones). One approach to this problem involves separating the individual sources
from the multi-channel audio mixture, and then re-creating the desired listener-end mixtures
by suitable recombination of the individual spatialized sources. The success of this approach
hinges on achieving the proper separation of sources from the input multi-channel mixtures.
Various source separation algorithms [3] have been developed based on the different source
models and mixing models.
There exist several successful techniques for blind source separation such as independent
component (ICA) and sparse decomposition. These sparsity-based techniques require the
sources to be sparse and disjoint-orthogonal in some time-frequency representation, these
techniques explores the sparsity of music/speech signals in the short time Fourier transform
(STFT) domain to construct binary time-frequency masks, which are then used to extract
several sources from only two mixtures. It is expected that the performance of the source
separation process can be improved by obtaining sparser time-frequency representation. The
STFT performs well in terms of concentration and resolution of a given signal component
when a properly chosen window is used. But the proper window function depends on the data,
and no automated procedure currently exists for determining a good window. And for signals
like music/speech which are composed of several different components at different time
instants, the best window differs for each time instant. Thus the fact that different windows
are appropriate for different time instants suggests the use of data-dependent time-varying
time-frequency representation [4].
Chapter 2 describes the various aspects of spatial audio. Chapter 3 discusses the various
stages involved in the audio spatialization process and presents techniques for re-synthesis of
the surround sound for headphones. Chapter 4 provides a brief review of the various source
separation algorithms and it also discusses about the various source models and the mixing
models considered for solving the blind source separation problem. Chapter 5 discusses the
generalized staged procedure for the sparsity-based source separation and it also describes
3
three sparsity-based source separation techniques viz. degenerate unmixing estimation
technique (DUET), lq basis pursuit (LQBP) and delay and subtraction scale scoring (DASSS)
in detail. Chapter 6 discusses the time-frequency representations that are used in source
separation algorithms, the need for the data-adaptive time-frequency representations and also
gives details about the adaptive time-frequency representation used for this work. Chapter 7
investigates the various concentration measures that can be used for the purpose of the
adaptation in the case of the adaptive time-frequency representations. Experiments to evaluate
the performance of the different source separation techniques discussed in chapter 5 and the
various time-frequency representations discussed in chapter 6 are described in chapter 8. In
chapter 9, the conclusions and the future work are presented. And finally in the appendix, a
detailed discussion about the role of adaptive time-frequency representation in sinusoid
detection problem is presented.
4
Chapter 2. Spatial Audio
Everyday life is full of three-dimensional sound experiences. Humans have the capability to
localize these sound sources even in noisy and reverberant environments. This ability of
humans to make sense of their environments and to interact with them depends strongly on
spatial awareness and hearing plays a major part in this process. The auditory system of the
human identifies various cues in the sounds heard at the two ears which indicate the spatial
locations of the sources in the three-dimensional space around the listener. The mechanisms
of sound source localization involve the detection of timing or phase difference between the
ears and of amplitude or spectral difference between the ears. The majority of spatial
perception is dependent on the listener having two ears, although certain monaural cues have
been shown to exist – in other words it is mainly the differences in signals received by the two
ears that matter.
2.1. Sound localization
We listen to speech (as well as other sounds) with two ears, and it is quite remarkable how
well we can separate and selectively attend to individual sound sources in a cluttered
acoustical environments. This ability of the listener to determine the location of the
origination of the sound is termed as sound localization. In fact, the familiar term cocktail
party processing was coined in an early study of how the binaural system enables us to
selectively attend to individual conversations when many are present, as in, of course, a
cocktail party. This phenomenon illustrates the important contribution that binaural hearing
makes to auditory scene analysis, by enabling us to localize and separate sound sources. In
addition the binaural system plays a major role in improving speech intelligibility in noisy and
reverberant environments.
Humans can deduce the various parameters of the location of the source viz. azimuth,
elevation, distance and spaciousness of the auditory environment from the sounds heard. This
is on the basis of the different cues introduced into the sound by the pinna, proximate parts of
the human body and the surrounding acoustic environment as it travels from the source to the
eardrum of the listener. Thereafter, the cues are processed by the human brain for determining
the acoustic characteristics of the source and the auditory environment. In general, a potential
acoustical localization cue is any physical aspect of the acoustical waveform reaching a
5
listener’s ears that is altered by a change in the position of the sound source relative to that of
the listener. The most important cues [5] used by humans are discussed below.
2.1.1. Binaural cues
Binaural localization relies on the comparison of auditory input from two separate detectors;
most evolved auditory systems feature two ears, on each side of the head.
• Interaural time difference (ITD): This cue arises because of the difference in the
distances between the source and the two ears as seen in Figure 2.1. The resulting
phase shift is used for localization of frequencies below 1.5 kHz. This cue is also
sensitive to the shift in the envelope of the signals at higher frequencies.
• Interaural level difference (ILD): The shadowing of the sound wave by the head as
seen in Figure 2.2 results in the sound having a higher intensity at the ear nearest to the
source, depending on the azimuth. It results in a difference in the energy levels
depending on the frequency. This cue is primarily used at frequencies above 1.5 kHz.
As a result of the symmetry of the human head, sounds originating from many different
directions share ITD and ILD. The locus of all source locations that share the same ITD and
ILD is called the cone of confusion as shown in Figure 2.3. Within the cone of confusion, the
estimation of source location is on the basis of monaural spectral cues and the effect of head
rotation..
2.1.2. Monaural spectral cue
This cue is primarily used to determine the elevation of the source. Figure 2.5 shows the
measured energies at different frequencies for two different directions of arrival. In each case,
there are two paths from the source to the ear canal – a direct path and a longer path following
a reflection from the pinna. For frequencies in the range 6-16 kHz, the delayed signal is out of
phase with the direct signal, and destructive interference occurs. The greatest interference
occurs when the difference in length is half the wavelength. This produces a notch in the
spectrum as seen in the Figure 2.5. Thus, the elevation of the source can be estimated from the
location of this notch.
2.1.3. Rotation of the human head
6
Typically, a listener directs his head towards the interesting sound source. The change in ITD,
ILD and monaural spectral cues with the rotation of the head helps the listener in further
localizing the source and for resolving confusions (see Figure 2.4)
Figure 2.1 Binaural cues – interaural time
difference (ITD)
Figure 2.2 Binaural cues – interaural level
difference (ILD)
Figure 2.3 Cone of confusion
Figure 2.4 Rotation of human head
Figure 2.5 Monaural spectral cues
Source: Aureal Corporation, “3-D Audio Primer,” Aureal Semiconductor A3D White Paper, 1998.
2.1.4. Head related impulse response
The frequency and position dependent characteristics of the pinna, proximate parts of human
body and ear canal are summarized in the form of the head-related transfer function (HRTF)
and its time-domain analogue is the head-related impulse response (HRIR). As the HRTF
depends on the diffraction and reflection properties of the head, pinna and torso which differ
7
from one person to another, it is unique for each person. The HRIR is measured in an
anechoic room and hence depends solely on the morphology of the listener.
2.2. Surround sound generation
Today a variety of multichannel transmission formats and end-user configurations are
available for conveying a 3-D audio scene to a listener. An example is a 5.1 DVD recording
intended for reproduction over a standard 3/2-stereo loudspeaker layout. In addition to the
practical choice of a multichannel transmission or storage and rendering format, various
microphone recording techniques or electronic spatialization methods can be used to encode
the directional information in the chosen multichannel format. This section gives a brief
introduction to the commonly available techniques for reproducing the desired directional
information over headphones or a number of loudspeakers located at known positions
surrounding the listening area. These techniques can be classified into three main approaches
[6]:
• Sound field reconstruction methods: The objective is to control an acoustical
variable of the sound field (pressure, velocity) at or around a reference measuring
point in the listening area. This reference point is usually the sweet spot where the
auditory image created during rendition is as desired by the mixer.
• Discrete panning techniques: The knowledge of the desired apparent direction of the
sound is used to selectively feed the closest loudspeakers in the reproduction system
based on a panning law.
• Head-related stereophony (binaural recording or binaural synthesis): The intent is to
control the acoustic pressure at the ears of the listener via headphone or loudspeaker
playback.
The most extensively used method for creating surround sound of various formats is discrete
amplitude panning. During the recording of the surround sound, each input source of the
mixing console receives a monophonic recorded or synthetic signal which is devoid of the
room effect, from an individual sound source. A panning module called as the panoramic
potentiometer (or panpot) is used to spatialize each source by multiplying the source signal
with gains corresponding to each of the output channels. These gains are determined by a
panning law depending on the desired source location. The commonly used panning laws
include constant gain optimization (or amplitude preserving law) and constant power
8
optimization (or energy preserving law). All the individual source components are then added
together to give the final multichannel audio. The inter-channel level difference (ICLD)
arising from the different channel gains for each source is translated into an ITD at the
listener’s ears for frequencies below 1.5 kHz. Additionally, the source signals might be fed to
an artificial reverberator which delivers several uncorrelated reverberation signals to the main
output channels, thus reproducing a diffuse immersive room effect, in which every sound
source can contribute a different intensity. The direct sound level and reverberation level can
be adjusted individually in each source channel in order to control the perceived distance of
the corresponding sound source.
2.3. Panning laws
Amplitude panning refers to techniques in which a monophonic audio channel is applied to all
or a subset of the loudspeakers with different gains. Depending on the gain relationships, the
listener perceives a virtual source, also known as a phantom source, in a direction that does
not necessarily match with the direction of any of the loudspeakers. Although the created
sound field does not match the sound field created by a single sound source, listeners perceive
it like that [2]. The best playback of stereo audio is obtained by placing the two speakers
symmetrically with respect to the median place to the front of the listener. Consequently, they
are referred to as the left (L) and right (R) speakers. Usually the speakers are placed at an
angle of 30˚ with respect to the median plane, as shown in Figure 2.6.
Figure 2.6 Reproduction of two-channel stereo
9
The total system gain and the total power are two important attributes of a panning law. For a
system with N-channel output, the total gain and power for source i is given by
= (2.1) = (2.2)
where aij is the gain for the ith source at the jth channel.
The constant gain law requires that the total gain, which is the sum of the gains for all
channels corresponding to a particular source, be a constant. In the two channel case, this
implies that the gain linearly decreases in one channel as it is increased in the other. The
angles are considered to be positive when measured in the anticlockwise direction. The gains
aL and aR given to the left and right speakers are obtained as follows
= ( − )2 & = ( − )2 , ℎ ≥ ≥ − (2.3) where θ is the desired angle.
The constant power law implies that the total gain which is the sum of the squares of the gains
for all channels corresponding to a source should be a constant. In the two channel case, this
constraint results in the gains aL and aR as follows
= ′ & = ′, ℎ ′ = ( − )2 90 (2.4) If there are N sources present in the system, the mixing parameters for each of them are
determined as described previously. Thereafter, the left and right channels i.e. XL(t) and XR(t)
are obtained by adding the individual components as follows
( ) = ( ) & ( ) = ( ) (2.5) where si(t) is the time-domain signal corresponding to the ith source, aiL and aiR are the gains
or mixing parameters corresponding to the ith source. Thus, each channel is actually a mixture
of the individual sources obtained by their linear combination. Besides, there is no relative
delay in the components corresponding to each source in the two channels. Thus, the mixing
process is linear and instantaneous.
Using these techniques, it is possible to spatialize sources to locations between the speakers
only. To place virtual sources outside this region, some additional processing based on
10
psychoacoustic principles is required as is being done in sound retrieval system (SRS)
technology. When more than two speakers are present in the system, only the two speakers
closest to the desired source location can be considered to be active. Using this assumption,
the gains for the two adjacent speakers can be determined using one of the aforementioned
laws while the gain for all the speakers is assigned the value zero. This is known as pair-wise
panning.
11
Chapter 3. Audio Spatialization
Sounds we hear are normally perceived to be located in the space around us and are usually
associated with sources which are visible or which we know to be there. During stereophonic
recording of music, the audio mixer virtually moves the sound sources using amplitude
panning that would give the desired response when reproduced with loudspeakers placed at
some distance from the user. But it is a common experience that when these stereophonic
recordings are presented by means of headphones, the sound images are localized within the
head [2], this phenomenon of in-head localization is known as lateralization. This is because
of the fact that these recorded signals lack the appropriate interaural time difference (ITD),
interaural intensity difference (IID) and body reflection cues associated with the real-world
sources. The consequence is that the music thus recorded is not ideal to be reproduced via
headphones, at least in terms of the truthfulness of reproduction of the desired auditory
environment. The sound arriving at the listener’s ears corresponds to an unnatural sound field
increasing the listener’s fatigue. Hence the stereophonic loudspeaker audio needs to be
specially processed to obtain a signal that can generate the effect of auditory immersion over
headphones. The techniques for including the appropriate real-world cues into the
stereophonic audio and making it compatible for the headphones are discussed in the
following sections and chapters. This process of spatial rendering for conversion of the
available audio configuration into the desired listening configuration is termed as audio
spatialization.
3.1. Stages of audio spatialization
As described in the previous section audio spatialization is a process of realistic spatial
rendering of audio into the desired listening configuration from the available audio format, in
our case, the available format is the stereophonic loudspeaker audio and the desired listening
configuration is the headphones. The approach (which is considered by us) to this problem
involves separating the individual sources from the multi-channel audio mixture and then re-
creating the desired listener-end configuration by suitable re-combination of the individual
spatialized sources. The stages involved in the audio spatialization process are shown in
Figure 3.1.
• Analysis (source separation) – the individual source signals and their locations in the 3D space are estimated from the available mixtures
12
• Re-synthesis (convolving with HRIRs) – the estimated sources are externalized to the desired locations by filtering them with the head related impulse responses (HRIRs)
Figure 3.1 Audio spatialization block diagram
3.1.1. Analysis – source separation
The process of extracting the individual sources from a set of observations from sensors such
as microphones (mixtures) is called source separation. When the information about the mixing
process and sources is limited, the problem is known as ‘blind’ source separation (BSS). The
classical example is the cocktail party problem, where a number of people are talking
simultaneously in a room (like at a cocktail party), and one is trying to follow one of the
discussions. The human brain can handle this sort of auditory source separation problem, but
it is a very tricky problem in signal processing. Several approaches have been proposed for
the solution of this problem but development is currently still very much in progress. The
separation of a superposition of multiple signals is accomplished by taking into account the
structure of the mixing process and by making assumptions about the sources. Some of the
successful approaches are principal component analysis (PCA) and independent component
analysis (ICA), which work well when there are no delays or echoes present; that is, the
problem is simplified a great deal. By assuming sources can be represented sparsely in given
basis, recent research has demonstrated that solutions to previously problematic blind source
separation problems can be obtained. In some cases, solutions are possible to problems
intractable by previous non-sparse methods. Indeed, sparse methods provide a powerful
approach to the separation of mixtures [3].
3.1.2. Re-synthesis – convolution with HRIRs
Each surround sound system has a pre-determined position for each of the speakers. The
audio for each system is recorded taking this factor into account. With this apriori knowledge
about the speaker locations, one of the simplest methods of spatialization would be to
convolve each of the individual input channels with HRIRs of the corresponding speaker and
then summing the results [5]. The location of each speaker determines the HRIR to be used
13
for that speaker. The HRIRs can be obtained from the CIPIC [7] or the KEMAR [8] database.
This process is depicted in Figure 3.2.
Let xm(t) be the mth channel signal. The filters hmL(t) and hmR(t) represents the HRIRs
corresponding to the mth speaker for the left and right ear respectively. The left and right ear
signals for playback over headphones are then given by yL(t) and yR(t) as follows:
( ) = ℎ ( ) ∗ ( ) (3.1) ( ) = ℎ ( ) ∗ ( ) (3.2)
Figure 3.2 Time-domain virtualization based on HRIRs
Using this method, the sources that are active only on a single channel can be convincingly
virtualized over headphones, i.e. a rendering can be achieved that generates a sense of
externalization and accurate spatial positioning of the source. However, a sound source that is
partially panned across channel in the recording may not be convincingly reproduced.
Consider a set of input signals each of which are amplitude-scaled version of the source s(t)
( ) = ( ) (3.3) with these inputs, the equations (1) and (2) become
( ) = ( ) ∗ ( ( )ℎ ( )) (3.4)
14
( ) = ( ) ∗ ( ( )ℎ ( )) (3.5) The source s(t) is thus rendered through a combination of HRIRs for multiple different
directions instead of via the correct HRIR for the actual desired source location. Unless the
combined HRIRs correspond to closely spaced channels, this combination of HRIRs will
significantly degrade the spatial image. This is one of the drawbacks of this method. To
rectify this, the desired source signals and locations can be estimated from the multiple
channels and then the corresponding source signals can be spatialized to the appropriate
location obtained from the estimation.
15
Chapter 4. Sparsity-based Source Separation
Source separation arises in a variety of signal processing applications, ranging from speech
processing to medical image analysis. The process of extracting the individual sources from a
set of observations from sensors such as microphones (mixtures) is called source separation.
When the information about the mixing process and sources is limited, the problem is known
as ‘blind’ source separation (BSS). Generally the problem is stated as follows
Given M mixtures of N sources mixed via an unknown (M x N) mixing matrix A,
estimate the underlying sources from the mixtures.
BSS of acoustic signals is often referred to as the cocktail party problem that is the separation
of individual voices from a myriad of voices in an uncontrolled acoustic environment such as
cocktail party.
4.1. Classification of source separation algorithms
BSS algorithms can be categorized according to the assumptions they make about the mixing
model. Thus, one can classify them based on mixing parameters considered in the mixing
model or based on the number of mixtures and sources considered in the mixing model.
4.1.1. Based on mixing parameters considered in the mixing model
Environmental assumptions about the surroundings in which the sensor observations are made
also influence the complexity of the problem. Sensor observations in a natural environment
are confounded by signal reverberations, and consequently, the estimated un-mixing process
needs to identify a source arriving from multiple directions at different times as one individual
source. Generally, BSS techniques depart from this difficult real world scenario and make less
realistic assumptions about the environment so as to make the problem more tractable. There
are typically three assumptions that are made about environment.
• instantaneous mixtures • anechoic mixtures • echoic mixtures
The most rudimentary of these is the instantaneous case, where sources assumed to arrive
instantly at the sensors but with different signal intensity. An extension of the previous
assumption, where arrival delays between sensors are also considered, is known as the
16
anechoic case. The anechoic case can be further extended by considering multiple paths
between each source and sensor, which results in the echoic case, sometimes also known as
convolutional mixing. Each case can be extended to incorporate linear additive noise. But the
presence of noise in the system increases the complexity of the source separation process.
Separation becomes even more challenging if the sources are assumed to me mobile. Most
systems assume that the sources are static as in the case of instantaneous and anechoic
mixtures, whereas echoic case signifies the most natural and general situation.
4.1.2. Based on number of mixtures and sources in the mixing model
The source separation algorithms can also be categorized based upon the assumptions made
related to the number of mixtures and the number of sources considered in the mixing model.
There are typically three assumptions that are made
• over-determined (M > N) • even-determined (M = N) • under-determined (M < N)
where M is the number of mixtures and N is the number of sources.
When M ≥ N, separation of sources can be achieved by constructing an unmixing matrix W,
where W = A-1 up to permutation and scaling of the rows. The dimensionality of the mixing
process influences the complexity of source separation. If M = N, the mixing process is
defined by an even-determined (i.e. square) matrix A and, provided that it is non-singular, the
underlying sources can be estimated by a linear transformation. And if M > N, the mixing
process is defined by an over-determined matrix A and, provided that it is full rank, the
underlying sources can be estimated by least-squares optimization or linear transformation
involving matrix pseudo-inversion. If M < N, the mixing process A is defined by an under-
determined matrix and consequently source estimation becomes more involved and is usually
achieved by some non-linear techniques.
4.2. Source separation algorithms: A review
The separation process is accomplished by taking into account the structure of the mixing
process and making assumptions about the sources. Several methods exist that attempt to
solve the BSS problem under various assumptions and conditions. Usually the following
17
assumptions about the nature of the sources are made in order to make the source separation
algorithm more tractable:
• statistical independence of sources [9] • sparse decomposition of sources into some basis (time-frequency dictionaries) [10] • sparsity of sources in some time-frequency representations [11] [12]
One such approach is independent component analysis (ICA) which is based on the
assumption that the sources are statistically independent [9]. This technique extracts N sources
from N instantaneous mixtures. This algorithm can be extended for the case of instantaneous
under-determined mixtures. There also exist algorithms that demix under-determined
anechoic mixtures, one such algorithm is complex independent component analysis technique
to solve the BSS problem for electroencephalographic (ECG) data.
An alternative approach to the BSS problem for under-determined instantaneous mixtures is
to assume that the sources have sparse expansion with respect to some basis. In this case, one
can formulate the source extraction problem as a constrained l1 minimization problem, which
typically yields a convex program [10]
Another approach to demix under-determined anechoic mixtures, called degenerate unmixing
estimation technique (DUET), was proposed by Yilmaz and Rickard [11]. This algorithm uses
sparsity of music/speech signals in the short time Fourier transform (STFT) domain to
construct binary time-frequency masks, which are then used to extract several sources from
only two mixtures. Another algorithm presented in [12] uses lq minimization based approach,
with q < 1 for estimation of sources in the STFT domain.
4.3. Mixing models
Suppose we have N time domain sources s1(t), s2(t), . . . , sN(t) and M mixtures x1(t), x2(t), . . . ,
xM(t) such that
= − , = 1,2, … , (4.1) where aij are the attenuation coefficients and δij are the time delays associated with the path
from the jth source to the ith receiver (sensor). Equation (1) defines an anechoic mixing model.
With δij = 0, equation (1) defines instantaneous mixing model. The problem of anechoic
signal unmixing is therefore to identify the attenuation coefficient and the relative delay
18
associated with each source. An illustration of the anechoic and under-determined case (M =
2, N = 3) is provided in Figure 4.1.
The echoic case of BSS considers not only transmission delays but reverberations too. This
results in a more involved generative model that in turn makes finding a solution more
difficult.
= − , = 1,2, … , (4.2) where L is the number of paths the source signal can take to the sensors. An illustration of the
echoic case is provided in Figure 4.2.
Figure 4.1 Mixing models - anechoic mixing
Figure 4.2 Mixing model - echoic mixing
Source: P. O’Grady, B. Pearlmutter and S. Rickard, “Survey of sparse and non-sparse methods in source separation,” International Journal of Imaging Systems and Technology, vol. 15, no. 1, 2005
4.4. Sparsity-based source separation
In order to make the source separation problem more tractable, one usually needs to make
certain assumptions about the nature of the sources. Such assumptions form the basis for most
source separation algorithms and include statistical properties such as independence and
stationarity. One increasingly popular and powerful assumption is that the sources have a
sparse representation in a given basis. These methods are known as sparsity-based source
separation methods. A signal is said to be sparse when most of its coefficients are zero (or
nearly zero) valued i.e. the major part of the signal energy is concentrated in very few
coefficients of the signal.
The advantage of a sparse signal representation is that the probability of two or more sources
being simultaneously active is low. Thus, sparse representations lend themselves to good
separability because most of the energy in a basis coefficient at any time instant belongs to a
single source. Additionally, sparsity can be used in many instances to perform source
19
separation in the case when there are more sources than sensors. A sparser representation of
an acoustic signal can often be achieved by a transformation into a Fourier, Gabor or Wavelet
basis.
4.5. Stages of sparsity-based source separation
The block representation of sparsity-based source separation algorithm is shown in Figure 4.3.
The various steps involved in the algorithm are:
• Time-frequency transform – Transformation of the available mixtures into some
sparse time-frequency representation such as short-time Fourier transform (STFT)
• Mixing parameter estimation – Estimation of the mixing parameters is done by
clustering the ratios of the time-frequency representations of the mixtures
• Source estimation – Using the estimates of the mixing parameters, the estimates of
each of the individual sources in the time-frequency domain is obtained by using an
appropriate source estimation algorithm like DUET, LQBP or DASSS
• Inverse time-frequency transform – Finally the time-frequency estimates of each of
the individual sources are inverted back to time-domain using an appropriate inverse
time-frequency transformation to recover the original sources
Figure 4.3 Block diagram of sparsity-based source separation
4.6. Source assumptions
1.1.1. Local stationarity
Windowed Fourier transform of a signal s(t) is obtained as
(∙) ( , ) = ( ) ( − )∞∞
(5.1)
The windowed Fourier transform of s(t) defined in equation (1) will be referred as SW(w,t)
where appropriate. Using equation (1) and the following Fourier transform pair,
( − ) ↔ ( ) (5.2)
20
we have
(∙ − ) ( , ) = (∙) ( , ) (5.3) when W(t) ≡ 1. However, when W(t) is a windowing function, equation (3) is not necessarily
true. This can be thought as a form of a narrowband assumption in array processing [13] , but
this label is perhaps misleading in that speech is not narrowband and local stationarity seems
a more appropriate moniker. For DUET, it is necessary that equation (3) holds for all δ, |δ| ≤
Δ, even when W(t) has finite support [14]. Here Δ is maximum time difference possible in the
mixing model (the microphone spacing divided by the speed of sound signal propagation).
4.6.1. Microphone spacing
Additionally, one crucial issue is that, the DUET algorithm is based on the extraction of
attenuation and delay parameters estimate for each time-frequency bin. We will utilize the
local stationarity assumption to turn the delay in time into a multiplicative factor in time-
frequency. Of course, this multiplicative factor e-iwδ uniquely specifies δ only if |wδ| < π as
otherwise we have an ambiguity due to phase-wrap [15] . So we require,
< , ∀ , ∀ (5.4) to avoid phase ambiguity. This is guaranteed when the microphones are separated by less than
πc/wm where wm is the maximum frequency present in the sources and c is the speed of sound.
4.6.2. W-disjoint orthogonality
Given a windowing function W(t), we call two functions sj(t) and sk(t) W-disjoint orthogonal
if the supports of the windowed Fourier transforms of sj(t) and sk(t) are disjoint [11] . The W-
disjoint orthogonality assumption can be stated concisely as,
( , ) ( , ) = 0, ∀ ≠ , ∀ , (5.5) This assumption is the mathematical idealization of the condition that is likely that every
time-frequency point in the mixture with significant energy is dominated by the contribution
of one source.
4.7. Mixing parameter estimation technique
The assumptions of anechoic mixing and local stationarity allow us to rewrite the mixing
equation (4.1) in the time-frequency domain as,
21
( , )( , ) = 1 ⋯ 1⋯ ( , )⋮( , ) (5.6)
With the further assumption of W-disjoint orthogonality, at most one source is active at every
(w,τ), the mixing process can be described for each (w,τ) and for some j as,
( , )( , ) = 1 ( , ) (5.7)
here j is the index of the source active at (w,τ).
Now, we can calculate the relative amplitude and delay parameters associated with one
source, using
, = ( , )( , ) , log ( , )( , ) / (5.8) for some j, where denoted taking imaginary part. Using (8), every (w,τ) yields an estimate
pair for the relative attenuation-delay parameter associated with each source. For W-disjoint
orthogonal signals, if we calculate the amplitude-delta estimates from a number of time-
frequency points, we would expect to see clusters around the true mixing parameters for each
source.
If we now construct a two-dimensional weighted histogram, the number of peaks found would
be the estimated of the number of sources, and the peak centres would be the estimate of the
attenuation-delay estimates associated with each source. From these estimates of mixing
parameters we then construct the time-frequency masks which de-mix the mixtures.
The main observation that DUET leverages is that the ratio of the time-frequency
representations of the mixtures does not depend on the source components but only on the
mixing parameters associated with the active source component [15]. Thus it can be seen that,
the successful extraction of mixing parameters relies on the sparsity of speech in the time-
frequency domain.
4.8. Source estimation techniques
Once the mixing parameters are estimated, each of the individual source signals is extracted
from the mixtures. The unmixing could be either a hard-assignment of each time-frequency
component of the mixture to only one source or a soft-assignment to multiple sources or a
combination of both hard and soft assignments (where for some time-frequency bins hard
22
assignment is used and for other bins soft assignment is used based on some criterion which
decides when to use hard/soft assignments).
4.8.1. Degenerate unmixing estimation technique (DUET)
If the number of sources is equal to the number of mixtures, the non-degenerate case, the
standard demixing method is to invert the mixing matrix. The mixing model for two sources
can be written as,
( , )( , ) = 1 1 ( , )( , ) (5.9)
When the number of sources is greater than the number of mixtures, the degenerate case,
matrix inversion is no longer possible. Nevertheless, in this case one can still de-mix by
partitioning the time-frequency plane using one of the mixtures based on the estimated mixing
parameters [11].
For W-disjoint orthogonal signals, using equation (5), we know that the every time-frequency
bin in ( , ) corresponds to ( , ) for some i. Moreover, the ratio ( , )/ ( , ) depends only on the mixing parameters associated with one source. Thus, for each time-
frequency point, we can determine which of the n peaks in the 2-D histogram of attenuation-
delay estimates is closest to the , estimate for the given ( , ) [11] The following likelihood function is used to produce a measure of closeness.
( , ) = argmin ( , ) − ( , )1 + (5.10)and then assign each time-frequency point to the mixing parameter estimate via
( , ) = 1, ( , ) =0, ℎ (5.11)Essentially, (10) and (11) assign each time-frequency point to the mixing parameter pair
which best explains the mixtures at that particular time-frequency point. We de-mix via
masking and maximum likelihood combining
( , ) = ( , ) ( , ) + ( , ) 1 + (5.12)Then the original sources are reconstructed from their time-frequency representations by
converting them back into the time domain.
23
4.8.2. Lq-basis pursuit (LQBP)
We have seen in DUET that it assumes only one active source at every time-frequency bin,
but practically this assumption is not always true. As we might have multiple sources present
at most of the time-frequency bins, the LQBP algorithm relaxes this assumption to a level that
it assumes at most M (the number of mixtures) sources active at every time-frequency bin
[12].
The LQBP algorithm proposed in [12] separates N sources from M mixtures. The task is
accomplished by extracting at most M sources at each time-frequency point that minimize via
lq-basis-pursuit. The following assumptions are required to ensure an accurate recovery of the
sources.
• No more than m sources are active at each time-frequency point. • The columns of the mixing matrix were accurately extracted in the mixing model
recovery stage. • The mixing matrix is full rank.
First, the mixing matrix is constructed from the mixing parameters estimates obtained from
the previous stage
= ⋯⋯⋮ ⋮⋯ ⋮ (5.13)here are the estimated attenuation parameters and are the estimated delay parameters,
computed as discussed in the previous. Note that column of is a unit vector.
The goal now is to compute good estimates ̂ , ̂ , … , ̂ of the original sources , , … , . These estimates must satisfy
̂ = (5.14)where ̂ = [ ̂ , ̂ , … , ̂ ] is the vector of source estimates in the time-frequency domain. At each time-frequency bin, equation (14) provides M equations (corresponding to M
available mixtures) with N > M unknowns ( ̂ , ̂ , … , ̂ ). Assuming that this system of equations is consistent, it has infinitely many solutions. To choose a reasonable estimate
among these solutions, we shall exploit the sparsity of the sources vector in the time-
frequency domain.
24
The problem can be formally stated as [12]
min̂ ‖ ̂‖ ̂ = (5.15)where ‖ ‖ denotes some measure of sparsity of a vector u. Given a vector u = (u1, u2, … , un) є R, one measure of its sparsity is simply the number of the
non-zero components of u, commonly denoted as ‖ ‖ . But in general, the sparsity of the Gabor coefficients of speech signals essentially suggests that most of the coefficients are
small, though not identically zero. In this case, P0 fails miserably. Alternatively, one can
consider
‖ ‖ = | | / (5.16)where 0 < q ≤ 1 as a measure of sparsity. Here, smaller q signifies increased importance of the
sparsity of u. Such a problem statement is commonly called as Pq problem.
The solution to Pq is identical to the solution of the lq-basis-pursuit (LQBP) problem, given by
∶ min̂ ‖ ̂‖ ̂ = ‖ ̂‖ ≤ (5.17)Note that to solve the LQBP problem, one need to find the best basis for the column space of
that minimizes the lq norm of the solution vector. The solution of LQBP is given by the
solution of
‖ ‖ ℎ (5.18)for = ( )|… | ( ) , = ( )|… | ( ) .
4.8.3. Delay and scale subtraction scoring (DASSS)
We have seen that DUET uses nearest-neighbour approach to demix the sources, which
suffers in cases where W-disjoint orthogonality is violated. Specifically, if two sources
contribute to the energy in a particular time-frequency bin, the mixing parameter estimates
will lie between those of the contributing sources. In some cases, this result might generate
mixing parameter estimates whose nearest neighbour source is actually a third source. So we
may spuriously assign energy to a third source that is not contributing. On the other hand, we
saw that LQBP assumes at most M (the number of mixtures) sources at every time-frequency
bin, which may not be the case always, as we might have only one active source at some time-
25
frequency bins. In this case too, we might spuriously assign energy to a source which might
not be present at that particular time-frequency bin. Thus a new demixing method called delay
and scale subtraction scoring (DASSS) was presented in [16] that is less erratic than the
nearest-neighbour method and highlights when the W-disjoint orthogonality assumptions of
the DUET system are not valid. Furthermore, this technique identifies the time-frequency bins
were actually multiple sources are present and uses a source-aware demixing technique for
those bins.
It should be noted that we require reliable estimates of the mixing parameters which is also
the case with the other two source separation techniques (DUET and LQBP). In this technique
N new signals Yi, each of which entirely eliminates a particular Si are created in the following
manner
= − 1 (5.19)It should be noted that the multiplicative factor applied to X2 corresponds to scaling and delay
in the time domain. Hence this source eliminating technique is called delay and scale
subtraction scoring or DASSS. Yi can also be written in the following way:
= , + , + ⋯ + , , ℎ , ≡ 1 − ( ) (5.20)If exactly one source is active, at a specific time-frequency bin, we have the following
= 0 (5.21) = , (5.22) = , (5.23)
Equation (23) reveals is that if only one source is active; the N values in the set of Yi for a
given bin can be predicted using only the known α values and the given mixture X1. In fact N
sets of such predictions, each assuming one guessed active source g. (We will use to
denote the prediction of the ith Y function value when assuming only source g is active.) Now
the predicted set of values can be compared with the actual observed set of Yi. If exactly one
source is active, only its corresponding prediction will fit the observed Yi. Further the
following scoring function can be used to compare predicted values to the observed values.
26
( ) = ∑ −∑ | | (5.24)If the error is sufficiently small for a particular g in a given bin, we consider that only one
source is active in that bin and i.e. g, and assign the energy of the mixture in that bin to the
source g. And if the error is too large, consider multi-source demixing algorithm discussed
below.
For the bins where no single source model scores well enough, it is reasonable to conclude
that at least two sources must be present. And hence the problem of partitioning the input into
those two sources may be solved via simply inverting the mixing matrix for those two
sources. It is assumed that the two sources whose fractional error f(g) are lowest are the active
sources.
27
Chapter 5. Adaptive Time-Frequency Representation
Time-frequency representations describe signals in terms of their frequency content at a given
time. These representations are useful for analyzing signals varying both in time and
frequency. For speech and music signals where we have continuously time-varying frequency
content, frequency domain representations cannot be used because they only give spectral
information and no time information i.e. they fail to convey when, in time, the different events
are occurring in the signal. The short-time Fourier transform is one of the most widely used
approaches to time-frequency analysis.
In case of the sparsity-based source separation techniques, all the processing i.e. the
estimation of mixing parameters and the estimation of the sources is carried out in the time-
frequency domain. The major assumption in such source separation techniques is that the
underlying sources are W-disjoint orthogonal i.e. only one source is active in every time-
frequency bin. But practically such an assumption is not always valid; so it is at least assumed
that at every time-frequency bin only one of the sources has dominant energy. It is expected
that if the time-frequency representation of the mixture is sparse, then the assumption of the
W-disjoint orthogonality can be satisfied to a greater extent. One such time-frequency
representation which provides sparse representation is short-time Fourier transform.
5.1. Short-time Fourier transform
The short-time Fourier transform is the most widely used method for studying non-stationary
signals like music/speech. The concept behind it is simple and powerful. Suppose we listen to
a piece of music that lasts an hour where in the beginning there are violins and at the end
drums. If we Fourier analyze the whole hour, the energy spectrum will show peaks at the
frequencies corresponding to the violins and drums. That will tell us that there were violins
and drums but will not give us any indication of when the violins and drums were played. The
most straightforward thing to do is to break up the hour into five minute segments and Fourier
analyze each interval. Upon examining the spectrum of each segment we will see in which
five minute intervals the violins and drums occurred. If we want to localize even better, we
break up the hour into one minute segments or even smaller time intervals and Fourier
analyze each segment. That is the basic idea of the short-time Fourier transform – break up
the signal into small time segments and Fourier analyze each time segment to ascertain the
28
frequencies that existed in that segment. The totality of such spectra indicates how the
spectrum is varying in time.
The short-time Fourier Transform (STFT) of the signal x(t) is defined as [17],
(∙) ( , ) = ( ) ( − )∞∞
(6.1)
where W(t) is the window function. W(t) can be considered as a window that selects a
particular portion of the signal centred around the given time location, and the Fourier
transform of the windowed signal yields the frequency content of the signal at the given time.
If we want good time localization we have to pick a narrow window in the time domain, and
if we want good frequency localization we have to pick a narrow window in the frequency
domain (i.e. a long window in time domain). But both the time domain window and the
frequency domain window cannot be made arbitrarily narrow; hence there is an inherent
trade-off between time and frequency localization in the spectrogram for a particular window.
The degree of trade-off depends on the window, signal, time, and frequency. We have just
seen that one window, in general, cannot give good time and frequency localization. That
should not cause any problem of principle as long as we look at the spectrogram as a tool at
our disposal that has many options including the choice of window. There is no reason why
we cannot change the window depending on what we want to study. That can sometimes be
done effectively, but not always. Sometimes a compromise window does very well.
5.2. Need for data-adaptive time-frequency representations
Most algorithms for underdetermined separation as mentioned earlier are based on the
assumption that the signals are sparse in some domain. In most cases, the sparser the sources,
the less they will overlap when mixed (i.e. the more disjoint their mixture will be), and
consequently the easier their separation will be. The most widely used transform for the
purpose of sparsification in the context of blind source separation has been the STFT. The
choice of the window significantly affects the signal concentration in the STFT. In fact, the
uniform frequency and time resolution it offers are disadvantageous for the task of speech or
music separation. When the pitch of a source varies slightly, the variation of the higher
harmonics is higher. This amplified variation of the higher harmonic frequencies can be
accurately tracked with shorter windows. Hence, a variation in the window duration with
frequency is expected to result in a more concentrated representation for some signals.
29
A second problem is that for signals having several different components occurring at
different instants, the best window differs for each component. A sparse representation is
obtained for the harmonic and impulsive parts of the signal by analyzing the segments with a
long and short window respectively. Thus the fact that different windows are appropriate for
different signal components suggests the use of a data-dependent time-and-frequency-varying
window function for analysis to achieve a high concentration and resolution of any signal
component present at any time-frequency location.
5.3. Data-adaptive time-frequency representations
As discussed in the previous section, the STFT performs well in terms of the concentration
and resolution of a given signal component when a properly chosen window is used. An
adaptive time-frequency representation was proposed in [4] by D. L. Jones and T. W. Parks.
In order to track the different signal components we need to have an adaptive window whose
parameters are dependent on the signal. The window function selected for this purpose was
the Gaussian function. Thus there are two parameters (the real and the imaginary part of the
Gaussian parameter) of the window function that are equivalent in terms of their fundamental
time-frequency concentration. They differ, though, in the time-frequency concentration they
provide for a particular signal component. A local signal concentration measure is used to
compute the Gaussian window parameter to achieve maximum concentration of the locally
dominant signal component at every time-frequency location. This procedure automates the
choice of the window and thus overcomes the problem of window selection in the short-time
Fourier transform. The adaptive time-frequency representation of a signal x(t) is given as [4]
( , ) = ( ) −2 [ , ] . , ( )∞∞
(6.2)
which projects the signal x(τ) onto the unit energy Gaussian basis elements
−2 [ , ] . , ( ) (6.3) Ct,ω is the Gaussian parameter which can vary with time and frequency. The adaptive time-
frequency representation in the equation (2) differs from the STFT with a Gaussian window in
that the Gaussian parameter Ct,ω may vary with time and frequency. The basic idea behind the
adaptive time-frequency representation is that extra degrees of the freedom, namely, the real
and the imaginary parts of the Gaussian parameter at every time-frequency location, can
improve the performance over that of a fixed-window STFT. The performance of the adaptive
30
time-frequency representation depends on the selection of the adaptive Gaussian parameters.
Those Gaussian parameters are selected for a particular time-frequency location which
maximizes the local concentration measure. The following local concentration measure was
used in [4]
= | ( , )|∞∞∞∞ | ( , )|∞∞∞∞ (6.4) which is the fourth power of the L4 norm dived by the squared L2 norm of the magnitude of
the short-time Fourier transform. This measure is very similar to kurtosis in statistics and to
other equivalent measures of peakedness or sharpness.
Motivated from this concept of data-adaptive time-frequency representation which provides
better resolution as compared to fixed-window STFT, we can apply this concept of
adaptiveness for obtaining sparser time-frequency representations for the application of blind
source separation. The requirement in the blind source separation problem is that the time-
frequency representations of the mixtures are as sparse as possible so that the underlying
sources satisfy the W-disjoint orthogonality criterion to a greater extent. So here the
adaptiveness can be used to maximize sparsity of the time-frequency representation.
Most real world signals are essentially stationary over short intervals of time. Consequently, a
sparse representation could be obtained by analyzing each frame with a window that has been
optimized for the frame. Long windows give a sparser representation for frames containing
steady frequency components than when shorter windows are used. On the contrary, the time-
frequency representation of impulses or onsets of events is sparser with short windows. And it
has also been observed that the by simply varying the length of the analysis window the
sparsity of the time-frequency representation varies [21] and usually there exists an optimum
length for which the sparsity is maximum. But the selection of optimum analysis window
length depends on the signal.
So instead of adapting the time-frequency representation at every time-frequency location as
in [4], for the application of blind source separation the adaptation can be restricted to only
time i.e. different analysis window lengths can be used for different time-instants. The reason
for restricting the adaptation to only time is because the blind source separation application
demands reconstruction of the time-frequency representations for estimation of the source
signals in time domain (this problem would be discussed in detail in the next section). Now
31
the next problem that needs to be addressed is what concentration measure to use for
adaptation process i.e. on what selection criterion to select the optimum analysis window
length. Some of the commonly used sparsity measures like kurtosis and Gini index can be
used for this purpose of adaptation. This aspect is investigated thoroughly in the next chapter.
The adaptive transformations used to obtain the time-frequency representation are non-linear
i.e. the representation for the sum of two signals might not be equal to the sum of the time-
frequency representations of the individual signals. This depends on the window sequence
chosen for obtaining the time-frequency components in each of the signals. However, if the
same sequence of windows is used to obtain the time-frequency representations for the
mixture as well as the individual signals, the transformation can be considered to be linear.
This linearity property is vital during the estimation of sources in the source separation
algorithm.
5.3.1. Steps to obtain a data-adaptive time-frequency representation of a signal
The procedure to obtain the data-adaptive time-frequency representation is as follows:
a) first select the set of analysis window sizes to be used for adaptation purpose (say 30
ms, 60 ms, 90 ms)
b) now for a particular time-instant, using an analysis window select a portion of the
signal and then Fourier analyze the selected signal
c) repeat step (b) using all the analysis window sizes selected for the purpose of
adaptation
d) once on obtaining the Fourier spectrum using all the different analysis windows
selected for adaptation, using an appropriate concentration measure select the optimal
Fourier spectrum which gives the best resolution (which concentration measures to
use for the purpose of adaptation is discussed in chapter 7), furthermore the adaptation
can be carried out on various bands of frequencies depending on the requirement
e) note down the analysis window size that was used for obtaining the best resolution
f) then based on the analysis window size used for this time-instant and the technique to
be used for re-construction (discussed in section 6.4) of the signal, decide an
appropriate hop size (i.e. step size) and proceed to the next time-instant
g) finally at the new time-instant, follow the adaptation procedure discussed in steps (b),
(c), (d), (e) and (f) until the end of the signal is reached
32
Figure 5.1: Data-adaptive time-frequency representation of a singing voice using frame-based adaptation (window function: hamming; window sets for adaptation: 30, 60 and 90 ms; hop size: 10 ms; concentration measure: kurtosis; adaptation region: 1000 to 3000 Hz)
Figure 5.1 shows data-adaptive time-frequency representation of a singing voice obtained
using above mentioned procedure. The red-dashed line shows the window size selected for
each of the frame. The window function used is hamming, the window sizes used for the
adaptation are 30, 60 and 90 ms, hop size used is 10 ms, concentration measure used for
adaptation is kurtosis and the region over which the adapta