View
216
Download
0
Category
Preview:
Citation preview
EE E6820: Speech & Audio Processing & Recognition
Lecture 10:Signal Separation
Dan Ellis <dpwe@ee.columbia.edu>Michael Mandel <mim@ee.columbia.edu>
Columbia University Dept. of Electrical Engineeringhttp://www.ee.columbia.edu/∼dpwe/e6820
April 14, 2009
1 Sound mixture organization
2 Computational auditory scene analysis
3 Independent component analysis
4 Model-based separation
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 1 / 45
Outline
1 Sound mixture organization
2 Computational auditory scene analysis
3 Independent component analysis
4 Model-based separation
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 2 / 45
Sound Mixture Organization
0 2 4 6 8 10 12 time/s
frq/Hz
0
2000
1000
3000
4000
Voice (evil)Stab
Rumble StringsChoir
Voice (pleasant)
Analysis
level / dB-60
-40
-20
0
Auditory Scene Analysis: describing a complex sound in termsof high-level sources / events
. . . like listeners do
Hearing is ecologically groundedI reflects ‘natural scene’ propertiesI subjective, not absolute
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 3 / 45
Sound, mixtures, and learning
time / s time / s time / s
freq
/ kH
z
0 0.5 10
1
2
3
4
0 0.5 1 0 0.5 1
+ =
Speech Noise Speech + Noise
SoundI carries useful information about the worldI complements vision
Mixtures
. . . are the rule, not the exceptionI medium is ‘transparent’, sources are manyI must be handled!
LearningI the ‘speech recognition’ lesson:
let the data do the workI like listeners
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 4 / 45
The problem with recognizing mixtures
“Imagine two narrow channels dug up from the edge of a lake,with handkerchiefs stretched across each one. Looking only at themotion of the handkerchiefs, you are to answer questions such as:How many boats are there on the lake and where are they?” (afterBregman, 1990)
Received waveform is a mixtureI two sensors, N signals . . . underconstrained
Disentangling mixtures as the primary goal?I perfect solution is not possibleI need experience-based constraints
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 5 / 45
Approaches to sound mixture recognition
Separate signals, then recognize
e.g. Computational Auditory Scene Analysis (CASA),Independent Component Analysis (ICA)
I nice, if you can do it
Recognize combined signalI ‘multicondition training’I combinatorics. . .
Recognize with parallel modelsI full joint-state space?I divide signal into fragments,
then use missing-data recognition
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 6 / 45
What is the goal of sound mixture analysis?
CASA ?
Separate signals?I output is unmixed waveformsI underconstrained, very hard . . .I too hard? not required?
Source classification?I output is set of event-namesI listeners do more than this. . .
Something in-between?Identify independent sources + characteristics
I standard task, results?
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 7 / 45
Segregation vs. Inference
Source separation requires attribute separationI sources are characterized by attributes
(pitch, loudness, timbre, and finer details)I need to identify and gather different attributes for different
sources. . .
Need representation that segregates attributesI spectral decompositionI periodicity decomposition
Sometimes values can’t be separated
e.g. unvoiced speechI maybe infer factors from probabilistic model?
p(O, x , y)→ p(x , y |O)
I or: just skip those values & infer from higher-level context
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 8 / 45
Outline
1 Sound mixture organization
2 Computational auditory scene analysis
3 Independent component analysis
4 Model-based separation
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 9 / 45
Auditory Scene Analysis (Bregman, 1990)
How do people analyze sound mixtures?I break mixture into small elements (in time-freq)I elements are grouped in to sources using cuesI sources have aggregate attributes
Grouping ‘rules’ (Darwin and Carlyon, 1995)I cues: common onset/offset/modulation,
harmonicity, spatial location, . . .
Frequencyanalysis
Sound
Elements Sources
Groupingmechanism
Onsetmap
Harmonicitymap
Spatialmap
Sourceproperties
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 10 / 45
Cues to simultaneous grouping
Elements + attributes
time / s
freq
/ H
z
0 1 2 3 4 5 6 7 8 90
2000
4000
6000
8000
Common onsetI simultaneous energy has common source
PeriodicityI energy in different bands with same cycle
Other cuesI spatial (ITD/IID), familiarity, . . .
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 11 / 45
The effect of context
Context can create an ‘expectation’
i.e. a bias towards a particular interpretation
e.g. Bregman’s “old-plus-new” principle:I A change in a signal will be interpreted as an added source
whenever possible
+
time / s
freq
uenc
y / k
Hz
0.0 0.4 0.8
1
2
1.20
I a different division of the same energydepending on what preceded it
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 12 / 45
Computational Auditory Scene Analysis (CASA)
Object 1 percept
Sound mixture
Object 2 perceptObject 3 percept
Goal: Automatic sound organizationI Systems to ‘pick out’ sounds in a mixture
. . . like people do
e.g. voice against a noisy backgroundI to improve speech recognition
ApproachI psychoacoustics describes grouping ‘rules’
. . . just implement them?
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 13 / 45
CASA front-end processing
Correlogram: Loosely based on known/possible physiology
E6820 SAPR - Dan Ellis L11 - Signal Separation 2005-04-13 - 13
!
"#$#!%&'()*+(,!-&'.+//0(1
!
2 "'&&+3'1&456
7''/+38!94/+,!'(!:(';(<-'//093+!-=8/0'3'18
- linear filterbank cochlear approximation
- static nonlinearity
- zero-delay slice is like spectrogram
- periodicity from delay-and-multiply detectors
Cochlea
filterbank
sound
envelope
follower
short-time
autocorrelation
delay line
fre
qu
en
cy c
ha
nn
els
"'&&+3'1&45/30.+
freq
lag
time
I linear filterbank cochlear approximationI static nonlinearityI zero-delay slice is like spectrogramI periodicity from delay-and-multiply detectors
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 14 / 45
Bottom-Up Approach (Brown and Cooke, 1994)
Implement psychoacoustic theoryinputmixture
signalfeatures(maps)
discreteobjectsFront end Object
formationGrouping
rulesSourcegroups
onsetperiodfrq.mod
time
freq
I left-to-right processingI uses common onset & periodicity cues
Able to extract voiced speech
time / sec
freq
/ kh
z
v3n7 original mix
0 0.2 0.4 0.6 0.8 1 1.2 1.40
1
2
3
4
5
6
7
8
time / seclevel ./ dB
v3n7 - Hu & Wang 02
0 0.2 0.4 0.6 0.8 1 1.2 1.40
-40
-30
-20
-10
0
10
20
freq
/ kh
z
1
2
3
4
5
6
7
8
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 15 / 45
Problems with ‘bottom-up’ CASA
E6820 SAPR - Dan Ellis L11 - Signal Separation 2005-04-13 - 15
!
"#$%&'()!*+,-!.%$,,$(/012!3454
!
6 3+#70()7#+%+89!,+('/:#';0'87<!'&'('8,)
- need to have ‘regions’, but hard to find
!
6 "'#+$=+7+,<!+)!,-'!1#+(>#<!70'
- how to handle aperiodic energy?
!
6 ?')<8,-')+)!@+>!(>)A'=!!&,'#+89
- cannot separate within a single t-f element
!
6 B$,,$(/01!&'>@')!8$!>(%+90+,<!$#!7$8,'C,
- how to model illusions?
time
freq
Circumscribing time-frequency elementsI need to have ‘regions’, but hard to find
Periodicity is the primary cueI how to handle aperiodic energy?
Resynthesis via masked filteringI cannot separate within a single t-f element
Bottom-up leaves no ambiguity or contextI how to model illusions?
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 16 / 45
Restoration in sound perception
Auditory ‘illusions’ = hearing what’s not there
The continuity illusion & Sinewave Speech (SWS)
Time
Frequency
0 0.5 1 1.5 2 2.50
2000
4000
6000
8000
Time
Frequency
0.5 1 1.5 2 2.5 3 3.50
1000
2000
3000
4000
5000
I duplex perception
What kind of model accounts for this?I is it an important part of hearing?
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 17 / 45
Adding top-down constraints: Prediction-Driven CASA
Perception is not directbut a search for plausible hypotheses -
Data-driven (bottom-up). . .input
mixturesignal
featuresdiscreteobjectsFront end Object
formationGrouping
rulesSourcegroups
I objects irresistibly appear
vs. Prediction-driven (top-down)
inputmixture
signalfeatures
predictionerrors
hypotheses
predictedfeaturesFront end Compare
& reconcile
Hypothesismanagement
Predict& combinePeriodic
components
Noisecomponents
I match observations with a ‘world-model’I need world-model constraints. . .
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 18 / 45
Generic sound elements for PDCASA (Ellis, 1996)
Goal is a representational space thatI covers real-world perceptual soundsI minimal parameterization (sparseness)I separate attributes in separate parameters
200400
100020004000
f/Hz profile
0
NoiseObject
0.000
0.005
0.010 magContour
0.0 0.1 0.2time/s
H[k]
A[n]
NX [n,k]
XN[n,k] = H[k]·A[n]50
100
200
400f/Hz Periodogram
p[n]
200400
100020004000
f/HzWeftObject
0.0 0.2 0.4 0.8time/s
H [n,k]W
50
100
200
400f/Hz
XW[n,k] = HW[n,k]·P[n,k]
0.6
Correlogramanalysis
H[k] T[k]ClickObject
2.2 2.4time / s
decayTimes
0.0 0.1time / s
magProfile
0.00 0.01
X [n,k]C
XC[n,k] = H[k]·exp((n - n0) / T[k])
Object hierarchies built on top. . .
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 19 / 45
PDCASA for old-plus-new
Incremental analysis
t1 t2 t3Input Signal
Time t1:Initial elementcreated
Time t2:Additionalelement required
Time t2:Second element finished
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 20 / 45
PDCASA for the continuity illusion
Subjects hear the tone as continuous
. . . if the noise is a plausible masker
100020004000
f/Hz ptshort
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4t / s
Data-driven analysis gives just visible portions:
Prediction-driven can infer masking:
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 21 / 45
Prediction-Driven CASAExplain a complex sound with basic elements
−70
−60
−50
−40
dB
200400
100020004000
f/Hz Noise1
200400
100020004000
f/Hz Noise2,Click1
200400
100020004000
f/Hz City
0 1 2 3 4 5 6 7 8 9
50100200400
1000
Horn1 (10/10)
Crash (10/10)
Horn2 (5/10)
Truck (7/10)
Horn3 (5/10)
Squeal (6/10)
Horn4 (8/10)Horn5 (10/10)
0 1 2 3 4 5 6 7 8 9time/s
200400
100020004000
f/Hz Wefts1−4
50100200400
1000
Weft5 Wefts6,7 Weft8 Wefts9−12
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 22 / 45
Aside: Ground Truth
What do people hear in sound mixtures?I do interpretations match?
→ Listening tests to collect ‘perceived events’:
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 23 / 45
Aside: Evaluation
Evaluation is a big problem for CASAI what is the goal, really?I what is a good test domain?I how do you measure performance?
SNR improvementI tricky to derive from before/after signals: correspondence
problemI can do with fixed filtering maskI differentiate removing signal from adding noise
Speech Recognition (ASR) improvementI recognizers often sensitive to artifacts
‘Real’ task?I mixture corpus with specific sound events. . .
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 24 / 45
Outline
1 Sound mixture organization
2 Computational auditory scene analysis
3 Independent component analysis
4 Model-based separation
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 25 / 45
Independent Component Analysis (ICA)(Bell and Sejnowski, 1995, etc.)
If mixing is like matrix multiplication, then separation issearching for the inverse matrix
x1x2
s1s2
^^
s1 w11w21
w12w22
×
a11
a21a12
a22
x1x2
s1s2
a11a21
a12a22
×
−δ MutInfoδwijs2
i.e. W ≈ A−1
I with N different versions of the mixed signals (microphones),we can find N different input contributions (sources)
I how to rate quality of outputs?i.e. when do outputs look separate?
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 26 / 45
Gaussianity, Kurtosis, & Independence
A signal can be characterized by its PDF p(x)i.e. as if successive time values are drawn from
a random variable (RV)I Gaussian PDF is ‘least interesting’I Sums of independent RVs (PDFs convolved) tend to Gaussian
PDF (central limit theorem)
Measures of deviations from Gaussianity:4th moment is Kurtosis (“bulging”)
kurt(y) = E
[(y − µσ
)4]− 3
-4 -2 0 2 40
0.2
0.4
0.6
0.8
Gaussian kurt = 0
exp(-|x|) kurt = 2.5
exp(-x4) kurt = -1
I kurtosis of Gaussian is zero (this def.)I ‘heavy tails’ → kurt > 0I closer to uniform dist. → kurt < 0I Directly related to KL divergence from Gaussian PDF
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 27 / 45
Independence in Mixtures
Scatter plots & Kurtosis values
-0.4 -0.2 0 0.2 0.4 0.6-0.4
-0.2
0
0.2
0.4
0.6s1 vs. s2
-0.5 0 0.5
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
x1 vs. x2
p(s1) kurtosis = 27.90
-0.2 -0.1 0 0.1 0.2
p(s2) kurtosis = 53.85
p(x1) kurtosis = 18.50
-0.2 -0.1 0 0.1 0.2
p(x2) kurtosis = 27.90
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 28 / 45
Finding Independent Components
Sums of independent RVs are more Gaussian→ minimize Gaussianity to undo sumsi.e. search over wij terms in inverse matrix
-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
mix 1
mix
2
0 0.2 0.4 0.6 0.8 10
2
4
6
8
10
12
θ / π
kurt
osis
Kurtosis vs. θMixture Scatter
s1
s1
s2
s2
Solve by Gradient descent or Newton-Raphson:
w+ = E[xg(wT x)]− E[g ′(wT x)]w
w =w+
‖w+‖
“Fast ICA”, (Hyvarinen and Oja, 2000)http://www.cis.hut.fi/projects/ica/fastica/
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 29 / 45
Limitations of ICA
Assumes instantaneous mixingI real world mixtures have delays & reflectionsI STFT domain?
x1(t) = a11(t) ∗ s1(t) + a12(t) ∗ s2(t)
⇒ X1(ω) = A11(ω)S1(ω) + A12(ω)S2(ω)
I Solve ω subbands separately, match up answers
Searching for best possible inverse matrixI cannot find more than N outputs from N inputsI but: “projection pursuit” ideas + time-frequency masking. . .
Cancellation inherently fragileI s1 = w11x1 + w12x2 to cancel out s2I sensitive to noise in x channelsI time-varying mixtures are a problem
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 30 / 45
Outline
1 Sound mixture organization
2 Computational auditory scene analysis
3 Independent component analysis
4 Model-based separation
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 31 / 45
Model-Based Separation: HMM decomposition
(e.g. Varga and Moore, 1990; Gales and Young, 1993)
Independent state sequences for 2+ component source models
model 1
model 2
observations / time
New combined state space q′ = q1× q2I need pdfs for combinations p(X | q1, q2)
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 32 / 45
One-channel Separation: Masked Filtering
Multichannel → ICA: Inverse filter and canceltarget s
interference nn
s
-mix
+
proc
One channel: find a time-frequency maskmix s+n
s
proc
spectrogram
mask
stftistftx
Cannot remove overlapping noise in t-f cells,but surprisingly effective (psych masking?):
Frequency
0 dB SNR
Noisemix
Maskresynth
2000400060008000
Time
Frequency
0 1 2 302000400060008000
-12 dB SNR
Time0 1 2 3
-24 dB SNR
Time0 1 2 3
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 33 / 45
“One microphone source separation”
(Roweis, 2001)
State sequences → t-f estimates → maskfre
q / H
z
0
1000
2000
3000
4000
time / sectime / sec
0
1000
2000
3000
4000
0
1000
2000
3000
4000
0 1 2 3 4 5 60
1000
2000
3000
4000
0 1 2 3 4 5 6
Speaker 1
Orig
inal
voic
esSt
ate
mea
nsR
esyn
thes
ism
asks
Mix
ture
Speaker 2
I 1000 states/model (→ 106 transition probs.)I simplify by subbands (coupled HMM)?
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 34 / 45
Speech Fragment Recognition
(Barker et al., 2005)
Signal separation is too hard! Instead:I segregate features into partially-observed sourcesI then classify
Made possible by missing data recognitionI integrate over uncertainty in observations
for true posterior distribution
Goal: Relate clean speech models P(X |M)to speech-plus-noise mixture observations
. . . and make it tractable
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 35 / 45
Missing Data RecognitionSpeech models p(x |m) are multidimensional. . .i.e. means, variances for every freq. channel
I need values for all dimensions to get p(·)But: can evaluate over a subset of dimensions xk
p(xk |m) =
∫p(xk , xu |m) dxu
!
"#$!%&&'( )*+$,-!.'/0+12(!3!42#1$'$5 67789::976!9!:7!;!68
!
"#$$#%&!'()(!*+,-&%#)#-%
!
. /0++,1!2-3+4$!
p
(
x
|
m
)
!
!(5+!264)#3#2+%$#-%(4777
!
9 '<2<!=2#$(-!>#1'#$?2(!@*1!2>21A!@12B<!?C#$$2&
9 $22,!>#&+2(!@*1!#&&!,'=2$('*$(!0*!520!
p
(•)
!
. 86)9!,(%!+:(46()+!-:+5!(!
$6;$+)!-<!3#2+%$#-%$!
x
k
!
. =+%,+>!
2#$$#%&!3()(!5+,-&%#)#-%9
!
9 C#1,!D#10!'(!!$,'$5!0C2!=#(E!F(25125#0'*$G
xk
xu
y
p(xk,xu)
p(xk|xu<y ) p(xk )
p xk m! " p xk xu# m! " xud$=
P(x1 | q)
P(x | q) =
· P(x2 | q) · P(x3 | q) · P(x4 | q) · P(x5 | q) · P(x6 | q)
?5+$+%)!3()(!2($@
time �!
dim
ensi
on �!
Hence, missing data recognition:
!
"#$!%&&'( )*+$,-!.'/0+12(!3!42#1$'$5 67789::976!9!:7!;!68
!
"#$$#%&!'()(!*+,-&%#)#-%
!
. /0++,1!2-3+4$!
p
(
x
|
m
)
!
!(5+!264)#3#2+%$#-%(4777
!
9 '<2<!=2#$(-!>#1'#$?2(!@*1!2>21A!@12B<!?C#$$2&
9 $22,!>#&+2(!@*1!#&&!,'=2$('*$(!0*!520!
p
(•)
!
. 86)9!,(%!+:(46()+!-:+5!(!
$6;$+)!-<!3#2+%$#-%$!
x
k
!
. =+%,+>!
2#$$#%&!3()(!5+,-&%#)#-%9
!
9 C#1,!D#10!'(!!$,'$5!0C2!=#(E!F(25125#0'*$G
xk
xu
y
p(xk,xu)
p(xk|xu<y ) p(xk )
p xk m! " p xk xu# m! " xud$=
P(x1 | q)
P(x | q) =
· P(x2 | q) · P(x3 | q) · P(x4 | q) · P(x5 | q) · P(x6 | q)
?5+$+%)!3()(!2($@
time �!
dim
ensi
on �!
I hard part is finding the mask (segregation)E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 36 / 45
Missing Data Results
Estimate static background noise level N(f )
Cells with energy close to background are considered“missing”
"1754" + noise
SNR mask
Missing DataClassifier "1754"
0 5 10 15 2020
40
60
80
100Factory Noise
Digit
reco
gnitio
n ac
cura
cy /
%SNR (dB)
a priorimissing dataMFCC+CMN
"
I must use spectral features!
But: nonstationary noise → spurious mask bitsI can we try removing parts of mask?
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 37 / 45
Comparing different segregations
Standard classification chooses between models M to matchsource features X
M∗ = argmaxM
p(M |X ) = argmaxM
p(X |M)p(M)
Mixtures: observed features Y , segregation S ,all related by p(X |Y ,S)
freq
ObservationY(f)
Segregation S
SourceX(f)
Joint classification of model and segregation:
p(M,S |Y ) = p(M)
∫p(X |M)
p(X |Y , S)
p(X )dXp(S |Y )
I P(X ) no longer constant
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 38 / 45
Calculating fragment matches
p(M, S |Y ) = p(M)
∫p(X |M)
p(X |Y ,S)
p(X )dXp(S |Y )
p(X |M) – the clean-signal feature modelp(X |Y ,S)
p(X ) – is X ‘visible’ given segregation?
Integration collapses some bands. . .
p(S |Y ) – segregation inferred from observationI just assume uniform, find S for most likely MI or: use extra information in Y to distinguish Ss. . .
Result:I probabilistically-correct relation betweenI clean-source models p(X |M) andI inferred, recognized source + segregation p(M,S |Y )
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 39 / 45
Using CASA features
p(S |Y ) links acoustic information to segregationI is this segregation worth considering?I how likely is it?
Opportunity for CASA-style information to contributeI periodicity/harmonicity: these different frequency bands
belong togetherI onset/continuity: this time-frequency region must be whole
Frequency Proximity HarmonicityCommon Onset
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 40 / 45
Fragment decoding
Limiting S to whole fragments makes hypothesis searchtractable:
Fragments
T1 T2 T3 T4 T5 T6time
Hypothesesw1
w2
w3
w4
w5
w6
w7
w8
I choice of fragments reflects p(S |Y )p(X |M)i.e. best combination of segregation and match to speech models
Merging hypotheses limits space demands
. . . but erases specific history
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 41 / 45
Speech fragment decoder results
Simple p(S |Y ) model forces contiguous regions to staytogether
I big efficiency gain when searching S space
"1754" + noise
SNR mask
Fragments
FragmentDecoder "1754"
Clean-models-based recognition rivals trained-in-noiserecognition
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 42 / 45
Multi-source decoding
Search for more than one source
Y(t)
S1(t)q1(t)
S2(t)q2(t)
Mutually-dependent data masksI disjoint subsets of cells for each sourceI each model match p(Mx |Sx ,Y ) is independentI masks are mutually dependent: p(S1,S2 |Y )
Huge practical advantage over full search
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 43 / 45
Summary
Auditory Scene Analysis:I Hearing: partially understood, very successful
Independent Component Analysis:I Simple and powerful, some practical limits
Model-based separation:I Real-world constraints, implementation tricks
Parting thought
Mixture separation the main obstacle in many applications e.g.soundtrack recognition
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 44 / 45
ReferencesAlbert S. Bregman. Auditory Scene Analysis: The Perceptual Organization of Sound.
The MIT Press, 1990. ISBN 0262521954.C.J. Darwin and R.P. Carlyon. Auditory grouping. In B.C.J. Moore, editor, The
Handbook of Perception and Cognition, Vol 6, Hearing, pages 387–424. AcademicPress, 1995.
G. J. Brown and M. P. Cooke. Computational auditory scene analysis. Computerspeech and language, 8:297–336, 1994.
D. P. W. Ellis. Prediction–driven computational auditory scene analysis. PhD thesis,Department of Electrtical Engineering and Computer Science, M.I.T., 1996.
Anthony J. Bell and Terrence J. Sejnowski. An information-maximization approach toblind separation and blind deconvolution. Neural Computation, 7(6):1129–1159,1995. ftp://ftp.cnl.salk.edu/pub/tony/bell.blind.ps.Z.
A. Hyvarinen and E. Oja. Independent component analysis: Algorithms andapplications. Neural Networks, 13(4–5):411–430, 2000.http://www.cis.hut.fi/aapo/papers/IJCNN99_tutorialweb/.
A. Varga and R. Moore. Hidden markov model decomposition of speech and noise. InProceedings of the 1990 IEEE International Conference on Acoustics, Speech andSignal Processing, pages 845–848, 1990.
M.J.F. Gales and S.J. Young. Hmm recognition in noise using parallel modelcombination. In Proc. Eurospeech-93, volume 2, pages 837–840, 1993.
S. Roweis. One-microphone source separation. In NIPS, volume 11, pages 609–616.MIT Press, Cambridge MA, 2001.
Jon Barker, Martin Cooke, and Daniel P. W. Ellis. Decoding speech in the presence ofother sources. Speech Communication, 45(1):5–25, 2005. URLhttp://www.ee.columbia.edu/~dpwe/pubs/BarkCE05-sfd.pdf.
E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 45 / 45
Recommended