Upload
victor-bourns
View
220
Download
0
Embed Size (px)
Citation preview
1Institute of Information Science Academia Sinica
Singer Identification and Singer Identification and Clustering of Popular Music Clustering of Popular Music
RecordingsRecordings
Wei-Ho [email protected]
Institute of Information Science, Academia Sinica
2Institute of Information Science Academia Sinica
Extracting Information From Music
Music Information Retrieval (MIR)– To develop ways of managing collections of musical
material for preservation, access, research, and other uses.
MIR communities & research areas [after Futrelle & Downie, 2002]
Computer Science
Audio Engineering
Psychology & Philosophy
Library Science
Law
Representation
Indexing
User Interface Design
Compression
FeatureDetection
Machine Learning
Metadata
Musical Analysis
Epistemology & Ontology
Perception
Intellectual Property
Classification
Musicology
Communities Research Areas
3Institute of Information Science Academia Sinica
Extracting Voice Information From Music
Viewing MIR from a speech-processing perspectiveSpeech Processing
Analysis/Synthesis Recognition Coding
Speech Recognition Speaker Recognition Language Recognition
Phone Recognition Word Recognition Tone Recognition
Speaker Identification Speaker Verification Speaker Clustering
Language Identification Language Verification Dialect Identification
Singing Processing
Analysis/Synthesis Recognition Coding
"Singing Recognition" Singer Recognition Language Recognition
Phone Recognition Lyric Transcription Melody Extraction
Singer Identification Singer Detection Singer Clustering
Language Identification Language Verification Dialect Identification
Speech/Singing/Music/
Other SoundsDiscrimination
4Institute of Information Science Academia Sinica
Singer Recognition Tasks (I)
Singer Identification– Determining who is singing
?
?
?
?
Who performed thismusic recording?
5Institute of Information Science Academia Sinica
Singer Recognition Tasks (II)
Singer detection– Determining whether or not a specified singer is present
in a music recording
?
If Coco's voices in thismusic recording?
6Institute of Information Science Academia Sinica
Singer Recognition Tasks (III)
Singer Tracking– Locating where a specified singer is present in a music
recording
Where is Coco's voices?
7Institute of Information Science Academia Sinica
Singer Recognition Tasks (IV)
Singer Clustering– Grouping the same-singer music recordings into a cluster
Clusteringby singer
8Institute of Information Science Academia Sinica
Potential Applications
Indexing– Finding cameo’s or guest appearances in live concert
recordings.– Identifying the singers in a movie’s musical interludes.
Music recommendation systems– Suggesting music by singers with similar voices.
Karaoke services– Efficiently organizing the customer’s recordings.– Personalization
Copyright protection– Distinguishing between an original song and a cover-band.– Rapidly scanning suspect websites for piracy
9Institute of Information Science Academia Sinica
Singer’s Vocal Characteristics
Humans use several levels of perceptual cues for distinguishing among singers
Singingpronunciation, and
IdiosyncrasiesLearned traits
Physical traits
Culture andsocio-background
Modulation of pitch,rhythm,speed,
intonation, andvolume
Personality type
Acoustic aspect ofsinging, e.g., nasal,deep, breathy and
rough
Anatomicalstructure
of vocal apparatus
Characteristics Sources
10Institute of Information Science Academia Sinica
Major Challenges In Singer Recognition
The vast majority of popular music contains background accompaniment during most or all vocal passages– Infeasible to acquire isolated solo voice data for
extracting the singer’s vocal characteristics
The proposed solution:
Vocal segment detection followed by solo vocal signal modeling
11Institute of Information Science Academia Sinica
Vocal/Non-vocal Segmentation
Sliding Window
Vocal Model
Non-vocal Model
FeatureVectors
DecisionFeature
Extraction
Vocalor
Non-vocal
Mel-scale Frequency Cepstral Coefficients (MFCCs)
Filter BankFrame
Segment
Music Recording
12Institute of Information Science Academia Sinica
Gaussian Mixture Model (I)
Model description– The distribution of the feature vector x is represented by
a mixture of M component Gaussian densities, i.e.,
• is the i-th Gaussian density with mean and covariance matrix
– A Gausian mixture model (GMM) is characterized by
x
11, ΣN () 22 , Σ ... MM Σ,
+w 1
w 2
w M
)|( xp
N () N ()
M
i iiiwp1
),()|( xx N
),( ii xN i i
Miw iii 1|,,
13Institute of Information Science Academia Sinica
Gaussian Mixture Model (II)
Parameter estimation– Using the EM algorithm, an initial model is created, and
the new model is then estimated by maximizing the auxiliary function
where and
– Letting for each parameter to be re-estimated, we have
T
tti ip
Tw
1
),|(1
x
T
t t
T
t tti
ip
ip
1
1
),|(
),|(
x
xx
iiT
t t
T
t ttti
ip
ip
1
1
),|(
),|(
x
xxx
)ˆ,ˆ(ˆ)ˆ|,( iitit wip xx N
,)ˆ|,(log),|()ˆ(1 1
T
t
M
itt ipipQ xx
M
m mmtm
iitit
w
wip
1),(
),(),|(
x
xx
N
N
0)ˆ( Q
14Institute of Information Science Academia Sinica
Distilling Singers’ Voices From Music
Substantial similarities exist between the instrumental regions and the accompaniment of the vocal signal
Solo voice can be modeled via suppressing the background music estimated from the instrumental regions.
Solo Voice
Accompaniments
Accompanied Voice
+
15Institute of Information Science Academia Sinica
Solo Vocal Signal Modeling (I)
Model Description
b can be approximately estimated using the instrumental regions of music
– Our aim is to find an optimal s such that (in maximum likelihood sense)
).,|(maxarg bss ps
V
MixingV = f (S ,B )
A Solo Voice
A Background Music
An Accompanied Voice
},...,,{ 21 TvvvV},...,,{ 21 TsssS
},...,,{ 21 TbbbB
GMM GMM
}| ,,{
1
,,,
Mi
isisiss w
Σ}|
,,{
1
,,,
Nj
jbjbjbb w
Σ,),,,|(),|(
1 1 1,,
T
t
M
i
N
jbstjbisbs jipwwp vV
.),;(),;(
),,,|(
),(
,,,,
BSV
ΣΣ
f
ttjbjbtisist
bst
dd
jip
bsbs
v
NN
(unobservable)
(unobservable)
(observable)
16Institute of Information Science Academia Sinica
Solo Vocal Signal Modeling (II)
Parameter estimation– Defining an auxiliary function
where
– Letting for each parameter to be re-estimated, we have
,)ˆ|,,(log),|,()ˆ(1 1 1
T
t
I
i
J
jbstbstss jipjipQ vv
),ˆ,,|()ˆ|,,( ,, bstjbisbst jipwwjip vv
.),,|(
),,|(),|,(
1 1 ,,
,,
I
m
J
n bstnbms
bstjbisbst
nmpww
jipwwjip
v
vv
0)ˆ( ssQ
,),|,(1
ˆ1 1
,
T
t
J
jbstis jip
Tw v
,
),,|,(
,,,,|),,|,(ˆ
1 1
1 1,
T
t
N
j bst
T
t
N
j bsttbst
isjip
jiEjip
v
vsv
,
),,|,(
,,,,|),,|,(ˆ
,,
1 1
1 1, isisT
t
J
j bst
T
t
J
j bstttbst
isjip
jiEjip
v
vssvΣ
17Institute of Information Science Academia Sinica
Solo Vocal Signal Modeling (III)
Re-estimation formulas for linear spectral features– Suppose V is a linear spectral feature, and S and B are
additive in the time domain, then vt = st + bt
– is the convolution of the solo and background music densities, i.e.,
– and can be shown in the following form:
),,,|( bst jivp
tjbjbttisistbst sds-vsjivp ),;(),;(),,,|( 2,,
2,, NN
,,,,,| ,,2,
2,
2,
2,
2,
jbis
is
jbt
jbis
isbstt vjivsE
.,,,,|,,,,|2
2,
2,
2,
2,
bsttjbis
jbisbst
2t jivsEjivsE
bstt jivsE ,,,,| bst2t jivsE ,,,,|
18Institute of Information Science Academia Sinica
Solo Vocal Signal Modeling (IV)
Re-estimation formulas for cepstral features– Suppose V is a cepstral feature, and S and B are additive in
the time domain, then vt = log[exp(st)+exp(bt)]. We approximate vt
max (st , bt ).
– It can be shown that
),(),;()(),;(),,,|(,
,2,,
,
,2,,
is
istjbjbt
jb
jbtisistbst
vv
vvjivp
NN .
2
1)( 2/
2
dwe w
,,,,,|),,,|(1),,,|(,,,,| bstttbstttbsttbstt jivssEjivspvjivspjivsE
,,,,,|),,,|(1),,,|(,,,,| 222bstttbstttbsttbstt jivssEjivspvjivspjivsE
,)(),;()(),;(
)(),;(
),,,|(
,
,2,,
,
,2,,
,
,2,,
is
istjbjbt
jb
jbtisist
jb
jbtisist
bstt vv
vv
vv
jivsp
NN
N
.
)(
),;(,,,,|
,
,
2,,
,,
is
ist
isistisisbsttt v
vjivssE
N
.)(
),;()(,,,,|
,
,
2,,
,,2,
2,
is
ist
isististisisisbstt
2t v
vvjivssE
N
19Institute of Information Science Academia Sinica
Singer Identification (SID)
Block diagram
TrainingData Vocal/
InstrumentalSegmentation
(Non-vocal portion)Instrumental
Signal B
Accompanied Signal V(vocal portion)
GaussianMixture
Modeling
Background MusicModel
Solo SignalModeling
SoloModel
Vocal/Instrumental
Segmentation
GaussianMixture
Modeling
MaximumLikelihoodDecision
max P( B |b)max
p (V s,b)
Training Phase
Testing Phase
b
s
TestData
X
InstrumentalSignal
Background Music Model
Accompanied Signal X V
arg max i
Hyp
othe
size
d Si
nger
Solo Models
for P Singers
s ,1 , s,2 , ..., s, P
B~
)~
,|( , bisVp X)
~|
~( bpmax B
b~
20Institute of Information Science Academia Sinica
SID Experiments
Music data– 200 tracks from
Mandarin pop music CDs
– 10 female & 10 male singers
– 5 tracks/singer for training; 5 tracks/singer for testing
– 20-min instrumental-only data for training the non-vocal GMM
– 22.05 kHz sampling rate (down-sampled from 44.1 kHz)
Vocal/Non-vocal segmentation– 82.3% frame accuracy
1000 Entire3000 6000 9000R ecord ing Length (# fram es)
65.0
70.0
75.0
80.0
85.0
90.0
95.0
100.0
Acc
urac
y (in
%)
G M M ; M anua l Segm enta tion
G M M ; Autom atic S egm entation
Solo M ode ling; M anual Segm enta tion
Solo M ode ling; A utom atic Segm entation
SID
21Institute of Information Science Academia Sinica
Singer Clustering (I)
Block diagram
L N 1 L N 2 ... L NN
x 1
Log-likelihoodComputation
L ij = log p (x i | )
x 2 x N
SoloModeling
SoloModeling
SoloModeling
x 1
x 2
x N
L 11 L 12 ... L 1 N
L 21 L 22 ... L 2 N
F 1
F 2
F N
VectorClustering
Cluster M : {x 6 , x 10 , ...}
Cluster 1: {x 3 , x 7 , ...}
Cluster 2: {x 1 , x 9 , ...}Transform
Transform
1 2 N (Log-likelihoods)
j
(Characteristic Vectors)
(Feature Vectors of Music Recordings)
(Models)
Transform
22Institute of Information Science Academia Sinica
Singer Clustering (II)
An example of the characteristic vectors
V
{ { { { {Singer 1 Singer 2 Singer 3 Singer 4 Singer 5
L F
{ { { { {Singer 1 Singer 2 Singer 3 Singer 4 Singer 5
- ,, jii LL
1.0
F i,j
)maxarg( ,
kik
L
23Institute of Information Science Academia Sinica
Singer Clustering (III)
Determining the number of clusters– Bayesian Information Criterion (BIC)
• Measuring how well the model fits a data set, and how simple the model is, specifically
– The BIC for a K-clustering is computed by:
– A reasonable number of clusters can be determined by
|,|log 2
1)|(log)BIC( DD dp
,log)1(2
1
2
1||log
2)(BIC
1
MMMMKn
KK
kk
k
).(BIC maxarg1
* KKMK
d : no. of free parameters in model | D | : size of the data set D: a penalty factor
K=3
K=4
BIC increases or not?
M : total no. of elementsn k : no. of elements of the cluster kk : covariance matrix of the
characteristic vectors in the cluster k
24Institute of Information Science Academia Sinica
Singer Clustering Experiments (I)
Music data– 200 tracks (20 singers; 10 tracks/singer)
Assessment method– Cluster purity
k is the purity of the cluster k, nk the total no. of recordings in the cluster k, and nkp the no. of recordings in the cluster k that were performed by singer p
– Average purity
• M is the total no. of recordings, and K the no. of clusters
,1
1
K
kkkn
M
,1
2
2
P
p k
kpk n
n
42.010
62112
2222
0.1
6
62
2
25.04
11112
2222
25Institute of Information Science Academia Sinica
Singer Clustering Experiments (II)
Results
0 10 20 30 40 50 60 70 80 90N o. of C lusters
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Ave
rage
Pur
ity
M anual Segm entation; 32-m ix So lo G M M & 8-m ix B ackground G M M / R ecord ing
M anual Segm entation; 32-m ix Vocal G M M / R ecord ing
Autom atic Segm entation; 24-m ix Solo G M M & 8-m ix Background G M M / R ecord ing
Autom atic Segm entation; 24-m ix voca l G M M / R ecord ing0 5 10 15 20 25 30 35 40 45 50
N o. of C lusters
-520
-480
-440
-400
-360
-320
-280
-240
-200
-160
-120
-80
-40
BIC
5 s ingers
10 s ingers
15 s ingers
20 s ingers
Appropriate no. of clusters
26Institute of Information Science Academia Sinica
Summary
We have– Separated vocal from non-vocal segments of music;– Isolated singers’ vocal characteristics form the
background music;– Distinguished singers from one another.
We will– Handle wider variety of music data including duets, trios,
chorus, background vocals, or music with multiple simultaneous or non-simultaneous singers;
– Deal with the other problems of voice information retrieval from music, such as lyric transcription and singing language recognition.
27Institute of Information Science Academia Sinica
To Probe Further (I)
Selected references– Music information retrieval
• A. L. Uitdenbogerd, “Music IR: past, present, and future,” Proceedings of International Symposium on Music Information Retrieval, 2000.
• J. Futrelle and J. S. Downie, “Interdisciplinary communities and research issues in music information retrieval,” Proceedings of International Conference on Music Information Retrieval, pp. 215–221, 2002.
– Artist recognition• B. Whitman, G. Flake, and S. Lawrence, “Artist detection in music with Minnowmatch,”
Proceedings of IEEE Workshop on Neural Networks for Signal Processing, 2001.• A. Berenzweig, D. P. W. Ellis, and S. Lawrence, “Using voice segments to improve artist
classification of music,” Proceedings of International Conference on Virtual, Synthetic and Entertainment Audio, 2002.
– Singer identification• Y. E. Kim and B. Whitman, “Singer identification in popular music recordings using voice coding features,”
Proceedings of International Conference on Music Information Retrieval, pp. 164–169, 2002. • C. C. Liu, and C. S. Huang, “A singer identification technique for content-based classification of MP3 music objects,”
Proceedings of International Conference on Information and Knowledge Management, pp. 438–445, 2002. • T. Zhang, “Automatic Singer Identification,” Proceedings of International Conference on Multimedia and Expo, 2003. • W. H. Tsai, H. M. Wang, and D. Rodgers, “Automatic singer identification of popular music recordings via estimation
and modeling of solo vocal signal,” Proceedings of European Conference on Speech Communication and Technology, 2003.
– Singer clustering• W. H. Tsai, H. M. Wang, D. Rodgers, S. S. Cheng, and H. M. Yu, “Blind clustering of popular music recordings based
on singer voice characteristics,” to appear in Proceedings of International Conference on Music Information Retrieval, 2003.
28Institute of Information Science Academia Sinica
To Probe Further (II)
General resources– Important conferences
• International Conference on Music Information Retrieval• International Computer Music Conference • IEEE International Conference on Multimedia and Expo• ACM International Multimedia Conference• International Conference on New Interfaces for Musical Expression
– Organizations• International Computer Music Association (http://www.computermusic.org/)• The Australasian Computer Music Association (http://www.acma.asn.au/)• ACM Multimedia (http://www.acm.org/sigmm/)• Acoustical Society of America (http://asa.aip.org/)
– Journals• Computer Music Journal (http://www-mitpress.mit.edu/catalog/item/default.asp?ttype=4&tid=15)• Journal of New Music Research (http://www.swets.nl/jnmr/jnmr.html)• Computing in Musicology (http://www.ccarh.org/publications/books/cm/)
– Useful links• http://www.leighsmith.com/Browsers/Cmusic.html• http://www2.siba.fi/Kulttuuripalvelut/computers.html