Institute of Information Science Academia Sinica 1 Singer Identification and Clustering of Popular Music Recordings Wei-Ho Tsai [email protected]

1Institute of Information Science Academia Sinica

Singer Identification and Singer Identification and Clustering of Popular Music Clustering of Popular Music

RecordingsRecordings

Wei-Ho [email protected]

Institute of Information Science, Academia Sinica


Extracting Information From Music

Music Information Retrieval (MIR)– To develop ways of managing collections of musical

material for preservation, access, research, and other uses.

MIR communities & research areas [after Futrelle & Downie, 2002]

Computer Science

Audio Engineering

Psychology & Philosophy

Library Science

Law

Representation

Indexing

User Interface Design

Compression

FeatureDetection

Machine Learning

Metadata

Musical Analysis

Epistemology & Ontology

Perception

Intellectual Property

Classification

Musicology

Communities Research Areas


Extracting Voice Information From Music

Viewing MIR from a speech-processing perspectiveSpeech Processing

Analysis/Synthesis Recognition Coding

Speech Recognition Speaker Recognition Language Recognition

Phone Recognition Word Recognition Tone Recognition

Speaker Identification Speaker Verification Speaker Clustering

Language Identification Language Verification Dialect Identification

Singing Processing

Analysis/Synthesis Recognition Coding

"Singing Recognition" Singer Recognition Language Recognition

Phone Recognition Lyric Transcription Melody Extraction

Singer Identification Singer Detection Singer Clustering

Language Identification Language Verification Dialect Identification

Speech/Singing/Music/

Other SoundsDiscrimination


Singer Recognition Tasks (I)

Singer Identification– Determining who is singing

?

?

?

?

Who performed thismusic recording?


Singer Recognition Tasks (II)

Singer detection– Determining whether or not a specified singer is present

in a music recording

?

If Coco's voices in thismusic recording?


Singer Recognition Tasks (III)

Singer Tracking– Locating where a specified singer is present in a music

recording

Where is Coco's voices?


Singer Recognition Tasks (IV)

Singer Clustering– Grouping the same-singer music recordings into a cluster

Clusteringby singer


Potential Applications

Indexing– Finding cameo’s or guest appearances in live concert

recordings.– Identifying the singers in a movie’s musical interludes.

Music recommendation systems– Suggesting music by singers with similar voices.

Karaoke services– Efficiently organizing the customer’s recordings.– Personalization

Copyright protection– Distinguishing between an original song and a cover-band.– Rapidly scanning suspect websites for piracy


Singer’s Vocal Characteristics

Humans use several levels of perceptual cues for distinguishing among singers

Singingpronunciation, and

IdiosyncrasiesLearned traits

Physical traits

Culture andsocio-background

Modulation of pitch,rhythm,speed,

intonation, andvolume

Personality type

Acoustic aspect ofsinging, e.g., nasal,deep, breathy and

rough

Anatomicalstructure

of vocal apparatus

Characteristics Sources


Major Challenges In Singer Recognition

The vast majority of popular music contains background accompaniment during most or all vocal passages– Infeasible to acquire isolated solo voice data for

extracting the singer’s vocal characteristics

The proposed solution:

Vocal segment detection followed by solo vocal signal modeling


Vocal/Non-vocal Segmentation

Sliding Window

Vocal Model

Non-vocal Model

FeatureVectors

DecisionFeature

Extraction

Vocalor

Non-vocal

Mel-scale Frequency Cepstral Coefficients (MFCCs)

Filter BankFrame

Segment

Music Recording


Gaussian Mixture Model (I)

Model description– The distribution of the feature vector x is represented by

a mixture of M component Gaussian densities, i.e.,

• is the i-th Gaussian density with mean and covariance matrix

– A Gausian mixture model (GMM) is characterized by

x

11, ΣN () 22 , Σ ... MM Σ,

+w 1

w 2

w M

)|( xp

N () N ()

M

i iiiwp1

),()|( xx N

),( ii xN i i

Miw iii 1|,,


Gaussian Mixture Model (II)

Parameter estimation– Using the EM algorithm, an initial model is created, and

the new model is then estimated by maximizing the auxiliary function

where and

– Letting for each parameter to be re-estimated, we have

T

tti ip

Tw

1

),|(1

x

T

t t

T

t tti

ip

ip

1

1

),|(

),|(

x

xx

iiT

t t

T

t ttti

ip

ip

1

1

),|(

),|(

x

xxx

)ˆ,ˆ(ˆ)ˆ|,( iitit wip xx N

,)ˆ|,(log),|()ˆ(1 1

T

t

M

itt ipipQ xx

M

m mmtm

iitit

w

wip

1),(

),(),|(

x

xx

N

N

0)ˆ( Q


Distilling Singers’ Voices From Music

Substantial similarities exist between the instrumental regions and the accompaniment of the vocal signal

Solo voice can be modeled via suppressing the background music estimated from the instrumental regions.

Solo Voice

Accompaniments

Accompanied Voice

+


Solo Vocal Signal Modeling (I)

Model Description

b can be approximately estimated using the instrumental regions of music

– Our aim is to find an optimal s such that (in maximum likelihood sense)

).,|(maxarg bss ps

V

MixingV = f (S ,B )

A Solo Voice

A Background Music

An Accompanied Voice

},...,,{ 21 TvvvV},...,,{ 21 TsssS

},...,,{ 21 TbbbB

GMM GMM

}| ,,{

1

,,,

Mi

isisiss w

Σ}|

,,{

1

,,,

Nj

jbjbjbb w

Σ,),,,|(),|(

1 1 1,,

T

t

M

i

N

jbstjbisbs jipwwp vV

.),;(),;(

),,,|(

),(

,,,,

BSV

ΣΣ

f

ttjbjbtisist

bst

dd

jip

bsbs

v

NN

(unobservable)

(unobservable)

(observable)


Solo Vocal Signal Modeling (II)

Parameter estimation– Defining an auxiliary function

where

– Letting for each parameter to be re-estimated, we have

,)ˆ|,,(log),|,()ˆ(1 1 1

T

t

I

i

J

jbstbstss jipjipQ vv

),ˆ,,|()ˆ|,,( ,, bstjbisbst jipwwjip vv

.),,|(

),,|(),|,(

1 1 ,,

,,

I

m

J

n bstnbms

bstjbisbst

nmpww

jipwwjip

v

vv

0)ˆ( ssQ

,),|,(1

ˆ1 1

,

T

t

J

jbstis jip

Tw v

,

),,|,(

,,,,|),,|,(ˆ

1 1

1 1,

T

t

N

j bst

T

t

N

j bsttbst

isjip

jiEjip

v

vsv

,

),,|,(

,,,,|),,|,(ˆ

,,

1 1

1 1, isisT

t

J

j bst

T

t

J

j bstttbst

isjip

jiEjip

v

vssvΣ


Solo Vocal Signal Modeling (III)

Re-estimation formulas for linear spectral features– Suppose V is a linear spectral feature, and S and B are

additive in the time domain, then vt = st + bt

– is the convolution of the solo and background music densities, i.e.,

– and can be shown in the following form:

),,,|( bst jivp

tjbjbttisistbst sds-vsjivp ),;(),;(),,,|( 2,,

2,, NN

,,,,,| ,,2,

2,

2,

2,

2,

jbis

is

jbt

jbis

isbstt vjivsE

.,,,,|,,,,|2

2,

2,

2,

2,

bsttjbis

jbisbst

2t jivsEjivsE

bstt jivsE ,,,,| bst2t jivsE ,,,,|


Solo Vocal Signal Modeling (IV)

Re-estimation formulas for cepstral features– Suppose V is a cepstral feature, and S and B are additive in

the time domain, then vt = log[exp(st)+exp(bt)]. We approximate vt

max (st , bt ).

– It can be shown that

),(),;()(),;(),,,|(,

,2,,

,

,2,,

is

istjbjbt

jb

jbtisistbst

vv

vvjivp

NN .

2

1)( 2/

2

dwe w

,,,,,|),,,|(1),,,|(,,,,| bstttbstttbsttbstt jivssEjivspvjivspjivsE

,,,,,|),,,|(1),,,|(,,,,| 222bstttbstttbsttbstt jivssEjivspvjivspjivsE

,)(),;()(),;(

)(),;(

),,,|(

,

,2,,

,

,2,,

,

,2,,

is

istjbjbt

jb

jbtisist

jb

jbtisist

bstt vv

vv

vv

jivsp

NN

N

.

)(

),;(,,,,|

,

,

2,,

,,

is

ist

isistisisbsttt v

vjivssE

N

.)(

),;()(,,,,|

,

,

2,,

,,2,

2,

is

ist

isististisisisbstt

2t v

vvjivssE

N


Singer Identification (SID)

Block diagram

TrainingData Vocal/

InstrumentalSegmentation

(Non-vocal portion)Instrumental

Signal B

Accompanied Signal V(vocal portion)

GaussianMixture

Modeling

Background MusicModel

Solo SignalModeling

SoloModel

Vocal/Instrumental

Segmentation

GaussianMixture

Modeling

MaximumLikelihoodDecision

max P( B |b)max

p (V s,b)

Training Phase

Testing Phase

b

s

TestData

X

InstrumentalSignal

Background Music Model

Accompanied Signal X V

arg max i

Hyp

othe

size

d Si

nger

Solo Models

for P Singers

s ,1 , s,2 , ..., s, P

B~

)~

,|( , bisVp X)

~|

~( bpmax B

b~


SID Experiments

Music data– 200 tracks from

Mandarin pop music CDs

– 10 female & 10 male singers

– 5 tracks/singer for training; 5 tracks/singer for testing

– 20-min instrumental-only data for training the non-vocal GMM

– 22.05 kHz sampling rate (down-sampled from 44.1 kHz)

Vocal/Non-vocal segmentation– 82.3% frame accuracy

1000 Entire3000 6000 9000R ecord ing Length (# fram es)

65.0

70.0

75.0

80.0

85.0

90.0

95.0

100.0

Acc

urac

y (in

%)

G M M ; M anua l Segm enta tion

G M M ; Autom atic S egm entation

Solo M ode ling; M anual Segm enta tion

Solo M ode ling; A utom atic Segm entation

SID


Singer Clustering (I)

Block diagram

L N 1 L N 2 ... L NN

x 1

Log-likelihoodComputation

L ij = log p (x i | )

x 2 x N

SoloModeling

SoloModeling

SoloModeling

x 1

x 2

x N

L 11 L 12 ... L 1 N

L 21 L 22 ... L 2 N

F 1

F 2

F N

VectorClustering

Cluster M : {x 6 , x 10 , ...}

Cluster 1: {x 3 , x 7 , ...}

Cluster 2: {x 1 , x 9 , ...}Transform

Transform

1 2 N (Log-likelihoods)

j

(Characteristic Vectors)

(Feature Vectors of Music Recordings)

(Models)

Transform


Singer Clustering (II)

An example of the characteristic vectors

V

{ { { { {Singer 1 Singer 2 Singer 3 Singer 4 Singer 5

L F

{ { { { {Singer 1 Singer 2 Singer 3 Singer 4 Singer 5

- ,, jii LL

1.0

F i,j

)maxarg( ,

kik

L


Singer Clustering (III)

Determining the number of clusters– Bayesian Information Criterion (BIC)

• Measuring how well the model fits a data set, and how simple the model is, specifically

– The BIC for a K-clustering is computed by:

– A reasonable number of clusters can be determined by

|,|log 2

1)|(log)BIC( DD dp

,log)1(2

1

2

1||log

2)(BIC

1

MMMMKn

KK

kk

k

).(BIC maxarg1

* KKMK

d : no. of free parameters in model | D | : size of the data set D: a penalty factor

K=3

K=4

BIC increases or not?

M : total no. of elementsn k : no. of elements of the cluster kk : covariance matrix of the

characteristic vectors in the cluster k


Singer Clustering Experiments (I)

Music data– 200 tracks (20 singers; 10 tracks/singer)

Assessment method– Cluster purity

k is the purity of the cluster k, nk the total no. of recordings in the cluster k, and nkp the no. of recordings in the cluster k that were performed by singer p

– Average purity

• M is the total no. of recordings, and K the no. of clusters

,1

1

K

kkkn

M

,1

2

2

P

p k

kpk n

n

42.010

62112

2222

0.1

6

62

2

25.04

11112

2222


Singer Clustering Experiments (II)

Results

0 10 20 30 40 50 60 70 80 90N o. of C lusters

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Ave

rage

Pur

ity

M anual Segm entation; 32-m ix So lo G M M & 8-m ix B ackground G M M / R ecord ing

M anual Segm entation; 32-m ix Vocal G M M / R ecord ing

Autom atic Segm entation; 24-m ix Solo G M M & 8-m ix Background G M M / R ecord ing

Autom atic Segm entation; 24-m ix voca l G M M / R ecord ing0 5 10 15 20 25 30 35 40 45 50

N o. of C lusters

-520

-480

-440

-400

-360

-320

-280

-240

-200

-160

-120

-80

-40

BIC

5 s ingers

10 s ingers

15 s ingers

20 s ingers

Appropriate no. of clusters


Summary

We have– Separated vocal from non-vocal segments of music;– Isolated singers’ vocal characteristics form the

background music;– Distinguished singers from one another.

We will– Handle wider variety of music data including duets, trios,

chorus, background vocals, or music with multiple simultaneous or non-simultaneous singers;

– Deal with the other problems of voice information retrieval from music, such as lyric transcription and singing language recognition.


To Probe Further (I)

Selected references– Music information retrieval

• A. L. Uitdenbogerd, “Music IR: past, present, and future,” Proceedings of International Symposium on Music Information Retrieval, 2000.

• J. Futrelle and J. S. Downie, “Interdisciplinary communities and research issues in music information retrieval,” Proceedings of International Conference on Music Information Retrieval, pp. 215–221, 2002.

– Artist recognition• B. Whitman, G. Flake, and S. Lawrence, “Artist detection in music with Minnowmatch,”

Proceedings of IEEE Workshop on Neural Networks for Signal Processing, 2001.• A. Berenzweig, D. P. W. Ellis, and S. Lawrence, “Using voice segments to improve artist

classification of music,” Proceedings of International Conference on Virtual, Synthetic and Entertainment Audio, 2002.

– Singer identification• Y. E. Kim and B. Whitman, “Singer identification in popular music recordings using voice coding features,”

Proceedings of International Conference on Music Information Retrieval, pp. 164–169, 2002. • C. C. Liu, and C. S. Huang, “A singer identification technique for content-based classification of MP3 music objects,”

Proceedings of International Conference on Information and Knowledge Management, pp. 438–445, 2002. • T. Zhang, “Automatic Singer Identification,” Proceedings of International Conference on Multimedia and Expo, 2003. • W. H. Tsai, H. M. Wang, and D. Rodgers, “Automatic singer identification of popular music recordings via estimation

and modeling of solo vocal signal,” Proceedings of European Conference on Speech Communication and Technology, 2003.

– Singer clustering• W. H. Tsai, H. M. Wang, D. Rodgers, S. S. Cheng, and H. M. Yu, “Blind clustering of popular music recordings based

on singer voice characteristics,” to appear in Proceedings of International Conference on Music Information Retrieval, 2003.


To Probe Further (II)

General resources– Important conferences

• International Conference on Music Information Retrieval• International Computer Music Conference • IEEE International Conference on Multimedia and Expo• ACM International Multimedia Conference• International Conference on New Interfaces for Musical Expression

– Organizations• International Computer Music Association (http://www.computermusic.org/)• The Australasian Computer Music Association (http://www.acma.asn.au/)• ACM Multimedia (http://www.acm.org/sigmm/)• Acoustical Society of America (http://asa.aip.org/)

– Journals• Computer Music Journal (http://www-mitpress.mit.edu/catalog/item/default.asp?ttype=4&tid=15)• Journal of New Music Research (http://www.swets.nl/jnmr/jnmr.html)• Computing in Musicology (http://www.ccarh.org/publications/books/cm/)

– Useful links• http://www.leighsmith.com/Browsers/Cmusic.html• http://www2.siba.fi/Kulttuuripalvelut/computers.html

Documents

Institute of Information Science Academia Sinica 1 Singer Identification and Clustering of Popular Music Recordings Wei-Ho Tsai [email protected]