E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - /22
Lecture 12:Alignment and Matching
Dan EllisDept. Electrical Engineering, Columbia University
[email protected] http://www.ee.columbia.edu/~dpwe/e4896/1
1. Music Alignment2. Cover Song Detection3. Echo Nest Analyze
ELEN E4896 MUSIC SIGNAL PROCESSING
E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - /22
1. Music Alignment
• Often have versions of the same music with unmatched time axesdifferent performancesperformance vs. score
• Various applications for aligning themsynchronizing different tracks (with TSM)synchronized score displayground truth transcriptions
2
Kurth et al., 2007
E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - /2216 32 48 time / beats
CDEGAB
Let It Be - Nick Cave
Euclidean Distance
16
Let I
t Be
- The
Bea
tles
3248
time
/ bea
tsCDEGAB
1
2
3
4
5
6
The Similarity Matrix• Point-to-point comparison of sequences
3
e.g. Euclidean distance
or normalized inner product (cosine distance)
dcos(i, j) = 1��
k xi(k)yj(k)��k |xi(k)|2
�k |yj(k)|2
deuc(i, j) =�
k
|xi(k)� yj(k)|2
dijj
i
Foote 1999
E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - /22
0.1 0.4 0.6 0.8 1.1 1.6 2.3 2.9
0.3 0.3 0.4 0.6 1.0 1.5 2.1 2.6
0.5 0.5 0.4 0.5 0.9 1.5 2.2 2.7
0.8 0.8 0.7 0.6 0.7 1.2 1.8 2.4
1.2 1.2 1.0 1.0 0.7 0.9 1.2 1.5
1.4 1.5 1.3 1.2 1.0 0.9 1.3 1.5
2.1 2.2 2.0 1.8 1.5 1.3 1.2 1.6
2.7 2.8 2.5 2.3 2.0 1.7 1.6 1.51.5
1.2
0.9
0.7
0.6
0.4
0.3
0.1
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Dynamic Programming• Find best path combining local + transitions
works for any kind of similarity matrix
4
Allowable transitions
Finds path {ik, jk} to minimize cost ...
... recursively
C�imax,jmax
=�
k
d(ik, jk)
+T (ik � ik�1, jk � jk�1)
T(1,0) = 0.1
T(0,1) = 0.1
T(1,1) = 0.0
{ik, jk}
{ik-1, jk-1}
Local costs dij ; C* ; paths
C�i,j =
minx,y={(1,1),(1,0),(0,1)}
�d(i, j) + T (x, y) + C�
i�x,j�y
� i
j dij
Bellman 1957
E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - /22
Audio-to-Audio Alignment• Dynamic programming to get time mapping
+ phase vocoder time scaling
5
50 100 150 200 250 300 350 400
50
100
150
200
250
300
350
400
450
500
E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - /22
Audio-Score Alignment• Aligning a score representation (e.g. MIDI)
is a proxy for polyphonic transcription
6
freq
/ Hz
time / sec
Let It Be + aligned MIDI labels
0 2 4 6 8 10 12 14 16 18 20 220
200
400
600
800
1000
E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - /22
Peak Structure Distance• How do we match spectra to score notes?
synthesize audio from MIDI & compare audio?“Peak Structure distance”: is energy where we expect?
7
0 50 100 150 200 250 300 350 400 time / frames
time / frames
C2C3C4C5C6
time / sec
freq
/ kHz
freq
/ bins
note
0 5 10 15 200
0.5
1
50 100 150 200 250 300 350 400 45020406080
freq
/ bins
20406080
0 50 100 150 200 250 300 350 4000
MIDI “Piano roll”
Synthesized audio
Predicted spectrum = mask M[k]
“Peak Structure”= energy blw mask
Orio & Schwartz 2001
dpsd = 1��
k M [k]|X[k]|�k |X[k]|
E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - /22
2. Cover Song Detection• Musicians are fond of ‘cover versions’
usually alter melody, harmony, instrumentation, rhythm, stylecan be hard to spot even for a human!
• Can try to match via alignment.. with some threshold on best alignment cost?
8
E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - /22
Smith-Waterman Local Alignment
• “Local alignment” measure want largest score S*similarity s(i, j) must exceed penalty P(x,y) on avg. (e.g. 0.96 for diagonal, 1.2 for off-diagonal)
9
Beatles vs. Carol Woods ! cosine dist
50 100 150
20406080
100120140160180200
Smith!Waterman cd/2,.96.1.2
50 100 time / beats
time
/ bea
ts• Cover version may have different formdifferent number, ordering of verse/chorus/brigewant to find any large aligned regions
S�i,j = maxx,y
�max{0, s(i, j)� P (x, y) + S�i�x,j�y}
�
E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - /22
Local Alignment Cover Detection
• Smith-Waterman needs predictable valuesuse binary similarity based on best transposition
10
Serrà & Gòmez, 2008
Euclidean Binary Non-cover
E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - /22
Cross-correlation Covers System• DP is good for time-warping, but expensive
beat-timing is tempo independent (if it works)simply cross-correlate beat-chroma patches?
11
100 200 300 400 500 beats
100 200 300 400 500 beats
GEDCAch
rom
a bi
ns
GEDCAch
rom
a bi
ns
extract
cross-correlate
Query
Candidate
how big are the pieces?how do we combine individual scores?also expensive
E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - /22
Global Cross-Correlation
• Cross-correlate entire beat-chroma matrices... at all possible transpositions (circular)implicit combination of match quality and duration
• One good matching fragment is sufficient...?12
Ellis & Poliner, 2007
100 200 300 400 500 beats @281 BPM
-500 -400 -300 -200 -100 0 100 200 300 400 skew / beats-5
0
+5
GEDCAch
rom
a bi
ns
GEDCAch
rom
a bi
nssk
ew /
sem
itone
s
Elliott Smith - Between the Bars
Glen Phillips - Between the Bars
Cross-correlation
E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - /22
Filtered Cross-Correlation
• Raw correlation not as important as precise local matchlooking for large contrast at ±1 beat skewi.e. high-pass filter
13
-500 -400 -300 -200 -100 0 100 200 300 4000
0.20.40.6
-500 -400 -300 -200 -100 0 100 200 300 400 skew / beats
skew / beats
-5
0
+5
skew
/ se
mito
nes Cross-correlation
Cross-correlation @ skew = +2 semitones
raw
filtered
E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - /22
Cover Song Results• 23 Covers found in 8700 song ‘uspop2002’
popular ‘decoys’ – normalization issues14
Test
Que
ry
Ab Ad Al Am Be Be Bl Ca Ce Cl Co Co Da En Fa Go Go Gr Hu I_ I_ Le Ta
Abracadabra/sugar_ray
Addicted_To_Love/tina_turner
All_Along_The_Watchtower/dave_matthews_band
America/simon_and_garfunkel
Before_You_Accuse_Me/eric_clapton
Between_The_Bars/glen_phillips
Blue_Collar_Man/styx
Caroline_No/brian_wilson
Cecilia/simon_and_garfunkel
Claudette/roy_orbison
Cocaine/nazareth
Come_Together/beatles
Day_Tripper/cheap_trick
Enjoy_The_Silence/tori_amos
Faith/limp_bizkit
God_Only_Knows/brian_wilson
Gold_Dust_Woman/sheryl_crow
Grand_Illusion/styx
Hush/milli_vanilli
I_Can_t_Get_No_Satisfaction/rolling_stones
I_Love_You/faith_hill
Let_It_Be/nick_cave
Take_Me_To_The_River/annie_lennox
Cover Songs - dpwe23 - 12/23 correct
E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - /22
Analyzing Cover Song Correlation
• Look inside global cross-correlation to find matching fragments...xcorr = t f (C1(t, f)⋅C2(t, f)) - view along time
15
Let It Be / Beatles (beats 11-441)
50 100 150 200 250 300 350 400
0 50 100 150 200 250 300 350 400-0.2
00.20.4
Let It Be / Nick Cave (beats 13-443)
50 100 150 200 250 300 350 400 time / beats
time / beats
time / beats
chro
ma
A
CD
FG
chro
ma
A
CD
FG
E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - /22
Cover Song False Alarm• Correlation can be weak
“Cocaine” (Clapton) vs. “Satisfaction” (Stones)
16
Eric Clapton - Cocaine - beats 17:1027
100 200 300 400 500 600 700 800 900 1000
Rolling Stones - Satisfaction - beats 1:1011
100 200 300 400 500 600 700 800 900 1000
0 100 200 300 400 500 600 700 800 900 1000-2-1012
chroma
A
CD
FG
chroma
A
CD
FG
E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - /22
3. Echo Nest Analyze• Web service to provide
beat, chroma, ... analysis (and much more)
17
240
761
2416
CDE
GAB
time / sec
freq
/ Hz
freq
/ Hz
0 2 4 6 8 10 12 14
240
761
2416
TRKUYPW128F92E1FC0 - Tori Amos - Smells Like Teen Spirit
Original
Resynth
EN Features
register for free API keyhttp://developer.echonest.com/account/register/
upload MP3, get back XML with analysis data
E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - /22
EN Analyze Usage• Matlab wrapper function
18
E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - /22
Million Song Dataset (MSD)
• Commercial-scale dataset available to MIR researchers1M pop songs250 GB of features(6 years of listening)
19
• EN Analyze features +...Lyrics, Tags, Covers, Listeners ...
http://labrosa.ee.columbia.edu/millionsong
Thierry Bertin-Mahieux
E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - /22
MSD Metadata
20
artist: 'Tori Amos' release: 'LIVE AT MONTREUX' title: 'Smells Like Teen Spirit' id: 'TRKUYPW128F92E1FC0' key: 5 mode: 0 loudness: -16.6780 tempo: 87.2330time_signature: 4 duration: 216.4502 sample_rate: 22050 audio_md5: '8' 7digitalid: 5764727 familiarity: 0.8500 year: 1992
%5489,4468, Smells Like Teen SpiritTRTUOVJ128E078EE10 NirvanaTRFZJOZ128F4263BE3 Weird Al YankovicTRJHCKN12903CDD274 Pleasure BeachTRELTOJ128F42748B7 The Flying PicketsTRJKBXL128F92F994D Rhythms Del Mundo feat. ShanadeTRIHLAW128F429BBF8 The Bad PlusTRKUYPW128F92E1FC0 Tori Amos
12 hello 11 i 10 a 9 and 7 it 6 are 6 we 6 now
6 here 6 us 6 entertain 4 the 4 feel 4 yeah 3 to 3 my
3 is 3 with 3 oh 3 out 3 an 3 light 3 less 3 danger
100.0 – cover57.0 – covers43.0 – female vocalists42.0 – piano34.0 – alternative14.0 – singer-songwriter11.0 – acoustic8.0 – tori amos7.0 – beautiful6.0 – rock6.0 – pop6.0 – Nirvana6.0 – female vocalist6.0 – 90s5.0 – out of genre covers
5.0 – cover songs4.0 – soft rock4.0 – nirvana cover4.0 – Mellow4.0 – alternative rock3.0 – chick rock3.0 – Ballad3.0 – Awesome Covers2.0 – melancholic2.0 – k00l ch1x2.0 – indie2.0 – female vocalistist2.0 – female2.0 – cover song2.0 – american
EN Metadata
SHS Covers MxM Lyric Bag-of-Words
Last.fm Tags
E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - /22
Summary• Music Alignment
Dynamic Programming finds correspondence
• Cover SongsDP, or cross-correlation for efficiency
• EN AnalyzeWeb service to analyze audio
21
E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - /22
References• R. Bellman, Dynamic Programming, Princeton University Press. 1957.
• D. Ellis and G. Poliner, “Identifying Cover Songs With Chroma Features and Dynamic Programming Beat Tracking,” Proc. ICASSP-07, Hawai'i, pp. IV-1429-1432, 2007.
• J. Foote, “Visualizing Music and Audio using Self-Similarity,” In Proc. ACM Multimedia, Orlando, pp. 77-80, 1999.
• Frank Kurth, Meinard Müller, Christian Fremerey, Yoon ha Chang, and Michael Clausen, “Automated synchronization of scanned sheet music with audio recordings,” Proc. 8th International Conference on Music Information Retrieval (ISMIR), Vienna, pp. 261-266, 2007.
• N. Orio & D. Schwarz, “Alignment of monophonic and polyphonic music to a score,” Proc. Int. Comp. Music Conf., Havana, pp.155-158, 2001.
• J. Serrà, E. Gómez, P. Herrera, X. Serra, “Chroma Binary Similarity and Local Alignment Applied to Cover Song Identification,” IEEE Trans. on Audio, Speech and Lang. Proc., 16(6), pp. 1138-1151, 2008.
22