Upload
primrose-butler
View
234
Download
3
Tags:
Embed Size (px)
Citation preview
112/04/19 1
Two Paradigms for Music IR:Query by Singing/Humming and
Audio Fingerprinting
J.-S. Roger Jang ( 張智星 )
Multimedia Information Retrieval Lab
CS Dept., Tsing Hua Univ., Taiwan
http://mirlab.org/jang
-2-
Outline
Introduction to MIRQBSH (query by singing/humming)
Intro, demos, conclusions
AFP (audio fingerprinting) Intro, demos, conclusions
-3-
Introduction to QBSH
QBSH: Query by Singing/Humming Input: Singing or humming from microphone Output: A ranking list retrieved from the song
database
Progression First paper: Around 1994 Extensive studies since 2001 State of the art: QBSH tasks at ISMIR/MIREX
-4-
「哼唱選歌」的流程
前處理: 收集單軌標準答案(通常是 MIDI 檔) 轉換成適合比對的中介格式
即時處理: 將使用者的音訊輸入轉成音高向量 由音高向量轉成音符(選擇性) 和標準答案進行比對 列出排名
-5-
Pitch Tracking for QBSH
Two categories for pitch tracking algorithms Time domain ( 時域 )
ACF (Autocorrelation function)AMDF (Average magnitude difference function)SIFT (Simple inverse filtering tracking)
Frequency domain ( 頻域 )Harmonic product spectrum methodCepstrum method
-6-
Frame Blocking for Pitch Tracking
Frame size=256 pointsOverlap=84 pointsFrame rate=11025/(256-84)=64 pitch/sec
0 50 100 150 200 250 300-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
Zoom in
Overlap
Frame
0 500 1000 1500 2000 2500-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
-7-
ACF: Auto-correlation Function
Frame s(i):
Shifted frame s(i+):
=30
30
acf(30) = inner product of overlap part
Pitch period
1
0
n
i
acf s i s i
-8-
Pitch Tracking via ACF
Specs Sampe rate = 11025 Hz Frame size = 32 ms Overlap = 0 Frame rate = 31.25
Playback soo.wav sooPitch.wav
-9-
Frequency to Semitone Conversion
Semitone : A music scale based on A440
Reasonable pitch range: E2 - C6 82 Hz - 1047 Hz ( - )
69440
log12 2
freq
semitone
-10-
Typical Result of Pitch Tracking
Pitch tracking via autocorrelation for 茉莉花 (jasmine)聲音
-11-
Comparison of Pitch VectorsYellow line : Target pitch vector
-12-
Comparison Methods of QBSH
Categories of approaches to QBSH Histogram/statistics-based Note vs. note
Edit distance
Frame vs. noteHMM
Frame vs. frameLinear scaling, DTW, recursive alignment
-13-
Linear Scaling (LS)
Concept Scale the query linearly to match the candidates
Example:
-14-
Linear Scaling (II)
Strength One-shot for dealing
with key transposition Efficient and effective Indexing methods
available
Weakness Cannot deal with non-
uniform tempo variations
Typical mapping path
-15-
Linear Scaling (III)
Distance function for LS Normalized L1-norm Normalized L2-norm
Rest handling Extend previous non-zero
note
Alignment example
-16-
Dynamic Time Warping (DTW)
Goal: Allows comparison of high tolerance to tempo variation
Characteristics: Robust for irregular tempo variations Trial-and-error for dealing with key transposition Expensive in computation Does not conform to triangle inequality Some indexing algorithms do exist
#1 method for task 2 in QBSH/MIREX 2006
-17-
Dynamic Time Warping: Type 1
i
j
t(i-1)
r(j)
)1,2(
)1,1(
)2,1(
min
|)()(|),(
jiD
jiD
jiD
jritjiD
),( jiD
t: input pitch vector (8 sec, 128 points)r: reference pitch vectorLocal paths: 27-45-63 degrees
DTW recurrence:r(j-1)
t(i)
-18-
Dynamic Time Warping: Type 2
i
j
t(i-1)
r(j)
),1(
)1,1(
)1,(
min
|)()(|),(
jiD
jiD
jiD
jritjiD
),( jiD
r(j-1)
t(i)
t: input pitch vector (8 sec, 128 points)r: reference pitch vectorLocal paths: 0-45-90 degrees
DTW recurrence:
-19-
Local Path Constraints
Type 1: 27-45-63 local paths
Type 2: 0-45-90 local paths
jiD ,
jiD ,
),1(
)1,1(
)1,(
min
)()(),(
jiD
jiD
jiD
jritjiD
)1,2(
)1,1(
)2,1(
min
)()(),(
jiD
jiD
jiD
jritjiD
2,1 jiD
1, jiD 1,1 jiD
jiD ,1
1,1 jiD 1,2 jiD
-20-
DTW Path of “Match Beginning”
-21-
DTW Path of “Match Anywhere”
-22-
DTW Path of “Match Anywhere”
-23-
Challenges in QBSH Systems
Song database preparation MIDIs, singing clips, or audio music
Reliable pitch tracking for acoustic input Input from mobile devices or noisy karaoke bar
Efficient/effective retrieval Karaoke machine: ~10,000 songs Internet music search engine: ~500,000,000 songs
-24-
-25-
Goal and Approach
Goal: To retrieve songs effectively within a given response time, say 5 seconds or so
Our strategy Multi-stage progressive filtering Indexing for different comparison methods Repeating pattern identification
-26-
MIRACLE
MIRACLE Music Information
Retrieval Acoustically via Clustered and paralleL Engines
Database (~13000) MIDI files Solo vocals (<100) Melody extracted from
polyphonic music (<100)
Comparison methods Linear scaling Dynamic time warping
Top-10 Accuracy 70~75%
Platform Single CPU+GPU
-27-
MIRACLE Before Oct. 2011Client-server distributed computingCloud computing via clustered PCs
Master server
Clients Clustered servers
PC
PDA/Smartphone
Cellular
Slave
Slave
Slave
Master server
Slave servers
Request: pitch vector
Response: search result
Database size: ~12,000
-28-
Current MIRACLESingle server with GPU
NVIDIA 560 Ti, 384 cores (speedup factor = 66)
Master server
ClientsSingle server
PC
PDA/Smartphone
Cellular
Master serverRequest: pitch vector
Response: search result
Database size: ~13,000
-29-
MIRACLE in the FutureMulti-modal retrieval
Singing, humming, speech, audio, tapping…
Master server
Clients Clustered servers
PC
PDA/Smartphone
Cellular
Slave
Slave
Slave
Master server
Slave servers
Request: feature vector
Response: search result
-30-
Outlook of MIRACLE
Web version Stand-alone version
-31-
QBSH for Other Platforms
Embedded systems Karaoke machines
Smartphones iPhone/iPad Android phone
Toys
-32-
Returned Results
Typical results of MIRACLE
-33-
QBSH Demo
Demo page of MIR Lab: http://mirlab.org/mir_products.asp
MIRACLE demo: http://mirlab.org/demo/miracle
Existing commercial QBSH systems www.midomi.com www.soundhound.com
-34-
To Make QBSH More Efficient
Algorithms Indexing of LS/DTW Progressive filtering
New Platforms GPU (10 times faster for QBSH!) Grid/clustered computing Multi-core platforms
-35-
Conclusions for QBSH
QBSH Fun and interesting way to retrieve music Can be extend to singing scoring Commercial applications getting mature
Challenges How to deal with massive music databases? How to extract melody from audio music?
-36-
Audio Fingerprinting (AFP)
Goal Identify a noisy version of
a given audio clips (no “cover versions”)
Technical barrier Robustness Efficiency (6M tags/day
for Shazam) Database collection (15M
tracks for Shazam)
Applications Song purchase Royalty assignment
(over radio) Confirmation of
commercials (over TV) Copyright violation
(over web) TV program ID
-37-
Company: Shazam
Facts First commercial product of audio fingerprinting Since 2002, UK
Technology Audio fingerprinting
Founder Avery Wang (PhD at Standard, 1994)
-38-
Company: Soundhound
Facts First product with multi-modal music search AKA: midomi
Technologies Audio fingerprinting Query by singing/humming Speech recognition
Founder Keyvan Mohajer (PhD at Stanford, 2007)
-39-
Two Stages in AFP
Offline: Database construction Robust feature
extraction (audio fingerprinting)
Hash table construction Inverted indexing
Online: Application Robust feature
extraction Hash table search Ranked list of the
retrieved songs/music
-40-
Robust Feature Extraction
Various kinds of features for AFP Invariance along time and frequency Landmark of a pair of local maxima Wavelets …
Extensive test required for choosing the best features
-41-
Representative Approaches to AFP
Philips J. Haitsma and T.
Kalker, “A highly robust audio fingerprinting system”, ISMIR 2002.
Shazam A.Wang, “An industrial-
strength audio search algorithm”, ISMIR 2003
Google S. Baluja and M. Covell,
“Content fingerprinting using wavelets”, Euro. Conf. on Visual Media Production, 2006.
V. Chandrasekhar, M. Sharifi, and D. A. Ross, “Survey and evaluation of audio fingerprinting schemes for mobile query-by-example applications”, ISMIR 2011
-42-
Philips: Thresholding as Features
Observation The sign of energy
differences is robust to various operationsLossy encodingRange compressionAdded noise
Thresholding as Features
),1()1,(
)1,1(),(
,1),(
ftSftS
ftSftS
ifftF
Fingerprint F(t, f)
Magnitude spectrum S(t, f)
(Source: Dan Ellis)
-43-
Philips: Thresholding as Features (II)
Robust to low-bitrate MP3 encoding (see the right)
Sensitive to “frame time difference” Hop size is kept small!
Original fingerprinting BER=0.07
8
Fingerprinting after MP3 encoding
-44-
Philips: Robustness of Features
BER of the features after various operations General low High for speed and time-
scale changes (which is not likely to occur under query by example)
-45-
Philips: Search Strategies
Via hashing
Inverted indexing
-46-
Shazam: Landmarks as Features
(Source: Dan Ellis)
Spectrogram
Local peaks of spectrogram
Pair peaks to form landmarks
Landmark: [t1, f1, t2, f2]20-bit hash key: f1: 8 bits Δf = f2-f1: 6 bits Δt = t2-t1: 6 bitsHash value: Song ID & offset time
-47-
Shazam: Landmark as Features (II)
Pick peaks based on local decaying surface
Matched landmarks
(Source: Dan Ellis)
-48-
Shazam: Time-justified Landmarks
Valid landmarks based on offset time (which avoids hash collision)
-49-
Our AFP Engine
Database (~2500) Personally collected MP3
(currently) Music collected from
Youtube (in the future)
Driving forces Fundamental issues in CS
(hashing, indexing…) Requests from local
companies
Methods Landmarks as feature
(Shazam) Speedup by hash tables
Platform Single CPU over MS
Windows
-50-
Experiments
Corpora Database: 2550 tracks Test files: 5 songs
recorded with mobiles (with noisy environment), and then chopped into segments of 5, 10, 15, and 20 seconds
Accuracy 5-second clips: 161/275 10-second clips: 121/136 15-second clips: 88/90 20-second clips: 65/66
-51-
Accuracy and Efficiency
Accuracy vs. query duration
Computing time. vs.query duration
Accuracy vs.computing time
-52-
Demos of Audio Fingerprinting
Commercial apps Shazam Soundhound
Ours http://mirlab.org/demo/afpFarmer2550
-53-
Conclusions For AFP
Conclusions Landmark-based methods are effective
Future work Scale-up
15M tracks in database, 6M tags/day
Speed-upProgressive filteringGPU