53
111/06/13 1 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張張張 ) Multimedia Information Retrieval Lab CS Dept., Tsing Hua Univ., Taiwan http://mirlab.org/jang

2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

Embed Size (px)

Citation preview

Page 1: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

112/04/19 1

Two Paradigms for Music IR:Query by Singing/Humming and

Audio Fingerprinting

J.-S. Roger Jang ( 張智星 )

Multimedia Information Retrieval Lab

CS Dept., Tsing Hua Univ., Taiwan

http://mirlab.org/jang

Page 2: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-2-

Outline

Introduction to MIRQBSH (query by singing/humming)

Intro, demos, conclusions

AFP (audio fingerprinting) Intro, demos, conclusions

Page 3: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-3-

Introduction to QBSH

QBSH: Query by Singing/Humming Input: Singing or humming from microphone Output: A ranking list retrieved from the song

database

Progression First paper: Around 1994 Extensive studies since 2001 State of the art: QBSH tasks at ISMIR/MIREX

Page 4: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-4-

「哼唱選歌」的流程

前處理: 收集單軌標準答案(通常是 MIDI 檔) 轉換成適合比對的中介格式

即時處理: 將使用者的音訊輸入轉成音高向量 由音高向量轉成音符(選擇性) 和標準答案進行比對 列出排名

Page 5: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-5-

Pitch Tracking for QBSH

Two categories for pitch tracking algorithms Time domain ( 時域 )

ACF (Autocorrelation function)AMDF (Average magnitude difference function)SIFT (Simple inverse filtering tracking)

Frequency domain ( 頻域 )Harmonic product spectrum methodCepstrum method

Page 6: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-6-

Frame Blocking for Pitch Tracking

Frame size=256 pointsOverlap=84 pointsFrame rate=11025/(256-84)=64 pitch/sec

0 50 100 150 200 250 300-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

Zoom in

Overlap

Frame

0 500 1000 1500 2000 2500-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

Page 7: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-7-

ACF: Auto-correlation Function

Frame s(i):

Shifted frame s(i+):

=30

30

acf(30) = inner product of overlap part

Pitch period

1

0

n

i

acf s i s i

Page 8: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-8-

Pitch Tracking via ACF

Specs Sampe rate = 11025 Hz Frame size = 32 ms Overlap = 0 Frame rate = 31.25

Playback soo.wav sooPitch.wav

Page 9: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-9-

Frequency to Semitone Conversion

Semitone : A music scale based on A440

Reasonable pitch range: E2 - C6 82 Hz - 1047 Hz ( - )

69440

log12 2

freq

semitone

Page 10: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-10-

Typical Result of Pitch Tracking

Pitch tracking via autocorrelation for 茉莉花 (jasmine)聲音

Page 11: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-11-

Comparison of Pitch VectorsYellow line : Target pitch vector

Page 12: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-12-

Comparison Methods of QBSH

Categories of approaches to QBSH Histogram/statistics-based Note vs. note

Edit distance

Frame vs. noteHMM

Frame vs. frameLinear scaling, DTW, recursive alignment

Page 13: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-13-

Linear Scaling (LS)

Concept Scale the query linearly to match the candidates

Example:

Page 14: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-14-

Linear Scaling (II)

Strength One-shot for dealing

with key transposition Efficient and effective Indexing methods

available

Weakness Cannot deal with non-

uniform tempo variations

Typical mapping path

Page 15: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-15-

Linear Scaling (III)

Distance function for LS Normalized L1-norm Normalized L2-norm

Rest handling Extend previous non-zero

note

Alignment example

Page 16: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-16-

Dynamic Time Warping (DTW)

Goal: Allows comparison of high tolerance to tempo variation

Characteristics: Robust for irregular tempo variations Trial-and-error for dealing with key transposition Expensive in computation Does not conform to triangle inequality Some indexing algorithms do exist

#1 method for task 2 in QBSH/MIREX 2006

Page 17: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-17-

Dynamic Time Warping: Type 1

i

j

t(i-1)

r(j)

)1,2(

)1,1(

)2,1(

min

|)()(|),(

jiD

jiD

jiD

jritjiD

),( jiD

t: input pitch vector (8 sec, 128 points)r: reference pitch vectorLocal paths: 27-45-63 degrees

DTW recurrence:r(j-1)

t(i)

Page 18: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-18-

Dynamic Time Warping: Type 2

i

j

t(i-1)

r(j)

),1(

)1,1(

)1,(

min

|)()(|),(

jiD

jiD

jiD

jritjiD

),( jiD

r(j-1)

t(i)

t: input pitch vector (8 sec, 128 points)r: reference pitch vectorLocal paths: 0-45-90 degrees

DTW recurrence:

Page 19: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-19-

Local Path Constraints

Type 1: 27-45-63 local paths

Type 2: 0-45-90 local paths

jiD ,

jiD ,

),1(

)1,1(

)1,(

min

)()(),(

jiD

jiD

jiD

jritjiD

)1,2(

)1,1(

)2,1(

min

)()(),(

jiD

jiD

jiD

jritjiD

2,1 jiD

1, jiD 1,1 jiD

jiD ,1

1,1 jiD 1,2 jiD

Page 20: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-20-

DTW Path of “Match Beginning”

Page 21: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-21-

DTW Path of “Match Anywhere”

Page 22: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-22-

DTW Path of “Match Anywhere”

Page 23: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-23-

Challenges in QBSH Systems

Song database preparation MIDIs, singing clips, or audio music

Reliable pitch tracking for acoustic input Input from mobile devices or noisy karaoke bar

Efficient/effective retrieval Karaoke machine: ~10,000 songs Internet music search engine: ~500,000,000 songs

Page 24: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-24-

Page 25: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-25-

Goal and Approach

Goal: To retrieve songs effectively within a given response time, say 5 seconds or so

Our strategy Multi-stage progressive filtering Indexing for different comparison methods Repeating pattern identification

Page 26: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-26-

MIRACLE

MIRACLE Music Information

Retrieval Acoustically via Clustered and paralleL Engines

Database (~13000) MIDI files Solo vocals (<100) Melody extracted from

polyphonic music (<100)

Comparison methods Linear scaling Dynamic time warping

Top-10 Accuracy 70~75%

Platform Single CPU+GPU

Page 27: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-27-

MIRACLE Before Oct. 2011Client-server distributed computingCloud computing via clustered PCs

Master server

Clients Clustered servers

PC

PDA/Smartphone

Cellular

Slave

Slave

Slave

Master server

Slave servers

Request: pitch vector

Response: search result

Database size: ~12,000

Page 28: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-28-

Current MIRACLESingle server with GPU

NVIDIA 560 Ti, 384 cores (speedup factor = 66)

Master server

ClientsSingle server

PC

PDA/Smartphone

Cellular

Master serverRequest: pitch vector

Response: search result

Database size: ~13,000

Page 29: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-29-

MIRACLE in the FutureMulti-modal retrieval

Singing, humming, speech, audio, tapping…

Master server

Clients Clustered servers

PC

PDA/Smartphone

Cellular

Slave

Slave

Slave

Master server

Slave servers

Request: feature vector

Response: search result

Page 30: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-30-

Outlook of MIRACLE

Web version Stand-alone version

Page 31: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-31-

QBSH for Other Platforms

Embedded systems Karaoke machines

Smartphones iPhone/iPad Android phone

Toys

Page 32: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-32-

Returned Results

Typical results of MIRACLE

Page 33: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-33-

QBSH Demo

Demo page of MIR Lab: http://mirlab.org/mir_products.asp

MIRACLE demo: http://mirlab.org/demo/miracle

Existing commercial QBSH systems www.midomi.com www.soundhound.com

Page 34: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-34-

To Make QBSH More Efficient

Algorithms Indexing of LS/DTW Progressive filtering

New Platforms GPU (10 times faster for QBSH!) Grid/clustered computing Multi-core platforms

Page 35: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-35-

Conclusions for QBSH

QBSH Fun and interesting way to retrieve music Can be extend to singing scoring Commercial applications getting mature

Challenges How to deal with massive music databases? How to extract melody from audio music?

Page 36: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-36-

Audio Fingerprinting (AFP)

Goal Identify a noisy version of

a given audio clips (no “cover versions”)

Technical barrier Robustness Efficiency (6M tags/day

for Shazam) Database collection (15M

tracks for Shazam)

Applications Song purchase Royalty assignment

(over radio) Confirmation of

commercials (over TV) Copyright violation

(over web) TV program ID

Page 37: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-37-

Company: Shazam

Facts First commercial product of audio fingerprinting Since 2002, UK

Technology Audio fingerprinting

Founder Avery Wang (PhD at Standard, 1994)

Page 38: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-38-

Company: Soundhound

Facts First product with multi-modal music search AKA: midomi

Technologies Audio fingerprinting Query by singing/humming Speech recognition

Founder Keyvan Mohajer (PhD at Stanford, 2007)

Page 39: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-39-

Two Stages in AFP

Offline: Database construction Robust feature

extraction (audio fingerprinting)

Hash table construction Inverted indexing

Online: Application Robust feature

extraction Hash table search Ranked list of the

retrieved songs/music

Page 40: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-40-

Robust Feature Extraction

Various kinds of features for AFP Invariance along time and frequency Landmark of a pair of local maxima Wavelets …

Extensive test required for choosing the best features

Page 41: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-41-

Representative Approaches to AFP

Philips J. Haitsma and T.

Kalker, “A highly robust audio fingerprinting system”, ISMIR 2002.

Shazam A.Wang, “An industrial-

strength audio search algorithm”, ISMIR 2003

Google S. Baluja and M. Covell,

“Content fingerprinting using wavelets”, Euro. Conf. on Visual Media Production, 2006.

V. Chandrasekhar, M. Sharifi, and D. A. Ross, “Survey and evaluation of audio fingerprinting schemes for mobile query-by-example applications”, ISMIR 2011

Page 42: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-42-

Philips: Thresholding as Features

Observation The sign of energy

differences is robust to various operationsLossy encodingRange compressionAdded noise

Thresholding as Features

),1()1,(

)1,1(),(

,1),(

ftSftS

ftSftS

ifftF

Fingerprint F(t, f)

Magnitude spectrum S(t, f)

(Source: Dan Ellis)

Page 43: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-43-

Philips: Thresholding as Features (II)

Robust to low-bitrate MP3 encoding (see the right)

Sensitive to “frame time difference” Hop size is kept small!

Original fingerprinting BER=0.07

8

Fingerprinting after MP3 encoding

Page 44: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-44-

Philips: Robustness of Features

BER of the features after various operations General low High for speed and time-

scale changes (which is not likely to occur under query by example)

Page 45: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-45-

Philips: Search Strategies

Via hashing

Inverted indexing

Page 46: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-46-

Shazam: Landmarks as Features

(Source: Dan Ellis)

Spectrogram

Local peaks of spectrogram

Pair peaks to form landmarks

Landmark: [t1, f1, t2, f2]20-bit hash key: f1: 8 bits Δf = f2-f1: 6 bits Δt = t2-t1: 6 bitsHash value: Song ID & offset time

Page 47: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-47-

Shazam: Landmark as Features (II)

Pick peaks based on local decaying surface

Matched landmarks

(Source: Dan Ellis)

Page 48: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-48-

Shazam: Time-justified Landmarks

Valid landmarks based on offset time (which avoids hash collision)

Page 49: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-49-

Our AFP Engine

Database (~2500) Personally collected MP3

(currently) Music collected from

Youtube (in the future)

Driving forces Fundamental issues in CS

(hashing, indexing…) Requests from local

companies

Methods Landmarks as feature

(Shazam) Speedup by hash tables

Platform Single CPU over MS

Windows

Page 50: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-50-

Experiments

Corpora Database: 2550 tracks Test files: 5 songs

recorded with mobiles (with noisy environment), and then chopped into segments of 5, 10, 15, and 20 seconds

Accuracy 5-second clips: 161/275 10-second clips: 121/136 15-second clips: 88/90 20-second clips: 65/66

Page 51: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-51-

Accuracy and Efficiency

Accuracy vs. query duration

Computing time. vs.query duration

Accuracy vs.computing time

Page 52: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-52-

Demos of Audio Fingerprinting

Commercial apps Shazam Soundhound

Ours http://mirlab.org/demo/afpFarmer2550

Page 53: 2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-53-

Conclusions For AFP

Conclusions Landmark-based methods are effective

Future work Scale-up

15M tracks in database, 6M tags/day

Speed-upProgressive filteringGPU