2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

112/04/19 1

Two Paradigms for Music IR:Query by Singing/Humming and

Audio Fingerprinting

J.-S. Roger Jang ( 張智星 )

Multimedia Information Retrieval Lab

CS Dept., Tsing Hua Univ., Taiwan

http://mirlab.org/jang

http://www.cs.nthu.edu.tw/~jang

-2-

Outline

Introduction to MIRQBSH (query by singing/humming)

Intro, demos, conclusions

AFP (audio fingerprinting) Intro, demos, conclusions

-3-

Introduction to QBSH

QBSH: Query by Singing/Humming Input: Singing or humming from microphone Output: A ranking list retrieved from the song

database

Progression First paper: Around 1994 Extensive studies since 2001 State of the art: QBSH tasks at ISMIR/MIREX

-4-

「哼唱選歌」的流程

前處理：收集單軌標準答案（通常是 MIDI 檔）轉換成適合比對的中介格式

即時處理：將使用者的音訊輸入轉成音高向量由音高向量轉成音符（選擇性）和標準答案進行比對列出排名

-5-

Pitch Tracking for QBSH

Two categories for pitch tracking algorithms Time domain ( 時域 )

ACF (Autocorrelation function)AMDF (Average magnitude difference function)SIFT (Simple inverse filtering tracking)

Frequency domain ( 頻域 )Harmonic product spectrum methodCepstrum method

-6-

Frame Blocking for Pitch Tracking

Frame size=256 pointsOverlap=84 pointsFrame rate=11025/(256-84)=64 pitch/sec

0 50 100 150 200 250 300-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

Zoom in

Overlap

Frame

0 500 1000 1500 2000 2500-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

-7-

ACF: Auto-correlation Function

Frame s(i):

Shifted frame s(i+):

=30

30

acf(30) = inner product of overlap part

Pitch period

1

0

n

i

acf s i s i

-8-

Pitch Tracking via ACF

Specs Sampe rate = 11025 Hz Frame size = 32 ms Overlap = 0 Frame rate = 31.25

Playback soo.wav sooPitch.wav

-9-

Frequency to Semitone Conversion

Semitone : A music scale based on A440

Reasonable pitch range: E2 - C6 82 Hz - 1047 Hz ( - )

69440

log12 2

freq

semitone

-10-

Typical Result of Pitch Tracking

Pitch tracking via autocorrelation for 茉莉花 (jasmine)聲音

-11-

Comparison of Pitch VectorsYellow line : Target pitch vector

-12-

Comparison Methods of QBSH

Categories of approaches to QBSH Histogram/statistics-based Note vs. note

Edit distance

Frame vs. noteHMM

Frame vs. frameLinear scaling, DTW, recursive alignment

-13-

Linear Scaling (LS)

Concept Scale the query linearly to match the candidates

Example:

-14-

Linear Scaling (II)

Strength One-shot for dealing

with key transposition Efficient and effective Indexing methods

available

Weakness Cannot deal with non-

uniform tempo variations

Typical mapping path

-15-

Linear Scaling (III)

Distance function for LS Normalized L1-norm Normalized L2-norm

Rest handling Extend previous non-zero

note

Alignment example

-16-

Dynamic Time Warping (DTW)

Goal: Allows comparison of high tolerance to tempo variation

Characteristics: Robust for irregular tempo variations Trial-and-error for dealing with key transposition Expensive in computation Does not conform to triangle inequality Some indexing algorithms do exist

#1 method for task 2 in QBSH/MIREX 2006

-17-

Dynamic Time Warping: Type 1

i

j

t(i-1)

r(j)

)1,2(

)1,1(

)2,1(

min

|)()(|),(

jiD

jiD

jiD

jritjiD

),( jiD

t: input pitch vector (8 sec, 128 points)r: reference pitch vectorLocal paths: 27-45-63 degrees

DTW recurrence:r(j-1)

t(i)

-18-

Dynamic Time Warping: Type 2

i

j

t(i-1)

r(j)

),1(

)1,1(

)1,(

min

|)()(|),(

jiD

jiD

jiD

jritjiD

),( jiD

r(j-1)

t(i)

t: input pitch vector (8 sec, 128 points)r: reference pitch vectorLocal paths: 0-45-90 degrees

DTW recurrence:

-19-

Local Path Constraints

Type 1: 27-45-63 local paths

Type 2: 0-45-90 local paths

jiD ,

jiD ,

),1(

)1,1(

)1,(

min

)()(),(

jiD

jiD

jiD

jritjiD

)1,2(

)1,1(

)2,1(

min

)()(),(

jiD

jiD

jiD

jritjiD

2,1 jiD

1, jiD 1,1 jiD

jiD ,1

1,1 jiD 1,2 jiD

-20-

DTW Path of “Match Beginning”

-21-

DTW Path of “Match Anywhere”

-22-

DTW Path of “Match Anywhere”

-23-

Challenges in QBSH Systems

Song database preparation MIDIs, singing clips, or audio music

Reliable pitch tracking for acoustic input Input from mobile devices or noisy karaoke bar

Efficient/effective retrieval Karaoke machine: ~10,000 songs Internet music search engine: ~500,000,000 songs

-24-

-25-

Goal and Approach

Goal: To retrieve songs effectively within a given response time, say 5 seconds or so

Our strategy Multi-stage progressive filtering Indexing for different comparison methods Repeating pattern identification

-26-

MIRACLE

MIRACLE Music Information

Retrieval Acoustically via Clustered and paralleL Engines

Database (~13000) MIDI files Solo vocals (<100) Melody extracted from

polyphonic music (<100)

Comparison methods Linear scaling Dynamic time warping

Top-10 Accuracy 70~75%

Platform Single CPU+GPU

-27-

MIRACLE Before Oct. 2011Client-server distributed computingCloud computing via clustered PCs

Master server

Clients Clustered servers

PC

PDA/Smartphone

Cellular

Slave

Slave

Slave

Master server

Slave servers

Request: pitch vector

Response: search result

Database size: ~12,000

-28-

Current MIRACLESingle server with GPU

NVIDIA 560 Ti, 384 cores (speedup factor = 66)

Master server

ClientsSingle server

PC

PDA/Smartphone

Cellular

Master serverRequest: pitch vector


Database size: ~13,000

-29-

MIRACLE in the FutureMulti-modal retrieval

Singing, humming, speech, audio, tapping…

Master server

Clients Clustered servers

PC

PDA/Smartphone

Cellular

Slave

Slave

Slave

Master server

Slave servers

Request: feature vector


-30-

Outlook of MIRACLE

Web version Stand-alone version

-31-

QBSH for Other Platforms

Embedded systems Karaoke machines

Smartphones iPhone/iPad Android phone

Toys

-32-

Returned Results

Typical results of MIRACLE

-33-

QBSH Demo

Demo page of MIR Lab: http://mirlab.org/mir_products.asp

MIRACLE demo: http://mirlab.org/demo/miracle

Existing commercial QBSH systems www.midomi.com www.soundhound.com

-34-

To Make QBSH More Efficient

Algorithms Indexing of LS/DTW Progressive filtering

New Platforms GPU (10 times faster for QBSH!) Grid/clustered computing Multi-core platforms

-35-

Conclusions for QBSH

QBSH Fun and interesting way to retrieve music Can be extend to singing scoring Commercial applications getting mature

Challenges How to deal with massive music databases? How to extract melody from audio music?

-36-

Audio Fingerprinting (AFP)

Goal Identify a noisy version of

a given audio clips (no “cover versions”)

Technical barrier Robustness Efficiency (6M tags/day

for Shazam) Database collection (15M

tracks for Shazam)

Applications Song purchase Royalty assignment

(over radio) Confirmation of

commercials (over TV) Copyright violation

(over web) TV program ID

-37-

Company: Shazam

Facts First commercial product of audio fingerprinting Since 2002, UK

Technology Audio fingerprinting

Founder Avery Wang (PhD at Standard, 1994)

-38-

Company: Soundhound

Facts First product with multi-modal music search AKA: midomi

Technologies Audio fingerprinting Query by singing/humming Speech recognition

Founder Keyvan Mohajer (PhD at Stanford, 2007)

-39-

Two Stages in AFP

Offline: Database construction Robust feature

extraction (audio fingerprinting)

Hash table construction Inverted indexing

Online: Application Robust feature

extraction Hash table search Ranked list of the

retrieved songs/music

-40-

Robust Feature Extraction

Various kinds of features for AFP Invariance along time and frequency Landmark of a pair of local maxima Wavelets …

Extensive test required for choosing the best features

-41-

Representative Approaches to AFP

Philips J. Haitsma and T.

Kalker, “A highly robust audio fingerprinting system”, ISMIR 2002.

Shazam A.Wang, “An industrial-

strength audio search algorithm”, ISMIR 2003

Google S. Baluja and M. Covell,

“Content fingerprinting using wavelets”, Euro. Conf. on Visual Media Production, 2006.

V. Chandrasekhar, M. Sharifi, and D. A. Ross, “Survey and evaluation of audio fingerprinting schemes for mobile query-by-example applications”, ISMIR 2011

-42-

Philips: Thresholding as Features

Observation The sign of energy

differences is robust to various operationsLossy encodingRange compressionAdded noise

Thresholding as Features

),1()1,(

)1,1(),(

,1),(

ftSftS

ftSftS

ifftF

Fingerprint F(t, f)

Magnitude spectrum S(t, f)

(Source: Dan Ellis)

-43-

Philips: Thresholding as Features (II)

Robust to low-bitrate MP3 encoding (see the right)

Sensitive to “frame time difference” Hop size is kept small!

Original fingerprinting BER=0.07

8

Fingerprinting after MP3 encoding

-44-

Philips: Robustness of Features

BER of the features after various operations General low High for speed and time-

scale changes (which is not likely to occur under query by example)

-45-

Philips: Search Strategies

Via hashing

Inverted indexing

-46-

Shazam: Landmarks as Features

(Source: Dan Ellis)

Spectrogram

Local peaks of spectrogram

Pair peaks to form landmarks

Landmark: [t1, f1, t2, f2]20-bit hash key: f1: 8 bits Δf = f2-f1: 6 bits Δt = t2-t1: 6 bitsHash value: Song ID & offset time

-47-

Shazam: Landmark as Features (II)

Pick peaks based on local decaying surface

Matched landmarks

(Source: Dan Ellis)

-48-

Shazam: Time-justified Landmarks

Valid landmarks based on offset time (which avoids hash collision)

-49-

Our AFP Engine

Database (~2500) Personally collected MP3

(currently) Music collected from

Youtube (in the future)

Driving forces Fundamental issues in CS

(hashing, indexing…) Requests from local

companies

Methods Landmarks as feature

(Shazam) Speedup by hash tables

Platform Single CPU over MS

Windows

-50-

Experiments

Corpora Database: 2550 tracks Test files: 5 songs

recorded with mobiles (with noisy environment), and then chopped into segments of 5, 10, 15, and 20 seconds

Accuracy 5-second clips: 161/275 10-second clips: 121/136 15-second clips: 88/90 20-second clips: 65/66

-51-

Accuracy and Efficiency

Accuracy vs. query duration

Computing time. vs.query duration

Accuracy vs.computing time

-52-

Demos of Audio Fingerprinting

Commercial apps Shazam Soundhound

Ours http://mirlab.org/demo/afpFarmer2550

-53-

Conclusions For AFP

Conclusions Landmark-based methods are effective

Future work Scale-up

15M tracks in database, 6M tags/day

Speed-upProgressive filteringGPU

Documents

2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab