2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

112/04/20 1

Two Paradigms for Music IR:Query by Singing/Humming and

Audio Fingerprinting

J.-S. Roger Jang (張智星 )

Multimedia Information Retrieval Lab

CS Dept., Tsing Hua Univ., Taiwan

http://mirlab.org/jang

http://www.cs.nthu.edu.tw/~jang

-2-

Outline

Introduction to MIRQBSH (query by singing/humming)

Intro, demos, conclusions

AFP (audio fingerprinting) Intro, demos, conclusions

-3-

Content-based Music Information Retrieval (MIR) via Acoustic Inputs

Melody Query by humming

(usually “ta” or “da”) Query by singing Query by whistling

Note onsets Query by tapping (at the

onsets of notes)

Metadata Query by speech (for

meta-data, such as title, artist, lyrics)

Audio contents Query by examples

(noisy versions of original clips)

Drums Query by beatboxing

-4-

Introduction to QBSH

QBSH: Query by Singing/Humming Input: Singing or humming from microphone Output: A ranking list retrieved from the song

database

Progression First paper: Around 1994 Extensive studies since 2001 State of the art: QBSH tasks at ISMIR/MIREX

-5-

Two Stages in QBSH

Offline stage Database preparation

From MIDI filesFrom audio music (eg.,

MP3)From human vocals

Indexing (if necessary)

Online stage Perform pitch tracking

on the user’s query Compare the query pitch

with songs in database Return the ranked list

according to similarity

-6-

Frame Blocking for Pitch Tracking

Frame size=256 pointsOverlap=84 pointsFrame rate=11025/(256-84)=64 pitch/sec

0 50 100 150 200 250 300-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

Zoom in

Overlap

Frame

0 500 1000 1500 2000 2500-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

-7-

ACF: Auto-correlation Function

Frame s(i):

Shifted frame s(i+):

=30

30

acf(30) = inner product of overlap part

Pitch period

1

0

n

i

acf s i s i

-8-

Frequency to Semitone Conversion

Semitone : A music scale based on A440

Reasonable pitch range: E2 - C6 82 Hz - 1047 Hz ( - )

69440

log12 2

freq

semitone

-9-

Typical Result of Pitch Tracking

Pitch tracking via autocorrelation for 茉莉花 (jasmine)聲音

-10-

Comparison of Pitch VectorsYellow line : Target pitch vector

-11-

Comparison Methods of QBSH

Categories of approaches to QBSH Histogram/statistics-based Note vs. note

Edit distance

Frame vs. noteHMM

Frame vs. frameLinear scaling, DTW, recursive alignment

-12-

Linear Scaling

Scale the query pitch linearly to match the candidates

Original input pitch

Stretched by 1.25

Stretched by 1.5

Compressed by 0.75

Compressed by 0.5

Target pitch in database

Best match

Original pitch

-13-

Challenges in QBSH Systems

Song database preparation MIDIs, singing clips, or audio music

Reliable pitch tracking for acoustic input Input from mobile devices or noisy karaoke bar

Efficient/effective retrieval Karaoke machine: ~10,000 songs Internet music search engine: ~500,000,000 songs

-14-

Goal and Approach

Goal: To retrieve songs effectively within a given response time, say 5 seconds or so

Our strategy Multi-stage progressive filtering Indexing for different comparison methods Repeating pattern identification

-15-

MIRACLE

MIRACLE Music Information

Retrieval Acoustically via Clustered and paralleL Engines

Database (~13000) MIDI files Solo vocals (<100) Melody extracted from

polyphonic music (<100)

Comparison methods Linear scaling Dynamic time warping

Top-10 Accuracy 70~75%

Platform Single CPU+GPU

-16-

Current MIRACLESingle server with GPU

NVIDIA 560 Ti, 384 cores (speedup factor = 10)

Master server

ClientsSingle server

PC

PDA/Smartphone

Cellular

Master serverRequest: pitch vector

Response: search result

Database size: ~13,000

-17-

QBSH for Various Platforms

PC Web version

Embedded systems Karaoke machines

Smartphones iPhone/iPad Android phone

Toys

-18-

QBSH Demo

Demo page of MIR Lab: http://mirlab.org/mir_products.asp

MIRACLE demo: http://mirlab.org/demo/miracle

Existing commercial QBSH systems www.midomi.com www.soundhound.com

-19-

Conclusions for QBSH

QBSH Fun and interesting way to retrieve music Can be extend to singing scoring Commercial applications getting mature

Challenges How to deal with massive music databases? How to extract melody from audio music?

-20-

Audio Fingerprinting (AFP)

Goal Identify a noisy version of a

given audio clips (query by example, not by “cover versions”)

Technical barrier Robustness Efficiency (6M tags/day for

Shazam) Effectiveness (15M tracks

for Shazam)

Applications Song purchase Royalty assignment

(over radio) Confirmation of

commercials (over TV) Copyright violation

(over web) TV program ID

-21-

Two Stages in AFP

Offline Robust feature

extraction (audio fingerprinting)

Hash table construction Inverted indexing

Online Robust feature

extraction Hash table search Ranked list of the

retrieved songs/music

-22-

Representative Approaches to AFP

Philips J. Haitsma and T.

Kalker, “A highly robust audio fingerprinting system”, ISMIR 2002.

Shazam A.Wang, “An industrial-

strength audio search algorithm”, ISMIR 2003

Google S. Baluja and M. Covell,

“Content fingerprinting using wavelets”, Euro. Conf. on Visual Media Production, 2006.

V. Chandrasekhar, M. Sharifi, and D. A. Ross, “Survey and evaluation of audio fingerprinting schemes for mobile query-by-example applications”, ISMIR 2011

-23-

Shazam: Landmarks as Features

(Source: Dan Ellis)

Spectrogram

Local peaks of spectrogram

Pair peaks to form landmarks

Landmark: [t1, f1, t2, f2]20-bit hash key: f1: 8 bits Δf = f2-f1: 6 bits Δt = t2-t1: 6 bitsHash value: Song ID & offset time

-24-

Shazam: Landmark as Features (II)

Peak picking after smoothing

Matched landmarks (green)

(Source: Dan Ellis)

-25-

Shazam: Time-justified Landmarks

Valid landmarks based on offset time (which maintains robustness even with hash collision)

-26-

Our AFP Engine

Database (~2500) 2500 tracks currently 50k tracks soon 1M tracks in the future

Driving forces Fundamental issues in

computer science (hashing, indexing…)

Requests from local companies

Methods Landmarks as feature

(Shazam) Speedup by hash tables

and inverted files

Platform Currently: Single CPU In the future: Multiple

CPU & GPU

-27-

Experiments

Corpora Database: 2550 tracks Test files: 5 mobile-

recorded songs chopped into segments of 5, 10, 15, and 20 seconds

Accuracy test 5-sec clips: 161/275=58.6% 10-sec clips: 121/136=89.0% 15-sec clips: 88/90=97.8% 20-sec clips: 65/66=98.5%

Accuracy vs. duration Computing time. vs. duration Accuracy vs. computing time

-28-

Demos of Audio Fingerprinting

Commercial apps Shazam Soundhound

Our demo http://mirlab.org/demo/afpFarmer2550

-29-

Conclusions For AFP

Conclusions Landmark-based methods are effective Machine learning is indispensable for further

improvement.

Future work: Scale up Shazam: 15M tracks in database, 6M tags/day Our goal:

50K tracks with a single PC and GPU1M tracks with cloud computing of 10 PC

Documents

2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab