29
111/06/14 1 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張張張 ) Multimedia Information Retrieval Lab CS Dept., Tsing Hua Univ., Taiwan http://mirlab.org/jang

2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

Embed Size (px)

Citation preview

Page 1: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

112/04/20 1

Two Paradigms for Music IR:Query by Singing/Humming and

Audio Fingerprinting

J.-S. Roger Jang (張智星 )

Multimedia Information Retrieval Lab

CS Dept., Tsing Hua Univ., Taiwan

http://mirlab.org/jang

Page 2: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-2-

Outline

Introduction to MIRQBSH (query by singing/humming)

Intro, demos, conclusions

AFP (audio fingerprinting) Intro, demos, conclusions

Page 3: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-3-

Content-based Music Information Retrieval (MIR) via Acoustic Inputs

Melody Query by humming

(usually “ta” or “da”) Query by singing Query by whistling

Note onsets Query by tapping (at the

onsets of notes)

Metadata Query by speech (for

meta-data, such as title, artist, lyrics)

Audio contents Query by examples

(noisy versions of original clips)

Drums Query by beatboxing

Page 4: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-4-

Introduction to QBSH

QBSH: Query by Singing/Humming Input: Singing or humming from microphone Output: A ranking list retrieved from the song

database

Progression First paper: Around 1994 Extensive studies since 2001 State of the art: QBSH tasks at ISMIR/MIREX

Page 5: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-5-

Two Stages in QBSH

Offline stage Database preparation

From MIDI filesFrom audio music (eg.,

MP3)From human vocals

Indexing (if necessary)

Online stage Perform pitch tracking

on the user’s query Compare the query pitch

with songs in database Return the ranked list

according to similarity

Page 6: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-6-

Frame Blocking for Pitch Tracking

Frame size=256 pointsOverlap=84 pointsFrame rate=11025/(256-84)=64 pitch/sec

0 50 100 150 200 250 300-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

Zoom in

Overlap

Frame

0 500 1000 1500 2000 2500-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

Page 7: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-7-

ACF: Auto-correlation Function

Frame s(i):

Shifted frame s(i+):

=30

30

acf(30) = inner product of overlap part

Pitch period

1

0

n

i

acf s i s i

Page 8: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-8-

Frequency to Semitone Conversion

Semitone : A music scale based on A440

Reasonable pitch range: E2 - C6 82 Hz - 1047 Hz ( - )

69440

log12 2

freq

semitone

Page 9: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-9-

Typical Result of Pitch Tracking

Pitch tracking via autocorrelation for 茉莉花 (jasmine)聲音

Page 10: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-10-

Comparison of Pitch VectorsYellow line : Target pitch vector

Page 11: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-11-

Comparison Methods of QBSH

Categories of approaches to QBSH Histogram/statistics-based Note vs. note

Edit distance

Frame vs. noteHMM

Frame vs. frameLinear scaling, DTW, recursive alignment

Page 12: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-12-

Linear Scaling

Scale the query pitch linearly to match the candidates

Original input pitch

Stretched by 1.25

Stretched by 1.5

Compressed by 0.75

Compressed by 0.5

Target pitch in database

Best match

Original pitch

Page 13: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-13-

Challenges in QBSH Systems

Song database preparation MIDIs, singing clips, or audio music

Reliable pitch tracking for acoustic input Input from mobile devices or noisy karaoke bar

Efficient/effective retrieval Karaoke machine: ~10,000 songs Internet music search engine: ~500,000,000 songs

Page 14: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-14-

Goal and Approach

Goal: To retrieve songs effectively within a given response time, say 5 seconds or so

Our strategy Multi-stage progressive filtering Indexing for different comparison methods Repeating pattern identification

Page 15: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-15-

MIRACLE

MIRACLE Music Information

Retrieval Acoustically via Clustered and paralleL Engines

Database (~13000) MIDI files Solo vocals (<100) Melody extracted from

polyphonic music (<100)

Comparison methods Linear scaling Dynamic time warping

Top-10 Accuracy 70~75%

Platform Single CPU+GPU

Page 16: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-16-

Current MIRACLESingle server with GPU

NVIDIA 560 Ti, 384 cores (speedup factor = 10)

Master server

ClientsSingle server

PC

PDA/Smartphone

Cellular

Master serverRequest: pitch vector

Response: search result

Database size: ~13,000

Page 17: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-17-

QBSH for Various Platforms

PC Web version

Embedded systems Karaoke machines

Smartphones iPhone/iPad Android phone

Toys

Page 18: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-18-

QBSH Demo

Demo page of MIR Lab: http://mirlab.org/mir_products.asp

MIRACLE demo: http://mirlab.org/demo/miracle

Existing commercial QBSH systems www.midomi.com www.soundhound.com

Page 19: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-19-

Conclusions for QBSH

QBSH Fun and interesting way to retrieve music Can be extend to singing scoring Commercial applications getting mature

Challenges How to deal with massive music databases? How to extract melody from audio music?

Page 20: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-20-

Audio Fingerprinting (AFP)

Goal Identify a noisy version of a

given audio clips (query by example, not by “cover versions”)

Technical barrier Robustness Efficiency (6M tags/day for

Shazam) Effectiveness (15M tracks

for Shazam)

Applications Song purchase Royalty assignment

(over radio) Confirmation of

commercials (over TV) Copyright violation

(over web) TV program ID

Page 21: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-21-

Two Stages in AFP

Offline Robust feature

extraction (audio fingerprinting)

Hash table construction Inverted indexing

Online Robust feature

extraction Hash table search Ranked list of the

retrieved songs/music

Page 22: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-22-

Representative Approaches to AFP

Philips J. Haitsma and T.

Kalker, “A highly robust audio fingerprinting system”, ISMIR 2002.

Shazam A.Wang, “An industrial-

strength audio search algorithm”, ISMIR 2003

Google S. Baluja and M. Covell,

“Content fingerprinting using wavelets”, Euro. Conf. on Visual Media Production, 2006.

V. Chandrasekhar, M. Sharifi, and D. A. Ross, “Survey and evaluation of audio fingerprinting schemes for mobile query-by-example applications”, ISMIR 2011

Page 23: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-23-

Shazam: Landmarks as Features

(Source: Dan Ellis)

Spectrogram

Local peaks of spectrogram

Pair peaks to form landmarks

Landmark: [t1, f1, t2, f2]20-bit hash key: f1: 8 bits Δf = f2-f1: 6 bits Δt = t2-t1: 6 bitsHash value: Song ID & offset time

Page 24: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-24-

Shazam: Landmark as Features (II)

Peak picking after smoothing

Matched landmarks (green)

(Source: Dan Ellis)

Page 25: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-25-

Shazam: Time-justified Landmarks

Valid landmarks based on offset time (which maintains robustness even with hash collision)

Page 26: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-26-

Our AFP Engine

Database (~2500) 2500 tracks currently 50k tracks soon 1M tracks in the future

Driving forces Fundamental issues in

computer science (hashing, indexing…)

Requests from local companies

Methods Landmarks as feature

(Shazam) Speedup by hash tables

and inverted files

Platform Currently: Single CPU In the future: Multiple

CPU & GPU

Page 27: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-27-

Experiments

Corpora Database: 2550 tracks Test files: 5 mobile-

recorded songs chopped into segments of 5, 10, 15, and 20 seconds

Accuracy test 5-sec clips: 161/275=58.6% 10-sec clips: 121/136=89.0% 15-sec clips: 88/90=97.8% 20-sec clips: 65/66=98.5%

Accuracy vs. duration Computing time. vs. duration Accuracy vs. computing time

Page 28: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-28-

Demos of Audio Fingerprinting

Commercial apps Shazam Soundhound

Our demo http://mirlab.org/demo/afpFarmer2550

Page 29: 2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab

-29-

Conclusions For AFP

Conclusions Landmark-based methods are effective Machine learning is indispensable for further

improvement.

Future work: Scale up Shazam: 15M tracks in database, 6M tags/day Our goal:

50K tracks with a single PC and GPU1M tracks with cloud computing of 10 PC