Upload
colin-owens
View
230
Download
0
Tags:
Embed Size (px)
Citation preview
112/04/20 1
Two Paradigms for Music IR:Query by Singing/Humming and
Audio Fingerprinting
J.-S. Roger Jang (張智星 )
Multimedia Information Retrieval Lab
CS Dept., Tsing Hua Univ., Taiwan
http://mirlab.org/jang
-2-
Outline
Introduction to MIRQBSH (query by singing/humming)
Intro, demos, conclusions
AFP (audio fingerprinting) Intro, demos, conclusions
-3-
Content-based Music Information Retrieval (MIR) via Acoustic Inputs
Melody Query by humming
(usually “ta” or “da”) Query by singing Query by whistling
Note onsets Query by tapping (at the
onsets of notes)
Metadata Query by speech (for
meta-data, such as title, artist, lyrics)
Audio contents Query by examples
(noisy versions of original clips)
Drums Query by beatboxing
-4-
Introduction to QBSH
QBSH: Query by Singing/Humming Input: Singing or humming from microphone Output: A ranking list retrieved from the song
database
Progression First paper: Around 1994 Extensive studies since 2001 State of the art: QBSH tasks at ISMIR/MIREX
-5-
Two Stages in QBSH
Offline stage Database preparation
From MIDI filesFrom audio music (eg.,
MP3)From human vocals
Indexing (if necessary)
Online stage Perform pitch tracking
on the user’s query Compare the query pitch
with songs in database Return the ranked list
according to similarity
-6-
Frame Blocking for Pitch Tracking
Frame size=256 pointsOverlap=84 pointsFrame rate=11025/(256-84)=64 pitch/sec
0 50 100 150 200 250 300-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
Zoom in
Overlap
Frame
0 500 1000 1500 2000 2500-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
-7-
ACF: Auto-correlation Function
Frame s(i):
Shifted frame s(i+):
=30
30
acf(30) = inner product of overlap part
Pitch period
1
0
n
i
acf s i s i
-8-
Frequency to Semitone Conversion
Semitone : A music scale based on A440
Reasonable pitch range: E2 - C6 82 Hz - 1047 Hz ( - )
69440
log12 2
freq
semitone
-9-
Typical Result of Pitch Tracking
Pitch tracking via autocorrelation for 茉莉花 (jasmine)聲音
-10-
Comparison of Pitch VectorsYellow line : Target pitch vector
-11-
Comparison Methods of QBSH
Categories of approaches to QBSH Histogram/statistics-based Note vs. note
Edit distance
Frame vs. noteHMM
Frame vs. frameLinear scaling, DTW, recursive alignment
-12-
Linear Scaling
Scale the query pitch linearly to match the candidates
Original input pitch
Stretched by 1.25
Stretched by 1.5
Compressed by 0.75
Compressed by 0.5
Target pitch in database
Best match
Original pitch
-13-
Challenges in QBSH Systems
Song database preparation MIDIs, singing clips, or audio music
Reliable pitch tracking for acoustic input Input from mobile devices or noisy karaoke bar
Efficient/effective retrieval Karaoke machine: ~10,000 songs Internet music search engine: ~500,000,000 songs
-14-
Goal and Approach
Goal: To retrieve songs effectively within a given response time, say 5 seconds or so
Our strategy Multi-stage progressive filtering Indexing for different comparison methods Repeating pattern identification
-15-
MIRACLE
MIRACLE Music Information
Retrieval Acoustically via Clustered and paralleL Engines
Database (~13000) MIDI files Solo vocals (<100) Melody extracted from
polyphonic music (<100)
Comparison methods Linear scaling Dynamic time warping
Top-10 Accuracy 70~75%
Platform Single CPU+GPU
-16-
Current MIRACLESingle server with GPU
NVIDIA 560 Ti, 384 cores (speedup factor = 10)
Master server
ClientsSingle server
PC
PDA/Smartphone
Cellular
Master serverRequest: pitch vector
Response: search result
Database size: ~13,000
-17-
QBSH for Various Platforms
PC Web version
Embedded systems Karaoke machines
Smartphones iPhone/iPad Android phone
Toys
-18-
QBSH Demo
Demo page of MIR Lab: http://mirlab.org/mir_products.asp
MIRACLE demo: http://mirlab.org/demo/miracle
Existing commercial QBSH systems www.midomi.com www.soundhound.com
-19-
Conclusions for QBSH
QBSH Fun and interesting way to retrieve music Can be extend to singing scoring Commercial applications getting mature
Challenges How to deal with massive music databases? How to extract melody from audio music?
-20-
Audio Fingerprinting (AFP)
Goal Identify a noisy version of a
given audio clips (query by example, not by “cover versions”)
Technical barrier Robustness Efficiency (6M tags/day for
Shazam) Effectiveness (15M tracks
for Shazam)
Applications Song purchase Royalty assignment
(over radio) Confirmation of
commercials (over TV) Copyright violation
(over web) TV program ID
-21-
Two Stages in AFP
Offline Robust feature
extraction (audio fingerprinting)
Hash table construction Inverted indexing
Online Robust feature
extraction Hash table search Ranked list of the
retrieved songs/music
-22-
Representative Approaches to AFP
Philips J. Haitsma and T.
Kalker, “A highly robust audio fingerprinting system”, ISMIR 2002.
Shazam A.Wang, “An industrial-
strength audio search algorithm”, ISMIR 2003
Google S. Baluja and M. Covell,
“Content fingerprinting using wavelets”, Euro. Conf. on Visual Media Production, 2006.
V. Chandrasekhar, M. Sharifi, and D. A. Ross, “Survey and evaluation of audio fingerprinting schemes for mobile query-by-example applications”, ISMIR 2011
-23-
Shazam: Landmarks as Features
(Source: Dan Ellis)
Spectrogram
Local peaks of spectrogram
Pair peaks to form landmarks
Landmark: [t1, f1, t2, f2]20-bit hash key: f1: 8 bits Δf = f2-f1: 6 bits Δt = t2-t1: 6 bitsHash value: Song ID & offset time
-24-
Shazam: Landmark as Features (II)
Peak picking after smoothing
Matched landmarks (green)
(Source: Dan Ellis)
-25-
Shazam: Time-justified Landmarks
Valid landmarks based on offset time (which maintains robustness even with hash collision)
-26-
Our AFP Engine
Database (~2500) 2500 tracks currently 50k tracks soon 1M tracks in the future
Driving forces Fundamental issues in
computer science (hashing, indexing…)
Requests from local companies
Methods Landmarks as feature
(Shazam) Speedup by hash tables
and inverted files
Platform Currently: Single CPU In the future: Multiple
CPU & GPU
-27-
Experiments
Corpora Database: 2550 tracks Test files: 5 mobile-
recorded songs chopped into segments of 5, 10, 15, and 20 seconds
Accuracy test 5-sec clips: 161/275=58.6% 10-sec clips: 121/136=89.0% 15-sec clips: 88/90=97.8% 20-sec clips: 65/66=98.5%
Accuracy vs. duration Computing time. vs. duration Accuracy vs. computing time
-28-
Demos of Audio Fingerprinting
Commercial apps Shazam Soundhound
Our demo http://mirlab.org/demo/afpFarmer2550
-29-
Conclusions For AFP
Conclusions Landmark-based methods are effective Machine learning is indispensable for further
improvement.
Future work: Scale up Shazam: 15M tracks in database, 6M tags/day Our goal:
50K tracks with a single PC and GPU1M tracks with cloud computing of 10 PC