5
People are routinely exposed to music in everyday environments— car, home, restaurant, movie theater, shopping mall—but are frus- trated by not being able to learn more about what they hear. They may, for instance, be interested in a particular piece of music and want to know its title and the name of the artist who created it. They may want to buy a digital download of the song or a ring- tone. To address these limitations, my colleagues and I at Shazam Entertainment have developed a query-by-example (QBE) music search service that enables users to learn the identity of audible pre- recorded music by sampling a few seconds of audio using a mobile phone as a recording device. THE SHAZAM MUSIC RECOGNITION SERVICE Guided by a user’s query-by-example music sample, it delivers the matching song, as well as related music information, of immediate interest to the user. By Avery Wang 44 August 2006/Vol. 49, No. 8 COMMUNICATIONS OF THE ACM

The Shazam music recognition service

  • Upload
    avery

  • View
    218

  • Download
    2

Embed Size (px)

Citation preview

People are routinely exposed to music in everyday environments—car, home, restaurant, movie theater, shopping mall—but are frus-trated by not being able to learn more about what they hear. Theymay, for instance, be interested in a particular piece of music andwant to know its title and the name of the artist who created it.They may want to buy a digital download of the song or a ring-tone. To address these limitations, my colleagues and I at ShazamEntertainment have developed a query-by-example (QBE) musicsearch service that enables users to learn the identity of audible pre-recorded music by sampling a few seconds of audio using a mobilephone as a recording device.

THE SHAZAM MUSIC RECOGNITION SERVICE Guided by a user’s query-by-example music sample, it deliversthe matching song, as well as related music information, of immediate interest to the user.

By Avery Wang

44 August 2006/Vol. 49, No. 8 COMMUNICATIONS OF THE ACM

COMMUNICATIONS OF THE ACM August 2006/Vol. 49, No. 8 45

Others have also sought to deliver musicidentification. For example, in 1999 an earlypioneer, StarCD, introduced a serviceenabling users with mobile phones to iden-tify songs playing on certain radio stations

[1]. Identification was accomplished byusers calling into the StarCD service whilean unknown song was playing. They wouldthen enter the radio station call letters ontothe phone’s keypad. StarCD would auto-matically consult a third-party playlist data-base provider to determine what song was

playing on the specified radio station at thetime of the call. Learning the identity of themusic, users would be given the opportunityto buy the CD. The StarCD system wasthus limited to songs playing on radio sta-

tions being monitored by the third-party playlist provider.

QBE is a more flexible music-recognition modality than the oneprovided by StarCD, recognizingmusic from a sample recording ratherthan requiring the user to key in radiostation information. QBE musicrecognition research has been pursuedby a number of groups worldwide.For example, in the early 1980sBroadcast Data Systems (www.bdson-line.com/) was an early pioneer, usinga variety of methods to correlate wave-forms [4]. In 1996 Musclefish(www.musclefish.com/) developed amethod based on multidimensionalfeature analysis and Euclidean dis-

tance metrics [6]. However, these methodswere all best suited for fairly clean audio sam-ples and do not work well in the presence ofsignificant noise and distortion.

When we founded Shazam Entertain-ment in London in 2000, we aimed todevelop a commercial QBE music recogni-tion service that could be offered throughmobile phones [5]. A user would captureaudio samples by calling the service througha short dial code; after sampling, the server

Shazam interface: (a) recording a 10-second audio sample; (b) displaying the query results.

(a) (b)

46 August 2006/Vol. 49, No. 8 COMMUNICATIONS OF THE ACM

would hang up and return the identification resultsvia an SMS text message. In 2004, we introduced anewer version called Song Identity, providing a moreinteractive interface, available first through VerizonWireless in the U.S. Since then, we have ported theapplication to a variety of handset platforms.

Before using Song Identity, users first down-load and install its applet into their hand-sets. Upon hearing music they wish toidentify, they launch the applet, whichrecords about 10 seconds of ambient

audio and either sends an audio file to Shazam or per-forms feature extraction on the phone to generate asmall signature file. This signature file is then sent toa central server that performs a search and providesthe matching metadata—title, artist, album—back to

the applet (see the figure here). Users are then pre-sented with the title, artist, and album information, aswell as the option of downloading the correspondingringtone—if the matching one is available—based onthe music. Future versions on some platforms mayalso allow the purchase of full-track downloads, CD,concert tickets, lyrics, and other follow-on data oractivities associated with the music.

Commercial QBE music recognition systems,including Song Identity, must overcome a number ofdaunting technical challenges:

Noise. Noise competes with the target music insome environments; for example, music is generallyin the background in noisy cafés and shopping malls.Being in a car adds the complication of traffic orengine noise. Noise power may indeed be signifi-cantly greater than signal power, so a recognitionalgorithm must be robust enough to deal with signif-icant noise.

Distortion. The system must be able to deal withdistortion arising from a variety of sources, includingimperfect playback or sampling equipment, as well asenvironmental factors (such as reverberation andabsorption). Sampling through telephony equipmentreduces the frequency response to about300Hz–3,400Hz. Distortion may also arise from theaudio sample being subject to low bit-rate voice com-pression in the mobile phone. Nonlinear noise sup-pression and voice-quality-enhancement algorithmsbuilt into the handset or into the mobile carrier’s net-work may represent a further challenge in which thesampling of background music results in recordingsthat contain mostly silence.

Database management. The system must be ableto index “fingerprints” of millions of songs in itsonline database without requiring an inordinatenumber of servers. Thus the fingerprint, or uniquefeature representation, of each song must be reason-ably small, on the order of only a few kilobytes. More-over, scaling to millions of songs must not be able toincur a significant processing load on the backendsearch engine, as the system may need to dispatchhundreds or thousands of queries per second. Thesystem must also scale statistically, meaning the addi-tion of many millions of songs into the database mustnot significantly decrease the probability of finding acorrect match or significantly increase the probabilityof reporting a false positive.

We were dismayed by the early audio samples wecollected through mobile phones; the music was oftenso distorted we could barely recognize the presence ofmusic with our own ears. We were often hard-pressedto match an audio sample over the phone against itsknown master recording, let alone scale it to millions

THE ADDITION OF MANYmillions of songs into the database must not

significantly decrease theprobability of finding a

correct match or significantlyincrease the probability ofreporting a false positive.

of recordings. We were even at risk of having to aban-don our efforts. Fortunately, in 2000 after threemonths of work, we arrived at a solution involvingtemporally aligned combinatorial hashing, generallyovercoming these challenges.

Another challenge we managed to address is morelogistical than technological: How to cost-effectivelycompile a database of millions of songs. Shazam hasbeen purchasing music assets, as well as extracting fin-gerprints from content partners with large catalogs ofmusic. Confronting these constraints actually helpedsimplify development of our music recognition algo-rithm. We were forced to disregard approaches thatcould not scale to large numbers of recordings, espe-cially in the presence of more noise than signal. Thisthinking produced a number of insights [5]; firstamong them, we would have to find robust featuresthat could be reproduced in the presence of significantnoise and distortion. We considered a number of can-didate features (such as power envelopes and mel-fre-quency cepstral coefficients), but most weren’t robustenough for our needs.

We needed features that could be linearly super-posed (transparently) and recovered in the presence ofnoise. We thus turned to spectrogram peaks [1],which provide a map of energy distribution in termsof time and frequency.1 The location of the peaks,though the result of nonlinear processing, are sub-stantially linearly superposable; that is, a spectrogrampeak analysis of a mixture of music and noise containsspectral peaks (due to the music and the noise) if eachwould be analyzed separately. The presence of corre-sponding spectrogram peaks in a noisy versus noiselessmusic signal makes it possible to determine with highprobability whether an audio sample matches arecording in the database.

Though robust, sets of individual spectrogrampeaks used as fingerprint features provide insufficiententropy (the number of unique features is too small)to allow for an efficient search, especially in a verylarge database. In order to increase the entropy whilemaintaining transparency, we hit upon a scheme wecall “combinatorial hashing” in which we constructfingerprint hash tokens using pairs of spectrogrampeaks chosen from the set of all spectrogram peakspresent in the signal being analyzed. In the fingerprintformation process, we use a subset of spectrogrampeaks as “anchor points,” each with a target zonedefined by a range of time and frequency values offsetfrom the anchor point’s coordinates. Each anchor

point is paired to a number of target points in the tar-get zone. The frequency information and relative timeoffset from each pair of points are used to create a 32bfingerprint hash token.

This combinatorial expansion results in perhaps aten-fold increase in the number of tokens searched inthe database over the original number of spectrogrampeaks. However, the increased entropy in the hashtokens helps accelerate the index search by a factor ofmore than a million, resulting in significant speedimprovement when identifying a particular song. Thisspeedup is due to the fact that more descriptive bits ofinformation allow for cutting more efficientlythrough ambiguous clutter.

The effect of this combinatorial hashing on a mix-ture of music and noise generates three classes of fin-gerprint tokens:

• Both spectrogram peaks belong to the target signal; • One peak belongs to the target signal and one to

a noise signal; and • Both peaks belong to noise.

Only the tokens having peaks from thetarget signal are important to thesearch process. In the kind of low sig-nal-to-noise situation frequentlyencountered in the Shazam applica-

tion, most of the tokens generated from the audiosample are garbage. But the presence of even a smallpercentage of good matching tokens is sufficient toflag a statistically significant probability of finding thecorrect song in a large database of songs.

We also found that the fingerprint features must bealigned temporally; that is, if a set of features appearsin both the original recording in the database and in asample query, the relative positions of each featurewithin each recording must be the same. Unlikespeech recognition, in which dynamic time warpingmay be used to match up loosely corresponding fea-tures with a time-varying and nondeterministic rate ofprogression, the Shazam technique assumes the corre-spondence is directly linear; that is, if you plot the rel-ative times of occurrence of each token in atime-versus-time scatterplot, a valid match shouldhave points that accumulate on a straight diagonalline.

Such a line is quickly detected by searching for apeak in a histogram of relative time differences. Thisassumption concerning temporal alignment greatlyaccelerates the matching process and strengthens theacceptance/rejection criteria determining whether agiven fingerprint feature is valid, thus providing aquick and effective way to filter out the large number

COMMUNICATIONS OF THE ACM August 2006/Vol. 49, No. 8 47

1An overlapping Short-Time Fourier Transform is calculated at regular intervals on theaudio data, and a power level is calculated for each resulting time-frequency bin. A binis a peak if its power level is greater than all the other bins in a bounded region aroundthe bin.

of garbage tokens generated in the combinatorialhashing stage.

We implemented the recognition algorithm inC++ with certain speed-critical sections optimized inassembly. The recognition server is implemented as acluster of a few dozen commodity off-the-shelf 64-bitx86-based servers, each with 8GB of RAM and run-ning an optimized Linux kernel.

As of June 2006, the Shazam database containedmore than three million tracks. Each incomingrequest is received by a master process that thenbroadcasts the query to a farm of slave processors,each holding a piece of the database index in memory.Each slave independently searches its chunk of theuniverse of fingerprint tokens and reports its identifi-cation results to the master. The master collects theresults and returns a report (concerning recognition)to the remote client. For a discussion on performancecharacteristics of the recognition algorithm, see [5].

The Shazam music recognition service(www.shazam.com) has been publicly available onmobile phones in the U.K. since 2002. Since then, ithas expanded into more than 20 countries hostedthrough a variety of local partners under various ser-vice brands, including Verizon Wireless and Cingularin the U.S. As of June 2006, nearly six million payingcustomers worldwide had used the service.

Meanwhile, Philips Electronics demonstrated itsown “robust hashing” audio fingerprinting algorithmin 2001. Like Shazam, it is capable of free-field audioidentification (QBE), though it forms hashes fromdifferential energy flux in a time-frequency grid [2].This technology was acquired in 2005 by Gracenote,a music database company in Emeryville, CA. Also in2001, the Fraunhofer Institut in Erlangen, Germany(www.iis.fraunhofer.de/), demonstrated a techniquebased on “spectral flatness” [3].

QBE music recognition is a commercial reality. Inthe next few years, as more carriers and phone manu-facturers offer related services, QBE will likelybecome part of the standard mobile phone feature set,like camera phones are today. The cost per queryshould drop, and more integration into follow-onsales and discovery services should also be possible.Other search modalities (such as query-by-hummingor -similarity) may also be added.

References1. Bond, P. StarCD: A star is born nationally seeking stellar CD sales. Hol-

lywood Reporter CCCLX, 13 (Nov. 1, 1999), 3. 2. Haitsma, J., Kalker, T., and Oostveen, J. Robust audio hashing for con-

tent identification. In Proceedings of the International Workshop on Con-tent-based Multimedia Indexing (Brescia, Italy, Sept. 19–21, 2001).

3. Herre, J., Allamanche, E., and Helmuth, O. Robust matching of audiosignals using spectral flatness features. In Proceedings of the 2001 IEEEWorkshop on Applications of Signal Processing to Audio and Acoustics(Mohonk, NY, 2001), 127–130.

4. Kenyon, S., Simkins, L., Brown, L., and Sebastian, R. U.S. Patent4,450,531: Broadcast Signal Recognition System and Method. U.S. Patentand Trademark Office, Washington, D.C.; www.uspto.gov.

5. Wang, A. An industrial-strength audio search algorithm. In Proceedingsof the Fourth International Conference on Music Information Retrieval(Baltimore, Oct. 26–30, 2003); www.ismir.net.

6. Wold, E., Blum, T., Keislar, D., and Wheaton, J. Content-based classi-fication, search, and retrieval of audio. IEEE Multimedia 3, 3 (Fall1996), 27–36.

Avery Wang ([email protected]) is the chief scientist ofShazam Entertainment, Ltd., London, U.K.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. To copy otherwise, to republish, to post on servers or to redis-tribute to lists, requires prior specific permission and/or a fee.

© 2006 ACM 0001-0782/06/0800 $5.00

c

48 August 2006/Vol. 49, No. 8 COMMUNICATIONS OF THE ACM

THE PRESENCE OFcorresponding spectrogram

peaks in a noisy versusnoiseless music signal makesit possible to determine withhigh probability whether an

audio sample matches arecording in the database.