24
Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Content-Based Classification,

Search & Retrieval of Audio

Erling Wold, Thom Blum, Douglas Keislar, James Wheaton

Presented By: Adelle C. Knight

Page 2: Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Agenda

Introduction Previous Research Analysis Techniques Statistical Techniques Performance Applications Future Work

Page 3: Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Previous Research

Sounds traditionally described by pitch, loudness, duration, timbre Timbre can be identified by a tone because of their similar spectral

energy distributions Too much variation across range of pitches and dynamic levels to

“fingerprint” a single instrument tone

Algorithms that extract audio structure (i.e. find first occurrence of G-sharp)

– Algorithms were tuned to specific musical constructs and not appropriate for all sounds

Neural nets to index audio databases– Some success but it was difficult for user to specify which features were

important and which to ignore

Page 4: Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Methods To Access Sounds

Simile Acoustical/Perceptual Features Subjective Features Onomatopoeia

Page 5: Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Accomplish Methods

1. Analysis Techniques• Reduce sound to small set of parameters

2. Statistical Techniques• To accomplish classification & retrieval

Page 6: Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Analysis Techniques

Page 7: Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Analysis & Retrieval Engine

Exact Text Search

Sound Level

Fuzzy Text Search

Speech or Musical content 1. Measure variety of acoustical features of each sound1. Loudness2. Pitch3. Brightness4. Bandwidth5. Harmonicity

2. Set of N features is represented as an N-vector.3. Different aural properties map to different regions of N-

space.

Page 8: Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Acoustical Features:Loudness

Approximated by signal’s Root-Mean-Square (RMS) measured in decibels

– RMS calculated by taking series of windowed frames of the sound and computing the square root of the sum of the squares of the sample values

Human ear: 120 db range Software: 100 db range from 16 bit recordings

Page 9: Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Acoustical Features:Pitch

Estimated by taking series of short-time Fourier spectra

Frequencies & amplitudes of peaks measured for each frame

Approximate Greatest Common Divisor algorithm to calculate estimate of pitch

Store as log frequency Human ear: 20Hz – 20kHz Software: 50Hz – 10kHz

Page 10: Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Acoustical Features:Brightness

Measure of higher frequency content of signal Computed as centroid of the short-time Fourier

magnitude spectra Stored as log frequency Varies over same range as pitch Can’t be less than pitch estimate at any given instant

Page 11: Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Acoustical Features:Bandwidth

Difference of frequency components and centre frequency is taken

Summation of differences Divide by number of components to get average Examples:

– Single sine wave has bandwidth of 0– Ideal white noise has infinite bandwidth

Page 12: Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Acoustical Features:Harmonicity

Harmonic vs. Inharmonic vs. Noise Computed by measuring deviation of sound’s line

spectrum from a perfectly harmonic spectrum Normalized range 0-1 Optional feature

Page 13: Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Storage – Feature Vector

Trajectory in time computed but not stored For each trajectory, computes & stores:

– Average– Variance– Autocorrelation– Duration of sound

Page 14: Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Training The System

For each sound entered into the db, the N-vector, a, is computed

Mean vector and covariance matrix R for the a vectors in each class are calculated:

µ = (1/M) ∑j .a[j]

R = (1/M) ∑j .(a[j]-µ)(a[j]-µ)T

Mean + Covariance = System’s model of perceptual property being trained by user

Page 15: Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Statistical Techniques

Page 16: Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Classifying Sounds

When a new sound needs to be classified, a distance measure is calculated from new sound’s a vector and previous model

Using weighted L2 or Euclidean distance:

D = ((a-µ)TR-1(a-µ))1/2

Likelihood value L based on normal distribution and given by:

L = exp(-D2/2)

Page 17: Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Retrieving Sounds

Sort sounds by all acoustic features

Example:

– Retrieve top M sounds in class– Get all sounds in hyper-rectangle centered around mean with

volume V such that

V/V0=M/M0

– Compute distance measure for all sounds– Return closest M sounds– Increase ratio & Iterate of not enough sounds returned

Page 18: Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

2 Quality Measures

1. Magnitude of covariance matrix R• Measure of the compactness of the class• Quality measure of classification

2. Size of covariance matrix • Measure of particular dimension’s importance to the class• User can see if feature is too important or not important

enough

Page 19: Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Segmentation

Apply acoustical analyses Look for transitions Transitions define segments of the signal to be treated

like individual sounds

Page 20: Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Performance & Results

Laughter classification Touchtone classification

Example: Laughter classification

Returned:

•Laughing sounds•Animal sounds

Example: Touchtone classification

Returned:

•1 recording out of training set•Low likelihood touchtone - 7 digit telephone #•High likelihood – single digit tones

Page 21: Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Applications

Audio databases & file systems– Fields: file name, sample rate, sample size, file format,

channels, dates, keywords, analysis feature vector, etc.

Audio database browser– Front-end db application (e.g.. SoundFisher) lets user

search for sounds using queries that can be content based– Permits general maintenance of entries – adding, deleting,

describing sounds

Page 22: Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Applications

Audio editors– Include knowledge of audio content– Search commands like queries, build new classes on the fly

Surveillance– Identical to editor but identification & classification done in real

time– Detect sounds associated with criminal activity (eg. Glass

breaking, screams)

Automatic segmentation of audio & video– For large archives of raw audio & video– Audio-to-MIDI (Studio Vision Pro 3.0)

Page 23: Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Future Work

More analytic features General phrase-level content based retrieval Source separation Sound synthesis

Page 24: Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Conclusions