Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Content-Based Classification,

Search & Retrieval of Audio

Erling Wold, Thom Blum, Douglas Keislar, James Wheaton

Presented By: Adelle C. Knight

Agenda

Introduction Previous Research Analysis Techniques Statistical Techniques Performance Applications Future Work

Previous Research

Sounds traditionally described by pitch, loudness, duration, timbre Timbre can be identified by a tone because of their similar spectral

energy distributions Too much variation across range of pitches and dynamic levels to

“fingerprint” a single instrument tone

Algorithms that extract audio structure (i.e. find first occurrence of G-sharp)

– Algorithms were tuned to specific musical constructs and not appropriate for all sounds

Neural nets to index audio databases– Some success but it was difficult for user to specify which features were

important and which to ignore

Methods To Access Sounds

Simile Acoustical/Perceptual Features Subjective Features Onomatopoeia

Accomplish Methods

1. Analysis Techniques• Reduce sound to small set of parameters

2. Statistical Techniques• To accomplish classification & retrieval

Analysis Techniques

Analysis & Retrieval Engine

Exact Text Search

Sound Level

Fuzzy Text Search

Speech or Musical content 1. Measure variety of acoustical features of each sound1. Loudness2. Pitch3. Brightness4. Bandwidth5. Harmonicity

2. Set of N features is represented as an N-vector.3. Different aural properties map to different regions of N-

space.

Acoustical Features:Loudness

Approximated by signal’s Root-Mean-Square (RMS) measured in decibels

– RMS calculated by taking series of windowed frames of the sound and computing the square root of the sum of the squares of the sample values

Human ear: 120 db range Software: 100 db range from 16 bit recordings

Acoustical Features:Pitch

Estimated by taking series of short-time Fourier spectra

Frequencies & amplitudes of peaks measured for each frame

Approximate Greatest Common Divisor algorithm to calculate estimate of pitch

Store as log frequency Human ear: 20Hz – 20kHz Software: 50Hz – 10kHz

Acoustical Features:Brightness

Measure of higher frequency content of signal Computed as centroid of the short-time Fourier

magnitude spectra Stored as log frequency Varies over same range as pitch Can’t be less than pitch estimate at any given instant

Acoustical Features:Bandwidth

Difference of frequency components and centre frequency is taken

Summation of differences Divide by number of components to get average Examples:

– Single sine wave has bandwidth of 0– Ideal white noise has infinite bandwidth

Acoustical Features:Harmonicity

Harmonic vs. Inharmonic vs. Noise Computed by measuring deviation of sound’s line

spectrum from a perfectly harmonic spectrum Normalized range 0-1 Optional feature

Storage – Feature Vector

Trajectory in time computed but not stored For each trajectory, computes & stores:

– Average– Variance– Autocorrelation– Duration of sound

Training The System

For each sound entered into the db, the N-vector, a, is computed

Mean vector and covariance matrix R for the a vectors in each class are calculated:

µ = (1/M) ∑j .a[j]

R = (1/M) ∑j .(a[j]-µ)(a[j]-µ)T

Mean + Covariance = System’s model of perceptual property being trained by user

Statistical Techniques

Classifying Sounds

When a new sound needs to be classified, a distance measure is calculated from new sound’s a vector and previous model

Using weighted L2 or Euclidean distance:

D = ((a-µ)TR-1(a-µ))1/2

Likelihood value L based on normal distribution and given by:

L = exp(-D2/2)

Retrieving Sounds

Sort sounds by all acoustic features

Example:

– Retrieve top M sounds in class– Get all sounds in hyper-rectangle centered around mean with

volume V such that

V/V0=M/M0

– Compute distance measure for all sounds– Return closest M sounds– Increase ratio & Iterate of not enough sounds returned

2 Quality Measures

1. Magnitude of covariance matrix R• Measure of the compactness of the class• Quality measure of classification

2. Size of covariance matrix • Measure of particular dimension’s importance to the class• User can see if feature is too important or not important

enough

Segmentation

Apply acoustical analyses Look for transitions Transitions define segments of the signal to be treated

like individual sounds

Performance & Results

Laughter classification Touchtone classification

Example: Laughter classification

Returned:

•Laughing sounds•Animal sounds

Example: Touchtone classification

Returned:

•1 recording out of training set•Low likelihood touchtone - 7 digit telephone #•High likelihood – single digit tones

Applications

Audio databases & file systems– Fields: file name, sample rate, sample size, file format,

channels, dates, keywords, analysis feature vector, etc.

Audio database browser– Front-end db application (e.g.. SoundFisher) lets user

search for sounds using queries that can be content based– Permits general maintenance of entries – adding, deleting,

describing sounds

Applications

Audio editors– Include knowledge of audio content– Search commands like queries, build new classes on the fly

Surveillance– Identical to editor but identification & classification done in real

time– Detect sounds associated with criminal activity (eg. Glass

breaking, screams)

Automatic segmentation of audio & video– For large archives of raw audio & video– Audio-to-MIDI (Studio Vision Pro 3.0)

Future Work

More analytic features General phrase-level content based retrieval Source separation Sound synthesis

Conclusions

Documents

Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight