Engineering Perceptually Valid Spatial Audio Ramani Duraiswami Perceptual Interfaces and Reality Laboratory Institute for Advanced Computer Studies University

Engineering Perceptually Valid Spatial Audio

Ramani DuraiswamiPerceptual Interfaces and Reality Laboratory

Institute for Advanced Computer Studies

University of Maryland, College Park

Work supported by NSF, ONR/DARPA and UMIACS

http://www.umiacs.umd.edu/users/ramani [email protected]

Create Virtual Environments

Virtual audio must Convincingly place sound sources at

their proper three dimensional locations. Simulate the environment: room size, scattering properties. Change with the pose and motion of the listener - dynamically

Key => sound rendered must include cues that Encode the acoustical location of a source Encode the size of the environment.

• Virtual environments• Stereoscopic rendering can place

objects (people and things) in their proper spatial location

• But what about sound?

Perceptual system

The human perceptual system is a sophisticated sensing, measuring and computing system

Designed by evolution Performs real-time

measurements and decisions

Our goal is to fool this system in to believing that it is perceiving an object that is not there

User interfaces for Mobile devices (“eyes-free”) Visually Impaired people High bandwidth interfaces for high-dimensional data

UAV operators, air-traffic controllers, pilots

Virtual Reality Entertainment and Games Training and Simulation Remote exploration

Augmented Reality Assisted Navigation Mobile Information look-up

Design of sound systems Headphones, speakers

Applications

Perception of Sound Location

Measure attributes of Received Sound at the two Ears Solve inverse problem of finding source location based

on these measurements Humans effortlessly do this in near real-time What cues are available? Cues based on comparing sound received at the two ears

“Interaural Cues” Interaural Level Differences (ILD)

For distant sources, level differences small Interaural Time Differences (ITD)

With a sound speed of 340 ms and a distance between the ears of 0.17 m, maximum ITD = 500 s

The human auditory system is able to resolve time differences of 20 s

Interaural cues are not sufficient

Time Delay: Locus of | x-xL| -|x-xR| = c t is a hyperboloid of revolution

All points on this “cone-of-confusion” give rise to the same delays

For distant sources, level differences are also small

Other mechanisms necessary Scattering of sound

Off our bodies Off the environment

Purposive Motion

HEAD

Source

Left ear

Right ear

Sound and Human Spaces

wavelengths are comparable to our rooms, bodies, and features

Not an accident but evolutionary selection!

102

103

104

10-2

10-1

100

101

frequency, Hz

wav

elen

gth,

m

pinna dimensions

head dimensions

shoulder dimensions

workspace

rooms/offices

large rooms

Speech Sound

Sound wavelengths comparable to human dimensions and dimensions of spaces we live in.

f=c When >> a

wave is unaffected by object

~ a behavior of scattered wave is complex and diffraction effects are important.

<< a wave behaves like a ray

Sound sources are coherent as opposed to light, and so interference effects are important

Scattering Cues

ka = 5ka = 1 ka = 10

frequency

amplitude

frequency

amplitude

Characteristics of interactions

Interactions change received sound waves

Can encode the path of the sound, and thus location

Interactions with Bodies and ears

Bodies ~ 50 cm Heads ~ 25 cm Ears ~ 4 cm Not much multiple

scattering With surroundings

Rooms ~ 2m – 10m More multiple scattering

Smaller sizes => higher frequencies

Fewer sharp edges but open cavities

Closer to receivers: ears

Larger sizes => lower frequencies

Some sharp edges

Relatively distant from ears

Guiding principle

Axiom: To create the virtual scene, it is sufficient to recreate sound pressure levels at the eardrum Or a sufficiently fine approximation to it …

Of course, with more precise knowledge of the cues the brain actually uses, it might be possible to simplify the recreation Subject of ongoing work of many neuroscientists, including

our collaborators.

Scattering characterization:

Linear systems can be characterized by impulse response (IR) Knowing IR, can compute response to general source by convolution

Response to impulsive source at a particular location Scattering off person by Head Related Impulse Response (HRIR) Room scattering by Room Impulse Response (RIR)

Response differs according to source and receiver locations Thus encodes source location

HRTF and RTF are Fourier transforms of the Impulse response Convolution is cheaper in the Fourier domain (becomes a multiplication)

Thus to create virtual audio we need head-related impulse responses room imulse response Head tracking

Head Related Transfer Function

Head Related Transfer Function

Scattering causes selective amplification or attenuation at certain frequencies, depending on source location Ears act as directional acoustic probes Effects can be of the order of tens of dB

Encoded in a Head Related Transfer Function (HRTF) Ratio of the Fourier transform of the sound pressure level at the

eardrum to that which would have been obtained at the head center without listener

HRTFs are very individual Humans have different sizes and shapes Ear shapes are very individual as well

Before fingerprints, Alphonse Bertillon used a system of identification of criminals that included 11 measurements of the ear

Even today ear shots are part of Mugshots & INS photographs DARPA research in human identification

If ear shapes and body sizes are different Properties of scattered wave are different HRTFs will be very individual

Need individual HRTFs for creating virtual audio

HRTF potpourri

HRTF dependence on range neglected, and usually parameterized as a function of azimuth, elevation and frequency Valid for distances >1.3 m

Experiments show that: With personalized HRTFs, localization is close to natural limits (3-5

degrees) In an anechoic room people cannot distinguish between external speakers

and open-ear headphones You can learn a new set of HRTF in weeks if your ears are modified (e.g.,

by placing molds, Hofman, 2001) You do not forget the original HRTF set – when molds are removed,

localization is still perfect Some smoothing still allows good perception Features beyond 15 kHz seem unimportant

HRTF measurement

Sound presented using a speaker and measured using in-ear microphones Source placed at several azimuths and

elevations. Considerable experimental ingenuity

required. Speakers are moved and locations sampled

e.g., speakers slide along hoop for five different sets, and hoop moves along 25 elevations for 50 x25 measurements

Takes about 2 hours Subject must be still.

given feedback to keep steady Hoop is usually >1m away (no range data) Robotic apparatus to reduce interferences and

speed-up (still needs 2 hours)

CIPIC HRTF Database

Algazi et al (WASPAA 2001) 45 HRTFs +ear dimensions - available on web Demonstrate significant person-to-person variability

of both HRTFs and ear shapes Allow study of HRTFs Contralateral and ipsilateral ear HRTF slices for the 45 degree azimuth and

varying elevation for a human Torso reflections are the

wide arches in low frequencies

Pinna notches are the dark streaks in high frequency range

However, there are no models yet that let us go directly from geometry to response

Left ear Right ear

elevation

freq

.

HRTFs can be computed

0P

n

2 2 0P k P

Sound-hard boundaries:

Sound-soft boundaries: 0P

Impedance conditions:P

i P gn

Sommerfeld radiation condition lim 0

r

Pr ikP

r

Helmholtz equation:

Boundary conditions:

2 2 2 22 2 2

2 2 2 2

' ' ' ''

p p p pc c p

t x y z

Wave equation:

Fourier Transform from Time to Frequency Domain

dtetzyxpwzyxP ti

),,,('),,,(

Idea for rapidly obtaining individual HRTFs

Discretize equation using surface meshes of individuals Obtain these via computer vision Basis for an NSF ITR award in 2000

Boundary Integral Formulations:

Discretization

Prior Work: Computation of HRTFs

Katz (1998, 2002), Kahana & Nelson (2001) Has been demonstrated to be viable Mesh obtained via laser scanning

Required much processing of scans to obtain good meshes Relied on commercial BEM acoustics codes

Developed for auto industry noise calculations Both scanning and computations are very slow Our project goals

speed up computations via the FMM Speed up the process of obtaining meshes via computer vision

Fast Multipole Method Solution

Wave equation solution: panel size should be small relative to wavelength

For N points direct algorithm scales as O(N3) time and O(N2) memory

The FMM is an algorithm that with iterative methods can reduce cost to O(kNlogN) time and O(N) memory

Basic idea is to speed-up dense matrix vector products. Uses the addition theorem for solutions of the Helmholtz equation Given a matrix vector product like Write

FMM factorization

Operation reduction trick

We evaluate the p2 coefficients clm in O(Np2) operations

Then we evaluate the matrix vector product at the N points yi in O(Np2) operations

Direct product takes O(N2) operations If N ' 105, p=50 40 fold speed-up Larger problems --- speed up is larger

Fast solution of the wave equation

However many technical problems remain … Data-structures, translation operators, stability, iterative methods Conventional FMM algorithms for the Helmholtz equation

scales as O(Np5), or are inaccurate and scale as O(Np3). p is a parameter that depends on desired accuracy and wavenumber. p'

10-100

Recently, we developed accurate O(Np2) algorithms. Speedups by factors of a 100 or more on the conventional FMM

Could have major impact on solutions of problems in computational electromagnetics and acoustics

Also developed general datastructures and blackbox implementations

Gumerov & Duraiswami (2001-2004)

Our Other FMM work

FMM for sums of high-dimensional Gaussians Useful for kernel density estimation (statistical technique to

estimate probability densities) Applied to computer vision and computational learning theory

problems

FMM for non-uniform Fourier transforms Multi-rate filtering FFTs for real non-uniform processes Medical imaging

Implicit function interpolation Graphics and visualization

Problem Decomposition

As people move head and shoulder locations change. HRTF changes as well.

Impractical to solve wave equation for every configuration recreate meshes for each configuration dynamically.

Possible to simplify the problem Recently, in joint work with Duda, Algazi and Gumerov have

developed a decompositional approach Simple spherical/ellipsoidal models for head/torso, can be

computed on-line Exact numerical solutions models for the pinna scattering,

computed off-line Solution to full problem synthesized from these

Decomposition

Analytical Solution to simplified head and torso model (Gumerov & Duraiswami, ICASSP’01/’02, Algazi et al 2002, JASA) Solution can be evaluated online as geometry changes

Exact offline computation of the pinna transfer function Online synthesis of two solutions to be attempted

Simple addition now. (Algazi et al WASPAA 2002)

Scattering from multiple spheres

Multiple Scattering from 100 spheres

Applications to Laplace, Helmholtz, Stokes, Maxwell problems

Nanotechnology simulation Fluid/particle mechanics

ka=1.6 ka=2.8 ka=4.8

FMM also used here for visualization

How good are the spherical HRTFs?

We compared measured HRTFs for a mannekin, a bocce+bowling ball, and computed.

More extensive tests in Algazi et al , JASA, Nov. 2002

Validate strategy

Decomposition: Computer Vision

Simplifies computer vision problem toSome gross estimation of head and torso sizesTracking of head pose and person positionDetermining accurate meshes of the pinna

Determining pinna structure has proven hard.Use accurate prior mesh model of a person’s ear.Deform this mesh, so that it produces the same images as that of a particular person’s ear

Next step: Compute Pinna Transfer Functions

Pinna mould image. Pinna mesh

Diverted by a breakthrough!

A recent breakthrough

Measurements in secondsGet range HRTFs as well in this time frameMuch more compact representation of HRTFsGood scientifically valid interpolation May be the proper scientific setting to answer

How many measurement points?Where should they be placed?

Patent pending available to license from UMD Office of Technology

Commercialization

Reciprocity principle

Source

ReceiverSource

Receiver

Reciprocity Principle:

If sound waves are excited at a point A in a space filled with air and partly bounded by stationary bodies and is partly non-bounded, the pressure at any other point B coincides, in magnitude and phase, with the pressure that would be observed at A if the sound source were placed at B

Pressure scattering off an object is a well studied area Helmholtz Equation Linear system, so decompose the pressure as

Pressure outside a compact scatterer can be written as

hl = Hankel functions, Ylm are spherical harmonics

Complete basis for radiating solutions of the wave equation. Interpolating with these satisfies physics!

2nd key idea

Measurement? We want to measure HRTFs over entire hearing range

20Hz- 20 kHz

Good low-frequency measurements are problematic, large loudspeakers necessary to emit energy at low frequencies anechoic chambers not so anechoic at low frequencies.

Thus current speakers are already too small! Measured HRTFS below 500 Hz are not very good. Subject of active research, including in the current project

(See Algazi et al, JASA, Nov 2002)

Low frequency HRTFs well represented by such models See Algazi et al, AES 2002

Perceptually they may even sound better! So all we need to do is measure things at higher frequencies

Approach Turned out headphone drivers Array of tiny microphones Send out a highpass signal and

measure received signal Use analytical anthropometric

representation for low frequenciesand compose

Extrapolate range

Amplifiers and Data Acquisition

Custom design: amplifiers, band pass filters, preamplifiers

Fabricated using surface mount components into 2 32-channel boxes

Comparison

Apply idea to existing HRTFs

Existing HRTF measurements on a spherical grid, can be viewed as measurements by microphones at speaker locations, to a source at the ear location.

Can we use this idea to interpolate existing HRTFs? Yes! Also serves to validate Take HRTF measurements. Write them in the form Fit Measured data f(ri,i,i) −

Validation!

Regularized fitting

Range Interpolation

The expansions are valid in a region outside the scatterer So the expansion is valid at different ranges in its region of

validity The HRTF can then be extrapolated

Y

x*

R*

O

Range (Kemar)

Room Impulse Response

Environmental reflections

Fundamental component of all sound

Human’s expect some reverberation

Use comparisons between sound and its reflection to disambiguate locations

Room Reflection Locations

virtual source

virtual source

source

listener

room

virtual source

2nd order virtual source

Obtaining the Room Impulse Response

Room Impulse Response depends upon the location of the source and the receiver Since two ears are spatially separated they have different

RIRs

Primary contributions are reflections and diffractions Early reflections are distinct and spatially separated

Including room impulse appears very important for the perception of range Also for “spaciousness”, “ambience”

Can also be measured, computed, or approximated

Approximate Room Impulse Responses

Simple classical method is an image method (Allen, Berkley, 1979)

Images of source in the walls Images of images … 6, 62,63,64, Boundary conditions of absorption

accounted for approximately Even this simple model is expensive to compute For M receiver locations and N real/image sources method

requires O(MN) computations We developed an efficient method for this computation using the multipole method.

Requires O(M+N) time. Described in Duraiswami et al 2001, Proc. WASPAA 2001

Recently we have developed algorithms for general polyhedral rooms with general boundary conditions

Complexity

N sources M evaluation points MN operations Can this complexity be improved?

Yes, via application of fast multipole methods

Can achieve O(M+N) complexity Order of magnitude speedup

Results

Obtain identical results to the Allen Berkley model

About 5 to 10 times faster for 6 orders

More orders and more sources, relative speed-up increases

The Rendering Pipeline

The Rendering Pipeline

Tranformations that must be performed on the sound to achieve VA Low latency for reality However, filters are very long

~1s for room response

Some motions will have immediate change on the sound, but others will cause change later.

Need to apply a decompositional approach again Composition of filters

Short ones for Head related scattering Long ones for room response

As users head pose changes the short filters change As users move the long filters change

Need to break-up the room response filter

Breaking up the Filter Convolution is linear Early reflections are more important and time separated

Important for determining range

Later reflections are a continuum important for “spaciousness,” “envelopment,” “warmth,” etc.

Create early reflections filter on the fly reflections of up to 5th or 6th order (depending on CPU resources) These are convolved with their HRTF Stick appropriate HRIR at the arrival location

Tail of room impulse response is approximated depending on room size This part is pre-computed and mixed with source

Sequential creation of the room impulse response

-- Start with the pre-computed tail of IR (reflections 4th order and up)-- Quickly compute the reflections of order 0-3 for current geometry-- (Reflection of order 0 is just the direct arrival)-- Stick them onto this generic tail-- parts that are perceptually important are updated in real time

Practical system

Demonstrated at several conferences and to many visitors to UMD

Can use personalized HRTFs, Fixed HRTFs, or simple customization via database selection

Simple room model (allen & berkeley) Head tracking via a Polhemus

Informal Reports

“Perfect” localization and externalization for person’s own HRTF

Other people approximately localize When presented with a visual cue localization is good

“That gray box is making the sound” If source is placed on the computer speaker, the illusion is very

convincing Other reactions

“I used to hate headphones as they isolate you from the outside world. This system doesn’t take you away from it”

Also some games … shoot ‘em up game with audio video cues Application with multiple “radios” in a virtual room

BBC, CNN, other web audio streams. “Walk” to stream of choice

Collaborators

Dick Duda, Ralph Algazi (CIPIC, UC Davis) Dmitry Zotkin (programming & systems) Nail Gumerov (numerical & analytical techniques, FMM) Shihab Shamma (neural modeling) Larry Davis, Kexue Liu, Ankur Mohan (computer vision) Zhiyun Li, Elena Grassi, Dmitry Zotkin (array capture) Vikas Raykar, B. Yegnanarayana (feature-based HRTFs) Ken Grant, Barbara Shinn-Cunningham (psycho-acoustics)

Auditory User Interfaces

Manipulate synthesized sound spatial location and other perceptual attributes

Pitch, timbre, Intensity Range, Elevation,

Azimuth Ambience Map information along

rendered sound Funded by another ITR

award

Renderingsystem

HRTFAcquisition

Position

Pitch

Timbre

Ambience

Intensity

NeuralModel

Evaluationand Tests

Baselinetesting

PPDTestingD

AT

A

pe

rce

pts

SU

BJE

CT

S

MA

PP

ING

Center forAuditoryandAcoustic Research

Sound

Estimated stimulus spectrum

60Time (msec)

Basilar membrane vibrations

Time (msec) 500

A’

B’

C’

Cochlear Analysis Auditory-Nerve Responses

C4

.25

Har

mon

ic s

erie

s

Time (msec)

4000

250

60

CF

(H

z)

4000

250

CF

(Hz)

Time (msec)

A

B

500

C

CF

(kH

z)

Hair cells along the tonotopic axis

Charact er is t ic Fr equency A

xi s ( CF)

Auditory

-nerve fibers

Lateral Inhibition

Decouple HRTFs and Recordings

Place microphones at a remote location (e.g. concert hall)

Replay spatialized audio at a remote location

Must play it for many users Use HRTFs at the client side

RECORDING

PLAYBACK

Beamforming-based Rendering

Similar to image-based rendering in graphics Do not solve figure/ground or object location problem

Assume that HRTF set is known (measured) Any set is discrete Grid of points G on the sphere

For every point P in a grid G: Compute an output signal Q of the beamformer aimed at point P (which

amounts to listening in the direction of P, and Q is simply a signal coming from the direction of P)

Play the computed signal Q filtered with HRTF for P (which amounts to rendering Q from the direction of P)

BEAMFORMING

+ HRTFs

+ signal processing

Microphone array outside

Documents

Engineering Perceptually Valid Spatial Audio Ramani Duraiswami Perceptual Interfaces and Reality Laboratory Institute for Advanced Computer Studies University