View
218
Download
1
Tags:
Embed Size (px)
Citation preview
Engineering Perceptually Valid Spatial Audio
Ramani DuraiswamiPerceptual Interfaces and Reality Laboratory
Institute for Advanced Computer Studies
University of Maryland, College Park
Work supported by NSF, ONR/DARPA and UMIACS
http://www.umiacs.umd.edu/users/ramani [email protected]
Create Virtual Environments
Virtual audio must Convincingly place sound sources at
their proper three dimensional locations. Simulate the environment: room size, scattering properties. Change with the pose and motion of the listener - dynamically
Key => sound rendered must include cues that Encode the acoustical location of a source Encode the size of the environment.
• Virtual environments• Stereoscopic rendering can place
objects (people and things) in their proper spatial location
• But what about sound?
Perceptual system
The human perceptual system is a sophisticated sensing, measuring and computing system
Designed by evolution Performs real-time
measurements and decisions
Our goal is to fool this system in to believing that it is perceiving an object that is not there
User interfaces for Mobile devices (“eyes-free”) Visually Impaired people High bandwidth interfaces for high-dimensional data
UAV operators, air-traffic controllers, pilots
Virtual Reality Entertainment and Games Training and Simulation Remote exploration
Augmented Reality Assisted Navigation Mobile Information look-up
Design of sound systems Headphones, speakers
Applications
Perception of Sound Location
Measure attributes of Received Sound at the two Ears Solve inverse problem of finding source location based
on these measurements Humans effortlessly do this in near real-time What cues are available? Cues based on comparing sound received at the two ears
“Interaural Cues” Interaural Level Differences (ILD)
For distant sources, level differences small Interaural Time Differences (ITD)
With a sound speed of 340 ms and a distance between the ears of 0.17 m, maximum ITD = 500 s
The human auditory system is able to resolve time differences of 20 s
Interaural cues are not sufficient
Time Delay: Locus of | x-xL| -|x-xR| = c t is a hyperboloid of revolution
All points on this “cone-of-confusion” give rise to the same delays
For distant sources, level differences are also small
Other mechanisms necessary Scattering of sound
Off our bodies Off the environment
Purposive Motion
HEAD
Source
Left ear
Right ear
Sound and Human Spaces
wavelengths are comparable to our rooms, bodies, and features
Not an accident but evolutionary selection!
102
103
104
10-2
10-1
100
101
frequency, Hz
wav
elen
gth,
m
pinna dimensions
head dimensions
shoulder dimensions
workspace
rooms/offices
large rooms
Speech Sound
Sound wavelengths comparable to human dimensions and dimensions of spaces we live in.
f=c When >> a
wave is unaffected by object
~ a behavior of scattered wave is complex and diffraction effects are important.
<< a wave behaves like a ray
Sound sources are coherent as opposed to light, and so interference effects are important
Scattering Cues
ka = 5ka = 1 ka = 10
frequency
amplitude
frequency
amplitude
Characteristics of interactions
Interactions change received sound waves
Can encode the path of the sound, and thus location
Interactions with Bodies and ears
Bodies ~ 50 cm Heads ~ 25 cm Ears ~ 4 cm Not much multiple
scattering With surroundings
Rooms ~ 2m – 10m More multiple scattering
Smaller sizes => higher frequencies
Fewer sharp edges but open cavities
Closer to receivers: ears
Larger sizes => lower frequencies
Some sharp edges
Relatively distant from ears
Guiding principle
Axiom: To create the virtual scene, it is sufficient to recreate sound pressure levels at the eardrum Or a sufficiently fine approximation to it …
Of course, with more precise knowledge of the cues the brain actually uses, it might be possible to simplify the recreation Subject of ongoing work of many neuroscientists, including
our collaborators.
Scattering characterization:
Linear systems can be characterized by impulse response (IR) Knowing IR, can compute response to general source by convolution
Response to impulsive source at a particular location Scattering off person by Head Related Impulse Response (HRIR) Room scattering by Room Impulse Response (RIR)
Response differs according to source and receiver locations Thus encodes source location
HRTF and RTF are Fourier transforms of the Impulse response Convolution is cheaper in the Fourier domain (becomes a multiplication)
Thus to create virtual audio we need head-related impulse responses room imulse response Head tracking
Head Related Transfer Function
Head Related Transfer Function
Scattering causes selective amplification or attenuation at certain frequencies, depending on source location Ears act as directional acoustic probes Effects can be of the order of tens of dB
Encoded in a Head Related Transfer Function (HRTF) Ratio of the Fourier transform of the sound pressure level at the
eardrum to that which would have been obtained at the head center without listener
HRTFs are very individual Humans have different sizes and shapes Ear shapes are very individual as well
Before fingerprints, Alphonse Bertillon used a system of identification of criminals that included 11 measurements of the ear
Even today ear shots are part of Mugshots & INS photographs DARPA research in human identification
If ear shapes and body sizes are different Properties of scattered wave are different HRTFs will be very individual
Need individual HRTFs for creating virtual audio
HRTF potpourri
HRTF dependence on range neglected, and usually parameterized as a function of azimuth, elevation and frequency Valid for distances >1.3 m
Experiments show that: With personalized HRTFs, localization is close to natural limits (3-5
degrees) In an anechoic room people cannot distinguish between external speakers
and open-ear headphones You can learn a new set of HRTF in weeks if your ears are modified (e.g.,
by placing molds, Hofman, 2001) You do not forget the original HRTF set – when molds are removed,
localization is still perfect Some smoothing still allows good perception Features beyond 15 kHz seem unimportant
HRTF measurement
Sound presented using a speaker and measured using in-ear microphones Source placed at several azimuths and
elevations. Considerable experimental ingenuity
required. Speakers are moved and locations sampled
e.g., speakers slide along hoop for five different sets, and hoop moves along 25 elevations for 50 x25 measurements
Takes about 2 hours Subject must be still.
given feedback to keep steady Hoop is usually >1m away (no range data) Robotic apparatus to reduce interferences and
speed-up (still needs 2 hours)
CIPIC HRTF Database
Algazi et al (WASPAA 2001) 45 HRTFs +ear dimensions - available on web Demonstrate significant person-to-person variability
of both HRTFs and ear shapes Allow study of HRTFs Contralateral and ipsilateral ear HRTF slices for the 45 degree azimuth and
varying elevation for a human Torso reflections are the
wide arches in low frequencies
Pinna notches are the dark streaks in high frequency range
However, there are no models yet that let us go directly from geometry to response
Left ear Right ear
elevation
freq
.
HRTFs can be computed
0P
n
2 2 0P k P
Sound-hard boundaries:
Sound-soft boundaries: 0P
Impedance conditions:P
i P gn
Sommerfeld radiation condition lim 0
r
Pr ikP
r
Helmholtz equation:
Boundary conditions:
2 2 2 22 2 2
2 2 2 2
' ' ' ''
p p p pc c p
t x y z
Wave equation:
Fourier Transform from Time to Frequency Domain
dtetzyxpwzyxP ti
),,,('),,,(
Idea for rapidly obtaining individual HRTFs
Discretize equation using surface meshes of individuals Obtain these via computer vision Basis for an NSF ITR award in 2000
Boundary Integral Formulations:
Discretization
Prior Work: Computation of HRTFs
Katz (1998, 2002), Kahana & Nelson (2001) Has been demonstrated to be viable Mesh obtained via laser scanning
Required much processing of scans to obtain good meshes Relied on commercial BEM acoustics codes
Developed for auto industry noise calculations Both scanning and computations are very slow Our project goals
speed up computations via the FMM Speed up the process of obtaining meshes via computer vision
Fast Multipole Method Solution
Wave equation solution: panel size should be small relative to wavelength
For N points direct algorithm scales as O(N3) time and O(N2) memory
The FMM is an algorithm that with iterative methods can reduce cost to O(kNlogN) time and O(N) memory
Basic idea is to speed-up dense matrix vector products. Uses the addition theorem for solutions of the Helmholtz equation Given a matrix vector product like Write
FMM factorization
Operation reduction trick
We evaluate the p2 coefficients clm in O(Np2) operations
Then we evaluate the matrix vector product at the N points yi in O(Np2) operations
Direct product takes O(N2) operations If N ' 105, p=50 40 fold speed-up Larger problems --- speed up is larger
Fast solution of the wave equation
However many technical problems remain … Data-structures, translation operators, stability, iterative methods Conventional FMM algorithms for the Helmholtz equation
scales as O(Np5), or are inaccurate and scale as O(Np3). p is a parameter that depends on desired accuracy and wavenumber. p'
10-100
Recently, we developed accurate O(Np2) algorithms. Speedups by factors of a 100 or more on the conventional FMM
Could have major impact on solutions of problems in computational electromagnetics and acoustics
Also developed general datastructures and blackbox implementations
Gumerov & Duraiswami (2001-2004)
Our Other FMM work
FMM for sums of high-dimensional Gaussians Useful for kernel density estimation (statistical technique to
estimate probability densities) Applied to computer vision and computational learning theory
problems
FMM for non-uniform Fourier transforms Multi-rate filtering FFTs for real non-uniform processes Medical imaging
Implicit function interpolation Graphics and visualization
Problem Decomposition
As people move head and shoulder locations change. HRTF changes as well.
Impractical to solve wave equation for every configuration recreate meshes for each configuration dynamically.
Possible to simplify the problem Recently, in joint work with Duda, Algazi and Gumerov have
developed a decompositional approach Simple spherical/ellipsoidal models for head/torso, can be
computed on-line Exact numerical solutions models for the pinna scattering,
computed off-line Solution to full problem synthesized from these
Decomposition
Analytical Solution to simplified head and torso model (Gumerov & Duraiswami, ICASSP’01/’02, Algazi et al 2002, JASA) Solution can be evaluated online as geometry changes
Exact offline computation of the pinna transfer function Online synthesis of two solutions to be attempted
Simple addition now. (Algazi et al WASPAA 2002)
Scattering from multiple spheres
Multiple Scattering from 100 spheres
Applications to Laplace, Helmholtz, Stokes, Maxwell problems
Nanotechnology simulation Fluid/particle mechanics
ka=1.6 ka=2.8 ka=4.8
FMM also used here for visualization
How good are the spherical HRTFs?
We compared measured HRTFs for a mannekin, a bocce+bowling ball, and computed.
More extensive tests in Algazi et al , JASA, Nov. 2002
Validate strategy
Decomposition: Computer Vision
Simplifies computer vision problem toSome gross estimation of head and torso sizesTracking of head pose and person positionDetermining accurate meshes of the pinna
Determining pinna structure has proven hard.Use accurate prior mesh model of a person’s ear.Deform this mesh, so that it produces the same images as that of a particular person’s ear
Next step: Compute Pinna Transfer Functions
Pinna mould image. Pinna mesh
Diverted by a breakthrough!
A recent breakthrough
Measurements in secondsGet range HRTFs as well in this time frameMuch more compact representation of HRTFsGood scientifically valid interpolation May be the proper scientific setting to answer
How many measurement points?Where should they be placed?
Patent pending available to license from UMD Office of Technology
Commercialization
Reciprocity principle
Source
ReceiverSource
Receiver
Reciprocity Principle:
If sound waves are excited at a point A in a space filled with air and partly bounded by stationary bodies and is partly non-bounded, the pressure at any other point B coincides, in magnitude and phase, with the pressure that would be observed at A if the sound source were placed at B
Pressure scattering off an object is a well studied area Helmholtz Equation Linear system, so decompose the pressure as
Pressure outside a compact scatterer can be written as
hl = Hankel functions, Ylm are spherical harmonics
Complete basis for radiating solutions of the wave equation. Interpolating with these satisfies physics!
2nd key idea
Measurement? We want to measure HRTFs over entire hearing range
20Hz- 20 kHz
Good low-frequency measurements are problematic, large loudspeakers necessary to emit energy at low frequencies anechoic chambers not so anechoic at low frequencies.
Thus current speakers are already too small! Measured HRTFS below 500 Hz are not very good. Subject of active research, including in the current project
(See Algazi et al, JASA, Nov 2002)
Low frequency HRTFs well represented by such models See Algazi et al, AES 2002
Perceptually they may even sound better! So all we need to do is measure things at higher frequencies
Approach Turned out headphone drivers Array of tiny microphones Send out a highpass signal and
measure received signal Use analytical anthropometric
representation for low frequenciesand compose
Extrapolate range
Amplifiers and Data Acquisition
Custom design: amplifiers, band pass filters, preamplifiers
Fabricated using surface mount components into 2 32-channel boxes
Comparison
Apply idea to existing HRTFs
Existing HRTF measurements on a spherical grid, can be viewed as measurements by microphones at speaker locations, to a source at the ear location.
Can we use this idea to interpolate existing HRTFs? Yes! Also serves to validate Take HRTF measurements. Write them in the form Fit Measured data f(ri,i,i) −
Validation!
Regularized fitting
Range Interpolation
The expansions are valid in a region outside the scatterer So the expansion is valid at different ranges in its region of
validity The HRTF can then be extrapolated
Y
x*
R*
O
Range (Kemar)
Room Impulse Response
Environmental reflections
Fundamental component of all sound
Human’s expect some reverberation
Use comparisons between sound and its reflection to disambiguate locations
Room Reflection Locations
virtual source
virtual source
source
listener
room
virtual source
2nd order virtual source
Obtaining the Room Impulse Response
Room Impulse Response depends upon the location of the source and the receiver Since two ears are spatially separated they have different
RIRs
Primary contributions are reflections and diffractions Early reflections are distinct and spatially separated
Including room impulse appears very important for the perception of range Also for “spaciousness”, “ambience”
Can also be measured, computed, or approximated
Approximate Room Impulse Responses
Simple classical method is an image method (Allen, Berkley, 1979)
Images of source in the walls Images of images … 6, 62,63,64, Boundary conditions of absorption
accounted for approximately Even this simple model is expensive to compute For M receiver locations and N real/image sources method
requires O(MN) computations We developed an efficient method for this computation using the multipole method.
Requires O(M+N) time. Described in Duraiswami et al 2001, Proc. WASPAA 2001
Recently we have developed algorithms for general polyhedral rooms with general boundary conditions
Complexity
N sources M evaluation points MN operations Can this complexity be improved?
Yes, via application of fast multipole methods
Can achieve O(M+N) complexity Order of magnitude speedup
Results
Obtain identical results to the Allen Berkley model
About 5 to 10 times faster for 6 orders
More orders and more sources, relative speed-up increases
The Rendering Pipeline
The Rendering Pipeline
Tranformations that must be performed on the sound to achieve VA Low latency for reality However, filters are very long
~1s for room response
Some motions will have immediate change on the sound, but others will cause change later.
Need to apply a decompositional approach again Composition of filters
Short ones for Head related scattering Long ones for room response
As users head pose changes the short filters change As users move the long filters change
Need to break-up the room response filter
Breaking up the Filter Convolution is linear Early reflections are more important and time separated
Important for determining range
Later reflections are a continuum important for “spaciousness,” “envelopment,” “warmth,” etc.
Create early reflections filter on the fly reflections of up to 5th or 6th order (depending on CPU resources) These are convolved with their HRTF Stick appropriate HRIR at the arrival location
Tail of room impulse response is approximated depending on room size This part is pre-computed and mixed with source
Sequential creation of the room impulse response
-- Start with the pre-computed tail of IR (reflections 4th order and up)-- Quickly compute the reflections of order 0-3 for current geometry-- (Reflection of order 0 is just the direct arrival)-- Stick them onto this generic tail-- parts that are perceptually important are updated in real time
Practical system
Demonstrated at several conferences and to many visitors to UMD
Can use personalized HRTFs, Fixed HRTFs, or simple customization via database selection
Simple room model (allen & berkeley) Head tracking via a Polhemus
Informal Reports
“Perfect” localization and externalization for person’s own HRTF
Other people approximately localize When presented with a visual cue localization is good
“That gray box is making the sound” If source is placed on the computer speaker, the illusion is very
convincing Other reactions
“I used to hate headphones as they isolate you from the outside world. This system doesn’t take you away from it”
Also some games … shoot ‘em up game with audio video cues Application with multiple “radios” in a virtual room
BBC, CNN, other web audio streams. “Walk” to stream of choice
Collaborators
Dick Duda, Ralph Algazi (CIPIC, UC Davis) Dmitry Zotkin (programming & systems) Nail Gumerov (numerical & analytical techniques, FMM) Shihab Shamma (neural modeling) Larry Davis, Kexue Liu, Ankur Mohan (computer vision) Zhiyun Li, Elena Grassi, Dmitry Zotkin (array capture) Vikas Raykar, B. Yegnanarayana (feature-based HRTFs) Ken Grant, Barbara Shinn-Cunningham (psycho-acoustics)
Auditory User Interfaces
Manipulate synthesized sound spatial location and other perceptual attributes
Pitch, timbre, Intensity Range, Elevation,
Azimuth Ambience Map information along
rendered sound Funded by another ITR
award
Renderingsystem
HRTFAcquisition
Position
Pitch
Timbre
Ambience
Intensity
NeuralModel
Evaluationand Tests
Baselinetesting
PPDTestingD
AT
A
pe
rce
pts
SU
BJE
CT
S
MA
PP
ING
Center forAuditoryandAcoustic Research
Sound
Estimated stimulus spectrum
60Time (msec)
Basilar membrane vibrations
Time (msec) 500
A’
B’
C’
Cochlear Analysis Auditory-Nerve Responses
C4
.25
Har
mon
ic s
erie
s
Time (msec)
4000
250
60
CF
(H
z)
4000
250
CF
(Hz)
Time (msec)
A
B
500
C
CF
(kH
z)
Hair cells along the tonotopic axis
Charact er is t ic Fr equency A
xi s ( CF)
Auditory
-nerve fibers
Lateral Inhibition
Decouple HRTFs and Recordings
Place microphones at a remote location (e.g. concert hall)
Replay spatialized audio at a remote location
Must play it for many users Use HRTFs at the client side
RECORDING
PLAYBACK
Beamforming-based Rendering
Similar to image-based rendering in graphics Do not solve figure/ground or object location problem
Assume that HRTF set is known (measured) Any set is discrete Grid of points G on the sphere
For every point P in a grid G: Compute an output signal Q of the beamformer aimed at point P (which
amounts to listening in the direction of P, and Q is simply a signal coming from the direction of P)
Play the computed signal Q filtered with HRTF for P (which amounts to rendering Q from the direction of P)
BEAMFORMING
+ HRTFs
+ signal processing
Microphone array outside