19
Research & Development White Paper WHP 273 October 2013 Musical Movements – Gesture Based Audio Interfaces A. Churnside, C. Pike and M. Leonard BRITISH BROADCASTING CORPORATION

BBC Research & Development White Paper WHP273downloads.bbc.co.uk/rd/pubs/whp/whp-pdf-files/WHP273.pdf · BBC Research & Development White Paper WHP 273 Musical Movements ... Essentially,

Embed Size (px)

Citation preview

Research & Development

White Paper

WHP 273

October 2013

Musical Movements – Gesture Based Audio Interfaces

A. Churnside, C. Pike and M. Leonard

BRITISH BROADCASTING CORPORATION

BBC Research & Development

White Paper WHP 273

Musical Movements - Gesture Based Audio Interfaces

Anthony Churnside, Chris Pike, Max Leonard

Abstract

Recent developments have led to the availability of consumer devices capable of recognising certain

human movements and gestures. This paper is a study of novel gesture-based audio interfaces. The

authors present two prototypes for interacting with audio/visual experiences. The first allows a

user to ‘conduct’ a recording of an orchestral performance, controlling the tempo and dynamic. The

paper describes the audio and visual capture of the orchestra and the design and construction of the

audio-visual playback system. An analysis of this prototype, based on testing and feedback from

a number of users, is also provided. The second prototype uses the gesture tracking algorithm to

control a three-dimensional audio panner. This audio panner is tested and feedback from a number

of professional engineers is analysed.

This work was presented at the 131st Audio Engineering Society Convention in New York in October

2011.

Additional key words: audio, ambisonics, interaction,panning,orchestra,production

c�BBC 2013. All rights reserved.

White Papers are distributed freely on request.

Authorisation of the Chief Scientist or General Manager is

required for publication.

c�BBC 2013. Except as provided below, no part of this document may be reproduced in any

material form (including photocoping or storing it in any medium by electronic means) without

the prior written permission of BBC Research & Development except in accordance with the

provisions of the (UK) Copyright, Designs and Patents Act 1988.

The BBC grants permission to individuals and organisations to make copies of the entire doc-

ument (including this copyright notice) for their own internal use. No copies of this document

may be published, distributed or made available to third parties whether by paper, electronic or

other means without the BBC’s prior written permission. Where necessary, third parties should

be directed to the relevant page on BBC’s website at http://www.bbc.co.uk/rd/pubs/whp for a

copy of this document.

1 Introduction

Consumer devices that provide depth imagery and skeletal tracking data have recently become

available. This means that for the first time hands-free gesture-based interaction and control

can be used without a large time and cost impact. This paper examines two prototypes which

explore how this might benefit interaction with spatial audio, an area of work at BBC Research &

Development. The two prototypes are similar in design but look at di↵ering points of application.

The first (BBC Philharmonic Maestro) investigates a tool for user/audience interaction, allowing

them to explore a musical performance in an interactive and immersive way. The second (3D

Panner) looks at how this might be used in a production environment as a panner for sounds in

three dimensional (3D) audio systems.

2 A↵ordable Gesture Control

Machine vision-based gestural control is currently a popular area both in research and more re-

cently in consumer technology. A fundamental part of this technology is robust interactive human

body tracking, which includes pose estimation (reviewed in [1]) and body tracking algorithms (for

example [2]).

The availability of real-time depth cameras has allowed advancements in the field (for example [3]),

however until recently such technology was not available for general use. Therefore production of

any system incorporating human body tracking and gestural control would first incur a significant

research and development e↵ort and cost.

Now there are low-price consumer devices available [4, 5] providing depth imagery, human body

tracking, and gesture recognition, all in real-time. These devices use a depth-mapping technique

based on projected infra-red patterns [6]. Details of the pose detection algorithm for [4] are given in

[7]. Software development tools are available for these devices [8, 9] to provide programmers with

access to this data, allowing wide-spread application of body tracking and gesture recognition.

Today there is a wealth of applications of this technology, with many video demonstrations avail-

able on the internet, and this year sees the first IEEE Workshop on Consumer Depth Cameras for

Computer Vision. This paper describes two applications that use real-time body tracking data to

allow gestural control of audio, for use in spatial audio applications.

3 BBC Philharmonic Maestro

Relatively few people have had the opportunity to stand in front of a full orchestra during a live

performance. Fewer still have experienced what it is like to conduct their own orchestra. The

BBC Philharmonic Maestro is an installation, aimed at the general public (non-conductors), which

allows them to conduct a performance of the William Tell Overture. The system was designed to

provide a game like experience, but one with a good level of immersion.

3.1 Related Work

There have been numerous attempts to use gestural control to create musical meaning in the

past. Some devices [10] used low frequency radio transmission to calculate distance, or required

1

the user to wear technology containing a large number of sensors [11, 12]. Some of these systems

attemped to detect emotional intent, in addition to tempo and dynamics [13]. Handheld wireless

game controllers have also been used as an orchestral conducting input device [14]. Systems which

have smaller, less cumbersome input devices have also been explored [15], including the use of

infrared light emitting diodes (IRLEDs) [16]. A number of more commercial systems have also

been developed. TikTakt [17] is an iOS game that uses accelerometer data from the iOS device

to detect the tempo from the rate the user is waving the device. Chamber Orchestra Director

Pro [18] allows the player to cue sections of the orchestra, but no gesture-based input is used

in this application. A limitation of many of these interfaces is that they all require the user to

conduct while using or wearing a physical device containing sensors such as infrared detectors,

accelerometers, or IRLEDs. These devices can often be cumbersome, and the need for a separate

controller-like device in a museum or installation is not ideal.

3.2 The BBC Philharmonic Maestro System

The intention with the BBC Philharmonic Maestro was to try and avoid some of the less desirable

characteristics of the aforementioned systems. The general aim was to create a natural conducting

experience without the need for an external controller. This meant delivering highly-immersive

audio and video, each from the point of view of the conductor, whilst designing a user interface that

didn’t rely on a physical input device, detecting and recognising natural and intuitive movements

without the need for an ‘electronic baton’ or similar device.

3.3 Capture

This section describes the capture of The BBC Philharmonic Orchestra’s performance. The piece

of music chosen for this experiment was Rossine’s William Tell Overture, selected partly because

it is very well known, and partly because it maintains a clear metronomic tempo throughout.

3.3.1 Audio

With immersion a key aim for the audio reproduction, there were a number of audio formats

from which to choose. Two-channel stereo is perhaps the most common method of recording

an orchestral performance, but stereo is a one-dimensional (width) audio playback technology

and there are alternative audio formats that can o↵er greater immersion. 5.1 surround sound

o↵ers two dimensions (width and depth) and, as an established format, has recognised recording

techniques, however this format lacks height. Both stereo and 5.1 also require specific inflexible

speaker placement [19] which can be non-conducive to public installations. First order ambisonics

[20] was chosen as the audio capture format. Essentially, ambisonics is an extension of the Blumlein

Pair [21] stereo recording technique. Instead of just capturing a one dimensional left-right di↵erence

signal, di↵erence signals from three perpendicular figure-of-eight microphones all positioned at the

same point in space are derived. When combined with an omnidirectional microphone these four

signals are known as B-format, and represents the three-dimensional sound-field. There are a

number of other discrete-channel 3D sound systems which could have been chosen (3D7.1, 10.2,

22.2 etc.) but ambisonics was selected for the following reasons:

2

Figure 1: High resolution camera with fish eye lens

• Ambisonics o↵ers a speaker-independent solution. The choice as to the number and place-

ment of speakers can be made as needed, and changed depending of the testing/installation

environment.

• Ambisonics o↵ers 3D audio immersion, providing audio from all directions around the listener.

• Ambisonics is based on spherical decomposition of the soundfield at a single point in space.

This is ideal for the conductor application as the aim is to reproduce the soundfield as it was

at the place where the original conductor stood.

The ambisonic recording of the orchestra was made by placing a Soundfield microphone [22] directly

behind the conductor. Known limitations of ambisonics, such as restricted ’sweet spot’ area, are

less of a drawback for this project because, being designed for a single listener, the listening area

for the intended application is very small. In addition to the Soundfield microphone the left and

right of the orchestra were recorded separately using monophonic microphones.

3.3.2 Video

In order to capture the whole orchestra from the conductor’s position a camera with a fish-eye lens

was used. In addition, a very high resolution recording was needed to provide the desired immersion,

with the intention to playback the video on a very large screen. The camera used captured the

video with a spatial resolution of 2048 by 1152. The camera and lens are shown in figure 1. One

of the key problems to overcome was the positioning of the camera. The ideal position would be

where the conductor’s head is located. Thankfully the conductor was very accommodating and

conducted the performance from behind the camera stand, ensuring he didn’t knock the camera

stand or get in shot during the recording of the performance. Figure 2 shows the location of the

camera relative to the conductor. Figure 3 shows the fish-eye view of the orchestra.

3

Figure 2: The conductor working around the camera

Figure 3: A screen shot of the user’s view of the Orchestra

4

Figure 4: Initialisation phase, required for skeletal tracking

3.4 Implementation

The design of the system was split into control interface, and audio and video processing, so ei-

ther could be modified individually. The two applications communicate using a messaging format

over the Open Sound Control (OSC) protocol [23] . The interface was designed so the interac-

tion between the user and orchestra is as natural as possible. Due to the skeletal tracking, an

initialisation phase is needed requiring the user to raise their arms to allow the skeletal tracking

software to recognise the shape of a human skeleton (figure 4). After the initialisation is complete

the orchestra begins and the user is in control of the performance. The x and y coordinates of the

user’s left and right hands are tracked and changes in direction are associated with tempo taps and

from these a value for the beats per minute is calculated, as shown in figure 5. The vertical position

of the left and right hand controls the dynamic (volume) of the left and right side of the orchestra

respectively. This data is passed to the audio and video processing application which adjusts the

playback rate of the video and audio using spectral domain frequency shifting to maintain constant

pitch while varying the tempo. This application also provides a light graphical user interface (GUI)

a part of which is shown in figure 4.

3.5 User Testing

In addition to ad-hoc testing undertaken during development, the installation was included in a

Manchester International Festival event titled ‘Music Boxes’ [24]. Around 75 shipping containers

were arranged in an oval shape and musical installations for children were placed inside each

container. During the course of the festival over 1000 children visited the event, many of them

playing with the BBC Philharmonic Maestro. In the following section a summary of observations

and feedback from the children visiting, and sta↵ manning the installation is provided.

5

Figure 5: Skeletal tracking used to derive left and right volumes and beats per minute

3.5.1 Feedback and Results

The user’s understanding of what to do varied from subject to subject; some didn’t know what a

conductor was, others did. Those who were aware of conductors had a much clearer idea of how to

interact with the system. Initially many of the subjects were intimidated by the environment and

it took them a while to become comfortable being surrounded by the technology (a large screen

and 6 speakers). Once feelings of intimidation were overcome the users clearly enjoyed controlling

the system. The calibration phase of the system struggled to detect the human shape when faced

with certain users; girls wearing dresses and children in baggy coats were particularly problematic.

Feedback from the subjects indicated the experience felt real, but so few of the children had

real experience of an orchestra, judging how much it felt like being there was a challenge for them.

Many subjects were familiar with computer games and interaction with computer-generated images.

Those subjects said they enjoyed the ability to interact with a linear video recording because that

was novel.

3.5.2 Conclusions

The novelty of the system clearly lay with the ability to conduct an orchestra without the need

to hold a controller, but in it’s current state it is essentially a fun and educational tool. The

potential exists to develop the system to recognise more specific gestures e.g standard conductor’s

movements, to create something that behaved more like a real orchestra. However, professional

conducting styles vary so wildly, that the development of a universal virtual orchestra aimed at

professional conductors would be a significant challenge.

6

Figure 6: BBC Philharmonic Maestro user testing

4 Pan Control in 3D Audio

In generation of artificial sound fields, the issue of control is a problem. There is a lack of ap-

propriate interfaces for accurate and intuitive manipulation of sound sources in 3D space. Es-

tablished control interfaces, such as faders, rotary encoders, and touch-screens, only o↵er one- or

two-dimensional control.

This section describes the application of recent vision-based gestural control interfaces to placing

sounds in 3D scenes. A pan controller has been developed, allowing a user to control the direction

from which sound sources are emitted using arm movements. The system was specifically designed

as a production tool for 3D audio material in a broadcast environment.

4.1 Related Work

Methods for reproduction of 3D sound fields in high resolution have been extensively researched

[25, 26, 27, 28], as have methods for capturing [29, 30] and generating [31, 32, 33] such sound fields.

There are a number of existing software tools for controlling spatialisation of sound sources. IRCAM

SPAT [34] is a sophisticated software plugin for spatialisation that covers many room characteristics

as well as sound source characteristics, however it doesn’t allow 3D sound field production. Zirko-

nium [35] is a software tool for panning sound sources on a 3D speaker array; it uses a VBAP-based

[31] approach. [33] introduces object-based spatial audio authoring with a hierarchical approach

for wave field synthesis systems. In [36] an object-based approach is investigated for ambisonics

production. It can incorporate mono- and multi-channel signals, as well as sound source and field

representations defined in terms of spherical harmonics. A sophisticated system is presented in [37]

allowing control of panning and room e↵ects in 3D, based on the 22.2 channel system [38]. These

applications allow 3D sound field rendering but they do not incorporate any novel interfaces for

control of 3D panning. Conventional input devices such as a mouse, keyboard, fader or joystick

7

are expected, and two dimensional displays of control settings are used.

A review of work to control spatialisation with gesture is given in [39]. It contains much research

on mapping frameworks and input for polymorphic control interfaces such as motion capture. Sev-

eral of the discussed applications use sensor gloves for 3D gestural input data to control sound

source position; [40] presents a framework for gestural control in hierarchical object-based audio

representations. Researchers at NHK have demonstrated an Integrated Surround Sound Panning

System [41], using handheld wireless games console controllers to allow 3D positioning of audio

objects by gesture.

In these works a large number of spatialisation parameters are discussed; yet a holistic mapping

approach from motion tracking data to these parameters has not been attempted in a single ap-

plication. It would likely create a complex control interface which would take a lot of training to

use e�ciently. Instead applications have focused on control of a single parameter or small set of

parameters, such as sound position or room dimensions.

Recent consumer motion tracking devices have so far yielded some interesting musical control appli-

cations, for example [42]. However this technology has not yet been applied to audio spatialisation.

There may be benefits to using vision-based motion tracking over other devices such as sensor

gloves which can inhibit other normal activities in an operational environment.

4.2 3D Panning System

One of the most fundamental aspects of spatialisation is panning, control of the direction from

which sounds radiate. A system has been developed to allow control of panning in 3D space with

arm gestures.

In stereophonic systems panning is conventionally controlled with a rotary encoder or analogue

potentiometer. Many modern digital sound desks with 5.1 channel capabilities o↵er a joystick

and graphical interface to control the pan direction in the horizontal plane. In 3D spatial audio,

source position is often stated in terms of spherical co-ordinates, with direction being defined by

azimuth and elevation angles. There are therefore two dimensions of control when panning in 3D,

determining the sound’s position on the surface of a sphere. In this application, distance control

is not considered.

The design of this system separates the control interface from the audio processing so that either

can be modified individually if required. Two applications communicate using a messaging format

over OSC.

4.2.1 Pan Control Software

The pan control software presents a representation of the sound scene using 3D graphics, shown

in figure 7. A simple room is displayed with a spherical mesh representing the range of positions

where the sounds can be placed. Each sound is represented by a small sphere. In order to improve

understanding of depth, mainly given by geometric perspective, spheres change colour according to

depth, from red to blue. Also the viewpoint is elevated slightly to aid depth perception. Another

(green) sphere represents the current position indicated by the user’s arm when not actively panning

a sound, this is linked to a smaller black sphere (representing the listening position) by a dotted

line. The display operates like a mirror rather than a window, so objects at the front of the sound

8

Figure 7: Graphical user interface for gesture-based pan control

scene appear larger on the screen.

The gesture control input is simple. Skeleton joint positions are returned by the device drivers in

cartesian coordinates. The orientation of the user’s forearm (elbow to wrist) is converted to an

azimuth (✓) and elevation (�) by the following formulae:

✓ = tan�1

✓yw � ye

xw � xe

◆(1)

� = tan�1

zw � zep

(xw � xe)2 + (yw � ye)2

!(2)

Subscripts w and e represent cartesian co-ordinates of the wrist and elbow joints respectively.

Only one arm is used; the active arm can be selected in the software. An array of buttons on

a USB MIDI controller is used to select the currently active object, which is to be panned. The

active object is then “picked up” when the panning arm position is close to its position. This is

in order to reduce the chance of accidental changes by the operator. An object can be deactivated

by pressing its button a second time. When an object is active, the user can simply point in the

direction from which they want the sound to come. The control software sends an OSC message

when the position of any sound is changed, containing the new position in spherical coordinates

and an ID number which could be channel number or an object ID. In the current system the

sound software reads these messages and in real-time adjusts the panning. Due to the modular

design of the system, this data could also be stored for control automation of sound panning or

even applied to an entirely di↵erent application.

4.2.2 3D Sound System

Higher-order ambisonics is a sound reproduction method based on a spherical harmonic decom-

position of the target sound field, under free field conditions. This harmonic series is truncated

to achieve an e�cient representation, at the cost of spatial resolution and accurate listening area

[25]. The sound system used for this investigation encodes the panned sounds using third-order

9

ambisonics, which generates a 16 channel representation of the sound field. First-order ambisonics,

previously described in section 3.3.1, uses a 4 channel representation and as a result has a lower

angular resolution and smaller area where accurate localisation is possible.

The benefits of ambisonics in a broadcast context are significant. Encoding is abstracted from

the reproduction system, allowing a finished programme to be rendered to a number of standard

broadcast formats automatically, or even to arbitrary speaker configurations [43]. It also allows for

transformations of the entire sound field during production.

After encoding, these signals are decoded to an optimally distributed 16 speaker array. The speaker

positions were taken from [44]. This speaker layout is quasi-orthonormal, making accurate decoding

of the ambisonics signals as simple as possible. The orthonormality error for a particular loud-

speaker layout can be calculated to assess the validity of simple decoding methods [29].

The decoding used in-phase weighting of ambisonic orders [45] to reduce the energy from speakers

in the opposite direction to the sound source. This decreases the likelihood of localisation error for

o↵-centre listeners. In a broadcast production environment users are likely to need to move within

the listening area to do their work.

The audio software receives pan messages from the control software, and uses these to control the

encoding of monophonic input signals in real-time.

An alternative version of the system was created using headphones, rendering the sound field using

techniques described in [46].

4.3 Testing

The system was tested with three highly experienced BBC sound supervisors, who have worked

across a wide range of programme types and production scenarios encountered in broadcasting,

including 5.1 production. The objective of the test was to assess whether this method of controlling

3D audio production o↵ers any advantages over use of conventional control interfaces such as faders

and rotary encoders. A major di↵erence between the two approaches is that the gesture-based

interface combines azimuth and elevation parameters into one movement, whereas two separate

controllers are required using conventional interfaces.

Participants were instructed how to use the gesture-based panning tool (with its graphical interface)

and also a system that provided rotary controls for azimuth and elevation of each channel. Once

comfortable with each system, they were given two small sets of audio to experiment with. One

set contained five stems from a contemporary music production. The other contained six channels

from a radio drama production (including dialogue channels and ambient and point source e↵ects

channels).

The participants had volume faders for each audio channel in addition to the pan controls. They

were asked to experiment with spatialisation in each system for each audio set, until they were

satisfied with the generated sound scene.

4.3.1 Results

Subjects were asked to give their preferences between the gesture-based and rotary control systems

based on the following aspects:

• understanding of interface function

10

• accuracy of control

• creativity enabled

• understanding of 3D sound scene

It was noted by all of the subjects prior to the test that, due to their past experience, they expected

to find the conventional rotary controller interface more intuitive and appropriate for the task.

Feedback indicated that the three test subjects had di↵ering opinions of the gesture-based 3D

panning system. All of the subjects found the gesture-based interface fun but not all of them

thought it was valid for production environments. One subject indicated a strong preference

for rotary controllers on all criteria, another indicated strong preference for the gesture-based

controller. The third subject thought that rotary controllers were easier to understand functionally

and allowed more accurate control of panning, however the gesture-based control allowed more

creative use of the sound field and in combination with the GUI gave a clearer understanding of

the 3D sound scene.

4.4 Discussion

Due to the broad scope of the test, there was a large variation in responses according to personal

preference. However the feedback indicated that the di↵erent opinions were in part based on the

production environment considered.

Gesture-control was deemed inappropriate for location work where sta↵ are under significant time

pressures and often work in confined spaces. Productions of this kind include live music (classical

and popular) and sport, for which panning is largely static. Operators want to establish the spatial

scene and then leave it fixed, to concentrate on critical balancing work. One comment stated that,

in location sound work, a system of hardware controllers such as joystick and fader would work

better for control of 3D panning. An emphasis was made on compact movements and ensuring

clear intent of control.

Post-production scenarios generally allow more space and time. Most significantly programmes

produced in this way, drama in particular, commonly involve more dynamic use of the sound

scene. The gesture-based control was considered useful in this context, allowing intuitive spatiali-

sation and creative dynamic control.

Many developments could be made to improve the system. One important improvement for post-

production applications would be to add the capability for parameter automation. This could

be most easily achieved by implementing the system as a plug-in for a digital audio workstation

that has automation functionality, also allowing it to fit into existing production workflows. The

control software would need development, especially to handle the many ephemeral audio objects

encountered in some productions.

Since the motion capture system currently only uses one depth camera, the system fails to detect

arm positions when occluded by the torso. Multiple depth camera devices could be combined to

achieve more accurate, full-3D motion tracking.

The control capabilities can clearly be expanded. Some obvious control parameters for spatialisa-

tion are distance, size and directivity. Each of these controls would need an appropriate mapping

11

from motion capture data, and multimodal system would likely be needed to allow control of di↵er-

ent parameters in di↵erent states. Control of many parameters simultaneously with a combination

of gestures would likely become confusing and di�cult to do accurately.

The audio system can currently only present simple point sources at a fixed distance from the

listener. The ambisonics system should be expanded to incorporate modelling of complex sources

[47] and distance modelling should incorporate near field compensation [48].

5 Conclusions

Gestural control of audio material has been shown to be useful in some instances. It clearly makes

for a fun experience, partly due to the novelty of such interaction for many people. The two ap-

plications described use simple mappings from skeletal tracking data to audio system parameters.

Whilst these have had good feedback from target users, both applications would benefit from more

advanced multimodal gestural control interfaces, allowing more control of the audio systems. This

is required for sophisticated applications in professional environments, such as an orchestral con-

ductor training system or a complete spatial audio mixing system.

It is clear that current gesture-based interfaces are not appropriate in all situations, particularly

high-pressure live broadcast production. As the available consumer devices and software develop-

ment tools improve and more experimentation with audio interaction occurs, it will become clearer

which scenarios are appropriately controlled by natural gesture and how gesture should be mapped

to control parameters.

References

[1] R.Poppe, “Vision-based human motion analysis: An overview,” Computer Vision and Image

Understanding, 108, (2007).

[2] L. Sigal, S. Bhatia, S. Roth, M. Black, and M. Isard, “Tracking loose-limbed people.” in Proc.

IEEE CVPR, (2004).

[3] C. Plagemann, V. Ganapathi, D. Koller, and S. Thrun, “Real-time identification and localiza-

tion of body parts from depth images,” in Proc. ICRA, (2010).

[4] Microsoft Corp. Redmond WA. Kinect for Xbox 360.

[5] ASUS Xtion Pro. http://www.asus.com/Multimedia/Motion_Sensor/Xtion_PRO

[6] Freedman et al, “Depth Mapping Using Projected Patterns,” US Patent US 2010/0118123 A1,

filed 2nd April 2008.

[7] Shotten et al, “Real-Time Human Pose Recognition in Parts from Single Depth Images,” from

proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2011).

[8] Kinect for Windows SDK. http://research.microsoft.com/en-us/um/redmond/

projects/kinectsdk/

[9] OpenNI Middleware - Software for natural interaction interfaces. http://www.openni.org/

12

[10] M. Matthews, “The Radio Baton and Conductor Program,” Computer Music Journal, Vol.

15 No. 4, 1991.

[11] Nakra, “Inside the Conductor’s Jacket: Analysis, Interpretation and Musical Synthesis of

Expressive Gesture,” PhD thesis, Media Laboratory, MIT. 2000.

[12] L. Savioja, “Interacting with virtual orchestra,” NORDUnet2000 Conference, Helsinki, Fin-

land, 2000 September.

[13] Ilmonen, Takala, “Detecting Emotional Content from the Motion of an Orchestra Conductor,”

The 6th International Workshop on Gesture in Human-Computer Interaction and Simulation,

Vannes, France, 2005 May.

[14] Nakra et al, “The UBS Virtual Maestro: an Interactive Conducting System,” in Proc. NIME,

2009.

[15] Bruegge et al, “Pinocchio: conducting a virtual symphony orchestra,” in Proceedings of the

international conference on Advances in Computer Entertainment Technology, New York, USA

2007.

[16] Borchers et al, “Personal orchestra: a real-time audio/video system for interactive conducting,”

Multimedia Systems, Volume 9, Number 5, 458-465.

[17] “TikTakt,” http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewSoftware?id=

395881884&mt=8

[18] “Chamber Orchestra Director Pro.,” http://www.yoyogames.com/games/

20739-chamber-orchestra-director-pro- 2007-winter-concert-edition

[19] “Multichannel stereophonic sound system with and without accompanying picture,” ITU-R

BS.775.

[20] M.A. Gerzon,“Periphony: With-height sound reproduction. in JAES, vol. 21, no. 1, pp. 2-10,

(1973).

[21] Blumlein, “Improvements in and relating to sound-transmission, sound-recording and sound-

reproducing systems,” GB Patent 394325, 1933.

[22] Soundfield ST350 Portable Microphone System - http://www.soundfield.com/products/

st350.php

[23] Wright et al, “OpenSound Control: State of the Art 2003,” in Proc. NIME, (2003).

[24] “Music Boxes,” http://mif.co.uk/event/music-boxes/

[25] J. Daniel, R. Nicol, S. Moureau, “Further Investigations of High-Order Ambisonics and Wave-

field Synthesis for Holophonic Sound Imaging, in AES 114th Convention, (2003).

[26] D. de Vries, E.W. Start, V.G. Valstar, “The Wave Field Synthesis concept applied to sound

reinforcement: Restrictions and solutions, in AES 96th Convention, (1994).

13

[27] F. M. Fazi, P. A. Nelson, J. E. N. Christensen, and J. Seo, “Surround system based on three-

dimensional sound field reconstruction”, AES 125th Convention, (2008).

[28] A. Farina et al., “Ambiophonic Principles for the Recording and Reproduction of Surround

Sound for Music, in AES 19th International Conference, (2001).

[29] S. Moreau, J. Daniel, S. Bertet, “3D sound field recording with High Order Ambisonics -

objective measurements and validation of a 4th order spherical microphone,” AES 120th Con-

vention, (2006).

[30] E. Bourdillat, D. de Vries, E. Hulsebos, “Improved microphone array configurations for aural-

ization of sound fields by Wave Field Synthesis,” AES 110th Convention, (2001).

[31] V. Pulkki, “Virtual sound source positioning using vector base amplitude panning, JAES, vol.

45, no. 6, pp. 456466, (1997).

[32] M. Neukom, “Ambisonic Panning,” in AES 123rd Convention, (2007).

[33] S. Brix et al, “Authoring Systems for Wave Field Synthesis Content Production,” AES 115th

Convention, (2003).

[34] Flux IRCAM SPAT spatialisation plug-in. http://www.fluxhome.com/products/plug_ins/

ircam_spat

[35] Ramakrishnan, C., Goßmann, J., Brummer, L., “The ZKMKlangdom” in Proc. NIME, (2006).

[36] F. Melchior et al., “Spatial Audio Authoring for Ambisonics Reproduction,” in Proc. Am-

bisonics Symposium, (2009).

[37] W. Woszczyk, B. Leonard, D. Ko, “Space Builder: An Impulse Response-Based Tool for

Immersive 22.2 Channel Ambiance Design,” in Proc. AES 40th International Conference,

(2010).

[38] K. Hamasaki, K. Himaya, R. Okomura, “The 22.2 Multichannel Sound System and Its Appli-

cation,” in AES 118th Convention, (2005).

[39] M. T. Marshall, J. Malloch, M.M. Wanderley et al. “Gesture Control of Sound Spatialization

for Live Musical Performance,” in Gesture-Based Human-Computer Interaction and Simula-

tion, pp. 227-238, Springer, (2009).

[40] Schacher, “Gesture Control of Sounds in 3D Space,” in Proc. NIME, (2007).

[41] K. Hamasaki, S. Komiyama, K. Hiyama, H. Okubo, “5.1 and 22.2 Multichannel Sound Pro-

ductions Using an Integrated Surround Sound Panning System,” in Proc. NAB Broadcast

Engineering Conference, (2005).

[42] J.M. Comajuncosas et al., “Nuvolet: 3D Gesture-Driven Collaborative Audio Mosaicing,” in

Proc. NIME, (2011).

[43] J. Boehm, “Decoding for 3D,” AES 130th Convention, (2011).

14

[44] N.J.A. Sloane et al., “Spherical Codes,” http://www.research.att.com/~njas/packings/

[45] D. Malham, “Experience With Large Area 3-D Ambisonic Sound Systems,” in Proc. of the

Institute of Acoustics, (1992).

[46] J. Daniel, J.B. Rault, J.D. Polack, “Ambisonics Encoding of Other Audio Formats for Multiple

Listening Conditions,” in AES 105th Convention, (1998).

[47] D. Menzies, M. Al-Akaidi, “Ambisonics Synthesis of Complex Sources,” in JAES, vol. 55, no.

10, pp. 864876, (2007).

[48] J. Daniel, “Spatial Sound Encoding Including Near Field E↵ect: Introducing Distance Coding

Filters and a Viable, New Ambisonics Format,” in AES 23rd International Conference, (2003).

15