Using Audio for Player Interaction in HTML5 · PDF file · 2017-03-16Using Audio for Player Interaction in HTML5 Games ... immersive agent that helps breaking the boundaries between

Using Audio for Player Interaction in HTML5 GamesJefferson Torres de Freitas∗ Ernesto Trajano de Lima† Artur de Oliveira da Rocha Franco‡

Jose Gilvan Rodrigues Maia§

Federal University of Ceará, Virtual University Institute, Brazil

ABSTRACT

Advances on computing hardware and software throughout the lastdecades drastically molded the prospect of the entertainment indus-try. Within this context, a prominent development was the emer-gence of interactive computer-driven products. The introduction ofspecialized game controllers established a paradigm in which theplayer explores her hand skills for fast and flexible interaction withgames. Sophisticated interaction techniques supporting body ges-tures and voice commands were introduced later. However, mostsuccessful technologies in this field are typically bound to specifichardware. With the advent of HTML5 and recent JavaScript perfor-mance enhancements, most modern web browsers already providerefined programmable components for audio processing that sup-port sound analysis and synthesis in real-time. This paper presentsan investigation on how audio capture and processing can be used asan input device for user interaction in web games. An API support-ing programmable game interaction driven by audio for HTML5was designed and implemented using JavaScript. An actual proto-type game was developed using this API for evaluation purposes.Keywords: Audio, Web games, Interaction.

1 INTRODUCTION

Entertainment became a central business in the last decades. Ac-tually, numerous technological advances brought forth new inter-action possibilities that emerged as a new kind of products thatchanged the entertainment industry radically: digital games. Asgaming computers and consoles became more and more popular,user input apparatus like joypads allowed people to use both handsfor relatively comfortable and fast interaction. Moreover, these de-vices were proven effective for adapting to different games’ specificneeds by using clever programming techniques.

Other interaction devices were developed using new hardwareand efficient data analysis algorithms for real-time, reliable recog-nition of landmark blobs that are prominent from the backgroundin infrared images due to specially crafted materials and controlledlighting. Recent developments accounted for an even more so-phisticated and more ”natural” interaction by identifying the hu-man body and its parts, making it possible for the user to interactby expressing a myriad of inputs coming from her eye gaze, handgestures and voice commands, for example. For example, in first,karaokes were using simple techniques of audio processing [18], atthe turn of century twenty-first century like Kinect, Wiimote andOculus Rift which have a important applications like digital her-itage [21].

∗e-mail: [email protected]†e-mail: [email protected]‡e-mail: [email protected]§e-mail: [email protected]

However, such interactive forms are bound to specific and usu-ally expensive hardware. Moreover, most computer vision and im-age sensing techniques resort to infrared illumination which maydamage the user’s retina due to prolonged exposure or when the de-vice operates outside accessible and maximum permissible limitsdue to malfunctioning.

Despite those drawbacks, using a subset of such interactive tech-nologies is becoming feasible in recent computers as available pro-cessing power increases and domain-specific languages evolve.

Modern web browsers already have improved, standardised au-dio engines that allow for real time sound analysis and synthesis.In this paper, we investigate usage of programmable HTML5 audiointerfaces for real time user interaction within the context of webgames by means of signal processing techniques applied to audiocaptured continuously.

These are the main contributions of this paper:

• We investigate how real-time audio processing can provideuser inputs for web games.

• We present a an API specifically designed to support buildingsuch games.

• We evaluate the effectiveness of this approach by carrying outexperimentation over a prototype web game developed usingthe proposed API requiring well-timed inputs in real-time.

• This prototype game runs on both Desktop and Android plat-forms, which corroborates that cross-platform web games canbenefit from player inputs provided via audio processing.

This work is organized in the following manner. Section 2presents a preliminary discussion about audio, games and the web.Audio processing definitions and subjacent mathematical modelsare presented in Section 3 as well the APIs available within the con-text of HTML5 development. Similar investigations are discussedthroughout Section 4. The proposed API, its design and implemen-tation details are covered in Section 5. Experimental evaluation anddiscussion about the results are the matter for Sections 6 , and 7, re-spectively. Finally, conclusions about this work and future researchdirections are pointed out in Section 8

2 GAMES, AUDIO AND THE WEB

Audio has been proved fundamental not only for digital games butalso for many other digital media. This perceptual stimulus is animmersive agent that helps breaking the boundaries between usersand characters [1] [14] [21].

2.1 Audio in GamesThe first popular games that had the audio as the main form of in-teraction were in karaoke machines [7]. The processing techniqueswere quite primitive in the early products, taking into account onlythe energy of the signal for evaluating the singer’s performance and,in some cases, no processing was available at all since scores wereassigned randomly in some systems [18].

SBC – Proceedings of SBGames 2016 | ISSN: 2179-2259 Computing Track – Full Papers

XV SBGames – São Paulo – SP – Brazil, September 8th - 10th, 2016 118

In 2000, Nintento released Voice Recognition Unit (VRU) asan accessory for the Nintendo 64 console. The VRU was as ca-pable of comprehending human voice in a region-dependent fash-ion and its calibration was set to cope better with children’s high-pitched voices because they were the target audience of ”Hey You,Pikachu!”, the mainstream game supporting the technology.

Other interaction forms such as the Wiimote R©, the Kinect R©,and Oculus Rift R©and acquired substantial prominence despitenone of those being established permanently as a de facto standardfor interaction in games.

Games have used sound to indicate a number of different things.Here is a brief overview about how audio is used in electronicgames:

• Simon was a hugely popular toy in the late 70’s that used col-ored buttons corresponding to tones to challenge the player’smemory. Basically, the player must memorize and reproducea sequence that increases at each interaction [4].

• Super Mario World by Nintendo (1990) played backgroundmusic faster when the protagonist is about to face death dueto a timeout.

• Silent Hill, released in 1999, used broken, mute piano keysin an intriguing situation to challenge the player’s ability tosolve puzzles.

• Modern gaming consoles support voice command recogni-tion. However, these are usually applied for interacting withmenus instead of proper games.

• More recently, A Blind Legend1 goes a level beyond. This isconsidered by many as the first ”audio-only action/adventuremobile video game” since it does not resort to a visual rep-resentation. A huge sound design and programming effort isnecessary to bring correct feedback to the player that inter-acts using standard the gesture recognition available on mo-bile platforms.

On the other hand, other products assume that audio must beused not only as a stimulus, but also as an user input form. Thereare games tn the market where there player interacts through audio,such as SingStar R©2 and Karaoke Revolution [18]. Such gameswere inspired by the genre karaoke, which had its analog source,whose creation is attributed to Daisuke Inoue [7].

2.2 Audio Processing for the WebNevertheless, despite the remarkable technological advances incomputing and the advent of alternative entries using cameras andmicrophones specially developed to a high degree of complexity,audio has been explored to a small extent as a source of user input.

Moreover, recent additions to the web software stack providenew opportunities for interactive hypermedia, specially with the ad-vent of the HTML5 standard [6] that has been incorporated intomodern browsers. As the HTML5 standard emerged, new opportu-nities for the development of robust and efficient APIs that supportaudio processing.

This powerful, widely adopted standard includes interactionswith different media, protocols and programming languages. AnHTML5 page can process video streams and audio captured di-rectly from devices possibly available on the user’s hardware [10].Moreover, WebGL represents a significative boost on both graphicsperformance and visual programming flexibility with the introduc-tion of programable visual effects via shaders that are coded in a

1http://www.ablindlegend.com/2https://www.singstar.com/

high-level language and these interact with the web page by meansof scripts.

It is also worth noting that since WebGL closely resembles thecompact and efficient GL ES API widely incorporated into mo-bile devices such as smartphones, tablets and smart TVs, HTML5-compliant web pages can display interactive graphics that benefitfrom hardware acceleration in those devices. On the other hand,HTML5 also features the Canvas API [5] for programming two-dimensional graphics based on immediate-mode drawing. BothWebGL and the HTML5 Canvas are epitomes of flexibility andperformance necessary for developing minimally convincing webgames.

This possibility, combined with the evolution of JavaScript(JS) technology enabled the mass production of content and alsodomain-specific application programming interfaces (APIs), suchas Web Audio API4 3. Modern browsers make extensive use ofJavaScript as logical page control, which is becoming a popular al-ternative since such implementations do not require any plug-insand other external programs. JS controls the many componentsavailable for rendering graphics and sounds, as well as the repre-sentation of the web page itself.

Some APIs, such as the Web Speech API, have complex built-inalgorithms such as the voice recognition available in the Chromebrowser. Such multimedia resources became more common evenother for Web standards, as used in CSS Aural 4, which is nowobsolete. However, this initiative has led to the proposition of a newstandard called CSS Speech 5 that is still undergoing developmentat the time when this paper as written.

This paper focuses on audio decomposition methods applicableto in-game interaction that benefit from audio processing API sup-ported by browsers. As far as we could search, there are no specificAPIs for using real-time audio as input for web games. This can beexplained by considering that implementations of the HTML5 aresomewhat recent. From the developer’s perspective which aims tooffer innovative products, it is essential to research on possibilitiesthat can be explored in order to enable this new kind of interactionsin games. However, using these APIs requires standard-compliantimplementations that are not yet available widely.

3 THEORY

Our world is filled with sounds that most living beings perceiveand analyze in order to act in an environment. However, the termsound is more generic than audio since sound can be provoked byany source. Audio processing is a term widely employed by techni-cians and researchers in the signal processing field since this wordrefers to a special category of sounds originating from a recording,transmission or electronic device.

Despite the numerous efforts made over the years in this field,some challenges remain in existence since those were faced for thefirst time by researchers and enthusiast practitioners, specially con-sidering the general case when the application’s environment is un-controlled [19] [17]. Examples of those are the extraction of soundsof interest and noise suppression that hinder the effectiveness of au-dio for a specific situation, such as trying to extract the human voicefrom audio recorded at a beach [9].

Therefore it is necessary to bear in mind that obtaining a digitalaudio processing implementation requires some background theo-retical aspects regarding the nature of this task.

3.1 PhysicsSound may be defined as a biological perception of countlessmechanical waves that propagate from sound sources through a

3https://www.w3.org/TR/webaudio/4http://www.w3schools.com/cssref/css ref aural.asp5https://www.w3.org/TR/css3speech/



medium so that they finally arrive at the auditive apparatus. Thiscomprises an extremely complex process even assuming simplifica-tions such as an ideal and uniform physical medium with constant,medium-dependent speed [8].

Consequently, how the wave actually spreads through space is anintricate physical process that varies in time and depends on boththe shape and the physical properties of virtually every single objecton a propagation environment.

Effective analysis of this phenomenon requires knowledge ofsignal processing concepts, which can be divided into two cate-gories according to the nature of the signal: analog signals anddigital signals [12]. This work is bound to consider only digitaldiscrete signals that are, in fact, a time series composed by sam-ples represented by a sequence of values, each of them representinga point in time. Moreover, samples are considered to be equallyspaced in time, i.e., the sampling rate is constant.

This means that audio has an intrinsic unidimensional nature,i.e., an ideal processing mechanism must be endowed with the abil-ity of deftly identify and separate the constituent features that makeup the time series as a whole.

The latter statement explains the difficulty in obtaining reliablecomputational methods for digital audio processing in poorly con-trolled environments, making this a highly challenging task whencompared to its biological counterpart. Hence, it becomes clear thatresorting to an alternative representation may better suit automaticprocessing by computer algorithms.

3.2 Mathematical Models

Representing audio as a time series presents many drawbacksthat can be overcome by adopting a suitable mathematical model.Fourier series represent signals as a sum of infinite terms, each ofthem being a periodic function with a given frequency in terms ofsines and cosines. This representation is of utmost importance foraudio processing since the signal is observed as a decompositionover a wide range of frequencies [15]. Moreover, pure tones aremodelled as periodic functions [8].

The Discrete Fourier Transform (DFT) is a simple procedure todetermine the contents of the sample frequency, but is inefficientand costly to use a complex algorithm asymptotic O(n2), such n isthe number of samples. Although its polynomial complexity, therequired processing time grows very quickly as the number of sam-ple points increase, having in mind that usually tens of thousandssamples are used per second to represent audio [14] [15].

This limitation is often avoided by adopting the Fast FourierTransform (FFT) described the technique proposed by Cooley-Tukey [2]. The FFT is closely based on the definition of DFT, butit becomes highly efficient in relation to its traditional counterpartby presenting a series of optimizations which are applicable assum-ing the number of samples is a power of 2 [12]. The algorithm forcalculation of the FFT is usually asymptotic complexity O (nlog(n)), with significantly better performance than the DFT and it alsoyields greater precision in many cases.

Moreover, real-world signals contain undesired sounds, alsoknown as noise. This spurious information needs to be suppressedor diminished, which is usually performed by filtering the signal. Itis also important to consider that uncontrolled sound power levelsmay injure the users’ hearing apparatus.. Taking the hearing thresh-old as a reference, sounds under 50dB are not considered harmful tohearing, while a normal conversation equates to 60dB. Sounds be-tween 80dB and 100dB, in their turn, are considered troublesomeand stressful, thus being harmful to health. Sound intensities equalto or greater than 120dB, known as the pain threshold, cause irre-versible damage to the inner ear [8].

4 RELATED WORK

4.1 Audio as Input for GamesTsai and Lee [18], in their experiments to automate the evaluationof performance within karaoke systems, take into account charac-teristics such as pitch, volume or sound intensity, and rhythm ofnotes sung, as well the duration of each note in order to optimizethese systems. Starting from those characteristics, a comparison ismade between singer recordings and MIDI files in order to evaluatesound harmony. In this case, harmony is defined by the sequence ofnotes sung in a correct tone along with adequate duration of eachnote.

Lima et al. [11] investigated using computer vision and voicecommand recognition applied to interact with games 3D environ-ments. They tackle voice recognition from a machine learningapproach using time-consuming baseline learning techniques K-Nearest Neighbors [3] and Support Vector Machines [20]. How-ever, evaluation as presented in their paper did not present enoughevidence supporting that an effective recognition was obtained.

Silva and Pozzer [16] applied voice commands for control-ling interactive story generation by analyzing sentences emitted byusers. Sentences are represented as binary trees for syntactical thatcorrectly evaluates about 80% to 90% of the users’ requisitions us-ing both conventional and Kinect microphones for audio capture.This result can be classified as satisfactory since the system doesnot recognize some words.

Antas et al. [13] propose an approach that is similar to the latter,but considering typical point-and-click interaction for handicappedplayers. These authors employed bootstrapping to improve recog-nition performance, achieving an error rate inferior to 5%.

There Came an Echo6 is a real-time strategy game focusing onthe narrative released in February 2015. In this game, the playeruses her voice to issue commands for controlling military units onthe battlefield. The technology employed for speech recognitionin this game requires certain conditions to ensure an acceptablefunctioning and implementation of an acoustic model suitable tothe user accent. Smooth and paused speaker diction are necessarysince the game requires real-time interaction without allowing in-terruptions in combat situations. Some sophisticated audio equip-ment seem incompatible with the technology for some mysteriousreason, such as speaking near the microphone or advanced captureconfigurations.

4.2 Audio APIs for the WebWeb Audio API 7 is proposed for both the loading audio files viaAJAX and the handling of frequency ranges. This allows for moreadvanced operations, such as sound synthesis, signal analysis, dataconversion, signal filtering, spatialization of the sound, etc. ThisAPI also uses audiovisual elements of HTML5 for communicationwith multimedia devices for real-time reproduction. The API iscomposed by 33 programming interfaces, 2 of which are obsolete,each one with a specific functionality, connected to define a basedrouting graphs for general rendering of the sound. The most out-standing interfaces are depicted by Figure 1.

Timbre.js 8 is a library that provide functional audio processingand synthesis for web applications based in modern JavaScript tech-nologies, such as JQuery and Node.js. By ”Functional” it is meantthat the API uses functional programming concepts for definingprocessing elements and sound synthesis. The library is composedof ”Timbre Objects”, referred to as ”T-Objects”, whose structureis the same Web Audio API. According to the author, this archi-tecture has been proposed to address the next generation of audioprocessing technologies for web.

6http://www.playiridium.com/games7https://www.w3.org/TR/webaudio/8http://mohayonao.github.io/timbre.js/



Figure 1: Programmable audio interfaces from Web Audio API.

The most outstanding feature of Timbre.js is to provide meansto synthesize sounds with relative simplicity. Moreover, using thisAPI only requires the addition of a JS library to the page, which canbe done with just one line of code. Given these features, Timbre.jsalso has considerable value for use as a teaching tool. However, theprogramming interface available for audio processing is unintuitivedespite being quite powerful, since it is necessary to master a veryintricate functional notation when, on the other hand, the debuggingsupport processing is quite limited.

Web Speech API 9 is an API that enables the incorporation ofvoice recognition on web pages. It allows developers to use speechrecognition how input generating textual outputs. The program-ming model is little effective, because the own programmer mustselect the words that interest you from a list reported by APIthrough a callback mechanism. In this way, the API fails on thelevel of abstraction of this problem by the developer.

5 PROPOSED FRAMEWORK

In this section, it is presented the API model applied in modern webbrowsers and code structure responsible for processing the data ob-tained through the Web Audio re-utilized interfaces. The API usesJavaScript, a language in which were developed the modules ofconfiguration, requisition and control of multimedia devices, dataanalysis, statistical processing for making decisions of game status,data clearance for audio context control, environment noise calibra-tion, and audio loading.

5.1 Architectural DesignThe proposed API in this paper is structured in modules. It was usedJavaScript to develop the API because the language is accepted invirtually all modern browsers that follow HTML5. Also, recent ad-vances in JavaScript translators had conferred better performance inexecution time, making possible to do tasks more computationallyintense, such as processing audio signals.

It was created a flowchart of the interfaces utilized on the WebAudio specification present on the HTML5 standard (Figure 2).

5.1.1 Configuration ModuleThe basic settings for the API are encapsulated by this module, asillustrated by Listing 1.

1 var _setup = function _setup( stream ) {

23 _mediaSource = _audioCtx.createMediaStreamSource( stream );

4 _analyser.fftSize = 1024;

5 _analyser.smoothingTimeConstant = 0;

6 _fftArray = new Uint8Array( _analyser.frequencyBinCount );

7 _mediaSource.connect( _analyser );

89 }

Listing 1: Configuration script for setting most low-level processingoptions.

9https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html

Figure 2: Proposed architecture and its modules for: (left, in green)configuration, multimedia device requisition and liberation; (middle,in yellow) analysis and statistics; (middle bottom, in red) callibration;(top right, in gray) audio file loading; and (bottom right, in light blue)utility functions. These modules are built on top of Web Audio APIinterfaces.

The interface mediaSoruce is generated and it receives a streamas a parameter. The stream is assigned on the moment that isgranted the use permission of the multimedia device. The param-eters fftSize and smoothingTimeConstant define the FFT window’ssize and the delay of the time which the samples will be processed,respectively. It is important to mention that from the various typesof windowing of Fourier transform, the Web Audio only uses theBlackman window. The windows are useful to minimize the effectsof a phenomenon known as spectral leak. The spectral leak com-monly occurs in truncated waves responsible for generating discon-tinuities in the signal. These discontinuities appear in the FFT am-plitudes as components of high frequencies that are not present inthe original signal. There are several different types of windowsprovided with specific characteristics that may be applied to a sig-nal. The Blackman window has its use widespread 10.

The object fftArray is the structure in which the audio samplesare stored, so that only half of them are considered to prevent mir-roring property of the signal. At the end, a connection is madewith analyser interface, so that all the captured information is pro-cessed.

5.1.2 Multimedia Device Requisition Module

The request for the user to use a multimedia device is performedby the liveInput() method. In this case it is requested access to themicrophone. If so, the configuration module is called as a successcallback, as shown in Listing 2. Otherwise, an error message isdisplayed. Additionally, this module is configured to match theiroperation with major browsers at the time of this work, such asFirefox, Chrome and Opera.

10http://www.ni.com/whitepaper/4844/en/



1 this.liveInput = function liveInput() {

2 var constrains = window.constrains = { audio: true, video: false };

34 navigator.mediaDevices.getUserMedia( constrains )

5 .then( function _onSuccess( stream ) {

67 _setup( stream );

89 } )

10 .catch( function _onError( error ) {

1112 console.log( error.name + ’: ’ + error.message );

13 } );

14 }

Listing 2: The API gently asks for user confirmation in order tobehave correctly.

5.1.3 Liberation moduleThe Liberation Module is responsible for liberating the audio con-text and its interfaces from the web browser. This module is neces-sary to ensure a good functioning of the API in order to promptlyrelease resources that become idle, which helps to avoid harmful in-terference with other multimedia applications that are perhaps com-peting for the same resources. This approach is important to pro-vide compatibility with mobile devices and it is depicted by Listing3.

1 this.free = function free() {

23 if( navigator.getUserMedia )

4 navigator.getUserMedia = undefined;

56 if( _audioCtx != undefined ) {

7 _analyser.disconnect();

8 _mediaSource.disconnect();

9 _audioCtx = undefined;

1011 }

1213 }

Listing 3: Audio context liberation avoids undesirable erratic statesafter interaction.

5.1.4 Real-time Analysis and Processing ModuleA Fast Fourier Transform is applied on this module to convert thesampling in the time domain to the frequency domain. After that,the structure fftArray is proved and returned with the newly pro-cessed frequencies. This is shown in Listing 4

1 this.realTimeAnalysis = function realTimeAnalysis() {

23 _analyser.getByteFrequencyData( _fftArray );

45 return _fftArray;

67 }

Listing 4: Overview of the analysis and processing module. A call-back function provides programmer-defined code with FFT data.

5.1.5 Statistical moduleModule responsible for calculating representative values of thesamples on the frequency domain. Based on real-time processing, itcalculates the overall average of the samples. Moreover, this mod-ule is built on top of the utils.js script, which provides prototypeproperties and methods that are inherited from other instances, suchas the mean method described in Listing 5.

1 Uint8Array.prototype.mean = function mean() {

23 var result = 0;

45 for( var i = 0; i < this.length; i++ )

6 result += this[ i ];

78 return ( result / this.length );

910 }

1112 this.getFrequencyAverage = function getFrequencyAverage() {

1314 var array = this.realTimeAnalysis();

1516 return array.mean();

1718 }

Listing 5: Statistics are used for processing incoming signals inorder to detect salient features such loud sounds or distinctive fre-quencies.

The mean method is responsible for compute the average of thesamples stored in an array of unsigned 8-bit integer. The getFre-quencyAverage() method invokes the analysis module and real-timeprocessing and returns the average of the signal. This average isconsidered as the loudness of signal set decibels.

5.1.6 Calibration ModuleThe Calibration Module also depends on the timer.js script that isresponsible for creating a countdown timer, which is defined in sec-onds. Basically, with the use of other modules, the overall averageof ambient noise denoted by parameter noiseAverage is calculatedand the obtained result is measured in decibels. This result is thenstored as the parameter ambientNoise. This is depicted by the codesnippet in Listing 6.

1 this.calibrate = function calibrate() {

23 // avoids already calibrated state

4 if( this.isCalibrate )

5 return;

67 // starts a 5s countdown limiting the setup

8 if( !_timer.isStarted ) {

9 _timer.start();

10 _timer.counter = setInterval( countdown, 1000 );

1112 _noiseAverage = 0;

13 }

1415 if( _timer.timeRemaining > 0 )

16 _noiseAverage += this.getFrequencyAverage();

17 else if( !this.isCalibrate ) {

18 // restarts if needed

19 _timer = new Timer( 5 );

20 var time = _timer.countDown * FPS;

21 // computes the noise over time

22 this.ambientNoise = _noiseAverage / time;

23 // (...)

24 }

Listing 6: Callibration module’s logic uses timeouts for consideringa given time period which is usually set to 5 seconds.

Depending on the result, a new calibration can be done as theuser sees fit. The parameter isCalibrate indicates whether the cal-ibration was done successfully. It is important to notice that theestimated average is equal to the sound intensity.

5.1.7 Audio File loading ModuleThis module is responsible for loading temporal series represent-ing audio clips. This process is performed in the following man-



ner, given an asynchronous xhr instance representing a standardHTTP request that accesses underlying data. Several events are im-plemented, including the main event called onload(), so that audiofile data contained in a ArrayBuffer is decoded asynchronously andthen these data are loaded from the response attribute defined bythe responseType. The decoded AudioBuffer is then re-sampled toAudioContext’s sampling rate, which, by default, is set to 48kHz.Finally, the result is assigned, by callback, to the buffer of a soundsource. This process is shown in Listing 7. It is important to notethat very large files can cause sluggishness, including errors in load-ing.

1 this.loadSound = function loadSound( url ) {

23 _xhr = new XMLHttpRequest();

45 //(...)

67 // Complete callback

8 _xhr.onload = function onload() {

910 var audio = _xhr.response;

11 //Decode the data

12 _audioCtx.decodeAudioData( audio, function( decodedAudio ) {

13 _bufferSource.buffer = decodedAudio;

14 });

15 }

1617 //(...)

18 }

Listing 7: Audio loading module implementation.

5.1.8 Audio Recording Module

This module is utterly useful for testing and experimentation pur-poses. For example, it is possible to record audio from a user in-teraction in order to reproduce any detected processing problemsand address them more effectively by inserting markers at the time-line. This supports underlying developments which are necessaryto evolve the API and to fix eventual bugs. Code listing is omittedfor presentation purposes.

5.2 Measuring Noise

The noise measurement is a simple procedure that aims to esti-mate the intensity of background sounds emitted in the environ-ment. This is performed according to audio information capturedfrom the microphone during a given time interval. Once samplesare captured within a window for that time period, the module thencomputes the mean loudness as the noise level perceived in thatenvironment.

5.3 Generating Events: Peak Detection

Peak Detection is an event-generating process. It checks amplitudepeaks in a frequency-domain signal. This event makes it possible totake several decisions, since programmers can check a wide rangeof discrepant amplitudes of any frequencies at the signal. This defi-nition is used in interaction proposed by this work. The interactionis described later. It is noteworthy that, in addition to the Peak De-tection, it also exists other types of detection. For example, BeatDetection that takes into account the rhythmic perception of beats,among other types of detection.

5.4 Initialization

The API can be initialized in only one line of code by creating anAudioController instance, as shown in Listing 8.

The target audio context is created when the API initializes. Al-ternatively, an existing context can be reused. In addition, the API

Figure 3: Peak detection procedure.

1 var API = new AudioController();

2 // ... or ...

3 var API = new AudioController( audioContext );

Listing 8: API initialization requires a new AudioController object.

can be configured by changing parameters such as FFT size, mini-mum and maximum decibels for feature detection. Otherwise, de-fault values 1024, −90 and 0, respectively, are assigned to theseparameters.

5.5 Integrating the API into a GameIntegration between the proposed API and games is also a fairlysimple process, after it is initialized and the analysis and real-timeprocessing modules gets a vector filled with samples in the fre-quency domain. With this structure programmers can set eventsfor their applications. In the game prototype presented in this pa-per, events are generated based on the Peak Detection approach andusing the average amplitude of each sample as loudness estimate.This event is fired whenever the resulting sound intensity by theplayer is above a certain threshold.

6 EVALUATION

Next, we have a description of a “endless runner” game prototype,with some screenshots and their relevant details. Such a game ischaracterized by rapid and synchronized actions that steer a char-acter undergoing constant movimentation through a scenario. Thisgenre was chosen because both attention and a considerable controllevel over the mechanism used for interaction are demanded fromplayers.

Finally, it will be presented the results about our implementa-tion and its actual behavior during tests. It is noteworthy that thecalibration and game prototype experiments are important not onlyto validate the research line of this work, but also to foster futureworks in the area and development of any nature applications likethis.

6.1 Developing a Prototype GameThe application, in general terms, consists of two phases. The firstcomes down to audio calibration, which acquired a noise average inthe environment. This process starts at the time that guarantees thepermission microphone use the browser being finalized after five



seconds of continuous measurements. In this final step, the stateresulting noise in the environment is obtained, defined by param-eter ambientNoise. After that is provided to the user, dependingon the intensity noise average value, performs a recalibration. Thefollowing figures illustrate the states of noise.

Figure 4: Screenshots of calibration states.

The noise states are defined as: Excellent, Great, Caution andCritical, classified according to the value of ambientNoise for deci-bels intervals defined from 0 to 5, 6 to 10, 11 to 15, and 16 ahead,respectively. Recalibration is not required when the state is Excel-lent, however it is mandatory if the status is Critical. Recalibrationis optional in the other states.

After calibration, the application enters its second phase, thegame, in the infinite runner style. At this stage, the interaction isdone by audio input, the value of parameter ambientNoise is con-sidered in the processing of getting the parameter freq points thatrecords the sound intensity associated with noise. This parameteris defined by the average total samples obtained at run time, in thefollowing code snippet.

1 if( this.GAME_START && _timer.timeRemaining <= 0 ) {

23 this.freq_points = this.audio.getFrequencyAverage();

4 var total_freq = this.freq_points - this.audio.ambientNoise;

5 //(...)

6 }

Listing 9: The calibration step conveniently hides most of thegame’s initialization from users.

If the total freq parameter exceed a threshold of 20dB, an eventis triggered causing the character jump.

Moreover, the character has its inventory size items equal toone and a Finite State Machine (FSM) that manages their lifecy-cle. They are: Normal, Weak and Invincible. The damage that thecharacter takes on collision with the enemy is related to its currentstate, in Normal mode is discounted a unit of life parameter, in In-vincible so there are no discounts and the Weak mode is deductedtwice the normal damage.

The states of the character varies according to the item obtainedthroughout the game where this item is stored in your inventoryuntil the timer parameter equal to 500 (processing cycles) is zero, ie,similar to the concept of cooldown. Only then the inventory slot isreleased to store another item. The points parameter is increased byone when there is no collision with an enemy. In turn, one hundred

Figure 5: FSM managing the player’s avatar status.

points are scored when it acquires a coin. The following figuresreport the states of character and existing items, respectively.

Figure 6: Sprites currently used as placeholders in the prototypegame. All characters are property of their respective owners. Thesesprites and images are free.

Initial experiments and development adopted a PC with a i7-4500U [email protected], 16GB RAM running at a 64-bit operatingsystem. Audio was captured using the device’s built-in microphone.A capture of the actual game running on a notebook is shown in Fig-ure 7. The prototype game was also executed using a modest An-droid cell phone, as shown in Figure 8. This device has a 1.3GHzDual Core processor, 457MB RAM, with Android 4.4 KitKat.

Figure 7: Prototype game executing on the PC.

6.2 Experiment I: CalibrationCalibration is an essential practice in applications that have the au-dio has the main agent. The calibration is useful for optimization,since the suppression of noise in the signal is partial or total, de-pending on how it is implemented.



Figure 8: Prototype game running on a modest Android cell phonewith a 1.3GHz Dual Core processor and 457MB RAM. Execution didnot require any modifications to the game’s resources or code exceptadapting to the screen dimensions.

At first, the prototype did not use calibration and displayed manyproblems, such as intermittent jumps and, sometimes, no eventjumps were detected at all. Calibration significantly increased ac-curacy when triggering the jump event, and offers feedback to theuser about how well the environment is fit for audio processing ap-plications.

6.3 Experiment II: Testing the Game

The game was run on both Desktop and Android platforms in anordinary room with no soundproofing. In fact, there were two airconditioners functioning in the room plus eventual people walkingnearby the testing site.

A group of 5 players was invited to assess how well they couldperform the simple actions in the game. Moreover, they were giventotal freedom to explore interaction. At first, two players werescreaming to make the character jump. Henceforth some usersstarted resorting to different alternatives to produce beats, such assnapping fingers and hitting the table.

It must be observed that the game required users to focus onstrict timing. After a short adaptation period of less than a minute,players could jump either to avoid obstacles or to collect potions.Despite eventual loud sounds caused by external sources, the in-teraction was considered eligible by the focal group evaluating theproduct. In fact, this kind of interaction demands headphones for abetter experience because audio feedback is important for a greaterimmersion.

7 DISCUSSION

The objective of this work, which proposed a study on real-timeaudio processing and its application for gaming interaction on theweb has been achieved. We have also show that such technology fitsmodern mobile devices such as smart phones and tablets, as gameexecution did not require anything but an HTML5-compliant webbrowser.

Due to technical limitations usually displayed by more sophisti-cated processing methods, our main focus on this paper was apply-ing baseline processing techniques in order to achieve real-time per-formance. Moreover, if the sampling window is set to wider values,there may be two reasons for sluggish frame rates: (a) the windowcaptures events shifted in time because samples represent a largetime period and (b) processing requires more computing time. Weobserved that a sampling window about 1024 samples wide usuallyprovides both performance and a reliable interaction. Nevertheless,interactive methods based on complex pattern recognition may re-sort to larger sampling windows in order to cope with words andsentences spoken in different rhythms.

Therefore we exert effort on carefully choosing a game genrethat suits our evaluation purposes. We adopted a simple game de-sign requiring fast-paced real-time interaction instead of voice com-mand recognition [11] [16] [13]. Player interaction was consideredeligible for real-time interaction on actual games: we corroboratethrough experimentation that players can easily adapt to this inter-action method in no time.

For practical reasons, an audio output should be applicable pref-erentially when the player is using headsets. Moreover, the playerhas to switch between the use of voice and other forms to produceamplitude peaks, such as impacts, in order to avoid feeling tired.

8 CONCLUSION

As proven by practical experimentation we carried out, Web audioAPI for browsers presented good efficiency even for mobile plat-forms. Our prototype game developed using HTML5 Canvas showsit is possible to build cross-platform applications and games for theweb with processed audio as input.

It is important to observe that more sophisticated mechanismsbased on recognition of complex patterns can experience severaloperating restrictions. Therefore, we opted for a baseline analysisusing sound intensity and the manipulation of the respective signalin frequency domain. By using our own API specially designed forthis task, peak detection was used to interact in real time in a typicalrunner game: players managed to use jump obstacles, collect itemsand also time their jumps to avoid undesired potions.

Other applications for digital games were discussed throughoutthe development of this work process, which require a multidisci-plinary development team and an improvement of the techniquesused to date.

There are several fronts of future development made possiblefrom this study. Here are some examples of research we believestood out for further investigation:

• Proper study about effective use of audio inputs in differ-ent game genres and mechanics. This would enable to buildguidelines for development teams and their game designer forusing audio inputs.

• API extensions and adaptations for others platforms and pro-gramming languages;

• Proposition, development and evaluation of other ways to ex-tract useful information for user interaction from audio. Thispossibly includes the proposal or adaptation of more sophisti-cated techniques for speech recognition, timbres and specificsounds recorded by the user;

• Computer-aided voice health care;

• Some audio applications, such as soundtracks and special ef-fects, sensitize the spectator at specific times during the ap-plication, for example, triggering fear or euphoria. The audioprocessing in this way could be used as a parameter to directthe production of automatic game content.

9 ACKNOWLEDGEMENTS

This work is supported by the TEJO-UFC research group and it waspartially developed at Bimo-UFC lab. We would also like to thankJoao Ramos Filho for the artwork and his invaluable help, as wellthe anonymous reviewers for their considerations.

REFERENCES

[1] K. Collins. Game sound: an introduction to the history, theory, andpractice of video game music and sound design. Mit Press, 2008.

[2] J. W. Cooley and J. W. Tukey. An algorithm for the machine calcula-tion of complex fourier series. Math. Comput., 19:297–301, 1965.



[3] T. M. Cover and P. E. Hart. Nearest neighbor pattern classification.Information Theory, IEEE Transactions on, 13(1):21–27, 1967.

[4] O. Edwards. Simonized: In 1978 a new electronic toy ushered in theera of computer games. Smithsonian Magazine, 2006.

[5] S. Fulton and J. Fulton. HTML5 Canvas. O’Reilly Media, Incorpo-rated, 2013.

[6] I. Hickson, R. Berjon, S. Faulkner, T. Leithead, E. D. Navara, andS. O’Connor, E. an Pfeiffer. Html5 a vocabulary and associated apisfor html and xhtml. 2014.

[7] S. Hosokawa and T. Mitsui. Karaoke around the world: Global tech-nology, local singing. Routledge, 2005.

[8] D. M. Howard and J. Angus. Acoustics and psychoacoustics. Taylor& Francis, 2009.

[9] N. Kirch and N. Zhu. A discourse on the effectiveness of digital filtersat removing noise from audio. The Journal of the Acoustical Societyof America, 139(4):2225–2225, 2016.

[10] A. Kostiainen, I. Oksanen, and D. Hazaal-Massieux. Html media cap-ture. 2014.

[11] E. E. S. Lima, C. Ruby, C. T. Pozzer, and C. N. Silva. Visao computa-cional e reconhecimento de comandos de voz aplicados na interacaocom jogos e ambientes 3d. In Proceedings of SBGames 2010, vol-ume 1, pages 350–353. SBC, Nov. 2010.

[12] R. G. Lyons. Understanding digital signal processing. Pearson Edu-cation, 2010.

[13] R. A. M. Antas, G. Souto and R. Valentim. Interface adaptavel parajogos digitais: Jogando com a voz. In Proceedings of SBGames 2015,volume 1, pages 248–251. SBC, Nov. 2015.

[14] M. K. Mandal. Multimedia signals and systems, volume 716. SpringerScience & Business Media, 2012.

[15] K. Pohlmann. Principles of digital audio. McGraw Hill Professional,2010.

[16] L. J. S. Silva and C. T. Pozzer. Controle de geracao de historias inter-ativas atraves de comandos de voz. In Proceedings of SBGames 2014,volume 1, pages 1038–1041. SBC, Nov. 2014.

[17] A. Spanias. Advances in speech and audio processing and coding. InInformation, Intelligence, Systems and Applications (IISA), 2015 6thInternational Conference on, pages 1–2. IEEE, 2015.

[18] W.-H. Tsai and H.-C. Lee. Automatic evaluation of karaoke singingbased on pitch, volume, and rhythm features. Audio, Speech, andLanguage Processing, IEEE Transactions on, 20(4):1233–1243, 2012.

[19] V. Valimaki and J. D. Reiss. All about audio equalization: Solutionsand frontiers. Applied Sciences, 6(5):129, 2016.

[20] V. Vapnik. The nature of statical learning theory, 1995.[21] S. Webel, M. Olbrich, T. Franke, and J. Keil. Immersive experience of

current and ancient reconstructed cultural attractions. Digital HeritageInternational Congress (DigitalHeritage), 1:395–398, 2013.



Documents

Using Audio for Player Interaction in HTML5 · PDF file · 2017-03-16Using Audio for Player Interaction in HTML5 Games ... immersive agent that helps breaking the boundaries between