CIS2010

8/2/2019 CIS2010

1/6

Swiping with Luminophonics

Shern Shiou Tan, Tomas Henrique Bode MaulSchool of Computer Science

The University of Nottingham Malaysia Campus

Jalan Broga, 43500 Semenyih

Selangor Darul Ehsan, Malaysia

Neil Russel Mennie, Peter MitchellSchool of Psychology

The University of Nottingham Malaysia Campus

Jalan Broga, 43500 Semenyih

Selangor Darul Ehsan, Malaysia

AbstractLuminophonics is a system that aims to maximizecross-modality conversion of information, specifically from thevisual to auditory modalities, with the motivation to develop abetter assistive technology for the visually impaired by usingimage sonification techniques. The project aims to research anddevelop generic and highly-configurable components concernedwith different image processing techniques, attention mechanisms,orchestration approaches and psychological constraints. Theswiping method that is introduced in this paper combines severaltechniques in order to explicitly convert the colour, size andposition of objects. Preliminary tests suggest that the approach

is valid and deserves further investigation.

Index TermsImage Processing, Computer Vision, AuditoryDisplay, Image Sonification

I. INTRODUCTION

In 2003, the World Health Organization reported that

there were about 314 million people that were visually

impaired worldwide with 45 million of them completely

blind [6]. This fact constitutes the core motivation behind

Luminophonics, which is a solution that aims to help the

blind regain their ability to interpret the visual world through

visual to auditory sensory substitution (or image sonification).

Because this is not a new area of research, we propose to

address current limitations, mainly in the sense of maximizing

the information transfer from one modality (i.e. visual) to the

other (i.e. auditory). Probably the earliest successful sensory

substitution solution for the blind consists of Braille [4].

Image sonification has been a research area for several

years now. Some of the older solutions include vOICe [5]

and SonART [1] whereas one of the most recent solutions

consists of SeeColor [2]. Although these and related solutions

have shown to be successful to some degree, they tend to

exhibit some or all of the following weaknesses: information

loss, cacophony (uninterpretability), lack of configurabilityand usability, and steep learning curves. Luminophonics aims

to find solutions to these weaknesses, because they hinder the

application of image sonification in real world situations.

The Luminophonics project covers experimental methods

for testing different mappings between visual to auditory

properties. It is also concerned with studying aspects of

human perception (e.g. parallel processing limits) in order

to address the conversion maximization issue in a realistic

manner. The main low-level difference between the work

reported in this paper and other methods is the usage of

top-down swiping in conjunction with an emphasis on colour

conversion (an emphasis that is shared with SeeColor). The

rationale for the swiping method is to increase the amount

of information that is converted per unit of time, without

sacrificing interpretability (i.e. without creating cacophony).

I I . LUMINOPHONICS & CROSS MODALITY CONVERSIONS

The core question of cross modality conversions consists

of determining how much information was preserved after

the conversion. Apart from representation issues (e.g.

dimensionality, statistical and compositional structure, and so

on) and the determination of what is relevant or not, cross

modality conversions also suffer from noise, missing data and

related issues.

The basic concept of Luminophonics is to develop a

highly configurable and customizable platform that supports

sandboxing and generic components for the audio and

vision components. Through several experiments, involving

different visual to auditory mappings, we hope to discover themost effective set of methods, yielding the best information

preservation across modalities. The definition of good

information preservation includes not only a rich and correct

pairing between visual and auditory properties, but also,

ease of use, ease of learning, informational relevance, and

pleasantness (i.e. aesthetic value).

The generic platform (Figure 1) consists of several blocks

of module, each dedicated for their own task from image

segmentation to sound synthesis. Figure 1 shows that an input

image will go into computer for Image Segmentation then

Colour Heuristic Model. After that, both the modules will

send out their outputs into Decision Module before SoundSynthesizer can produce the result of the conversion.

By having a generic platform, different methods can be

rapidly developed for proving or disproving hypotheses regard-

ing information preservation, perceptual limits, usability, and

so on. In order to conduct preliminary comparisons between

methods a standard measure of information conversion will be

proposed along with statistical results from users experiments.

978-1-4244-6502-6/10/$26.00 c2010 IEEE CIS 2010

8/2/2019 CIS2010

2/6

Fig. 1. Generic Platform Process Flow

III. SWIPING METHOD

Part of the inspiration for the swiping method is derived

from human visual attention. Even though the human eye

provides the brain with a wide field of view, visual perception

is mostly limited to a focused subset of the total field.

Similarly to human visual attention, which continuously and

dynamically repositions and reconfigures itself in order to

process relevant information, the swiping method dynamically

repositions a horizontal attention band in order to gradually

transfer information for sequential processing.

In the swiping method, an image is split into left and

right halves for image segmentation purposes. From the

segmentation results, blobs are categorized and stored from

top to bottom in a buffer. From the buffer, the blobs are

processed according to their properties to form a two channel

audio wave output. Blobs are converted into sound, band

by band in a top down sequence. Each intra and inter

band delay is configurable so that each user can optimize

the recognition of images according to their individual

(perceptual) differences.

The main advantage of generating a soundscape (auditory

mapping) by sequentially swiping bands from top to bottom is

to clearly represent Y-axis information based on time delays.The usage of time delays helps to improve human sound

localization along the Y-axis. One might say that the loss of

spatial representational freedom (resolution and concurrency),

and the resulting difficulties in object localization, constitute

the central problems or limitations of visual to auditory

sensory substitution. In an image, the data comes in 2

dimensions and stereoscopy enables a third dimension (i.e.

depth information). Although swiping has been used before

(both vertically and horizontally) in order to address this

spatial issue, to the best of our knowledge this is the first

time that it has been combined with intermediate level image

processing and an emphasis on colour processing.

IV. IMAGE SEGMENTATION

To obtain image blobs and their properties, Luminophonics

applies an image segmentation technique which works in

linear time, and simultaneously labels connected components

and their contours [3]. Using the technique not only provides

a fast and real-time segmentation for simple images but also

generates blob properties like size (in width and height) and

location (in pixel coordinates).

[Input] [Binary]

[K-means] [Output]

Fig. 2. Process of Segmentation

In this method, the input images are converted into binary

images for the purpose of image segmentation by using a

contour tracing technique. To simplify the image further,

K-means algorithm is used to convert the binary image

into multiple segments. After going through 5 iterations of

K-means algorithm (refer to Figure 2), noises and incorrectfalses are greatly reduced hence it is able to increase the

accuracy of final segmentation. In order to preserve the colour

information for later conversion, the input images are kept in

pairs with their binary counterparts in a memory buffer. The

original colour images are stored for colour extraction while

the binary images are used for image segmentation.

Two regions-of-interest (ROI) namely left region and

right region are extracted from the binary image by splitting

the image into two parts of equal size. The function of

splitting the image not only speeds up the process of image

segmentation but also caters to the 2 channel stereo audioaspect of the solution. Two different audio outputs are

produced for both ears. Assuming that earphones are being

used, the left ear can only hear the objects that exist on

the left ROI while the right ear can only hear the sounds

of objects in the right ROI. If an object exists on both

sides, its corresponding audio properties can be heard on both

ears. This helps the user to determine the X-axis of the objects.

From the image segmentation stage, a set of blobs with

8/2/2019 CIS2010

3/6

their properties are arranged in linear format. Refering to

Figure 2, the final output is shown at the bottom right corner

with the red boxes illustrating the blobs to be classified. After

this, a Heuristic Colour Model classifies the blobs based

on their colours. The blobs are subsequently arranged in a

buffer in a swiping compatible format before they are finally

converted into audio wave.

V. HEURISTIC COLOUR MODEL

CCD (charge coupled device) and CMOS (complementary

metal oxide semiconductor) camera sensors both vary in

sensitivity and colour management models. The subjective

interpretation of colour also varies between individuals.

Hence, a standard heuristic colour model is needed for each

individual camera to deduce standard colour properties based

on user settings. By assessing how users perceive colour, a

colour model (based on flexible thresholds) can be created for

particular camera/user pairs.

The Heuristic Colour Model (HCM) is strictly based on the

HSL colour model which consists of a double symmetricalcone with a true black point at the bottom and white colour at

the other end. H represents hue, S represents saturation while

L represents Lightness. To determine the mapping of a pixel

value, HCM first tests the pixels saturation and then based

on this decides whether to test its hue or lightness.

Fig. 3. Heuristic Colour Model Decision Chart

HCM is a simple and intuitive method for determining

whether a particular pixel should be determined as a colour

or grayscale pixel. For a colour pixel, its saturation must

be between the upper threshold and lower threshold of the

saturation scale. For a grayscale pixel, its saturation is either

above the upper threshold or below the lower threshold of the

saturation scale. Refer to Figure 3 for a diagram of HCMs

decision logic.

VI . VISUAL TO AUDIO MAPPING

The primary visual properties being investigated in this

early phase of the Luminophonics project consist of object

colour, size and location. These properties are converted and

synthesized dynamically into audio waves by manipulating

different audio properties. It is crucial to understand and

maximize the dataset from both audio and visual domain to

produce a good visual to audio mapping [7]. An intuitive

mapping of visual to auditory properties not only helpsusers learn how to use the technology more rapidly and

effectively but also maximizes the information preservation

across modalities.

A. Colour

Blob pixels, which are captured by standard imaging

sensors, are by default Red Green Blue values (RGB). These

values are converted into the Hue Saturation Lightness (HSL)

colour space. Blobs are then categorized into different colours

through the Heuristic Colour Model by computing the mean

of the blob pixels. In the first Luminophonics prototype

reported here, blobs are categorized into 10 colours namely:Red, Orange, Yellow, Green, Blue, Violet, Indigo and Black,

White, Gray. In a manner analogous to the SeeColor approach,

the 10 different colours are then be mapped to 10 different

timbres (or musical instruments).

Besides categorizing blobs into distinct colours, the mean

HSL value of a blob provides further auditory variations

through the lightness value, which affects the frequency (or

pitch) of each timber (or instrument).

1) Timbre: Each colour is mapped to a different and

distinctive timbre. In order to satisfy the requirements of

timbre distinctiveness, usability (e.g. ease of learning), andaesthetic satisfaction, as with the SeeColor approach, we have

chosen different musical instruments to represent different

classes of timbre.

Table 1 depicts the particular colour to timbre mapping

currently being used. This mapping is an obvious target for

configurability, seeing that users are likely to have their own

aesthetic preferences.

2) Frequency: Though there are 10 types of colours

matching with 10 different musical instruments, the lightness

value of the blob can be encoded in the auditory signal by

affecting the frequency (pitch) of the musical instrument, thusfurther expanding the range of sounds that can be experienced

by the user. This encoding allows users to differentiate two

blobs with the same colour but with different lightness values.

B. Location

The 2D coordinates of each blob are extracted from the

contour segmentation technique in x and y pixel units.

8/2/2019 CIS2010

4/6

Colours InstrumentRed SaxophoneOrange CelloYellow HarmonicaGre en PianoBlue HornIndigo GuitarViolet TrumpetWhite XylophoneGray FluteBlack Violin

TABLE ICOLOUR MAPPING TABLE

1) X-axis: The x-coordinate of a blob is encoded

through stereophonic sound. The x-axis is represented by

three regions, i.e.: left, right and both sides. Before audio

conversion, Luminophonics determines in which regions blobs

reside in. Audio synthesis is based on which region a blob

is located in. For example, if the blob falls entirely in the

left region, sound will be synthesized only in the left audio

channel. The converse applies to blobs falling entirely in the

right region. If blobs fall across both regions then sound issynthesized in both audio channels.

2) Y-axis: The y-coordinate of a blob is implicitly mapped

by the time delay of each band. One of the main purposes

of creating the swiping method is to preserve the Y-axis

information. Without a surround sound system or a highly

trained ear, the y-coordinate of an object is very hard to

represent. In the swiping method, the y-coordinate of a blob

is associated with time delay in an intuitive manner. The

further down a blob is in the image, the longer it takes for

its corresponding sound to be synthesized. Thus, users can

used this temporal information to create a mental image of

the relative positioning of blobs along the y-axis.

C. Size

In the current implementation of Luminophonics, the size

(or area) of a blob is encoded in terms of the volume of the

instrument being played. In other words, the volume of a

particular instrument is directly proportional to the size of the

blob generating it. This visual to auditory association, apart

from allowing users to infer the size of objects, also allows

them to infer their horizontal skewness. If a blob is skewed to

the left, the area of that part of the blob in the left region will

be larger than the area in the right region. Hence, the volumeof the particular audio in the left channel will be higher than

the volume of the same audio in the right channel.

VII. TRAINING & PRELIMINARY TES T

Several basic preliminary tests were conducted, not only to

validate the software developed but also to verify the validity

of our sensory-substitution approach. Due to the preliminary

nature of these tests a single subject, with adequate hearing

and a background in music, was used. Before the actual

testing phase, the subject was asked to go through a training

phase, where he learnt to recognize soundscapes generated by

the prototype.

A. Training

The training was limited to the basic mappings (or features)

implemented by the prototype, namely: colour, location andsize. The participant was trained to recognize the three basic

features and was expected to recognize combinations of these,

in order to achieve the objective of each experiment.

The duration of each training session depended on the

satisfaction of the participant. The participant was allowed to

repeat sessions any number of times until he was satisfied

with his ability to recognize each basic feature.

The first training session was to recognize the colour of

the input image through its corresponding sound (i.e. timbre).

The participant was also trained to differentiate shades of

colour, though the resulting variations in pitch.

To train the participant to recognize the location of an

object, the participant was taught to relate time delays to

vertical position and stereo placement to horizontal position.

Four blobs of the same colour were drawn on four different

input images in four different quadrants: Top-Left, Top-Right,

Bottom-Left and Bottom-Right. The participant was required

to listen to the resulting soundscape and interpret the position

of blobs.

Training to recognize blob size was relatively easier,

because of the simple relationship between object size and

sound volume. Blobs of different sizes were drawn on to

different input images. The participant was asked to listen

closely to the different sound volumes and to recognize blob

sizes based on these differences.

B. Experiments

Six different experiments were conducted on the participant

after the training phase. The difficulty and complexity levels

of the experiments were low with a maximum of two feature

combinations per test. Due to the preliminary nature of the

testing, the sample images consisted of simple synthetic

objects. Figure 4 shows some of the test samples used inconducting the experiments.

1) Colour Test: The colour test was the simplest

experiment of all, where the participant was required to

differentiate between the 10 different possible blob colours.

One object with one colour was showed every time and

the participant was required to identify the colour through

the resulting sound. The process was repeated for 20 times

to roughly gauge the accuracy of the participants color

8/2/2019 CIS2010

5/6

Fig. 4. Test Samples

identification.

2) Object Test: The objective of this experiment was

to test whether our prototype produced adequate auditoryfeatures for recognizing four different object classes. The four

objects were randomly displayed in different sequences but

5 times each (each object class consisted of five variations

each). In total, the participant needed to recognize the objects

in 20 trials.

The four object classes (i.e. bee, house, stickman and tree)

were pre-selected partly based on their distinctive features.

The bee was selected due to its combination of black and

yellow features. Images of the house category were drawn

specifically from a square and a triangle of different colours.

The participant was expected to recognize the object through

the triangle/square arrangement despite their specific colours.Stickman images were drawn in only one colour with a round

head at the top and a body made out of lines. Tree images

were drawn with simplified green foliage and a brown trunk.

3) Shate Test: As mentioned earlier, different colour

shades produce different pitches for the same musical

instrument. In the shade test, two blobs of the same colour

but with different shades were drawn side by side to let the

user pick the darkest blob based on pitch. In this experiment,

the participants ability to discriminate horizontal locations

and shades of colour was tested.

4) Find Location: In this experiment, one variant from the

four object classes (mentioned in the object test) was selected

and redrawn in four different image quadrants. The participant

was asked to locate the position of a specific object in one of

the four quadrants. 24 different combinations of images were

shown randomly to the users.

5) Find Object: This experiment is closely related to the

previous one. In this case, the participant was given a specific

quadrant and asked to identify the object located in it. 24

different combinations of images were shown randomly to the

participant.

6) Counting Test: The final experiment was the most

complex of the six reported here, where the participant (and

the prototype) were tested on all three visual features (i.e.

location, size and colour). In Figure 4, the bottom left image

with the label Rounds is one of the test image where userneeds to count the number of round objects in 3 different

colours. Different shapes, with different sizes, in different

colours were drawn on an image. Every image had different

numbers of blobs located randomly in a particular location.

The participant was required to count the number of blobs on

the image. The process was repeated for 10 times.

VIII. RESULTS

Fig. 5. Preliminary Experiments Accuracy

The bar chart in Figure 5 shows the recognition rates of the

participant for the different experimental conditions. Based on

the fact that for most conditions the chance level rests at 25%

(for the colour and shade tests chance levels are 10% and

50% respectively), these results suggest that the prototype is

indeed functioning as expected.

The participant obtained the lowest accuracy for the

experiment that required him to identify an object within

a given quadrant. This result might be partly explained by

the phenomenon of cacophony which was more pronounced

in this condition. In contrast, and maybe ironically, the

participant obtained the highest accuracy in the experiment

that required him to find the location of a given object. Thefact that the subject limited his focus to a quadrant at a time

and compared this soundscape to that stored in his memory

might explain the larger accuracy rate.

In the counting test, the participant obtained a high

accuracy rate with 90% correct answers. This suggests that

swiping from top to bottom is a good approach for users

to create mental maps of blob locations and numbers. The

participant was capable of drawing the location of blobs on a

8/2/2019 CIS2010

6/6

graph, while counting them one by one.

I X. CONCLUSION AND FUTURE ENHANCEMENTS

The swiping method aims to improve current visual

to auditory sensory substitution solutions by maximizing

information conversion whilst maintaining learnability and

interpretability. The visual properties of colour, size and

location, are explicitly encoded and converted into auditorysignals. Due to the attentional (banding) and temporal

(swiping) aspects of the solution, information about shape

and texture can be deduced from the soundscape. Through

training and/or repeated utilization, users should be capable of

interpreting increasing amounts of information with a reduced

sense of cacophony, and should exhibit faster recognition and

learning rates.

Having said this, a significant amount of work remains to

be done. One of the most immediate future tasks involves

the image processing stage, whereby we hope to generate

simplified visual descriptions (through modified segmentation

algorithms) that aim to cohesively extract the most relevantaspects of a visual scene. Future work will also involve

the completion of a highly configurable prototype, which

apart from allowing users to fine-tune the system to their

tastes and requirements, will allow us to conduct extensive

experimentation in order to conclusively answer the question

of how to maximize cross-modal information conversions. This

effort will also involve an investigation of human perceptual

capabilities and limitations. Quantitative conversion measures

also need to be developed in order to facilitate preliminary

comparisons between approaches. Different variations of the

solution are expected to be developed for different contexts,

e.g.: navigation (e.g. walking in a shopping mall) vs. human

computer interaction (e.g. trying to interpret a graph). Dueto the importance of depth information, particularly in the

context of navigation, future versions of the approach should

incorporate a stereo camera. The auditory encoding of depth

should be done in a manner that is distinctive and does

not interfere with the mapping already provided for colour,

position and size.

In conclusion, preliminary results indicate that the image

processing and attentional dynamic approach adopted by

Luminophonics is valid and thus that its aims to further

maximize the quantity, rate, learnability and interpretability of

the information converted are within reach, and consequently

that its ultimate goal to provide effective technology to assist

the visually impaired in their real-life interactions with the

environment, is also attainable.

REFERENCES

[1] Ben-Tal, O., Berger, J., Cook, B., Daniels, M., Scavone, G., and Cook, P.Sonart: The sonification application research. In Proceedings of the 2002

International Conference on Auditory Display, page 151, Kyoto, Japan,Jul 2002.

[2] Bologna, G., Deville, B., Pun, T., and Vinckenbosch, M. Transforming3d coloured pixels into musical instrument: Notes for vision substitutionapplications. EURASIP Journal on Image and Video Processing, 2007.

[3] Chang, F., Chen, C.J., and Lu, C.J. A linear-time component-labeling al-gorithm using contour tracing technique. CVIU, 93(2):206220, February2004.

[4] Grant, A. C., Thiagarajah, M. C., and Sathian, K. Tactile perception inblind braille readers: A psychophysical study of acuity and hyperacuityusing gratings and dot patterns. Perception & Psychophysics, pages 301312, 2000.

[5] Meijer, P. An experimental system for auditory image representations.

IEEE Transactions on Biomedical Engineering, 39:112122, 1999.[6] World Health Organisation. Up to 45 million blind people globally - and

growing. Retrieved from World Health Organization Official Website:http://www.who.int/mediacentre/news/releases/2003/pr73/en/, Oct 2003.

[7] Yeo, W. S. and Berger, J. Application of raster scanning method to imagesonification, sound visualization, sound analysis and synthesis. In 9th Int.Conference on Digital Audio Effects (DAFx-06), 2006.

Documents

CIS2010