Upload
obrien-sim
View
214
Download
0
Embed Size (px)
Citation preview
8/2/2019 CIS2010
1/6
Swiping with Luminophonics
Shern Shiou Tan, Tomas Henrique Bode MaulSchool of Computer Science
The University of Nottingham Malaysia Campus
Jalan Broga, 43500 Semenyih
Selangor Darul Ehsan, Malaysia
Neil Russel Mennie, Peter MitchellSchool of Psychology
The University of Nottingham Malaysia Campus
Jalan Broga, 43500 Semenyih
Selangor Darul Ehsan, Malaysia
AbstractLuminophonics is a system that aims to maximizecross-modality conversion of information, specifically from thevisual to auditory modalities, with the motivation to develop abetter assistive technology for the visually impaired by usingimage sonification techniques. The project aims to research anddevelop generic and highly-configurable components concernedwith different image processing techniques, attention mechanisms,orchestration approaches and psychological constraints. Theswiping method that is introduced in this paper combines severaltechniques in order to explicitly convert the colour, size andposition of objects. Preliminary tests suggest that the approach
is valid and deserves further investigation.
Index TermsImage Processing, Computer Vision, AuditoryDisplay, Image Sonification
I. INTRODUCTION
In 2003, the World Health Organization reported that
there were about 314 million people that were visually
impaired worldwide with 45 million of them completely
blind [6]. This fact constitutes the core motivation behind
Luminophonics, which is a solution that aims to help the
blind regain their ability to interpret the visual world through
visual to auditory sensory substitution (or image sonification).
Because this is not a new area of research, we propose to
address current limitations, mainly in the sense of maximizing
the information transfer from one modality (i.e. visual) to the
other (i.e. auditory). Probably the earliest successful sensory
substitution solution for the blind consists of Braille [4].
Image sonification has been a research area for several
years now. Some of the older solutions include vOICe [5]
and SonART [1] whereas one of the most recent solutions
consists of SeeColor [2]. Although these and related solutions
have shown to be successful to some degree, they tend to
exhibit some or all of the following weaknesses: information
loss, cacophony (uninterpretability), lack of configurabilityand usability, and steep learning curves. Luminophonics aims
to find solutions to these weaknesses, because they hinder the
application of image sonification in real world situations.
The Luminophonics project covers experimental methods
for testing different mappings between visual to auditory
properties. It is also concerned with studying aspects of
human perception (e.g. parallel processing limits) in order
to address the conversion maximization issue in a realistic
manner. The main low-level difference between the work
reported in this paper and other methods is the usage of
top-down swiping in conjunction with an emphasis on colour
conversion (an emphasis that is shared with SeeColor). The
rationale for the swiping method is to increase the amount
of information that is converted per unit of time, without
sacrificing interpretability (i.e. without creating cacophony).
I I . LUMINOPHONICS & CROSS MODALITY CONVERSIONS
The core question of cross modality conversions consists
of determining how much information was preserved after
the conversion. Apart from representation issues (e.g.
dimensionality, statistical and compositional structure, and so
on) and the determination of what is relevant or not, cross
modality conversions also suffer from noise, missing data and
related issues.
The basic concept of Luminophonics is to develop a
highly configurable and customizable platform that supports
sandboxing and generic components for the audio and
vision components. Through several experiments, involving
different visual to auditory mappings, we hope to discover themost effective set of methods, yielding the best information
preservation across modalities. The definition of good
information preservation includes not only a rich and correct
pairing between visual and auditory properties, but also,
ease of use, ease of learning, informational relevance, and
pleasantness (i.e. aesthetic value).
The generic platform (Figure 1) consists of several blocks
of module, each dedicated for their own task from image
segmentation to sound synthesis. Figure 1 shows that an input
image will go into computer for Image Segmentation then
Colour Heuristic Model. After that, both the modules will
send out their outputs into Decision Module before SoundSynthesizer can produce the result of the conversion.
By having a generic platform, different methods can be
rapidly developed for proving or disproving hypotheses regard-
ing information preservation, perceptual limits, usability, and
so on. In order to conduct preliminary comparisons between
methods a standard measure of information conversion will be
proposed along with statistical results from users experiments.
978-1-4244-6502-6/10/$26.00 c2010 IEEE CIS 2010
8/2/2019 CIS2010
2/6
Fig. 1. Generic Platform Process Flow
III. SWIPING METHOD
Part of the inspiration for the swiping method is derived
from human visual attention. Even though the human eye
provides the brain with a wide field of view, visual perception
is mostly limited to a focused subset of the total field.
Similarly to human visual attention, which continuously and
dynamically repositions and reconfigures itself in order to
process relevant information, the swiping method dynamically
repositions a horizontal attention band in order to gradually
transfer information for sequential processing.
In the swiping method, an image is split into left and
right halves for image segmentation purposes. From the
segmentation results, blobs are categorized and stored from
top to bottom in a buffer. From the buffer, the blobs are
processed according to their properties to form a two channel
audio wave output. Blobs are converted into sound, band
by band in a top down sequence. Each intra and inter
band delay is configurable so that each user can optimize
the recognition of images according to their individual
(perceptual) differences.
The main advantage of generating a soundscape (auditory
mapping) by sequentially swiping bands from top to bottom is
to clearly represent Y-axis information based on time delays.The usage of time delays helps to improve human sound
localization along the Y-axis. One might say that the loss of
spatial representational freedom (resolution and concurrency),
and the resulting difficulties in object localization, constitute
the central problems or limitations of visual to auditory
sensory substitution. In an image, the data comes in 2
dimensions and stereoscopy enables a third dimension (i.e.
depth information). Although swiping has been used before
(both vertically and horizontally) in order to address this
spatial issue, to the best of our knowledge this is the first
time that it has been combined with intermediate level image
processing and an emphasis on colour processing.
IV. IMAGE SEGMENTATION
To obtain image blobs and their properties, Luminophonics
applies an image segmentation technique which works in
linear time, and simultaneously labels connected components
and their contours [3]. Using the technique not only provides
a fast and real-time segmentation for simple images but also
generates blob properties like size (in width and height) and
location (in pixel coordinates).
[Input] [Binary]
[K-means] [Output]
Fig. 2. Process of Segmentation
In this method, the input images are converted into binary
images for the purpose of image segmentation by using a
contour tracing technique. To simplify the image further,
K-means algorithm is used to convert the binary image
into multiple segments. After going through 5 iterations of
K-means algorithm (refer to Figure 2), noises and incorrectfalses are greatly reduced hence it is able to increase the
accuracy of final segmentation. In order to preserve the colour
information for later conversion, the input images are kept in
pairs with their binary counterparts in a memory buffer. The
original colour images are stored for colour extraction while
the binary images are used for image segmentation.
Two regions-of-interest (ROI) namely left region and
right region are extracted from the binary image by splitting
the image into two parts of equal size. The function of
splitting the image not only speeds up the process of image
segmentation but also caters to the 2 channel stereo audioaspect of the solution. Two different audio outputs are
produced for both ears. Assuming that earphones are being
used, the left ear can only hear the objects that exist on
the left ROI while the right ear can only hear the sounds
of objects in the right ROI. If an object exists on both
sides, its corresponding audio properties can be heard on both
ears. This helps the user to determine the X-axis of the objects.
From the image segmentation stage, a set of blobs with
8/2/2019 CIS2010
3/6
their properties are arranged in linear format. Refering to
Figure 2, the final output is shown at the bottom right corner
with the red boxes illustrating the blobs to be classified. After
this, a Heuristic Colour Model classifies the blobs based
on their colours. The blobs are subsequently arranged in a
buffer in a swiping compatible format before they are finally
converted into audio wave.
V. HEURISTIC COLOUR MODEL
CCD (charge coupled device) and CMOS (complementary
metal oxide semiconductor) camera sensors both vary in
sensitivity and colour management models. The subjective
interpretation of colour also varies between individuals.
Hence, a standard heuristic colour model is needed for each
individual camera to deduce standard colour properties based
on user settings. By assessing how users perceive colour, a
colour model (based on flexible thresholds) can be created for
particular camera/user pairs.
The Heuristic Colour Model (HCM) is strictly based on the
HSL colour model which consists of a double symmetricalcone with a true black point at the bottom and white colour at
the other end. H represents hue, S represents saturation while
L represents Lightness. To determine the mapping of a pixel
value, HCM first tests the pixels saturation and then based
on this decides whether to test its hue or lightness.
Fig. 3. Heuristic Colour Model Decision Chart
HCM is a simple and intuitive method for determining
whether a particular pixel should be determined as a colour
or grayscale pixel. For a colour pixel, its saturation must
be between the upper threshold and lower threshold of the
saturation scale. For a grayscale pixel, its saturation is either
above the upper threshold or below the lower threshold of the
saturation scale. Refer to Figure 3 for a diagram of HCMs
decision logic.
VI . VISUAL TO AUDIO MAPPING
The primary visual properties being investigated in this
early phase of the Luminophonics project consist of object
colour, size and location. These properties are converted and
synthesized dynamically into audio waves by manipulating
different audio properties. It is crucial to understand and
maximize the dataset from both audio and visual domain to
produce a good visual to audio mapping [7]. An intuitive
mapping of visual to auditory properties not only helpsusers learn how to use the technology more rapidly and
effectively but also maximizes the information preservation
across modalities.
A. Colour
Blob pixels, which are captured by standard imaging
sensors, are by default Red Green Blue values (RGB). These
values are converted into the Hue Saturation Lightness (HSL)
colour space. Blobs are then categorized into different colours
through the Heuristic Colour Model by computing the mean
of the blob pixels. In the first Luminophonics prototype
reported here, blobs are categorized into 10 colours namely:Red, Orange, Yellow, Green, Blue, Violet, Indigo and Black,
White, Gray. In a manner analogous to the SeeColor approach,
the 10 different colours are then be mapped to 10 different
timbres (or musical instruments).
Besides categorizing blobs into distinct colours, the mean
HSL value of a blob provides further auditory variations
through the lightness value, which affects the frequency (or
pitch) of each timber (or instrument).
1) Timbre: Each colour is mapped to a different and
distinctive timbre. In order to satisfy the requirements of
timbre distinctiveness, usability (e.g. ease of learning), andaesthetic satisfaction, as with the SeeColor approach, we have
chosen different musical instruments to represent different
classes of timbre.
Table 1 depicts the particular colour to timbre mapping
currently being used. This mapping is an obvious target for
configurability, seeing that users are likely to have their own
aesthetic preferences.
2) Frequency: Though there are 10 types of colours
matching with 10 different musical instruments, the lightness
value of the blob can be encoded in the auditory signal by
affecting the frequency (pitch) of the musical instrument, thusfurther expanding the range of sounds that can be experienced
by the user. This encoding allows users to differentiate two
blobs with the same colour but with different lightness values.
B. Location
The 2D coordinates of each blob are extracted from the
contour segmentation technique in x and y pixel units.
8/2/2019 CIS2010
4/6
Colours InstrumentRed SaxophoneOrange CelloYellow HarmonicaGre en PianoBlue HornIndigo GuitarViolet TrumpetWhite XylophoneGray FluteBlack Violin
TABLE ICOLOUR MAPPING TABLE
1) X-axis: The x-coordinate of a blob is encoded
through stereophonic sound. The x-axis is represented by
three regions, i.e.: left, right and both sides. Before audio
conversion, Luminophonics determines in which regions blobs
reside in. Audio synthesis is based on which region a blob
is located in. For example, if the blob falls entirely in the
left region, sound will be synthesized only in the left audio
channel. The converse applies to blobs falling entirely in the
right region. If blobs fall across both regions then sound issynthesized in both audio channels.
2) Y-axis: The y-coordinate of a blob is implicitly mapped
by the time delay of each band. One of the main purposes
of creating the swiping method is to preserve the Y-axis
information. Without a surround sound system or a highly
trained ear, the y-coordinate of an object is very hard to
represent. In the swiping method, the y-coordinate of a blob
is associated with time delay in an intuitive manner. The
further down a blob is in the image, the longer it takes for
its corresponding sound to be synthesized. Thus, users can
used this temporal information to create a mental image of
the relative positioning of blobs along the y-axis.
C. Size
In the current implementation of Luminophonics, the size
(or area) of a blob is encoded in terms of the volume of the
instrument being played. In other words, the volume of a
particular instrument is directly proportional to the size of the
blob generating it. This visual to auditory association, apart
from allowing users to infer the size of objects, also allows
them to infer their horizontal skewness. If a blob is skewed to
the left, the area of that part of the blob in the left region will
be larger than the area in the right region. Hence, the volumeof the particular audio in the left channel will be higher than
the volume of the same audio in the right channel.
VII. TRAINING & PRELIMINARY TES T
Several basic preliminary tests were conducted, not only to
validate the software developed but also to verify the validity
of our sensory-substitution approach. Due to the preliminary
nature of these tests a single subject, with adequate hearing
and a background in music, was used. Before the actual
testing phase, the subject was asked to go through a training
phase, where he learnt to recognize soundscapes generated by
the prototype.
A. Training
The training was limited to the basic mappings (or features)
implemented by the prototype, namely: colour, location andsize. The participant was trained to recognize the three basic
features and was expected to recognize combinations of these,
in order to achieve the objective of each experiment.
The duration of each training session depended on the
satisfaction of the participant. The participant was allowed to
repeat sessions any number of times until he was satisfied
with his ability to recognize each basic feature.
The first training session was to recognize the colour of
the input image through its corresponding sound (i.e. timbre).
The participant was also trained to differentiate shades of
colour, though the resulting variations in pitch.
To train the participant to recognize the location of an
object, the participant was taught to relate time delays to
vertical position and stereo placement to horizontal position.
Four blobs of the same colour were drawn on four different
input images in four different quadrants: Top-Left, Top-Right,
Bottom-Left and Bottom-Right. The participant was required
to listen to the resulting soundscape and interpret the position
of blobs.
Training to recognize blob size was relatively easier,
because of the simple relationship between object size and
sound volume. Blobs of different sizes were drawn on to
different input images. The participant was asked to listen
closely to the different sound volumes and to recognize blob
sizes based on these differences.
B. Experiments
Six different experiments were conducted on the participant
after the training phase. The difficulty and complexity levels
of the experiments were low with a maximum of two feature
combinations per test. Due to the preliminary nature of the
testing, the sample images consisted of simple synthetic
objects. Figure 4 shows some of the test samples used inconducting the experiments.
1) Colour Test: The colour test was the simplest
experiment of all, where the participant was required to
differentiate between the 10 different possible blob colours.
One object with one colour was showed every time and
the participant was required to identify the colour through
the resulting sound. The process was repeated for 20 times
to roughly gauge the accuracy of the participants color
8/2/2019 CIS2010
5/6
Fig. 4. Test Samples
identification.
2) Object Test: The objective of this experiment was
to test whether our prototype produced adequate auditoryfeatures for recognizing four different object classes. The four
objects were randomly displayed in different sequences but
5 times each (each object class consisted of five variations
each). In total, the participant needed to recognize the objects
in 20 trials.
The four object classes (i.e. bee, house, stickman and tree)
were pre-selected partly based on their distinctive features.
The bee was selected due to its combination of black and
yellow features. Images of the house category were drawn
specifically from a square and a triangle of different colours.
The participant was expected to recognize the object through
the triangle/square arrangement despite their specific colours.Stickman images were drawn in only one colour with a round
head at the top and a body made out of lines. Tree images
were drawn with simplified green foliage and a brown trunk.
3) Shate Test: As mentioned earlier, different colour
shades produce different pitches for the same musical
instrument. In the shade test, two blobs of the same colour
but with different shades were drawn side by side to let the
user pick the darkest blob based on pitch. In this experiment,
the participants ability to discriminate horizontal locations
and shades of colour was tested.
4) Find Location: In this experiment, one variant from the
four object classes (mentioned in the object test) was selected
and redrawn in four different image quadrants. The participant
was asked to locate the position of a specific object in one of
the four quadrants. 24 different combinations of images were
shown randomly to the users.
5) Find Object: This experiment is closely related to the
previous one. In this case, the participant was given a specific
quadrant and asked to identify the object located in it. 24
different combinations of images were shown randomly to the
participant.
6) Counting Test: The final experiment was the most
complex of the six reported here, where the participant (and
the prototype) were tested on all three visual features (i.e.
location, size and colour). In Figure 4, the bottom left image
with the label Rounds is one of the test image where userneeds to count the number of round objects in 3 different
colours. Different shapes, with different sizes, in different
colours were drawn on an image. Every image had different
numbers of blobs located randomly in a particular location.
The participant was required to count the number of blobs on
the image. The process was repeated for 10 times.
VIII. RESULTS
Fig. 5. Preliminary Experiments Accuracy
The bar chart in Figure 5 shows the recognition rates of the
participant for the different experimental conditions. Based on
the fact that for most conditions the chance level rests at 25%
(for the colour and shade tests chance levels are 10% and
50% respectively), these results suggest that the prototype is
indeed functioning as expected.
The participant obtained the lowest accuracy for the
experiment that required him to identify an object within
a given quadrant. This result might be partly explained by
the phenomenon of cacophony which was more pronounced
in this condition. In contrast, and maybe ironically, the
participant obtained the highest accuracy in the experiment
that required him to find the location of a given object. Thefact that the subject limited his focus to a quadrant at a time
and compared this soundscape to that stored in his memory
might explain the larger accuracy rate.
In the counting test, the participant obtained a high
accuracy rate with 90% correct answers. This suggests that
swiping from top to bottom is a good approach for users
to create mental maps of blob locations and numbers. The
participant was capable of drawing the location of blobs on a
8/2/2019 CIS2010
6/6
graph, while counting them one by one.
I X. CONCLUSION AND FUTURE ENHANCEMENTS
The swiping method aims to improve current visual
to auditory sensory substitution solutions by maximizing
information conversion whilst maintaining learnability and
interpretability. The visual properties of colour, size and
location, are explicitly encoded and converted into auditorysignals. Due to the attentional (banding) and temporal
(swiping) aspects of the solution, information about shape
and texture can be deduced from the soundscape. Through
training and/or repeated utilization, users should be capable of
interpreting increasing amounts of information with a reduced
sense of cacophony, and should exhibit faster recognition and
learning rates.
Having said this, a significant amount of work remains to
be done. One of the most immediate future tasks involves
the image processing stage, whereby we hope to generate
simplified visual descriptions (through modified segmentation
algorithms) that aim to cohesively extract the most relevantaspects of a visual scene. Future work will also involve
the completion of a highly configurable prototype, which
apart from allowing users to fine-tune the system to their
tastes and requirements, will allow us to conduct extensive
experimentation in order to conclusively answer the question
of how to maximize cross-modal information conversions. This
effort will also involve an investigation of human perceptual
capabilities and limitations. Quantitative conversion measures
also need to be developed in order to facilitate preliminary
comparisons between approaches. Different variations of the
solution are expected to be developed for different contexts,
e.g.: navigation (e.g. walking in a shopping mall) vs. human
computer interaction (e.g. trying to interpret a graph). Dueto the importance of depth information, particularly in the
context of navigation, future versions of the approach should
incorporate a stereo camera. The auditory encoding of depth
should be done in a manner that is distinctive and does
not interfere with the mapping already provided for colour,
position and size.
In conclusion, preliminary results indicate that the image
processing and attentional dynamic approach adopted by
Luminophonics is valid and thus that its aims to further
maximize the quantity, rate, learnability and interpretability of
the information converted are within reach, and consequently
that its ultimate goal to provide effective technology to assist
the visually impaired in their real-life interactions with the
environment, is also attainable.
REFERENCES
[1] Ben-Tal, O., Berger, J., Cook, B., Daniels, M., Scavone, G., and Cook, P.Sonart: The sonification application research. In Proceedings of the 2002
International Conference on Auditory Display, page 151, Kyoto, Japan,Jul 2002.
[2] Bologna, G., Deville, B., Pun, T., and Vinckenbosch, M. Transforming3d coloured pixels into musical instrument: Notes for vision substitutionapplications. EURASIP Journal on Image and Video Processing, 2007.
[3] Chang, F., Chen, C.J., and Lu, C.J. A linear-time component-labeling al-gorithm using contour tracing technique. CVIU, 93(2):206220, February2004.
[4] Grant, A. C., Thiagarajah, M. C., and Sathian, K. Tactile perception inblind braille readers: A psychophysical study of acuity and hyperacuityusing gratings and dot patterns. Perception & Psychophysics, pages 301312, 2000.
[5] Meijer, P. An experimental system for auditory image representations.
IEEE Transactions on Biomedical Engineering, 39:112122, 1999.[6] World Health Organisation. Up to 45 million blind people globally - and
growing. Retrieved from World Health Organization Official Website:http://www.who.int/mediacentre/news/releases/2003/pr73/en/, Oct 2003.
[7] Yeo, W. S. and Berger, J. Application of raster scanning method to imagesonification, sound visualization, sound analysis and synthesis. In 9th Int.Conference on Digital Audio Effects (DAFx-06), 2006.