CIS2010

Embed Size (px)

Citation preview

  • 8/2/2019 CIS2010

    1/6

    Swiping with Luminophonics

    Shern Shiou Tan, Tomas Henrique Bode MaulSchool of Computer Science

    The University of Nottingham Malaysia Campus

    Jalan Broga, 43500 Semenyih

    Selangor Darul Ehsan, Malaysia

    Neil Russel Mennie, Peter MitchellSchool of Psychology

    The University of Nottingham Malaysia Campus

    Jalan Broga, 43500 Semenyih

    Selangor Darul Ehsan, Malaysia

    AbstractLuminophonics is a system that aims to maximizecross-modality conversion of information, specifically from thevisual to auditory modalities, with the motivation to develop abetter assistive technology for the visually impaired by usingimage sonification techniques. The project aims to research anddevelop generic and highly-configurable components concernedwith different image processing techniques, attention mechanisms,orchestration approaches and psychological constraints. Theswiping method that is introduced in this paper combines severaltechniques in order to explicitly convert the colour, size andposition of objects. Preliminary tests suggest that the approach

    is valid and deserves further investigation.

    Index TermsImage Processing, Computer Vision, AuditoryDisplay, Image Sonification

    I. INTRODUCTION

    In 2003, the World Health Organization reported that

    there were about 314 million people that were visually

    impaired worldwide with 45 million of them completely

    blind [6]. This fact constitutes the core motivation behind

    Luminophonics, which is a solution that aims to help the

    blind regain their ability to interpret the visual world through

    visual to auditory sensory substitution (or image sonification).

    Because this is not a new area of research, we propose to

    address current limitations, mainly in the sense of maximizing

    the information transfer from one modality (i.e. visual) to the

    other (i.e. auditory). Probably the earliest successful sensory

    substitution solution for the blind consists of Braille [4].

    Image sonification has been a research area for several

    years now. Some of the older solutions include vOICe [5]

    and SonART [1] whereas one of the most recent solutions

    consists of SeeColor [2]. Although these and related solutions

    have shown to be successful to some degree, they tend to

    exhibit some or all of the following weaknesses: information

    loss, cacophony (uninterpretability), lack of configurabilityand usability, and steep learning curves. Luminophonics aims

    to find solutions to these weaknesses, because they hinder the

    application of image sonification in real world situations.

    The Luminophonics project covers experimental methods

    for testing different mappings between visual to auditory

    properties. It is also concerned with studying aspects of

    human perception (e.g. parallel processing limits) in order

    to address the conversion maximization issue in a realistic

    manner. The main low-level difference between the work

    reported in this paper and other methods is the usage of

    top-down swiping in conjunction with an emphasis on colour

    conversion (an emphasis that is shared with SeeColor). The

    rationale for the swiping method is to increase the amount

    of information that is converted per unit of time, without

    sacrificing interpretability (i.e. without creating cacophony).

    I I . LUMINOPHONICS & CROSS MODALITY CONVERSIONS

    The core question of cross modality conversions consists

    of determining how much information was preserved after

    the conversion. Apart from representation issues (e.g.

    dimensionality, statistical and compositional structure, and so

    on) and the determination of what is relevant or not, cross

    modality conversions also suffer from noise, missing data and

    related issues.

    The basic concept of Luminophonics is to develop a

    highly configurable and customizable platform that supports

    sandboxing and generic components for the audio and

    vision components. Through several experiments, involving

    different visual to auditory mappings, we hope to discover themost effective set of methods, yielding the best information

    preservation across modalities. The definition of good

    information preservation includes not only a rich and correct

    pairing between visual and auditory properties, but also,

    ease of use, ease of learning, informational relevance, and

    pleasantness (i.e. aesthetic value).

    The generic platform (Figure 1) consists of several blocks

    of module, each dedicated for their own task from image

    segmentation to sound synthesis. Figure 1 shows that an input

    image will go into computer for Image Segmentation then

    Colour Heuristic Model. After that, both the modules will

    send out their outputs into Decision Module before SoundSynthesizer can produce the result of the conversion.

    By having a generic platform, different methods can be

    rapidly developed for proving or disproving hypotheses regard-

    ing information preservation, perceptual limits, usability, and

    so on. In order to conduct preliminary comparisons between

    methods a standard measure of information conversion will be

    proposed along with statistical results from users experiments.

    978-1-4244-6502-6/10/$26.00 c2010 IEEE CIS 2010

  • 8/2/2019 CIS2010

    2/6

    Fig. 1. Generic Platform Process Flow

    III. SWIPING METHOD

    Part of the inspiration for the swiping method is derived

    from human visual attention. Even though the human eye

    provides the brain with a wide field of view, visual perception

    is mostly limited to a focused subset of the total field.

    Similarly to human visual attention, which continuously and

    dynamically repositions and reconfigures itself in order to

    process relevant information, the swiping method dynamically

    repositions a horizontal attention band in order to gradually

    transfer information for sequential processing.

    In the swiping method, an image is split into left and

    right halves for image segmentation purposes. From the

    segmentation results, blobs are categorized and stored from

    top to bottom in a buffer. From the buffer, the blobs are

    processed according to their properties to form a two channel

    audio wave output. Blobs are converted into sound, band

    by band in a top down sequence. Each intra and inter

    band delay is configurable so that each user can optimize

    the recognition of images according to their individual

    (perceptual) differences.

    The main advantage of generating a soundscape (auditory

    mapping) by sequentially swiping bands from top to bottom is

    to clearly represent Y-axis information based on time delays.The usage of time delays helps to improve human sound

    localization along the Y-axis. One might say that the loss of

    spatial representational freedom (resolution and concurrency),

    and the resulting difficulties in object localization, constitute

    the central problems or limitations of visual to auditory

    sensory substitution. In an image, the data comes in 2

    dimensions and stereoscopy enables a third dimension (i.e.

    depth information). Although swiping has been used before

    (both vertically and horizontally) in order to address this

    spatial issue, to the best of our knowledge this is the first

    time that it has been combined with intermediate level image

    processing and an emphasis on colour processing.

    IV. IMAGE SEGMENTATION

    To obtain image blobs and their properties, Luminophonics

    applies an image segmentation technique which works in

    linear time, and simultaneously labels connected components

    and their contours [3]. Using the technique not only provides

    a fast and real-time segmentation for simple images but also

    generates blob properties like size (in width and height) and

    location (in pixel coordinates).

    [Input] [Binary]

    [K-means] [Output]

    Fig. 2. Process of Segmentation

    In this method, the input images are converted into binary

    images for the purpose of image segmentation by using a

    contour tracing technique. To simplify the image further,

    K-means algorithm is used to convert the binary image

    into multiple segments. After going through 5 iterations of

    K-means algorithm (refer to Figure 2), noises and incorrectfalses are greatly reduced hence it is able to increase the

    accuracy of final segmentation. In order to preserve the colour

    information for later conversion, the input images are kept in

    pairs with their binary counterparts in a memory buffer. The

    original colour images are stored for colour extraction while

    the binary images are used for image segmentation.

    Two regions-of-interest (ROI) namely left region and

    right region are extracted from the binary image by splitting

    the image into two parts of equal size. The function of

    splitting the image not only speeds up the process of image

    segmentation but also caters to the 2 channel stereo audioaspect of the solution. Two different audio outputs are

    produced for both ears. Assuming that earphones are being

    used, the left ear can only hear the objects that exist on

    the left ROI while the right ear can only hear the sounds

    of objects in the right ROI. If an object exists on both

    sides, its corresponding audio properties can be heard on both

    ears. This helps the user to determine the X-axis of the objects.

    From the image segmentation stage, a set of blobs with

  • 8/2/2019 CIS2010

    3/6

    their properties are arranged in linear format. Refering to

    Figure 2, the final output is shown at the bottom right corner

    with the red boxes illustrating the blobs to be classified. After

    this, a Heuristic Colour Model classifies the blobs based

    on their colours. The blobs are subsequently arranged in a

    buffer in a swiping compatible format before they are finally

    converted into audio wave.

    V. HEURISTIC COLOUR MODEL

    CCD (charge coupled device) and CMOS (complementary

    metal oxide semiconductor) camera sensors both vary in

    sensitivity and colour management models. The subjective

    interpretation of colour also varies between individuals.

    Hence, a standard heuristic colour model is needed for each

    individual camera to deduce standard colour properties based

    on user settings. By assessing how users perceive colour, a

    colour model (based on flexible thresholds) can be created for

    particular camera/user pairs.

    The Heuristic Colour Model (HCM) is strictly based on the

    HSL colour model which consists of a double symmetricalcone with a true black point at the bottom and white colour at

    the other end. H represents hue, S represents saturation while

    L represents Lightness. To determine the mapping of a pixel

    value, HCM first tests the pixels saturation and then based

    on this decides whether to test its hue or lightness.

    Fig. 3. Heuristic Colour Model Decision Chart

    HCM is a simple and intuitive method for determining

    whether a particular pixel should be determined as a colour

    or grayscale pixel. For a colour pixel, its saturation must

    be between the upper threshold and lower threshold of the

    saturation scale. For a grayscale pixel, its saturation is either

    above the upper threshold or below the lower threshold of the

    saturation scale. Refer to Figure 3 for a diagram of HCMs

    decision logic.

    VI . VISUAL TO AUDIO MAPPING

    The primary visual properties being investigated in this

    early phase of the Luminophonics project consist of object

    colour, size and location. These properties are converted and

    synthesized dynamically into audio waves by manipulating

    different audio properties. It is crucial to understand and

    maximize the dataset from both audio and visual domain to

    produce a good visual to audio mapping [7]. An intuitive

    mapping of visual to auditory properties not only helpsusers learn how to use the technology more rapidly and

    effectively but also maximizes the information preservation

    across modalities.

    A. Colour

    Blob pixels, which are captured by standard imaging

    sensors, are by default Red Green Blue values (RGB). These

    values are converted into the Hue Saturation Lightness (HSL)

    colour space. Blobs are then categorized into different colours

    through the Heuristic Colour Model by computing the mean

    of the blob pixels. In the first Luminophonics prototype

    reported here, blobs are categorized into 10 colours namely:Red, Orange, Yellow, Green, Blue, Violet, Indigo and Black,

    White, Gray. In a manner analogous to the SeeColor approach,

    the 10 different colours are then be mapped to 10 different

    timbres (or musical instruments).

    Besides categorizing blobs into distinct colours, the mean

    HSL value of a blob provides further auditory variations

    through the lightness value, which affects the frequency (or

    pitch) of each timber (or instrument).

    1) Timbre: Each colour is mapped to a different and

    distinctive timbre. In order to satisfy the requirements of

    timbre distinctiveness, usability (e.g. ease of learning), andaesthetic satisfaction, as with the SeeColor approach, we have

    chosen different musical instruments to represent different

    classes of timbre.

    Table 1 depicts the particular colour to timbre mapping

    currently being used. This mapping is an obvious target for

    configurability, seeing that users are likely to have their own

    aesthetic preferences.

    2) Frequency: Though there are 10 types of colours

    matching with 10 different musical instruments, the lightness

    value of the blob can be encoded in the auditory signal by

    affecting the frequency (pitch) of the musical instrument, thusfurther expanding the range of sounds that can be experienced

    by the user. This encoding allows users to differentiate two

    blobs with the same colour but with different lightness values.

    B. Location

    The 2D coordinates of each blob are extracted from the

    contour segmentation technique in x and y pixel units.

  • 8/2/2019 CIS2010

    4/6

    Colours InstrumentRed SaxophoneOrange CelloYellow HarmonicaGre en PianoBlue HornIndigo GuitarViolet TrumpetWhite XylophoneGray FluteBlack Violin

    TABLE ICOLOUR MAPPING TABLE

    1) X-axis: The x-coordinate of a blob is encoded

    through stereophonic sound. The x-axis is represented by

    three regions, i.e.: left, right and both sides. Before audio

    conversion, Luminophonics determines in which regions blobs

    reside in. Audio synthesis is based on which region a blob

    is located in. For example, if the blob falls entirely in the

    left region, sound will be synthesized only in the left audio

    channel. The converse applies to blobs falling entirely in the

    right region. If blobs fall across both regions then sound issynthesized in both audio channels.

    2) Y-axis: The y-coordinate of a blob is implicitly mapped

    by the time delay of each band. One of the main purposes

    of creating the swiping method is to preserve the Y-axis

    information. Without a surround sound system or a highly

    trained ear, the y-coordinate of an object is very hard to

    represent. In the swiping method, the y-coordinate of a blob

    is associated with time delay in an intuitive manner. The

    further down a blob is in the image, the longer it takes for

    its corresponding sound to be synthesized. Thus, users can

    used this temporal information to create a mental image of

    the relative positioning of blobs along the y-axis.

    C. Size

    In the current implementation of Luminophonics, the size

    (or area) of a blob is encoded in terms of the volume of the

    instrument being played. In other words, the volume of a

    particular instrument is directly proportional to the size of the

    blob generating it. This visual to auditory association, apart

    from allowing users to infer the size of objects, also allows

    them to infer their horizontal skewness. If a blob is skewed to

    the left, the area of that part of the blob in the left region will

    be larger than the area in the right region. Hence, the volumeof the particular audio in the left channel will be higher than

    the volume of the same audio in the right channel.

    VII. TRAINING & PRELIMINARY TES T

    Several basic preliminary tests were conducted, not only to

    validate the software developed but also to verify the validity

    of our sensory-substitution approach. Due to the preliminary

    nature of these tests a single subject, with adequate hearing

    and a background in music, was used. Before the actual

    testing phase, the subject was asked to go through a training

    phase, where he learnt to recognize soundscapes generated by

    the prototype.

    A. Training

    The training was limited to the basic mappings (or features)

    implemented by the prototype, namely: colour, location andsize. The participant was trained to recognize the three basic

    features and was expected to recognize combinations of these,

    in order to achieve the objective of each experiment.

    The duration of each training session depended on the

    satisfaction of the participant. The participant was allowed to

    repeat sessions any number of times until he was satisfied

    with his ability to recognize each basic feature.

    The first training session was to recognize the colour of

    the input image through its corresponding sound (i.e. timbre).

    The participant was also trained to differentiate shades of

    colour, though the resulting variations in pitch.

    To train the participant to recognize the location of an

    object, the participant was taught to relate time delays to

    vertical position and stereo placement to horizontal position.

    Four blobs of the same colour were drawn on four different

    input images in four different quadrants: Top-Left, Top-Right,

    Bottom-Left and Bottom-Right. The participant was required

    to listen to the resulting soundscape and interpret the position

    of blobs.

    Training to recognize blob size was relatively easier,

    because of the simple relationship between object size and

    sound volume. Blobs of different sizes were drawn on to

    different input images. The participant was asked to listen

    closely to the different sound volumes and to recognize blob

    sizes based on these differences.

    B. Experiments

    Six different experiments were conducted on the participant

    after the training phase. The difficulty and complexity levels

    of the experiments were low with a maximum of two feature

    combinations per test. Due to the preliminary nature of the

    testing, the sample images consisted of simple synthetic

    objects. Figure 4 shows some of the test samples used inconducting the experiments.

    1) Colour Test: The colour test was the simplest

    experiment of all, where the participant was required to

    differentiate between the 10 different possible blob colours.

    One object with one colour was showed every time and

    the participant was required to identify the colour through

    the resulting sound. The process was repeated for 20 times

    to roughly gauge the accuracy of the participants color

  • 8/2/2019 CIS2010

    5/6

    Fig. 4. Test Samples

    identification.

    2) Object Test: The objective of this experiment was

    to test whether our prototype produced adequate auditoryfeatures for recognizing four different object classes. The four

    objects were randomly displayed in different sequences but

    5 times each (each object class consisted of five variations

    each). In total, the participant needed to recognize the objects

    in 20 trials.

    The four object classes (i.e. bee, house, stickman and tree)

    were pre-selected partly based on their distinctive features.

    The bee was selected due to its combination of black and

    yellow features. Images of the house category were drawn

    specifically from a square and a triangle of different colours.

    The participant was expected to recognize the object through

    the triangle/square arrangement despite their specific colours.Stickman images were drawn in only one colour with a round

    head at the top and a body made out of lines. Tree images

    were drawn with simplified green foliage and a brown trunk.

    3) Shate Test: As mentioned earlier, different colour

    shades produce different pitches for the same musical

    instrument. In the shade test, two blobs of the same colour

    but with different shades were drawn side by side to let the

    user pick the darkest blob based on pitch. In this experiment,

    the participants ability to discriminate horizontal locations

    and shades of colour was tested.

    4) Find Location: In this experiment, one variant from the

    four object classes (mentioned in the object test) was selected

    and redrawn in four different image quadrants. The participant

    was asked to locate the position of a specific object in one of

    the four quadrants. 24 different combinations of images were

    shown randomly to the users.

    5) Find Object: This experiment is closely related to the

    previous one. In this case, the participant was given a specific

    quadrant and asked to identify the object located in it. 24

    different combinations of images were shown randomly to the

    participant.

    6) Counting Test: The final experiment was the most

    complex of the six reported here, where the participant (and

    the prototype) were tested on all three visual features (i.e.

    location, size and colour). In Figure 4, the bottom left image

    with the label Rounds is one of the test image where userneeds to count the number of round objects in 3 different

    colours. Different shapes, with different sizes, in different

    colours were drawn on an image. Every image had different

    numbers of blobs located randomly in a particular location.

    The participant was required to count the number of blobs on

    the image. The process was repeated for 10 times.

    VIII. RESULTS

    Fig. 5. Preliminary Experiments Accuracy

    The bar chart in Figure 5 shows the recognition rates of the

    participant for the different experimental conditions. Based on

    the fact that for most conditions the chance level rests at 25%

    (for the colour and shade tests chance levels are 10% and

    50% respectively), these results suggest that the prototype is

    indeed functioning as expected.

    The participant obtained the lowest accuracy for the

    experiment that required him to identify an object within

    a given quadrant. This result might be partly explained by

    the phenomenon of cacophony which was more pronounced

    in this condition. In contrast, and maybe ironically, the

    participant obtained the highest accuracy in the experiment

    that required him to find the location of a given object. Thefact that the subject limited his focus to a quadrant at a time

    and compared this soundscape to that stored in his memory

    might explain the larger accuracy rate.

    In the counting test, the participant obtained a high

    accuracy rate with 90% correct answers. This suggests that

    swiping from top to bottom is a good approach for users

    to create mental maps of blob locations and numbers. The

    participant was capable of drawing the location of blobs on a

  • 8/2/2019 CIS2010

    6/6

    graph, while counting them one by one.

    I X. CONCLUSION AND FUTURE ENHANCEMENTS

    The swiping method aims to improve current visual

    to auditory sensory substitution solutions by maximizing

    information conversion whilst maintaining learnability and

    interpretability. The visual properties of colour, size and

    location, are explicitly encoded and converted into auditorysignals. Due to the attentional (banding) and temporal

    (swiping) aspects of the solution, information about shape

    and texture can be deduced from the soundscape. Through

    training and/or repeated utilization, users should be capable of

    interpreting increasing amounts of information with a reduced

    sense of cacophony, and should exhibit faster recognition and

    learning rates.

    Having said this, a significant amount of work remains to

    be done. One of the most immediate future tasks involves

    the image processing stage, whereby we hope to generate

    simplified visual descriptions (through modified segmentation

    algorithms) that aim to cohesively extract the most relevantaspects of a visual scene. Future work will also involve

    the completion of a highly configurable prototype, which

    apart from allowing users to fine-tune the system to their

    tastes and requirements, will allow us to conduct extensive

    experimentation in order to conclusively answer the question

    of how to maximize cross-modal information conversions. This

    effort will also involve an investigation of human perceptual

    capabilities and limitations. Quantitative conversion measures

    also need to be developed in order to facilitate preliminary

    comparisons between approaches. Different variations of the

    solution are expected to be developed for different contexts,

    e.g.: navigation (e.g. walking in a shopping mall) vs. human

    computer interaction (e.g. trying to interpret a graph). Dueto the importance of depth information, particularly in the

    context of navigation, future versions of the approach should

    incorporate a stereo camera. The auditory encoding of depth

    should be done in a manner that is distinctive and does

    not interfere with the mapping already provided for colour,

    position and size.

    In conclusion, preliminary results indicate that the image

    processing and attentional dynamic approach adopted by

    Luminophonics is valid and thus that its aims to further

    maximize the quantity, rate, learnability and interpretability of

    the information converted are within reach, and consequently

    that its ultimate goal to provide effective technology to assist

    the visually impaired in their real-life interactions with the

    environment, is also attainable.

    REFERENCES

    [1] Ben-Tal, O., Berger, J., Cook, B., Daniels, M., Scavone, G., and Cook, P.Sonart: The sonification application research. In Proceedings of the 2002

    International Conference on Auditory Display, page 151, Kyoto, Japan,Jul 2002.

    [2] Bologna, G., Deville, B., Pun, T., and Vinckenbosch, M. Transforming3d coloured pixels into musical instrument: Notes for vision substitutionapplications. EURASIP Journal on Image and Video Processing, 2007.

    [3] Chang, F., Chen, C.J., and Lu, C.J. A linear-time component-labeling al-gorithm using contour tracing technique. CVIU, 93(2):206220, February2004.

    [4] Grant, A. C., Thiagarajah, M. C., and Sathian, K. Tactile perception inblind braille readers: A psychophysical study of acuity and hyperacuityusing gratings and dot patterns. Perception & Psychophysics, pages 301312, 2000.

    [5] Meijer, P. An experimental system for auditory image representations.

    IEEE Transactions on Biomedical Engineering, 39:112122, 1999.[6] World Health Organisation. Up to 45 million blind people globally - and

    growing. Retrieved from World Health Organization Official Website:http://www.who.int/mediacentre/news/releases/2003/pr73/en/, Oct 2003.

    [7] Yeo, W. S. and Berger, J. Application of raster scanning method to imagesonification, sound visualization, sound analysis and synthesis. In 9th Int.Conference on Digital Audio Effects (DAFx-06), 2006.