Upload
chaitu-chaitanya
View
26
Download
4
Embed Size (px)
Citation preview
IMAGE PROCESSING
ON A MOBILE PLATFORM
A thesis submitted to the University of Manchester
for the degree of Master of Science
in the Faculty of Engineering and Physical Sciences
2009
By
Samantha Patricia Bail
School of Computer Science
Contents
Abstract 5
Declaration 6
Copyright 7
Acknowledgements 8
1 Introduction 9
1.1 Description of the Project . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Main Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Project Background and Literature Review 15
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Mobile Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Mobile Phones as Assistive Devices . . . . . . . . . . . . . . . . . 18
2.4 Image Processing and Object Detection . . . . . . . . . . . . . . . 18
2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Analysis of Methods for Object Detection . . . . . . . . . . . . . 23
2.7 Factor Graph Belief Propagation . . . . . . . . . . . . . . . . . . 24
2.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Application Design 31
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Requirements Analysis . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 33
2
3.4 Image Processing Methods and Algorithms . . . . . . . . . . . . . 36
3.5 Training Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6 Issues A!ecting the System Performance . . . . . . . . . . . . . . 45
3.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 System Implementation 47
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Implementation Tools . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Image Capturing . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Phase One: Feature Extraction . . . . . . . . . . . . . . . . . . . 49
4.5 Phase Two: Object Recognition . . . . . . . . . . . . . . . . . . . 53
4.6 Result Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.7 Optimisation for Symbian S60 devices . . . . . . . . . . . . . . . . 55
4.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5 Testing 57
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Description of the Testing Procedures . . . . . . . . . . . . . . . . 57
5.3 System Performance Evaluation . . . . . . . . . . . . . . . . . . . 60
5.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6 System Evaluation 62
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 Analysis of the Research Methodology . . . . . . . . . . . . . . . 62
6.3 Review of the Project Plan . . . . . . . . . . . . . . . . . . . . . . 64
6.4 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7 Conclusion and Future Work 67
7.1 Project Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Bibliography 72
A Listings 75
3
List of Figures
1.1 Two exit signs according to BS 5499-4 . . . . . . . . . . . . . . . 13
2.1 Worldwide smartphone sales to end users 2008 . . . . . . . . . . . 17
2.2 Example of a factor graph . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Class diagram showing the organisation of the application classes 35
3.2 State diagram for the emergency exit sign recognition software . . 36
3.3 Sobel kernels used for horizontal and vertical derivatives . . . . . 37
3.4 Four examples of emergency exit signs captured with a phone camera 45
4.1 Individual steps of edge detection . . . . . . . . . . . . . . . . . . 51
4.2 Two examples of binary sign templates . . . . . . . . . . . . . . . 54
4
Abstract
Emergency exit signs are an indispensable part of any safety precautions for
public buildings. In case of an emergency, they indicate safe escape routes and
emergency doors, using an internationally recognizable sign: A green and white
sign with icons showing a running person, a door, an arrow pointing into the
direction of the escape route and the word Exit (or other words describing an
emergency exit), in di!erent combinations. These signs can be easily detected
and interpreted by sighted people, but are unsuitable for visually impaired persons
who cannot rely on visual indicators.
This project deals with the issues of recognizing emergency exit signs with a
mobile device. It describes the development of a piece of software that runs on
a Symbian OS smartphone and can be used to detect emergency exit signs using
the phone’s camera. In case of a detection, the device indicates this through an
acoustic signal and, if an arrow is present on the sign, the software specifies the
direction through text output.
In order to achieve fast processing times, the study also deals with the low
computing power of smartphones. The chosen approach is based on belief prop-
agation on factor graphs, a method drawn from statistics, which is used in com-
bination with other image processing tasks such as template matching. While
the success of an e"cient implementation depends strongly on the observance of
necessary optimisations in both the choice of algorithms and coding practice, the
general feasibility of image processing on the chosen mobile platform is demon-
strated by this project.
5
Declaration
No portion of the work referred to in this thesis has been
submitted in support of an application for another degree
or qualification of this or any other university or other
institute of learning.
6
Copyright
i. Copyright in text of this dissertation rests with the Author. Copies (by any
process) either in full, or of extracts, may be made only in accordance
with instructions given by the Author. Details may be obtained from the
appropriate Graduate O"ce. This page must form part of any such copies
made. Further copies (by any process) of copies made in accordance with
such instructions may not be made without the permission (in writing) of
the Author.
ii. The ownership of any intellectual property rights which may be described
in this thesis is vested in the University of Manchester, subject to any
prior agreement to the contrary, and may not be made available for use by
third parties without the written permission of the University, which will
prescribe the terms and conditions of any such agreement.
iii. Further information on the conditions under which disclosures and exploita-
tion may take place is available from the Head of the School of Computer
Science.
7
Acknowledgements
I would like to thank my supervisor Dr Tim Morris for his support and helpful
guidance throughout all stages of the project, as well as Dr David Rydeheard who
would always provide me with good advice whenever I came across any di"culties
on the course. Many thanks to Marcus Groeber for his advice regarding Symbian,
and to Volodymyr Ivanchenko for providing the Crosswatch application for testing
purposes.
Thanks to my family and especially my mother and grandfather who sup-
ported me during my never ending studies (I’m o! to the next round). My
thanks goes out to Simon for his incredible patience, as well as his family for all
their help. Thanks to all my friends in the UK and in Germany, especially to Dr
B., and to my housemates for their motivational talks.
I would also like to mention all the students who spent so many days (and
nights) in the MSc lab and provided me with advice and chats.
Danke.
8
Chapter 1
Introduction
1.1 Description of the Project
Visual signs provide a means of orientation for sighted people within unfamiliar
locations such as o"ces, hospitals and other public buildings. Particularly in
emergency situations, emergency exit signs point the way to important escape
routes, thus making them a legal requirement for buildings of a certain size.
However, for people with visual impairments, these vital resources cannot be
utilized as a guidance aid. Using a mobile tool to detect these emergency signs
and output the necessary information in acoustic form can make them accessible
to people who cannot rely on their eyesight to recognize visual objects. This can
be helpful in unknown or complex buildings, when the escape routes cannot be
memorized and there is no other person immediately available that could provide
guidance to find the right escape route.
This project will carry out research into the feasibility of such a guidance
system, analyse di!erent methods to achieve the task and describe a way of im-
plementing the system on a mobile platform. Upon completion of the work, we
will have gained insights into an e"cient implementation of computationally de-
manding procedures such as computer vision algorithms on mobile devices with
low processing power. In addition, the software will be a demonstration of how
modern technology such as the smartphone platform with its wide scope of pos-
sible applications can be used to assist blind and visually impaired people.
9
1.2. MOTIVATION 10
1.2 Motivation
There are over two million people in the UK living with significant sight loss,
out of which over 300.000 are o"cially registered as blind or partially sighted
[RNI]. Numerous tools and techniques are available to blind people to help them
complete everyday tasks more safely and with greater independence. Such assis-
tance can come in the form of guide dogs and white canes (for navigating around
unfamiliar obstacles in public spaces) but also in lesser known forms, an example
of which are digital water level sensors that sound an alarm when a vessel is full.
The use of modern information technology has become increasingly popular in
the past few years, with companies providing mobile talking book players, braille
output devices for mobile phones and text-to-speech software for computers.
To sighted people, many everyday tasks such as locating exit signs in public
places are hardly thought about; it is something that is done almost subcon-
sciously. However, for a blind or partially sighted person, not being able to
identify the quickest and safest way out of a building can have serious, poten-
tially dangerous consequences. It is this particular problem that will form the
core of this study.
While mainly based in the discipline of computer vision, this project has two
important aspects: First, adapting modern technology in order to provide as-
sistance to visually impaired people, without the need to produce specifically
designed devices for them, which is connected to the notion of accessibility. Sec-
ondly, the implementation of a computationally demanding task such as image
processing on a platform with restricted computing power. This fact makes it
necessary to move away from some of the traditionally used methods that prove
to computationally demanding, and explore novel approaches, simplified versions
of algorithms and approximations that can be used to achieve a lightweight im-
plementation.
We acknowledge that the idea of using computer vision for visually impaired
people is not ground breaking,however, it still is rarely seen on mobile platforms.
We hope to give an insight into the di!erent possibilities that modern mobile
systems o!er, and provide the basis for further research in this area.
1.3. MAIN OBJECTIVES 11
1.3 Main Objectives
The aim of this application is to provide visually impaired people with a method
of recognizing emergency exits1 independently, using an “out-of-the-box” mobile
phone with a built-in camera.
The ideal process when using the application would include the following
steps:
• The user opens the application on the phone, if possible via a shortcut
• The user pans the phone from side to side
• If an emergency exit sign is detected, the application outputs an acoustic
signal (a “beep”)
• If the sign contains an arrow, the application outputs the direction of the
arrow (e.g. “Arrow points to the right”)
• The user knows the location of the sign and where to proceed from there
(e.g. at the next door)
It is obvious that these signals can only function as a pointer to indicate
the approximate direction of an emergency exit. Parameters like the location
of the sign (above a door, on the wall etc.) in the room or the exact distance
from the camera would make the application more useful, but are di"cult to
determine. However, a rough acoustic description of the direction is already one
step ahead of signs that are virtually useless for visually impaired people. It can
help making decisions, for example when standing in the middle of a corridor, in
which direction to proceed to get closest to the nearest exit. By also describing
the arrow on the sign (if present), walking in a di!erent direction than the exit
can be prevented, which makes this an important part of the application’s output.
1.4 Scope
In order to achieve the previously mentioned objectives, the scope of the project
has to be clearly defined.
1We have picked emergency exit signs for this task as an example for the technology used inthe project, as they are easily recognisable and standardised. However, we would like to pointout that the methods discussed in this study could be applied to any other type of signs thatis based on a common standard.
1.4. SCOPE 12
First, it has to be specified what exactly should be recognized by the ap-
plication. The basic design of emergency exit signs is similar for most countries
despite there being no mandatory international standard for emergency exit signs.
Most signs include a stylised symbol of a running person (sometimes in front of
a rectangle that represents a door), an arrow and the words “Exit”, “Emergency
exit”, “Fire Exit” or similar, in various combinations, but always green2 on white
background (or vice versa). Depending on the surrounding lighting conditions,
the signs can be either lit from the inside or externally. These di!erences do
not cause any problem for people who can see the signs and interpret them as
“similar” on the basis of their typical colour and contents. However, when trying
to apply automated recognition strategies to di!erent types of signs, these are
likely to fail.
This is why a decision was made to constrain the set of signs that should be
recognized to emergency exit signs that were designed according to the British
Standard BS 5499-4 [BS500], which are widely used in most public buildings in
the UK. Signs of this type are composed of up to three di!erent parts:
• A running figure (running to the left or to the right)
• An arrow pointing in the direction of the escape route
• The word “Exit” or “Fire exit”
Even with this constraint, we are still confronted with a problem caused by
the often low quality of built-in mobile phone cameras. When trying to capture
sample pictures of exit signs that were illuminated internally, the light intensity
of the sign leads to a large white spot on the image. This overexposure makes
recognising any sign impossible. Since it cannot be expected that mobile phone
cameras have the means of automatically adjusting the exposure time to correct
the flaws, this type of internally illuminated exit signs simply has to be removed
from the set of recognizable images.
This reduces the task to recognizing emergency signs that were designed ac-
cording to BS 5499-4, and which are not internally illuminated. Two examples
of these signs are shown in figure 1.1.
The signs always consist of the same three parts, however, their layout di!ers
depending on the location of the exit. Signs that point at a location to the right
2In the case of BS 5499-4 exit signs, the shade of green is Pantone®3405CVC
1.5. DISSERTATION OVERVIEW 13
Figure 1.1: Two exit signs according to BS 5499-4
(i.e. right, up, down, top right, bottom right) have the arrow on the right hand
side, with the running person facing the right. Accordingly, all signs with an
arrow pointing to the left, top left or bottom left, place the arrow on the left
hand side, with the running icon also facing the left.
1.5 Dissertation Overview
The structure of this dissertation is roughly based on the chronological develop-
ment of the project:
• In chapter 2 we will discuss the usage of mobile phones as tools for visually
impaired users. We will then give an overview of the domain of computer
vision and its sub areas that are relevant for the given task, such as image
processing and object detection. This is followed by an extensive review of
related works of similar nature, i.e. image processing applications on mobile
platforms, which our project will be based on.
• Chapter 3 will evaluate di!erent mobile platforms with respect to their
suitability for the task, and their e"ciency of carrying out computationally
demanding processes like object detection. After deciding on the platform,
the available tools and methods will be reviewed, which will form the basis
of the actual implementation of the software. We will then describe the
general application design and outline implementation details, such as the
image processing algorithms that will be used.
• In chapter 4, details of the implementation on the mobile platform will be
explained, along with a discussion of the methods necessary for optimising
system performance. Given the rather unfamiliar mobile platform and pro-
gramming language Symbian C++, we will also include code snippets to
describe the most important modules of the system and highlight significant
details.
1.5. DISSERTATION OVERVIEW 14
• This is followed by a description of the testing procedures and an evaluation
of the implemented system with respect to the test results in chapter 5 and
6 respectively. In addition, the chosen approach is analysed in the context
of projects with a similar background, where some of the advantages and
disadvantages will be discussed. In this chapter we will also review the
project plan with respect to the project flow.
• In the final chapter, we will summarise the project with respect to the tasks
performed during the course of this study and the findings discussed in the
di!erent chapters. The work will be concluded by an overview of possible
future developments and applications based on the work performed in the
course of this project.
Chapter 2
Project Background and
Literature Review
2.1 Overview
This chapter will discuss the project background with regard to the status of the
chosen platform and the foundations of the research area it is based on. This
will be followed by an extensive literature review that discusses and analyses the
research that was carried out in similar projects, and their approaches to the
problem of image processing on mobile devices. We will then look into details
of the most suitable methods for the given task and draw a conclusion regarding
the chosen approach for our project.
2.2 Mobile Platforms
2.2.1 Hardware — Suitable Devices
Nowadays, the vast number of mobile phones that are available to the public
o!er a multitude of designs and functionalities. For this project, certain require-
ments for processing capacity and user interface have to be met, which narrows
the choice of phones down to a certain type. The term “smartphone” is gen-
erally used for a mobile phone that combines standard phone functions (phone
calls, text messaging) with those of a PDA, such as internet access, e-mail tasks,
multimedia players and o"ce applications [Yua05]. The most popular and estab-
lished smartphones to date are the Blackberry line (RIM), the iPhone (Apple)
15
2.2. MOBILE PLATFORMS 16
and several Nokia devices (such as the N-series) – and the market is ever-growing.
The average processing power of smartphones seems appropriate for the com-
putationally demanding task of image processing, as it has already been proved
by several applications (see section 2). This is why we decided to choose a smart-
phone platform for this project rather than developing a Java application for a
standard mobile phone. It can be also assumed that visually impaired users prefer
to use devices with text-to-speech software, for which smartphones provide the
most sophisticated platform.
With respect to the hardware and user interface, several requirements have to
be met for this task. The most obvious feature that is needed for image processing
is an integrated camera which is suitable for capturing images in a su"cient
quality and resolution, such as 320x240 or 640x480 pixel [KT07, ICS08]. Since
most mobile phones and smartphones come equipped with a camera that has a
resolution of at least 1 megapixel (up to 8 megapixels), this criteria will be easily
met by most available devices.
Another important issue is the user interface, that is, the accessibility by vi-
sually impaired users. As previously mentioned, it can be assumed that users
access the device through text-to-speech software that reads out the screen con-
tent and describes the phone menus. Interaction is then carried out using the
phone’s buttons, which have to be felt out. This requirement rules out devices
that are operated with touchscreens, as they provide no tactile feedback to the
user1.
As a conclusion, the most suitable device for the given task is a smartphone
that is able to run third party software, comes equipped with a camera and has
tactile buttons. These criteria have to be considered when choosing the platform
for the image processing software.
2.2.2 Operating Systems and Platforms
Smartphones are currently distributed with a wide range of operating systems,
such as Windows Mobile, the BlackBerry OS, Symbian OS, Palm OS and Linux-
based systems. All systems o!er di!erent capabilities for installing and running
1Nokia announced support for tactile feedback on touchscreens with the latest Symbian OSversion 9.4 in late 2008. However, this will not be discussed here, as it cannot be consideredcommercially relevant yet.
2.2. MOBILE PLATFORMS 17
Operating System Sales in Thousands Market ShareSymbian 72,933.5 52.4 %RIM (BlackBerry) 23,149.0 16.6 %Microsoft Windows Mobile 16,498.1 11.8 %Mac OS X (iPhone) 11,417.5 8.2 %Linux 11,262.9 8.1 %Palm OS 2,507.2 1.8 %Other 1,519.7 1.1 %Total in 2008 139,287.9 100.0 %
Figure 2.1: Worldwide smartphone sales to end users 2008
third-party software, and most manufacturers provide APIs for various program-
ming languages such as C, C++, Java and Python.
With a market share of roughly 50% of the smartphone market, the Symbian
operating system currently is the leading smartphone platform, as shown in figure
2.1 [Gar09]. Symbian OS is widely supported by Nokia2 and encourages develop-
ers to implement applications for its operating system by providing the necessary
APIs and tools. This, and the wide range of available Symbian handsets, makes it
a suitable system to reach as many users as possible. In particular, the majority
of accessible Symbian devices runs the S60 version of this operating system.
Based on previous works that used the Symbian platform to develop image
processing applications, it can be assumed that devices with su"cient processing
power for this task are available. While Symbian also supports Java and Python,
Symbian C++, a C++ dialect, is labelled the fastest and most e"cient program-
ming language on this system. [ICS08] even states that for the task of recognising
zebra crossings with a mobile phone, “Real-time performance [. . . ] is made pos-
sible by coding in Symbian C++”. These points lead to the decision to develop
the software for recognising emergency exit signs on the Symbian platform, using
Symbian C++.
The device used for this application is a Nokia N95 smartphone running the
9.2 version of Symbian OS. Both Symbian and Nokia o!er extensive and up-to-
date online resources with detailed information on Symbian C++ programming
[Sym09, For]. The online forums o!ered on both websites in particular act as
a helpful source, along with sample code of which parts (e.g. tutorials for the
camera API) were used as a starting point for this project.
2Symbian Software Limited was in fact acquired by Nokia in 2008.
2.3. MOBILE PHONES AS ASSISTIVE DEVICES 18
2.3 Mobile Phones as Assistive Devices
Mobile phones, and smartphones in particular, provide a platform for a wide
range of applications for visually impaired users. Interaction with the phone
is made possible by a screen reader that outputs the displayed text via text-
to-speech, and therefore allows access to text-based information such as phone
menus, internet content or text messages. Many smartphone applications were
developed specifically for use by visually impaired people. These include OCR3
applications that make use of a built-in camera, navigation software and audio
players for talking books in the DAISY4 format. Due to the existence of a screen
reader on the device, the applications do not need to implement their own text-
to-speech solutions, but only ensure accessibility by the screen reader.
The benefits of smartphones as assistive devices lie in the convenience of using
an out-of-the-box platform with additional software, rather than a hardware-
based implementation that was designed for a single purpose. This includes lower
costs of an “all-in-one” tool, as opposed to multiple devices, the relatively small
size of modern phones and the comfort of not having to carry several devices.
2.4 Image Processing and Object Detection
In order to produce a correctly and e"ciently working piece of software, it is
important to analyse the basic requirements for the given task with respect to
the underlying principles of computer vision. This discipline, part of the domain
of artificial intelligence (AI), aims at “emulat[ing] human vision” with the aid
of computers [GW02]. The general description of this term is the detection and
recognition of objects on the basis of an input image, which leads to a decision on
the contents and nature of the object and, eventually, a reaction of the system.
However, there is no clear definition of the boundaries of this discipline and
those of the subareas it comprises. In the literature, it is often implied that,
while being an extensive research subject itself, image processing is a subarea of
computer vision. It deals with the processing and analysis of an image in order
to manipulate the image, which yields another image (for example, by applying
a masking algorithm to detect edges in an image) or to obtain information about
3Optical Character Recognition4Digital Accessible Information SYstem – Digital talking book format, based on XML ap-
plications, that was specifically developed for visually impaired users.
2.5. RELATED WORK 19
the image (such as a histogram of the image’s tonal values).
The area of object detection is of particular importance for the task of rec-
ognizing emergency exit signs. This term describes the process of finding and
identifying an object of interest in a given image; in this case, a rectangular plate
in a room or corridor. In order to recognize the object as an emergency exit sign
(as opposed to other signs), and the direction it points at, the object has to be
classified. This process, again, utilizes methods from AI (and neural networks in
particular), such as prior training of the system (on a set of positive and negative
images), probabilistic methods and statistics.
As for a research discipline that has been studied for several decades, there
exists a large number of di!erent algorithms (and implementations) for the various
tasks of computer vision. Given the limited computing capacity of mobile phones,
the suitability of di!erent algorithms for e"cient implementation on this platform
has to be analysed. In the next section, we will look into projects that deal with
similar problems and review the di!erent methods, which will then provide a
basis for our implementation.
2.5 Related Work: Image Processing on Mobile
Platforms
2.5.1 Server-Client Systems
It is only recently that researchers started studying the issue of image processing
on mobile devices. Due to the restricted computing capacity on mobile phones
and PDAs, there are di!erent approaches to dealing with this issue, of which the
most significant will be discussed in this section.
One of the early solutions is the use of a server-client based system. The
user captures an image with the mobile device, which is then sent to a server
that carries out the actual processing work. After processing, the reply is sent
back to the user via the mobile phone network. Various commercial providers use
this method for mobile marketing in high profile campaigns, such as presented
in [Koo]. The major advantage is that the task can be carried out with any
kind of mobile device that has a camera, no matter what computing capacity it
has. However, this system requires a phone network connection to be available,
which can be di"cult in certain areas or inside buildings. [RR06] uses a similar
2.5. RELATED WORK 20
server-client solution to recognize street name plates and use them as links to
further information that is available on-line. The system runs on a PDA with a
touchscreen, which allows the user to manually highlight the area that contains
the street sign (“area of interest”, and therefore simplifies the issue of object de-
tection in the image. Several feature extraction algorithms (SIFT, Black/White,
Wavelet, HSV) were analysed for this task, with SIFT proving to be the most
e!ective algorithm that is invariant to perspective bias and varying lighting con-
ditions. Due to the server-client structure however, there were no investigations
into whether SIFT is also suitable on devices with low computational resources.
In 2009, Nokia launched their “Point & Find” software, which is the first
system that does not restrict the use to a certain type of objects [Kob09]. After
capturing a picture of the object, such as a film poster, the image (and, if avail-
able, GPS information on the user’s location) is sent to a server that searches a
database. Additional content and information on the object is then sent back to
the handset. Nokia aims at extending the use of “Point & Find” to a wide range
of commercial applications, such as barcodes and museum exhibits.
2.5.2 On-device Image Processing
A di!erent approach to the previously mentioned that focuses on carrying out the
image processing on-device is the use of 2-dimensional barcodes or “QR Codes”.
With this method, the user captures an image of the barcode (e.g. on magazines,
posters), which is processed immediately and turned into a URL that leads to
further information on a website which is accessed via the phone’s browser. This
established method has spread widely over the last few years (various examples
can be found on [Mob]) and many phones already come equipped with a barcode
reader [Nok].
A similar project (“PhoneGuide”) uses image processing on mobile devices to
recognize exhibits in a museum [BB08]. A wide range of di!erent algorithms have
been examined for this purpose, such as pattern-matching, discriminate regions
and SIFT, which all proved too ine"cient on computationally restricted devices.
The chosen approach, a linear separation strategy implemented with an artifi-
cial neural network, achieved the most correct and e"cient object recognition.
Several sets of normalized features (such as colour and structural features) were
tested for object recognition, with colour features yielding the highest recognition
2.5. RELATED WORK 21
rate. However, due to the di!erences of various mobile phone cameras, colour cal-
ibration is necessary if the camera used for training the algorithm di!ers from
the user’s phone. The application, implemented on a Symbian S60 smartphone
in Symbian C++, achieved a recognition rate of 90% in tests, with processing
times of less than 1 second.
All previously mentioned systems assume that the user points the camera
directly at (or even manually marks) the object that is to be captured, with
blurring, varying lighting conditions, scaling and perspective bias being the major
issues that need to be addressed. A basic aspect of our project, however, is
the software’s suitability for visually impaired people. In this case, not only
the nature of the captured object is important, but detecting whether there is
any object in the picture at all is even more critical. [PTAE09] emphasizes the
importance of object detection as a first step to recognizing text on street name
plates. The system uses a boosting algorithm (AdaBoost) and Haar features for
object detection. In order to correct the number of false positives, the system
makes use of the textural information on street name signs (as opposed to windows
and building facades that caused the false positives). The text on the signs is
then recognized using a direct matching technique. Given the limited set of street
signs that are to be recognized, this image matching approach is considered more
e"cient than character recognition. Although the system is intended for use on
a mobile phone, the testing was only carried out on a desktop PC, which does
not allow any statements regarding e"ciency.
[GdGH+06] focuses on the e"ciency of a system for recognizing buildings (e.g.
for use as a tourist guide) with mobile devices, making use of a local invariant
regions algorithm. Several approaches for object recognition using global or local
features are analysed, with global features such as colour distribution proving to
be insu"cient and not robust to occlusion or di!erent viewpoints. Algorithms
such as SIFT that utilize local features are more robust to these problems, but
are found to be ine"cient when carried out on a mobile device. In order to re-
duce computation time, the image data is compressed using principal component
analysis. The similarity of the captured image to a building in the application’s
database is then determined using a voting scheme. Tests were carried out on a
Sony Ericsson K700i and Nokia 6630, with both phones only supporting Java ap-
plications, and achieved recognition times of less than 5 seconds for one building.
An application that is built explicitly for visually impaired users is described in
2.5. RELATED WORK 22
[ICS08]. The system is implemented in Symbian C++ on a Nokia N95 phone and
detects zebra crossings in real time (3 frames per seconds), using the phone’s video
capturing mode. The user points the camera in the estimated direction and the
application outputs an acoustic notifier if a zebra crossing is detected. The system
is based on a feature extraction in the first stage and figure-ground segmentation
using a graphical model, the factor graph, in the second stage. Figure-ground
segmentation5 describes the process of grouping pixels into object (figure) and
background (ground) pixels depending on their compatibility as a group of figure
or ground pixels respectively. Since mobile devices do not have floating point
units (FPU), all floating point operations are carried out on a software-emulated
FPU, which has great impact on the processing speed. In order to avoid floating-
point calculations, the phone implementation uses a simplified version of factor
belief propagation to perform statistical inference on the factor graph, as well
as static arrays instead of dynamic lists. This application is the only known
approach to date that aims at processing images on a mobile platform in real
time and is therefore particularly interesting for our project in terms of e"ciency.
The Symbian platform is also used to develop a mobile colour recognition
software as described in [KT07]. Due to it being the “native language” of the
Symbian operating system and providing “very low level access to devices and
other services”, the C++ programming language is considered suitable for this
task. The system was tested on a Nokia N93 smartphone running the Symbian
S60 3rd edition operating system, and yielded a minimal processing time of 4.4
seconds after reducing the sample rate of the test image. Since this system is
only colour based, it strongly depends on the lighting conditions and camera
parameters, and is therefore very likely to produce incorrect results.
Another very recent development (Summer 2009) is the use of mobile phones
for augmented reality applications. The systems make use of the a smart phone
camera to capture real-time images, process the image and output information
based on the image. The Swedish company TAT [TAT09] announced their “Aug-
mented ID” system that matches people’s faces with their profile in the database
of the social network, using a 3D facial recognition method. It then displays per-
sonal information (such as Facebook or Twitter profiles) as hovering icons around
the person’s face.
5A term originating in the early 20th century “Gestalt” psychology dealing with humanvisual perception. The theory makes statements about how the visual system groups individualelements into objects, based on cues such as proximity and similarity.
2.6. ANALYSIS OF METHODS FOR OBJECT DETECTION 23
2.6 Analysis of Methods for Object Detection
In order to decide which method for object detection on a mobile platform seems
suitable for our task, we have to look into the details of the approaches that were
proposed in the previous section. Due to no prior knowledge of image processing
on platforms with restricted computing power, the decision will be based on the
findings and conclusions drawn in previous research.
While the SIFT algorithm (scale-invariant feature transform) is considered
“superior” due to its invariance to image transformations (scaling, translation,
rotation) [RR06], it is also labelled too ine"cient on mobile platforms [FZB+05].
By using the modified i-SIFT (informative SIFT), the application’s runtime can
be reduced, while yielding high recognition rates [GdGH+06]. However, the com-
putationally demanding execution of this algorithm is still too time-consuming
for mobile devices, which is why it can be generally ruled out as unsuitable for
the given task.
Another approach mentioned in similar applications is the use of a boosting
algorithm such as AdaBoost (adaptive boosting). Boosting, which evolved from
the domain of machine learning, is based on the combination of “weak” (i.e. only
slightly better than random guessing) learning algorithms in order to produce
one “strong” learning algorithm through training [FS99]. During the training,
AdaBoost performs the weak algorithm repeatedly (e.g. 100 rounds) on a set of
input values that are initially weighted equally. If a value is incorrectly classified,
its weight is increased which grades it as “hard” example that the algorithm has
to concentrate on. This training leads to a weak hypothesis for every round,
which are combined to the final hypothesis that yields a very low error rate.
[PTAE09] describes the use of Haar-like features as weak classifiers for general
object detection. Haar-like features are image features that are represented as
jointed black and white rectangles, the value of each feature being the di!erence
of the pixel grey level values within the rectangles. By using this method for
classification instead of single pixel values, the classification process can be sped
up significantly.
The advantage of AdaBoost is that the initial training can be carried out on
an external device, which makes it independent from the mobile phone’s pro-
cessing power. Implementations of di!erent versions of this algorithm (namely
AdaBoost.M1, and a more complex version, AdaBoost.M2 ) in various program-
ming languages are available online and will be analysed for their portability onto
2.7. FACTOR GRAPH BELIEF PROPAGATION 24
the Symbian platform. However, as there are no test results for the e"ciency of
AdaBoost implementations on mobile platforms (see 2.5), the suitability of this
approach for our project has yet to be determined.
The most promising method that achieved high performance rates on a mobile
platform without the use of boosting is described in [ICS08]. The algorithm
utilizes (max-product) factor graph belief propagation, a method drawn from the
area of machine learning, for figure-ground segmentation in order to infer the
state (figure or ground) of each segment extracted from the image. The belief
propagation algorithm has only low complexity, as the required time “grows only
linearly with the number of nodes in the [graph]” [YFW03] — a clear advantage
for the implementation on a mobile device with a weak processing capacity.
Due to its convincing performance and the small set of training images nec-
essary for recognising objects, we decided to implement the system based on a
factor graph belief propagation method. The simplified max-product version of
this algorithm, as proposed in [SC07], in particular is expected to allow an ef-
ficient implementation. The next section will give a more detailed explanation
of factor graph belief propagation with respect to the task of image processing.
This is completed by a description of the steps necessary for implementing the
algorithm, which will be outlined in section 4.4.
2.7 Factor Graph Belief Propagation
2.7.1 Factor Graphs
Factor graphs are graphical models for the factorisation of global functions as
a product of local functions, which represent the mathematical relation “is an
argument of” between variables and the local functions. Factorisation of a func-
tion is the process of decomposing a global function g(x1, . . . , xn) into smaller
parts, its factors, that have a subset of {x1, . . . xn} as their arguments. The prod-
uct of these factors (or local functions) then again forms the original function.
Generally, the factorisation of a function g is defined as in [KFL01, p. 499]:
g(x1, x2, . . . , xn) =!
j!J
fj(Xj) (2.1)
2.7. FACTOR GRAPH BELIEF PROPAGATION 25
This process can be visualised by a bipartite6 graph that consists of
• Variable nodes xi, the set of all variable nodes being X = {x1, . . . xn}.
• Factor nodes fj, representing local functions that determine probabilities.
• Undirected edges that represent the relationship fj(Xj) . An edge between
variable node xi and factor node fj exists if xi is an argument of fj, i.e. Xj
is a subset of {x1, . . . , xn}.
Figure 2.2 (based on [SC07, p. 4]) shows an example of a factor graph with
four variable nodes w, x, y, z and three factors f, g, h. The factor graph in this
figure shows the joint distribution P (w, x, y, z) = f(w, x, y)g(x, y, z)h(y, z), which
is represented by the edges between the variable nodes and factor nodes.
Figure 2.2: Example of a factor graph with variable nodes w, x, y, z (circles), andfactor nodes f, g, h (squares).
With respect to the task of image processing, the nodes in a factor graph
correspond to segments of the image which have to be classified into figure (seg-
ments that fulfil the criteria for being part of the object) or ground (i.e. not part
of the object, background). This makes the variable nodes in the graph binary,
as they can have one of two states assigned: xi = 1 (figure) or xi = 0 (ground).
The decision (or “evidence”) whether a segment is more likely to belong to figure
or ground is based on cues that describe the relationship between neighbouring
segments. These cues can be of any arity (such as unary, binary, ternary and so
on) to take into account any number of segments. The evidence again is based on
the statistical di!erences between figure segments and ground segments, which
are learned from empirical data. Using the evidence from all cues, a factor graph
6Bipartite describes the fact that the nodes can be divided into two sets, with edges onlyrunning between nodes from di!erent sets. Here, the two sets are variables and factors.
2.7. FACTOR GRAPH BELIEF PROPAGATION 26
then represents the joint distribution of each node’s state based on this evidence
[ICS08].
Based on a description given in the aforementioned source, the relationship
between variable nodes and n-ary cues in a factor graph will now be explained in
detail. The objective of this section is to clarify how we can infer the assignment
of each segment xi, that is, how to determine the global state assignments (con-
figuration) X = {x1, . . . , xn} of all segments extracted from the image. Based on
training data, it can be estimated that a certain number of segments is likely to
be in figure state, independently from other segments — this is an a priori belief
or i.i.d.7 on X which is defined as:
P (x) =n!
i=1
fj(Xj) (2.2)
In detail, we know that Pi(xi = 0) = p0 and Pi(xi = 1) = 1! p0.This means
that, without considering any relationships between two or more segments, it is
already known that each segment has the likelihood of p0 to be in state 0 and
1 ! p0 to be in state 1 respectively. The probabilistic distribution p0 (ranging
from 0 to 1) is determined through training data.
A binary cue Cij describes the relationship between two neighbouring seg-
ments i and j. The relationship between this binary cue and the states of the
two segments it relates is defined as the conditional distribution P (Cij | xi, xj).
Again, this distribution is learned from training data. It can be decomposed into
two distributions Pon and Poff , which describe the likelihood of the segments
belonging to figure (on) or ground (o!):
Pon = P (Cij | xixj = 1) (2.3)
is the distribution of the cue for both segments in figure state (xixj = 1), and
accordingly
Poff = P (Cij | xixj = 0) (2.4)
is the distribution if the product xixj = 0, i.e. at least one of the segments is 0.
Evidence whether the pair of segments belongs to figure or ground is then given
by the di!erences between the two distributions as log"Pon(Cij)/Poff (Cij)
#.
7Independent and identically-distributed. The states of all variables are independent fromthose of other variables, and each variable has the same probability distribution.
2.7. FACTOR GRAPH BELIEF PROPAGATION 27
Generally, the set C of all cues Cij can then be related to the set of variable
nodes X through the posterior distribution8 P (X | C) which is proportional
to the product of the aforementioned a priori belief from equation 2.2 and the
distribution for binary cues:
P (X | C) " P (X)!
(ij)
P (Cij | xi, xj) (2.5)
Using the two equations 2.3 and 2.4, the product over (ij) can be rewritten
as: !
(ij)
P (Cij | xi, xj) =!
(ij)
Poff (Cij)!
i,j:xixj=1
Pon(Cij)
Poff (Cij)(2.6)
The product$
i,j:xixj=1 is restricted to xixj = 1, which means that only pairs of
segments that are both in figure state are taken into account. In this equation
the product over Poff is independent of X, which is why it can be removed from
the posterior probability when combining equations 2.5 and 2.6:
R(X | C) = P (X)!
i,j:xixj=1
Pon(Cij)
Poff (Cij)(2.7)
which is equivalent to
logR(X | C) =%
i
log Pi(xi) +%
ij
xixj logPon(Cij)
Poff (Cij)(2.8)
Maximizing this expression leads to an estimate for the maximum a posterior9,
or MAP. By using belief propagation on the factor graph, this MAP can be
determined in an e"cient way, which will be described in the ensuing section.
Since the method uses more than just one cue, we have to add one term
for each cue to the previous equation. For binary cues, this means that for each
additional cue a term of the form&
ij xixj log Pon(Cij)Poff (Cij)
is added to equation 2.8, i.e.
the distributions for all cues are multiplied in order to determine the most likely
global assignment of all variables in X. After defining how the distributions for
each cue are computed, we will describe the process of constructing the factors for
8The empirically determined probability, which “summarizes the current state of knowledgeabout all the uncertain quantities” [Gel02].
9The particular value of X that maximizes the posterior
2.7. FACTOR GRAPH BELIEF PROPAGATION 28
each cue, which will be used in the factor graph. This process is carried out step-
by-step beginning with binary cues: For each pair of variable nodes (neighbouring
segments in the image) we determine whether they su"ce the cue and mark them
as candidate factors, which will then be used to determine the candidate factors
for 3-ary cues, which are in turn used to determine the factors for the arity-4
cues.
2.7.2 Belief Propagation on Factor Graphs
After this detailed explanation of how a factor graph is constructed, we will
now illustrate how belief propagation is used to infer the likelihood of a node
in the factor graph to be in a certain state. Belief propagation is a version
of the sum-product algorithm, an algorithm used for message passing on factor
graphs, that calculates the marginal probability (“belief”) for each node. Two
types of messages are used in factor graph belief propagation: Messages sent
from variables to factors, and those sent from factor nodes to variables, with
both types of messages being functions of the variable that is associated with
the edge along which the message is passed. We can explain the basic principle
of message passing through the sum-product algorithm with the following two
equations: The messages sent from variable nodes to factors are given by
mx"f (x)#!
h!n(x)\{f}
mh"x(x) (2.9)
Here, n(x) is the set of all factor neighbours of x in the graph. Equation 2.9
expresses that the message sent by a variable node is the product of all messages it
has received from from other factor nodes h, i.e. the variable node simply forwards
the messages10. The factors here correspond to the local functions defined for the
factor graph, i.e. the probabilities P for every cue based on its parameters.
The messages sent by factor nodes are defined by the product of the factor
itself with all messages sent from the variable nodes it is connected to, which is
then summarised:
mf"x(x)#%
#{x}
"f(X) +
!
y!n(f)\{x}
my"f (y)#
(2.10)
10The general approach to describing this process is to treat the graph as a tree and definethe message as the product of all messages received from child nodes. When implementing thealgorithm, all nodes are treated as child and parent nodes to the nodes they are connected to.
2.7. FACTOR GRAPH BELIEF PROPAGATION 29
X = n(f) is the set of all arguments of f and $ {x} denotes the sum over
all variables except x. The messages are updated until they converge, then the
marginal distribution for a node x is computed as the product of all messages
that are sent to x.
In order to allow an e"cient implementation on a platform with low compu-
tational power, the max-product version of belief propagation is used to estimate
the maximum a posterior. By implementing this version in the log domain (tak-
ing the logarithm of all equations) where all calculations are reduced to addition
and subtraction, e"cient computation of the belief is made possible.
The message updating equations in this max-product version are defined by
the following two equations (note the sum and maximum here instead of product
and sum as in 2.9 and 2.10)
mx"f (x)#%
h!n(x)\{f}
mh"x(x) (2.11)
mf"x(x)# max#{x}
"f(X) +
%
y!n(f)\{x}
my"f (y)#
(2.12)
Eventually, the belief function for each node in the graph is calculated as:
b(x) =%
f!n(x)
mf"x(x) (2.13)
In this framework, each factor f(x1, . . . , xm) in the factor graph is only non-
zero, if all of its parameters {x1, . . . , xm} are 1. As suggested by [SC07], a non-
negativity requirement is introduced in order to reduce the computational com-
plexity of the method: Kf = f(x1 = 1, . . . , xm = 1) % f(x1 = 0, . . . , xm = 0) = 0,
that is, all factors have to be greater or equal than zero.
Adding this to equation 2.13, the belief for each node is then computed as
bx(x = 1) =&
f!n(x) Kf and bx(x = 0) = 0, which then leads to the final equation
for the beliefs of all nodes:
Bx =%
f!n(x)
Kf (2.14)
Finally, with respect to the implementation of this algorithm, the notion of
scheduled message passing has to be explained. It is assumed that the sending and
receiving of messages is organised by a schedule (such as a timer) that specifies the
2.8. CHAPTER SUMMARY 30
way messages are passed. This schedule can be synchronous (flooding schedule),
which means that all messages are updated at the same time, or asynchronous
(serial), where only one message is updated at a time. Usually, several runs
(sweeps) of the non-simplified message passing algorithm have to be performed
in order for it to converge when all messages have been sent.
2.8 Chapter Summary
This chapter discussed the background and foundations regarding the given task
of image processing on a mobile platform and described the notion of assistive
technology for visually impaired people. Di!erent types of smartphones were
analysed which lead to the decision to develop the recognition system for Symbian
OS, using its native programming language Symbian C++. As shown in this
chapter, there exists a wide range of applications that deal with image processing
on mobile phones, with di!erent strategies for both the application structure
(server-client, stand-alone) and the algorithms used for the processing task. After
explaining factor graph belief propagation, we will now take a closer look at the
application design and discuss how these methods will be integrated into the
recognition system.
Chapter 3
Application Design
3.1 Overview
In this chapter, we will outline and discuss the preliminary considerations that
have to be made before implementing the application. In the first section, the
functional and non-functional requirements for the recognition system will be
listed, which is then followed by a detailed description of the software architec-
ture. This includes an explanation of the di!erent parts a Symbian OS application
comprises, as well as an overview over the program’s structural and behavioural
organisation using UML diagrams. Finally, we will highlight details of the soft-
ware such as the algorithms that will be used and give a short description of
images necessary for training the system. The main objective of the chapter is to
provide the reader with a clear idea of all the tasks that will be carried out from
a high-level perspective. Detailed descriptions of the actual implementation will
then be discussed in the ensuing chapter.
3.2 Requirements Analysis
In order to describe the necessary functionalities of the application and assist the
evaluation process, the requirements that have to be met by the software and its
user interface will be defined in this section. They are organised into two groups:
The first part lists functional requirements which describe the behaviour of the
system, i.e. what the application does. The second group are non-functional
requirements that describe how these tasks will be performed by the application.
31
3.2. REQUIREMENTS ANALYSIS 32
3.2.1 Functional Requirements
• The program detects BS 5499-4 emergency exit signs that are not lit up
internally (must-have)
• A detected object is indicated by a sound (must-have)
• Interactions with the software are confirmed to the user through text output
(must-have)
• The capturing process begins automatically when starting up the software
(should-have)
• Capturing the image and repeating the process takes one click and is re-
peated automatically if no object is detected, while the user is panning the
phone (should-have1)
• If present, the direction of the arrow (left, right, up, down) on the sign is
output as text (could-have)
• If present, any text on the sign (such as “Fire Exit” or “Exit”) is read out
by the system
• The software outputs information on the distance of the sign from the user,
based on the camera lens specifications and the size of the detected sign
(could-have)
3.2.2 Non-Functional Requirements
• The execution time for one image lies in a time frame that is acceptable for
the user, e.g. less than 2 seconds (must-have)
• The software works in various lighting conditions, outside a well lit envi-
ronment (must-have)
• The application works correctly and does not show any unexpected be-
haviour or lead to system errors, such as program crashes (must-have)
1Making the application as comfortable to use as possible is clearly an important objective,which would make this item a definite “must”. However, if the automated capturing provesto be too computationally demanding, we can abandon this feature without compromising theinitial idea behind the project.
3.3. SOFTWARE ARCHITECTURE 33
• The interface is accessible by screen reader software, i.e. all menu items
and outputs can be read out to the user (must-have)
• The software does not have any complicated menus or graphical elements
(must-have)
• Starting the recognition process must require as few steps as possible, ideally
begin automatically on program start-up (should-have)
• The software is able to perform the recognition process automatically in
real-time, i.e. several frames per second (could-have)
3.3 Software Architecture
In this section we will outline the basic program structure using descriptions
in both written form and UML diagrams. The software will be organised into
several modules that deal with the di!erent program tasks. In more detail, the
basic structure for any Symbian v9.2 application comprises of five classes2 that
are necessary to start up the program and draw the screen:
Main The first object called by the OS when starting up an application, creates
a new application object and runs it
Application Creates a new document object and returns a pointer to it
Document Creates the application user interface (AppUi) object
AppUi The application user interface handles all interactions such as pushing
a key or selecting a menu item. It creates an AppView object (or multiple
views) which is used for screen access. Here, it also creates a new Main-
Controller object which coordinates the image capturing and processing
AppView The application view draws the screen to make information visible to
the user
To this skeleton, we add more classes for the image processing task:
2The naming conventions for Symbian command that all class names start with Cxx, xxbeing the application name. In this case, all classes have the prefix CMSP.
3.3. SOFTWARE ARCHITECTURE 34
MainController Coordinates all image capturing and processing operations and
returns the results to be displayed and read out be the UI, which promotes
it to the AppView object
ImgCaptureEngine Fetches an image by accessing Symbian’s camera API and
returns it to the controller.
ImgProcessor Takes the image provided by the previous module and performs
pre-processing operations on it, then determines the presence of a sign and
returns the results (sign present: yes / no, arrow direction) to the controller
CFactorGraph Constructs the factor graph object based on the image segments
and the cues defined in the next section
CFactor Class for factor nodes in the graph
CBeliefPropapation Performs belief propagation on the factor graph
The structure shows that the modules can be designed to interact over clearly
defined interfaces, which is crucial for the development process, as it will simplify
separate implementation of individual parts and composition at a later stage.
This will also help to optimise the performance for each module and to carry out
precise testing. The complete organisation of the image processing system, i.e.
the symbian skeleton and the classes created for the processing task, is shown in
figure 3.1. It has to be noted that, due to the complexity of the classes, only the
most important member data and methods are displayed in the class diagram.
The ImgCaptureEngine class in particular shows an important feature of Sym-
bian OS applications. The class is derived from CActive, which makes it an
“Active Object”, a framework that allows for asynchronous programming. This
construct comprises of the Active Object in the form of a class derived from
CActive, and an Active Scheduler which is provided by the Symbian application
architecture. Using an Active Object makes it possible to manage asynchronous
functions, which means that a function returns immediately after calling it with-
out waiting for further tasks to be executed. This is particularly useful for the
task of capturing continuous frames from a camera: The system issues a request
to capture a frame from the camera, which is done asynchronously while other
tasks can be performed. Once the frame has been captured, the camera object
issues a callback to its observer, which then initiates processing of the image.
3.3. SOFTWARE ARCHITECTURE 35
Figure 3.1: Class diagram showing the organisation of the application classes
Figure 3.2 shows a state diagram of the basic processes that are executed when
running the software, which will give a more detailed insight into the coordination
of the system’s components and functionalities. The application can be closed
from every state by using the “Exit” menu option, or simply pressing the “hang
up” key on the phone.
3.4. IMAGE PROCESSING METHODS AND ALGORITHMS 36
Figure 3.2: State diagram for the emergency exit sign recognition software
3.4 Image Processing Methods and Algorithms
An obvious approach to the task of recognition emergency exit signs would be to
base the object detection solely on the image’s colour values. All escape route
signs according to BS 5499-4 show white icons on a plain green background, which
could make it easy to search for a rectangular green sign in the image. [FZB+05]
achieved the highest and most e"cient recognition rates by analysing the object’s
3.4. IMAGE PROCESSING METHODS AND ALGORITHMS 37
colour. However, due to the varying (and previously unknown) lighting condi-
tions and di!erent phone camera properties, this method is not expected to yield
adequate results for our project. In addition to other methods, analysing colour
features could improve the recognition rate, which will be examined during the
course of the project.
3.4.1 Edge Detection
The first phase of the processing will consist of di!erent pre-processing tasks,
such as converting a colour image into greyscale, using an edge detection method
to extract edges from the image and thresholding the results to produce a binary
edge map. An edge map provides information about the estimated location of
edges (i.e. region boundaries or contours) in an image, which are defined as
changes in the image intensity. A drastic change in the intensity indicates a
clear edge (for example, the border of a black object on a white background),
whereas a more gradual change hints at a more blurred or softer edge. We decided
to use the Sobel filter for edge extraction, which can be easily and e"ciently
implemented using integer operations (multiplication and addition). Other edge
detection methods such as the Laplacian were considered for this task, but deemed
unsuitable due to the high sensitivity to noise. The Sobel filter uses the two 3x3
kernels shown in figure 3.3 for computing approximations of the horizontal and
vertical derivatives of each pixel.
dx =-1 0 1-2 0 2-1 0 1
dy =1 2 10 0 0-1 -2 -1
Figure 3.3: Sobel kernels used for horizontal and vertical derivatives
In order to reduce the edges that are extracted by the Sobel operator to
thinner, 1-pixel-lines, non-maximum3 suppression is performed on the image. The
idea of this operation is to reduce the visible pixels to the ones that are local
maxima in their neighbourhood which is given by their gradient direction (the
normal to the edge direction). Only if the intensity of a pixel is greater than
the intensities of its neighbouring pixels along the gradient direction it can be
considered a local maximum. This thinning operation is an important step in an
3The term is used in di!erent variations such as non-maximal and non-maxima suppression.
3.4. IMAGE PROCESSING METHODS AND ALGORITHMS 38
edge detector in order to ensure locality of the edge, which means that the edge
is detected exactly at its location.
3.4.2 Extracting Straight Line Segments
The next step consists of detecting any rectangular object in the image that has
the approximate dimensions of an emergency exit sign, regardless of the actual
content. The Hough transform has been considered for this process, given its
suitability for finding straight lines in an image. However, it was not possible to
find a simplified or approximated version of the algorithm that could be used for
e"cient implementation without floating point operations. This would cause a
slowdown of the processing speed, which is clearly not desirable for the project.
First, we need to extract straight line segments, which are then used as a
starting point for detecting a rectangular structure in the image. This is achieved
using a greedy bottom-up grouping procedure as suggested in [SC07]. The image
is checked for vertical and horizontal edges separately. For the detection of hori-
zontal lines, the method groups edge pixels that are already connected and form
an approximately horizontal line into smaller segments. Small gaps between these
segments are then filled if the segments are neighbours (within a certain region)
and have roughly the same orientation. Those segments that are shorter than
20 pixels are then removed from the set, which eventually contains all horizontal
straight line segments, represented by their start- and end points.
3.4.3 Detecting Rectangular Shapes
Suggestion for a Simplified Method
Once the straight line segments have been detected, the system need to determine
whether there is a rectangle (i.e. a quadrilateral shape that has roughly parallel
opposing sides) present in the frame. It can be argued that a straightforward way
of carrying out this task would be to simply check for overlapping (or nearly over-
lapping) start- and end points of horizontal and vertical segments. The following
conditions have to be satisfied for the shape to qualify as a candidate rectangle:
• Take horizontal segment A with start point SA = (xsA, ysA) and end point
EA = (xeA, yeA)
3.4. IMAGE PROCESSING METHODS AND ALGORITHMS 39
• The coordinates of SA are within the neighbourhood (minimal di!erence
in x- and y-direction) of the start point SB of a vertical segment B with a
length shorter than A
• The coordinates of EA are within the neighbourhood of the start point SD
of a vertical segment D shorter than A and roughly the same length as B
• B and D have opposite polarity, i.e. the gradient direction of B is the
inverse of the D’s gradient direction
• B’s orientation is orthogonal to A’s (roughly orthogonal that is, within only
a few degrees)
• D’s orientation is (roughly) orthogonal to A’s
• B and D are roughly parallel, i.e. the di!erence between their gradient
orientations is minimal
• The coordinates of EB are within the neighbourhood of the start point SC
of a horizontal segment C
• The length of this segment C is similar to the length of A
• The coordinates of EA are within the neighbourhood of the end point SC
of the same horizontal segment C
• C’s orientation is roughly orthogonal to B’s and D’, and roughly parallel
to A’s
• A and C have opposite polarity, i.e. the gradient direction of A is the
inverse of the C’s gradient direction
Please note that, in order to simplify the construction, start- and end points
of segments are classified in a left-to-right (for horizontal segments) and top-
to-bottom (vertical segments) manner respectively, regardless of their polarity.
With respect to perspective bias that a!ects the length of the segments (which,
ideally would be pairwise of equal length) it can be assumed that exit signs are are
approximately at eye level (or slightly above), which means that the perspective
distortion is expected to be minimal. As for the length of the segments: On
average, the width of an emergency exit sign that contains all three icons (the
3.4. IMAGE PROCESSING METHODS AND ALGORITHMS 40
word “Exit”, a running person and an arrow) has a width to height ratio of 2.8,
i.e. the horizontal segments are nearly three times as long as the vertical ones.
Assuming that the camera is close enough for the sign to fill out the full image
width of 320 pixels (which is very unlikely), this means that the vertical segments
are at most 118 pixels long.
It also needs to be mentioned that some of these conditions can be omitted
as they are the consequence of other conditions. For example, if two vertical
segments of the same length begin at the start-and end points of a horizontal
segment, then the horizontal segment at the bottom of the sign that connects
their end points must be approximately the same length as the top segment
(again, assuming that the perspective bias is minimal).
Factor Graph Belief Propagation
While the simplified method described in the previous paragraph seems easy to
implement, it is the objective of this study to review more sophisticated ap-
proaches to the task of object recognition that are based on inferences rather
than basic image processing on a pixel level. This is why we will now describe
a solution based on the factor graph belief propagation method as explained in
the previous chapter. This will allow us to analyse the segment groups with re-
spect to multiple cues and perform rapid inference on them in order to determine
whether a segment is part of a rectangle or not.
Using the straight line segments extracted in the edge detection step, the
factor graph is constructed with each line segment being a node variable in the
graph. Based on the specification of factor graphs, the cues that the factor graph
uses for this task will now be described in detail. It has to be noted that due
to the characteristics of a rectangular sign, both horizontal and vertical straight
line segments have to be analysed. In order to simplify this, the horizontal and
vertical segments will be first checked individually, then the candidates will be
combined to look for matching 4-tuples (that is, one vertical and one horizontal
pair). For all cues, the distributions Pon and Poff (as explained in section 2.7.1)
are determined based on training images.
Unitary cues make a statement about single segments, regardless of their
relationship with other segments. We will only use one unitary cue:
• Segment length: On average, the horizontal straight line segments at the
top and bottom of the sign are long compared to other straight lines in the
3.4. IMAGE PROCESSING METHODS AND ALGORITHMS 41
image, whereas the vertical lines are relatively short (as previously men-
tioned).
The binary cues describe a relationship between two neighbouring segments
(two nodes) with opposite polarity:
• Parallelism: The di!erence (its absolute value) between the orientations of
two neighbouring segments is minimal.
• Proximity: For horizontal pairs, the distance between the two segments is
usually relatively small (approximately one third of the segment length),
whereas the distance between vertical segments is the inverse of this, i.e.
roughly three times the length of the segments.
• Overlapping: The di!erence between start- and endpoints of horizontal /
vertical pairs is within a certain limit.
The arity-4 cues take into account 4 straight line segments, that is, one hori-
zontal pair and one vertical pair:
• Orientation: The average orientation of the horizontal pair is orthogonal to
the orientation of the vertical pair.
• Corner points: The di!erences between the coordinates of start- and end
points of the four segments are minimal.
• Width to height ratio: As previously mentioned, the ratio of horizontal
segment length to vertical segment length should be in the region of 2.8.
After defining the cues that will be used to describe the relationships between
line segments, the factor graph has to be constructed. Each cue corresponds to
the factor (a local function) of a global function that describes the likelihood of
a segment to be part of a rectangular shape. The first two steps are carried out
individually for horizontal and vertical lines, with candidate factors being gener-
ated for every segment or segment pair that meets the requirements. Beginning
with the unitary cue, only segments of su"cient length are considered candidates
for the next step, and a unitary factor is constructed. Determined by the binary
cues for parallelism, proximity and overlapping, all pairs of segments (with op-
posite polarity) that satisfy the criteria for those cues are selected as candidates.
3.4. IMAGE PROCESSING METHODS AND ALGORITHMS 42
This is followed by combining vertical and horizontal candidate pairs to check
them for the arity-4 cues.
The final decision whether a node variable is in figure state (xi = 1) is then
based only on its belief Bx. If Bx is su"ciently large, the node will be assigned
figure state, if not, it will be set to ground (xi = 0). The simplified version
explained in the previous chapter suggests that no message updates are necessary
for an approximated result, which means that the belief propagation will already
converge after one run.
3.4.4 Analysis of the Sign’s Content
If the decision is made that a rectangle is indeed present in the image, a sound is
output in order to notify the user of this status. An image is then captured with
a higher resolution and several pre-processing tasks are carried out to prepare
the final analysis of the sign’s characteristics. As we know the coordinates of
the start and end points of the four rectangle sides, it is possible to enlarge the
section containing the rectangle. In order to calculate the coordinates for the
larger image (640x480 pixels), the coordinates obtained from the smaller image
(320x240 pixels) simply have to be doubled.
Assuming that the rectangle is not immediately surrounded by any clutter, the
perspective distortion and rotation are minimal, and that the background colour
is di!erent from the green sign (which is necessary to achieve a high contrast so
that the signs can be clearly seen inside buildings), simply clipping the image to
its bounding box with sides parallel to the image borders would be su"cient to
isolate the sign from its surroundings. However, in order to achieve an accurate
result for the ratio of sign background and icon pixels, the image needs to be
freed from any perspective bias, rotation and background clutter. This can be
achieved by projecting the presumably skewed and rotated quadrilateral shape
onto a rectangle with parallel sides. This projective transformation then maps
the points from the distorted image to the corresponding points in the rectangle.
Using the four corner points of the distorted image and the target rectangle as
reference points, a matrix for the transformation can be constructed as described
3.4. IMAGE PROCESSING METHODS AND ALGORITHMS 43
in [Blo]. In this case, this projective transformation is defined as
"u v w
#=
"x y 1
#'
()a11 a12 a13
a21 a22 a23
a31 a32 1
*
+, (3.1)
Here, the matrix A is a non-singular (invertible) homogenous transformation
matrix with eight degrees of freedom. The coordinates (x$, y$) of the mapped
point are given by
x$ =u
w=
a11x + a21y + a31
a13x + a23y + 1y$ =
v
w=
a12x + a22y + a32
a13x + a23y + 1(3.2)
This leads to a linear system with eight unkown coe"cients a11 . . . a32 that is
solved using the point pair coordinates in order to determine the transformation
matrix.
The next step of the detection phase is then based on the histogram of the
rectangular shape that was extracted from the large image. Knowing that there
are only two colours present in the image, its histogram can now be examined. If
the ratio of green (i.e. dark grey in the greyscaled image) and white pixels corre-
sponds to the usual ratio found in emergency exit signs, the rectangle is marked
as candidate for a sign and the final outcome is determined by a matching pro-
cedure. This quick check reduces the overall computational costs by discarding
all rectangular shapes that are highly unlikely to be emergency exit signs. Oth-
erwise, the more expensive template matching procedure would be performed in
vain.
Due to a number of reasons, it was decided to reduce the step of confirming the
presence of an emergency exit sign and recognising the direction of any arrow to a
simple pixel matching method: Firstly, and most importantly, we are dealing with
standardised signs that di!er only in the direction of the arrow and, accordingly,
the orientation of the “running” icon. The directions from -90° to 90° are place
on the right, the directions top left, left and bottom left are located on the left
hand side of the sign (as explained in section 1.4). Secondly, the location of
the sign is known, defined by its corner points. Thirdly, the image has already
been freed from perspective and rotational bias in the previous step. And finally,
the number of di!erent arrow directions is reasonably low (eight: The four main
directions plus the diagonals), which means that in the worst case the entire sign
3.5. TRAINING IMAGES 44
has to be checked only eight times4. While the operation is not expected to be
the most e"cient method, it is yet straight forward to implement and does not
cause any complex overhead.
3.5 Training Images
The amount of training images necessary for training the system varies heavily
depending on the algorithm that is used for classification. AdaBoost performs well
using a large set of training images ([PTAE09] mentions up to 10,000 negative and
500 positive samples from an existing database).. The number of training images
that are necessary for factor graph belief propagation is relatively low: [ICS08]
uses 25 positive and negative images each, which still yields high recognition
rates and seems more feasible. In order to determine the distributions for the
di!erent cues, training images are captured with a mobile phone camera and
then labelled manually. It is particularly important to pay attention to images
that could cause false positives due to their similarity to emergency exit signs,
were taken under di"cult lighting conditions, contain perspective bias or partly
occluded signs. Examples of images taken with a mobile phone camera are shown
in 3.4. Starting with the top left image, these pictures show some of the most
common problems with camera phones. The images point out some variations of
the standardised emergency exit signs that will not be recognised by the system,
such as the text “Fire Exit” on the sign instead of “Exit” as specified by BS
5499-4, which we will be using as sample templates:
• The angle between camera and exit sign is wide, lights cause reflections on
the sign.
• Blurring due to fast camera movement. Here, the text on the sign is “FIRE
EXIT” in capital letters, which is expected to cause problems when applying
a template matching strategy.
• The distance between the camera and the sign is very far, the sign appears
small and blurred.
• The sign is placed next to lit signs, which also causes reflections and overex-
posure. In this picture, the sign is also made up of two plates (arrow on the
4It has to be noted that there are no clear definitions for the actual de
3.6. ISSUES AFFECTING THE SYSTEM PERFORMANCE 45
left, icon and text on the right) and will not be recognized by our method.
Figure 3.4: Four examples of emergency exit signs captured with a phone camera
3.6 Issues A!ecting the System Performance
Some of the challenges that mobile image processing software for visually im-
paired users has to deal with are characterised in [DLQ+06]. The problems are
specified for text recognition system based on a client-server architecture (see
above), but can also be applied to the issue of recognizing emergency exit signs.
The application has to process images that
• are blurred
• contain text that is very small (or in this case, small exit signs)
• have low contrast
• were taken under poor lighting conditions
3.7. CHAPTER SUMMARY 46
These issues have to be considered when designing the image processing appli-
cation, as well as producing the sets of training and test images. Since there is
no way of improving the camera quality, these errors can only be mitigated by
choosing image processing methods that do not rely too heavily on flawless image
quality.
There are also a number of critical issues that have to be dealt with when
implementing the software, regarding both the implementation process and the
actual problem of object recognition. First, the existing resources of computer
vision libraries on Symbian OS are relatively small compared to other platforms
such as Windows PCs. This makes it necessary to implement a large amount of
functionalities from scratch or port them to Symbian C++, which is an error-
prone procedure that slows down the development process. This risk could be
reduced by using as many existing building blocks as possible and keeping to the
principles of good coding practice for Symbian C++, as defined in on the “Forum
Nokia” website5. Secondly, as previously mentioned, the low processing power of
smartphones and a software-emulated floating point unit require careful memory
management and choice of data types. In order to deal with this problem, floating
point operations have to be avoided where possible in favour of integer operations.
3.7 Chapter Summary
This section outlined the main aspects of the software development process. We
specified the requirements for the application and gave an overview of the pro-
gram design which is based on the typical Symbian application structure. It was
decided to organise the application into several modules with di!erent function-
alities that interact over clearly defined interfaces. We then gave an overview
over the methods and algorithms that will be used for the image processing mod-
ule of the software and specified details of the factor graph belief propagation,
along with a proposal for a simplified version of the detection stage. The chap-
ter was concluded by a discussion of the quality and quantity of test images, as
well as an overview over typical problems that will have to be dealt with in the
implementation phase.
5http://www.forum.nokia.com
Chapter 4
System Implementation
4.1 Overview
This main objective of this chapter is to give an insight into the implementation
phase of the project, based on the methods discussed in the previous chapter.
First, we will give an overview over the implementation tools that were used,
followed by a description of the di!erent stages of the application development.
This will include explanations of the implemented algorithms, along with short
code listings of the most significant program segments where deemed necessary
for understanding. The chapter is concluded by an explanation of the charac-
teristics of Symbian OS with respect to methods for optimising the application
performance on this platform.
4.2 Implementation Tools
Symbian provides software development kits for the di!erent OS versions, with
S60 3rd Edition (Symbian OS v9.1) and S60 3rd Edition FP 1 (Symbian OS
v9.2) being the ones supported by the largest number of devices (mainly Nokia
and Samsung) [Sym09]. The SDK comes with all the necessary C++ APIs,
example programs and a phone emulator (which is of no use for this project, as
the camera on the handset cannot be simulated by the emulator using a built-in
laptop camera). In order to assist the application development process, Symbian
recommends using an IDE such as Carbide.c++. This free software is based on
the Eclipse IDE and o!ers tools for debugging, on-device debugging and GUI
construction. For this project, the software was run on a Mac OS X system
47
4.3. IMAGE CAPTURING 48
using a virtual machine (VirtualBox) with Windows XP Professional as a guest
OS. Compilation of the application code (using the GCC-E compiler) produces
a Symbian installation file (.sis) that can be installed on any suitable Symbian
device. In order for the file to be accepted by the device, it has to undergo
a signing process. This is achieved using command line tools provided by the
SDK to generate a key and a certificate, which are used to sign the .sis file
after compiling it. The signed application1 is then installed by either directly
connecting the phone to a PC via USB and initiating the installation process
from the development system, or transferring the .sis file to the handset (e.g.
sending it via bluetooth) and then installing it. The IDE also o!ers a mode for
on-device debugging when the device is connected to the host computer via USB,
which proved to be useful for debugging purposes.
4.3 Image Capturing
The phone’s camera is accessed using the camera API to capture images over the
phone’s viewfinder. The images are transferred directly to a bitmap without any
further processing. The advantage of this method over capturing an image is the
speed of the operation. In this viewfinder mode, the N95 camera produces im-
ages with a size of 320x240 pixels in 32-bit colour mode, which are then used for
carrying out further pre-processing steps. In order to capture a higher resolution
picture of 640x480 pixels, the camera viewfinder needs to be stopped when the
capturing buttons is pressed. The camera settings are then changed to a higher
format, the image is captured and displayed on the screen. Figure 3.2 shows a
state machine diagram of the camera module in interaction with the other system
components. In tests with the Nokia phone, the autofocus which is run automat-
ically by the operating system’s controls proved good enough to produce pictures
of su"cient quality with little blurring, which makes adjusting the camera focus
by hand unnecessary. This can be considered a very helpful feature of the built-in
camera API, given the system is designed for blind users who will not be able to
adjust any camera settings to improve the image quality.
1It has to be mentioned that any self-generated key and certificate pair is only valid for acertain period of time, usually one year. After that, the .sis file is rejected by the phone andhas to be signed again with a newly generated key and certificate.
4.4. PHASE ONE: FEATURE EXTRACTION 49
4.4 Phase One: Feature Extraction
Tests were carried out with a Symbian OS computer vision library in C++,
developed by Nokia (NokiaCV2), that provides an implementation of various
image processing tasks. Due to very slow performance (3 seconds per frame
for greyscale conversion and convolution with a Sobel filter), this approach was
deemed rather unsuitable for real-time processing. Therefore it was decided to
implement all processing steps using only the bitmap interface that is o!ered by
Symbian and simplifies drawing the captured images to the phone screen. While
accessing the individual pixels over the interface’s GetPixel() method o!ers a
convenient way of manipulating the colour values, this proved too slow for e"cient
implementation of complex image processing algorithms in several loops over the
image. All pixels were therefore accessed through a pointer to the bitmap’s first
pixel, using the bitmap interface’s DataAddress() method.
4.4.1 Greyscale Conversion
In a first pre-processing step, feature extraction was performed on the input
image. The first stage included converting the bitmap delivered from the camera
from colour into greyscale mode3. As previously mentioned, the input image on
the test device is a 32-bit RGB + alpha channel bitmap4. The conversion function
takes the input bitmap and simply draws it onto a new bitmap that was created
in greyscale mode.
4.4.2 Sobel Operator
The Sobel operator for edge detection is implemented using simple integer mul-
tiplication and addition to convolve the image pixels with the horizontal and
vertical kernel. The gradient magnitude is then calculated using the sum of the
derivative’s absolute value Abs(dx) + ABs(dy) as an approximation, rather then
the hypotenuse, in order to avoid calculating the square root which would have an
impact on the system performance. In the case of an implementation for Symbian
OS, attention has to be paid to the correct usage of the integer datatypes that are
2http://research.nokia.com/research/projects/nokiacv/3Due to the mode being named “EGray256”, the US English spelling was used throughout
the source code for consistency reasons.4In fact, the colour mode delivered by the N95 is “EColor16MU” which is built up as BGR.
4.4. PHASE ONE: FEATURE EXTRACTION 50
o!ered by the platform (several di!erent signed and unsigned integer with various
lengths, such as TUint8 for unsigned 8-bit integers), and their explicit conver-
sion when assigning values. Even without any prior (possibly time-consuming)
smoothing, this operation produced results that were suitable for further pro-
cessing. Listing A.1 shows how the sobel operator is applied to the image, with
subsequent normalisation of the resulting gradient values to the range 0..255.
4.4.3 Non-Maximum Suppression
In the system’s implementation, the non-maximum suppression is performed by
determining the gradient direction of each pixel and comparing it to the two
neighbouring pixels in the positive and negative edge direction (normal to the
gradient direction). The gradient direction is defined as ! = arctan(dy/dx),
however, this expensive operation was not suitable for an e"cient implementation
as it already slowed down the performance to 1 frame per second. Therefore, and
since we operate in a discrete domain, the gradient orientation has to be classified
into one of the eight main directions which we “hard-code”. There are several
ways of carrying out this classification without directly computing the gradient
orientation: One method is based on the signs of the horizontal and vertical
derivatives, which classifies the pixel into one of the directions 1 to 7: Direction
1 covers 0° to 45°, 2 ranges from 45° to 90° and so on. The gradient magnitude
is then compared to the linear interpolated gradient values of the pixel pairs (in
negative and positive gradient direction) in the discrete grid that are closest.
A suitable threshold was determined through testing, with results varying
depending on the light conditions and the distance from the object, as it had
been expected. This non-maximum suppression leads to a binary edge map with
thin edges. The results of the individual edge detection steps as shown in figure
4.1 show a comparison between a simple thresholded image and the image after
applying non-maximum suppression, which clearly demonstrates the importance
of this operation.
4.4.4 Straight Line Extraction
In order to extract straight line segments from the image, a greedy grouping
procedure is applied to the edge pixel. The method (here explained for horizontal
segments) scans every row and proceeds as follows: If the current pixel is an edge
4.4. PHASE ONE: FEATURE EXTRACTION 51
Figure 4.1: Individual steps of edge detection: (a) Original image (top left),(b) Sobel filtered (top right), (c) Sobel and threshold (bottom left), (d) Sobel,non-maximum suppression and threshold (bottom right)
pixel (i.e. not zero), check its neighbouring pixel within 0°, 45° and -45°. If one
of these is also an edge pixel, set the current pixel as starting point for a stroke.
Then proceed to check this edge pixel for its horizontal neighbours and continue
until an edge pixel is met that does not have any neighbours to the right. This last
pixel is then set as end point of the stroke and the algorithm continues to process
the starting line. If the starting pixel does not have any edge pixel neighbours,
discard it in the target image. To connect the shorter collinear strokes to longer
segments, a small (5 pixels) window is moved over the end points in order to
detect start points that are located within 45° (positive or negative) of the end
point. Given the start- and end coordinates, we can also infer all the information
needed for the factor graph, namely the length of the segment, its position and its
orientation (slope). This information is then stored in an array of TPoint objects
created to represent line segments, with consecutive elements being regarded as
neighbours, i.e. candidate pairs for the binary cues used in the factor graph.
4.4. PHASE ONE: FEATURE EXTRACTION 52
4.4.5 Factor Graph and Belief Propagation
After having extracted straight line segments from the image, the factor graph
is built based on those image features. This section will explain the way to
implement a factor graph and represent the cues listed in the previous chapter.
It has to be noted however, that the factor graph belief propagation has not been
fully implemented and that the steps described in this section will only act as
a pointer to the actual finalised implementation. The implementation is largely
based on the libDAI library and Intel’s Probabilistic Network Library (PNL), two
open source libraries for inference on graphical models in C++, which provide a
good starting point for porting the algorithms to the Symbian platform [Moo, Int].
To build up the factor graph, some helper classes are needed that provide a
data structure for the di!erent pieces of information stored within the graph. A
class for the individual factors called CFactor holds a set of variables (the argu-
ments of the function) and a reference to a probability vector as data members,
along with methods to manipulate this data. The probability vector describes
the value of a factor depending on all possible variable assignments. With respect
to an e"cient implementation, both this vector and the set of variables are con-
structed using Symbian’s RArray template class, a wrapper for accessing arrays
of structures and objects5.
Corresponding to the definition of factor graphs in the previous chapter, the
factor graph class CFactorGraph has an array of variables (that are either one or
zero) and an array of CFactor objects as data members. Edges in the graph are
represented by an array of factor neighbours for each variable, and an array of
neighbours that are variables for each factor node6, in order to distinguish between
the di!erent types of edges. An edge is then added by including the variable and
factor involved in the respective set in the graph and adding entries to both of
the neighbour lists. Accordingly, in order to remove an edge, the entries from the
neighbouring arrays are deleted (which is also done if the respective variable or
factor nodes are removed from the graph). In this context of image processing,
each edge between a factor and a variable corresponds to a cue used to determine
the state of a segment (the variable that the factor is connected to).
In order to compute the MAP, the belief propagation is now performed on
5In this section the terms “list” and “set” are only used for legibility reasons and do notsuggest the use of the respective data structures.
6As we are dealing with a bipartite graph, it is ensured that there exist only edges betweentwo di!erent types of nodes.
4.5. PHASE TWO: OBJECT RECOGNITION 53
the factor graph. A BP class is generated for this task which holds objects
for the messages passed between the graph nodes as member variables, along
with methods for creating and updating messages. When creating a BP object,
it is initialised with the factor graph that the operation is carried out on. The
computations for the segment’s beliefs are then computed based on the algorithms
explained in the second chapter. By using the simplified version of the factor
graph belief propagation algorithm, no message updates are necessary in order
to determine the belief for the segments in the image. The segments that are
regarded as not belonging to a suitable quadrilateral shape are then saved in an
array of four TPoint objects that mark the start- and end points of the segment
in clockwise direction (top, right, bottom, left). If several quadrilateral shapes
are detected in the image, the one with the highest belief is first analysed with
the warping and template matching procedures described in the next section and,
if the output is negative, the steps are repeated for the other detected shapes.
4.5 Phase Two: Object Recognition
4.5.1 Final Content Analysis
Once we have obtained the coordinates of the rectangle’s corner points we can
proceed with the analysis of the rectangle’s content. First, the captured image
drawn to a bitmap in greyscale mode, again using Symbian’s bitmap API. This is
followed by the planar homography that projects the found quadrilateral shape
with perspective bias onto a rectangle, using a transformation matrix that is
constructed on the basis of the four corner points of the distorted and target
rectangle respectively. The implementation of this projective transformation is
based on the code described in [Blo], which has been ported from C to Symbian
C++. The transformation matrix that is determined in the first stage is then
used to compute the corresponding pixel in the quadrilateral for each pixel in
the target rectangle. When implementing this method, particular attention has
to be paid to optimising the multiple divisions that occur when computing the
transformation matrix and the final output by using fixed point arithmetic with
integers rather than Symbian’s TReal class for float values and standard division.
For the next step (determining the ratio between “green” background pixels
4.5. PHASE TWO: OBJECT RECOGNITION 54
and white icon pixels) a very simple histogram analysis procedure was imple-
mented: The clipped image’s greyscale values are compared to a lower and upper
boundary value chosen through testing (the average greyscale value of the back-
ground green is approximately 100) near the expected background colour in order
to separate the green background from the white icons. The ratio of pixels that
are within the boundaries (i.e. “green” pixels) to the number of pixels that are
above the upper boundary (white pixels) has to be close to 1.1, which has been
determined through testing. If this quick check produces a positive result, i.e the
found rectangle is a candidate for an emergency exit sign, the template matching
is performed.
For this purpose, eight exit sign templates are created as binary images, with
the same dimensions as the target rectangle used in the projecting step. The
pixels from the template and the thresholded image are then pairwise compared
and their di!erence is summarised. If the sum (i.e. the di!erence between the two
images) is minimal, it is assumed that the sign matches the respective template.
The arrow direction is then saved as one of eight directions and output by the
system. Figure 4.2 shows that even templates with the same layout (text, icon
and arrow from left to right) clearly di!er in the relatively large white section
that defines the tip of the arrow, which is how they can be distinguished.
Figure 4.2: Two examples of binary sign templates
The templates are created as bitmaps and then referenced in the project’s
MMP file that includes project specific instructions for the compiler, such as li-
braries that need to be included. The bitmap is then integrated into a Symbian
multi bitmap file (.mbm) during compilation and can be loaded using its auto-
matically generated file name or its enumeration index in the source code. Listing
A.2 demonstrates how the the candidate rectangle is matched with each of the
eight templates in the .mbm file.
4.6. RESULT OUTPUT 55
4.6 Result Output
The result of the object recognition procedures carried out in the previous stages
need to be output in a way that is suitable for visually impaired users. As
previously mentioned, the signal tone in the first stage (finding a rectangular
shape) is a simple “bleep” sound which is produced using the Symbian library for
system sounds like warning and error messages. Since the sound will be repeated
for each frame in which a rectangle is recognised, this is the least obtrusive way
to notify the user of the presence of a potential sign. Once the image is captured
and processed in the second stage, the result is output as text, using Symbian’s
“CAknInformationNote” pop-ups that display text for several seconds. The text
is then picked up by the screen reader that is installed on the device and read
out through the screen reader software’s text-to-speech synthesis.
The output informs the user whether an exit sign could be detected in the
image or not, and gives the arrow direction if an arrow is present:
• “No exit sign found.”
• “Emergency exit sign found. No arrow detected.”
• “Emergency exit sign found. Arrow direction: Top right.”
These messages can be displayed again by pressing any key on the keypad, which
adds to the usability of the application.
4.7 Optimisation for Symbian S60 devices
For applications that update the screen at short intervals, as it is done here to
display the processed image, it is recommended to bypass the window server that
scales and clips the image before it is drawn on the screen. Symbian provides
direct screen access over its CDirectScreenAccess interface. While this would not
be important in a final application that does not display the camera image, it
could speed up and give a more accurate impression of the system performance
in the development stage, where the screen output is necessary for debugging
purposes.
Since several copies of the processed image are being held in memory as simple
arrays, it is important to increase the application’s heap size before allocating
memory for the image data. The SDK o!ers a way of setting a minimum and
4.8. CHAPTER SUMMARY 56
maximum size, for which a check is performed before starting the application. In
order to allow for su"cient memory, the maximum heap size was changed from
the default 1MB to 4MB. This solved the problem of application crashes caused
by accessing pointers that were initialised to NULL due to a lack of memory.
As suggested in [ICS08], the workload during runtime can be reduced by using
static arrays rather then dynamically allocated memory. Due to the fact that
the images used in the application always have the same size (320x240 pixels
and 640x480 pixels respecitvely for the images captured from the camera), a
su"ciently large array can already be constructed during compile time.
4.8 Chapter Summary
This chapter discussed the application development process and the implementa-
tion of complex processing tasks on the Symbian OS platform. The development
was carried out using the Carbide.c++ IDE provided by Symbian. The appli-
cation makes use of the platform’s camera API to capture continuous and still
images which are passed on to the processing module. The image processing is
then performed in two stages, with greyscaling and edge extraction being applied
to the image in a first step, followed by the object detection core. After com-
pletion of the implementation stage, the program will be tested and evaluated,
which will now be described in the following two chapters.
Chapter 5
Testing
5.1 Overview
In this chapter, we will outline the testing and evaluation procedures that are
being performed throughout the development process and when examining the
final version of the application. This also includes a discussion of the expected
and desired test results, which define the criteria for success of the project. The
chapter is concluded by an overview of the testing results with respect to speed
and recognition rates of the application in di!erent testing environments.
5.2 Description of the Testing Procedures
5.2.1 Ad-hoc Testing
Testing took place throughout the di!erent stages of the application development
in order to ensure adequate performance and recognition rates when implementing
new functionalities. The testing includes checks for both the performance of the
image processing module and the robustness of the software.
Regarding the correctness of the software, we carry out informal tests for each
completed unit and module integration step. This will deal with code coverage
criteria in particular, in order to ensure that all statements in the code have been
executed and tested for validity, and do not contain any bugs or errors. Ad-hoc
tests are performed conveniently on the Symbian SDK emulator, while the more
critical tests are carried out on the actual handset.
It has to be noted that accessing a camera (such as the development system’s
57
5.2. DESCRIPTION OF THE TESTING PROCEDURES 58
built-in laptop camera) is not possible when using the emulator. This makes it
necessary to test on static images captured with a phone camera, which means
that only the processing results but not the performance of the system can be
determined. Due to the movement of the camera, the results are also expected
to be less accurate on the live system. It was tried to compensate these circum-
stances by manually adding noise and motion blur to the static testing images
that were used with the emulator. By carrying out testing on the handset it is
also ensured that the program can be installed and runs on the intended device.
On the completion of every module, the prototype is be tested for compliance
with the requirements defined in section 3.2. If necessary, this triggers a revision
of the code and the modification or addition of functionalities.
5.2.2 Final Testing
In terms of object recognition performance, we aim at a relatively high rate of
true positives and a low number of false positives, in combination with a short
processing time. Since imperfect results are more acceptable for users than long
latency [BB08], we focus mainly on the e"ciency of the application, i.e. fast
execution of all program functions. If the desired results for the tests are not
achieved, the code is reviewed in order to improve the performance. Early tests
are carried out on a large set of positive and negative images in order to determine
the performance of the chosen object recognition methods, with the final testing
being performed “live” in buildings on a smaller test set. It is also desirable to
have the software tested for usability (ease of use) and evaluated by users who
are unfamiliar with the system — however, due to the restricted functionality of
the application, this is not of highest priority.
Live tests with users were not carried out due to the system not being in the
finalised state in this stage. However, due to the minimal interaction necessary
for running the system, user tests are only expected to provide feedback regarding
the performance of the system rather than the actual usability. This is why user
tests were not considered absolutely necessary when evaluating the system in its
current state.
5.2. DESCRIPTION OF THE TESTING PROCEDURES 59
5.2.3 Desired Results
Due to the complexity of the Symbian platform and the limited amount of appli-
cations that can be directly compared to our project, figures regarding expected
results can only be estimated based on related works.
A quick experiment with a stopwatch shows that a comfortable time to pan
a phone from left to right, i.e. approximately 180 degrees, is about 10 seconds.
This figure can act as a guideline for determining how many images have to be
captured and processed within 10 seconds in order to cover the whole area of 180
degrees around the user. With an average angle of view of roughly 40 degrees for
standard phone cameras (defined by the focal length of the lens), we can infer
that the system needs to be able to capture at least 5 images in those 10 seconds
(i.e. 2 seconds per image) to provide comfortable use of the system.
The projects mentioned in section 2.5 in fact confirm the experiment and give
a rough estimate for suitable recognition rates:
• The processing time for a single frame is less than 2 seconds. Automated
processing in real-time (as demonstrated in [ICS08]) is highly desirable, but
depends strongly on the implementation and will therefore only be regarded
as an “additional feature” for this project.
• The recognition rate (true positive) lies at approximately 75% of all test
images.
• The amount of false positives has to be treated with particular care, as it is
unacceptable to send the user in the wrong direction. Therefore, no more
than 1% of the test images should be incorrectly classified as a sign.
These are only basic requirements that give an overview of the most significant
aspects of the final system. However, the focus of the project lies primarily on
the structured realisation of the proposed system design. It is deemed obvious
that an “ideal”, i.e. highly e"cient and correct system can only be achieved
through profound knowledge of the platform, along with multiple code revisions
and su"cient time for exploring di!erent approaches to one problem.
5.3. SYSTEM PERFORMANCE EVALUATION 60
5.3 System Performance Evaluation
While the overall execution speed of the recognition application is a major aspect
in evaluating the system, the first phase of the recognition that detects rectan-
gular (quadrilateral, that is) shapes is considered particularly critical, as it is
performed in real-time, while the user is panning the phone. The first step, Sobel
filtering and non-maximum suppression, achieved roughly 3 frames per second,
when tested on the Nokia N95. This can be considered as real-time performance
and lies within the expected time frame. The straight line extraction did not
yield optimal results in live-tests on the device due to noise, motion blur and
varying lighting conditions; therefore the execution speed was not measured. In
tests on the SDK’s emulator, the straight line detection su!ered from wide gaps
between shorter segments that could not be closed. The edge detection algorithm
clearly needs optimising in order to deal with those issues, which can be achieved
through extensive tests to adjust the chosen thresholds.
As the factor graph belief propagation algorithm was not far enough imple-
mented to allow statements about its performance, we can only give rough esti-
mates based on the sources that served as the foundations for this method. It is
expected to perform close to real-time performance, however slower than stated
in [ICS08] due to the higher number of arity-4 cues used in the factor graph. The
ensuing warping procedure which involves a large amount of divisions will slow
down the system performance if not optimised for fixed point calculations, which
is crucial for this operation. This procedure could be omitted by restricting the
number of recognisable signs to those that lie within a certain angle from the
camera and thus do not su!er from significant distortion. However, due to the
minor di!erence between the templates (for example, arrows pointing to the top
and bottom only di!er in the tip of the arrow), as well as blurring and noise in
the image, it proved di"cult to chose a suitable threshold to decide between the
outcome without causing too many false positives or false negatives when match-
ing images that had not been warped. Finally, the template matching for eight
di!erent templates is carried out with su"ciently high performance (under one
second in tests on the phone), as was expected for a small number of binary tem-
plates. In this stage, the worst case is that all eight templates need to be checked,
while in the best case the first template already matches the input image, which
reduces the processing time.
Regarding the overall outcome it can be stated that the performance of the
5.4. CHAPTER SUMMARY 61
system was not as e"cient as expected, which is believed to be caused by the non-
optimised and rather straight forward implementation of the proposed methods.
However, these optimisations are considered only a matter of following standard
procedures, which does not a!ect the general feasibility of implementing the cho-
sen method on the Symbian platform.
5.4 Chapter Summary
In this chapter we outlined the testing procedures that were carried out through-
out the development process to ensure the quality and performance of the pro-
duced application. Both the e"ciency and e!ectiveness of the object recognition
module were evaluated, along with standard software testing procedures that
were carried out in order to check for the correctness of the application. Based
on the system performance it was followed that optimising the application with
respect to Symbian guidelines is key for achieving an e"cient implementation.
Chapter 6
System Evaluation
6.1 Overview
After specifying the application design and outlining the implementation process
in previous chapters, we will now review and discuss the overall project devel-
opment. Firstly, the chosen method to the given task of recognising exit signs
will be critically analysed, along with the system design. This is followed by an
evaluation of the project in relation to existing work, where the advantages and
disadvantages of the chosen method will be discussed. The chapter is concluded
by a critical review of the project schedule and possible improvements that can
be made to the system that was developed.
6.2 Analysis of the Research Methodology
It may be argued that the chosen methods for the feature extraction and object
detection stages were not the most suitable for this task regarding e"ciency and
ease of implementation. However, due to the vast number of di!erent and varied
methodologies and algorithms in this field, it can only be stated that the pre-
ceding research was carried out carefully, which lead to the decision to choose
an approach based on the work most similar to the given task that indicated
successful results. Another positive aspect of this method is the small amount
of training data that is needed for determining the cues that the factor graph is
built on. This is a great advantage over methods such as Adaboost that need
large amounts of labelled training images, sometimes up to several thousands, in
order to produce good results. While the decision to base the application core on
62
6.2. ANALYSIS OF THE RESEARCH METHODOLOGY 63
statistical inference using a method from machine learning is an interesting and
still rarely used approach to image processing on mobile platforms, the complex-
ity of it proved to be a drawback of this approach. Especially with respect to the
use on a restricted platform such as Symbian OS, finding a suitable way of e!-
ciently implementing the construction of a factor graph and inference using the
belief propagation algorithm was not feasible given the time constraints. With
respect to the specified task, the project can therefore be regarded as unsuccess-
ful. However, in order to compensate for this issue, alternative approaches to the
problem were explored, that would simplify the implementation and could still
achieve acceptable results.
With respect to Symbian OS smartphones as the chosen platform for this
application it can be said that there are hardly any alternatives for developing
a complex application such as the emergency exit sign recognition system. Due
to the large range of available handsets, along with screen reader applications for
visually impaired users and the extensive resources for developers, the system can
be seen as superior with respect to the suitability for this task.
One of the advantages of the developed system design is its modularity. By
restricting the first step of the recognition phase to rectangular objects, the pro-
gram can be extended to recognise other standardised signs by using di!erent
templates in the second stage. As the final decision for the content of the sign
is delayed until several checks give a clear indication for the result, the recog-
nition rate and especially the number of false positives can be optimised. The
system does not rely on manually highlighting any regions of interest in the im-
age or markers that help locating the signs and is therefore deemed more flexible
and user friendly than previous approaches to object detection on mobile devices
using touchscreens. The recognition software is designed to be accessible by a
screen reader, which is very likely to be already installed on a blind user’s smart-
phone. As the program only outputs information as text over Symbian’s built-in
notification interface, it is guaranteed to work with any type of screen reader or
display magnifying software. This is a great advantage over systems that show
information over graphically oriented interfaces. In addition, by steering away
from including speech output into the system, it can also be easily extended to
provide more information to the user, without having to produce new sound files.
This is also an advantage when considering developing a multilingual version of
the software. The fact that the voice output of the recognition system is the
6.3. REVIEW OF THE PROJECT PLAN 64
same as the general text-to-speech voice used on the phone can also add to the
user’s acceptance of the application. Finally, both the installation files (.sis) and
use of system memory during program execution are kept relatively lightweight
when favouring written text output over speech output that is included in the
application.
6.3 Review of the Project Plan
The project schedule was designed before the start of the implementation stage
of the project and was structured into three main stages. It allowed for a rela-
tively long phase (one month) of getting familiar with the chosen platform and
its programming language, Symbian C++. This stage was followed by the imple-
mentation of the program core, which was supposed to take another month. The
final stage (one month) would be application testing, evaluation and writing up
of the insights gained during the course of the project.
While this plan seemed adequate given the complex platform and the di"-
cult task of optimising the system for real-time performance, it did not leave
much room to deal with problems caused by implementation errors. This fact
was worsened by the di!erent error handling procedures on the Symbian emula-
tor and the actual device, along with flaws of the IDE and SDK tools such as
non-transparent caching mechanisms, undocumented emulator crashes and de-
bugging facilities. While the first stage was completed in a shorter period of time
than scheduled, the main implementation stage su!ered from the aforementioned
problems. This lead to delays which made it necessary to cut down the task to a
simplified version of system, as well as reduce the time scheduled for testing and
evaluation.
The conclusion that can be drawn for future projects is to arrange enough
time for troubleshooting when dealing with unknown platforms. When working
under strict time constraints, stepping back to the research phase to develop an
alternative route is too time consuming to keep up with the project plan. Despite
the incomplete implementation, the research work carried out for this project and
the application design based on this research are still regarded as an adequate and
convincing approach for solving a complex problem on a platform with limited
resources.
6.4. IMPROVEMENTS 65
6.4 Improvements
As mentioned previously, the original task of implementing a recognition system
based on a statistical method like factor graph belief propagation had to be
reduced to a simplified solution to the recognition problem. Thus, the most
obvious improvement would be to include an implementation of the factor graph
belief propagation method outlined in 3.4.3 into the finalised software. Based on
evaluation results from [ICS08], this is expected to improve both the processing
speed (as the template matching phase is abandoned), as well as the recognition
rate by relying on multiple cues, therefore removing a number of false positives
from the findings.
In order to improve the recognition rate for exit signs that di!er slightly from
the ones shown in 1.1, the templates for the final stage could be split up into
their three components. For example, a variation of the signs shows the words
“FIRE EXIT” in capital letters, which would not exactly match the template.
In order to deal with this seemingly minimal di!erence, the matching algorithm
could simply look for the “running person” icon in the centre of the sign (only
two templates) and then perform matching for the three (icon facing the left) or
five (icon facing the right) arrow templates. This method would simply ignore
the presence of text in the sign, but the uniqueness of the icon and arrows are
expected to already guarantee correct results, and it would speed up the matching
performance by reducing the template size and number.
With respect to the implementation on the Symbian OS platform, it can be
stated that the produced code still needs to be optimised in some areas. In partic-
ular, the memory management can be improved by paying more attention to the
careful use of system memory, as well as using simplified or approximated algo-
rithms. As the image processing phase does not return to capture the next frame
until the processing is completely finished, it would have also been useful to im-
plement the CMSPImgProcessor class as an Active Object (see section 3.3). This
would have allowed to perform both the image capturing and the asynchronously
so that the next image is already fetched and prepared for processing while the
previous frame is still being analysed. It is obvious, however, that the purpose of
this study was mainly to carry out research into the topic and demonstrate a pos-
sible implementation, rather than producing a highly optimised piece of software
for a rather unknown platform. This also explains why this report does not dis-
cuss the interaction of the recognition software with other phone functions such
6.5. CHAPTER SUMMARY 66
as incoming calls, text messages or other applications running in the background.
Of course, those features and events have to be considered when implementing
applications for mobile platforms outside an academic environment.
6.5 Chapter Summary
This chapter discussed the success of the study by evaluating it with respect to
the chosen solution to the problem of image processing on mobile platforms. This
included a review of the chosen approach which led to the conclusion that the
research methods used for the application were appropriate for the given task.
This was followed by a critical analysis of the project plan and suggestions for
improvements that could have been made to the system and overall development
process if time constraints would not have applied.
Chapter 7
Conclusion and Future Work
7.1 Project Summary
This report has depicted the process of developing an image recognition system
on a mobile platform which assists visually impaired users in finding emergency
exit signs. In the introduction we gave a description of the motivation behind the
study which is to make use of mobile phone technology as assistive devices for
visually impaired persons, and to carry out research into the feasibility of image
processing on mobile platforms. The system’s main objectives were given as a
sample flow of events when the application is used by a blind person to detect an
emergency exit sign.
The first task was to decide on which smartphone platform the software was to
be developed. Di!erent platforms were discussed with respect to their processing
power and ease of developing applications, and it was decided that the software
was to be developed for Symbian OS smartphone models using its “native” pro-
gramming language Symbian C++. The Symbian S60 platform in particular was
deemed most appropriate due to its popularity and wide use on some powerful
devices such as the Nokia N95.
We then gave an extensive review of related work that made the di!erent
approaches to the problem of image processing on devices with restricted com-
puting power the subject of discussion. The di!erent methods can be grouped
into server-client structures on one hand, where the captured image is sent to a
server for processing, and on-device processing on the other hand, out of which
the studies using factor graph belief propagation seemed the most successful and
e"cient. Due to long file transfer times and possible lack of network connection,
67
7.1. PROJECT SUMMARY 68
the server-client approach was deemed unsuitable for the given task.
In the ensuing chapter a high-level description of the system architecture was
given in order to provide the reader with an overview over the most important
points in the development process. The application itself has been organised into
modules, each of them with a di!erent function, that are able to interact over
clearly defined interfaces. The software’s structure and behaviour were described
using both text and appropriate UML diagrams. In this chapter, we also proposed
a simplified version of the rectangle detection method, as well as a description of
the more sophisticated belief propagation.
The software implementation was completed using the Carbide.c++ IDE, pro-
vided by Symbian. It uses the cameras API to capture both still and continuous
images that are then processed. In the first step, the image was converted to
greyscale and an edge extraction filter was applied, which produced an edge map.
The actual object detection was then carried out using factor graph belief prop-
agation, a message passing algorithm on a graphical model that computes the
belief of an image segment as the likelihood of it being part of the “figure” (as
opposed to the background). The final decision whether an emergency exit sign
was present in the image was then based on the (greyscale) histogram of the
thresholded sign and a template matching procedure.
Testing was carried out throughout the whole development process, as well
as after completing the implementation phase. It was essential to test both the
quality of the application (identifying the exit signs in various situations and from
various angles) and the performance of the processing module: Given the limited
processing power of mobile phones, can the image processing and identification
be run quickly enough? The necessary testing procedures were explained in the
respective chapter, along with an outline of the available test results.
Finally, a review of the application design and development process, along
with suggestions for possible improvements were given in the previous chapter.
The evaluation of the project was important to demonstrate the understanding of
the topic and the ability to critically analyse the work carried out for this study.
This dissertation discussed and combined methods taken from a number of
di!erent research disciplines, such as signal processing, statistics and software
development for mobile platforms. This makes it a valuable piece of work that,
while providing an extensive review of the di!erent areas and their application
7.2. FUTURE WORK 69
for image processing tasks, may also function as a starting point for further ex-
ploration of the aforementioned topics. While not all of the main objectives were
achieved, the significant amount of research carried out for this study, as well as
the clearly laid out methodologies, the structured system design and the explo-
ration of di!erent approaches to the problem demonstrate the general feasibility
of the task based on the chosen solution. As the use of factor graph belief propaga-
tion for image processing tasks on mobile platforms is yet to be comprehensively
investigated, it is strongly encouraged to carry out further research based on the
conclusions drawn from this work.
7.2 Future Work
In order to make the system’s output even more useful and accurate, the text on
the emergency exit sign could be analysed in addition to the other features that
have been discussed in this study. This could be achieved using the OCR1 API
provided by the Symbian operating system. After detecting the section of the
sign that contains the text, the methods o!ered by the API take the bitmap and
information about the text region (bounding box, background colour) and return
the recognised text. The text can then be output over the screen reader’s text-to-
speech and provide the user with more information about the sign content. While
the API has not been tested for this project, it is expected to deliver relatively
good results, considering it was designed with the aim of recognising very small
text such as addresses found on business cards.
Eventually, it would be an appropriate next step to research the feasibility of
utilising factor graph belief propagation for all stages of the recognition phase,
i.e. for grouping pixels in the straight line extraction phase, detecting rectangu-
lar structures and analysing the icons and arrows on the sign. This methodology
promises a very e"cient implementation of detection procedures on computa-
tionally weak mobile platforms. The success of this method is almost exclusively
based on the choice of suitable cues, which have to be carefully considered given
the complex structure of the di!erent icons found on a sign. With respect to fu-
ture work, it would be particularly interesting to implement a general framework
for factor graph belief propagation in Symbian C++ in order to provide a basis
for further exploration of real-time image processing on this platform.
1Optical Character Recognition
7.2. FUTURE WORK 70
By combining e"cient object recognition through factor graph BP and OCR,
it would be possible to develop the system even further to recognise various types
of signs that combine icons and text using on-device processing. This is a highly
interesting application of the methodologies described in this project that could
even serve as a replacement for the currently server-client architectures currently
in use to carry out computationally heavy image processing tasks.
As the popularity of mobile platforms, and camera smartphones in particular,
is expected to grow even further in the future, it is desirable to continue exploring
their use not only for commercial software, but also for assisting people with
physical disabilities.
Bibliography
[BB08] Erich Bruns and Oliver Bimber. Adaptive training of video sets for
image recognition on mobile phones. Journal of Personal and
Ubiquitous Computing, 13:165–178, 2008.
[Blo] Dan Bloomberg. Leptonica. http://www.leptonica.com/a"ne.html.
Accessed: 12/07/2009.
[BS500] BS 5499-4:2000. Safety signs, including fire safety signs. The
British Standards Institution, 2000.
[DLQ+06] Tudor Dumitras, Matthew Lee, Pablo Quinones, Asim Smailagic,
Dan Siewiorek, and Priya Narasimhan. Eye of the Beholder:
Phone-Based Text-Recognition for the Visually-Impaired. In IEEE
International Symposium on Wearable Computers, pages 145–146,
2006.
[For] Forum Nokia. http://www.forum.nokia.com. Accessed:
10/04/2009.
[FS99] Yoav Freund and Robert E. Schapire. A Short Introduction to
Boosting. Journal of Japanese Society for Artificial Intelligence,
14:771–780, 1999.
[FZB+05] Paul Fockler, Thomas Zeidler, Benjamin Brombach, Erich Bruns,
and Oliver Bimber. PhoneGuide: Museum Guidance Supported by
On-Device Object Recognition on Mobile Phones. In International
Conference on Mobile and Ubiquitous Computing, pages 3–10, 2005.
[Gar09] Gartner Newsroom.
http://www.gartner.com/it/page.jsp?id=910112, 2009. Accessed:
11/08/2009.
72
BIBLIOGRAPHY 73
[GdGH+06] N. J. C. Groeneweg, B. de Groot, A. H. R. Halma, B. R. Quiroga,
M. Tromp, and F. C. A. Groen. A Fast O#ine Building
Recognition Application on a Mobile Telephone. In Advanced
Concepts for Intelligent Vision Systems, volume 4179 of Lecture
Notes in Computer Science, pages 1122–1132. Springer Berlin /
Heidelberg, 2006.
[Gel02] Andrew Gelman. Posterior Distribution. Encyclopedia of
Environmetrics, 3:1627–1628, 2002.
[GW02] Rafael C. Gonzalez and Richard E. Woods. Digital Image
Processing. Prentice Hall, 2nd edition, 2002.
[ICS08] Volodymyr Ivanchenko, James Coughlan, and Huiying Shen.
Detecting and locating crosswalks using a camera phone. In IEEE
Computer Society Conference on Computer Vision and Pattern
Recognition Workshops, 2008.
[Int] Intel. Probabilistic Network Library.
http://sourceforge.net/projects/openpnl. Accessed: 05/06/2009.
[KFL01] Frank Kschischang, Brendan J. Frey, and Hans-Andrea Loeliger.
Factor Graphs and the Sum-Product Algorithm. IEEE
Transactions on Information Theory, 47:498–519, 2001.
[Kob09] Nicole Kobie. Nokia’s ’Point & Find’ uses camera phone for search.
http://www.itpro.co.uk/610402/nokias-point-find-uses-camera-
phone-for-search, 2009. Accessed:
03/04/2009.
[Koo] Kooaba. http://www.kooaba.com/mobile-marketing/cases.
Accessed: 12/02/2009.
[KT07] Surendra M. Kumar and Timothy Jwoyen Tsai. CAT — Camera
Phone Color Appearance Tool. Stanford University, 2007.
[Mob] All About Mobile Life Blog.
http://mobile.kaywa.com/qr-code-data-matrix. Accessed:
12/02/2009.
BIBLIOGRAPHY 74
[Moo] Joris Mooij. libDAI — A free/open source C++ library for
Discrete Approximate Inference methods.
http://www.kyb.mpg.de/bs/people/jorism/libDAI. Accessed:
05/06/2009.
[Nok] Nokia Mobile Codes. http://mobilecodes.nokia.com/scan.htm.
Accessed: 12/02/2009.
[PTAE09] Sobhan Naderi Parizi, Alireza Tavakoli Targhi, Omid Aghazadeh,
and Jan-Olof Eklundh. Reading Street Signs Using a Generic
Structured Object Detection and Signature Recognition Approach.
In International Conference on Vision Application, 2009.
[RNI] RNIB. Statistics — numbers of people with sight problems by age
group in the UK.
http://www.rnib.org.uk/xpedio/groups/public/documents/
PublicWebsite/public researchstats.hcsp. Accessed: 11/05/2009.
[RR06] Christof Roduner and Michael Rohs. Practical Issues in Physical
Sign Recognition with Mobile Devices. ETH Zurich, 2006.
[SC07] Huiying Shen and James Coughlan. Grouping Using Factor
Graphs: An Approach for Finding Text with a Camera Phone. In
Graph-Based Representations in Pattern Recognition, 2007.
[Sym09] Symbian Developer Network. http://developer.symbian.com, 2009.
Accessed: 10/04/2009.
[TAT09] TAT — The Astonishing Tribe.
http://www.tat.se/site/showroom/latest design.html, 2009.
Accessed: 11/08/2009.
[YFW03] Jonathan S Yedidia, William T Freeman, and Yair Weiss.
Understanding Belief Propagation and its Generalizations, 2003.
[Yua05] Michael Juntao Yuan. What Is a Smartphone.
http://www.oreillynet.com/pub/a/wireless/2005/08/23/
whatissmartphone.html, 2005. Accessed: 08/03/2009.
Appendix A
Listings
Sobel Operator
1 T I n t CMSPImgProcessor :: DetectEdges ()
2 {
3 T I n t w = iSize.iWidth; TInt h = iSize.iHeight;
4 TInt imgSize = w*h;
5 [...]
6 for(i=0; i<imgSize; i++)
7 {
8 // if we’re at the first column - first pixel of a row
9 if (i==(y+1)*w) { y++; x=0; }
10 else { x++; }
11 // initialise arrays with 0
12 grad[i] = 0; xGrad[i] = 0; yGrad[i] = 0;
13 // if we’re not in the first/ last column or row (image
boundaries)
14 if(x>0 && x<w-1 && y>0 && y<h-1 )
15 {
16 // apply Sobel filter
17 xGrad[i] = gImg[i+w+1] + gImg[i-w+1] + (2* gImg[i+1])
18 - gImg[i-w-1] - gImg[i+w-1] - (gImg[i -1]*2) ;
19 yGrad[i] = gImg[i-w-1] + gImg[i-w+1] + (gImg[i-w]*2)
20 - gImg[i+w-1] - gImg[i+w+1] - (2* gImg[i+w]);
21 grad[i] = Abs(xGrad[i]) + Abs(yGrad[i]);
22 max = Max(grad[i], max);
23 } // end if
24 } //end for
25
75
APPENDIX A. LISTINGS 76
26 // normalise values to range 0..255
27 // max is initialised with 1 to avoid division by zero
28 TReal32 norm = 255.0/ max;
29 TReal32 g = 0.0;
30 for(i=0; i<imgSize; i++)
31 {
32 edges[i] = 0; // initialise with 0 and only change if
necessary
33 if (grad[i] > 0)
34 {
35 g = static_cast <TReal32 >( grad[i] );
36 edges[i] = static_cast <TUint8 >( g * norm );
37 }
38 }
39 }
Listing A.1: Sobel operator and normalisation
Template Matching
1 TInt CMSPImgProcessor :: MatchTemplate(CFbsBitmap* srcBitmap ,
TInt th)
2 {
3 TInt direction = -1;
4 // load the bitmap from an .mbm file
5 _LIT(KMBMFileName ,"z:\\ resource \\apps\\ Templates.mbm");
6 // create a new bitmap for the templates and push
7 // on the cleanup stack
8 CFbsBitmap* atemplate = new (ELeave) CFbsBitmap ();
9 CleanupStack :: PushL(atemplate);
10 TInt imgSize = srcBitmap ->SizeInPixels ().iHeight
11 * srcBitmap ->SizeInPixels ().iWidth;
12 // lock the global heap
13 srcBitmap ->LockHeap(ETrue);
14 TUint8* src = (TUint8 *) srcBitmap ->DataAddress ();
15 for(TInt i = 0; i<8; i++)
16 {
17 TInt sum =0;
18 // load the template
19 User:: LeaveIfError(atemplate ->Load(KMBMFileName , i));
20 TUint8* temp = (TUint8 *) atemplate ->DataAddress ();
21 for (TInt j = 0; j<imgSize; j++)
22 {
APPENDIX A. LISTINGS 77
23 // compute difference between source and template
24 sum += Abs(temp[i]-src[i]);
25 }
26 // if the difference for one template is less
27 // than the threshold
28 if (sum < th) { direction = i; break; }
29 }
30 srcBitmap ->UnlockHeap(ETrue);
31 CleanupStack :: PopAndDestroy(atemplate);
32 return direction;
33 }
Listing A.2: Template matching to determine the arrow direction