Image Processing on Mobile Platform

IMAGE PROCESSING

ON A MOBILE PLATFORM

A thesis submitted to the University of Manchester

for the degree of Master of Science

in the Faculty of Engineering and Physical Sciences

2009

By

Samantha Patricia Bail

School of Computer Science

Contents

Abstract 5

Declaration 6

Copyright 7

Acknowledgements 8

1 Introduction 9

1.1 Description of the Project . . . . . . . . . . . . . . . . . . . . . . 9

1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3 Main Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.5 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Project Background and Literature Review 15

2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Mobile Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Mobile Phones as Assistive Devices . . . . . . . . . . . . . . . . . 18

2.4 Image Processing and Object Detection . . . . . . . . . . . . . . . 18

2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.6 Analysis of Methods for Object Detection . . . . . . . . . . . . . 23

2.7 Factor Graph Belief Propagation . . . . . . . . . . . . . . . . . . 24

2.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 Application Design 31

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2 Requirements Analysis . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 33

2

3.4 Image Processing Methods and Algorithms . . . . . . . . . . . . . 36

3.5 Training Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.6 Issues A!ecting the System Performance . . . . . . . . . . . . . . 45

3.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4 System Implementation 47

4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Implementation Tools . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.3 Image Capturing . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.4 Phase One: Feature Extraction . . . . . . . . . . . . . . . . . . . 49

4.5 Phase Two: Object Recognition . . . . . . . . . . . . . . . . . . . 53

4.6 Result Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.7 Optimisation for Symbian S60 devices . . . . . . . . . . . . . . . . 55

4.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5 Testing 57

5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2 Description of the Testing Procedures . . . . . . . . . . . . . . . . 57

5.3 System Performance Evaluation . . . . . . . . . . . . . . . . . . . 60

5.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6 System Evaluation 62

6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.2 Analysis of the Research Methodology . . . . . . . . . . . . . . . 62

6.3 Review of the Project Plan . . . . . . . . . . . . . . . . . . . . . . 64

6.4 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7 Conclusion and Future Work 67

7.1 Project Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Bibliography 72

A Listings 75

3

List of Figures

1.1 Two exit signs according to BS 5499-4 . . . . . . . . . . . . . . . 13

2.1 Worldwide smartphone sales to end users 2008 . . . . . . . . . . . 17

2.2 Example of a factor graph . . . . . . . . . . . . . . . . . . . . . . 25

3.1 Class diagram showing the organisation of the application classes 35

3.2 State diagram for the emergency exit sign recognition software . . 36

3.3 Sobel kernels used for horizontal and vertical derivatives . . . . . 37

3.4 Four examples of emergency exit signs captured with a phone camera 45

4.1 Individual steps of edge detection . . . . . . . . . . . . . . . . . . 51

4.2 Two examples of binary sign templates . . . . . . . . . . . . . . . 54

4

Abstract

Emergency exit signs are an indispensable part of any safety precautions for

public buildings. In case of an emergency, they indicate safe escape routes and

emergency doors, using an internationally recognizable sign: A green and white

sign with icons showing a running person, a door, an arrow pointing into the

direction of the escape route and the word Exit (or other words describing an

emergency exit), in di!erent combinations. These signs can be easily detected

and interpreted by sighted people, but are unsuitable for visually impaired persons

who cannot rely on visual indicators.

This project deals with the issues of recognizing emergency exit signs with a

mobile device. It describes the development of a piece of software that runs on

a Symbian OS smartphone and can be used to detect emergency exit signs using

the phone’s camera. In case of a detection, the device indicates this through an

acoustic signal and, if an arrow is present on the sign, the software specifies the

direction through text output.

In order to achieve fast processing times, the study also deals with the low

computing power of smartphones. The chosen approach is based on belief prop-

agation on factor graphs, a method drawn from statistics, which is used in com-

bination with other image processing tasks such as template matching. While

the success of an e"cient implementation depends strongly on the observance of

necessary optimisations in both the choice of algorithms and coding practice, the

general feasibility of image processing on the chosen mobile platform is demon-

strated by this project.

5

Declaration

No portion of the work referred to in this thesis has been

submitted in support of an application for another degree

or qualification of this or any other university or other

institute of learning.

6

Copyright

i. Copyright in text of this dissertation rests with the Author. Copies (by any

process) either in full, or of extracts, may be made only in accordance

with instructions given by the Author. Details may be obtained from the

appropriate Graduate O"ce. This page must form part of any such copies

made. Further copies (by any process) of copies made in accordance with

such instructions may not be made without the permission (in writing) of

the Author.

ii. The ownership of any intellectual property rights which may be described

in this thesis is vested in the University of Manchester, subject to any

prior agreement to the contrary, and may not be made available for use by

third parties without the written permission of the University, which will

prescribe the terms and conditions of any such agreement.

iii. Further information on the conditions under which disclosures and exploita-

tion may take place is available from the Head of the School of Computer

Science.

7

Acknowledgements

I would like to thank my supervisor Dr Tim Morris for his support and helpful

guidance throughout all stages of the project, as well as Dr David Rydeheard who

would always provide me with good advice whenever I came across any di"culties

on the course. Many thanks to Marcus Groeber for his advice regarding Symbian,

and to Volodymyr Ivanchenko for providing the Crosswatch application for testing

purposes.

Thanks to my family and especially my mother and grandfather who sup-

ported me during my never ending studies (I’m o! to the next round). My

thanks goes out to Simon for his incredible patience, as well as his family for all

their help. Thanks to all my friends in the UK and in Germany, especially to Dr

B., and to my housemates for their motivational talks.

I would also like to mention all the students who spent so many days (and

nights) in the MSc lab and provided me with advice and chats.

Danke.

8

Chapter 1

Introduction

1.1 Description of the Project

Visual signs provide a means of orientation for sighted people within unfamiliar

locations such as o"ces, hospitals and other public buildings. Particularly in

emergency situations, emergency exit signs point the way to important escape

routes, thus making them a legal requirement for buildings of a certain size.

However, for people with visual impairments, these vital resources cannot be

utilized as a guidance aid. Using a mobile tool to detect these emergency signs

and output the necessary information in acoustic form can make them accessible

to people who cannot rely on their eyesight to recognize visual objects. This can

be helpful in unknown or complex buildings, when the escape routes cannot be

memorized and there is no other person immediately available that could provide

guidance to find the right escape route.

This project will carry out research into the feasibility of such a guidance

system, analyse di!erent methods to achieve the task and describe a way of im-

plementing the system on a mobile platform. Upon completion of the work, we

will have gained insights into an e"cient implementation of computationally de-

manding procedures such as computer vision algorithms on mobile devices with

low processing power. In addition, the software will be a demonstration of how

modern technology such as the smartphone platform with its wide scope of pos-

sible applications can be used to assist blind and visually impaired people.

9

1.2. MOTIVATION 10

1.2 Motivation

There are over two million people in the UK living with significant sight loss,

out of which over 300.000 are o"cially registered as blind or partially sighted

[RNI]. Numerous tools and techniques are available to blind people to help them

complete everyday tasks more safely and with greater independence. Such assis-

tance can come in the form of guide dogs and white canes (for navigating around

unfamiliar obstacles in public spaces) but also in lesser known forms, an example

of which are digital water level sensors that sound an alarm when a vessel is full.

The use of modern information technology has become increasingly popular in

the past few years, with companies providing mobile talking book players, braille

output devices for mobile phones and text-to-speech software for computers.

To sighted people, many everyday tasks such as locating exit signs in public

places are hardly thought about; it is something that is done almost subcon-

sciously. However, for a blind or partially sighted person, not being able to

identify the quickest and safest way out of a building can have serious, poten-

tially dangerous consequences. It is this particular problem that will form the

core of this study.

While mainly based in the discipline of computer vision, this project has two

important aspects: First, adapting modern technology in order to provide as-

sistance to visually impaired people, without the need to produce specifically

designed devices for them, which is connected to the notion of accessibility. Sec-

ondly, the implementation of a computationally demanding task such as image

processing on a platform with restricted computing power. This fact makes it

necessary to move away from some of the traditionally used methods that prove

to computationally demanding, and explore novel approaches, simplified versions

of algorithms and approximations that can be used to achieve a lightweight im-

plementation.

We acknowledge that the idea of using computer vision for visually impaired

people is not ground breaking,however, it still is rarely seen on mobile platforms.

We hope to give an insight into the di!erent possibilities that modern mobile

systems o!er, and provide the basis for further research in this area.

1.3. MAIN OBJECTIVES 11

1.3 Main Objectives

The aim of this application is to provide visually impaired people with a method

of recognizing emergency exits1 independently, using an “out-of-the-box” mobile

phone with a built-in camera.

The ideal process when using the application would include the following

steps:

• The user opens the application on the phone, if possible via a shortcut

• The user pans the phone from side to side

• If an emergency exit sign is detected, the application outputs an acoustic

signal (a “beep”)

• If the sign contains an arrow, the application outputs the direction of the

arrow (e.g. “Arrow points to the right”)

• The user knows the location of the sign and where to proceed from there

(e.g. at the next door)

It is obvious that these signals can only function as a pointer to indicate

the approximate direction of an emergency exit. Parameters like the location

of the sign (above a door, on the wall etc.) in the room or the exact distance

from the camera would make the application more useful, but are di"cult to

determine. However, a rough acoustic description of the direction is already one

step ahead of signs that are virtually useless for visually impaired people. It can

help making decisions, for example when standing in the middle of a corridor, in

which direction to proceed to get closest to the nearest exit. By also describing

the arrow on the sign (if present), walking in a di!erent direction than the exit

can be prevented, which makes this an important part of the application’s output.

1.4 Scope

In order to achieve the previously mentioned objectives, the scope of the project

has to be clearly defined.

1We have picked emergency exit signs for this task as an example for the technology used inthe project, as they are easily recognisable and standardised. However, we would like to pointout that the methods discussed in this study could be applied to any other type of signs thatis based on a common standard.

1.4. SCOPE 12

First, it has to be specified what exactly should be recognized by the ap-

plication. The basic design of emergency exit signs is similar for most countries

despite there being no mandatory international standard for emergency exit signs.

Most signs include a stylised symbol of a running person (sometimes in front of

a rectangle that represents a door), an arrow and the words “Exit”, “Emergency

exit”, “Fire Exit” or similar, in various combinations, but always green2 on white

background (or vice versa). Depending on the surrounding lighting conditions,

the signs can be either lit from the inside or externally. These di!erences do

not cause any problem for people who can see the signs and interpret them as

“similar” on the basis of their typical colour and contents. However, when trying

to apply automated recognition strategies to di!erent types of signs, these are

likely to fail.

This is why a decision was made to constrain the set of signs that should be

recognized to emergency exit signs that were designed according to the British

Standard BS 5499-4 [BS500], which are widely used in most public buildings in

the UK. Signs of this type are composed of up to three di!erent parts:

• A running figure (running to the left or to the right)

• An arrow pointing in the direction of the escape route

• The word “Exit” or “Fire exit”

Even with this constraint, we are still confronted with a problem caused by

the often low quality of built-in mobile phone cameras. When trying to capture

sample pictures of exit signs that were illuminated internally, the light intensity

of the sign leads to a large white spot on the image. This overexposure makes

recognising any sign impossible. Since it cannot be expected that mobile phone

cameras have the means of automatically adjusting the exposure time to correct

the flaws, this type of internally illuminated exit signs simply has to be removed

from the set of recognizable images.

This reduces the task to recognizing emergency signs that were designed ac-

cording to BS 5499-4, and which are not internally illuminated. Two examples

of these signs are shown in figure 1.1.

The signs always consist of the same three parts, however, their layout di!ers

depending on the location of the exit. Signs that point at a location to the right

2In the case of BS 5499-4 exit signs, the shade of green is Pantone®3405CVC

1.5. DISSERTATION OVERVIEW 13

Figure 1.1: Two exit signs according to BS 5499-4

(i.e. right, up, down, top right, bottom right) have the arrow on the right hand

side, with the running person facing the right. Accordingly, all signs with an

arrow pointing to the left, top left or bottom left, place the arrow on the left

hand side, with the running icon also facing the left.

1.5 Dissertation Overview

The structure of this dissertation is roughly based on the chronological develop-

ment of the project:

• In chapter 2 we will discuss the usage of mobile phones as tools for visually

impaired users. We will then give an overview of the domain of computer

vision and its sub areas that are relevant for the given task, such as image

processing and object detection. This is followed by an extensive review of

related works of similar nature, i.e. image processing applications on mobile

platforms, which our project will be based on.

• Chapter 3 will evaluate di!erent mobile platforms with respect to their

suitability for the task, and their e"ciency of carrying out computationally

demanding processes like object detection. After deciding on the platform,

the available tools and methods will be reviewed, which will form the basis

of the actual implementation of the software. We will then describe the

general application design and outline implementation details, such as the

image processing algorithms that will be used.

• In chapter 4, details of the implementation on the mobile platform will be

explained, along with a discussion of the methods necessary for optimising

system performance. Given the rather unfamiliar mobile platform and pro-

gramming language Symbian C++, we will also include code snippets to

describe the most important modules of the system and highlight significant

details.

1.5. DISSERTATION OVERVIEW 14

• This is followed by a description of the testing procedures and an evaluation

of the implemented system with respect to the test results in chapter 5 and

6 respectively. In addition, the chosen approach is analysed in the context

of projects with a similar background, where some of the advantages and

disadvantages will be discussed. In this chapter we will also review the

project plan with respect to the project flow.

• In the final chapter, we will summarise the project with respect to the tasks

performed during the course of this study and the findings discussed in the

di!erent chapters. The work will be concluded by an overview of possible

future developments and applications based on the work performed in the

course of this project.

Chapter 2

Project Background and

Literature Review

2.1 Overview

This chapter will discuss the project background with regard to the status of the

chosen platform and the foundations of the research area it is based on. This

will be followed by an extensive literature review that discusses and analyses the

research that was carried out in similar projects, and their approaches to the

problem of image processing on mobile devices. We will then look into details

of the most suitable methods for the given task and draw a conclusion regarding

the chosen approach for our project.

2.2 Mobile Platforms

2.2.1 Hardware — Suitable Devices

Nowadays, the vast number of mobile phones that are available to the public

o!er a multitude of designs and functionalities. For this project, certain require-

ments for processing capacity and user interface have to be met, which narrows

the choice of phones down to a certain type. The term “smartphone” is gen-

erally used for a mobile phone that combines standard phone functions (phone

calls, text messaging) with those of a PDA, such as internet access, e-mail tasks,

multimedia players and o"ce applications [Yua05]. The most popular and estab-

lished smartphones to date are the Blackberry line (RIM), the iPhone (Apple)

15

2.2. MOBILE PLATFORMS 16

and several Nokia devices (such as the N-series) – and the market is ever-growing.

The average processing power of smartphones seems appropriate for the com-

putationally demanding task of image processing, as it has already been proved

by several applications (see section 2). This is why we decided to choose a smart-

phone platform for this project rather than developing a Java application for a

standard mobile phone. It can be also assumed that visually impaired users prefer

to use devices with text-to-speech software, for which smartphones provide the

most sophisticated platform.

With respect to the hardware and user interface, several requirements have to

be met for this task. The most obvious feature that is needed for image processing

is an integrated camera which is suitable for capturing images in a su"cient

quality and resolution, such as 320x240 or 640x480 pixel [KT07, ICS08]. Since

most mobile phones and smartphones come equipped with a camera that has a

resolution of at least 1 megapixel (up to 8 megapixels), this criteria will be easily

met by most available devices.

Another important issue is the user interface, that is, the accessibility by vi-

sually impaired users. As previously mentioned, it can be assumed that users

access the device through text-to-speech software that reads out the screen con-

tent and describes the phone menus. Interaction is then carried out using the

phone’s buttons, which have to be felt out. This requirement rules out devices

that are operated with touchscreens, as they provide no tactile feedback to the

user1.

As a conclusion, the most suitable device for the given task is a smartphone

that is able to run third party software, comes equipped with a camera and has

tactile buttons. These criteria have to be considered when choosing the platform

for the image processing software.

2.2.2 Operating Systems and Platforms

Smartphones are currently distributed with a wide range of operating systems,

such as Windows Mobile, the BlackBerry OS, Symbian OS, Palm OS and Linux-

based systems. All systems o!er di!erent capabilities for installing and running

1Nokia announced support for tactile feedback on touchscreens with the latest Symbian OSversion 9.4 in late 2008. However, this will not be discussed here, as it cannot be consideredcommercially relevant yet.

2.2. MOBILE PLATFORMS 17

Operating System Sales in Thousands Market ShareSymbian 72,933.5 52.4 %RIM (BlackBerry) 23,149.0 16.6 %Microsoft Windows Mobile 16,498.1 11.8 %Mac OS X (iPhone) 11,417.5 8.2 %Linux 11,262.9 8.1 %Palm OS 2,507.2 1.8 %Other 1,519.7 1.1 %Total in 2008 139,287.9 100.0 %

Figure 2.1: Worldwide smartphone sales to end users 2008

third-party software, and most manufacturers provide APIs for various program-

ming languages such as C, C++, Java and Python.

With a market share of roughly 50% of the smartphone market, the Symbian

operating system currently is the leading smartphone platform, as shown in figure

2.1 [Gar09]. Symbian OS is widely supported by Nokia2 and encourages develop-

ers to implement applications for its operating system by providing the necessary

APIs and tools. This, and the wide range of available Symbian handsets, makes it

a suitable system to reach as many users as possible. In particular, the majority

of accessible Symbian devices runs the S60 version of this operating system.

Based on previous works that used the Symbian platform to develop image

processing applications, it can be assumed that devices with su"cient processing

power for this task are available. While Symbian also supports Java and Python,

Symbian C++, a C++ dialect, is labelled the fastest and most e"cient program-

ming language on this system. [ICS08] even states that for the task of recognising

zebra crossings with a mobile phone, “Real-time performance [. . . ] is made pos-

sible by coding in Symbian C++”. These points lead to the decision to develop

the software for recognising emergency exit signs on the Symbian platform, using

Symbian C++.

The device used for this application is a Nokia N95 smartphone running the

9.2 version of Symbian OS. Both Symbian and Nokia o!er extensive and up-to-

date online resources with detailed information on Symbian C++ programming

[Sym09, For]. The online forums o!ered on both websites in particular act as

a helpful source, along with sample code of which parts (e.g. tutorials for the

camera API) were used as a starting point for this project.

2Symbian Software Limited was in fact acquired by Nokia in 2008.

2.3. MOBILE PHONES AS ASSISTIVE DEVICES 18

2.3 Mobile Phones as Assistive Devices

Mobile phones, and smartphones in particular, provide a platform for a wide

range of applications for visually impaired users. Interaction with the phone

is made possible by a screen reader that outputs the displayed text via text-

to-speech, and therefore allows access to text-based information such as phone

menus, internet content or text messages. Many smartphone applications were

developed specifically for use by visually impaired people. These include OCR3

applications that make use of a built-in camera, navigation software and audio

players for talking books in the DAISY4 format. Due to the existence of a screen

reader on the device, the applications do not need to implement their own text-

to-speech solutions, but only ensure accessibility by the screen reader.

The benefits of smartphones as assistive devices lie in the convenience of using

an out-of-the-box platform with additional software, rather than a hardware-

based implementation that was designed for a single purpose. This includes lower

costs of an “all-in-one” tool, as opposed to multiple devices, the relatively small

size of modern phones and the comfort of not having to carry several devices.

2.4 Image Processing and Object Detection

In order to produce a correctly and e"ciently working piece of software, it is

important to analyse the basic requirements for the given task with respect to

the underlying principles of computer vision. This discipline, part of the domain

of artificial intelligence (AI), aims at “emulat[ing] human vision” with the aid

of computers [GW02]. The general description of this term is the detection and

recognition of objects on the basis of an input image, which leads to a decision on

the contents and nature of the object and, eventually, a reaction of the system.

However, there is no clear definition of the boundaries of this discipline and

those of the subareas it comprises. In the literature, it is often implied that,

while being an extensive research subject itself, image processing is a subarea of

computer vision. It deals with the processing and analysis of an image in order

to manipulate the image, which yields another image (for example, by applying

a masking algorithm to detect edges in an image) or to obtain information about

3Optical Character Recognition4Digital Accessible Information SYstem – Digital talking book format, based on XML ap-

plications, that was specifically developed for visually impaired users.

2.5. RELATED WORK 19

the image (such as a histogram of the image’s tonal values).

The area of object detection is of particular importance for the task of rec-

ognizing emergency exit signs. This term describes the process of finding and

identifying an object of interest in a given image; in this case, a rectangular plate

in a room or corridor. In order to recognize the object as an emergency exit sign

(as opposed to other signs), and the direction it points at, the object has to be

classified. This process, again, utilizes methods from AI (and neural networks in

particular), such as prior training of the system (on a set of positive and negative

images), probabilistic methods and statistics.

As for a research discipline that has been studied for several decades, there

exists a large number of di!erent algorithms (and implementations) for the various

tasks of computer vision. Given the limited computing capacity of mobile phones,

the suitability of di!erent algorithms for e"cient implementation on this platform

has to be analysed. In the next section, we will look into projects that deal with

similar problems and review the di!erent methods, which will then provide a

basis for our implementation.

2.5 Related Work: Image Processing on Mobile

Platforms

2.5.1 Server-Client Systems

It is only recently that researchers started studying the issue of image processing

on mobile devices. Due to the restricted computing capacity on mobile phones

and PDAs, there are di!erent approaches to dealing with this issue, of which the

most significant will be discussed in this section.

One of the early solutions is the use of a server-client based system. The

user captures an image with the mobile device, which is then sent to a server

that carries out the actual processing work. After processing, the reply is sent

back to the user via the mobile phone network. Various commercial providers use

this method for mobile marketing in high profile campaigns, such as presented

in [Koo]. The major advantage is that the task can be carried out with any

kind of mobile device that has a camera, no matter what computing capacity it

has. However, this system requires a phone network connection to be available,

which can be di"cult in certain areas or inside buildings. [RR06] uses a similar


server-client solution to recognize street name plates and use them as links to

further information that is available on-line. The system runs on a PDA with a

touchscreen, which allows the user to manually highlight the area that contains

the street sign (“area of interest”, and therefore simplifies the issue of object de-

tection in the image. Several feature extraction algorithms (SIFT, Black/White,

Wavelet, HSV) were analysed for this task, with SIFT proving to be the most

e!ective algorithm that is invariant to perspective bias and varying lighting con-

ditions. Due to the server-client structure however, there were no investigations

into whether SIFT is also suitable on devices with low computational resources.

In 2009, Nokia launched their “Point & Find” software, which is the first

system that does not restrict the use to a certain type of objects [Kob09]. After

capturing a picture of the object, such as a film poster, the image (and, if avail-

able, GPS information on the user’s location) is sent to a server that searches a

database. Additional content and information on the object is then sent back to

the handset. Nokia aims at extending the use of “Point & Find” to a wide range

of commercial applications, such as barcodes and museum exhibits.

2.5.2 On-device Image Processing

A di!erent approach to the previously mentioned that focuses on carrying out the

image processing on-device is the use of 2-dimensional barcodes or “QR Codes”.

With this method, the user captures an image of the barcode (e.g. on magazines,

posters), which is processed immediately and turned into a URL that leads to

further information on a website which is accessed via the phone’s browser. This

established method has spread widely over the last few years (various examples

can be found on [Mob]) and many phones already come equipped with a barcode

reader [Nok].

A similar project (“PhoneGuide”) uses image processing on mobile devices to

recognize exhibits in a museum [BB08]. A wide range of di!erent algorithms have

been examined for this purpose, such as pattern-matching, discriminate regions

and SIFT, which all proved too ine"cient on computationally restricted devices.

The chosen approach, a linear separation strategy implemented with an artifi-

cial neural network, achieved the most correct and e"cient object recognition.

Several sets of normalized features (such as colour and structural features) were

tested for object recognition, with colour features yielding the highest recognition


rate. However, due to the di!erences of various mobile phone cameras, colour cal-

ibration is necessary if the camera used for training the algorithm di!ers from

the user’s phone. The application, implemented on a Symbian S60 smartphone

in Symbian C++, achieved a recognition rate of 90% in tests, with processing

times of less than 1 second.

All previously mentioned systems assume that the user points the camera

directly at (or even manually marks) the object that is to be captured, with

blurring, varying lighting conditions, scaling and perspective bias being the major

issues that need to be addressed. A basic aspect of our project, however, is

the software’s suitability for visually impaired people. In this case, not only

the nature of the captured object is important, but detecting whether there is

any object in the picture at all is even more critical. [PTAE09] emphasizes the

importance of object detection as a first step to recognizing text on street name

plates. The system uses a boosting algorithm (AdaBoost) and Haar features for

object detection. In order to correct the number of false positives, the system

makes use of the textural information on street name signs (as opposed to windows

and building facades that caused the false positives). The text on the signs is

then recognized using a direct matching technique. Given the limited set of street

signs that are to be recognized, this image matching approach is considered more

e"cient than character recognition. Although the system is intended for use on

a mobile phone, the testing was only carried out on a desktop PC, which does

not allow any statements regarding e"ciency.

[GdGH+06] focuses on the e"ciency of a system for recognizing buildings (e.g.

for use as a tourist guide) with mobile devices, making use of a local invariant

regions algorithm. Several approaches for object recognition using global or local

features are analysed, with global features such as colour distribution proving to

be insu"cient and not robust to occlusion or di!erent viewpoints. Algorithms

such as SIFT that utilize local features are more robust to these problems, but

are found to be ine"cient when carried out on a mobile device. In order to re-

duce computation time, the image data is compressed using principal component

analysis. The similarity of the captured image to a building in the application’s

database is then determined using a voting scheme. Tests were carried out on a

Sony Ericsson K700i and Nokia 6630, with both phones only supporting Java ap-

plications, and achieved recognition times of less than 5 seconds for one building.

An application that is built explicitly for visually impaired users is described in


[ICS08]. The system is implemented in Symbian C++ on a Nokia N95 phone and

detects zebra crossings in real time (3 frames per seconds), using the phone’s video

capturing mode. The user points the camera in the estimated direction and the

application outputs an acoustic notifier if a zebra crossing is detected. The system

is based on a feature extraction in the first stage and figure-ground segmentation

using a graphical model, the factor graph, in the second stage. Figure-ground

segmentation5 describes the process of grouping pixels into object (figure) and

background (ground) pixels depending on their compatibility as a group of figure

or ground pixels respectively. Since mobile devices do not have floating point

units (FPU), all floating point operations are carried out on a software-emulated

FPU, which has great impact on the processing speed. In order to avoid floating-

point calculations, the phone implementation uses a simplified version of factor

belief propagation to perform statistical inference on the factor graph, as well

as static arrays instead of dynamic lists. This application is the only known

approach to date that aims at processing images on a mobile platform in real

time and is therefore particularly interesting for our project in terms of e"ciency.

The Symbian platform is also used to develop a mobile colour recognition

software as described in [KT07]. Due to it being the “native language” of the

Symbian operating system and providing “very low level access to devices and

other services”, the C++ programming language is considered suitable for this

task. The system was tested on a Nokia N93 smartphone running the Symbian

S60 3rd edition operating system, and yielded a minimal processing time of 4.4

seconds after reducing the sample rate of the test image. Since this system is

only colour based, it strongly depends on the lighting conditions and camera

parameters, and is therefore very likely to produce incorrect results.

Another very recent development (Summer 2009) is the use of mobile phones

for augmented reality applications. The systems make use of the a smart phone

camera to capture real-time images, process the image and output information

based on the image. The Swedish company TAT [TAT09] announced their “Aug-

mented ID” system that matches people’s faces with their profile in the database

of the social network, using a 3D facial recognition method. It then displays per-

sonal information (such as Facebook or Twitter profiles) as hovering icons around

the person’s face.

5A term originating in the early 20th century “Gestalt” psychology dealing with humanvisual perception. The theory makes statements about how the visual system groups individualelements into objects, based on cues such as proximity and similarity.

2.6. ANALYSIS OF METHODS FOR OBJECT DETECTION 23

2.6 Analysis of Methods for Object Detection

In order to decide which method for object detection on a mobile platform seems

suitable for our task, we have to look into the details of the approaches that were

proposed in the previous section. Due to no prior knowledge of image processing

on platforms with restricted computing power, the decision will be based on the

findings and conclusions drawn in previous research.

While the SIFT algorithm (scale-invariant feature transform) is considered

“superior” due to its invariance to image transformations (scaling, translation,

rotation) [RR06], it is also labelled too ine"cient on mobile platforms [FZB+05].

By using the modified i-SIFT (informative SIFT), the application’s runtime can

be reduced, while yielding high recognition rates [GdGH+06]. However, the com-

putationally demanding execution of this algorithm is still too time-consuming

for mobile devices, which is why it can be generally ruled out as unsuitable for

the given task.

Another approach mentioned in similar applications is the use of a boosting

algorithm such as AdaBoost (adaptive boosting). Boosting, which evolved from

the domain of machine learning, is based on the combination of “weak” (i.e. only

slightly better than random guessing) learning algorithms in order to produce

one “strong” learning algorithm through training [FS99]. During the training,

AdaBoost performs the weak algorithm repeatedly (e.g. 100 rounds) on a set of

input values that are initially weighted equally. If a value is incorrectly classified,

its weight is increased which grades it as “hard” example that the algorithm has

to concentrate on. This training leads to a weak hypothesis for every round,

which are combined to the final hypothesis that yields a very low error rate.

[PTAE09] describes the use of Haar-like features as weak classifiers for general

object detection. Haar-like features are image features that are represented as

jointed black and white rectangles, the value of each feature being the di!erence

of the pixel grey level values within the rectangles. By using this method for

classification instead of single pixel values, the classification process can be sped

up significantly.

The advantage of AdaBoost is that the initial training can be carried out on

an external device, which makes it independent from the mobile phone’s pro-

cessing power. Implementations of di!erent versions of this algorithm (namely

AdaBoost.M1, and a more complex version, AdaBoost.M2 ) in various program-

ming languages are available online and will be analysed for their portability onto

2.7. FACTOR GRAPH BELIEF PROPAGATION 24

the Symbian platform. However, as there are no test results for the e"ciency of

AdaBoost implementations on mobile platforms (see 2.5), the suitability of this

approach for our project has yet to be determined.

The most promising method that achieved high performance rates on a mobile

platform without the use of boosting is described in [ICS08]. The algorithm

utilizes (max-product) factor graph belief propagation, a method drawn from the

area of machine learning, for figure-ground segmentation in order to infer the

state (figure or ground) of each segment extracted from the image. The belief

propagation algorithm has only low complexity, as the required time “grows only

linearly with the number of nodes in the [graph]” [YFW03] — a clear advantage

for the implementation on a mobile device with a weak processing capacity.

Due to its convincing performance and the small set of training images nec-

essary for recognising objects, we decided to implement the system based on a

factor graph belief propagation method. The simplified max-product version of

this algorithm, as proposed in [SC07], in particular is expected to allow an ef-

ficient implementation. The next section will give a more detailed explanation

of factor graph belief propagation with respect to the task of image processing.

This is completed by a description of the steps necessary for implementing the

algorithm, which will be outlined in section 4.4.

2.7 Factor Graph Belief Propagation

2.7.1 Factor Graphs

Factor graphs are graphical models for the factorisation of global functions as

a product of local functions, which represent the mathematical relation “is an

argument of” between variables and the local functions. Factorisation of a func-

tion is the process of decomposing a global function g(x1, . . . , xn) into smaller

parts, its factors, that have a subset of {x1, . . . xn} as their arguments. The prod-

uct of these factors (or local functions) then again forms the original function.

Generally, the factorisation of a function g is defined as in [KFL01, p. 499]:

g(x1, x2, . . . , xn) =!

j!J

fj(Xj) (2.1)


This process can be visualised by a bipartite6 graph that consists of

• Variable nodes xi, the set of all variable nodes being X = {x1, . . . xn}.

• Factor nodes fj, representing local functions that determine probabilities.

• Undirected edges that represent the relationship fj(Xj) . An edge between

variable node xi and factor node fj exists if xi is an argument of fj, i.e. Xj

is a subset of {x1, . . . , xn}.

Figure 2.2 (based on [SC07, p. 4]) shows an example of a factor graph with

four variable nodes w, x, y, z and three factors f, g, h. The factor graph in this

figure shows the joint distribution P (w, x, y, z) = f(w, x, y)g(x, y, z)h(y, z), which

is represented by the edges between the variable nodes and factor nodes.

Figure 2.2: Example of a factor graph with variable nodes w, x, y, z (circles), andfactor nodes f, g, h (squares).

With respect to the task of image processing, the nodes in a factor graph

correspond to segments of the image which have to be classified into figure (seg-

ments that fulfil the criteria for being part of the object) or ground (i.e. not part

of the object, background). This makes the variable nodes in the graph binary,

as they can have one of two states assigned: xi = 1 (figure) or xi = 0 (ground).

The decision (or “evidence”) whether a segment is more likely to belong to figure

or ground is based on cues that describe the relationship between neighbouring

segments. These cues can be of any arity (such as unary, binary, ternary and so

on) to take into account any number of segments. The evidence again is based on

the statistical di!erences between figure segments and ground segments, which

are learned from empirical data. Using the evidence from all cues, a factor graph

6Bipartite describes the fact that the nodes can be divided into two sets, with edges onlyrunning between nodes from di!erent sets. Here, the two sets are variables and factors.


then represents the joint distribution of each node’s state based on this evidence

[ICS08].

Based on a description given in the aforementioned source, the relationship

between variable nodes and n-ary cues in a factor graph will now be explained in

detail. The objective of this section is to clarify how we can infer the assignment

of each segment xi, that is, how to determine the global state assignments (con-

figuration) X = {x1, . . . , xn} of all segments extracted from the image. Based on

training data, it can be estimated that a certain number of segments is likely to

be in figure state, independently from other segments — this is an a priori belief

or i.i.d.7 on X which is defined as:

P (x) =n!

i=1

fj(Xj) (2.2)

In detail, we know that Pi(xi = 0) = p0 and Pi(xi = 1) = 1! p0.This means

that, without considering any relationships between two or more segments, it is

already known that each segment has the likelihood of p0 to be in state 0 and

1 ! p0 to be in state 1 respectively. The probabilistic distribution p0 (ranging

from 0 to 1) is determined through training data.

A binary cue Cij describes the relationship between two neighbouring seg-

ments i and j. The relationship between this binary cue and the states of the

two segments it relates is defined as the conditional distribution P (Cij | xi, xj).

Again, this distribution is learned from training data. It can be decomposed into

two distributions Pon and Poff , which describe the likelihood of the segments

belonging to figure (on) or ground (o!):

Pon = P (Cij | xixj = 1) (2.3)

is the distribution of the cue for both segments in figure state (xixj = 1), and

accordingly

Poff = P (Cij | xixj = 0) (2.4)

is the distribution if the product xixj = 0, i.e. at least one of the segments is 0.

Evidence whether the pair of segments belongs to figure or ground is then given

by the di!erences between the two distributions as log"Pon(Cij)/Poff (Cij)

#.

7Independent and identically-distributed. The states of all variables are independent fromthose of other variables, and each variable has the same probability distribution.


Generally, the set C of all cues Cij can then be related to the set of variable

nodes X through the posterior distribution8 P (X | C) which is proportional

to the product of the aforementioned a priori belief from equation 2.2 and the

distribution for binary cues:

P (X | C) " P (X)!

(ij)

P (Cij | xi, xj) (2.5)

Using the two equations 2.3 and 2.4, the product over (ij) can be rewritten

as: !

(ij)

P (Cij | xi, xj) =!

(ij)

Poff (Cij)!

i,j:xixj=1

Pon(Cij)

Poff (Cij)(2.6)

The product$

i,j:xixj=1 is restricted to xixj = 1, which means that only pairs of

segments that are both in figure state are taken into account. In this equation

the product over Poff is independent of X, which is why it can be removed from

the posterior probability when combining equations 2.5 and 2.6:

R(X | C) = P (X)!

i,j:xixj=1

Pon(Cij)

Poff (Cij)(2.7)

which is equivalent to

logR(X | C) =%

i

log Pi(xi) +%

ij

xixj logPon(Cij)

Poff (Cij)(2.8)

Maximizing this expression leads to an estimate for the maximum a posterior9,

or MAP. By using belief propagation on the factor graph, this MAP can be

determined in an e"cient way, which will be described in the ensuing section.

Since the method uses more than just one cue, we have to add one term

for each cue to the previous equation. For binary cues, this means that for each

additional cue a term of the form&

ij xixj log Pon(Cij)Poff (Cij)

is added to equation 2.8, i.e.

the distributions for all cues are multiplied in order to determine the most likely

global assignment of all variables in X. After defining how the distributions for

each cue are computed, we will describe the process of constructing the factors for

8The empirically determined probability, which “summarizes the current state of knowledgeabout all the uncertain quantities” [Gel02].

9The particular value of X that maximizes the posterior


each cue, which will be used in the factor graph. This process is carried out step-

by-step beginning with binary cues: For each pair of variable nodes (neighbouring

segments in the image) we determine whether they su"ce the cue and mark them

as candidate factors, which will then be used to determine the candidate factors

for 3-ary cues, which are in turn used to determine the factors for the arity-4

cues.

2.7.2 Belief Propagation on Factor Graphs

After this detailed explanation of how a factor graph is constructed, we will

now illustrate how belief propagation is used to infer the likelihood of a node

in the factor graph to be in a certain state. Belief propagation is a version

of the sum-product algorithm, an algorithm used for message passing on factor

graphs, that calculates the marginal probability (“belief”) for each node. Two

types of messages are used in factor graph belief propagation: Messages sent

from variables to factors, and those sent from factor nodes to variables, with

both types of messages being functions of the variable that is associated with

the edge along which the message is passed. We can explain the basic principle

of message passing through the sum-product algorithm with the following two

equations: The messages sent from variable nodes to factors are given by

mx"f (x)#!

h!n(x)\{f}

mh"x(x) (2.9)

Here, n(x) is the set of all factor neighbours of x in the graph. Equation 2.9

expresses that the message sent by a variable node is the product of all messages it

has received from from other factor nodes h, i.e. the variable node simply forwards

the messages10. The factors here correspond to the local functions defined for the

factor graph, i.e. the probabilities P for every cue based on its parameters.

The messages sent by factor nodes are defined by the product of the factor

itself with all messages sent from the variable nodes it is connected to, which is

then summarised:

mf"x(x)#%

#{x}

"f(X) +

!

y!n(f)\{x}

my"f (y)#

(2.10)

10The general approach to describing this process is to treat the graph as a tree and definethe message as the product of all messages received from child nodes. When implementing thealgorithm, all nodes are treated as child and parent nodes to the nodes they are connected to.


X = n(f) is the set of all arguments of f and $ {x} denotes the sum over

all variables except x. The messages are updated until they converge, then the

marginal distribution for a node x is computed as the product of all messages

that are sent to x.

In order to allow an e"cient implementation on a platform with low compu-

tational power, the max-product version of belief propagation is used to estimate

the maximum a posterior. By implementing this version in the log domain (tak-

ing the logarithm of all equations) where all calculations are reduced to addition

and subtraction, e"cient computation of the belief is made possible.

The message updating equations in this max-product version are defined by

the following two equations (note the sum and maximum here instead of product

and sum as in 2.9 and 2.10)

mx"f (x)#%

h!n(x)\{f}

mh"x(x) (2.11)

mf"x(x)# max#{x}

"f(X) +

%

y!n(f)\{x}

my"f (y)#

(2.12)

Eventually, the belief function for each node in the graph is calculated as:

b(x) =%

f!n(x)

mf"x(x) (2.13)

In this framework, each factor f(x1, . . . , xm) in the factor graph is only non-

zero, if all of its parameters {x1, . . . , xm} are 1. As suggested by [SC07], a non-

negativity requirement is introduced in order to reduce the computational com-

plexity of the method: Kf = f(x1 = 1, . . . , xm = 1) % f(x1 = 0, . . . , xm = 0) = 0,

that is, all factors have to be greater or equal than zero.

Adding this to equation 2.13, the belief for each node is then computed as

bx(x = 1) =&

f!n(x) Kf and bx(x = 0) = 0, which then leads to the final equation

for the beliefs of all nodes:

Bx =%

f!n(x)

Kf (2.14)

Finally, with respect to the implementation of this algorithm, the notion of

scheduled message passing has to be explained. It is assumed that the sending and

receiving of messages is organised by a schedule (such as a timer) that specifies the

2.8. CHAPTER SUMMARY 30

way messages are passed. This schedule can be synchronous (flooding schedule),

which means that all messages are updated at the same time, or asynchronous

(serial), where only one message is updated at a time. Usually, several runs

(sweeps) of the non-simplified message passing algorithm have to be performed

in order for it to converge when all messages have been sent.

2.8 Chapter Summary

This chapter discussed the background and foundations regarding the given task

of image processing on a mobile platform and described the notion of assistive

technology for visually impaired people. Di!erent types of smartphones were

analysed which lead to the decision to develop the recognition system for Symbian

OS, using its native programming language Symbian C++. As shown in this

chapter, there exists a wide range of applications that deal with image processing

on mobile phones, with di!erent strategies for both the application structure

(server-client, stand-alone) and the algorithms used for the processing task. After

explaining factor graph belief propagation, we will now take a closer look at the

application design and discuss how these methods will be integrated into the

recognition system.

Chapter 3

Application Design

3.1 Overview

In this chapter, we will outline and discuss the preliminary considerations that

have to be made before implementing the application. In the first section, the

functional and non-functional requirements for the recognition system will be

listed, which is then followed by a detailed description of the software architec-

ture. This includes an explanation of the di!erent parts a Symbian OS application

comprises, as well as an overview over the program’s structural and behavioural

organisation using UML diagrams. Finally, we will highlight details of the soft-

ware such as the algorithms that will be used and give a short description of

images necessary for training the system. The main objective of the chapter is to

provide the reader with a clear idea of all the tasks that will be carried out from

a high-level perspective. Detailed descriptions of the actual implementation will

then be discussed in the ensuing chapter.

3.2 Requirements Analysis

In order to describe the necessary functionalities of the application and assist the

evaluation process, the requirements that have to be met by the software and its

user interface will be defined in this section. They are organised into two groups:

The first part lists functional requirements which describe the behaviour of the

system, i.e. what the application does. The second group are non-functional

requirements that describe how these tasks will be performed by the application.

31

3.2. REQUIREMENTS ANALYSIS 32

3.2.1 Functional Requirements

• The program detects BS 5499-4 emergency exit signs that are not lit up

internally (must-have)

• A detected object is indicated by a sound (must-have)

• Interactions with the software are confirmed to the user through text output

(must-have)

• The capturing process begins automatically when starting up the software

(should-have)

• Capturing the image and repeating the process takes one click and is re-

peated automatically if no object is detected, while the user is panning the

phone (should-have1)

• If present, the direction of the arrow (left, right, up, down) on the sign is

output as text (could-have)

• If present, any text on the sign (such as “Fire Exit” or “Exit”) is read out

by the system

• The software outputs information on the distance of the sign from the user,

based on the camera lens specifications and the size of the detected sign

(could-have)

3.2.2 Non-Functional Requirements

• The execution time for one image lies in a time frame that is acceptable for

the user, e.g. less than 2 seconds (must-have)

• The software works in various lighting conditions, outside a well lit envi-

ronment (must-have)

• The application works correctly and does not show any unexpected be-

haviour or lead to system errors, such as program crashes (must-have)

1Making the application as comfortable to use as possible is clearly an important objective,which would make this item a definite “must”. However, if the automated capturing provesto be too computationally demanding, we can abandon this feature without compromising theinitial idea behind the project.

3.3. SOFTWARE ARCHITECTURE 33

• The interface is accessible by screen reader software, i.e. all menu items

and outputs can be read out to the user (must-have)

• The software does not have any complicated menus or graphical elements

(must-have)

• Starting the recognition process must require as few steps as possible, ideally

begin automatically on program start-up (should-have)

• The software is able to perform the recognition process automatically in

real-time, i.e. several frames per second (could-have)

3.3 Software Architecture

In this section we will outline the basic program structure using descriptions

in both written form and UML diagrams. The software will be organised into

several modules that deal with the di!erent program tasks. In more detail, the

basic structure for any Symbian v9.2 application comprises of five classes2 that

are necessary to start up the program and draw the screen:

Main The first object called by the OS when starting up an application, creates

a new application object and runs it

Application Creates a new document object and returns a pointer to it

Document Creates the application user interface (AppUi) object

AppUi The application user interface handles all interactions such as pushing

a key or selecting a menu item. It creates an AppView object (or multiple

views) which is used for screen access. Here, it also creates a new Main-

Controller object which coordinates the image capturing and processing

AppView The application view draws the screen to make information visible to

the user

To this skeleton, we add more classes for the image processing task:

2The naming conventions for Symbian command that all class names start with Cxx, xxbeing the application name. In this case, all classes have the prefix CMSP.


MainController Coordinates all image capturing and processing operations and

returns the results to be displayed and read out be the UI, which promotes

it to the AppView object

ImgCaptureEngine Fetches an image by accessing Symbian’s camera API and

returns it to the controller.

ImgProcessor Takes the image provided by the previous module and performs

pre-processing operations on it, then determines the presence of a sign and

returns the results (sign present: yes / no, arrow direction) to the controller

CFactorGraph Constructs the factor graph object based on the image segments

and the cues defined in the next section

CFactor Class for factor nodes in the graph

CBeliefPropapation Performs belief propagation on the factor graph

The structure shows that the modules can be designed to interact over clearly

defined interfaces, which is crucial for the development process, as it will simplify

separate implementation of individual parts and composition at a later stage.

This will also help to optimise the performance for each module and to carry out

precise testing. The complete organisation of the image processing system, i.e.

the symbian skeleton and the classes created for the processing task, is shown in

figure 3.1. It has to be noted that, due to the complexity of the classes, only the

most important member data and methods are displayed in the class diagram.

The ImgCaptureEngine class in particular shows an important feature of Sym-

bian OS applications. The class is derived from CActive, which makes it an

“Active Object”, a framework that allows for asynchronous programming. This

construct comprises of the Active Object in the form of a class derived from

CActive, and an Active Scheduler which is provided by the Symbian application

architecture. Using an Active Object makes it possible to manage asynchronous

functions, which means that a function returns immediately after calling it with-

out waiting for further tasks to be executed. This is particularly useful for the

task of capturing continuous frames from a camera: The system issues a request

to capture a frame from the camera, which is done asynchronously while other

tasks can be performed. Once the frame has been captured, the camera object

issues a callback to its observer, which then initiates processing of the image.


Figure 3.1: Class diagram showing the organisation of the application classes

Figure 3.2 shows a state diagram of the basic processes that are executed when

running the software, which will give a more detailed insight into the coordination

of the system’s components and functionalities. The application can be closed

from every state by using the “Exit” menu option, or simply pressing the “hang

up” key on the phone.

3.4. IMAGE PROCESSING METHODS AND ALGORITHMS 36

Figure 3.2: State diagram for the emergency exit sign recognition software

3.4 Image Processing Methods and Algorithms

An obvious approach to the task of recognition emergency exit signs would be to

base the object detection solely on the image’s colour values. All escape route

signs according to BS 5499-4 show white icons on a plain green background, which

could make it easy to search for a rectangular green sign in the image. [FZB+05]

achieved the highest and most e"cient recognition rates by analysing the object’s


colour. However, due to the varying (and previously unknown) lighting condi-

tions and di!erent phone camera properties, this method is not expected to yield

adequate results for our project. In addition to other methods, analysing colour

features could improve the recognition rate, which will be examined during the

course of the project.

3.4.1 Edge Detection

The first phase of the processing will consist of di!erent pre-processing tasks,

such as converting a colour image into greyscale, using an edge detection method

to extract edges from the image and thresholding the results to produce a binary

edge map. An edge map provides information about the estimated location of

edges (i.e. region boundaries or contours) in an image, which are defined as

changes in the image intensity. A drastic change in the intensity indicates a

clear edge (for example, the border of a black object on a white background),

whereas a more gradual change hints at a more blurred or softer edge. We decided

to use the Sobel filter for edge extraction, which can be easily and e"ciently

implemented using integer operations (multiplication and addition). Other edge

detection methods such as the Laplacian were considered for this task, but deemed

unsuitable due to the high sensitivity to noise. The Sobel filter uses the two 3x3

kernels shown in figure 3.3 for computing approximations of the horizontal and

vertical derivatives of each pixel.

dx =-1 0 1-2 0 2-1 0 1

dy =1 2 10 0 0-1 -2 -1

Figure 3.3: Sobel kernels used for horizontal and vertical derivatives

In order to reduce the edges that are extracted by the Sobel operator to

thinner, 1-pixel-lines, non-maximum3 suppression is performed on the image. The

idea of this operation is to reduce the visible pixels to the ones that are local

maxima in their neighbourhood which is given by their gradient direction (the

normal to the edge direction). Only if the intensity of a pixel is greater than

the intensities of its neighbouring pixels along the gradient direction it can be

considered a local maximum. This thinning operation is an important step in an

3The term is used in di!erent variations such as non-maximal and non-maxima suppression.


edge detector in order to ensure locality of the edge, which means that the edge

is detected exactly at its location.

3.4.2 Extracting Straight Line Segments

The next step consists of detecting any rectangular object in the image that has

the approximate dimensions of an emergency exit sign, regardless of the actual

content. The Hough transform has been considered for this process, given its

suitability for finding straight lines in an image. However, it was not possible to

find a simplified or approximated version of the algorithm that could be used for

e"cient implementation without floating point operations. This would cause a

slowdown of the processing speed, which is clearly not desirable for the project.

First, we need to extract straight line segments, which are then used as a

starting point for detecting a rectangular structure in the image. This is achieved

using a greedy bottom-up grouping procedure as suggested in [SC07]. The image

is checked for vertical and horizontal edges separately. For the detection of hori-

zontal lines, the method groups edge pixels that are already connected and form

an approximately horizontal line into smaller segments. Small gaps between these

segments are then filled if the segments are neighbours (within a certain region)

and have roughly the same orientation. Those segments that are shorter than

20 pixels are then removed from the set, which eventually contains all horizontal

straight line segments, represented by their start- and end points.

3.4.3 Detecting Rectangular Shapes

Suggestion for a Simplified Method

Once the straight line segments have been detected, the system need to determine

whether there is a rectangle (i.e. a quadrilateral shape that has roughly parallel

opposing sides) present in the frame. It can be argued that a straightforward way

of carrying out this task would be to simply check for overlapping (or nearly over-

lapping) start- and end points of horizontal and vertical segments. The following

conditions have to be satisfied for the shape to qualify as a candidate rectangle:

• Take horizontal segment A with start point SA = (xsA, ysA) and end point

EA = (xeA, yeA)


• The coordinates of SA are within the neighbourhood (minimal di!erence

in x- and y-direction) of the start point SB of a vertical segment B with a

length shorter than A

• The coordinates of EA are within the neighbourhood of the start point SD

of a vertical segment D shorter than A and roughly the same length as B

• B and D have opposite polarity, i.e. the gradient direction of B is the

inverse of the D’s gradient direction

• B’s orientation is orthogonal to A’s (roughly orthogonal that is, within only

a few degrees)

• D’s orientation is (roughly) orthogonal to A’s

• B and D are roughly parallel, i.e. the di!erence between their gradient

orientations is minimal

• The coordinates of EB are within the neighbourhood of the start point SC

of a horizontal segment C

• The length of this segment C is similar to the length of A

• The coordinates of EA are within the neighbourhood of the end point SC

of the same horizontal segment C

• C’s orientation is roughly orthogonal to B’s and D’, and roughly parallel

to A’s

• A and C have opposite polarity, i.e. the gradient direction of A is the

inverse of the C’s gradient direction

Please note that, in order to simplify the construction, start- and end points

of segments are classified in a left-to-right (for horizontal segments) and top-

to-bottom (vertical segments) manner respectively, regardless of their polarity.

With respect to perspective bias that a!ects the length of the segments (which,

ideally would be pairwise of equal length) it can be assumed that exit signs are are

approximately at eye level (or slightly above), which means that the perspective

distortion is expected to be minimal. As for the length of the segments: On

average, the width of an emergency exit sign that contains all three icons (the


word “Exit”, a running person and an arrow) has a width to height ratio of 2.8,

i.e. the horizontal segments are nearly three times as long as the vertical ones.

Assuming that the camera is close enough for the sign to fill out the full image

width of 320 pixels (which is very unlikely), this means that the vertical segments

are at most 118 pixels long.

It also needs to be mentioned that some of these conditions can be omitted

as they are the consequence of other conditions. For example, if two vertical

segments of the same length begin at the start-and end points of a horizontal

segment, then the horizontal segment at the bottom of the sign that connects

their end points must be approximately the same length as the top segment

(again, assuming that the perspective bias is minimal).

Factor Graph Belief Propagation

While the simplified method described in the previous paragraph seems easy to

implement, it is the objective of this study to review more sophisticated ap-

proaches to the task of object recognition that are based on inferences rather

than basic image processing on a pixel level. This is why we will now describe

a solution based on the factor graph belief propagation method as explained in

the previous chapter. This will allow us to analyse the segment groups with re-

spect to multiple cues and perform rapid inference on them in order to determine

whether a segment is part of a rectangle or not.

Using the straight line segments extracted in the edge detection step, the

factor graph is constructed with each line segment being a node variable in the

graph. Based on the specification of factor graphs, the cues that the factor graph

uses for this task will now be described in detail. It has to be noted that due

to the characteristics of a rectangular sign, both horizontal and vertical straight

line segments have to be analysed. In order to simplify this, the horizontal and

vertical segments will be first checked individually, then the candidates will be

combined to look for matching 4-tuples (that is, one vertical and one horizontal

pair). For all cues, the distributions Pon and Poff (as explained in section 2.7.1)

are determined based on training images.

Unitary cues make a statement about single segments, regardless of their

relationship with other segments. We will only use one unitary cue:

• Segment length: On average, the horizontal straight line segments at the

top and bottom of the sign are long compared to other straight lines in the


image, whereas the vertical lines are relatively short (as previously men-

tioned).

The binary cues describe a relationship between two neighbouring segments

(two nodes) with opposite polarity:

• Parallelism: The di!erence (its absolute value) between the orientations of

two neighbouring segments is minimal.

• Proximity: For horizontal pairs, the distance between the two segments is

usually relatively small (approximately one third of the segment length),

whereas the distance between vertical segments is the inverse of this, i.e.

roughly three times the length of the segments.

• Overlapping: The di!erence between start- and endpoints of horizontal /

vertical pairs is within a certain limit.

The arity-4 cues take into account 4 straight line segments, that is, one hori-

zontal pair and one vertical pair:

• Orientation: The average orientation of the horizontal pair is orthogonal to

the orientation of the vertical pair.

• Corner points: The di!erences between the coordinates of start- and end

points of the four segments are minimal.

• Width to height ratio: As previously mentioned, the ratio of horizontal

segment length to vertical segment length should be in the region of 2.8.

After defining the cues that will be used to describe the relationships between

line segments, the factor graph has to be constructed. Each cue corresponds to

the factor (a local function) of a global function that describes the likelihood of

a segment to be part of a rectangular shape. The first two steps are carried out

individually for horizontal and vertical lines, with candidate factors being gener-

ated for every segment or segment pair that meets the requirements. Beginning

with the unitary cue, only segments of su"cient length are considered candidates

for the next step, and a unitary factor is constructed. Determined by the binary

cues for parallelism, proximity and overlapping, all pairs of segments (with op-

posite polarity) that satisfy the criteria for those cues are selected as candidates.


This is followed by combining vertical and horizontal candidate pairs to check

them for the arity-4 cues.

The final decision whether a node variable is in figure state (xi = 1) is then

based only on its belief Bx. If Bx is su"ciently large, the node will be assigned

figure state, if not, it will be set to ground (xi = 0). The simplified version

explained in the previous chapter suggests that no message updates are necessary

for an approximated result, which means that the belief propagation will already

converge after one run.

3.4.4 Analysis of the Sign’s Content

If the decision is made that a rectangle is indeed present in the image, a sound is

output in order to notify the user of this status. An image is then captured with

a higher resolution and several pre-processing tasks are carried out to prepare

the final analysis of the sign’s characteristics. As we know the coordinates of

the start and end points of the four rectangle sides, it is possible to enlarge the

section containing the rectangle. In order to calculate the coordinates for the

larger image (640x480 pixels), the coordinates obtained from the smaller image

(320x240 pixels) simply have to be doubled.

Assuming that the rectangle is not immediately surrounded by any clutter, the

perspective distortion and rotation are minimal, and that the background colour

is di!erent from the green sign (which is necessary to achieve a high contrast so

that the signs can be clearly seen inside buildings), simply clipping the image to

its bounding box with sides parallel to the image borders would be su"cient to

isolate the sign from its surroundings. However, in order to achieve an accurate

result for the ratio of sign background and icon pixels, the image needs to be

freed from any perspective bias, rotation and background clutter. This can be

achieved by projecting the presumably skewed and rotated quadrilateral shape

onto a rectangle with parallel sides. This projective transformation then maps

the points from the distorted image to the corresponding points in the rectangle.

Using the four corner points of the distorted image and the target rectangle as

reference points, a matrix for the transformation can be constructed as described


in [Blo]. In this case, this projective transformation is defined as

"u v w

#=

"x y 1

#'

()a11 a12 a13

a21 a22 a23

a31 a32 1

*

+, (3.1)

Here, the matrix A is a non-singular (invertible) homogenous transformation

matrix with eight degrees of freedom. The coordinates (x$, y$) of the mapped

point are given by

x$ =u

w=

a11x + a21y + a31

a13x + a23y + 1y$ =

v

w=

a12x + a22y + a32

a13x + a23y + 1(3.2)

This leads to a linear system with eight unkown coe"cients a11 . . . a32 that is

solved using the point pair coordinates in order to determine the transformation

matrix.

The next step of the detection phase is then based on the histogram of the

rectangular shape that was extracted from the large image. Knowing that there

are only two colours present in the image, its histogram can now be examined. If

the ratio of green (i.e. dark grey in the greyscaled image) and white pixels corre-

sponds to the usual ratio found in emergency exit signs, the rectangle is marked

as candidate for a sign and the final outcome is determined by a matching pro-

cedure. This quick check reduces the overall computational costs by discarding

all rectangular shapes that are highly unlikely to be emergency exit signs. Oth-

erwise, the more expensive template matching procedure would be performed in

vain.

Due to a number of reasons, it was decided to reduce the step of confirming the

presence of an emergency exit sign and recognising the direction of any arrow to a

simple pixel matching method: Firstly, and most importantly, we are dealing with

standardised signs that di!er only in the direction of the arrow and, accordingly,

the orientation of the “running” icon. The directions from -90° to 90° are place

on the right, the directions top left, left and bottom left are located on the left

hand side of the sign (as explained in section 1.4). Secondly, the location of

the sign is known, defined by its corner points. Thirdly, the image has already

been freed from perspective and rotational bias in the previous step. And finally,

the number of di!erent arrow directions is reasonably low (eight: The four main

directions plus the diagonals), which means that in the worst case the entire sign

3.5. TRAINING IMAGES 44

has to be checked only eight times4. While the operation is not expected to be

the most e"cient method, it is yet straight forward to implement and does not

cause any complex overhead.

3.5 Training Images

The amount of training images necessary for training the system varies heavily

depending on the algorithm that is used for classification. AdaBoost performs well

using a large set of training images ([PTAE09] mentions up to 10,000 negative and

500 positive samples from an existing database).. The number of training images

that are necessary for factor graph belief propagation is relatively low: [ICS08]

uses 25 positive and negative images each, which still yields high recognition

rates and seems more feasible. In order to determine the distributions for the

di!erent cues, training images are captured with a mobile phone camera and

then labelled manually. It is particularly important to pay attention to images

that could cause false positives due to their similarity to emergency exit signs,

were taken under di"cult lighting conditions, contain perspective bias or partly

occluded signs. Examples of images taken with a mobile phone camera are shown

in 3.4. Starting with the top left image, these pictures show some of the most

common problems with camera phones. The images point out some variations of

the standardised emergency exit signs that will not be recognised by the system,

such as the text “Fire Exit” on the sign instead of “Exit” as specified by BS

5499-4, which we will be using as sample templates:

• The angle between camera and exit sign is wide, lights cause reflections on

the sign.

• Blurring due to fast camera movement. Here, the text on the sign is “FIRE

EXIT” in capital letters, which is expected to cause problems when applying

a template matching strategy.

• The distance between the camera and the sign is very far, the sign appears

small and blurred.

• The sign is placed next to lit signs, which also causes reflections and overex-

posure. In this picture, the sign is also made up of two plates (arrow on the

4It has to be noted that there are no clear definitions for the actual de

3.6. ISSUES AFFECTING THE SYSTEM PERFORMANCE 45

left, icon and text on the right) and will not be recognized by our method.

Figure 3.4: Four examples of emergency exit signs captured with a phone camera

3.6 Issues A!ecting the System Performance

Some of the challenges that mobile image processing software for visually im-

paired users has to deal with are characterised in [DLQ+06]. The problems are

specified for text recognition system based on a client-server architecture (see

above), but can also be applied to the issue of recognizing emergency exit signs.

The application has to process images that

• are blurred

• contain text that is very small (or in this case, small exit signs)

• have low contrast

• were taken under poor lighting conditions


These issues have to be considered when designing the image processing appli-

cation, as well as producing the sets of training and test images. Since there is

no way of improving the camera quality, these errors can only be mitigated by

choosing image processing methods that do not rely too heavily on flawless image

quality.

There are also a number of critical issues that have to be dealt with when

implementing the software, regarding both the implementation process and the

actual problem of object recognition. First, the existing resources of computer

vision libraries on Symbian OS are relatively small compared to other platforms

such as Windows PCs. This makes it necessary to implement a large amount of

functionalities from scratch or port them to Symbian C++, which is an error-

prone procedure that slows down the development process. This risk could be

reduced by using as many existing building blocks as possible and keeping to the

principles of good coding practice for Symbian C++, as defined in on the “Forum

Nokia” website5. Secondly, as previously mentioned, the low processing power of

smartphones and a software-emulated floating point unit require careful memory

management and choice of data types. In order to deal with this problem, floating

point operations have to be avoided where possible in favour of integer operations.

3.7 Chapter Summary

This section outlined the main aspects of the software development process. We

specified the requirements for the application and gave an overview of the pro-

gram design which is based on the typical Symbian application structure. It was

decided to organise the application into several modules with di!erent function-

alities that interact over clearly defined interfaces. We then gave an overview

over the methods and algorithms that will be used for the image processing mod-

ule of the software and specified details of the factor graph belief propagation,

along with a proposal for a simplified version of the detection stage. The chap-

ter was concluded by a discussion of the quality and quantity of test images, as

well as an overview over typical problems that will have to be dealt with in the

implementation phase.

5http://www.forum.nokia.com

Chapter 4

System Implementation

4.1 Overview

This main objective of this chapter is to give an insight into the implementation

phase of the project, based on the methods discussed in the previous chapter.

First, we will give an overview over the implementation tools that were used,

followed by a description of the di!erent stages of the application development.

This will include explanations of the implemented algorithms, along with short

code listings of the most significant program segments where deemed necessary

for understanding. The chapter is concluded by an explanation of the charac-

teristics of Symbian OS with respect to methods for optimising the application

performance on this platform.

4.2 Implementation Tools

Symbian provides software development kits for the di!erent OS versions, with

S60 3rd Edition (Symbian OS v9.1) and S60 3rd Edition FP 1 (Symbian OS

v9.2) being the ones supported by the largest number of devices (mainly Nokia

and Samsung) [Sym09]. The SDK comes with all the necessary C++ APIs,

example programs and a phone emulator (which is of no use for this project, as

the camera on the handset cannot be simulated by the emulator using a built-in

laptop camera). In order to assist the application development process, Symbian

recommends using an IDE such as Carbide.c++. This free software is based on

the Eclipse IDE and o!ers tools for debugging, on-device debugging and GUI

construction. For this project, the software was run on a Mac OS X system

47

4.3. IMAGE CAPTURING 48

using a virtual machine (VirtualBox) with Windows XP Professional as a guest

OS. Compilation of the application code (using the GCC-E compiler) produces

a Symbian installation file (.sis) that can be installed on any suitable Symbian

device. In order for the file to be accepted by the device, it has to undergo

a signing process. This is achieved using command line tools provided by the

SDK to generate a key and a certificate, which are used to sign the .sis file

after compiling it. The signed application1 is then installed by either directly

connecting the phone to a PC via USB and initiating the installation process

from the development system, or transferring the .sis file to the handset (e.g.

sending it via bluetooth) and then installing it. The IDE also o!ers a mode for

on-device debugging when the device is connected to the host computer via USB,

which proved to be useful for debugging purposes.

4.3 Image Capturing

The phone’s camera is accessed using the camera API to capture images over the

phone’s viewfinder. The images are transferred directly to a bitmap without any

further processing. The advantage of this method over capturing an image is the

speed of the operation. In this viewfinder mode, the N95 camera produces im-

ages with a size of 320x240 pixels in 32-bit colour mode, which are then used for

carrying out further pre-processing steps. In order to capture a higher resolution

picture of 640x480 pixels, the camera viewfinder needs to be stopped when the

capturing buttons is pressed. The camera settings are then changed to a higher

format, the image is captured and displayed on the screen. Figure 3.2 shows a

state machine diagram of the camera module in interaction with the other system

components. In tests with the Nokia phone, the autofocus which is run automat-

ically by the operating system’s controls proved good enough to produce pictures

of su"cient quality with little blurring, which makes adjusting the camera focus

by hand unnecessary. This can be considered a very helpful feature of the built-in

camera API, given the system is designed for blind users who will not be able to

adjust any camera settings to improve the image quality.

1It has to be mentioned that any self-generated key and certificate pair is only valid for acertain period of time, usually one year. After that, the .sis file is rejected by the phone andhas to be signed again with a newly generated key and certificate.

4.4. PHASE ONE: FEATURE EXTRACTION 49

4.4 Phase One: Feature Extraction

Tests were carried out with a Symbian OS computer vision library in C++,

developed by Nokia (NokiaCV2), that provides an implementation of various

image processing tasks. Due to very slow performance (3 seconds per frame

for greyscale conversion and convolution with a Sobel filter), this approach was

deemed rather unsuitable for real-time processing. Therefore it was decided to

implement all processing steps using only the bitmap interface that is o!ered by

Symbian and simplifies drawing the captured images to the phone screen. While

accessing the individual pixels over the interface’s GetPixel() method o!ers a

convenient way of manipulating the colour values, this proved too slow for e"cient

implementation of complex image processing algorithms in several loops over the

image. All pixels were therefore accessed through a pointer to the bitmap’s first

pixel, using the bitmap interface’s DataAddress() method.

4.4.1 Greyscale Conversion

In a first pre-processing step, feature extraction was performed on the input

image. The first stage included converting the bitmap delivered from the camera

from colour into greyscale mode3. As previously mentioned, the input image on

the test device is a 32-bit RGB + alpha channel bitmap4. The conversion function

takes the input bitmap and simply draws it onto a new bitmap that was created

in greyscale mode.

4.4.2 Sobel Operator

The Sobel operator for edge detection is implemented using simple integer mul-

tiplication and addition to convolve the image pixels with the horizontal and

vertical kernel. The gradient magnitude is then calculated using the sum of the

derivative’s absolute value Abs(dx) + ABs(dy) as an approximation, rather then

the hypotenuse, in order to avoid calculating the square root which would have an

impact on the system performance. In the case of an implementation for Symbian

OS, attention has to be paid to the correct usage of the integer datatypes that are

2http://research.nokia.com/research/projects/nokiacv/3Due to the mode being named “EGray256”, the US English spelling was used throughout

the source code for consistency reasons.4In fact, the colour mode delivered by the N95 is “EColor16MU” which is built up as BGR.


o!ered by the platform (several di!erent signed and unsigned integer with various

lengths, such as TUint8 for unsigned 8-bit integers), and their explicit conver-

sion when assigning values. Even without any prior (possibly time-consuming)

smoothing, this operation produced results that were suitable for further pro-

cessing. Listing A.1 shows how the sobel operator is applied to the image, with

subsequent normalisation of the resulting gradient values to the range 0..255.

4.4.3 Non-Maximum Suppression

In the system’s implementation, the non-maximum suppression is performed by

determining the gradient direction of each pixel and comparing it to the two

neighbouring pixels in the positive and negative edge direction (normal to the

gradient direction). The gradient direction is defined as ! = arctan(dy/dx),

however, this expensive operation was not suitable for an e"cient implementation

as it already slowed down the performance to 1 frame per second. Therefore, and

since we operate in a discrete domain, the gradient orientation has to be classified

into one of the eight main directions which we “hard-code”. There are several

ways of carrying out this classification without directly computing the gradient

orientation: One method is based on the signs of the horizontal and vertical

derivatives, which classifies the pixel into one of the directions 1 to 7: Direction

1 covers 0° to 45°, 2 ranges from 45° to 90° and so on. The gradient magnitude

is then compared to the linear interpolated gradient values of the pixel pairs (in

negative and positive gradient direction) in the discrete grid that are closest.

A suitable threshold was determined through testing, with results varying

depending on the light conditions and the distance from the object, as it had

been expected. This non-maximum suppression leads to a binary edge map with

thin edges. The results of the individual edge detection steps as shown in figure

4.1 show a comparison between a simple thresholded image and the image after

applying non-maximum suppression, which clearly demonstrates the importance

of this operation.

4.4.4 Straight Line Extraction

In order to extract straight line segments from the image, a greedy grouping

procedure is applied to the edge pixel. The method (here explained for horizontal

segments) scans every row and proceeds as follows: If the current pixel is an edge


Figure 4.1: Individual steps of edge detection: (a) Original image (top left),(b) Sobel filtered (top right), (c) Sobel and threshold (bottom left), (d) Sobel,non-maximum suppression and threshold (bottom right)

pixel (i.e. not zero), check its neighbouring pixel within 0°, 45° and -45°. If one

of these is also an edge pixel, set the current pixel as starting point for a stroke.

Then proceed to check this edge pixel for its horizontal neighbours and continue

until an edge pixel is met that does not have any neighbours to the right. This last

pixel is then set as end point of the stroke and the algorithm continues to process

the starting line. If the starting pixel does not have any edge pixel neighbours,

discard it in the target image. To connect the shorter collinear strokes to longer

segments, a small (5 pixels) window is moved over the end points in order to

detect start points that are located within 45° (positive or negative) of the end

point. Given the start- and end coordinates, we can also infer all the information

needed for the factor graph, namely the length of the segment, its position and its

orientation (slope). This information is then stored in an array of TPoint objects

created to represent line segments, with consecutive elements being regarded as

neighbours, i.e. candidate pairs for the binary cues used in the factor graph.


4.4.5 Factor Graph and Belief Propagation

After having extracted straight line segments from the image, the factor graph

is built based on those image features. This section will explain the way to

implement a factor graph and represent the cues listed in the previous chapter.

It has to be noted however, that the factor graph belief propagation has not been

fully implemented and that the steps described in this section will only act as

a pointer to the actual finalised implementation. The implementation is largely

based on the libDAI library and Intel’s Probabilistic Network Library (PNL), two

open source libraries for inference on graphical models in C++, which provide a

good starting point for porting the algorithms to the Symbian platform [Moo, Int].

To build up the factor graph, some helper classes are needed that provide a

data structure for the di!erent pieces of information stored within the graph. A

class for the individual factors called CFactor holds a set of variables (the argu-

ments of the function) and a reference to a probability vector as data members,

along with methods to manipulate this data. The probability vector describes

the value of a factor depending on all possible variable assignments. With respect

to an e"cient implementation, both this vector and the set of variables are con-

structed using Symbian’s RArray template class, a wrapper for accessing arrays

of structures and objects5.

Corresponding to the definition of factor graphs in the previous chapter, the

factor graph class CFactorGraph has an array of variables (that are either one or

zero) and an array of CFactor objects as data members. Edges in the graph are

represented by an array of factor neighbours for each variable, and an array of

neighbours that are variables for each factor node6, in order to distinguish between

the di!erent types of edges. An edge is then added by including the variable and

factor involved in the respective set in the graph and adding entries to both of

the neighbour lists. Accordingly, in order to remove an edge, the entries from the

neighbouring arrays are deleted (which is also done if the respective variable or

factor nodes are removed from the graph). In this context of image processing,

each edge between a factor and a variable corresponds to a cue used to determine

the state of a segment (the variable that the factor is connected to).

In order to compute the MAP, the belief propagation is now performed on

5In this section the terms “list” and “set” are only used for legibility reasons and do notsuggest the use of the respective data structures.

6As we are dealing with a bipartite graph, it is ensured that there exist only edges betweentwo di!erent types of nodes.

4.5. PHASE TWO: OBJECT RECOGNITION 53

the factor graph. A BP class is generated for this task which holds objects

for the messages passed between the graph nodes as member variables, along

with methods for creating and updating messages. When creating a BP object,

it is initialised with the factor graph that the operation is carried out on. The

computations for the segment’s beliefs are then computed based on the algorithms

explained in the second chapter. By using the simplified version of the factor

graph belief propagation algorithm, no message updates are necessary in order

to determine the belief for the segments in the image. The segments that are

regarded as not belonging to a suitable quadrilateral shape are then saved in an

array of four TPoint objects that mark the start- and end points of the segment

in clockwise direction (top, right, bottom, left). If several quadrilateral shapes

are detected in the image, the one with the highest belief is first analysed with

the warping and template matching procedures described in the next section and,

if the output is negative, the steps are repeated for the other detected shapes.

4.5 Phase Two: Object Recognition

4.5.1 Final Content Analysis

Once we have obtained the coordinates of the rectangle’s corner points we can

proceed with the analysis of the rectangle’s content. First, the captured image

drawn to a bitmap in greyscale mode, again using Symbian’s bitmap API. This is

followed by the planar homography that projects the found quadrilateral shape

with perspective bias onto a rectangle, using a transformation matrix that is

constructed on the basis of the four corner points of the distorted and target

rectangle respectively. The implementation of this projective transformation is

based on the code described in [Blo], which has been ported from C to Symbian

C++. The transformation matrix that is determined in the first stage is then

used to compute the corresponding pixel in the quadrilateral for each pixel in

the target rectangle. When implementing this method, particular attention has

to be paid to optimising the multiple divisions that occur when computing the

transformation matrix and the final output by using fixed point arithmetic with

integers rather than Symbian’s TReal class for float values and standard division.

For the next step (determining the ratio between “green” background pixels

4.5. PHASE TWO: OBJECT RECOGNITION 54

and white icon pixels) a very simple histogram analysis procedure was imple-

mented: The clipped image’s greyscale values are compared to a lower and upper

boundary value chosen through testing (the average greyscale value of the back-

ground green is approximately 100) near the expected background colour in order

to separate the green background from the white icons. The ratio of pixels that

are within the boundaries (i.e. “green” pixels) to the number of pixels that are

above the upper boundary (white pixels) has to be close to 1.1, which has been

determined through testing. If this quick check produces a positive result, i.e the

found rectangle is a candidate for an emergency exit sign, the template matching

is performed.

For this purpose, eight exit sign templates are created as binary images, with

the same dimensions as the target rectangle used in the projecting step. The

pixels from the template and the thresholded image are then pairwise compared

and their di!erence is summarised. If the sum (i.e. the di!erence between the two

images) is minimal, it is assumed that the sign matches the respective template.

The arrow direction is then saved as one of eight directions and output by the

system. Figure 4.2 shows that even templates with the same layout (text, icon

and arrow from left to right) clearly di!er in the relatively large white section

that defines the tip of the arrow, which is how they can be distinguished.

Figure 4.2: Two examples of binary sign templates

The templates are created as bitmaps and then referenced in the project’s

MMP file that includes project specific instructions for the compiler, such as li-

braries that need to be included. The bitmap is then integrated into a Symbian

multi bitmap file (.mbm) during compilation and can be loaded using its auto-

matically generated file name or its enumeration index in the source code. Listing

A.2 demonstrates how the the candidate rectangle is matched with each of the

eight templates in the .mbm file.

4.6. RESULT OUTPUT 55

4.6 Result Output

The result of the object recognition procedures carried out in the previous stages

need to be output in a way that is suitable for visually impaired users. As

previously mentioned, the signal tone in the first stage (finding a rectangular

shape) is a simple “bleep” sound which is produced using the Symbian library for

system sounds like warning and error messages. Since the sound will be repeated

for each frame in which a rectangle is recognised, this is the least obtrusive way

to notify the user of the presence of a potential sign. Once the image is captured

and processed in the second stage, the result is output as text, using Symbian’s

“CAknInformationNote” pop-ups that display text for several seconds. The text

is then picked up by the screen reader that is installed on the device and read

out through the screen reader software’s text-to-speech synthesis.

The output informs the user whether an exit sign could be detected in the

image or not, and gives the arrow direction if an arrow is present:

• “No exit sign found.”

• “Emergency exit sign found. No arrow detected.”

• “Emergency exit sign found. Arrow direction: Top right.”

These messages can be displayed again by pressing any key on the keypad, which

adds to the usability of the application.

4.7 Optimisation for Symbian S60 devices

For applications that update the screen at short intervals, as it is done here to

display the processed image, it is recommended to bypass the window server that

scales and clips the image before it is drawn on the screen. Symbian provides

direct screen access over its CDirectScreenAccess interface. While this would not

be important in a final application that does not display the camera image, it

could speed up and give a more accurate impression of the system performance

in the development stage, where the screen output is necessary for debugging

purposes.

Since several copies of the processed image are being held in memory as simple

arrays, it is important to increase the application’s heap size before allocating

memory for the image data. The SDK o!ers a way of setting a minimum and


maximum size, for which a check is performed before starting the application. In

order to allow for su"cient memory, the maximum heap size was changed from

the default 1MB to 4MB. This solved the problem of application crashes caused

by accessing pointers that were initialised to NULL due to a lack of memory.

As suggested in [ICS08], the workload during runtime can be reduced by using

static arrays rather then dynamically allocated memory. Due to the fact that

the images used in the application always have the same size (320x240 pixels

and 640x480 pixels respecitvely for the images captured from the camera), a

su"ciently large array can already be constructed during compile time.

4.8 Chapter Summary

This chapter discussed the application development process and the implementa-

tion of complex processing tasks on the Symbian OS platform. The development

was carried out using the Carbide.c++ IDE provided by Symbian. The appli-

cation makes use of the platform’s camera API to capture continuous and still

images which are passed on to the processing module. The image processing is

then performed in two stages, with greyscaling and edge extraction being applied

to the image in a first step, followed by the object detection core. After com-

pletion of the implementation stage, the program will be tested and evaluated,

which will now be described in the following two chapters.

Chapter 5

Testing

5.1 Overview

In this chapter, we will outline the testing and evaluation procedures that are

being performed throughout the development process and when examining the

final version of the application. This also includes a discussion of the expected

and desired test results, which define the criteria for success of the project. The

chapter is concluded by an overview of the testing results with respect to speed

and recognition rates of the application in di!erent testing environments.

5.2 Description of the Testing Procedures

5.2.1 Ad-hoc Testing

Testing took place throughout the di!erent stages of the application development

in order to ensure adequate performance and recognition rates when implementing

new functionalities. The testing includes checks for both the performance of the

image processing module and the robustness of the software.

Regarding the correctness of the software, we carry out informal tests for each

completed unit and module integration step. This will deal with code coverage

criteria in particular, in order to ensure that all statements in the code have been

executed and tested for validity, and do not contain any bugs or errors. Ad-hoc

tests are performed conveniently on the Symbian SDK emulator, while the more

critical tests are carried out on the actual handset.

It has to be noted that accessing a camera (such as the development system’s

57

5.2. DESCRIPTION OF THE TESTING PROCEDURES 58

built-in laptop camera) is not possible when using the emulator. This makes it

necessary to test on static images captured with a phone camera, which means

that only the processing results but not the performance of the system can be

determined. Due to the movement of the camera, the results are also expected

to be less accurate on the live system. It was tried to compensate these circum-

stances by manually adding noise and motion blur to the static testing images

that were used with the emulator. By carrying out testing on the handset it is

also ensured that the program can be installed and runs on the intended device.

On the completion of every module, the prototype is be tested for compliance

with the requirements defined in section 3.2. If necessary, this triggers a revision

of the code and the modification or addition of functionalities.

5.2.2 Final Testing

In terms of object recognition performance, we aim at a relatively high rate of

true positives and a low number of false positives, in combination with a short

processing time. Since imperfect results are more acceptable for users than long

latency [BB08], we focus mainly on the e"ciency of the application, i.e. fast

execution of all program functions. If the desired results for the tests are not

achieved, the code is reviewed in order to improve the performance. Early tests

are carried out on a large set of positive and negative images in order to determine

the performance of the chosen object recognition methods, with the final testing

being performed “live” in buildings on a smaller test set. It is also desirable to

have the software tested for usability (ease of use) and evaluated by users who

are unfamiliar with the system — however, due to the restricted functionality of

the application, this is not of highest priority.

Live tests with users were not carried out due to the system not being in the

finalised state in this stage. However, due to the minimal interaction necessary

for running the system, user tests are only expected to provide feedback regarding

the performance of the system rather than the actual usability. This is why user

tests were not considered absolutely necessary when evaluating the system in its

current state.

5.2. DESCRIPTION OF THE TESTING PROCEDURES 59

5.2.3 Desired Results

Due to the complexity of the Symbian platform and the limited amount of appli-

cations that can be directly compared to our project, figures regarding expected

results can only be estimated based on related works.

A quick experiment with a stopwatch shows that a comfortable time to pan

a phone from left to right, i.e. approximately 180 degrees, is about 10 seconds.

This figure can act as a guideline for determining how many images have to be

captured and processed within 10 seconds in order to cover the whole area of 180

degrees around the user. With an average angle of view of roughly 40 degrees for

standard phone cameras (defined by the focal length of the lens), we can infer

that the system needs to be able to capture at least 5 images in those 10 seconds

(i.e. 2 seconds per image) to provide comfortable use of the system.

The projects mentioned in section 2.5 in fact confirm the experiment and give

a rough estimate for suitable recognition rates:

• The processing time for a single frame is less than 2 seconds. Automated

processing in real-time (as demonstrated in [ICS08]) is highly desirable, but

depends strongly on the implementation and will therefore only be regarded

as an “additional feature” for this project.

• The recognition rate (true positive) lies at approximately 75% of all test

images.

• The amount of false positives has to be treated with particular care, as it is

unacceptable to send the user in the wrong direction. Therefore, no more

than 1% of the test images should be incorrectly classified as a sign.

These are only basic requirements that give an overview of the most significant

aspects of the final system. However, the focus of the project lies primarily on

the structured realisation of the proposed system design. It is deemed obvious

that an “ideal”, i.e. highly e"cient and correct system can only be achieved

through profound knowledge of the platform, along with multiple code revisions

and su"cient time for exploring di!erent approaches to one problem.

5.3. SYSTEM PERFORMANCE EVALUATION 60

5.3 System Performance Evaluation

While the overall execution speed of the recognition application is a major aspect

in evaluating the system, the first phase of the recognition that detects rectan-

gular (quadrilateral, that is) shapes is considered particularly critical, as it is

performed in real-time, while the user is panning the phone. The first step, Sobel

filtering and non-maximum suppression, achieved roughly 3 frames per second,

when tested on the Nokia N95. This can be considered as real-time performance

and lies within the expected time frame. The straight line extraction did not

yield optimal results in live-tests on the device due to noise, motion blur and

varying lighting conditions; therefore the execution speed was not measured. In

tests on the SDK’s emulator, the straight line detection su!ered from wide gaps

between shorter segments that could not be closed. The edge detection algorithm

clearly needs optimising in order to deal with those issues, which can be achieved

through extensive tests to adjust the chosen thresholds.

As the factor graph belief propagation algorithm was not far enough imple-

mented to allow statements about its performance, we can only give rough esti-

mates based on the sources that served as the foundations for this method. It is

expected to perform close to real-time performance, however slower than stated

in [ICS08] due to the higher number of arity-4 cues used in the factor graph. The

ensuing warping procedure which involves a large amount of divisions will slow

down the system performance if not optimised for fixed point calculations, which

is crucial for this operation. This procedure could be omitted by restricting the

number of recognisable signs to those that lie within a certain angle from the

camera and thus do not su!er from significant distortion. However, due to the

minor di!erence between the templates (for example, arrows pointing to the top

and bottom only di!er in the tip of the arrow), as well as blurring and noise in

the image, it proved di"cult to chose a suitable threshold to decide between the

outcome without causing too many false positives or false negatives when match-

ing images that had not been warped. Finally, the template matching for eight

di!erent templates is carried out with su"ciently high performance (under one

second in tests on the phone), as was expected for a small number of binary tem-

plates. In this stage, the worst case is that all eight templates need to be checked,

while in the best case the first template already matches the input image, which

reduces the processing time.

Regarding the overall outcome it can be stated that the performance of the


system was not as e"cient as expected, which is believed to be caused by the non-

optimised and rather straight forward implementation of the proposed methods.

However, these optimisations are considered only a matter of following standard

procedures, which does not a!ect the general feasibility of implementing the cho-

sen method on the Symbian platform.

5.4 Chapter Summary

In this chapter we outlined the testing procedures that were carried out through-

out the development process to ensure the quality and performance of the pro-

duced application. Both the e"ciency and e!ectiveness of the object recognition

module were evaluated, along with standard software testing procedures that

were carried out in order to check for the correctness of the application. Based

on the system performance it was followed that optimising the application with

respect to Symbian guidelines is key for achieving an e"cient implementation.

Chapter 6

System Evaluation

6.1 Overview

After specifying the application design and outlining the implementation process

in previous chapters, we will now review and discuss the overall project devel-

opment. Firstly, the chosen method to the given task of recognising exit signs

will be critically analysed, along with the system design. This is followed by an

evaluation of the project in relation to existing work, where the advantages and

disadvantages of the chosen method will be discussed. The chapter is concluded

by a critical review of the project schedule and possible improvements that can

be made to the system that was developed.

6.2 Analysis of the Research Methodology

It may be argued that the chosen methods for the feature extraction and object

detection stages were not the most suitable for this task regarding e"ciency and

ease of implementation. However, due to the vast number of di!erent and varied

methodologies and algorithms in this field, it can only be stated that the pre-

ceding research was carried out carefully, which lead to the decision to choose

an approach based on the work most similar to the given task that indicated

successful results. Another positive aspect of this method is the small amount

of training data that is needed for determining the cues that the factor graph is

built on. This is a great advantage over methods such as Adaboost that need

large amounts of labelled training images, sometimes up to several thousands, in

order to produce good results. While the decision to base the application core on

62

6.2. ANALYSIS OF THE RESEARCH METHODOLOGY 63

statistical inference using a method from machine learning is an interesting and

still rarely used approach to image processing on mobile platforms, the complex-

ity of it proved to be a drawback of this approach. Especially with respect to the

use on a restricted platform such as Symbian OS, finding a suitable way of e!-

ciently implementing the construction of a factor graph and inference using the

belief propagation algorithm was not feasible given the time constraints. With

respect to the specified task, the project can therefore be regarded as unsuccess-

ful. However, in order to compensate for this issue, alternative approaches to the

problem were explored, that would simplify the implementation and could still

achieve acceptable results.

With respect to Symbian OS smartphones as the chosen platform for this

application it can be said that there are hardly any alternatives for developing

a complex application such as the emergency exit sign recognition system. Due

to the large range of available handsets, along with screen reader applications for

visually impaired users and the extensive resources for developers, the system can

be seen as superior with respect to the suitability for this task.

One of the advantages of the developed system design is its modularity. By

restricting the first step of the recognition phase to rectangular objects, the pro-

gram can be extended to recognise other standardised signs by using di!erent

templates in the second stage. As the final decision for the content of the sign

is delayed until several checks give a clear indication for the result, the recog-

nition rate and especially the number of false positives can be optimised. The

system does not rely on manually highlighting any regions of interest in the im-

age or markers that help locating the signs and is therefore deemed more flexible

and user friendly than previous approaches to object detection on mobile devices

using touchscreens. The recognition software is designed to be accessible by a

screen reader, which is very likely to be already installed on a blind user’s smart-

phone. As the program only outputs information as text over Symbian’s built-in

notification interface, it is guaranteed to work with any type of screen reader or

display magnifying software. This is a great advantage over systems that show

information over graphically oriented interfaces. In addition, by steering away

from including speech output into the system, it can also be easily extended to

provide more information to the user, without having to produce new sound files.

This is also an advantage when considering developing a multilingual version of

the software. The fact that the voice output of the recognition system is the

6.3. REVIEW OF THE PROJECT PLAN 64

same as the general text-to-speech voice used on the phone can also add to the

user’s acceptance of the application. Finally, both the installation files (.sis) and

use of system memory during program execution are kept relatively lightweight

when favouring written text output over speech output that is included in the

application.

6.3 Review of the Project Plan

The project schedule was designed before the start of the implementation stage

of the project and was structured into three main stages. It allowed for a rela-

tively long phase (one month) of getting familiar with the chosen platform and

its programming language, Symbian C++. This stage was followed by the imple-

mentation of the program core, which was supposed to take another month. The

final stage (one month) would be application testing, evaluation and writing up

of the insights gained during the course of the project.

While this plan seemed adequate given the complex platform and the di"-

cult task of optimising the system for real-time performance, it did not leave

much room to deal with problems caused by implementation errors. This fact

was worsened by the di!erent error handling procedures on the Symbian emula-

tor and the actual device, along with flaws of the IDE and SDK tools such as

non-transparent caching mechanisms, undocumented emulator crashes and de-

bugging facilities. While the first stage was completed in a shorter period of time

than scheduled, the main implementation stage su!ered from the aforementioned

problems. This lead to delays which made it necessary to cut down the task to a

simplified version of system, as well as reduce the time scheduled for testing and

evaluation.

The conclusion that can be drawn for future projects is to arrange enough

time for troubleshooting when dealing with unknown platforms. When working

under strict time constraints, stepping back to the research phase to develop an

alternative route is too time consuming to keep up with the project plan. Despite

the incomplete implementation, the research work carried out for this project and

the application design based on this research are still regarded as an adequate and

convincing approach for solving a complex problem on a platform with limited

resources.

6.4. IMPROVEMENTS 65

6.4 Improvements

As mentioned previously, the original task of implementing a recognition system

based on a statistical method like factor graph belief propagation had to be

reduced to a simplified solution to the recognition problem. Thus, the most

obvious improvement would be to include an implementation of the factor graph

belief propagation method outlined in 3.4.3 into the finalised software. Based on

evaluation results from [ICS08], this is expected to improve both the processing

speed (as the template matching phase is abandoned), as well as the recognition

rate by relying on multiple cues, therefore removing a number of false positives

from the findings.

In order to improve the recognition rate for exit signs that di!er slightly from

the ones shown in 1.1, the templates for the final stage could be split up into

their three components. For example, a variation of the signs shows the words

“FIRE EXIT” in capital letters, which would not exactly match the template.

In order to deal with this seemingly minimal di!erence, the matching algorithm

could simply look for the “running person” icon in the centre of the sign (only

two templates) and then perform matching for the three (icon facing the left) or

five (icon facing the right) arrow templates. This method would simply ignore

the presence of text in the sign, but the uniqueness of the icon and arrows are

expected to already guarantee correct results, and it would speed up the matching

performance by reducing the template size and number.

With respect to the implementation on the Symbian OS platform, it can be

stated that the produced code still needs to be optimised in some areas. In partic-

ular, the memory management can be improved by paying more attention to the

careful use of system memory, as well as using simplified or approximated algo-

rithms. As the image processing phase does not return to capture the next frame

until the processing is completely finished, it would have also been useful to im-

plement the CMSPImgProcessor class as an Active Object (see section 3.3). This

would have allowed to perform both the image capturing and the asynchronously

so that the next image is already fetched and prepared for processing while the

previous frame is still being analysed. It is obvious, however, that the purpose of

this study was mainly to carry out research into the topic and demonstrate a pos-

sible implementation, rather than producing a highly optimised piece of software

for a rather unknown platform. This also explains why this report does not dis-

cuss the interaction of the recognition software with other phone functions such


as incoming calls, text messages or other applications running in the background.

Of course, those features and events have to be considered when implementing

applications for mobile platforms outside an academic environment.

6.5 Chapter Summary

This chapter discussed the success of the study by evaluating it with respect to

the chosen solution to the problem of image processing on mobile platforms. This

included a review of the chosen approach which led to the conclusion that the

research methods used for the application were appropriate for the given task.

This was followed by a critical analysis of the project plan and suggestions for

improvements that could have been made to the system and overall development

process if time constraints would not have applied.

Chapter 7

Conclusion and Future Work

7.1 Project Summary

This report has depicted the process of developing an image recognition system

on a mobile platform which assists visually impaired users in finding emergency

exit signs. In the introduction we gave a description of the motivation behind the

study which is to make use of mobile phone technology as assistive devices for

visually impaired persons, and to carry out research into the feasibility of image

processing on mobile platforms. The system’s main objectives were given as a

sample flow of events when the application is used by a blind person to detect an

emergency exit sign.

The first task was to decide on which smartphone platform the software was to

be developed. Di!erent platforms were discussed with respect to their processing

power and ease of developing applications, and it was decided that the software

was to be developed for Symbian OS smartphone models using its “native” pro-

gramming language Symbian C++. The Symbian S60 platform in particular was

deemed most appropriate due to its popularity and wide use on some powerful

devices such as the Nokia N95.

We then gave an extensive review of related work that made the di!erent

approaches to the problem of image processing on devices with restricted com-

puting power the subject of discussion. The di!erent methods can be grouped

into server-client structures on one hand, where the captured image is sent to a

server for processing, and on-device processing on the other hand, out of which

the studies using factor graph belief propagation seemed the most successful and

e"cient. Due to long file transfer times and possible lack of network connection,

67

7.1. PROJECT SUMMARY 68

the server-client approach was deemed unsuitable for the given task.

In the ensuing chapter a high-level description of the system architecture was

given in order to provide the reader with an overview over the most important

points in the development process. The application itself has been organised into

modules, each of them with a di!erent function, that are able to interact over

clearly defined interfaces. The software’s structure and behaviour were described

using both text and appropriate UML diagrams. In this chapter, we also proposed

a simplified version of the rectangle detection method, as well as a description of

the more sophisticated belief propagation.

The software implementation was completed using the Carbide.c++ IDE, pro-

vided by Symbian. It uses the cameras API to capture both still and continuous

images that are then processed. In the first step, the image was converted to

greyscale and an edge extraction filter was applied, which produced an edge map.

The actual object detection was then carried out using factor graph belief prop-

agation, a message passing algorithm on a graphical model that computes the

belief of an image segment as the likelihood of it being part of the “figure” (as

opposed to the background). The final decision whether an emergency exit sign

was present in the image was then based on the (greyscale) histogram of the

thresholded sign and a template matching procedure.

Testing was carried out throughout the whole development process, as well

as after completing the implementation phase. It was essential to test both the

quality of the application (identifying the exit signs in various situations and from

various angles) and the performance of the processing module: Given the limited

processing power of mobile phones, can the image processing and identification

be run quickly enough? The necessary testing procedures were explained in the

respective chapter, along with an outline of the available test results.

Finally, a review of the application design and development process, along

with suggestions for possible improvements were given in the previous chapter.

The evaluation of the project was important to demonstrate the understanding of

the topic and the ability to critically analyse the work carried out for this study.

This dissertation discussed and combined methods taken from a number of

di!erent research disciplines, such as signal processing, statistics and software

development for mobile platforms. This makes it a valuable piece of work that,

while providing an extensive review of the di!erent areas and their application

7.2. FUTURE WORK 69

for image processing tasks, may also function as a starting point for further ex-

ploration of the aforementioned topics. While not all of the main objectives were

achieved, the significant amount of research carried out for this study, as well as

the clearly laid out methodologies, the structured system design and the explo-

ration of di!erent approaches to the problem demonstrate the general feasibility

of the task based on the chosen solution. As the use of factor graph belief propaga-

tion for image processing tasks on mobile platforms is yet to be comprehensively

investigated, it is strongly encouraged to carry out further research based on the

conclusions drawn from this work.

7.2 Future Work

In order to make the system’s output even more useful and accurate, the text on

the emergency exit sign could be analysed in addition to the other features that

have been discussed in this study. This could be achieved using the OCR1 API

provided by the Symbian operating system. After detecting the section of the

sign that contains the text, the methods o!ered by the API take the bitmap and

information about the text region (bounding box, background colour) and return

the recognised text. The text can then be output over the screen reader’s text-to-

speech and provide the user with more information about the sign content. While

the API has not been tested for this project, it is expected to deliver relatively

good results, considering it was designed with the aim of recognising very small

text such as addresses found on business cards.

Eventually, it would be an appropriate next step to research the feasibility of

utilising factor graph belief propagation for all stages of the recognition phase,

i.e. for grouping pixels in the straight line extraction phase, detecting rectangu-

lar structures and analysing the icons and arrows on the sign. This methodology

promises a very e"cient implementation of detection procedures on computa-

tionally weak mobile platforms. The success of this method is almost exclusively

based on the choice of suitable cues, which have to be carefully considered given

the complex structure of the di!erent icons found on a sign. With respect to fu-

ture work, it would be particularly interesting to implement a general framework

for factor graph belief propagation in Symbian C++ in order to provide a basis

for further exploration of real-time image processing on this platform.

1Optical Character Recognition

7.2. FUTURE WORK 70

By combining e"cient object recognition through factor graph BP and OCR,

it would be possible to develop the system even further to recognise various types

of signs that combine icons and text using on-device processing. This is a highly

interesting application of the methodologies described in this project that could

even serve as a replacement for the currently server-client architectures currently

in use to carry out computationally heavy image processing tasks.

As the popularity of mobile platforms, and camera smartphones in particular,

is expected to grow even further in the future, it is desirable to continue exploring

their use not only for commercial software, but also for assisting people with

physical disabilities.

Bibliography

[BB08] Erich Bruns and Oliver Bimber. Adaptive training of video sets for

image recognition on mobile phones. Journal of Personal and

Ubiquitous Computing, 13:165–178, 2008.

[Blo] Dan Bloomberg. Leptonica. http://www.leptonica.com/a"ne.html.

Accessed: 12/07/2009.

[BS500] BS 5499-4:2000. Safety signs, including fire safety signs. The

British Standards Institution, 2000.

[DLQ+06] Tudor Dumitras, Matthew Lee, Pablo Quinones, Asim Smailagic,

Dan Siewiorek, and Priya Narasimhan. Eye of the Beholder:

Phone-Based Text-Recognition for the Visually-Impaired. In IEEE

International Symposium on Wearable Computers, pages 145–146,

2006.

[For] Forum Nokia. http://www.forum.nokia.com. Accessed:

10/04/2009.

[FS99] Yoav Freund and Robert E. Schapire. A Short Introduction to

Boosting. Journal of Japanese Society for Artificial Intelligence,

14:771–780, 1999.

[FZB+05] Paul Fockler, Thomas Zeidler, Benjamin Brombach, Erich Bruns,

and Oliver Bimber. PhoneGuide: Museum Guidance Supported by

On-Device Object Recognition on Mobile Phones. In International

Conference on Mobile and Ubiquitous Computing, pages 3–10, 2005.

[Gar09] Gartner Newsroom.

http://www.gartner.com/it/page.jsp?id=910112, 2009. Accessed:

11/08/2009.

72

BIBLIOGRAPHY 73

[GdGH+06] N. J. C. Groeneweg, B. de Groot, A. H. R. Halma, B. R. Quiroga,

M. Tromp, and F. C. A. Groen. A Fast O#ine Building

Recognition Application on a Mobile Telephone. In Advanced

Concepts for Intelligent Vision Systems, volume 4179 of Lecture

Notes in Computer Science, pages 1122–1132. Springer Berlin /

Heidelberg, 2006.

[Gel02] Andrew Gelman. Posterior Distribution. Encyclopedia of

Environmetrics, 3:1627–1628, 2002.

[GW02] Rafael C. Gonzalez and Richard E. Woods. Digital Image

Processing. Prentice Hall, 2nd edition, 2002.

[ICS08] Volodymyr Ivanchenko, James Coughlan, and Huiying Shen.

Detecting and locating crosswalks using a camera phone. In IEEE

Computer Society Conference on Computer Vision and Pattern

Recognition Workshops, 2008.

[Int] Intel. Probabilistic Network Library.

http://sourceforge.net/projects/openpnl. Accessed: 05/06/2009.

[KFL01] Frank Kschischang, Brendan J. Frey, and Hans-Andrea Loeliger.

Factor Graphs and the Sum-Product Algorithm. IEEE

Transactions on Information Theory, 47:498–519, 2001.

[Kob09] Nicole Kobie. Nokia’s ’Point & Find’ uses camera phone for search.

http://www.itpro.co.uk/610402/nokias-point-find-uses-camera-

phone-for-search, 2009. Accessed:

03/04/2009.

[Koo] Kooaba. http://www.kooaba.com/mobile-marketing/cases.

Accessed: 12/02/2009.

[KT07] Surendra M. Kumar and Timothy Jwoyen Tsai. CAT — Camera

Phone Color Appearance Tool. Stanford University, 2007.

[Mob] All About Mobile Life Blog.

http://mobile.kaywa.com/qr-code-data-matrix. Accessed:

12/02/2009.

BIBLIOGRAPHY 74

[Moo] Joris Mooij. libDAI — A free/open source C++ library for

Discrete Approximate Inference methods.

http://www.kyb.mpg.de/bs/people/jorism/libDAI. Accessed:

05/06/2009.

[Nok] Nokia Mobile Codes. http://mobilecodes.nokia.com/scan.htm.

Accessed: 12/02/2009.

[PTAE09] Sobhan Naderi Parizi, Alireza Tavakoli Targhi, Omid Aghazadeh,

and Jan-Olof Eklundh. Reading Street Signs Using a Generic

Structured Object Detection and Signature Recognition Approach.

In International Conference on Vision Application, 2009.

[RNI] RNIB. Statistics — numbers of people with sight problems by age

group in the UK.

http://www.rnib.org.uk/xpedio/groups/public/documents/

PublicWebsite/public researchstats.hcsp. Accessed: 11/05/2009.

[RR06] Christof Roduner and Michael Rohs. Practical Issues in Physical

Sign Recognition with Mobile Devices. ETH Zurich, 2006.

[SC07] Huiying Shen and James Coughlan. Grouping Using Factor

Graphs: An Approach for Finding Text with a Camera Phone. In

Graph-Based Representations in Pattern Recognition, 2007.

[Sym09] Symbian Developer Network. http://developer.symbian.com, 2009.

Accessed: 10/04/2009.

[TAT09] TAT — The Astonishing Tribe.

http://www.tat.se/site/showroom/latest design.html, 2009.

Accessed: 11/08/2009.

[YFW03] Jonathan S Yedidia, William T Freeman, and Yair Weiss.

Understanding Belief Propagation and its Generalizations, 2003.

[Yua05] Michael Juntao Yuan. What Is a Smartphone.

http://www.oreillynet.com/pub/a/wireless/2005/08/23/

whatissmartphone.html, 2005. Accessed: 08/03/2009.

Appendix A

Listings

Sobel Operator

1 T I n t CMSPImgProcessor :: DetectEdges ()

2 {

3 T I n t w = iSize.iWidth; TInt h = iSize.iHeight;

4 TInt imgSize = w*h;

5 [...]

6 for(i=0; i<imgSize; i++)

7 {

8 // if we’re at the first column - first pixel of a row

9 if (i==(y+1)*w) { y++; x=0; }

10 else { x++; }

11 // initialise arrays with 0

12 grad[i] = 0; xGrad[i] = 0; yGrad[i] = 0;

13 // if we’re not in the first/ last column or row (image

boundaries)

14 if(x>0 && x<w-1 && y>0 && y<h-1 )

15 {

16 // apply Sobel filter

17 xGrad[i] = gImg[i+w+1] + gImg[i-w+1] + (2* gImg[i+1])

18 - gImg[i-w-1] - gImg[i+w-1] - (gImg[i -1]*2) ;

19 yGrad[i] = gImg[i-w-1] + gImg[i-w+1] + (gImg[i-w]*2)

20 - gImg[i+w-1] - gImg[i+w+1] - (2* gImg[i+w]);

21 grad[i] = Abs(xGrad[i]) + Abs(yGrad[i]);

22 max = Max(grad[i], max);

23 } // end if

24 } //end for

25

75

APPENDIX A. LISTINGS 76

26 // normalise values to range 0..255

27 // max is initialised with 1 to avoid division by zero

28 TReal32 norm = 255.0/ max;

29 TReal32 g = 0.0;

30 for(i=0; i<imgSize; i++)

31 {

32 edges[i] = 0; // initialise with 0 and only change if

necessary

33 if (grad[i] > 0)

34 {

35 g = static_cast <TReal32 >( grad[i] );

36 edges[i] = static_cast <TUint8 >( g * norm );

37 }

38 }

39 }

Listing A.1: Sobel operator and normalisation

Template Matching

1 TInt CMSPImgProcessor :: MatchTemplate(CFbsBitmap* srcBitmap ,

TInt th)

2 {

3 TInt direction = -1;

4 // load the bitmap from an .mbm file

5 _LIT(KMBMFileName ,"z:\\ resource \\apps\\ Templates.mbm");

6 // create a new bitmap for the templates and push

7 // on the cleanup stack

8 CFbsBitmap* atemplate = new (ELeave) CFbsBitmap ();

9 CleanupStack :: PushL(atemplate);

10 TInt imgSize = srcBitmap ->SizeInPixels ().iHeight

11 * srcBitmap ->SizeInPixels ().iWidth;

12 // lock the global heap

13 srcBitmap ->LockHeap(ETrue);

14 TUint8* src = (TUint8 *) srcBitmap ->DataAddress ();

15 for(TInt i = 0; i<8; i++)

16 {

17 TInt sum =0;

18 // load the template

19 User:: LeaveIfError(atemplate ->Load(KMBMFileName , i));

20 TUint8* temp = (TUint8 *) atemplate ->DataAddress ();

21 for (TInt j = 0; j<imgSize; j++)

22 {

APPENDIX A. LISTINGS 77

23 // compute difference between source and template

24 sum += Abs(temp[i]-src[i]);

25 }

26 // if the difference for one template is less

27 // than the threshold

28 if (sum < th) { direction = i; break; }

29 }

30 srcBitmap ->UnlockHeap(ETrue);

31 CleanupStack :: PopAndDestroy(atemplate);

32 return direction;

33 }

Listing A.2: Template matching to determine the arrow direction

Documents

Image Processing on Mobile Platform