TuBCT10.28 Text Tracking Nicht Gelesen

8/6/2019 TuBCT10.28 Text Tracking Nicht Gelesen

1/4

Text-Tracking Wearable Camera System for Visually-Impaired People

Makoto Tanaka

Graduate School of Information Sciences,Tohoku University, Japan

mc41229n @ sc.isc.tohoku.ac.jp

Hideaki Goto

Cyberscience Center,Tohoku University, Japan

hgot @ isc.tohoku.ac.jp

Abstract

Disability of visual text reading has a huge impact

on the quality of life for visually disabled people. One

of the most anticipated devices is a wearable camera

capable of finding text regions in natural scenes and

translating the text into another representation such as

sound or braille. In order to develop such a device,

text tracking in video sequences is required as well as

text detection. We need to group homogeneous text re-

gions to avoid multiple and redundant speech syntheses

or braille conversions.

We have developed a prototype system equipped with

a head-mounted video camera. Text regions are ex-

tracted from the video frames using a revised DCT fea-

ture. Particle filtering is employed for fast and robust

text tracking. We have tested the performance of our

system using 1,000 video frames of a hall way with eight

signboards. The number of text candidate images is re-

duced to 0.98%.

1. Introduction

We human beings make the most of text information

in surrounding scenes in our daily lives. Disability of vi-

sual text reading has a huge impact on the quality of life

for visually disabled people. Although there have been

several devices designed for helping visually-impaired

people to see objects using an alternative sense such

as sound and touch, the development of text reading de-

vices is still at an early stage. Character recognition for

the visually disabled is one of the most difficult tasks

since the characters have complex shapes and are very

small compared with physical obstacles. One of themost anticipated devices is probably a wearable cam-

era capable of finding text regions in natural scenes and

translating the text into another representation such as

sound or braille.

An alternative device is a helper robot that can recog-

nize characters in human living space and read out the

text information for the user. Some robots with char-

acter recognition capability have been proposed so far

[2, 3, 4, 6]. Iwatsuka et al. proposed a guide dog sys-

tem for blind people [2]. We presented a text capturing

robot equipped with an active camera [8]. Our robot

can find and track multiple text regions in the surround-

ing scene. In these robot applications, camera move-

ment is constrained by some simple and steady robot

movements. On the other hand, a wearable camera can

be moved freely. Thus, a robust text tracking method is

needed to develop a text capturing device.

Some duplicate text strings may appear in the con-

secutive video frames. Recognizing all the text strings

in the images is a waste of time. More importantly, the

camera user would not want to hear repeatedly a syn-

thesized voice originated from the same text. Merino

and Mirmehdi presented a framework for realtime text

detection and tracking and demonstrated a system [5].

The text tracking performance was not so satisfactory

and a lot of improvements are still needed.

In this paper, we present a wearable camera systemthat can automatically find and track text regions in the

surrounding scenes. The system is equipped with a

head-mounted video camera. The text strings are ex-

tracted using the revised DCT-based method [1]. The

text regions are then grouped into image chains by a

text tracking method based on particle filtering.

In Section 2, we present the overview of the wearable

camera system and the algorithms used. In Section 3,

the text tracking algorithm is given. Section 4 describes

experimental results and performance evaluations.

2. Wearable Camera System

2.1. Overview of the system

Figure 1 shows the prototype of the wearable camera

system which we have constructed. The system con-

sists of a head-mount camera (380k-pixel color CCD),

978-1-4244-2175-6/08/$25.00 2008 IEEE


2/4

Figure 1. Wearable Camera System.

an NTSC-DV converter, and a laptop PC running Linux.

The NTSC video signal is converted to DV stream and

the video frames are captured at 352 224-pixel reso-lution.

The text capture and tracking method proposed in

this paper consists of the following stages.

1. Partition each frame into 16 16-pixel blocks.

2. Extract text blocks using revised DCT feature.

3. Generate text candidate regions by merging ad-

joining text blocks.4. Create chains (groups) of text regions by finding

the correspondences of the text regions between

the (n1)th frame and the nth frame using particlefiltering.

5. Filter out non-text regions in the chains.

6. Select the best text image in each chain.

2.2. Text region detection

We have employed the revised DCT-based feature

proposed in [1] for text extraction from scene images.

High-wide (HWide) frequency band of the DCT coef-

ficient matrix is used. The discriminant analysis-based

thresholding is also used. The blocks in the image are

classified into two classes; text blocks whose feature

values are greater than the automatically found thresh-

old, and non-text blocks.

After extracting the text blocks, connected compo-

nents of the blocks are generated. The bounding box

of each connected component is regarded as a text re-

gion. The text regions whose area is smaller than four

blocks are discarded, because they are considered to be

too small for legible text strings.

2.3. Text tracking

The text tracking process corresponds to the fourth

stage. The details are given in Section 3.

A new chain is created when a text candidate re-

gion without any correspondence appears in the cur-

rent frame. A new label is assigned to the region/chain.

Figure 2. Selecting the most appropriate

image.

When a chain of the text candidate regions is broken in

a frame, the chain is extracted and its validity is exam-

ined. If the chain is shorter than 3 frames, it is regarded

as noise and discarded.

2.4. Selecting the most appropriate text images

All the text candidate regions belonging to the chains

are examined and the images that seem to be non-text

are removed using the edge counts method in [7].

It is necessary to pick up the most appropriate text

images for character recognition. One text image is se-

lected per chain, and fed to the next stage which is most

likely character recognition. As the most appropriate

image, the region whose horizontal length is the longest

is chosen as shown in Figure 2. This selection scheme

makes sense in many cases, since a signboard should

be approached by the user close enough and as fronto-

parallel as possible.

Our current system is not equipped with a character

recognition process, since we are concentrating only on

text detection and tracking.

3. Text Tracking Using Particle Filtering

3.1. Region-based matching

We have employed particle filtering for the tracking

of text candidate regions. Particle filtering has been

used successfully in various object tracking applica-

tions. We have tested two matching schemes in the text

tracking.

The first method is region-based matching. The

particles are scattered around the predicted center point

of text region in the current frame from the center of

each text candidate region in the previous frame as

shown in Figure 3. The particles that fall outside anyof the detected text region have their weight set to zero.

If a particle falls into a detected text region, its weight

is set to the similarity value between the previous and

the current text regions. All the particles have the same

label as that of the source text region.


3/4

The similarity s1,2 between regions 1 and 2 is de-fined as

s1,2 =1

d1,2 + , (1)

where d1,2 is the distance defined below and is a smallvalue ( = 1) to avoid divergence.

Cumulative histogram is used as a measure to evalu-

ate the dissimilarity between text regions, since it canrepresent color distribution with small computational

cost. The cumulative histogram H(z) is given as

H(z) =z

i=0

h(i) (2)

where h(i) denotes the normal histogram and i denotesthe intensity. For comparing two cumulative histograms

H1,c(z) and H2,c(z), where c denotes one of RGB colorchannels, the following city block distance is used.

d1,2 =

c

255

z=0

|H1,c(z) H2,c(z)| (3)

Weighted centers of the particles are found in thecurrent frame. The label of a current text region is de-

termined by finding the weighted center nearest to the

center of the region.

3.2. Block-based matching

The second method is block-based matching,

which is similar to the one used in our previous work

[8]. The same particle filtering algorithm is applied to

text blocks instead of text candidate regions. A chain

label is assigned to each text block in the current frame.

The region label is determined by voting as shown in

Figure 4. The most popular label is used.

When more than one regions in the current frame

correspond to a text region in the previous frame, the

distance between every pair of the current regions are

measured. If the side-to-side distance is lower than a

pre-determined threshold or the regions overlap each

other, they are merged together. Otherwise a new label

is assigned to the region which is less close to the cor-

responding region in the previous frame. This region

merging process has been introduced to make the text

tracking more robust to temporal breakups of multiple

text lines.

4. Experimental Results and Discussions

4.1. Experimental setup

Experimental images were taken at the hall way in

our building. Eight signboards including text exist on

the walls.

Figure 3. Region-based text tracking.

Figure 4. Label selection by voting in

block-based text tracking.

A sighted person wore the camera, walked in the

hall way, and approached every signboard. 1,000 video

frames were obtained in total. All the signboards in the

scene were approached by the person close enough for

text detection.

The variance of particles random walk is 13.0. The

number of particles per region (or block) is 500. Only

the velocity is taken into account in the particle filtering.

4.2. Evaluations and discussions

Consequently, 5,192 text candidate regions were

found using the DCT feature. After the text region

tracking, the total number of text images to be passed

to the character recognition stage has been significantly

reduced to 51. Thus, the number has been cut down to

0.98%. Table 1 shows the numbers of extracted text im-

ages and average precessing times per frame. We have

also tested our former text tracking method [8] for com-

parison purposes.

The performance of the region-based tracking is al-

most the same as that of the conventional method. The

particle filtering-based method with the block-based

matching outperforms others. Our text tracking pro-

gram (single threaded) runs on the laptop PC with a

Core2Duo 1.06GHz processor. The processing speedwithout DV decoding is 10.0fps (= 1/100msec), whichis near-realtime.

Some text tracking results are shown in Figure 5.

The text regions are successfully detected and tracked

well. The text region filter based on the edge counts


4/4

Figure 5. Results of text region tracking.

Table 1. Numbers of extracted text images

and processing times.

Filtered No filter

images time images time

Conventional[8] 113 56 185 56

Region-based 111 56 176 56Block-based 51 100 80 100

(msec)

Figure 6. Extracted text images.

works quite effectively. The images of the upper door

edge are discarded by the filtering. Some examples of

automatically selected text images are shown in Figure

6. All the signboards have been successfully captured.

Although the proposed method performs the best so

far, a lot of duplicate or non-text images can be seen.

The value 51 is about 6.4 times of 8 which is the ideal

number of text images. A lot of splits of text region

chains can be seen where multiple text lines are close

each other. Further improvements will be needed. In

addition, combining our method with a super-resolution

technique would be an interesting topic.

5. Conclusions

We have presented a wearable camera system thatautomatically finds and track text regions in surround-

ing scenes in near-realtime. The proposed text tracking

method is based on particle filtering and can effectively

reduce the number of text images to be recognized. In

the experiment in an indoor scene, our system could

reduce the number of text candidate images down to

0.98%.

Making the text detection and tracking more robust

to quick camera movements, and developing a filtering

method to deal with blurred/corrupt frames are included

in our future work.

References

[1] H. Goto. Redefining the DCT-based feature for scene

text detection Analysis and comparison of spatial

frequency-based features. International Journal on Doc-

ument Analysis and Recognition (IJDAR), 2008. (Online

First article).[2] K. Iwatsuka, K. Yamamoto, and K. Kato. Development

of a guide dog system for the blind people with character

recognition ability. Proceedings 17th International Con-

ference on Pattern Recognition, pages 683686, 2004.[3] D. Letourneau, F. Michaud, and J.-M. Valin. Au-

tonomous Mobile Robot That Can Read. EURASIP Jour-

nal on Applied Signal Procesing, 17:26502662, 2004.[4] D. Letourneau, F. Michaud, J.-M. Valin, and C. Proulx.

Textual message read by a mobile robot. Proceedings IEEE/RS International Conference on Intelligent Robots

and Systems, pages 27242729, 2003.[5] C. Merino and M. Mirmehdi. A Framework Towards Re-

altime Detection and Tracking of Text. Second Interna-

tional Workshop on Camera-Based Document Analysis

and Recognition (CBDAR2007), pages 1017, 2007.[6] J. Samarabandu and X. Liu. An Edge-based Text Re-

gion Extraction Algorithm for Indoor Mobile Robot Nav-

igation. International Journal of Signal Processing,

3(4):273280, 2006.[7] H. Shiratori, H. Goto, and H. Kobayashi. An Efficient

Text Capture Method for Moving Robots Using DCT

Feature and Text Tracking. Proceedings 18th Interna-

tional Conference on Pattern Recognition (ICPR2006),

2:10501053, 2006.[8] M. Tanaka and H. Goto. Autonomous Text Capturing

Robot Using Improved DCT Feature and Text Tracking.

9th International Conference on Document Analysis and

Recognition (ICDAR2007) Volume II, pages 11781182,

2007.

Documents

TuBCT10.28 Text Tracking Nicht Gelesen