TuBCT10.28 Text Tracking Nicht Gelesen

Embed Size (px)

Citation preview

  • 8/6/2019 TuBCT10.28 Text Tracking Nicht Gelesen

    1/4

    Text-Tracking Wearable Camera System for Visually-Impaired People

    Makoto Tanaka

    Graduate School of Information Sciences,Tohoku University, Japan

    mc41229n @ sc.isc.tohoku.ac.jp

    Hideaki Goto

    Cyberscience Center,Tohoku University, Japan

    hgot @ isc.tohoku.ac.jp

    Abstract

    Disability of visual text reading has a huge impact

    on the quality of life for visually disabled people. One

    of the most anticipated devices is a wearable camera

    capable of finding text regions in natural scenes and

    translating the text into another representation such as

    sound or braille. In order to develop such a device,

    text tracking in video sequences is required as well as

    text detection. We need to group homogeneous text re-

    gions to avoid multiple and redundant speech syntheses

    or braille conversions.

    We have developed a prototype system equipped with

    a head-mounted video camera. Text regions are ex-

    tracted from the video frames using a revised DCT fea-

    ture. Particle filtering is employed for fast and robust

    text tracking. We have tested the performance of our

    system using 1,000 video frames of a hall way with eight

    signboards. The number of text candidate images is re-

    duced to 0.98%.

    1. Introduction

    We human beings make the most of text information

    in surrounding scenes in our daily lives. Disability of vi-

    sual text reading has a huge impact on the quality of life

    for visually disabled people. Although there have been

    several devices designed for helping visually-impaired

    people to see objects using an alternative sense such

    as sound and touch, the development of text reading de-

    vices is still at an early stage. Character recognition for

    the visually disabled is one of the most difficult tasks

    since the characters have complex shapes and are very

    small compared with physical obstacles. One of themost anticipated devices is probably a wearable cam-

    era capable of finding text regions in natural scenes and

    translating the text into another representation such as

    sound or braille.

    An alternative device is a helper robot that can recog-

    nize characters in human living space and read out the

    text information for the user. Some robots with char-

    acter recognition capability have been proposed so far

    [2, 3, 4, 6]. Iwatsuka et al. proposed a guide dog sys-

    tem for blind people [2]. We presented a text capturing

    robot equipped with an active camera [8]. Our robot

    can find and track multiple text regions in the surround-

    ing scene. In these robot applications, camera move-

    ment is constrained by some simple and steady robot

    movements. On the other hand, a wearable camera can

    be moved freely. Thus, a robust text tracking method is

    needed to develop a text capturing device.

    Some duplicate text strings may appear in the con-

    secutive video frames. Recognizing all the text strings

    in the images is a waste of time. More importantly, the

    camera user would not want to hear repeatedly a syn-

    thesized voice originated from the same text. Merino

    and Mirmehdi presented a framework for realtime text

    detection and tracking and demonstrated a system [5].

    The text tracking performance was not so satisfactory

    and a lot of improvements are still needed.

    In this paper, we present a wearable camera systemthat can automatically find and track text regions in the

    surrounding scenes. The system is equipped with a

    head-mounted video camera. The text strings are ex-

    tracted using the revised DCT-based method [1]. The

    text regions are then grouped into image chains by a

    text tracking method based on particle filtering.

    In Section 2, we present the overview of the wearable

    camera system and the algorithms used. In Section 3,

    the text tracking algorithm is given. Section 4 describes

    experimental results and performance evaluations.

    2. Wearable Camera System

    2.1. Overview of the system

    Figure 1 shows the prototype of the wearable camera

    system which we have constructed. The system con-

    sists of a head-mount camera (380k-pixel color CCD),

    978-1-4244-2175-6/08/$25.00 2008 IEEE

  • 8/6/2019 TuBCT10.28 Text Tracking Nicht Gelesen

    2/4

    Figure 1. Wearable Camera System.

    an NTSC-DV converter, and a laptop PC running Linux.

    The NTSC video signal is converted to DV stream and

    the video frames are captured at 352 224-pixel reso-lution.

    The text capture and tracking method proposed in

    this paper consists of the following stages.

    1. Partition each frame into 16 16-pixel blocks.

    2. Extract text blocks using revised DCT feature.

    3. Generate text candidate regions by merging ad-

    joining text blocks.4. Create chains (groups) of text regions by finding

    the correspondences of the text regions between

    the (n1)th frame and the nth frame using particlefiltering.

    5. Filter out non-text regions in the chains.

    6. Select the best text image in each chain.

    2.2. Text region detection

    We have employed the revised DCT-based feature

    proposed in [1] for text extraction from scene images.

    High-wide (HWide) frequency band of the DCT coef-

    ficient matrix is used. The discriminant analysis-based

    thresholding is also used. The blocks in the image are

    classified into two classes; text blocks whose feature

    values are greater than the automatically found thresh-

    old, and non-text blocks.

    After extracting the text blocks, connected compo-

    nents of the blocks are generated. The bounding box

    of each connected component is regarded as a text re-

    gion. The text regions whose area is smaller than four

    blocks are discarded, because they are considered to be

    too small for legible text strings.

    2.3. Text tracking

    The text tracking process corresponds to the fourth

    stage. The details are given in Section 3.

    A new chain is created when a text candidate re-

    gion without any correspondence appears in the cur-

    rent frame. A new label is assigned to the region/chain.

    Figure 2. Selecting the most appropriate

    image.

    When a chain of the text candidate regions is broken in

    a frame, the chain is extracted and its validity is exam-

    ined. If the chain is shorter than 3 frames, it is regarded

    as noise and discarded.

    2.4. Selecting the most appropriate text images

    All the text candidate regions belonging to the chains

    are examined and the images that seem to be non-text

    are removed using the edge counts method in [7].

    It is necessary to pick up the most appropriate text

    images for character recognition. One text image is se-

    lected per chain, and fed to the next stage which is most

    likely character recognition. As the most appropriate

    image, the region whose horizontal length is the longest

    is chosen as shown in Figure 2. This selection scheme

    makes sense in many cases, since a signboard should

    be approached by the user close enough and as fronto-

    parallel as possible.

    Our current system is not equipped with a character

    recognition process, since we are concentrating only on

    text detection and tracking.

    3. Text Tracking Using Particle Filtering

    3.1. Region-based matching

    We have employed particle filtering for the tracking

    of text candidate regions. Particle filtering has been

    used successfully in various object tracking applica-

    tions. We have tested two matching schemes in the text

    tracking.

    The first method is region-based matching. The

    particles are scattered around the predicted center point

    of text region in the current frame from the center of

    each text candidate region in the previous frame as

    shown in Figure 3. The particles that fall outside anyof the detected text region have their weight set to zero.

    If a particle falls into a detected text region, its weight

    is set to the similarity value between the previous and

    the current text regions. All the particles have the same

    label as that of the source text region.

  • 8/6/2019 TuBCT10.28 Text Tracking Nicht Gelesen

    3/4

    The similarity s1,2 between regions 1 and 2 is de-fined as

    s1,2 =1

    d1,2 + , (1)

    where d1,2 is the distance defined below and is a smallvalue ( = 1) to avoid divergence.

    Cumulative histogram is used as a measure to evalu-

    ate the dissimilarity between text regions, since it canrepresent color distribution with small computational

    cost. The cumulative histogram H(z) is given as

    H(z) =z

    i=0

    h(i) (2)

    where h(i) denotes the normal histogram and i denotesthe intensity. For comparing two cumulative histograms

    H1,c(z) and H2,c(z), where c denotes one of RGB colorchannels, the following city block distance is used.

    d1,2 =

    c

    255

    z=0

    |H1,c(z) H2,c(z)| (3)

    Weighted centers of the particles are found in thecurrent frame. The label of a current text region is de-

    termined by finding the weighted center nearest to the

    center of the region.

    3.2. Block-based matching

    The second method is block-based matching,

    which is similar to the one used in our previous work

    [8]. The same particle filtering algorithm is applied to

    text blocks instead of text candidate regions. A chain

    label is assigned to each text block in the current frame.

    The region label is determined by voting as shown in

    Figure 4. The most popular label is used.

    When more than one regions in the current frame

    correspond to a text region in the previous frame, the

    distance between every pair of the current regions are

    measured. If the side-to-side distance is lower than a

    pre-determined threshold or the regions overlap each

    other, they are merged together. Otherwise a new label

    is assigned to the region which is less close to the cor-

    responding region in the previous frame. This region

    merging process has been introduced to make the text

    tracking more robust to temporal breakups of multiple

    text lines.

    4. Experimental Results and Discussions

    4.1. Experimental setup

    Experimental images were taken at the hall way in

    our building. Eight signboards including text exist on

    the walls.

    Figure 3. Region-based text tracking.

    Figure 4. Label selection by voting in

    block-based text tracking.

    A sighted person wore the camera, walked in the

    hall way, and approached every signboard. 1,000 video

    frames were obtained in total. All the signboards in the

    scene were approached by the person close enough for

    text detection.

    The variance of particles random walk is 13.0. The

    number of particles per region (or block) is 500. Only

    the velocity is taken into account in the particle filtering.

    4.2. Evaluations and discussions

    Consequently, 5,192 text candidate regions were

    found using the DCT feature. After the text region

    tracking, the total number of text images to be passed

    to the character recognition stage has been significantly

    reduced to 51. Thus, the number has been cut down to

    0.98%. Table 1 shows the numbers of extracted text im-

    ages and average precessing times per frame. We have

    also tested our former text tracking method [8] for com-

    parison purposes.

    The performance of the region-based tracking is al-

    most the same as that of the conventional method. The

    particle filtering-based method with the block-based

    matching outperforms others. Our text tracking pro-

    gram (single threaded) runs on the laptop PC with a

    Core2Duo 1.06GHz processor. The processing speedwithout DV decoding is 10.0fps (= 1/100msec), whichis near-realtime.

    Some text tracking results are shown in Figure 5.

    The text regions are successfully detected and tracked

    well. The text region filter based on the edge counts

  • 8/6/2019 TuBCT10.28 Text Tracking Nicht Gelesen

    4/4

    Figure 5. Results of text region tracking.

    Table 1. Numbers of extracted text images

    and processing times.

    Filtered No filter

    images time images time

    Conventional[8] 113 56 185 56

    Region-based 111 56 176 56Block-based 51 100 80 100

    (msec)

    Figure 6. Extracted text images.

    works quite effectively. The images of the upper door

    edge are discarded by the filtering. Some examples of

    automatically selected text images are shown in Figure

    6. All the signboards have been successfully captured.

    Although the proposed method performs the best so

    far, a lot of duplicate or non-text images can be seen.

    The value 51 is about 6.4 times of 8 which is the ideal

    number of text images. A lot of splits of text region

    chains can be seen where multiple text lines are close

    each other. Further improvements will be needed. In

    addition, combining our method with a super-resolution

    technique would be an interesting topic.

    5. Conclusions

    We have presented a wearable camera system thatautomatically finds and track text regions in surround-

    ing scenes in near-realtime. The proposed text tracking

    method is based on particle filtering and can effectively

    reduce the number of text images to be recognized. In

    the experiment in an indoor scene, our system could

    reduce the number of text candidate images down to

    0.98%.

    Making the text detection and tracking more robust

    to quick camera movements, and developing a filtering

    method to deal with blurred/corrupt frames are included

    in our future work.

    References

    [1] H. Goto. Redefining the DCT-based feature for scene

    text detection Analysis and comparison of spatial

    frequency-based features. International Journal on Doc-

    ument Analysis and Recognition (IJDAR), 2008. (Online

    First article).[2] K. Iwatsuka, K. Yamamoto, and K. Kato. Development

    of a guide dog system for the blind people with character

    recognition ability. Proceedings 17th International Con-

    ference on Pattern Recognition, pages 683686, 2004.[3] D. Letourneau, F. Michaud, and J.-M. Valin. Au-

    tonomous Mobile Robot That Can Read. EURASIP Jour-

    nal on Applied Signal Procesing, 17:26502662, 2004.[4] D. Letourneau, F. Michaud, J.-M. Valin, and C. Proulx.

    Textual message read by a mobile robot. Proceedings IEEE/RS International Conference on Intelligent Robots

    and Systems, pages 27242729, 2003.[5] C. Merino and M. Mirmehdi. A Framework Towards Re-

    altime Detection and Tracking of Text. Second Interna-

    tional Workshop on Camera-Based Document Analysis

    and Recognition (CBDAR2007), pages 1017, 2007.[6] J. Samarabandu and X. Liu. An Edge-based Text Re-

    gion Extraction Algorithm for Indoor Mobile Robot Nav-

    igation. International Journal of Signal Processing,

    3(4):273280, 2006.[7] H. Shiratori, H. Goto, and H. Kobayashi. An Efficient

    Text Capture Method for Moving Robots Using DCT

    Feature and Text Tracking. Proceedings 18th Interna-

    tional Conference on Pattern Recognition (ICPR2006),

    2:10501053, 2006.[8] M. Tanaka and H. Goto. Autonomous Text Capturing

    Robot Using Improved DCT Feature and Text Tracking.

    9th International Conference on Document Analysis and

    Recognition (ICDAR2007) Volume II, pages 11781182,

    2007.