Localizing people in crosswalks with a moving handheld camera: proof of concept

Localizing people in crosswalks with a moving

handheld camera: proof of concept Marc Lalonde, Claude Chapdelaine and Samuel Foucher

Vision and Imaging Team, CRIM, 405 Ogilvy Ave. #101 Montréal (Qc), Canada, H3N 1M3

Email: {marc.lalonde, claude.chapdelaine, samuel.foucher}@crim.ca

ABSTRACT

Although people or object tracking in uncontrolled environments has been acknowledged in the literature, the accurate

localization of a subject with respect to a reference ground plane remains a major issue. This study describes an early

prototype for the tracking and localization of pedestrians with a handheld camera. One application envisioned here is to

analyze the trajectories of blind people going across long crosswalks when following different audio signals as a guide.

This kind of study is generally conducted manually with an observer following a subject and logging his/her current

position at regular time intervals with respect to a white grid painted on the ground. This study aims at automating the

manual logging activity: with a marker attached to the subject’s foot, a video of the crossing is recorded by a person

following the subject, and a semi-automatic tool analyzes the video and estimates the trajectory of the marker with

respect to the painted markings. Challenges include robustness to variations to lighting conditions (shadows, etc.),

occlusions, and changes in camera viewpoint. Results are promising when compared to GNSS measurements.

Keywords: Tracking, color segmentation, moving camera.

1. INTRODUCTION

The present work is a small contribution to a larger project that aims at measuring the trajectory of blind subjects who

are asked to cross a street intersection with the aid of various cuckoo-chirp type signals. Some signals may be better

designed and more efficient in guiding a blind person to align himself/herself with the crosswalk. One key component of

the project is to measure the person's deviation with respect to the center of the crosswalk. Although competing

techniques are generally efficient and readily available (e.g. GPS), they are not quite appropriate in this context due to

their bulkiness and limited spatial resolution (15 cm at most is required). One approach that is being explored is based on

the automatic analysis of the video footage taken during the experiments. However vision-based tracking and

localization of a person in an uncontrolled setting like a city street is always a challenge. Moreover, due to logistical as

well as optical reasons, positioning a single fixed camera in such a way that the whole crossing event is recorded at good

resolution is impossible: the street selected for the experiment is a six-lane boulevard with a median, and the total

walking distance to get from one sidewalk to the other is about 30 meters. So the strategy under analysis is to use a

moving, handheld camera to record the subject's displacement and then to track and localize his/her feet with respect to

markings painted on the ground for spatial referencing. Such an approach is very convenient in terms of data acquisition

(an observer merely needs to walk behind the subject with a camcorder, video acquisition and management is easy,

resolution is always good, etc.) but it poses many challenges for the subsequent analysis phase: the subject must be

tracked through a video shot that is unstable due to the means of acquisition, the markings themselves must be tracked

for proper spatial referencing, and finally, tracking must be robust despite shadows, varying lighting conditions,

occlusions caused by observers accompanying the subject for security reasons, etc. Figure 1 shows a model of the

crosswalk with the pair of 'vertical' white lines delimiting the corridor as well as additional lines painted on the road

surface: orange line in the center of the crosswalk, blue lines on each side that delimit the test area (for safety

considerations), as well as 'horizontal' lines that can be counted to monitor positioning during the crossing. Distance

between lines is known a priori.

Figure 1 - Grid made of the two standard ’vertical’ white lines with additional lines painted on the street for positioning: an

orange line in the middle of the crosswalk, two blue lines on each side and 12 ’horizontal’ lines. d(A,B)=120cm;

d(B,D)=120cm; d(D,E)=120cm; d(A,C)=190cm.

2. PREVIOUS WORK

The closest work that resembles what is presented in this paper is that of [1]. Their system consists of a wearable camera

attached to a visually impaired person as well as an image analysis module that acts as an automatic navigation system.

The system assists the person by detecting and tracking a white line found in the pedestrian area along roads and streets.

A particle filter provides the tracking capability, where particles are rectangles in a video frame (particle likelihood is

related to the number of white pixels inside the rectangle); a classifier makes sure that the tracked object is indeed a line.

However a single line is tracked, which simplifies scene modeling, and no spatial referencing is done. Another body of

literature that can provide hints of solution is related to automatic lane tracking for vehicles (e.g. [2][3][4]). In this

context, good solutions are proposed with the help of geometric models of road markings as well as camera calibration.

Camera calibration allows for the use of inverse perspective mapping that creates a bird's eye view of the road suitable

for lane detection and tracking. Many surveyed papers rely on particle filtering for lane tracking, where the particles are

attached to each lane [4] or to parameters of the road model [3] and propagated using dynamic models that are in some

cases based on motion information from the vehicle itself. This paper focuses on a problem that raises similar issues

(robust line tracking) but in a different context:

Camera calibration is impossible: distance between the subject and the camera varies; distance between camera

and ground also varies depending on the height of the person holding it; the camera may pan, tilt, rotate a little,

shift somewhat abruptly as a result of the walking action of the holder, etc.

Camera motion is of course unknown.

Frame-to-frame registration (as an attempt to model transition between tracking system states) is less reliable

because of the non-rigid bodies (and their shadows) that may take up a sizeable portion of the image.

Many lines are subject to tracking, and scene modeling should be approached differently.

3. THE PROPOSED APPROACH

3.1 Methodology

The objective is to perform spatial positioning of the blind subject as s/he crosses the street. In fact, the positioning is

relative to some origin that is arbitrarily set to be the intersection between the orange line and the bottom horizontal

white line. This relative positioning is possible if a maximum of spatial cues are detected and tracked in the video shot,

and then properly mapped to the physical markings on the street. Such mapping is non-ambiguous for 'vertical' lines but

in the case of horizontal white lines, line counting is necessary in order to figure out which portion of the crosswalk is

being observed by the moving camera. Each video frame is analyzed to detect situations where the subject's feet are

found in a parallelogram made of pairs of vertical and horizontal lines, in which case the mapping information can be

used to deduce his/her relative position.

3.2 Current implementation

In its current implementation as a Matlab application, the proposed tool is designed to be interactive, allowing the user to

initialize line detection and possibly make correction or supply new information as video analysis progresses. For

example, at frame 0 the user may manually initialize the left vertical white line and the orange line because they are

visible in the camera field of view, but the right vertical white line may appear at frame 200, in which case the user may

intervene to initialize this particular line so that the system can make use of it for spatial referencing. Once lines are

initialized, they are tracked automatically.

3.3 Line tracking

1) Line modeling / initialization: Each line is modeled by a pair of points p1 and p2 supplied by the user.

For the orange and blue lines, the average color along the line between the points is also extracted by

K-means clustering and stored in order to exploit their color properties. Clustering is done in the CbCr

subspace after RGb->YCbCr conversion.

2) Tracking: Tracking then proceeds by moving p1 and p2 and measuring the fit of the corresponding line

with respect to the image content. Points are moved in a systematic fashion inside a small

neighborhood (see Figure 2 for an illustration). The procedure can be seen as a kind of adjustment of

the position of the line model so that a fit criterion is maximized at each frame. In case the fit is

inadequate (likelihood value is too low), tracking for this particular line halts. Lines are tracked

independently.

Figure 2 - Line fitting: exploration around the current points defining the line model; points corresponding to the

best fit define the new line model.

3) Likelihood function: The likelihood function that measures the line fit is made of two contributions,

geometric and radiometric.

The geometric part enforces the likelihood of a line candidate lying at the center of a line marking in the image.

First, similar to [3], a binary edge image is computed and transformed into a distance map, where each pixel

value represents the shortest distance to an edge pixel. The distance map is then thresholded to retain the locally

maximum distance values that are within an expected range corresponding to the typical half width of a line

marking. The thresholded map is finally filtered to provide a smooth likelihood function. Figure 3 shows the

series of steps.

The radiometric part takes advantage of the knowledge about the appearance of the line being tracked. For

orange and blue lines, the closer the color of the line candidate to the expected average color, the higher the

likelihood. As for the pixels making up the white lines, their grayscale level of intensity is an obvious cue from

which a likelihood value may be derived.

Figure 3 - Series of steps leading to the creation of the geometric part of the likelihood function: original image,

edge detection, distance map, thresholded distance map, final likelihood map.

4) Counting horizontal lines: Horizontal white lines are handled differently because they constantly enter and

leave the camera field of view. The strategy is to track them by pair so that two of them are always

available for positioning the pedestrian (Section 3.5). Let us label the closest line to the camera as line A,

and its companion as line B; Line A is initialized by the user and tracked in the same manner as the vertical

lines while line B, being parallel to A and above in terms of image coordinates, is easy to locate. As line A

leaves the field of view, line B becomes line A and its companion is located. Of course, after every label

switch a line counter is incremented so that the system knows which area of the crosswalk is being

analyzed.

3.4 Tracking the pedestrian

Accurate spatial positioning of the subject with respect to the crosswalk requires detection and tracking of his or her feet

throughout the video shot. The easiest and most robust approach for performing tracking involves the use of a distinctive

marker affixed to the right foot of the subject. The reason for attaching the marker to the foot is to reduce uncertainty

over its 3D location (and hence its 2D projection) since it is kept close to the ground during a normal walk. A piece of

yellow tape seemed to be adequate (see Figure 4 for an example). Tracking of the yellow sticker is based on the selection

of the optimal combination of YCbCr color channels that best discriminates between the sticker and background

(asphalt, clothing, road markings, etc.), as described in [5]. The selection is guided by the maximization of the

separability between the probability distributions of sticker and background colors: The nonlinear log likelihood ratio

maps object/background distributions into positive values for colors distinctive to the object and negative for colors

associated with the background. The end result is a series of weight images (one for each combination of channels),

ranked by discriminability measure, that can be thresholded to extract the blob corresponding to the sticker. The highest

ranking weight image carries enough information to be used alone. Tracking the sticker then boils down to frame-to-

frame blob matching (based on blob moments for increased robustness). Figure 4 shows some examples where the blob

associated to the sticker is clearly visible and easy to segment.

3.5 Positioning the pedestrian

Once the image coordinates of the centroid P of the yellow patch are found, verification is made as to whether or not P

lies in a parallelogram defined by some of the line markings being tracked. In case of a positive verification, keeping in

mind that line tracking provides knowledge about the real-world location of the line intersections, the relative location of

the point P inside the parallelogram can be reprojected by bilinear interpolation [6] to its real-world counterpart.

Figure 4 - Two frames along with their two highest ranking weight images, column-wise. The yellow marker is

clearly visible in the weight images.

4. RESULTS AND DISCUSSION

Figure 5 depicts the graphical user interface of the Matlab application. The parallelogram formed by the four vertical and

horizontal lines is clearly visible and spatial referencing of the feet can be carried out, provided that the user has

previously initialized the yellow marker detector/tracker. At any time during analysis the user can reinitialize the

trackers.

In order to test the application and validate the whole process, we analyzed a video sequence lasting about 27 seconds,

which is the average time taken by the blind subjects to cross the street. The images were captured using a consumer

handheld camcorder (frame size: 720x480 pixels) held by an observer about 2-3 meters behind the subject. In addition to

video acquisition, the subject's position was recorded using a field computer equipped with a GNSS receiver: another

observer was asked to hold the GNSS antenna above the participant during his displacement while the field computer

automatically recorded his position at every second. The GNSS data are tainted with localization errors because of 1)

limitations in the spatial accuracy of the capturing device, between 10 cm and 40 cm, and 2) the inability of the observer

holding the tilted antenna to always maintain it perfectly above the participant during the walking phase. Figure 6, left,

illustrates the spatial sampling done by the GNSS receiver and mapped onto a graphic representation of the street

crosswalk, while the right image represents the same crosswalk sampled by the proposed system. It appears that there is

greater agreement between the vision-based data and what can be observed in the video sequence than with the GNSS

data but 1) a posteriori video annotation and sequence matching should be used, and 2) many more sequences should be

analyzed (planned in the upcoming phase). It can be observed that there are missing data in the first and last seconds of

the sequence; this is essentially due to the inability of the system to detect the first and last horizontal white lines.

The spatial inaccuracy has two sources:

The walking mode when the foot leaves the ground. This type of error is much less critical because the

inaccuracy affects the localization coordinate that is parallel to the walking direction, yet the initial accuracy

specification is related to the deviation with respect to the midline of the street crosswalk. Nonetheless, the

dynamics of the evolution of the estimated 3D location of the marker could give enough information to pinpoint

moments during which the foot is in contact with the ground.

Errors in line tracking. Inaccurate line tracking obviously causes errors in the localization of line intersections

and thus in the mapping to the real-world coordinates. A potentially better strategy that is being explored calls

for a 'global' attempt to propagate perturbations to their modeled states since all lines undergo the same

transformation from frame to frame. More clearly, we plan to explore camera pose tracking as a tool to predict

line positions. Recent work in camera pose estimation ([7][8][9][10][11]) brought interesting ideas, in particular

[11] who proposed a 3D edge tracker based on particle filtering, where particles represent transformation

matrices (rotations and translations). This is in line with our future direction as far as line tracking is concerned.

Figure 5 - Matlab user interface of the application; Tracking results for a 27-second sequence. Bottom left:

positioning data provided by the GNSS receiver. Bottom right: positioning data provided by the application.

Coordinate units are in meters.

5. CONCLUSION

This paper reports on a proof of concept about a semi-automatic system that can localize (blind) pedestrians in

crosswalks by video analysis. Tracking of a yellow marker attached to a subject's foot as well as tracking of the line

markings on the ground allow for a good estimate of the subject's relative position with respect to the markings. Results

are comparable to those obtained from GNSS data acquisition although more video sequences are needed to better assess

performance in more diverse conditions.

ACKNOWLEDGMENT

This work has been supported in part by the the MESRST of the “Gouvernement du Québec”, and in part by the « École

d'orthophonie et d'audiologie » of the University of Montreal. We want to express our special gratitude to Tony Leroux,

Agathe Ratelle, Carole Zabihaylo and André-Anne Mailhot for integrating us in their testing team. We are also very

grateful to the participants in the study who so generously gave their time to make this work possible.

REFERENCES

[1] S. Takahashi and J. Ohya, “Tracking white road line by particle filter from the video sequence acquired by the

camera attached to a walking human body,” vol. 8295, 2012, pp. 82 950V–82 950V–9.

[2] S. Sehestedt, S. Kodagoda, A. Alempijevic, and G. Dissanayake, “Robust lane detection in urban environments,” in

IROS, 2007, pp. 123–128.

[3] R. Jiang, R. Klette, T. Vaudrey, and S. Wang, “New lane model and distance transform for lane detection and

tracking,” in CAIP, 2009, pp. 1044–1052.

[4] C. Guo, S. Mita, and D. A. McAllester, “Lane detection and tracking in challenging environments based on a

weighted graph and integrated cues,” in IROS, 2010, pp. 5543–5550.

[5] R. T. Collins, Y. Liu, and M. Leordeanu, “Online selection of discriminative tracking features,” IEEE Trans. Pattern

Anal. Mach. Intell., vol. 27, no. 10, pp. 1631–1643, 2005.

[6] Interpolation using an arbitrary quadrilateral. [Online]. Available: http://www.particleincell.com/blog/2012/quad-

interpolation/

[7] C. R. del Blanco, N. N. García, L. Salgado, and F. Jaureguizar, “Object tracking from unstabilized platforms by

particle filtering with embedded camera ego motion,” in AVSS, 2009, pp. 400–405.

[8] F. Herranz, K. Muthukrishnan, and K. Langendoen, “Camera pose estimation using particle filters,” in Indoor

Positioning and Indoor Navigation (IPIN), 2011 International Conference on, Sept 2011, pp. 1–8.

[9] M. Pupilli and A. Calway, “Real-time camera tracking using a particle filter,” in BMVC, 2005.

[10] J. Yang, D. Schonfeld, and M. A. Mohamed, “Robust video stabilization based on particle filter tracking of

projected camera motion,” IEEE Trans. Circuits Syst. Video Techn., vol. 19, no. 7, pp. 945–954, 2009.

[11] G. Klein and D. W. Murray, “Full-3d edge tracking with a particle filter,” in BMVC, 2006, pp. 1119–1128.

Documents

Localizing people in crosswalks with a moving handheld camera: proof of concept