Upload
crim-ca
View
1
Download
0
Embed Size (px)
Citation preview
Localizing people in crosswalks with a moving
handheld camera: proof of concept Marc Lalonde, Claude Chapdelaine and Samuel Foucher
Vision and Imaging Team, CRIM, 405 Ogilvy Ave. #101 Montréal (Qc), Canada, H3N 1M3
Email: {marc.lalonde, claude.chapdelaine, samuel.foucher}@crim.ca
ABSTRACT
Although people or object tracking in uncontrolled environments has been acknowledged in the literature, the accurate
localization of a subject with respect to a reference ground plane remains a major issue. This study describes an early
prototype for the tracking and localization of pedestrians with a handheld camera. One application envisioned here is to
analyze the trajectories of blind people going across long crosswalks when following different audio signals as a guide.
This kind of study is generally conducted manually with an observer following a subject and logging his/her current
position at regular time intervals with respect to a white grid painted on the ground. This study aims at automating the
manual logging activity: with a marker attached to the subject’s foot, a video of the crossing is recorded by a person
following the subject, and a semi-automatic tool analyzes the video and estimates the trajectory of the marker with
respect to the painted markings. Challenges include robustness to variations to lighting conditions (shadows, etc.),
occlusions, and changes in camera viewpoint. Results are promising when compared to GNSS measurements.
Keywords: Tracking, color segmentation, moving camera.
1. INTRODUCTION
The present work is a small contribution to a larger project that aims at measuring the trajectory of blind subjects who
are asked to cross a street intersection with the aid of various cuckoo-chirp type signals. Some signals may be better
designed and more efficient in guiding a blind person to align himself/herself with the crosswalk. One key component of
the project is to measure the person's deviation with respect to the center of the crosswalk. Although competing
techniques are generally efficient and readily available (e.g. GPS), they are not quite appropriate in this context due to
their bulkiness and limited spatial resolution (15 cm at most is required). One approach that is being explored is based on
the automatic analysis of the video footage taken during the experiments. However vision-based tracking and
localization of a person in an uncontrolled setting like a city street is always a challenge. Moreover, due to logistical as
well as optical reasons, positioning a single fixed camera in such a way that the whole crossing event is recorded at good
resolution is impossible: the street selected for the experiment is a six-lane boulevard with a median, and the total
walking distance to get from one sidewalk to the other is about 30 meters. So the strategy under analysis is to use a
moving, handheld camera to record the subject's displacement and then to track and localize his/her feet with respect to
markings painted on the ground for spatial referencing. Such an approach is very convenient in terms of data acquisition
(an observer merely needs to walk behind the subject with a camcorder, video acquisition and management is easy,
resolution is always good, etc.) but it poses many challenges for the subsequent analysis phase: the subject must be
tracked through a video shot that is unstable due to the means of acquisition, the markings themselves must be tracked
for proper spatial referencing, and finally, tracking must be robust despite shadows, varying lighting conditions,
occlusions caused by observers accompanying the subject for security reasons, etc. Figure 1 shows a model of the
crosswalk with the pair of 'vertical' white lines delimiting the corridor as well as additional lines painted on the road
surface: orange line in the center of the crosswalk, blue lines on each side that delimit the test area (for safety
considerations), as well as 'horizontal' lines that can be counted to monitor positioning during the crossing. Distance
between lines is known a priori.
Figure 1 - Grid made of the two standard ’vertical’ white lines with additional lines painted on the street for positioning: an
orange line in the middle of the crosswalk, two blue lines on each side and 12 ’horizontal’ lines. d(A,B)=120cm;
d(B,D)=120cm; d(D,E)=120cm; d(A,C)=190cm.
2. PREVIOUS WORK
The closest work that resembles what is presented in this paper is that of [1]. Their system consists of a wearable camera
attached to a visually impaired person as well as an image analysis module that acts as an automatic navigation system.
The system assists the person by detecting and tracking a white line found in the pedestrian area along roads and streets.
A particle filter provides the tracking capability, where particles are rectangles in a video frame (particle likelihood is
related to the number of white pixels inside the rectangle); a classifier makes sure that the tracked object is indeed a line.
However a single line is tracked, which simplifies scene modeling, and no spatial referencing is done. Another body of
literature that can provide hints of solution is related to automatic lane tracking for vehicles (e.g. [2][3][4]). In this
context, good solutions are proposed with the help of geometric models of road markings as well as camera calibration.
Camera calibration allows for the use of inverse perspective mapping that creates a bird's eye view of the road suitable
for lane detection and tracking. Many surveyed papers rely on particle filtering for lane tracking, where the particles are
attached to each lane [4] or to parameters of the road model [3] and propagated using dynamic models that are in some
cases based on motion information from the vehicle itself. This paper focuses on a problem that raises similar issues
(robust line tracking) but in a different context:
Camera calibration is impossible: distance between the subject and the camera varies; distance between camera
and ground also varies depending on the height of the person holding it; the camera may pan, tilt, rotate a little,
shift somewhat abruptly as a result of the walking action of the holder, etc.
Camera motion is of course unknown.
Frame-to-frame registration (as an attempt to model transition between tracking system states) is less reliable
because of the non-rigid bodies (and their shadows) that may take up a sizeable portion of the image.
Many lines are subject to tracking, and scene modeling should be approached differently.
3. THE PROPOSED APPROACH
3.1 Methodology
The objective is to perform spatial positioning of the blind subject as s/he crosses the street. In fact, the positioning is
relative to some origin that is arbitrarily set to be the intersection between the orange line and the bottom horizontal
white line. This relative positioning is possible if a maximum of spatial cues are detected and tracked in the video shot,
and then properly mapped to the physical markings on the street. Such mapping is non-ambiguous for 'vertical' lines but
in the case of horizontal white lines, line counting is necessary in order to figure out which portion of the crosswalk is
being observed by the moving camera. Each video frame is analyzed to detect situations where the subject's feet are
found in a parallelogram made of pairs of vertical and horizontal lines, in which case the mapping information can be
used to deduce his/her relative position.
3.2 Current implementation
In its current implementation as a Matlab application, the proposed tool is designed to be interactive, allowing the user to
initialize line detection and possibly make correction or supply new information as video analysis progresses. For
example, at frame 0 the user may manually initialize the left vertical white line and the orange line because they are
visible in the camera field of view, but the right vertical white line may appear at frame 200, in which case the user may
intervene to initialize this particular line so that the system can make use of it for spatial referencing. Once lines are
initialized, they are tracked automatically.
3.3 Line tracking
1) Line modeling / initialization: Each line is modeled by a pair of points p1 and p2 supplied by the user.
For the orange and blue lines, the average color along the line between the points is also extracted by
K-means clustering and stored in order to exploit their color properties. Clustering is done in the CbCr
subspace after RGb->YCbCr conversion.
2) Tracking: Tracking then proceeds by moving p1 and p2 and measuring the fit of the corresponding line
with respect to the image content. Points are moved in a systematic fashion inside a small
neighborhood (see Figure 2 for an illustration). The procedure can be seen as a kind of adjustment of
the position of the line model so that a fit criterion is maximized at each frame. In case the fit is
inadequate (likelihood value is too low), tracking for this particular line halts. Lines are tracked
independently.
Figure 2 - Line fitting: exploration around the current points defining the line model; points corresponding to the
best fit define the new line model.
3) Likelihood function: The likelihood function that measures the line fit is made of two contributions,
geometric and radiometric.
The geometric part enforces the likelihood of a line candidate lying at the center of a line marking in the image.
First, similar to [3], a binary edge image is computed and transformed into a distance map, where each pixel
value represents the shortest distance to an edge pixel. The distance map is then thresholded to retain the locally
maximum distance values that are within an expected range corresponding to the typical half width of a line
marking. The thresholded map is finally filtered to provide a smooth likelihood function. Figure 3 shows the
series of steps.
The radiometric part takes advantage of the knowledge about the appearance of the line being tracked. For
orange and blue lines, the closer the color of the line candidate to the expected average color, the higher the
likelihood. As for the pixels making up the white lines, their grayscale level of intensity is an obvious cue from
which a likelihood value may be derived.
Figure 3 - Series of steps leading to the creation of the geometric part of the likelihood function: original image,
edge detection, distance map, thresholded distance map, final likelihood map.
4) Counting horizontal lines: Horizontal white lines are handled differently because they constantly enter and
leave the camera field of view. The strategy is to track them by pair so that two of them are always
available for positioning the pedestrian (Section 3.5). Let us label the closest line to the camera as line A,
and its companion as line B; Line A is initialized by the user and tracked in the same manner as the vertical
lines while line B, being parallel to A and above in terms of image coordinates, is easy to locate. As line A
leaves the field of view, line B becomes line A and its companion is located. Of course, after every label
switch a line counter is incremented so that the system knows which area of the crosswalk is being
analyzed.
3.4 Tracking the pedestrian
Accurate spatial positioning of the subject with respect to the crosswalk requires detection and tracking of his or her feet
throughout the video shot. The easiest and most robust approach for performing tracking involves the use of a distinctive
marker affixed to the right foot of the subject. The reason for attaching the marker to the foot is to reduce uncertainty
over its 3D location (and hence its 2D projection) since it is kept close to the ground during a normal walk. A piece of
yellow tape seemed to be adequate (see Figure 4 for an example). Tracking of the yellow sticker is based on the selection
of the optimal combination of YCbCr color channels that best discriminates between the sticker and background
(asphalt, clothing, road markings, etc.), as described in [5]. The selection is guided by the maximization of the
separability between the probability distributions of sticker and background colors: The nonlinear log likelihood ratio
maps object/background distributions into positive values for colors distinctive to the object and negative for colors
associated with the background. The end result is a series of weight images (one for each combination of channels),
ranked by discriminability measure, that can be thresholded to extract the blob corresponding to the sticker. The highest
ranking weight image carries enough information to be used alone. Tracking the sticker then boils down to frame-to-
frame blob matching (based on blob moments for increased robustness). Figure 4 shows some examples where the blob
associated to the sticker is clearly visible and easy to segment.
3.5 Positioning the pedestrian
Once the image coordinates of the centroid P of the yellow patch are found, verification is made as to whether or not P
lies in a parallelogram defined by some of the line markings being tracked. In case of a positive verification, keeping in
mind that line tracking provides knowledge about the real-world location of the line intersections, the relative location of
the point P inside the parallelogram can be reprojected by bilinear interpolation [6] to its real-world counterpart.
Figure 4 - Two frames along with their two highest ranking weight images, column-wise. The yellow marker is
clearly visible in the weight images.
4. RESULTS AND DISCUSSION
Figure 5 depicts the graphical user interface of the Matlab application. The parallelogram formed by the four vertical and
horizontal lines is clearly visible and spatial referencing of the feet can be carried out, provided that the user has
previously initialized the yellow marker detector/tracker. At any time during analysis the user can reinitialize the
trackers.
In order to test the application and validate the whole process, we analyzed a video sequence lasting about 27 seconds,
which is the average time taken by the blind subjects to cross the street. The images were captured using a consumer
handheld camcorder (frame size: 720x480 pixels) held by an observer about 2-3 meters behind the subject. In addition to
video acquisition, the subject's position was recorded using a field computer equipped with a GNSS receiver: another
observer was asked to hold the GNSS antenna above the participant during his displacement while the field computer
automatically recorded his position at every second. The GNSS data are tainted with localization errors because of 1)
limitations in the spatial accuracy of the capturing device, between 10 cm and 40 cm, and 2) the inability of the observer
holding the tilted antenna to always maintain it perfectly above the participant during the walking phase. Figure 6, left,
illustrates the spatial sampling done by the GNSS receiver and mapped onto a graphic representation of the street
crosswalk, while the right image represents the same crosswalk sampled by the proposed system. It appears that there is
greater agreement between the vision-based data and what can be observed in the video sequence than with the GNSS
data but 1) a posteriori video annotation and sequence matching should be used, and 2) many more sequences should be
analyzed (planned in the upcoming phase). It can be observed that there are missing data in the first and last seconds of
the sequence; this is essentially due to the inability of the system to detect the first and last horizontal white lines.
The spatial inaccuracy has two sources:
The walking mode when the foot leaves the ground. This type of error is much less critical because the
inaccuracy affects the localization coordinate that is parallel to the walking direction, yet the initial accuracy
specification is related to the deviation with respect to the midline of the street crosswalk. Nonetheless, the
dynamics of the evolution of the estimated 3D location of the marker could give enough information to pinpoint
moments during which the foot is in contact with the ground.
Errors in line tracking. Inaccurate line tracking obviously causes errors in the localization of line intersections
and thus in the mapping to the real-world coordinates. A potentially better strategy that is being explored calls
for a 'global' attempt to propagate perturbations to their modeled states since all lines undergo the same
transformation from frame to frame. More clearly, we plan to explore camera pose tracking as a tool to predict
line positions. Recent work in camera pose estimation ([7][8][9][10][11]) brought interesting ideas, in particular
[11] who proposed a 3D edge tracker based on particle filtering, where particles represent transformation
matrices (rotations and translations). This is in line with our future direction as far as line tracking is concerned.
Figure 5 - Matlab user interface of the application; Tracking results for a 27-second sequence. Bottom left:
positioning data provided by the GNSS receiver. Bottom right: positioning data provided by the application.
Coordinate units are in meters.
5. CONCLUSION
This paper reports on a proof of concept about a semi-automatic system that can localize (blind) pedestrians in
crosswalks by video analysis. Tracking of a yellow marker attached to a subject's foot as well as tracking of the line
markings on the ground allow for a good estimate of the subject's relative position with respect to the markings. Results
are comparable to those obtained from GNSS data acquisition although more video sequences are needed to better assess
performance in more diverse conditions.
ACKNOWLEDGMENT
This work has been supported in part by the the MESRST of the “Gouvernement du Québec”, and in part by the « École
d'orthophonie et d'audiologie » of the University of Montreal. We want to express our special gratitude to Tony Leroux,
Agathe Ratelle, Carole Zabihaylo and André-Anne Mailhot for integrating us in their testing team. We are also very
grateful to the participants in the study who so generously gave their time to make this work possible.
REFERENCES
[1] S. Takahashi and J. Ohya, “Tracking white road line by particle filter from the video sequence acquired by the
camera attached to a walking human body,” vol. 8295, 2012, pp. 82 950V–82 950V–9.
[2] S. Sehestedt, S. Kodagoda, A. Alempijevic, and G. Dissanayake, “Robust lane detection in urban environments,” in
IROS, 2007, pp. 123–128.
[3] R. Jiang, R. Klette, T. Vaudrey, and S. Wang, “New lane model and distance transform for lane detection and
tracking,” in CAIP, 2009, pp. 1044–1052.
[4] C. Guo, S. Mita, and D. A. McAllester, “Lane detection and tracking in challenging environments based on a
weighted graph and integrated cues,” in IROS, 2010, pp. 5543–5550.
[5] R. T. Collins, Y. Liu, and M. Leordeanu, “Online selection of discriminative tracking features,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 27, no. 10, pp. 1631–1643, 2005.
[6] Interpolation using an arbitrary quadrilateral. [Online]. Available: http://www.particleincell.com/blog/2012/quad-
interpolation/
[7] C. R. del Blanco, N. N. García, L. Salgado, and F. Jaureguizar, “Object tracking from unstabilized platforms by
particle filtering with embedded camera ego motion,” in AVSS, 2009, pp. 400–405.
[8] F. Herranz, K. Muthukrishnan, and K. Langendoen, “Camera pose estimation using particle filters,” in Indoor
Positioning and Indoor Navigation (IPIN), 2011 International Conference on, Sept 2011, pp. 1–8.
[9] M. Pupilli and A. Calway, “Real-time camera tracking using a particle filter,” in BMVC, 2005.
[10] J. Yang, D. Schonfeld, and M. A. Mohamed, “Robust video stabilization based on particle filter tracking of
projected camera motion,” IEEE Trans. Circuits Syst. Video Techn., vol. 19, no. 7, pp. 945–954, 2009.
[11] G. Klein and D. W. Murray, “Full-3d edge tracking with a particle filter,” in BMVC, 2006, pp. 1119–1128.