Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Multi-View and Multi-Plane Data Fusion for Effective Pedestrian Detection in Intelligent Visual Surveillance
Jie Ren1, Ming Xu2, 3, Jeremy S. Smith3, Shi Cheng4
1 College of Electronics and Information, Xi'an Polytechnic University, Xi’an, 710048, China2 Department of Electrical and Electronic Engineering, Xi'an Jiaotong-Liverpool University, Suzhou, 215123,
China3 Department of Electrical Engineering and Electronics, University of Liverpool, L69 3BX, Liverpool, UK
4 Division of Computer Science, University of Nottingham Ningbo, Ningbo, 315100, China
Abstract: For the robust detection of pedestrians in intelligent video surveillance, an approach to
multi-view and multi-plane data fusion is proposed. Through the estimated homography, foreground
regions are projected from multiple camera views to a reference view. To identify false-positive
detections caused by foreground intersections of non-corresponding objects, the homographic
transformations for a set of parallel planes, which are from the head plane to the ground, are
employed. Multiple features including occupancy information and colour cues are derived from such
planes for joint decision-making. Experimental results on real world sequences have demonstrated the
good performance of the proposed approach in pedestrian detection for intelligent visual surveillance.
Keywords: Detection; Video surveillance; Homography; Information fusion.
1 Introduction
Automatic pedestrian detection is a fundamental computer vision task for visual surveillance,
driverless vehicles, intelligent transportation and even smart cities. Detecting multiple pedestrians is
particularly challenging due to the frequent occlusion between moving objects. A reasonable solution
is to use multiple cameras to solve occlusions, as an object occluded in one camera view may be
visible in other camera views. Furthermore, the multiple camera views can extend the overall field of
view (FOV) and allow large spatial coverage for the continuous tracking of the objects.
In multi-view data fusion, the camera views need to be associated. A useful assumption here is
that in all the camera views the objects of interest are on one common plane, which is valid for most
scenarios of intelligent visual surveillance systems. Homography, which is a geometric transformation
that shows a pixelwise mapping between two camera views according to a common plane, can be
used as an efficient method to associate multiple camera views. Using a homography transformation
for a specific plane, foreground regions detected from each of the multiple camera views can be
projected to a reference view. The intersection regions of the foreground projections indicate the
locations of moving objects on that plane. This method can achieve good results in detection and is
robust in coping with occlusion.
The remainder of this paper is organized as follows. In Section 2 the related work in this area is
reviewed. In Section 3 the techniques used to estimate the homographies for the ground plane and a
1
set of parallel head planes are described. In Section 4, foreground detection using Gaussian mixture
model (GMM) based background subtraction is discussed. In Section 5, a detailed fusion scheme for
improved pedestrian detection is presented, including the fusion of multi-view foreground regions and
multi-plane feature fusion using the occupancy information and colour cues. Finally, the experimental
results are discussed in Section 6, followed by the conclusions drawn in Section 7.
2 Related Work
Multi-view fusion in a visual surveillance system can not only provide a larger overall FOV but
also reduce occlusions in the overlapping FOV. One of the key issues for multi-view visual
surveillance is how to utilize the information from the multiple cameras for the purpose of detection
and tracking. The more information which is extracted from the multiple cameras and can be used
simultaneously, the more robust and accurate the system becomes. Depending on the degree of
information fusion, the existing multi-camera surveillance systems can be categorized into low-degree
information fusion, intermediate-degree information fusion or high-degree information fusion.
Low-degree fusion in visual surveillance systems uses multiple cameras only to extend the limited
FOV for a single camera [1-4]. It is known as the camera handoff method which starts tracking an
object within a single camera view and switches to the next camera when the tracked object goes
beyond the FOV of the current camera. For the camera handoff method, existing researches differ on
three points: when the handoff process is triggered, which camera will be selected as the next optimal
camera to track and monitor the object of interest and how to establish the correspondence of an
object between cameras.
In the approaches of low-degree fusion, the detection and tracking are carried out separately in
each camera view, and only one camera is actively working at a particular time stamp. Therefore, it
fails to detect and track objects during dynamic occlusions as this is one of the problems with single-
camera visual surveillance systems. As there is very limited information exchange between the
cameras, this camera switching approach is classed as low-degree information fusion. In Cai and
Aggarwal’s work [1], switching between different camera views for more effective single-camera
tracking is proposed and the system predicts when the current camera will no longer have a good
view. The features extracted from each upper human body are used to build the object correspondence
between cameras. The next camera should not only provide the best view but also have the least
switching to continually track a moving object in that camera view.
In intermediate-level information fusion, dynamic occlusions are usually resolved through the
integration of information from additional cameras mas occlusions might not occur simultaneously in
all the camera views. With the established correspondence of the objects between cameras, it is
possible to not only track objects as they move from one camera view to the other, but also align the
trajectories in multiple views for improved tracking. Khan and Shah [5] extended their work proposed
2
in [2] by creating associations across views for the better localization of each object. The trajectories
from each camera are fused in a reference view if that camera view can be visible in the reference
view. In [6], the authors align the multiple views with the viewpoint of the camera which can give a
clear view of the scene and fuse the trajectories from multiple views in that view. In [7], the
multiview process uses Kalman filter trackers to model the object position and velocity, to which the
multiple measurements from a single-view stage are associated.
As features are extracted from the individual camera views before fusion, the problems that arise
in the detection and tracking with a single camera will affect the final fusion result. Extracted features
in the individual camera views can be integrated in a reference view to obtain a global estimate. The
extracted features include bounding boxes, centroids, principal axes and classification results. In [8], a
motion model and an appearance model of each detected moving object are built. Then the moving
object is tracked using a joint probability data association filter in a single camera view. The bounding
boxes of the moving objects are projected to a reference view according to the ground-plane
homography to correct falsely detected bounding boxes and handle occlusions in the reference view.
Hu et al. [9] projected the principal axis of each foreground object from the individual camera views
to a top view according to the homography mapping for the ground plane. The foot point of each
object in the top view is determined by the intersection of the principal axes projected from two
camera views. The tracking is based on the foot-point locations in the top view. Du and Piater [10]
used particle filters to track targets in individual camera views and then projected the principal axes of
the targets onto the ground plane. After tracking the intersections of the principal axes using the
particle filters on the ground plane, the tracking results were warped back into each camera view to
improve the tracking in the individual camera views. In [11], the centroid of each tracked target is
used to calculate an epipolar line in 3D space, and then the global tracking is carried out on the basis
of the intersection of two epipolar lines each from a different camera view.
In high-degree information fusion, the individual cameras no longer extract features but provide
foreground bitmap information to the fusion centre. The objects are detected as the intersections of
these foreground bitmaps projected from multiple views. In [12], homography mapping is used to
combine foreground likelihood images from different views to resolve occlusions and localise moving
objects on the ground plane. In [13] and [14], the midpoints of the matched foreground segments in a
pair of camera views are back-projected to yield an intersection point in 3D space. Such points are
then projected onto the ground plane to generate a probability distribution map for object locations.
Berclaz, Fleuret and Fua [15] divided the ground plane into grids and calculated an occupancy map in
the ground plane. The probability that each sub-image corresponds to the average size of a person in
each camera view is warped from that camera view to the top view by using the ground-plane
homography independently. The ground plane is later extended to a set of planes parallel to, but at
some height off, the ground plane to reduce false positives and missed detections [16]. In [17], a
similar procedure is followed but the set of parallel planes are at the height of people's heads. This
3
method is able to handle highly crowded scenes because the feet of a pedestrian are more likely to be
occluded in a crowd than the head. Their work achieves good results in moderately crowded scenes.
These methods fully utilize the visual cues from multiple cameras and have high-level information
fusion.
The method proposed in this paper is used to detect pedestrians via multi-view and multi-plane
data fusion, which utilizes head-plane homographies, occupancy maps and statistical colour matching
to identify pedestrians and phantoms. In [15] an occupancy map is extracted on the ground plane only.
However, foreground projections on the ground plane are vulnerable to the lost detection of feet and
legs in the individual camera views. In [16] homographies for the ground plane and a set of planes
parallel to, but at some height off, the ground plane are used to reduce false positives and missed
detections. However, if the feet and legs of a pedestrian are lost in the detection in any individual
camera views, the joint foreground likelihood of the pedestrian becomes very low and may still be lost
in the multiview detection. In [17] a head-plane homographic transformation, which is used to align
the intensity correlation between each pixel in one camera view and its corresponding pixel in the
other camera view, is used to detect the head patch of each pedestrian. However, the pixelwise
intensity correlation in [17] is sensitive to homography estimation errors and textured patterns in
foregrounds. Instead, in the proposed method, a statistical approach based on colours is used, which is
more robust than the pixelwise intensity correlation.
Figure 1 shows a flowchart of the proposed pedestrian detection algorithm based on the
occupancy map and colours. To identify the false-positive detections, which occur due to the
foreground intersections of non-corresponding objects in a reference camera view, the occupancy
information and colour matching are combined in the head plane detection. Table 1 explains the
notation used in the equations of this paper.
3 Homography Estimation
Planar homography is defined by a 3 ×3 transformation matrix between a pair of captured images of
the same plane with different cameras. Usually one image is chosen as a reference image in which
data association or fusion is carried out. The reference image is often captured by a real camera, but
sometimes it is a synthetic top view image or an aerial top view image. Let (uc , vc ) be a point on that
plane in an image captured by a real camera c and (ut , v t ) be the corresponding point in a top view
image. pc=[uc , vc ,1 ]T and pt=[ut , v t ,1 ]T are the homogeneous coordinates of those two points.
They are associated by the homography matrix H c ,t as follows:
pt≅ H c, t pc (1)
4
H c ,t=[h11 h12 h13
h21 h22 h23
h31 h32 1 ] (2)
where ≅ denotes that the homography is defined up to an unknown scalar. The superscript c in pc and
H c ,t indicates the camera index and c = a, b. Since the virtual top view is selected as the reference
view in this paper, the superscript t in pt and H c ,t indicates the top view.
3.1 Estimation of the Ground-Plane Homography
By matching corresponding feature points in an image pair, the homography between two camera
views can be estimated. The most commonly used feature is corresponding points, though other
features such as lines or ellipses may be used [18][19]. These features are selected and matched
manually or automatically in 2D images [11]. In this paper a set of manually selected corresponding
points was used to calibrate the cameras.
As the homography transformation is a special variation of the projective transformation, the
parameters calculated in the camera calibration can be used to determine the homography matrix for
the ground plane. Let pw= [ Xw ,Y w , Zw , 1 ]T be a point in 3D space and pc be the corresponding point
in an image captured by a real camera, the relationship that maps pw to pc can be rewritten using a
3×4 projection matrix M as follows:
pc=M pw=[m1 , m2 , m3 , m4] pw (3)
where, m1, m2, m3, and m4 are 3×1 vectors. The M matrix is constituted of the intrinsic and extrinsic
parameters of the underlying camera, which can be extracted from the camera calibration.
By assuming that points pw and pc are on the ground plane, then point pw in 3D space can be
denoted as p0w= [ Xw ,Y w ,0 , 1 ]T, where Zw=0 and the subscript 0 denotes the ground plane.
Assuming there is a virtual camera which generates a virtual top view, p0w in the top view is denoted
as p0t = [u t , v t , 1 ]T, where ut=X w, v t=Y w. Therefore, the ground-plane homography H 0
t ,c can be
simplified as a degraded M matrix [20]:
H 0t ,c=(H 0
c, t )−1=[m1 ,m2 ,m4] (4)
p0c=H 0
t ,c p0t (5)
p0t =H 0
c, t p0c (6)
3.2 Estimation of Multi-Plane Homographies
Homography mapping is not limited to the homography for the ground plane, it can be extended to
a set of planes parallel to the ground plane and at some height. If the camera is calibrated, the multi-
plane homographies can be calculated through the parameters determined in the calibration process
5
directly. After the homography matrix for the ground plane is estimated, the homography matrix for a
plane parallel to and off the ground plane can be recovered from the perspective properties such as the
vanishing point in the vertical direction [16] or the cross-ratio of four collinear points [17]. The
homography matrix can also be approximated by using interpolated points between the head and feet
of each pedestrian [21].
For a plane parallel to the ground and at a height of h, let phw= [ Xw ,Y w , Zw ,1 ]T be a point on that
plane in 3D space, where Zw=h. Assuming there is a virtual camera which generates a virtual top
view, phw in the top view is denoted as ph
t = [u t , v t , 1 ]T, where ut=X w and v t=Y w. According to Eq.
(3), the corresponding point for phw in camera view c is ph
c=[m1uw+m2 vw+m3 h+m4 ]. Since each
element in the third column m3 multiplies the value h and then adds its corresponding element in the
last column m4, the projection from point pht in the top view to point ph
c in camera view c according to
the homography for plane h becomes:
phc=H h
t ,c pht =[m1 m2 m4+hm3 ] ph
t (7)
The homography of plane h can be represented as a combination of the homography for the
ground plane and the third column of the projection matrix M multiplied by the given height h:
H ht ,c=H0
t ,c+[0∨hm3 ] (8)
where [0] is a 3 ×2 zero matrix [20]. The relationship between H 0t ,c and Hh
t ,c is illustrated in Figure 2.
Points p0w and ph
w in the 3D space are at the same point in the top view.
4 Foreground Detection
Aiming to separate moving objects from a background image in each frame, foreground detection
is one essential process in visual surveillance systems. For a fixed camera, foreground detection
suffers mainly from changes of lighting conditions, moving cast shadows, switching backgrounds, etc.
Although there are three main categories of foreground detection algorithms, i.e. optical flow,
temporal differencing, and background subtraction, we choose background subtraction because it
copes better with the surveillance background. The other two can be very sensitive to the
environmental lighting conditions and the irregular motion patterns of pedestrians.
For background subtraction, it requires i) calculating a background image, ii) subtracting the
background image from each new frame and iii) thresholding the subtraction result. Since the
foreground pixels are identified according to the pixelwise difference between the new frame and the
background image, this method is highly dependent on a good background model, which should not
be sensitive to illumination variations, shadows and waving vegetation. The Gaussian Mixture Model
(GMM) [22] is the most widely used method to cope with changing background elements (e.g.,
6
waving trees). According to the assumption that a background pixel is more stable than a foreground
pixel in pixel value, the value of a background pixel is modelled by a mixture of Gaussian
distributions. The sum of the probability density functions weighted by the corresponding priors
represents the probability that a pixel is observed at a particular intensity or colour.
After detection of the foreground pixels in each camera view, these pixels are grouped into
foreground regions by applying connected component analysis, and then morphological operations
and a size filter are applied to fill holes and remove small regions. Once the foreground regions have
been identified in a camera view, each foreground region needs to be projected to a reference view
according to the homography for a certain plane. As a pixelwise homographic transformation is time
consuming, each foreground region is represented by the polygon approximation of its contour [23].
The Douglas-Peucker (DP) method [24] has been used for the polygon approximation. Instead of
applying the inverse homography to each pixel in the reference view, the vertices of the polygon of
each detected foreground region are projected to the reference view through homography mapping.
Let F ia be the i-th foreground region detected in camera a. The contour of F i
a is represented by an
ordered set of N points on the contour curve. To make the representation of the contour point set more
compact, the original contour is approximated by a polygon using the Douglas-Peucker (DP)
algorithm. Then, the contour of F ia is approximated by a set of vertices V i
a, which will be used in the
next section for foreground projection and fusion in the top view.
5. Multi-View and Multi-Plane Data Fusion for Pedestrian Detection
When the foreground regions for the same object are warped from multiple views to the top view,
they will intersect at a location where the object touches the ground. Although the ground plane is the
most commonly used plane in homography mapping, the foreground projections of the same object,
from each one of the multiple camera views, may have missed intersections in the reference view.
Since pedestrians’ feet and legs are small and thin, they are frequently missed in detection.
Homography estimation errors are another reason for missed intersections. In addition, when a
pedestrian is striding and hence has the two legs separated, the two legs may cause additional phantom
intersections.
If the foreground projections from individual camera views to the top view are based on the
homography for a plane off the ground, the object detection will be more robust. In this paper, the
homographies for the head plane are used to detect pedestrians and the height of each pedestrian. The
foreground regions detected in each camera view are warped into the top view according to the
homographies for a set of parallel planes at different heights. The intersection regions in the top view
indicate all the possible regions that contain moving objects. To identify false-positive detections,
which occur due to the foreground intersections of non-corresponding objects in the top view, the
occupancy information and colour cues are analysed.
7
5.1 Region Based Foreground Fusion
Each foreground region is represented by the polygon approximation of its contour. After the
contour of F ia is approximated by vertices V i
a, instead of applying a pixelwise homographic
transformation to the foreground images, only the vertices of the foreground polygons need to be
projected onto the reference image. The foreground regions are then rebuilt by filling the internal area
of each polygon with a fixed value [21].
Using the homography estimation method described in Section 3.1, the ground-plane homography
can be determined. The vertices V ia of the i-th foreground polygon from camera view a are projected
to the top view t according to ground plane homography H 0a , t. Let V i , 0
a , t be the set of projected vertices
in the top view t:
V i , 0a , t=H 0
a ,t (V ia ) (9)
Since the vertices in V ia are in an ordered list, connecting each projected vertex with its neighbour
sequentially can generated a new polygon in the top view t. This new polygon approximates the
contour of the projected foreground region F i ,0a ,t which is the projection of F i
a from camera view a to
the top view t according to the ground-plane homography. Then F i ,0a ,t is rebuilt by filling the internal
area of the projected foreground polygon with a fixed value. Thus, F i ,0a ,t can be described as:
F i ,0a ,t=H 0
a ,t ( Fia) (10)
Figure 3 shows a schematic diagram of the homographic projection based on the ground plane and
a plane parallel to the ground plane. Plane h is an imaginary plane parallel to the ground plane and at a
height of h. The warped foreground region of an object in the top view is observed as the intersection
of the ground plane and the cone swept out by the silhouette of that object.
Given foreground region F ia from camera view a and F j
b from camera view b, F i ,ha ,t and F j ,h
b ,t are
the projected foreground regions from the two camera views to the top view according to the
homographies H ha , t and H h
b , t for a plane at a height of h. If the two projected foreground regions
intersect in the top view, these two regions in the top view and their original foreground regions in
both camera views are defined as a pair of projected foreground regions and a pair of foreground
regions respectively. The intersection region of the projected foreground regions F i ,ha ,t and F j ,h
b ,t is
determined as:
Ri , j ,ht =Fi , h
a , t⋂F j , hb , t=( Hh
a ,t ( Fia ) )⋂ (H h
b , t (F jb )) (11)
5.2 Phantoms
8
When the foreground regions in the individual camera views are projected into the top view
according to the homography for the ground plane or a plane parallel to the ground plane and at some
height, the foreground projections may intersect. If the intersecting foreground regions correspond to
the same object in different camera views, the intersection region reports the location where the object
touches the plane used in the homography projection. If the intersection regions are caused by non-
corresponding foreground regions from different camera views, they are false-positive detections or
phantoms. This is an important problem in multi-camera moving object detection using foreground
homography mapping.
Fig. 4 is a schematic diagram which illustrates how non-corresponding foreground regions
intersect and give rise to a false-positive detection in homography mapping. In Fig. 4(a), the
foreground regions of two objects are projected from two camera views into the top view by using the
ground plane homography. The foreground projections intersect in three regions on the ground plane.
The white intersection regions are the locations of the two objects, whilst the black region is a
phantom. If homography mapping for a plane higher than the ground plane is used, the projected
foreground regions move closer to the cameras. This can be partly explained in Fig.3, in which the
projected foreground region on plane h is closer to the camera than that on the ground plane. When
such projected foreground regions on the plane off the ground intersect in the top view, additional
phantoms may be generated, as shown in Fig.4(b) in which the grey region is an additional phantom.
5.3 Occupancy Maps
If the intersection region Ri , j ,ht is due to the same object in different camera views, it indicates the
location where the object is intersected by plane h. When plane h is at different heights and parallel to
the ground plane, the intersection region Ri , j ,ht varies in its size and shape, which approximates the
widths of the corresponding body parts at different heights. Fig. 5 shows an example of the foreground
intersection regions generated by the same pedestrian according to the homographies for planes at
different heights. There are two pedestrians in each of the two camera views. When the foreground
polygons of the same pedestrian in both camera views are projected into the top view, the intersection
of the projected foreground polygons shows the location of the pedestrian intersected by that plane.
Fig. 5(a) shows the intersection region according to the homography for the ground plane. Such an
intersection region is then warped back into a single camera view as shown in Fig. 5(b), where the
back-warped intersection region in camera view b is represented with a red colour. Fig. 5(c)-5(h) show
the results at heights of 0.5 m, 1.0 m and 1.5 m respectively.
By assuming that the pedestrians are standing upright, the intersection regions Ri , j ,0t and Ri , j ,h
t are
roughly at the same position in the top view as illustrated in Fig. 5(a) and 5(c). Let pi , j ,ht =[u , v ]T be
a pixel in Ri , j ,ht in the top view. If Ri , j ,h
t is occupied by a real object, when the pixel is warped back to
9
camera view c (c = a, b) according to the homography for a plane at the pedestrian’s height
H ht ,c=(H h
c, t )−1, the warped back pixel pi , j ,hc is located around the head area of the corresponding
foreground region in camera view c.
When the pixel is warped back to camera view c according to the ground-plane homography
H 0t ,c=(H 0
c, t )−1, the warped back pixel pi , j ,0c is located around the foot area of the corresponding
foreground region in camera view c. Pixels on the line segment which connects those two warped
back pixels are usually located inside the corresponding foreground region in camera view c.
However, if Ri , j ,ht is a phantom region, the warped back pixels pi , j ,h
c and pi , j ,0c are not located around
the head area or foot area of their corresponding foreground region in camera view c. Some pixels on
the line segment which connects those two warped back pixels are located outside the corresponding
foreground region. Therefore, the percentage in which the pixels on that line segment are inside the
corresponding foreground region indicates the likelihood that the original pixel pi , j ,ht in the top view
belongs to a real object. The top-view occupancy likelihood which is derived from camera view c is
denoted as Li , j ,ht , c (u , v ). In reality, the legs of a pedestrian may be separated during walking.
Therefore, a margin region of 5 pixels around the line segment is used in our algorithm. If a pixel on
that line segment is not inside the corresponding foreground region, ± 5 pixels beside that pixel in the
horizontal direction are scanned. If any of these 10 pixels is inside the foreground region, that pixel
will still be recognized as a foreground pixel.
Each pixel in the intersection region is assigned its joint occupancy likelihood from all camera
views.
Li , j ,ht (u , v )=∏
CLi , j ,h
t , c (u , v ) (12)
The joint occupancy likelihoods of all the intersection regions in the top view constitute an
occupancy map for a specific plane [25].
5.4 Colour Matching of Intersection Regions
Since colour is a strong cue to differentiate objects, the colours of the warped back intersection
regions in individual camera views are utilized to identify whether two foreground projections from
different camera views are due to the same object. Pixelwise colour matching is sensitive to
homography estimation errors. It also assumes that the foreground regions of the same object have
consistent colour patterns in different camera views. In reality, because multiple cameras are often
placed at different orientations, the same object may appear slightly different in the colour patterns in
different camera views. Therefore, pixelwise colour matching cannot achieve good results in these
situations and a statistical approach based on colour cues is proposed.
The colour statistical approach is not based on the colour of each intersection region in the top
view, because the warping operation may change the statistical properties of a colour patch. This
10
approach is based on the original colour patches in the two camera views, which correspond to the
same foreground intersection region in the top view. A pair of such colour patches often have different
sizes. Initially each intersection region in the top view needs to be warped back to the individual
camera views. Given a head-plane intersection region Ri , j ,ht in the top view, the image patch in camera
view a, which is warped back from the top view using the homography for the head plane h, is as
follows:
Ri , j ,ha =Hh
t , a (R i , j ,ht ) (13)
The colour model of each warped back patch is built by using the colours of all the pixels in that
patch. A Gaussian mixture model (GMM) is used to handle the multiple colours in the warped back
region. Suppose x is the d-dimensional colour vector of a pixel in the colour patch and K is the
number of Gaussian distributions, the corresponding GMM model can be denoted as:
p ( x )=∑k=1
K
πk N ( x∨μk , Σk ) (14)
N ( x∨μk ,Σ k )= 1
(2 π )d2|Σk|
12
e−12 ( x−μ k)T Σk
−1( x− μk) (15)
The K-means algorithm and the Expectation-maximization (EM) algorithm are used to determine
the parameters of the probability density functions in the Gaussian mixture model. For the warped
back patch Ri , j ,ha , its colour is modelled by K Gaussian distributions: N (π i ,n
a , μi ,na , Σi , n
a ), n∈[1 , K ] ,
where π i ,na , μ i ,n
a ∧Σi , na denote the weight, mean and covariance of the n-th Gaussian distribution
respectively. The K Gaussians are in a descending order according to the magnitudes of the weights (
π i ,1a is the greatest weight).
For the warped back patch Ri , j ,hb in camera view b, its colour is modelled by K Gaussian
distributions N (π j , mb , μ j ,m
b , Σ j , mb ) where m∈[1 , K ]. The colour similarity of such a pair of warped
back patches, which corresponds to the same intersection region in the top view, is measured by using
the Mahalanobis distance between the two colour models. Since the Gaussian distributions in each
GMM are ranked in a descending order, the first distribution is always the dominant distribution in the
GMM. The first r distributions, which have an overall weight over a given threshold T r, are defined as
significant distributions. That is:
∑m=1
r−1
π j , mb <T r∧∑
m=1
r
π j ,mb ≥T r
(16)
Then, Mahalanobis distance is used to calculate the similarity of two Gaussian distributions.
Unlike the Euclidean distance which only measures the distance between the means of two Gaussian
distributions, the covariance of two Gaussian distributions are used to normalize the squared distance
11
between the means in the Mahalanobis distance. The Mahalanobis distance between the dominant
distribution N (π i ,1a , μ i ,1
a , Σi ,1a ) for patch Ri , j ,h
a in camera view a and each of the significant
distributions for the corresponding patch in camera view b are defined as follows:
si , j ,ma =(μi , 1
a −μ j , mb )T (Σi , 1
a +Σ j ,mb )−1 ( μi ,1
a −μ j ,mb ) m∈ [ 1 , r ] (17)
The Mahalanobis distance between the dominant distribution for patch Ri , j ,ha in camera view a and
the corresponding patch in camera view b is:
si , ja = min
m∈ [1 , r ]{si , j , m
a } (18)
Using the same process, the Mahalanobis distance between the dominant distribution for patch
Ri , j ,hb in camera view b and the corresponding patch in camera view a is si , j
b . Then the colour distance
between the pair of colour patches is defined as:
si , ja , b=min
❑{si , j
a , si , jb } (19)
The smaller the colour distance si , ja , b is, the more likely that the two foreground regions are from the
same object.
5.5 Head-Plane Foreground Intersections
Assuming that pedestrians are standing upright and the heights of the pedestrians may vary from
one person to another. In this paper, homographies for planes paralleled to the ground plane and
around pedestrians’ average height are used to detect pedestrians. Considering the variety of
pedestrians’ height, the height range of the multiple planes is from 1.5 to 1.8m, which covers the
heights of most people. If the difference of pedestrians’ heights is larger, the height range can be
further extended, which means more homographic planes will be used. Let h∈[1.5 , 1.8]m, Ri , jt
represents a set of foreground intersection regions at different heights but at the same location in the
top view. The intersection region is recognized as the location of a person's top of the head, if it has
the highest height in such a set, its occupancy likelihood is higher than a threshold T L, its colour
distance between both camera views is lower than a threshold T S, and its area is over a threshold T A.
The highest height of that intersection region indicates the height of that person. The location of that
intersection region indicates the location of that person.
The details of the phantom pruning algorithm are described as Algorithm 1.
Algorithm 1:
1: for each imaginary plane parallel to the ground plane and at a height of h do
2each foreground region in both camera views is projected to the top view t
according to H hc ,t, (c = a, b).
12
3 for each intersection region Ri , j ,ht of foreground projections
4 for each pixel pi , j ,ht at (u , v )
5the pixel is warped back to camera view c (c = a, b) according to H 0
t ,c
and H ht ,c, to calculate the occupancy likelihood Li , j ,h
t , c (u , v )
6 calculate joint likelihood Li , j ,ht (u , v )
7 end for
8calculate joint likelihoods for all intersection regions, which leads to an
occupancy map.
9 end for
10 end for
11for each set of intersection regions Ri , j
t which are at the same location but at
different heights do
12 choose Ri , j ,hmax
t that satisfy Ai , j ,hmax
t >T A , Li , j , hmax
t >T L
13 Ri , j ,hmax
t is warped back to camera views a and b
14 build GMM models for Ri , j ,hmax
a and Ri , j ,hmax
b
15 calculate the colour distancesi , ja , b
16 if si , ja , b<T s then
17 Ri , j ,hmax
t is the top of a head
18 end if
19 end for
6 Experimental Results
The algorithm has been tested on video sequences captured in the author's campus. Two cameras
were placed with small viewing angles and with significant overlapping fields of view. People walked
around within a 4.0m×2.4m region to ensure some degree of occlusion. There are 2790 frames
captured in each camera view with a resolution of 640×480 pixels and a frame rate 15 fps. 2155
frames that contain two or three pedestrians were used in the experiments (the first 660 frames contain
no pedestrians or only one pedestrian). In this experiment, a virtual top view image was selected as the
reference image with a resolution of 840×1000 pixels. The test was run on a single PC with an Intel
Core i7 CPU running at 2.9 GHz.
Fig.6 shows the foreground detection using background subtraction and GMMs for frame 1200.
For the two original camera views shown in Fig. 6(a) and Fig. 6(b), the results of foreground detection
13
are shown in Fig. 6(c) and Fig. 6(d), respectively. There are three pedestrians which are labelled 1 to 3
in camera view a and labelled a to c in camera view b.
Each foreground polygon in a camera view was warped to the top view according to the
homographies for a set of parallel planes. In this experiment, seven planes which are parallel to and
between 1.5-1.8 meters above the ground plane were used. These planes are around the head level of
the pedestrians. Fig.7 shows the overlaid foreground projections from the two camera views using the
homographies for the planes at heights of 1.5m, 1.6m, 1.7m, and 1.8m, respectively. Each foreground
projection in the top view is given the same label of the corresponding foreground region in the
individual camera views. The foreground projections intersect at 8 regions for the height of 1.5m.
Fig.8 shows the foreground intersection regions of homography mapping at a height of 1.5m. Each
intersection region is given a label to indicate the corresponding foreground regions in the two camera
views. For example, region 1b is the intersection of foreground region 1 in camera view a and
foreground region b in camera view b.
Then, each pixel in the intersection regions is warped back to the individual camera views
according to the homographies for the ground plane and a plane at a height of 1.5 meters. In Fig.9, a
pixel in the intersection region is warped back and overlaid on the foreground regions in the individual
camera views. The blue dot and the red dot in each camera view indicate the warped back pixels
according to the homographies for the two planes. The yellow line indicates all the pixels on the line
segment decided by the pair of warped back pixels. The occupancy likelihood of the original pixel in a
single camera view is calculated. The occupancy likelihoods for all the pixels lead to an occupancy
map in the top view.
The occupancy map indicates the likelihood that each intersection region is occupied by a real
object. Table 2 shows the average occupancy likelihood for each intersection region. For intersection
region 1b in the top view, when each pixel in that region is warped back to camera view a, the average
occupancy likelihood in camera view a is 0.853. Since the average occupancy likelihood of the same
intersection region in camera view b is 0.973, the joint occupancy likelihood is 0.830. The intersection
regions which have the joint likelihoods higher than 0.6 are in bold. To visualize the results, in Fig.10,
each intersection region in the top view is filled with its average occupancy likelihood, in which the
darkest intensity corresponds to the highest likelihood.
Since colour matching is carried out in the individual camera views, according to the inverse
homography for the plane at a height of 1.5 meter, the intersection regions in the top view are warped
back into the individual camera views. Fig.11 shows the warped back patches overlaid onto the
original camera views. The regions with blue boxes are the warped back patches. Each warped back
patch is given the same label as the corresponding intersection region in the top view.
14
Table 3 shows the colour matching results for the same frame. The hue and saturation components
in the HSI colour space are used in the colour matching. Since the colour distances of unmatched
intersection regions are usually much greater than 1000, those for matched intersection regions are in
bold. Intersection regions 2, 4 and 6 in Table 3 are identified as matched regions. Table 4 shows the
results using seven planes at a series of heights ranging from 1.50m to 1.80m, incrementing by 0.05m.
Intersection regions 2, 4 and 6 are identified as real objects because their occupancy likelihoods are
higher than a threshold T L, their colour distances are lower than threshold T S, and their areas are over
threshold T A. The threshold can be decided empirically. In this experiment, T L = 0.6, T S=1000
and T A=10. Their corresponding heights are estimated as 1.55m, 1.75m and 1.75m, respectively.
There are the highest heights at which an intersection region can be detected.
To evaluate the performance, the proposed algorithm has been tested over 142 frames, each of
which was sampled from every 15 frames in the 2155 frames of the video sequence. Table 5 and Table
6 show the performance evaluation results of the pedestrian detection. In Table 5, the classification
results are compared with the ground truth. 786 intersection regions in 142 frames are classified into
two categories: object regions and phantom regions. The number of ground-truth object regions was
360 and 342 are correctly identified. At the same time, there are 426 phantom regions and 10 were
misclassified as object regions.
The false negative rate and the false positive rate of the classification are shown in Table 6. For
each category, let GT be the ground-truth number of that category. The false negatives (missed
detections), FN, are the intersection regions which belong to that category but are misclassified as the
other category. The false positives (FP) or false alarms are the intersection regions which belong to
the other category but are misclassified as that category. The false negative rate (RFN) is the ratio
between the number of false negatives and the number of ground truths. The false positives rate (RFP )
is the ratio between the number of false positives and the number of ground-truths.
RFN=FN /¿
RFP=FP /¿
(20)
For object regions, the false negative rate is 5.0% and the false positive rate is 2.78%. The false
negative rate and the false positive rate for the phantom regions are 2.35% and 4.22% respectively.
The processing time for the proposed algorithm has been tested over the 142 frames by using two
cameras and seven parallel planes. All the processing tasks were carried out in a single PC. The
overall processing time for the 142 frames is 65.98 seconds, corresponding to 0.46 second per frame.
The time spent on foreground detection using the GMMs is 31.22 seconds, corresponding to 0.22
second per frame and accounting for 46% of the overall processing time. The time spent on
foreground projection and overlay is 5.44 seconds, corresponding to 0.037 second per frame and
accounting for 8.2% of the overall processing time. The time spent on intersection region
15
classification is 29.32 seconds, corresponding to 0.20 second per frame and accounting for 44% of the
overall processing time.
7. Conclusions
In this paper, we have focused on pedestrian detection in visual surveillance applications. By using
multi-view data fusion, we can not only enlarge the FOV but also achieve more accurate and reliable
detection and tracking. Through homography transformations, multi-view foreground regions are
projected to a reference view, where occlusions can be minimized by making use of cross-view
information. In addition, a stack of virtual top views according to the homographies for multiple
planes at different heights are generated, where the foreground regions detected in each camera view
are warped into. As a result, the multi-plane data fusion is an efficient solution to remove false-
positive detections by using both the occupancy information and colour cues. The proposed approach
works well in the differentiation of pedestrians from false-positive detections. The limits of the
proposed method include: people are assumed to stand on a common plane; it can only be used to
detect pedestrians distributed at a medium density.
Acknowledgements
This work was supported by the National Natural Science Foundation of China (NSFC) under Grant
60975082 and the Scientific Research Program funded by Shaanxi Provincial Education Department,
P. R. China, under Grant 15JK1310.
References
[1] Q. Cai and J. K. Aggarwal, “Automatic tracking of human motion in indoor scenes across multiple synchronized video
streams,” in Computer Vision, 1998. Sixth International Conference on. IEEE, 1998, pp. 356–362.
[2] O. Javed, S. Khan, Z. Rasheed, and M. Shah, “Camera handoff: tracking in multiple uncalibrated stationary cameras,” in
Human Motion, 2000. Proceedings. Workshop on. IEEE, 2000, pp. 113–118.
[3] V. Kettnaker and R. Zabih, “Bayesian multi-camera surveillance,” in Computer Vision and Pattern Recognition, 1999.
IEEE Computer Society Conference on., vol. 2. IEEE, 1999.
[4] M. Quaritsch, M. Kreuzthaler, B. Rinner, H. Bischof, and B. Strobl, “Autonomous multicamera tracking on embedded
smart cameras,” EURASIP Journal on Embedded Systems, vol. 2007, no. 1, pp. 35–35, 2007.
[5] S. Khan and M. Shah, “Consistent labeling of tracked objects in multiple cameras with overlapping fields of view,”
Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 25, no. 10, pp. 1355–1360, 2003.
[6] G. P. Stein, "Tracking from multiple view points: Self-calibration of space and time," in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 1999.
16
[7] M. Xu, J. Orwell, L. Lowey, and D. Thirde, “Architecture and algorithms for tracking football players with multiple
cameras,” IEE Proceedings-Vision, Image and Signal Processing, vol. 152, no. 2, pp. 232–241, 2005.
[8] J. Kang, I. Cohen, and G. Medioni, "Continuous tracking within and across camera streams," in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2003, pp. I-267-I-272 vol. 1.
[9] W. Hu, M. Hu, X. Zhou, T. Tan, J. Lou, and S. Maybank, “Principal axis based correspondence between multiple
cameras for people tracking,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 28, no. 4, pp. 663–671,
2006.
[10] W. Du and J. Piater, “Multi-camera people tracking by collaborative particle filters and principal axis-based
integration,” in Computer Vision–ACCV 2007. Springer, 2007, pp. 365–374.
[11] J. Black and T. Ellis, “Multi camera image tracking,” Image and Vision Computing, vol. 24, no. 11, pp. 1256–1267,
2006.
[12] S. M. Khan, P. Yan, and M. Shah, “A homographic framework for the fusion of multi-view silhouettes,” in Computer
Vision, 2007. ICCV 2007. IEEE 11th International Conference on. IEEE, 2007, pp. 1–8.
[13] A. Mittal and L. Davis, “Unified multi-camera detection and tracking using region-matching,” in Multi-Object
Tracking, 2001. Proceedings. 2001 IEEE Workshop on. IEEE, 2001, pp. 3–10.
[14] A. Mittal and L. S. Davis, "M2tracker: A multi-view approach to segmenting and tracking people in a cluttered scene
using region-based stereo," in Proceedings of the European Conference on Computer Vision, 2002, pp. 18-33.
[15] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua, “Multiple object tracking using k-shortest paths optimization,” Pattern
Analysis and Machine Intelligence, IEEE Transactions on, vol. 33, no. 9, pp. 1806–1819, 2011.
[16] S. M. Khan and M. Shah, “Tracking multiple occluding people by localizing on multiple scene planes,” Pattern
Analysis and Machine Intelligence, IEEE Transactions on, vol. 31, no. 3, pp. 505–519, 2009.
[17] R. Eshel and Y. Moses, “Tracking in a dense crowd using multiple cameras,” International journal of computer vision,
vol. 88, no. 1, pp.129–143, 2010.
[18] J. Wright, A. Wagner, S. Rao, and Y. Ma, "Homography from coplanar ellipses with application to forensic blood
splatter reconstruction," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
2006, pp. 1250-1257.
[19] H. Zeng, X. Deng, and Z. Hu, "A new normalized method on line-based homography estimation," Pattern Recognition
Letters, vol. 29, pp. 1236-1244, 2008.
[20] A. Criminisi, I. Reid, and A. Zisserman, “Single view metrology,” International Journal of Computer Vision, vol. 40,
no. 2, pp. 123–148, 2000.
[21] M. Xu, J. Ren, D. Chen, J. S. Smith, Z. Liu and T. Jia, “Robust object detection with real-time fusion of multiview
foreground silhouettes,” Optical Engineering, 51(4), 047202, 2012
[22] C. Stauffer and W. E. L. Grimson, “Adaptive background mixture models for real-time tracking,” in Computer Vision
and Pattern Recognition, 1999. IEEE Computer Society Conference on., vol. 2. IEEE, 1999.
[23] M. Xu, J. Ren, D. Chen, J. Smith, and G. Wang, “Real-time detection via homography mapping of foreground polygons
from multiple cameras,” in Image Processing (ICIP), 2011 18th IEEE International Conference on. IEEE, 2011, pp. 3593–
3596.
[24] D. H. Douglas and T. K. Peucker, “Algorithms for the reduction of the number of points required to represent a
digitized line or its caricature,” Cartographica: The International Journal for Geographic Information and Geovisualization,
vol. 10, no. 2, pp. 112–122, 1973.
[25] J. Ren, M. Xu and J. S. Smith, “Multi-view pedestrian detection using occupancy color matching,” in Proc. 1 st IEEE
Conf. Multimedia Big Data, April. 2015
17
Figure 1. Flowchart of the proposed pedestrian detection algorithm.
18
Figure 2. The relationship between H 0t ,c and H h
t ,c.
Figure 3. A schematic diagram of the foreground projection using homographies for the ground plane and the parallel plane at a height of h.
(a) (b)
Figure 4. A schematic diagram of the homography mapping for (a) the ground-plane and (b) a plane above the ground.
19
(a) (b)
(c) (d)
(e) (f)
(g) (h)Figure 5. An example of the projected foreground intersections due to the same object by using the homographies for a set of planes at different heights. (a) and (b) are the intersection region in the top view and the warped back region in camera view b for the ground plane, (c) and (d) are for the plane at a height of 0.5 m, (e) and (f) are for the plane at a height of 1.0 m, (g) and (h) are for the plane at a height of 1.5 m.
20
(a) (b)
(c) (d)Figure 6. The foreground detection: (a)(b) the original images, and (c)(d) the detected foreground regions.
(a) (b) (c) (d)
Figure 7. Fusion of the projected foregrounds in the top view using homographies for a set of parallel planes, (a) the ground plane; (b)-(d) at heights of 1.5m, 1.6m 1.7m and 1.8m, respectively.
Figure 8. Intersection regions using the homography for a plane at a height of 1.5m.
21
(a) (b)Figure 9. The warped back head-foot pixels overlaid on the individual camera views.
(a) (b) (c)Figure 10. The visualized results of the occupancy likelihood map: (a) the likelihoods derived from camera view a, (b) the likelihoods derived from camera view b, and (c) the joint likelihoods. A darker intensity corresponds to a higher likelihood.
(a) (b)Figure 11. The warped back patches overlaid on the original camera views for the plane at a height of 1.5m.
22
Table 1. The notation used in in this paper.
Symbolpc The homogeneous coordinate of the pixel at image coordinate (u, v) in
camera view c.
H hc ,t = (H h
t ,c )−1 Homography transformation matrix, from camera view c to top view t, for the plane at a height of h (h = 0 is the ground plane). The interchanging order for the superscripts denotes the inverse homography matrix.
F ia, V i
a The i-th foreground region and corresponding foreground polygon in camera view a.
F i ,ha ,t Projection of foreground region F i
a from camera view a to the top view
t according to homography H ha , t.
Ri , j ,ht The intersection region of projected foreground regions F i ,h
a ,t and F j ,hb ,t
in the top view.pi , j ,h
t A pixel at image coordinate (u, v) in Ri , j ,ht in the top view.
Li , j ,ht , c (u , v ) The occupancy likelihood of pi , j ,h
t , which is derived from camera view c.
si , ja , b The colour distance of Ri , j
t in both camera views.
Table 2. The average occupancy likelihood for each intersection region.
Table 3. Colour matching results.
Index Region Colour distances1 1b 6674.542 1c 0.0413 2a 2014.174 2b 102.935 2c 2103.476 3a 88.337 3b 6734.668 3c 25930.10
Table 4. The average occupancy likelihoods for intersection regions by using seven parallel planes.
Index Region 1.5m 1.55m 1.6m 1.65m 1.7m 1.75m 1.8m1 1b 0.830 - - - - - -2 1c 0.818 0.781 - - - - -3 2a 0.462 0.488 0.52
50.637 - - -
4 2b 0.920 0.901 0.905
0.898 0.888 0.901 -
23
Index Region camera a
camera b Joint likelihood
1 1b 0.853 0.973 0.8302 1c 0.864 0.947 0.8183 2a 0.984 0.469 0.4624 2b 0.926 0.993 0.9205 2c 0.738 0.458 0.3386 3a 0.946 0.949 0.8997 3b 0.465 0.993 0.4628 3c 0.395 0.704 0.278
5 2c 0.338 0.499 0.445
0.486 0.514 0.573 -
6 3a 0.899 0.996 0.979
0.956 0.946 0.964 -
7 3b 0.462 0.472 0.498
0.520 - - -
8 3c 0.278 0.294 0.333
0.373 - - -
Table 5. Classification results with occupancy maps and colour matching.
Classification Results Number of the Ground TruthObject Regions
(Ob) Phantom Regions (Ph)
Ground truth
Ob 342 18 360Ph 10 416 426
Total 352 434 786
Table 6. Classification errors after fusion of occupancy maps and colour matching.
False Negative Rate (%)
False Positive Rate (%)
Object regions 5.00 2.78Phantom regions 2.35 4.22
24