livrepository.liverpool.ac.uklivrepository.liverpool.ac.uk/3002767/1/Uploaded.docx · Web viewSince the foreground pixels are identified according to the pixelwise difference between

Multi-View and Multi-Plane Data Fusion for Effective Pedestrian Detection in Intelligent Visual Surveillance

Jie Ren1, Ming Xu2, 3, Jeremy S. Smith3, Shi Cheng4

1 College of Electronics and Information, Xi'an Polytechnic University, Xi’an, 710048, China2 Department of Electrical and Electronic Engineering, Xi'an Jiaotong-Liverpool University, Suzhou, 215123,

China3 Department of Electrical Engineering and Electronics, University of Liverpool, L69 3BX, Liverpool, UK

4 Division of Computer Science, University of Nottingham Ningbo, Ningbo, 315100, China

Abstract: For the robust detection of pedestrians in intelligent video surveillance, an approach to

multi-view and multi-plane data fusion is proposed. Through the estimated homography, foreground

regions are projected from multiple camera views to a reference view. To identify false-positive

detections caused by foreground intersections of non-corresponding objects, the homographic

transformations for a set of parallel planes, which are from the head plane to the ground, are

employed. Multiple features including occupancy information and colour cues are derived from such

planes for joint decision-making. Experimental results on real world sequences have demonstrated the

good performance of the proposed approach in pedestrian detection for intelligent visual surveillance.

Keywords: Detection; Video surveillance; Homography; Information fusion.

1 Introduction

Automatic pedestrian detection is a fundamental computer vision task for visual surveillance,

driverless vehicles, intelligent transportation and even smart cities. Detecting multiple pedestrians is

particularly challenging due to the frequent occlusion between moving objects. A reasonable solution

is to use multiple cameras to solve occlusions, as an object occluded in one camera view may be

visible in other camera views. Furthermore, the multiple camera views can extend the overall field of

view (FOV) and allow large spatial coverage for the continuous tracking of the objects.

In multi-view data fusion, the camera views need to be associated. A useful assumption here is

that in all the camera views the objects of interest are on one common plane, which is valid for most

scenarios of intelligent visual surveillance systems. Homography, which is a geometric transformation

that shows a pixelwise mapping between two camera views according to a common plane, can be

used as an efficient method to associate multiple camera views. Using a homography transformation

for a specific plane, foreground regions detected from each of the multiple camera views can be

projected to a reference view. The intersection regions of the foreground projections indicate the

locations of moving objects on that plane. This method can achieve good results in detection and is

robust in coping with occlusion.

The remainder of this paper is organized as follows. In Section 2 the related work in this area is

reviewed. In Section 3 the techniques used to estimate the homographies for the ground plane and a

1

set of parallel head planes are described. In Section 4, foreground detection using Gaussian mixture

model (GMM) based background subtraction is discussed. In Section 5, a detailed fusion scheme for

improved pedestrian detection is presented, including the fusion of multi-view foreground regions and

multi-plane feature fusion using the occupancy information and colour cues. Finally, the experimental

results are discussed in Section 6, followed by the conclusions drawn in Section 7.

2 Related Work

Multi-view fusion in a visual surveillance system can not only provide a larger overall FOV but

also reduce occlusions in the overlapping FOV. One of the key issues for multi-view visual

surveillance is how to utilize the information from the multiple cameras for the purpose of detection

and tracking. The more information which is extracted from the multiple cameras and can be used

simultaneously, the more robust and accurate the system becomes. Depending on the degree of

information fusion, the existing multi-camera surveillance systems can be categorized into low-degree

information fusion, intermediate-degree information fusion or high-degree information fusion.

Low-degree fusion in visual surveillance systems uses multiple cameras only to extend the limited

FOV for a single camera [1-4]. It is known as the camera handoff method which starts tracking an

object within a single camera view and switches to the next camera when the tracked object goes

beyond the FOV of the current camera. For the camera handoff method, existing researches differ on

three points: when the handoff process is triggered, which camera will be selected as the next optimal

camera to track and monitor the object of interest and how to establish the correspondence of an

object between cameras.

In the approaches of low-degree fusion, the detection and tracking are carried out separately in

each camera view, and only one camera is actively working at a particular time stamp. Therefore, it

fails to detect and track objects during dynamic occlusions as this is one of the problems with single-

camera visual surveillance systems. As there is very limited information exchange between the

cameras, this camera switching approach is classed as low-degree information fusion. In Cai and

Aggarwal’s work [1], switching between different camera views for more effective single-camera

tracking is proposed and the system predicts when the current camera will no longer have a good

view. The features extracted from each upper human body are used to build the object correspondence

between cameras. The next camera should not only provide the best view but also have the least

switching to continually track a moving object in that camera view.

In intermediate-level information fusion, dynamic occlusions are usually resolved through the

integration of information from additional cameras mas occlusions might not occur simultaneously in

all the camera views. With the established correspondence of the objects between cameras, it is

possible to not only track objects as they move from one camera view to the other, but also align the

trajectories in multiple views for improved tracking. Khan and Shah [5] extended their work proposed

2

in [2] by creating associations across views for the better localization of each object. The trajectories

from each camera are fused in a reference view if that camera view can be visible in the reference

view. In [6], the authors align the multiple views with the viewpoint of the camera which can give a

clear view of the scene and fuse the trajectories from multiple views in that view. In [7], the

multiview process uses Kalman filter trackers to model the object position and velocity, to which the

multiple measurements from a single-view stage are associated.

As features are extracted from the individual camera views before fusion, the problems that arise

in the detection and tracking with a single camera will affect the final fusion result. Extracted features

in the individual camera views can be integrated in a reference view to obtain a global estimate. The

extracted features include bounding boxes, centroids, principal axes and classification results. In [8], a

motion model and an appearance model of each detected moving object are built. Then the moving

object is tracked using a joint probability data association filter in a single camera view. The bounding

boxes of the moving objects are projected to a reference view according to the ground-plane

homography to correct falsely detected bounding boxes and handle occlusions in the reference view.

Hu et al. [9] projected the principal axis of each foreground object from the individual camera views

to a top view according to the homography mapping for the ground plane. The foot point of each

object in the top view is determined by the intersection of the principal axes projected from two

camera views. The tracking is based on the foot-point locations in the top view. Du and Piater [10]

used particle filters to track targets in individual camera views and then projected the principal axes of

the targets onto the ground plane. After tracking the intersections of the principal axes using the

particle filters on the ground plane, the tracking results were warped back into each camera view to

improve the tracking in the individual camera views. In [11], the centroid of each tracked target is

used to calculate an epipolar line in 3D space, and then the global tracking is carried out on the basis

of the intersection of two epipolar lines each from a different camera view.

In high-degree information fusion, the individual cameras no longer extract features but provide

foreground bitmap information to the fusion centre. The objects are detected as the intersections of

these foreground bitmaps projected from multiple views. In [12], homography mapping is used to

combine foreground likelihood images from different views to resolve occlusions and localise moving

objects on the ground plane. In [13] and [14], the midpoints of the matched foreground segments in a

pair of camera views are back-projected to yield an intersection point in 3D space. Such points are

then projected onto the ground plane to generate a probability distribution map for object locations.

Berclaz, Fleuret and Fua [15] divided the ground plane into grids and calculated an occupancy map in

the ground plane. The probability that each sub-image corresponds to the average size of a person in

each camera view is warped from that camera view to the top view by using the ground-plane

homography independently. The ground plane is later extended to a set of planes parallel to, but at

some height off, the ground plane to reduce false positives and missed detections [16]. In [17], a

similar procedure is followed but the set of parallel planes are at the height of people's heads. This

3

method is able to handle highly crowded scenes because the feet of a pedestrian are more likely to be

occluded in a crowd than the head. Their work achieves good results in moderately crowded scenes.

These methods fully utilize the visual cues from multiple cameras and have high-level information

fusion.

The method proposed in this paper is used to detect pedestrians via multi-view and multi-plane

data fusion, which utilizes head-plane homographies, occupancy maps and statistical colour matching

to identify pedestrians and phantoms. In [15] an occupancy map is extracted on the ground plane only.

However, foreground projections on the ground plane are vulnerable to the lost detection of feet and

legs in the individual camera views. In [16] homographies for the ground plane and a set of planes

parallel to, but at some height off, the ground plane are used to reduce false positives and missed

detections. However, if the feet and legs of a pedestrian are lost in the detection in any individual

camera views, the joint foreground likelihood of the pedestrian becomes very low and may still be lost

in the multiview detection. In [17] a head-plane homographic transformation, which is used to align

the intensity correlation between each pixel in one camera view and its corresponding pixel in the

other camera view, is used to detect the head patch of each pedestrian. However, the pixelwise

intensity correlation in [17] is sensitive to homography estimation errors and textured patterns in

foregrounds. Instead, in the proposed method, a statistical approach based on colours is used, which is

more robust than the pixelwise intensity correlation.

Figure 1 shows a flowchart of the proposed pedestrian detection algorithm based on the

occupancy map and colours. To identify the false-positive detections, which occur due to the

foreground intersections of non-corresponding objects in a reference camera view, the occupancy

information and colour matching are combined in the head plane detection. Table 1 explains the

notation used in the equations of this paper.

3 Homography Estimation

Planar homography is defined by a 3 ×3 transformation matrix between a pair of captured images of

the same plane with different cameras. Usually one image is chosen as a reference image in which

data association or fusion is carried out. The reference image is often captured by a real camera, but

sometimes it is a synthetic top view image or an aerial top view image. Let (uc , vc ) be a point on that

plane in an image captured by a real camera c and (ut , v t ) be the corresponding point in a top view

image. pc=[uc , vc ,1 ]T and pt=[ut , v t ,1 ]T are the homogeneous coordinates of those two points.

They are associated by the homography matrix H c ,t as follows:

pt≅ H c, t pc (1)

4

H c ,t=[h11 h12 h13

h21 h22 h23

h31 h32 1 ] (2)

where ≅ denotes that the homography is defined up to an unknown scalar. The superscript c in pc and

H c ,t indicates the camera index and c = a, b. Since the virtual top view is selected as the reference

view in this paper, the superscript t in pt and H c ,t indicates the top view.

3.1 Estimation of the Ground-Plane Homography

By matching corresponding feature points in an image pair, the homography between two camera

views can be estimated. The most commonly used feature is corresponding points, though other

features such as lines or ellipses may be used [18][19]. These features are selected and matched

manually or automatically in 2D images [11]. In this paper a set of manually selected corresponding

points was used to calibrate the cameras.

As the homography transformation is a special variation of the projective transformation, the

parameters calculated in the camera calibration can be used to determine the homography matrix for

the ground plane. Let pw= [ Xw ,Y w , Zw , 1 ]T be a point in 3D space and pc be the corresponding point

in an image captured by a real camera, the relationship that maps pw to pc can be rewritten using a

3×4 projection matrix M as follows:

pc=M pw=[m1 , m2 , m3 , m4] pw (3)

where, m1, m2, m3, and m4 are 3×1 vectors. The M matrix is constituted of the intrinsic and extrinsic

parameters of the underlying camera, which can be extracted from the camera calibration.

By assuming that points pw and pc are on the ground plane, then point pw in 3D space can be

denoted as p0w= [ Xw ,Y w ,0 , 1 ]T, where Zw=0 and the subscript 0 denotes the ground plane.

Assuming there is a virtual camera which generates a virtual top view, p0w in the top view is denoted

as p0t = [u t , v t , 1 ]T, where ut=X w, v t=Y w. Therefore, the ground-plane homography H 0

t ,c can be

simplified as a degraded M matrix [20]:

H 0t ,c=(H 0

c, t )−1=[m1 ,m2 ,m4] (4)

p0c=H 0

t ,c p0t (5)

p0t =H 0

c, t p0c (6)

3.2 Estimation of Multi-Plane Homographies

Homography mapping is not limited to the homography for the ground plane, it can be extended to

a set of planes parallel to the ground plane and at some height. If the camera is calibrated, the multi-

plane homographies can be calculated through the parameters determined in the calibration process

5

directly. After the homography matrix for the ground plane is estimated, the homography matrix for a

plane parallel to and off the ground plane can be recovered from the perspective properties such as the

vanishing point in the vertical direction [16] or the cross-ratio of four collinear points [17]. The

homography matrix can also be approximated by using interpolated points between the head and feet

of each pedestrian [21].

For a plane parallel to the ground and at a height of h, let phw= [ Xw ,Y w , Zw ,1 ]T be a point on that

plane in 3D space, where Zw=h. Assuming there is a virtual camera which generates a virtual top

view, phw in the top view is denoted as ph

t = [u t , v t , 1 ]T, where ut=X w and v t=Y w. According to Eq.

(3), the corresponding point for phw in camera view c is ph

c=[m1uw+m2 vw+m3 h+m4 ]. Since each

element in the third column m3 multiplies the value h and then adds its corresponding element in the

last column m4, the projection from point pht in the top view to point ph

c in camera view c according to

the homography for plane h becomes:

phc=H h

t ,c pht =[m1 m2 m4+hm3 ] ph

t (7)

The homography of plane h can be represented as a combination of the homography for the

ground plane and the third column of the projection matrix M multiplied by the given height h:

H ht ,c=H0

t ,c+[0∨hm3 ] (8)

where [0] is a 3 ×2 zero matrix [20]. The relationship between H 0t ,c and Hh

t ,c is illustrated in Figure 2.

Points p0w and ph

w in the 3D space are at the same point in the top view.

4 Foreground Detection

Aiming to separate moving objects from a background image in each frame, foreground detection

is one essential process in visual surveillance systems. For a fixed camera, foreground detection

suffers mainly from changes of lighting conditions, moving cast shadows, switching backgrounds, etc.

Although there are three main categories of foreground detection algorithms, i.e. optical flow,

temporal differencing, and background subtraction, we choose background subtraction because it

copes better with the surveillance background. The other two can be very sensitive to the

environmental lighting conditions and the irregular motion patterns of pedestrians.

For background subtraction, it requires i) calculating a background image, ii) subtracting the

background image from each new frame and iii) thresholding the subtraction result. Since the

foreground pixels are identified according to the pixelwise difference between the new frame and the

background image, this method is highly dependent on a good background model, which should not

be sensitive to illumination variations, shadows and waving vegetation. The Gaussian Mixture Model

(GMM) [22] is the most widely used method to cope with changing background elements (e.g.,

6

waving trees). According to the assumption that a background pixel is more stable than a foreground

pixel in pixel value, the value of a background pixel is modelled by a mixture of Gaussian

distributions. The sum of the probability density functions weighted by the corresponding priors

represents the probability that a pixel is observed at a particular intensity or colour.

After detection of the foreground pixels in each camera view, these pixels are grouped into

foreground regions by applying connected component analysis, and then morphological operations

and a size filter are applied to fill holes and remove small regions. Once the foreground regions have

been identified in a camera view, each foreground region needs to be projected to a reference view

according to the homography for a certain plane. As a pixelwise homographic transformation is time

consuming, each foreground region is represented by the polygon approximation of its contour [23].

The Douglas-Peucker (DP) method [24] has been used for the polygon approximation. Instead of

applying the inverse homography to each pixel in the reference view, the vertices of the polygon of

each detected foreground region are projected to the reference view through homography mapping.

Let F ia be the i-th foreground region detected in camera a. The contour of F i

a is represented by an

ordered set of N points on the contour curve. To make the representation of the contour point set more

compact, the original contour is approximated by a polygon using the Douglas-Peucker (DP)

algorithm. Then, the contour of F ia is approximated by a set of vertices V i

a, which will be used in the

next section for foreground projection and fusion in the top view.

5. Multi-View and Multi-Plane Data Fusion for Pedestrian Detection

When the foreground regions for the same object are warped from multiple views to the top view,

they will intersect at a location where the object touches the ground. Although the ground plane is the

most commonly used plane in homography mapping, the foreground projections of the same object,

from each one of the multiple camera views, may have missed intersections in the reference view.

Since pedestrians’ feet and legs are small and thin, they are frequently missed in detection.

Homography estimation errors are another reason for missed intersections. In addition, when a

pedestrian is striding and hence has the two legs separated, the two legs may cause additional phantom

intersections.

If the foreground projections from individual camera views to the top view are based on the

homography for a plane off the ground, the object detection will be more robust. In this paper, the

homographies for the head plane are used to detect pedestrians and the height of each pedestrian. The

foreground regions detected in each camera view are warped into the top view according to the

homographies for a set of parallel planes at different heights. The intersection regions in the top view

indicate all the possible regions that contain moving objects. To identify false-positive detections,

which occur due to the foreground intersections of non-corresponding objects in the top view, the

occupancy information and colour cues are analysed.

7

5.1 Region Based Foreground Fusion

Each foreground region is represented by the polygon approximation of its contour. After the

contour of F ia is approximated by vertices V i

a, instead of applying a pixelwise homographic

transformation to the foreground images, only the vertices of the foreground polygons need to be

projected onto the reference image. The foreground regions are then rebuilt by filling the internal area

of each polygon with a fixed value [21].

Using the homography estimation method described in Section 3.1, the ground-plane homography

can be determined. The vertices V ia of the i-th foreground polygon from camera view a are projected

to the top view t according to ground plane homography H 0a , t. Let V i , 0

a , t be the set of projected vertices

in the top view t:

V i , 0a , t=H 0

a ,t (V ia ) (9)

Since the vertices in V ia are in an ordered list, connecting each projected vertex with its neighbour

sequentially can generated a new polygon in the top view t. This new polygon approximates the

contour of the projected foreground region F i ,0a ,t which is the projection of F i

a from camera view a to

the top view t according to the ground-plane homography. Then F i ,0a ,t is rebuilt by filling the internal

area of the projected foreground polygon with a fixed value. Thus, F i ,0a ,t can be described as:

F i ,0a ,t=H 0

a ,t ( Fia) (10)

Figure 3 shows a schematic diagram of the homographic projection based on the ground plane and

a plane parallel to the ground plane. Plane h is an imaginary plane parallel to the ground plane and at a

height of h. The warped foreground region of an object in the top view is observed as the intersection

of the ground plane and the cone swept out by the silhouette of that object.

Given foreground region F ia from camera view a and F j

b from camera view b, F i ,ha ,t and F j ,h

b ,t are

the projected foreground regions from the two camera views to the top view according to the

homographies H ha , t and H h

b , t for a plane at a height of h. If the two projected foreground regions

intersect in the top view, these two regions in the top view and their original foreground regions in

both camera views are defined as a pair of projected foreground regions and a pair of foreground

regions respectively. The intersection region of the projected foreground regions F i ,ha ,t and F j ,h

b ,t is

determined as:

Ri , j ,ht =Fi , h

a , t⋂F j , hb , t=( Hh

a ,t ( Fia ) )⋂ (H h

b , t (F jb )) (11)

5.2 Phantoms

8

When the foreground regions in the individual camera views are projected into the top view

according to the homography for the ground plane or a plane parallel to the ground plane and at some

height, the foreground projections may intersect. If the intersecting foreground regions correspond to

the same object in different camera views, the intersection region reports the location where the object

touches the plane used in the homography projection. If the intersection regions are caused by non-

corresponding foreground regions from different camera views, they are false-positive detections or

phantoms. This is an important problem in multi-camera moving object detection using foreground

homography mapping.

Fig. 4 is a schematic diagram which illustrates how non-corresponding foreground regions

intersect and give rise to a false-positive detection in homography mapping. In Fig. 4(a), the

foreground regions of two objects are projected from two camera views into the top view by using the

ground plane homography. The foreground projections intersect in three regions on the ground plane.

The white intersection regions are the locations of the two objects, whilst the black region is a

phantom. If homography mapping for a plane higher than the ground plane is used, the projected

foreground regions move closer to the cameras. This can be partly explained in Fig.3, in which the

projected foreground region on plane h is closer to the camera than that on the ground plane. When

such projected foreground regions on the plane off the ground intersect in the top view, additional

phantoms may be generated, as shown in Fig.4(b) in which the grey region is an additional phantom.

5.3 Occupancy Maps

If the intersection region Ri , j ,ht is due to the same object in different camera views, it indicates the

location where the object is intersected by plane h. When plane h is at different heights and parallel to

the ground plane, the intersection region Ri , j ,ht varies in its size and shape, which approximates the

widths of the corresponding body parts at different heights. Fig. 5 shows an example of the foreground

intersection regions generated by the same pedestrian according to the homographies for planes at

different heights. There are two pedestrians in each of the two camera views. When the foreground

polygons of the same pedestrian in both camera views are projected into the top view, the intersection

of the projected foreground polygons shows the location of the pedestrian intersected by that plane.

Fig. 5(a) shows the intersection region according to the homography for the ground plane. Such an

intersection region is then warped back into a single camera view as shown in Fig. 5(b), where the

back-warped intersection region in camera view b is represented with a red colour. Fig. 5(c)-5(h) show

the results at heights of 0.5 m, 1.0 m and 1.5 m respectively.

By assuming that the pedestrians are standing upright, the intersection regions Ri , j ,0t and Ri , j ,h

t are

roughly at the same position in the top view as illustrated in Fig. 5(a) and 5(c). Let pi , j ,ht =[u , v ]T be

a pixel in Ri , j ,ht in the top view. If Ri , j ,h

t is occupied by a real object, when the pixel is warped back to

9

camera view c (c = a, b) according to the homography for a plane at the pedestrian’s height

H ht ,c=(H h

c, t )−1, the warped back pixel pi , j ,hc is located around the head area of the corresponding

foreground region in camera view c.

When the pixel is warped back to camera view c according to the ground-plane homography

H 0t ,c=(H 0

c, t )−1, the warped back pixel pi , j ,0c is located around the foot area of the corresponding

foreground region in camera view c. Pixels on the line segment which connects those two warped

back pixels are usually located inside the corresponding foreground region in camera view c.

However, if Ri , j ,ht is a phantom region, the warped back pixels pi , j ,h

c and pi , j ,0c are not located around

the head area or foot area of their corresponding foreground region in camera view c. Some pixels on

the line segment which connects those two warped back pixels are located outside the corresponding

foreground region. Therefore, the percentage in which the pixels on that line segment are inside the

corresponding foreground region indicates the likelihood that the original pixel pi , j ,ht in the top view

belongs to a real object. The top-view occupancy likelihood which is derived from camera view c is

denoted as Li , j ,ht , c (u , v ). In reality, the legs of a pedestrian may be separated during walking.

Therefore, a margin region of 5 pixels around the line segment is used in our algorithm. If a pixel on

that line segment is not inside the corresponding foreground region, ± 5 pixels beside that pixel in the

horizontal direction are scanned. If any of these 10 pixels is inside the foreground region, that pixel

will still be recognized as a foreground pixel.

Each pixel in the intersection region is assigned its joint occupancy likelihood from all camera

views.

Li , j ,ht (u , v )=∏

CLi , j ,h

t , c (u , v ) (12)

The joint occupancy likelihoods of all the intersection regions in the top view constitute an

occupancy map for a specific plane [25].

5.4 Colour Matching of Intersection Regions

Since colour is a strong cue to differentiate objects, the colours of the warped back intersection

regions in individual camera views are utilized to identify whether two foreground projections from

different camera views are due to the same object. Pixelwise colour matching is sensitive to

homography estimation errors. It also assumes that the foreground regions of the same object have

consistent colour patterns in different camera views. In reality, because multiple cameras are often

placed at different orientations, the same object may appear slightly different in the colour patterns in

different camera views. Therefore, pixelwise colour matching cannot achieve good results in these

situations and a statistical approach based on colour cues is proposed.

The colour statistical approach is not based on the colour of each intersection region in the top

view, because the warping operation may change the statistical properties of a colour patch. This

10

approach is based on the original colour patches in the two camera views, which correspond to the

same foreground intersection region in the top view. A pair of such colour patches often have different

sizes. Initially each intersection region in the top view needs to be warped back to the individual

camera views. Given a head-plane intersection region Ri , j ,ht in the top view, the image patch in camera

view a, which is warped back from the top view using the homography for the head plane h, is as

follows:

Ri , j ,ha =Hh

t , a (R i , j ,ht ) (13)

The colour model of each warped back patch is built by using the colours of all the pixels in that

patch. A Gaussian mixture model (GMM) is used to handle the multiple colours in the warped back

region. Suppose x is the d-dimensional colour vector of a pixel in the colour patch and K is the

number of Gaussian distributions, the corresponding GMM model can be denoted as:

p ( x )=∑k=1

K

πk N ( x∨μk , Σk ) (14)

N ( x∨μk ,Σ k )= 1

(2 π )d2|Σk|

12

e−12 ( x−μ k)T Σk

−1( x− μk) (15)

The K-means algorithm and the Expectation-maximization (EM) algorithm are used to determine

the parameters of the probability density functions in the Gaussian mixture model. For the warped

back patch Ri , j ,ha , its colour is modelled by K Gaussian distributions: N (π i ,n

a , μi ,na , Σi , n

a ), n∈[1 , K ] ,

where π i ,na , μ i ,n

a ∧Σi , na denote the weight, mean and covariance of the n-th Gaussian distribution

respectively. The K Gaussians are in a descending order according to the magnitudes of the weights (

π i ,1a is the greatest weight).

For the warped back patch Ri , j ,hb in camera view b, its colour is modelled by K Gaussian

distributions N (π j , mb , μ j ,m

b , Σ j , mb ) where m∈[1 , K ]. The colour similarity of such a pair of warped

back patches, which corresponds to the same intersection region in the top view, is measured by using

the Mahalanobis distance between the two colour models. Since the Gaussian distributions in each

GMM are ranked in a descending order, the first distribution is always the dominant distribution in the

GMM. The first r distributions, which have an overall weight over a given threshold T r, are defined as

significant distributions. That is:

∑m=1

r−1

π j , mb <T r∧∑

m=1

r

π j ,mb ≥T r

(16)

Then, Mahalanobis distance is used to calculate the similarity of two Gaussian distributions.

Unlike the Euclidean distance which only measures the distance between the means of two Gaussian

distributions, the covariance of two Gaussian distributions are used to normalize the squared distance

11

between the means in the Mahalanobis distance. The Mahalanobis distance between the dominant

distribution N (π i ,1a , μ i ,1

a , Σi ,1a ) for patch Ri , j ,h

a in camera view a and each of the significant

distributions for the corresponding patch in camera view b are defined as follows:

si , j ,ma =(μi , 1

a −μ j , mb )T (Σi , 1

a +Σ j ,mb )−1 ( μi ,1

a −μ j ,mb ) m∈ [ 1 , r ] (17)

The Mahalanobis distance between the dominant distribution for patch Ri , j ,ha in camera view a and

the corresponding patch in camera view b is:

si , ja = min

m∈ [1 , r ]{si , j , m

a } (18)

Using the same process, the Mahalanobis distance between the dominant distribution for patch

Ri , j ,hb in camera view b and the corresponding patch in camera view a is si , j

b . Then the colour distance

between the pair of colour patches is defined as:

si , ja , b=min

❑{si , j

a , si , jb } (19)

The smaller the colour distance si , ja , b is, the more likely that the two foreground regions are from the

same object.

5.5 Head-Plane Foreground Intersections

Assuming that pedestrians are standing upright and the heights of the pedestrians may vary from

one person to another. In this paper, homographies for planes paralleled to the ground plane and

around pedestrians’ average height are used to detect pedestrians. Considering the variety of

pedestrians’ height, the height range of the multiple planes is from 1.5 to 1.8m, which covers the

heights of most people. If the difference of pedestrians’ heights is larger, the height range can be

further extended, which means more homographic planes will be used. Let h∈[1.5 , 1.8]m, Ri , jt

represents a set of foreground intersection regions at different heights but at the same location in the

top view. The intersection region is recognized as the location of a person's top of the head, if it has

the highest height in such a set, its occupancy likelihood is higher than a threshold T L, its colour

distance between both camera views is lower than a threshold T S, and its area is over a threshold T A.

The highest height of that intersection region indicates the height of that person. The location of that

intersection region indicates the location of that person.

The details of the phantom pruning algorithm are described as Algorithm 1.

Algorithm 1:

1: for each imaginary plane parallel to the ground plane and at a height of h do

2each foreground region in both camera views is projected to the top view t

according to H hc ,t, (c = a, b).

12

3 for each intersection region Ri , j ,ht of foreground projections

4 for each pixel pi , j ,ht at (u , v )

5the pixel is warped back to camera view c (c = a, b) according to H 0

t ,c

and H ht ,c, to calculate the occupancy likelihood Li , j ,h

t , c (u , v )

6 calculate joint likelihood Li , j ,ht (u , v )

7 end for

8calculate joint likelihoods for all intersection regions, which leads to an

occupancy map.

9 end for

10 end for

11for each set of intersection regions Ri , j

t which are at the same location but at

different heights do

12 choose Ri , j ,hmax

t that satisfy Ai , j ,hmax

t >T A , Li , j , hmax

t >T L

13 Ri , j ,hmax

t is warped back to camera views a and b

14 build GMM models for Ri , j ,hmax

a and Ri , j ,hmax

b

15 calculate the colour distancesi , ja , b

16 if si , ja , b<T s then

17 Ri , j ,hmax

t is the top of a head

18 end if

19 end for

6 Experimental Results

The algorithm has been tested on video sequences captured in the author's campus. Two cameras

were placed with small viewing angles and with significant overlapping fields of view. People walked

around within a 4.0m×2.4m region to ensure some degree of occlusion. There are 2790 frames

captured in each camera view with a resolution of 640×480 pixels and a frame rate 15 fps. 2155

frames that contain two or three pedestrians were used in the experiments (the first 660 frames contain

no pedestrians or only one pedestrian). In this experiment, a virtual top view image was selected as the

reference image with a resolution of 840×1000 pixels. The test was run on a single PC with an Intel

Core i7 CPU running at 2.9 GHz.

Fig.6 shows the foreground detection using background subtraction and GMMs for frame 1200.

For the two original camera views shown in Fig. 6(a) and Fig. 6(b), the results of foreground detection

13

are shown in Fig. 6(c) and Fig. 6(d), respectively. There are three pedestrians which are labelled 1 to 3

in camera view a and labelled a to c in camera view b.

Each foreground polygon in a camera view was warped to the top view according to the

homographies for a set of parallel planes. In this experiment, seven planes which are parallel to and

between 1.5-1.8 meters above the ground plane were used. These planes are around the head level of

the pedestrians. Fig.7 shows the overlaid foreground projections from the two camera views using the

homographies for the planes at heights of 1.5m, 1.6m, 1.7m, and 1.8m, respectively. Each foreground

projection in the top view is given the same label of the corresponding foreground region in the

individual camera views. The foreground projections intersect at 8 regions for the height of 1.5m.

Fig.8 shows the foreground intersection regions of homography mapping at a height of 1.5m. Each

intersection region is given a label to indicate the corresponding foreground regions in the two camera

views. For example, region 1b is the intersection of foreground region 1 in camera view a and

foreground region b in camera view b.

Then, each pixel in the intersection regions is warped back to the individual camera views

according to the homographies for the ground plane and a plane at a height of 1.5 meters. In Fig.9, a

pixel in the intersection region is warped back and overlaid on the foreground regions in the individual

camera views. The blue dot and the red dot in each camera view indicate the warped back pixels

according to the homographies for the two planes. The yellow line indicates all the pixels on the line

segment decided by the pair of warped back pixels. The occupancy likelihood of the original pixel in a

single camera view is calculated. The occupancy likelihoods for all the pixels lead to an occupancy

map in the top view.

The occupancy map indicates the likelihood that each intersection region is occupied by a real

object. Table 2 shows the average occupancy likelihood for each intersection region. For intersection

region 1b in the top view, when each pixel in that region is warped back to camera view a, the average

occupancy likelihood in camera view a is 0.853. Since the average occupancy likelihood of the same

intersection region in camera view b is 0.973, the joint occupancy likelihood is 0.830. The intersection

regions which have the joint likelihoods higher than 0.6 are in bold. To visualize the results, in Fig.10,

each intersection region in the top view is filled with its average occupancy likelihood, in which the

darkest intensity corresponds to the highest likelihood.

Since colour matching is carried out in the individual camera views, according to the inverse

homography for the plane at a height of 1.5 meter, the intersection regions in the top view are warped

back into the individual camera views. Fig.11 shows the warped back patches overlaid onto the

original camera views. The regions with blue boxes are the warped back patches. Each warped back

patch is given the same label as the corresponding intersection region in the top view.

14

Table 3 shows the colour matching results for the same frame. The hue and saturation components

in the HSI colour space are used in the colour matching. Since the colour distances of unmatched

intersection regions are usually much greater than 1000, those for matched intersection regions are in

bold. Intersection regions 2, 4 and 6 in Table 3 are identified as matched regions. Table 4 shows the

results using seven planes at a series of heights ranging from 1.50m to 1.80m, incrementing by 0.05m.

Intersection regions 2, 4 and 6 are identified as real objects because their occupancy likelihoods are

higher than a threshold T L, their colour distances are lower than threshold T S, and their areas are over

threshold T A. The threshold can be decided empirically. In this experiment, T L = 0.6, T S=1000

and T A=10. Their corresponding heights are estimated as 1.55m, 1.75m and 1.75m, respectively.

There are the highest heights at which an intersection region can be detected.

To evaluate the performance, the proposed algorithm has been tested over 142 frames, each of

which was sampled from every 15 frames in the 2155 frames of the video sequence. Table 5 and Table

6 show the performance evaluation results of the pedestrian detection. In Table 5, the classification

results are compared with the ground truth. 786 intersection regions in 142 frames are classified into

two categories: object regions and phantom regions. The number of ground-truth object regions was

360 and 342 are correctly identified. At the same time, there are 426 phantom regions and 10 were

misclassified as object regions.

The false negative rate and the false positive rate of the classification are shown in Table 6. For

each category, let GT be the ground-truth number of that category. The false negatives (missed

detections), FN, are the intersection regions which belong to that category but are misclassified as the

other category. The false positives (FP) or false alarms are the intersection regions which belong to

the other category but are misclassified as that category. The false negative rate (RFN) is the ratio

between the number of false negatives and the number of ground truths. The false positives rate (RFP )

is the ratio between the number of false positives and the number of ground-truths.

RFN=FN /¿

RFP=FP /¿

(20)

For object regions, the false negative rate is 5.0% and the false positive rate is 2.78%. The false

negative rate and the false positive rate for the phantom regions are 2.35% and 4.22% respectively.

The processing time for the proposed algorithm has been tested over the 142 frames by using two

cameras and seven parallel planes. All the processing tasks were carried out in a single PC. The

overall processing time for the 142 frames is 65.98 seconds, corresponding to 0.46 second per frame.

The time spent on foreground detection using the GMMs is 31.22 seconds, corresponding to 0.22

second per frame and accounting for 46% of the overall processing time. The time spent on

foreground projection and overlay is 5.44 seconds, corresponding to 0.037 second per frame and

accounting for 8.2% of the overall processing time. The time spent on intersection region

15

classification is 29.32 seconds, corresponding to 0.20 second per frame and accounting for 44% of the

overall processing time.

7. Conclusions

In this paper, we have focused on pedestrian detection in visual surveillance applications. By using

multi-view data fusion, we can not only enlarge the FOV but also achieve more accurate and reliable

detection and tracking. Through homography transformations, multi-view foreground regions are

projected to a reference view, where occlusions can be minimized by making use of cross-view

information. In addition, a stack of virtual top views according to the homographies for multiple

planes at different heights are generated, where the foreground regions detected in each camera view

are warped into. As a result, the multi-plane data fusion is an efficient solution to remove false-

positive detections by using both the occupancy information and colour cues. The proposed approach

works well in the differentiation of pedestrians from false-positive detections. The limits of the

proposed method include: people are assumed to stand on a common plane; it can only be used to

detect pedestrians distributed at a medium density.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (NSFC) under Grant

60975082 and the Scientific Research Program funded by Shaanxi Provincial Education Department,

P. R. China, under Grant 15JK1310.

References

[1] Q. Cai and J. K. Aggarwal, “Automatic tracking of human motion in indoor scenes across multiple synchronized video

streams,” in Computer Vision, 1998. Sixth International Conference on. IEEE, 1998, pp. 356–362.

[2] O. Javed, S. Khan, Z. Rasheed, and M. Shah, “Camera handoff: tracking in multiple uncalibrated stationary cameras,” in

Human Motion, 2000. Proceedings. Workshop on. IEEE, 2000, pp. 113–118.

[3] V. Kettnaker and R. Zabih, “Bayesian multi-camera surveillance,” in Computer Vision and Pattern Recognition, 1999.

IEEE Computer Society Conference on., vol. 2. IEEE, 1999.

[4] M. Quaritsch, M. Kreuzthaler, B. Rinner, H. Bischof, and B. Strobl, “Autonomous multicamera tracking on embedded

smart cameras,” EURASIP Journal on Embedded Systems, vol. 2007, no. 1, pp. 35–35, 2007.

[5] S. Khan and M. Shah, “Consistent labeling of tracked objects in multiple cameras with overlapping fields of view,”

Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 25, no. 10, pp. 1355–1360, 2003.

[6] G. P. Stein, "Tracking from multiple view points: Self-calibration of space and time," in Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, 1999.

16

[7] M. Xu, J. Orwell, L. Lowey, and D. Thirde, “Architecture and algorithms for tracking football players with multiple

cameras,” IEE Proceedings-Vision, Image and Signal Processing, vol. 152, no. 2, pp. 232–241, 2005.

[8] J. Kang, I. Cohen, and G. Medioni, "Continuous tracking within and across camera streams," in Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, 2003, pp. I-267-I-272 vol. 1.

[9] W. Hu, M. Hu, X. Zhou, T. Tan, J. Lou, and S. Maybank, “Principal axis based correspondence between multiple

cameras for people tracking,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 28, no. 4, pp. 663–671,

2006.

[10] W. Du and J. Piater, “Multi-camera people tracking by collaborative particle filters and principal axis-based

integration,” in Computer Vision–ACCV 2007. Springer, 2007, pp. 365–374.

[11] J. Black and T. Ellis, “Multi camera image tracking,” Image and Vision Computing, vol. 24, no. 11, pp. 1256–1267,

2006.

[12] S. M. Khan, P. Yan, and M. Shah, “A homographic framework for the fusion of multi-view silhouettes,” in Computer

Vision, 2007. ICCV 2007. IEEE 11th International Conference on. IEEE, 2007, pp. 1–8.

[13] A. Mittal and L. Davis, “Unified multi-camera detection and tracking using region-matching,” in Multi-Object

Tracking, 2001. Proceedings. 2001 IEEE Workshop on. IEEE, 2001, pp. 3–10.

[14] A. Mittal and L. S. Davis, "M2tracker: A multi-view approach to segmenting and tracking people in a cluttered scene

using region-based stereo," in Proceedings of the European Conference on Computer Vision, 2002, pp. 18-33.

[15] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua, “Multiple object tracking using k-shortest paths optimization,” Pattern

Analysis and Machine Intelligence, IEEE Transactions on, vol. 33, no. 9, pp. 1806–1819, 2011.

[16] S. M. Khan and M. Shah, “Tracking multiple occluding people by localizing on multiple scene planes,” Pattern

Analysis and Machine Intelligence, IEEE Transactions on, vol. 31, no. 3, pp. 505–519, 2009.

[17] R. Eshel and Y. Moses, “Tracking in a dense crowd using multiple cameras,” International journal of computer vision,

vol. 88, no. 1, pp.129–143, 2010.

[18] J. Wright, A. Wagner, S. Rao, and Y. Ma, "Homography from coplanar ellipses with application to forensic blood

splatter reconstruction," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2006, pp. 1250-1257.

[19] H. Zeng, X. Deng, and Z. Hu, "A new normalized method on line-based homography estimation," Pattern Recognition

Letters, vol. 29, pp. 1236-1244, 2008.

[20] A. Criminisi, I. Reid, and A. Zisserman, “Single view metrology,” International Journal of Computer Vision, vol. 40,

no. 2, pp. 123–148, 2000.

[21] M. Xu, J. Ren, D. Chen, J. S. Smith, Z. Liu and T. Jia, “Robust object detection with real-time fusion of multiview

foreground silhouettes,” Optical Engineering, 51(4), 047202, 2012

[22] C. Stauffer and W. E. L. Grimson, “Adaptive background mixture models for real-time tracking,” in Computer Vision

and Pattern Recognition, 1999. IEEE Computer Society Conference on., vol. 2. IEEE, 1999.

[23] M. Xu, J. Ren, D. Chen, J. Smith, and G. Wang, “Real-time detection via homography mapping of foreground polygons

from multiple cameras,” in Image Processing (ICIP), 2011 18th IEEE International Conference on. IEEE, 2011, pp. 3593–

3596.

[24] D. H. Douglas and T. K. Peucker, “Algorithms for the reduction of the number of points required to represent a

digitized line or its caricature,” Cartographica: The International Journal for Geographic Information and Geovisualization,

vol. 10, no. 2, pp. 112–122, 1973.

[25] J. Ren, M. Xu and J. S. Smith, “Multi-view pedestrian detection using occupancy color matching,” in Proc. 1 st IEEE

Conf. Multimedia Big Data, April. 2015

17

Figure 1. Flowchart of the proposed pedestrian detection algorithm.

18

Figure 2. The relationship between H 0t ,c and H h

t ,c.

Figure 3. A schematic diagram of the foreground projection using homographies for the ground plane and the parallel plane at a height of h.

(a) (b)

Figure 4. A schematic diagram of the homography mapping for (a) the ground-plane and (b) a plane above the ground.

19

(a) (b)

(c) (d)

(e) (f)

(g) (h)Figure 5. An example of the projected foreground intersections due to the same object by using the homographies for a set of planes at different heights. (a) and (b) are the intersection region in the top view and the warped back region in camera view b for the ground plane, (c) and (d) are for the plane at a height of 0.5 m, (e) and (f) are for the plane at a height of 1.0 m, (g) and (h) are for the plane at a height of 1.5 m.

20

(a) (b)

(c) (d)Figure 6. The foreground detection: (a)(b) the original images, and (c)(d) the detected foreground regions.

(a) (b) (c) (d)

Figure 7. Fusion of the projected foregrounds in the top view using homographies for a set of parallel planes, (a) the ground plane; (b)-(d) at heights of 1.5m, 1.6m 1.7m and 1.8m, respectively.

Figure 8. Intersection regions using the homography for a plane at a height of 1.5m.

21

(a) (b)Figure 9. The warped back head-foot pixels overlaid on the individual camera views.

(a) (b) (c)Figure 10. The visualized results of the occupancy likelihood map: (a) the likelihoods derived from camera view a, (b) the likelihoods derived from camera view b, and (c) the joint likelihoods. A darker intensity corresponds to a higher likelihood.

(a) (b)Figure 11. The warped back patches overlaid on the original camera views for the plane at a height of 1.5m.

22

Table 1. The notation used in in this paper.

Symbolpc The homogeneous coordinate of the pixel at image coordinate (u, v) in

camera view c.

H hc ,t = (H h

t ,c )−1 Homography transformation matrix, from camera view c to top view t, for the plane at a height of h (h = 0 is the ground plane). The interchanging order for the superscripts denotes the inverse homography matrix.

F ia, V i

a The i-th foreground region and corresponding foreground polygon in camera view a.

F i ,ha ,t Projection of foreground region F i

a from camera view a to the top view

t according to homography H ha , t.

Ri , j ,ht The intersection region of projected foreground regions F i ,h

a ,t and F j ,hb ,t

in the top view.pi , j ,h

t A pixel at image coordinate (u, v) in Ri , j ,ht in the top view.

Li , j ,ht , c (u , v ) The occupancy likelihood of pi , j ,h

t , which is derived from camera view c.

si , ja , b The colour distance of Ri , j

t in both camera views.

Table 2. The average occupancy likelihood for each intersection region.

Table 3. Colour matching results.

Index Region Colour distances1 1b 6674.542 1c 0.0413 2a 2014.174 2b 102.935 2c 2103.476 3a 88.337 3b 6734.668 3c 25930.10

Table 4. The average occupancy likelihoods for intersection regions by using seven parallel planes.

Index Region 1.5m 1.55m 1.6m 1.65m 1.7m 1.75m 1.8m1 1b 0.830 - - - - - -2 1c 0.818 0.781 - - - - -3 2a 0.462 0.488 0.52

50.637 - - -

4 2b 0.920 0.901 0.905

0.898 0.888 0.901 -

23

Index Region camera a

camera b Joint likelihood

1 1b 0.853 0.973 0.8302 1c 0.864 0.947 0.8183 2a 0.984 0.469 0.4624 2b 0.926 0.993 0.9205 2c 0.738 0.458 0.3386 3a 0.946 0.949 0.8997 3b 0.465 0.993 0.4628 3c 0.395 0.704 0.278

5 2c 0.338 0.499 0.445

0.486 0.514 0.573 -

6 3a 0.899 0.996 0.979

0.956 0.946 0.964 -

7 3b 0.462 0.472 0.498

0.520 - - -

8 3c 0.278 0.294 0.333

0.373 - - -

Table 5. Classification results with occupancy maps and colour matching.

Classification Results Number of the Ground TruthObject Regions

(Ob) Phantom Regions (Ph)

Ground truth

Ob 342 18 360Ph 10 416 426

Total 352 434 786

Table 6. Classification errors after fusion of occupancy maps and colour matching.

False Negative Rate (%)

False Positive Rate (%)

Object regions 5.00 2.78Phantom regions 2.35 4.22

24

Documents

livrepository.liverpool.ac.uklivrepository.liverpool.ac.uk/3002767/1/Uploaded.docx · Web viewSince the foreground pixels are identified according to the pixelwise difference between