TRAJECTORY-BASED VIDEO RETRIEVAL AND RECOGNITION …mxiang/publications/Prelim-Proposal-Xiang-Ma.pdf · TRAJECTORY-BASED VIDEO RETRIEVAL AND RECOGNITION BASED ON MULTILINEAR ALGEBRA

TRAJECTORY-BASED VIDEO RETRIEVAL AND RECOGNITION

BASED ON MULTILINEAR ALGEBRA AND DISTRIBUTED

MULTI-DIMENSIONAL HIDDEN MARKOV MODELS

BY

XIANG MAB.S. (University of Science and Technology of China) 2005

Ph.D. Preliminary Exam Thesis

Submitted in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy in Electrical and Computer Engineering

University of Illinois at Chicago

Chicago, IllinoisOctober, 2008

Copyright by

Xiang Ma

2008

To my parents and sister,

for their endless love and support.

iii

ACKNOWLEDGMENTS

I would like to thank my advisors, Prof. Dan Schonfeld and Prof. Ashfaq Khokhar, for

their vision, directions, and numerous support and encouragement during my PhD study. Their

passion and serious attitude in scientific research will benefit me a lot in my future research.

I also want to thank my PhD preliminary exam committee, Profs. Rashid Ansari, Shmuel

Fridlander and Philip Yu for their support and suggestions.

I would like to thank my many friends and colleagues at UIC with whom I have had the

pleasure of working over the years. These include Faisal Bashir, Wei Qu, Nidhal Bouaynaya,

Hunsop Hong, Chong Chen, Junlan Yang, Pan Pan, Xu Chen, Liuling Gong, Xiangqiong Shi,

Liming Wang and all other members of MCL.

I would like to thank Prof. Stefan Rueger from the Imperial College London and the Open

University for hiring me as an intern in the Knowledge Media Institute (KMi) in the United

Kingdom in the summer 2008, where I developed an algorithm called CAMERA (CAmera

Motion Estimation and Realtime Analysis) for fast camera motion estimation. I also want to

thank all people I met and worked with in KMi.

Finally, I would like to thank my parents, Biansheng Ma and Weidong Wu, for letting me

pursue my dream for so long and so far away from home, and my sister, Min Phoebe Ma, for

her consistant support and encouragement during my difficult times.

iv

ACKNOWLEDGMENTS (Continued)

This work is supported in part by the National Science Foundation Grand IIS-0534438, the

University of Illinois at Chicago, UIC Graduate School GSC Travel Award and UIC Student

Presenter Award 2006, 2007.

v

TABLE OF CONTENTS

CHAPTER PAGE

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Search, Indexing and Retrieval—Concepts and Applications . 11.2 Content-based Image and Video Retrieval . . . . . . . . . . . . 31.2.1 Content-based Image Retrieval . . . . . . . . . . . . . . . . . . . 31.2.1.1 Image Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.1.2 Similarity or Distance Measures . . . . . . . . . . . . . . . . . . 41.2.2 Content-based Video Retrieval . . . . . . . . . . . . . . . . . . . 41.2.2.1 Video Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.2.2 Motion in video: a key feature . . . . . . . . . . . . . . . . . . . 41.2.3 Overview of the Thesis and Contributions . . . . . . . . . . . . 5

2 TRAJECTORY-BASED VIDEO RETRIEVAL BASED ON MUL-TILINEAR ALGEBRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Multiple Trajectory Tensor Representation . . . . . . . . . . . 92.3.1 Tensor-Space Representation of Multiple Object Trajectories 112.3.2 Global and Segmented Multiple Trajectory Tensors . . . . . . 142.3.2.1 Global Multiple Trajectory Tensor . . . . . . . . . . . . . . . . 142.3.2.2 Segmented Multiple Trajectory Tensor . . . . . . . . . . . . . . 142.4 Multiple-Trajectory Indexing and Retrieval Algorithms . . . . 182.4.1 Geometrical Multiple-Trajectory Indexing and Retrieval (GMIR)

Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4.2 Unfolded Multiple-Trajectory Indexing and Retrieval (UMIR)

Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.4.3 Concentrated Multiple-Trajectory Indexing and Retrieval (CMIR)

Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 242.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.5.2 Interacting Multiple Trajectory Selection . . . . . . . . . . . . 252.5.3 Retrieval Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.5.4 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . . . 272.5.5 Indexing and Query Time . . . . . . . . . . . . . . . . . . . . . . 292.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

vi

TABLE OF CONTENTS (Continued)

CHAPTER PAGE

3 TRAJECTORY-BASED VIDEO CLASSIFICATION AND RECOG-NITION BASED ON DISTRIBUTED MULTI-DIMENSIONALHIDDEN MARKOV MODELS . . . . . . . . . . . . . . . . . . . . . . . 443.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.2 Non-Causal Multi-Dimensional Hidden Markov Model . . . . 483.2.1 Motivation and Problem Statement . . . . . . . . . . . . . . . . 493.2.2 Non-Causal Multi-Dimensional Hidden Markov Model . . . . 503.3 Non-Causal HMM: A Distributed Approach . . . . . . . . . . . 513.4 Distributed Causal HMMs: Training and Classification . . . . 573.4.1 Expectation-Maximization (EM) Algorithm . . . . . . . . . . . 583.4.2 General Forward-Backward (GFB) Algorithm . . . . . . . . . . 613.4.2.1 Forward and Backward Probability . . . . . . . . . . . . . . . . 623.4.3 Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 633.4.4 Summary of DHMM Training and Classification Algorithms . 643.5 Application I: Non-Causal HMM-Based Image Classification . 643.6 Application II: Non-Causal HMM-Based Video Classification 69

4 FUTURE WORKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.0.1 Dynamic adding and deleting entries in multiple motion tra-

jectory databases . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.0.2 Dynamic matching of query and database multiple-trajectory

entries with different numbers of trajectories . . . . . . . . . . 774.0.3 Dynamic matching of query and database multiple-trajectory

entries with diverse temporal lengths . . . . . . . . . . . . . . . 78

5 CITED LITERATURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

vii

LIST OF TABLES

TABLE PAGE

I COMPARISON OF AVERAGE QUERY TIME FOR EACH MUL-TIPLE TRAJECTORY QUERY INPUT (SEC.) . . . . . . . . . . . . 30

II COMPARISON OF AVERAGE QUERY TIME FOR EACH MUL-TIPLE (THREE) TRAJECTORY QUERY INPUT (SEC.) . . . . . 33

III AVERAGE CLASSIFICATION ERROR RATE VERSUS BLOCK-SIZE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

IV AVERAGE CLASSIFICATION ACCURACY RATE . . . . . . . . . 75

viii

LIST OF FIGURES

FIGURE PAGE

1 The architecture of a typical retrieval (search) system . . . . . . . . . . . 2

2 Video clips and their corresponding multiple motion trajectories: (a)videoclips 1, 2, 3; (b) their corresponding multiple motion trajectories. . . . . 10

3 Tensor-Space Representation of Multiple-Object Trajectories: (a) Threesets of multiple trajectories S1,S2,S3, extracted from three video clipsdisplayed in Fig. 1. (b) Three corresponding Multiple Trajectory Ma-trices M1, M2, M3. (c) Three Multiple Trajectory Matrices M′

1, M′2,

M′3 after resampling. (d) Multiple Trajectory Tensor T constructed

from M′1, M′

2, M′3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Example of segmentation of a set of global multiple trajectories (2 trajec-tories) into segmented multiple subtrajectories. (a) a set of 2 trajectorieswith global length. (b) 6 sets of segmented multiple subtrajectories, seg-mented from (a). In both figures, solid squares with numbers indicatesegmentation positions, the horizontal axis is x- location, the verticalaxis is y- location within the scene. . . . . . . . . . . . . . . . . . . . . . 35

5 Block diagram of our multiple object trajectory indexing and retrievalsystem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6 Example video clip from the CAVIAR dataset: (a) Sample frame fromthe video clip ’Two people meet, fight and chase each other’. (b) Multipletrajectories extracted from the referred video sequence. (c) Lifespans ofmultiple objects within the referred video sequence. . . . . . . . . . . . . 36

7 Retrieval results for the ASL data set using the proposed CMIR al-gorithm for multiple trajectory representation (2 trajectories): (a) thequery; (b) the most-similar retrieval; (c) the second-most-similar re-trieval; (d) the most-dissimilar retrieval; and (e) the second-most-dissimilarretrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

8 Retrieval results for ASL dataset using the proposed CMIR algorithmfor multiple trajectory representation (3 trajectories): (a) the query; (b)the most-similar retrieval; (c) the second-most-similar retrieval; (d) themost-dissimilar retrieval; and (e) the second-most-dissimilar retrieval. . 38

ix

LIST OF FIGURES (Continued)

FIGURE PAGE

9 Retrieval results for CAVIAR dataset (INRIA) using the proposed CMIRalgorithm for multiple trajectory representation (2 trajectories): (a) thequery; (b) the most-similar retrieval; (c) the second-most-similar re-trieval; (d) the most-dissimilar retrieval; (e) the second-most-dissimilarretrieval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

10 Retrieval results for CAVIAR dataset (Shopping Center in Portugal) us-ing the proposed CMIR algorithm for multiple trajectory representation(3 trajectories): (a) the query; (b) the most-similar retrieval; (c) thesecond-most-similar retrieval; (d) the most-dissimilar retrieval; (e) thesecond-most-dissimilar retrieval. . . . . . . . . . . . . . . . . . . . . . . . . 40

11 Comparison of Average Precision and Recall Curves for proposed threemultiple-trajectory indexing and retrieval algorithms using (a): globaland (b): segmented multiple trajectory tensor representation: CMIR,GMIR, UMIR and Modified Min’s method, on two datasets (ASL(1225),CAVIAR(533)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

12 Comparison of Average Precision and Recall Curve for proposed threemultiple-trajectory indexing and retrieval algorithms using segmentedmultiple trajectory tensor representation: CMIR, GMIR, UMIR andModified Min’s method, on two datasets (ASL(1225), CAVIAR(533)) . 42

13 Comparison of average Precision and Recall Curve for CMIR and GMIRalgorithm. Curve (a)-(g) are for CMIR algorithm with factor from 3 to12; Curve (h) is for GMIR algorithm. . . . . . . . . . . . . . . . . . . . . 43

14 Classical one-dimensional hidden Markov model. . . . . . . . . . . . . . . 45

15 Previous two-dimensional hidden Markov models: (a) coupled HMM(CHMM) (b) nearest-neighbor, strictly causal 2D HMM (dependencebased on nearest neighbors in vertical and horizontal directions) . . . . 46

16 Proposed non-causal two-dimensional hidden Markov models: (a) arbi-trary model with non-causality in all dimensions; (b) arbitrary modelwith causality along a single dimension; and (c) nearest-neighbor modelwith causality along a single dimension. . . . . . . . . . . . . . . . . . . . 49

17 Distributed HMM: (a) non-causal 2D hidden Markov model; (b) dis-tributed causal 2D Hidden Markov Model 1; and (c) distributed causal2D hidden Markov model 2. . . . . . . . . . . . . . . . . . . . . . . . . . . 53

x


FIGURE PAGE

18 Sequential alternate updating scheme for multiple distributed HMMs. . 56

19 Training and Classification: (a) state dependencies of the proposed dis-tributed causal 2D-HMM; and (b) the corresponding decomposed subset-state sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

20 Image segmentation: (a)-(f) aerial test images. . . . . . . . . . . . . . . . 65

21 Hand-labeled ground truth images of man-made and natural regions:(a)-(f) truth images corresponding to aerial test images in Fig. 7 (a)-(f).(White and gray denote man-made and natural regions, respectively) . 66

22 Image segmentation: (a) original aerial image (see Fig. 7(f)); (b)hand-labeled ground truth image (see Fig. 8(f)); (c) classification using the astrictly-causal, two-dimensional hidden Markov model—the correspond-ing error rate is 13.39%; and, (d) classification using the proposed non-causal, two-dimensional hidden Markov model—the corresponding errorrate is 8.25%. (White and gray denote man-made and natural regions,respectively). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

23 Video classification—samples of two classes in video dataset TWO-HANDS :(a) two-trajectory sample from class 1: ’god+boy’; (b) two-trajectorysample from class 2: ’dog+smile’ . . . . . . . . . . . . . . . . . . . . . . . 71

24 Video classification—samples of two classes in video dataset PEOPLE :(a) two-trajectory sample from class 1: “Two people meet and walktogether”; (b) two-trajectory sample from class 2: “Two people meet,fight and run away”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

25 Video classification—ROC curves for two datasets: (a) TWO-HANDSand (b) PEOPLE. (Each graph depicts the proposed non-causal 2DHMM (red), strictly-causal 2D HMM (blue), and 1D HMM (green),and random classification (black)). . . . . . . . . . . . . . . . . . . . . . . 72

26 Dynamic adding and deleting entries in multiple motion trajectory databases:(a) Adding entries in multiple trajectory database; (b) Deleting entriesin multiple trajectory database. (In both figures, white portion denotesentries to be kept, and red portion denotes entries to be added or deleted.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

xi


FIGURE PAGE

27 Dynamic matching of query and database multiple-trajectory entrieswith different numbers of trajectories: (a) Example multiple trajectoryquery with M ′=3 trajectories. (b) Multiple trajectory database withM=6 trajectories (Red portion is the part of database to be searched tomatch the query). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

28 Dynamic matching of query and database multiple-trajectory entrieswith diverse temporal lengths: (a) Example multiple trajectory querywith temporal length L’. (b) Multiple trajectory database with length L(Red portion is the part of database to be searched to match the query). 80

xii

LIST OF ABBREVIATIONS

UIC University of Illinois at Chicago

MCL Multimedia Communications Laboratory

PCA Principle Component Analysis

SVD Singular Value Decomposition

HOSVD Higher Order Singular Value Decomposition

HMM Hidden Markov Models.

xiii

SUMMARY

Motion information is regarded as one of the most important cues for developing semantics in

video data. Yet it is extremely challenging to build semantics of video clips particularly when

it involves interactive motion of multiple objects. Most of the existing research has focused

on capturing and modelling the motion of each object individually thus loosing interaction

information. Such approaches yield low precision-recall ratios and limited indexing and retrieval

performances.

We presents a novel framework for compact representation of multi-object motion trajec-

tories. Three efficient multi-trajectory indexing and retrieval algorithms based on multi-linear

algebraic representations are proposed. These include: (i) geometrical multiple-trajectory in-

dexing and retrieval (GMIR), (ii) unfolded multiple-trajectory indexing and retrieval (UMIR),

and (iii) concentrated multiple-trajectory indexing and retrieval (CMIR). The proposed tensor-

based representations not only remarkably reduce the dimensionality of the indexing space but

also enable the realization of fast retrieval systems. The proposed algorithms have been im-

plemented and evaluated using real video datasets. Simulation results on both full and partial

(segmented) multiple motion trajectories with varying number of objects, trajectory lengths,

and sampling rates demonstrate that the CMIR algorithm provides superior precision-recall

metrics, and smaller query processing time compared to the other approaches.

We propose a novel solution to an arbitrary non-causal, multi-dimensional hidden Markov

model (HMM) for image and video classification. We provide a solution for the non-causal

xiv

SUMMARY (Continued)

model by splitting it into multiple causal HMMs that are analytically solvable in a fully syn-

chronous distributed computing framework, therefore referred to as distributed HMMs. We

approximate the simultaneous solution of multiple distributed HMMs on a sequential proces-

sor by an alternating updating scheme. The parameters of the distributed causal HMMs are

estimated by extending the classical one-dimensional training and classification algorithms to

multiple dimensions. The proposed extension to arbitrary causal, multi-dimensional HMMs

allows state transitions that are dependent on all causal neighbors. We thus extend three fun-

damental algorithms to multi-dimensional, causal systems, i.e. (1) Expectation-Maximization

(EM); (2) General Forward-Backward (GFB); and (3) Viterbi algorithms. Simulation results

demonstrate the superior performance, higher accuracy rate and applicability of the proposed

non-causal HMM framework to image and video classification.

xv

CHAPTER 1

INTRODUCTION

1.1 Search, Indexing and Retrieval—Concepts and Applications

The basic objective of an indexing and retrieval (search) system, as stated in the mission of

the world’s leading search engine provider, Google [1], is “To organize the world’s information

and make it universally accessible and useful.” A typical retrieval (search) system works on the

basis of a query, through which the user specifies the information need to be searched. The

retrieval system then searches each entry of information in its database, and the best match is

retrieved and send back to the user. The query, typically can be text, image, video or audio

document. The documents of information stored in the database are usually processed ahead

of time, and metadata is extracted from each document and efficiently indexed in the database.

During the retrieval procedure, the stored metadata information for each document is compared

against the query or metadata extracted from the query, and the most similar documents are

returned in an ranked order, according to some similarity measure. The user may influence

the retrieval results by providing additional feedbacks to the system regarding the relevance of

retrieved documents. Fig Figure 1 Shows the architecture of a typical indexing and retrieval

(search) system.

1

2

Figure 1. The architecture of a typical retrieval (search) system

3

1.2 Content-based Image and Video Retrieval

The unique characteristics of image and video files makes it more difficult to retrieve than

text, and more importantly, the multimedia content such as image and video cannot be directly

indexed by text or words. An efficient technique to extract metadata and create descriptions of

the image and video content is critical to a multimedia indexing and retrieval system. Finally,

an effective similarity or distance measure is highly desired to give the best comparison between

metadata or description information of image and video and the query and return the most

similar retrieved result.

1.2.1 Content-based Image Retrieval

The Content-Based Image Retrieval (CBIR), also known as query by image content (QBIC)

and content-based visual information retrieval (CBVIR), is the application of computer vision

to the image retrieval problem, that is, the problem of searching for digital images in large

databases. “Content-based” means that the search will analyze the actual contents of the

image. The term ’content’ refers to colors, shapes, textures, or any other information that can

be derived from the image itself. [2]

1.2.1.1 Image Querying

There are mainly three types of queries: query by category or concept, query by sketch and

query by example. Query by concept is to retrieve images according to the set of conceptual

description associated with each image in the database. Query by sketch allows user to draw a

sketch of an image with a graphic editing tool. Query by example allows the user to formulate

a query by providing an example image ore specifying a region within an image. [3]

4

1.2.1.2 Similarity or Distance Measures

Content-based image retrieval system calculates visual similarities between a query image

and image in a collection. Many similarity measures have been proposed for image retrieval

based on empirical estimates of the distribution of features in recent years. Different similarities

or distance measures will affect retrieval performances of an image retrieval system significantly.

The commonly used similarities are : L-Family Distance, which includes L1 distance (also

known as Manahattan distance) and L2 distance (also known as Euclidean distance); Earth

Mover Distance (EMD) and Kullback-Leibler (KL) distance [3].

1.2.2 Content-based Video Retrieval

In many ways, video retrieval fundamentally builds on image retrieval. Each frame in a

video can be considered as an image by itself, so all aspects of an image retrieval system can be

applied to an video retrieval system. Due to the temporal dynamics and relatively huge amount

of image data within a video, video retrieval task is usually much harder than image retrieval.

Nevertheless, video consists of much more information than image (e.g. audio sounds, human

speech, motion), which offers more opportunities and sources for indexing and retrieval.

1.2.2.1 Video Querying

Video query contains much richer information than an image query, it may contain text,

images, video clip and audio clip.

1.2.2.2 Motion in video: a key feature

Motion in video can be categorized to two types: Motion of camera and Motion of Object.

The motion of camera includes: pan, tilt, rotate, zoom in, zoom out. The motion of object can

5

be any type of movement. Both motion types characterizes motion content of a video, and can

be used as a key feature to index and retrieve videos, especially those contain plenty of motion

activities, e.g. sports and vehicle video.

1.2.3 Overview of the Thesis and Contributions

The main contributions of my PhD work includes:

1. A novel indexing and retrieval framework for compact and integrated representation

of multiple interacting motion trajectories for video retrieval. This framework provides rich

indexing and retrieval mechanisms for multiple motion trajectories without requiring additional

higher-level semantic analysis. The proposed approach can be used for indexing and retrieval

of multiple motion trajectories as well as segmented (partial) multiple motion trajectories. The

optimal joint segmentation of multiple motion trajectories is determined by posing this task as

the solution of a hypothesis testing problem and deriving the maximum likelihood solution. The

key contribution is the formation of three new multi-linear algebraic structures for the compact

and unified representation of multiple object trajectories in a reduced-dimension space, and

three novel algorithms for the indexing of multiple object trajectories corresponding to each of

the algebraic structures used for multiple interacting object trajectory representations.

2. A novel solution to an arbitrary non-causal, multi-dimensional hidden Markov model

(HMM) for image and video classification. A novel solution for the non-causal model by splitting

it into multiple causal HMMs that are analytically solvable in a fully synchronous distributed

computing framework, therefore referred to as distributed HMMs.

CHAPTER 2

TRAJECTORY-BASED VIDEO RETRIEVAL BASED ON

MULTILINEAR ALGEBRA

2.1 Introduction

Motion trajectory indexing and retrieval techniques have been successfully used for event

analysis, activity recognition and video surveillance. However, the issues related to indexing

and retrieval of multiple interacting object trajectories are extremely challenging and have not

yet been thoroughly addressed in the existing research work. The primary challenge is to find

a robust representation of a high-dimensional multiple object motion trajectory space that can

capture (explicitly or implicitly) the motion interaction information. The second challenge is

to develop an effective indexing system for such a high-dimensional space. The third challenge

is to find an effective retrieval mechanism which can minimize the system response time for

online queries.

We are particularly interested in multiple trajectories that co-exist for a certain period of

time in the video sequence and refer to them as “interacting trajectories.” Thus, multiple

trajectories are considered as “interacting trajectories” provided that they appear for a certain

number of consecutive images in the video sequence (say, for instance, a duration of at least

4 seconds, e.g. 120 frames). Note that there is no requirement that interacting objects meet

6

7

any criteria - only that they reside in the same image sequence for a certain period of time.

Therefore, the time index of the trajectories is used to determine if the objects interact.

In this chapter, we present a novel indexing and retrieval framework for compact and inte-

grated representation of multiple interacting motion trajectories. This framework provides rich

indexing and retrieval mechanisms for multiple motion trajectories without requiring additional

higher-level semantic analysis. The proposed approach can be used for indexing and retrieval

of multiple motion trajectories as well as segmented (partial) multiple motion trajectories. We

determine the optimal joint segmentation of multiple motion trajectories by posing this task as

the solution of a hypothesis testing problem and deriving the maximum likelihood solution. Our

key contribution is the formation of three new multi-linear algebraic structures for the compact

and unified representation of multiple object trajectories in a reduced-dimension space, and

three novel algorithms for the indexing of multiple object trajectories corresponding to each

of the algebraic structures used for multiple interacting object trajectory representations. We

finally develop an effective retrieval method for minimizing the retrieval time for queries and

report experimental results using multiple trajectories from public domain data sets such as

sign language data from KDI [48] and human motion data from the CAVIAR data set [49].

The rest of this chapter is organized as follows: Section II surveys the related work on motion

trajectory indexing and retrieval systems. In Section III, we present our multi-dimensional

tensor-space representation of multiple interacting object trajectories. Moreover, we present the

derivation of a joint segmentation of multiple motion trajectories based on a maximum likelihood

solution to a hypothesis testing problem. In Section IV, we introduce three new algorithms for

8

the indexing and retrieval of multiple object trajectories. We present experimental results and

a discussion of the proposed multiple object trajectory indexing and retrieval algorithms in

Section V. Finally, in Sections VI and VII, we present a brief summary of our results and

discussion of our future work.

2.2 Related Work

This section provides a survey of related work from the recent literature in the areas of

motion trajectory-based event analysis and applications.

Object motion has been an important feature for object representation and activity mod-

elling in video applications [6] [7] [8] [9]. An object trajectory-based system for video indexing

is proposed in [10], in which the normalized x- and y-projections of trajectory are separately

processed by wavelet transform using Haar wavelets. Chen et al. [11] segment each trajec-

tory into subtrajectories using fine-scale wavelet coefficients at high levels of decomposition.

A feature vector is then extracted from each subtrajectory and Euclidean distances between

each subtrajectory in the query trajectory and all the indexed subtrajectories are computed

to generate a list of similar trajectories in the database. In our previous work [12] [13], we

proposed a Principle Component Analysis (PCA)-based approach to object motion trajectory

indexing and retrieval, which has been shown to provide a very effective method for indexing

and retrieval of single object motion trajectory.

Multiple motion trajectories has become an intensive area of research recently, due to its

wide application in many areas, such as activity recognition and video content analysis. Shan

et. al [44] proposed a video retrieval approach based on single and multiple motion trajectories.

9

Multiple motion trajectories extracted from a video are modelled as a sequence of symbolic

pictures, then the spatial information of the symbolic pictures is mapped to 2D strings. Two

similarity measures are then proposed for the matching of query and database videos. This

method is efficient for solving video subsequences matching problems, however, the symbolic

pictures can only be used to characterize simple scenes, which limits its application to complex

scenes such as video surveillance. Mansouri et. al [45] proposed a motion-based image sequences

segmentation technique based on level set partial differential equations. In their method, differ-

ential equations are utilized to characterize the velocities of multiple motion trajectories. And

motion segmentation is computed by minimizing an energy function expressed as a coupled

system of level set partial differential equations. More recently, a novel method for activity

recognition based on multiple motion trajectories is proposed [46]. In this method, HMM is

used to model each movement video based on multiple motion trajectories. The x- and y-

location information of multiple motion trajectories at each frame is formulated as one feature

vector, and the time sequence of feature vectors is then fed to the HMM model for training.

Maximum likelihood criteria is finally used for activity recognition.

2.3 Multiple Trajectory Tensor Representation

An object motion trajectory is usually represented by a two-dimensional L-tuples corre-

sponding to the x- and y-axes of the object centroid location at each instant of time, where L

is the temporal duration.

rL = {x[t], y[t]}, t = 1, ..., L. (2.1)

10

Figure 2. Video clips and their corresponding multiple motion trajectories: (a)video clips 1, 2,3; (b) their corresponding multiple motion trajectories.

For a video clip consisting of multiple moving objects, multiple object motion trajectories

are usually extracted by using some tracking algorithms [17] [18] or motion sensor devices.

Fig. 1 depicted three video clips (a) and their corresponding multiple motion trajectories (b).

Typically for a video clip of length L with M moving objects, the multiple motion trajectories

are represented as a set S of M motion trajectories

S = {rL1 , rL

2 , rL3 , ..., rL

M}. (2.2)

For a video database consisting of many video clips, each video clip has an unique corre-

sponding set of multiple motion trajectories, which characterizing the dynamics and motion

pattern of multiple objects within that particular video clip. Thus by indexing and retrieving

11

a set of multiple object trajectories, we are able to index and retrieve its corresponding video

clip in the database.

2.3.1 Tensor-Space Representation of Multiple Object Trajectories

In our approach, each set of multiple object trajectories S = {rL1 , rL

2 , rL3 , ..., rL

M}, extracted

from a particular video clip of length L with M objects, is modelled as a Multiple Trajectory

Matrix M of size 2L by M .

We firstly smooth out each of the noisy trajectories by applying wavelet transform using

Daubechies wavelet DB4, and taking coarse coefficients corresponding to the low subband in a

three level decomposition. After that, we concatenate x− and y− location information of each

object trajectory rLi (i = 1, 2, ..., M), into one column vector, we refer to as Single Trajectory

Vector Vi.

Vi = (xi[1], xi[2], ..., xi[L], yi[1], yi[2], ..., yi[L])T (2.3)

We then align the single trajectory vectors as columns of a matrix, where the number of columns

is set to be the number of objects in the particular set of multiple object trajectories. We name

this matrix as Multiple Trajectory Matrix M.

M = [V1|V2|V3|...|VM ] (2.4)

Here, each multiple trajectory matrix represents the motion dynamics of a group of objects

within one particular video clip, since the dynamics of each object is embedded within its

location information; while multiple object group behavior is embedded within the compact

12

matrix representation. For each set of multiple motion trajectories extracted from a particular

video clip, there is one unique corresponding multiple trajectory matrix; thus this multiple

trajectory matrix can be viewed as a ”feature matrix” which contains the essential motion

information of a group of objects within a particular video clip. Fig. 2(b) show three multiple

trajectory matrices constructed from three sets of multiple trajectories (shown in Fig. 2(a),

which are the same as in Fig. 1(b)).

Please note that the order of single trajectory vectors placed in the multiple trajectory

matrix is very important, since different order of single trajectory vectors will result in different

multiple trajectory matrices. The order of multiple trajectories in a multiple trajectory matrix

may even be critical for the retrieval procedure. In this presentation, for simplicity, we focused

exclusively on the simplest case, where orders of multiple trajectories in both the dataset and the

query are known. However, our tensor-space representation can be easily extended to the case

where correspondense of multiple trajectories between the dataset and the query is unknown.

For more discussions on multiple trajectories order matching, please refer to Section VI.

In a motion video database with many video clips, multiple trajectories are firstly extracted

for each video, then multiple trajectory matrix is constructed. For compact representation,

multiple trajectory matrices of the same number of columns (video clips with the same number

of acting objects), are grouped together. Then each group of video clips are resampled to a

median size for further processing. The median size is the median of lengths from all video clips

within the same group. Let there be N sets of M trajectories extracted from N video clips,

and the original lengths of each set of M trajectories be L1, L2,...,LN . Suppose the desired

13

median length after sampling is L′. For each set of M trajectories, we resample the whole set

of trajectories to the median length L′. For each trajectory within the same set, we first use 2D

fourier transform to get coefficients of each trajectory in frequency domain; then we retain the

first p biggest coefficients while ignoring the rest; finally we perform L′-point inverse 2D fourier

transform to get the re-sampled trajectory with desired length L′. Fig. Figure 3 (b) depicts

three sets of 2-trajectories S1, S2 and S3, extracted from three video clips of lengths L1,L2 and

L3, respectively; while Fig. Figure 3 (c) shows the resampled multiple trajectory sets with the

same median size L′. The re-sampling process is a necessary step to transform different sets of

trajectories with varying lengths to the same format, such that they can be assembled into a

compact form (e.g. matrix or tensor form) for further efficient analysis.

After resampling, within each group, multiple trajectory matrices are of the same size,

then they are aligned in the direction orthogonal to the plane spanned by them, and form a

three-dimensional matrix, or tensor [19]. We refer to it as Multiple Trajectory Tensor T

T = (Ti,j,k)i=1,2,...,2L′;j=1,2,...,M ;k=1,2,...,K. (2.5)

where L and M are the same as previous defined, K is the depth of the tensor, referring to

total number of video clips indexed so far.

Fig. Figure 3 depicts an example of constructing a multiple trajectory tensor by using three

set of multiple trajectories. The reason to align multiple trajectory matrices and form a three-

dimensional tensor is, to assemble as much as possible date into a compact form and extract

14

intrinsic and common characteristics of data for efficient analysis. By assembling multiple

trajectory into three-dimensional tensors, the multiple trajectory data spans a three-dimensional

tensor-space.

Please note that our approach is built on trajectory representation by instantaneous x- and

y- coordinates of object centroid at each frame. While processing trajectories, it is assumed

that the frame of reference for all of the trajectories is held fixed during the generation of

trajectories. This is consistent with the fixed-camera scenario. For the moving camera case,

such as PTZ cameras or airborne surveillance, an additional step of trajectory registration will

be needed to generate trajectories which are all registered to a common frame of reference.

Our algorithms can then be used for indexing and retrieval of the registered trajectories from

multiple objects for the moving camera scenario.

2.3.2 Global and Segmented Multiple Trajectory Tensors

In this subsection, we present two types of multiple trajectory tensors based on two types

of multiple trajectory data used.

2.3.2.1 Global Multiple Trajectory Tensor

The “Global Multiple Trajectory Tensor” is constructed by using sets of multiple trajectories

extracted from video clips, as shown in Fig. Figure 2. Here, “Global” refers to the multiple

trajectories with global lengths.

2.3.2.2 Segmented Multiple Trajectory Tensor

The “Segmented Multiple Trajectory Tensor” is constructed by using sets of multiple sub-

trajectories. We jointly segment multiple trajectories with global lengths into atomic ”units”

15

which are called multiple subtrajectories. Those multiple subtrajectories represent certain joint

motion patterns of multiple objects within a certain time interval. The segmentation points

that define the joint segmentation of multiple trajectories depend on the segmentation technique

and in general shall correspond to changes in joint motion patterns of multiple objects, such

as changes in velocity (1st order derivative) and/or acceleration (2nd order derivative). Our

segmentation technique ensures that only joint change of motions of multiple objects would be

chosen as segmentation point, based on spatial curvatures of multiple trajectories. The spatial

curvature of a 2-D trajectory is given by:

Θ[t] =x′[t]y′′[t]− y′[t]x′′[t][(x′[t])2 + (y′[t])2]3/2

. (2.6)

The value of curvature at any point is a measure of inflection point, an indication of concavity

or convexity in the trajectory. For multiple trajectories, curvature data is calculated for each

trajectory. For example, for M(> 1) trajectories, there would be M curvatures. We segment

the multiple trajectories by applying a moving window scheme and a hypothesis testing method

on their spatial curvatures. Let X and Y be two non-overlapping windows of size n, where X

contains the first n M-dimensional curvature vector samples, and Y contains the next n samples.

Each M -dimensional curvature vector consists of curvatures from M trajectories. Let Z be the

window of size 2n formed by concatenating window X and window Y. Then we perform a

likelihood ratio hypothesis test to determine if the two windows X and Y have data drawn

16

from the same distribution. If curvatures of multiple trajectories in X and Y are from different

distributions, then there would be a change of concavity or convexity of multiple trajectories.

Specifically, we have two hypothesis:

H0 : fx(X; θx) = fy(Y ; θy) = fz(Z; θz).

H1 : fx(X; θx) 6= fy(Y ; θx).

Assume that curvature data in each window forms an i.i.d random variable and the data is

M-dimensional jointly Gaussian. We first compute the maximum likelihood estimator of mean

and variance for each hypothesis,

L0 =[

1

(2π)M2 |Σ3| 12

]2n

exp{−12

s+2n∑

i=s

(ki − µ3)T Σ−13 (ki − µ3)}

L1 =[

1

(2π)M |Σ1| 12 |Σ2| 12

]n

exp{−12[s+n∑

i=s

(ki − µ1)T Σ−11 (ki − µ1)

+s+2n∑

i=s+n+1

(ki − µ2)T Σ−12 (ki − µ2)]}. (2.7)

then the likelihood ratio is defined as,

λL =L0

L1. (2.8)

17

We finally calculate the distance d between the distributions of X and Y. We define the distance

function d between X and Y as

d(s) = −log(λL) = −nlog(|Σ1|1/2|Σ2|1/2

|Σ3| )

+12

[ s+2n∑

i=s

(ki − µ3)T Σ−13 (ki − µ3)−

s+n∑

i=s

(ki − µ1)T Σ−11 (ki − µ1)

−s+2n∑

i=s+n+1

(ki − µ2)T Σ−12 (ki − µ2)

]. (2.9)

where µi, Σi (i = 1, 2, 3) are mean column vectors and variance matrices of M-dimensional

Gaussian distributions, which represent distributions of data in windows X,Y and Z, respec-

tively. s is the start point of samples in window X, where s = 1, 2, ..., L − 2n. ki is curva-

ture column vector which consists of curvature samples from M trajectories at time instant i,

ki = [Θ1[i], ...,ΘM [i]]T . The distance d is large if the data in window X and Y have differ-

ent distributions. The windows are moved by m (< n) samples and the process is repeated

in the same manner. A 1-D vector of distance function is then formed. The segmentation

positions are chosen at the locations along the trajectory where the likelihood ratio d attains

a local maximum. To select the local maxima, the 1-D vector of likelihood ratio d is parti-

tioned into segments and the global maximum is selected within each partition to represent

the segmentation location. The threshold α within each segment is chosen such that only the

global maximum of the likelihood ratio d is selected to represent the segmentation location

within each partition. Figure Figure 4 displays the segmentation results of a set of multiple

18

trajectories. Please note that the segmentation positions are chosen when both curvatures of

the two trajectories have a sudden change, e.g. at segmentation points 2 and 4, which ensures

that in our segmentation only joint change of motions of multiple objects would be chosen as

segmentation point.

Next, we show the use of tensor analysis techniques to extract optimal data-dependent

bases of the reduced subspaces corresponding to the original multiple object trajectories. We

then project original multiple trajectories onto these bases to obtain optimally compact repre-

sentation of multiple trajectories. This processing ensures that not only the temporal motion

pattern of each trajectory is kept intact, but also the spacial-temporal interaction between dif-

ferent objects is preserved. The details of the tensor-analysis based algorithms are explained in

the following section.

2.4 Multiple-Trajectory Indexing and Retrieval Algorithms

We propose three multiple-object trajectory indexing and retrieval algorithms, that mainly

differ in terms of representation techniques: (i) geometrical multiple-trajectory indexing and

retrieval (GMIR) algorithm, (ii) unfolded multiple-trajectory indexing and retrieval (UMIR)

algorithm, and (iii) concentrated multiple-trajectory indexing and retrieval (CMIR) algorithm.

Fig. Figure 5 depicts the general block diagram of our multiple-object trajectory indexing and

retrieval system.

2.4.1 Geometrical Multiple-Trajectory Indexing and Retrieval (GMIR) Algorithm

In the following we outline the Indexing and Retrieval processes of the GMIR algorithm.

19

Indexing

1. Generate geometric bases G1, G2 and G3 from multiple trajectory tensor T . The geo-

metric bases represent the principle axes of variation across each coordinate axis in a 3D

coordinate system. Specifically, G1 spans the space of spatial-temporal multiple trajec-

tory data, G2 spans the space of object trajectory cardinality, G3 spans the space of sets

of multiple trajectories. They can be obtained by using various tensor analysis techniques,

such as [20].

2. Project each multiple trajectory matrix onto the first two bases G1 and G2 and transform

the multiple trajectory data into low-dimensional subspace defined by those two bases:

MCoeff = GT1 ×MMulti−Traj ×G2. (2.10)

3. Use the coefficient matrices obtained in step 2 as indices of their corresponding multiple-

trajectory matrices in the database.

Retrieval

1. Project multiple trajectories in the input query on the 2 bases G1 and G2, and get the

GMIR coefficient matrix:

MQueryCoeff = GT1 ×MTrajQuery ×G2. (2.11)

20

2. Calculate the Euclidean distance norm DGMIR between the GMIR coefficients of query

multiple-trajectory and the GMIR coefficients matrices stored in the database, and return

the ones that have the distances within a threshold.

DGMIR = ||(MCoeff −MQueryCoeff )||2. (2.12)

where in eqns.(6)-(8), Gi is the ith geometric base, GTi is transpose of Gi, MCoeff and MQueryCoeff

are the coefficient matrices of multiple object trajectory in multiple trajectory tensor and query

multiple-trajectory, MMulti−Traj and MTrajQuery are multiple trajectory matrices in multiple

trajectory tensor and query multiple object trajectory, respectively.

2.4.2 Unfolded Multiple-Trajectory Indexing and Retrieval (UMIR) Algorithm

The procedure of generating data-dependent bases of a multi-dimensional matrix or tensor

using unfolded multiple-trajectory indexing and retrieval (UMIR) algorithm can be viewed as a

recursive ”unfold” process of tensor. By viewing slices of tensor, which are matrices, as vectors,

the tensor can be viewed as a matrix, then first use SVD on the matrix, and then use SVD on

the slice-matrices, as shown below:

T = σ1 ×1 T1...(N−1) ×2 UN . (2.13)

T1...(N−1) = σ2 ×1 T1...(N−2) ×2 UN−1. (2.14)

...

21

T12 = σN−1 ×1 U1 ×2 U2. (2.15)

Where in the above eqns., ×n refers to the mode-n product [21]. The indexing and retrieval

processes of the UMIR algorithm are summarized below:

Indexing

1. Generate unfolded bases U1, U2 and U3 from multi-dimensional data, by recursive ”un-

folding” of tensor T :

T = σ1 ×1 T12 ×2 U3. (2.16)

T12 = σ2 ×1 U1 ×2 U2. (2.17)

2. Project each multiple trajectory matrix onto the 2 bases U1 and U2 and transform the

multiple trajectory data into low-dimensional subspace spanned by those two bases:

MCoeff = UT1 ×MMulti−Traj × U3. (2.18)



Retrieval

1. Project query multiple trajectories on the 2 bases U1 and U3, and get UMIR coefficient

matrix:

MQueryCoeff = UT1 ×MTrajQuery × U3. (2.19)

22

2. Calculate the Euclidean distance norm DUMIR between the UMIR coefficients of query

multiple-trajectory and those of each multiple-trajectory input indexed in the database,

and return the ones that have the distances within a threshold.

DUMIR = ||(MCoeff −MQueryCoeff )||2. (2.20)

Where in eqns.(9)-(16), T1...N is a matrix indexed by 1 to N, and Ui are unfolded bases. UTi is

transpose of Ui (i = 1, 2, 3). MCoeff1, MCoeff2, MQueryCoeff1, MQueryCoeff2 are coefficient ma-

trices; MMulti−Traj , MTrajQuery are the multiple-trajectory matrix and query multi-trajectory

matrix.

2.4.3 Concentrated Multiple-Trajectory Indexing and Retrieval (CMIR) Algorithm

The Indexing and Retrieval processes used in the CMIR algorithm are outlined below:

Indexing

1. Generate concentrated bases C1, C2 and C3 from multiple trajectory tensor T by mini-

mizing the following sum-of-squares loss function:

minC1,C2,C3

∑

i,j,k

||Tijk −R∑r

c1irc

2jrc

3kr||2. (2.21)

23

Where c1ir,c

2jr, c3

kr are i, j and k-th column vectors of C1, C2 and C3, respectively. The

data-dependent bases can be obtained by solving the above equation. One solution to

extract those bases is called Parallel Factors Decomposition (PARAFAC) [22].

2. Project each multiple trajectory matrix onto the 2 bases C1 and C2 and transform the

multiple trajectory data into low-dimensional subspace spanned by those two bases:

MCoeff = CT1 ×MMulti−Traj × C2. (2.22)



Retrieval

1. Project query multiple trajectories on the 2 bases C1 and C2, and get CMIR coefficient

matrix:

MQueryCoeff = CT1 ×MTrajQuery × C2. (2.23)

2. Calculate the Euclidean distance norm DUMIR between the UMIR coefficients of query

multiple-trajectory and those of each multiple-trajectory input indexed in the database,

and return the ones that have the distances within a threshold.

DCMIR = ||(MCoeff −MQueryCoeff )||2. (2.24)

24

Where in eqns.(14)-(17) CTi is transpose of Cii = 1, 2, 3, MMulti−Traj and MTrajQuery are tra-

jectory matrices of multiple-trajectory in the database and query multiple-trajectory, MCoeff

and MQueryCoeff are their corresponding coefficient matrices.

2.5 Experimental Results

We evaluate the performance of the proposed multiple trajectory indexing and retrieval

algorithms on two datasets: ASL and CAVIAR data sets. All the algorithms have been im-

plemented on an Intel P-IV 2.4 GHz Desktop with 512 M RAM, using Matlab, without code

optimization.

2.5.1 Datasets

(1) The Australian Sign Language (ASL) Dataset :

We construct our ASL dataset by combining multiple hand movement trajectory data from

the Australian Sign Language (ASL) dataset [48]. We first extract 50 different sets of hand

movement trajectories, each of which contains 20 single-trajectories, then we combine each

group of two or three trajectories, resulting in 1225 multiple trajectories (with different number

of trajectories), which are re-sampled to a uniform length of 512 sample points.

(2) The Context Aware Vision using Image-based Active Recognition (CAVIAR) Dataset :

One of the most popular data sets used for video surveillance tests is the Context Aware

Vision using Image-based Active Recognition (CAVIAR) dataset [49]. This dataset provides

the coordinates of the tracked objects as ground truth data, which can be used to determine

multiple-object trajectories. The CAVIAR dataset itself was generated using two sub-dataset,

namely, (1) Clips from INRIA. (2) Clips from Shopping Center in Portugal. From this dataset,

25

we select 509 two-trajectory video clips, 16 three-trajectory video clips and 8 four-trajectory

video clips, totally 533 multiple-trajectory video clips for testing. Fig. Figure 6 (a) (b) show

an example of multiple trajectories in CAVIAR data set.

2.5.2 Interacting Multiple Trajectory Selection

In video sequences with multiple objects, not all the objects are interacting, e.g. in the video

clip ”Two people meet, fight and chase each other” (clip ’FIGHT and CHASE’ in CAVIAR

INRIA dataset), as depicted in Fig. Figure 6 (a) (b), objects with IDs 1 and 2 interact with

each other, while object with ID 0 only appears in the beginning of the video without interacting

with the other objects. Similarly, objects with ID 3 and 4 barely have any activity. This can

be more clearly seen in Fig Figure 6 (c), where lifespans of all the objects in the video are

displayed. The horizontal axis of this figure represents the number of frames within that video,

while the vertical axis shows the status of the object. Each object has two status conditions: it

appear in the video or it does not appear in the video. Here, status 0 indicates that an object

does not appear in the video; while a non zero status shows that the object’s appears in the

video. Objects 1 and 2 appear and disappear almost at the same time thus share the same

lifespan, while their life spans do not overlap with object 0. Obviously, objects 1 and 2 have

possible interacting activity, while they have no interaction with object 0.

Based on this observation, we propose a novel interacting multiple trajectory selection

scheme based on their lifespans. All the objects sharing some portion of lifespan should be

marked as interacting candidates, while other combinations whose lifespans don’t overlap will

obviously have no interaction, thus can be automatically ruled out. This procedure is of great

26

practical importance, since it can be used to further improve the response of the multiple object

trajectory indexing and retrieval system.

2.5.3 Retrieval Results

We first provide retrieval results of multiple trajectories in order to qualitatively illustrate

the performance of our proposed multiple trajectory indexing and retrieval algorithms. Figure

Figure 7 depicts the results of a query consisting of two interacting trajectories posed to the

retrieval system based on CMIR algorithm. The query is shown in Fig. Figure 7(a). The

most-similar and second-most-similar retrieval results are shown in Figs. Figure 7(b) and

Figure 7(c), respectively. As can be seen, the retrieved trajectories are indeed very similar

to the query trajectories. Meanwhile, the most-dissimilar and second-most-dissimilar retrieval

result are quite different from our query, as illustrated in Figs. Figure 7(d) and Figure 7(e),

respectively.

To further illustrate the power of our algorithms, we select a query consisting of three

interacting trajectories, which is shown in Fig. Figure 8(a). The retrieval results of our system

are depicted in Fig. Figure 8(b)-(e). This experiments shows that our algorithms are capable

of indexing and retrieving any number (> 2) of multiple trajectories.

Figures Figure 9 and Figure 10 depict our query and retrieval results for real-world CAVIAR

data set using our tensor-based algorithms for multiple trajectory representation.

The experiment shown in Fig. Figure 9 demonstrates the retrieval results on two trajectories.

The query is represented in Fig. Figure 9(a). The most-similar and second-most-similar retrieval

results are illustrated in Figs. Figure 9(b) and Figure 9(c), respectively. In this case, the query

27

depicts a video clip “two people who meet, fight and chase each other”. The most-similar

retrieved result show in Fig. Figure 9(b) is a video clip in which “two people meet, walk

together and split apart”; whereas the second-most similar retrieved result portrayed in Fig.

Figure 9(c) is a video clip in which “two people meet, fight and run away”. We can see the

retrieved multiple trajectories are visually quite similar with the query. Figure

Figure 10 shows another retrieval result on CAVIAR data for three-trajectory case. Trajectory

data are selected from the clips of shopping center in Portugal (2nd subset in CAVIAR). In

this scenario, the query depicts a video sequence “3 persons walking in the corridor”. The

most-similar retrieved result show in Fig. Figure 10(b) is a video clip in which “Another 3

persons walking in the corridor”; whereas the second-most similar retrieved result portrayed in

Fig. Figure 10(c) is a video clip in which “3 people walking together along the corridor”. On

contrast, the most-dissimilar retrieval and the second-most-dissimilar retrieval are depicted in

Figs. Figure 10 (d) and (e), respectively. Both of retrieved dissimilar results visually vary a lot

from the query.

2.5.4 Precision and Recall

For a quantitative assessment of the performance of our algorithms, we compute the con-

ventional Precision-Recall curves. The conventional definitions of the precision Pp and recall

Pr metrics are given by:

Pp = |Xi ∈ T | \ |N | (2.25)

and

Pr = |Xi ∈ T | \ |T | , (2.26)

28

where the |X| operator denotes the cardinality of the set X; |Xi ∈ T | is used to denote the

number of trajectory classes that were correctly retrieved, and N and T are used to represent

the total number of trajectory classes retrieved and the total number of trajectory classes in

the target set in the database, respectively.

Figs. Figure 11 and Figure 12 show the Precision and Recall curves for indexing and retrieval

of both global (full) and segmented (partial) multiple trajectories, respectively. Figure Figure 11

represents the average Precision and Recall curves of the three proposed algorithms applied to

1235 full (global) trajectory pairs. Figure Figure 12 depicts the average Precision and Recall

results of the three algorithms applied to 1225 partial (segmented) multiple trajectories. In both

cases, CMIR algorithm performs much better than GMIR and UMIR. For fair comparison,

we implemented a multiple motion trajectory indexing and retrieval algorithm based on the

presentation of Min and Kasturi [46]. We modified their method for a fair comparison in an

indexing and retrieval setting, the modified method is referred to as Modified Min’s method.

The indexing process is to generate HMM parameters for each set of multiple trajectories

in the database, and HMM parameters are considered as features for each set of multiple

trajectories. In the retrieval procedure, the similarity between query multiple trajectories and

the dataset multiple trajectories is measured by the maximum likelihood based on HMM fitting,

and retrieved results are selected by looking up a list of maximum likelihoods. As shown in

Fig. Figure 11, the average precision-recall curves of proposed CMIR and GMIR algorithms

are much better than Modified Min’s method.

29

The superiority of the proposed CMIR algorithm is further analyzed in Fig. Figure 13.

In this figure, the red curve represents the precision-recall curve of our GMIR algorithm, and

other curves denote the precision-recall curves of our CMIR algorithm with different number of

components (i.e. factors R). We can see that the performance of the CMIR algorithm improves

as the number of components increases. In particular, when the number of components R

used reaches 9 (i.e. R = 9), the performance of the CMIR algorithm is nearly the same

as the performance of the GMIR algorithm. When R used by CMIR is 12 (i.e. R = 12),

its performance is far superior to that of GMIR. The improvement in performance of CMIR

algorithm with the increase in the number of the components is due to the high accuracy present

in the representation of the tensor when using a larger number of factors. Specifically, CMIR

decomposes tensors into a sum of rank-1 tensors; thus the accuracy of the tensor representation

increases as the number of rank-1 tensors increases and, consequently, the retrieval performance

improves. In the next subsections, we will see that even when the number of components used

is small and the performance of GMIR is superior to CMIR, the CMIR algorithm may still

be superior to CMIR since it requires a much smaller number of coefficients for trajectory

representation; hence, the storage space requirements of CMIR will be dramatically reduced in

comparison to GMIR.

2.5.5 Indexing and Query Time

Indexing and query time serve as important criteria in evaluating the efficiency of an in-

dexing and retrieval system. A retrieval system can benefit from short indexing time which

allows for faster off-line processing. The average query time required for the three tensor-based

30

methods proposed for retrieval of multiple trajectories using global and segmented trajectory

representation are depicted in Table 1. The query time includes: (1) query multiple trajectory

generation and coefficients extraction. (2) matching of query with the database contents. We

observe that indexing time of the proposed CMIR algorithm is comparable to Modified Min’s

method, while GMIR and UMIR require less indexing time. Much more importantly, the CMIR

algorithm dramatically reduces the query time, compared with the other techniques.

TABLE I

COMPARISON OF AVERAGE QUERY TIME FOR EACH MULTIPLE TRAJECTORYQUERY INPUT (SEC.)

Algorithms Global tensor data Partial tensor dataGMIR 19.159 10.543UMIR 19.415 19.737

Modified Min’s [46] 13.064 22.488CMIR 0.133 0.172

2.6 Discussion

The practical utility of our approach to multiple interacting motion trajectories must ad-

dress two important problems: (a) ”multiple trajectory number matching”, and (b) ”multiple

trajectory order matching.” The multiple trajectory number matching problem arises when the

number of trajectories in the database elements and the number of trajectories in the query

is not equal. For instance, we need to determine the proper retrieval results when a database

31

element has 3 motion trajectories and the query only has 2 motion trajectories, or vice-versa.

Similarly, the multiple trajectory order matching must be resolved when the correspondence

between the multiple trajectories in the database elements and query is unknown.

We shall now provide a brief discussion on how to extend our proposed algorithms for

a general case, where the number of trajectories in the tensor database and in the query is

arbitrary and the correspondence between them is unknown. We outline four possible scenarios

with increasing order of complexity, i.e.,

(S1): The number of trajectories in the query is the same as the number of trajectories in

the database element.

(S2): The number of trajectories in the query is greater than the number of trajectories in


(S3): The number of trajectories in the query is smaller than the number of trajectories in


(S4): The number of trajectories in the query and the number of trajectories in the database

element is arbitrary.

Here, we address the multiple trajectory number and order matching problems by adopting

one or both of the following schemes for each scenario:

(a) Database Processing: Permute and match a subset of the database trajectories, retain

query trajectories. In this approach, the database will contain all possible number of trajectories

with all possible orders. Thus, for any query with fewer trajectories than in the database element

and with unknown object correspondence, there will be a match in the database. Although

32

this approach could be implemented offline, the indexing and retrieval times as well as storage

requirements could be extremely large; i.e. consider a database element with N trajectories,

we must now transform this element into C(N, n)! different database elements each of which

has n trajectories, for every n = 1, 2, . . . , N , where C(N, n) is used to denote N choose n.

(b) Query Processing: Permute and match a subset of the query trajectories, retain database

trajectories. In this scheme, queries with multiple motion trajectories are used to generate

a family of queries with all possible number of trajectories with all possible orders. Then,

for any query with more trajectories than in the database element and with unknown object

correspondence, the family of query trajectories are utilized to match to each entry in the

dataset. The advantage of this approach is that it does not effect the indexing time and storage

requirements. Nonetheless, the online processing required will degrade the retrieval times.

Following the discussion above, a query with K trajectories will be mapped to a query trajectory

family consisting of C(K, k)! different queries each of which has k trajectories, for every k =

1, 2, . . . , K. Overall, however, online processing by permutation and matching of subsets of

the query trajectories is generally preferable to the demands imposed by the corresponding

operations on the database.

We experimented with both database processing and query processing schemes applied on

the scenario S1, where the number of trajectories in both the query and the database element

is the same, however the orders of trajectories in both the query and the database element are

unknown. The corresponding query time is depicted in Table II. For both two schemes, the

retrieval times increase for UMIR, CMIR, GMIR and Modified Min’s method. However, we can

33

see that the retrieval time of CMIR is still less than 1 second, which is far better than other

methods. This experiment demonstrates the robustness of our indexing and retrieval system to

multiple trajectory order matching.

TABLE II

COMPARISON OF AVERAGE QUERY TIME FOR EACH MULTIPLE (THREE)TRAJECTORY QUERY INPUT (SEC.)

Algorithms Fixed Unknown Unknownorder order order

matching matching matchingusing using

database queryprocessing processing

GMIR 19.046 110.483 114.027UMIR 19.563 112.209 116.823

Modified Min’s [46] 13.196 74.118 78.364CMIR 0.128 0.784 0.802

34

Figure 3. Tensor-Space Representation of Multiple-Object Trajectories: (a) Three sets ofmultiple trajectories S1,S2,S3, extracted from three video clips displayed in Fig. 1. (b) Three

corresponding Multiple Trajectory Matrices M1, M2, M3. (c) Three Multiple TrajectoryMatrices M′

1, M′2, M′

3 after resampling. (d) Multiple Trajectory Tensor T constructed fromM′

1, M′2, M′

3.

35

Figure 4. Example of segmentation of a set of global multiple trajectories (2 trajectories) intosegmented multiple subtrajectories. (a) a set of 2 trajectories with global length. (b) 6 sets ofsegmented multiple subtrajectories, segmented from (a). In both figures, solid squares with

numbers indicate segmentation positions, the horizontal axis is x- location, the vertical axis isy- location within the scene.

Figure 5. Block diagram of our multiple object trajectory indexing and retrieval system.

36

Figure 6. Example video clip from the CAVIAR dataset: (a) Sample frame from the video clip’Two people meet, fight and chase each other’. (b) Multiple trajectories extracted from thereferred video sequence. (c) Lifespans of multiple objects within the referred video sequence.

37

Figure 7. Retrieval results for the ASL data set using the proposed CMIR algorithm formultiple trajectory representation (2 trajectories): (a) the query; (b) the most-similar

retrieval; (c) the second-most-similar retrieval; (d) the most-dissimilar retrieval; and (e) thesecond-most-dissimilar retrieval

38

Figure 8. Retrieval results for ASL dataset using the proposed CMIR algorithm for multipletrajectory representation (3 trajectories): (a) the query; (b) the most-similar retrieval; (c) the

second-most-similar retrieval; (d) the most-dissimilar retrieval; and (e) thesecond-most-dissimilar retrieval.

39

Figure 9. Retrieval results for CAVIAR dataset (INRIA) using the proposed CMIR algorithmfor multiple trajectory representation (2 trajectories): (a) the query; (b) the most-similarretrieval; (c) the second-most-similar retrieval; (d) the most-dissimilar retrieval; (e) the

second-most-dissimilar retrieval.

40

Figure 10. Retrieval results for CAVIAR dataset (Shopping Center in Portugal) using theproposed CMIR algorithm for multiple trajectory representation (3 trajectories): (a) the

query; (b) the most-similar retrieval; (c) the second-most-similar retrieval; (d) themost-dissimilar retrieval; (e) the second-most-dissimilar retrieval.

41

Figure 11. Comparison of Average Precision and Recall Curves for proposed threemultiple-trajectory indexing and retrieval algorithms using (a): global and (b): segmented

multiple trajectory tensor representation: CMIR, GMIR, UMIR and Modified Min’s method,on two datasets (ASL(1225), CAVIAR(533))

42

Figure 12. Comparison of Average Precision and Recall Curve for proposed threemultiple-trajectory indexing and retrieval algorithms using segmented multiple trajectorytensor representation: CMIR, GMIR, UMIR and Modified Min’s method, on two datasets

(ASL(1225), CAVIAR(533))

43

Figure 13. Comparison of average Precision and Recall Curve for CMIR and GMIRalgorithm. Curve (a)-(g) are for CMIR algorithm with factor from 3 to 12; Curve (h) is for

GMIR algorithm.

CHAPTER 3

TRAJECTORY-BASED VIDEO CLASSIFICATION AND

RECOGNITION BASED ON DISTRIBUTED MULTI-DIMENSIONAL

HIDDEN MARKOV MODELS

3.1 Introduction

Hidden Markov Models (HMMs) have received tremendous attention in recent years due

to their wide applicability in diverse areas such as speech recognition [23], gesture and mo-

tion trajectory recognition [24] [25], image classification [26] and retrieval [27], musical score

following [28], genome data analysis [29], etc. Most of the previous research has focused on

the classical one-dimensional HMM developed in the 1960s by Baum et al. [30] [31], where the

states of the system form a single one-dimensional Markov chain, as depicted in Fig. 1. How-

ever, the one-dimensional structure of this model has limited applicability to more complex

data elements such as images and videos.

Early efforts devoted to extending 1D HMMs to higher dimensions were presented by pseudo

2D HMMs [32] [33]. The model is called “pseudo 2D” in the sense that it is not a fully connected

2D HMM. The basic assumption is that there exists a set of “superstates” that are Markovian

and within each superstate there is a set of simple Markovian states. To illustrate this model for

higher dimensional systems, let us consider a two-dimensional image. The transition between

superstates is modelled as a first-order Markov chain and each superstate is used to represent

44

45

Figure 14. Classical one-dimensional hidden Markov model.

an entire row of the image; a simple Markov chain is then used to generate observations in the

column. Thus, superstates relate to rows while simple states to columns of the image. Later

efforts to represent 2D data using 1D HMMs were proposed using coupled HMMs (CHMMs)

[34] [35]. In this framework, each state of the 1D HMM is used as a meta-state to represent a

collection of states, as shown in Fig. 2(a). For example, image representation based on CHMMs

would rely on a 1D HMM where each state represents an entire column of the image. In

certain applications, these models perform better than the classical 1D HMM [32]. However,

the performance of pseudo 2D- HMMs and CHMMs remains limited since these models capture

only part of the two-dimensional hidden state information.

46

Figure 15. Previous two-dimensional hidden Markov models: (a) coupled HMM (CHMM) (b)nearest-neighbor, strictly causal 2D HMM (dependence based on nearest neighbors in vertical

and horizontal directions)

We believe that true 2D HMM models were presented first time by Devijver [36] [37].

In this model, Markov meshes, in particular, second- and third-order Markov meshes, were

used to characterize the state process along with hidden observations to represent images thus

proposing a true 2D HMM model. Although this general 2D HMM model is powerful, ana-

lytic solutions for training and classification algorithms needed to determine the maximum a

posteriori classification were not provided. Nonetheless, suboptimal classification algorithms

have been proposed for 2D HMMs, namely, the deterministic relaxation algorithm [36]. The

use of suboptimal algorithms in various applications served to demonstrate the importance of

the information embedded in two-dimensional models. For example, Levin and Pieraccini [38]

developed an algorithm for character recognition based on 2D HMM, while Park and Miller [39]

constructed a 2D HMM-based image decoding system over noisy channels.

47

The first analytic solution to the true two-dimensional HMM model was presented by Li

et al. [26]. They proposed a strictly-causal, nearest-neighbor, two-dimensional HMM and pre-

sented its application to image classification and segmentation. In this model, state transition

probability for each node is conditioned on the states of the nearest neighboring nodes from

the horizontal and vertical directions, as depicted in Fig. 2(b). The main limitation of this ap-

proach is that it allowed the state dependencies can be causal only. Thus, the analytic solution

to the two-dimensional model presented in [26] can only be used to capture partial information.

In particular, the training and classification algorithms presented in [26] rely on the causality

of the model. Hence, direct extension of these algorithms to general 2D HMMs, which can

represent state dependencies from neighbors in all directions, is not possible since such a model

is inherently non-causal.

In this chapter, a novel solution for a non-causal, multi-dimensional hidden Markov model

is proposed. Since an analytic solution to the non-causal model is not available, we provide

a solution by splitting the non-causal model into multiple causal HMMs that are analytically

solvable in a distributed computing framework, therefore referred to as distributed HMMs. The

distributed HMMs are processed on a sequential processor using an alternate updating scheme to

obtain an approximate solution of the non-causal model. Subsequently, we present training and

classification algorithms for the distributed, causal models by extending the work presented in

[26] to general causal, multi-dimensional systems. Specifically, a new Expectation-Maximization

(EM) algorithm for estimation of general causal models is derived, where a General Forward-

Backward (GFB) algorithm is proposed for recursive estimation of the model parameters. Also,

48

a conditional independent subset-state sequence structure decomposition is presented for the

multi-dimensional Viterbi algorithm.

We present two applications of the proposed non-causal HMM framework: (1) image classi-

fication and segmentation; (2) video classification based on multiple interacting motion trajec-

tories. For simplicity, the presentation of the proposed HMM framework and its application to

image and video classification is focused on 2D and limited to dependencies from the adjacent

neighbors at each node, as shown in Fig. 1(d). However, the proposed non-causal, 2D HMM

can be easily extended to higher dimensions by following the procedure used in [43].

The remainder of this chapter is organized as follows: Section II introduces the proposed

non-causal hidden Markov model. The distributed solution and sequential updating scheme

designed to obtain an approximation of the non-causal model is proposed in Section III. The

exact analytic solution for a general causal, multi-dimensional HMM required for the distributed

model is presented in Section IV. Section V provides an application of the proposed non-causal,

2D HMM model to the problem of image classification and segmentation. The application of

the proposed model to video classification based on multiple interacting motion trajectories is

presented in Section VI. Finally, a brief summary and discussion of our results is provided in

Section VII.

3.2 Non-Causal Multi-Dimensional Hidden Markov Model

In this section, we provide the basic mathematical formulation of the proposed non-causal,

multi-dimensional Hidden Markov Model.

49

Figure 16. Proposed non-causal two-dimensional hidden Markov models: (a) arbitrary modelwith non-causality in all dimensions; (b) arbitrary model with causality along a singledimension; and (c) nearest-neighbor model with causality along a single dimension.

3.2.1 Motivation and Problem Statement

The motivation of our work comes from the basic problem in image processing: Image

segmentation. In traditional block-based image segmentation algorithms, feature vectors are

generated for each image block; segmentation decisions are made independently for each block

based on feature information. The performance of such algorithms is limited since context infor-

mation between image blocks is lost. Li et al. [26] proposed a two-dimensional hidden Markov

model for image segmentation, where state transition probability for each block is conditioned

on the states of nearest neighboring blocks from horizontal and vertical directions, depicted in

Fig. 2(b). However, the context information that a block depends on may arise from any direc-

tion and from any of its neighbors, as depicted in Fig. 3(a). Thus, the existing two-dimensional

model will only capture partial context information. Generalization of the hidden Markov model

(HMM) framework to represent state dependencies from neighbors in all directions and all its

50

neighbors will make the model non-causal. The non-causal hidden Markov model with arbitrary

directions of state transitions (or state dependencies) is powerful of characterizing the intrinsic

state transition structure and behavior of complex systems involving multi-dimensional system

states. For example, for multiple object motion trajectories analysis, a non-causal model is

needed that will capture the representation of motion trajectories of multiple objects while:

(i) not isolating the motion trajectories to individual objects and thus losing their interaction;

and (ii) avoiding costly semantic analysis that would perform classification based on heuristics

rather than the inherent probabilistic model used for multiple trajectory representation. This

goal can only be achieved by providing an algorithm based on non-causal HMM.

3.2.2 Non-Causal Multi-Dimensional Hidden Markov Model

Our non-causal hidden Markov model is defined as follows:

1. There are two layers of random variable sets, the hidden states S = {S(i, j), i = 1 :

M, j = 1 : N} and corresponding observations O = {O(i, j), i = 1 : M, j = 1 : N}. The hidden

state variable nodes S(i, j) lay on a two-dimensional grid where the relative positions of nodes

reflex spatial dependencies between them, e.g., we say node S(i′, j′) is ”before” node S(i, j) if

i′ < i and j′ < j. The hidden states are not observable, but can only be observed through the

observations. Each state takes value from a finite set {1, 2, ..., Q}, where Q is a finite integer.

2. For each state node S(i, j), its conditional distribution is determined by a set of neigh-

boring nodes that ”affects” it, i.e.,

Pr(S(i, j)|S(m,n) ∈ S, (m,n) 6= (i, j))

51

= Pr(S(i, j)|S(m,n) ∈ N, (m,n) 6= (i, j)). (3.1)

Where in the above equation, N ⊂ S is a set of neighboring nodes of node S(i, j). This definition

is equivalent to the well-known Markov random field (MRF). However, in our model, we use a

completely different approach to solve the problem.

3. Given the hidden state S(i, j), its corresponding observation O(i, j) follows a multivari-

ate Gaussian distribution characterized by a mean vector and a covariance matrix that are

determined by the hidden state. For any state S(i, j) = q ∈ {1, 2, ...Q}, denote the mean vector

and covariance matrix as µq and Σq, then the p-dimensional Gaussian distribution is

bq(O(i, j)) =1

(2π)p2 |Σq| 12

e−12(O(i,j)−µq)T Σ−1

q (O(i,j)−µq) (3.2)

To our knowledge no solution exists for the general non-causal multi-dimensional HMM

models. Even for the simplest case, where the dimensionality is reduced to two, the task is

extremely hard due to its non-causality.

3.3 Non-Causal HMM: A Distributed Approach

Subsequently, we present training and classification algorithms for the distributed, causal

models by extending the work presented in [26] to general causal, multi-dimensional systems.

We propose a novel solution to the arbitrary non-causal multi-dimensional hidden Markov model

by splitting the non-causal model into multiple causal HMMs that are analytically solvable in

a distributed computing framework, therefore referred to as distributed HMMs. We process

the distributed HMMs on a sequential processor using an alternate updating scheme to obtain

52

an approximate solution of the non-causal model. For simplicity, we focus our discussion on

the solution of arbitrary non-causal, two-dimensional hidden Markov models. The proposed

scheme, however, can be easily extended to higher dimensional models.

The splitting rules are defined as follows:

1. The original non-causal model is decomposed into a set of causal models, each of which

is formulated by specifically focusing on the state dependencies between each current analyzed

node and its neighboring nodes one at a time.

2. All non-neighboring state nodes and their observations corresponding to the current

analyzed state node are removed.

3. Each non-causal (bi-directional) dependency between state nodes is further decomposed

into a couple of causal (directional) dependencies.

Please note that in the splitting procedure, the state dependency information in still intact.

Further more, since each distributed submodel preserves the correlation between neighboring

state nodes, the proposed framework is not a simple collection of uncorrelated causal models

but an accurate representation of the original model.

An arbitrary two-dimensional hidden Markov model with N2 state nodes can be represented

in the form of a two-dimensional state transitional diagram, as depicted in Fig 3(a). If each

dimension of the model is non-causal, we can solve the model by allocating N2 processors, one

processor per node, with the condition that all the N2 processors are perfectly synchronized

and deadlock among concurrent state dependencies is successfully resolved. We assert that such

a solution is highly impractical and difficult to realize. As an approximate solution, we propose

53

Figure 17. Distributed HMM: (a) non-causal 2D hidden Markov model; (b) distributed causal2D Hidden Markov Model 1; and (c) distributed causal 2D hidden Markov model 2.

to split the non-causal model into N2 distributed causal models, by specifically focusing on

the state dependencies of one node at a time, while ignoring the other nodes. Similarly, for

arbitrary M -dimensional hidden Markov models, we can split the non-causal model into NM

distributed causal HMMs, by specifically focusing on the state dependencies of one node at a

time, while ignoring the other nodes.

54

More over, if we assume that one dimension of the non-causal multi-dimensional model is

causal, we can reduce the computational complexity as well as resources needed significantly

for the estimation of the model parameters. For example, for the two-dimensional non-causal

model depicted in Fig. 3(b), if we assume that one dimension, e.g. the horizontal dimension,

is causal, we only need N parallel processors instead of N2 needed in the fully non-causal

formulation. Also, there would be only N distributed causal hidden Markov models instead

of N2. In general, for an arbitrary M dimensional non-causal hidden Markov model with

P (≤ M − 1) causal dimensions, the non-causal model can be solved by splitting it into NM−d

distributed causal hidden Markov models and processing them in parallel.

In our proposed distributed approach, we decentralize the noncausal hidden Markov model,

by focusing on one causal direction of state transition at a time , while ignoring other directions

of state transition probabilities. Figure 4 depicts one example of distributing a non-causal

2D HMM, whose one dimension (horizontal direction) is causal, while the other dimension

(vertical direction) is non-causal. We decouple the non-causal model (Fig. 4(a)) using the

proposed distributing approach, since there are only two directions of causal state transitions

along the non-causal dimension, we get two distributed causal hidden Markov models, shown

in Figs. 4(a) and 4(b).

The extension of our representation to non-casual models is equivalent to the removal of

the restriction in a graph to be cycle-free. This is due to the fact that whenever the conditional

dependence between two nodes is reciprocal (i.e. there is a directed link between two nodes

55

in both directions) we inevitably have a graph with cycles. This is precisely the case for the

non-causal models used in our example (see Fig. 3(a)).

It is interesting to note that cycle-free graphs are isomorphic to multi-dimensional causal

graphical models. That is, cycle-free graphs can be mapped to a multi-dimensional causal

graphical representation. Hence, causal models are equivalent to general cycle-free graphs and

explicitly determine the recursive structure of the update process in training and classification.

In order to accurately estimate the state transition probabilities of the non-causal model, all

of the distributed causal two-dimensional models must be processed simultaneously in perfect

synchrony. However, in reality, it is not possible for the entire system to be exactly synchronous.

We approximate the simultaneous solution of multiple distributed HMMs on a sequential pro-

cessor by an alternating updating scheme. One example updating scheme is depicted in Fig. 5.

The numbers {1, 2, 3, 4, ...} are the sequence orders of updating of model parameters.

In the following discussions, we refer to the distributed causal hidden Markov models as

DHMMs. The presentation in this chapter will focus primarily on a special case of our model

restricted to two-dimensions, which we refer to as the 2D DHMM. Nonetheless, solutions to the

2D DHMM model can be easily extended to higher dimensions by following the procedure used

in [43]. Since the above distributed causal 2D DHMMs are the fundamental building blocks

of the general non-causal multi-dimensional hidden Markov models, we focus primarily on the

estimation of their parameters, which is discussed in detail in the following section.

56

Figure 18. Sequential alternate updating scheme for multiple distributed HMMs.

57

3.4 Distributed Causal HMMs: Training and Classification

The distributed approach to an arbitrary non-causal hidden Markov model presented in

the previous section assumes a solution to multiple causal hidden Markov models. In this

section, we provide an exact solution to arbitrary causal hidden Markov models by extending

the classical one-dimensional training and classification algorithms to multiple dimensions.

Recall the observed feature vector set O = {(m,n),m = 1, 2, ..., M ;n = 1, 2, ..., N} and

corresponding hidden state set S = {s(m,n),m = 1, 2, ..., M ;n = 1, 2, ..., N}, the model pa-

rameters are defined as a set Θ = {Π,A,B}, where Π is the set of initial probabilities of states

Π = {π(m,n)}; A is the set of state transition probabilities A = {ai,j,k,l(m,n)}, and

ai,j,k,l(m,n) = Pr

(s(m,n) = l|s(m′, n) = k,

s(m,n− 1) = i, s(m′, n− 1) = j

)(3.3)

and B is the set of probability density functions (PDFs) of the observed feature vectors given

corresponding states, assume B is a set of Gaussian distribution with means µm,n and variances

Σm,n, where m,m′ = 1, ..., M ;m 6= m′;n, i, j, k, l = 1, ..., N ; t = 1, ..., T . Due to space limitation,

we will discuss the case M = 2 in detail.

We will now present an extension of the training and classification algorithms for each of

the distributed multi-dimensional causal HMMs. In particular, we will provide an extension

to general multi-dimensional causal systems for the following algorithms: (a) Expectation-

Maximization (EM); (b) General Forward-Backward (GFB); and (c) Viterbi algorithm.

58

3.4.1 Expectation-Maximization (EM) Algorithm

We propose a new Expectation-Maximization (EM) algorithm suitable for estimation of

parameters of the M Distributed 2D Hidden Markov Models, which is analog but different

to EM algorithm for 1D HMM. EM algorithm was firstly proposed by Dempster, Laird and

Rubin [41], and there are many applications of EM algorithm to Hidden Markov Models. Let’s

take DHMM with applications to 2 trajectories for example, to explain how EM algorithm

works for DHMM; the estimation of parameters of the DHMM with applications to M(M > 2)

is in the same way.

Assume that each trajectory is sampled at T consecutive times instances. We have Ob-

servation Sequence O = {o(i, j), i = 1, 2; j = 1, ..., T}, and Hidden States S = {s(i, j), i =

1, 2; j = 1, ..., T} where o(1, j), o(2, j) refers to observation (feature) set of object 1 and 2, and

s(1, j), s(2, j) refers to state sets of object 1 and object 2. Here S refers to the union of those two

state sets. Assume the number of states is N, so s(1, j), s(2; j) ∈ {1, 2, ..., N} for j = 1, 2, ..., T .

The incomplete data is O, and complete data is (O, S). The incomplete-data likelihood

function is P (O, Θ), the complete-data likelihood function is P (O, S|Θ). We wish to maximize

the complete-data likelihood function P (O, S|Θ). According to EM algorithm, the Q function

is:

Q(Θ,Θ′) =∑

S∈V

log(P (O, S)|Θ)P (S|O, Θ′). (3.4)

where

P (O, S|Θ) = πs(2,0)

T∏

t=1

am,n,k,lbs(1,t)(o(1, t))bs(2,t)(o(2, t)). (3.5)

59

Here Θ′ is the current (known) estimation of parameters, Θ is the future (unknown) esti-

mation of parameters that maximize the likelihood function, V is the space of all possible state

sequences with length of T . The joint probability of Observations O and states S is P (O, S|Θ).

Substitute (4) into (3), we get

Q(Θ,Θ′) =∑

S∈V

log(πs(2,0))P (O, S|Θ′)

︸︷︷︸A∗

+∑

S∈V

T∑

t=1

log(am,n,k,l)P (S|O, Θ′)

︸︷︷︸B∗

+∑

S∈V

T∑

t=1

log(bs(2,t)(o(2, t)))P (S|O, Θ′)

︸︷︷︸C∗

+∑

S∈V

T∑

t=1

log(bs(1,t)(o(1, t)))P (S|O, Θ′)

︸︷︷︸D∗

. (3.6)

In the above equations, am,n,k,l is the transition probability from states s(2, t − 1), s(1, t − 1),

s(1, t) to state s(2, t), when s(2, t− 1) is in state m, s(1, t− 1) is in state n, s(1, t) is in state k

and s(2, t) is in state l. m,n, k, l ∈ {1, 2, ..., N}; bs(m,t)(o(m, t)) is the probability of observation

o(m, t) from trajectory m given corresponding state s(m, t), m = 1, 2; and assume they follow a

d-dimensional Gaussian Distribution, when the corresponding state is in i(i ∈ {1, 2, ..., N}),i.e.

bm,i(o(m, t)) =1

(2π)d2 |Σm,i| 12

e−12(o(m,t)−µm,i)

T Σ−1m,i(o(m,t)−µm,i) (3.7)

60

where in the above equation, µm,i is d-dimensional mean vector and Σm,i is d × d covariance

matrix, and d is the dimensionality of observation (feature) vector.

To maximize Q(Θ,Θ′), we maximize both (A∗),(B∗),(C∗) and (D∗). By maximizing

(A∗),(B∗),(C∗) and (D∗), we get the iterative updating formulas of parameters of our dis-

tributed 2D HMM.

Define F(p)m,n,k,l(i, j) as the probability of state corresponding to observation o(i − 1, j) is

state m, state corresponding to observation o(i − 1, j − 1) is state n, state corresponding to

observation o(i, j − 1) is state k and state corresponding to observation o(i, j) is state l, given

the observations and model parameters,

F(p)m,n,k,l(i, j) (3.8)

= P

(m = s(i− 1, j), n = s(i− 1, j − 1), k = s(i, j − 1),

l = s(i, j)|O, Θ(p)

), (3.9)

and define G(p)m (i, j) as the probability of the state corresponding to observation o(i, j) is state

m, then

G(p)m (i, j) = P (s(i, j) = m|O, Θ(p)). (3.10)

We can get the iterative updating formulas of parameters of the proposed model,

π(p+1)m = P (G(p)

m (1, 1)|O, Θ(p)). (3.11)

61

a(p+1)m,n,k,l =

∑Ii

∑Jj F

(p)m,n,k,l(i, j)∑M

l=1

∑Ii

∑Jj F

(p)m,n,k,l(i, j)

. (3.12)

µ(p+1)m =

∑Ii

∑Jj G

(p)m (i, j)o(i, j)

∑Ii

∑Jj G

(p)m (i, j)

. (3.13)

Σ(p+1)m =

∑Ii

∑Jj G

(p)m (i, j)(o(i, j)− µ

(p+1)m )(o(i, j)− µ

(p+1)m )T

∑Ii

∑Jj G

(p)m (i, j)

. (3.14)

In eqns. (1)-(6), p is the iteration step number. F(p)m,n,k,l(i, j), G

(p)m (i, j) are unknown in the above

formulas, next we propose a General Forward-Backward (GFB) algorithm for the estimation of

them.

3.4.2 General Forward-Backward (GFB) Algorithm

Forward-Backward algorithm was firstly proposed by Baum et al. [42] for 1D Hidden Markov

Model and later modified by Li et al. [26]. Here, we generalize the Forward-Backward algorithm

so that it can be applied to any HMM, the proposed algorithm is called General Forward-

Backward (GFB) algorithm. For any HMM model, if its state sequence satisfy the following

property, then GFB algorithm can be applied to it: The probability of all-state sequence S can

be decomposed as products of probabilities of conditional-independent subset-state sequences

U0, U1, ..., i.e., P (S) = P (U0)P (U1/U0)...P (Ui/Ui−1)..., where U0, U1, ..., Ui...are subsets of all-

state sequence in the HMM system, we call them subset-state sequences. Define the observation

sequence corresponding to each subset-state sequence Ui as Oi. Subset-state sequences for our

model are shown in Fig. 5(b). The new structure enables us to use General Forward-Backward

(GFB) algorithm to estimate the model parameters.

62

Figure 19. Training and Classification: (a) state dependencies of the proposed distributedcausal 2D-HMM; and (b) the corresponding decomposed subset-state sequences.

3.4.2.1 Forward and Backward Probability

Define the forward probability αUu(u), u = 1, 2, ... as the probability of observing the obser-

vation sequence Ov(v ≤ u) corresponding to subset-state sequence Uv(v ≤ u) and having state

sequence for u-th product component in the decomposing formula as Uu, given model parameters

Θ, i.e. αUu(u) = P{S(u) = Uu, Ov, v ≤ u|Θ}, and the backward probability βUu(u), u = 1, 2, ...

as the probability of observing the observation sequence Ov (v > u) corresponding to subset-

state sequence Uv(v > u), given state sequence for u-th product component as Uu and model

parameters Θ, i.e. βUu(u) = P (Ov, v > u|S(u) = Uu,Θ). The recursive updating formula of

63

forward and backward probabilities can be obtained as

αUu(u) = [∑

u−1

αUu−1(u− 1)P{Uu|Uu−1,Θ}]P{Ou|Uu,Θ}. (3.15)

βUu(u) =∑

u+1

P (Uu+1|Uu,Θ)P (Ou+1|Uu+1,Θ)βUu+1(u + 1). (3.16)

Then, the estimation formulas of Fm,n,k,l(i, j), Gm(i, j) are :

Gm(i, j) =αUu(u)βUu(u)∑

u:Uu(i,j)=m αUu(u)βUu(u). (3.17)

Fm,n,k,l(i, j)

=αUu−1(u− 1)P (Uu|Uu−1,Θ)P (Ou|Uu,Θ)βUu(u)∑

u

∑u−1[αUu−1(u− 1)P (Uu|Uu−1,Θ)P (Ou|Uu,Θ)βUu(u)]

. (3.18)

3.4.3 Viterbi Algorithm

For classification, we employ a two-dimensional Viterbi algorithm [43] to search for the

best combination of states with maximum a posteriori probability and map each block to a

class. This process is equivalent to search for the state of each block using an extension of the

variable-state Viterbi algorithm presented in [26], based on the new structure in Fig. 5(b). If

we search for all the combinations of states, suppose the number of states in each subset-state

sequence Uu is w(u), then the number of possible sequences of states at every position will be

Mw(u), which is computationally infeasible. To reduce the computational complexity, we only

use N sequences of states with highest likelihoods out of the Mw(u) possible states.

64

3.4.4 Summary of DHMM Training and Classification Algorithms

-Training:

1. Assign initial values to {πm,n, ai,j,k,l, µm,n,Σm,n}.

2. Update the forward and backward probabilities according to eqns. (22) and (23) using

proposed GFB algorithm, calculate old logP (O|Θ0).

3. Update Fi,j,k,l(m, t), Gn(m, t) according to eqns. (24)(25)(26).

4. Update πm,n, ai,j,k,l(m), µm,n and Σm,n according to eqns. (9)-(16) using the proposed

EM algorithm.

5. Back to step 2,Calculate new logP (O|Θ), stop if logP (O|Θ)-logP (O|Θ0) is below pre-set

threshold.

-Classification: Use a two-dimensional Viterbi algorithm to search for the best combination

of states with maximum a posteriori (MAP) probability.

3.5 Application I: Non-Causal HMM-Based Image Classification

In this section, we provide experimental results of the proposed non-causal HMM model

applied to the problem of image classification and segmentation.

.

In our experiment, we test our model for the segmentation of 6 aerial images of the San

Francisco Bay area provided by TRW (formerly ESL, Inc.), as depicted in Fig. 7.

The test images are divided into non-overlapping blocks, and feature vectors for each block

are extracted. The feature vector consists of nine features, of which six are intra-block features,

65

Figure 20. Image segmentation: (a)-(f) aerial test images.

66

Figure 21. Hand-labeled ground truth images of man-made and natural regions: (a)-(f) truthimages corresponding to aerial test images in Fig. 7 (a)-(f). (White and gray denote

man-made and natural regions, respectively)

67

Figure 22. Image segmentation: (a) original aerial image (see Fig. 7(f)); (b)hand-labeledground truth image (see Fig. 8(f)); (c) classification using the a strictly-causal,

two-dimensional hidden Markov model—the corresponding error rate is 13.39%; and, (d)classification using the proposed non-causal, two-dimensional hidden Markov model—the

corresponding error rate is 8.25%. (White and gray denote man-made and natural regions,respectively).

as defined in [26], and three are inter-block features defined as the differences of average inten-

sities of block (i, j) with its vertical, horizontal and diagonal block. Let the average intensity of

block (i, j) be I(i, j), then the 3 features are f7 = I(i, j)−I(i−1, j); f8 = I(i, j)−I(i−1, j−1);

and f9 = I(i, j)− I(i− 1, j).

We wish to segment each of the test images into two regions : man-made region and natural

region. It can be also seen as a classification problem, where each image block is to be classified

into one of the two classes: man-made class or natural class. For testing the model, six-fold

cross-validation is used. For each test, one image is used as a test image, and the other 5 serve

as training images. We first train our model using training images, and estimate the model

parameters based on the training feature vectors and their corresponding truth set of classes.

68

We then perform image classification for a test image using the trained model. Feature vectors

are generated for each block in the test image in the same way as in training. Models with

different number of states and different block sizes are evaluated. We found that the model with

6 states for the natural class and 8 states for the man-made class yields the best result. The

classification results using different HMM models for one of the images are shown in Figs. 9(c)

and 9(d). The proposed 2D HMM model reduces the error rate of segmentation significantly.

This is due to the fact that in our model, more context information from neighboring blocks

gets incorporated, thus the classes of image blocks are more accurately estimated. This can also

be seen in Table I, where average classification error rates of all 6 test images are shown. We

can see for different image block sizes, the proposed model demonstrates superior performance

over the existing models.

TABLE III

AVERAGE CLASSIFICATION ERROR RATE VERSUS BLOCKSIZEAlgorithm Blocksize=4 Blocksize=8 Blocksize=16

Causal 2D HMM [26] 0.1874 0.2420 0.2921Non-Causal 2D HMM 0.1536 0.2041 0.2596

69

3.6 Application II: Non-Causal HMM-Based Video Classification

In this section, we report experimental results of the proposed Non-causal hidden Markov

model applied to the problem of classification of videos that contain moving objects.

Hidden Markov model (HMM) is very powerful tool to model temporal dynamics of pro-

cesses, and has been successfully applied to trajectory-based video classification and recognition.

Bashir et al. presented a novel classification algorithm of object motion trajectory based on 1D

HMM. They segmented single trajectory into atomic segments called subtrajectories based on

curvature of trajectory, then the subtrajectories are represented by their principal component

analysis (PCA) coefficients. Temporal relationships of subtrajectories are represented by fitting

a 1D HMM. However, simple combinations of 1D HMMs can not be used to characterize mul-

tiple trajectories, since 1D models fail to convey interaction information of multiple interacting

objects with in a video.

To match our proposed non-causal multi-dimensional HMM in Fig. 3(c), the states of

each trajectory is modelled as one-dimensional Markov chain, shown as one row in the two-

dimensional grid. The ‘X’ direction shows the temporal dynamics of each trajectory, one can

regards ‘X’ as time ‘t’, or frame number ‘n’. The ”interactions” between objects are modelled as

dependencies of state variables of one process on states of the others. Thus at each fixed time ‘t’,

the state of one trajectory would depends on its own previous state at time ‘t-1’, and the states

of other trajectories at time ‘t-1’. The intuition of our work is that, HMM is very powerful tool

to model temporal dynamics of each process (trajectory); each process (trajectory) has its own

dynamics, while it may be influenced by or influence others. In our proposed model, “influence”

70

or “interaction” among processes (trajectories) are modelled as dependencies of state variables

among processes (trajectories). Thus for a N-frame video contains M interacting trajectories,

the states of system would form a two-dimensional grid of size M by N. Since various videos

would contain different number of objects and of diverse lengthes, the intrinsic two-dimensional

state dependency structure would uniquely characterize each video. Even for videos of the

same lengthes and of the same number of objects, the activities of objects would be dissimilar

such that the corresponding two-dimensional model would be different. Thus we classify videos

according to the activities of objects, based on objects’ motion trajectories. In this way, we

transform the problem of video classification to multiple object trajectory classification.

For simplicity, we only tested our model on videos of the same lengthes consisting of two

objects. We test the classification performance of the proposed model-based classifier, causal

2D HMM-based classifier and traditional 1D HMM-based classifier on 2 datasets: (A) dataset

TWO-HANDS, which contains videos clips of two moving hands performing sign languages.

(B) dataset PEOPLE, which contains video clips of people moving with interactions.

The results are reported in terms of 3 criteria:

1. the average Receiver Operating Characteristics (ROC) curve.

The ROC curve captures the trade-off between false positive rate versus the true positive

rate as the threshold on likelihood at the output of the classifier is varied. The resulting

ROC curves are shown in Figs. Figure 25(a) and Figure 25(b). As a baseline case, the

performance of a uniformly distributed random classifier is also presented in the ROC

curve.

71

Figure 23. Video classification—samples of two classes in video dataset TWO-HANDS : (a)two-trajectory sample from class 1: ’god+boy’; (b) two-trajectory sample from class 2:

’dog+smile’

Figure 24. Video classification—samples of two classes in video dataset PEOPLE : (a)two-trajectory sample from class 1: “Two people meet and walk together”; (b) two-trajectory

sample from class 2: “Two people meet, fight and run away”.

72

Figure 25. Video classification—ROC curves for two datasets: (a) TWO-HANDS and (b)PEOPLE. (Each graph depicts the proposed non-causal 2D HMM (red), strictly-causal 2D

HMM (blue), and 1D HMM (green), and random classification (black)).

2. The Area Under Curve (AUC).

The AUC is a convenient way of comparing classifiers, which varies from 0.5 (random clas-

sifier) to 1.0 (ideal classifier). The AUCs for two datasets are shown in Figs. Figure 25(a)

and Figure 25(b), respectively.

3. Classification Accuracy.

The Classification Accuracy is defined as :

PAccuracy = 1− |F ||S| . (3.19)

where |F | represents the cardinality of the false positives set, and |S| represents the

cardinality of the whole dataset.

73

We first test on the dataset TWO-HANDS. The trajectories are taken from the Australian

Sign Language (ASL) dataset, which can be obtained from the University of California at Irvine

Knowledge Discovery in Databases (KDD) archive (http://kdd.ics.uci.edu). This dataset rep-

resents 2565 three-dimensional (x-, y- and z-) single trajectory data of various sensor recordings

mounted on right-hand glove while they engage in signing different words using the Australian

Sign Language. We select 10 sign words, and combine each two different sign words, e.g. sign

words ’god’ and ’boy’, as a class, to form a new class ’god+boy’ which contain two-hands 2D

trajectory (x- and y-) samples. In this way, we construct 45 classes of two-hand trajectories,

each of which contain 30 samples, totally 1350 two-hand trajectory samples to form our TWO-

HANDS dataset. Each sample of two-hand trajectories can be viewed as taken from a video

where two hands are moving simultaneously. Since large amount of video data containing multi-

ple trajectory are generally difficult to obtain, this construction of dataset gives us a convenient

way of generating a large dataset. Figure Figure 23 lists samples of two-hands trajectories from

2 classes in the video dataset TWO-HANDS. We use 50% samples as training data, and the

rest as testing data. The Average ROC curve is shown in Fig. Figure 25(a). We can see that at

a low false positive rate, the non-causal HMM-based classifier achieved almost twice high as the

causal HMM-based classifier, and almost three times high as the 1D HMM. This is due to the

fact that our non-causal HMM use context information from 5 neighbors of the analyzed state,

while in causal HMM, context information from only 2 neighbors is used; in 1D HMM, only

context information from 1 neighbor is used. The superior performance of our proposed model

can be further illustrated from Table II, which shows the average classification accuracy rate of

74

the three models. Our non-causal HMM-based classifier achieves a 91.25% high accurate rate of

classification, followed by strictly-causal 2D HMM-based classifier, which is 8% lower than ours;

and 1D HMM-based classifier, which is 15% lower than ours. This experiment demonstrates

the robustness and superior performance of our non-causal HMM-based classifier under large

sample testing.

Then we test on the dataset PEOPLE, which contains 9 classes of video clips that have

2 people interacting with each other. This dataset is a subset of the Context Aware Vision

using Image-based Active Recognition (CAVIAR) dataset1. The CAVIAR dataset itself was

generated using two sub-dataset, namely, (1) Clips from INRIA. (2) Clips from Shopping Center

in Portugal. The first set of clips were filmed with a wide angle camera lens in the entrance lobby

of the INRIA Labs at Grenoble, France. The resolution is half-resolution PAL standard (384 x

288 pixels, 25 frames per second) and compressed using MPEG2. The second set of data also

used a wide angle lens along and across the hallway in a shopping center in Lisbon, Portugal.

For each sequence, there are two time synchronized videos, one with the view across and the

other along the hallway. The resolution is half-resolution PAL standard (384 x 288 pixels, 25

frames per second) and compressed using MPEG2. Figure 7 lists samples from 2 classes in the

video dataset PEOPLE. Average ROC curve for this dataset is shown in Fig. 8. Here we can see

when false positive rate is still low, our non-causal HMM-based classifier achieved a high true

positive rate, which is twice as high as causal HMM-based classifier and 1D HMM classifier.

1The CAVIAR dataset is from EC Funded CAVIAR project/ IST 2001 37540.(http://homepages.inf.ed.ac.uk/rbf/CAVIAR/)

75

On Table II, the average classification accuracy of our non-causal HMM-based classifier reaches

92.04%, which is 8% higher than strictly-causal 2D HMM-based classifier, and 12classifier. This

experiment shows the applicability of the non-causal HMM in video classification applications.

TABLE IV

AVERAGE CLASSIFICATION ACCURACY RATEClassifier/Dataset TWO-HANDS(1350) PEOPLE(36)

1D HMM 0.7654 0.8097Strictly-causal 2D HMM [26] 0.8319 0.8420

Non-causal 2D HMM 0.9125 0.9204

CHAPTER 4

FUTURE WORKS

In our previous work, we presented a novel indexing and retrieval framework for compact

and integrated representation of multiple interacting motion trajectories. This framework pro-

vides rich indexing and retrieval mechanisms for multiple motion trajectories without requiring

additional higher-level semantic analysis. The proposed approach can be used for indexing

and retrieval of multiple motion trajectories as well as segmented (partial) multiple motion

trajectories. We determine the optimal joint segmentation of multiple motion trajectories by

posing this task as the solution of a hypothesis testing problem and deriving the maximum

likelihood solution. Our key contribution was the formation of three new multi-linear alge-

braic structures for the compact and unified representation of multiple object trajectories in a

reduced-dimension space, and three novel algorithms for the indexing of multiple object tra-

jectories corresponding to each of the algebraic structures used for multiple interacting object

trajectory representations. We finally develop an effective retrieval method for minimizing the

retrieval time for queries and report experimental results using multiple trajectories from public

domain data sets such as sign language data from KDI [48] and human motion data from the

CAVIAR data set [49].

However, this work did not consider dynamic updates and partial queries. The practical

issues related to indexing and retrieval of multiple motion trajectories are extremely challenging

and have not yet been thoroughly addressed in the existing research work. Specifically, the prac-

76

77

tical utility of a robust indexing and retrieval system of multiple interacting motion trajectories

must address three fundamental problems: (i) Dynamic adding and deleting entries in multiple

motion trajectory databases, (ii) Dynamic matching of query and database multiple-trajectory

entries with different numbers of trajectories, (iii) Dynamic matching of query and database

multiple-trajectory entries with diverse temporal lengths.

4.0.1 Dynamic adding and deleting entries in multiple motion trajectory databases

The dynamic updating of databases requires frequent operations of adding and deleting one

or multiple entries. Adding operation is needed when new entries come to the dataset; while

deleting operation is desired when some entries are not needed in the database. However, it

is highly demanded that every time when an adding or deleting operation is performed, the

updating of databases structure does not require re-calculation of the whole database structure;

instead, the new structure can be efficiently calculated only based on the existing structure.

Thus a robust indexing and retrieval system must have a dynamic scheme of updating its

databases structures when adding or deleting one or multiple entries.

4.0.2 Dynamic matching of query and database multiple-trajectory entries with

different numbers of trajectories

The multiple trajectory number matching problem arises when the number of trajectories

in the database elements and the number of trajectories in the query is not equal. Suppose

the number of trajectories in query is M ′, and the number of trajectories in database is M .

If M ′ > M , the matching problem can be efficiently solved by pre-selecting of M trajectories

78

Figure 26. Dynamic adding and deleting entries in multiple motion trajectory databases: (a)Adding entries in multiple trajectory database; (b) Deleting entries in multiple trajectory

database. (In both figures, white portion denotes entries to be kept, and red portion denotesentries to be added or deleted. )

out of M ′ query trajectories as new query and search in the M -trajectory databases for the

best match. If M ′ < M , then the structure of trajectory database must be able to adaptively

extract a M ′-trajectory sub-database out of the M -trajectory database, such that the searching

procedure can be efficiently done in the sub-database.

4.0.3 Dynamic matching of query and database multiple-trajectory entries with

diverse temporal lengths

This problem arises when the temporal length of multiple-trajectories query is different

from the temporal length of multiple-trajectories in the database. Figure Figure 28 shows an

79

Figure 27. Dynamic matching of query and database multiple-trajectory entries with differentnumbers of trajectories: (a) Example multiple trajectory query with M ′=3 trajectories. (b)Multiple trajectory database with M=6 trajectories (Red portion is the part of database to

be searched to match the query).

example of such matching, where the multiple trajectory query is of temporal length L′, while

the database consists of multiple trajectories of length L. One may consider resampling of the

query trajectories to be the same as the database entries, but this fails when exact-partial-

queries comes to the database. Here, exact-partial-query refer to the parts of the full trajectory

stored in the database. If we perform resampling on them, the exact-partial-queries could

not find any match in the database. Thus a robust indexing and retrieval system must have

80

a dynamic scheme of generating a sub-database consists of trajectories with partial temporal

lengths out of the database consists of those with full temporal lengths.

Figure 28. Dynamic matching of query and database multiple-trajectory entries with diversetemporal lengths: (a) Example multiple trajectory query with temporal length L’. (b)Multiple trajectory database with length L (Red portion is the part of database to be

searched to match the query).

Another open topic is that, in our previous work, when we perform matching of query and

the entries in database, we assume that the start point of comparison is known, e.g., if we

would like to search a partial trajectory and a full trajectory, we assume to know where to start

our search. The practical issue is that one doesn’t know where to start comparing, given an

81

unknown partial query. This problem is a potential future direction of research. Also, since all

the above works are build on simple motion event, how to represent and retrieve and recognize

complex motion event which consists of multiple simple motion events is still an open problem.

CHAPTER 5

CITED LITERATURE

1. The Google Company (URL: http://www.google.com).

2. The Content-based image retrieval at Wikipedia (URL:

http://en.wikipedia.org/wiki/CBIR).

3. A. Hauptmann, ”Image/Video Retrieval”, Carnegie Mellon University.

4. The University of California at Irvine Knowledge Discovery in Databases (KDD) archive

(URL: http://kdd.ics.uci.edu).

5. The Context Aware Vision using Image-based Active Recognition (CAVIAR) dataset (URL:

http://homepages.inf.ed.ac.uk/rbf/CAVIAR/).

6. S. F. Chang, W. Chen, H. J. Meng, H. Sundaram and D. Zhong, “A Fully Automated

Content-Based Video Search Engine Supporting Spatiotemporal Queries”, IEEE

Transactions on Circuits and Systems For Video Technology, Vol.8, No.5, 1998.

7. N. AbouGhazaleh, Y. E. Gamal, “Compressed Video Indexing Based on Object’s Motion”,

International Conference on Visual Communication and Image Processing, VCIP’00,

Australia, June 2000.

82

83

8. B. Katz, J. Lin, C. Stauffer and E. Grimson, “Answering questions about moving objects

in surveillance videos”, in proceedings of 2003 AAAI Spring Symposium on New

Directions in Question Answering, 2003.

9. N. Rea, R. Dahyot and A. Kokaram, “Semantic Event Detection in Sports Through Motion

Understanding”, in proceedings of Third International Conference on Image and

Video Retrieval (CIVR), 2004.

10. E. Sahouria, A. Zakhor, “A Trajectory Based Video Indexing System For Street Surveil-

lance”, IEEE Int. Conf. on Image Processing, ICIP, 1999.

11. W. Chen and S. F. Chang, “Motion Trajectory Matching of Video Objects”, IS&T/ SPIE,

pp. 544-553, 2000.

12. F. I. Bashir, A. A. Khokhar and D. Schonfeld, “Segmented trajectory based indexing and

retrieval of video data”, in Proceedings of IEEE International Conference on Image

Processing, pp.623-626, 2003.

13. F. I. Bashir, A. A. Khokhar and D. Schonfeld, “Real-Time Motion Trajectory-Based In-

dexing and Retrieval of Video Sequences”, IEEE Transactions on Multimedia, vol.9,

No.1, pp.58-65, 2007.

14. M. K. Shan and S. Y. Lee, “Content-based video retrieval via motion trajectories”, in

Proceedings of SPIE, Electronic Imaging and Multimedia Systems II, Vol. 3561, pp.

84

52-61, 1998.

15. A. R. Mansouri, A. Mitiche and R. E. Feghali, “Spatio-temporal motion segmentation

via level set partial differential equations”, in Proceedings of 5th IEEE Southwest

Symposium on Image Analysis and Interpretation (SSIAI’02), pp. 243-247, 2002.

16. J. Min and R. Kasturi, “Activity recognition based on multiple motion trajectories”, in

Proceedings of 17th International Conference on Pattern Recognition (ICPR’04),

Vol. 4, pp. 199-202, 2004.

17. T. Zhao and R. Nevatia, “Tracking multiple humans in crowded environment”, IEEE In-

ternational Conference on Computer Vision and Pattern Recognition, Vol. 2 (2004),

pp. 406-413, 2004.

18. C. Chang, R. Ansari and A. Khokhar, “Multiple Object Tracking with Kernal Particle Fil-

ter”, IEEE International Conference on Computer Vision and Pattern Recognition,

Vol. 1 (2005), pp. 566-573, 2005.

19. X. Ma, F. Bashir, A. Knokhar and D. Schonfeld, “Tensor-based Multiple Object Trajec-

tory Indexing and Retrieval”, in proceedings of IEEE International Conference on

Multimedia and Expo.(ICME), 2006.

20. L. Lathauwer, B. D. Moor and J. Vandewalle, “A multilinear singular value decomposition”,

SIAM Journal on Matrix Analysis and Applications (SIMAX), pp. 1253-1278, 2000.

85

21. L. D. Lathauwer and B. D. Moor, “From Matrix to Tensor: Multilinear Algebra and Signal

Processing”, in proceedings of 4th IMA International Conference on Mathmatics in

Signal Processing, 1996.

22. R. A. Harshman, “Foundations of the PARAFAC procedure: Model and Conditions for

an “explanatory” multi-mode factor analysis”, UCLA Working Papers in Phonetics,

pp.1-84, 1970.

23. L. R. Rabiner, “A tutorial on hidden markov models and selected applications in speech

recognition,” Proceedings of the IEEE, vol. 77, issue 2, pp. 257-286, 1989.

24. T. Starner and A. Pentland, “Real-Time American Sign Language Recognition From Video

Using Hidden Markov Models,” Technical Report 375, MIT Media Lab, Perceptual

Computing Group, 1995.

25. F. I. Bashir, A. A. Khokhar and D. Schonfeld, “Object Trajectory-Based Activity Classifi-

cation and Recognition Using Hidden Markov Models”, IEEE Transactions on Image

Processing, vol. 16, issue 7, pp. 1912-1919, 2007.

26. J. Li, A. Najmi, and R. M. Gray, “Image classification by a two-dimensional hidden markov

model,” IEEE Transactions on Signal Processing, vol 48, pp. 517-533, 2000.

27. H. C. Lin, L. L. Wang, and S. N. Yang, “Color image retrieval based on hidden Markov

models”, IEEE Transactions on Image Processing, vol. 6, issue 2, pp. 332-339, 1997.

86

28. C. Raphael, “Automatic Segmentation of Acoustic Musical Signals Using Hidden Markov

Models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21,

no. 4, pp. 360-370, Apr., 1999.

29. A.V. Lukashin and M. Borodovsky, “GeneMark.hmm: new solutions for gene finding,”

Nucleic Acids Research, vol. 26, issue 4, pp. 1107-1115, 1998.

30. L. E. Baum and T. Petrie, “Statistical inference for probabilistic functions of finite state

Markov chains,” Annals of Mathematical Statistics, 37(6): 1554-1563, Dec. 1966.

31. L. E. Baum and J. A. Eagon, “An inequality with applications to statistical estimation for

probabilistic functions of Markov processes and to a model for ecology, ” Bulletin of

the American Mathematical Society, vol. 37, pp. 360-363, 1967.

32. S. S. Kuo and O. E. Agazzi, “Machine vision for keyword spotting using pseudo 2D hidden

Markov models,” in Proceedings of International Conference on Acoustic, Speech and

Signal Processing, vol. 5, pp. 81-84, 1993.

33. C. C. Yen and S. S. Kuo, “Degraded documents recognition using pseudo 2D hidden Markov

models in Gray-scale images,” in Proceedings of SPIE, vol. 2277, pp. 180-191, 1994.

34. M. Brand,“Coupled hidden Markov models for modeling interacting processes”, Technical

Report 405, MIT Media Lab, Perceptual Computing Group, 1997.

87

35. M. Brand, N. Oliver, and A. Pentland, “Coupled hidden Markov models for complex action

recognition”, IEEE Computer Society Conference on Computer Vision and Pattern

Recognition (CVPR’97), pp. 994, 1997.

36. P. A. Devijver, “Probabilistic labeling in a hidden second order Markov mesh,” Pattern

Recognition in Practice II, pp. 113?C123, 1985.

37. P. A. Devijver, “Modeling of digital images using hidden Markov mesh random fields,”

Signal Processing IV: Theories and Applications (Proc. EUSIPCO-88), pp. 23?C28,

1988.

38. E. Levin and R. Pieraccini, “Dynamic planar warping for optical character recognition,” in

Proceedings of International Conference on Acoustic, Speech and Signal Processing,

vol. 3, pp. 149?C152, Mar. 1992.

39. M. Park and D. J. Miller, “Image decoding over noisy channels using minimum mean-

squared estimation and a Markov mesh,” in Proceedings of International Conference

on Image Processing, vol. 3, pp.594?C597, 1997.

40. J. Owens and A. Hunter, “Application of the self-organising map to trajectory classifica-

tion,” Third IEEE International Workshop on Visual Surveillance , pp. 77-83, 2000.

41. A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data

via the em algorithm,” Journal of the Royal Statistical Society: Series B 39(1), pp.

88

1-38, 1977.

42. L. E. Baum, T. Petrie, G. Soules, and N. Weiss, “A maximization technique occuring in

the statistical analysis of probabilitic functions of markov chains,” The Annals of

Mathematical Statistics,41(1), pp. 164-171.

43. D. Schonfeld and N. Bouaynaya, “A new method for multidimensional optimization and its

application in image and video processing,” IEEE Signal Processing Letters, 13, pp.

485-488, 2006.

44. M. K. Shan and S. Y. Lee, “Content-based video retrieval via motion trajectories”, in

Proceedings of SPIE, Electronic Imaging and Multimedia Systems II, Vol. 3561, pp.

52-61, 1998.

45. A. R. Mansouri, A. Mitiche and R. E. Feghali, “Spatio-temporal motion segmentation

via level set partial differential equations”, in Proceedings of 5th IEEE Southwest

Symposium on Image Analysis and Interpretation (SSIAI’02), pp. 243-247, 2002.

46. J. Min and R. Kasturi, “Activity recognition based on multiple motion trajectories”, in

Proceedings of 17th International Conference on Pattern Recognition (ICPR’04),

Vol. 4, pp. 199-202, 2004.

47. X. Ma, F. Bashir, A. A. Khokhar and D. Schonfeld, “Event Analysis Based on Multiple

Interactive Motion Trajectories”, IEEE Trans. on Circuits and Systems for Video

89

Technology, to appear.

48. The University of California at Irvine Knowledge Discovery in Databases (KDD) archive

(URL: http://kdd.ics.uci.edu).

49. The Context Aware Vision using Image-based Active Recognition (CAVIAR) dataset (URL:

http://homepages.inf.ed.ac.uk/rbf/CAVIAR/).

50. L. Lathauwer, B. D. Moor and J. Vandewalle, “A multilinear singular value decomposition”,

SIAM Journal on Matrix Analysis and Applications (SIMAX), pp. 1253-1278, 2000.

51. A. Levy and M. Lindenbaum, “Sequential Karhunen-Loeve Basis Extraction and its Ap-

plication to Images”, IEEE Transactions on Image Processing, Vol. 9, No. 8, pp.

1371-1374, 2000.

52. P. Hall, D. Marshall and R. Martin, “Merging and Splitting Eigenspace Models”, IEEE

Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 9, pp. 1042-

1049, 2000.

53. M. Brand, “Incremental singular value decomposition of uncertain data with missing val-

ues”, Proceedings of the 2002 European Conference on Computer Vision (ECCV02’),

Copenhagen, Denmark, 2002.

90

VITA

Xiang Ma

909 S Aberdeen St, Apt 2F Chicago, IL 60607 (312) 371-8833 [email protected]

http://www.ece.uic.edu/∼mxiang

EDUCATION

Ph.D. in Electrical and Computer Engineering

University of Illinois at Chicago (UIC), Chicago, IL, May 2009 (expected)

GPA: 3.71/4.0

Bachelor of Science in Electrical Engineering

University of Science and Technology of China (USTC), Hefei, China, July 2005

GPA: 3.69/4.0

EXPERIENCES

Research Assistant 08/2005- Present

Multimedia Communications Lab, University of Illinois at Chicago, Chicago, IL, USA

• Project: Multiple projects on image/video processing, content understanding,

indexing, retrieval and recognition

Developed and implemented various algorithms for image/video processing and computer

vision. Details can be found at personal website.

91

Summer Research Intern 06/2008-08/2008

Multimedia and Information Systems Lab, Knowledge Media Institute (KMi), Milton Keynes,

United Kingdom

• Project: Fast camera motion estimation and detection based on motion vec-

tors extracted from MPEG videos.

Developed and implemented a novel system on Linux platform to detect camera motion

in MPEG videos.

Full-time Research Consultant 06/2008-08/2008

The Open University, Milton Keynes, United Kingdom

• Project: Multimedia Visual Search Engine, Image/Video Indexing, Retrieval

and Annotation, TRECVID participation

Provided technical consultancy to around 90 researchers in the area of multimedia infor-

mation retrieval and image/video processing , conducted technical meetings and discus-

sions.

Summer Intern 06/2004-08/2004

Bio-Medical Engineering Lab, USTC, Hefei, China

• Project: Design and implementation of RSAPred, an on-line web service plat-

form for statistical analysis of protein data.

92

Designed and Developed the on-line web service system. Developed the web GUI and the

server programs.

AWARDS, ACTIVITY AND HONORS

•UIC Chancellors Student Leadership and Service Award 08’, University of Illinois at Chicago,

04/2008

• Vice President, Chinese Students and Scholars Association, University of Illinois at Chicago,

Chicago, 2007-2009

• Graduate Student Travel Award 07’, 06’, Graduate College and Graduate Student Council,

University of Illinois at Chicago, 10/2007, 08/2006

• Best Bachelors Thesis Award, USTC, 07/2005

• Outstanding Student Scholarship, USTC, 2001-2004

PUBLICATIONS

Journal papers

-MA Xiang, Wang Minghui, LI Ao, XIE Dan, FENG Huan-qing, “Prediction of Protein

Subcellular Locations With a Weighted Fuzzy K-NN Algorithm, Chinese Journal of Biomedical

Engineering (in Chinese), Vol.25,No.1, pp.106-109, 2006

-Xiang Ma, Faisal Bashir, Ashfaq Khokhar and Dan Schonfeld, “Event Analysis Based on

Multiple Interactive Motion Trajectories”, IEEE Trans. on Circuits and Systems for Video

Technology (T-CSVT), accepted.

93

-Xiang Ma, Dan Schonfeld and Ashfaq Khokhar , “Video Event Classification and Image

Segmentation Based on Non-Causal Multi-Dimensional Hidden Markov Models”, IEEE Trans.

on Image Processing (T-IP), in revision.

-Xiang Ma, Dan Schonfeld and Ashfaq Khokhar , “Dynamic Updating and Downdating

Matrix SVD and Tensor HOSVD for Adaptive Indexing and Retrieval of Motion Trajectories”,

IEEE Trans. on Image Processing (T-IP), in submission.

Conference papers

-Xiang Ma, Faisal Bashir, Ashfaq Khokhar and Dan Schonfeld, “Tensor-Based Multiple

Object Trajectory Indexing and Retrieval, IEEE International Conference on Multimedia and

Expo. (ICME 06’),Toronto, Ontario, Canada, 2006.

-Xiang Ma, Ashfaq Khokhar and Dan Schonfeld, “A General Two-Dimensional Hidden

Markov Model And Its Application to Image Classification, IEEE International Conference on

Image Processing (ICIP 07’), San Antonio, Texas, 2007.

-Xiang Ma, Dan Schonfeld and Ashfaq Khokhar , “Distributed Multidimensional Hid-

den Markov Model: Theory and Application in Multiple-Object Trajectory Classification and

Recognition,SPIE Conference on Multimedia Content Access: Algorithms and Systems, San

Jose, California, 2008.

-Xiang Ma, Dan Schonfeld and Ashfaq Khokhar ,“Image Segmentation and Classification

Based on a 2D Distributed Hidden Markov Model, SPIE Conference on Visual Communications

and Image Processing (VCIP 08’), San Jose, California, 2008.

94

-Xiang Ma, Dan Schonfeld and Ashfaq Khokhar , “Distributed Multi-dimensional Hidden

Markov Models for Image and Trajectory-Based Video Classification, IEEE International Con-

ference on Acoustics, Speech and Signal Processing (ICASSP 08’), Las Vegas, Nevada, 2008.

-Xiang Ma, Dan Schonfeld and Ashfaq Khokhar , “Dynamic Updating and Downdating

Matrix SVD and Tensor HOSVD for Adaptive Indexing and Retrieval of Motion Trajectories,

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 09’),

submitted.

Documents

TRAJECTORY-BASED VIDEO RETRIEVAL AND RECOGNITION …mxiang/publications/Prelim-Proposal-Xiang-Ma.pdf · TRAJECTORY-BASED VIDEO RETRIEVAL AND RECOGNITION BASED ON MULTILINEAR ALGEBRA