Extracting features from spatio-temporal volumes (STVs) for activity recognition

Extracting features from spatio-temporal volumes (STVs) for activity recognitionDheeraj SingarajuReading group: 06/29/06

Motivation for dealing with STVsOptical flow based methods would be able to capture only first order motion. Methods that use HMMs deal with single point trajectories that carry only motion information and no spatial information

We aim at a direct scheme for event detection and classification that does not require feature tracking, segmentation or computation of optical flow We want to detect points in the space-time volume which have significant local variation in both space and time.

Approaches that we shall discussOn Space-Time Interest Points; Ivan LaptevLocal image features provide compact and abstract representations of images, eg: cornersExtend the concept of a spatial corner detector to a spatio-temporal corner detector

Actions as Objects: A Novel Action Represenation; Alper Yilmaz and Mubarak ShahConcepts of differential geometry: Extract features from the STV based on local variations in curvatures of points on the volumeThe curvatures show invariance to rotation and translation

Detecting interest points in spaceAn image can be modeled by its linear scale representation as follows

To look for interest points one analyzes the matrix of 2nd moments : A more familiar form of the matrix

Detecting interest points in space (contd.)We want to choose corners in the image since they have significant spatial variation.We therefore detect positive maxima of the following function

How do we detect interest points in space-time ?

Results of detecting interest points in spaceDetecting interest points in space gives interest points in the stationary background alsoWe want to find interest points that have information in the space as well as the temporal domain.

Detecting interest points in space-timeA spatio-temporal image sequence can be modeled by its linear scale representation as follows

Note that there are different scales for the spatial and the temporal scale, i.e. and respectively

Detecting interest points in space-time (contd.)To look for interest points one analyzes the matrix of 2nd moments :

We therefore look for the maxima of the following spatio-temporal corner function

Results of detecting interest points in the STV Consider a synthetic sequence of a ball moving towards a wall and colliding with it

An interest point is detected at the collision point

Results of detecting interest points in the STV Consider a synthetic sequence of 2 balls moving towards each other

Different interest points are calculated at different spatial and temporal scalescoarser scale

Effects of scales on interest point detectionLong temporal events are detected for large values of while short events are detected for small values of Long spatial events are detected for large values of while short events are detected for small values of

Scale selection in space-timeWe consider a prototype event modeled by a spatio-temporal Gaussian blob

The scale space representation of f is hence given by

Scale selection in space-time (contd.)We want to find a differential operator that assumes simultaneous extrema over spatial and temporal scales that are characteristic of this Gaussian prototype event

To recover the spatio-temporal extent of f, we consider second order derivatives of L normalized by the scales as:

By solving for the fact that the above normalized 2nd order derivatives assume maxima at scales and we get a =1, b= , c= and d= .

Scale selection in space-time (contd.)We therefore define a normalized spatio-temporal Laplace operator as follows:

The following plots show that the zero crossings correspond to the maxima that are detected at and

Scale adapted space time interest pointsSo far we have found events that are local extrema in the space time volume at a particular choice of space and time scales

We would like to detect interest points that are extrema over the space time volume as well as over the scale of the scale-normalized Laplace operator

The reason for doing so is that different events would in general have different spatial and temporal extents

Algorithm for detecting interest points

Results on a previously used synthetic exampleNote that all the extrema are detected irrespective of their spatial and temporal extentsDOUBTWhy are these points not detected as interest points ?

Results of the algorithm on real seq.Note that events of all spatial and temporal extents are captured.

The size of the circle shows the spatial extent of the event

Results of interest pt. detectionNote that the regularity and extent of the spatio-temporal interest points is actually representative of the true events in time

Classification of eventsEvery interest point is described by its local spatio-temporal neighbor and we compare neighborhoods of events to classify events

The neighborhood of an interest point is defined by evaluating the following event descriptors

This normalization guarantees the invariance of the derivative response to image scaling

Classification of events (contd.)To compare two events, we compute the Mahalanobis distance between their descriptors as

To detect similar events in the given data, we apply k-means clustering to the event descriptors and thus detect groups of interest points with similar spatio-temporal neighbourhoods

Once the cluster centers are evaluated from the training data, given a new event, we evaluate its distance from the cluster centers. If the distance from all the centers is above a threshold we declare it as a background event.

Results of classification

Recognizing gaitsWe extract the following features from the spatio-temporal volume Positions of the interest points:The corresponding scales: The class of interest points:

We introduce a state for the model determined by the vector , where the variables are

Position of person in the image:His/her size:Frequency of the gait:Phase of the gait at current moment:Temporal variations of

Recognizing gaits (contd.)

We then have the following model for walking

Such a model helps handle translations as well as uniform rescaling in the image and the temporal domain

Recognizing gaits (contd.)Given a model state X, a current time , a length of time window , and a set of data features detected from the recent time window , the match between the model and the data is defined by a weighted sum of distances h between the model features and the data features .

is a data feature minimizing the distance h for a given and is the variance for the exponential function.

Recognizing gaits (contd.)To find the best match between the model and the data, we search for the model state that minimizes

Summary of the approachAn interest point detector is developed that finds local image features that show high variation of the image values in space and in timeThe spatio-temporal extents of detected events can be estimated by using a normalized Laplacian operatorThe neighborhoods of the events are described using scale invariant spatio-temporal descriptorsDifferent actions are then compared by checking for the matches between the event descriptors

Actions as objects: Action sketchesThis methods analyzes the spatio-temporal volume by using the differential geometric surface properties such as peaks, pits, valleys and ridges

The authors claim that these are important action descriptors as they capture both spatial and temporal properties

These descriptors are related to the convex and concave parts of the object contours and/or to the maxima in the spatio-temporal curvature of a trajectory, and are hence view invariant.

STV: a collection of contoursIn this approach the spatio-temporal volume is really a hollow solid object whose boundaries are defined by the contours of the boundaries of a person in every image frame. It is assumed that the STV can be considered as a manifold, which helps us to consider small neighborhoods around a point to be nearly flat.

Since the STV is really the time evolution of a contour, we can define a 2D parametric representation by considering arc length s of the contour and time t.

STV: a collection of contours (contd.) t varying, s fixeds varying, t fixedThe STV is a continuous representation in the normalized time scale and it does not require ay time warping for matching two sequences of different lengths.

Action descriptorsWe want to compute action descriptors that correspond to changes in direction, speed and shape of parts of contour

Changes in these quantities are reflected on the surface of the STV and can be computed using differential geometry by identifying different landmarks.

These landmarks can be classified by basis of the local curvatures at points on the STV

Action descriptors (contd.)Differential geometry gives us the concept of Gaussian Curvature K and Mean Curvature H that can be evaluated at points on the manifold of the STV. These curvatures exhibit invariance to algebraic transformations such as translation and rotation.

Local extrema of these curvatures can therefore be used to identify interest points for describing actions

Action descriptors (contd.)The following table shows the different surface types and their associated curvatures

Analysis of action descriptorsWe consider three types of contours: concave contours, convex contours and straight contours

The following contours generate typical landmarks in the spatial-temporal volumeStraight contour: ridge, valley or flat surfaceConvex contour: peak, ridge or saddle ridgeConcave contour: pit, valley or saddle valleyShapes generated from straight contours

STVs corresponding to hand motionThe STV generated by a hand staying stable. Such a motion (or lack of it) creates a ridge

STVs corresponding to hand motionThe STV created by a hand that first moves downwards and then upwards. Note that a saddle ridge is created at the point of change of motion

Properties of the event descriptorsThe landmarks discussed so far are essentially produced due to stable motion or change in stable motion. The stability of motion enforces that the STV is smooth enough so that one can consider valid local planar neighborhoods at points

Some of the landmarks are related to the curvature of the point trajectories and body contours as follows

View invariance of event descriptorsSince the landmarks are associated with extrema of local curvatures, even when the view changes the transformed landmarks are extrema in the new STV

DOUBT: Not very confident about the derivation of the above

Due to this view invariance, comparing two STV volumes is equivalent to checking if there is a valid Fundamental Matrix relating the set of event descriptors in 2 given action volumes.Derived formula relating curvatures of corresponding points in 2 different views

Comparing two actionsWe check if a linear system of the following kind is satisfied by the event descriptors in both the actions

This boils down to checking if the last singular value of A is 0. From a set of possible matches between the input action sketch and the known action sketches, we select the action with the minimum matching score

Summary of the approachUsing concepts of differential geometry, extract interest points; action sketches that have local spatiotemporal information by virtue of being local extrema of curvatures in space-time

These event descriptors are associated with uniform motion or stable changes in uniform motion

Since the action sketches are view invariant, comparing 2 actions is equivalent to checking if there is a valid Fundamental Matrix relating the positions of the action sketches for the individual actions.

Documents

Extracting features from spatio-temporal volumes (STVs) for activity recognition