Real-time Action Recognition by Spatiotemporal Semantic and Structural Forest

Real-time Action Recognition by Spatiotemporal Semantic and Structural ForestTsz-Ho Yu, Tae-Kyun Kim and Roberto Cipolla

Machine Intelligence Laboratory, Engineering Department, University of Cambridge

Introduction and Motivations

• A novel real-time solution for action recognition

• utilises local-appearance and structural information.

High run-time performances

Local appearance +

structural information

Short response time

Real-time feature extraction and classification

Continuous / frame-by-frame

recognition

Pyramidal spatiotemporal

relationship match (PSRM)

Main features / major contributions:

Main objective: efficiency

A short demo

Please visit: “http://www.youtube.com/watch?v=eD5b8d7hV6E” on the Internet for the full demo video.

http://www.youtube.com/watch?v=eD5b8d7hV6E


Related Work

• Many current methods focus on:[Schuldt et al. ICPR2004, Niebles et al. BMVC06, Ryoo and Aggarwal ICCV09, Willems BMVC09, Riemenschneider et al. BMVC09]

• Some achieve high accuracies, but take a long time to recognise • How can we improve efficiency?

• Can we improve codebook learning and feature matching?

“Bag of words” model

Sophisticated spatiotemporal

features Learned classifier

K-means codebook

Accuracy Action representation model (Feature design)

Related Work

• Vector quantisation by random forest [Moosmann et al. ECCV06]

• For image segmentation [Shotton et al. CVPR08]

• Can we apply it in video analysis?• Pyramid match kernel [Graumann and Darrell. ICCV05]

• Image recognition [Graumann and Darrell. ICCV05] , scene classification [Lazebnik et al. CVPR06], etc.

• Spatiotemporal relationship match [Ryoo and Aggarwal ICCV09]

S. Lazebnik C. Schmid J. Ponce “Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories” , CVPR 2006K. Grauman and T. Darrell “The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features” ICCV2005F. Moosmann, B. Triggs, and F. Jurie. “Fast discriminative visual codebooks using randomized clustering forests” NIPS2006J. Shotton, M. Johnson, and R. Cipolla. “Semantic texton forests for image categorization and segmentation” CVPR2008M. S. Ryoo and J. K. Aggarwal. “Spatio-temporal relationship match: Video structure comparison for recognition of copmlex human activities” ICCV2009

Graumann and Darrell. ICCV05

MoosmannNIPS2006

Ryoo and Aggarwal ICCV09

Our Contributions

• Our contribution is three-fold:

Efficient codebook learning

High run-time performance

Local appearance + structural information

SRM → PSRM: pyramidal spatiotemporal

relationship match

1. V-FAST corner detector

2. Random forest classifiers

3. Continuous action recognition

Spatiotemporal Texton ForestImage segmentation(2D) → Action recognition (3D)

Typical Approaches

Feature Encoding

Feature Matching

K-means Clustering

Slow for Large Codebook

The “Bag of Words” (BOW) ModelLacks Structural Information

Quantisation Error

Our MethodSemantic Texton Forest

Efficient

PSRMStructural Information

Hierarchical Matching

Robust

Comparison with existing approaches

Overview

Spatiotemporal Semantic Texton Forest

V-FAST Corner

PSRM

BOST Random Forest Classifier

K-means Forest

Results

Spatio-temporal Cuboids

Feature detection

Feature extraction

Feature matching Classification

Feature detection


V-FAST Corner

PSRM


K-means Forest

Results


Feature detection

V-FAST: Spatiotemporal Feature Detection

• A novel spatiotemporal interest point detector

• Inspired from FAST [Rosten and Drummond ECCV2006]

• A cascade of three FAST detectors.

• Consider three orthogonal Bensenham circles

• Features:

• Very fast!

E. Rosten and T. Drummond. “Machine learning for high-speed corner detection” ECCV 2006

Feature extraction


V-FAST Corner

PSRM


K-means Forest

Results


Feature extraction

Building a codebook using STF

• Extract small video cuboids at detected keypoints

• Visual codebook using STF:

• Efficient visual codebook• One feature → multiple

codewords. • Quantisation and partial matching

Random forest based codebook

• Work on pixels directly• Hierarchical splits

“Textonises” patches recursively

Feature extraction


V-FAST Corner

PSRM


K-means Forest

Results


Feature matching

Pyramidal Spatiotemporal Relationship Match (PSRM)PSRM: a multi-codewords multi-resolution SRM• Old method: SRM [Ryoo and Aggarwal ICCV09]• PSRM: A multi-codebook multi-

resolution version.Natural combination: local appearance + action structureEvaluate each pair of codewords using a set of association rules.

A set of “rules” (in different colours) are designed to describe spatiotemporal

structure of features.

TREE N

TREE N

Pyramidal Spatiotemporal Relationship Match (PSRM)


• Apply on all each “association rules”

Apply on each tree in the STF

• We apply it semantically but not spatially

• Assumption: neighbouring codewords are similar

• Merging the ajacent nodes, instead of merging ajacent spatial bins

Pyramid match kernel:

Typical pyramid match kernel

Our Pyramid Match Kernel

Ajacent bins are merged

Children are merged to parents

Multiple Structural Relationship Histograms

PyramidMatch Kernel (PMK)


Typical Methods

Our Approach

Features

Classification

Features

Classification

Features

Classification

Features

Classification

Features

Classification

Features

Classification

Features

Classification

Features

Classification

Features

Classification

Features

Classification

Continuous action recognition

Classification


V-FAST Corner

PSRM


K-means Forest

Results


Classification!

Combined Classification

• PSRM and BOST (bag of spatiotemporal textons) are classified indenpendently:

• PSRM: k-means forest

M.Muja and D. G. Lowe. “Fast approximate nearest neighbors with automatic algorithm” VISAPP2009K-means tree figure courtesy of David Aldavert Miró : http://www.cvc.uab.cat/~aldavert/plor/

Originally uses for NN approximation

Use PSRM as the matching kernel

Combined with the BOST model for

final results

Data points are clustered using k-means at root

For each cluster, perform another k-means

recursively

At each terminal cluster , a posterior prob. dist. Is

assigned

Experiments

• Short video sequences (50 frames ~ 2 seconds) are extracted from the input video.

• Sampling frequency is 5 frames for experiment and 1 frame for the laptop demo. (so it is a frame-by-frame recognition)

• Two datsets are used for performance evaluation:

• The standard benchmark• Six classes, with viewpoint changes, illumination changes, zoom ,

etc.

KTH dataset

• Human interactions, 6 classes of actions, cluttered backgroundUT dataset (for ICPR contest on Semantic Description of Human Activities 2010)

• Intel Core i7 920 (for accuracy and speed tests)• Core 2 Duo P9400 (for laptop demo)

Hardware specifications

KTH datasetUT interaction dataset

Experiments: Results (KTH dataset)

Mined features (ICCV2009)

CCA (CVPR2007)

Neighbourhood (CVPR2010)

Info. Max. (CVPR2008)

Shape-motion tree (ICCV2009)

Vocabulary Forest(CVPR2008)

Point clouds (CVPR2009)

our method (sequence)

our method (snippets)

90 91 92 93 94 95 96 97 98 99 100

96.7

95.33

94.53

94.15

93.43

93.17

93.17

95.67

93.55

Comparison with recent state-of-the-art

• Comparable to most state-of-the-art.

• Around ~3% slower than the top performer

• Is it a sensible trade-off?

• Useful for many more practical applications. (surveillance, robotics, etc.)

snippet: subsequence level recognition

sequence: major voting of subsequence labels

leave-of-out-cross-validation

Leave-of-out-cross-validation

Experiments: Results

• Results: UT interaction dataset

• Run time performance

PSRM and BOST gave low accuracies when applied separately.

~20% performance improved by simply combining the class labels!

< 25 fps, but enough for most real-time applications

Can be further optimised (e.g. GPU, mult-core processing)

Demo video

• Frame-level recognition

• Potential improvement:

• Delay (~1s) in recognition results (Depends on the subsequence length )

• Please visit: “http://www.youtube.com/watch?v=eD5b8d7hV6E” on the Internet for the full demo video.



Conclusions

A novel action recognition system

Main strength: run time performance

• k-means codebook → spatiotemporal semantic forest

• Histogram → PSRM• Traditional classifiers (e.g. SVM) → k-means

forest classifier / random forest

A re-design of the traditional “bag of words” model

THE ENDTHANK YOU VERY MUCH

Extra slide

• Formulation of V-FAST

Extra slide

• Formulation of STF

• Split function model:

• Split criteria --- Information gain:

Extra slide

• Formulation of STF

Extra slide

• Formulation of PSRM

• Step 1 Feature matching:

• Step 2 Semantic PMK over histogram

Extra slide

• Formulation of Classifier training

• Optimising the clusters of feature which maximise the PMK with the mean.

Extra slide

• Experiment parameters

Extra slide

• Confusion matrix:

Extra slide

Kernel k-means forest

Random forest

PSRM BOST

Action recognition results (class labels)

Weighted combination

Documents

Real-time Action Recognition by Spatiotemporal Semantic and Structural Forest