Upload
misu
View
48
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Real-time Action Recognition by Spatiotemporal Semantic and Structural Forest. Tsz -Ho Yu, Tae-Kyun Kim and Roberto Cipolla. Machine Intelligence Laboratory, Engineering Department, University of Cambridge. Introduction and Motivations. A novel real-time solution for action recognition - PowerPoint PPT Presentation
Citation preview
Real-time Action Recognition by Spatiotemporal Semantic and Structural ForestTsz-Ho Yu, Tae-Kyun Kim and Roberto Cipolla
Machine Intelligence Laboratory, Engineering Department, University of Cambridge
Introduction and Motivations
• A novel real-time solution for action recognition
• utilises local-appearance and structural information.
High run-time performances
Local appearance +
structural information
Short response time
Real-time feature extraction and classification
Continuous / frame-by-frame
recognition
Pyramidal spatiotemporal
relationship match (PSRM)
Main features / major contributions:
Main objective: efficiency
A short demo
Please visit: “http://www.youtube.com/watch?v=eD5b8d7hV6E” on the Internet for the full demo video.
Related Work
• Many current methods focus on:[Schuldt et al. ICPR2004, Niebles et al. BMVC06, Ryoo and Aggarwal ICCV09, Willems BMVC09, Riemenschneider et al. BMVC09]
• Some achieve high accuracies, but take a long time to recognise • How can we improve efficiency?
• Can we improve codebook learning and feature matching?
“Bag of words” model
Sophisticated spatiotemporal
features Learned classifier
K-means codebook
Accuracy Action representation model (Feature design)
Related Work
• Vector quantisation by random forest [Moosmann et al. ECCV06]
• For image segmentation [Shotton et al. CVPR08]
• Can we apply it in video analysis?• Pyramid match kernel [Graumann and Darrell. ICCV05]
• Image recognition [Graumann and Darrell. ICCV05] , scene classification [Lazebnik et al. CVPR06], etc.
• Spatiotemporal relationship match [Ryoo and Aggarwal ICCV09]
S. Lazebnik C. Schmid J. Ponce “Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories” , CVPR 2006K. Grauman and T. Darrell “The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features” ICCV2005F. Moosmann, B. Triggs, and F. Jurie. “Fast discriminative visual codebooks using randomized clustering forests” NIPS2006J. Shotton, M. Johnson, and R. Cipolla. “Semantic texton forests for image categorization and segmentation” CVPR2008M. S. Ryoo and J. K. Aggarwal. “Spatio-temporal relationship match: Video structure comparison for recognition of copmlex human activities” ICCV2009
Graumann and Darrell. ICCV05
MoosmannNIPS2006
Ryoo and Aggarwal ICCV09
Our Contributions
• Our contribution is three-fold:
Efficient codebook learning
High run-time performance
Local appearance + structural information
SRM → PSRM: pyramidal spatiotemporal
relationship match
1. V-FAST corner detector
2. Random forest classifiers
3. Continuous action recognition
Spatiotemporal Texton ForestImage segmentation(2D) → Action recognition (3D)
Typical Approaches
Feature Encoding
Feature Matching
K-means Clustering
Slow for Large Codebook
The “Bag of Words” (BOW) ModelLacks Structural Information
Quantisation Error
Our MethodSemantic Texton Forest
Efficient
PSRMStructural Information
Hierarchical Matching
Robust
Comparison with existing approaches
Overview
Spatiotemporal Semantic Texton Forest
V-FAST Corner
PSRM
BOST Random Forest Classifier
K-means Forest
Results
Spatio-temporal Cuboids
Feature detection
Feature extraction
Feature matching Classification
Feature detection
Spatiotemporal Semantic Texton Forest
V-FAST Corner
PSRM
BOST Random Forest Classifier
K-means Forest
Results
Spatio-temporal Cuboids
Feature detection
V-FAST: Spatiotemporal Feature Detection
• A novel spatiotemporal interest point detector
• Inspired from FAST [Rosten and Drummond ECCV2006]
• A cascade of three FAST detectors.
• Consider three orthogonal Bensenham circles
• Features:
• Very fast!
E. Rosten and T. Drummond. “Machine learning for high-speed corner detection” ECCV 2006
Feature extraction
Spatiotemporal Semantic Texton Forest
V-FAST Corner
PSRM
BOST Random Forest Classifier
K-means Forest
Results
Spatio-temporal Cuboids
Feature extraction
Building a codebook using STF
• Extract small video cuboids at detected keypoints
• Visual codebook using STF:
• Efficient visual codebook• One feature → multiple
codewords. • Quantisation and partial matching
Random forest based codebook
• Work on pixels directly• Hierarchical splits
“Textonises” patches recursively
Feature extraction
Spatiotemporal Semantic Texton Forest
V-FAST Corner
PSRM
BOST Random Forest Classifier
K-means Forest
Results
Spatio-temporal Cuboids
Feature matching
Pyramidal Spatiotemporal Relationship Match (PSRM)PSRM: a multi-codewords multi-resolution SRM• Old method: SRM [Ryoo and Aggarwal ICCV09]• PSRM: A multi-codebook multi-
resolution version.Natural combination: local appearance + action structureEvaluate each pair of codewords using a set of association rules.
A set of “rules” (in different colours) are designed to describe spatiotemporal
structure of features.
TREE N
TREE N
Pyramidal Spatiotemporal Relationship Match (PSRM)
Pyramidal Spatiotemporal Relationship Match (PSRM)
• Apply on all each “association rules”
Apply on each tree in the STF
• We apply it semantically but not spatially
• Assumption: neighbouring codewords are similar
• Merging the ajacent nodes, instead of merging ajacent spatial bins
Pyramid match kernel:
Typical pyramid match kernel
Our Pyramid Match Kernel
Ajacent bins are merged
Children are merged to parents
Multiple Structural Relationship Histograms
PyramidMatch Kernel (PMK)
Pyramidal Spatiotemporal Relationship Match (PSRM)
Typical Methods
Our Approach
Features
Classification
Features
Classification
Features
Classification
Features
Classification
Features
Classification
Features
Classification
Features
Classification
Features
Classification
Features
Classification
Features
Classification
Continuous action recognition
Classification
Spatiotemporal Semantic Texton Forest
V-FAST Corner
PSRM
BOST Random Forest Classifier
K-means Forest
Results
Spatio-temporal Cuboids
Classification!
Combined Classification
• PSRM and BOST (bag of spatiotemporal textons) are classified indenpendently:
• PSRM: k-means forest
M.Muja and D. G. Lowe. “Fast approximate nearest neighbors with automatic algorithm” VISAPP2009K-means tree figure courtesy of David Aldavert Miró : http://www.cvc.uab.cat/~aldavert/plor/
Originally uses for NN approximation
Use PSRM as the matching kernel
Combined with the BOST model for
final results
Data points are clustered using k-means at root
For each cluster, perform another k-means
recursively
At each terminal cluster , a posterior prob. dist. Is
assigned
Experiments
• Short video sequences (50 frames ~ 2 seconds) are extracted from the input video.
• Sampling frequency is 5 frames for experiment and 1 frame for the laptop demo. (so it is a frame-by-frame recognition)
• Two datsets are used for performance evaluation:
• The standard benchmark• Six classes, with viewpoint changes, illumination changes, zoom ,
etc.
KTH dataset
• Human interactions, 6 classes of actions, cluttered backgroundUT dataset (for ICPR contest on Semantic Description of Human Activities 2010)
• Intel Core i7 920 (for accuracy and speed tests)• Core 2 Duo P9400 (for laptop demo)
Hardware specifications
KTH datasetUT interaction dataset
Experiments: Results (KTH dataset)
Mined features (ICCV2009)
CCA (CVPR2007)
Neighbourhood (CVPR2010)
Info. Max. (CVPR2008)
Shape-motion tree (ICCV2009)
Vocabulary Forest(CVPR2008)
Point clouds (CVPR2009)
our method (sequence)
our method (snippets)
90 91 92 93 94 95 96 97 98 99 100
96.7
95.33
94.53
94.15
93.43
93.17
93.17
95.67
93.55
Comparison with recent state-of-the-art
• Comparable to most state-of-the-art.
• Around ~3% slower than the top performer
• Is it a sensible trade-off?
• Useful for many more practical applications. (surveillance, robotics, etc.)
snippet: subsequence level recognition
sequence: major voting of subsequence labels
leave-of-out-cross-validation
Leave-of-out-cross-validation
Experiments: Results
• Results: UT interaction dataset
• Run time performance
PSRM and BOST gave low accuracies when applied separately.
~20% performance improved by simply combining the class labels!
< 25 fps, but enough for most real-time applications
Can be further optimised (e.g. GPU, mult-core processing)
Demo video
• Frame-level recognition
• Potential improvement:
• Delay (~1s) in recognition results (Depends on the subsequence length )
• Please visit: “http://www.youtube.com/watch?v=eD5b8d7hV6E” on the Internet for the full demo video.
Conclusions
A novel action recognition system
Main strength: run time performance
• k-means codebook → spatiotemporal semantic forest
• Histogram → PSRM• Traditional classifiers (e.g. SVM) → k-means
forest classifier / random forest
A re-design of the traditional “bag of words” model
THE ENDTHANK YOU VERY MUCH
Extra slide
• Formulation of V-FAST
Extra slide
• Formulation of STF
• Split function model:
• Split criteria --- Information gain:
Extra slide
• Formulation of STF
Extra slide
• Formulation of PSRM
• Step 1 Feature matching:
• Step 2 Semantic PMK over histogram
Extra slide
• Formulation of Classifier training
• Optimising the clusters of feature which maximise the PMK with the mean.
Extra slide
• Experiment parameters
Extra slide
• Confusion matrix:
Extra slide
Kernel k-means forest
Random forest
PSRM BOST
Action recognition results (class labels)
Weighted combination