Chao-Yeh Chen and Kristen Grauman University of Texas at Austin Efficient Activity Detection with Max- Subgraph Search

Efficient Activity Detection with Max-Subgraph Search

Chao-Yeh Chen and Kristen GraumanUniversity of Texas at AustinEfficient Activity Detection with Max-Subgraph Search1OutlineIntroductionApproachDefine weighted nodesLink nodesSearch for the maximum-weight graphExperimental ResultConclusion2IntroductionExisting methods tend to separate activity detection into two distinct stages:1. generates space-time candidate regions of interest from the test video2. scores each candidate according to how well it matches a given activity model (often a classifier).activity detection:1.test videospace-time2.score(activity model(classifier)match),person-centered tackvideoframesliding window3How to detect human activity in continuous video?Status quo approaches:

sliding window:,framefeatureCuboid detection:cubeperson-centered tack:focusactivityobject(ex:)

4IntroductionWe pose activity detection as a maximum-weight connected subgraph problem over a learned space-time graph constructed on the test sequence.

classificationlocalizationdetectionspace-timevideo graph,activity classifier scoresubgraphDetection result5Approach

S*=argmax f(S)Activity detection=subvolume Svideo sequence QscoreS*6Classifier training for feature weightsLearn a linear SVM from training data, the scoring function would have the form:

let denotej-thbin count for histogram h(S), thej-th word is associated with a weight

forj = 1,,K,whereKis the dimension of histogramh.training datalearninglinear SVM,score functionhBoF, learned weight,bias,itraining exaplesindex---------------------------------------------------------------------------------

7Classifier training for feature weightsThus the classifier response forsubvolumeS is:

CandidatesubvolumeSVM weight for j-th wordNum occurrencesof j-th wordSVM weight for i-th feature's word8Bag-of-feature(Bof)

local feature9

clustering(ex;k-means)local featureclusterclustervisual word,Visual wordclusterfeaturecode word,visual wordvisual vocabulary

10

Visual vocabularycodebook(code word),imagefeaturevisual vocabularyvisual word,visual word,Imagehistogram,BoF---------------------------------------------------------------SVMBoF featureclassification modelimageBof feature11Localized SpaceTime FeaturesLow-level descriptorswe use HoG and HoF computed in local space-time cubes [14, 10]. These descriptors capture the appearance and motion in the video.High-level descriptors

Low-level descriptorsHistogram of oriented gradient(HOG)Step 1:pixelpixelStep 2:cellhistogramimagecellcell6*6pixel36(alpha)histogramhistogramX4-orientation8-orientationStep 3:cell(2*2)blockblockhistogramblockStep 4blockhistogramfeature

High-level descriptors1.Detect objects and people from frames using object/pose detector. Cluster poses into N(~10) person types.2.For each detection, build a semantic descriptor based onitsspatial/temporal neighbors.3. Quantize the semantic descriptor into words via random forest.12Define weighted nodesDivide space-time volume into frame-level or space-time nodes. Compute the weight of nodes from the features inside them.

space-time columeframe-level nodesspace-time nodesfeaturenode weight13Link nodesTwo different link strategies:1. Neighbors only for frame-level nodes(T-Subgraph) or space-time nodes(ST-Subgraph).2. First two neighbors for frame-level nodes(T-Jump-Subgraph).

14Search for the maximum-weight graphTransform max-weightsubgraphproblem into a prize-collecting Steiner tree problem.Solve efficiently with branch and cut method from [15].

[15]An algorithmic framework for the exact solution of the prize-collecting Steiner tree problem. Math. Prog., 2006.Experimental ResultDatasets

BaselinesT-SlidingST-Cube-SlidingST-Cube-Subvolume[29]

J. Yuan, Z. Liu, and Y. Wu. Discriminative subvolume search for efficient action detection. In CVPR, 2009.UCF Sports data

18Hollywood data

MSR dataset

Example of ST-Subgraph

Top row: yellow box the space time detection from our ST-Subgraph. Note the detection changes with time. Red boxes: ground truth annotation.Bottom row: green box top four detections from the ST-Cube-Subvolume. The top three detections are trapped by the local maximum caused by the small motion of human.21Overview of all methods on the three datasets

Speed and accuracy trade offIncreased search scope boosts accuracy, though costs more.Flexibility of proposed method allows the best speed/accuracy.22High-level vs Low-level descriptors

23ConclusionCompare to sliding window search ,it significantly reduces computation time.Flexible node structure offers more robust detection in noisy backgrounds.High-level descriptor shows promise for complex activities by incorporating semantic relationships between humans and objects in video.slinding window searcgnode stucturenoisedetectionHigh-level descriptorobjecthumanactivity24

Documents

Chao-Yeh Chen and Kristen Grauman University of Texas at Austin Efficient Activity Detection with Max- Subgraph Search