Spatiotemporal Graphs for Object Segmentation, Human Pose

Spatiotemporal Graphs for Object Segmentation, Human Pose Estimation and

Action Detection in Videos

Mubarak Shah

Center for Research in Computer Vision

University of Central Florida

Spatiotemporal Graphs (STG)

• Video-based problems

• Nodes and edges

• Spatiotemporal

• Type I

• Type II

Frame 3 Frame 2

Type I Spatiotemporal Graph (STG)

• Nodes represent entities in single frames

Frame 1

……

Frame ...

Nodes can be: Object proposals Pixels Super-pixels Object locations …

Edges can be: Color similarities Distances Shape similarities …

Type II Spatiotemporal Graph (STG)

• Nodes represent entities in multiple frames

Nodes can be: Object tracklets Super-voxels …

Edges can be: Appearance similarities Motion models Overlaps …

Examples of Spatiotemporal Graphs

Original Video Object Segmentation

Video Object Segmentation (VOS)

Spatiotemporal Graph (STG): Video Object Segmentation

Frame i-1 Frame i Frame i+1

……

Video Object Co-Segmentation (VOCS)

… …

Video 1 Video 2

Tracklets

STG – Video Object Co-Segmentation

Human Pose Estimation in Videos (HPEV)

STG – Human Pose Estimation in Videos

Head Top …

Head Bottom …

Shoulder

… …

Hand … …

Action Detection (HAD)

Diving

Video 1

Spatiotemporal Context Graphs for Training Videos

Training Videos for Action c

Video n

Context Graphs

G1 ( V1, E1 )

Gn ( Vn , En ) …

Outline

• Video Object Segmentation (VOS)

• Video Object Co-Segmentation (VOCS)

• Human Pose Estimation in Videos (HPEV)

• Human Action Detection (HAD)

Dong Zhang, Omar Javed, and Mubarak Shah, “Video object segmentation through spatially accurate and temporally dense extraction of primary object regions”, CVPR, 2013

• Applications • Object Recognition

• Activity Recognition

• Surveillance

• Challenges • Camera movements

• Varieties of objects

• Deformable objects

Spatiotemporal Graph for Object Selection

GMMs and MRF based Optimization

Input Video

Object Segmentation

Object Proposal Generation

Framework

Object Proposal Generation

• Object proposal methods [1,2]

[1] Ian Endres and Derek Hoiem, “Category Independent Object Proposals”, ECCV, 2010

[2] Alexe, B., Deselares, T. and Ferrari, V., “What is an object?”, CVPR, 2010

… …

Frame index

Segtrack (monkeydog)

… … 100 1 2 3 4

21 … …

… …

… … 100 1 2 3 4

1 2 3 4

… … 100 1 2 3 4

Ranked object proposals

Sample a lot of proposals! Select the right ones!

… …

Frame index

1 2 3 4

Segtrack (parachute)

Ranked object proposals expansion

Multiple proposals

Spatiotemporal Graph for Object Selection

Beginning node Ending node

Unary edge Represents object-ness

An object proposal

Unary Edge

𝑺𝒖𝒏𝒂𝒓𝒚 = 𝑴 𝒓 + 𝑨(𝒓)

𝑨 𝒓 : appearance score Objectness

𝑴(𝒓) : average Frobenius norm for optical flow gradient

𝑼𝒙 =𝒖𝒙 𝒖𝒚𝒗𝒙 𝒗𝒚 𝑭

= 𝒖𝒙𝟐 + 𝒖𝒚

𝟐 + 𝒗𝒙𝟐 + 𝒗𝒚

Unary Edge: Score

Original video frame

Optical flow

Object region (proposal

Optical flow gradient

Boundary region

OF gradient around boundary

Unary Edge: Motion Score

Binary edge

Frame i Frame i+1

… …

Frame i+2

… …

Binary Edges

𝑺𝒃𝒊𝒏𝒂𝒓𝒚 = 𝝀 ∙ 𝑺𝒐𝒗𝒆𝒓𝒍𝒂𝒑 𝒓𝒎, 𝒓𝒏 ∙ 𝑺𝒄𝒐𝒍𝒐𝒓 (𝒓𝒎, 𝒓𝒏)

𝑺𝒄𝒐𝒍𝒐𝒓(𝒓𝒎, 𝒓𝒏) = 𝒉𝒊𝒔𝒕(𝒓𝒎) ∙ 𝒉𝒊𝒔𝒕(𝒓𝒏) 𝑻

𝑺𝒐𝒗𝒆𝒓𝒍𝒂𝒑(𝒓𝒎, 𝒓𝒏) =𝒓𝒎 ∩ 𝒘𝒂𝒓𝒑𝒎𝒏(𝒓𝒏)

𝒓𝒎 ∪ 𝒘𝒂𝒓𝒑𝒎𝒏(𝒓𝒏)

Binary Edge Score

…… ……

Frame i-1 Frame i Frame i+1

…… ……

Goal: Find only one object proposal from each frame, such that all of them have high object-ness and high similarity across frames.

Find the highest weighted path in the DAG.

Longest Path Problem of DAG Dynamic Programming Solution.

Final Spatiotemporal Graph

Results

Qualitative Results – “Girl”

Original video Ground truth

Selected object proposals Segmentation results

Region within the red boundary is the object region

Qualitative Results – “Parachute”

Original video Ground truth

Selected object proposals Segmentation results

Qualitative Results – “Birdfall”

Original video Ground truth Segmentation results

Qualitative Results – “Cheetah”

Qualitative Results – “Monkeydog”

* Average per-frame pixel error rate. The smaller, the better.

SegTrack: Quantitative Results*

Ours [14] [13] [20] [6]

Use GTs? N N N Y Y

Birdfall 155 189 288 252 454

Cheetah 633 806 905 1142 1217

Girl 1488 1698 1785 1304 1755

Monkeydog 365 472 521 533 683

Parachute 220 221 201 235 502

Avg. 452 542 592 594 791

Summary

• STG moving object

• STG pixel-level segmentation

• Performance improved ~20%

Dong Zhang, Omar Javed, and Mubarak Shah, “Video object segmentation through spatially accurate and temporally dense extraction of primary object regions”, CVPR, 2013

How about multiple videos?

Dong Zhang, Omar Javed, and Mubarak Shah, “Video object co-segmentation by regulated maximum weight cliques”, ECCV, 2014

• Applications • Automatic Annotation

• Unsupervised object detection & recognition

• Re-Identification Training image

Annotation

Testing image

• Challenges

• Appearance variation • Multiple object classes • High complexity

Regulated Maximum Weight Cliques for Tracklets

MRF based Optimization

Input Videos

Object Co-Segmentation

Object Proposal Tracklets Generation

Framework

Object Proposal Tracklets Generation

… …

Object Proposals

… …

Object Proposals

Frame 31 track 1

Track backward Track forward Frame 31 track 2

𝑺𝒔𝒊𝒎𝒊 𝒙𝒎, 𝒙𝒏 = 𝑺𝒂𝒑𝒑 𝒙𝒎, 𝒙𝒏 .𝑺𝒍𝒐𝒄 𝒙𝒎, 𝒙𝒏 .𝑺𝒔𝒉𝒂𝒑𝒆 𝒙𝒎, 𝒙𝒏

Frame 31 track 1

Frame 31 track 2

… … … … for all proposals, in all frames

Frame 61 track 2

… …

Regulated Maximum Weight Cliques for Tracklets

… …

Video 1 Video 2

Tracklets

Clique 1: all chickens

Clique 2: all turtles

Each tracklet is a node Node weight 𝑾 𝑿 = (𝑺𝒐𝒃𝒋𝒆𝒄𝒕(𝒙𝒊))

𝒇𝒊=𝟏 Find Regulated Maximum Weight Cliques by

our modified Bron-Kerbosch Algorithm

Results

Chicken & Turtle

Red: first object Green: second object

Original Videos CoSegmentation Results

Elephant & Giraffe

Original Videos CoSegmentation Results

Lion & Zebra

Original Videos CoSegmentation

Results Original Videos CoSegmentation

Results

Quantitative Results: MOViCS Dataset

Video Set Ours1 Ours2 VCS[4] ICS[13]

Ours1: same parameters for all video sets Ours2: different parameters for each video set Numbers are the results by intersection-over-union metric, the larger, the better.

Chicken&turtle 0.860 0.860 0.65 0.08

Ours1: same parameters for all video sets Ours2: different parameters for each video set

Numbers are the results by intersection-over-union metric, the larger, the better.

Chicken&turtle 0.860 0.860 0.65 0.08

Zebra&lion 0.588 0.636 0.48 0.23

Giraffe&elephant 0.528 0.639 0.52 0.07

Tiger 0.336 0.336 0.30 0.30

Overall 0.578 0.617 0.49 0.17

Ours1: same parameters for all video sets Ours2: different parameters for each video set

Numbers are the results by intersection-over-union metric, the larger, the better.

Summary

• Type I STG for object segmentation

• Type II STG for object co-segmentation

• Results improved more than 20%

Dong Zhang, Omar Javed, and Mubarak Shah, “Video object co-segmentation by regulated maximum weight cliques”, ECCV, 2014

What is the most important object?

Human!

Dong Zhang and Mubarak Shah, “Human Pose Estimation in Videos”, ICCV, 2015 Dong Zhang and Mubarak Shah, “A Framework for Human Pose Estimation in Videos” (submitted), PAMI, 2016

An Example for Human Segmentation

Coarse segmentation

Pose Estimation

• Applications • Action recognition

• HCI

• Surveillance

• Challenges • Huge appearance variation

• Multiple people

• Consistent estimation

Body Part Hypotheses Generation

Body Part Tracking

Input Videos

Tree-based Pose Estimation

Pose Hypotheses Generation

Framework

Frame f Frame f+1 Frame f+2

… … … …

Body part

Intra-frame Edge

Inter-frame Edge

Yellow Edges: Commonly Used Intra-

frame Edges

Blue Edges: Symmetric Intra-

frame Edges

Red Edges: Inter-frame Edges

Intra-frame Simple Cycles

Inter-frame Simple Cycles

Too Many Simple Cycles!

NP Hard!!!

Idea 1: Abstraction

Abstract Body Parts Relational Graph Real Body Parts Relational Graph

Remove intra-frame simple cycles

Idea 2: Association

Pose Relational Graph (Tracklet Graph)

Remove the inter-frame simple cycles

N-Best Hypotheses

Real Body Part Hypotheses

Abstract Body Part Hypotheses

Abstract Body Part Tracklets

Tree-based Pose

Estimation

Generate many full body pose hypotheses for each video frame

x x x x

N-Best Hypotheses

Tree-based Pose

Estimation

x x x x

Generate real body part hypotheses for the frames

N-Best Hypotheses

Tree-based Pose

Estimation

x x x x

Combine Symmetric Parts

Real Body Parts Relational Graph

Abstract Body Parts Relational Graph

N-Best Hypotheses

Tree-based Pose

Estimation

Tracklet Hypotheses Graph

Get Best Tracklets for each part

N-Best Hypotheses

Tree-based Pose

Estimation

Pose Hypotheses Graph

… Select Best Poses

Qualitative Results

Outdoor Dataset (video: warmup)

Ours N-Best

Outdoor Dataset (video: bounce)

Ours N-Best

Outdoor Dataset: (video: walk2 video: kick)

N-Best

N-Best Dataset (video: baseball)

Ours N-Best

N-Best Dataset (video: walkstraight)

Ours N-Best

HumanEva Dataset (video: Jog)

Ours N-Best

HumanEva Dataset (video: Walking)

Ours N-Best

Quantitative Results

Park et

0.44 0.58 0.55 0.69 1.03 1.65 0.82

Ramakri

0.39 0.58 0.48 0.48 0.88 1.42 0.71

Ours 0.19 0.22 0.35 0.37 0.41 0.61 0.36

Park et

0.99 0.83 0.92 0.86 0.79 0.52 0.82

Ours 0.99 1.00 1.00 0.97 0.91 0.66 0.92

Ramakri

0.99 0.86 0.95 0.96 0.86 0.52 0.86

Metric Method

Head Torso U.L. L.L. U.A. L.A. Average

Ours 0.99 1.00 1.00 0.97 0.91 0.66 0.92

Ramakrishna et.al

0.99 0.86 0.95 0.96 0.86 0.52 0.86

Park et al.

0.99 0.83 0.92 0.86 0.79 0.52 0.82

Ours 0.19 0.22 0.35 0.37 0.41 0.61 0.36

Ramakrishna et.al

0.39 0.58 0.48 0.48 0.88 1.42 0.71

Park et al.

0.44 0.58 0.55 0.69 1.03 1.65 0.82

Outdoor Dataset

PCP is a precision metric, the larger the better KLE is an error metric, the smaller the better

Metric Method Head Torso U.L. L.L. U.A. L.A. Average

Probability of a Correct Pose (PCP)

Keypoint Localization Error (KLE)

Park et

0.23 0.52 0.24 0.35 1.10 1.18 0.60

Ramakris

hna et.al

0.27 0.48 0.13 0.22 1.14 1.07 0.55

Ours 0.16 0.42 0.13 0.15 0.20 0.24 0.22

Park et

0.97 0.97 0.97 0.90 0.83 0.48 0.85

Ramakris

hna et.al

0.99 1.00 0.99 0.98 0.99 0.53 0.91

Ours 1.00 1.00 1.00 0.94 0.93 0.67 0.92

Ramakrishna et.al

0.99 1.00 0.99 0.98 0.99 0.53 0.91

Park et al.

0.97 0.97 0.97 0.90 0.83 0.48 0.85

Ours 0.16 0.42 0.13 0.15 0.20 0.24 0.22

Ramakrishna et.al

0.27 0.48 0.13 0.22 1.14 1.07 0.55

Park et al.

0.23 0.52 0.24 0.35 1.10 1.18 0.60

HumanEva I Dataset

Park et

0.54 0.74 0.80 1.39 2.39 4.08 1.66

Ramakris

hna et.al

0.53 0.88 0.67 1.01 1.70 2.68 1.25

Ours 0.15 0.17 0.24 0.37 0.30 0.60 0.31

Park et

1.00 0.61 0.86 0.84 0.66 0.41 0.73

Ramakris

hna et.al

1.00 0.69 0.91 0.89 0.85 0.42 0.80

Ours 1.00 1.00 0.92 0.94 0.93 0.65 0.91

Ramakrishna et.al

1.00 0.69 0.91 0.89 0.85 0.42 0.80

Park et al.

1.00 0.61 0.86 0.84 0.66 0.41 0.73

Ours 0.15 0.17 0.24 0.37 0.30 0.60 0.31

Ramakrishna et.al

0.53 0.88 0.67 1.01 1.70 2.68 1.25

Park et al.

0.54 0.74 0.80 1.39 2.39 4.08 1.66

N-Best Dataset

Summary

• HPEV can be well formulated into STGs

• STGs can be employed in multiple stages of HPEV

• Improved results

Action Localization in Videos through Context Walk

Khurram Soomro, Haroon Idrees and Mubarak Shah ICCV-2015

Action Recognition

Diving Lifting

Swing Bench Walking

Action Localization

1. Action Recognition

2. Action Detection a. Trimmed Videos

i. Spatio-Temporal

b. Untrimmed Videos i. Temporal

ii. Spatio-Temporal

Diving

Lifting

Swing Bench

Challenges: Action Localization

• Cluttered Background

• Multiple Actors/Actions

• Untrimmed Videos

Basketball Dunk

Salsa Spin

Hand Waving/Clapping/Boxing

Applications of Action Localization

•Video Search

•Action Retrieval

•Multimedia Event Recounting

•Video Understanding

Existing Solutions to Action Localization

• 1) Learn Action Detector

• 2) Exhaustively search in testing videos

• Sliding Window approach is IMPRACTICAL and WASTEFUL! • Videos:

• Untrimmed (Longer Duration)

• High Resolution

• Action Localization in Videos through Context Walk An efficient approach for action localization

Use of Context Relations that exists in videos: Action-Scene Intra-Action

Action Contours instead of bounding boxes

Motivation Context Graph Context Walk CRF Results

• Context Relations • Learn Spatio-Temporal Relations between all the Supervoxels to those within the Action (Actor

Bounding Box) • Arrows represent three-dimensional displacement vectors capturing:

Action-Scene Relations Intra-Action Relations

• Context Graph • Given supervoxels in an nth Training Video

• Construct a directed Graph Gn(Vn, En) for the video • Vn = Supervoxel nodes • En = Spatio-Temporal Relations

• Edges emanate from: All the nodes (supervoxels) Nodes (supervoxels) contained within the Actor Bounding Box

Directed Graph Action-Scene Relations Intra-Action Relations

• Context Walk • Given a Testing Video: 1. Construct an Undirected Graph G(V,E)

• Edges exist between Spatio-Temporal Neighbors 2. Randomly Select Initial node 3. Find Nearest Neighbor Supervoxel from Training Data 4. Project Displacement Vectors onto Testing Supervoxels 5. Select Next Node with Max. Probability, Repeat (Steps 3-5)

Training Video Nc

(b) Construct Spatio-temporal

Graph using all SVs

SV (v), SV Features ( )

(c) Search NNs using SV

features, then project

displacement vectors

(d) Update SVs Conditional

Distribution using all NNs

(e) Select SV with highest

confidence

(f) Repeat for T steps

(g) Segment Action Proposals through

CRF + SVM Classification

G (V, E)

Context Walk

Proposed Framework for Context Walk

CRF + SVM

(a) Segment Video into

Supervoxels (SVs)

•UCF Sports Dataset

Annotated Actor Bounding Box Action Localization Contour

Action Localization Contour

•UCF Sports Dataset

Annotated Actor Bounding Box

• Sub-JHMDB Dataset

Action Localization Contour Annotated Actor Bounding Box

• Sub-JHMDB Dataset

• THUMOS’13 Dataset

•Quantitative Results (UCFSports)

•Quantitative Results (sub-JHMDB)

•Quantitative Results (THUMOS’13)

Summary

• Efficient and Effective approach for Action Localization

• Learn Contextual Relations in the form of relative locations between different video regions

• Use Context Walk to select supervoxel at each step and predict the Action Location

Action Localization in Videos through Context Walk

Khurram Soomro, Haroon Idrees and Mubarak Shah ICCV-2015

Conclusion

• Generic Object Segmentation in Videos • Single video (CVPR-2013)

• Multiple videos (ECCV-2014)

• Human Pose Estimation in Videos (ICCV-2015)

• Human Action Detection in Videos (ICCV-2015)

Youtube Presentations

https://www.youtube.com/user/UCFCRCV

Thank You

Spatiotemporal Graphs for Object Segmentation, Human Pose

Documents

Simultaneous Segmentation and 3D Pose Estimation of Humans Philip H.S. Torr Pawan Kumar, Pushmeet Kohli, Matt Bray Oxford Brookes University Arasanathan

WORLD SPATIOTEMPORAL ANALYTICS AND … · world spatiotemporal analytics and mapping project (wstamp): discovering, exploring, and mapping spatiotemporal patterns across the world’s

Spatiotemporal Stereo via Spatiotemporal ... - Vision Lab

3D Hand Pose Estimation Using Randomized Decision Forest ...hbling/publication/hand-pose-iccv15.pdf3D Hand Pose Estimation Using Randomized Decision Forest with Segmentation Index

Unsupervised Dynamic Texture Segmentation Using Local Spatiotemporal Descriptors

Globally Tuned Cascade Pose Regression via Back ...Globally Tuned Cascade Pose Regression via Back Propagation with Appli-cation in 2D Face Pose Estimation and Heart Segmentation in

Joint Multi-Person Pose Estimation and Semantic Part Segmentation Multi-Person Pose Estimation and Semantic Part Segmentation ... notated joint types (i.e. forehead, neck, left/right

Joint Multi-Person Pose Estimation and Semantic Part ...alanlab/Pubs17/xia2017joint.pdf · Joint Multi-Person Pose Estimation and Semantic Part Segmentation ... where a novel segment-joint

Multi-view video based multiple objects segmentation using graph cut and spatiotemporal projections Journal of Visual Communication and Image Representation

Joint Multi-Person Pose Estimation and Semantic Part ... · Human pose estimation and semantic part segmentation are two complementary tasks in computer vision. In this paper, we

Spatial and Spatiotemporal Data Miningzhejiang.cs.ua.edu/papers/ComprehensiveGIS.pdf · Spatial and Spatiotemporal Data Mining Abstract The significant growth of spatial and spatiotemporal

Spatiotemporal Semantic Video Segmentation · Spatiotemporal Semantic Video Segmentation E. Galmar ∗1, Th. Athanasiadis †3, B.Huet ∗2, Y. Avrithis †4 ∗ Departement Multim

POSE–CUT Simultaneous Segmentation and 3D Pose Estimation of Humans using Dynamic Graph Cuts

6DoF Vehicle Pose Estimation Using Segmentation-Based Part

Object Localization, Segmentation, Classification, and Pose …€¦ · •Extend CNN model to multiclass object localization, segmentation, classification, and pose estimation in

Self-supervised Learning of Pose Embeddings from ... · 3. Approach Our motivation is to learn pose embeddings from videos without labels. We follow the insight that spatiotemporal

Video Human Segmentation using Fuzzy Object Models … Human Segmentation using Fuzzy Object Models and its Application to Body Pose Estimation of Toddlers for Behavior Studies Thiago

Spatiotemporal properties of intracellular calcium ...liinc.bme.columbia.edu/wp-content/uploads/Spatiotemporal... · Spatiotemporal properties of intracellular calcium signaling in

Automatic video segmentation using spatiotemporal T- · PDF fileAutomatic video segmentation using spatiotemporal T-junctions ... Automatic video segmentation using ... results have

POSE–CUT Simultaneous Segmentation and 3D Pose Estimation of Humans using Dynamic Graph Cuts Mathieu Bray Pushmeet Kohli Philip H.S. Torr Department of