Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
End-to-end Learning of Action Detection from Frame Glimpse in
Videos
Ezgi Pekşen SoysalHacettepe University
BIL722 - Advanced Topics in Computer Vision
Olga Russakovsky: The human side of computer vision
Input Output
t = 0 t = TRunningTalking
Task: what is the person doing?
Olga Russakovsky: The human side of computer vision
Input Output
Task: what is the person doing?
InterpretabilityEfficiencyAccuracy
t = 0 t = TRunningTalking
Olga Russakovsky: The human side of computer vision
Efficient video processing
Talkingt = 0 t = T
Running
Olga Russakovsky: The human side of computer vision
Efficient video processing
Talkingt = 0 t = T
Running
Olga Russakovsky: The human side of computer vision
Efficient video processing
[N. I. Badler. “Temporal Scene Analysis…” 1975]
“Knowing the output or the final state… there is no need to explicitly store many previous states”
Talkingt = 0 t = T
Running
Olga Russakovsky: The human side of computer vision
Efficient video processing
Dominant paradigm: sliding windows
t = 0 t = T
……
Used in all THUMOS challenge action detection entries
[OneVerSch 2014] [WanQiaTan 2014] KarSeiBim 2014] [YuaPeiNiMouKas 2015]
“Knowing the output or the final state… there is no need to explicitly store many previous states”
[N. I. Badler. “Temporal Scene Analysis…” 1975]
Talkingt = 0 t = T
Running
Olga Russakovsky: The human side of computer vision
Efficient video processing
“Knowing the output or the final state… there is no need to explicitly store many previous states”
“Time may be represented in several ways… The intervals between ‘pulses’ need not be equal.”
[N. I. Badler. “Temporal Scene Analysis…” 1975]
Talkingt = 0 t = T
Running
Olga Russakovsky: The human side of computer vision
t = 0 t = T
[YeuRusMorFei CVPR’16]
Our model for efficient action detection
Frame model
Input: A frame
Output
Olga Russakovsky: The human side of computer vision
t = 0 t = T
Our model for efficient action detection
Frame model
Output: Detection instance [start, end] Next frame to glimpse
Input: A frame
[ ]Output
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
t = 0 t = T
Output: Detection instance [start, end] Next frame to glimpse
Our model for efficient action detection
Frame model
[ ][ ]OutputOutput
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
t = 0 t = T
OutputOutput:
Detection instance [start, end] Next frame to glimpse
Our model for efficient action detection
Frame model
[ ][ ] [ ]
…OutputOutput
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
t = 0 t = T
Output
Our model for efficient action detection
[ ]
Convolutional neural network (frame information)
Recurrent neural network (time information)
[ ] [ ]Output:
Detection instance [start, end] Next frame to glimpse…
OutputOutput
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
t = 0 t = T
Output
Optional output: Detection instance [start, end]
Output: Next frame to glimpse
Our model for efficient action detection
…
� �
Convolutional neural network (frame information)
Recurrent neural network (time information)
OutputOutput
[ ]
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
Training the detection instance outputPositive video Negative video
t = 0 t = T[ ] [ ] t = 0 t = TTraining data
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
Training the detection instance outputPositive video Negative video
t = 0 t = T[ ] [ ] t = 0 t = TTraining data
Aside:• effective video annotation
[YeuRusJinAndMorFei UnderReview] [LiuRusDenBerFei ImageNetChallenge ’15]
• weakly supervised detection[YeuRamRusMorFei InPreparation]
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
Training the detection instance output
Training data
Positive video Negative video
t = 0 t = T[ ] [ ] t = 0 t = T
d1 d2
Detections
t = 0 t = T[ ] [ ] t = 0 t = T
g1
[ ]d3
[ ]d4
g2
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
Training the detection instance output
Training data
Positive video Negative video
t = 0 t = T[ ] [ ] t = 0 t = T
d1 d2
Detections
t = 0 t = T[ ] [ ] t = 0 t = T
g1
[ ]d3
[ ]d4
Reward for detection
g2
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
Training the detection instance output
Training data
Positive video Negative video
t = 0 t = T[ ] [ ] t = 0 t = T
d1 d2
Detections
t = 0 t = T[ ] [ ] t = 0 t = T
g1
[ ]d3
[ ]d4
Reward for detection
g2
y3 = 2y2 = 1y1 = 1 y4 = 0
cross-entropyclassification loss
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
Training the detection instance output
Training data
Positive video Negative video
t = 0 t = T[ ] [ ] t = 0 t = T
d1 d2
Detections
t = 0 t = T[ ] [ ] t = 0 t = T
g1
[ ]d3
[ ]d4
g2
Reward for detection
L2 distancelocalization loss
y3 = 2y2 = 1y1 = 1 y4 = 0
cross-entropyclassification loss
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
Training the non-differentiable outputsTraining data
t = 0 t = T[ ] [ ]Detections
t = 0 t = T[ ] [ ][ ]
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
Training the non-differentiable outputsTraining data
t = 0 t = T[ ] [ ]d1 d2
Detections
t = 0 t = T[ ] [ ]
Model’s action sequence a Frame 1 Frame 8 Frame 6
�
go to frame 8 go to frame 6
(1) whether to predict a detection
(2) where to look next
[ ]
Frame 15
d3
go to frame 15
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
Training the non-differentiable outputsTraining data
t = 0 t = T[ ] [ ]Detections
t = 0 t = T[ ] [ ]
Frame 1 Frame 8 Frame 6
[ ]
Frame 15Model’s action sequence a
go to frame 8 go to frame 6(2) where to look next
go to frame 15
d1 d2 � (1) whether to predict a detectiond3
Train an policy for actions (1) and (2) using REINFORCE [Williams 1992]
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
Training the non-differentiable outputsTraining data
t = 0 t = TDetections
t = 0 t = T[ ] [ ]
Frame 1 Frame 8 Frame 6
Reward for an action sequence :
Frame 15Model’s action sequence a
[ ] [ ]bad bad
go to frame 8 go to frame 6(2) where to look next
go to frame 15
[ ]good
d1 d2 � (1) whether to predict a detectiond3
Train an policy for actions (1) and (2) using REINFORCE [Williams 1992]
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
Training the non-differentiable outputsTraining data
t = 0 t = TDetections
t = 0 t = T[ ] [ ]
Frame 1 Frame 8 Frame 6
Reward for an action sequence :
Frame 15
Objective:Gradient:
Monte-Carlo approximation:
Model’s action sequence a
[ ] [ ]bad bad
go to frame 8 go to frame 6(2) where to look next
go to frame 15
[ ]good
d1 d2 � (1) whether to predict a detectiond3
Train an policy for actions (1) and (2) using REINFORCE [Williams 1992]
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
Accuracy
Interpretability
[YeuRusMorFei CVPR’16]
Efficiency
Olga Russakovsky: The human side of computer vision
✓
[YeuRusMorFei CVPR’16]
Accuracy
DatasetDetection AP at IOU 0.5
State-of-the-art Our result
THUMOS 2014 14.4 17.1ActivityNet sports 33.2 36.7ActivityNet work 31.1 39.9
Interpretability
Efficiency
Olga Russakovsky: The human side of computer vision
✓
[YeuRusMorFei CVPR’16]
Accuracy
DatasetDetection AP at IOU 0.5
State-of-the-art Our result
THUMOS 2014 14.4 17.1ActivityNet sports 33.2 36.7ActivityNet work 31.1 39.9
Interpretability
Efficiency✓Glimpse only 2% of video frames
Olga Russakovsky: The human side of computer vision
✓
[YeuRusMorFei CVPR’16]
Accuracy
DatasetDetection AP at IOU 0.5
State-of-the-art Our result
THUMOS 2014 14.4 17.1ActivityNet sports 33.2 36.7ActivityNet work 31.1 39.9
Interpretability
Efficiency✓Glimpse only 2% of video frames
Samping Detection AP at IOU 0.5
Uniform 9.3
Our glimpses 17.1
Olga Russakovsky: The human side of computer vision
✓
[YeuRusMorFei CVPR’16]
Accuracy
DatasetDetection AP at IOU 0.5
State-of-the-art Our result
THUMOS 2014 14.4 17.1ActivityNet sports 33.2 36.7ActivityNet work 31.1 39.9
Efficiency✓Glimpse only 2% of video frames
Samping Detection AP at IOU 0.5
Uniform 9.3
Our glimpses 17.1
Interpretability✓Javelin throw[ ]
Javelin throw[ ]Ground truth
Detections
Glimpses
Frames
Olga Russakovsky: The human side of computer vision
…
Algorithm Test data
Car, Person Building
AI vision system
Training data
cars
dogs
people
Reinforcement learning for human action detection [YeuRusMorFei CVPR’16]
Open-world probabilistic human-in-the-loop model [RusLiFei CVPR’15]
The human cost of developing computer vision expertise [RusDenDuKraSatEtal IJCV’15] [DenRusKraBerBerFei CHI’14]
Olga Russakovsky: The human side of computer vision
Data is critical Beyond classification
Resource allocation Broader context
Takeaways
Table Chair Bowl Dog Cat …
+ + - - - -
+ - + - + -
+ + - - - -
- - - + - -
Olga Russakovsky: The human side of computer vision
Open-world recognition
Challenging objects
Long-tail distributions
Designing open-world evaluation frameworks
Olga Russakovsky: The human side of computer vision
Large-scale video analysis
Formulating the right temporal questions
Multi-task modelsPlaying tennis
Visual fluents
Olga Russakovsky: The human side of computer vision
Collaborative visual systems
Understanding human intention
Knowledge acquisition
Effective teaching
Olga Russakovsky: The human side of computer vision
Stanford Artificial Intelligence Laboratory’sOutreach Summer (SAILORS) program
24 high school girls, 2 weeks Rigorous AI curriculum emphasizing humanistic applications
http://sailors.stanford.edu
AI will change the world. Who will change AI?
[VacWuChaRusSomFei SIGCSE’16]
Olga Russakovsky: The human side of computer vision
Acknowledgements
Jia DengAlexander Berg
Fei-Fei Li Sean Ma
Jonathan KrauseZhiheng Huang Andrej Karpathy Aditya Khosla
Michael Bernstein
Jia Li Yuanqing Lin
Ellen Klingbeil
Vittorio FerrariAmy Bearman Blake Carpenter
Davide Modolo
Jenny Jin
Misha Andriluka
Hao SuSanjeev Satheesh
Andrew Ng
Kai Yu
Greg Mori
Serena YeungSasha VezhnevetsDeva Ramanan
Abhinav Gupta
Vignesh Ramanathan
Olga Russakovsky: The human side of computer vision
…
Algorithm Test data
Car, Person Building
AI vision system
Training data
cars
dogs
people
http://cs.cmu.edu/~orussako [email protected]
Questions?
[RusDenHuaBerFei ICCV’13] [KliCarRusNg ICRA’10]Strongly supervised:[RusNg CVPR’10] [RusFei ECCVW’10] [RusGupRam UnderReview] [ModVezRusFei CVPR’15] [YeuRusJinAndMorFei UnderReview] Weakly supervised:[RusLinYuFei ECCV’12] [BeaRusFerFei UnderReview]
[RusLiFei CVPR’15] [YeuRusMorFei CVPR’16][RusDenSuKraSatEtal IJCV’15][DenRusKraBerBerFei CHI’14]