End-to-end Learning of Action Detection from Frame Glimpse in …aykut/classes/spring2016/bil722/... · End-to-end Learning of Action Detection from Frame Glimpse in Videos Ezgi Pekşen

End-to-end Learning of Action Detection from Frame Glimpse in

Videos

Ezgi Pekşen SoysalHacettepe University

BIL722 - Advanced Topics in Computer Vision

Olga Russakovsky: The human side of computer vision

Input Output

t = 0 t = TRunningTalking

Task: what is the person doing?


Input Output

Task: what is the person doing?

InterpretabilityEfficiencyAccuracy

t = 0 t = TRunningTalking


Efficient video processing

t = 0 t = T



Talkingt = 0 t = T

Running



Talkingt = 0 t = T

Running



[N. I. Badler. “Temporal Scene Analysis…” 1975]

“Knowing the output or the final state… there is no need to explicitly store many previous states”

Talkingt = 0 t = T

Running



Dominant paradigm: sliding windows

t = 0 t = T

……

Used in all THUMOS challenge action detection entries

[OneVerSch 2014] [WanQiaTan 2014] KarSeiBim 2014] [YuaPeiNiMouKas 2015]



Talkingt = 0 t = T

Running




“Time may be represented in several ways… The intervals between ‘pulses’ need not be equal.”


Talkingt = 0 t = T

Running


t = 0 t = T

[YeuRusMorFei CVPR’16]

Our model for efficient action detection

Frame model

Input: A frame

Output


t = 0 t = T


Frame model

Output: Detection instance [start, end] Next frame to glimpse

Input: A frame

[ ]Output



t = 0 t = T

Output: Detection instance [start, end] Next frame to glimpse


Frame model

[ ][ ]OutputOutput



t = 0 t = T

OutputOutput:

Detection instance [start, end] Next frame to glimpse


Frame model

[ ][ ] [ ]

…OutputOutput



t = 0 t = T

Output


[ ]

Convolutional neural network (frame information)

Recurrent neural network (time information)

[ ] [ ]Output:

Detection instance [start, end] Next frame to glimpse…

OutputOutput



t = 0 t = T

Output

Optional output: Detection instance [start, end]

Output: Next frame to glimpse


…

� �

Convolutional neural network (frame information)

Recurrent neural network (time information)

OutputOutput

[ ]



Training the detection instance outputPositive video Negative video

t = 0 t = T[ ] [ ] t = 0 t = TTraining data



Training the detection instance outputPositive video Negative video

t = 0 t = T[ ] [ ] t = 0 t = TTraining data

Aside:• effective video annotation

[YeuRusJinAndMorFei UnderReview] [LiuRusDenBerFei ImageNetChallenge ’15]

• weakly supervised detection[YeuRamRusMorFei InPreparation]



Training the detection instance output

Training data

Positive video Negative video

t = 0 t = T[ ] [ ] t = 0 t = T

d1 d2

Detections

t = 0 t = T[ ] [ ] t = 0 t = T

g1

[ ]d3

[ ]d4

g2




Training data


t = 0 t = T[ ] [ ] t = 0 t = T

d1 d2

Detections

t = 0 t = T[ ] [ ] t = 0 t = T

g1

[ ]d3

[ ]d4

Reward for detection

g2




Training data


t = 0 t = T[ ] [ ] t = 0 t = T

d1 d2

Detections

t = 0 t = T[ ] [ ] t = 0 t = T

g1

[ ]d3

[ ]d4


g2

y3 = 2y2 = 1y1 = 1 y4 = 0

cross-entropyclassification loss




Training data


t = 0 t = T[ ] [ ] t = 0 t = T

d1 d2

Detections

t = 0 t = T[ ] [ ] t = 0 t = T

g1

[ ]d3

[ ]d4

g2


L2 distancelocalization loss

y3 = 2y2 = 1y1 = 1 y4 = 0

cross-entropyclassification loss



Training the non-differentiable outputsTraining data

t = 0 t = T[ ] [ ]Detections

t = 0 t = T[ ] [ ][ ]




t = 0 t = T[ ] [ ]d1 d2

Detections

t = 0 t = T[ ] [ ]

Model’s action sequence a Frame 1 Frame 8 Frame 6

�

go to frame 8 go to frame 6

(1) whether to predict a detection

(2) where to look next

[ ]

Frame 15

d3

go to frame 15




t = 0 t = T[ ] [ ]Detections

t = 0 t = T[ ] [ ]

Frame 1 Frame 8 Frame 6

[ ]

Frame 15Model’s action sequence a

go to frame 8 go to frame 6(2) where to look next

go to frame 15

d1 d2 � (1) whether to predict a detectiond3

Train an policy for actions (1) and (2) using REINFORCE [Williams 1992]




t = 0 t = TDetections

t = 0 t = T[ ] [ ]


Reward for an action sequence :

Frame 15Model’s action sequence a

[ ] [ ]bad bad


go to frame 15

[ ]good






t = 0 t = TDetections

t = 0 t = T[ ] [ ]


Reward for an action sequence :

Frame 15

Objective:Gradient:

Monte-Carlo approximation:

Model’s action sequence a

[ ] [ ]bad bad


go to frame 15

[ ]good





Accuracy

Interpretability


Efficiency


✓


Accuracy

DatasetDetection AP at IOU 0.5

State-of-the-art Our result

THUMOS 2014 14.4 17.1ActivityNet sports 33.2 36.7ActivityNet work 31.1 39.9

Interpretability

Efficiency


✓


Accuracy




Interpretability

Efficiency✓Glimpse only 2% of video frames


✓


Accuracy




Interpretability


Samping Detection AP at IOU 0.5

Uniform 9.3

Our glimpses 17.1


✓


Accuracy





Samping Detection AP at IOU 0.5

Uniform 9.3

Our glimpses 17.1

Interpretability✓Javelin throw[ ]

Javelin throw[ ]Ground truth

Detections

Glimpses

Frames


…

Algorithm Test data

Car, Person Building

AI vision system

Training data

cars

dogs

people

Reinforcement learning for human action detection [YeuRusMorFei CVPR’16]

Open-world probabilistic human-in-the-loop model [RusLiFei CVPR’15]

The human cost of developing computer vision expertise [RusDenDuKraSatEtal IJCV’15] [DenRusKraBerBerFei CHI’14]


Data is critical Beyond classification

Resource allocation Broader context

Takeaways

Table Chair Bowl Dog Cat …

+ + - - - -

+ - + - + -

+ + - - - -

- - - + - -


What next?


Open-world recognition

Challenging objects

Long-tail distributions

Designing open-world evaluation frameworks


Large-scale video analysis

Formulating the right temporal questions

Multi-task modelsPlaying tennis

Visual fluents


Collaborative visual systems

Understanding human intention

Knowledge acquisition

Effective teaching


AI will change the world. Who will change AI?


Stanford Artificial Intelligence Laboratory’sOutreach Summer (SAILORS) program

24 high school girls, 2 weeks Rigorous AI curriculum emphasizing humanistic applications

http://sailors.stanford.edu

AI will change the world. Who will change AI?

[VacWuChaRusSomFei SIGCSE’16]


Acknowledgements

Jia DengAlexander Berg

Fei-Fei Li Sean Ma

Jonathan KrauseZhiheng Huang Andrej Karpathy Aditya Khosla

Michael Bernstein

Jia Li Yuanqing Lin

Ellen Klingbeil

Vittorio FerrariAmy Bearman Blake Carpenter

Davide Modolo

Jenny Jin

Misha Andriluka

Hao SuSanjeev Satheesh

Andrew Ng

Kai Yu

Greg Mori

Serena YeungSasha VezhnevetsDeva Ramanan

Abhinav Gupta

Vignesh Ramanathan


…

Algorithm Test data

Car, Person Building

AI vision system

Training data

cars

dogs

people

http://cs.cmu.edu/~orussako [email protected]

Questions?

[RusDenHuaBerFei ICCV’13] [KliCarRusNg ICRA’10]Strongly supervised:[RusNg CVPR’10] [RusFei ECCVW’10] [RusGupRam UnderReview] [ModVezRusFei CVPR’15] [YeuRusJinAndMorFei UnderReview] Weakly supervised:[RusLinYuFei ECCV’12] [BeaRusFerFei UnderReview]

[RusLiFei CVPR’15] [YeuRusMorFei CVPR’16][RusDenSuKraSatEtal IJCV’15][DenRusKraBerBerFei CHI’14]

Documents

End-to-end Learning of Action Detection from Frame Glimpse in …aykut/classes/spring2016/bil722/... · End-to-end Learning of Action Detection from Frame Glimpse in Videos Ezgi Pekşen