Ali Al-Raziqi- Unsupervised Framework for Interactions Modeling between Multiple Objects

Unsupervised Framework forInteractions Modeling between

Multiple Objects

Ali Al-Raziqi, Joachim Denzler

Computer Vision GroupDepartment of Mathematics and Computer Science

Friedrich Schiller University of Jena, Germany

Ali.Al-Raziqi,[email protected]://www.inf-cv.uni-jena.de/

March 4, 2016

Ali.Al-Raziqi,[email protected]

http://www.inf-cv.uni-jena.de/

IntroductionInteraction Modeling

Experiments and ResultsConclusion

Friedrich Schiller University Jena

Computer Vision Group

Outline

1 Introduction

2 Interaction Modeling

3 Experiments and Results

4 Conclusion

Ali Al-Raziqi, Joachim Denzler Interactions Modeling 1 of 23





Outline

1 Introduction



4 Conclusion






Introduction

Activity recognition

Activities Datasets:[Gorelick,2007, Ryoo,2010, Blunsden,Scott,et al. 2010]






Motivation

Motivation

Our goal is to build an unsupervised system to extract theinteraction between objects in video sequence.

Current object interactions modeling systems mostly rely onsupervised learning methods.






Motivation

Motivation

Our goal is to build an unsupervised system to extract theinteraction between objects in video sequence.Current object interactions modeling systems mostly rely onsupervised learning methods.






Interactions Samples (InGroup and Fight)






Outline

1 Introduction



4 Conclusion






Sequence Tracking

Tracking

Track all cavies by tracking-by-detection method, where cavies arefirstly detected in each frame.These detections associated in successive frames using two-stagesgraph tracking approach using [Jiang, Xiaoyan, et al., 2012].






Sequence Tracking Flow WordsExtraction

Dictionary

Optical Flow

The tracking algorithm is represented as bounding boxes.Optical flow inside the BBs regions is computed using the TV-L1

algorithm [Zach, Christopher, et al., 2007].One flow word is: w = (xi, yi, ui, vi), quantized into eight directions.







Dictionary

Bag-of-WordsClips

Flow Word Count

.....2540 3

24 12

28

.....

3568

Flow Word Count

.....8560 203

985 102

2

.....

15840

Clips

Divided the videos into clips with equal sizeEach clip represented by its words.







Dictionary

HDP Model

InteractionsBag-of-WordsClips

Flow Word Count

.....2540 3

24 12

28

.....

3568

Flow Word Count

.....8560 203

985 102

2

.....

15840






Why topic models?

Assumption

Suppose you have a huge number ofdocumentsWant to know what’s going onCan’t read them all (e.g. every NewYork Times article from the 90’s)Topic models offer a way to get acorpus-level view of major themes

Unsupervised

Some slides are taken from JordanBoyd-Graber with permission






Why topic models?

Assumption

Suppose you have a huge number ofdocumentsWant to know what’s going onCan’t read them all (e.g. every NewYork Times article from the 90’s)Topic models offer a way to get acorpus-level view of major themesUnsupervised

Some slides are taken from JordanBoyd-Graber with permission






Conceptual Approach

From an input corpus and number of topics K → words to topics

Forget the Bootleg, Just Download the Movie LegallyMultiplex Heralded As

Linchpin To GrowthThe Shape of Cinema, Transformed At the Click of

a MouseA Peaceful Crew Puts

Muppets Where Its Mouth IsStock Trades: A Better Deal For Investors Isn't SimpleThe three big Internet portals begin to distinguish

among themselves as shopping malls

Red Light, Green Light: A 2-Tone L.E.D. to Simplify Screens

Corpus






Conceptual Approach

From an input corpus and number of topics K → words to topics

computer, technology,

system, service, site,

phone, internet, machine

play, film, movie, theater,

production, star, director,

stage

sell, sale, store, product,

business, advertising,

market, consumer

TOPIC 1 TOPIC 2 TOPIC 3






Generative Model

Hollywood studios are preparing to let people

download and buy electronic copies of movies over

the Internet, much as record labels now sell songs for

99 cents through Apple Computer's iTunes music store

and other online services ...






stage



market, consumer






Generative Model











stage



market, consumer






Generative Model











stage



market, consumer






Generative Model











stage



market, consumer






Hierarchical Dirichlet Process (HDP)

HDP has been originally designed for clustering words in documentsbased on word co-occurrences not distances in feature-space[Teh, Yee Whye,2006].The number of clusters is deduced automatically from the data andhyper-parameters.






HDP

MNθd zn wn

Kβk

α

λ

Infereance Topics

For each topic k ∈ 1, . . . , ∞, draw a multinomial distribution βk from aDirichlet distribution.






HDP

MNθd zn wn

Kβk

α

λ

GenerativeFor each document d ∈ 1, . . . , M, draw a multinomial distribution θdfrom a Dirichlet distribution with parameter α.






HDP

MNθd zn wn

Kβk

α

λ

GenerativeFor each word position n ∈ 1, . . . , N, select a hidden topic zn from themultinomial distribution parameterized by θ.






HDP

MNθd zn wn

Kβk

α

λ

GenerativeChoose the observed word wnfrom the distribution βzn .






Outline

1 Introduction



4 Conclusion






Experiments and Results

We performed several experiments on the Cavy dataset and thebenchmark dataset Behave [Blunsden,Scott,et al. 2010].As the Cavy dataset does not contain ground truth, we marked thesemantically meaningful interactions in the scene.Then, similar to the procedures in [Kuettel,2010, Krishna,2014], theoutput of our system is manually mapped to the ground truth labelsand the performance accuracy is calculated.






Behave Dataset

Behave dataset consists of fourvideo sequences, and 76, 800frames in total.Recorded at 25 frames persecond with a resolution of640× 480 pixels.The number of objects involvedin the interaction is rangingfrom 2 to 5.The tracking ground truth isavailable but not for the wholedataset.






Comparison

Interaction recognition comparison with [Kim,2014] and [Munch,2012]

Category Our [Kim,2014] [Munch,2012]Approach 68.42 83.33 60.00

Split 66.42 100.00 70.00WalkTogether 75.00 91.66 45.00

InGroup 53.73 100.00 90.00Average 65.95 93.74 66.25






Comparison

Interaction recognition comparison with [Kim,2014] and [Yin,2012]

Category Our [Kim,2014] [Yin,2012]Split 66.42 100.00 93.10

WalkTogether 75.00 91.66 92.10InGroup 53.73 100.00 94.30Fight 80.00 83.33 95.10

Average 65.95 93.74 93.65






Cavy Dataset

Sequences are recorded fromdifferent views with changingillumination and in differentperiods.It contains 16 sequences with640× 480 resolutions recordedat 7.5 frames per second (fps)with approx 3 million frames intotal (272 GB).

Contains five dominantinteractions performed severaltimes by two or three cavies.






Interaction Description

Approach One object approaches toanother(s) object(s)

Ingroup Several objects are close to eachother and with small motion

Fight Objects fighting each otherSplit Object(s) split from one anotherFollow Object(s) following other

Cavy Dataset

Sequences are recorded fromdifferent views with changingillumination and in differentperiods.It contains 16 sequences with640× 480 resolutions recordedat 7.5 frames per second (fps)with approx 3 million frames intotal (272 GB).Contains five dominantinteractions performed severaltimes by two or three cavies.






Confusion Matrix

Approach Split InGroup Follow Fight NoIntApproach 0.51 0.03 0.05 0.00 0.00 0.41

Split 0.01 0.28 0.03 0.00 0.01 0.67InGroup 0.03 0.01 0.40 0.00 0.02 0.54Follow 0.00 0.25 0.13 0.50 0.00 0.13Fight 0.02 0.00 0.10 0.00 0.35 0.53NoInt 0.06 0.01 0.14 0.01 0.05 0.73

#6175

3738

48392






Analysis

Different factors that have an effect on the results, such as errors raisedfrom detector (splitted objects,false, missing, merged)Optical flow for fixed objects.

Split False Missing Merge






Conclusion

Conclusion

Our proposed approach incorporates an unsupervised clusteringcapabilities of the HDP with spatio-temporal features.Furthermore, the Cavy dataset is introduced in this work.The experiments have been performed on the Cavy dataset and theBehave dataset.Our approach achieved results with an accuracy of up to 65.95% onthe Behave dataset and up to 45% on Cavy dataset.






Conclusion

Improvement

Robust Detector and Tracker.Appearance-based Features (SIFT,HOG and CNN)Trajectory-based Features (Velocity, distanc).






Thank you for your attention!






The Cavy dataset and annotated interactions are available athttp://www.inf-cv.uni-jena.de/interaction_recognition.html


http://www.inf-cv.uni-jena.de/interaction_recognition.html

ReferencesFriedrich Schiller University Jena


Effects of hyper-parameter η on number of extracted interactions

0.1 0.5 1 1.5 2

10

20

30

40

Hyper-parameter η

#of

inte

ract

ions η

Ali Al-Raziqi, Joachim Denzler Interactions Modeling 22



Effects of hyper-parameter η on the Accuracy

0 0.5 1 1.5 20.5

0.6

0.7

Hyper-parameter η

Acc

urac

y




ReferencesI Jiang, Xiaoyan and Rodner, Erik and Denzler, Joachim

Multi-person tracking-by-detection based on calibrated multi-camera systemsComputer Vision and Graphics

I Zach, Christopher and Pock, Thomas and Bischof, HorstA duality based approach for realtime TV-L 1 optical flowPattern Recognition

I Blunsden, Scott and Fisher, RBThe BEHAVE video dataset: ground truthed video for multi-person behavior classificationBritish Machine Vision Association

I Kim, Young-Ji and Cho, Nam-Gyu and Lee, Seong-WhanGroup Activity Recognition with Group Interaction ZoneICPR

I Munch, David and Michaelsen, Eckart and Arens, MichaelSupporting fuzzy metric temporal logic based situation recognition by mean shift clusteringAdvances in Artificial Intelligence

I Yin, Yafeng and Yang, Guang and Xu, Jin and Man, HongSmall group human activity recognitionICIP

I Kuettel, Daniel and Breitenstein, Michael D and Van Gool, Luc and Ferrari, VittorioWhat’s going on? Discovering spatio-temporal dependencies in dynamic scenesCVPR

I Mahesh Krishna and Joachim DenzlerA Combination of Generative and Discriminative Models for Fast Unsupervised ActivityRecognition from Traffic Scene VideosProceedings of the IEEE (WACV)

I Teh, Yee Whye and Jordan, Michael I and Beal, Matthew J and Blei, David MHierarchical dirichlet processesJournal of the american statistical association

I Lena Gorelick and Moshe Blank and Eli Shechtman and Michal Irani and Ronen BasriActions as Space-Time ShapesTransactions on Pattern Analysis and Machine Intelligence

I Ryoo, M. S. and Aggarwal, J. KUT-Interaction Dataset, ICPR contest on Semantic Description of Human Activities (SDHA)ICPR


Technology

Ali Al-Raziqi- Unsupervised Framework for Interactions Modeling between Multiple Objects