Elements of behaviour recognition with ego- centric Visual Sensors . Application to AD patients assessment

  • Upload
    isra

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR 5800/ University Bordeaux1 H. Boujut , V. Buso , L. Letoupin , R. Megret , S. Karaman, V. Dovgalecs , Ivan Gonsalez-Diaz( University Bordeaux1) Y. Gaestel , J.-F. Dartigues (INSERM). - PowerPoint PPT Presentation

Citation preview

TRECVID 2008

Elements of behaviour recognition with ego-centric Visual Sensors. Application to AD patients assessmentJenny Benois-Pineau,LaBRI Universit de Bordeaux CNRS UMR 5800/ University Bordeaux1H. Boujut, V. Buso, L. Letoupin, R. Megret, S. Karaman, V. Dovgalecs, Ivan Gonsalez-Diaz(University Bordeaux1)Y. Gaestel, J.-F. Dartigues (INSERM)

1SummaryIntroduction and motivationEgocentric sensors : Wearable videosFusion of multiple cues: eraly fusion frameworkFusion of multiple cues: late fusion and intermediate fusion Model of activities as Active Objects&Context5.1. 3D Localization from wearable camera5.2. Object recognition with visual saliencyConclusion and perspectives231. Introduction and motivationRecognition of Instrumental Activities of Daily Living (IADL) of patients suffering from Alzheimer DiseaseDecline in IADL is correlated with future dementiaIADL analysis:Survey for the patient and relatives subjective answersObservations of IADL with the help of video cameras worn by the patient at homeObjective observations of the evolution of diseaseAdjustment of the therapy for each patient: IMMED ANR, Dem@care IP FP7 EC

16/03/2010316/03/2010342. Egocentric sensors : Wearable videoVideo acquisition setupWide angle camera on shoulderNon intrusive and easy to use deviceIADL capture: from 40 minutes up to 2,5 hoursNatural integration into home visit by paramedical assistants protocol(c)

Loxie ear-wearedLooking-glaces weared with eye-tracker, iPhone.16/03/2010416/03/201045Wearable videos4 examples of activities recorded with this camera:Making the bed, Washing dishes, Sweeping, Hovering IMMED dataset

16/03/2010516/03/20105Wearable Camera analysisMonitoring of Instrumental Activities of Daily Living (IADLs)Meal related activities (Preparation, meal, cleaning)Instrumental activities (telephone, )For directed activities in @Lab or undirected activities in @Home and @NursingHomeAdvantages of wearable cameraProvides a close-up view on IADLsObjects *Location *Actions *Complements fixed sensors with more detailed informationFollows the person in multiple places of interest (sink, table, telephone) without full equipment of the environmentVisual positionning from wearable camera

In front of the stovepancookingstool6Wearable video datasets and scenariiIMMED Dataset @HomeUB1, CHU Bordeaux12 volunteers and 42 patientsADL Dataset @HomeUC Irvine20 healthy volunteersVisual positionning from wearable camera

H. Pirsiavash, D. Ramanan. "Detecting Activities of Daily Living in First-person Camera Views" CVPR 2012.

S.Unconstrained scenario

7Dem@care DatasetCHUN @Lab3 healthy, 5 MCI, 3 AD, 2 mixed dementia

First pilot ongoingS. Constrained scenario8

3. Fusion of multiple cues for activities recognitionEarly fusion frameworkThe IMMED Framework9Temporal Partitioning(1)Pre-processing: preliminary step towards activities recognitionObjectives:Reduce the gap between the amount of data (frames) and the target number of detections (activities)Associate one observation to one viewpointPrinciple:Use the global motion e.g. ego motion to segment the video in terms of viewpointsOne key-frame per segment: temporal centerRough indexes for navigation throughout this long sequence shotAutomatic video summary of each new video footage

16/03/2010916/03/20109Complete affine model of global motion (a1, a2, a3, a4, a5, a6)

[Krmer, Benois-Pineau, Domenger]Camera Motion Detection in the Rough Indexing Paradigm, TRECVID2005.Principle:Trajectories of corners from global motion modelEnd of segment when at least 3 corners trajectories have reached outbound positions10 Temporal Partitioning(2)

16/03/20101016/03/20101011Threshold t defined as a percentage p of image width wp=0.2 0.25

Temporal Partitioning(3)16/03/20101116/03/20101112Temporal Partitioning(4)Video Summary332 key-frames, 17772 frames initiallyVideo summary (6 fps)

16/03/20101216/03/20101213Color: MPEG-7 Color Layout Descriptor (CLD)6 coefficients for luminance, 3 for each chrominanceFor a segment: CLD of the key-frame, x(CLD) 12

Localization: feature vector adaptable to individual home environment. Nhome localizations. x(Loc) NhomeLocalization estimated for each frameFor a segment: mean vector over the frames within the segmentAudio: x(Audio) : probabilistic features SMNV. Dovgalecs, R. Mgret, H. Wannous, Y. Berthoumieu. "Semi-Supervised Learning for Location Recognition from Wearable Video". CBMI2010, France.

Description space: fusion of features (1)16/03/20101316/03/20101314

Description space(2). AudioJ. Pinquier, IRITHtpe log-scale histogram of the translation parameters energyCharacterizes the global motion strength and aims to distinguish activities with strong or low motionNe = 5, sh = 0.2. Feature vectors x(Htpe,a1) and x(Htpe,a4) 5

Histograms are averaged over all frames within the segment

x(Htpe, a1)x(Htpe,a4)Low motion segment0,87 0,03 0,02 0 0,080,93 0,01 0,01 0 0,05Strong motion segment0,05 0 0,01 0,11 0,830 0 0 0,06 0,94 Description space(3). Motion15

16/03/20101516/03/201015

16Hc: cut histogram. The ith bin of the histogram contains the number of temporal segmentation cuts in the 2i last frames

Hc[1]=0, Hc[2]=0, Hc[3]=1, Hc[4]=1, Hc[5]=2, Hc[6]=7Average histogram over all frames within the segmentCharacterizes the motion history, the strength of motion even outside the current segment26=64 frames 2s, 28=256 frames 8.5s x(Hc) 6 or 8Residual motion 4x4 x(RM) 16

Description space(4). Motion

16/03/20101616/03/201016Room Recognition from Wearable CameraClassification based recognitionMulti-class classifiers, trained on a bootstrap videoTemporal postprocessing of the estimates (temporal accumulation)4 global architectures considered (based on SVM classifiers)

Visual features considered : Bag-of-Visual-Words (BoVW)Spatial Pyramid Histograms (SPH)Composed Receptive Field Histograms (CRFH)

UB1 - Wearable camera video analysisTemporalacculumationSVMFeature 1Feature 2Classifier : Single feature Early-fusion Late fusion Co-training with late fusionA semi-supervised multi-modal framework based on co-training and late fusionCo-training exploits unannotated data by transfering reliable information from one feature pipeline to the other.Time-information is used to consolidate estimates

18 Vladislavs Dovgalecs; Rmi Megret; Yannick Berthoumieu. Multiple Feature Fusion Based on Co-Training Approach and Time Regularization for Place Classification in Wearable Video. Advances in Multimedia, 2013, 2013, pp. Article ID 175064Global architectureMultiple featuresLate-fusionCo-trainingVisual positionning from wearable cameraCo-training: a semi-supervised multi-modal frameworkCombining co-training, temporal accumulation and late-fusion produce the best results, by leveraging non annotated data

Time-aware cotrainingLate-fusion baselineIllustration of correct detection rate on IDOL2 database19 Vladislavs Dovgalecs; Rmi Megret; Yannick Berthoumieu. Multiple Feature Fusion Based on Co-Training Approach and Time Regularization for Place Classification in Wearable Video. Advances in Multimedia, 2013, Article ID 175064CotrainingVisual positionning from wearable cameraIMMED Dataset @Home (subset used for experiments)14 sessions at patients homes. Each session is divided in training/testing

6 classes:Bathroom, bedroom, kitchen, living-room, outside, otherTrainingAverage 5 min / sessionBootstrapPatient asked to present the various rooms in the appartmentUsed for location model trainingTestingAverage 25 min / sessionGuided activitiesVisual positionning from wearable camera20AlgorithmFeature spaceTemporal accumulationwithoutwithSVMBOVW0.490.52CRFH0.480.53SPH30.470.49Early-fusionBOVW+SPH30.480.50BOVW+CRFH0.500.54CRFH+SPH30.480.51Late-fusionBOVW+SPH30.510.56BOVW+CRFH0.510.56CRFH+SPH30.500.54Co-training with late-fusionBOVW+SPH30.500.53BOVW+CRFH0.540.58CRFH+SPH30.540.57RRWC performance on IMMED @HomeDiscussionBest performances obtained using late-fusion approaches with temporal accumulationGlobal accuracy in the 50-60% range. Difficult dataset due to specific applicative constraints in IMMED leading to low quality of training data for location. Visual positionning from wearable cameraAverage accuracy for room recognition on IMMED dataset.21Dem@Care Dataset @LabTraining (22 videos)Total time: 6h42minTesting (22 videos).Total time: 6h20minAverage duration for 1 video: 18min5 classes:phone_place, tea_place, tv_place, table_place, meication_placeVisual positionning from wearable camera22

RRWC performance on Dem@Care @Lab23

= bovwLarger homogeneous training set yield: 94% accuracyLate fusionVisual positionning from wearable camera23

Details on one test video24Video 20120717a_gopro_split002Parameters:CRFH+SPH2C = 1000Normalized score for best classVisual positionning from wearable camera24Discussion on the resultsFeature fusion (late fusion) works best for place classificationSemi-supervised training slightly improve performances on IMMED datasetBut performance highly dependent on good training:Sufficient amount of dataHomogeneous contentOpen issuesMore selective criteria for recognition/rejection to improve reliability of positive decision

Visual positionning from wearable camera25Early fusion of all features26Dynamic FeaturesStatic FeaturesAudio FeaturesCombinationof featuresHMM ClassifierActivitiesMedia27Feature vector fusion: early fusion CLD x(CLD) 12Motionx(Htpe) 10x(Hc) 6 or 8x(RM) 16

Localization: Nhome between 5 and 10.x(Loc) NhomeAudio : x(Audio) 7

Description space(5)16/03/20102716/03/201027Description space(6)Possible combinations of descriptors : 2 - 1 = 63DescriptorsAudioLocRMHtpeHcCLDconfigminconfigmaxDimensions77161081276016/03/20102829A two level hierarchical HMM:Higher level:transition between activitiesExample activities: Washing the dishes, Hovering,Making coffee, Making tea...Bottom level: activity descriptionActivity: HMM with 3/5/7/8 statesObservations model: GMMPrior probability of activity

Model of the content: activities recognition

16/03/20102916/03/20102930

Bottom level HMMStart/End Non emitting stateObservation x only for emitting states qiTransitions probabilitiesand GMM parameters are learnt by Baum-Welsh algorithmA priori fixed number of statesHMM initialization:Strong loop probability aiiWeak out probability aiend

Activities recognition16/03/20103016/03/20103031Video Corpus CorpusHealthy volonteers/

Patients

Number of videosDurationIMMED12 healthy volonteers

157H16IMMED42 patients

4617H04TOTAL12 healthy volonteers

+ 42 patients

6124H20

16/03/20103116/03/201031Evaluation protocolprecision= TP/(TP+FP)recall= TP/(TP+FN)accuracy= (TP+TN)/(TP+FP+TN+FN)F-score= 2/(1/precision+1/recall)32Leave-one-out cross validation scheme (one video left)Resultas are averaged Training is performed over a sub sampling of smoothed (10 frames) data. Label of a segment is derived by majority vote of frames resultsRecognition of Activities5 videos. Descriptors?33

3states LL HMM, 1 state NoneActivitiesPlantSpraying, RemoveDishes, WipeDishes, MedsManagement, HandCleaning, BrushingTeeth, WashingDishes, Sweeping, MakingCoffee, MakingSnack, PickingUpDust, PutSweepBack, HairBrushing, Serving, Phone and TV. (16)Result : 5 times better than chance34Results using segments as observations35

Comparison with a GMM baseline(23 activities on 26 videos)36The variance is strong : this is the difficulty of the task. The better mean values than in 5 sequence experiment : due to the stronger training set3623 activities Food manual preparation, Displacement free, Hoovering, Sweeping, Cleaning, Making a bed, Using dustpan, Throwing into the dustbin, Cleaning dishes by hand, Body hygiene, Hygiene beauty, Getting dressed, Gardening, Reading, Watching TV, Working on computer, Making coffee, Cooking, Using washing machine, Using microwave, Taking medicines, Using phone, Making home visit.

37Hovering38

Conclusion on early fusionEarly fusion requires tests of numerous combinations of features.The best results were achieved for the complete description spaceFor specific activities optimal combinations of descriptors vary and correspond to common sense approach.

S. Karaman, J. Benois-Pineau, V. Dovgalecs, R. Megret, J. Pinquier, R. Andr-Obrecht, Y. Gaestel, J.-F. Dartigues, Hierarchical Hidden Markov Model in detecting activities of daily living in wearable videos for studies of dementia, Multimedia Tools and Applications, Springer, 2012, pp. 1-29, DOI 10.1007/s11042-012-117-x Article in Press

394. Fusion of multiple cues: intermediate fusion and late fusion Intermediate fusion

Treat the different modalities (Dynamic, Static, Audio) separately.We represent each modality by a stream, that is a set of measures along the time. Each state of the HMM models the observations of each stream separately by a Gaussian mixture.K streams of observations40

Late Fusion framework41Dynamic FeaturesStatic FeaturesAudio FeaturesMEDIACombinationFusion of scores of classifiersfeaturesHMM ClassifierActivitiesHMM ClassifierHMM ClassifierPerformance measure of classifiers : modality k, activity l The late fusion approach uses a HHMM classifier for each subspace of the complete description space and merges the results of these preliminary classifiers for the decision making. In the late fusion scheme in the context of our work a classifier is trained for each modality Audio, Dynamic and Static. Each modality k is assigned an expertise trust score that is specific to each activity l. This score is defined as a normalized performance computed from the performance measure perflk obtained by the modalityk for the activity l:elk=perflkkperflk

41Experimental video corpusin 3-fusion experiment37 videos recorded by 34 persons (healthy volunteers and patients) for a total of 14 hours of content. 42Results of 3- fusion experiment(1)43

Results of 3- fusion experiment(2) 44

Conclusion on 3-fusion experimentOverall, the experiments have shown that the intermediate fusion has provided consistently better results than the other fusion approaches, on such complex data, supporting its use and expansion in future work.

455. Model of activities as Active Objects & ContextVisual positionning from wearable cameraLocalizationObject recognitionInstrumental Activities RecogninitionWearable camera videoHigher level reasoning

465.1. 3D localization from wearable cameraProblematicsHow to generate precise position events from wearable video ?How to improve the precision of 3D estimation ?

Visual positionning from wearable camera47

2 step positionningTrain: Reconstruct 3D models of placesBootstrap framesPartial models of environement focused on places of interestStructure from Motion (SfM) techniquesTest: Match test images to 3D modelsExtraction of visual featuresRobust 2D-3D matching

48Visual positionning from wearable camera48General architecture49Objectness [1][2]

Measure to quantify how likely it is for an image window to contain an object.

[1] Alexe, B., Deselares, T. and Ferrari, V. What is an object? CVPR 2010.[2] Alexe, B., Deselares, T. and Ferrari, V. Measuring the objectness of image windows PAMI 2012.

3D model finalization

Densificationfor visualization Map global geolocalizationmanual, using natural markers on the sceneDefinition of specific 3D places of interestCoffee machine, sinkDirectly on 3D map50Visual positionning from wearable camera503D positioningUsing Perspective-n-Point approachMatch 2D points in frame with 3D points in model and apply PnP techniques

51Visual positionning from wearable camera

Exploitation: zone detection

InstrumentalNearFarObject-camera distance (green: cofee machine, blue: sink)Distance to objects of interestZones: Near/Instrumental/FarEvents: zone entrance/exit52Visual positionning from wearable cameraProduce discrete events suitable for higher level analysis52Experiments2 test sequences 68000 and 63000 frames6 classes of interest (desk, sink, printer, shelf, library, couch)Repeated events One bootstrap videoThrough each place6 reference 3D modelsReconstruction requires less than 100 frames / model

Sample frame3D point cloudDeskSinkPrinterShelfLibraryCouch53Visual positionning from wearable camera53

P = 0.9074R = 0.5816 cuisinerangesallephotocop.bibliopccuisinerangesallephotocop.bibliopcRecognition performances54Visual positionning from wearable camera54Les classes:1 : pc2 : Biblio.3 : photocop.4 : salle5 : range_bureau6 : cuisine0 : transition

Taux de localisation = ?(1er col et 1er lin correspond la classe de rejetconf_matrix =

15057 10472 252 2703 2982 635 3283 464 3063 0 0 0 0 0 423 0 1470 985 0 0 0 415 0 6 5287 0 0 0 75 0 0 0 2692 0 0 121 0 0 0 0 590 0 535 0 0 0 0 0 16537P = 0.8813R = 0.6771

cuisinerangesallephotocop.bibliopccuisinerangesallephotocop.bibliopcRecognition performanceMajority filtering over 100 frames55Visual positionning from wearable camera55

Interest of 3D model for indexing2D-3D matching: matching between test frames and 3D models2D-2D matching: direct match between test frames and reference frames(-) Overhead in building 3d model(+) Much better precision/recall that purely 2D analysis(+) Matching to consolidated model instead of large set of imagesHazem Wannous, Vladislavs Dovgalecs, Rmi Mgret, Mohamed Daoudi: Place Recognition via 3D Modeling for Personal Activity Lifelog Using Wearable Camera. MMM 2012: 244-25456Visual positionning from wearable camera5.2. Object Recognition with SaliencyMany objects may be present in the camera field

How to consider the object of interest?

Our proposal: By using visual saliency

57

IIMMED DBOur approach : moleling Visual Attention58Several approachesBottom-up or top-downOvert or covert attentionSpatial or spatio-temporalScanpath or pixel-based saliency

FeaturesIntensity, color, and orientation (Feature Integration Theory [1]), HSI or L*a*b* color spaceRelative motion [2]

Plenty of models in the literatureIn their 2012 survey, A. Borji and L. Itti [3] have taken the inventory of 48 significant visual attention methods[1] Anne M. Treisman & Garry Gelade. A feature-integration theory of attention. Cognitive Psychology, vol. 12, no. 1, pages 97136, January 1980.[2] Scott J. Daly. Engineering Observations from Spatiovelocity and Spatiotemporal Visual Models. In IS&T/SPIE Conference on Human Vision and Electronic Imaging III, volume 3299, pages 180191, 1 1998.[3] Ali Borji & Laurent Itti. State-of-the-art in Visual Attention Modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 99, no. PrePrints, 2012.Ittis model59

The most widely used model

Designed for still images

Does not consider the temporal dimension of videos[1] Itti, L.; Koch, C.; Niebur, E.; , "A model of saliency-based visual attention for rapid scene analysis ,Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol.20, no.11, pp.1254-1259, Nov 1998Spatiotemporal Saliency Modeling60Most of spatio-temporal bottom-up methods work in the same way[1], [2]Extraction of the spatial saliency map (static pathway)Extraction of the temporal saliency map (dynamic pathway)Fusion of the spatial and the temporal saliency maps (fusion)[1] Olivier Le Meur, Patrick Le Callet & Dominique Barba. Predicting visual fixations on video based on low-level visual features. Vision Researchearch, vol. 47, no. 19, pages 24832498, Sep 2007.[2] Sophie Marat, Tien Ho Phuoc, Lionel Granjon, Nathalie Guyader, Denis Pellerin & Anne Gurin-Dugu. Modelling spatio-temporal saliency to predict gaze direction for short videos. International Journal of Computer Vision, vol. 82, no. 3, pages 231243, 2009. Dpartement Images et Signal.Spatial Saliency Model61Based on the sum of 7 color contrast descriptors in HSI domain [1][2]Saturation contrastIntensity contrastHue contrastOpposite color contrastWarm and cold color contrastDominance of warm colorsDominance of brightness and hueThe 7 descriptors are computed for each pixels of a frame I using the 8 connected neighborhood.The spatial saliency map is computed by:

Finally, is normalized between 0 and 1 according to its maximum value

[1] M.Z. Aziz & B. Mertsching. Fast and Robust Generation of Feature Maps for Region-Based Visual Attention. Image Processing, IEEE Transactions on, vol. 17, no. 5, pages 633 644, may 2008.[2] Olivier Brouard, Vincent Ricordel & Dominique Barba. Cartes de Saillance Spatio-Temporelle bases Contrastes de Couleur et Mouvement Relatif. In Compression et representation des signaux audiovisuels, CORESA 2009, page 6 pages, Toulouse, France, March 2009.Temporal Saliency Model62The temporal saliency map is extracted in 4 steps [Daly 98][Brouard et al. 09][Marat et al. 09]The optical flow is computed for each pixel of frame i.The motion is accumulated in and the global motion is estimated.The residual motion is computed:

Finally, the temporal saliency map is computed by filtering the amount of residual motion in the frame.

with , and

Saliency Model Improvement63Spatio-temporal saliency models were designed for edited videos

Not well suited for unedited egocentric video streamsOur proposal:Add a geometric saliency cue that considers the camera motion anticipation

1. H. Boujut, J. Benois-Pineau, and R. Megret. Fusion of multiple visual cues for visual saliency extraction from wearable camera settings with strong motion. In A. Fusiello, V. Murino, and R. Cucchiara, editors, Computer Vision ECCV 2012, IFCV WSGeometric Saliency Model642D Gaussian was already applied in the literature [1]Center bias, Busswel, 1935 [2]Suitable for edited videosOur proposal:Train the center position as a function of camera positionMove the 2D Gaussian center according to camera center motion.Computed from the global motion estimation

Considers the anticipation phenomenon [Land et al.].

Geometric saliency map[1] Tilke Judd, Krista A. Ehinger, Frdo Durand & Antonio Torralba.Learning to predict where humans look. In ICCV, pages 21062113. IEEE,2009.[2] Michael Dorr, et al. Variability of eye movements when viewing dynamic natural scenes. Journal of Vision (2010), 10(10):28, 1-17Geometric Saliency Model65The saliency peak is never located on the visible part of the shoulder

Most of the saliency peaks are located on the 2/3 at the top of the frame

So the 2D Gaussian center is set at:

Saliency peak on frames fromall videos of the eye-tracker experimentSaliency Fusion66

FrameSpatio-temporal-geometricsaliency mapSubjective saliency mapSubjective Saliency67D. S. Wooding method, 2002(was tested over 5000 participants)

Eye fixationsfrom the eye-tracker

+Subjectivesaliency map2D Gaussians(Fovea area = 2 spread)

How people watch videos from wearable camera?68Psycho-visual experiment

Gaze measure with an Eye-Tracker (Cambridge Research Systems Ltd. HS VET 250Hz)

31 HD video sequences from IMMED database.Duration 133025 subjects (5 discarded)6 562 500 gaze positions recordedWe noticed that subject anticipate camera motion

Evaluation on IMMED DB69Normalized Saliency Scanpath (NSS) correlation method Metric

Comparison of:Baseline spatio-temporal saliencySpatio-temporal-geometric saliency without camera motionSpatio-temporal-geometric saliency with camera motionResults:Up to 50% better than spatio-temporal saliencyUp to 40% better than spatio-temporal-geometric saliency without camera motion

H. Boujut, J. Benois-Pineau, R. Megret: Fusion of Multiple Visual Cues for Visual Saliency Extraction from Wearable Camera Settings with Strong Motion. ECCV Workshops (3) 2012: 436-445Evaluation on GTEA DBIADL dataset8 videos, duration 2443Eye-tracking mesures for actors and observersActors (8 subjects)Observers (31 subjects) 15 subjects have seen each videoSD resolution video at 15 fps recorded with eyetracker glasses

70

A. Fathi, Y. Li, and J. Rehg. Learning to recognize daily actions using gaze. In A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, editors, Computer Vision ECCV 2012Evaluation on GTEA DBCenter bias : camera on looking glaces, head mouvement compensatesCorrelation is database dependent71

Objective vs. Subjective -viewer Proposed Processing Pipeline for Object Recognition72

Mask computationLocal PatchDetection & DescriptionBoWcomputationVisual vocabularyImageMatchingSupervised classifierImage retrievalObject recognitionSpatially constrained approach using saliency methodsWGTEA Dataset (constrained scenario)GTEA [1] is an ego-centric video dataset, containing 7 types of daily activities performed by 14 different subjects. The camera is mounted on a cap worn by the subject.Scenes show 15 object categories of interest.Data split into a training set (294 frames) and a test set (300 frames)

73

Categories in GTEA dataset with the number of positives in train/test sets[1] Alireza Fathi, Yin Li, James M. Rehg, Learning to recognize daily actions using gaze, ECCV 2012.Assessment of Visual Saliency in Object RecognitionWe tested various parameters of the model:Various approaches for local region sampling.Sparse detectors (SIFT, SURF)Dense Detectors (grids at different granularities).Different options for the spatial constraints:Without masks: global BoWIdeal manually annotated masks.Saliency masks: geometric, spatial, fusion schemesIn two tasks:Object retrieval (image matching): mAP ~ 0.45Object recognition (learning): mAP ~ 0.91

74Sparse vs Dense Detection75

Detector + DescriptorAvg Pt NbrSparse SURF + SURF137.9Dense + SURF1210Experimental setup:GTEA database.15 categories.SURF descriptor.Object recognition task.SVM with 2 kernel.Using ideal masks.

Dense detection greatly improves the performance at the expense of an increase on the computation time (more points to be processed)

Object Recognition with Saliency Maps76

The best: Ideal, Geometric, Squared-with-geometric, Object recognition on ADL dataset (unconstrained scenario)

77Approach Temporal pyramid matchingFeatures : probability vector of objects Oprobability vector of places (2D approach) PEarly fusion by concatenation F=O+P7879

30Resultas of activity recognition (ADL) 80

6. Conclusion and perspectives(1)Spatiotemporal saliency correlation is database dependent for egocentric videosNeed A VERY GOOD motion estimation ( now Ransac)

Saliency maps are a good ROIs for object recognition in the task of interpretation of wearable video content. The model of activities as a combination of objects and location is promising

81Conclusion and Perspectives (2)Integration of 3D localizationTests on Dem@care constrained and unconstrained datasets (Lab and Home)Learning of manupulated objects vs Active Objects.

82

Metrics (averaged)Early fusionInterm.fusionLate Fusion

F-score trustPrec. trust Recall trust

Accuracy 0.2070.4420.2150.1880.210

Precision 0.1740.2670.1060.0970.131

Recall0.2840.2880.1710.1550.221

F-Score0.1800.2530.1090.0920.139

Activity recognition accuracy for our approach computed at Frame and Segment level, respectivelyApproachAvg Frame AccuracyAvg Segment Accuracy

Our Objects24.240.5

Our Places19.711.1

Our Objects + Places26.9%41.3%

Pirsiavash et al.2336.9