1 Scene Understanding perception, multi-sensor fusion, spatio-temporal reasoning and activity recognition. Francois BREMOND PULSAR project-team, INRIA

1

Scene Understandingperception, multi-sensor fusion, spatio-temporal reasoning

and activity recognition.

Francois BREMOND

PULSAR project-team,

INRIA Sophia Antipolis, FRANCE

[email protected]

http://www-sop.inria.fr/pulsar/

Key words: Artificial intelligence, knowledge-based systems,

cognitive vision, human behavior representation, scenario recognition

2

• ETISEO: French initiative for algorithm validation and knowledge acquisition: http://www-sop.inria.fr/orion/ETISEO/

• Approach: 3 critical evaluation concepts• Selection of test video sequences

• Follow a specified characterization of problems• Study one problem at a time, several levels of difficulty• Collect long sequences for significance

• Ground truth definition• Up to the event level• Give clear and precise instructions to the annotator

• E.g., annotate both visible and occluded part of objects• Metric definition

• Set of metrics for each video processing task• Performance indicators: sensitivity and precision

Video Understanding: Performance Evaluation (V. Valentin, R. Ma)

3

Evaluation : current approach(AT. NGHIEM)

• ETISEO limitations:• Selection of video sequence according to difficulty levels is subjective• Generalization of evaluation results is subjective.• One video sequence may contain several video processing problems at many

difficulty levels

• Approach: treat each video processing problem separately• Define a measure to compute difficulty levels of input data (e.g. video sequences)• Select video sequences containing only the current problems at various difficulty

levels• For each algorithm, determine the highest difficulty level for which this algorithm

still has acceptable performance.

• Approach validation : applied to two problems• Detect weakly contrasted objects• Detect objects mixed with shadows

4

Evaluation : conclusion • A new evaluation approach to generalise evaluation results.

• Implement this approach for 2 problems.

• Limitations: only detect the upper bound of algorithm capacity.

• The difference between the upper bound and the real performance may be significant if:

• The test video sequence contains several video processing problems

• The same set of parameters is tuned differently to adapt to several concurrent problems

• Ongoing evaluation campaigns:• PETS at ECCV2008

• TRECVid (NIST) with ILids video

• Benchmarking databases:• http://homepages.inf.ed.ac.uk/cgi/rbf/CVONLINE/entries.pl?TAG363

• http://www.hitech-projects.com/euprojects/cantata/datasets_cantata/dataset.html

5

Video Understanding: Program Supervision

6

Goal : easy creation of reliable supervised video understanding systems

Approach• Use of a supervised video understanding platform

• A reusable software tool composed of three separate components: program library – control – knowledge base

• Formalize a priori knowledge of video processing programs• Explicit the control of video processing programs

Issues ?• Video processing programs which can be supervised• A friendly formalism to represent knowledge of programs• A general control engine to implement different control strategies• A learning tool to adapt system parameters to the environment

Supervised Video Understanding:Proposed Approach

7

Control

ApplicationDomain Expert

Video ProcessingExpert

Application domainknowledge base

Scene environmentknowledge base

Video processing programknowledge baseLearning

Evaluation

ParticularSystem

Evaluation

Video Processing

ProgramLibrary

Proposed Approach

8

• Use of an operator formalism [Clément and Thonnat, 93] to represent knowledge of video processing programs

• Composed of frames and production rules• Frames: declarative knowledge

• Operators: abstract model of a video processing program– primitive: particular program– composite: particular combination of programs

• Production rules: inferential knowledge• Choice and optional criteria• Initialization criteria• Assessment criteria• Adjustment and repair criteria

Supervised Video Understanding Platform:Operator Formalism

9Program Supervision: Knowledge and Reasoning Primitive operator

FunctionalityCharacteristicsInput dataParametersOutput dataPreconditionsPostconditionsEffects

Calling syntax

Rule Bases

Parameter initialization rulesParameter adjustment rulesResult evalutation rulesRepair rules

Composite operator

FunctionalityCharacteristicsInput dataParametersOutput dataPreconditionsPostconditionsEffectsDecomposition into suboperators(sequential, parallel, alternative)Data flow

Rule bases

Parameter initialization rulesParameter adjustment rules Choice rulesResult evalutation rulesRepair rules

10

• Objective: a learning tool to automatically tune algorithm parameters with experimental data

• Used for learning the segmentation parameters with respect to the illumination conditions

• Method• Identify a set of parameters of a task

• 18 segmentation thresholds • depending on environment characteristics

• Image intensity histogram

• Study the variability of the characteristic• Histogram clustering -> 5 clusters

• Determine optimal parameters for each cluster• Optimization of the 18 segmentation thresholds

Video Understanding: Learning Parameters (B.Georis)

11Video Understanding: Learning Parameters

Camera View

12

Learning Parameters Clustering the Image Histograms

Number of pixels [%]

Pixelintensity [0-255]

X

Z

Y

A X-Z slice represents an image histogram

ßiopt4

ßiopt1

ßiopt2

ßiopt5

ßiopt3

13

• CARETAKER: An FP6 IST European initiative to provide an efficient tool for the management of large multimedia collections.

• Applications to surveillance and safety issues, in urban/environment planning, resource optimization, disabled/elderly person monitoring.

• Currently being validated

on large underground video

recordings ( Torino, Roma).

Complex Events

Raw Data

Simple Events

Knowledge Discovery

Raw data

Primitives Event and Meta data

Audio/Video acquisition and

encoding

Multiple Audio/Video

sensors

Knowledge Discovery

Generic Event recognition

Video Understanding : Knowledge Discovery (E. Corvee, JL. Patino_Vilchis)

14

Event detection examples

15

Data Flow

Object/EventDetection

InformationModelling

Object Detection•Id

•Type

•Info 2D

•Info 3D

Event Detection•Id

•Type (inside_zone, stays_inside_zone)

•Involved Mobile Object

•Involved Contextual Object

Mobile object table

Event table

Contextual object table

16

Mobile Objects

People characterised by:

•Trajectory

•Shape

•Significant Event in which they are involved

•…

Contextual Objects

Find interactions between mobile objects and contextual objects

•Interaction type

•Time

•…

Table Contents

Events

Model the normal activities in the metro station

•Event type

•Involved objects

•Time

•…

17

Knowledge Discovery: trajectory clustering

Objective: Clustering of trajectories into k groups to match people activities

• Feature set• Entry and exit points of an object• Direction, speed, duration, …

• Clustering techniques• Agglomerative Hierarchical Clustering.• K-means• Self-Organizing (Kohonen) Maps

• Evaluation of each cluster set based on Ground-Truth

20

Results on Torino subway (45min), 2052 trajectories

21

SOM K-means Agglomerative

Groups with mixed overlap

Trajectory: Analysis

22

Trajectory: Semantic characterisation

SOM CL14 / Kmeans CL12 Agglomerative CL 21

Consistency of clusters between algorithmsSemantic meaning: walking towards vending machines

23

Intraclass & Interclass variance

• SOM algorithm has the lowest intraclass and higher interclass separation,• Parameter tuning: which clustering techniques?

Trajectory: Analysis

J

v

v

J

vvInterclass

J

vv

Intraclass

jj

ii

i ijij

2

25

Mobile Objects

26

Mobile Object Analysis

0

50

100

150

200

250

time

nb o

f per

sons

ove

r 5

min

Building statistics on Objects

There is an increase of people after 6:45

27

Contextual Object Analysis

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

time

per

cnte

nta

ge

of

use

ove

r 5

min

Vending Machine 2

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

time

per

cnte

nta

ge

of

use

ove

r 5

min

Vending Machine 1

With an increase of people, there is an increase on theuse of vending machines

30

Results : Trajectory Clustering

Cluster 38 Cluster 6 Number of objects 385 15 Object types types: {'Unknown'}

freq: 385 types: {'Person'} freq: 15

Start time (min) [0.1533, 48.4633] [28.09, 46.79] Duration (sec) [0.04, 128.24] [2.04, 75.24] Trajectory types types: {'4' '3' '7'}

freq: [381 1 3] types: {'13' '12' '19'} freq: [13 1 1]

Significant event types: {'void '} freq: 385

types: {'inside_zone_Platform '} freq: 15

31

• Semantic knowledge extracted by the off-line long term analysis of on-line interactions between moving objects and contextual objects:

• 70% of people are coming from north entrance• Most people spend 10 sec in the hall• 64% of people are going directly to the gates without stopping at the ticket

machine• At rush hours people are 40% quicker to buy a ticket, …

• Issues:• At which level(s), should be designed clustering techniques: low level (image

features)/ middle level (trajectories, shapes)/ high level (primitive events)? • to learn what : visual concepts, scenario models? • uncertainty (noise/outliers/rare), what are the activities of interest?• Parameter tuning (e.g. distance, clustering tech.) and • performance evaluation (criteria, ground-truth).

Knowledge Discovery: achievements

32

Video Understanding : Learning Scenario Models (A. Toshev)

or Frequent Composite Event Discovery in Videosevent time series

33

• Why unsupervised model learning in Video Understanding?

• Complex models containing many events,

• Large variety of models,• Different parameters for different

models

The learning of models should be automated.

Learning Scenarios: Motivation

Video surveillance in a parking lot

34

• Input: A set of primitive events from the vision module:object-inside-zone(Vehicle, Entrance) [5,16]

• Output: frequent event patterns.

• A pattern is a set of events:object-inside-zone(Vehicle, Road) [0, 35]object-inside-zone(Vehicle, Parking_Road) [36, 47]object-inside-zone(Vehicle, Parking_Places) [62, 374]object-inside-zone(Person, Road) [314, 344]

Learning Scenarios: Problem Definition

• Goals:• Automatic data-driven modeling of composite events,

• Reoccurring patterns of primitive events correspond to frequent activities,

Find classes with large size & similar patterns.

Zones

35

• Approach:• Iterative method from data mining for efficient frequent patterns discovery in large

datasets,• A PRIORI: Sub-patterns of frequent patterns are also frequent (Agrawal & Srikant,

1995),• At i th step consider only i-patterns which have frequent (i-1) – sub-patterns the search space is thus pruned.

• A PRIORI-property for activities represented as classes:

size(C m-1) ≥ size(C m)

where C m is a class containing patterns of length m, C m-1 is a sub-activity of C m.

Learning Scenarios: A PRIORI Method

36

Learning Scenarios: A PRIORI Method

Merge two i-patterns with (i-1) primitive events in common to form an (i+1)-pattern:

37

2 types of Similarity Measure between event patterns :• similarities between event attributes• similarities between pattern structures

Generic Similarity Measure :• Generic properties when possible easy usage in different domains,• It should incorporate domain-dependent properties relevance to the

concrete application.

Learning Scenarios: Similarity Measure

38

Attributes: the corresponding events in two patterns should have similar (same) attributes (duration, names, object types,...).

Learning Scenarios: Attribute Similarity

• Comparison between corresponding events (same type, same color).

• For numeric attributes: G(x,y)=

• attr(pi, pj) = average of all event attribute similarities.

xy

yx

e

2

39

Test data:

•Video surveillance at a parking lot,

•4 hours records from 2 days in 2 test sets,

•Every test set contains appr. 100 primitive events.

Learning Scenarios: Evaluation

Results: In both test sets the following event pattern was recognized:object-inside-zone(Vehicle, Road)

object-inside-zone(Vehicle, Parking_Road)

object-inside-zone(Vehicle, Parking_Places)

object-inside-zone(Person, Parking_Road)

40

Test data:









41

Test data:









42

Test data:









Maneuver Parking!

43

Conclusion:• Application of a data mining approach,• Handling of uncertainty without losing computational effectiveness,• General framework: only a similarity measure and a primitive event library

must be specified.

Future Work:• Other similarities,• Handling of different aspects of uncertainty,• Qualification of the learned patterns,

• Frequent equal interesting ?• Different applications: different event libraries or features.

Learning Scenarios: Conclusion & Future Work

44

GERHOME (CSTB, INRIA, CHU Nice) : Ageing population

http://gerhome.cstb.fr/

Approach :• Multi-sensor analysis based on sensors embedded in the home environment • Detect in real-time any alarming situation• Identify a person profile – his/her usual behaviors - from the global trends of life

parameters, and then to detect any deviation from this profile

HealthCare Monitoring: (N. Zouba)

45

Monitoring of Activities of Daily Living for ElderlyMonitoring of Activities of Daily Living for Elderly

• Goal:Goal: Increase independence and quality of life: Increase independence and quality of life:• Enable elderly to Enable elderly to live longerlive longer in their preferred environment. in their preferred environment.

• Reduce Reduce costscosts for public health systems. for public health systems.

• Relieve Relieve familyfamily members and caregivers. members and caregivers.

• Approach:Approach:• Detecting alarming situations Detecting alarming situations (eg. Falls)(eg. Falls)• Detecting changes in Detecting changes in behaviorbehavior

((missing activities, disorder, interruptions, missing activities, disorder, interruptions, repetitions, inactivityrepetitions, inactivity).).

• Calculate the degree of Calculate the degree of frailtyfrailty of elderly people. of elderly people.

Example of normal activity: Example of normal activity: Meal preparation (in kitchen) (11h– 12h)Meal preparation (in kitchen) (11h– 12h)Eating (in dinning room) (12h -12h30)Eating (in dinning room) (12h -12h30) Resting, TV watching, (in living room) (13h– 16h)Resting, TV watching, (in living room) (13h– 16h) … …

46

• GERHOME (Gerontology at Home) : GERHOME (Gerontology at Home) : homecare laboratory homecare laboratory

http://www-sop.inria.fr/orion/personnel/Francois.Bremond/topicsText/gerhomeProject.html

• Experimental site in CSTB (Centre Scientifique et Technique du Bâtiment) at Sophia Antipolis Experimental site in CSTB (Centre Scientifique et Technique du Bâtiment) at Sophia Antipolis

http://gerhome.cstb.fr

• Partners:Partners: INRIA, CSTB, CHU-Nice, Philips-NXP, CG06… INRIA, CSTB, CHU-Nice, Philips-NXP, CG06…

Gerhome laboratoryGerhome laboratory

47

Gerhome laboratoryGerhome laboratory

Position of the sensors in Gerhome laboratory

• Video camerasVideo cameras installed in the kitchen and in the living-room to detect and track the installed in the kitchen and in the living-room to detect and track the person in the apartment.person in the apartment.

• Contact sensorsContact sensors mounted on many devices to determine the interactions with the person. mounted on many devices to determine the interactions with the person.

• Presence sensorsPresence sensors installed in front of sink and cooking stove to detect the presence of installed in front of sink and cooking stove to detect the presence of people near sink and stove. people near sink and stove.

48

Dans la cuisine

Sensors installed in Gerhome laboratorySensors installed in Gerhome laboratory

Video camera in the living-room

Pressure sensor underneath the legs of armchair

Contact sensor in the windowContact sensor in the cupboard door

49

• We have modelled a set of activities by using a event recognition language We have modelled a set of activities by using a event recognition language developed developed

• in our team. This is an example for “Meal preparation” event.in our team. This is an example for “Meal preparation” event.Composite Event (Prepare_meal_1, “detected by a video camera combined with a contact sensors”

Physical Objects ( (p: Person), (Microwave: Equipment), (Fridge: Equipment), (Kitchen: Zone))

Components ((p_inz: PrimitiveState Inside_zone (p, Kitchen)) “detected by video camera”

(open_fg: PrimitiveEvent Open_Fridge (Fridge)) “detected by contact sensor”

(close_fg: PrimitiveEvent Close_Fridge (Fridge)) “detected by contact sensor”

(open_mw: PrimitiveEvent Open_Microwave (Microwave)) “detected by contact sensor”

(close_mw: PrimitiveEvent Close_Microwave (Microwave))) “detected by contact sensor”

Constraints ((open_fg during p_inz )

(open_mw before_meet open_fg )

(open_fg Duration>= 10)

(open_mw Duration>=5))

Action ( AText (“Person prepares meal”)

AType (“NOT URGENT”)) )

Event modellingEvent modelling

50

Multi-sensor monitoring: results and evaluationMulti-sensor monitoring: results and evaluation

• We have validated and visualized the recognized events with a We have validated and visualized the recognized events with a 3D visualization 3D visualization tool.tool.

ActivityActivity # Videos# Videos # Events# Events TPTP FNFN FPFP PrecisionPrecision SensitivitySensitivity

In the kitchenIn the kitchen 10 45 40 5 0 1 0,888

In the living-roomIn the living-room 10 35 40 0 5 0,888 1

Open microwaveOpen microwave 8 15 15 0 0 1 1

Open fridgeOpen fridge 8 24 24 0 0 1 1

Open cupboardOpen cupboard 8 30 30 0 0 1 1

Preparing meal 1Preparing meal 1 8 3 3 0 0 1 1

• We have We have studied and testedstudied and tested a range of activities in the a range of activities in the GerhomeGerhome laboratory, laboratory, such as: using microwave, using fridge, preparing meal, …such as: using microwave, using fridge, preparing meal, …

51

Recognition of the “Prepare meal” eventRecognition of the “Prepare meal” event

Visualization of a recognized event in the Gerhome laboratory

• The person is recognized with the posture "standing with The person is recognized with the posture "standing with one arm upone arm up”, “located in the ”, “located in the kitchenkitchen” and “using the ” and “using the microwavemicrowave”.”.

52

RecognitionRecognition of the “Resting in living-room” eventof the “Resting in living-room” event

• The person is recognized with the posture “The person is recognized with the posture “sittingsitting in the armchair” and “located in the in the armchair” and “located in the living-living-roomroom”.”.

Visualization of a recognized event in the Gerhome laboratory

53End-usersEnd-users•There are There are several end-usersseveral end-users in homecare: in homecare:

• Doctors (gerontologists):Doctors (gerontologists):• Frailty measurement (depression, …)Frailty measurement (depression, …)• AlarmAlarm detection (falls, gas, dementia, …). detection (falls, gas, dementia, …).

• Caregivers and nursing home:Caregivers and nursing home:• CostCost reduction: no false alarm and reduction employee involvement. reduction: no false alarm and reduction employee involvement.• Employee protection.Employee protection.

• Persons with special needs, including young children, disabled and elderly people: Persons with special needs, including young children, disabled and elderly people: • Feeling safe at home.Feeling safe at home.• AutonomyAutonomy: at night, lighting up the way to bathroom.: at night, lighting up the way to bathroom.• Improving life: smart mirror, summary of user day, week, month, in terms of walking Improving life: smart mirror, summary of user day, week, month, in terms of walking

distance, TV, water consumption.distance, TV, water consumption.

• Family members and relatives: Family members and relatives: • Elderly Elderly safetysafety and protection. and protection.• Social connectivity.Social connectivity.

54Social problems and solutionsSocial problems and solutionsProblemsProblems SolutionsSolutions

Privacy confidentiality and ethics: video (and Privacy confidentiality and ethics: video (and other data) recording, processing and transmission.other data) recording, processing and transmission.

No video recording and transmission, only textual No video recording and transmission, only textual alarms.alarms.

Acceptability for elderlyAcceptability for elderly User empowerment.User empowerment.

UsabilityUsability Easy ergonomic interface (no keyboard, large Easy ergonomic interface (no keyboard, large screen), friendly usage of the system.screen), friendly usage of the system.

Cost effectivenessCost effectiveness The right service for the right price, large variety The right service for the right price, large variety of solutions.of solutions.

Legal issues, no certificationLegal issues, no certification Robustness, benchmarking, on site evaluationRobustness, benchmarking, on site evaluation

Installation, maintenance, training, interoperability Installation, maintenance, training, interoperability with other home deviceswith other home devices

Adaptability, X-Box integration, wireless, Adaptability, X-Box integration, wireless, standards (OSGI, …)standards (OSGI, …)

Research financing Research financing ? France (no money, lobbies), Europe (delay), US, ? France (no money, lobbies), Europe (delay), US, Asia.Asia.

55

Conclusion

A global framework for building video understanding systems:

• Hypotheses: • mostly fixed cameras• 3D model of the empty scene• predefined behavior models

• Results: • Video understanding real-time systems for Individuals, Groups of People, Vehicles, Crowd,

or Animals …• Knowledge structured within the different abstraction levels (i.e. processing worlds)

• Formal description of the empty scene• Structures for algorithm parameters• Structures for object detection rules, tracking rules, fusion rules, … • Operational language for event recognition (more than 60 states and events), video

event ontology• Tools for knowledge management

• Metrics, tools for performance evaluation, learning• Parsers, Formats for data exchange• …

56

Object and video event detection• Finer human shape description: gesture models • Video analysis robustness: reliability computation

Knowledge Acquisition• Design of learning techniques to complement a priori knowledge:

• visual concept learning• scenario model learning

System Reusability• Use of program supervision techniques: dynamic configuration of programs and

parameters • Scaling issue: managing large network of heterogeneous sensors (cameras,

microphones, optical cells, radars….)

Conclusion: perspectives

Documents

1 Scene Understanding perception, multi-sensor fusion, spatio-temporal reasoning and activity recognition. Francois BREMOND PULSAR project-team, INRIA