Linking Video Analysis to Annotation Technologies

Linking Video Analysis to Annotation Technologies

Presentation for the BMVA’s

Computer Vision Methods for Ambient Intelligence

31st May 2006

Dimitrios Makris (Kingston University)

&

Bogdan Vrusias (University of Surrey)

REVEAL project

• EPSRC-funded project, initiated in 2004– Academic partners

• Kingston University, University of Surrey

– Industrial partners/observers• SIRA Ltd, Ipsotek Ltd, CrowdDynamics Ltd, Overview Ltd

– End-Users• PITO, PSDB (Home Office), Surrey Police

• Aim: “to promote those key technologies which will enable automated extraction of evidence from CCTV archives”

See No EvilHear No Evil

Speak No Evil

• See Evil– Computer Vision– Input: Video Streams

• Hear Evil– Natural Language Processing– Input: Annotations of Video Streams

• Speak Evil– Link together– Output: Automatic Video Annotations

Scope of REVEAL

Challenges

• Development of Visual Evidence Thesaurus– Automatic Extraction of Surveillance Ontology

• Extracting Visual Semantics– Motion Detection & Tracking– Geometric and Colour Constancy– Object Classification, Behaviour Analysis, Semantic Landscape

• Analysing Crowds• Development of Surveillance Meta-Data Mode• Multimodal Data Fusion

– Fusion of Visual Semantics and Annotations

• Video Summarisation

[email protected]

Video Analysis Overview

• Motion Analysis– Motion Detection– Motion Tracking– Crowd Analysis

• Automatic Camera Calibration– Colour– Geometric

• Visual Semantics extraction– Object Classification– Behaviour Analysis– Semantic landscape

Motion Analysis (1/2)

• Motion Detection– Novel Technique for handling rapid light

variations, based on correlating changes in YUV (Renno et al, VS2006)

• Motion Tracking– Blob-based Kalman filter for tackling partial

occlusion (Xu&Ellis, BMVC2002)

Motion Analysis (2/2)

Example

Crowd Analysis (1/3)

• Problem: Detect and Track Individuals in Crowded situations

Original Frame Foreground Mask

Crowd Analysis (2/3)

• Combine edges of original image with edges of foreground mask

Original Frame Edges Foreground Mask Edges

Crowd Analysis (3/3)• Fit a head-shoulder (Omega) model

Head Candidates in the scene

Head Candidates on the boundaries of the foreground

Head Candidates within the foreground

Automatic Geometric Calibration (1/3)

Heights

0

20

40

60

80

100

120

140

160

180

200

0 50 100 150 200 250 300 350

Row

Hei

gh

t

Large Vehicle

Vehicle

Person

hi- horizon

Height (pixels)

Imag

e Po

sitio

n (p

ixel

s)

Horizon

• Pedestrian height model– Estimate linear pedestrian height model from

observations (Renno et al, ICIP2002)


• Ground Plane Estimation– Use the pedestrian linear model to estimate

ground plane. (Renno et al, BMVC2002)


Occlusion Edges

Depth Map

• Scene Depth Map– Use moving objects estimated depths to determine the

scene depth map (Renno et al, BMVC2004)

Automatic Colour Calibration (1/3)• Variation of

colour responses is significant!

• A real-time colour constancy algorithm is required

Automatic Colour Calibration (2/3)

• Grey world and Gamut Mapping algorithms were tested.

• Automatic method for reference frame.

• Gamut Mapping performs better, but Grey World can operate real-time.

(Renno et al, VS-PETS2005)

Automatic Colour Calibration (3/3)

• Real Time Colour Constancy

Visual Semantics(Makris et al, ECOVISION 2004)

Targets– Pedestrians– Cars– large

vehicles

Actions– move– stop– enter/exit– accelerate– turn left/right

Static features– road/corridor– door/gate– ATM– desk– bus stop

Visual Semantics

• Object Classification(ongoing work)

• Behaviour Analysis(ongoing work)

• Semantic landscape– Label static scene by observing activity

(Makris&Ellis, AVSS 2003)

Reverse Engineering

Entry/Exit Zones

Detected by an EM-based algorithm

Detected Routes

Segmentation of Routesto Paths & Junction

Possible extensions

• Use target labels– paths: traffic road or pavements– pedestrian crossing: junction of

• pedestrian route• vehicle route

• More complicated rules– bus stop

• pedestrians stop• vehicle stop• pedestrians merge with vehicle

Data Hierarchy in Video Analysis

Pixels

Blobs Trajectories

Actor labels Scene labels Action labels

Textual Summary

Natural Language Processing (Surrey)

Hypothesis: Experts are using a common language/keywords to describe crime scene/video evidence.

•Visual Evidence Thesaurus– Data acquisition (workshops)– Data analysis– Automatic ontology extraction

Video Annotation Workshops

• 2 different workshops were organised and ran to prove the hypothesis and construct a domain thesaurus.– Different experts

• Police Forces (Surrey-West Yorkshire).• Forensic Services (London-Birmingham).• Private Video Evidence expert (Luton).

– Several data collection tasks

• Purpose: Gather knowledge and feedback from experts in order to understand the way videos are observed and perceived.

• Task: Validate the hypothesis and extract common keywords used and the description pattern.

Sel

ecte

d sa

mpl

e fr

om th

e W

orks

hop

Sel

ecte

d sa

mpl

e fr

om th

e W

orks

hop

Video Annotation Workshops

• Workshop Feedback:

– Strong interest in the project.– Willing to help (within the legal limits).

• Workshop Outputs:

– Initial descriptions from experts, for analysis.– Useful feedback and comments.

Video Evidence Thesaurus

• 2 or 3 people walking around what looks like between 2 buildings.

• 2 people fighting,………

• 5 people, 1 person walks away to the bottom of screen.

• 2 people walk towards each other,……….

• Person walking across holding piece of paper on heart. Person walking away looking over his left shoulder.

• 2 people passing out each other in corridor, appear to wave hands, having some kind of interaction.

• Same video clip.

• Description from 3 different people.

• Pattern (Identify, Elaborate, Location)

• 2 or 3 people walking around what looks like between 2 buildings.

• 2 people fighting,………

• 5 people, 1 person walks away to the bottom of screen.

• 2 people walk towards each other,……….

• Person walking across holding piece of paper on heart. Person walking away looking over his left shoulder.

• 2 people passing out each other in corridor, appear to wave hands, having some kind of interaction.

Analysis

Analysis:

• Description from different people, for same video clips.

• Pattern (Identify <I>, Elaborate <E>, Location <L>) .• Grammar:

• <Description>: <I><E|L><L|E|{Φ}><Description|{Φ}>.

• <I>: <Single|Group>

• <Single> : <{Person}|{Male}|{Female}|….>

• <Group> : <2|3|…..|n><{People}|Single>

Video Evidence Thesaurus

Thesaurus Construction Methodology

• We adopt a text-driven and bottom-up method: starting from a collection of texts in a specialist domain, together with a representative general language corpus

• Use a five-step algorithm for identifying discourse patterns with more or less unique meanings, without any overt access to an external knowledge base

I. Select training corpora: CCTV-Related Corpus and a general language corpus.

II. Extract key words;

III. Extract key collocates;

IV. Extract local grammar using collocation and relevance feedback;

V. Assert the grammar as a finite state automaton.

Thesaurus Construction Methodology

• Once the single terms, especially weird terms, are identified we find candidate compound terms by computing collocation statistics between the single terms and other open class words in the entire CORPUS.

Development of Visual Evidence Thesaurus

• Collocates of the the weird term EARPRINT + collocation statistics


• Collocates of the the weird term EARPRINT IDENTIFICATION


A inheritance hierarchy of EARPRINT collocates exported to a knowledge representation system PROTEGE:


• A multiple inheritance hierarchy of EARPRINT collocates now exported to a knowledge representation workbench PROTEGE:

Rubbish!


Experiments and Evaluation

• I. Select training corpora

• Training-Corpus– The British National Corpus, comprising 100-million

tokens distributed over 4124 texts (Aston and Burnard 1998);

– Crime Alerts Corpus (FBI Crime Alerts, Wanted by the Royal Canadian Mounted Police (RCMP) & Polizei Bayern, Journal/Conference papers) comprising 109 articles and contains 214,437 words


• II. Extract key words– The frequencies of individual words in the Crime

Alerts Corpus were computed using System Quirk;

Experiments and EvaluationRanks Crime Alerts Corpus

(NCAC=214437)

Cumulative

Number of

Tokens (%)

British National

Corpus

(NBNC=100 Million)

Cumulative

Number of

Tokens (%)

1-10 the, of, and, in, to, a, by, for, county, total

39724

(18.88%)

the, of, and, a, in, to, for, is, as, that

22.3 M

(22.3%)

11-20 percent, is, state, city, with, law, that, s, population, enforcement

11910

(5.66%)

was, I, on, with, as, be, he, you, at, by

6.51 M

(6.5 %)

21-30 offenses, crime, or, are, rate, on, township, area, as, agencies

8923

(4.24%)

are, this, have, but, not, from, had, his, they, or

4.23 M

(4.2%)

31-40 be, was, counties, this, per, from, were, an, number, continued

7126

(3.39%)

which, an, she, where, here, we, one, there, all, been

3.05 M

(3.1%)

41-50 data, inhabitants, reporting, university, estimated, theft, cities, metropolitan, at, other

6043

(2.87%)

their, if, has, will, so, would, no, what, can, when

2.35 M

(2.4%)


Token Crime Alerts Corpus

(NCAC=214437)BNC

(NBNC=100,000,000) Weirdness

(a/b) Rank fCAC fCAC / NCAC

(a)

Rank fBNC fBNC / NBNC

(b)

crime 22 960 0.448% 1512 7155 0.007% 62.63

theft 46 618 0.288% 5031 1727 0.002% 167.05

murder 57 492 0.229% 1811 5935 0.006% 38.70

crimes 91 301 0.140% 4901 1789 0.002% 78.54

scars 181 145 0.068% 14217 379 0.0004% 178.60


• III. Extract key collocates

f Left Right Total z-score

scars 65763

acne 70 42 0 42 8.16

boxcar 43 32 5 37 7.13

rolling 26 16 11 27 5.06

deep 40 20 3 23 4.24

facial 34 22 0 22 4.03


• IV. Extract local grammar using collocation and relevance feedback

Pattern f Collocate Left Right z-score

facial acne scars 21 pitted 9 1 2.37

deep boxcar scars 16 had 3 3 2.12

has scars on 8 his 1 5 3.95

has scars on 8 left 1 4 3.05


• V. Assert the grammar as a finite state automaton– The (re-) collocation patterns can then be asserted as a finite state automata for each of the

movement verbs and spatial preposition metaphors

An experiment:

• 16 videos in the CAVIAR data set were shown to 4 different surveillance experts;

• The experts were asked to describe the videos in their own words – in English surveillance speak

• Experts will describe videos in a succinct manner using terminology of their domain and framing the description in a ‘local grammar’

• The interviews were transcribed and sentences and phrases were marked up using a basic ontology: Action, Location, Result, Miscellaneous.

Describing Videos

One of our experts described the frame on the left as

Man in blue t-shirt, centre of scene, facing camera, raises white card high above his head. Second man wearing a dark top with white stripes down the sleeve enters scene from above. Meets a third individual with a dark top and pale trousers and an altercation occurs in the centre of the open space. Individuals meet briefly and leave scene in opposite directions. The original person with the white sleeves -- white stripes on the sleeves leaves scene below camera. Second person in the altercation leaves scene by the red chairs. That was an assault, I’d say.

Describing Videos


Event 1:Man in blue t-shirt, centre of scene, facing camera, raises white card high above his head.

Event 1:Miscellaneous: Man in blue t-shirt, Location: centre of scene, Action: facing camera, Result: raises white card high above

his head.

Describing Videos


Event 1:

M Man in blue t-shirt,

L centre of scene,

A facing camera,

R raises white card high above his head.

Event 2:

A Second man wearing a dark top with white stripes down the

sleeve enters scene

L from above.

A Meets a third individual with a dark top and pale trousers and

an altercation occurs

L in the centre of the open space.

A Individuals meet briefly

R and leave scene in opposite directions.

Describing Videos

Inter-indexer variability: Triplets at the start of event descriptions

TripletExpert 1

%Expert

2 %Expert 3

%Expert

4 % Avg Std Dev

ALA 36 31 33 23 30.75 5.6

LAL 23 21 15 17 19 3.6

ALM 22 21 11 3 14.25 8.9

ALR 11 0 11 17 9.75 7.1

MAL 5 3 7 17 8 6.2

Describing Videos

Inter-indexer variability:

The most frequent triplet that ends with

an R is ALRALR, in turn, was found in frequently occurring patterns like

ALALRALALALR

ALALALALRAALALR

Describing Videos

Inter-indexer variability:The following local grammar was ‘discovered’ from the corpus of marked transcripts for all of the four descriptions:

L?M?((A+L)M?)+L?A*M?R

Describing Videos

Actions Expert 1 %

Expert2 %

Expert 3 %

Expert 4 %

Mean

Verb of motion

71 70 51 50 60.5

Verb of action

20 17 37 41 28.7

Verb of stasis 7 7 7 5 6.5

Verb of action + prep.

2 1 0 0 0.7

Most frequently used verbs

Describing Videos

Location Type Exemplars

Real world location

Seating area; Stairs; Walkway;Shop entrance

Relative spatial location

To the left; Left to right; From the right; Area from left

Location relative to portrayal

Right hand side of screenTop left field of view of cameraAway from usTowards the camera

Describing Videos

What about the labelling in CAVIAR? Actions are described through verbal nouns – three videos (Fight_Chase, Browse, LeftBagBehindChair)

Situation Type FC BR3 LBBC

Moving Verbal noun of motion

923 799 1295

Inactive Adjective of stasis 618 70 228

Fighting Verbal noun of action 280 0 0

Joining Verbal noun of action 52 0 0

Split up Verbal noun of action 200 0 0

Browsing Verbal noun of motion

0 135 0

Describing Videos

What about the labelling in CAVIAR? Actions are described through verbal nouns – three videos (Fight_Chase, Browse, LeftBagBehindChair)

Context Type FC BR3 LBBC

Walking Verbal noun of motion

522 156 897

Fighting Verbal noun of action

532 0 0

Immobile Adjective of stasis 1019 200 626

Browsing Verbal noun of motion

0 648 0

Describing Videos

What about the labelling in CAVIAR? Actions are described through verbal nouns – three videos (Fight_Chase, Browse, LeftBagBehindChair). How do our experts compare with CAVIAR labelling

Acts & Results

Expert Mean

CAVIAR Context

CAVIAR Situation

Motion 59.25 48 69

Action 29.75 12 12

Stasis 6.25 40 20

Describing Videos

Surveillance Meta-Data Model (both)

• How can we create a model to describe CCTV video?

• What is an activity?• Activity: Interaction between

actors and scene objects

Multimodal Data Fusion

• Can we use machine learning methods to link vision and text?

• Can that link be created in an unsupervised fashion?

Video Sequence

Text Description Fr

ame

Imag

e

Multimodal Data Fusion

Video Summarisation

• Automatic annotation / labelling

• Video content categorisation

• Retrieval

Video Summarisation

VIDEOPROCESSING

XML FUSIONSUMMARY

http://images.google.com/imgres?imgurl=http://www.stealthwd.com/images/layout/graphics/process.jpg&imgrefurl=http://www.stealthwd.com/process.php&h=242&w=189&sz=7&tbnid=Tr9HY0D4QtoI_M:&tbnh=105&tbnw=82&hl=en&start=4&prev=/images%3Fq%3Dprocess%26svnum%3D10%26hl%3Den%26lr%3D%26rls%3DGGLD,GGLD:2004-44,GGLD:en%26sa%3DN

http://images.google.com/imgres?imgurl=http://www.stealthwd.com/images/layout/graphics/process.jpg&imgrefurl=http://www.stealthwd.com/process.php&h=242&w=189&sz=7&tbnid=Tr9HY0D4QtoI_M:&tbnh=105&tbnw=82&hl=en&start=4&prev=/images%3Fq%3Dprocess%26svnum%3D10%26hl%3Den%26lr%3D%26rls%3DGGLD,GGLD:2004-44,GGLD:en%26sa%3DN

http://images.google.com/imgres?imgurl=http://www.defence.gov.au/news/navynews/images/camcorder.gif&imgrefurl=http://www.defence.gov.au/news/navynews/editions/4702/default.htm&h=35&w=35&sz=1&tbnid=vAPKY4Bfo11WYM:&tbnh=35&tbnw=35&hl=en&start=79&prev=/images%3Fq%3Dcamcorder%2Bicon%26start%3D60%26svnum%3D10%26hl%3Den%26lr%3D%26rls%3DGGLD,GGLD:2004-44,GGLD:en%26sa%3DN

http://images.google.com/imgres?imgurl=http://www.udel.edu/PR/UpDate/01/12/Microphone%252020.GIF&imgrefurl=http://www.udel.edu/PR/UpDate/01/12/whatsup.html&h=243&w=264&sz=7&tbnid=9BpRaBRaDya9DM:&tbnh=98&tbnw=107&hl=en&start=19&prev=/images%3Fq%3Dmicrophone%26svnum%3D10%26hl%3Den%26lr%3D%26rls%3DGGLD,GGLD:2004-44,GGLD:en

Summary/Conclusions

• Development of Visual Evidence Thesaurus– Automatic Extraction of Surveillance Ontology

• Extracting Visual Semantics– Motion Detection & Tracking– Geometric and Colour Constancy– Object Classification, Behaviour Analysis, Semantic Landscape

• Analysing Crowds• Development of Surveillance Meta-Data Mode• Multimodal Data Fusion

– Fusion of Visual Semantics and Annotations

• Video Summarisation

Summary/Conclusions

• Computer Vision to extract the Visual Semantics (See Evil)

• Natural Language Processing to identify the Surveillance Ontology (Hear Evil)

• Linking of the two technologies in order to construct a Video Summarisation System (Speak Evil)

Documents

Linking Video Analysis to Annotation Technologies