Upload
tybalt
View
44
Download
0
Embed Size (px)
DESCRIPTION
Linking Video Analysis to Annotation Technologies. Presentation for the BMVA’s Computer Vision Methods for Ambient Intelligence 31st May 2006 Dimitrios Makris (Kingston University) & Bogdan Vrusias (University of Surrey). REVEAL project. EPSRC-funded project, initiated in 2004 - PowerPoint PPT Presentation
Citation preview
Linking Video Analysis to Annotation Technologies
Presentation for the BMVA’s
Computer Vision Methods for Ambient Intelligence
31st May 2006
Dimitrios Makris (Kingston University)
&
Bogdan Vrusias (University of Surrey)
REVEAL project
• EPSRC-funded project, initiated in 2004– Academic partners
• Kingston University, University of Surrey
– Industrial partners/observers• SIRA Ltd, Ipsotek Ltd, CrowdDynamics Ltd, Overview Ltd
– End-Users• PITO, PSDB (Home Office), Surrey Police
• Aim: “to promote those key technologies which will enable automated extraction of evidence from CCTV archives”
See No EvilHear No Evil
Speak No Evil
• See Evil– Computer Vision– Input: Video Streams
• Hear Evil– Natural Language Processing– Input: Annotations of Video Streams
• Speak Evil– Link together– Output: Automatic Video Annotations
Scope of REVEAL
Challenges
• Development of Visual Evidence Thesaurus– Automatic Extraction of Surveillance Ontology
• Extracting Visual Semantics– Motion Detection & Tracking– Geometric and Colour Constancy– Object Classification, Behaviour Analysis, Semantic Landscape
• Analysing Crowds• Development of Surveillance Meta-Data Mode• Multimodal Data Fusion
– Fusion of Visual Semantics and Annotations
• Video Summarisation
Video Analysis Overview
• Motion Analysis– Motion Detection– Motion Tracking– Crowd Analysis
• Automatic Camera Calibration– Colour– Geometric
• Visual Semantics extraction– Object Classification– Behaviour Analysis– Semantic landscape
Motion Analysis (1/2)
• Motion Detection– Novel Technique for handling rapid light
variations, based on correlating changes in YUV (Renno et al, VS2006)
• Motion Tracking– Blob-based Kalman filter for tackling partial
occlusion (Xu&Ellis, BMVC2002)
Motion Analysis (2/2)
Example
Crowd Analysis (1/3)
• Problem: Detect and Track Individuals in Crowded situations
Original Frame Foreground Mask
Crowd Analysis (2/3)
• Combine edges of original image with edges of foreground mask
Original Frame Edges Foreground Mask Edges
Crowd Analysis (3/3)• Fit a head-shoulder (Omega) model
Head Candidates in the scene
Head Candidates on the boundaries of the foreground
Head Candidates within the foreground
Automatic Geometric Calibration (1/3)
Heights
0
20
40
60
80
100
120
140
160
180
200
0 50 100 150 200 250 300 350
Row
Hei
gh
t
Large Vehicle
Vehicle
Person
hi- horizon
Height (pixels)
Imag
e Po
sitio
n (p
ixel
s)
Horizon
• Pedestrian height model– Estimate linear pedestrian height model from
observations (Renno et al, ICIP2002)
Automatic Geometric Calibration (2/3)
• Ground Plane Estimation– Use the pedestrian linear model to estimate
ground plane. (Renno et al, BMVC2002)
Automatic Geometric Calibration (3/3)
Occlusion Edges
Depth Map
• Scene Depth Map– Use moving objects estimated depths to determine the
scene depth map (Renno et al, BMVC2004)
Automatic Colour Calibration (1/3)• Variation of
colour responses is significant!
• A real-time colour constancy algorithm is required
Automatic Colour Calibration (2/3)
• Grey world and Gamut Mapping algorithms were tested.
• Automatic method for reference frame.
• Gamut Mapping performs better, but Grey World can operate real-time.
(Renno et al, VS-PETS2005)
Automatic Colour Calibration (3/3)
• Real Time Colour Constancy
Visual Semantics(Makris et al, ECOVISION 2004)
Targets– Pedestrians– Cars– large
vehicles
Actions– move– stop– enter/exit– accelerate– turn left/right
Static features– road/corridor– door/gate– ATM– desk– bus stop
Visual Semantics
• Object Classification(ongoing work)
• Behaviour Analysis(ongoing work)
• Semantic landscape– Label static scene by observing activity
(Makris&Ellis, AVSS 2003)
Reverse Engineering
Entry/Exit Zones
Detected by an EM-based algorithm
Detected Routes
Segmentation of Routesto Paths & Junction
Possible extensions
• Use target labels– paths: traffic road or pavements– pedestrian crossing: junction of
• pedestrian route• vehicle route
• More complicated rules– bus stop
• pedestrians stop• vehicle stop• pedestrians merge with vehicle
Data Hierarchy in Video Analysis
Pixels
Blobs Trajectories
Actor labels Scene labels Action labels
Textual Summary
Natural Language Processing (Surrey)
Hypothesis: Experts are using a common language/keywords to describe crime scene/video evidence.
•Visual Evidence Thesaurus– Data acquisition (workshops)– Data analysis– Automatic ontology extraction
Video Annotation Workshops
• 2 different workshops were organised and ran to prove the hypothesis and construct a domain thesaurus.– Different experts
• Police Forces (Surrey-West Yorkshire).• Forensic Services (London-Birmingham).• Private Video Evidence expert (Luton).
– Several data collection tasks
• Purpose: Gather knowledge and feedback from experts in order to understand the way videos are observed and perceived.
• Task: Validate the hypothesis and extract common keywords used and the description pattern.
Sel
ecte
d sa
mpl
e fr
om th
e W
orks
hop
Sel
ecte
d sa
mpl
e fr
om th
e W
orks
hop
Video Annotation Workshops
• Workshop Feedback:
– Strong interest in the project.– Willing to help (within the legal limits).
• Workshop Outputs:
– Initial descriptions from experts, for analysis.– Useful feedback and comments.
Video Evidence Thesaurus
• 2 or 3 people walking around what looks like between 2 buildings.
• 2 people fighting,………
• 5 people, 1 person walks away to the bottom of screen.
• 2 people walk towards each other,……….
• Person walking across holding piece of paper on heart. Person walking away looking over his left shoulder.
• 2 people passing out each other in corridor, appear to wave hands, having some kind of interaction.
• Same video clip.
• Description from 3 different people.
• Pattern (Identify, Elaborate, Location)
• 2 or 3 people walking around what looks like between 2 buildings.
• 2 people fighting,………
• 5 people, 1 person walks away to the bottom of screen.
• 2 people walk towards each other,……….
• Person walking across holding piece of paper on heart. Person walking away looking over his left shoulder.
• 2 people passing out each other in corridor, appear to wave hands, having some kind of interaction.
Analysis
Analysis:
• Description from different people, for same video clips.
• Pattern (Identify <I>, Elaborate <E>, Location <L>) .• Grammar:
• <Description>: <I><E|L><L|E|{Φ}><Description|{Φ}>.
• <I>: <Single|Group>
• <Single> : <{Person}|{Male}|{Female}|….>
• <Group> : <2|3|…..|n><{People}|Single>
Video Evidence Thesaurus
Thesaurus Construction Methodology
• We adopt a text-driven and bottom-up method: starting from a collection of texts in a specialist domain, together with a representative general language corpus
• Use a five-step algorithm for identifying discourse patterns with more or less unique meanings, without any overt access to an external knowledge base
I. Select training corpora: CCTV-Related Corpus and a general language corpus.
II. Extract key words;
III. Extract key collocates;
IV. Extract local grammar using collocation and relevance feedback;
V. Assert the grammar as a finite state automaton.
Thesaurus Construction Methodology
• Once the single terms, especially weird terms, are identified we find candidate compound terms by computing collocation statistics between the single terms and other open class words in the entire CORPUS.
Development of Visual Evidence Thesaurus
• Collocates of the the weird term EARPRINT + collocation statistics
Development of Visual Evidence Thesaurus
• Collocates of the the weird term EARPRINT IDENTIFICATION
Development of Visual Evidence Thesaurus
A inheritance hierarchy of EARPRINT collocates exported to a knowledge representation system PROTEGE:
Development of Visual Evidence Thesaurus
• A multiple inheritance hierarchy of EARPRINT collocates now exported to a knowledge representation workbench PROTEGE:
Rubbish!
Development of Visual Evidence Thesaurus
Experiments and Evaluation
• I. Select training corpora
• Training-Corpus– The British National Corpus, comprising 100-million
tokens distributed over 4124 texts (Aston and Burnard 1998);
– Crime Alerts Corpus (FBI Crime Alerts, Wanted by the Royal Canadian Mounted Police (RCMP) & Polizei Bayern, Journal/Conference papers) comprising 109 articles and contains 214,437 words
Experiments and Evaluation
• II. Extract key words– The frequencies of individual words in the Crime
Alerts Corpus were computed using System Quirk;
Experiments and EvaluationRanks Crime Alerts Corpus
(NCAC=214437)
Cumulative
Number of
Tokens (%)
British National
Corpus
(NBNC=100 Million)
Cumulative
Number of
Tokens (%)
1-10 the, of, and, in, to, a, by, for, county, total
39724
(18.88%)
the, of, and, a, in, to, for, is, as, that
22.3 M
(22.3%)
11-20 percent, is, state, city, with, law, that, s, population, enforcement
11910
(5.66%)
was, I, on, with, as, be, he, you, at, by
6.51 M
(6.5 %)
21-30 offenses, crime, or, are, rate, on, township, area, as, agencies
8923
(4.24%)
are, this, have, but, not, from, had, his, they, or
4.23 M
(4.2%)
31-40 be, was, counties, this, per, from, were, an, number, continued
7126
(3.39%)
which, an, she, where, here, we, one, there, all, been
3.05 M
(3.1%)
41-50 data, inhabitants, reporting, university, estimated, theft, cities, metropolitan, at, other
6043
(2.87%)
their, if, has, will, so, would, no, what, can, when
2.35 M
(2.4%)
Experiments and Evaluation
Token Crime Alerts Corpus
(NCAC=214437)BNC
(NBNC=100,000,000) Weirdness
(a/b) Rank fCAC fCAC / NCAC
(a)
Rank fBNC fBNC / NBNC
(b)
crime 22 960 0.448% 1512 7155 0.007% 62.63
theft 46 618 0.288% 5031 1727 0.002% 167.05
murder 57 492 0.229% 1811 5935 0.006% 38.70
crimes 91 301 0.140% 4901 1789 0.002% 78.54
scars 181 145 0.068% 14217 379 0.0004% 178.60
Experiments and Evaluation
• III. Extract key collocates
f Left Right Total z-score
scars 65763
acne 70 42 0 42 8.16
boxcar 43 32 5 37 7.13
rolling 26 16 11 27 5.06
deep 40 20 3 23 4.24
facial 34 22 0 22 4.03
Experiments and Evaluation
• IV. Extract local grammar using collocation and relevance feedback
Pattern f Collocate Left Right z-score
facial acne scars 21 pitted 9 1 2.37
deep boxcar scars 16 had 3 3 2.12
has scars on 8 his 1 5 3.95
has scars on 8 left 1 4 3.05
Experiments and Evaluation
• V. Assert the grammar as a finite state automaton– The (re-) collocation patterns can then be asserted as a finite state automata for each of the
movement verbs and spatial preposition metaphors
An experiment:
• 16 videos in the CAVIAR data set were shown to 4 different surveillance experts;
• The experts were asked to describe the videos in their own words – in English surveillance speak
• Experts will describe videos in a succinct manner using terminology of their domain and framing the description in a ‘local grammar’
• The interviews were transcribed and sentences and phrases were marked up using a basic ontology: Action, Location, Result, Miscellaneous.
Describing Videos
One of our experts described the frame on the left as
Man in blue t-shirt, centre of scene, facing camera, raises white card high above his head. Second man wearing a dark top with white stripes down the sleeve enters scene from above. Meets a third individual with a dark top and pale trousers and an altercation occurs in the centre of the open space. Individuals meet briefly and leave scene in opposite directions. The original person with the white sleeves -- white stripes on the sleeves leaves scene below camera. Second person in the altercation leaves scene by the red chairs. That was an assault, I’d say.
Describing Videos
One of our experts described the frame on the left as
Event 1:Man in blue t-shirt, centre of scene, facing camera, raises white card high above his head.
Event 1:Miscellaneous: Man in blue t-shirt, Location: centre of scene, Action: facing camera, Result: raises white card high above
his head.
Describing Videos
One of our experts described the frame on the left as
Event 1:
M Man in blue t-shirt,
L centre of scene,
A facing camera,
R raises white card high above his head.
Event 2:
A Second man wearing a dark top with white stripes down the
sleeve enters scene
L from above.
A Meets a third individual with a dark top and pale trousers and
an altercation occurs
L in the centre of the open space.
A Individuals meet briefly
R and leave scene in opposite directions.
Describing Videos
Inter-indexer variability: Triplets at the start of event descriptions
TripletExpert 1
%Expert
2 %Expert 3
%Expert
4 % Avg Std Dev
ALA 36 31 33 23 30.75 5.6
LAL 23 21 15 17 19 3.6
ALM 22 21 11 3 14.25 8.9
ALR 11 0 11 17 9.75 7.1
MAL 5 3 7 17 8 6.2
Describing Videos
Inter-indexer variability:
The most frequent triplet that ends with
an R is ALRALR, in turn, was found in frequently occurring patterns like
ALALRALALALR
ALALALALRAALALR
Describing Videos
Inter-indexer variability:The following local grammar was ‘discovered’ from the corpus of marked transcripts for all of the four descriptions:
L?M?((A+L)M?)+L?A*M?R
Describing Videos
Actions Expert 1 %
Expert2 %
Expert 3 %
Expert 4 %
Mean
Verb of motion
71 70 51 50 60.5
Verb of action
20 17 37 41 28.7
Verb of stasis 7 7 7 5 6.5
Verb of action + prep.
2 1 0 0 0.7
Most frequently used verbs
Describing Videos
Location Type Exemplars
Real world location
Seating area; Stairs; Walkway;Shop entrance
Relative spatial location
To the left; Left to right; From the right; Area from left
Location relative to portrayal
Right hand side of screenTop left field of view of cameraAway from usTowards the camera
Describing Videos
What about the labelling in CAVIAR? Actions are described through verbal nouns – three videos (Fight_Chase, Browse, LeftBagBehindChair)
Situation Type FC BR3 LBBC
Moving Verbal noun of motion
923 799 1295
Inactive Adjective of stasis 618 70 228
Fighting Verbal noun of action 280 0 0
Joining Verbal noun of action 52 0 0
Split up Verbal noun of action 200 0 0
Browsing Verbal noun of motion
0 135 0
Describing Videos
What about the labelling in CAVIAR? Actions are described through verbal nouns – three videos (Fight_Chase, Browse, LeftBagBehindChair)
Context Type FC BR3 LBBC
Walking Verbal noun of motion
522 156 897
Fighting Verbal noun of action
532 0 0
Immobile Adjective of stasis 1019 200 626
Browsing Verbal noun of motion
0 648 0
Describing Videos
What about the labelling in CAVIAR? Actions are described through verbal nouns – three videos (Fight_Chase, Browse, LeftBagBehindChair). How do our experts compare with CAVIAR labelling
Acts & Results
Expert Mean
CAVIAR Context
CAVIAR Situation
Motion 59.25 48 69
Action 29.75 12 12
Stasis 6.25 40 20
Describing Videos
Surveillance Meta-Data Model (both)
• How can we create a model to describe CCTV video?
• What is an activity?• Activity: Interaction between
actors and scene objects
Multimodal Data Fusion
• Can we use machine learning methods to link vision and text?
• Can that link be created in an unsupervised fashion?
Video Sequence
Text Description Fr
ame
Imag
e
Multimodal Data Fusion
Video Summarisation
• Automatic annotation / labelling
• Video content categorisation
• Retrieval
Video Summarisation
VIDEOPROCESSING
XML FUSIONSUMMARY
Summary/Conclusions
• Development of Visual Evidence Thesaurus– Automatic Extraction of Surveillance Ontology
• Extracting Visual Semantics– Motion Detection & Tracking– Geometric and Colour Constancy– Object Classification, Behaviour Analysis, Semantic Landscape
• Analysing Crowds• Development of Surveillance Meta-Data Mode• Multimodal Data Fusion
– Fusion of Visual Semantics and Annotations
• Video Summarisation
Summary/Conclusions
• Computer Vision to extract the Visual Semantics (See Evil)
• Natural Language Processing to identify the Surveillance Ontology (Hear Evil)
• Linking of the two technologies in order to construct a Video Summarisation System (Speak Evil)