View
215
Download
1
Category
Tags:
Preview:
Citation preview
Agenda
• Introduction
• Feature set extension
• Video features processing
• Video features integration
• Preliminary results
• Conclusions
Meeting Structuring (1)
• Goal: recognise events which involve one or more communicative modalities:
• Monologue / Dialogue / Note taking / Presentation / Presentation at the whiteboard
• Working environment: “IDIAP framework” :• 69 five minutes long meetings of 4 participants
• 30 transcribed meetings
• Scripted meeting structure
Meeting Structuring (2)• 3 audio derived feature families:
Speaker turns, Prosodic Features, Lexical Features
Mic. Array
Lapel Mic.
SpeakerTurns
Beam-forming
Rate Of Speech
Pitch baselineEnergy
Prosody
Transcription. M/DI discrimination Lexical features
ASR
Meeting Structuring (3)• Dynamic Bayesian Network based models (using GMTK, Bilmes et al.) • Multi-stream processing (parallel stream processing)• “Counter structure” (state duration modelling)
S01
Y01
St1
Yt1
St+11
Yt+11
…. ….
A0 At At+1…. ….
S02
Y02
St2
Yt2
St+12
Yt+12
….
C0
E0
C0
E0
C0
E0
….….
Corr Sub Del Ins AER
W.o. counter 91.7 4.5 3.8 2.6 10.9
With counter 92.9 5.1 1.9 1.9 9.0
• 3 feature families:• Prosodic features (S1)• Speaker Turns (S2)• Lexical features (S3)
• Leave-one-out cross-validation over 30 annotated meetings
100Sub Ins Del
AERTotalActionsNumber
Feature set extension (1)
Multi-party meeting are multi-modal communicative processes
Our features cover only two modalities: audio (prosodic features & speaker turns) and lexical
content (lexical monologue/dialogue discriminator)
Exploiting video contentsis the next step!!
Approach: extract low level video features and leave their interpretation to high level specialised models
The three most confused symbols
Feature set extension (2)
Three meeting actions which highlyinvolve body/hands movements
Goal: improve the recognition of “Note taking”, “Presentation” and “Whiteboard”
Feature set extension (3) We need motion features for hands/head-torso regions• Constraints:
– The system must be simple
– Reliable against “environmental” changes (lighting, backgrounds, …)
– Open to further extensions / modifications
• Initial assumptions:– Meetings video contents are quite “static”
– Participants occupy only few spatial regions
and tend to stay there
– Meeting room configuration (camera positions, seats,
furniture …) is fixed
Kanade Lucas Tomasi (KLT) feature tracking…
Video feature extraction (1)
• Motion analysis is performed using :
…and partitioning resulting trajectories according to their relative position into the scene
Four spatial regionsfor each scene:
Head 1 / 2Hands 1 / 2
KLT (1)Assumption: brightness of every point of a (slow) moving or static
object does not change for images taken at near time instants
(Taylor series approximated to the 1st derivative)( , ) ( , ) ( ) ...T I
I x dx t dt I x t I dx dtt
( , ) ( , )I x dx t dt I x t
( )TI dxI
t dt
Optical flow constraint equation :
Represents how fast the intensity is changing with time
( , )I x t
t
Moving object speeds
Brightness gradient
If we have one equation in two unknown; hence more than one solution 2x
dx
dt
dy
dt
KLT (2)• Minimizing weighted least square error:
• In two dimensions the system has the form:
• If the solution is :
2
2 2
2( , ) ( , )
( , ) ( , )x y x y
I I I dx Ix x y Idt xW x y W x y
dy I tI I Idt xx y x
22 ( , )( ) ( ( , ))T
x
dx I x tw x I x t
dt t
are neighbour points of x, withsame constant velocity
12 2
( , )
x
tx y y
IdxW A W I
Idt
2
( , )
det( ) 0x y
W A
2
2
I I I
x x yA
I I I
x y x
KLT (3)A good feature is :1. one that can be tracked well … (Tomasi et al.)
if are the eigenvalues of , the system is well-conditioned if:
2. … and even better if it is part of a human body
2
( , )x y
W A
1 2,
1 2min( , ) Th
P( )SKIN Th
Large eigenvalues, but in the same range
Pixel with higher probability to be skin are preferred
(high texture content)
We decided to track n=100 featuresis a square (7x7) window
KLT (4)KLT feature tracking consists of 3 steps :
1. Select n good features
2. Track the selected n features
3. Replace lost features
min( )i Th P( )SKIN Th
12 2
( , )
x
tx y y
IdxW A W I
Idt
Skin modelling Color based approach: (Cr,Cb) chromatic subspace
Now: 3 components Gaussian Mixture Model
Initial experiments made using a single Gaussian
Skin samples taken from unused meetings
0.299 0.587 0.114
0.713 ( )
0.564 ( )
Y R G B
Cr V R Y
Cb U B Y
Structure of the implemented system:
Video feature extraction (2)
Video
KLT
Skin Detection
TrajectoryStructure
Skinmodel
100 features / frame
100 trajectories / frame
Video feature extraction (3)
Trajectoriesclassification
Define 4partitions (regions)(2 x heads,2 x hands)
TrajectoryStructure
Evaluate:Average Motion
Remove:long and quite
static trajectories
Define 2 additionalfixed regions
H1 H2
Ha1 Ha2
RL
+4 regions
4 regions
Video feature extraction (5)
Open issues:• Loss of tracking for fast moving objects• Account during the tracking• Assumption of a fixed scene structure• Delayed/offline processing
For each scene 4 motion vectors, one for each region, are estimated(to be soon enhanced with 2 more regions/vectors L and R)
H1 H2
Ha1 Ha2
In order to detect if someone is entering or leaving the scene
Taking motion vectors averaged over many trajectories helps reducing noise
P( )SKIN
IntegrationGoal: extend the multi-stream model with a new video stream
Lexical features
Speaker turns
Prosodic featuresS0
1
Y01
St1
Yt1
St+11
Yt+11
…. ….
A0 At At+1…. ….
S02
Y02
St2
Yt2
St+12
Yt+12
….
C0
E0
C0
E0
C0
E0
….….
Yt1 Yt+1
1
S03
Y03
St3
Yt3
St+13
Yt+13
….
….….
S04
Y04
St4
Yt4
St+14
Yt+14
….
Video features
It is possible that the extended model will beintractable due to the increased state space
In this case:
• State space reduction through a multi-time-scaleapproach will be attempted
• Early integration of Speaker turns + Lexical features will be investigated
Video features alone havequite poor performances,
but they seem to be helpfulif evaluated together with
Speaker Turns
Preliminary resultsBefore proceeding with the proposed integration we need to:
• compare video performances against the other features families• validate the extracted video features
(A) (Speaker Turns) + (Prosody + Lexical Features)(B) (Speaker Turns) + (Video Features)
Corr Sub Del Ins AER
(A) Two-stream model 87.8 4.5 7.7 3.2 15.4
(B) Two-stream model 90.4 3.2 6.4 4.5 14.1
Speaker Turns
Prosodic
Features
Lexical
Features
Video
Features
Accuracy % 85.9 69.9 52.6 48.1
Summary
– Extraction of video features through:• A skin detector enhanced KLT feature tracker
• Segmentation of trajectories into 4/6 spatial regions
(Simple and fast approach, but with some open problems)
– Validation of Motion Vectors as a video feature– Integration in the existing framework (work in progress)
Recommended