21
Exploiting video information for Meeting Structuring …. …. ….

Exploiting video information for Meeting Structuring …

Embed Size (px)

Citation preview

Exploiting video information for Meeting Structuring

….

….

….

Agenda

• Introduction

• Feature set extension

• Video features processing

• Video features integration

• Preliminary results

• Conclusions

Meeting Structuring (1)

• Goal: recognise events which involve one or more communicative modalities:

• Monologue / Dialogue / Note taking / Presentation / Presentation at the whiteboard

• Working environment: “IDIAP framework” :• 69 five minutes long meetings of 4 participants

• 30 transcribed meetings

• Scripted meeting structure

Meeting Structuring (2)• 3 audio derived feature families:

Speaker turns, Prosodic Features, Lexical Features

Mic. Array

Lapel Mic.

SpeakerTurns

Beam-forming

Rate Of Speech

Pitch baselineEnergy

Prosody

Transcription. M/DI discrimination Lexical features

ASR

Meeting Structuring (3)• Dynamic Bayesian Network based models (using GMTK, Bilmes et al.) • Multi-stream processing (parallel stream processing)• “Counter structure” (state duration modelling)

S01

Y01

St1

Yt1

St+11

Yt+11

…. ….

A0 At At+1…. ….

S02

Y02

St2

Yt2

St+12

Yt+12

….

C0

E0

C0

E0

C0

E0

….….

Corr Sub Del Ins AER

W.o. counter 91.7 4.5 3.8 2.6 10.9

With counter 92.9 5.1 1.9 1.9 9.0

• 3 feature families:• Prosodic features (S1)• Speaker Turns (S2)• Lexical features (S3)

• Leave-one-out cross-validation over 30 annotated meetings

100Sub Ins Del

AERTotalActionsNumber

Feature set extension (1)

Multi-party meeting are multi-modal communicative processes

Our features cover only two modalities: audio (prosodic features & speaker turns) and lexical

content (lexical monologue/dialogue discriminator)

Exploiting video contentsis the next step!!

Approach: extract low level video features and leave their interpretation to high level specialised models

The three most confused symbols

Feature set extension (2)

Three meeting actions which highlyinvolve body/hands movements

Goal: improve the recognition of “Note taking”, “Presentation” and “Whiteboard”

Feature set extension (3) We need motion features for hands/head-torso regions• Constraints:

– The system must be simple

– Reliable against “environmental” changes (lighting, backgrounds, …)

– Open to further extensions / modifications

• Initial assumptions:– Meetings video contents are quite “static”

– Participants occupy only few spatial regions

and tend to stay there

– Meeting room configuration (camera positions, seats,

furniture …) is fixed

Kanade Lucas Tomasi (KLT) feature tracking…

Video feature extraction (1)

• Motion analysis is performed using :

…and partitioning resulting trajectories according to their relative position into the scene

Four spatial regionsfor each scene:

Head 1 / 2Hands 1 / 2

KLT (1)Assumption: brightness of every point of a (slow) moving or static

object does not change for images taken at near time instants

(Taylor series approximated to the 1st derivative)( , ) ( , ) ( ) ...T I

I x dx t dt I x t I dx dtt

( , ) ( , )I x dx t dt I x t

( )TI dxI

t dt

Optical flow constraint equation :

Represents how fast the intensity is changing with time

( , )I x t

t

Moving object speeds

Brightness gradient

If we have one equation in two unknown; hence more than one solution 2x

dx

dt

dy

dt

KLT (2)• Minimizing weighted least square error:

• In two dimensions the system has the form:

• If the solution is :

2

2 2

2( , ) ( , )

( , ) ( , )x y x y

I I I dx Ix x y Idt xW x y W x y

dy I tI I Idt xx y x

22 ( , )( ) ( ( , ))T

x

dx I x tw x I x t

dt t

are neighbour points of x, withsame constant velocity

12 2

( , )

x

tx y y

IdxW A W I

Idt

2

( , )

det( ) 0x y

W A

2

2

I I I

x x yA

I I I

x y x

KLT (3)A good feature is :1. one that can be tracked well … (Tomasi et al.)

if are the eigenvalues of , the system is well-conditioned if:

2. … and even better if it is part of a human body

2

( , )x y

W A

1 2,

1 2min( , ) Th

P( )SKIN Th

Large eigenvalues, but in the same range

Pixel with higher probability to be skin are preferred

(high texture content)

We decided to track n=100 featuresis a square (7x7) window

KLT (4)KLT feature tracking consists of 3 steps :

1. Select n good features

2. Track the selected n features

3. Replace lost features

min( )i Th P( )SKIN Th

12 2

( , )

x

tx y y

IdxW A W I

Idt

Skin modelling Color based approach: (Cr,Cb) chromatic subspace

Now: 3 components Gaussian Mixture Model

Initial experiments made using a single Gaussian

Skin samples taken from unused meetings

0.299 0.587 0.114

0.713 ( )

0.564 ( )

Y R G B

Cr V R Y

Cb U B Y

Structure of the implemented system:

Video feature extraction (2)

Video

KLT

Skin Detection

TrajectoryStructure

Skinmodel

100 features / frame

100 trajectories / frame

Video feature extraction (3)

Trajectoriesclassification

Define 4partitions (regions)(2 x heads,2 x hands)

TrajectoryStructure

Evaluate:Average Motion

Remove:long and quite

static trajectories

Define 2 additionalfixed regions

H1 H2

Ha1 Ha2

RL

+4 regions

4 regions

Video feature extraction (4)1. 2.

3. 4.

Video feature extraction (5)

Open issues:• Loss of tracking for fast moving objects• Account during the tracking• Assumption of a fixed scene structure• Delayed/offline processing

For each scene 4 motion vectors, one for each region, are estimated(to be soon enhanced with 2 more regions/vectors L and R)

H1 H2

Ha1 Ha2

In order to detect if someone is entering or leaving the scene

Taking motion vectors averaged over many trajectories helps reducing noise

P( )SKIN

IntegrationGoal: extend the multi-stream model with a new video stream

Lexical features

Speaker turns

Prosodic featuresS0

1

Y01

St1

Yt1

St+11

Yt+11

…. ….

A0 At At+1…. ….

S02

Y02

St2

Yt2

St+12

Yt+12

….

C0

E0

C0

E0

C0

E0

….….

Yt1 Yt+1

1

S03

Y03

St3

Yt3

St+13

Yt+13

….

….….

S04

Y04

St4

Yt4

St+14

Yt+14

….

Video features

It is possible that the extended model will beintractable due to the increased state space

In this case:

• State space reduction through a multi-time-scaleapproach will be attempted

• Early integration of Speaker turns + Lexical features will be investigated

Video features alone havequite poor performances,

but they seem to be helpfulif evaluated together with

Speaker Turns

Preliminary resultsBefore proceeding with the proposed integration we need to:

• compare video performances against the other features families• validate the extracted video features

(A) (Speaker Turns) + (Prosody + Lexical Features)(B) (Speaker Turns) + (Video Features)

Corr Sub Del Ins AER

(A) Two-stream model 87.8 4.5 7.7 3.2 15.4

(B) Two-stream model 90.4 3.2 6.4 4.5 14.1

Speaker Turns

Prosodic

Features

Lexical

Features

Video

Features

Accuracy % 85.9 69.9 52.6 48.1

Summary

– Extraction of video features through:• A skin detector enhanced KLT feature tracker

• Segmentation of trajectories into 4/6 spatial regions

(Simple and fast approach, but with some open problems)

– Validation of Motion Vectors as a video feature– Integration in the existing framework (work in progress)