MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

MultimodalPerson Discovery

in Broadcast TV

Johann Poignant / Hervé Bredin / Claude Barras

2015

[email protected] herve.niderb.fr @hbredin

Outline• Motivations

• Definition of the task

• Datasets

• Baseline & Metadata

• Evaluation protocol

• Organization

• Results

• Conclusion

2

Motivations

3

"the usual suspects"

• Huge TV archives are useless if not searchable

• People ❤ people

• Need for person-based indexes

4

REPERE

• Evaluation campaigns in 2012, 2013 and 2014 Three French consortia funded by ANR

• Multimodal people recognition in TV documents "who speaks when?" and "who appears when?"

• Led to significant progress in both supervised and unsupervised multimodal person recognition

5

From REPERE to Person Discovery

• Speaking faces Focus on "person of interest"

• Unsupervised approaches People may not be "famous" at indexing time

• Evidence Archivist/journalist use case

6

Definition of the task

7

Input

8

Broadcast TV pre-segmented into shots.

Speaking face

9

Tag each shot with the names of people both speaking and appearing at the same time

Person discovery

• Prior biometric models are not allowed.

• Person names must be discovered automatically in text overlay or speech utterances.

unsupervised approaches only

10

Evidence

11

Associate each name with a unique shotprooving that the person actually holds this name

Evidence (cont.)

12

an image evidence is a shot during which a person is visible and their name is written on screen

an audio evidence is a shot during which a person is visible and their name is pronounced at least once during a [shot start time - 5s, shot end time + 5s] neighborhood

shot #3for B

shot #1for A

Datasets

13

DatasetsDEV | REPERE TEST | INA

14

137 hours

two French TV channels eight different types

106 hours (172 videos)

only one French TV channel only one type (news)

dense audio annotations speaker diarization speech transcription

sparse video annotations face detection & recognition optical character recognition

no prior annotation

a posteriori collaborative annotation

DatasetsTEST | INA

15

106 hours (172 videos)

only one French TV channel only one type (news)

no prior annotation

a posteriori collaborative annotation

http://dataset.ina.fr

Evaluation protocol

16

Information retrieval task• Queries formatted as firstname_lastname

e.g. francois_hollande "return all shots where François Hollande is speaking and visible at the same time"

• Approximate search among submitted names e.g. francois_holande

• Select shots tagged with the most similar name

17

Evidence-weighted MAP

18

To ensure that participants followed this strict“no biomet-ric supervision” constraint, each hypothesized name n 2 Nhad to be backed up by a carefully selected and unique shotprooving that the person actually holds this name n: wecall this an evidence and denote it by E : N 7! S. In real-world conditions, this evidence would help a human anno-tator double-check the automatically-generated index, evenfor people they did not know beforehand.

Two types of evidence were allowed: an image evidenceis a shot during which a person is visible, and their name iswritten on screen; an audio evidence is a shot during which aperson is visible, and their name is pronounced at least onceduring a [shot start time� 5s, shot end time+5s] neighbor-hood. For instance, in Figure 1, shot #1 is an image evi-dence for Mr A (because his name and his face are visiblesimultaneously on screen) while shot #3 is an audio evi-dence for Mrs B (because her name is pronounced less than5 seconds before or after her face is visble on screen).

3. DATASETSThe REPERE corpus – distributed by ELDA – served

as development set. It is composed of various TV shows(around news, politics and people) from two French TVchannels, for a total of 137 hours. A subset of 50 hours ismanually annotated. Audio annotations are dense and pro-vide speech transcripts and identity-labeled speech turns.Video annotations are sparse (one image every 10 seconds)and provide overlaid text transcripts and identity-labeledface segmentation. Both speech and overlaid text transcriptsare tagged with named entities. The test set – distributedby INA – contains 106 hours of video, corresponding to 172editions of evening broadcast news “Le 20 heures” of Frenchpublic channel “France 2”, from January 1st 2007 to June30st 2007.

As the test set came completely free of any annotation, itwas annotated a posteriori based on participants’ submis-sions. In the following, task groundtruths are denoted byfunction L : S 7! P(N) that maps each shot s to the setof names of every speaking face it contains, and functionE : S 7! P(N) that maps each shot s to the set of personnames for which it actually is an evidence.

4. BASELINE AND METADATAThis task targeted researchers from several communities

including multimedia, computer vision, speech and naturallanguage processing. Though the task was multimodal bydesign and necessitated expertise in various domains, thetechnological barriers to entry was lowered by the provi-sion of a baseline system described in Figure 2 and availableas open-source software1. For instance, a researcher fromthe speech processing community could focus its research ef-forts on improving speaker diarization and automatic speechtranscription, while still being able to rely on provided facedetection and tracking results to participate to the task.The audio stream was segmented into speech turns, while

faces were detected and tracked in the visual stream. Speechturns (resp. face tracks) were then compared and clus-tered based on MFCC and the Bayesian Information Cri-terion [10] (resp. HOG [11] and Logistic Discriminant Met-ric Learning [17] on facial landmarks [31]). The approachproposed in [27] was also used to compute a probabilistic

1

http://github.com/MediaEvalPersonDiscovery

Figure 2: Multimodal baseline pipeline. Output ofgreyed out modules is provided to the participants.

mapping between co-occuring faces and speech turns. Writ-ten (resp. pronounced) person names were automaticallyextracted from the visual stream (resp. the audio stream)using open source LOOV Optical Character Recognition [24](resp. Automatic Speech Recognition [21, 12]) followed byNamed Entity detection (NE). The fusion module was a two-steps algorithm: propagation of written names onto speakerclusters [26] followed by propagation of speaker names ontoco-occurring speaking faces.

5. EVALUATION METRICThis information retrieval task was evaluated using a vari-

ant of Mean Average Precision (MAP), that took the qual-ity of evidences into account. For each query q 2 Q ⇢ N(firstname_lastname), the hypothesized person name nq

with the highest Levenshtein ratio ⇢ to the query q is se-lected (⇢ : N ⇥ N 7! [0, 1]) – allowing approximate nametranscription:

nq = argmaxn2N

⇢ (q, n) and ⇢q = ⇢ (q, nq)

Average precision AP(q) is then computed classically basedon relevant and returned shots:

relevant(q) = {s 2 S | q 2 L(s)}returned(q) = {s 2 S | nq 2 L(s)}

sorted by

confidence

Proposed evidence is Correct if name nq is close enough tothe query q and if shot E(nq) actually is an evidence for q:

C(q) =

(1 if ⇢q > 0.95 and q 2 E(E(nq))

0 otherwise

To ensure participants do provide correct evidences for everyhypothesized name n 2 N , standard MAP is altered intoEwMAP (Evidence-weighted Mean Average Precision), theo�cial metric for the task:

EwMAP =1|Q|

X

q2QC(q) ·AP(q)

Acknowledgment. This work was supported by the FrenchNational Agency for Research under grant ANR-12-CHRI-0006-01. The open source CAMOMILE collaborative anno-tation platform2 was used extensively throughout the progressof the task: from the run submission script to the automatedleaderboard, including a posteriori collaborative annotationof the test corpus. We thank ELDA and INA for supportingthe task by distributing development and test datasets.

2

http://github.com/camomile-project

MAP =1

| Q |X

q2QAP(q)

C(q) measures the correctness of provided evidence for query q

Baseline & Metadata

19

the "multi" in multimediaTask necessitates expertise in various domains:• multimedia• computer vision• speech processing• natural language processing

20

Technological barriers to entry is loweredby the provision of a baseline system.

github.com/MediaevalPersonDiscoveryTask

http://github.com/MediaevalPersonDiscoveryTask

Baseline fusion

21

22

face tracking • •face clustering • • • •speaker diarization • • • • •optical character recognition • • •automatic speech recognition

speaking face detection • • • •fusion • • • • • •

Was the baseline useful?

team relied on the baseline module

• team developed their own module

• team tried both

9 participants

23

focus onmonomodalcomponents

base

line

face tracking • •face clustering • • • •speaker diarization • • • • •optical character recognition • • •automatic speech recognition

speaking face detection • • • •fusion • • • • • •

focus onmultimodal

fusion

Organization

24

Schedule01.05 development set release 01.06 test set release 01.07 "out-domain" submission deadline 01.07 — 08.07 leaderboard updated every 6 hours 08.07 "in-domain" submission deadline 09.07 — 28.07 collaborative annotation 28.07 test set annotation release 28.07 — 28.08 adjudication

25

Leaderboard

26

computed on a secretsubset of the test set

updated every 6 hours

private leaderboard participants know how they rank participants know their score but do not know the score of others

Collaborative annotation

27

Thanks!

28

around 50 hours in total

Thanks!

29

half of the test set ✔

collaborative annotation platform-projectgithub.com/

objectives open source

serv

er

clie

nt

clie

nt

clie

nt bring

yourown

client

JSO

N

data modelcorpus

medium layer

annotationmedium fragment attached metadata

time

Lorem ipsum dolor sit amet, consecteturadipisicing elit, sed do eiusmod temporincididunt ut labore et dolore magnaaliqua. Ut enim ad minim veniam, quisnostrud exercitation ullamco laboris nisiut aliquip ex ea commodo consequat.

multimediadocument

REST APIGET /corpus/:idcorpusPOST /corpus/:idcorpus/mediumPUT /layer/:idlayer/annotationDEL /annotation/:idannotation

obtain information about a corpus

add a medium to a corpus

update an annotation

delete an annotation

and more...

permissionsannotation historyannotation queue

user authentication

provide an architecture for effective creation and sharing of annotations of multimedia, multimodal and multilingual data

homogeneouscollection ofmultimediadocuments

homogeneouscollection ofannotationsaudio

videoimagetext

categorical valuenumerical value

free textraw content

enable novel collaborative and interactive annotation tools

be generic enough to supportas many use cases as possible

online documentationhttp://camomile-project.github.io/camomile-server

JSO

N

JSO

N

ADVERTISEMENT

Results

31

Stats

• 14 (+ organizers) registered participants received dev and test data

• 8 (+ organizers) participants submitted 70 runs

• 7 (+ organizers) submitted a working note paper

• 7 (+ organizers) attend the workshop

32

33

out-domain results

28k shots 2070 queries 1642 # vs. 428 $

EwMAP (%) for PRIMARY runs

34

out-domain results

28k shots 2070 queries 1642 # vs. 428 $


in-domain results

35


in-domain results

36


37

out- vs. in-domain

# people that no other team discovered

38

1200 01 4 0 0 0

speech transcripts?

0

anchors


Conclusion• Had a great (and exhausting) time organizing this task

• The winning submission DID NOT make any use of the face modality.

• (Almost) nobody used ASRMost people are introduced by overlaid names in the test set

• No cross-show approaches

• Poster session later today at 16:00Technical retreat tomorrow morning at 9:15

39

Next (oral)• Mateusz Budnik / LIG at MediaEval 2015

Multimodal Person Discovery in Broadcast TV Task

• Meriem Bendris / PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

• Rosalia Barros / GTM-UVigo Systems for Person Discovery Task at MediaEval 2015

40

Next (poster)• Claude Barras / Multimodal Person Discovery in Broadcast TV at

MediaEval 2015

• Johann Poignant / LIMSI at MediaEval 2015: Person Discovery in Broadcast TV Task

• Paula Lopez Otero / GTM-UVigo Systems for Person Discovery Task at MediaEval 2015

• Javier Hernando / UPC System for the 2015 MediaEval Multimodal Person Discovery in Broadcast TV task

• Guillaume Gravier / SSIG and IRISA at Multimodal Person Discovery

• Nam Le / EUMSSI team at the MediaEval Person Discovery Challenge

41

Education

MediaEval 2015 - Multimodal Person Discovery in Broadcast TV