Upload
multimediaeval
View
134
Download
0
Embed Size (px)
Citation preview
MultimodalPerson Discovery
in Broadcast TV
Johann Poignant / Hervé Bredin / Claude Barras
2015
[email protected] herve.niderb.fr @hbredin
Outline• Motivations
• Definition of the task
• Datasets
• Baseline & Metadata
• Evaluation protocol
• Organization
• Results
• Conclusion
2
Motivations
3
"the usual suspects"
• Huge TV archives are useless if not searchable
• People ❤ people
• Need for person-based indexes
4
REPERE
• Evaluation campaigns in 2012, 2013 and 2014 Three French consortia funded by ANR
• Multimodal people recognition in TV documents "who speaks when?" and "who appears when?"
• Led to significant progress in both supervised and unsupervised multimodal person recognition
5
From REPERE to Person Discovery
• Speaking faces Focus on "person of interest"
• Unsupervised approaches People may not be "famous" at indexing time
• Evidence Archivist/journalist use case
6
Definition of the task
7
Input
8
Broadcast TV pre-segmented into shots.
Speaking face
9
Tag each shot with the names of people both speaking and appearing at the same time
Person discovery
• Prior biometric models are not allowed.
• Person names must be discovered automatically in text overlay or speech utterances.
unsupervised approaches only
10
Evidence
11
Associate each name with a unique shotprooving that the person actually holds this name
Evidence (cont.)
12
an image evidence is a shot during which a person is visible and their name is written on screen
an audio evidence is a shot during which a person is visible and their name is pronounced at least once during a [shot start time - 5s, shot end time + 5s] neighborhood
shot #3for B
shot #1for A
Datasets
13
DatasetsDEV | REPERE TEST | INA
14
137 hours
two French TV channels eight different types
106 hours (172 videos)
only one French TV channel only one type (news)
dense audio annotations speaker diarization speech transcription
sparse video annotations face detection & recognition optical character recognition
no prior annotation
a posteriori collaborative annotation
DatasetsTEST | INA
15
106 hours (172 videos)
only one French TV channel only one type (news)
no prior annotation
a posteriori collaborative annotation
http://dataset.ina.fr
Evaluation protocol
16
Information retrieval task• Queries formatted as firstname_lastname
e.g. francois_hollande "return all shots where François Hollande is speaking and visible at the same time"
• Approximate search among submitted names e.g. francois_holande
• Select shots tagged with the most similar name
17
Evidence-weighted MAP
18
To ensure that participants followed this strict“no biomet-ric supervision” constraint, each hypothesized name n 2 Nhad to be backed up by a carefully selected and unique shotprooving that the person actually holds this name n: wecall this an evidence and denote it by E : N 7! S. In real-world conditions, this evidence would help a human anno-tator double-check the automatically-generated index, evenfor people they did not know beforehand.
Two types of evidence were allowed: an image evidenceis a shot during which a person is visible, and their name iswritten on screen; an audio evidence is a shot during which aperson is visible, and their name is pronounced at least onceduring a [shot start time� 5s, shot end time+5s] neighbor-hood. For instance, in Figure 1, shot #1 is an image evi-dence for Mr A (because his name and his face are visiblesimultaneously on screen) while shot #3 is an audio evi-dence for Mrs B (because her name is pronounced less than5 seconds before or after her face is visble on screen).
3. DATASETSThe REPERE corpus – distributed by ELDA – served
as development set. It is composed of various TV shows(around news, politics and people) from two French TVchannels, for a total of 137 hours. A subset of 50 hours ismanually annotated. Audio annotations are dense and pro-vide speech transcripts and identity-labeled speech turns.Video annotations are sparse (one image every 10 seconds)and provide overlaid text transcripts and identity-labeledface segmentation. Both speech and overlaid text transcriptsare tagged with named entities. The test set – distributedby INA – contains 106 hours of video, corresponding to 172editions of evening broadcast news “Le 20 heures” of Frenchpublic channel “France 2”, from January 1st 2007 to June30st 2007.
As the test set came completely free of any annotation, itwas annotated a posteriori based on participants’ submis-sions. In the following, task groundtruths are denoted byfunction L : S 7! P(N) that maps each shot s to the setof names of every speaking face it contains, and functionE : S 7! P(N) that maps each shot s to the set of personnames for which it actually is an evidence.
4. BASELINE AND METADATAThis task targeted researchers from several communities
including multimedia, computer vision, speech and naturallanguage processing. Though the task was multimodal bydesign and necessitated expertise in various domains, thetechnological barriers to entry was lowered by the provi-sion of a baseline system described in Figure 2 and availableas open-source software1. For instance, a researcher fromthe speech processing community could focus its research ef-forts on improving speaker diarization and automatic speechtranscription, while still being able to rely on provided facedetection and tracking results to participate to the task.The audio stream was segmented into speech turns, while
faces were detected and tracked in the visual stream. Speechturns (resp. face tracks) were then compared and clus-tered based on MFCC and the Bayesian Information Cri-terion [10] (resp. HOG [11] and Logistic Discriminant Met-ric Learning [17] on facial landmarks [31]). The approachproposed in [27] was also used to compute a probabilistic
1
http://github.com/MediaEvalPersonDiscovery
Figure 2: Multimodal baseline pipeline. Output ofgreyed out modules is provided to the participants.
mapping between co-occuring faces and speech turns. Writ-ten (resp. pronounced) person names were automaticallyextracted from the visual stream (resp. the audio stream)using open source LOOV Optical Character Recognition [24](resp. Automatic Speech Recognition [21, 12]) followed byNamed Entity detection (NE). The fusion module was a two-steps algorithm: propagation of written names onto speakerclusters [26] followed by propagation of speaker names ontoco-occurring speaking faces.
5. EVALUATION METRICThis information retrieval task was evaluated using a vari-
ant of Mean Average Precision (MAP), that took the qual-ity of evidences into account. For each query q 2 Q ⇢ N(firstname_lastname), the hypothesized person name nq
with the highest Levenshtein ratio ⇢ to the query q is se-lected (⇢ : N ⇥ N 7! [0, 1]) – allowing approximate nametranscription:
nq = argmaxn2N
⇢ (q, n) and ⇢q = ⇢ (q, nq)
Average precision AP(q) is then computed classically basedon relevant and returned shots:
relevant(q) = {s 2 S | q 2 L(s)}returned(q) = {s 2 S | nq 2 L(s)}
sorted by
confidence
Proposed evidence is Correct if name nq is close enough tothe query q and if shot E(nq) actually is an evidence for q:
C(q) =
(1 if ⇢q > 0.95 and q 2 E(E(nq))
0 otherwise
To ensure participants do provide correct evidences for everyhypothesized name n 2 N , standard MAP is altered intoEwMAP (Evidence-weighted Mean Average Precision), theo�cial metric for the task:
EwMAP =1|Q|
X
q2QC(q) ·AP(q)
Acknowledgment. This work was supported by the FrenchNational Agency for Research under grant ANR-12-CHRI-0006-01. The open source CAMOMILE collaborative anno-tation platform2 was used extensively throughout the progressof the task: from the run submission script to the automatedleaderboard, including a posteriori collaborative annotationof the test corpus. We thank ELDA and INA for supportingthe task by distributing development and test datasets.
2
http://github.com/camomile-project
MAP =1
| Q |X
q2QAP(q)
C(q) measures the correctness of provided evidence for query q
Baseline & Metadata
19
the "multi" in multimediaTask necessitates expertise in various domains:• multimedia• computer vision• speech processing• natural language processing
20
Technological barriers to entry is loweredby the provision of a baseline system.
github.com/MediaevalPersonDiscoveryTask
Baseline fusion
21
22
face tracking • •face clustering • • • •speaker diarization • • • • •optical character recognition • • •automatic speech recognition
speaking face detection • • • •fusion • • • • • •
Was the baseline useful?
team relied on the baseline module
• team developed their own module
• team tried both
9 participants
23
focus onmonomodalcomponents
base
line
face tracking • •face clustering • • • •speaker diarization • • • • •optical character recognition • • •automatic speech recognition
speaking face detection • • • •fusion • • • • • •
focus onmultimodal
fusion
Organization
24
Schedule01.05 development set release 01.06 test set release 01.07 "out-domain" submission deadline 01.07 — 08.07 leaderboard updated every 6 hours 08.07 "in-domain" submission deadline 09.07 — 28.07 collaborative annotation 28.07 test set annotation release 28.07 — 28.08 adjudication
25
Leaderboard
26
computed on a secretsubset of the test set
updated every 6 hours
private leaderboard participants know how they rank participants know their score but do not know the score of others
Collaborative annotation
27
Thanks!
28
around 50 hours in total
Thanks!
29
half of the test set ✔
collaborative annotation platform-projectgithub.com/
objectives open source
serv
er
clie
nt
clie
nt
clie
nt bring
yourown
client
JSO
N
data modelcorpus
medium layer
annotationmedium fragment attached metadata
time
Lorem ipsum dolor sit amet, consecteturadipisicing elit, sed do eiusmod temporincididunt ut labore et dolore magnaaliqua. Ut enim ad minim veniam, quisnostrud exercitation ullamco laboris nisiut aliquip ex ea commodo consequat.
multimediadocument
REST APIGET /corpus/:idcorpusPOST /corpus/:idcorpus/mediumPUT /layer/:idlayer/annotationDEL /annotation/:idannotation
obtain information about a corpus
add a medium to a corpus
update an annotation
delete an annotation
and more...
permissionsannotation historyannotation queue
user authentication
provide an architecture for effective creation and sharing of annotations of multimedia, multimodal and multilingual data
homogeneouscollection ofmultimediadocuments
homogeneouscollection ofannotationsaudio
videoimagetext
categorical valuenumerical value
free textraw content
enable novel collaborative and interactive annotation tools
be generic enough to supportas many use cases as possible
online documentationhttp://camomile-project.github.io/camomile-server
JSO
N
JSO
N
ADVERTISEMENT
Results
31
Stats
• 14 (+ organizers) registered participants received dev and test data
• 8 (+ organizers) participants submitted 70 runs
• 7 (+ organizers) submitted a working note paper
• 7 (+ organizers) attend the workshop
32
33
out-domain results
28k shots 2070 queries 1642 # vs. 428 $
EwMAP (%) for PRIMARY runs
34
out-domain results
28k shots 2070 queries 1642 # vs. 428 $
EwMAP (%) for PRIMARY runs
in-domain results
35
EwMAP (%) for PRIMARY runs
in-domain results
36
EwMAP (%) for PRIMARY runs
37
out- vs. in-domain
# people that no other team discovered
38
1200 01 4 0 0 0
speech transcripts?
0
anchors
EwMAP (%) for PRIMARY runs
Conclusion• Had a great (and exhausting) time organizing this task
• The winning submission DID NOT make any use of the face modality.
• (Almost) nobody used ASRMost people are introduced by overlaid names in the test set
• No cross-show approaches
• Poster session later today at 16:00Technical retreat tomorrow morning at 9:15
39
Next (oral)• Mateusz Budnik / LIG at MediaEval 2015
Multimodal Person Discovery in Broadcast TV Task
• Meriem Bendris / PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign
• Rosalia Barros / GTM-UVigo Systems for Person Discovery Task at MediaEval 2015
40
Next (poster)• Claude Barras / Multimodal Person Discovery in Broadcast TV at
MediaEval 2015
• Johann Poignant / LIMSI at MediaEval 2015: Person Discovery in Broadcast TV Task
• Paula Lopez Otero / GTM-UVigo Systems for Person Discovery Task at MediaEval 2015
• Javier Hernando / UPC System for the 2015 MediaEval Multimodal Person Discovery in Broadcast TV task
• Guillaume Gravier / SSIG and IRISA at Multimodal Person Discovery
• Nam Le / EUMSSI team at the MediaEval Person Discovery Challenge
41