19
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Symeon Papadopoulos (CERTH) David Corney (RGU) Luca Aiello (Yahoo! Labs)

SNOW 2014 Data Challenge

Embed Size (px)

DESCRIPTION

Overview presentation of SNOW 2014 Data Challenge including results.

Citation preview

Page 1: SNOW 2014 Data Challenge

WWW 2014Seoul, April 8th

SNOW 2014 Data ChallengeSymeon Papadopoulos (CERTH)

David Corney (RGU)

Luca Aiello (Yahoo! Labs)

Page 2: SNOW 2014 Data Challenge

Overview of Challenge

• Goal: Detection of newsworthy topics in a large and noisy set of tweets

• Topic: a news story represented by a headline + tags + representative tweets + representative images (optional)

• Newsworthy: A topic that ends up being covered by at least some major online news sources

• Topics are detected per timeslot (small equally-sized time intervals)

• We want a maximum number of topics per timeslot

#2

Page 3: SNOW 2014 Data Challenge

Challenge Activity Log

• Challenge definition (Dec 2013)• Challenge toolkit and registration (Jan 20, 2014)• Development dataset collection (Feb 3, 2014)• Rehearsal dataset collection (Feb 17, 2014)• Test dataset collection (Feb 25, 2014)• Results submission (Mar 4, 2014)• Paper submission (Mar 9, 2014)• Results evaluation (Mar 5-18, 2014)• Workshop (Apr 7, 2014)

#3

Page 4: SNOW 2014 Data Challenge

Some statistics

• Registered participants: 25– India: 4, Belgium: 3, Germany: 3, UK: 3, Greece: 3,

Ireland: 2, USA: 2, France: 2, Italy: 1, Spain: 1, Russia: 1

• Participants that signed the Challenge agreement: 19• Participants that submitted results: 11• Participants that submitted papers: 9

#4

Page 5: SNOW 2014 Data Challenge

Evaluation Protocol

• Defined several evaluation criteria:– Newsworthiness Precision/Recall, F-score– Readability scale [1-5]– Coherence scale [1-5]– Diversity scale [1-5]

• List of reference topics• Set up precise evaluation guidelines• Blind evaluation (i.e. evaluator not aware of which

method a topic comes from) based on Web UI• Participants submitted topics for 96 timeslots, but

manual evaluation happened for 5 sample timeslots.• Result validation and analysis

#5

Page 6: SNOW 2014 Data Challenge

Teams key

#6

Key Team

A UKON

B IBCN

C ITI

D math-dyn

E Insight

F FUB-TORV

G PILOTS

H RGU

I UoGMIR

J EURECOM

K SNOWBITS

References to the submitted papers will be included in the overview paper in the workshop proceedings.

Page 7: SNOW 2014 Data Challenge

Results – Reference topic recall

#7

Team Recall (%) Rank

A 0.44 5

B 0.58 4

C 0.32 7

D 0.63 2

E 0.66 1F 0.39 6

G 0.24 8

H 0.6 3

I 0.17 10

J 0.24 8

K 0.14 11

Recall computed with respect to 59 reference topics. Those were partitioned in three groups (20, 20, 19) and each of the three evaluators manually matched the topics of participants to the topics assigned to him.

Eval. Pair Correlation

Eval. 1 – Eval. 2 0.894913

Eval. 1 – Eval. 3 0.930247

Eval. 2 – Eval. 3 0.811976

Page 8: SNOW 2014 Data Challenge

Results – Pooled topic recall (1/2)

• Each evaluator independently evaluated the topics of each participant as newsworthy or not

• Selected all topics that were marked as newsworthy by at least two evaluators

• Manually extracted the unique topics (70 in total, partially overlapping with reference topic list)

• Manually matched correct topics of each participants to the list of newsworthy topics

• Computed precision, recall and F-score

#8

Page 9: SNOW 2014 Data Challenge

Results – Pooled topic recall (2/2)

#9

Team Matched Unique Total Prec Rec F-score Rank

A 13 13 27 0.481 0.186 0.268 6

B 12 12 23 0.522 0.171 0.258 7

C 22 15 50 0.44 0.214 0.288 4

D 18 14 39 0.462 0.2 0.279 5

E 28 25 50 0.56 0.357 0.436 1F 4 2 15 0.267 0.029 0.052 10

G 4 4 10 0.4 0.057 0.099 9

H 19 17 49 0.388 0.243 0.299 3

I 36 15 45 0.8 0.214 0.338 2J 1 1 8 0.125 0.014 0.027 11

K 8 7 10 0.8 0.1 0.178 8

Page 10: SNOW 2014 Data Challenge

Results - Readability

#10

Team Readability Rank

A 4.29 9

B 4.92 2C 4.49 7

D 4.59 6

E 4.74 4

F 4.18 10

G 4.93 1H 4.71 5

I 4.8 3

J 3.38 11

K 4.32 8

Eval. Pair Correlation

Eval. 1 – Eval. 2 0.902124

Eval. 1 – Eval. 3 0.357733

Eval. 2 – Eval. 3 0.278632

Page 11: SNOW 2014 Data Challenge

Results - Coherence

#11

Team Coherence Rank

A 4.4 6

B 4.08 9

C 4.68 5

D 4.91 2

E 4.97 1F 4.78 4

G 4.83 3

H 4.22 8

I 3.95 10

J 3.75 11

K 4.36 7

Eval. Pair Correlation

Eval. 1 – Eval. 2 0.549512

Eval. 1 – Eval. 3 0.730684

Eval. 2 – Eval. 3 0.684426

Page 12: SNOW 2014 Data Challenge

Results - Diversity

#12

Team Diversity Rank

A 2.12 7

B 2.36 4

C 2.31 6

D 2.11 8

E 2.11 8

F 2 10

G 1.92 11

H 3.27 2I 2.36 4

J 2.5 3

K 3.47 1

Eval. Pair Correlation

Eval. 1 – Eval. 2 0.873365

Eval. 1 – Eval. 3 0.890415

Eval. 2 – Eval. 3 0.905915

Page 13: SNOW 2014 Data Challenge

Results – Image Relevance

#13

Team Precision (%) Rank

A 54.19 3

B 31.75 5

C 58.09 2D 52.04 4

E 27.39 6

F 0 8

G 0 8

H 58.82 1I 0 8

J 0 8

K 18.45 7

Eval. Pair Correlation

Eval. 1 – Eval. 2 0.944946

Eval. 1 – Eval. 3 0.919469

Eval. 2 – Eval. 3 0.79596

Page 14: SNOW 2014 Data Challenge

Results – Aggregate (1/2)

• For each criterion Ci, we computed the score of each team relative to the best team for this criterion:

Ci* (team) = Ci (team) / max(Ci (teamj))

• We then aggregated over the different norm. scores: Ctot = 0.25*Cref*Cpool + 0.25*Cread + 0.25*Ccoh + 0.25*Cdiv

where Cref is computed from the recall of reference topics, Cpool from the F-score of the pooled topics, and Cread, Ccoh and Cdiv from readability, coherence and diversity respectively.

#14

Page 15: SNOW 2014 Data Challenge

Results – Aggregate (2/2)

#15

Team Precision (%) Rank

A 0.694 7

B 0.755 4

C 0.710 5

D 0.785 3

E 0.892 1F 0.614 10

G 0.652 9

H 0.842 2I 0.662 8

J 0.546 11

K 0.70987 6

We tried several other alternative aggregation scores. The top three teams were the same!

Page 16: SNOW 2014 Data Challenge

Program

15:20-15:30: Carlos Martin-Dancausa and Ayse Goker: Real-time topic detection with bursty n-grams.

16:00-16:20: Gopi Chand Nuttaki, Olfa Nasraoui, Behnoush Abdollahi, Mahsa Badami, Wenlong Sun: Distributed LDA based topic modelling and topic agglomeration in a latent space.

16:20-16:40: Steven van Canneyt, Matthias Feys, Steven Schockaert, Thomas Demeester, Chris Develder, Bart Dhoedt: Detecting newsworthy topics in Twitter.

16:40-17:00: Georgiana Ifrim, Bichen Shi, Igor Brigadir: Event detection in Twitter using aggressive filtering and hierarchical tweet clustering.

17:00-17:20: Gerard Burnside, Dimitrios Milioris, Philippe Jacquet: One day in Twitter: Topic detection via joint complexity.

17:20-17:30: Georgios Petkos, Symeon Papadopoulos, Yiannis Kompatsiaris: Two-level message clustering for topic detection in Twitter.

17:30-17:40: Winners’ announcement!

#16

Page 17: SNOW 2014 Data Challenge

Limitations – Lessons Learned

• Did not take into account time– However, methods that produce a newsworthy topic earlier

should be rewarded• Did not take into account image relevance

– since we considered it an optional field• Coherence and diversity had extreme values in

numerous cases– e.g. when a single relevant tweet was provided as

representative• Evaluation turned out to be a very complex task!• Assessing only five slots (out of the 96) is definitely a

compromise: (a) consider use of more evaluators/AMT, (b) consider simpler evaluation tasks

#17

Page 18: SNOW 2014 Data Challenge

Plan

• Release evaluation resources– list of reference topics– list of pooled newsworthy topics– evaluation scores

• Papers– SNOW Data Challenge paper– Resubmission of participants’ papers with CEUR style– Submission to CEUR-ws.org

• Open-source implementations?• Further plans?

#18

Page 19: SNOW 2014 Data Challenge

Thank you!

#19