6
ReSEED: Social Event dEtection Dataset Timo Reuter Universität Bielefeld, CITEC Bielefeld, Germany [email protected] bielefeld.de Symeon Papadopoulos CERTH-ITI Thermi, Greece [email protected] Vasilios Mezaris CERTH-ITI Thermi, Greece [email protected] Philipp Cimiano Universität Bielefeld, CITEC Bielefeld, Germany [email protected] bielefeld.de ABSTRACT Nowadays, digital cameras are very popular among people and quite every mobile phone has a build-in camera. Social events have a prominent role in people’s life. Thus, people take pictures of events they take part in and more and more of them upload these to well-known online photo commu- nity sites like Flickr. The number of pictures uploaded to these sites is still proliferating and there is a great inter- est in automatizing the process of event clustering so that every incoming (picture) document can be assigned to the corresponding event without the need of human interaction. These social events are defined as events that are planned by people, attended by people and for which the social multi- media are also captured by people. There is an urgent need to develop algorithms which are capable of grouping media by the social events they depict or are related to. In order to train, test, and evaluate such algorithms and frameworks, we present a dataset that consists of about 430,000 photos from Flickr together with the underlying ground truth con- sisting of about 21,000 social events. All the photos are ac- companied by their textual metadata. The ground truth for the event groupings has been derived from event calendars on the Web that have been created collaboratively by peo- ple. The dataset has been used in the Social Event Detec- tion (SED) task that was part of the MediaEval Benchmark for Multimedia Evaluation 2013. This task required par- ticipants to discover social events and organize the related media items in event-specific clusters within a collection of Web multimedia documents. In this paper we describe how the dataset has been collected and the creation of the ground truth together with a proposed evaluation methodology and a brief description of the corresponding task challenge as applied in the context of the Social Event Detection task. Permission to make digital or hard copies of all or part of this work for per- sonal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstract- ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. MMSys ’14, March 19 - 21 2014, Singapore, Singapore Copyright is held by the owner/authors. Publication rights licensed to ACM. ACM 978-1-4503-2705-3/14/03 ...$15.00. http://dx.doi.org/10.1145/2557642.2563674 Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Experimentation Keywords Event Detection, Dataset, Content Analysis, Clustering, Clas- sification 1. INTRODUCTION As social media applications proliferate, an ever-increasing amount of web and multimedia content available on the Web is being created. More and more people are using digi- tal cameras to take pictures from important and interest- ing happenings in their life which they upload regularly to social networks on the Web. In fact, the number of media uploaded to those sites is still increasing every year. A lot of this content is related to social events. According to our definition, a social event is an event that is organized and attended by people and illustrated by social media content created by people. Finding digital content related to a certain social event is very challenging for users. It requires to search large vol- umes of data, possibly at different sources and sites. It is obvious that algorithms supporting humans in this task are urgently needed. Thus, an important task consists in devel- oping algorithms that can detect event-related media and group them by the events they illustrate or are related to. Such a grouping would provide the basis for aggregation and search applications that foster easier discovery, browsing and querying of social events. In order to create, test, and validate developments in the field of social event detection [7], it is necessary to have a non-toy and real-world dataset that supports the develop- ment and comparison of different approaches on the task. In this paper we propose such a dataset. It has been used in the MediaEval 2013 Social Event Detection task and is thus already established in the community. In comparison to the datasets from earlier MediaEval SED tasks [4] and TrecVID MED [3] for videos, the dataset pre- sented in this paper features real events and not only event Proc. 5th ACM Multimedia Systems Conference (MMSys’14), Singapore, March 2014.

ReSEED: Social Event dEtection Datasetbmezaris/publications/mmsys14...event detection and searching. 2.1 MediaEval 2013 The dataset was already successfully used in the 2013 edi-tion

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ReSEED: Social Event dEtection Datasetbmezaris/publications/mmsys14...event detection and searching. 2.1 MediaEval 2013 The dataset was already successfully used in the 2013 edi-tion

ReSEED: Social Event dEtection Dataset

Timo ReuterUniversität Bielefeld, CITEC

Bielefeld, [email protected]

bielefeld.de

Symeon PapadopoulosCERTH-ITI

Thermi, [email protected]

Vasilios MezarisCERTH-ITI

Thermi, [email protected]

Philipp CimianoUniversität Bielefeld, CITEC

Bielefeld, [email protected]

bielefeld.de

ABSTRACTNowadays, digital cameras are very popular among peopleand quite every mobile phone has a build-in camera. Socialevents have a prominent role in people’s life. Thus, peopletake pictures of events they take part in and more and moreof them upload these to well-known online photo commu-nity sites like Flickr. The number of pictures uploaded tothese sites is still proliferating and there is a great inter-est in automatizing the process of event clustering so thatevery incoming (picture) document can be assigned to thecorresponding event without the need of human interaction.These social events are defined as events that are planned bypeople, attended by people and for which the social multi-media are also captured by people. There is an urgent needto develop algorithms which are capable of grouping mediaby the social events they depict or are related to. In orderto train, test, and evaluate such algorithms and frameworks,we present a dataset that consists of about 430,000 photosfrom Flickr together with the underlying ground truth con-sisting of about 21,000 social events. All the photos are ac-companied by their textual metadata. The ground truth forthe event groupings has been derived from event calendarson the Web that have been created collaboratively by peo-ple. The dataset has been used in the Social Event Detec-tion (SED) task that was part of the MediaEval Benchmarkfor Multimedia Evaluation 2013. This task required par-ticipants to discover social events and organize the relatedmedia items in event-specific clusters within a collection ofWeb multimedia documents. In this paper we describe howthe dataset has been collected and the creation of the groundtruth together with a proposed evaluation methodology anda brief description of the corresponding task challenge asapplied in the context of the Social Event Detection task.

Permission to make digital or hard copies of all or part of this work for per-sonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than the author(s) must be honored. Abstract-ing with credit is permitted. To copy otherwise, or republish, to post onservers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’14, March 19 - 21 2014, Singapore, SingaporeCopyright is held by the owner/authors. Publication rights licensed to ACM.ACM 978-1-4503-2705-3/14/03 ...$15.00.http://dx.doi.org/10.1145/2557642.2563674

Categories and Subject DescriptorsH.3 [Information Storage and Retrieval]: InformationSearch and Retrieval

General TermsExperimentation

KeywordsEvent Detection, Dataset, Content Analysis, Clustering, Clas-sification

1. INTRODUCTIONAs social media applications proliferate, an ever-increasing

amount of web and multimedia content available on the Webis being created. More and more people are using digi-tal cameras to take pictures from important and interest-ing happenings in their life which they upload regularly tosocial networks on the Web. In fact, the number of mediauploaded to those sites is still increasing every year. A lotof this content is related to social events. According to ourdefinition, a social event is an event that is organized andattended by people and illustrated by social media contentcreated by people.

Finding digital content related to a certain social event isvery challenging for users. It requires to search large vol-umes of data, possibly at different sources and sites. It isobvious that algorithms supporting humans in this task areurgently needed. Thus, an important task consists in devel-oping algorithms that can detect event-related media andgroup them by the events they illustrate or are related to.Such a grouping would provide the basis for aggregation andsearch applications that foster easier discovery, browsing andquerying of social events.

In order to create, test, and validate developments in thefield of social event detection [7], it is necessary to have anon-toy and real-world dataset that supports the develop-ment and comparison of different approaches on the task.In this paper we propose such a dataset. It has been used inthe MediaEval 2013 Social Event Detection task and is thusalready established in the community.

In comparison to the datasets from earlier MediaEval SEDtasks [4] and TrecVID MED [3] for videos, the dataset pre-sented in this paper features real events and not only event

Proc. 5th ACM Multimedia Systems Conference (MMSys’14), Singapore, March 2014.

Page 2: ReSEED: Social Event dEtection Datasetbmezaris/publications/mmsys14...event detection and searching. 2.1 MediaEval 2013 The dataset was already successfully used in the 2013 edi-tion

types. Every picture is assigned to a single and unique event(like the baseball match between the San Francisco Giantsvs. New York Mets in April, 2008). In addition, the num-ber of pictures has also increased in comparison to the SEDdataset.

2. APPLICATIONS OF THE DATASETThe dataset has been created to support researchers in

the field of social event detection with a freely available andcomparable dataset. While the dataset has already beenused in the Social Event Detection task (see Section 2.1),it is also useful for other research questions in the area ofevent detection and searching.

2.1 MediaEval 2013The dataset was already successfully used in the 2013 edi-

tion of the Social Event Detection (SED) task at MediaEval.There were 11 teams with about 65 people altogether whoparticipated in the task. The task description for the chal-lenge was as follows: “Produce a complete clustering of theimage dataset according to events.”.

The task consisted in the automatic induction of event-related clusters for all images in the dataset, determiningthe number of events in the dataset automatically. The chal-lenge was thus a completely data-driven task involving theanalysis of a large-scale dataset, requiring the productionof a complete clustering of the image dataset according toevents (see Figure 1). The task might be regarded as somesort of supervised clustering task [6, 5] where a set of train-ing events (groupings of images) was provided. It might alsobe regarded as a unsupervised clustering task where a clas-sical algorithm like k-means can be used. The event clustersin the training set were disjoint, i. e. participants were toldto assume that one image belongs to exactly one event.

?

Event 1

. . .

?Event 2

Event n

Image documents

Figure 1: Clustering of image documents into eventclusters

There was a required run for the challenge which involvedusing only the metadata. The use of additional data forthis run was forbidden (e. g. visual information from theimages). For the other runs, additional data could be used(including the images). It was allowed to use generic ex-ternal resources like Wikipedia, WordNet, or visual conceptdetectors trained on other data. However, it was not allowedat all to use external data that was directly related to theindividual images included in the dataset, such as machinetags1.

Participants were allowed to submit up to five runs pertask, where each run contained a different set of results.This could be produced by either a different approach or

1A special triple tag to define extra semantic informationfor interpretation by computer systems

a variant of the same approach. Each run was evaluatedseparately.

2.2 Further applicationsA subchallenge of the MediaEval 2013 SED task has con-

sisted in classifying images into a set of nine event types:concert, conference, exhibition, fashion, protest, sports, the-atre & dance, other, or classifying it as not depicting anyevent (non-event). This task thus represents a standardclassification task that can be addressed using standard su-pervised classification approaches and exploiting both visualfeatures and metadata. More specifically, a set of eight eventtypes were pre-defined, and methods were expected to to as-sign pictures to one of these event types. A dataset for thissubchallenge was derived from Instagram, which is not partof the ReSEED dataset as described here.

Overall, the datasets associated to the MediaEval 2014SED challenge can be used also for other applications. Itis for instance possible to rely on the dataset as a basisto enrich the data via other types of social media infor-mation items coming from Twitter or other social networksites. Further, it can be used in tasks related to the seman-tic search / information retrieval of pictures and/or socialmedia events.

3. DATASETThe dataset consists of pictures from Flickr. These are as-

signed to individual social events. The events include sportevents, protest marches, debates, expositions, festivals, con-certs, and more. All pictures in the dataset, together withtheir associated metadata, were downloaded from Flickr us-ing their official API2. Furthermore, they are all publishedunder a Creative Commons license allowing for their free dis-tribution. In the following section, we describe the datasetin more detail and also provide some figures. We then pro-ceed to give information on how we obtained the data. Thecreation process of the event clusters is described in Section3.3. We end with an exact description of the dataset formatin Section 3.4.

3.1 Dataset StatisticsThe dataset as a whole contains 437,370 pictures from

the Flickr photo community site. For the dataset we onlyconsidered pictures available under a Creative Common li-cense with an upload time between January 2006 and De-cember 2012, yielding a dataset of 437,370 pictures assignedto 21,169 events in total. The events are heterogeneous withrespect to type and length, e. g. it includes festivals whichlast for several days as well as protest marches with onlya few hours of duration. The dataset includes the picturesthemselves together with metadata about each picture. Thedata itself has not been been post-processed in any fash-ion. As it is a real-world dataset, there are some featureslike upload time and uploader information that are availablefor every picture, but there are also features that are avail-able for only a subset of the images. In particular, we fetchthe following metadata and information about the picturesdirectly from Flickr:

• Unique ID for the picture. This is an integer value.

2http://www.flickr.com/services/api/

Proc. 5th ACM Multimedia Systems Conference (MMSys’14), Singapore, March 2014.

Page 3: ReSEED: Social Event dEtection Datasetbmezaris/publications/mmsys14...event detection and searching. 2.1 MediaEval 2013 The dataset was already successfully used in the 2013 edi-tion

Table 1: Availability of featuresUpload time 100.0%Capture time 98.3%Geographic Information 45.9%Tags 95.6%Title 97.6%Description 37.8%Uploader Information 100.0%

Table 2: Distribution per year2006 4.6%2007 21,0%2008 25,7%2009 21,4%2010 13,5%2011 8,8%2012 5,0%

• Information about the person who uploaded the pic-ture to Flickr

• URL where the picture can be downloaded from

• Timestamp when the picture was taken

• Timestamp when the user uploaded the picture to Flickr

• Geographic location (specified by longitude and lati-tude)

• All tags assigned to the photo

• Title of the photo as chosen by the uploader

• A description of the photo as chosen by the uploader

• The number of people who have viewed that picturesfrom date of upload till 2013-03-30

• Type of Creative Commons license (as indicated)

In spite of the fact that some pictures include EXIF datafrom the camera, this data is not used directly in the creationprocess of the metadata. Nevertheless, the EXIF informa-tion is used by Flickr to create the metadata. If the capturetime is unknown, Flickr uses the upload time for both times-tamps; to ensure a good quality we removed the capturetime information if it equalled the upload time. While time,geographic, and tag information is usually of good quality,we discovered that the title often contains the filename; thisis problematic in the sense that it is named by the camera(i. e. DSCFxxxx, IMGxxxx, etc.). It is also mentionablethat the description field is used by some uploaders to ad-vertise themselves and not to give a discription of the shownpicture. Exact statistics for each metadata feature are givenin Table 1. We also report the relative number of picturesper year in Table 2 as the distribution over time is not con-stant.

As already mentioned above, the pictures are licensed un-der a Creative Commons license. We only included picturesin the dataset which allow for free distribution, remixing,and tweaking. The following subtypes of Creative Com-mons allow the use of the image along the lines mentioned

Table 3: Use of licenseCC BY 20.1%CC BY-SA 15.4%CC BY-NC 17.2%CC BY-NC-SA 47.3%

above as long as the owner is credited for his photograph:Attribution (CC BY), Attribution Share Alike (CC BY-SA),Attribution Non-Commercial (CC BY-NC), and AttributionNon-Commercial Share Alike (CC BY-NC-SA). The two li-censes Attribution No Derivatives (CC BY-ND) and Attri-bution Non-Commercial No Derivatives (CC BY-NC-ND)were not used to enable scientists to also change pictures forresearch purposes. Exact figures of the distribution of theused licenses are given in Table 3.

The distribution of pictures per event is not uniform (seeFigure 2). The size of the events varies a lot. While 3,598events include only one single picture and 1,799 events in-clude 2 pictures, there is a small number of events whichinclude over 1,000 pictures. This is very challenging as notonly the natural numbers of cluster has to be determinedbut also the cluster size is unknown beforehand.

Number of Pictures

Number of Events

Number of Events Number of Pictures

1 3598 3598 12 1799 1799 23 1447 1447 34 1194 1194 45 1012 1012 56 877 877 67 787 787 78 676 676 89 602 602 9

10 545 545 1011 541 541 1112 476 476 1213 401 401 1314 380 380 1415 366 366 1516 317 317 1617 310 310 1718 250 250 1819 269 269 1920 251 251 2021 177 177 2122 209 209 2223 160 160 2324 198 198 2425 158 158 2526 153 153 2627 151 151 2728 111 111 2829 115 115 2930 150 150 3031 109 109 3132 119 119 3233 100 100 3334 112 112 3435 98 98 3536 97 97 3637 88 88 3738 79 79 3839 73 73 3940 77 77 4041 84 84 4142 72 72 4243 55 55 4344 57 57 4445 54 54 4546 64 64 4647 44 44 4748 43 43 4849 59 59 4950 47 47 5051 45 45 5152 46 46 5253 34 34 5354 41 41 5455 38 38 5556 45 45 5657 34 34 5758 47 47 5859 30 30 5960 35 35 6061 32 32 6162 25 25 6263 31 31 6364 27 27 6465 34 34 6566 34 34 6667 24 24 6768 32 32 6869 24 24 6970 33 33 7071 37 37 7172 23 23 7273 27 27 7374 19 19 7475 17 17 7576 30 30 7677 15 15 7778 17 17 7879 15 15 7980 20 20 8081 25 25 8182 21 21 8283 11 11 8384 19 19 8485 18 18 8586 10 10 8687 14 14 8788 11 11 8889 20 20 8990 16 16 9091 14 14 9192 14 14 9293 11 11 9394 11 11 9495 9 9 9596 16 16 9697 12 12 9798 17 17 9899 10 10 99

100 11 11 100101 14 14 101102 11 11 102103 7 7 103104 17 17 104105 7 7 105106 9 9 106107 6 6 107108 14 14 108109 5 5 109110 16 16 110111 10 10 111112 6 6 112113 3 3 113114 15 15 114115 10 10 115116 7 7 116117 7 7 117118 9 9 118119 7 7 119120 13 13 120121 8 8 121122 2 2 122123 5 5 123124 4 4 124125 4 4 125126 4 4 126127 4 4 127128 6 6 128129 7 7 129130 8 8 130131 11 11 131132 7 7 132133 8 8 133134 3 3 134135 7 7 135136 5 5 136137 7 7 137138 4 4 138139 13 13 139140 3 3 140141 1 1 141142 4 4 142143 8 8 143144 8 8 144145 4 4 145146 7 7 146147 4 4 147148 7 7 148149 2 2 149150 5 5 150151 2 2 151153 6 6 153154 3 3 154155 7 7 155156 4 4 156157 1 1 157158 3 3 158159 9 9 159160 1 1 160161 5 5 161162 7 7 162163 3 3 163164 3 3 164165 1 1 165166 3 3 166167 5 5 167168 2 2 168169 3 3 169170 2 2 170171 1 1 171172 3 3 172173 4 4 173174 4 4 174175 2 2 175176 4 4 176177 1 1 177178 2 2 178179 1 1 179180 1 1 180181 2 2 181182 2 2 182184 3 3 184185 2 2 185186 3 3 186187 3 3 187188 4 4 188189 7 7 189190 3 3 190191 2 2 191192 3 3 192193 1 1 193194 1 1 194195 1 1 195196 2 2 196197 2 2 197198 6 6 198199 6 6 199200 4 4 200201 1 1 201202 2 2 202203 2 2 203204 4 4 204205 3 3 205206 2 2 206208 3 3 208209 1 1 209210 2 2 210211 1 1 211212 4 4 212213 4 4 213214 3 3 214215 3 3 215216 3 3 216217 3 3 217218 1 1 218219 2 2 219220 3 3 220221 1 1 221222 3 3 222223 1 1 223226 2 2 226227 1 1 227228 1 1 228230 2 2 230231 1 1 231232 3 3 232233 2 2 233234 5 5 234237 1 1 237238 2 2 238239 2 2 239241 4 4 241242 1 1 242244 1 1 244246 1 1 246247 3 3 247248 2 2 248249 1 1 249250 2 2 250251 1 1 251252 1 1 252253 1 1 253255 1 1 255256 2 2 256258 1 1 258259 1 1 259260 2 2 260261 2 2 261264 4 4 264265 1 1 265268 1 1 268269 2 2 269270 3 3 270272 3 3 272274 2 2 274275 2 2 275278 1 1 278279 1 1 279280 1 1 280281 1 1 281282 1 1 282283 1 1 283284 1 1 284290 2 2 290292 1 1 292293 1 1 293296 2 2 296299 1 1 299302 1 1 302303 1 1 303304 2 2 304308 2 2 308312 1 1 312313 1 1 313316 1 1 316318 2 2 318319 4 4 319320 1 1 320324 1 1 324325 1 1 325326 1 1 326329 1 1 329331 1 1 331333 2 2 333336 1 1 336337 2 2 337338 1 1 338339 1 1 339340 1 1 340341 1 1 341342 1 1 342344 2 2 344345 2 2 345348 1 1 348349 2 2 349350 1 1 350351 1 1 351352 1 1 352355 1 1 355357 1 1 357359 1 1 359362 1 1 362364 1 1 364365 1 1 365368 1 1 368372 1 1 372374 1 1 374384 2 2 384386 1 1 386393 1 1 393395 2 2 395396 1 1 396397 1 1 397402 1 1 402404 1 1 404411 1 1 411416 1 1 416428 1 1 428432 1 1 432434 1 1 434436 1 1 436446 2 2 446450 2 2 450451 1 1 451457 1 1 457461 1 1 461464 1 1 464467 1 1 467470 1 1 470479 1 1 479484 1 1 484486 1 1 486490 1 1 490491 1 1 491493 1 1 493498 1 1 498505 1 1 505506 1 1 506510 1 1 510521 1 1 521522 1 1 522539 1 1 539546 1 1 546548 1 1 548557 1 1 557563 1 1 563568 1 1 568576 1 1 576587 1 1 587595 1 1 595611 1 1 611623 1 1 623705 1 1 705726 1 1 726746 1 1 746794 1 1 794852 1 1 852873 1 1 873924 1 1 924930 1 1 930939 1 1 939

1026 1 1 10261163 1 1 11631409 1 1 1409

0

400

800

1200

1600

Number of Events Number of Pictures 10 99 4 200 1 351

Number of Pictures

Pictures in event

Num

ber

of e

vent

s

Figure 2: Distribution of pictures per event

The 437,370 pictures were uploaded by 4,926 people, thuscorresponding roughly to 89 pictures uploaded per person.Here we observe that 2,418 user uploaded only one eventwhich is not surprising.

The total number of tags assigned to the photographs is3,444,612; there are 91,219 unique tags. About 32.7% of alltags have only one single occurrence in the whole dataset.This is not surprising as the number of single picture eventsis also high. This leads to the assumption that there is ahigh correlation between both.

3.2 Collection of the DatasetFor the collection of the dataset we relied on the official

API from Flickr. The retrieval of the photos and their ac-companying metadata was done in four steps:

1. The metadata for the photos was fetched using theflickr.photos.search function of the Flickr API. (seeSection 3.2.1)

2. All available information about the uploaders of thephotos was fetched using the flickr.people.getInfo

function. (see Section 3.2.2)

Proc. 5th ACM Multimedia Systems Conference (MMSys’14), Singapore, March 2014.

Page 4: ReSEED: Social Event dEtection Datasetbmezaris/publications/mmsys14...event detection and searching. 2.1 MediaEval 2013 The dataset was already successfully used in the 2013 edi-tion

3. The photos itself were downloaded via HTTP requestsusing the addresses fetched in step 1. (see Section3.2.3)

4. Fetching of additional event information from Upcom-ing and last.fm (see Section 3.2.4)

Flickr has an essential key function which enabled us tocreate the gold standard. Every user of the Flickr serviceshas the possibility to assign tags to the pictures which arepreviously uploaded. Tags are meaningful keywords whichdescribe the picture by important and relevant (key-)words.Usually, these tags contain human-readable information. In2007, Flickr introduced a special type of tags to be consumedby machines. These special tags are called machine tags ortriple tags. The main difference to the normal user-readabletags is that machine tags use a fixed schema. This schemauses a namespace, predicate, as well as a value. Such amachine tag is denoted in the following form:

namespace:predicate=value.

3.2.1 Fetching MetadataIn the first step we exploited an API function allowing us

to search for photos meeting certain criteria: flickr.pho-

tos.search. The result of this API call is a set of textualmetadata about the photos including the address where thephoto itself can be downloaded. Our aim was to downloadall photos uploaded between January 2006 and December2012 which fulfill the following criteria: a) there is a specialtag assigned to the photo which is either in the namespaceof upcoming:event= or lastfm:event= and b) the license ofthe picture is one of the following: CC BY, CC BY-SA, CCBY-NC, and CC BY-NC-SA.

Flickr’s API returns a maximum of 500 results per call.Even there is no official limit for the number of retrievedphoto data using the flickr.photos.search-function, aftera certain amount of photos retrieved, the API repeatedlyreturns the last 500 photos as duplicates. In order to cir-cumvent this behavior, we call the API function to retrievethe pictures for one single day. The final API call thus usesthree constraints: a) the machine tag is in one of the follow-ing namespaces: upcoming:event= and lastfm:event=, b)the picture has been taken on the specified day, c) the pic-ture is under a redistributable Creative Commons license.The exact approach for the whole process is shown in Algo-rithm 1.

The result of Algorithm 1 is data for about 450,000 pic-tures. This data is used as a basis for the next steps.

3.2.2 Fetching Uploader InformationAs step 2 we use the API again to fetch information about

the uploaders. This is necessary to credit the original au-thor as requested by the license. We employ the followingfunction of the API to fetch the information:flickr.people.getInfo.

We request the uploader information for each picture en-try in our local database using the unique user ID providedby Flickr. As a result we get the following information aboutthe people:

• Unique ID on Flickr

• Username chosen by the user

• The real name (if the user has provided that name)

Input : Two namespaces: upcoming:event= andlastfm:event=

Output: Retrieved pictures matching query

Day ← ’2006-01-01’ ;while Day 6 ’2012-12-31’ do

foreach Namespace doPageNumber ← photoSearch (Day, Namespace);foreach PageNumber do

CurrentPicture ← photoSearch (Day,Namespace, PageNumber);foreach CurrentPicture do

if CurrentPicture is new thenStore to local database;

elseDiscard picture;

end

end

end

endDay ← Day + 1;

endAlgorithm 1: Retrieval algorithm

• A link to the user profile on Flickr

• The location where the user is situated (if provided bythe user)

• A self-description of the user

At the end of this step, we have stored the uploader infor-mation together with the metadata for the 450,000 pictures.

3.2.3 Fetching the Picture FilesUsing the addresses from the local database, we down-

loaded all pictures using a HTTP connection. As about15,000 of the pictures are not available for download for sev-eral reasons (like unavailability of Flickr server, removal bythe uploader or copyright holder, changed privacy settings),the process eventually lead to a total of 437,370 picturesdownloaded.

3.2.4 Fetching event information from Upcoming andlast.fm

We employed the APIs of Upcoming and last.fm3 in orderto fetch more information about the events.

We used the Event.getInfo function from last.fm’s APIand the no more existing equivalent from Upcoming to fetchthe information shown in Table 4.

At a glance, we obtained detailed information for 5,236out of 5,384 events from Upcoming and 15,043 out of 16,334events from last.fm.

3.3 Creation of the Gold StandardAs the creation of the gold standard for such high amount

of pictures by hand is very time-consuming and expensive,we exploit existing manually created and readily availablesocial events defined collaboratively by a community of users.In fact, there are websites on the Web that host social eventcalendars. A social event calendar is defined as a reposi-tory of social events which can be searched and browsed by

3http://www.lastfm.de/api

Proc. 5th ACM Multimedia Systems Conference (MMSys’14), Singapore, March 2014.

Page 5: ReSEED: Social Event dEtection Datasetbmezaris/publications/mmsys14...event detection and searching. 2.1 MediaEval 2013 The dataset was already successfully used in the 2013 edi-tion

Table 4: Available information from two social eventcalendars, last.fm and Upcoming

Information last.fm UpcomingUnique ID X XEvent title X XEvent tags XEvent description XStart time X XEnd time XNumber of attendees XVenue name X XVenue address X XURL to venue X XExact geolocation X X

events. In such a social event calendar, humans create anentry for a distinct event. These services are often managedprofessionally and only social events that have been vali-dated by the community are added. The advantage of thisis that the events extracted in this way have a rather smallnoise level. It is notable that these sites provide more infor-mation about the social event than the event name. Usuallythe entries contain more information about the event typebut also provide a more detailed description about the eventitself; this includes the date and time when the event takesplace, the location, a detailed description, and a lot more (afull list of available information is provided in Table 4).

In this paper, we focus on two services which provide a so-cial event calendar: 1) Upcoming and 2) last.fm4. The firstservice, Upcoming, was a public Internet page by YahooInc which provided event information since 2003. The sitewas unfortunately taken down on April, 30, 2013. The sec-ond service, last.fm, provides a calendar since 2007. Whilelast.fm focuses on events related to music, Upcoming pro-vided also entries for all other varieties of events like sportevents, conferences, protests, etc. The information of theevents available from both services is comparable. In recentyears less people were using these services which explainswhy the number of pictures for recent years are lower (seeTable 2).

An example for a social event presented on last.fm isshown in Figure 3.

Figure 3: Example from last.fm

Both services provide an API which enabled us to eas-ily access and download the information about the events.There is usually more information available using the APIthan what is shown in Figure 3. For us, the most interest-ing information is the unique event identifier (Event-ID) foreach event; this is a simple integer value but is the key usedto uniquely identify an event and can thus be used as thebasis for creating a gold standard (as originally proposed byBecker et al. [1]).

The introduction of machine tags in Flickr enabled usersto tag their pictures with information in that form. It is

4http://www.last.fm

automatically recognized as a machine tag by Flickr if theuser enters a tag using the special schema for the tripletag namespace:predicate=value. These machine tags canbe used in several ways. On the one hand, the tags canthus be used to link data, e.g. pictures, across sites. Thisis where the unique event identifiers from Upcoming andlast.fm come into place. The users of both sites were at-tending these events and took pictures with their cameras,uploading these pictures to Flickr thereafter. Both sites be-gan to ask their users to add a well-formed machine tag tothese pictures so that other users could use this unique IDwhen uploading their pictures to some service. The machinetags for these pictures are in the following form: upcom-

ing:event=#eventid or lastfm:event=#eventid. For us,this provides important information about the pictures: a)we know that the picture with such a tag belongs to a socialevent in the database of the social event calendar providerand b) obtain the corresponding unique #eventid for thisevent. Therefore, we can make the assumption that all pic-tures which were marked with the same ID in the machinetag belong to the same event [6] (see also Becker et al. [1]for a previous approach exploiting this same principle thatinspired this approach).

Having that information, we extracted the machine tagsfor each picture. Here we discard all machine tags which arenot in the namespace we are looking for. We then extractthe value of the triple tag for the following machine tags:upcoming:event=value and lastfm:event=value. These val-ues are then stored together with the Flickr picture ID. Weend up with 3 different types of relations for the pictures:a) having only a relation to the Upcoming event calendar,b) having only a relation to the last.fm event calendar, or c)having a relation to both services.

We finally merge an event listed on both services to oneevent, so that the number of events for the dataset is 21,169.

3.4 Dataset FormatThe dataset is made available in the following formats:

• Compressed archive of pictures in JPEG format witha maximal optical resolution of 1024px for the longerand 768px for the shorter side.

• CSV files for the textual metadata of the pictures asthis facilitates the direct usage and easily allows animport into diverse database systems.

• CSV files for the textual metadata of the events fromUpcoming and last.fm.

The files come together with a ReadMe file introducing eachfile and describing the database schema and its fields in de-tail. An overview is shown in Figure 4 for the picture meta-data.

The dataset is available for download from

http://greententacle.techfak.uni-bielefeld.de/reseed/ orhttp://umass/.

In Figure 5 we show the database schema for the metadatafrom the Upcoming events.

4. EVALUATION PROPOSALFor the evaluation and comparability, we split the dataset

in a training and an evaluation part. We also propose certainevaluation measures.

Proc. 5th ACM Multimedia Systems Conference (MMSys’14), Singapore, March 2014.

Page 6: ReSEED: Social Event dEtection Datasetbmezaris/publications/mmsys14...event detection and searching. 2.1 MediaEval 2013 The dataset was already successfully used in the 2013 edi-tion

urlusernamedatetakendateuploadtitledescriptionlatitudelongitudeviews

flickr_picture_id

VARCHAR(255)VARCHAR(100)DATETIMEDATETIMETEXTTEXTFLOATFLOATINT(11)

BIGINT(20) NOT NULL

NOT NULLNOT NULLNOT NULLNOT NULLNULLNULLNULLNULLNOT NULL

ReSEED_PICTURES

flickr_picture_idtag

BIGINT(20)VARCHAR(255)

NOT NULLNOT NULL

ReSEED_PICTURES_TAGS

flickr_picture_idevent_id

BIGINT(20)INT(11)

NOT NULLNOT NULL

ReSEED_PICTURES_EVENTS

upcoming_event_idlastfm_event_id

INT(11)INT(11)

NOT NULLNOT NULL

Figure 4: Database schema for dataset (picturemetadata)

titletagsdescriptionvenue_idvenue_namevenue_streetvenue_cityvenue_countryvenue_longitudevenue_latitudevenue_urlstartdateenddate

upcoming_event_id

VARCHAR(255)VARCHAR(255)VARCHAR(255)INT(11)VARCHAR(255)VARCHAR(255)VARCHAR(128)VARCHAR(100)FLOATFLOATVARCHAR(255)DATETIMEDATETIME

INT(11) NOT NULL

NOT NULLNULLNULLNULLNULLNULLNULLNULLNULLNULLNULLNOT NULLNOT NULL

ReSEED_EVENTS

Figure 5: Database schema for dataset (event meta-data from Upcoming)

4.1 Training and Test Dataset SplitThe dataset is split into two parts: 70% of the dataset is

declared to constitute the training set and the rest is sup-posed to be used for evaluation purposes. The split wasmade without the overlap of any events.

4.2 Evaluation measurementsWe evaluated the submissions to the SED challenge by

comparing the results to the ground truth information thathas been compiled from the event calendars.

The results of event-related media item detection wereevaluated using three evaluation measures which return val-ues in the range of [0, 1] (higher values indicate a betteragreement with the gold standard):

• F1-score is proposed to be used as the main evalua-tion measurement. In SED2013, the micro-averagedversion was used to calculate the harmonic mean ofPrecision and Recall (see also [6]). It measures theappropriateness of for the event clusters.

• Normalized Mutual Information (NMI) to compute theoverlap between clusters.

• Divergence from a Random Baseline: indicating howmuch the results diverge from a random baseline asdescribed in De Vries et al. [2]. This is used as asanity check.

We deliver an evaluation script together with the datasetso that new algorithms can be compared easily to the officialresults of the Social Event Detection challenge [7].

5. CONCLUSIONIn this paper we presented a dataset for research in the

area of social event identification. It contains user-contrib-uted images and metadata under a Creative Commons li-cense. It is suitable for clustering and classification tasks inthat area. Both the scale and the complexity of the datasetmake it more challenging and more representative of real-world problems. The dataset is freely available and we seethis dataset as an important contribution to the advance-ment of the field, supporting the development, evaluationand systematic comparison of different social event detec-tion approaches. As it has been used with the MediaEvalSocial Event Detection task in 2013 it is also well introducedin the community and helps scientists to easier compare theirresults to those of other approaches.

AcknowledgmentsThe work was supported by the Deutsche Forschungsge-meinschaft (DFG) – Excellence Cluster 277 (Cognitive In-teraction Technology) and by European Commission undercontracts FP7-287911 LinkedTV, FP7-318101 MediaMixer,FP7-287975 SocialSensor and FP7-249008 CHORUS+.

6. REFERENCES[1] H. Becker, M. Naaman, and L. Gravano. Learning similarity

metrics for event identification in social media. In WSDM,pages 291–300, 2010.

[2] C. de Vries, S. Geva, and A. Trotman. Document clusteringevaluation: Divergence from a random baseline. 2012.

[3] P. Over, G. Awad, J. Fiscus, B. Antonishek, M. Michel,A. F. Smeaton, W. Kraaij, G. Quenot, et al. An overview ofthe goals, tasks, data, evaluation mechanisms and metrics.In TRECVID 2012-TREC Video Retrieval EvaluationOnline, 2012.

[4] S. Papadopoulos, E. Schinas, V. Mezaris, R. Troncy, andI. Kompatsiaris. Social event detection at mediaeval 2012:Challenges, dataset and evaluation. In Proceedings ofMediaEval 2012 Workshop, 2012.

[5] G. Petkos, S. Papadopoulos, and Y. Kompatsiaris. Socialevent detection using multimodal clustering and integratingsupervisory signals. In Proceedings of the 2nd ACM Intern.Conf. on Multimedia Retrieval, page 23. ACM, 2012.

[6] T. Reuter and P. Cimiano. Event-based classification ofsocial media streams. In Proceedings of the 2nd ACMIntern. Conf. on Multimedia Retrieval, page 22. ACM, 2012.

[7] T. Reuter, S. Papadopoulos, G. Petkos, V. Mezaris,Y. Kompatsiaris, P. Cimiano, C. de Vries, and S. Geva.Social event detection at mediaeval 2013: Challenges,datasets, and evaluation. Proceedings of the MediaEval 2013Multimedia Benchmark Workshop Barcelona, Spain,October 18-19, 2013, 2013.

Proc. 5th ACM Multimedia Systems Conference (MMSys’14), Singapore, March 2014.