8
Event Panning in a Stream of Big Data Hugo Hromic, Marcel Karnstedt, Mengjiao Wang, Alice Hogan, V´ aclav Bel´ ak, Conor Hayes DERI, National University of Ireland Galway {first.last}@deri.org Abstract In this paper, we present a hands-on experi- ence report from designing and building an archi- tecture for preprocessing & delivering real-time social-media messages in the context of a large international sporting event. In contrast to the standard topic-centred approach, we apply social community analytics to filter, segregate and rank an incoming stream of Twitter messages for dis- play on a mobile device. The objective is to pro- vide the user with a “breaking news” summary of the main sources, events and messages discov- ered. The architecture can be generally deployed in any context where (mobile) information con- sumers need to keep track of the latest news & trends and the corresponding sources in a stream of Big Data. We describe the complete infras- tructure and the fresh stance we took for the an- alytics, the lessons we learned while developing it, as well as the main challenges and open issues we identified. 1 Introduction Although large-scale data analytics as well as stream pro- cessing and mining are well established research domains, they recently experience an increased attention in the ad- vent of the Big Data movement. While some of this atten- tion might be just due to hype, the characteristics of what is commonly understood as Big Data analytics pose a wide range of interesting and in parts novel challenges. People often focus on issues of data management and processing when considering Big Data. However, there are as many untackled challenges with respect to knowledge discovery and data mining at the heart of the actual analytics. This is particularly true in the context of Big Data streams, for which many of the usually considered approaches, such as the MapReduce paradigm, are not directly adoptable. In this paper, we report on our own experience from a “hands-on” exercise on the quest of discovering knowledge from one of the largest Big Data streams publicly avail- able these days, Twitter. We collected these experiences in the context of creating a mobile application for the fi- nal stop of the Volvo Ocean Race 2012 in Galway, Ireland. For Galway, a city of ca. 75,000 citizens, this was a ma- jor event with more than 900,000 visitors during one week. We asked ourselves the question: How can we leverage the experience of the visitors so that they get the most from the event and their time in Galway? Twitter was quickly iden- tified as probably the best, fastest and most multifaceted source of information for this purpose. But what is needed to extract, analyse, prepare and present the most relevant information from this huge social-media stream for such a particular and – compared to the overall size of the Twit- tersphere – rather small and local event? This work sum- marises our architecture spanning from a raw Big Data stream over large-scale and streaming analytics to the end- user mobile application. While we focus on the core ana- lytics and the therein novel approach we took, we briefly cover all main aspects of the whole framework. As such, we also touch areas of information retrieval, Web search, personalisation and ubiquitous access on mobile devices. The widespread use of social media on internet-enabled mobile devices has created new challenges for research. As users report their experiences from the world, information consumers are faced with an increasing number of sources to read in real time. Furthermore, the small screen size of mobile devices limits the amount of information that can be shown and interacted with. In short, social media users on mobile devices are faced with increasing amounts of information and less space to consume it. Thus, finding, analysing and presenting information related to one partic- ular event of, although high, but still mostly local relevance, is comparable to the search of the needle in the haystack. The intuitive approach that comes into mind is to mine the stream for particular topics of interest, and try to pick those that are of high relevance. However, this is prone to miss particular information, as large dominating topics will of- ten overshadow the rather small ones. Moreover, it creates a mix of content that comes from many different and often unrelated groups of users, which might obscure particular facets of the groups’ messages. For these reasons, we decided to take a fresh stance on this problem. Our approach is based on several observa- tions in literature [Kwak et al., 2010; Glasgow et al., 2012; Cha et al., 2010] that characterise Twitter as a novel mix between news media and social network. As such, we hy- pothesise that news and topics of particular interest will be amplified by groups of users that reveal close and frequent communication relations on Twitter. Thus, we argue that by mining for closely connected groups and analysing their topics, we actually search for the easier (but still challeng- ing) to spot pin cushion rather than the single needle. To use another metaphor, we compare this task to the panning for gold, and propose to: (i) pan only in the best part of the Big Data stream by ap- plying a set of complex filters that benefit from knowl- edge discovery and machine learning, (ii) use network analysis to achieve successful screening to identify the truly valuable nuggets of information,

Event Panning in a Stream of Big Data - Semantic Scholar€¦ · Event Panning in a Stream of Big Data Hugo Hromic, Marcel Karnstedt, Mengjiao Wang, Alice Hogan, Vaclav Bel´ ak,

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Event Panning in a Stream of Big Data - Semantic Scholar€¦ · Event Panning in a Stream of Big Data Hugo Hromic, Marcel Karnstedt, Mengjiao Wang, Alice Hogan, Vaclav Bel´ ak,

Event Panning in a Stream of Big Data

Hugo Hromic, Marcel Karnstedt, Mengjiao Wang, Alice Hogan, Vaclav Belak, Conor HayesDERI, National University of Ireland Galway

{first.last}@deri.org

AbstractIn this paper, we present a hands-on experi-ence report from designing and building an archi-tecture for preprocessing & delivering real-timesocial-media messages in the context of a largeinternational sporting event. In contrast to thestandard topic-centred approach, we apply socialcommunity analytics to filter, segregate and rankan incoming stream of Twitter messages for dis-play on a mobile device. The objective is to pro-vide the user with a “breaking news” summaryof the main sources, events and messages discov-ered. The architecture can be generally deployedin any context where (mobile) information con-sumers need to keep track of the latest news &trends and the corresponding sources in a streamof Big Data. We describe the complete infras-tructure and the fresh stance we took for the an-alytics, the lessons we learned while developingit, as well as the main challenges and open issueswe identified.

1 IntroductionAlthough large-scale data analytics as well as stream pro-cessing and mining are well established research domains,they recently experience an increased attention in the ad-vent of the Big Data movement. While some of this atten-tion might be just due to hype, the characteristics of whatis commonly understood as Big Data analytics pose a widerange of interesting and in parts novel challenges. Peopleoften focus on issues of data management and processingwhen considering Big Data. However, there are as manyuntackled challenges with respect to knowledge discoveryand data mining at the heart of the actual analytics. Thisis particularly true in the context of Big Data streams, forwhich many of the usually considered approaches, such asthe MapReduce paradigm, are not directly adoptable.

In this paper, we report on our own experience from a“hands-on” exercise on the quest of discovering knowledgefrom one of the largest Big Data streams publicly avail-able these days, Twitter. We collected these experiencesin the context of creating a mobile application for the fi-nal stop of the Volvo Ocean Race 2012 in Galway, Ireland.For Galway, a city of ca. 75,000 citizens, this was a ma-jor event with more than 900,000 visitors during one week.We asked ourselves the question: How can we leverage theexperience of the visitors so that they get the most from theevent and their time in Galway? Twitter was quickly iden-tified as probably the best, fastest and most multifaceted

source of information for this purpose. But what is neededto extract, analyse, prepare and present the most relevantinformation from this huge social-media stream for such aparticular and – compared to the overall size of the Twit-tersphere – rather small and local event? This work sum-marises our architecture spanning from a raw Big Datastream over large-scale and streaming analytics to the end-user mobile application. While we focus on the core ana-lytics and the therein novel approach we took, we brieflycover all main aspects of the whole framework. As such,we also touch areas of information retrieval, Web search,personalisation and ubiquitous access on mobile devices.

The widespread use of social media on internet-enabledmobile devices has created new challenges for research. Asusers report their experiences from the world, informationconsumers are faced with an increasing number of sourcesto read in real time. Furthermore, the small screen size ofmobile devices limits the amount of information that canbe shown and interacted with. In short, social media userson mobile devices are faced with increasing amounts ofinformation and less space to consume it. Thus, finding,analysing and presenting information related to one partic-ular event of, although high, but still mostly local relevance,is comparable to the search of the needle in the haystack.The intuitive approach that comes into mind is to mine thestream for particular topics of interest, and try to pick thosethat are of high relevance. However, this is prone to missparticular information, as large dominating topics will of-ten overshadow the rather small ones. Moreover, it createsa mix of content that comes from many different and oftenunrelated groups of users, which might obscure particularfacets of the groups’ messages.

For these reasons, we decided to take a fresh stance onthis problem. Our approach is based on several observa-tions in literature [Kwak et al., 2010; Glasgow et al., 2012;Cha et al., 2010] that characterise Twitter as a novel mixbetween news media and social network. As such, we hy-pothesise that news and topics of particular interest will beamplified by groups of users that reveal close and frequentcommunication relations on Twitter. Thus, we argue thatby mining for closely connected groups and analysing theirtopics, we actually search for the easier (but still challeng-ing) to spot pin cushion rather than the single needle. Touse another metaphor, we compare this task to the panningfor gold, and propose to:

(i) pan only in the best part of the Big Data stream by ap-plying a set of complex filters that benefit from knowl-edge discovery and machine learning,

(ii) use network analysis to achieve successful screeningto identify the truly valuable nuggets of information,

Page 2: Event Panning in a Stream of Big Data - Semantic Scholar€¦ · Event Panning in a Stream of Big Data Hugo Hromic, Marcel Karnstedt, Mengjiao Wang, Alice Hogan, Vaclav Bel´ ak,

(a) Main Screen (b) Tweets Screen (c) Group Screen

Figure 1: Screens of the mobile app

(iii) apply social ranking to bring the most valuable infor-mation to the highest attention.

The contribution of the work in hand is a summary re-port from this hands-on experience. We discuss the majorissues we encountered, the expected and unexpected chal-lenges we had to face and the data mining approaches weintegrated as well as those that we identified as potentiallybeneficial for further improvement. We expect that this ex-perience report can help others in designing & developingapplications based on advanced Big Data stream analytics.

2 Application OverviewWe developed our application as part of the official mo-bile application for the Volvo Ocean Race 2012 in Galway,Ireland, which was completely created by DERI, NUI Gal-way.1 After more than three months of hard work, severalmajor and minor changes to our approach and some sleep-less nights, we were happy to finish Tweet Cliques2, a smallbut smart app that shows what’s up and hot around the eventon Twitter. A recent blogpost3 nicely summarises a mainissue of Big Data, whose truth we could identify nearly ev-ery day: it’s not just the analytics. All surrounding tasks,such as filtering, pre- and post-processing, are of equal im-portance. In this section, we give a complete overviewof all the resulting crucial ingredients and the lessons welearned, before we focus on the actual analytical backbonein Section 3.

Figure 1 shows our final product, the mobile application.On the main screen, we show the groups we found and themost recently observed tweets from the most central groupmembers. On the second screen, we show the most recentlyobserved tweet from each group member, ranked by the im-portance of the users. Finally, the third screen shows theusers and the computed “Clique Score” (used for ranking,see Section 3.3) as well as the number of outgoing and in-coming links of the users in the group.

2.1 ArchitectureThe main architecture is shown in Figure 2. It comprisesseveral interconnected components, from the live Twitter

1http://www.deri.ie/about/press/coverage/details/

?uid=265&ref=2142Information and download: http://uimr.deri.ie/tweet-cliques3http://wp.sigmod.org/?p=430

stream up to the final end-user mobile application. The coreidea is to process tweets that pass the filter continuouslyand perform analytics to extract community structures andtopics in a batch mode. However, incoming tweets are al-ways classified to the according groups and delivered inreal time. The architecture is fully modular with a clearseparation of the different processing logics involved.

The system considers three kind of data flows for pro-cessing:

• Streaming flow: data is processed continuously andendlessly, using a publisher/subscriber distributedmodel. The communication channel is always activeand data processing is totally event-driven. This flowis able to deliver events to multiple components at thesame time.

• System-triggered flow: data is processed at regular in-tervals, controlled by known system timers (i.e., win-dow slides). Data transport is only active when re-quested and fixed amount of data is passed betweencomponents in a point-to-point strategy.

• On-demand flow: very similar to the above flow, withthe key difference that this is triggered by externalevents, such as users starting new sessions. This flowis also point-to-point, active upon requests and dataflows in fixed amounts.

Two different databases provide the storage backbone forthe application:

• Graph database: This store is used for real-time cre-ation of relationship networks (mentions, retweets,replies and hashtag referrals). Time-stamped nodesand edges are inserted into the database on the fly af-ter carrying out an according entities extraction.

• Relational database: this storage keeps the relevanttweets’ content and the ranked community structures(including users and labels). Content is inserted assoon as tweets arrive, whereas community data is in-serted at regular intervals (system triggered) from theanalytics component.

We describe the two main components, the Stream Fil-ter and the Analytics, in the following sections. The TweetsClassifier works on the data provided by the Analytics com-ponent and assigns tweets to groups. The communicationbetween the Mobile App and the backend is based on a

Page 3: Event Panning in a Stream of Big Data - Semantic Scholar€¦ · Event Panning in a Stream of Big Data Hugo Hromic, Marcel Karnstedt, Mengjiao Wang, Alice Hogan, Vaclav Bel´ ak,

GraphDB

NetworksSlices

Ranked  Communities,  Users  and  Labels

Feedback:Users  to  Follow

Raw  Tweets

RelationalDB

Relevant  Tweets

Signal:Updated  Communities  Data

Tweets,  Topics  &Communities

Relationships

Initial  Tweets  &  Communities  Data

TweetsClassifierTwitter Stream

Filter

Entities  Extraction

AnalyticsStreaming  Flow

On-­‐demand  Flow

System  Triggered  Flow

Data  Flow  Types

Mobile  App

Figure 2: Architecture overview

Web service, which allows to run the application on a widerange of different mobile platforms.

2.2 Stream FilteringThe Twitter public stream APIs4 provide real-time accessto Twitter data by following specific users and topics. Dueto these mere technical reasons, and concerns about scala-bility, we knew from the beginning that we had to filter outa particular part from the global Twitter stream. However,we quickly learned that filtering is not only crucial becauseof these reasons. Rather, the trade-off between too muchnoise in the data (i.e., too relaxed filters) and sparsity of in-put (i.e., too strict filters) has a crucial impact on the resultsgained from the actual analytics. Consequently, the filter-ing component grew with every day into a complex systemof different techniques (see Figure 3), which still bears alot of potential for improvement.

Search  API

Seed  Users  &Keywords

Words  Blacklist Hashtags  Greylist

Twitter

Filter  Core

Raw  Tweets

Relevant  Tweets

Raw  Past  Tweets

Tweets  for  Hashtags

Co-­‐occurrence  Computation

Co-­‐occurrence  Tables

Extender

Output  Stream:  Relevant  Tweets

Extended  Hashtags

Extended  Users

Burst  Checker

Past  4  hours

User/Hashtag  &Hashtag/Hashtag

Extension  Logic

Trending  Topics

Filter  Input

Analytics

Users  to  Follow

Streaming  API

Figure 3: Overview of the filtering component

We make use of two parameters that the stream APIssupport to retrieve tweets from the public stream:

1. Follow: a list of user IDs

2. Track: a list of keywords

In the following, we describe the two main buildingblocks of the filtering component: a set of filter inputs thatdefine what we actually fetch from the stream (and filterout again, e.g., spam) and an extension logic aiming at adynamic maintenance of these filter inputs.

4https://dev.twitter.com/docs/api/1/post/statuses/

filter

Seeds As a starting point, we manually chose a numberof seed users and keywords that we identified as highlyrelevant to the Volvo Ocean Race and which we followedpermanently. If we want to filter particularly for hashtags(commonly understood as topic labels on Twitter), we pre-cede a keyword by the ‘#’ sign. This is important for somehashtags like ‘#vor’, which would return a large amount ofunrelated tweets if used directly as a keyword. Further, onefilter keyword can actually contain several words, which re-turns all tweets containing each of the words at any positionin its text.

In our current implementation, the addition of keywords,hashtags and users to the seed lists is mainly a manualand subjective process. Seed users were initially chosenby analysing Twitter lists relevant to the event. We pickedusers from these lists and their followers, i.e., a 2-hop usercollection. Then we cleaned the resulting list manually.However, soon after having the first set of filters in place,we recognised that the overlap between the set of users ac-tually using the seed hashtags and the set of users from theTwitter list approach is rather small, which might be due toa rather unpopular and ad-hoc usage of Twitter lists. Thus,we decided to manually check the descriptions and latesttweets of all seen users in order to determine if they are rel-evant or if they are mostly doing advertisement. Additionalseed users are also found by manually checking the usersthat another seed user is following. This method was, forinstance, used to find the list of Galway bars, as there wasno according Twitter list in place. Other social media, e.g.,boards.ie5, can also be used to find good seed users.

The seed terms are found by looking at historical hash-tags for the type of event, searching Twitter for any cur-rently used hashtags and also pre-emptively guessing seedkeywords and seed hashtags that may be used in the future.Before a term is added to the seed list as keyword or hash-tag only, its search results are manually checked in Twitterto determine the type of results that it is likely to generate.This helps to estimate the level of noise that the addition ofthe term will generate.

Extension Logic Hashtags in Twitter are very heteroge-neous, as they may be freely chosen by users and can referto particular topics in many different ways. As such, it isimpossible to cover all related users and keywords in thea priori created lists. Therefore, we try to identify otherrelevant keywords and users by analysing the tweets we re-trieve based on the currently active filters. Then, we extend

5http://www.boards.ie/vbulletin/showthread.php?t=

2056640325

Page 4: Event Panning in a Stream of Big Data - Semantic Scholar€¦ · Event Panning in a Stream of Big Data Hugo Hromic, Marcel Karnstedt, Mengjiao Wang, Alice Hogan, Vaclav Bel´ ak,

the seed lists, but only for a certain amount of time. Afterthat time we re-evaluate the extension. Like this, extendedkeywords and users are included into the filters only for thetime that we consider them relevant based on our exten-sion logic. We extend periodically based on the followingapproaches:

1. Hashtag co-occurrence: We find hashtags that co-occurred with seed hashtags and extend by those thatco-occur more frequently than a predefined threshold(in fact, a mix of an absolute and a relative thresh-old). We restrict this co-occurrence to the seed hash-tags only in order to avoid a drift to unrelated topics.

2. Hashtag-user co-occurrence: We extend the user listby users who frequently use seed hashtags.

3. We follow users that are identified as central in theircommunity by the analytics component (see Sec-tion 3.3). We do this in order to always retrieve themost recent tweets from these important users.

For example, tweets about a concert of the band “ThinLizzy” may contain both, #volvooceanrace and #thinlizzy,but also #tlizzy, #lizzy, etc. – impossible to cover all pos-sibilities. Moreover, these hashtags are relevant for ourapp only in a limited time window. This temporary im-portance is covered by the extension based on hashtag co-occurrence.

This type of co-occurrence checking is related to detect-ing frequent itemsets in the stream of incoming tweets. Weexperimented with an according implementation providedby the stream data management system AnduIN [Klan etal., 2011]. However, we found that hashtags are actuallyoccurring more frequently alone or in pairs than in largersets. We are nevertheless planning to integrate this stream-mining method closer into our extension logic.

Trending Topics The trending topic facility on Twittercaters for those interested in the most frequently occurringhashtags and keywords overall in Twitter. Users can queryindividual hashtags and get a listing of all quoting posts.However, this does not solve the problem of filtering andranking and selecting these posts for presentation on a mo-bile device. Nevertheless, we can benefit from the list ofup-to-date trending topics for maintaining the seed lists, butalso to feed the other filter inputs, as we describe next.

Blacklist It is unavoidable to retrieve a lot of unrelatedand in fact unwanted tweets from the public stream, suchas porn-related tweets. In order to filter these out again, webuilt a blacklist containing words and phrases that we cansafely correlate with such noise. Any tweet containing anentry from the blacklist is discarded. Further, we calculatea spam score S for each retrieved tweet, inspired by [Chaet al., 2010]. We consider the number of hashtags h andthe number of trending topics t appearing in a tweet andalso the account age a of the user (number of days sincethe account created).

S =(h+ 1) · (t+ 1)

log (a+ 1)

The intuition behind this calculation is that most spammessages try to attract attention by referring to many of thecurrently hot topics in Twitter at once. Furthermore, manyspam accounts are of short life time, as they get blockedas soon as they are identified. This straightforward spam

classification can obviously benefit from tailor-made ma-chine learning techniques. An assessment of appropriatemethods is planned for future work.

Greylist The greylist is used for common terms that maybe relevant, i.e., that we do not want to block in general, butwhich also bear the potential to result in a drift towards un-related topics. Consider the concert example from above. Ithappened that in such a case we automatically extended ourfilters to listen to the hashtag #top5bands. But, this is such awidely used hashtag on Twitter that the number of retrievedtweets, and consequently also the noise, bursted. To filterout such kind of unrelated tweets, we build a greylist con-taining hashtags which we call global hashtags: hashtagsthat bear the potential to create a drift from the “local” eventto a “global” trend involving all or large parts of the over-all Twittersphere. Any tweet containing an entry from thegreylist and from the seed list (i.e., “#thinlizzy #top5bands#galway”) will enter the system – if nothing from the seedlist is included, it will be discarded.

Our current approach finds such global hashtags by man-ually checking the hashtags correlated to trending topicsin Twitter and other Twitter trend websites.6 Further, theTwitter API can be used to query for hashtags represent-ing currently trending topics. However, we refrained fromadding these without any manual cross-check in order toavoid the before mentioned sparsity of input data.

One obvious machine learning technique that will likelyhelp to automate this process is stream-based burst detec-tion. The intuition is that, due to their nature, follow-ing global hashtags will result in a burst of correspondingtweets. There are several applicable burst detection meth-ods available, although we are not aware of any methodparticularly designed for Twitter. For future work, we planto experiment with the method proposed in [Karnstedt etal., 2009], which is already available in the before men-tioned stream system AnduIN [Klan et al., 2011]. How-ever, for direct application, we will have to slightly modifythe currently available implementation. For the time beingwe backed the greylist on a type of naive burst detectionthat makes use of the Twitter Search API. For each hash-tag that we identify as potentially relevant, i.e., we decideto extend it for the next window, we query the API for allrelated tweets from the last 24 hours. A hashtag that re-turns more tweets than a pre-defined threshold is regardedas a global hashtag. This approach bears the obvious prob-lem of having to set yet another threshold. At the moment,the threshold is chosen based on historical data and regularmonitoring is required. Further, neither does Twitter guar-antee that all corresponding tweets are returned by the APIcall, nor are all global hashtags always “bursty”.

Another idea is to use statistical process control (SPC)methods for calculating the threshold that indicates out-liers or special causes – which are then likely candidatesfor global hashtags. This would require a fairly stable pro-cess to begin with. This and other statistical methods forburst detection are currently on our agenda.

3 The Analytical BackboneThe analytic goal of the application was to detect newsand emerging topic trends by finding groups of actors andanalysing the topics of these groups. Therefore, the ana-lytic backbone comprises community identification, topical

6http://trendsmap.com/, http://whatthetrend.com/

Page 5: Event Panning in a Stream of Big Data - Semantic Scholar€¦ · Event Panning in a Stream of Big Data Hugo Hromic, Marcel Karnstedt, Mengjiao Wang, Alice Hogan, Vaclav Bel´ ak,

(a) Only retweet edges (b) All types of edges

Figure 4: Communities with minimal size of 5 in different types of networks

labelling of the communities and finally multiple rankingstrategies.

3.1 Community AnalysisIn order to identify groups of actors, it is necessary to es-tablish a notion of similarity or relatedness between theactors. An intuitive idea is to analyse the follow networkof Twitter. But, we are more interested in the groups thatare existing in-situ. Thus, we need a more dynamic no-tion of relatedness. For this, different modes of interactionamongst the Twitter users yield three dynamic social net-works, which can be understood as three different modesof communication happening on Twitter [Glasgow et al.,2012]:

• RETWEET: quoting other users

• MENTION: talking about other users

• REPLY: talking to other users

In all the networks, the nodes are the users and the directedweighted edges represent one of the following types of in-teractions. First, a user i can be connected to j if i hasretweeted j. Analogously, i can be linked to j when i hasmentioned j in her tweets. Finally, the REPLY network con-sists of users linked by the who-replied-to-whom relation.The weights represent the number of the interactions wehave observed, e.g., wij in the RETWEET network is thenumber of tweets of j retweeted by i.

Apart from the three social networks, Twitter users canbe also associated to hashtags they referred to in theirtweets. This can be conveniently formalised as a bipar-tite network with n + h nodes, where n is the number ofusers, h the number of hashtags. This network contains mdirected links, each connecting a user i to a hashtag j withweight wij if i has used the hashtag j wij times. Let us callsuch a bipartite network REFERS TO. We use the hashtagsto determine topics of the tweets.

In our use case all the networks were generally verynoisy and sparse (n ≈ m), which hindered the commu-nity detection step since many of the nodes were isolates.Further, while many community detection methods havebeen developed in recent years [Fortunato, 2010], only fewmethods have been proposed to mine communities from

multiplex (multi-relational) networks [Lin et al., 2009;Mucha et al., 2010]. These methods generally requirea definition of relations between the individual networks(e.g., how the act of retweeting relates to mentioning, etc.),which is generally not obvious. This fact, along with thesparsity and noisiness of the individual social networks, ledus to the decision to initially employ a simple approach: thethree social networks were merged into one by adding upthe weights of the links from the RETWEET, MENTION andREPLY networks. The resulting network represents a gen-eral relation INTERACTS WITH.

We further assume that the INTERACTS WITH relationcorrelates with topical interests of the users. The merger,however, does not fully tackle the problem of noisiness,despite the fact that it amplifies the signal. Therefore, weemployed a community detection method OSLOM [Lanci-chinetti et al., 2011], which identifies only statistically sig-nificant communities based on a randomised version of thegraph. That way we effectively filtered the noisy relationsand isolates. As a result, only ≈ 7% of nodes in INTER-ACTS WITH were found to be members of some significantcommunity. This number was generally much lower forcommunities detected in one of the three elementary socialnetworks.

We illustrate this in Figure 4. As an example for thethree elementary networks, we show the RETWEET net-work from a certain point in time in Figure 4a. From thethree discussed communication modes, the RETWEET re-lation is commonly the most frequently used one in Twit-ter [Kwak et al., 2010]. The figure clearly shows the manyisolates and very small disconnected components at the pe-riphery. The large hair ball on the right is a communityaround Olympics 2012 – which somehow managed to by-pass our filters at that time. The second large component,which consists of a few smaller connected parts, is centredaround the Volvo Ocean Race. In this network, OSLOMfound only five communities. Figure 4b shows the mergednetwork for the same time. It is about twice as large (4512vs. 2860 nodes, 5802 vs. 2717 edges), but shows simi-lar characteristics. The Olympics hair ball and the VolvoOcean Race cluster are still obvious. The network now re-veals connections between both (which might indicate howthe Olympics topic ended in our data) and a third larger

Page 6: Event Panning in a Stream of Big Data - Semantic Scholar€¦ · Event Panning in a Stream of Big Data Hugo Hromic, Marcel Karnstedt, Mengjiao Wang, Alice Hogan, Vaclav Bel´ ak,

component’s appearance. Interestingly, the one on the bot-tom left is centred around Galway users tweeting about therace, while the new one above it is centred around the ac-counts of participating race teams – and both are connectedby a small group around the official Twitter account of therace.

In the merged network OSLOM identified 25 communi-ties. One might think that this is actually a small fractiongiven the clustered appearance of the network. However,often many users refer to a few very central ones (like theofficial Olympics account), but rather sparsely among eachother. We believe that this is not an indication of a “newsspreading community” that we are actually looking for. Infact, we found that most of the 25 communities are as ex-pected and aimed at, such as one around DERI and NUIGalway spreading the news about the Volvo Ocean Raceapp, communities around the teams giving race updates,etc. However, we consider applying different communitydetection algorithms to assess the impact of finding more(and thus, noisier) communities.

Another feature of OSLOM is that it yields overlappingcommunity structure, i.e., a node can be a member of mul-tiple communities. This is an important factor, because thecommunities are assumed to be centred around shared top-ics and indeed a user may be interested in multiple dis-tinct topics, which results in her membership in multiplecommunities. But, to our surprise, this case occurred veryseldom. Twitter communities seem to be rather disjoint,although often well connected.

In order to reveal changes in the community structure,we segmented the streamed INTERACTS WITH network bya sliding time window 36 hours wide and overlapping by 35hours, i.e., the window was moved each time by one hour.We observed that any narrower window gives too little sig-nal. Further, as the majority of relevant Twitter users werelikely to be geographically collocated near the location ofthe event, there was a regular gap in their activity duringthe night hours. Therefore, the 36 long window effectivelyrepresents approximately 24 hours of user activity.

An independent set of communities Ct = {C1, . . . , Ck}was then obtained by running OSLOM on each networktime slice t. In order to maintain an intuitive user expe-rience in the client application, it was necessary to trackthe same communities from one time slice to the next. Weadopted the approach proposed by [Greene et al., 2010].This formulates the process of tracking as a problem ofmaximal bipartite matching between the set of clustersfrom time slice ti and ti+1. All communities from bothslices are represented as nodes in the bipartite graph, whoseweighted links represent similarities between each possiblepair of clusters from the two subsequent time slices. Weused the Jaccard coefficient to compute this similarity sC :

sC =|C1 ∩ C2||C1 ∪ C2|

, C1 ∈ Cti , C2 ∈ Cti+1

Finally, we did not match clusters whose similarity was be-low a threshold θc = 0.2.

3.2 Topic LabellingAfter the community structure was obtained, it was neces-sary to label each community by the topic it was centredaround. Recall that a weight wij in the REFERS TO net-work represents the number of times a user i has used ahashtag j. The joint frequency fCj of a hashtag j in a com-munityC = {u1, . . . , ul}with l users can therefore be eas-ily obtained as fCj =

∑i∈C wij . Note that even though all

the tags of the community members are associated with thecommunity, we expect the modes of the distribution of thecommunity’s hashtags to be located at the hashtags repre-senting similar and coherent topics. In order to select onlythe relevant themes for the community and filter out noise,we used a frequency threshold θf and labelled a commu-nity C by only topics with fCj ≥ θf . We observed thatchoosing θf = 2, i.e., filtering out just the hashtags thatwere used only once by a member of a community, alreadysignificantly improved the quality of the labels. This alsoconfirmed our assumption about topical coherence of themined communities.

While hashtag usage is a natural and intuitive approachfor labelling of communities, many users do not use hash-tags in their tweets. Hence we believe the labelling pro-cess can be improved by more advanced text processingtechniques like keyword/key-phrase extraction, named en-tity recognition, etc. We further discuss this in Section 4.

3.3 Ranking is VitalAs mentioned in Section 2.2, there is an unavoidable trade-off between having too much noise in the system vs. havingnot enough interesting and relevant information. As such,we cannot avoid to always find communities and topics thatare not perfectly (or better, obviously) related to the event.The only way to overcome this and to still provide a satisfy-ing user experience is an appropriate ranking. With a goodranking, the crucial and relevant information will always bethe first the user sees – and rather unrelated information isshown at the bottom of the groups list.

Thus, after the communities were identified and labelled,multiple ranking strategies were used. First, as we havealready described, an importance of a topic (hashtag) jwithin a community C is measured by its joint frequencyfCj . Second, the communities were ranked according totheir correspondence with the global focus of the applica-tion. Third, the members of each community were rankedin order to present the most relevant tweets in each com-munity.

The global focus of the application is determined by theset of p seed hashtags S = {h1, . . . , hp}, which are usedto obtain the tweets relevant to the event (see Section 2.2).Our first approach was to measure that similarity by Jac-card coefficient between the seed set and the set of commu-nity hashtags. This works well, but can have a drawback incertain situations. If a community has many hashtags as-signed, which cover a significant fraction of the seed hash-tags, it can still be ranked lower than a community withonly a few hashtags that cover only a small part of the seedhashtags. Thus, in our current version, we remove that de-pendence on the number of hashtags in a community anduse only the number of overlapping hashtags:

rC = |S ∩H|, H = {j : f cj ≥ θf}We are still analysing which of the two rankings is bet-ter suited in which situations. As a matter of fact, bothprovide a significant improvement of the results forwardedto the mobile client. If two communities have the samerank rC1 = rC2 , then the one with more hashtags is rankedhigher. The intuition behind this is that a community withmore hashtags is more specific and should thus be listedfirst. If also the number of hashtags is equal, then they arefinally ranked by their size, i.e., the bigger communities aregiven priority.

In order to rank the users within a community, we firstinduced a subgraph of the INTERACTS WITH network by

Page 7: Event Panning in a Stream of Big Data - Semantic Scholar€¦ · Event Panning in a Stream of Big Data Hugo Hromic, Marcel Karnstedt, Mengjiao Wang, Alice Hogan, Vaclav Bel´ ak,

including only the members of the community and links be-tween them. Then the nodes were ranked by PageRank. Weused particularly PageRank because it has been frequentlyused as a heuristic for capturing influence and authority ofactors in social networks [Sun and Tang, 2011]. Finally,tweets in the mobile application are ordered by decreasingPageRank value of their authors. For further versions weplan to provide the option to the user to switch to a purelytime-based ordering, as it is provided by standard Twitterapplications. We further consider experimenting with othersocial ranking mechanisms, such as taking the actual user,his preferences and personal networks into account [Gou etal., 2010; Haynes and Perisic, 2009]. One promising ap-proach is to apply a tailored notion of PageRank like theFolkRank proposed in [Hotho et al., 2006].

Our hypothesis is that the most central users in the ob-served networks are the most important ones – and the onesthat most frequently initiate the most important news. Assuch, we also decided to follow these users until the nextcommunity analysis is performed (currently in 1 hour in-tervals). The 33% or top-3 users (whichever is lower) aretherefor signalled to the filtering component, which caresfor an appropriate extension (see Section 2.2). Further, themobile application shows the last received tweet from thetwo highest ranked users of each community on its mainscreen (see Figure 1). It is thus of importance that thesetweets are indeed the most recent ones from these users.

4 Lessons Learned and Future WorkOur general conclusion is positive, as we can say “Itworks!”. We continuously monitored the application andthe determined groups and topics before and during theVolvo Ocean Race in Galway. Indeed, the application dis-played mostly relevant groups with interesting and relevanttopics. This ranged from small to large topics and clearlyseparated certain groups of interest, such as around the raceteams or certain events – which, as a byproduct, also sepa-rated groups of particular languages. Still, as we discussedbefore, it is unavoidable to get certain topics into the appli-cation that are not obviously related to the event. To ourrelief, the applied ranking method always positioned themin the lower end of the groups list. Still, we plan to run ex-periments with different, less restrictive community detec-tion methods to assess the quality of the currently identifiedgroups – which are often rather small.

There are many important lessons we have learned and awide range of identified challenges and research topics forfuture work. They arose from the mere focus on developinga real-time analytics application with a user-friendly inter-face, and also from issues we observed regarding the ap-plied concepts and general design choices. One of the mostcrucial challenges lies in the actual analytics. Our batch-based approach can only work because we handle a ratherlimited part of the whole Twitter stream (due to the filter-ing). This might not always be the case and appropriate.Moreover, any method based on a priori defined fixed win-dows is usually prone to provide sub-optimal results. Assuch, we plan to intensely investigate methods for real-timecommunity detection and analysis. Our main focus will beon how to achieve a community monitoring that first sup-ports the identification of change points, which signal theneed for re-running the network clustering. Later, we willassess opportunities to do the whole community detectionitself in a streaming fashion.

The general idea of moving from a topic-centred ap-

800

1000

1200

1400

1600

1800

2000

Tweet C

ount

Added seedterms

Topic appeared in the app

Removedseedterms

Topicleft the app

0

200

400

600

12 14 16 18 20 22 0 2 4 6 8 10 12 14 16 18 20 22 0 2 4 6 8 10 12 14 16 18 20 22 0 2 4 6 8 10

23 24 25 26

Days/Hours of day

Figure 5: “Topic latency”: timely detection, late removal

proach to a group-centred one works out. With first testsbased on pure topic clustering in the analysed time slices,we found that, as expected, large topics are “swallowing”smaller ones – in our approach the listed groups and topicsrange from small to large. Nevertheless, we experienced asort of “topic latency” problem, as shown in Figures 5 and 6for two events during the race. Figure 5 shows the numberof tweets over time for a very popular band and indicatesthe time span that we observed a corresponding topic in ourapp. As one can see, a large number of tweets results in anaccurate early detection – but also in a belated removal ofthe topic. In contrast, an event with a rather low number oftweets (Figure 6) can result in a belated topic display but atimely removal. We plan to tackle this issue as follows:

9

10

Added seed

Topic appeared in 

Removedseed

Topicleft

7

8 termsppthe app terms

left theapp

5

6

Coun

t

4

5

Tweet 

2

3

0

1

18 20 22 0 2 4 6 8 10 12 14 16 18 20 22 0 2 4 6 8 10 12 14 16 18 20 22 0 2 4 6 8 10 12 14 16 18 20 22 0 2 4 6 8 10 12 1418 20 22 0 2 4 6 8 10 12 14 16 18 20 22 0 2 4 6 8 10 12 14 16 18 20 22 0 2 4 6 8 10 12 14 16 18 20 22 0 2 4 6 8 10 12 14

21 22 23 24 25

Days/Hours of day

Figure 6: “Topic latency”: late detection, timely removal

1. Decouple the topic detection from the actual group de-tection. While we have to use a larger window forfinding a community signal in the networks, this is notnecessary for the topic labelling of a community. Weplan to mine these topic labels in shorter intervals forthe found communities, which will also reflect topicchanges in groups more accurately.

2. We identified the need for integrating a more sophis-ticated method for trend and event detection. Thisshould be combined with a method for detectingbursts, which we already plan to incorporate into thefiltering component (see Section 2.2). In order to beable to do this in (near) real time, we consider toadopt more stream-mining approaches and will inves-tigate cloud-based computing frameworks [Vakali etal., 2012].

3. Include NLP techniques to extend the topic labellingfrom considering only hashtags to general keywords.First tests in this direction were promising. This will,however, require a higher weighting of hashtags, asthey are generally understood as topic labels by theTwitter community.

During the development of the app, we were surprised tosee how crucial the quality of the filtering component actu-ally is. Similar to the above, we identified the strong need

Page 8: Event Panning in a Stream of Big Data - Semantic Scholar€¦ · Event Panning in a Stream of Big Data Hugo Hromic, Marcel Karnstedt, Mengjiao Wang, Alice Hogan, Vaclav Bel´ ak,

for advanced burst and trend detection methods, as wellas other suited knowledge discovery and machine learningtechniques. In order to make the application really worksmoothly, we have to achieve an almost 100% automatedmaintenance of the filters. In fact, feeding the filters is cur-rently the one component that requires the most manual in-put and benefits the least from automated analyses. Nev-ertheless, this manual and laborious process helped us toclearly identify the main challenges and problems we willhave to face when aiming for an automated approach:

• Our results are very reliant on Twitter users behaviour,e.g., retweeting, usage of Twitter lists, accurate userdescriptions and the usage/non-usage of hashtags.This makes it very challenging to remove the manualelement to finding suitable users and terms. Whichdata mining techniques are suited to help here? Canthey be applied out of the box or do they have to beadopted to the particular use case?

• The decision-making for including/excluding usersand terms is very subjective. For instance, should weinclude all users on a relevant Twitter list or shouldwe filter them manually? Should we exclude userswho are mostly advertising products? Maybe some ofthese products are interesting to the users of the app.Should we put all global hashtags on the greylist, alltopics trending on Twitter? This might be problematicwhen considering larger events, such as the Olympics,which bear a global character in themselves.

Similar holds for the importance of post-processing, i.e.,what and how to show in the mobile client. Should we dis-play retweeted messages just because they are most recent,or rather only the original tweets? What minimal groupsize actually makes sense? Should tweets appear in multi-ple groups if the author belongs to more than one? Thesethings are also related to the crucial aspect of appropriateranking techniques. This is in the first line concerning theranking of groups and thus topics. However, it also refers tothe ranking of users, tweets and keywords to label topics.In the same line, we believe that the application providesperfect ground for research on personalisation. We envi-sion to include personal preferences, personal social net-works, own tweets and topics, etc. This will strongly bene-fit from the aforementioned social ranking techniques, butalso from research in the area of user interfaces and HCI.

Last but not least, the probably expected statement aboutthe many unexpected issues: Never underestimate Mur-phy’s law! Although the app was running smoothly fordays before the actual event, at the time of most importancewe ran into issues with temporary outages of our Twitterconnection, hardware failures, etc. This can in parts beaccommodated by unit tests, much more intense than thenorm for any research prototype. Similar, one tends to un-derestimate the efforts required to build such a completearchitecture and the wide range of issues, from tiny to cru-cial, one has to face during the design and development. Asresearchers we are not primarily used to creating workingapplications in a given time – for us this was a very goodlesson about all the glitches and itches around that.

AcknowledgementsThis work was jointly supported by the European Union(EU) under grant no. 257859 (ROBUST integratingproject) and Science Foundation Ireland (SFI) under grantno. SFI/08/CE/I1380 (LION-2).

References[Cha et al., 2010] Meeyoung Cha, Hamed Haddadi, Fabri-

cio Benevenuto, and Krishna P. Gummadi. MeasuringUser Influence in Twitter: The Million Follower Fallacy.In ICWSM, 2010.

[Fortunato, 2010] Santo Fortunato. Community detectionin graphs. Physics Reports, 486:75–174, 2010.

[Glasgow et al., 2012] Kimberly Glasgow, AlisonEbaugh, and Clayton Fink. #londonsburning: Integrat-ing geographic, topical and social information duringcrisis. In ICWSM, 2012.

[Gou et al., 2010] Liang Gou, Xiaolong (Luke) Zhang,Hung-Hsuan Chen, Jung-Hyun Kim, and C. Lee Giles.Social network document ranking. In JCDL, pages 313–322, 2010.

[Greene et al., 2010] D. Greene, D. Doyle, and P. Cun-ningham. Tracking the evolution of communities in dy-namic social networks. In ASONAM, pages 176–183,2010.

[Haynes and Perisic, 2009] Jonathan Haynes and IgorPerisic. Mapping search relevance to social networks.In Workshop on Social Network Mining and Analysis(SNA-KDD), pages 2:1–2:7, 2009.

[Hotho et al., 2006] Andreas Hotho, Robert Jaschke,Christoph Schmitz, and Gerd Stumme. Informationretrieval in folksonomies: search and ranking. InESWC, pages 411–426, 2006.

[Karnstedt et al., 2009] M. Karnstedt, D. Klan, Chr.Politz, K.-U. Sattler, and C. Franke. Adaptive burst de-tection in a stream engine. In SAC, pages 1511–1515,2009.

[Klan et al., 2011] D. Klan, M. Karnstedt, K. Hose,L. Ribe-Baumann, and K. Sattler. Stream Engines MeetWireless Sensor Networks: Cost-Based Planning andProcessing of Complex Queries in AnduIN. Distributedand Parallel Databases, 29(1):151–183, January 2011.

[Kwak et al., 2010] Haewoon Kwak, Changhyun Lee, Ho-sung Park, and Sue Moon. What is twitter, a social net-work or a news media? In WWW, pages 591–600, 2010.

[Lancichinetti et al., 2011] A. Lancichinetti, F. Radicchi,J.J. Ramasco, and S. Fortunato. Finding statisti-cally significant communities in networks. PloS one,6(4):e18961, 2011.

[Lin et al., 2009] Y.R. Lin, J. Sun, P. Castro, R. Konuru,H. Sundaram, and A. Kelliher. Metafac: communitydiscovery via relational hypergraph factorization. InSIGKDD, pages 527–536, 2009.

[Mucha et al., 2010] P.J. Mucha, T. Richardson, K. Ma-con, M.A. Porter, and J.P. Onnela. Community structurein time-dependent, multiscale, and multiplex networks.Science, 328(5980):876, 2010.

[Sun and Tang, 2011] J. Sun and J. Tang. Social NetworkData Analytics, chapter A survey of models and al-gorithms for social influence analysis, pages 177–214.Springer, 2011.

[Vakali et al., 2012] Athena Vakali, Maria Giatsoglou, andStefanos Antaris. Social networking trends and dynam-ics detection via a cloud-based framework design. InWorkshop on Mining Social Network Dynamics (MSND–WWW Companion), pages 1213–1220, 2012.