Association and temporality between news and tweets · reliable source (DVJ Insights and ING The Netherlands, 2015). The expectations are that the relationship between news and social

Association and temporality between news

and tweets

by

Vânia Nogueira Moutinho

Master Dissertation in Data Analytics

Supervised by

Professor João Paulo Cordeiro

Professor Pavel Bernard Brazdil

2018

Acknowledgements

Firstly, I would like to thank Professors Pavel Brazdil and João Cordeiro for their

support, guidance and wisdom throughout this investigation.

I would also like to acknowledge Pedro Saleiro for his guidance and generosity

at the start of this project, André Lima for his kind advice at critical moments, and

Natália Silva and Filomena Anselmo for their companionship during the ups and

downs of these past three years.

Lastly, I would like to thank my parents for making this journey possible, and

Rafael Correia for his unending support and constant belief in my success.

i

Resumo

Com o advento dos social media, as fronteiras entre o jornalismo e as redes sociais

estão a esbater-se. Existe um aumento dos conteúdos gerados pelos utilizadores

(UGC), dedicando os jornalistas uma parte signi�cativa do seu dia a anunciar, di-

fundir e monitorar notícias, assim como validar informações, em plataformas como

o Facebook e o Twitter. Vários estudos tentaram perceber o papel das redes sociais

enquanto fontes de notícias. Contudo, a relação e as interligações entre este tipo de

plataforma e os meios de comunicação social ainda não foi detalhadamente estudada.

Nesta investigação, estudámos uma série de notícias publicadas em artigos jor-

nalísticos e a sua partilha e discussão numa rede social referentes a seis meses. Espe-

ci�camente, uma amostra de artigos de fontes portuguesas generalistas de notícias

publicados no primeiro semestre de 2016 foi submetida a agrupamento, utilizando

um algoritmo híbrido. Os grupos de notícias gerados foram posteriormente associ-

ados a tweets de utilizadores portugueses, usando uma medida de similaridade.

Para um subconjunto dos clusters obtidos, realizámos uma análise temporal so-

bre estes grupos de notícias, examinando a evolução dos dois tipos de documentos

(artigos e tweets) e o momento da sua criação. Foi possível concluir que, para alguns

grupos de notícias, nomeadamente o Brexit e o Campeonato Europeu de Futebol,

a publicação de artigos jornalísticos ganha instensidade em datas chave (orientada

para eventos), enquanto que o debate e a discussão nas redes sociais são mais equi-

librados ao longo dos meses que antecedem esses eventos.

Palavras-Chave: agrupamento de texto, tweets, notícias

ii

Abstract

With the advent of social media, the boundaries of mainstream journalism and

social networks are becoming blurred. User generated content is increasing and

journalists dedicate considerable time searching platforms such as Facebook and

Twitter to announce, spread and monitor news and crowd check information. Many

studies have looked at social networks as news sources, but the relationship and

interconnections between this type of platform and news media is still not thoroughly

investigated.

In this work, we have studied a series of news articles stories and their sharing and

commenting on a social network during a period of six months. Speci�cally, a sample

of articles from generalist Portuguese news sources published on the �rst semester

of 2016 was subject to hybrid text clustering. The groups of stories obtained were

then associated with tweets of Portuguese users with the use of a similarity measure.

Focussing on a set of clusters, we performed a temporal analysis on these groups

of stories by examining the evolution of the two types of documents (articles and

tweets) and the timing of their generation. We concluded that for some stories,

namely Brexit and the European Football Cup, the publishing of news articles in-

tensi�es on key dates (event-oriented), while the discussion on social media is more

balanced throughout the months leading up to those events.

Keywords: text clustering, twitter, news

iii

Contents

Acknowledgements i

Resumo ii

Abstract iii

1 Introduction 1

1.1 Motivation and objectives . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Details and contribution . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Related Work 5

2.1 Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Facts and conventions . . . . . . . . . . . . . . . . . . . . . . 5

2.1.2 User intention . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.3 Twitter in Portugal . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.4 Twitter and news . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Text mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 De�nition and applications . . . . . . . . . . . . . . . . . . . . 9

2.2.2 The case of unstructured text data . . . . . . . . . . . . . . . 10

2.2.3 Document, features and the representation model . . . . . . . 11

2.2.4 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

iv

2.2.5 Mining tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.2 Text clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Methodology 20

3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Four main stages of the method . . . . . . . . . . . . . . . . . . . . . 21

3.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4 News clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4.1 Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4.2 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.5 Assignment of tweets to clusters . . . . . . . . . . . . . . . . . . . . . 29

3.5.1 Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5.3 Assignment to clusters . . . . . . . . . . . . . . . . . . . . . . 30

3.5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.6.1 Timing of events on the news and on social media . . . . . . . 32

4 Temporal Analysis of News and Tweets 34

4.1 News clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1.1 Selection of articles . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1.2 Preprocessing and representation model . . . . . . . . . . . . 38

4.1.3 Parametrizing the method . . . . . . . . . . . . . . . . . . . . 40

4.1.4 Clustering results . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2 Assignment of tweets to clusters . . . . . . . . . . . . . . . . . . . . . 45

4.2.1 Sample construction of tweets . . . . . . . . . . . . . . . . . . 45

v

4.2.2 Pre-processing of tweets . . . . . . . . . . . . . . . . . . . . . 46

4.2.3 Assignment to clusters . . . . . . . . . . . . . . . . . . . . . . 46

4.3 Temporal analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3.1 Evolution of articles and tweets . . . . . . . . . . . . . . . . . 51

4.3.2 Time-wise di�erences . . . . . . . . . . . . . . . . . . . . . . . 53

5 Conclusion 56

5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.2 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . 57

Bibliography 59

Appendices 65

A Example of input tweet 65

B Cluster keywords 67

C Assignment of tweets with at least two features in common with

cluster centroids - Evaluation results 69

D Temporality between news and tweets - considering the complete

dataset 71

vi

List of Tables

4.1 Number of news articles per press source available . . . . . . . . . . . 36

4.2 Frequent expressions on news articles . . . . . . . . . . . . . . . . . . 39

4.3 News articles pre-processing transformation examples . . . . . . . . . 39

4.4 Number of elements per cluster and class homogeneity and separation 42

4.5 Tweets pre-processing transformation examples . . . . . . . . . . . . 47

4.6 Per-class evaluation of tweets assignment to clusters . . . . . . . . . . 49

4.7 Global evaluation of tweets assignment to clusters . . . . . . . . . . . 50

C.1 Global evaluation of tweets assignment to clusters - considering tweets

with at least two terms in common with cluster centroids . . . . . . . 69

C.2 Per-class evaluation of tweets assignment to clusters - considering

tweets with at least two terms in common with cluster centroids . . . 70

vii

List of Figures

3.1 Four main stages of the method . . . . . . . . . . . . . . . . . . . . . 21

3.2 Data setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1 Number of news articles in the dataset per month of 2016 . . . . . . . 35

4.2 News articles length � boxplot . . . . . . . . . . . . . . . . . . . . . . 37

4.3 Cluster size and mean length of articles . . . . . . . . . . . . . . . . . 37

4.4 Explained inertia for k up to 500 . . . . . . . . . . . . . . . . . . . . 41

4.5 Aggregation indices for k up to 500 . . . . . . . . . . . . . . . . . . . 41

4.6 Keywords per cluster � 4 examples . . . . . . . . . . . . . . . . . . . 44

4.7 Number of available tweets per month of 2016 . . . . . . . . . . . . . 45

4.8 Number of tweets and articles per clusters . . . . . . . . . . . . . . . 48

4.9 Evolution of the number of elements . . . . . . . . . . . . . . . . . . 52

4.10 Days di�erence between articles and the median tweet . . . . . . . . . 54

A.1 Example of tweet in JSON � part 1 . . . . . . . . . . . . . . . . . . . 65

A.2 Example of tweet in JSON � part 2 . . . . . . . . . . . . . . . . . . . 66

B.1 Top 10 cluster keywords � part 1 . . . . . . . . . . . . . . . . . . . . 67

B.2 Top 10 cluster keywords � part 2 . . . . . . . . . . . . . . . . . . . . 68

D.1 Days di�erence between articles and the median tweet - considering

the complete dataset of tweets and articles . . . . . . . . . . . . . . . 71

viii

Chapter 1

Introduction

This chapter presents the theme of the dissertation, describing the motivation be-

hind it, the main goals to achieve and proposed methodology, as well as the main

contributions.

1.1 Motivation and objectives

News media can have a powerful in�uence in people's perception of reality, at least in

some domains, like politics (McCombs and Shaw, 1972) and foreign a�airs (Wanta

et al., 2004). A story reported in mass media reaches thousands of people, that

generally do not have any other direct contact with the subject of the story, and it

is reasonable to say that news media is an important source of information for most

people (McCombs and Shaw, 1972).

However, the advent of social media is gradually changing the way information

is disseminated and, moreover, possibly shifting the roles of news makers and news

recipients. According to Kaplan and Haenlein (2010), social media is the set of

applications based on the internet where user generated content (UGC) is created,

modi�ed and exchanged in a collaborative and participatory way. Following their

categorization, examples of social media include: Facebook and LinkedIn, that can

1

be classi�ed as social networking sites; Wikipedia, a collaborative project; Youtube

and Pinterest, as content communities; blogs; and virtual social or game worlds, like

Second Life. In this setting, information is no longer the property of an elite set of

sources; it is rather the result of a framework where every person with an internet

connection can add, transform, update, di�use, �lter and share pieces of content.

The impact on journalism, in particular, is already noticeable. Although we

still look to mainstream news to discern truthful from unreliable information, we

are also more and more interested in the content posted or shared by friends or

other entities we follow on social networks (Newman, 2009). A study of the impact

of social media on the activities of public relations professionals and journalists

in 2015 reports that 1.7 hours of a journalist's day are spent using social media,

Facebook and Twitter being the �rst and second leader platforms for it. In addition

to expected activities such as relationship management and comments response,

these professionals use social media to announce, spread and monitor news and to

check information (crowd checking) with 40% of them considering social media as a

reliable source (DVJ Insights and ING The Netherlands, 2015).

The expectations are that the relationship between news and social media will

continue to grow (DVJ Insights and ING The Netherlands, 2015).

Considering these facts, can we still tell where and when a news story begins

and its authorship? As the borders between journalism in the traditional sense and

social media become blurred, what is the impact of one over the other? Does this

evolution mean that the popularity of stories in social media is re�ected on and/or

re�ects the attention on the news press? In other words, to what extent are we

still relying on news from the press to know what is happening and, conversely, how

much is the social media focus impacting what the press reports?

This is a relationship worth analysing at this point in time. In particular, this

empirical investigation explores news articles and social media messages of Portugal

� a country where the use of social media is still increasing for both population and

2

enterprises (Lusa, 2016).

The main goal of this dissertation is to study stories that are reported in the

news and commented / di�used throughout social networks. The focus is to identify

stories that were published on a six months period and social media posts about

them. Then, for a selected set, the analysis of the di�erences in timing on news and

social media is studied. This analysis includes the examination of the evolution in

the number of articles and tweets about the same stories and the identi�cation of

which groups of stories show signs of having been �rst published on the news and

then di�used or commented on a social network and which have reached the height

of social media discussion previously to the height of news article publishing.

1.2 Details and contribution

As described in the �rst section, there is an important role of the press on the

daily conversations and opinions of people (McCombs and Shaw (1972); Wanta

et al. (2004)). Also, trends on social media have an increasing in�uence on what

journalists report (DVJ Insights and ING The Netherlands (2015); Newman (2009)).

To explore this relationship, we looked at the news media coverage during a certain

period of time and identi�ed groups of stories with the use of clustering techniques.

Then we searched those stories on a social network using a similarity measure to

those groups of stories. We evaluated the results using `semi-labelled' social media

posts, that is, posts that shared news articles. Focusing on the groups with the

best performance, we studied the evolution of the number of articles and posts

and timing di�erences. The possible di�erences found were meant to shed some

light on the following questions. Stories that are brought to the public attention

by mainstream journalism and then generate a buzz on social media may indicate a

prevalent function of that profession. Stories begun on social media and then fetched

by news can indicate either a reinforcement or a turnaround on this mechanism.

3

Moreover, are there stories that only occasionally come about or is there a continuous

debate through time on either or both platforms?

It is not our objective to categorize every story on these terms. Indeed, we

recognize these are subjective considerations and a full understanding of the inter-

connections of news and social media and underlying aspects in society requires

a more thorough research. Notwithstanding, we believe the approach and focus

proposed for this investigation will provide initial grounds for it.

Furthermore, the outcome of this research should provide a good insight into

what were the main themes discussed, formally and informally, chronologically and

strength wise, thus allowing a country's narrative to take form.

1.3 Organization

The remainder of this report is as follows. Chapter 2 provides the literature review,

Chapter 3 describes the methodology employed, Chapter 4 presents the results ob-

tained from news and tweets grouping and discusses the temporal relationship be-

tween news and tweets. Chapter 5 concludes.

4

Chapter 2

Related Work

This chapter provides an overview of the most relevant and related work. It begins

by describing the social network Twitter and noting certain aspects researchers have

found important when working with tweets. Then, attention is given to text mining

and clustering techniques.

2.1 Twitter

2.1.1 Facts and conventions

Twitter is a social network launched in October, 2006. By the end of the �rst

semester of 2016, it held more than 313 million monthly active users, of which 82%

used the mobile app (Twitter, 2016). By the end of the �rst semester of 2018, the

number of monthly active users was of 335 million (Statista, 2018). Users are people

or organizations who create an account so they can send or receive messages and

follow other users. Those messages are known as tweets. The relationship between

users is bi-directional: a user chooses to follow another, but it does not mean that

the other user will follow him/her back. Moreover, a user decides who he/she wants

to follow, but has no power over who follows him/her.

5

Tweets are messages posted by a user that have a singular property of a maximum

length of 140 characters1. This means that users must be concise in their writing,

and also not having to invest much thought makes adherence to this form of blogging

higher. Indeed, as Java et al. (2007) state in one of the earliest works on Twitter, this

feature classi�es it as a microblogging platform that makes communication faster

and easier. On the other hand, it also follows that the use of abbreviations is fairly

common (Sankaranarayanan et al., 2009), adding a layer of di�culty to text mining

tasks.

Following the classi�cation of Kwak et al. (2010), tweets can be singletons, replies,

mentions or retweets. A reply or a mention uses the convention `@user ' to indicate

we are addressing someone, whereas a retweet is a form of forwarding some other

user's message and is usually preceded by `RT'. Kwak et al. (2010) consider retweets

a very powerful feature for information �ltering and di�usion. A singleton is a tweet

with no reply or mentions (Kwak et al., 2010).

There is another interesting feature associated with tweets which, according to

Sankaranarayanan et al. (2009), has been successfully utilized in clustering tasks,

namely the hashtag. A hashtag is a word or expression that begins with the hash

symbol (#), and it is generally used to indicate the topic of the tweet. A query on

a particular hashtag returns all the tweets containing it. Considering the hashtag

is set at the user level, it is surprising to see how few hashtags are associated to a

single news issue (Sankaranarayanan et al., 2009).

2.1.2 User intention

Studies have shown Twitter users are of three kinds: information sharers, infor-

mation seekers or friends/acquaintances (Java et al. (2007); Krishnamurthy et al.

(2008)).

The follower/followee network constructed by Kwak et al. (2010) using more than

1At the end of 2017, this limit was extended to 280 characters (Rosen, 2017).

6

41.7 million user pro�les collected in July 2009 revealed the average path length to be

4.12, which the authors consider to be relatively short. They conclude that Twitter

is not only a social networking platform but also an information seeking facilitator.

Furthermore, tweets creation follows somewhat the Pareto's law, as less than

10% of Twitter users tweet more than 90% of all tweets (Sankaranarayanan et al.,

2009).

2.1.3 Twitter in Portugal

In 2015, 54.8% of Portuguese people used social networks, according to a study

by Marktest, cited by Lusa (2016). By 2017 the penetration rate had increased to

59.1% (Marketeer, 2017).The main activities on social media are sending/receiving

messages, video watching, chatting and reading and sharing news (Lusa, 2016).

Twitter has a penetration rate of 23.6% among the population and of 41.9% among

enterprises (Lusa, 2016).

To our best knowledge, in the literature there is one attempt focused on empir-

ically characterizing Twitter in Portugal: Brogueira et al. (2016). This work uses

geolocated tweets collected from the Twitter streaming API from mid-September of

2014 to mid-September of 2015. The main �ndings are the following: the distri-

bution of users per district is very similar to the population distribution; regional

di�erences in Twitter usage throughout the year re�ect the usual vacation destina-

tions; tweets containing URL's and retweets represent a very small portion of the

geolocated tweets (less than 3%), while mentions and replies add up to almost 35%,

indicating these users mostly chat; the top hashtags were either football (soccer) or

television entertainment shows related (Brogueira et al., 2016).

Although these results apply to geolocated tweets only, these are relevant as-

pects of the Portuguese Twitter community that are taken into account during the

empirical part of the dissertation.

7

2.1.4 Twitter and news

The literature concerning Twitter and news is often directed at using tweets as a

single news source. Indeed, there are some studies whose approach is to regard

Twitter as a substitute of (rather than complementary platform to) traditional news

sources (e.g. Sankaranarayanan et al. (2009), Zhao et al. (2011), Phuvipadawat

and Murata (2010)). The main reason for this is perhaps the realization that some

news break �rst on Twitter. For example, Hu et al. (2012) have shown that the

capture and death of Osama Bin Laden was made public on Twitter at least 20

minutes sooner than on major U.S. television channels. The authors argue that this

may happen due to the role of a particular set of in�uential users, namely journalists

and politicians, whose credibility instantly provokes an immediate reaction on social

networks (Hu et al., 2012).

Sankaranarayanan et al. (2009) built a tool called TwitterStand with the goal of

collecting and di�using breaking news quicker than conventional news media. This

system performs online clustering on �ltered tweets from a set of manually selected

seeders � users that usually post news. In addition, it performs periodic checks to

avoid fragmentation and ensure minimal duplication of clusters, i.e., topics. Also, it

takes advantage of information in the content of the tweet and/or the user's pro�le to

associate topics to geographic locations. The authors believe that if tweets belonging

to a certain cluster mostly come from one location or a set of close locations, then the

topic of that cluster is likely to pertain to that geographical area (Sankaranarayanan

et al., 2009).

Zhao et al., 2011 used a corpus of news articles from the journal New York

Times (NYT) and tweets from Edinburgh, gathered from November 11 2009 to

February 1 2010, to investigate how similar the topics in Twitter and a traditional

news source are. Their results showed some di�erences regarding the most frequent

categories and types of topics: Twitter users tweet the most about family and life,

a category not covered by the NYT; arts is a topic similarly frequent on both

8

Twitter and the NYT; world is much more frequent on the NYT; lastly, while long-

standing topics have an equally strong presence, the same does not happen for

entity-oriented and event-oriented topics, with Twitter favouring the former and the

NYT the latter (Zhao et al., 2011). Regarding long-lasting topics, there is evidence

that their prevalence is not due to an increasing number of users tweeting about

them, but to a set of important users who discuss it over time (Kwak et al., 2010).

The above �ndings bring about relevant aspects of the similarities between Twit-

ter and conventional news sources. Particularly, they emphasize the importance of a

certain type of users in social networks that foster its role as a news medium. Still,

while the reputation and popularity of users is signi�cant for the level of certainty

in the network regarding new information (Hu et al., 2012), it is the communica-

tion structure set upon follower/followee relationships that renders Twitter such a

fast information di�usion network. In fact, this propagation may in some cases not

depend entirely on the �rst user's network: Kwak et al. (2010) have found that if

a message is retweeted, it quickly reaches an average of 1000 users, regardless of

the �rst user's number of followers. This is what the authors call `the emergence

of collective intelligence', in the sense that individuals decide what information is

good enough to spread and once that decision is made, it almost instantly reaches

a massive audience (Kwak et al., 2010).

2.2 Text mining

2.2.1 De�nition and applications

Text mining is the process of retrieving useful and meaningful information from text.

As in data mining, it generally does so by identifying relevant patterns.

In a world where unstructured information is undoubtedly proliferous � a survey

conducted on data management professionals in 2006 revealed an average of 31%

9

unstructured plus 22% semi-structured data over the entire organization (Russom,

2007) �, text mining has grown considerably. It is applied in a variety of �elds,

such as analysis of patents, discovery of protein interactions, categorization of news

stories, spam �ltering, identifying industry trends for corporate �nance, to name a

few (Hotho et al., 2005; Feldman and Sanger, 2006).

2.2.2 The case of unstructured text data

Even though the goal of text mining is conceptually similar to that of data mining

in general, the unstructured format of the data implies a bigger e�ort into the pre-

processing stage. Unlike data mining, where data is often extracted from databases

where the information is organized in records, the lack of structure in text data

means that the data are not prepared to be used by common data mining algorithms.

It should be noted that text data can have some form of structure. Some doc-

uments are notoriously written following a speci�c format. For instance, a news

article usually has the following elements: headline, byline (author and date), lead

paragraph, body and conclusion. This type of text data can be classi�ed as weakly

structured data. Other documents that are constructed in a format to facilitate

transmission (e.g. email, JSON) are considered semi-structured. However, the fact

remains that even these types of structures are not adequate to feed into a data min-

ing algorithm as is. The in�nite possibilities of news headlines or email recipients �

the shortest components in the examples � per se cannot allow for comparison and

pattern recognition. It is necessary to apply a representation model to transform

text data into an input that can be processed by machine learning algorithms.

Additionally, handling textual information requires an understanding of natural

language, which is why many text mining techniques borrow knowledge from other

�elds of study, particularly information retrieval (obtaining documents that answer

a certain query), information extraction (extracting speci�c information from docu-

ments) and computational linguistics (Feldman and Sanger, 2006). So, text mining

10

lies in the connection between techniques from these related research areas and the

methods and algorithms from data mining (Hotho et al., 2005).

2.2.3 Document, features and the representation model

Feldman and Sanger (2006) de�ne document as a unit of discrete textual data

within a collection that usually, but not necessarily, correlates with some real world

document, such as a business report, legal memorandum, e-mail, research paper,

manuscript, article, press release, or news story. In text mining, a document collec-

tion is also known as a corpus.

Each document must be represented in a structured format. That transformation

implies identifying the features that best characterise the document and ease the

operations of text mining algorithms. According to Feldman and Sanger (2006),

features are usually characters, words, terms or concepts. The di�erence between

words and terms is that the latter can include multi-words such as `Mother Theresa'

and `European Union'. Concepts, in turn, represent a number of terms with familiar

or related meaning and may include words not present in the original documents.

The bag-of-words representation of documents uses these features without taking

into account the order in which they appear in each document. A document is

represented by its set of features, and when each of these features has a value for

that document, the document is represented in a feature vector. This is also known

as the vector space model.

Therefore, once the features have been produced, each document can be repre-

sented by a feature vector, and a corpus by a document-term matrix. The values of

the vector or matrix depend on the attributed weight of the feature in the document

and/or corpus. This weight may be binary, i.e., equal to one if the feature is present

in the document and zero otherwise, or else dependent on the frequency of the term

in the document and/or corpus. In this area, the most widely used representation

format is the TF-IDF weighting, in which the weight of the term t in the document

11

d is given by

TF-IDF(t, d) = TermFreq(t, d) · log(

N

DocFreq(t)

)

where TermFreq(t, d) is the absolute frequency of t in d, N is the number of all

documents in the collection and DocFreq(t) is the fraction of documents containing

t. Note that this weighting scheme penalizes terms that are much too frequent in

the corpus through the IDF component (the second factor).

The document-term matrix is hence the structured representation of the corpus of

documents, allowing comparisons between documents and enabling pattern search.

2.2.4 Pre-processing

Textual data has two major issues: high feature dimensionality and feature sparse-

ness.

For any given document collection, the number of possible features is normally

very high. Consequently, the data becomes very sparse. This may be costly due to

two main reasons: (1) the increased space hinders computational performance; and

(2) as the number of variables becomes larger � possibly larger than the number of

observations �, the performance of some algorithms tends to degrade. Thus, it is

convenient to keep only those features rendered informative and most representative

of the document collection.

Some common pre-processing tasks are:

• Tokenization. A token is a meaningful constituent of a text stream (e.g.

a chapter, section, paragraph, sentence, word) (Feldman and Sanger, 2006).

After punctuation marks removal, the goal is usually to split each document

in words.

• Parts-of-Speech (POS) Tagging. The tags article, noun, verb, adjective,

12

preposition, number and proper noun, among others, enrich the textual data

by marking their syntactic value.

• Filtering. Removing words that add no meaning to the document because

they appear in any document (e.g. prepositions, conjunctions, articles � also

known as stop words, i.e., the most common words in a language) is one way

to reduce feature dimensionality; but there are other and more sophisticated

feature selection techniques, based on, for example, the document frequency of

the word/term or, in the case of classi�cation problems, the information gain

of the word/term.

• Stemming. A stem is the basic form of a word � its root or its root with

some su�x or pre�x. The most widely used stemmer for the English language

is the Porter Stemmer (Porter, 1980).

• Lemmatization. A lemma is the dictionary form of a word (e.g. the lemma

of `are' is `be').

2.2.5 Mining tweets

A very important aspect to consider is that short segments of text, such as tweets,

enhance the problem of sparseness in the dataset, and strategies relying on exact

word matching may be inadequate (Aggarwal and Zhai, 2012).

For that reason, some authors recommend the use of extra information (Genc

et al. (2011); Sriram et al. (2010); Aggarwal and Zhai (2012)). The approach of

Genc et al. (2011) was to map each tweet to the closest Wikipedia page and take

the distance between these pages as the distance between two tweets. Their argu-

ment is that both tweets and Wikipedia pages are constructed by humans and the

categorization of Wikipedia pages in particular, echoes how our brains link semantic

structures.

13

Sriram et al. (2010), on the other hand, used the standard bag-of-words con-

struct plus eight additional features to categorize tweets and con�rmed the valuable

contribution of extra information. These added features were, namely, the author,

the presence of abbreviations or slang, opinioned words, any currency or percentage

symbols, emphasis on words, mentions at the start and mentions within the tweet.

2.3 Clustering

2.3.1 Introduction

Aggarwal and Zhai (2012) de�ne the clustering problem as that of �nding groups

of similar objects in the data, with the objective of obtaining high inner-group

similarity and high inter-group dissimilarity. There are many categories of clustering

algorithms (Gama et al., 2015), some of which are presented.

Connectivity-based clustering algorithms, also known as hierarchical, con-

sider that neighbouring objects are more similar, while more distanced objects

should appear in separate clusters. Thus, objects are linked based on their dis-

tances, and a dendogram of the resulting hierarchy can be drawn.

The approach to link the objects can be either agglomerative, where all objects

are initially separated in isolated clusters and successively grouped based on smallest

distances; or divisive, where one big cluster containing all objects is initially formed

and successively partitioned into smaller groups based on greatest distances.

The linkage method (or aggregation index) determines how to assess the dis-

tances between objects and/or groups: the complete linkage considers the distance

between the furthest elements of each group; the single linkage the distance be-

tween the nearest elements of each group; the average linkage considers the average

dissimilarity values between the elements of each group; the centroid linkage the

distance of centroids; and Ward's linkage considers the increase in inertia (between

14

and within class dispersion) when two groups are merged.

Since the �nal result is a hierarchy, to get the �nal clusters of objects it is

necessary to apply a cut-o� criteria to the tree-like structure.

In centroid-based clustering algorithms, the distances are computed to a

cluster centre � a speci�c data point (medoid) or a central vector �, and clusters

are improved in an iterative process. The number of clusters needs to be set at

the start. Initial centres can be randomly picked or user-chosen, and the idea is to

allocate each of the data points to the closest centre, recalculate cluster centres and

repeat, until no further improvements are possible.

Albeit e�cient, these types of algorithms can converge to local optimum, and so

it is important to execute them more than once with di�erent seeds and evaluate

the results. The most widely known and still used algorithm of this category is the

k-means (MacQueen, 1967; Jain, 2010).

Other categories of algorithms include density-based clustering algorithms,

where clusters correspond to regions with high densities of objects, separated by

regions with low densities of objects (e.g. DBSCAN (Ester et al., 1996)) and

distribution-based clustering algorithms, where clusters are formed based on

the probability of its members following the same distribution (e.g. clustering using

the Expectation Maximization algorithm (Dempster et al., 1977)).

For distance based algorithms, ameasure of (dis)similarity is needed, such as

the Euclidean, Manhattan, Chebyshev or Mahalanobis distances. In the case of text

data, the most widely used is the cosine similarity, for which, unlike the Euclidean

distance, for example, the magnitude of the vector representing the document is

not taken into account, but only its direction. This is particularly important when

working with collections of documents with variable length, in which the weight of

a particular term may be bigger not because it appears relatively more often in a

document than any other, but simply because that document is longer.

According to Gama et al. (2015), following the clustering analysis there is an

15

evaluation stage, where an objective assessment is made regarding the signi�cance

of the resulting clusters and whether the number of clusters is adequate. However,

the authors recognize that it is still an open area, and make the following remarks:

• There is no universal clustering algorithm capable of capturing every possible

underlying data structure;

• If the goals of two clustering algorithms are di�erent, it may not make sense

to compare their results;

• The knowledge of the data and its domain is paramount for determining not

only what data transformations are necessary but also to choose the most

adequate similarity measures and the properties inherent to each clustering

algorithm.

Nevertheless, there are three types of criteria to evaluate the quality of a clus-

tering (Gama et al., 2015):

1. Relative criteria, focusing on measuring which algorithm �ts best to the data

or on the most adequate number of clusters for a given algorithm. For example,

the intra-cluster variance measures how compact the clusters are, while the

connectivity assesses the degree to which neighbouring objects are positioned

in the same cluster.

2. Internal criteria, focused on determining to what degree the partitioning rep-

resents the inherent data structure. For example, the Gap statistic compares

the total intra-cluster variation with the expected value under a null reference

distribution (i.e., a distribution with no obvious clustering).

3. External criteria, with the goal of evaluating how well the clustering obtained

con�rms a given pre-speci�ed hypothesis. For example, the Jaccard index de-

termines the probability of two objects of the same cluster also being clustered

together in a di�erent clustering scheme.

16

By interpreting the resulting classes, one can also validate the clustering results,

especially with the support of a domain expert. The interpretation may include

some form of labelling, computation of mean values or visual representations of the

clusters.

Regarding the challenges in data clustering, Jain (2010) denotes the importance

of building benchmark data from di�erent domains to test and evaluate clustering

algorithms. Also, as there is not yet one clustering algorithm to clearly outperform

any other in whatever domain, it is suggested that algorithms should be designed

and used according to the application needs. Finally, the rise of semi-supervised

methods is encouraged, as the user's domain expertise on pair-wise must-link and/or

cannot-link constraints is still very important for good quality clusters.

2.3.2 Text clustering

The additional challenges of clustering text rather than quantitative or even cate-

gorical data, are the following (Aggarwal and Zhai, 2012):

• High dimensionality and sparseness: on the one hand, the range of possibilities

for words present in a document (i.e. the glossary of the document collection)

is extremely high; on the other hand, each document often has relatively a low

number of them.

• Word correlations: even though the lexicon is large, there are generally many

words relating to a single concept.

• Variable document length: the representation of such items should take into

account a normalization process.

To face these challenges, a series of pre-processing tasks are usually employed before

the clustering process, some of which were detailed in section 2.2.4.

In addition to the feature selection techniques already discussed, as Aggarwal

and Zhai (2012) emphasize, dimensionality reduction can also be achieved through

17

feature transformation methods, an example of which is Latent Semantic Index-

ing (Deerwester et al., 1990). This technique represents the documents in a new

(smaller) feature space where the �nal features are a linear combination of the

original features, thus eliminating noisy dimensions from the data (synonymy and

polysemy) and enhancing its semantic value. This is particularly valuable in the

context of text clustering.

Clustering techniques based on distances use similarity measures to evaluate

how close or apart the objects are. Huang (2008) performed an experiment on seven

di�erent datasets, of which four comprised newsgroup posts or newspaper articles.

Her results showed that the cosine similarity, Pearson's correlation coe�cient and

the averaged Kullback-Leibler Divergence clearly outperformed the Euclidean dis-

tance in all datasets in terms of entropy and purity. However, this experiment was

conducted using the k-means algorithm, that belongs to the partitioning family of

clustering algorithms, and therefore conclusions should be drawn only for this type

of clustering.

A compromise between the robustness of hierarchical clustering methods and the

e�ciency of partitioning clustering methods involves the use of a hybrid approach.

In Cutting et al. (1992), such an approach is used in order to provide an e�cient

interactive experience in document browsing. Speci�cally, the authors discuss two

techniques, buckshot and fractionation, to �nd the initial centres to feed to the

partitioning clustering algorithm. The former consists of taking a sample of√k ·N

documents and performing hierarchical clustering to �nd k centres that are more

robust than if randomly chosen. The latter implies dividing the corpus into N/m

buckets of size m>k, applying hierarchical clustering in each of them, and then using

cluster centres to reapply the clustering routine iteratively until k clusters have been

obtained. Both techniques provide better seeds for a more computationally e�cient

algorithm � like the k-means � to begin with when clustering the complete and larger

document collection. In this study, the buckshot technique was applied.

18

There are other types of text clustering techniques, such as probabilistic cluster-

ing, of which topic modelling is an example (Aggarwal and Zhai, 2012). In this case,

each document and each term of the lexicon has a probability of belonging to one

of k topics. This also makes topic modelling a clustering technique that determines

document clusters and word clusters at the same time, following the notion that

good clusters of words are indicative of good clusters of documents.

19

Chapter 3

Methodology

3.1 Motivation

Journalism and social media have become more intricately interconnected. Tradi-

tionally, people resort to mainstream media to know what is happening in the world.

However, this dynamic has been changing in recent years, at least for some news

topics, due to the fact that a great proportion of the world population has access to

platforms which broadcast real time events to an equally worldwide audience. Con-

sequently, the beginning of the process of news generation and dissemination has in

some cases changed from journalists to the general public. It has been recognised

in various studies (Newman, 2009; DVJ Insights and ING The Netherlands, 2015)

that journalists spend a lot of their time scouting social media for interesting topics

to write about, relying on these platforms as reliable sources.

It is therefore relevant to study how events or news come about on these two

types of platforms � news articles and social posts � particularly how and if they

similarly arise, disseminate, gain strength and die. The goal of this dissertation is to

characterize the news or events published by news sources and/or commented and

shared on social media during a period of six months, focusing on the timing of their

generation and the intensity with which they are mentioned on each and through

20

the use of text mining techniques. The next section describes the empirical steps

taken to achieve this goal and the following provide more details regarding each of

them.

3.2 Four main stages of the method

The empirical process used in this study can be observed in Figure 3.1 and is sum-

marized below.

1. Data gathering: obtaining the data. On the news side, on-line news articles

were used; on the social media side, tweets were used (see section 3.3).

2. News clustering: forming groups of similar news (see section 3.4).

3. Assignment of tweets to news clusters: allocating tweets about the same

stories to the groups of news obtained (see section 3.5).

4. Analysing the resulting groups of news articles and tweets: tempo-

rality between news and tweets (see section 3.6).

Figure 3.1: Four main stages of the method

21

3.3 Data

Both tweets and on-line news articles were provided by the POPmine platform

developed in SAPO Labs � a partnership between an internet services and products

provider and an academic institution, namely the University of Porto. This platform

gathers on a continuing basis tweets of approximately 100 thousand Portuguese users

and news articles of over 40 Portuguese press sources (Saleiro et al., 2015).

The data were collected during the year of 2016. In total, there are more than 600

thousand news articles and almost 38 million tweets, distributed somewhat linearly

across the year.

For computational reasons, the data time frame used in this study was reduced

from one year to one semester. Both types of documents (news articles and tweets)

were received in text �les in the JSON format1. The �rst task was to read and import

the relevant components into a database, in order to ease data cleaning, transfor-

mation, access and analysis. The ETL2 process was particularly important in the

case of tweets, since each tweet can have multiple components recorded in a nested

structure, which would be extremely di�cult to read and use as is (see example in

Appendix A). Figure 3.2 shows this setup process, as wells as the tools utilized for

information extraction, transformation, loading and storage/management.

Figure 3.2: Data setup

1JSON stands for JavaScript Object Notation. It is a �le format used to easily transmit data,that uses attribute-value pairs and arrays data structures. See json.org for more information.

2ETL � extract, transform, load.

22

The following components were gathered and imported into the database:

• News articles: article id (integer), title (string), body (string), date of publi-

cation (timestamp), source (string), url (string).

• Tweets: tweet id (integer), text (string), date of posting (timestamp), user

(string), URL's shared (string).

The next stages were implemented on R (R Core Team, 2017) and rely on the

information stored in this database, by use of the RODBC package (Ripley and Lapsley,

2017).

3.4 News clustering

The same story or event can be published in many di�erent articles and shared or

commented in many tweets. It is therefore �rst necessary to identify such stories.

The news articles were chosen as a base for that identi�cation in preference to tweets

as these are particularly prone to being about personal life rather than news topics.

They are also more di�cult to mine due to the shortness of the text, as seen in the

previous chapter.

To obtain stories or groups of similar stories of the �rst semester of 2016, clus-

tering techniques were applied to a sample of on-line news articles.

3.4.1 Sample

The sample construction considered the importance of having, on the one hand, news

articles that were shared on social media, and, on the other hand, news articles that

were not shared on social media.

It is possible to know if a news article was shared on Twitter by looking up

its URL in the tweets information collected. The former are necessary so that in

the following stage the assignment of tweets to the clusters can be evaluated. We

23

assume that if a tweet contains a link to a news article then that tweet is about the

same story as that news article. However, if only these news articles were included

in the sample, it would be extremely likely that when analysing the �nal clusters

comprised of both news articles and tweets the following conclusion would be drawn:

the story was �rst brought up by the press, because if a link was shared, it means

the tweet came afterwards.

By including also news articles that were not shared on Twitter, we allow news

articles later published to enter the cluster. Additionally, if there are clusters of news

articles with no similar tweets assigned to them, it is possible to identify stories that

are not as relevant in the social media, which is useful to gain insights of the least

talked about topics in the Portuguese society.

3.4.2 Pre-Processing

Standard pre-processing techniques were applied to the news articles, including the

ones listed below. We do not further discuss these techniques, as they were described

in Chapter 2.

• Tokenization

• Lower case conversion

• Punctuation and numbers removal

• Portuguese stopwords removal

• Repeated expressions removal

• Stemming

These tasks were performed in R (R Core Team, 2017) using the following packages:

tm (Feinerer and Hornik, 2017) and SnowballC (Bouchet-Valat, 2014). Examples of

articles pre-processing are given in Table 4.3, in Chapter 4.

24

Tagging

Additionally, as a feature selection technique, terms were tagged according to their

syntactic value (parts-of-speech) and any term not classi�ed as a verb, noun or

proper noun was discarded. The recognition of named entities in the text of news

articles was also included. The reason behind this step is that the purpose of clus-

tering is to �nd groups of stories or events and it is expected that the inclusion of

named entities, such as personalities, names of events and locations, as features will

help the representation and consequent pattern recognition in that respect.

For parts-of-speech tagging, the resources used were the openNLP and the NLP

packages (Hornik, 2016, 2017). For named entities recognition, we used the PAMPO

package (Rocha, 2016). The PAMPO method (Rocha et al., 2016) for named entities

extraction was built for the Portuguese language and is based on two algorithms:

the �rst generates candidates by gathering common named entities terms, such

as capitalized words and personal titles; the second performs a candidate selection

based on parts-of-speech tagging. Performance results on a Portuguese news corpora

were a recall of 0.91, a precision of 0.959 and a 0.932 F1 score.

Representation

The news corpus was then represented in a document-term matrix (dtm), with

normalized TF-IDF weights, according to the following formula:

tfi,j · idfi =tfi,j∑k nk,j

· log2|D|

|{d|ti ∈ d}|

where tfi,j is the absolute frequency of term ti in document dj,∑

k nk,j is the sum

of absolute frequencies of each term k in document dj, |D| is total the number of

documents in the collection, and |{d|ti ∈ d}| is the number of documents in the

collection where ti appears.

25

Dimensionality reduction

For dimensionality reduction purposes, the level of sparseness of the dtm was set to

0.98, which means that if a feature was not present in at least 2% of the documents,

it was discarded. The 0.98 level guarantees no documents were left with only zero

entries, while reducing signi�cantly the number of features.

The �nal document-term matrix was built using the tm and RWeka packages

(Feinerer and Hornik, 2017; Witten and Frank, 2005; Hornik et al., 2009).

3.4.3 Clustering

The clustering experiment conducted included hierarchical and k-means clustering.

Hierarchical clustering

The hierarchical clustering algorithm used is of an agglomerative nature, which

means each observation (news article) is isolated at the start in a cluster of its

own; then, clusters are iteratively joined according to greatest similarities, until

there is one single cluster. So, at each iteration, there is a need to compute a

similarity (or dissimilarity) value between each of the current clusters. In this case,

the dissimilarity measure used was the Euclidean distance, which is given by:

d(v1, v2) =

√√√√ p∑f=1

(v1f − v2f )2

where v1 and v2 are the feature vectors representing each element to compare and

p is the total number of features. The values of the feature vectors are the TF-IDF

values subject to a normalization at the document level, which means that the length

of the document will not in�uence its distance to a document of di�erent (larger or

smaller) length.

Regarding the linkage method, i.e., how the dissimilarity between two clusters

26

is determined, Ward's index was chosen (Ward Jr, 1963). This means that the

dissimilarity is computed as the increase in inertia when the two clusters are joined.

K-means clustering

The k-means algorithm (MacQueen, 1967) starts with selecting a set of k initial

centres from the observations (in this case, news articles) and assigning each of the

remaining observations to the its closest centre, immediately updating the centre of

the chosen cluster to its mean point. Once every remaining observation is allocated

to a cluster centre, the solution is optimized by repeating the assignment operation

for each of the observations, until convergence is achieved, that is, until there is no

change in the cluster centres. The selection of the initial centres can be random or

user de�ned.

For this type of algorithm, the number of clusters (k) needs to be set at the start.

To determine the number of clusters, two methods were used: the representation of

the aggregation indices of the hierarchical clustering and the representation of the

explained inertia. The latter is the between-class dispersion (B), measured as the

sum of squared distances of the cluster centres (centroids) to the centre of gravity

g, divided by the total dispersion of the data (T ), measured as the sum of squared

distances of every observation to the centre of gravity:

B =1

n

k∑h=1

nhd2(gh, g)

T =1

n

n∑i=1

d2(Ii, g)

The explained inertia (BT) naturally increases with the number of clusters, and

the goal is to �nd the value of k for which this increase starts to marginally decrease

(the elbow method) (Bholowalia and Kumar, 2014). The same rationale applies for

the aggregation indices, in reverse.

27

Hybrid clustering

The �nal clustering method chosen, whose results are presented in the next chapter,

was of a hybrid nature, following in the footsteps of Cutting et al. (1992), namely

the buckshot method: choose the initial centres for k-means clustering by perform-

ing hierarchical clustering on a sample of√k ·N news articles and compute the

resulting k centroids. This way, the described methodology can still be applied to a

larger sample of news articles without losing neither the robustness from hierarchical

clustering nor the e�ciency of k-means clustering.

The tasks of hierarchical clustering and k-means clustering were implemented

using the factoextra (Kassambara and Mundt, 2017) and ClustGeo (Chavent et al.,

2017) packages.

Cluster labelling

Each cluster was labelled with keywords selected from the dictionary of features

produced at the end of the pre-processing stage. For each cluster centroid, the

terms with higher TF-IDF values were considered to be the most representative of

that cluster. The number of keywords per cluster was dependent on the cluster size

and set in the following manner: (i) determine the minimum number of keywords in

a cluster of size one; (ii) increase the number of keywords using a logarithmic growth,

in order to capture the lexicon variety in larger clusters but keep the characterization

limited to a relatively small set of keywords.

Let |Ck| be the number of documents in cluster k and m the minimum number

of keywords in a cluster of size one. The number of keywords of cluster k is:

Wk = log2(|Ck|+ 1) ·m

Cluster keywords were visually represented using the wordcloud package (Fel-

lows, 2014).

28

3.5 Assignment of tweets to clusters

Once news articles clusters are formed, it is possible to add tweets about the same

stories. To determine if a given tweet should be assigned to a cluster, we used a

measure of similarity between that tweet and the cluster centroids. Next, we describe

the methodology for this assignment.

3.5.1 Sample

All tweets with links to a clustered news article were included in the sample, thus

allowing a later evaluation of the assignment method utilized. The sample size was

then doubled by including tweets with no link to a clustered news article. Since

there were over 19 million tweets to choose from, we demanded that for a tweet with

no news article URL to be selected, it needed to have at least two terms from the

set of keywords representing the clusters.

The reason to include this second set of tweets is straightforward: if only tweets

with links to news articles were used, every story would be found to be �rst talked

about by the press, since those tweets can only exist after the shared URL exists

and if the URL exists, an article has been published.

3.5.2 Pre-processing

Similarly to the news articles, every tweet was subject to the pre-processing tech-

niques listed in section 3.4.2.

Representation

The corpus of tweets was transformed into a document-term matrix (dtm), using

the dictionary of features from the news articles dtm. The dtm also included as

documents the cluster centroids, so that the TF-IDF weights later used to compute

29

the similarity between each tweet and centroid were not based solely on the tweets

corpus.

3.5.3 Assignment to clusters

Each tweet was assigned to the cluster whose centroid was closest based on the

cosine similarity. This similarity measure is commonly used in text mining (Huang,

2008), and particularly because it is not in�uenced by the size of the documents.

Let v1 and v2 be two non-zero vectors, which in this case represent a given tweet

and a certain centroid. The cosine similarity between these vectors is:

cos(θ) =v1 · v2

||v1|| · ||v2||

If equal to zero, the vectors are directed at orthogonal directions, which means that

the similarity between the two documents is non-existent. If equal to one, the angle

between the two vectors is zero and hence the similarity between the two documents

maximum.

3.5.4 Evaluation

Because all the data used in this study are unlabelled, results cannot be directly

evaluated. The methodology described so far has addressed this issue by including

tweets with links to clustered news articles. It can be assumed that if a tweet shares

a speci�c news article, it should belong to the same cluster. So, we consider that

class as the real class of the tweet and compare it with the results of the assignment

based on similarity to cluster centroids.

Accuracy, precision, recall and F1 measures were used to evaluate these results.

Accuracy is the percentage of the total number of true positives, i.e., correctly

assigned tweets, on the total number of observations evaluated. Precision evaluates

the positive predictions, that is, the percentage of true positives among the positive

30

predictions of a given class. On multi-class problems, the macro precision can be

computed as the average of per-class precisions. Recall assesses the percentage of

true positives among the actual positive class. The F1 score is the harmonic mean

of precision and recall, and is used when neither false positives nor false negatives

are more important.

In addition, we borrowed the concept of precision at n from the information

retrieval �eld. In this context, performance measures consider as true positives the

relevant documents among those retrieved from a query. When considering the

topmost query results only instead of all the query results, the measure is called

precision at n (P@n), where n is the cut-o� rank. This is particularly used for web

search engines, where the performance of the �rst results is more important than

the overall performance (Schütze et al., 2008).

In the context of this study, we made the following adaptation: each observation

has n predictions, based on the ranked closest news centroids. If the true class of a

tweet is in the n-topmost predictions, it is considered as a true positive. The P@n

is therefore the percentage of observations whose true class is present in the top n

predictions.

3.6 Analysis

Once the groups of similar articles and tweets have been formed, it is possible to

study how the underlying events or stories develop over time, and in particular if

the role of the press as a story breaker is still indisputable or, on the contrary, if

there are some stories that break �rst on social media. This section discusses how

this analysis was conducted.

31

3.6.1 Timing of events on the news and on social media

Graphical representation

Firstly, the date and time of publication of each document was retrieved from the

database. This allows the representation of the cluster documents on a timeline,

signalling the evolution of the number of articles and tweets in that cluster.

Temporality between news and tweets

In order to get a picture of the temporality between articles and tweets, a second

analysis was made. For each news article in a given cluster, the time di�erence was

computed to that cluster median tweet. So, each cluster has a number of lag values

characterizing it, and those values represent the timing di�erence between articles

and the moment the social network is fully engaged in the discussion of that story. If

there is a signi�cant proportion of articles with positive lag values, then it is a strong

indicator that for that particular group, the press had a more important role in the

beginning of the discussion; if, on the contrary, there is a signi�cant proportion of

articles with negative lag values towards the median tweet, then it is a sign that the

discussion is likely to have started on the social media.

As in each cluster there is a set of linked articles and tweets, this analysis can

be biased towards the �rst hypothesis, as for these speci�c documents, every article

comes sooner than its linked tweet (a news article URL can only be shared by a

tweet if the article has been published). Hence, we further excluded these articles

and tweets from this analysis.

Remarks on the analysis made

This analysis was conducted on a subset of clusters, the selection of which considered

both the cluster size and the per-class precision obtained from the previous stage of

tweets assignment.

32

Additionally, as this is an exploratory analysis, directed at gaining a general

picture of the temporality of news and tweets in Portugal, we emphasize that con-

clusions are meant to be interpreted carefully, as there is no attempt at evaluating

them in this study.

33

Chapter 4

Temporal Analysis of News and

Tweets

The aim of this chapter is to present the empirical results obtained following the

methodology described in the previous chapter. It begins by describing the process

of clustering news articles. Then it presents the results of the assignment of tweets

to those clusters. Finally, the �nal clusters of news articles and tweets are analysed.

4.1 News clustering

Usually there are many news articles related to the same event or story, either due

to the existence of many news sources that publish it or because it evolves through

time. The goal of clustering the on-line news articles is to segment the published

stories so that articles of the same story or similar stories are grouped together. This

will allow the study of when a story or group of similar stories is brought up by the

press and, later, compare this to what happens on the social media Twitter.

34

4.1.1 Selection of articles

The on-line news articles provided by the POPmine platform were collected during

2016 and amounted to over 600 thousand items. Figure 4.1 presents the frequency

of the available news articles for this investigation over that year. The monthly

average is 51.7 thousand news articles.

Figure 4.1: Number of news articles in the dataset per month of 2016

For computational reasons, only a sample of news articles was used to test the

proposed methodology. Our sample included news articles from the �rst semester of

2016, which is still a reasonably long period in which to study the timing of stories

in the news and social media and reduces the size in approximately 50%.

First attempts at news clustering revealed the prevalence of sports related con-

tent, which can be con�rmed in Table 4.1: 58% of the articles are from generalist

news sources, 28% from sports news sources, 8% from economics news sources, 3%

from technology news sources and 3% from other types of news sources. In order

to obtain a wider range of topics in the groups of articles formed, only those from

the manually selected set of generalist news sources highlighted in bold in Table

4.1, were used. This further reduced the sample size in about 42%, to about 174

thousand news articles.

Furthermore, although the representation model used a normalized TF-IDF

weighting scheme, �rst attempts also revealed that longer news articles tended to be

35

Table 4.1: Number of news articles per press source available

grouped together. Indeed, a careful examination of the length of news articles re-

vealed that it could vary from as little as one word to up to 4731 words. The boxplot

in Figure 4.2 shows the presence of outliers in terms document length measured as

the total number of characters (extreme outlier: 5011 characters; moderate outlier:

3349 characters). The scatterplot in Figure 4.3 shows clustering results (k=100) on

a sample of three thousand articles � longer articles tend to be clustered together,

whereas smaller ones easily tend to be separated.

This could be explained by the fact that longer articles have a wider range of

vocabulary and therefore similarities between them are easier to �nd, whereas for

36

Figure 4.2: News articles length � boxplot

Figure 4.3: Cluster size and mean length of articles

shorter articles it is the opposite and hence they appear in separate clusters.

For these reasons, we have used another criterion for selecting news articles: to

keep those with length between 100 and 3349 characters. The lower bound of 100

characters is used to exclude rather short and uninformative articles, such as the

following examples: `Dados são relativos à zona euro e à União Europeia em geral.' ;

`Veja na íntegra o debate entre os três candidatos presidenciais, transmitido na SIC

Notícias.'. It also prevents some articles that may not have been fully or correctly

37

collected (for example, only the subtitle was registered) from entering the sample.

The �nal criterion for selecting the articles was to include both articles that were

shared on Twitter and also articles that were not shared. The reasons for this are

explained further on. Hence, 50% of the �nal sample is comprised of on-line news

articles whose URL was shared in at least one tweet and 50% of on-line news articles

whose URL was not found in any of the tweets.

With the above criteria, there were 3037 on-line news articles with a link to at

least one tweet. The other 50% was randomly assembled. The �nal sample includes

therefore 6074 news articles.

4.1.2 Preprocessing and representation model

Standard preprocessing techniques were applied, as outlined earlier in Chapter 3

(section 3.4.2).

Each document was converted to lower case and stripped of punctuation, numeric

characters and Portuguese stopwords. Words were stemmed using the Portuguese

Snowball stemmer (Bouchet-Valat, 2014).

The �rst analysis of the most relevant terms obtained from the document-term

matrix after these preprocessing tasks revealed the need to remove some expressions

that frequently appeared in the body of several news articles, as exempli�ed in

Table 4.2. These types of expressions were removed from the articles directly at the

database level, before carrying out the selection of articles for our study.

For feature selection purposes, parts-of-speech (POS) tagging was also used, so

that only nouns, verbs and proper nouns were kept. By observing the most relevant

terms with and without POS �ltering, we concluded that this step also helped to

improve the quality of labels produced per cluster.

Another improvement to the representation of news articles was the identi�cation

of named entities and their inclusion as features. The application of PAMPO (Rocha

et al., 2016) identi�ed approximately 29 thousand named entities in our articles

38

Table 4.2: Frequent expressions on news articles

Expression CommentsSiga o CM no Facebook Equivalent to: Follow CM on Face-

book.Os nossos termos e condições de privacidadeforam alterados. Este website utiliza cookiesque asseguram funcionalidades para uma mel-hor navegação. Ao continuar a navegar, está aconcordar com a utilização de cookies e com osnovos termos de utilização.

Warning about terms and conditionsand cookies.

Partilhar o artigo [título] Imprimir o artigo [tí-tulo] Enviar por email o artigo [título] Aumen-tar a fonte do artigo [título] Diminuir a fontedo artigo [título] Ouvir o artigo [título]

Options for web users to share, print,send, listen to the news article andincrease/decrease its font size (iconslegends).

Completam-se agora 100 anos sobre o início dabeligerância portuguesa. Uma data assinaladapela RTP com a publicação online dos seus maissigni�cativos materiais de arquivo sobre o tema.

Advertisement to another content ofthe news source (present in over 100articles).

Table 4.3: News articles pre-processing transformation examples

Before AfterSegundo site TMZ, Prince morreu na suaresidência em Presley Park. A polícia está ainvestigar um óbito ocorrido na sua residência,mas não con�rmou que se tratasse da mortedo próprio artista. O cantor norte-americano,Prince Rogers Nelson de seu nome, terá su-cumbido a uma gripe que originara o seu in-ternamento de urgência na passada sexta feita.

sit tmz prince morr residentpresley park políc investig óbit ocorrresident con�rm trat mort artistcantor prince rogers nelson nomsucumb grip origin intern urgêncpass sext feit

A ministra da Administração Interna justi�couesta sexta-feira que, "por uma questão de pro-porcionalidade", optou por aplicar ao militar daGNR, que matou um jovem numa perseguiçãoapós um assalto, uma sanção menos gravosa doque a proposta pela IGAI.

ministra da administração internajusti�c sext feir questã proporcionaloptou aplic milit gnr mat perseguiãassalt sanção propost decisã tompropost igai

Entre os detidos encontram-se familiares do ex-tremista malaio Mohamed Jedi, que combate naSíria nas �leiras do Estado Islâmico.

det encontr malai mohamed jedicombat síria �leir estado islâmico

39

sample.

Table 4.3 presents examples of the transformations. Underlined terms correspond

to named entities. After these transformations, the corpus was structured in a

document-term matrix. Since the number of terms generated, including named

entities, surpassed 44 thousand, a feature reduction technique applied was to set the

level of sparseness of the matrix to 98%. This resulted in 952 terms only.

4.1.3 Parametrizing the method

The clustering method chosen to group the on-line news articles was of a hybrid

nature, combining an e�cient algorithm, k-means, with a robust setup of hierarchical

clustering. The number of clusters (parameter k) that k-means requires was decided

on the basis of a subjective evaluation of the number of di�erent stories that could

occur in a six-months period. We have also considered the representation of the

explained inertia for di�erent k values.

Determining k

The explained inertia for hierarchical clustering, given by the between-class disper-

sion over the total dispersion of the dataset, is shown in Figure 4.4. The elbow of the

line uniting the dots is not clearly visible, but the largest growth in the explained

inertia happens for k up to 50, above which any further partitioning does not gain

marginal increments in quality (measured as class separation) at the same rate as

up to that point. A similar conclusion can be drawn from the representation of the

aggregation indices of the hierarchical clustering, presented in Figure 4.5. In this

representation, the elbow is more clearly visible, for k values between 25 and 50.

As a benchmark, we retrieved a list of events occurred in the �rst semester of

2016 from a well-known news source website � SIC Notícias. This information was

published as part of the `Year in Review' at the end of 20161. The cited news source

1https://sicnoticias.sapo.pt/especiais/revista-do-ano-2016/2016-12-27-O-ano-em-revista

40

Figure 4.4: Explained inertia for k up to 500

Figure 4.5: Aggregation indices for k up to 500

identi�ed a total of 119 di�erent events from January to June worth including in the

year review. Hence, it is reasonable to assume that a large proportion of clusters of

news articles characterizes di�erent stories published in that time frame.

Consequently, we opted for the larger end of the spectrum and set k equal to 50.

4.1.4 Clustering results

As described in Chapter 3, hierarchical clustering was performed on a sample of√k ·N =

√50 · 6074 = 551 news articles, with a cut-o� at 50 clusters. Then, the

41

centroids for each cluster were fed into the k-means algorithm, so that the starting

points would have a higher quality than a random selection of 50 articles from the

sample. The clustering results are summarized in Table 4.4.

Table 4.4: Number of elements per cluster and class homogeneity and separation

Cluster size and dispersion

The resulting clusters of news articles have variable size. Indeed, some clusters, like

cluster 5 - Miscellaneous, cluster 7 - Elections, cluster 14 - Politics and cluster 19 -

International security, are rather large, accounting for over 50% of the documents.

Their within-class dispersion values re�ect the diversity of the articles in them, and

further partitioning would probably continue to divide these clusters into smaller

42

ones. On the other hand, there are also very isolated clusters, with 1 or 2 elements

only (e.g. cluster 28 - Palmira buildings and cluster 39 - Space transport).

Cluster labelling

The groups of news articles were labelled using the most signi�cant terms as key-

words. The number of keywords varies according to the cluster size, so that larger

clusters had a larger set of keywords to represent them. Keywords are ordered

according to their average TF-IDF values within the cluster. As an example, Fig-

ure 4.6 shows clusters of (a) articles related to the Brexit referendum, (b) the �rst

news about António Guterres' run for United Nations Secretary-General, (c) the

European Football Cup held in France and (d) the Brussels terrorist attack.

The list of keywords of every cluster can be found in Appendix B. The names

of the clusters were given after an examination of the list of keywords and, in case

of very small clusters, the articles themselves.

We highlight the importance of named entities identi�cation in this study. As

the wordclouds in Figure 4.6 and the list of keywords in Appendix B show, terms

such as Reino Unido, União Europeia (United Kingdom, European Union � cluster

44), Nações Unidas (United Nations � cluster 36) and Banco de Portugal (Bank of

Portugal � cluster 2) are often very indicative of the type of news stories present in

that group.

The highlighted clusters in Table 4.4, the wordclouds of three of which were

already presented, are the clusters that are later on discussed. The reasons for this

selection will be provided in section 4.2.3 of this chapter.

• Politics (n.14)

• Air transport incidents (includes Brussels attack) (n.18)

• Football - Euro 2016 (n.21)

• Brexit (n.44)

43

Figure 4.6: Keywords per cluster � 4 examples

44

4.2 Assignment of tweets to clusters

In order to analyse when a certain story or group of similar stories appear in social

networks in comparison to its press release, there is a need to attribute social media

posts to a particular group of similar stories. In this case, we used a collection of

tweets posted during the same period as the on-line news articles and assigned them

to the appropriate clusters of articles formed in the previous stage.

4.2.1 Sample construction of tweets

The available tweets from Portuguese users were collected during 2016. The total

number of documents surpassed 38 million, with a monthly average of 3.1 million

tweets. The relevant information of approximately 19.4 million tweets of the �rst

semester of 2016 was collected and stored in the database, as described in Chapter

3 (section 3.3).

Figure 4.7: Number of available tweets per month of 2016

However, not all of these tweets were signi�cant to the present study, because, as

seen in Chapter 2, family and life is a signi�cant topic among Twitter users. Also,

tweets are very short segments of text. In 2016, the limit was of 140 characters.

Extremely short tweets (for example, less than 20 characters) increase the di�culty

45

of text mining tasks.

The �rst type of tweets to be included in the sample were tweets that contained

the URL to a clustered news article, as this provides an evidence that the tweet

itself is about the same story or event referred to in the news article. This strategy

also allows the evaluation of the proposed assignment method. In total, 5664 tweets

obeyed this criterion.

Then, similarly to what was done with the news articles, a sample of the same size

(5664) was selected from the available 19.4 million tweets. The selection process was

both random and oriented: �rst, 250 thousand tweets with more than 20 characters

were randomly chosen; then, only those containing at least one keyword from the

clusters were kept (approximately 50%); �nally, a random sub-sample of these was

selected. The decision of this selection process was therefore a compromise between

the processing resources available and the identi�cation of promising tweets.

4.2.2 Pre-processing of tweets

Tweets were subject to a series of pre-processing techniques, including lower case

conversion, removal of punctuation, numbers and stopwords, stemming and named

entity recognition. Additionally, there was the need to remove any URL from the

tweet text, as these are not informative. Similarly to what was done for frequent

expressions in news articles (see section 4.1.2), URLs were removed directly from

the database, by identifying sub-strings initiated with http. Table 4.5 presents some

examples of these transformations. Underlined terms are named entities.

4.2.3 Assignment to clusters

Tweets were assigned to the clusters of news articles using a similarity measure

between each tweet and the cluster centroids. In this work we have used the co-

sine similarity. The similarity values were computed using feature vectors with

46

Table 4.5: Tweets pre-processing transformation examples

Before AfterRT @RTPNoticias: Coreia do Norte testabomba de hidrogénio e desperta receiosmundiais https://t.co/oCX2sWMmhW

rt rtpnoticias coreia do norte test bombhidrogéni despert recei mundi

Marcelo não precisa do apoio dos antigospresidentes da republica. Não precisa porquenão tem!

marcelo precis apoi antig president re-publ precis porqu

[Noticias ao Minuto] Renato Sanches é com-parado a Ronaldo e ganha mais interessados

noticias minuto renato sanches é com-par ronaldo ganh interess

Tenho 18 episódios para ver e o que vou fazer??? vou começar a ver Criminal Minds: Be-yond Borders ??????

episódi ver vou faz vou comec vercriminal minds beyond borders

normalised TF-IDF values. For each tweet, the �ve closest cluster centroids were

identi�ed, and the cluster that was most similar to the tweet in question was chosen.

The distribution of tweets and news articles per clusters is shown in Figure 4.8.

The sample of tweets is larger than the sample of articles, and the mean ratio is 1.8

tweets per article. For the clusters under observation, the ratio is larger than the

mean ratio (from 2.1 for cluster Brexit, to 3.9 for cluster Air Transport incidents),

with the exception of cluster Politics (0.8). This is some evidence that the chosen

clusters have a place in the social network discussion.

Evaluation

Using the tweets with a link to a news article, we have evaluated the results of the

assignment based on cosine similarity. Tables 4.6 and 4.7 present the performance

values.

Global accuracy is 12.7%, while macro-average precision is 13.8%. Due to the

fact that the classes are very unbalanced, we also present the weighted macro-average

precision: 26.0%. The weighted macro-average recall is 12.7% and the weighted

macro-average F1 score is 9.9%.

47

Figure 4.8: Number of tweets and articles per clusters

48

Table 4.6: Per-class evaluation of tweets assignment to clusters

49

Table 4.7: Global evaluation of tweets assignment to clusters

The assignments to clusters 3 - Public Prosecution, 5 - Miscellaneous and 44

- Brexit have the highest precision values. However, F1 values, which also take

into account the recall, are higher for clusters 18 - Air transport incidents (includes

Brussels attack), 21 - Football - Euro 2016 and 44 - Brexit. These performance

values point to the clusters that are probably the most reliable for the temporality

analysis of the next section. Indeed, this was the main reason for the selection of

clusters on which to focus the analysis. Another criterion was to select one of the

largest clusters (14 - Politics).

In addition, we present the values for precision@n, for n up to �ve. This measure

re�ects how well the proposed method performs considering its n topmost predic-

tions, as opposed to the �rst prediction only. In this case, precision @ 5 is 43.5%,

which means that, for 43.5% of the assigned tweets with link to a news article, the

correct cluster was in the top �ve predictions.

Moreover, we have carried out another experiment, demanding at least two fea-

tures in common with clusters centroids in the tweets assignment stage, to improve

the assignments. Evaluation results improved slightly but not signi�cantly (accu-

racy: 14.2%; macro-average precision: 15.9%; weighted macro-average precision:

25.5%; weighted macro-average F1: 11.2%) � see Appendix C.

50

4.3 Temporal analysis

The main goal of the dissertation was to identify similar stories or events and analyse

how di�erently or not they come about on the news and social media. Given the rise

of user generated content and the current trend of journalists scouting social media

for crowd checking and news monitoring (DVJ Insights and ING The Netherlands,

2015), the hypothesis is that these two environments are interconnected.

The work developed thus far aimed to group the two types of documents � on-

line news articles and tweets � through the use of text mining, clustering, besides

other techniques. The �nal clusters are now used for the temporal analysis, and in

particular the temporal relationship between news and tweets.

4.3.1 Evolution of articles and tweets

As a result of the previous stages, we have identi�ed similar news articles and tweets

and characterized each group with a number of keywords. In this section we present

the timeline of each of these document types per cluster, in an attempt to gain

insights of the evolution of news generation and sharing/commenting on Twitter in

Portugal.

To that end, the date and time of publication or posting of these elements were

retrieved from the database.

For the following analyses, we focused on tweets that do not have any link to a

news article. Similarly, we only included articles that were not shared on Twitter.

This prevented a possible bias towards the hypothesis that, for a given cluster, the

social discussion on Twitter was after its publication by the press.

Figure 4.9 presents the temporal evolution of tweets and articles for the four

clusters under observation. We recall that Football - Euro 2016, Brexit and Air

transport incidents (includes Brussels attack) were the clusters with the best F1

scores on the evaluation of tweets assignment, and that Politics is the largest cluster,

51

albeit with a lower performance evaluation (see Table 4.6).

Figure 4.9: Evolution of the number of elements

It is possible to observe that clusters Football - Euro 2016 and Brexit have

peaks in the number of articles and tweets at the expected moment: the European

Football Cup started on the 10th of June (week 24) and the Brexit Referendum

took place on the 23rd of June (week 26). Naturally, both of these events were

subject to discussion in the previous months, as the National team prepared for the

competition and debates concerning Brexit and its consequences intensi�ed. Tweets

assigned to Football - Euro 2016 always surpassed the number of published article on

a weekly basis, with a global ratio of six tweets to one news article. This highlights

the importance of this event (Euro 2016) and topic (football) on the Portuguese

discussions on Twitter. These time series show a correlation of 0.84, which may

indicate that football is referred to with the same intensity in the news and on

Twitter. A smoother trend of tweets surpassing the number of articles is noticed

for Brexit, with the exception of week 26, when the referendum occurred, where

52

the number of assigned tweets is approximately 50% lower than the clustered news

articles.

Air transport incidents (includes Brussels attack) is the smallest cluster under

observation, albeit having scored the highest F1 value. It shows a small rise at week

13, which was when the bombings in Brussels Airport and Maalbeck metro station

happened. We remark, nevertheless, that if we included shared articles and tweets

with link to them in this analysis, the rise at week 13 would be signi�cantly larger

(18 tweets and 10 articles versus an average of 0.2 and 1.6 in the weeks prior to this

event).

Politics, the largest cluster of articles and news, shows a rather smooth evolution

on the number of elements 2, especially on the tweets side. It shows that for the kind

of study aimed at analysing the timing of generation of these documents, further

partitioning of this cluster could be necessary in order to identify patterns in sub-

stories possibly di�erent than the one this large cluster reveals. This is the cluster

under observation with the lowest ratio of assigned tweets to articles, which may

also be a sign that for this topic, the keywords generated at the articles level may

not be su�ciently discriminative at the tweets level. The evaluation measures did,

indeed, reveal a large proportion of false negatives for this class (see Table 4.6).

Another possible line of interpretation is that Portuguese Twitter users do not, in

fact, talk as much about this topic when compared to its importance to the press.

4.3.2 Time-wise di�erences

This section explores the timing of news articles versus the timing of tweets about

the same story or group of similar stories. One way of analysing the temporality

between news and tweets is to consider the time di�erence of every article in a

cluster to its `median tweet'. The median tweet is the tweet with median time

2Note that the falls in the number of elements are due to the partitioning of the data in weeks,that are cut into two when the new month begins at the middle of a certain week of the year.

53

in the corresponding cluster. This brings out how the press publication timings

compare to the moment the public discussion is at its highest.

The time di�erence is computed as a lag variable, that, if positive, indicates

that the article is older than the median tweet, and the opposite if negative. If the

distribution of this variable is skewed to the right, the stories of that cluster have a

tendency of �rst being published by the press; if skewed to the left, the social media

discussion happens sooner than the news.

Figure 4.10 shows the representation of the distribution of the above-mentioned

lag variable, for the clusters under observation.

Figure 4.10: Days di�erence between articles and the median tweet

It can be observed that the majority of the articles belonging to the cluster

Football - Euro 2016 were published after the median date of the tweets assigned to

this cluster (a negative days di�erence). Portuguese Twitter users seem, therefore,

to anticipate the discussion of the national team's participation in the competition

in comparison to what happened on the news. A similar conclusion can be drawn

54

for Brexit : the height of discussion of Portuguese users of the staying or leaving of

the UK from the European Union happened on Twitter about two months before it

happened on the press.

These observations lead to the following conclusion: while the news on Football

- Euro 2016 and Brexit were more event-oriented, with peaks of articles at speci�c

points or short periods of time (e.g.: the start of the football competition on the

10th of June; the referendum date announcement in February and the referendum

itself on the 23rd of June), conversations on social media may have happened more

evenly during the period under study.

The Politics cluster does not present any speci�c pattern. The Air transport

incidents (includes Brussels attack) cluster shows some signs that the press published

articles about the incidents �rst � in total there were nine articles published before

the median tweet (positive days di�erence) and �ve articles published after the

median tweet (negative days di�erence).

We also did this temporality analysis for the whole dataset, i.e., including shared

articles and tweets with links to them. Conclusions for the selected clusters do not

change (see Appendix D).

It is important, however, to emphasize that conclusions drawn from the previ-

ous analyses should be interpreted carefully, as they depend on the quality of the

generated clusters, which could only be partially evaluated.

55

Chapter 5

Conclusion

5.1 Results

This work aimed at providing some initial insights into the themes that appear both

in news sources and social networks in Portugal. In particular, the goal was to study

the temporal relationship between news articles and tweets. The strategy was to

investigate what were the main stories published and how they behaved in terms of

the number of articles and social media posts during a reasonably long period.

To that end, a sample of on-line news articles from generalist news sources was

subject to text clustering techniques, and 50 groups of similar stories were identi�ed.

These were subsequently labelled with a set of keywords. The groups included

stories on football related events and teams, terrorist attacks and other international

incidents, elections, accidents, investigations, political parties, ministerial actions,

economic/�nancial reports and weather warnings, among others.

Then, to associate tweets with the groups of news articles, we used a sample

of tweets and assigned them to each of the clusters using a similarity measure.

This assignment was evaluated using tweets with a news article URL. Performance

results were not as promising as expected (12.7% accuracy), suggesting the need for

improvements in the proposed method. The fact that tweets contain very few words

56

adds to di�culty of this task. Also, as the number of classes is large, we cannot

expect a very high accuracy value.

Nonetheless, for some clusters the evaluation was considerably above average,

which allowed the study of the temporality between news and tweets. The three

selected clusters based on per-class performance were, coincidently, event-oriented

stories: Brexit, Football - Euro 2016 and Air transport incidents. The �nal cluster

under focus was one of the largest clusters formed (Politics).

The analyses made on the evolution of the number of elements and temporality

between tweets and news for those four clusters lead to the following main conclu-

sions. The football national team was a rather constant subject of discussion on

social media on the �rst semester of 2016, culminating in the �nal month with its

participation in the European Cup. However, the same did not necessarily happen

on the news side, where the most frequent articles were published towards the end.

A similar pattern was noticed for Brexit. This indicates that, for some stories, the

press is more event-oriented, contrasting with the more permanent focus of Twitter

users. The analysis on incidents on airports, which included Brussels' bombings,

revealed that the press had a more prominent role on the news di�usion, with com-

ments on social media arising afterwards. We hypothesize that it might not have

been the case had those incidents occurred in Portugal. Finally, this type of analysis

requires a certain level of partitioning of stories, so that timing patterns are more

easily identi�able, which was not achieved for the Politics cluster.

5.2 Limitations and Future Work

This investigation presented some challenges. The fact that we were working with

unlabelled data (both articles and tweets) prevented a more robust evaluation of

the proposed method, even though we were able to partially assess the performance

of the assignment of tweets to clusters of articles. Additionally, the available tweets

57

(more than 19 million) can be classi�ed as big data, which calls for more e�cient

retrieval techniques, directed at identifying those tweets related to news stories.

A possible avenue of investigation is to improve the representation of news articles

in order to capture the desired stories or events. Some of the cluster keywords

revealed that the list of stopwords could be enhanced. Also, in line with the current

literature, the use of must-link and/or cannot-link constraints based on previous

knowledge (for example, a list of Portuguese events from the Wikipedia), making

the unsupervised task a semi-supervised task, may help to improve cluster quality.

At the tweets level, we suggest two possible strategies aimed at better repre-

sentation and, consequently, association with news articles. The �rst is the use of

ontologies to compute semantic distances between articles and tweets. The second

is to expand the tweet with the use of synonyms, for example through word em-

beddings (Mikolov et al., 2013), that could be more easily matched with the lexicon

present at the articles level.

A di�erent strategy that could yield better results would be to use the tweets

with link to news articles to train (and test) a classi�cation model that would be

more adequate for short segments of text.

58

Bibliography

Aggarwal, C. C. and Zhai, C. (2012). A survey of text clustering algorithms. In

Mining Text Data, chapter 4, pages 77�128. Springer US.

Bholowalia, P. and Kumar, A. (2014). EBK-means: A clustering technique based

on elbow method and k-means in WSN. International Journal of Computer Ap-

plications, 105(9):17�24.

Bouchet-Valat, M. (2014). SnowballC: Snowball stemmers based on the C libstemmer

UTF-8 library.

Brogueira, G., Batista, F., and Carvalho, J. P. (2016). Using geolocated tweets for

characterization of Twitter in Portugal and the Portuguese administrative regions.

Social Network Analysis and Mining, 6(1).

Chavent, M., Kuentz, V., Labenne, A., and Saracco, J. (2017). ClustGeo: Hierar-

chical Clustering with Spatial Constraints.

Cutting, D. R., Karger, D. R., Pedersen, J. O., and Tukey, J. W. (1992). Scat-

ter/Gather: A Cluster-based Approach to Browsing Large Document Collections.

In SIGIR 92, volume 51, pages 318�329.

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman,

R. A. (1990). Indexing by latent semantic analysis. Journal of The American

Society For Information Science, 41(6):391�407.

59

Dempster, A., Laird, N., and Rubin, D. B. (1977). Maximum Likelihood from

Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society

Series B Methodological, 39(1):1�38.

DVJ Insights and ING The Netherlands (2015). Impact of social media on news

(#SMING15). Technical report.

Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A Density-Based Algorithm

for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of

the Second International Conference on Knowledge Discovery and Data Mining

(KDD-96), volume 34, pages 226�231.

Feinerer, I. and Hornik, K. (2017). tm: Text Mining Package.

Feldman, R. and Sanger, J. (2006). The Text Mining Handbook.

Fellows, I. (2014). wordcloud: Word Clouds.

Gama, J., Carvalho, A. P. d. L., Faceli, K., Lorena, A. C., and Oliveira, M. (2015).

Extração de Conhecimento de Dados - Data Mining. Edições Sílabo, Lda., 2nd

edition.

Genc, Y., Sakamoto, Y., and Nickerson, J. V. (2011). Discovering context: Classi-

fying tweets through a semantic transform based on wikipedia. In Lecture Notes

in Computer Science (including subseries Lecture Notes in Arti�cial Intelligence

and Lecture Notes in Bioinformatics), volume 6780 LNAI, pages 484�492.

Hornik, K. (2016). openNLP: Apache OpenNLP Tools Interface.

Hornik, K. (2017). NLP: Natural Language Processing Infrastructure.

Hornik, K., Buchta, C., and Zeileis, A. (2009). Open-Source Machine Learning: {R}

Meets {Weka}. Computational Statistics, 24(2):225�232.

60

Hotho, A., Nürnberger, A., and Paaÿ, G. (2005). A Brief Survey of Text Min-

ing. LDV Forum - GLDV Journal for Computational Linguistics and Language

Technology, 20:19�62.

Hu, M., Liu, S., Wei, F., Wu, Y., Stasko, J., and Ma, K.-L. (2012). Breaking news

on twitter. In Proceedings of the 2012 ACM annual conference on Human Factors

in Computing Systems - CHI '12, pages 275�279.

Huang, A. (2008). Similarity measures for text document clustering. In Proceedings

of the Sixth New Zealand Computer Science Research Student Conference, number

April, pages 49�56.

Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition

Letters, 31(8):651�666.

Java, A., Song, X., Finin, T., and Tseng, B. (2007). Why we twitter: Understanding

Microblogging Usage and Communities. In International Conference on Knowl-

edge Discovery and Data Mining, pages 56�65.

Kaplan, A. M. and Haenlein, M. (2010). Users of the world, unite! The challenges

and opportunities of Social Media. Business Horizons, 53(1):59�68.

Kassambara, A. and Mundt, F. (2017). factoextra: Extract and Visualize the Results

of Multivariate Data Analyses.

Krishnamurthy, B., Gill, P., and Arlitt, M. (2008). A few chirps about twitter.

In Proceedings of the 1st Workshop on Online Social Networks (WOSN), pages

19�24.

Kwak, H., Lee, C., Park, H., and Moon, S. (2010). What is Twitter, a Social Network

or a News Media? The International World Wide Web Conference Committee

(IW3C2), pages 1�10.

61

Lusa (2016). Uso das redes sociais em Portugal tripiclou em sete anos.

MacQueen, J. (1967). Some methods for classi�cation and analysis of multivari-

ate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical

Statistics and Probability, 1(14):281�297.

Marketeer (2017). Qual é a rede social mais utilizada em Portugal?

McCombs, M. E. and Shaw, D. L. (1972). The agenda-setting funcion of mass-media.

The Public Opinios Quarterly, 36(2):176�187.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). Dis-

tributed representations of words and phrases and their compositionality. In Ad-

vances in neural information processing systems, pages 3111�3119.

Newman, N. (2009). The rise of social media and its impact on mainstream jour-

nalism.

Phuvipadawat, S. and Murata, T. (2010). Breaking news detection and tracking

in Twitter. In Proceedings - 2010 IEEE/WIC/ACM International Conference on

Web Intelligence and Intelligent Agent Technology - Workshops, WI-IAT 2010,

pages 120�123.

Porter, M. (1980). An algorithm for su�x stripping. Program, 14(3):130�137.

R Core Team (2017). R: A Language and Environment for Statistical Computing.

R Foundation for Statistical Computing, Vienna, Austria.

Ripley, B. and Lapsley, M. (2017). RODBC: ODBC Database Access.

Rocha, C. (2016). PAMPO: PAMPO - Extract Named Entities from texts.

Rocha, C., Jorge, A., Sionara, R., Brito, P., Pimenta, C., and Rezende, S. (2016).

PAMPO: using pattern matching and pos-tagging for e�ective Named Entities

recognition in Portuguese. arXiv preprint arXiv:1612.09535., pages 1�17.

62

Rosen, A. (2017). Tweeting Made Easier.

Russom, P. (2007). BI Search and Text Analytics: New Additions to the BI Tech-

nology Stack. Technical report, The Data Warehousing Institute.

Saleiro, P., Amir, S., Silva, M., and Soares, C. (2015). POPmine: Tracking Political

Opinion on the Web. In 2015 IEEE International Conference on Computer and

Information Technology; Ubiquitous Computing and Communications; Depend-

able, Autonomic and Secure Computing; Pervasive Intelligence and Computing

(CIT/IUCC/DASC/PICOM), pages 1521�1526.

Sankaranarayanan, J., Samet, H., Teitler, B. E., Lieberman, M. D., and Sperling, J.

(2009). TwitterStand: News in Tweets. In Proceedings of the 17th ACM SIGSPA-

TIAL International Conference on Advances in Geographic Information Systems

- GIS '09, pages 42�51.

Schütze, H., Manning, C. D., and Raghavan, P. (2008). Evaluation in information

retrieval, volume 39. Cambridge University Press.

Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., and Demirbas, M. (2010).

Short Text Classi�cation in Twitter to Improve Information Filtering. In Proceed-

ings of the 33rd International ACM SIGIR Conference on Research and Develop-

ment in Information Retrieval SE - SIGIR '10, pages 841�842.

Statista (2018). Number of monthly active Twitter users worldwide from 1st quarter

2010 to 2nd quarter 2018 (in millions).

Twitter (2016). Twitter Usage - Company Facts.

https://about.twitter.com/company. Last accessed on Dec 12, 2016.

Wanta, W., Golan, G., and Lee, C. (2004). Agenda Setting and International News:

Media In�uence on Public Perceptions of Foreign Nations. Journalism and Mass

Communication Quarterly, 82(1):364�377.

63

Ward Jr, J. H. (1963). Hierarchical grouping to optimize an objective function.

Journal of the American statistical association, 58(301):236�244.

Witten, I. H. and Frank, E. (2005). Data Mining: Practical Machine Learning Tools

and Techniques. Morgan Kaufmann, San Francisco, 2nd edition.

Zhao, W. X., Jiang, J., Weng, J., He, J., Lim, E.-p., Yan, H., and Li, X. (2011).

Comparing Twitter and Traditional Media using Topic Models. In Proceedings of

the 33rd European conference on Advances in information retrieval (ECIR'11),

pages 338�349.

64

Appendix A

Example of input tweet

Figure A.1: Example of tweet in JSON � part 1

65

Figure A.2: Example of tweet in JSON � part 2

66

Appendix B

Cluster keywords

Figure B.1: Top 10 cluster keywords � part 1

67

Figure B.2: Top 10 cluster keywords � part 2

68

Appendix C

Assignment of tweets with at least

two features in common with cluster

centroids - Evaluation results

Table C.1: Global evaluation of tweets assignment to clusters - considering tweetswith at least two terms in common with cluster centroids

69

Table C.2: Per-class evaluation of tweets assignment to clusters - considering tweetswith at least two terms in common with cluster centroids

70

Appendix D

Temporality between news and

tweets - considering the complete

dataset

Figure D.1: Days di�erence between articles and the median tweet - considering thecomplete dataset of tweets and articles

71

Documents

Association and temporality between news and tweets · reliable source (DVJ Insights and ING The Netherlands, 2015). The expectations are that the relationship between news and social