View
1
Download
0
Category
Preview:
Citation preview
Association and temporality between news
and tweets
by
Vânia Nogueira Moutinho
Master Dissertation in Data Analytics
Supervised by
Professor João Paulo Cordeiro
Professor Pavel Bernard Brazdil
2018
Acknowledgements
Firstly, I would like to thank Professors Pavel Brazdil and João Cordeiro for their
support, guidance and wisdom throughout this investigation.
I would also like to acknowledge Pedro Saleiro for his guidance and generosity
at the start of this project, André Lima for his kind advice at critical moments, and
Natália Silva and Filomena Anselmo for their companionship during the ups and
downs of these past three years.
Lastly, I would like to thank my parents for making this journey possible, and
Rafael Correia for his unending support and constant belief in my success.
i
Resumo
Com o advento dos social media, as fronteiras entre o jornalismo e as redes sociais
estão a esbater-se. Existe um aumento dos conteúdos gerados pelos utilizadores
(UGC), dedicando os jornalistas uma parte signi�cativa do seu dia a anunciar, di-
fundir e monitorar notícias, assim como validar informações, em plataformas como
o Facebook e o Twitter. Vários estudos tentaram perceber o papel das redes sociais
enquanto fontes de notícias. Contudo, a relação e as interligações entre este tipo de
plataforma e os meios de comunicação social ainda não foi detalhadamente estudada.
Nesta investigação, estudámos uma série de notícias publicadas em artigos jor-
nalísticos e a sua partilha e discussão numa rede social referentes a seis meses. Espe-
ci�camente, uma amostra de artigos de fontes portuguesas generalistas de notícias
publicados no primeiro semestre de 2016 foi submetida a agrupamento, utilizando
um algoritmo híbrido. Os grupos de notícias gerados foram posteriormente associ-
ados a tweets de utilizadores portugueses, usando uma medida de similaridade.
Para um subconjunto dos clusters obtidos, realizámos uma análise temporal so-
bre estes grupos de notícias, examinando a evolução dos dois tipos de documentos
(artigos e tweets) e o momento da sua criação. Foi possível concluir que, para alguns
grupos de notícias, nomeadamente o Brexit e o Campeonato Europeu de Futebol,
a publicação de artigos jornalísticos ganha instensidade em datas chave (orientada
para eventos), enquanto que o debate e a discussão nas redes sociais são mais equi-
librados ao longo dos meses que antecedem esses eventos.
Palavras-Chave: agrupamento de texto, tweets, notícias
ii
Abstract
With the advent of social media, the boundaries of mainstream journalism and
social networks are becoming blurred. User generated content is increasing and
journalists dedicate considerable time searching platforms such as Facebook and
Twitter to announce, spread and monitor news and crowd check information. Many
studies have looked at social networks as news sources, but the relationship and
interconnections between this type of platform and news media is still not thoroughly
investigated.
In this work, we have studied a series of news articles stories and their sharing and
commenting on a social network during a period of six months. Speci�cally, a sample
of articles from generalist Portuguese news sources published on the �rst semester
of 2016 was subject to hybrid text clustering. The groups of stories obtained were
then associated with tweets of Portuguese users with the use of a similarity measure.
Focussing on a set of clusters, we performed a temporal analysis on these groups
of stories by examining the evolution of the two types of documents (articles and
tweets) and the timing of their generation. We concluded that for some stories,
namely Brexit and the European Football Cup, the publishing of news articles in-
tensi�es on key dates (event-oriented), while the discussion on social media is more
balanced throughout the months leading up to those events.
Keywords: text clustering, twitter, news
iii
Contents
Acknowledgements i
Resumo ii
Abstract iii
1 Introduction 1
1.1 Motivation and objectives . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Details and contribution . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Related Work 5
2.1 Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Facts and conventions . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 User intention . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Twitter in Portugal . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.4 Twitter and news . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Text mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 De�nition and applications . . . . . . . . . . . . . . . . . . . . 9
2.2.2 The case of unstructured text data . . . . . . . . . . . . . . . 10
2.2.3 Document, features and the representation model . . . . . . . 11
2.2.4 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
iv
2.2.5 Mining tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Text clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Methodology 20
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Four main stages of the method . . . . . . . . . . . . . . . . . . . . . 21
3.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 News clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.1 Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.2 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Assignment of tweets to clusters . . . . . . . . . . . . . . . . . . . . . 29
3.5.1 Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5.3 Assignment to clusters . . . . . . . . . . . . . . . . . . . . . . 30
3.5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6.1 Timing of events on the news and on social media . . . . . . . 32
4 Temporal Analysis of News and Tweets 34
4.1 News clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.1 Selection of articles . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.2 Preprocessing and representation model . . . . . . . . . . . . 38
4.1.3 Parametrizing the method . . . . . . . . . . . . . . . . . . . . 40
4.1.4 Clustering results . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Assignment of tweets to clusters . . . . . . . . . . . . . . . . . . . . . 45
4.2.1 Sample construction of tweets . . . . . . . . . . . . . . . . . . 45
v
4.2.2 Pre-processing of tweets . . . . . . . . . . . . . . . . . . . . . 46
4.2.3 Assignment to clusters . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Temporal analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.1 Evolution of articles and tweets . . . . . . . . . . . . . . . . . 51
4.3.2 Time-wise di�erences . . . . . . . . . . . . . . . . . . . . . . . 53
5 Conclusion 56
5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . 57
Bibliography 59
Appendices 65
A Example of input tweet 65
B Cluster keywords 67
C Assignment of tweets with at least two features in common with
cluster centroids - Evaluation results 69
D Temporality between news and tweets - considering the complete
dataset 71
vi
List of Tables
4.1 Number of news articles per press source available . . . . . . . . . . . 36
4.2 Frequent expressions on news articles . . . . . . . . . . . . . . . . . . 39
4.3 News articles pre-processing transformation examples . . . . . . . . . 39
4.4 Number of elements per cluster and class homogeneity and separation 42
4.5 Tweets pre-processing transformation examples . . . . . . . . . . . . 47
4.6 Per-class evaluation of tweets assignment to clusters . . . . . . . . . . 49
4.7 Global evaluation of tweets assignment to clusters . . . . . . . . . . . 50
C.1 Global evaluation of tweets assignment to clusters - considering tweets
with at least two terms in common with cluster centroids . . . . . . . 69
C.2 Per-class evaluation of tweets assignment to clusters - considering
tweets with at least two terms in common with cluster centroids . . . 70
vii
List of Figures
3.1 Four main stages of the method . . . . . . . . . . . . . . . . . . . . . 21
3.2 Data setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1 Number of news articles in the dataset per month of 2016 . . . . . . . 35
4.2 News articles length � boxplot . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Cluster size and mean length of articles . . . . . . . . . . . . . . . . . 37
4.4 Explained inertia for k up to 500 . . . . . . . . . . . . . . . . . . . . 41
4.5 Aggregation indices for k up to 500 . . . . . . . . . . . . . . . . . . . 41
4.6 Keywords per cluster � 4 examples . . . . . . . . . . . . . . . . . . . 44
4.7 Number of available tweets per month of 2016 . . . . . . . . . . . . . 45
4.8 Number of tweets and articles per clusters . . . . . . . . . . . . . . . 48
4.9 Evolution of the number of elements . . . . . . . . . . . . . . . . . . 52
4.10 Days di�erence between articles and the median tweet . . . . . . . . . 54
A.1 Example of tweet in JSON � part 1 . . . . . . . . . . . . . . . . . . . 65
A.2 Example of tweet in JSON � part 2 . . . . . . . . . . . . . . . . . . . 66
B.1 Top 10 cluster keywords � part 1 . . . . . . . . . . . . . . . . . . . . 67
B.2 Top 10 cluster keywords � part 2 . . . . . . . . . . . . . . . . . . . . 68
D.1 Days di�erence between articles and the median tweet - considering
the complete dataset of tweets and articles . . . . . . . . . . . . . . . 71
viii
Chapter 1
Introduction
This chapter presents the theme of the dissertation, describing the motivation be-
hind it, the main goals to achieve and proposed methodology, as well as the main
contributions.
1.1 Motivation and objectives
News media can have a powerful in�uence in people's perception of reality, at least in
some domains, like politics (McCombs and Shaw, 1972) and foreign a�airs (Wanta
et al., 2004). A story reported in mass media reaches thousands of people, that
generally do not have any other direct contact with the subject of the story, and it
is reasonable to say that news media is an important source of information for most
people (McCombs and Shaw, 1972).
However, the advent of social media is gradually changing the way information
is disseminated and, moreover, possibly shifting the roles of news makers and news
recipients. According to Kaplan and Haenlein (2010), social media is the set of
applications based on the internet where user generated content (UGC) is created,
modi�ed and exchanged in a collaborative and participatory way. Following their
categorization, examples of social media include: Facebook and LinkedIn, that can
1
be classi�ed as social networking sites; Wikipedia, a collaborative project; Youtube
and Pinterest, as content communities; blogs; and virtual social or game worlds, like
Second Life. In this setting, information is no longer the property of an elite set of
sources; it is rather the result of a framework where every person with an internet
connection can add, transform, update, di�use, �lter and share pieces of content.
The impact on journalism, in particular, is already noticeable. Although we
still look to mainstream news to discern truthful from unreliable information, we
are also more and more interested in the content posted or shared by friends or
other entities we follow on social networks (Newman, 2009). A study of the impact
of social media on the activities of public relations professionals and journalists
in 2015 reports that 1.7 hours of a journalist's day are spent using social media,
Facebook and Twitter being the �rst and second leader platforms for it. In addition
to expected activities such as relationship management and comments response,
these professionals use social media to announce, spread and monitor news and to
check information (crowd checking) with 40% of them considering social media as a
reliable source (DVJ Insights and ING The Netherlands, 2015).
The expectations are that the relationship between news and social media will
continue to grow (DVJ Insights and ING The Netherlands, 2015).
Considering these facts, can we still tell where and when a news story begins
and its authorship? As the borders between journalism in the traditional sense and
social media become blurred, what is the impact of one over the other? Does this
evolution mean that the popularity of stories in social media is re�ected on and/or
re�ects the attention on the news press? In other words, to what extent are we
still relying on news from the press to know what is happening and, conversely, how
much is the social media focus impacting what the press reports?
This is a relationship worth analysing at this point in time. In particular, this
empirical investigation explores news articles and social media messages of Portugal
� a country where the use of social media is still increasing for both population and
2
enterprises (Lusa, 2016).
The main goal of this dissertation is to study stories that are reported in the
news and commented / di�used throughout social networks. The focus is to identify
stories that were published on a six months period and social media posts about
them. Then, for a selected set, the analysis of the di�erences in timing on news and
social media is studied. This analysis includes the examination of the evolution in
the number of articles and tweets about the same stories and the identi�cation of
which groups of stories show signs of having been �rst published on the news and
then di�used or commented on a social network and which have reached the height
of social media discussion previously to the height of news article publishing.
1.2 Details and contribution
As described in the �rst section, there is an important role of the press on the
daily conversations and opinions of people (McCombs and Shaw (1972); Wanta
et al. (2004)). Also, trends on social media have an increasing in�uence on what
journalists report (DVJ Insights and ING The Netherlands (2015); Newman (2009)).
To explore this relationship, we looked at the news media coverage during a certain
period of time and identi�ed groups of stories with the use of clustering techniques.
Then we searched those stories on a social network using a similarity measure to
those groups of stories. We evaluated the results using `semi-labelled' social media
posts, that is, posts that shared news articles. Focusing on the groups with the
best performance, we studied the evolution of the number of articles and posts
and timing di�erences. The possible di�erences found were meant to shed some
light on the following questions. Stories that are brought to the public attention
by mainstream journalism and then generate a buzz on social media may indicate a
prevalent function of that profession. Stories begun on social media and then fetched
by news can indicate either a reinforcement or a turnaround on this mechanism.
3
Moreover, are there stories that only occasionally come about or is there a continuous
debate through time on either or both platforms?
It is not our objective to categorize every story on these terms. Indeed, we
recognize these are subjective considerations and a full understanding of the inter-
connections of news and social media and underlying aspects in society requires
a more thorough research. Notwithstanding, we believe the approach and focus
proposed for this investigation will provide initial grounds for it.
Furthermore, the outcome of this research should provide a good insight into
what were the main themes discussed, formally and informally, chronologically and
strength wise, thus allowing a country's narrative to take form.
1.3 Organization
The remainder of this report is as follows. Chapter 2 provides the literature review,
Chapter 3 describes the methodology employed, Chapter 4 presents the results ob-
tained from news and tweets grouping and discusses the temporal relationship be-
tween news and tweets. Chapter 5 concludes.
4
Chapter 2
Related Work
This chapter provides an overview of the most relevant and related work. It begins
by describing the social network Twitter and noting certain aspects researchers have
found important when working with tweets. Then, attention is given to text mining
and clustering techniques.
2.1 Twitter
2.1.1 Facts and conventions
Twitter is a social network launched in October, 2006. By the end of the �rst
semester of 2016, it held more than 313 million monthly active users, of which 82%
used the mobile app (Twitter, 2016). By the end of the �rst semester of 2018, the
number of monthly active users was of 335 million (Statista, 2018). Users are people
or organizations who create an account so they can send or receive messages and
follow other users. Those messages are known as tweets. The relationship between
users is bi-directional: a user chooses to follow another, but it does not mean that
the other user will follow him/her back. Moreover, a user decides who he/she wants
to follow, but has no power over who follows him/her.
5
Tweets are messages posted by a user that have a singular property of a maximum
length of 140 characters1. This means that users must be concise in their writing,
and also not having to invest much thought makes adherence to this form of blogging
higher. Indeed, as Java et al. (2007) state in one of the earliest works on Twitter, this
feature classi�es it as a microblogging platform that makes communication faster
and easier. On the other hand, it also follows that the use of abbreviations is fairly
common (Sankaranarayanan et al., 2009), adding a layer of di�culty to text mining
tasks.
Following the classi�cation of Kwak et al. (2010), tweets can be singletons, replies,
mentions or retweets. A reply or a mention uses the convention `@user ' to indicate
we are addressing someone, whereas a retweet is a form of forwarding some other
user's message and is usually preceded by `RT'. Kwak et al. (2010) consider retweets
a very powerful feature for information �ltering and di�usion. A singleton is a tweet
with no reply or mentions (Kwak et al., 2010).
There is another interesting feature associated with tweets which, according to
Sankaranarayanan et al. (2009), has been successfully utilized in clustering tasks,
namely the hashtag. A hashtag is a word or expression that begins with the hash
symbol (#), and it is generally used to indicate the topic of the tweet. A query on
a particular hashtag returns all the tweets containing it. Considering the hashtag
is set at the user level, it is surprising to see how few hashtags are associated to a
single news issue (Sankaranarayanan et al., 2009).
2.1.2 User intention
Studies have shown Twitter users are of three kinds: information sharers, infor-
mation seekers or friends/acquaintances (Java et al. (2007); Krishnamurthy et al.
(2008)).
The follower/followee network constructed by Kwak et al. (2010) using more than
1At the end of 2017, this limit was extended to 280 characters (Rosen, 2017).
6
41.7 million user pro�les collected in July 2009 revealed the average path length to be
4.12, which the authors consider to be relatively short. They conclude that Twitter
is not only a social networking platform but also an information seeking facilitator.
Furthermore, tweets creation follows somewhat the Pareto's law, as less than
10% of Twitter users tweet more than 90% of all tweets (Sankaranarayanan et al.,
2009).
2.1.3 Twitter in Portugal
In 2015, 54.8% of Portuguese people used social networks, according to a study
by Marktest, cited by Lusa (2016). By 2017 the penetration rate had increased to
59.1% (Marketeer, 2017).The main activities on social media are sending/receiving
messages, video watching, chatting and reading and sharing news (Lusa, 2016).
Twitter has a penetration rate of 23.6% among the population and of 41.9% among
enterprises (Lusa, 2016).
To our best knowledge, in the literature there is one attempt focused on empir-
ically characterizing Twitter in Portugal: Brogueira et al. (2016). This work uses
geolocated tweets collected from the Twitter streaming API from mid-September of
2014 to mid-September of 2015. The main �ndings are the following: the distri-
bution of users per district is very similar to the population distribution; regional
di�erences in Twitter usage throughout the year re�ect the usual vacation destina-
tions; tweets containing URL's and retweets represent a very small portion of the
geolocated tweets (less than 3%), while mentions and replies add up to almost 35%,
indicating these users mostly chat; the top hashtags were either football (soccer) or
television entertainment shows related (Brogueira et al., 2016).
Although these results apply to geolocated tweets only, these are relevant as-
pects of the Portuguese Twitter community that are taken into account during the
empirical part of the dissertation.
7
2.1.4 Twitter and news
The literature concerning Twitter and news is often directed at using tweets as a
single news source. Indeed, there are some studies whose approach is to regard
Twitter as a substitute of (rather than complementary platform to) traditional news
sources (e.g. Sankaranarayanan et al. (2009), Zhao et al. (2011), Phuvipadawat
and Murata (2010)). The main reason for this is perhaps the realization that some
news break �rst on Twitter. For example, Hu et al. (2012) have shown that the
capture and death of Osama Bin Laden was made public on Twitter at least 20
minutes sooner than on major U.S. television channels. The authors argue that this
may happen due to the role of a particular set of in�uential users, namely journalists
and politicians, whose credibility instantly provokes an immediate reaction on social
networks (Hu et al., 2012).
Sankaranarayanan et al. (2009) built a tool called TwitterStand with the goal of
collecting and di�using breaking news quicker than conventional news media. This
system performs online clustering on �ltered tweets from a set of manually selected
seeders � users that usually post news. In addition, it performs periodic checks to
avoid fragmentation and ensure minimal duplication of clusters, i.e., topics. Also, it
takes advantage of information in the content of the tweet and/or the user's pro�le to
associate topics to geographic locations. The authors believe that if tweets belonging
to a certain cluster mostly come from one location or a set of close locations, then the
topic of that cluster is likely to pertain to that geographical area (Sankaranarayanan
et al., 2009).
Zhao et al., 2011 used a corpus of news articles from the journal New York
Times (NYT) and tweets from Edinburgh, gathered from November 11 2009 to
February 1 2010, to investigate how similar the topics in Twitter and a traditional
news source are. Their results showed some di�erences regarding the most frequent
categories and types of topics: Twitter users tweet the most about family and life,
a category not covered by the NYT; arts is a topic similarly frequent on both
8
Twitter and the NYT; world is much more frequent on the NYT; lastly, while long-
standing topics have an equally strong presence, the same does not happen for
entity-oriented and event-oriented topics, with Twitter favouring the former and the
NYT the latter (Zhao et al., 2011). Regarding long-lasting topics, there is evidence
that their prevalence is not due to an increasing number of users tweeting about
them, but to a set of important users who discuss it over time (Kwak et al., 2010).
The above �ndings bring about relevant aspects of the similarities between Twit-
ter and conventional news sources. Particularly, they emphasize the importance of a
certain type of users in social networks that foster its role as a news medium. Still,
while the reputation and popularity of users is signi�cant for the level of certainty
in the network regarding new information (Hu et al., 2012), it is the communica-
tion structure set upon follower/followee relationships that renders Twitter such a
fast information di�usion network. In fact, this propagation may in some cases not
depend entirely on the �rst user's network: Kwak et al. (2010) have found that if
a message is retweeted, it quickly reaches an average of 1000 users, regardless of
the �rst user's number of followers. This is what the authors call `the emergence
of collective intelligence', in the sense that individuals decide what information is
good enough to spread and once that decision is made, it almost instantly reaches
a massive audience (Kwak et al., 2010).
2.2 Text mining
2.2.1 De�nition and applications
Text mining is the process of retrieving useful and meaningful information from text.
As in data mining, it generally does so by identifying relevant patterns.
In a world where unstructured information is undoubtedly proliferous � a survey
conducted on data management professionals in 2006 revealed an average of 31%
9
unstructured plus 22% semi-structured data over the entire organization (Russom,
2007) �, text mining has grown considerably. It is applied in a variety of �elds,
such as analysis of patents, discovery of protein interactions, categorization of news
stories, spam �ltering, identifying industry trends for corporate �nance, to name a
few (Hotho et al., 2005; Feldman and Sanger, 2006).
2.2.2 The case of unstructured text data
Even though the goal of text mining is conceptually similar to that of data mining
in general, the unstructured format of the data implies a bigger e�ort into the pre-
processing stage. Unlike data mining, where data is often extracted from databases
where the information is organized in records, the lack of structure in text data
means that the data are not prepared to be used by common data mining algorithms.
It should be noted that text data can have some form of structure. Some doc-
uments are notoriously written following a speci�c format. For instance, a news
article usually has the following elements: headline, byline (author and date), lead
paragraph, body and conclusion. This type of text data can be classi�ed as weakly
structured data. Other documents that are constructed in a format to facilitate
transmission (e.g. email, JSON) are considered semi-structured. However, the fact
remains that even these types of structures are not adequate to feed into a data min-
ing algorithm as is. The in�nite possibilities of news headlines or email recipients �
the shortest components in the examples � per se cannot allow for comparison and
pattern recognition. It is necessary to apply a representation model to transform
text data into an input that can be processed by machine learning algorithms.
Additionally, handling textual information requires an understanding of natural
language, which is why many text mining techniques borrow knowledge from other
�elds of study, particularly information retrieval (obtaining documents that answer
a certain query), information extraction (extracting speci�c information from docu-
ments) and computational linguistics (Feldman and Sanger, 2006). So, text mining
10
lies in the connection between techniques from these related research areas and the
methods and algorithms from data mining (Hotho et al., 2005).
2.2.3 Document, features and the representation model
Feldman and Sanger (2006) de�ne document as a unit of discrete textual data
within a collection that usually, but not necessarily, correlates with some real world
document, such as a business report, legal memorandum, e-mail, research paper,
manuscript, article, press release, or news story. In text mining, a document collec-
tion is also known as a corpus.
Each document must be represented in a structured format. That transformation
implies identifying the features that best characterise the document and ease the
operations of text mining algorithms. According to Feldman and Sanger (2006),
features are usually characters, words, terms or concepts. The di�erence between
words and terms is that the latter can include multi-words such as `Mother Theresa'
and `European Union'. Concepts, in turn, represent a number of terms with familiar
or related meaning and may include words not present in the original documents.
The bag-of-words representation of documents uses these features without taking
into account the order in which they appear in each document. A document is
represented by its set of features, and when each of these features has a value for
that document, the document is represented in a feature vector. This is also known
as the vector space model.
Therefore, once the features have been produced, each document can be repre-
sented by a feature vector, and a corpus by a document-term matrix. The values of
the vector or matrix depend on the attributed weight of the feature in the document
and/or corpus. This weight may be binary, i.e., equal to one if the feature is present
in the document and zero otherwise, or else dependent on the frequency of the term
in the document and/or corpus. In this area, the most widely used representation
format is the TF-IDF weighting, in which the weight of the term t in the document
11
d is given by
TF-IDF(t, d) = TermFreq(t, d) · log(
N
DocFreq(t)
)
where TermFreq(t, d) is the absolute frequency of t in d, N is the number of all
documents in the collection and DocFreq(t) is the fraction of documents containing
t. Note that this weighting scheme penalizes terms that are much too frequent in
the corpus through the IDF component (the second factor).
The document-term matrix is hence the structured representation of the corpus of
documents, allowing comparisons between documents and enabling pattern search.
2.2.4 Pre-processing
Textual data has two major issues: high feature dimensionality and feature sparse-
ness.
For any given document collection, the number of possible features is normally
very high. Consequently, the data becomes very sparse. This may be costly due to
two main reasons: (1) the increased space hinders computational performance; and
(2) as the number of variables becomes larger � possibly larger than the number of
observations �, the performance of some algorithms tends to degrade. Thus, it is
convenient to keep only those features rendered informative and most representative
of the document collection.
Some common pre-processing tasks are:
• Tokenization. A token is a meaningful constituent of a text stream (e.g.
a chapter, section, paragraph, sentence, word) (Feldman and Sanger, 2006).
After punctuation marks removal, the goal is usually to split each document
in words.
• Parts-of-Speech (POS) Tagging. The tags article, noun, verb, adjective,
12
preposition, number and proper noun, among others, enrich the textual data
by marking their syntactic value.
• Filtering. Removing words that add no meaning to the document because
they appear in any document (e.g. prepositions, conjunctions, articles � also
known as stop words, i.e., the most common words in a language) is one way
to reduce feature dimensionality; but there are other and more sophisticated
feature selection techniques, based on, for example, the document frequency of
the word/term or, in the case of classi�cation problems, the information gain
of the word/term.
• Stemming. A stem is the basic form of a word � its root or its root with
some su�x or pre�x. The most widely used stemmer for the English language
is the Porter Stemmer (Porter, 1980).
• Lemmatization. A lemma is the dictionary form of a word (e.g. the lemma
of `are' is `be').
2.2.5 Mining tweets
A very important aspect to consider is that short segments of text, such as tweets,
enhance the problem of sparseness in the dataset, and strategies relying on exact
word matching may be inadequate (Aggarwal and Zhai, 2012).
For that reason, some authors recommend the use of extra information (Genc
et al. (2011); Sriram et al. (2010); Aggarwal and Zhai (2012)). The approach of
Genc et al. (2011) was to map each tweet to the closest Wikipedia page and take
the distance between these pages as the distance between two tweets. Their argu-
ment is that both tweets and Wikipedia pages are constructed by humans and the
categorization of Wikipedia pages in particular, echoes how our brains link semantic
structures.
13
Sriram et al. (2010), on the other hand, used the standard bag-of-words con-
struct plus eight additional features to categorize tweets and con�rmed the valuable
contribution of extra information. These added features were, namely, the author,
the presence of abbreviations or slang, opinioned words, any currency or percentage
symbols, emphasis on words, mentions at the start and mentions within the tweet.
2.3 Clustering
2.3.1 Introduction
Aggarwal and Zhai (2012) de�ne the clustering problem as that of �nding groups
of similar objects in the data, with the objective of obtaining high inner-group
similarity and high inter-group dissimilarity. There are many categories of clustering
algorithms (Gama et al., 2015), some of which are presented.
Connectivity-based clustering algorithms, also known as hierarchical, con-
sider that neighbouring objects are more similar, while more distanced objects
should appear in separate clusters. Thus, objects are linked based on their dis-
tances, and a dendogram of the resulting hierarchy can be drawn.
The approach to link the objects can be either agglomerative, where all objects
are initially separated in isolated clusters and successively grouped based on smallest
distances; or divisive, where one big cluster containing all objects is initially formed
and successively partitioned into smaller groups based on greatest distances.
The linkage method (or aggregation index) determines how to assess the dis-
tances between objects and/or groups: the complete linkage considers the distance
between the furthest elements of each group; the single linkage the distance be-
tween the nearest elements of each group; the average linkage considers the average
dissimilarity values between the elements of each group; the centroid linkage the
distance of centroids; and Ward's linkage considers the increase in inertia (between
14
and within class dispersion) when two groups are merged.
Since the �nal result is a hierarchy, to get the �nal clusters of objects it is
necessary to apply a cut-o� criteria to the tree-like structure.
In centroid-based clustering algorithms, the distances are computed to a
cluster centre � a speci�c data point (medoid) or a central vector �, and clusters
are improved in an iterative process. The number of clusters needs to be set at
the start. Initial centres can be randomly picked or user-chosen, and the idea is to
allocate each of the data points to the closest centre, recalculate cluster centres and
repeat, until no further improvements are possible.
Albeit e�cient, these types of algorithms can converge to local optimum, and so
it is important to execute them more than once with di�erent seeds and evaluate
the results. The most widely known and still used algorithm of this category is the
k-means (MacQueen, 1967; Jain, 2010).
Other categories of algorithms include density-based clustering algorithms,
where clusters correspond to regions with high densities of objects, separated by
regions with low densities of objects (e.g. DBSCAN (Ester et al., 1996)) and
distribution-based clustering algorithms, where clusters are formed based on
the probability of its members following the same distribution (e.g. clustering using
the Expectation Maximization algorithm (Dempster et al., 1977)).
For distance based algorithms, ameasure of (dis)similarity is needed, such as
the Euclidean, Manhattan, Chebyshev or Mahalanobis distances. In the case of text
data, the most widely used is the cosine similarity, for which, unlike the Euclidean
distance, for example, the magnitude of the vector representing the document is
not taken into account, but only its direction. This is particularly important when
working with collections of documents with variable length, in which the weight of
a particular term may be bigger not because it appears relatively more often in a
document than any other, but simply because that document is longer.
According to Gama et al. (2015), following the clustering analysis there is an
15
evaluation stage, where an objective assessment is made regarding the signi�cance
of the resulting clusters and whether the number of clusters is adequate. However,
the authors recognize that it is still an open area, and make the following remarks:
• There is no universal clustering algorithm capable of capturing every possible
underlying data structure;
• If the goals of two clustering algorithms are di�erent, it may not make sense
to compare their results;
• The knowledge of the data and its domain is paramount for determining not
only what data transformations are necessary but also to choose the most
adequate similarity measures and the properties inherent to each clustering
algorithm.
Nevertheless, there are three types of criteria to evaluate the quality of a clus-
tering (Gama et al., 2015):
1. Relative criteria, focusing on measuring which algorithm �ts best to the data
or on the most adequate number of clusters for a given algorithm. For example,
the intra-cluster variance measures how compact the clusters are, while the
connectivity assesses the degree to which neighbouring objects are positioned
in the same cluster.
2. Internal criteria, focused on determining to what degree the partitioning rep-
resents the inherent data structure. For example, the Gap statistic compares
the total intra-cluster variation with the expected value under a null reference
distribution (i.e., a distribution with no obvious clustering).
3. External criteria, with the goal of evaluating how well the clustering obtained
con�rms a given pre-speci�ed hypothesis. For example, the Jaccard index de-
termines the probability of two objects of the same cluster also being clustered
together in a di�erent clustering scheme.
16
By interpreting the resulting classes, one can also validate the clustering results,
especially with the support of a domain expert. The interpretation may include
some form of labelling, computation of mean values or visual representations of the
clusters.
Regarding the challenges in data clustering, Jain (2010) denotes the importance
of building benchmark data from di�erent domains to test and evaluate clustering
algorithms. Also, as there is not yet one clustering algorithm to clearly outperform
any other in whatever domain, it is suggested that algorithms should be designed
and used according to the application needs. Finally, the rise of semi-supervised
methods is encouraged, as the user's domain expertise on pair-wise must-link and/or
cannot-link constraints is still very important for good quality clusters.
2.3.2 Text clustering
The additional challenges of clustering text rather than quantitative or even cate-
gorical data, are the following (Aggarwal and Zhai, 2012):
• High dimensionality and sparseness: on the one hand, the range of possibilities
for words present in a document (i.e. the glossary of the document collection)
is extremely high; on the other hand, each document often has relatively a low
number of them.
• Word correlations: even though the lexicon is large, there are generally many
words relating to a single concept.
• Variable document length: the representation of such items should take into
account a normalization process.
To face these challenges, a series of pre-processing tasks are usually employed before
the clustering process, some of which were detailed in section 2.2.4.
In addition to the feature selection techniques already discussed, as Aggarwal
and Zhai (2012) emphasize, dimensionality reduction can also be achieved through
17
feature transformation methods, an example of which is Latent Semantic Index-
ing (Deerwester et al., 1990). This technique represents the documents in a new
(smaller) feature space where the �nal features are a linear combination of the
original features, thus eliminating noisy dimensions from the data (synonymy and
polysemy) and enhancing its semantic value. This is particularly valuable in the
context of text clustering.
Clustering techniques based on distances use similarity measures to evaluate
how close or apart the objects are. Huang (2008) performed an experiment on seven
di�erent datasets, of which four comprised newsgroup posts or newspaper articles.
Her results showed that the cosine similarity, Pearson's correlation coe�cient and
the averaged Kullback-Leibler Divergence clearly outperformed the Euclidean dis-
tance in all datasets in terms of entropy and purity. However, this experiment was
conducted using the k-means algorithm, that belongs to the partitioning family of
clustering algorithms, and therefore conclusions should be drawn only for this type
of clustering.
A compromise between the robustness of hierarchical clustering methods and the
e�ciency of partitioning clustering methods involves the use of a hybrid approach.
In Cutting et al. (1992), such an approach is used in order to provide an e�cient
interactive experience in document browsing. Speci�cally, the authors discuss two
techniques, buckshot and fractionation, to �nd the initial centres to feed to the
partitioning clustering algorithm. The former consists of taking a sample of√k ·N
documents and performing hierarchical clustering to �nd k centres that are more
robust than if randomly chosen. The latter implies dividing the corpus into N/m
buckets of size m>k, applying hierarchical clustering in each of them, and then using
cluster centres to reapply the clustering routine iteratively until k clusters have been
obtained. Both techniques provide better seeds for a more computationally e�cient
algorithm � like the k-means � to begin with when clustering the complete and larger
document collection. In this study, the buckshot technique was applied.
18
There are other types of text clustering techniques, such as probabilistic cluster-
ing, of which topic modelling is an example (Aggarwal and Zhai, 2012). In this case,
each document and each term of the lexicon has a probability of belonging to one
of k topics. This also makes topic modelling a clustering technique that determines
document clusters and word clusters at the same time, following the notion that
good clusters of words are indicative of good clusters of documents.
19
Chapter 3
Methodology
3.1 Motivation
Journalism and social media have become more intricately interconnected. Tradi-
tionally, people resort to mainstream media to know what is happening in the world.
However, this dynamic has been changing in recent years, at least for some news
topics, due to the fact that a great proportion of the world population has access to
platforms which broadcast real time events to an equally worldwide audience. Con-
sequently, the beginning of the process of news generation and dissemination has in
some cases changed from journalists to the general public. It has been recognised
in various studies (Newman, 2009; DVJ Insights and ING The Netherlands, 2015)
that journalists spend a lot of their time scouting social media for interesting topics
to write about, relying on these platforms as reliable sources.
It is therefore relevant to study how events or news come about on these two
types of platforms � news articles and social posts � particularly how and if they
similarly arise, disseminate, gain strength and die. The goal of this dissertation is to
characterize the news or events published by news sources and/or commented and
shared on social media during a period of six months, focusing on the timing of their
generation and the intensity with which they are mentioned on each and through
20
the use of text mining techniques. The next section describes the empirical steps
taken to achieve this goal and the following provide more details regarding each of
them.
3.2 Four main stages of the method
The empirical process used in this study can be observed in Figure 3.1 and is sum-
marized below.
1. Data gathering: obtaining the data. On the news side, on-line news articles
were used; on the social media side, tweets were used (see section 3.3).
2. News clustering: forming groups of similar news (see section 3.4).
3. Assignment of tweets to news clusters: allocating tweets about the same
stories to the groups of news obtained (see section 3.5).
4. Analysing the resulting groups of news articles and tweets: tempo-
rality between news and tweets (see section 3.6).
Figure 3.1: Four main stages of the method
21
3.3 Data
Both tweets and on-line news articles were provided by the POPmine platform
developed in SAPO Labs � a partnership between an internet services and products
provider and an academic institution, namely the University of Porto. This platform
gathers on a continuing basis tweets of approximately 100 thousand Portuguese users
and news articles of over 40 Portuguese press sources (Saleiro et al., 2015).
The data were collected during the year of 2016. In total, there are more than 600
thousand news articles and almost 38 million tweets, distributed somewhat linearly
across the year.
For computational reasons, the data time frame used in this study was reduced
from one year to one semester. Both types of documents (news articles and tweets)
were received in text �les in the JSON format1. The �rst task was to read and import
the relevant components into a database, in order to ease data cleaning, transfor-
mation, access and analysis. The ETL2 process was particularly important in the
case of tweets, since each tweet can have multiple components recorded in a nested
structure, which would be extremely di�cult to read and use as is (see example in
Appendix A). Figure 3.2 shows this setup process, as wells as the tools utilized for
information extraction, transformation, loading and storage/management.
Figure 3.2: Data setup
1JSON stands for JavaScript Object Notation. It is a �le format used to easily transmit data,that uses attribute-value pairs and arrays data structures. See json.org for more information.
2ETL � extract, transform, load.
22
The following components were gathered and imported into the database:
• News articles: article id (integer), title (string), body (string), date of publi-
cation (timestamp), source (string), url (string).
• Tweets: tweet id (integer), text (string), date of posting (timestamp), user
(string), URL's shared (string).
The next stages were implemented on R (R Core Team, 2017) and rely on the
information stored in this database, by use of the RODBC package (Ripley and Lapsley,
2017).
3.4 News clustering
The same story or event can be published in many di�erent articles and shared or
commented in many tweets. It is therefore �rst necessary to identify such stories.
The news articles were chosen as a base for that identi�cation in preference to tweets
as these are particularly prone to being about personal life rather than news topics.
They are also more di�cult to mine due to the shortness of the text, as seen in the
previous chapter.
To obtain stories or groups of similar stories of the �rst semester of 2016, clus-
tering techniques were applied to a sample of on-line news articles.
3.4.1 Sample
The sample construction considered the importance of having, on the one hand, news
articles that were shared on social media, and, on the other hand, news articles that
were not shared on social media.
It is possible to know if a news article was shared on Twitter by looking up
its URL in the tweets information collected. The former are necessary so that in
the following stage the assignment of tweets to the clusters can be evaluated. We
23
assume that if a tweet contains a link to a news article then that tweet is about the
same story as that news article. However, if only these news articles were included
in the sample, it would be extremely likely that when analysing the �nal clusters
comprised of both news articles and tweets the following conclusion would be drawn:
the story was �rst brought up by the press, because if a link was shared, it means
the tweet came afterwards.
By including also news articles that were not shared on Twitter, we allow news
articles later published to enter the cluster. Additionally, if there are clusters of news
articles with no similar tweets assigned to them, it is possible to identify stories that
are not as relevant in the social media, which is useful to gain insights of the least
talked about topics in the Portuguese society.
3.4.2 Pre-Processing
Standard pre-processing techniques were applied to the news articles, including the
ones listed below. We do not further discuss these techniques, as they were described
in Chapter 2.
• Tokenization
• Lower case conversion
• Punctuation and numbers removal
• Portuguese stopwords removal
• Repeated expressions removal
• Stemming
These tasks were performed in R (R Core Team, 2017) using the following packages:
tm (Feinerer and Hornik, 2017) and SnowballC (Bouchet-Valat, 2014). Examples of
articles pre-processing are given in Table 4.3, in Chapter 4.
24
Tagging
Additionally, as a feature selection technique, terms were tagged according to their
syntactic value (parts-of-speech) and any term not classi�ed as a verb, noun or
proper noun was discarded. The recognition of named entities in the text of news
articles was also included. The reason behind this step is that the purpose of clus-
tering is to �nd groups of stories or events and it is expected that the inclusion of
named entities, such as personalities, names of events and locations, as features will
help the representation and consequent pattern recognition in that respect.
For parts-of-speech tagging, the resources used were the openNLP and the NLP
packages (Hornik, 2016, 2017). For named entities recognition, we used the PAMPO
package (Rocha, 2016). The PAMPO method (Rocha et al., 2016) for named entities
extraction was built for the Portuguese language and is based on two algorithms:
the �rst generates candidates by gathering common named entities terms, such
as capitalized words and personal titles; the second performs a candidate selection
based on parts-of-speech tagging. Performance results on a Portuguese news corpora
were a recall of 0.91, a precision of 0.959 and a 0.932 F1 score.
Representation
The news corpus was then represented in a document-term matrix (dtm), with
normalized TF-IDF weights, according to the following formula:
tfi,j · idfi =tfi,j∑k nk,j
· log2|D|
|{d|ti ∈ d}|
where tfi,j is the absolute frequency of term ti in document dj,∑
k nk,j is the sum
of absolute frequencies of each term k in document dj, |D| is total the number of
documents in the collection, and |{d|ti ∈ d}| is the number of documents in the
collection where ti appears.
25
Dimensionality reduction
For dimensionality reduction purposes, the level of sparseness of the dtm was set to
0.98, which means that if a feature was not present in at least 2% of the documents,
it was discarded. The 0.98 level guarantees no documents were left with only zero
entries, while reducing signi�cantly the number of features.
The �nal document-term matrix was built using the tm and RWeka packages
(Feinerer and Hornik, 2017; Witten and Frank, 2005; Hornik et al., 2009).
3.4.3 Clustering
The clustering experiment conducted included hierarchical and k-means clustering.
Hierarchical clustering
The hierarchical clustering algorithm used is of an agglomerative nature, which
means each observation (news article) is isolated at the start in a cluster of its
own; then, clusters are iteratively joined according to greatest similarities, until
there is one single cluster. So, at each iteration, there is a need to compute a
similarity (or dissimilarity) value between each of the current clusters. In this case,
the dissimilarity measure used was the Euclidean distance, which is given by:
d(v1, v2) =
√√√√ p∑f=1
(v1f − v2f )2
where v1 and v2 are the feature vectors representing each element to compare and
p is the total number of features. The values of the feature vectors are the TF-IDF
values subject to a normalization at the document level, which means that the length
of the document will not in�uence its distance to a document of di�erent (larger or
smaller) length.
Regarding the linkage method, i.e., how the dissimilarity between two clusters
26
is determined, Ward's index was chosen (Ward Jr, 1963). This means that the
dissimilarity is computed as the increase in inertia when the two clusters are joined.
K-means clustering
The k-means algorithm (MacQueen, 1967) starts with selecting a set of k initial
centres from the observations (in this case, news articles) and assigning each of the
remaining observations to the its closest centre, immediately updating the centre of
the chosen cluster to its mean point. Once every remaining observation is allocated
to a cluster centre, the solution is optimized by repeating the assignment operation
for each of the observations, until convergence is achieved, that is, until there is no
change in the cluster centres. The selection of the initial centres can be random or
user de�ned.
For this type of algorithm, the number of clusters (k) needs to be set at the start.
To determine the number of clusters, two methods were used: the representation of
the aggregation indices of the hierarchical clustering and the representation of the
explained inertia. The latter is the between-class dispersion (B), measured as the
sum of squared distances of the cluster centres (centroids) to the centre of gravity
g, divided by the total dispersion of the data (T ), measured as the sum of squared
distances of every observation to the centre of gravity:
B =1
n
k∑h=1
nhd2(gh, g)
T =1
n
n∑i=1
d2(Ii, g)
The explained inertia (BT) naturally increases with the number of clusters, and
the goal is to �nd the value of k for which this increase starts to marginally decrease
(the elbow method) (Bholowalia and Kumar, 2014). The same rationale applies for
the aggregation indices, in reverse.
27
Hybrid clustering
The �nal clustering method chosen, whose results are presented in the next chapter,
was of a hybrid nature, following in the footsteps of Cutting et al. (1992), namely
the buckshot method: choose the initial centres for k-means clustering by perform-
ing hierarchical clustering on a sample of√k ·N news articles and compute the
resulting k centroids. This way, the described methodology can still be applied to a
larger sample of news articles without losing neither the robustness from hierarchical
clustering nor the e�ciency of k-means clustering.
The tasks of hierarchical clustering and k-means clustering were implemented
using the factoextra (Kassambara and Mundt, 2017) and ClustGeo (Chavent et al.,
2017) packages.
Cluster labelling
Each cluster was labelled with keywords selected from the dictionary of features
produced at the end of the pre-processing stage. For each cluster centroid, the
terms with higher TF-IDF values were considered to be the most representative of
that cluster. The number of keywords per cluster was dependent on the cluster size
and set in the following manner: (i) determine the minimum number of keywords in
a cluster of size one; (ii) increase the number of keywords using a logarithmic growth,
in order to capture the lexicon variety in larger clusters but keep the characterization
limited to a relatively small set of keywords.
Let |Ck| be the number of documents in cluster k and m the minimum number
of keywords in a cluster of size one. The number of keywords of cluster k is:
Wk = log2(|Ck|+ 1) ·m
Cluster keywords were visually represented using the wordcloud package (Fel-
lows, 2014).
28
3.5 Assignment of tweets to clusters
Once news articles clusters are formed, it is possible to add tweets about the same
stories. To determine if a given tweet should be assigned to a cluster, we used a
measure of similarity between that tweet and the cluster centroids. Next, we describe
the methodology for this assignment.
3.5.1 Sample
All tweets with links to a clustered news article were included in the sample, thus
allowing a later evaluation of the assignment method utilized. The sample size was
then doubled by including tweets with no link to a clustered news article. Since
there were over 19 million tweets to choose from, we demanded that for a tweet with
no news article URL to be selected, it needed to have at least two terms from the
set of keywords representing the clusters.
The reason to include this second set of tweets is straightforward: if only tweets
with links to news articles were used, every story would be found to be �rst talked
about by the press, since those tweets can only exist after the shared URL exists
and if the URL exists, an article has been published.
3.5.2 Pre-processing
Similarly to the news articles, every tweet was subject to the pre-processing tech-
niques listed in section 3.4.2.
Representation
The corpus of tweets was transformed into a document-term matrix (dtm), using
the dictionary of features from the news articles dtm. The dtm also included as
documents the cluster centroids, so that the TF-IDF weights later used to compute
29
the similarity between each tweet and centroid were not based solely on the tweets
corpus.
3.5.3 Assignment to clusters
Each tweet was assigned to the cluster whose centroid was closest based on the
cosine similarity. This similarity measure is commonly used in text mining (Huang,
2008), and particularly because it is not in�uenced by the size of the documents.
Let v1 and v2 be two non-zero vectors, which in this case represent a given tweet
and a certain centroid. The cosine similarity between these vectors is:
cos(θ) =v1 · v2
||v1|| · ||v2||
If equal to zero, the vectors are directed at orthogonal directions, which means that
the similarity between the two documents is non-existent. If equal to one, the angle
between the two vectors is zero and hence the similarity between the two documents
maximum.
3.5.4 Evaluation
Because all the data used in this study are unlabelled, results cannot be directly
evaluated. The methodology described so far has addressed this issue by including
tweets with links to clustered news articles. It can be assumed that if a tweet shares
a speci�c news article, it should belong to the same cluster. So, we consider that
class as the real class of the tweet and compare it with the results of the assignment
based on similarity to cluster centroids.
Accuracy, precision, recall and F1 measures were used to evaluate these results.
Accuracy is the percentage of the total number of true positives, i.e., correctly
assigned tweets, on the total number of observations evaluated. Precision evaluates
the positive predictions, that is, the percentage of true positives among the positive
30
predictions of a given class. On multi-class problems, the macro precision can be
computed as the average of per-class precisions. Recall assesses the percentage of
true positives among the actual positive class. The F1 score is the harmonic mean
of precision and recall, and is used when neither false positives nor false negatives
are more important.
In addition, we borrowed the concept of precision at n from the information
retrieval �eld. In this context, performance measures consider as true positives the
relevant documents among those retrieved from a query. When considering the
topmost query results only instead of all the query results, the measure is called
precision at n (P@n), where n is the cut-o� rank. This is particularly used for web
search engines, where the performance of the �rst results is more important than
the overall performance (Schütze et al., 2008).
In the context of this study, we made the following adaptation: each observation
has n predictions, based on the ranked closest news centroids. If the true class of a
tweet is in the n-topmost predictions, it is considered as a true positive. The P@n
is therefore the percentage of observations whose true class is present in the top n
predictions.
3.6 Analysis
Once the groups of similar articles and tweets have been formed, it is possible to
study how the underlying events or stories develop over time, and in particular if
the role of the press as a story breaker is still indisputable or, on the contrary, if
there are some stories that break �rst on social media. This section discusses how
this analysis was conducted.
31
3.6.1 Timing of events on the news and on social media
Graphical representation
Firstly, the date and time of publication of each document was retrieved from the
database. This allows the representation of the cluster documents on a timeline,
signalling the evolution of the number of articles and tweets in that cluster.
Temporality between news and tweets
In order to get a picture of the temporality between articles and tweets, a second
analysis was made. For each news article in a given cluster, the time di�erence was
computed to that cluster median tweet. So, each cluster has a number of lag values
characterizing it, and those values represent the timing di�erence between articles
and the moment the social network is fully engaged in the discussion of that story. If
there is a signi�cant proportion of articles with positive lag values, then it is a strong
indicator that for that particular group, the press had a more important role in the
beginning of the discussion; if, on the contrary, there is a signi�cant proportion of
articles with negative lag values towards the median tweet, then it is a sign that the
discussion is likely to have started on the social media.
As in each cluster there is a set of linked articles and tweets, this analysis can
be biased towards the �rst hypothesis, as for these speci�c documents, every article
comes sooner than its linked tweet (a news article URL can only be shared by a
tweet if the article has been published). Hence, we further excluded these articles
and tweets from this analysis.
Remarks on the analysis made
This analysis was conducted on a subset of clusters, the selection of which considered
both the cluster size and the per-class precision obtained from the previous stage of
tweets assignment.
32
Additionally, as this is an exploratory analysis, directed at gaining a general
picture of the temporality of news and tweets in Portugal, we emphasize that con-
clusions are meant to be interpreted carefully, as there is no attempt at evaluating
them in this study.
33
Chapter 4
Temporal Analysis of News and
Tweets
The aim of this chapter is to present the empirical results obtained following the
methodology described in the previous chapter. It begins by describing the process
of clustering news articles. Then it presents the results of the assignment of tweets
to those clusters. Finally, the �nal clusters of news articles and tweets are analysed.
4.1 News clustering
Usually there are many news articles related to the same event or story, either due
to the existence of many news sources that publish it or because it evolves through
time. The goal of clustering the on-line news articles is to segment the published
stories so that articles of the same story or similar stories are grouped together. This
will allow the study of when a story or group of similar stories is brought up by the
press and, later, compare this to what happens on the social media Twitter.
34
4.1.1 Selection of articles
The on-line news articles provided by the POPmine platform were collected during
2016 and amounted to over 600 thousand items. Figure 4.1 presents the frequency
of the available news articles for this investigation over that year. The monthly
average is 51.7 thousand news articles.
Figure 4.1: Number of news articles in the dataset per month of 2016
For computational reasons, only a sample of news articles was used to test the
proposed methodology. Our sample included news articles from the �rst semester of
2016, which is still a reasonably long period in which to study the timing of stories
in the news and social media and reduces the size in approximately 50%.
First attempts at news clustering revealed the prevalence of sports related con-
tent, which can be con�rmed in Table 4.1: 58% of the articles are from generalist
news sources, 28% from sports news sources, 8% from economics news sources, 3%
from technology news sources and 3% from other types of news sources. In order
to obtain a wider range of topics in the groups of articles formed, only those from
the manually selected set of generalist news sources highlighted in bold in Table
4.1, were used. This further reduced the sample size in about 42%, to about 174
thousand news articles.
Furthermore, although the representation model used a normalized TF-IDF
weighting scheme, �rst attempts also revealed that longer news articles tended to be
35
Table 4.1: Number of news articles per press source available
grouped together. Indeed, a careful examination of the length of news articles re-
vealed that it could vary from as little as one word to up to 4731 words. The boxplot
in Figure 4.2 shows the presence of outliers in terms document length measured as
the total number of characters (extreme outlier: 5011 characters; moderate outlier:
3349 characters). The scatterplot in Figure 4.3 shows clustering results (k=100) on
a sample of three thousand articles � longer articles tend to be clustered together,
whereas smaller ones easily tend to be separated.
This could be explained by the fact that longer articles have a wider range of
vocabulary and therefore similarities between them are easier to �nd, whereas for
36
Figure 4.2: News articles length � boxplot
Figure 4.3: Cluster size and mean length of articles
shorter articles it is the opposite and hence they appear in separate clusters.
For these reasons, we have used another criterion for selecting news articles: to
keep those with length between 100 and 3349 characters. The lower bound of 100
characters is used to exclude rather short and uninformative articles, such as the
following examples: `Dados são relativos à zona euro e à União Europeia em geral.' ;
`Veja na íntegra o debate entre os três candidatos presidenciais, transmitido na SIC
Notícias.'. It also prevents some articles that may not have been fully or correctly
37
collected (for example, only the subtitle was registered) from entering the sample.
The �nal criterion for selecting the articles was to include both articles that were
shared on Twitter and also articles that were not shared. The reasons for this are
explained further on. Hence, 50% of the �nal sample is comprised of on-line news
articles whose URL was shared in at least one tweet and 50% of on-line news articles
whose URL was not found in any of the tweets.
With the above criteria, there were 3037 on-line news articles with a link to at
least one tweet. The other 50% was randomly assembled. The �nal sample includes
therefore 6074 news articles.
4.1.2 Preprocessing and representation model
Standard preprocessing techniques were applied, as outlined earlier in Chapter 3
(section 3.4.2).
Each document was converted to lower case and stripped of punctuation, numeric
characters and Portuguese stopwords. Words were stemmed using the Portuguese
Snowball stemmer (Bouchet-Valat, 2014).
The �rst analysis of the most relevant terms obtained from the document-term
matrix after these preprocessing tasks revealed the need to remove some expressions
that frequently appeared in the body of several news articles, as exempli�ed in
Table 4.2. These types of expressions were removed from the articles directly at the
database level, before carrying out the selection of articles for our study.
For feature selection purposes, parts-of-speech (POS) tagging was also used, so
that only nouns, verbs and proper nouns were kept. By observing the most relevant
terms with and without POS �ltering, we concluded that this step also helped to
improve the quality of labels produced per cluster.
Another improvement to the representation of news articles was the identi�cation
of named entities and their inclusion as features. The application of PAMPO (Rocha
et al., 2016) identi�ed approximately 29 thousand named entities in our articles
38
Table 4.2: Frequent expressions on news articles
Expression CommentsSiga o CM no Facebook Equivalent to: Follow CM on Face-
book.Os nossos termos e condições de privacidadeforam alterados. Este website utiliza cookiesque asseguram funcionalidades para uma mel-hor navegação. Ao continuar a navegar, está aconcordar com a utilização de cookies e com osnovos termos de utilização.
Warning about terms and conditionsand cookies.
Partilhar o artigo [título] Imprimir o artigo [tí-tulo] Enviar por email o artigo [título] Aumen-tar a fonte do artigo [título] Diminuir a fontedo artigo [título] Ouvir o artigo [título]
Options for web users to share, print,send, listen to the news article andincrease/decrease its font size (iconslegends).
Completam-se agora 100 anos sobre o início dabeligerância portuguesa. Uma data assinaladapela RTP com a publicação online dos seus maissigni�cativos materiais de arquivo sobre o tema.
Advertisement to another content ofthe news source (present in over 100articles).
Table 4.3: News articles pre-processing transformation examples
Before AfterSegundo site TMZ, Prince morreu na suaresidência em Presley Park. A polícia está ainvestigar um óbito ocorrido na sua residência,mas não con�rmou que se tratasse da mortedo próprio artista. O cantor norte-americano,Prince Rogers Nelson de seu nome, terá su-cumbido a uma gripe que originara o seu in-ternamento de urgência na passada sexta feita.
sit tmz prince morr residentpresley park políc investig óbit ocorrresident con�rm trat mort artistcantor prince rogers nelson nomsucumb grip origin intern urgêncpass sext feit
A ministra da Administração Interna justi�couesta sexta-feira que, "por uma questão de pro-porcionalidade", optou por aplicar ao militar daGNR, que matou um jovem numa perseguiçãoapós um assalto, uma sanção menos gravosa doque a proposta pela IGAI.
ministra da administração internajusti�c sext feir questã proporcionaloptou aplic milit gnr mat perseguiãassalt sanção propost decisã tompropost igai
Entre os detidos encontram-se familiares do ex-tremista malaio Mohamed Jedi, que combate naSíria nas �leiras do Estado Islâmico.
det encontr malai mohamed jedicombat síria �leir estado islâmico
39
sample.
Table 4.3 presents examples of the transformations. Underlined terms correspond
to named entities. After these transformations, the corpus was structured in a
document-term matrix. Since the number of terms generated, including named
entities, surpassed 44 thousand, a feature reduction technique applied was to set the
level of sparseness of the matrix to 98%. This resulted in 952 terms only.
4.1.3 Parametrizing the method
The clustering method chosen to group the on-line news articles was of a hybrid
nature, combining an e�cient algorithm, k-means, with a robust setup of hierarchical
clustering. The number of clusters (parameter k) that k-means requires was decided
on the basis of a subjective evaluation of the number of di�erent stories that could
occur in a six-months period. We have also considered the representation of the
explained inertia for di�erent k values.
Determining k
The explained inertia for hierarchical clustering, given by the between-class disper-
sion over the total dispersion of the dataset, is shown in Figure 4.4. The elbow of the
line uniting the dots is not clearly visible, but the largest growth in the explained
inertia happens for k up to 50, above which any further partitioning does not gain
marginal increments in quality (measured as class separation) at the same rate as
up to that point. A similar conclusion can be drawn from the representation of the
aggregation indices of the hierarchical clustering, presented in Figure 4.5. In this
representation, the elbow is more clearly visible, for k values between 25 and 50.
As a benchmark, we retrieved a list of events occurred in the �rst semester of
2016 from a well-known news source website � SIC Notícias. This information was
published as part of the `Year in Review' at the end of 20161. The cited news source
1https://sicnoticias.sapo.pt/especiais/revista-do-ano-2016/2016-12-27-O-ano-em-revista
40
Figure 4.4: Explained inertia for k up to 500
Figure 4.5: Aggregation indices for k up to 500
identi�ed a total of 119 di�erent events from January to June worth including in the
year review. Hence, it is reasonable to assume that a large proportion of clusters of
news articles characterizes di�erent stories published in that time frame.
Consequently, we opted for the larger end of the spectrum and set k equal to 50.
4.1.4 Clustering results
As described in Chapter 3, hierarchical clustering was performed on a sample of√k ·N =
√50 · 6074 = 551 news articles, with a cut-o� at 50 clusters. Then, the
41
centroids for each cluster were fed into the k-means algorithm, so that the starting
points would have a higher quality than a random selection of 50 articles from the
sample. The clustering results are summarized in Table 4.4.
Table 4.4: Number of elements per cluster and class homogeneity and separation
Cluster size and dispersion
The resulting clusters of news articles have variable size. Indeed, some clusters, like
cluster 5 - Miscellaneous, cluster 7 - Elections, cluster 14 - Politics and cluster 19 -
International security, are rather large, accounting for over 50% of the documents.
Their within-class dispersion values re�ect the diversity of the articles in them, and
further partitioning would probably continue to divide these clusters into smaller
42
ones. On the other hand, there are also very isolated clusters, with 1 or 2 elements
only (e.g. cluster 28 - Palmira buildings and cluster 39 - Space transport).
Cluster labelling
The groups of news articles were labelled using the most signi�cant terms as key-
words. The number of keywords varies according to the cluster size, so that larger
clusters had a larger set of keywords to represent them. Keywords are ordered
according to their average TF-IDF values within the cluster. As an example, Fig-
ure 4.6 shows clusters of (a) articles related to the Brexit referendum, (b) the �rst
news about António Guterres' run for United Nations Secretary-General, (c) the
European Football Cup held in France and (d) the Brussels terrorist attack.
The list of keywords of every cluster can be found in Appendix B. The names
of the clusters were given after an examination of the list of keywords and, in case
of very small clusters, the articles themselves.
We highlight the importance of named entities identi�cation in this study. As
the wordclouds in Figure 4.6 and the list of keywords in Appendix B show, terms
such as Reino Unido, União Europeia (United Kingdom, European Union � cluster
44), Nações Unidas (United Nations � cluster 36) and Banco de Portugal (Bank of
Portugal � cluster 2) are often very indicative of the type of news stories present in
that group.
The highlighted clusters in Table 4.4, the wordclouds of three of which were
already presented, are the clusters that are later on discussed. The reasons for this
selection will be provided in section 4.2.3 of this chapter.
• Politics (n.14)
• Air transport incidents (includes Brussels attack) (n.18)
• Football - Euro 2016 (n.21)
• Brexit (n.44)
43
Figure 4.6: Keywords per cluster � 4 examples
44
4.2 Assignment of tweets to clusters
In order to analyse when a certain story or group of similar stories appear in social
networks in comparison to its press release, there is a need to attribute social media
posts to a particular group of similar stories. In this case, we used a collection of
tweets posted during the same period as the on-line news articles and assigned them
to the appropriate clusters of articles formed in the previous stage.
4.2.1 Sample construction of tweets
The available tweets from Portuguese users were collected during 2016. The total
number of documents surpassed 38 million, with a monthly average of 3.1 million
tweets. The relevant information of approximately 19.4 million tweets of the �rst
semester of 2016 was collected and stored in the database, as described in Chapter
3 (section 3.3).
Figure 4.7: Number of available tweets per month of 2016
However, not all of these tweets were signi�cant to the present study, because, as
seen in Chapter 2, family and life is a signi�cant topic among Twitter users. Also,
tweets are very short segments of text. In 2016, the limit was of 140 characters.
Extremely short tweets (for example, less than 20 characters) increase the di�culty
45
of text mining tasks.
The �rst type of tweets to be included in the sample were tweets that contained
the URL to a clustered news article, as this provides an evidence that the tweet
itself is about the same story or event referred to in the news article. This strategy
also allows the evaluation of the proposed assignment method. In total, 5664 tweets
obeyed this criterion.
Then, similarly to what was done with the news articles, a sample of the same size
(5664) was selected from the available 19.4 million tweets. The selection process was
both random and oriented: �rst, 250 thousand tweets with more than 20 characters
were randomly chosen; then, only those containing at least one keyword from the
clusters were kept (approximately 50%); �nally, a random sub-sample of these was
selected. The decision of this selection process was therefore a compromise between
the processing resources available and the identi�cation of promising tweets.
4.2.2 Pre-processing of tweets
Tweets were subject to a series of pre-processing techniques, including lower case
conversion, removal of punctuation, numbers and stopwords, stemming and named
entity recognition. Additionally, there was the need to remove any URL from the
tweet text, as these are not informative. Similarly to what was done for frequent
expressions in news articles (see section 4.1.2), URLs were removed directly from
the database, by identifying sub-strings initiated with http. Table 4.5 presents some
examples of these transformations. Underlined terms are named entities.
4.2.3 Assignment to clusters
Tweets were assigned to the clusters of news articles using a similarity measure
between each tweet and the cluster centroids. In this work we have used the co-
sine similarity. The similarity values were computed using feature vectors with
46
Table 4.5: Tweets pre-processing transformation examples
Before AfterRT @RTPNoticias: Coreia do Norte testabomba de hidrogénio e desperta receiosmundiais https://t.co/oCX2sWMmhW
rt rtpnoticias coreia do norte test bombhidrogéni despert recei mundi
Marcelo não precisa do apoio dos antigospresidentes da republica. Não precisa porquenão tem!
marcelo precis apoi antig president re-publ precis porqu
[Noticias ao Minuto] Renato Sanches é com-parado a Ronaldo e ganha mais interessados
noticias minuto renato sanches é com-par ronaldo ganh interess
Tenho 18 episódios para ver e o que vou fazer??? vou começar a ver Criminal Minds: Be-yond Borders ??????
episódi ver vou faz vou comec vercriminal minds beyond borders
normalised TF-IDF values. For each tweet, the �ve closest cluster centroids were
identi�ed, and the cluster that was most similar to the tweet in question was chosen.
The distribution of tweets and news articles per clusters is shown in Figure 4.8.
The sample of tweets is larger than the sample of articles, and the mean ratio is 1.8
tweets per article. For the clusters under observation, the ratio is larger than the
mean ratio (from 2.1 for cluster Brexit, to 3.9 for cluster Air Transport incidents),
with the exception of cluster Politics (0.8). This is some evidence that the chosen
clusters have a place in the social network discussion.
Evaluation
Using the tweets with a link to a news article, we have evaluated the results of the
assignment based on cosine similarity. Tables 4.6 and 4.7 present the performance
values.
Global accuracy is 12.7%, while macro-average precision is 13.8%. Due to the
fact that the classes are very unbalanced, we also present the weighted macro-average
precision: 26.0%. The weighted macro-average recall is 12.7% and the weighted
macro-average F1 score is 9.9%.
47
Figure 4.8: Number of tweets and articles per clusters
48
Table 4.6: Per-class evaluation of tweets assignment to clusters
49
Table 4.7: Global evaluation of tweets assignment to clusters
The assignments to clusters 3 - Public Prosecution, 5 - Miscellaneous and 44
- Brexit have the highest precision values. However, F1 values, which also take
into account the recall, are higher for clusters 18 - Air transport incidents (includes
Brussels attack), 21 - Football - Euro 2016 and 44 - Brexit. These performance
values point to the clusters that are probably the most reliable for the temporality
analysis of the next section. Indeed, this was the main reason for the selection of
clusters on which to focus the analysis. Another criterion was to select one of the
largest clusters (14 - Politics).
In addition, we present the values for precision@n, for n up to �ve. This measure
re�ects how well the proposed method performs considering its n topmost predic-
tions, as opposed to the �rst prediction only. In this case, precision @ 5 is 43.5%,
which means that, for 43.5% of the assigned tweets with link to a news article, the
correct cluster was in the top �ve predictions.
Moreover, we have carried out another experiment, demanding at least two fea-
tures in common with clusters centroids in the tweets assignment stage, to improve
the assignments. Evaluation results improved slightly but not signi�cantly (accu-
racy: 14.2%; macro-average precision: 15.9%; weighted macro-average precision:
25.5%; weighted macro-average F1: 11.2%) � see Appendix C.
50
4.3 Temporal analysis
The main goal of the dissertation was to identify similar stories or events and analyse
how di�erently or not they come about on the news and social media. Given the rise
of user generated content and the current trend of journalists scouting social media
for crowd checking and news monitoring (DVJ Insights and ING The Netherlands,
2015), the hypothesis is that these two environments are interconnected.
The work developed thus far aimed to group the two types of documents � on-
line news articles and tweets � through the use of text mining, clustering, besides
other techniques. The �nal clusters are now used for the temporal analysis, and in
particular the temporal relationship between news and tweets.
4.3.1 Evolution of articles and tweets
As a result of the previous stages, we have identi�ed similar news articles and tweets
and characterized each group with a number of keywords. In this section we present
the timeline of each of these document types per cluster, in an attempt to gain
insights of the evolution of news generation and sharing/commenting on Twitter in
Portugal.
To that end, the date and time of publication or posting of these elements were
retrieved from the database.
For the following analyses, we focused on tweets that do not have any link to a
news article. Similarly, we only included articles that were not shared on Twitter.
This prevented a possible bias towards the hypothesis that, for a given cluster, the
social discussion on Twitter was after its publication by the press.
Figure 4.9 presents the temporal evolution of tweets and articles for the four
clusters under observation. We recall that Football - Euro 2016, Brexit and Air
transport incidents (includes Brussels attack) were the clusters with the best F1
scores on the evaluation of tweets assignment, and that Politics is the largest cluster,
51
albeit with a lower performance evaluation (see Table 4.6).
Figure 4.9: Evolution of the number of elements
It is possible to observe that clusters Football - Euro 2016 and Brexit have
peaks in the number of articles and tweets at the expected moment: the European
Football Cup started on the 10th of June (week 24) and the Brexit Referendum
took place on the 23rd of June (week 26). Naturally, both of these events were
subject to discussion in the previous months, as the National team prepared for the
competition and debates concerning Brexit and its consequences intensi�ed. Tweets
assigned to Football - Euro 2016 always surpassed the number of published article on
a weekly basis, with a global ratio of six tweets to one news article. This highlights
the importance of this event (Euro 2016) and topic (football) on the Portuguese
discussions on Twitter. These time series show a correlation of 0.84, which may
indicate that football is referred to with the same intensity in the news and on
Twitter. A smoother trend of tweets surpassing the number of articles is noticed
for Brexit, with the exception of week 26, when the referendum occurred, where
52
the number of assigned tweets is approximately 50% lower than the clustered news
articles.
Air transport incidents (includes Brussels attack) is the smallest cluster under
observation, albeit having scored the highest F1 value. It shows a small rise at week
13, which was when the bombings in Brussels Airport and Maalbeck metro station
happened. We remark, nevertheless, that if we included shared articles and tweets
with link to them in this analysis, the rise at week 13 would be signi�cantly larger
(18 tweets and 10 articles versus an average of 0.2 and 1.6 in the weeks prior to this
event).
Politics, the largest cluster of articles and news, shows a rather smooth evolution
on the number of elements 2, especially on the tweets side. It shows that for the kind
of study aimed at analysing the timing of generation of these documents, further
partitioning of this cluster could be necessary in order to identify patterns in sub-
stories possibly di�erent than the one this large cluster reveals. This is the cluster
under observation with the lowest ratio of assigned tweets to articles, which may
also be a sign that for this topic, the keywords generated at the articles level may
not be su�ciently discriminative at the tweets level. The evaluation measures did,
indeed, reveal a large proportion of false negatives for this class (see Table 4.6).
Another possible line of interpretation is that Portuguese Twitter users do not, in
fact, talk as much about this topic when compared to its importance to the press.
4.3.2 Time-wise di�erences
This section explores the timing of news articles versus the timing of tweets about
the same story or group of similar stories. One way of analysing the temporality
between news and tweets is to consider the time di�erence of every article in a
cluster to its `median tweet'. The median tweet is the tweet with median time
2Note that the falls in the number of elements are due to the partitioning of the data in weeks,that are cut into two when the new month begins at the middle of a certain week of the year.
53
in the corresponding cluster. This brings out how the press publication timings
compare to the moment the public discussion is at its highest.
The time di�erence is computed as a lag variable, that, if positive, indicates
that the article is older than the median tweet, and the opposite if negative. If the
distribution of this variable is skewed to the right, the stories of that cluster have a
tendency of �rst being published by the press; if skewed to the left, the social media
discussion happens sooner than the news.
Figure 4.10 shows the representation of the distribution of the above-mentioned
lag variable, for the clusters under observation.
Figure 4.10: Days di�erence between articles and the median tweet
It can be observed that the majority of the articles belonging to the cluster
Football - Euro 2016 were published after the median date of the tweets assigned to
this cluster (a negative days di�erence). Portuguese Twitter users seem, therefore,
to anticipate the discussion of the national team's participation in the competition
in comparison to what happened on the news. A similar conclusion can be drawn
54
for Brexit : the height of discussion of Portuguese users of the staying or leaving of
the UK from the European Union happened on Twitter about two months before it
happened on the press.
These observations lead to the following conclusion: while the news on Football
- Euro 2016 and Brexit were more event-oriented, with peaks of articles at speci�c
points or short periods of time (e.g.: the start of the football competition on the
10th of June; the referendum date announcement in February and the referendum
itself on the 23rd of June), conversations on social media may have happened more
evenly during the period under study.
The Politics cluster does not present any speci�c pattern. The Air transport
incidents (includes Brussels attack) cluster shows some signs that the press published
articles about the incidents �rst � in total there were nine articles published before
the median tweet (positive days di�erence) and �ve articles published after the
median tweet (negative days di�erence).
We also did this temporality analysis for the whole dataset, i.e., including shared
articles and tweets with links to them. Conclusions for the selected clusters do not
change (see Appendix D).
It is important, however, to emphasize that conclusions drawn from the previ-
ous analyses should be interpreted carefully, as they depend on the quality of the
generated clusters, which could only be partially evaluated.
55
Chapter 5
Conclusion
5.1 Results
This work aimed at providing some initial insights into the themes that appear both
in news sources and social networks in Portugal. In particular, the goal was to study
the temporal relationship between news articles and tweets. The strategy was to
investigate what were the main stories published and how they behaved in terms of
the number of articles and social media posts during a reasonably long period.
To that end, a sample of on-line news articles from generalist news sources was
subject to text clustering techniques, and 50 groups of similar stories were identi�ed.
These were subsequently labelled with a set of keywords. The groups included
stories on football related events and teams, terrorist attacks and other international
incidents, elections, accidents, investigations, political parties, ministerial actions,
economic/�nancial reports and weather warnings, among others.
Then, to associate tweets with the groups of news articles, we used a sample
of tweets and assigned them to each of the clusters using a similarity measure.
This assignment was evaluated using tweets with a news article URL. Performance
results were not as promising as expected (12.7% accuracy), suggesting the need for
improvements in the proposed method. The fact that tweets contain very few words
56
adds to di�culty of this task. Also, as the number of classes is large, we cannot
expect a very high accuracy value.
Nonetheless, for some clusters the evaluation was considerably above average,
which allowed the study of the temporality between news and tweets. The three
selected clusters based on per-class performance were, coincidently, event-oriented
stories: Brexit, Football - Euro 2016 and Air transport incidents. The �nal cluster
under focus was one of the largest clusters formed (Politics).
The analyses made on the evolution of the number of elements and temporality
between tweets and news for those four clusters lead to the following main conclu-
sions. The football national team was a rather constant subject of discussion on
social media on the �rst semester of 2016, culminating in the �nal month with its
participation in the European Cup. However, the same did not necessarily happen
on the news side, where the most frequent articles were published towards the end.
A similar pattern was noticed for Brexit. This indicates that, for some stories, the
press is more event-oriented, contrasting with the more permanent focus of Twitter
users. The analysis on incidents on airports, which included Brussels' bombings,
revealed that the press had a more prominent role on the news di�usion, with com-
ments on social media arising afterwards. We hypothesize that it might not have
been the case had those incidents occurred in Portugal. Finally, this type of analysis
requires a certain level of partitioning of stories, so that timing patterns are more
easily identi�able, which was not achieved for the Politics cluster.
5.2 Limitations and Future Work
This investigation presented some challenges. The fact that we were working with
unlabelled data (both articles and tweets) prevented a more robust evaluation of
the proposed method, even though we were able to partially assess the performance
of the assignment of tweets to clusters of articles. Additionally, the available tweets
57
(more than 19 million) can be classi�ed as big data, which calls for more e�cient
retrieval techniques, directed at identifying those tweets related to news stories.
A possible avenue of investigation is to improve the representation of news articles
in order to capture the desired stories or events. Some of the cluster keywords
revealed that the list of stopwords could be enhanced. Also, in line with the current
literature, the use of must-link and/or cannot-link constraints based on previous
knowledge (for example, a list of Portuguese events from the Wikipedia), making
the unsupervised task a semi-supervised task, may help to improve cluster quality.
At the tweets level, we suggest two possible strategies aimed at better repre-
sentation and, consequently, association with news articles. The �rst is the use of
ontologies to compute semantic distances between articles and tweets. The second
is to expand the tweet with the use of synonyms, for example through word em-
beddings (Mikolov et al., 2013), that could be more easily matched with the lexicon
present at the articles level.
A di�erent strategy that could yield better results would be to use the tweets
with link to news articles to train (and test) a classi�cation model that would be
more adequate for short segments of text.
58
Bibliography
Aggarwal, C. C. and Zhai, C. (2012). A survey of text clustering algorithms. In
Mining Text Data, chapter 4, pages 77�128. Springer US.
Bholowalia, P. and Kumar, A. (2014). EBK-means: A clustering technique based
on elbow method and k-means in WSN. International Journal of Computer Ap-
plications, 105(9):17�24.
Bouchet-Valat, M. (2014). SnowballC: Snowball stemmers based on the C libstemmer
UTF-8 library.
Brogueira, G., Batista, F., and Carvalho, J. P. (2016). Using geolocated tweets for
characterization of Twitter in Portugal and the Portuguese administrative regions.
Social Network Analysis and Mining, 6(1).
Chavent, M., Kuentz, V., Labenne, A., and Saracco, J. (2017). ClustGeo: Hierar-
chical Clustering with Spatial Constraints.
Cutting, D. R., Karger, D. R., Pedersen, J. O., and Tukey, J. W. (1992). Scat-
ter/Gather: A Cluster-based Approach to Browsing Large Document Collections.
In SIGIR 92, volume 51, pages 318�329.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman,
R. A. (1990). Indexing by latent semantic analysis. Journal of The American
Society For Information Science, 41(6):391�407.
59
Dempster, A., Laird, N., and Rubin, D. B. (1977). Maximum Likelihood from
Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society
Series B Methodological, 39(1):1�38.
DVJ Insights and ING The Netherlands (2015). Impact of social media on news
(#SMING15). Technical report.
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A Density-Based Algorithm
for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of
the Second International Conference on Knowledge Discovery and Data Mining
(KDD-96), volume 34, pages 226�231.
Feinerer, I. and Hornik, K. (2017). tm: Text Mining Package.
Feldman, R. and Sanger, J. (2006). The Text Mining Handbook.
Fellows, I. (2014). wordcloud: Word Clouds.
Gama, J., Carvalho, A. P. d. L., Faceli, K., Lorena, A. C., and Oliveira, M. (2015).
Extração de Conhecimento de Dados - Data Mining. Edições Sílabo, Lda., 2nd
edition.
Genc, Y., Sakamoto, Y., and Nickerson, J. V. (2011). Discovering context: Classi-
fying tweets through a semantic transform based on wikipedia. In Lecture Notes
in Computer Science (including subseries Lecture Notes in Arti�cial Intelligence
and Lecture Notes in Bioinformatics), volume 6780 LNAI, pages 484�492.
Hornik, K. (2016). openNLP: Apache OpenNLP Tools Interface.
Hornik, K. (2017). NLP: Natural Language Processing Infrastructure.
Hornik, K., Buchta, C., and Zeileis, A. (2009). Open-Source Machine Learning: {R}
Meets {Weka}. Computational Statistics, 24(2):225�232.
60
Hotho, A., Nürnberger, A., and Paaÿ, G. (2005). A Brief Survey of Text Min-
ing. LDV Forum - GLDV Journal for Computational Linguistics and Language
Technology, 20:19�62.
Hu, M., Liu, S., Wei, F., Wu, Y., Stasko, J., and Ma, K.-L. (2012). Breaking news
on twitter. In Proceedings of the 2012 ACM annual conference on Human Factors
in Computing Systems - CHI '12, pages 275�279.
Huang, A. (2008). Similarity measures for text document clustering. In Proceedings
of the Sixth New Zealand Computer Science Research Student Conference, number
April, pages 49�56.
Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition
Letters, 31(8):651�666.
Java, A., Song, X., Finin, T., and Tseng, B. (2007). Why we twitter: Understanding
Microblogging Usage and Communities. In International Conference on Knowl-
edge Discovery and Data Mining, pages 56�65.
Kaplan, A. M. and Haenlein, M. (2010). Users of the world, unite! The challenges
and opportunities of Social Media. Business Horizons, 53(1):59�68.
Kassambara, A. and Mundt, F. (2017). factoextra: Extract and Visualize the Results
of Multivariate Data Analyses.
Krishnamurthy, B., Gill, P., and Arlitt, M. (2008). A few chirps about twitter.
In Proceedings of the 1st Workshop on Online Social Networks (WOSN), pages
19�24.
Kwak, H., Lee, C., Park, H., and Moon, S. (2010). What is Twitter, a Social Network
or a News Media? The International World Wide Web Conference Committee
(IW3C2), pages 1�10.
61
Lusa (2016). Uso das redes sociais em Portugal tripiclou em sete anos.
MacQueen, J. (1967). Some methods for classi�cation and analysis of multivari-
ate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical
Statistics and Probability, 1(14):281�297.
Marketeer (2017). Qual é a rede social mais utilizada em Portugal?
McCombs, M. E. and Shaw, D. L. (1972). The agenda-setting funcion of mass-media.
The Public Opinios Quarterly, 36(2):176�187.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). Dis-
tributed representations of words and phrases and their compositionality. In Ad-
vances in neural information processing systems, pages 3111�3119.
Newman, N. (2009). The rise of social media and its impact on mainstream jour-
nalism.
Phuvipadawat, S. and Murata, T. (2010). Breaking news detection and tracking
in Twitter. In Proceedings - 2010 IEEE/WIC/ACM International Conference on
Web Intelligence and Intelligent Agent Technology - Workshops, WI-IAT 2010,
pages 120�123.
Porter, M. (1980). An algorithm for su�x stripping. Program, 14(3):130�137.
R Core Team (2017). R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria.
Ripley, B. and Lapsley, M. (2017). RODBC: ODBC Database Access.
Rocha, C. (2016). PAMPO: PAMPO - Extract Named Entities from texts.
Rocha, C., Jorge, A., Sionara, R., Brito, P., Pimenta, C., and Rezende, S. (2016).
PAMPO: using pattern matching and pos-tagging for e�ective Named Entities
recognition in Portuguese. arXiv preprint arXiv:1612.09535., pages 1�17.
62
Rosen, A. (2017). Tweeting Made Easier.
Russom, P. (2007). BI Search and Text Analytics: New Additions to the BI Tech-
nology Stack. Technical report, The Data Warehousing Institute.
Saleiro, P., Amir, S., Silva, M., and Soares, C. (2015). POPmine: Tracking Political
Opinion on the Web. In 2015 IEEE International Conference on Computer and
Information Technology; Ubiquitous Computing and Communications; Depend-
able, Autonomic and Secure Computing; Pervasive Intelligence and Computing
(CIT/IUCC/DASC/PICOM), pages 1521�1526.
Sankaranarayanan, J., Samet, H., Teitler, B. E., Lieberman, M. D., and Sperling, J.
(2009). TwitterStand: News in Tweets. In Proceedings of the 17th ACM SIGSPA-
TIAL International Conference on Advances in Geographic Information Systems
- GIS '09, pages 42�51.
Schütze, H., Manning, C. D., and Raghavan, P. (2008). Evaluation in information
retrieval, volume 39. Cambridge University Press.
Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., and Demirbas, M. (2010).
Short Text Classi�cation in Twitter to Improve Information Filtering. In Proceed-
ings of the 33rd International ACM SIGIR Conference on Research and Develop-
ment in Information Retrieval SE - SIGIR '10, pages 841�842.
Statista (2018). Number of monthly active Twitter users worldwide from 1st quarter
2010 to 2nd quarter 2018 (in millions).
Twitter (2016). Twitter Usage - Company Facts.
https://about.twitter.com/company. Last accessed on Dec 12, 2016.
Wanta, W., Golan, G., and Lee, C. (2004). Agenda Setting and International News:
Media In�uence on Public Perceptions of Foreign Nations. Journalism and Mass
Communication Quarterly, 82(1):364�377.
63
Ward Jr, J. H. (1963). Hierarchical grouping to optimize an objective function.
Journal of the American statistical association, 58(301):236�244.
Witten, I. H. and Frank, E. (2005). Data Mining: Practical Machine Learning Tools
and Techniques. Morgan Kaufmann, San Francisco, 2nd edition.
Zhao, W. X., Jiang, J., Weng, J., He, J., Lim, E.-p., Yan, H., and Li, X. (2011).
Comparing Twitter and Traditional Media using Topic Models. In Proceedings of
the 33rd European conference on Advances in information retrieval (ECIR'11),
pages 338�349.
64
Appendix A
Example of input tweet
Figure A.1: Example of tweet in JSON � part 1
65
Figure A.2: Example of tweet in JSON � part 2
66
Appendix B
Cluster keywords
Figure B.1: Top 10 cluster keywords � part 1
67
Figure B.2: Top 10 cluster keywords � part 2
68
Appendix C
Assignment of tweets with at least
two features in common with cluster
centroids - Evaluation results
Table C.1: Global evaluation of tweets assignment to clusters - considering tweetswith at least two terms in common with cluster centroids
69
Table C.2: Per-class evaluation of tweets assignment to clusters - considering tweetswith at least two terms in common with cluster centroids
70
Appendix D
Temporality between news and
tweets - considering the complete
dataset
Figure D.1: Days di�erence between articles and the median tweet - considering thecomplete dataset of tweets and articles
71
Recommended