Analysing Credibility of Twitter Users Using the …1104337/FULLTEXT01.pdfautomated credibility analysis on Twitter are presented. 2.1Twitter Twitter is an online social networking

INOM EXAMENSARBETE TEKNIK,GRUNDNIVÅ, 15 HP

, STOCKHOLM SVERIGE 2017

Analysing Credibility of Twitter Users Using the PageRank Algorithm

ALICE HEAVEY

ELIN KARAGÖZ

KTHSKOLAN FÖR DATAVETENSKAP OCH KOMMUNIKATION

Analysing Credibility of Twitter UsersUsing the PageRank Algorithm

ALICE HEAVEYELIN KARAGÖZ

Bachelor’s Thesis in Computer ScienceDate: June 1, 2017Supervisor: Alexander KozlovExaminer: Örjan EkebergSwedish title: Analys av trovärdighet hos Twitteranvändare med hjälp avPageRank-algoritmenSchool of Computer Science and Communication

ii

Abstract

In a time when information and opinions are to a large extent shared via social media,it is important to find a way to determine how credible the content is. The purpose ofthis study is to investigate whether PageRank based algorithms can be used to deter-mine how credible a Twitter user is based on how much the user’s posts are retweetedby other users. Two different algorithms based on PageRank have rated the credibilityof Twitter users in a network. This ranking has been compared with a manual credibil-ity check on the users to determine how close to reality the credibility distribution fromthe algorithms is. The results show that the algorithms can be said to preform better thanrandom, but they still assign inaccurate credibility scores to many users. The simplicityof the algorithms is an advantage compared to other methods used in previous research.The conclusion is that the algorithms in their current states are not suitable for determin-ing the credibility of Twitter users.

iii

Sammanfattning

I en tid då information och åsikter till stor del delas via sociala medier är det viktigt attfinna ett sätt att avgöra hur trovärdig detta innehåll är. Syftet med denna studie är attutreda om det med hjälp av algoritmer baserade på PageRank-algoritmen går att avgö-ra hur trovärdig en Twitteranvändare är, baserat på hur mycket användarens inlägg bli-vit delade av andra användare. Två olika algoritmer baserade på PageRank har rankattrovärdigheten hos Twitteranvändare i ett nätverk. Denna rankning har sedan jämförtsmed en manuell trovärighetstilldelning av användarna för att avgöra hur nära verklighe-ten algoritmernas trovärdighetsfördelning är. Resultaten visar att algoritmerna kan ansesprestera bättre än slumpen, men att de trots detta tilldelar en fleaktig trovärdighet tillmånga användare i nätverket. Algoritmernas triviala natur ger dem en fördel gentemotalgoritmer som använts i tidigare studier. Slutsatsen är att algoritmerna i deras nuvaran-de form inte är lämpade för att fastställa trovärdighet hos Twitteranvändare.

Contents

Contents iv

1 Introduction 11.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Scope and Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 32.1 Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.2 Tweet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.3 Retweet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.4 Hastag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Previous Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3.1 Credibility on Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3.2 Evaluating Credibility on Twitter . . . . . . . . . . . . . . . . . . . . . 5

3 Method 63.1 Literature Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2 Selection of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.3 Data Collection and Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.3.1 Twitter APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.3.2 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.3.3 Tweepy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.3.4 Neo4J . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.3.5 Database Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.4 Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.4.1 PageRankReset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.4.2 PageRankKeep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.4.3 LogRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.5 Manual Credibility Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.5.1 User Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.5.2 Verification Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.5.3 Follower Count Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.5.4 Spam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.6 Manual Control of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

iv

CONTENTS v

3.7 Estimating Credibility Distribution in User Population . . . . . . . . . . . . . 12

4 Results 134.1 PageRankReset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1.1 Manual Credibility Check of PageRankReset . . . . . . . . . . . . . . . 134.2 PageRankKeep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.2.1 Manual Credibility Check of PageRankKeep . . . . . . . . . . . . . . . 164.3 Estimated Credibility Distribution in Population . . . . . . . . . . . . . . . . 174.4 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.4.1 Spam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.4.2 Credible Users with Low LogRank . . . . . . . . . . . . . . . . . . . . 194.4.3 Circular Endorsement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 Discussion 21

6 Conclusion 246.1 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Bibliography 25

A Tables 27

Chapter 1

Introduction

Today an increasing part of society consume news through social media outlets suchas Twitter and Facebook. As a result concerns about the credibility of the news storiesshared in these non classical environments are raised. In a media feed where news sto-ries from established media houses are interspersed with those from lesser known newswebsites and personally authored posts from individual users it becomes evermore diffi-cult to discern what is true and what is not.

The problem with the spread of fake news stories on social media sites has in the re-cent year become evermore pronounced, and discussions around the responsibility ofsaid media sites have risen. The sheer amount of new content posted everyday prohibitsmanual assessment of credibility. One way to tackle these problems could be to developan algorithm for automatically analysing the credibility of content published and sharedonline. If such an algorithm is found and it performs well it could be used for automati-cally labelling online content as credible or non credible, which could help in preventingfake news stories from spreading. However, a bad performing algorithm could if imple-mented cause real damage to the trust of the users, as it could mislabel credible onlinecontent as non credible, and vice versa.

One way to tackle this issue is to use a version of the PageRank algorithm, whichis used by Google to rank websites tracked by their search engine, based on links fromother websites pointing to them[4]. By applying the same concepts, but for Twitter usersand retweets instead of websites and links, credibility could be distributed over a net-work of users based on how they interact with each other.

PageRank has been used in areas other than website ranking. For example, it hasbeen used to predict the use of public spaces[13], and it is used in Twitter’s follower rec-ommendation service[6]. Its widespread use, together with its simplicity makes it inter-esting to evaluate the algorithm’s performance in credibility analysis, which is why it hasbeen chosen for this study.

1.1 Problem Definition

This study will investigate if the PageRank algorithm can be used to successfully de-termine the credibility of Twitter users. This will be done by formulating and testing aheuristic for ranking the credibility of Twitter users based on retweet data.

1

2 CHAPTER 1. INTRODUCTION

1.2 Scope and Constraints

The focus of the study is to investigate the possibility of ranking the credibility of Twitterusers based on their activity on Twitter. Every attempt is based on the PageRank algo-rithm.

A user is considered credible by the algorithms if the user is retweeted by users withhigh credibility. To determine the success of the heuristic, samples of rated users will beexamined manually, based on a number of criteria.

The dataset is constrained to users associated with a set of hashtags defined in themethod chapter of this report.

1.3 Outline

Chapter 2 introduces necessary terminology, concepts and general background regard-ing the subject. In chapter 3 the implementations used to extract data from Twitter andthe algorithms used to interpret the data are explained. Chapter 4 presents the results ofthe analysis of Tweets and Twitter users and these results are discussed in chapter 5. Inchapter 6 a conclusion of the study is presented together with possible future researchareas.

Chapter 2

Background

The aim of the background is to introduce the notions, concepts and techniques whichare used in this study. In section 2.1 the properties of the Twitter network are presented,in section 2.2 the PageRank algorithm is explained, and in section 2.3 previous studies onautomated credibility analysis on Twitter are presented.

2.1 Twitter

Twitter is an online social networking community founded in 2006, where users interactby posting messages known as "Tweets", restricted to 140 characters. While membershipis required for someone who wants to post a Tweet, non-members can can read Tweetsposted by members of Twitter. As of June 2016, Twitter had 313 million monthly activeusers and 1 billion unique monthly visits to sites with embedded Tweets[12]. On an aver-age day 500 million Tweets are posted[15], making Twitter an important source of infor-mation.

2.1.1 Users

A user is someone who creates content, and interact with other users, on Twitter. A userinteracts with other users by posting Tweets, or retweeting, sharing and liking otherusers Tweets. Users can follow each other[11], and Tweets posted by followed users ap-pear in the user’s Twitter feed.

2.1.2 Tweet

Tweets constitutes the Twitter content and are posted by Twitter users. Each Tweet iscomposed of up to 140 characters, and can also contain photos, videos and links. Otherusers can post, retweet, quote, share and like the Tweet.

2.1.3 Retweet

A retweet is a repost or a forward of a Tweet. If a user retweets a Tweet it will appear inthe Twitter feeds of the user’s followers.

3

4 CHAPTER 2. BACKGROUND

2.1.4 Hastag

Tweets can contain hashtags, which can be used to index keywords or subject for a Tweet[10].Hashtags are a words preceded by a hash sign (#), and they are used as a way to cate-gorise posts. Below follows four hashtags relevant to this study.

#svpol

Hashtag used mainly by politicians and politically involved Twitter users to comment oncurrent events in swedish politics and society in general[16].

#svtagenda

Hashtag collecting comments about the live television program Agenda. Agenda focuseson the most important events in Sweden and the world[2].

#svtopinion

Hashtag collecting comments about the live television debate program Opinion live anddebate articles from SVT[3].

#opinionlive

Hashtag collecting comments about the live television debate program Opinion live[1].

2.2 PageRank

PageRank is an algorithm most famous for being used by Google in order to rank web-sites. The algorithm represents the Internet as a directed graph, where nodes representwebsites, and the directed edges represent hyperlinks from one website to another. Therank of a certain website is determined by the quality of the links pointing to it, which inturn is determined by the rank of the pointer. The algorithm is run iteratively, repeatedlycalculating and updating the rank of the sites, until a stable ranking of all the websites inthe graph is achieved[4]. For a more thorough description and pseudocode of the mostsimple version of the algorithm, see section 3.4.1.

2.3 Previous Studies

In this section three different approaches for analysing credibility in Twitter events arepresented. An event is a collection of Tweets on the same subject, such as a current newsevent. The approach presented in 2.3.1 calculates event credibility based on different fea-tures extracted from associated users and Tweets, and the event itself. Having been citedover 800 times, this article can be considered to have laid down the fundamentals forcredibility analysis on Twitter. In 2.3.2 two methods, BasicCA and EventOptCA, are pre-sented. They both take a queue from the method in 2.3.1, but refine the credibility scoresthrough a PageRank inspired process. Together these two studies are considered to givean overview of the field, as well as state of the art, of credibility analysis of Twitter users.

CHAPTER 2. BACKGROUND 5

2.3.1 Credibility on Twitter

This study was conducted on around a million Tweets collected over a two month pe-riod.

The study examines discussion topics on Twitter by studying bursts of activity onthe same topic. Relevant features are extracted from each labelled topic, and these areused to build a classifier that attempts to automatically determine if a topic correspondsto newsworthy information, and to automatically assess its level of credibility. The clas-sification takes the text content, the network of the user and propagation (retweets andprevious Tweets) into account when assigning credibility. After this, each item is assessedon its level of credibility through surveys answered by a group of human judges. Theaccuracy of the methods are around 70-80%[5].

2.3.2 Evaluating Credibility on Twitter

This study was conducted on two datasets, each containing millions of Tweets.

BasicCA

BasicCA constructs a network of nodes consisting of users, Tweets and events. The credi-bility of the different nodes is initiated using an extended version of the method in 2.3.1.The credibility scores are then propagated iteratively through the network, using a PageR-ank inspired approach, through which the credibility of a node is directly affected by thecredibility of its neighbours. This iterative process continues until the difference betweeneach iteration surpasses a threshold value. The accuracy of the BasicCA method is about76%[14].

EventOptCA

EventOptCA is a slightly enhanced version of the BasicCA method. The difference layswithin the iterative process, during which the credibility scores are propagated throughthe nodes network. In each iteration, a separate graph of events is created. This graphis used to separately calculate new credibility scores for the events in the network, us-ing Quadratic Optimization. This approach helps similar seeming events to get similarscores. The accuracy of the EventOptCA method is about 86%[14].

Chapter 3

Method

In this chapter the method of this study is presented. In section 3.1 the literature studyis presented, sections 3.2 and 3.3 explains the selection and collection of data, and sec-tion 3.4 presents the heuristic and the two approaches to the PageRank algorithm usedin this study. In sections 3.5 and 3.6 the Manual Credibility Check and its application isexplained. Finally, in section 3.7 the estimation of the actual credibility in the network ispresented.

The method used in this study for estimating the credibility of Twitter users takes cuefrom the study presented in 2.3.2, by using the PageRank algorithm for calculating cred-ibility over a network of users. The approach in this study is a vastly simplified versionof the one used in that study.

3.1 Literature Study

A literature study was conducted in order to gain knowledge about necessary aspectsof the field. Relevant academic articles and papers about Twitter credibility analysis hasbeen used, as well as the official Twitter documentation when extracting data from Twit-ter. The articles that was used are Evaluating Event Credibility on Twitter[14] and Infor-mation Credibility on Twitter[5]. An overview of these articles can be found in the Back-ground chapter under section 2.3. The literature study was used to obtain some kind ofbenchmark over how good an algorithm is and how well the result is compared to ear-lier analyzes.

3.2 Selection of Data

Due to the vast amount of Tweets posted every day, limitations have to be set on whatTweets and users to include in this study. In order to get get a useful data set, it is de-sirable to constrain the collected Tweets and their users in such a way that the networkthey form contain a sufficient amount of user interactions in form of retweets. This canbe done by constraining the collection to users who have posted Tweets containing a cer-tain set of hashtags, as hashtags are used to gather Tweets and users around a certaintopic, making them form what can be regarded as a form of community. In this studysuch a community is assumed to show on average a higher number of user interactionsthan in the Twitter network as a whole.

6

CHAPTER 3. METHOD 7

For this study, Tweets containing any of the following four hashtags have been se-lected for the data set: #svpol, #svtagenda, #svtopinion, and #opinionlive. The hashtags havebeen opt for as they are all associated to actively discussed, and highly polarising topics.A set of Tweets based around them is therefore assumed to provide a vast diversity incredibility.

In the cases where a collected Tweet is a retweet of, or quoting, another Tweet thisother Tweet and its author is also collected to the dataset. The same goes for user men-tions – if a user is mentioned in a Tweet that user becomes a part of the dataset.

3.3 Data Collection and Storage

The data is collected from the Twitter APIs, using Python and Tweepy, and stored in aNeo4j database. The data consists of Tweets posted between March 21st and April 4th2017, and containing any of the following four hashtags: #svpol, #svtagenda, #svtopinion,and #opinionlive. The gathered collection consists of 7648 Tweets, 4271 users, and 6351retweets. Every Tweet and retweet in the collection belongs to a user in the collection.

3.3.1 Twitter APIs

The Twitter APIs provide programmatic access to Twitter data. Tweets up to a week oldcan be accessed through the Search API, which allows for querying for Tweets containingcertain words. In this study those words have been the four above-mentioned hashtags.The Twitter Search API has a rate limit, constraining the number of requested Tweetsover a 15 minute window to 5000[9].

3.3.2 Python

Python is a high level programming language, supporting modules, packages and li-braries. The version used in this study is 2.7.10[7].

3.3.3 Tweepy

Tweepy is a Python library for accessing the Twitter APIs. The version used in this studyis 3.3.0[17].

3.3.4 Neo4J

The graph database tool Neo4J is used to store data and to manage the relationships be-tween the collected datasets[8].

3.3.5 Database Schema

Nodes in the database are in the form of Users and Tweets, and relationships in the formof Tweets and Retweets, as seen in figure 3.1.

8 CHAPTER 3. METHOD

Figure 3.1: Relationships Between Users and Tweets. A TWEETS edge from a user to aTweet indicates that the user is the original author of the Tweet, whereas a RE-TWEETSedge indicates that a user has retweeted the Tweet.

3.4 Heuristic

A heuristic is formulated based on the assumption that a user’s credibility depends onthe credibility of the users who retweet them. This heuristic is evaluated using two ver-sions of the PageRank algorithm, as described in the subsections below.

A graph of users is constructed from collected data set of users and Tweets, as seenin figure 3.2. In the new user graph, an edge from user X to another user Y indicatesthat user X has retweeted user Y. The weight of the edge describes the proportion of X’sretweets that go to user Y.

Figure 3.2: Construction of a user graph from the collected dataset.

The PageRank algorithm repeatedly calculates and updates the rank of each user in thegraph, until it stabilises due to changes in ranks between iterations being negligibly small.

The difference in the two algorithms presented below is that the first one, PageRankReset,overwrites the old ranks with the newly calculated ones. In the second, PageRankKeep,the newly calculated ranks are added to the previously calculated rank.

CHAPTER 3. METHOD 9

3.4.1 PageRankReset

This is a direct implementation of the most simple version of the PageRank algorithm.The list retweet_ratio is initialised to contain the edge weights of each directed edge in

the graph. That is, retweet_ratio[i,j] contains the proportion of user i:s retweets that go touser j. Each user’s rank is initialised to 1/n, where n is the number of users in the graph.The users’ ranks are stored in pageRank.

The algorithm will iteratively compute new ranks for all users in the graph, untilit the pageRank list stabilises between iterations. The values in pageRank are moved topageRank_old, and pageRank is reset to 0 for each user. The new rank for user i is com-puted as the sum of pageRank_old[k] * retweet_ratio[k,i], for all users k, k 6= i, in the graph.This means that the new rank for the i:th is based entirely on the most previous rank ofthe other users that have retweeted user i.

After a new pageRank has been computed for each user, the pageRank list is normalised,so that the sum of all pageRank values is equal to 1.

Pseudocode

PageRankResetretweet_ratio[i, j] = proportion of user i:s retweets that go to user j

pageRank[i] = 1/n, i = 1..nwhile pageRank not stabilised do

pageRank_old = pageRankpageRank = [0,0, ... , 0]for i = 1 ... n do

for k = 1 ... n, where k != i dopageRank[i] += pageRank_old[k] * retweet_ratio[k,i]

endendpageRank = normalise(pageRank)

end

3.4.2 PageRankKeep

The list retweet_ratio is initialised to contain the edge weights of each directed edge in thegraph. That is, retweet_ratio[i,j] contains the proportion of user i:s retweets that go to userj. Each user’s rank is initialised to 1/n, where n is the number of users in the graph. Theusers’ ranks are stored in pageRank.

The algorithm will iteratively compute new ranks for all users in the graph, untilit the pageRank list stabilises between iterations. The values in pageRank are moved topageRank_old. The new rank for user i is computed as the sum of pageRank_old[k] * retweet_ratio[k,i],for all users k, k 6= i, in the graph. This means that the new rank for user i is basedon both i:s most previous rank, and the most previous rank of the other users that haveretweeted user i.

After a new pageRank has been computed for each user, the pageRank list is normalised,so that the sum of all pageRank values is equal to 1.

10 CHAPTER 3. METHOD

Pseudocode

PageRankKeepretweet_ratio[i, j] = proportion of user i:s retweets that go to user j

pageRank[i] = 1/n, i = 1..nwhile pageRank not stabilised do

pageRank_old = pageRankfor i = 1 ... n do

for k = 1 ... n, where k != i dopageRank[i] += pageRank_old[k] * retweet_ratio[k,i]

endendpageRank = normalise(pageRank)

end

3.4.3 LogRank

In each iteration, both PageRank algorithms assign a PageRank to every user in the net-work. In this study, a credibility score called LogRank is calculated for each user basedon the PageRank calculated by the algorithm:

LogRank(useri) = log2PageRank(useri)

AvgPageRank(3.1)

This was used to get a smaller range of values to work with, which allows for clearerpresentation in graphs and tables.

3.5 Manual Credibility Check

A manual user credibility check is developed, according to a number of selected criteriathat, in combination with each other, are assumed to be crucial in deciding credibility,or at least give an indication of the credibility of a user. The criteria differs for accountsassociated with individuals or organisations. This check is made in order to evaluate theperformance of the algorithm, and it is constructed to be easy to apply manually, whilemaintaining as high objectivity and correctness as possible.

The manual credibility check gives a user a score from 1 to 3, based on the average ofa verification score and a follower count score. This score is not to be confused with theone given as LogRank, but is solely connected to the manual credibility check.

3.5.1 User Types

The user is divided into a category according the profile that is best suited for the user.

Person

The user account is associated with a private or public person. This person could havelinks to one or more organisations, but the account is intended for personal use.

CHAPTER 3. METHOD 11

Organisation

The user account is associated with an organisation, news agency, authority or politicalparty.

3.5.2 Verification Score

Person

For personal accounts, the verification scores can be seen in table 3.1.

Table 3.1: Verification Score for Person User AccountsScore Description3 Account belongs to a politician, celebrity, or business leader2 Person behind account can be identified through name and profile picture1 Person behind account can not be identified

Organisation

For accounts belonging to organisations, the verification scores can be seen in table 3.2.

Table 3.2: Verification Score for Organisation User AccountsScore Description3 The organisation has a legally responsible person1 A legally responsible person can not be identified

If the user is the Twitter account of a news agency, is there a publisher and does editinghave an official address? This information indicate that the news agency is serious andcan take responsibility for its content.

Assumptions

Due to accountability, a user account connected to a physical person will have more cred-ible content than one that has no such connection.

3.5.3 Follower Count Score

The more followers a user has, the more credible the user is considered to be. The scor-ing is assigned as seen in table 3.3.

Table 3.3: Follower Count Score for all User AccountsScore Description3 The user account has over 5000 followers2 The user account has between 1000 and 4999 followers1 The user account has between 0 and 999 followers

12 CHAPTER 3. METHOD

Assumptions

The approval of a large group of users boosts the credibility of a user, leading to the as-sumption that a user with a high amount of followers is more credible than a user with alesser amount of followers.

3.5.4 Spam

If a user posts a large amount of similar Tweets under short periods of time, or on av-erage uses over three hashtags in their Tweets, the user is considered to be producingspam posts. Tweets are considered to be similar if they all contain a small set of the samehashtags, phrases and links or posts with repetitive content. Posts are considered repet-itive if they contain the same content over and over again with little or no alteration.Users with this behaviour are considered non credible. All spam users receive a manualcredibility score of 1, regardless of amount of followers.

Assumptions

If a Twitter account contains spam posts it is considered bad for the users’ credibilitysince the posts or retweets are generated to gain followers or attention, and does not con-tain actual content.

3.6 Manual Control of Results

All users with an assigned LogRank of 3 or higher are evaluated using the manual credi-bility check. Users with a LogRank of two or less are not evaluated, as there are to manyusers in the network for one to manually check everyone within a reasonable time. Theline was drawn at this value both because it gave a manageable amount of users to man-ually check, and also because users with a LogRank of two or less weren’t considereddistinguishable enough as many of the other users scored higher.

3.7 Estimating Credibility Distribution in User Population

In order to evaluate the efficiency of the algorithms used in this study the credibility dis-tribution across the entire population of 4271 users is estimated. The distribution is esti-mated by evaluating the manual credibility on a randomly chosen sample of 354 users.The results can then be used to estimate the distribution across the population, assuminga multinomial distribution.

Chapter 4

Results

In this chapter the results from both approaches of the PageRank algorithm are presented,as well as the estimated credibility distribution in the entire population of users.

4.1 PageRankReset

The PageRankReset algorithm terminates after four iterations on the collected data set.Out of the 4271 users in the network, 397 users receives a positive LogRank, as shownin table 4.1. Out of these, 114 users receive a LogRank of three or higher. In the succeed-ing iterations the number of users with a positive LogRank decreases steadily, until onlytwo are left.

Table 4.1: LogRank distribution after each iteration of the PageRankReset algorithm. Rowsshow LogRank, and columns show iterations.

LogRank 1 2 3 411 110 19 1 3 18 1 3 27 2 46 5 75 17 124 27 173 552 1131 177-inf 3874 4227 4265 4269Increased 397 44 6 2Decreased 3874 4227 4265 4269

4.1.1 Manual Credibility Check of PageRankReset

After the first iteration, when conducting the manual credibility check two of the threeby the algorithm top rated users are considered to be spam accounts. The rest of the ac-

13

14 CHAPTER 4. RESULTS

counts that the algorithm considers to have high credibility has a large spread of credi-bility scores after the manual check. At this point the credibility scores of the algorithmdoes not correlate with the credibility scores of the manual check, as seen in table 4.2 a).

After the second iteration there is only one by the algorithm top rated user that isconsidered to have a low credibility score by the manual check (rated below 2). The restof the users has fairly high credibility according to the manual check. This is shown intable 4.2 b).

After the third iteration there are only six users with credibility left, and they all havea high credibility score from the algorithm, as seen in table 4.2 c). Of these six users, fourare considered very credible after the manual check.

After the fourth iteration the algorithm stabilises and there are only two users left.They are considered to be credible by both the algorithm and the manual credibility check.This is shown in table 4.2 d).

In the following iterations the two users that are left takes turn in endorsing eachother with credibility, and the algorithm is terminated.

Figure 4.1 show how the manual rating scores are distributed among the Twitter usersconsidered credible by the PageRankReset algorithm.

Table 4.2: Manual credibility score for each iteration of the PageRankReset algorithm. Thecolumn Rating contains the manual credibility scores.

a) Iteration 1Rating Count Percent3 33 30,8%2,5 19 17,8%2 25 23,4%1,5 19 17,8%1 7 6,1%Sum 114

b) Iteration 2Rating Count Percent3 44 25,0%2,5 11 25,0%2 9 20,5%1,5 11 25,0%1 2 4,5%Sum 44

c) Iteration 3Rating Count Percent3 2 33,3%2,5 2 33,3%2 1 16,7%1,5 1 16,7%1 0 0,0%Sum 6

d) Iteration 4Rating Count Percent3 2 100%2,5 0 0,0%2 0 0,0%1,5 0 0,0%3 2 0,0%Sum 2

CHAPTER 4. RESULTS 15

Figure 4.1: Distribution of manual credibility score among users assigned a LogRank ofthree or higher, for each iteration of the PageRankReset algorithm.

After the first iteration, 27.6% of the users that are considered credible by the algorithm,are considered to have low credibility by the manual credibility check. As the total num-ber of users drops for each iteration, the number of user who are considered non credibledecreases. After the fourth iteration, all of the users considered credible by the algorithmare also considered credible by the manual check. This is shown in figure 4.1.

4.2 PageRankKeep

After 17 iterations of The PageRankKeep algorithm, there are only two users left withpositive ranking. The rest of the users receives a lower ranking for each iteration whilethe two positive users remain the only ones who are considered credible. This is whenthe algorithm terminates. Since the credibility score assigned after the first iteration re-mains through the following iterations, users who would lose their credibility fast in thePageRankReset algorithm remain credible longer in the PageRankKeep algorithm.

Table 4.3 shows the distribution in LogRank after every other iteration. As seen in thetable, up until the seventh iteration the number of users with increased (non negative)LogRank remain stable, and thereafter they decrease steadily until the 17th iteration inwhich only two of them remain.


Table 4.3: LogRank distribution after every other iteration of the PageRankKeep algorithm.Rows show LogRank, and columns show iterations. Only odd iterations are shown in thetable due to the small changes between each iteration.

LogRank 1 3 5 7 9 11 13 15 1711 1 1 1 110 1 1 1 1 19 18 1 27 2 1 2 16 2 1 5 6 3 15 3 10 12 8 94 4 25 23 25 10 7 13 32 29 40 29 23 72 46 93 81 52 39 17 61 135 66 63 102 76 28 7 10 175 171 171 171 63 39 11 3-1 3874 3874 171 62 23 7 1-2 3874 63 40 8 2-3 3874 171 73 17 6-4 3874 63 30 6-5 171 33 13-6 3874 62 20< -6 3874 4108 4221Increased 397 397 397 397 226 101 27 6 2Decreased 3874 3874 3874 3874 4045 4170 4244 4265 4269

4.2.1 Manual Credibility Check of PageRankKeep

The figure 4.2 show a manual credibility check on users predicted to be credible by thePageRankKeep algorithm. After the first five iterations the amount of users who is con-sidered credible by the PageRankKeep algorithm increases steadily. In percentage, theamount of users with score 1,5 increase the most, with 183% between iterations 1 and5. In absolute amounts, the number of users with score 1,5 and 3 increased with elevenbetween the same iterations. After the fifth iteration the amount of users who are con-sidered credible decreases, while the amount users who are considered non credible issteady through the rest of the iterations until iteration 15 when the only users who areleft is the ones who are considered to be credible. For more exact numbers, see table A.1.


Figure 4.2: The distribution of manual credibility scores after each iteration, among userswho have scored a LogRank of three or higher from the PageRankKeep algorithm.

4.3 Estimated Credibility Distribution in Population

Results from the manual credibility check on the randomly selected samples chosen fromthe entire population of collected Twitter accounts are shown in table 4.4. The columns.e. contains the standard errors of the sample at a 95% confidence level, assuming amultinomial probability distribution.

Table 4.4: Manual credibility score in sample of populationRating Count Percent s.e.3 33 9,3% 1,5%2,5 16 4,5% 1,1%2 55 15,5% 1,9%1,5 122 34,5% 2,5%1 128 36,2% 2,6%Sum 354

In figure 4.3 the estimated amount of users in the entire population, for each manualcredibility score is shown. According to the estimation, a majority of c.a. 70% of the usersin the population have a low or very low manual credibility score. Around 15% of usershave a medium, and the remaining 15% of users have a high or very high manual credi-bility score. For exact numbers, see table A.2.


Figure 4.3: Estimated distribution of manual credibility score in the entire population. Theerror bars mark the confidence intervals.

4.4 Observations

4.4.1 Spam

Four users were assigned high LogRank scores by the PageRank algorithms, but wereconsidered to be posting spam in the manual check. In the PageRankReset algorithm twoof these users appeared with a LogRank of 3 or higher in the first iteration. None ofthem remained in succeeding iterations. In the PageRankKeep algorithm these four usersmaintained a LogRank of 3 or higher until the 10th iteration. In the 13th iteration all ofthem had a LogRank under 3.

Spam User 1

The user has posted 884 out of the 7648 Tweets in the dataset, which is almost 12% ofall Tweets. On average each Tweet contains 6,25 hashtags. The user has been retweeteda total of 325 times. The user is considered a spam user due to repetitive content in theTweets posted, as well as the excessive use of hashtags.

Spam User 2

The user has posted one Tweet in the dataset. The Tweet contains nine hashtags, and hasbeen retweeted by 140 users, 139 of which have not posted a single Tweet in the dataset.The user is considered a spam user due to an excessive use of hashtags.

Spam User 3

The user has posted 77 Tweets in the dataset, and has been retweeted a total of 416 times.On average each Tweet contains 1,09 hashtags. The user is considered a spam user due


to repetitive content in the Tweets posted.

Spam User 4

The user has posted 8 Tweets in the dataset, and has been retweeted a total of 121 times.On average each Tweet contains 0,625 hashtags. The user is considered a spam user dueto repetitive content in the Tweets posted.

4.4.2 Credible Users with Low LogRank

According to the estimation of credibility distribution in the entire population of usersin 4.3, there should be around 600 credible or very credible users in the population. ThePageRankReset algorithm identifies 52 of these in the first iteration, which correspondsto circa 8,7% of these 600 users. For the PageRankKeep algorithm at most 46, or 7,7% ofthese users are identified. This implies that over 90% of the credible or very credibleusers are not identified by the algorithm.

Below follows examples of why credible users were not identified by the PageRankalgorithms.

Quoting Tweets

A user not participating in discussions around the four hashtags may become a part ofthe dataset through quoting Tweets. One example is NASA, who became a part of thedataset through Spam User 1, who quoted one of NASA’s Tweets, and, among five hash-tags, included #svpol, #svtopinion, and #svtagenda in the quoting Tweet. The quotedTweet is the only Tweet posted by NASA in the dataset, and it lacks relevance to anyof the four hashtags around which this study is conducted. As a result of this, the col-lected dataset does not contain any retweets of the NASA Tweet, causing the PageRankalgorithms to assign NASA with a low LogRank.

User Mentions

There have been occurrences of Swedish and international celebrities becoming a partof the dataset due to being mentioned in a Tweet containing any of the four hashtagsaround which this study is conducted. These users have not themselves posted any Tweetsin the dataset, and hence do not have any retweeted Tweets in the dataset. This causesthe PageRank algorithms to assign them with a low LogRank.

Followers

A low amount of followers will lead to a low amount of retweets. Since the algorithmsassign credibility based on amount of retweets from other users, some accounts who ac-tually are credible will be assigned a low LogRank by the algorithm. An example of thisis the Twitter accounts of some Swedish authorities who are considered to be credible,but with a follower amount between 1000-4999.

4.4.3 Circular Endorsement

In both versions of the PageRank algorithm, the two users remaining in the final algo-rithm were the same. Both user accounts belong to a Swedish news service and they


only retweet each other. In the PageRank algorithm, this circular endorsement makessure they maintain their credibility, not "leaking" it to other users.

Chapter 5

Discussion

For both PageRankReset and PageRankKeep, the the group of users ranked credible by thealgorithm contain a considerably higher proportion of users with a manual credibilityrating of 2.5 or 3, than a random sample of the same size would. In this regard the al-gorithms performs better than random. However, the result of the study omits manyusers who could in fact be considered to be credible. The reason for this could be thatthe only way for the algorithm to determine whether a user is credible or not is to checkhow many times the user has been retweeted by other users. If a user who exists in thenetwork is not retweeted, the user will be perceived as non-credible. In reality, the lackof retweets does not mean that the user does not distribute credible material, but may be,for example, due to the fact that the user did not fully or at all participate in discussionsin which the hashtag was intended to be used. The user may have ended up in the net-work by mistake. One example of this is the case of NASA, where the Twitter account ofNASA is quoted by another user who uses a hashtag in the Tweet with the quote, whichis explained in section 4.4.2. The user with the original post, in this case NASA, will thenbecome part of the network without actively participating and will therefore not be di-rectly retweeted, leading to an unfair mistrust of the user. One way to avoid such situ-ations could be to restrict the network solely to users who in their own posts uses thehashtags that are being investigated, thus omitting users who are quoted and who doesnot have association with the network. However, since the users in the network are col-lected based on specific hashtags, users who are considered irrelevant in the context canin that way be interpreted as having low credibility.

In PageRankReset, endorsements from users who are not being retweeted, thus notconsidered to be credible, does not generate credibility to the users they are retweeting.This prevents non credible users from receiving undeserved credibility. Generally, userswho are considered to be spam users receive a large amount of retweets from other userswho posses a low amount of retweets themselves. The algorithm identifies these users aslow credibility users, making their endorsements useless. However, in some cases a userwho is retweeted by many non-credible users (for example users with a small amount offollowers) might actually distribute credible content. With this algorithm, that user willbe dismissed.

In order for a user to maintain credibility through the iterations, it is required thatthe users retweeting them have maintained their credibility in the previous iteration. Inthe first iteration, retweets from any user generates a credibility score. In the second it-eration, only retweets from users who received retweets in the first iteration generates

21

22 CHAPTER 5. DISCUSSION

credibility. In the third iteration, only retweets from users who received retweets in thesecond iteration generates credibility, and so on. However, it is possible to circumventthis property when a group of users enters a circular endorsement and never retweetsanyone outside the group. This group maintain their credibility through each iterationsince they form an infinite chain of retweets among themselves. This behaviour was ob-served in the collected dataset in section 4.4.3. The algorithm is based on the assumptionthat a user cannot endorse itself, but a group who endorse each other within the groupkeep their mutual credibility score within the group, which by extent is the same thingas endorsing yourself.

In the case of PageRankKeep it is possible for a user to rise in rank by simply get-ting endorsed by other users without credibility, since the credibility score is maintainedthrough every iteration. This makes it possible for spam users to rise in rank. An exam-ple of this is Spam User 2, who receives a high credibility score in the first iteration inPageRankKeep but only has one Tweet in the network, and that Tweet is retweeted manytimes. This is described in section 4.4.1.

When conducting the manual credibility check it might be hard for the person whois estimating the score to avoid being biased by the actual content of the account that isbeing evaluated. One’s own opinions must not interfere with the manual check. In orderto achieve objectivity when manually estimating the credibility, strict guidelines has beenused during the evaluation. However, since the check is coarse grained it will inevitablymisjudge some of the users. For example a user who might not be credible can be as-signed a higher credibility score than they deserve and a user who is considered to becredible might be assigned a lower credibility score than they should. A good example ofthe latter is Twitter accounts of Swedish authorities who in reality are very credible butthey have a low amount of followers which lowers their credibility score according to thecriteria that is being followed when assigning the score.

Considering the time and resources available when conducting this study, it was notpossible to execute a more elaborate investigation in this case. Limitations in both re-sources and time is also the reason why the study had to be restricted regarding the col-lecting of data. If more time had been at hand, Tweets could have been collected overa longer time period than 15 days, hence giving a more accurate representation of thebehaviour in the network, as this would have resulted in a higher number of registereduser interactions. This would have given the algorithm more data to work on, likely ren-dering a different result. Due to this, it is hard to draw any definitive conclusions regard-ing the accuracy of the methods evaluated in this study.

Previous studies has found successful methods to determine credibility. InformationCredibility on Twitter[5] uses semantic analysis, interpreting the actual content of theposts in social media feeds, together with meta data, achieving good results. EvaluatingEvent Credibility on Twitter[14] combines this method with a PageRank inspired algo-rithm, achieving slightly better results. Using these kinds of methods has been consid-ered too complicated and time consuming for this study. Since methods of interpretingsemantics and sentiments in Twitter posts could not be used, it was not possible to reachthe same levels of accuracy in credibility determination as the previous studies reached.On the other hand, both of these studies were conducted on much larger datasets thanthe study presented in this paper (millions of tweets, as opposed to thousands). Thismakes it hard to compare results in an accurate way.

If the implementations used in this study was to be used on social media, for exam-

CHAPTER 5. DISCUSSION 23

ple Twitter, in its current state, it could have a negative impact regarding credibility. Thesomewhat doubtful credibility rating could contribute to spread of fake news, suppressreal news and lower the credibility of Twitter in general as a source of information.

Chapter 6

Conclusion

The results show that the PageRank algorithm, although it could be considered to per-form better than random, is not efficient when evaluating user credibility. On the limiteddata set that was used, several non credible users received a high rank, while many cred-ible users received a lower rank. This would likely change if a greater data set were tobe used. In both versions of the PageRank algorithm tested in this study, the two usersreceiving the highest rank when the algorithms terminate are both considered to be ofvery high credibility. However, all the other users in the network are assigned a lowcredibility score, including several users who should be considered credible. The conclu-sion is that the PageRank algorithm is most likely not suitable to successfully determinethe credibility of Twitter users.

6.1 Future Research

The subject of the credibility of information shared on the internet is a relatively un-explored topic scientifically, where there is much to do research on. This thesis has fo-cused on determining the credibility of Twitter users and thereby determine whetherthe Tweets from these users are credible. There are several more parameters that can betaken into account when distributing credibility in a network of Tweets.

An initial suggestion for future research on this subject would be to test the algorithmon a larger dataset of users, collected over a longer period of time. Furthermore, theTwitter APIs offers a vast amount of meta data associated with both users and Tweets,and this data could be incorporated into an evaluation algorithm, in order to achieve bet-ter results.

24

Bibliography

[1] Sveriges Television AB. Opinion live, 2017. URL https://www.svtplay.se/opinion-live. Last visited 2017-04-11.

[2] Sveriges Television AB. Svt agenda, 2017. URL http://www.svt.se/agenda. Lastvisited 2017-04-11.

[3] Sveriges Television AB. Svt opinion, 2017. URL https://www.svt.se/opinion.Last visited 2017-04-11.

[4] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine.In Seventh International World-Wide Web Conference (WWW 1998), 1998. URL http://ilpubs.stanford.edu:8090/361/1/1998-8.pdf.

[5] M. Mendoza C. Castillo and B. Poblete. Information credibility on twitter. WWW,page 675–684, 2012.

[6] P. Gupta et al. Wtf: The who-to-follow system at twitter. WWW, 2013.

[7] Python Software Foundation. What is python? executive summary, 2017. URLhttps://www.python.org/doc/essays/blurb. Last visited 2017-04-20.

[8] Neo Technology Inc. Neo4j, 2017. URL http://www.neo4j.com. Last visited2017-04-11.

[9] Twitter Inc. The search api, 2017. URL https://dev.twitter.com/rest/public/search. Last visited 2017-04-24.

[10] Twitter Inc. Using hashtags on twitter, March 2017. URL https://support.twitter.com/articles/49309. Last visited 2017-03-27.

[11] Twitter Inc. Users – twitter developers, 2017. URL https://dev.twitter.com/overview/api/users. Last visited 2017-03-27.

[12] Twitter Inc. Twitter usage, February 2017. URL https://twitter.com. Lastvisited 2017-03-27.

[13] B. Jiang. Ranking spaces for predicting human movement in an urban environment.International Journal of Geographical Information Science, 23 (7):823–837, 2006. doi: 10.1080/13658810802022822.

[14] P. Zhao M. Gupta and J. Han. Evaluating event credibility on twitter. In Proceedingsof the 2012 SIAM International Conference on Data Mining (SDM), page 153–164, 2012.

25

https://www.svtplay.se/opinion-live

https://www.svtplay.se/opinion-live

http://www.svt.se/agenda

https://www.svt.se/opinion

http://ilpubs.stanford.edu:8090/361/1/1998-8.pdf

http://ilpubs.stanford.edu:8090/361/1/1998-8.pdf

https://www.python.org/doc/essays/blurb

http://www.neo4j.com

https://dev.twitter.com/rest/public/search

https://dev.twitter.com/rest/public/search

https://support.twitter.com/articles/49309

https://support.twitter.com/articles/49309

https://dev.twitter.com/overview/api/users

https://dev.twitter.com/overview/api/users

https://twitter.com

26 BIBLIOGRAPHY

[15] Internet Live Stats. Twitter usage statistics, 2017. URL http://www.internetlivestats.com/twitter-statistics. Last visited 2017-05-09.

[16] Svpol. Twitter hashtag intended for discussions of swedish politics, 2017.

[17] Tweepy. Tweepy, 2017. URL http://www.tweepy.org. Last visited 2017-04-11.

http://www.internetlivestats.com/twitter-statistics

http://www.internetlivestats.com/twitter-statistics

http://www.tweepy.org

Appendix A

Tables

Table A.1: The distribution of manual credibility scores after each iteration, among userswho have scored a LogRank of three or higher from the PageRankKeep algorithm. Rowsshow manual credibility score and columns show iterations.

InterationRATING 1 2 3 4 5 6 7 8 9 10 11 12 13 14 153 15 17 20 25 26 22 22 19 12 10 10 8 4 2 22,5 12 17 19 19 20 20 20 18 16 12 13 10 6 3 12 7 8 10 12 12 10 10 9 6 4 4 4 0 0 01,5 6 6 11 17 17 14 14 11 10 5 7 6 4 2 01 4 6 6 7 7 6 6 5 4 4 4 4 2 1 0Sum 44 54 66 80 82 72 72 62 48 35 38 32 16 8 3

Table A.2: The estimated amount of users in the entire population, for each manual credi-bility score is shown. The column s.e. contains the standard errors, and columns CI Lowerand CI Upper contains the upper and lower limits for the confidence intervals at a 95%confidence level.

RATING Count s.e. CI Lower CI Upper3 398 66 269 5282,5 193 47 101 2852 664 82 502 8251,5 1472 108 1260 16831 1544 109 1331 1758Sum 4271

27

www.kth.se

Documents

Analysing Credibility of Twitter Users Using the …1104337/FULLTEXT01.pdfautomated credibility analysis on Twitter are presented. 2.1Twitter Twitter is an online social networking