Event-based User Profiling in Social Media Using Data ... · Event-based User Profiling in Social Media Using Data Mining Approaches Supervisor: ... da diverse organizzazioni e da

POLITECNICO DI MILANO

School of Industrial and Information Engineering

Master of Science in Computer Engineering

Event-based User Profiling in Social

Media Using Data Mining

Approaches

Supervisor:

Dr. Marco Brambilla

Authors:

Behnam Rahdari (Student ID: 10480057)

Tahereh Arabghalizi (Student ID: 10481546)

Academic Year 2016/17

2

Abstract

Social Networks have undergone a dramatic growth and influenced everyone’s life in

recent years. People share everything from daily life stories to the latest local and

global news and events using social media. This rich and continuous flow of user-

generated content has received significant attention from many organizations and

researchers and is increasingly becoming a primary source for social and marketing

researches to name a few. Accordingly, a great number of works have been conducted

to extract valuable information from different platforms. However, there are no

specific studies that focus on categorizing social media users based on the texts they

share about a specific event. Given that the identification of online users with

common interest in a particular event can help event organizers to attract more

visitors to future similar events; this thesis study concentrates on examining the

similarity between such users from the aspect of textual published contents. In this

work different approaches have been proposed and various experiments have been

carried out to support an explanation concerning this notion. We take a systematic

approach to accomplish this objective by applying topic modeling techniques, using

statistical and data mining algorithms, combined with information visualization.

3

Sommario

Negli ultimi anni i Social Networks hanno visto una crescita esponenziale ed hanno

influenzato la vita di tutti noi. Le persone condividono di tutto tramite i social media,

dalle storie di vita quotidiana alle ultime notizie a livello locale e globale. Il ricco e

continuo flusso di notizie generate dagli utenti ha ricevuto un’importante attenzione

da diverse organizzazioni e da vari ricercatori, e sta ancora crescendo diventando una

fonte primaria per le ricerche di mercato sociali e di marketing, solo per nominarne

alcune. Di conseguenza, è stato condotto un grande numero di lavori per estrarre

informazioni utili da diverse piattaforme. Nonostante questo, non ci sono studi

specifici che si concentrano sulla categorizzazione degli utenti in base al testo,

riguardante eventi specifici, che condividono sui social media. Detto questo,

l’identificazione degli utenti con interessi comuni, in un evento specifico, può aiutare

gli organizzatori dell’evento ad attrarre più visitatori ad eventi simili in futuro. Questo

lavoro di tesi si concentra nell’esaminare le similarità tra questi utenti in base ai

contenuti testuali che hanno pubblicato tramite social media. Inoltre, sono proposte

diverse metodologie, e sono stati effettuati diversi esperimenti per sostenere una

spiegazione riguardo questo fenomeno. Abbiamo proceduto con un approccio

sistematico per ottenere questo obiettivo, applicando tecniche di modellazione,

usando algoritmi statistici e di estrazione dei dati (data-mining), combinando infine

questi dati attraverso tecniche di visualizzazione delle informazioni.

4

Contents

1 Introduction ............................................................................................................................... 9

1.1 Context ...................................................................................................................... 9

1.2 Problem Statement ................................................................................................. 10

1.3 Proposed Solution ................................................................................................... 10

1.4 Structure of the thesis ............................................................................................ 11

2 Background ............................................................................................................................12

2.1 Relevant Concepts .................................................................................................. 12

2.1.1 Knowledge Discovery and Data Mining .......................................................... 12

2.1.2 Text Similarity ................................................................................................. 15

2.1.3 Information Retrieval Models .......................................................................... 16

2.1.4 Topic Modeling ................................................................................................ 18

2.1.5 Dimensionality Reduction................................................................................ 19

2.1.6 Clustering Process ............................................................................................ 20

2.2 Relevant Technologies ........................................................................................... 22

2.2.1 Language Identification and Translation ......................................................... 22

2.2.2 Gender Detection ............................................................................................. 24

2.2.3 Twitter API ...................................................................................................... 25

2.2.4 Instagram API .................................................................................................. 26

2.2.5 Text Normalization Library ............................................................................. 26

2.2.6 Cloud Computing ............................................................................................. 26

3 Related Work .........................................................................................................................28

3.1 Clustering of People in Social Network Based on Textual Similarity ............... 29

3.2 Clustering a Customer Base Using Twitter Data ................................................. 31

3.3 Clustering Users Based on Interests ..................................................................... 33

3.4 Crowdsourcing Search for Topic Experts in Microblogs .................................... 35

3.5 Using Internal Validity Measures to Compare Clustering Algorithms .............. 36

4 Event-based User Profiling in Social Media .........................................................................39

4.1 Main Idea ................................................................................................................ 39

4.1.1 Twitter Users .................................................................................................... 39

4.1.2 Instagram Users ............................................................................................... 40

4.2 Motivation ............................................................................................................... 40

4.2.1 Why social media as data source? .................................................................... 40

5

4.2.2 Why Twitter and Instagram?............................................................................ 42

4.3 Approach ................................................................................................................. 45

4.3.1 Data Extraction ............................................................................................... 45

4.3.2 Data Preprocessing ........................................................................................... 48

4.3.3 Data Loading .................................................................................................... 49

4.3.4 Data Analysis ................................................................................................... 49

5 Experiments and Discussion ..................................................................................................61

5.1 The Floating Piers Datasets ..................................................................................... 61

5.2 Reports of Analysis .................................................................................................. 62

5.2.1 Content Specific Results .................................................................................. 62

5.2.2 User Specific Results ....................................................................................... 69

6 Conclusions ............................................................................................................................90

6.1 Summary .................................................................................................................. 90

6.2 Critical Discussion ................................................................................................... 91

6.3 Possible Future Works ............................................................................................. 91

Bibliography ..................................................................................................................................92

6

List of Figures

Figure 2.1: Knowledge Discovery Process .............................................................................. 12

Figure ‎2.2: Information Retrieval Models ............................................................................... 17

Figure ‎2.3: Graphical model representation of LDA ............................................................... 18

Figure ‎2.4: Steps of clustering process .................................................................................... 20

Figure 2.5: How machine translation works at Yandex ........................................................... 23

Figure ‎2.6: Twitter Rest API Design ....................................................................................... 25

Figure ‎2.7: Cloud Computing .................................................................................................. 27

Figure ‎2.8: Windows Azure Platform ...................................................................................... 27

Figure 3.1: Graph after spectral k-means clustering for real dataset ....................................... 30

Figure 3.2: Graph after spectral k-means clustering for dummy dataset ................................. 30

Figure 3.3: Percentage of followers for a set of chosen influencers. ....................................... 31

Figure 3.4: Silhouette coefficient as a function of number of clusters .................................... 32

Figure 3.5: Representative clusters in R2 ................................................................................. 33

Figure ‎3.6: Spectral clustering solutions selected by various measures .................................. 37

Figure ‎4.1: Number of social media users from 2010 to 2020 (in billions) ............................. 41

Figure ‎4.2: Percentage of adult users who use different social networks ................................ 41

Figure ‎4.3: Percentage of adult users who use at least one social media, by age .................... 42

Figure ‎4.4: Comparison between four major photo-sharing networks .................................... 44

Figure ‎4.5: Architecture design ................................................................................................ 45

Figure ‎4.6: Identify the number of topics for LDA .................................................................. 51

Figure ‎4.7: Elbow method representation ................................................................................ 56

Figure ‎4.8: k-nearest neighbor distances to determine eps in DBSCAN ................................. 59

Figure ‎5.1: The Floating Piers (Project for Lake Iseo, Italy) ................................................... 61

Figure ‎5.2: Twitter total retweets vs. favorites ........................................................................ 62

Figure ‎5.3: Most frequent words (a) and hashtags (b) in tweets .............................................. 63

Figure ‎5.4: Number of tweets for top 10 locations using the field “Place” ............................. 64

Figure ‎5.5: Instagram total likes vs. comments ....................................................................... 65

Figure ‎5.6: Most frequent words (a) and hashtags (b) in Instagram posts ............................... 65

Figure ‎5.7: Distribution of Instagram posts in the world ......................................................... 66

Figure ‎5.8: Density of Instagram posts – Italy ......................................................................... 66

Figure ‎5.9: Density of Instagram posts – Brescia .................................................................... 67

Figure ‎5.10: Density of Instagram posts – Sulzano ................................................................. 68

Figure ‎5.11: Tweets vs Instagram posts timeline ..................................................................... 69

Figure ‎5.12: Dendrogram representation of Twitter users ....................................................... 71

Figure ‎5.13: The percentage of user engagement in each cluster ............................................ 71

Figure ‎5.14: 2D representation of cluster objects .................................................................... 72

Figure ‎5.15: Word-cloud representation of first cluster based on bio ...................................... 73

Figure ‎5.16: Word-cloud representation of second cluster based on bio ................................. 73

Figure ‎5.17: Word-cloud representation of third cluster based on bio .................................... 74

Figure ‎5.18: Hashtag word-cloud for Travel Lovers, Art Lovers and Tech Lovers ................ 74

Figure ‎5.19: Tweet text word-cloud for Travel Lovers, Art Lovers and Tech Lovers ............ 75

Figure ‎5.20: List slug word-cloud for Travel Lovers, Art Lovers and Tech Lovers ............... 75

Figure ‎5.21: Percentage of users whose number of followers lie in each category ................. 76

Figure ‎5.22: Percentage of users whose number of followings lie in each category ............... 76

7

Figure ‎5.23: Percentage of users whose number of favorites lie in each category .................. 77

Figure ‎5.24: Percentage of users whose number of tweets lie in each category ...................... 77

Figure ‎5.25: Summary statistics of numbers of followers in each cluster ............................... 78

Figure ‎5.26: Summary statistics of numbers of followings in each cluster ............................. 78

Figure ‎5.27: Summary statistics of numbers of favorites in each cluster ................................ 79

Figure ‎5.28: Summary statistics of numbers of tweets in each cluster .................................... 79

Figure ‎5.29: Language timeline per cluster ............................................................................. 80

Figure ‎5.30: Gender timeline per cluster ................................................................................. 80

Figure ‎5.31: Number of Users - Tweets timeline per cluster ................................................... 81

Figure ‎5.32: Tweet – User ratio timeline per cluster ............................................................... 81

Figure ‎5.33: Twitter top 20 active users .................................................................................. 82

Figure ‎5.34: Instagram top 20 active users .............................................................................. 82

Figure ‎5.35: Twitter top 20 active contributors ....................................................................... 83

Figure ‎5.36: Instagram top 20 active contributors ................................................................... 83

Figure ‎5.37: Twitter top 10 influencers using FtF ratio ........................................................... 84

Figure ‎5.38: Twitter top 10 influencers using UTW ratio ....................................................... 85

Figure ‎5.39: Twitter top 10 influencers using Influence Ratio ................................................ 86

Figure ‎5.40: Twitter top 10 influencers per cluster using Influence Ratio .............................. 87

Figure ‎5.41: Number of followers of top 10 influencers in each cluster ................................. 87

Figure ‎5.42: Number of followings of top 10 influencers in each cluster ............................... 88

Figure ‎5.43: Number of tweets of top 10 influencers in each cluster ...................................... 88

Figure ‎5.44: Number of favorites of top 10 influencers in each cluster .................................. 89

8

List of Tables

Table 3.1: The most common topics of expertise as identified from Lists .............................. 35

Table 3.2: Top 5 results by Cognos and Twitter WTF for query “music” ............................... 36

Table 3.3: Average relative SI, CH and DB score over data set .............................................. 37

Table 4.1: Twitter extracted features ....................................................................................... 46

Table 4.2: Instagram extracted features ................................................................................... 47

Table 4.3: Topic probabilities by user ..................................................................................... 53

Table ‎4.4: Top terms of each extracted topic by LDA ............................................................. 53

Table 4.5: Formulas for Silhouette, Dunn and Entropy indices ............................................... 60

Table 5.1: Evaluation results of cluster validation indices ...................................................... 70

9

Chapter 1

1 Introduction

1.1 Context Social Networks have undergone a dramatic growth in recent years. Such networks

provide a powerful reflection of the structure and dynamics of the society of the 21st

century and the interaction of the Internet generation with both technology and other

people (Sfetcu 2017). Social media has a great influence in our daily lives. People

share their opinions, stories, news, and broadcast events using social media.

Monitoring and analyzing this rich and continuous flow of user-generated content can

yield unprecedentedly valuable information, enabling users and organizations to

acquire actionable knowledge. Due to the immediacy and rapidity of social media,

news events are often reported and spread on Twitter, Instagram or Facebook ahead of

traditional news media.

With the rapid growth of social media, Twitter has become one of the most widely

adopted platforms for people to post short and instant messages. Because of such wide

adoption of Twitter, events like breaking news and release of popular videos can

easily capture people’s attention and spread rapidly on Twitter. Therefore, the

popularity and importance of an event can be approximately gauged by the volume of

tweets covering the event. Moreover, the relevant tweets also reflect the public’s

opinions and reactions to events. It is therefore very important to identify and analyze

the events on Twitter (Diao 2015).

Another social network platform which is very popular is Instagram. 300 million

people use the app for sharing of photos every day. Users can also insert a caption for

a photo they share, mention other users and use hashtags. Like in Twitter, users can

follow the accounts they are interested in and share their posts publicly or privately

according to their preference. Considering this, Instagram is one of the best channels

that people can share their experiences (especially the ones about events) through

pictures as well as textual content such as hashtags. Hashtags have become a uniform

way to categorize content on many social media platforms, especially Instagram.

Hashtags allow Instagrammers to discover content to view and accounts to follow.

http://blog.instagram.com/post/146255204757/160621-news

http://blog.instagram.com/post/146255204757/160621-news

10

Research from Track Maven found that posts with over 11 hashtags tend to get more

engagement.

1.2 Problem Statement In social networking websites or applications, people generally use unstructured or

semi-structured language for communication. In everyday life conversation, people do

not care about the spellings and accurate grammatical construction of a sentence that

may leads to different types of ambiguities, such as lexical, syntactic, and semantic.

Therefore, extracting logical patterns with accurate information from such

unstructured form is a critical task to perform. Text mining, which is a knowledge

discovery technique that provides computational intelligence, can be a solution of

above mentioned problem (Rizwana Irfan 2015). Social networks, such as Twitter are

rich in texts that enable user to create various text contents in the form of comments,

posts and social media. Application of text mining techniques on social networking

websites can reveal significant results related to person-to-person interaction

behaviors. Moreover, text mining approaches such as clustering can be used for

finding general opinion about any specific subject, human thinking patterns, and

group identification in large-scale systems.

In spite of the high amount of research works that have been conducted for extracting

information from a particular social network, there are not specific studies that

address different formatted social networks to explore profiles and activities of users

based on the texts they share about an event. In this thesis study, it is proposed that

there may be some similarities in terms of interest and activity between social media

users who are engaged in different actions such as posting, liking and replying a text

or media about an event. This may give us an idea to improve the current event and

also identify potential users with the same interests for similar future events.

1.3 Proposed Solution First step to obtain the objective of this study is to decide which social media

platforms should be considered. Since the availability of public posts is the main

reason for our preference among many platforms, Twitter and Instagram, which can

provide a great number of publicly available posts, are preferred to be used for the

following analysis.

Second step is to collect the required data including tweets, Instagram posts and their

involved users during a specific time interval. Then textual features namely

biographies, hashtags, tweet/post texts and twitter lists of which a user is member are

cleaned, preprocessed, translated to English and stored in csv files.

After the transformation phase we define some steps to perform the analysis in

different levels. The first phase of analysis is to explore the main topics in the

provided data using topic modeling. Then we perform different analysis on other

levels, for example, three clustering algorithms including K-means, Hierarchical and

11

DBSCAN are applied on the outputs of topic modeling process separately. Having all

the outcomes of cluster analysis, it is suggested to evaluate the results employing

cluster validity measurement techniques such as Silhouette, Dunn and Entropy.

Forasmuch as the evaluation outcome, we perform further analyses and investigations

to probe the categories of the users and their activities during the event. Finally we

model the outcomes of all levels of analysis in order to have a proper visualization of

the results.

1.4 Structure of the thesis The thesis is organized as follows:

A general overview of relevant concepts and technologies used in this thesis project

are reviewed in chapter 2.

Chapter 3 is dedicated to the scientific works that have been done to address the

similar issues through discussing the associated publications, plus our own strategy

with respect to them.

In chapter 4, first we describe the main idea of this project and the motivations behind

it. All the details of our proposed approach are explained in the following section of

this chapter.

Chapter 5 is devoted to describe our dataset and the outcomes of the analysis that was

performed. It is divided into sections that are relevant to each level of analysis,

conducted on our dataset from the social media we used.

Finally in chapter 6 we review the study with a short summery of what has been done

and a discussion of our results. In addition there are some suggestions for the future

work.

12

Chapter 2

2 Background

2.1 Relevant Concepts In this section we discuss the concepts that are relevant to our work.

2.1.1 Knowledge Discovery and Data Mining Knowledge Discovery in Databases (KDD) is the process of identifying valid, novel,

useful, and understandable patterns from large datasets. Data Mining (DM) is the

mathematical core of the KDD process, involving the inferring algorithms that

explore the data, develop mathematical models and discover significant patterns

(implicit or explicit) -which are the essence of useful knowledge. The knowledge

discovery process (Figure 2.1) is iterative and interactive, consisting of the below

steps. Note that the process is iterative at each step, meaning that moving back to

adjust previous steps may be required (Oded Maimon 2010).

Figure 2.1: Knowledge Discovery Process

13

2.1.1.1 Data Selection This phase includes finding out what data is available, obtaining additional necessary

data, and then integrating all the data for the knowledge discovery into one data set,

including the attributes that will be considered for the process. This process is very

important because the Data Mining learns and discovers from the available data. This

is the evidence base for constructing the models. If some important attributes are

missing, then the entire study may fail. From this respect, the more attributes are

considered, the better. On the other hand, to collect, organize and operate complex

data repositories is expensive and there is a tradeoff with the opportunity for best

understanding the phenomena. This tradeoff represents an aspect where the interactive

and iterative aspect of the KDD is taking place. This starts with the best available data

set and later expands and observes the effect in terms of knowledge discovery and

modeling.

2.1.1.2 Data Pre-processing The operations performed in a preprocessing process can be reduced to two main

families of techniques: Detection Techniques (DT) to detect imperfections in data sets

and Transforming Techniques (TT) oriented to obtain more manageable data sets. DT

includes outlier’s detection, missing data detection, influent observations detection,

normality assessment, linearity assessment, and independence assessment. On the

other hand, TT includes outlier treatment, missing data imputation, dimensionality

reduction techniques or data projection techniques, deriving new attributes

techniques, filtering and resampling. Additionally, the statistical technique of data

cleaning, and the visualization techniques also play an important role in the pre-

processing of data (José Luis Díaz 2010).

2.1.1.3 Data Transformation In this step, the generation of better data for the data mining is prepared and

developed. Methods here include dimension reduction (such as feature selection and

extraction, and record sampling), and attribute transformation (such as discretization

of numerical attributes and functional transformation). This step is often crucial for

the success of the entire KDD project, but it is usually very project-specific. However,

even if we do not use the right transformation at the beginning, we may obtain a

surprising effect that hints to us about the transformation needed (in the next

iteration). Thus the KDD process reflects upon itself and leads to an understanding of

the transformation needed. The main techniques of data transformation include

(Äyrämö 2007) :

Smoothing (binning, clustering, regression etc.)

Aggregation (use of summary operations (e.g., averaging) on data)

Generalization (primitive data objects can be replaced by higher-level concepts)

14

Normalization (min-max-scaling, z-score)

Feature construction from the existing attributes (PCA1, MDS

2)

2.1.1.4 Data Mining The two high-level primary goals of data mining in practice tend to be prediction and

description. Prediction involves using some variables or fields in the database to

predict unknown or future values of other variables of interest, and description

focuses on finding human-interpretable patterns describing the data. The goals of

prediction and description can be achieved using a variety of particular data-mining

methods including (Usama Fayyad 1996):

Classification is learning a function that maps (classifies) a data item into one

of several predefined classes.

Regression is learning a function that maps a data item to a real-valued

prediction variable

Clustering is a common descriptive task where one seeks to identify a finite

set of categories or clusters to describe the data. The categories can be

mutually exclusive and exhaustive or consist of a richer representation, such as

hierarchical or overlapping categories. More details about clustering

algorithms and its validation techniques are elaborated in section 2.4.

Summarization involves methods for finding a compact description for a

subset of data. A simple example would be tabulating the mean and standard

deviations for all fields. Summarization techniques are often applied to

interactive exploratory data analysis and automated report generation.

Dependency modeling consists of finding a model that describes significant

dependencies between variables. Dependency models exist at two levels: (1)

the structural level of the model specifies (often in graphic form) which

variables are locally dependent on each other and (2) the quantitative level of

the model specifies the strengths of the dependencies using some numeric

scale.

Change and deviation detection focuses on discovering the most significant

changes in the data from previously measured or normative values.

2.1.1.5 Interpretation and Evaluation of Patterns This phase involves the evaluation and possibly interpretation of the patterns to make

the decision of what qualifies as knowledge (Gonzalo Mariscal 2010). This step

focuses on the comprehensibility and usefulness of the induced model.

1 Principle Component Analysis

2 Multi-Dimensional Scaling

15

2.1.1.6 Knowledge Representation This is the last step of knowledge discovery process where visualization and

knowledge representation techniques namely logical formulas, decision trees, neural

networks, etc. are used to present mined knowledge to users.

2.1.2 Text Similarity Text similarity measures play an important role in text related research and

applications such as topic detection, information retrieval, document clustering, text

classification, etc. Finding similarity between words is a fundamental part of text

similarity which is then used as a primary stage for sentence, paragraph and document

similarities.

Words can be similar in two ways lexically and semantically. Words are similar

lexically if they have a similar character sequence. Words are similar semantically if

they have the same thing, are opposite of each other, used in the same way, used in

the same context and one is a type of another (Wael H. Gomaa 2013).

Lexical similarity is introduced through string-based similarity measures which

operate on string sequences and character composition. Some of these measures are

mentioned as follows:

Manhattan Distance computes the distance that would be traveled to get from

one data point to the other if a grid-like path is followed. The Block distance

between two items is the sum of the differences of their corresponding

components (Wael H. Gomaa 2013).

Cosine Similarity is a measure of similarity between two vectors of an inner

product space that measures the cosine of the angle between them (Wael H.

Gomaa 2013).

Euclidean distance is the ordinary distance between two points. Euclidean

distance is widely used in clustering problems, including text clustering. It

satisfies all the above four conditions and therefore is a true metric. It is also

the default distance measure used with the K-means algorithm (Rugved

Deshpande 2014).

Jaccard similarity measures similarity as the intersection divided by the union

of the objects. For text document, it compares the sum weight of shared terms

to the sum weight of terms that are present in either of the two documents but

are not the shared terms (Rugved Deshpande 2014).

Semantic similarity is introduced through Corpus-Based and Knowledge-Based

algorithms. Corpus-Based similarity is a semantic similarity measure that determines

the similarity between words according to information gained from large corpora. The

most famous corpus-based similarity measures are:

Hyperspace Analogue to Language (HAL) considers context only as the words

that immediately surround a given word. HAL computes an NxN matrix,

16

where N is the number of words in its lexicon, using a 10-word reading frame

that moves incrementally through a corpus of text (Wikipedia 2016).

Latent Semantic Analysis (LSA) is the most popular technique of Corpus-

Based similarity. LSA assumes that words that are close in meaning will occur

in similar pieces of text. A matrix containing word counts per paragraph (rows

represent unique words and columns represent each paragraph) is constructed

from a large piece of text and a mathematical technique which called singular

value decomposition (SVD) is used to reduce the number of columns while

preserving the similarity structure among rows. Words are then compared by

taking the cosine of the angle between the two vectors formed by any two

rows (Wael H. Gomaa 2013).

Explicit Semantic Analysis (ESA) is a vectorial representation of text that uses

a document corpus as a knowledge base. Specifically, in ESA, a word is

represented as a column vector in the tf–idf matrix of the text corpus and a

document is represented as the centroid of the vectors representing its words.

Typically, it represents the meaning of texts in a high-dimensional space of

concepts derived from Wikipedia (Wikipedia 2016).

Knowledge-Based Similarity is one of semantic similarity measures that bases on

identifying the degree of similarity between words using information derived from

semantic networks WordNet is the most popular semantic network in the area of

measuring the Knowledge-Based similarity between words; WordNet is a large

lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into

sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are

interlinked by means of conceptual-semantic and lexical relations (Wael H. Gomaa

2013).

2.1.3 Information Retrieval Models Information retrieval (IR) is finding material (usually documents) of an unstructured

nature (usually text) that satisfies an information need from within large collections

(usually stored on computers) (Christopher D. Manning 2008). For effectively

retrieving relevant documents by IR strategies, the documents are typically

transformed into a suitable representation. Each retrieval strategy incorporates a

specific model for its document representation purposes. Figure 2.2 illustrates the

relationship of some common models. Three of the most well-known models are

explained in more detail below (Wikipedia 2016).

https://en.wikipedia.org/wiki/Vector_space_model

https://en.wikipedia.org/wiki/Knowledge_base

https://en.wikipedia.org/wiki/Tf*idf

https://en.wikipedia.org/wiki/Centroid

17

Figure ‎2.2: Information Retrieval Models

Standard Boolean model: The Boolean model is a simple retrieval model

based on Boolean algebra where index term’s significance is represented by

binary weights wi,j ∈ {0,1}. Queries are aslo defined as Boolean expressions

over index terms. The similarity between document dj and query q can be

calculated as:

Vector space model: in this model, documents and queries are represented as

vectors. dj = (w1,j, w2,j,…,wt,j) , q = (w1,q, w2,q,…,wt,q)

Each dimension corresponds to a separate term. If a term occurs in the

document, its value in the vector is non-zero. In the classic vector space

model the term-specific weights in the document vectors are products of local

and global parameters. The model is known as term frequency-inverse

document frequency model where weight wi,j is defined as: wt,d = tft,d . idft

and tft,d is term frequency of term t in document d and idft is inverse

document frequency. Using the cosine the similarity between

document dj and query q can be calculated as:

https://en.wikipedia.org/wiki/Standard_Boolean_model

https://en.wikipedia.org/wiki/Vector_space_model

https://en.wikipedia.org/wiki/Dimension_(vector_space)

https://en.wikipedia.org/wiki/Tf-idf

https://en.wikipedia.org/wiki/Tf-idf

18

Probabilistic model: this model makes an estimation of the probability of

finding if a document dj is relevant to a query q. This model assumes that this

probability of relevance depends on the query and document representations.

Furthermore, it assumes that there is a portion of all documents that is

preferred by the user as the answer set for query q. Such an ideal answer set

is called R and should maximize the overall probability of relevance to that

user. The prediction is that documents in this set R are relevant to the query,

while documents not present in the set are non-relevant.

2.1.4 Topic Modeling Topic models are [probabilistic] latent variable models of documents that exploit the

correlations among the words and latent semantic themes. A document is seen as a

mixture of topics. This intuitive explanation of how documents can be generated is

modeled as a stochastic process which is then “reversed” by machine learning

techniques that return estimates of the latent variables. With these estimates it is

possible to perform information retrieval or text mining tasks on a document corpus

(Ponweiser 2012).

The most prominent topic model is latent Dirichlet allocation (LDA) which is a three-

level hierarchical Bayesian model, in which each item of a collection is modeled as a

finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an

infinite mixture over an underlying set of topic probabilities. In the context of text

modeling, the topic probabilities provide an explicit representation of a document

(Recognition 2015). The graphical model of LDA is shown in Figure 2.3. The boxes

are “plates” representing replicates. The outer plate represents documents, while the

inner plate represents the repeated choice of topics and words within a document:

Figure ‎2.3: Graphical model representation of LDA

The LDA model assumes the following generative process for a document w = (w1, . .

. , wN ) of a corpus D containing N words from a vocabulary consisting of V different

https://en.wikipedia.org/wiki/Probabilistic_relevance_model

19

terms, wi ∈ {1,… , V } for all i = 1, . . . , N. The generative model consists of the

following three steps (Bettina Grun 2011):

Step 1: The term distribution β is determined for each topic by β ∼ Dirichlet(δ).

Step 2: The proportions θ of the topic distribution for the document w are determined

by θ ∼ Dirichlet(α).

Step 3: For each of the N words wi

(a) Choose a topic zi ∼ Multinomial(θ).

(b) Choose a word wi from a multinomial probability distribution conditioned

on the topic zi : p(wi |zi , β).

β is the term distribution of topics and contains the probability of a word

occurring in a given topic.

2.1.5 Dimensionality Reduction Dimensionality reduction or dimension reduction is the process of reducing the

number of random variables under consideration, via obtaining a set of principal

variables. It can be divided into feature selection and feature extraction.

Feature selection is the process of selecting a subset of relevant features (variables,

predictors) for use in model construction. Feature selection techniques are used for

four reasons:

simplification of models to make them easier to interpret by

researchers/users

shorter training times

to avoid the curse of dimensionality

enhanced generalization by reducing overfitting (formally, reduction

of variance)

The central premise when using a feature selection technique is that the data contains

many features that are either redundant or irrelevant, and can thus be removed without

incurring much loss of information.

Feature extraction transforms the data in the high-dimensional space to a space of

fewer dimensions. The data transformation may be linear, as in Principal Component

Analysis (PCA), but many nonlinear dimensionality reduction techniques also exist.

The main linear technique for dimensionality reduction, Principal Component

Analysis (PCA), performs a linear mapping of the data to a lower-dimensional space

in such a way that the variance of the data in the low-dimensional representation is

maximized. In other words, it uses an orthogonal transformation to convert a set of

https://en.wikipedia.org/wiki/Feature_selection

https://en.wikipedia.org/wiki/Feature_extraction

https://en.wikipedia.org/wiki/Feature_(machine_learning)

https://en.wikipedia.org/wiki/Curse_of_dimensionality

https://en.wikipedia.org/wiki/Overfitting

https://en.wikipedia.org/wiki/Bias-variance_tradeoff

https://en.wikipedia.org/wiki/Feature_extraction

https://en.wikipedia.org/wiki/High-dimensional_space

https://en.wikipedia.org/wiki/Space_(mathematics)

https://en.wikipedia.org/wiki/Dimension

https://en.wikipedia.org/wiki/Principal_component_analysis

https://en.wikipedia.org/wiki/Principal_component_analysis

https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction

https://en.wikipedia.org/wiki/Orthogonal_transformation

20

observations of possibly correlated variables into a set of values of linearly

uncorrelated variables called principal components. The number of principal

components is less than or equal to the smaller of (number of original variables or

number of observations). This transformation is defined in such a way that the first

principal component has the largest possible variance (that is, accounts for as much of

the variability in the data as possible), and each succeeding component in turn has the

highest variance possible under the constraint that it is orthogonal to the preceding

components. The resulting vectors are an uncorrelated orthogonal basis set. PCA is

sensitive to the relative scaling of the original variables (Wikipedia 2016).

2.1.6 Clustering Process As mentioned before, clustering is one of the most useful tasks in data mining process

for discovering groups and identifying interesting distributions and patterns in the

underlying data. The main concern in clustering process is to reveal the organization

of patterns into “sensible” groups, which allow us to discover similarities and

differences, as well as to derive useful conclusions about them. The basic steps to

develop clustering process are presented in Figure 2.4 and can be summarized as

follows (Maria Halkidi 2001):

Figure ‎2.4: Steps of clustering process

Feature selection: The goal is to select properly the features on which

clustering is to be performed so as to encode as much information as possible

concerning the task of our interest. Thus, preprocessing of data may be

necessary prior to their utilization in clustering task.

Clustering algorithm: This step refers to the choice of an algorithm which

results in the definition of a good clustering scheme for a data set. Clustering

algorithms can be broadly classified into the following types:

https://en.wikipedia.org/wiki/Correlation_and_dependence

https://en.wikipedia.org/wiki/Correlation_and_dependence

https://en.wikipedia.org/wiki/Variance

https://en.wikipedia.org/wiki/Orthogonal

https://en.wikipedia.org/wiki/Orthogonal_basis_set

21

o Partitional clustering attempts to directly decompose the data set into a

set of disjoint clusters. In this category, K-Means is a commonly used

algorithm.

o Hierarchical clustering proceeds successively by either merging

smaller clusters into larger ones, or by splitting larger clusters. The

result of the algorithm is a tree of clusters, called dendrogram, which

shows how the clusters are related. By cutting the dendrogram at a

desired level, a clustering of the data items into disjoint groups is

obtained.

o Density-based clustering: The key idea of this type of clustering is to

group neighboring objects of a data set into clusters based on density

conditions. A widely known algorithm of this category is DBSCAN.

o Grid-based clustering is mainly proposed for spatial data mining. Their

main characteristic is that they quantize the space into a finite number

of cells and then they do all operations on the quantized space.

o Fuzzy clustering, which uses fuzzy techniques to cluster data and they

consider that an object can be classified to more than one clusters. This

type of algorithms leads to clustering schemes that are compatible with

everyday life experience as they handle the uncertainty of real data.

The most important fuzzy clustering algorithm is Fuzzy C-Means.

o Crisp clustering, considers non-overlapping partitions meaning that a

data point either belongs to a class or not. Most of the clustering

algorithms result in crisp clusters, and thus can be categorized in crisp

clustering.

o Kohonen net clustering, which is based on the concepts of neural

networks.

Validation of the results: The procedure of evaluating the results of a

clustering algorithm is known under the term cluster validity. In general terms,

there are three approaches to investigate cluster validity:

o External Criteria: In this approach the basic idea is to test whether the

points of the data set are randomly structured or not. Rand index,

Jaccard coefficient, Entropy and Purity can be mentioned as external

measures to name as a few.

o Internal Criteria evaluate the result with respect to information

intrinsic to the data alone. Silhouette index, Davies-Bouldin index

(DB), Calinski-Harabasz index (CH) and Dunn index are the most

famous measures in this category (Eréndira Rendón 2011).

o Relative Criteria evaluate quality of a partition by comparing it to

other clustering schemes, resulting by the same algorithm but with

different parameter values.

22

Interpretation of the results: In many cases, the experts in the application area

have to integrate the clustering results with other experimental evidence and

analysis in order to draw the right conclusion.

2.2 Relevant Technologies In this section all the relevant technologies used in this thesis study are discussed.

2.2.1 Language Identification and Translation Language identification (LID) refers to the process of determining the natural

language in which a given text is written (Pienaar 2010). In this thesis LID is used as

a part of preprocessing phase which aims to uniform the textual content by first detect

the language and then translate it into a single language (en-us).

Yandex.Translate Application Programming Interface (API) is easy to use automatic

translation service provided by Russian Internet Company Yandex. As a statistical

machine translation system, it is based on statistics derived from the web sources

(Hees 2015). Yandex.Translate - synchronized translation for 91 languages, predictive

typing, dictionary with transcription, pronunciation and usage examples, and many

other features.

23

Figure 2.5: How machine translation works at Yandex

24

Yandex.Translate has an automated dictionary that sets it apart from the limited

number of similar existing services. The technology, developed by a Yandex team of

linguists and programmers, combines current statistical machine translation

approaches with traditional linguistic tools.The translation model constructs a graph

containing all the possible ways to translate a sentence. The language model selects

the best translation in terms of the optimal word combinations in natural language.The

translation model learns from extensive bilingual parallel corpora. The language

model is built from large single-language corpora, and contains all the language's

most frequent n-word combinations. N may be from 1 to 7 (usually 5).

Yandex uses BLEU metrics to automatically evaluate the quality of machine

translation; it determines the percent of n-grams (n<=4) that match between the

machine translation and the standard translation of a sentence. Translations are

usually manually rated for two factors, Adequacy and Fluency, using a 5-point scale

(Yandex 2017).

2.2.2 Gender Detection There are two concepts of gender, the biological gender and the socially constructed

gender. A text written by Gayle Rubin ‘s in 1975 discusses gender as a sex/gender

system, in which the social gender is described as enhancing the idea of a biological

gender, which in itself creates gender (Ottosson 2012). Determining gender of users

by analyzing their behavior in social media is very popular these days, but since it is a

complex and time consuming task to do, another method which is using a person's full

name to detect it’s gender is used.

Result of gender detection only based on full name can be imprecise in some cases

due to the problems like cultural origin of names, the coverage of names database,

support of various languages and etc. Using “NamSor” API to deternime the gender

of users was a sutable solution to cope with mentioned problems.

NamSor software classifies names accurately by gender, country of origin, or

ethnicity. The gender api comes with useful features (NamSor Applied Onomastics

2017) which are discribe in the following.

Accuracy : NamSor recognizes the likely cultural origin and gender at the

same time, for higher precision and recall.

Global coverage : NamSor covers all languages, alphabets, countries, regions.

They constantly improve the precision, working with linguists, anthropologist

and historians.

Ease of use: names can be parsed and classified online, using a simple web

application that processes up to 100,000 names within a few minutes. Power

users, statisticians and data scientists can take benefit of NamSor open source

extension for RapidMiner, a leading predictive analytics tool.

25

Integration : NamSor API can be securely integrated with a range of

applications, from geographical information systems (such as ESRI), to CRM

and campaign management.

2.2.3 Twitter API Twitter Platform provides developers with variety of different tools and API to

connects websites or applications with the worldwide conversation happening on

Twitter (Twitter Developer Documentation 2017). Whithin all these possiblilities the

“REST APIs” widely uses for extraxting data from twitter for processing and analysis.

The REST APIs provides programmatic access to read and write Twitter data. Author

a new Tweet, read author profile and follower data, and more. The REST API

identifies Twitter applications and users using OAuth; responses are available in

JSON (Twitter Developer Documentation 2017).

Figure ‎2.6: Twitter Rest API Design

Representational state transfer (REST) or RESTful Web services are one way of

providing interoperability between computer systems on the Internet. REST-

compliant Web services allow requesting systems to access and manipulate textual

representations of Web resources using a uniform and predefined set of stateless

26

operations.In a RESTful Web service, requests made to a resource's URI will elicit a

response that may be in XML, HTML, JSON or some other defined format. The

response may confirm that some alteration has been made to the stored resource, and

it may provide hypertext links to other related resources or collections of resources.

By making use of a stateless protocol and standard operations, REST systems aim for

fast performance, reliability, and the ability to grow, by re-using components that can

be managed and updated without affecting the system as a whole, even while it is

running (Wikipedia 2016).

2.2.4 Instagram API The Instagram API Platform can be used to build non-automated, authentic, high-

quality apps and services that: Help individuals share their own content with 3rd party

apps. Help brands and advertisers understand, manage their audience and media

rights. You can access the Instagram API with any platform using its REST endpoints

(Instagram 2017).

2.2.5 Text Normalization Library Multilingual text normalization library is designed mainly for short text (tweets,

Facebook posts , etc.) which remove special characters, emojies, common words,

remove mentions and URLs and finally stem words and return the clean list of

important words in the sentence.It can be used for normalizing the result of collected

data by twitter or other social media in order to analyse the text data by any data

mining tools.

2.2.6 Cloud Computing Cloud computing is a type of Internet-based computing that provides shared

computer processing resources and data to computers and other devices on demand.

Microsoft Azure is a cloud computing service created by Microsoft for building,

deploying, and managing applications and services through a global network of

Microsoft-managed data centers. It provides software as a service, platform as a

service and infrastructure as a service and supports many different programming

languages, tools and frameworks, including both Microsoft-specific and third-party

software and systems. Microsoft lists over 600 Azure services, of which some are:

Compute, Mobile Services, Storage Services, Data Management, etc (Wikipedia

2016). Azure Virtual Machine is used for computing-intensive tasks in this thesis

study.

https://en.wikipedia.org/wiki/Cloud_computing

https://en.wikipedia.org/wiki/Microsoft

https://en.wikipedia.org/wiki/Data_center

https://en.wikipedia.org/wiki/Software_as_a_service

https://en.wikipedia.org/wiki/Platform_as_a_service

https://en.wikipedia.org/wiki/Platform_as_a_service

https://en.wikipedia.org/wiki/Infrastructure_as_a_service

https://en.wikipedia.org/wiki/Programming_language

https://en.wikipedia.org/wiki/Programming_language

27

Figure ‎2.7: Cloud Computing

Figure ‎2.8: Windows Azure Platform

28

Chapter 3

3 Related Work

This chapter introduces different works that cope with similar issues through

discussing the associated publications.

Past works have found that data scraped from social media is a meaningful reflection

of the human behind the account. Therefore, in recent years, a wide variety of

research with regard to the social media analysis and user clustering/classification

based on different features has been conducted. There are studies that specifically

address clustering of people in social network based on textual and non-textual

features (Kuldeep Singh 2016) (Friedemann 2015) , clustering users based on their

interests (Recognition 2015), leverages Twitter Lists for topic modeling (Saptarshi

Ghosh 2012) as well as using internal validity measures to compare clustering

algorithms (Toon Van Craenendonck 2015). Furthermore, there are several works that

address user profiling on social networks like (Alessandro Bozzon 2013) where the

authors propose a method to select experts within the population of social networks,

according to the information about the social activities of their users. Or works like

(Narumol Prangnawarat 2015) and (Slava Kisilevich 2010) that focus on event

analysis in social media. The former analyzes the resulted heterogeneous network,

and use it in order to cluster posts by different topics and events and the latter perform

analysis and comparison of temporal events, rankings of sightseeing places in a city,

and study mobility of people using geotagged photos. Although, all these works have

delivered new solutions to social media analysis field, they have investigated the

problems of profiling users or events only in one social network platform or they have

employed only one machine learning algorithm to analyze data. The big difference

between our work and the mentioned ones is that our study not only addresses both

textual and non-textual users’ features but also utilizes three different clustering

models applied on Twitter and Instagram data. More details regarding above works

that made the major contributions to our thesis study are as follow.

29

3.1 Clustering of People in Social Network

Based on Textual Similarity This study (Kuldeep Singh 2016) concentrates on textual similarity between various

people in a social network. Textual similarity is a sub-field of data mining which

gives information of words that are frequently used in a group of people.

On the basis of the common words used in social networks, they have formulated a

metric. The data has been extracted from social networking sites and then it is

processed for generating the metrics. Simple k-means and spectral k-means

algorithms have been compared for finding textual similarity. They have also used

WordNet to groups words together based on their meanings. Since on twitter the basic

mode of communication between the users is Tweet, so they have extracted the tweets

of the users and performed analysis on them.

The approach of this paper consists of three main steps:

1. Data Pre-processing: Since the extracted tweets are in very rough from, pre-

processing is needed and has to be done in five steps:

Data extraction: Tweepy package of twitter API is being used. Tweepy

package is for the Python language. Tweepy package supports accessing

Twitter via Basic Authentication and the newer OAuth method.

Stop words removal: Stop words have been removed from experimental

dataset using Lucene. Lucene is an open-source Java full-text search

library.

Stemming of the text: Stemming is the process for reducing inflected

words. They have implemented stemming also using Lucene. The

PorterStemFilter class of the Lucene has been used for streaming of the

words.

Lexical analysis: Lexical analysis is done using a large lexical database of

English called WordNet. Nouns, verbs, adjectives and adverbs are grouped

into sets of cognitive synonyms (synsets). Synsets are interlinked by

means of conceptual-semantic and lexical relations.

Calculation of strength matrix: Strength between two users must be

directly proportional to the number of common word between them

(STRENGTH∝COMMON WORDS) but it should be inversely

proportional to the total number of words used by both the users

(STRENGTH ∝ (1/total words used)). Strength should also decrease with

the difference between the occurrences of a word as a person uses a word

very frequently and other does not, so it should decrease their textual

similarity (STRENGTH ∝ (1/difference of word used by two persons)). At

the end, if we add up this strength due to each word used by a user and

another user, it gives us a measure of strength of textual similarity between

those two persons.

30

2. Clustering: Simple k-means is based on compactness, so it always gives

nearer to approximation accurate results for general numerical datasets.

Spectral clustering is used to map the original data into a vector space spanned

by a few Eigen vectors and apply the k-means algorithm in that space. The

assumption here is that although the data samples are high dimensional, they

lie in a low dimensional subspace of the original space. Spectral clustering

based on Un-normalized graph Laplacian have been used here.

3. Evaluation: The results of simple k-means and spectral algorithm with

dummy and real datasets have been evaluated. Result shows that both

algorithms give almost similar outputs. Dummy dataset consists of 40 nodes

and has two clusters. The real dataset has 77 node numbers from 1 to 77. The

results of simple k-means and spectral k-means are almost equal. Spectral

clustering gives relatively quick results for sparse and higher element datasets,

but the computation cost of spectral clustering for large dataset is very high.

Figure 3.1: Graph after spectral k-means clustering for real dataset

Figure 3.2: Graph after spectral k-means clustering for dummy dataset

The main idea of using textual similarity for clustering the users as well as the

preprocessing steps e.g. data extraction and stop words removal is used as a guide in

this thesis study.

31

3.2 Clustering a Customer Base Using

Twitter Data

This paper (Friedemann 2015) focuses on a method to cluster customers of a company

(Nike) using social media data from Twitter. The motivation of this research is based

on this idea that clustering a company’s customers allows marketing teams to tailor

advertising messages for specific groups of people with similar interests. This

analytics will fundamentally change business operations as a result.

The steps of the approach are explained below:

1. First the tweets are harvested from Twitter using Tweepy package and stored

into a local SQLite database. Due to Twitter’s API rate limit constrains data

gathering, only a subset of 10,000 users from Nike’s total 5.6 million followers

are considered. For each user, the data set includes statuses posted, number of

followers, number of followings, and language. In addition, it is recorded if

each user is following one or more of a hand-selected list of popular Twitter

accounts (influencers). On the other hand, since all selected features are

numerical except for the language of the user, language should be converted to

a tuple of float values by mapping the language acronym to the latitude and

longitude coordinates of the largest city in the country with the most people

who speak this language.

Figure 3.3: Percentage of followers for a set of chosen influencers.

2. The data is transformed into a lower dimensional feature space using Principle

Component Analysis (PCA). Users following relationships towards influencers

are represented as a binary matrix with a 1 in the (i,j) position if user i follows

32

influencer j. Using PCA, the dimension of influence matrix is reduced from 12

to 8.

3. In order to efficiently segment the data samples into acceptable clusters,

selected features are passed into the k-means algorithm instead of slower

alternatives such as hierarchical clustering.

4. The optimal number of clusters is determined by performing silhouette

coefficient function.

Figure 3.4: Silhouette coefficient as a function of number of clusters

5. The clustering performance, which is a metric of clustering quality related to

the intra-cluster variation and inversely proportional to the inter-cluster

distance, is computed. This clustering performance is defined as below where

low values of q correspond to better clustering performance.

6. The clustering output is visualized in R2. Figure 3.5 indicated one such

visualization. The depicted clusters have the same ratio of average intra-cluster

variation to average inter-cluster distance as the clustering output. This

suggests that the studied data set can be cleanly clustered in the discussed

dimensionality space.

33

Figure 3.5: Representative clusters in R

2

7. In order to label the clusters, a randomly-selected subset of samples from each

of the k=5 clusters is examined. Furthermore, a human-subject experiment was

performed to validate the meaningfulness of the selected clusters. The

empirical results show that the human and the labeling algorithm were in

agreement approximately 80% of the time.

The PCA algorithm which, transforms data into a lower dimensional feature space

and silhouette coefficient function which is employed to determine an appropriate

number of clusters, are used as standard approaches in the thesis study.

3.3 Clustering Users Based on Interests In this paper (Recognition 2015) authors investigate the problem of clustering users in

Twitter based on their interests. The motivation of doing this study is the significance

of solving the mentioned problem in many different fields, such as user

recommendation, personalized services, viral marketing, etc. The main notion of this

research is that some Twitter users’ features are potentially useful in determining

interests of an individual user or his/her common interests with other users.

To address the mentioned problem, the approach of this paper is organized as follows:

1. Data Extraction: Twitter’s Developer API is used to collect user data. 45772

English users, who have posted at least 100 tweets and have more than 20

friends, are extracted. Besides, Different features, which are closely correlated

with user’s interests, including both textual contents (tweet text, URLs and

hashtags) and social structure (following relationship and retweeting

relationship), are leveraged. The findings of this study show that there is a very

widely use of URLs, hashtags and retweets at user level, and prove that it is

necessary to take these features into account when computing user similarity.

34

2. User Similarity: to get the final user similarity, the similarity of all selected

features should be computed first:

Text Similarity: All the tweets published by an individual user are

aggregated into a big document. With the purpose of identifying the

topics that users are interested in based on their tweets, Latent Dirichlet

Allocation (LDA), which is an unsupervised machine learning method,

is applied. Then, Text similarity between two users, ui and uj can be

calculated using a presented formula.

URL Similarity: All the URLs embedded in tweets corresponding to a

user are aggregated into a document. Then similar with the previous

section, URL similarity is calculated.

Hashtag Similarity: hashtag similarity is measured based on the

number of their common hashtags and the importance of these

hashtags.

Following Similarity: A twitterer follows a friend because she/he is

interested in the topics the friend publishes, and the friend follows back

because she/he finds they share similar topic interest. Intuitively, if two

users have many common friends and followers, they are quite similar.

This paper represents a formula which computes following similarity

based on the total number of users’ followers, followings, common

friends and common followers.

Retweeting Similarity: if two users retweet the same person frequently,

the two users may have similar interests. Additionally, whether the two

users retweet each other is a stronger indicator of similar interests.

With these two factors into consideration, retweeting similarity is

defined in this study.

The final similarity between users ui and uj can be calculated as:

In order to assess the effectiveness of their approach and determine the

parameters in user similarity formula, the authors propose an evaluation

metrics “the average number of mutual following links per user in per cluster

(FPUPC)”. The aggregation parameters for features γfeature are defined as

follows:

3. K-means Clustering: k-means is applied to cluster users because it is not only

effective but also very fast. Moreover, experimental results show that best

performance is achieved when the number of clusters is selected around 400.

35

The idea of using LDA model for identifying topics of a big text document is utilized

as one of the most important methods in our thesis work.

3.4 Crowdsourcing Search for Topic Experts

in Microblogs This study (Saptarshi Ghosh 2012) highlights Lists as a potentially valuable source of

information for future content or expert search systems in Twitter. In this paper,

authors present Cognos, a system for finding topic experts in Twitter. Unlike

traditional approaches which identify topical experts based either on the information

provided by the user or on analyzing the network characteristics, Cognos exploits the

Lists feature which is entirely a different approach.

The proposed methodology consists of three fundamental parts:

1. Crawl Lists containing the 54 million Twitter users in a complete snapshot of

the Twitter taken in August 2009. Then consider only users who were listed at

least 10 and at most 2000 times. Overall, for the 1.3 million users, a total of

88,471,234 Lists were gathered.

2. Extract frequently occurring topics (words) from List meta-data (names and

descriptions) and associate these topics with the listed users. This strategy

includes the following steps:

Separate List names into individual words

Apply case-folding, stemming and stop words removal

Group words that are very similar to each other based on edit-distance

among words

Consider only unigrams and bigrams as topics

Table 3.1: The most common topics of expertise as identified from Lists

3. Given a query, a topical similarity score is calculated between the topic vector

for a user and the given query vector, using an algorithm which computes the

cover density ranking between the vectors.

36

This paper concludes that Cognos provides better search results in the cases when the

bio or tweets posted by a user does not contain information about the user’s topic of

expertise. Even though Cognos is built employing only the Lists feature, it can

compete with the commercial who-to-follow system (WTF) deployed by Twitter

itself. As Table 3-2 indicates, top Cognos results mostly contain personal accounts

while top Twitter WTF results mostly contain organizations / business accounts.

Table 3.2: Top 5 results by Cognos and Twitter WTF for query “music”

Our thesis work benefits from the key idea of utilizing Twitter List feature for

identifying topic experts in this paper and employs List slug to find topics and user

clusters in Twitter.

3.5 Using Internal Validity Measures to

Compare Clustering Algorithms

The research and experiments of this paper (Toon Van Craenendonck 2015) rely on

using four internal validity measures and six clustering algorithms. The reason behind

this approach is the existence of many different clustering algorithms which may all

produce very different partitions of the same data set. Even a single clustering

algorithm can yield wildly different results depending on the chosen parameters.

Therefore, the authors investigate whether the outlined measures allow for a

comparison between algorithms or not.

Internal validity measures only rely on properties intrinsic to the data set. This

research uses the below internal measures:

Silhouette Index (SI): This score of a clustering is in [-1, 1], and should be

maximized.

Davies-Bouldin (DB): This score of a clustering is in [0, + ∞] and should be

minimized.

Calinski-Harabasz (CH): This score of a clustering is in [0, + ∞] and should be

maximized.

Density-Based Cluster Validation (DBCV): This score of a clustering is in [-

1, 1] and should be maximized. DBCV can be useful for data sets with well

37

separated structure. However, the results become less interesting when data

becomes noisier or transitions between clusters become more dimmed. Thus,

due to the noisy nature of data set used in this research work, the authors put

this measure aside.

As it is illustrated in Figure 3.6, the first measures have a strong bias towards

spherical clustering while the last measure can handle clusters with different densities

and shapes.

Figure ‎3.6: Spectral clustering solutions selected by various measures

The clustering algorithms which are used in this experiment include k-means,

spectral, DBSCAN, Ward, meanshift and EM. The parameter ranges for each

algorithm were chosen to be wide enough to make sure that they contain values

leading to a good solution. After applying all algorithms on a data set and computing

first three validity measures, the final outcomes can be shown in the below tables:

Above results indicate that all measures exhibit some undesired properties e.g.

sensitivity to noise points, a preference for highly imbalanced solutions or a bias

towards spherical clustering. Closer inspection shows that to produce clusters that

score well on the silhouette and Calinski-Harabasz measures, we can simply use k-

means. To score well on the Davies-Bouldin and DBCV measures, we can use

DBSCAN or meanshift, but this is mainly due to the previously mentioned undesired

properties.

Table 3.3: Average relative SI, CH and DB score over data set

38

The idea of using internal validity measures such as Silhouette Index to compare

clustering algorithms is used as a guide in the thesis study.

39

Chapter 4

4 Event-based User Profiling in

Social Media

This chapter discusses how to use data mining approaches to profile users in social

media namely Twitter and Instagram. The aim is to go deep into the analysis approach

and elaborating each step in details.

4.1 Main Idea The primary objective of this study is to analyze the social data collected about a

specific event and use its outcomes such as users’ types and behavior to improve the

quality of that specific event and engage potential users, which are more likely to be

interested in participating in similar events. To achieve this goal, the entities

contained in the tweets/posts as well as their related users have been taken into

consideration. Users’ textual content such as biographies, hashtags, tweet/post texts

and list descriptions are specifically proposed to be used in clustering approaches.

Moreover, Dealing with tweets/posts in different languages was another challenge to

overcome. Other properties are also extracted to be employed in knowledge

representation part. Taking all these ideas into account, the details of this research

were identified, in terms of structure needed to be examined and aspects to be

considered.

4.1.1 Twitter Users Twitter is a social networking and microblogging service, enabling registered users to

read and post short messages, so-called tweets. As of the fourth quarter of 2016, the

microblogging service averaged at 319 million monthly active users (Statista 2016).

In this thesis, extracted Twitter users are divided in two groups:

Masters: users who tweeted about this event in a specific time span.

Contributors: users who retweeted, favorited or replied one or more tweets

posted by a master user in a specific time span.

40

Intuitively, if two users (one master and one contributor) engage in the same tweet,

they may have similar interests and these kinds of users are considered as our target

people in this thesis study. That is the reason why both types have to be taken into

account for the clustering purposes.

4.1.2 Instagram Users The statistic gives information on the number of monthly active Instagram users as of

December 2016. As of that month, the mainly mobile photo sharing network had

reached 600 million monthly active users, up from 500 million in June 2016 (Statista

2016).

For the same reason that mentioned in section 4.1.1, extracted Instagram users are

also divided in two categories:

Masters: users who posted about this event in a specific time interval.

Contributors: users who liked or commented one or more Instagram media

posted by a master user in a specific time interval.

4.2 Motivation As mentioned in Chapter 3, there are many studies concentrating on analysis of

different aspects of social media platforms. However, there is not any work focusing

on analyzing Twitter and Instagram users based on their interests in a particular event.

This is one of the main motivations for choosing this topic, to inspect users’ activities

during an event, compare the results of two social media platforms and provide a

solution to predict future potential users.

4.2.1 Why social media as data source? The popularity of social media sites and the ease at which its data is available means

these platforms are increasingly becoming primary sources for every kind of

research. Current academic and industry interest in social media has been driven by

the rapidly broadening user base for social media technologies, which is of course

related to the continuing spread of internet use itself. The rise in social media use has

been rapid: in 2011, approximately 60% of internet users were also social media

users, up from just 17% in 2007. Much of this change has been driven by the

emergence of a small number of “mass appeal” social media websites, of which

Twitter and Facebook are the obvious examples. These sites are characterized by their

ease of use, their generic nature (i.e. they eschew focus on a particular subject or area

of interest) and their wide penetration, meaning that significant portions of the

population have created an account (Pensions 2014). The following figures

41

demonstrate the worldwide growth in using social media (Pew Research Center

2017).

Figure ‎4.1: Number of social media users from 2010 to 2020 (in billions)

Figure ‎4.2: Percentage of adult users who use different social networks

42

Figure ‎4.3: Percentage of adult users who use at least one social media, by age

In general, when compared to traditional surveys, social media data offer considerable

advantages in terms of how quickly results are delivered, the scale at which results

can be brought in, and (potentially) how cheaply they can be obtained. They also offer

the possibility to access sub-groups within the population in a way that sample

surveying has struggled with. The major difficulty lies in making accurate

generalizations from social media data to some overall population of interest as those

using social media do not constitute a representative sample of the public as a whole

and do not come with perfect demographic data attached. Nevertheless, knowing what

the public is thinking about is a crucial precursor to knowing what their opinion is of

any given topic. It is also an area where social media has the potential to offer real

added value (Pensions 2014).

4.2.2 Why Twitter and Instagram? Out of all the different social media platforms, Twitter is of a particular interest for

researchers as it provides them with arguably the most open access to its data in that it

provides a real-time stream of tweets, either as a 1% sample or as a dataset matching

criteria that are specified by the user. Other companies such as Google or Facebook

do not provide similar access to their data (Rob Procter 2015). There are at least six

reasons why researches prefer to use Twitter as their source of data (Ahmed 2015):

1. Twitter is a popular platform in terms of the media attention it receives and it

therefore attracts more research due to its cultural status.

2. Twitter makes it easier to find and follow conversations. (i.e., by both its

search feature and by tweets appearing in Google search results)

3. Twitter has hashtag norms which make it easier gathering, sorting, and

expanding searches when collecting data.

43

4. Twitter data is easy to retrieve as major incidents, news stories and events on

Twitter are tending to be centered on a hashtag.

5. The Twitter API is more open and accessible compared to other social media

platforms, which makes Twitter more favorable to developers creating tools to

access data. This consequently increases the availability of tools to

researchers.

6. Many researchers themselves are using Twitter and because of their favorable

personal experiences, they feel more comfortable with researching a familiar

platform.

A picture may be worth a thousand words, but those words are not worth much if no

one is listening. This is why it is important to choose the right network dedicated to

photo sharing for this thesis. Among all the current photo sharing platforms,

Instagram is a vibrant social platform, where users interact with photos by liking or

commenting on them. One of the app’s most powerful features is its tagging

mechanism, called a “hashtag,” which surfaces your photo to the right subgroup of

Instagram’s more than 500 million users. With over 90% of users falling under the

age of 30, Instagram is the best platform for promoting your event to a younger crowd

(Luna 2016). Figure 4.4 depicts the main differences between four major networks for

photo sharing: Instagram, Pinterest, Tumblr, and Flickr (Sorokina 2014).

44

Figure ‎4.4: Comparison between four major photo-sharing networks

45

4.3 Approach Our proposed approach in this study consists of several phases following each other.

The architecture design including the main steps which prepare collected data for

analyzing in the next phases, are shown and explained as follows:

Figure ‎4.5: Architecture design

4.3.1 Data Extraction In this phase raw data is extracted by probing into Instagram and Twitter APIs. Since

our approach is to make the process of storing, analyzing and visualizing data more

efficient, a MySQL database is used because of its scalability, flexibility, high

performance and high availability in dealing with the data that was collected from the

mentioned APIs. Table 4-1 and Table 4-2 represent the extracted features obtained

from Twitter and Instagram objects.

46

Table 4.1: Twitter extracted features

Tweet

Id: The string representation of the unique identifier for this

tweet

Username: The user who posted this tweet.

Text: The actual UTF-8 text of the status update.

Date: Date and time when this tweet was created.

Retweets: Number of times this tweet has been retweeted.

Favorites: Indicates approximately how many times this tweet

has been liked by Twitter users.

Mentions: the users who are mentioned in this tweet.

Hashtags: Represents hashtags which have been parsed out of

this tweet text.

Geo: Represents the geographic location of this tweet as

reported by the user or client application.

Place: Indicates that the tweet is associated (but not necessarily

originating from) a place.

User


user.

Username: the unique name of this user.

Full name: The name of this user, as they’ve defined it. Not

necessarily a person’s name.

Tweets: the user’s most recent (20) tweets.

Follower count: The number of followers this user currently

has.

Following count: The number of users this user is following.

Status count: The number of tweets issued by this user.

Listed count: The number of public lists that this user is a

member of.

Favorite count: The number of tweets this user has favorited in

the account’s lifetime.

Bio: the user-defined UTF-8 string describing their account.

Hashtags: All the hashtags included in this user’s most recent

(20) tweets.

Mentions: All the users who are mentioned in this user’s most

recent (20) tweets.

Location: The user-defined location for their profile.

Language: The user’s self-declared user interface language.

Time zone: A string describing the Time Zone this user

declares themselves within.

Join date: The UTC datetime that the user account was created

on Twitter.

https://dev.twitter.com/rest/reference/post/favorites/create

https://dev.twitter.com/overview/api/places

47

Is Verified: when true, indicates that the user has a verified

account.

Is Protected: When true, indicates that this user has chosen to

protect their Tweets.

List

Id: The numerical id of the list.

User ID: The ID of the user who is member of this list.

Name: The screen name of this list.

Slug: The short name of this list.

Description: The description of this list.

Member count: The number of members of this list.

Table 4.2: Instagram extracted features

Media

Id: The unique identifier for this media

Username: The user who posted this media.

Caption: The media caption text.

Date: Date and time when this media was created.

URL: The URL of the photo uploaded in this media.

Like count: Indicates how many times this media has

been liked by Instagram users.

Comment count: Indicates how many times this media has

been commented by Instagram users.

Likers: The first 10 users who liked this media.

Commenters: The first 10 users who commented this media.

Mentions: the users who are mentioned in this media.

Hashtags: Represents hashtags which have been parsed out of

this media text.

Geo: Represents the geographic location of this media.

User


user.

Username: the unique name of this user.

Full name: The name of this user, as they’ve defined it. Not

necessarily a person’s name.

Media: the user’s most recent (20) media.

Media count: The number of media this user posted.

Follower count: The number of followers this user currently

has.

Following count: The number of users this user is following.

Average like count: This user’s average number of likes.



48

Average comment count: This user’s average number of

comments.

Bio: the user-defined string describing their account.

Hashtags: All the hashtags included in this user’s most recent

(20) media.

Mentions: All the users who are mentioned in this user’s most

recent (20) media.

Is Verified: when true, indicates that this user has a verified

account.

Is Private: when true, indicates that this user has a private

profile.

Is Business: when true, indicates that this user is a business.

4.3.2 Data Preprocessing Since the gathered raw data is incomplete and inconsistent, we need to apply

preprocessing techniques to prepare an appropriate dataset which can be used for next

analysis and experiments. Thus, the related techniques are particularly applied on

fields “Text”, “Bio” and “Tweets” in Twitter dataset and fields “Caption”, “Bio” and

“Media” in Instagram dataset and as a consequence, new fields containing “Text

Norm”, “Bio Norm”, “Tweets Norm” and “Caption Norm”, “Bio Norm”, “Media

Norm” are appended to Twitter and Instagram datasets respectively. It is noteworthy

that “hashtags” are excluded from data preprocessing phase because each hashtag

refers to a specific content and should not be transformed. The preprocessing process

consists of three main steps to be followed.

4.3.2.1 Text Normalization Textual properties which were extracted in the data extraction phase include a great

deal of non-standard characters, punctuations, symbols, white spaces, stop words, etc.

that must be omitted for making the data clean and standard. Furthermore, it is

essential to reduce inflected or derived words to their word stem, base or root form.

This process is called stemming which is applied on textual features at the end of this

stage. Text Normalization Java Library is used for this purpose.

4.3.2.2 Language Identification and Translation Unsurprisingly, Twitter or Instagram users do not always tweet or post in English and

since the event monitored in our study was organized and held in Italy, having texts in

different languages are not unexpected. With the aim of making data more coherent

and unambiguous, text language detection and its translation into English, seems

https://en.wikipedia.org/wiki/Word_stem

https://en.wikipedia.org/wiki/Root_(linguistics)

49

absolutely necessary. Yandex API identifies text language first and then translates

stems into English.

4.3.2.3 Gender Detection Twitter and Instagram APIs do not provide users’ gender in their objects. But, since

gender is required for the next analysis and visualization phases in our thesis work,

Namsor API is proposed and employed in this step. After detection of each user’s

gender, a new field with the same name is added to Twitter and Instagram datasets to

be used in the following steps.

4.3.3 Data Loading In this phase we store the data into the end target which is a CSV file. CSV is a file

format for data storage which looks like a text file. The information is organized with

one record on each line and each field is separated by comma. Beside all the obvious

benefits of using flat data like CSV, simplicity of importing and working with this

format in R programming language and environment in our data analysis phase is one

of the main reasons why CSV file is proposed to be used as data storage.

4.3.4 Data Analysis This is the phase where the preprocessed and loaded datasets (CSV files) are used for

data analysis. In this thesis work, all analysis, statistics, evaluations and results

representations are done in R. R is an extremely flexible statistics programming

language and environment that is open source and freely available for all mainstream

operating systems. The flexibility of R is arguably unmatched by any other statistics

program, as its object-oriented programming language allows for the creation of

functions that perform customized procedures and/or the automation of tasks that are

commonly performed. Perhaps R’s biggest hindrance is also its biggest asset, and that

is its general and flexible approach to statistical inference. With R, if you know what

you want, you can almost always get it. But you have to ask for it. Using R requires a

more thoughtful approach to data analysis than does using some other programs (Ken

Kelley 2008).

We step through the main sub phases of the data analysis in detail below.

4.3.4.1 Topic Extraction As mentioned in chapter 2, topic models can help to organize and offer insights for us

to understand large collections of unstructured text bodies. Topic models allow the

probabilistic modeling of term frequency occurrences in documents. The fitted model

can be used to estimate the similarity between documents as well as between a set of

https://www.shopping-cart-migration.com/supported-carts/32356-csv-file

50

specified keywords using an additional layer of latent variables which are referred to

as topics. The R package “topicmodels” provides basic infrastructure for fitting topic

models based on data structures from the text mining package tm. The package

includes interfaces to two algorithms for fitting topic models: the variational

expectation-maximization algorithm and an algorithm using Gibbs sampling (Bettina

Grun 2011). Two following steps have to be done in order to complete topic modeling

approach:

Pre-processing: The input data for topic models is a document-term matrix. The rows

in this matrix correspond to the documents and the columns to the terms. The entry

mij indicates how often the jth

term occurred in the ith

document. The number of rows

is equal to the size of the corpus and the number of columns to the size of the

vocabulary. We consider users (documents) as rows and all bio stems (terms) as

columns of this matrix. In this step, first a corpus containing users’ biographies is

created. Here the focus is only on Twitter users but the same process can be applied

on Instagram users as well.

The data pre-processing step involves selecting a suitable vocabulary, which

corresponds to the columns of the document term matrix. The mapping from the

document to the term frequency vector involves tokenizing the document and then

processing the tokens for example by converting them to lower-case, removing

punctuation characters, removing numbers, stemming, removing stop words and

omitting terms with a length below a certain minimum. In addition the final,

document-term matrix can be reduced by selecting only the terms which occur in a

minimum number of documents or those terms with the highest term-frequency

inverse document frequency (tf-idf) scores. Therefore, in our case, terms that their

length are less than three or more than 15 and terms that their frequency are less than

50 are considered unimportant and so omitted.

In order to do create document term matrix for users’ other textual properties namely

hashtags, their first twenty tweets and the slugs of the lists they are member of (only

for Twitter), the above procedure should be also employed for each property

separately.

51

Model Selection: To discover the abstract "topics" that occurs in the collection of

documents containing users’ textual features, we need to apply a topic model such as

Latent Dirichlet Allocation (LDA) which benefits from Gibbs sampling algorithm.

For fitting the LDA model to a given document-term matrix the number of topics

needs to be fixed a-priori. Because the number of topics is in general not known,

models with several different numbers of topics are fitted and the optimal number is

determined in a data-driven way. In this thesis study,CaoJuan2009 (minimization) and

Deveaud2014 (maximization) are two metrics which are used to identify the number

of topics in LDA. Simple approach in analyze of these metrics is to find extremum.

Figure 4.6 indicates that the extremum (the number of topics) is 6.

Figure ‎4.6: Identify the number of topics for LDA

Additionally, estimation using Gibbs sampling requires specification of values for the

parameters of the prior distributions. Gibbs sampling works by performing a random

walk in such a way that reflects the characteristics of a desired distribution. Because

the starting point of the walk is chosen at random, it is necessary to discard the first

few steps of the walk (as these do not correctly reflect the properties of distribution).

https://en.wikipedia.org/wiki/Random_walk

https://en.wikipedia.org/wiki/Random_walk

52

This is referred to as the “burn-in” period. We set the burn-in parameter to 4000.

Following the burn-in period, we perform 2000 iterations, taking every 500th

iteration

for further use 9 (thin parameter). The reason we do this is to avoid correlations

between samples. We use 5 different starting points (nstart=5) – that is, five

independent runs. Each starting point requires a seed integer (this also ensures

reproducibility), so we have provided 5 random integers in seed list. Finally we

set best to TRUE (actually a default setting), which instructs the algorithm to return

results of the run with the highest posterior probability.

Having all required parameters set, LDA function is applied:

The LDA algorithm returns an object (LDAout) that contains a lot of information. Of

particular interest to us are the top terms of each topic and the probabilities associated

with each extracted topic and each user (document) which we call topicProbabilities.

Table 4-3 shows the first twenty rows of this matrix.

https://www.quora.com/What-is-the-difference-between-the-prior-and-the-posterior-in-statistics

53

Table 4.3: Topic probabilities by user

In general, if a user (document) has multiple topics with comparable probabilities, it

simply means that the user (document) speaks to all those topics in proportions

indicated by the probabilities.

Dimension Reduction: As mentioned earlier, the output of LDA function contains

information such as top terms of each obtained topic. As it can be seen in Table 4-4,

extracted topics are possibly correlated. Therefore, it is suggested to employ Principle

Component Analysis (PCA) to convert them to a set of values of linearly uncorrelated

topics. This transformation of data to a lower dimensional feature space not only

reduces the time and storage required but also makes the data visualization easier and

more interpretable when reduced to a low dimensions such as 2D or 3D.

Table ‎4.4: Top terms of each extracted topic by LDA

54

The result of applying PCA on topicProbabilities is illustrated below.

The choice we have taken is to pick all those topics that help us in capturing at least

95% of the total variance, in such a way we could compress our data not losing a lot

of information. In order to reach this threshold, we have to pick the three principal

components that are the first three topics. Then, in clustering phase, we will run

clustering algorithms exploiting only the topics chosen.

4.3.4.2 Cluster Analysis As mentioned earlier in this chapter, the main notion of this thesis work is to collect

social media data related to a specific event and then categorize users who talk about

that event and consequently analyze their activities or behavior. In other words, we

need to describe hidden structure like similarity groups from "unlabeled" data.

Therefore, to reach this goal, Cluster Analysis or Clustering, which is the task of

grouping a set of objects in such a way that objects in the same group (called

a cluster) are more similar (in some sense or another) to each other than to those in

other groups (clusters), has to be applied on our data (topicProbabilities with three

selected topics).

The notion of a "cluster" cannot be precisely defined, which is one of the reasons why

there are so many clustering algorithms. There is a common denominator: a group of

data objects. However, different researchers employ different cluster models, and for

each of these cluster models again different algorithms can be given. The notion of a

cluster, as found by different algorithms, varies significantly in its properties

(Wikipedia 2016). For this reason, three different algorithms from three different

cluster models are put into practice: k-means, hierarchical and DBSCAN.

K-means Algorithm is one of the most popular "clustering" algorithms. The goal of k-

means algorithm is to find the best division of n entities in k groups, so that the total

distance between the group's members and its corresponding centroid, representative

of the group, is minimized. Formally, the goal is to partition the n entities

into k sets Si, i=1, 2, ..., k in order to minimize the within-cluster sum of squares

(WSS), defined as:

55

where term provides the distance between an entity point and the cluster's

centroid.

The most common algorithm, described below, uses an iterative refinement approach,

following these steps:

1. Define the initial groups' centroids. This step can be done using different

strategies. A very common one is to assign random values for the centroids of

all groups. Another approach is to use the values of K different entities as

being the centroids.

2. Assign each entity to the cluster that has the closest centroid. In order to find

the cluster with the most similar centroid, the algorithm must calculate the

distance between all the entities and each centroid.

3. Recalculate the values of the centroids. The values of the centroid's fields are

updated, taken as the average of the values of the entities' attributes that are

part of the cluster.

4. Repeat steps 2 and 3 iteratively until entities can no longer change groups.

The k-means is a greedy, computationally efficient technique, being the most popular

representative-based clustering algorithm.

One decision that has to be made before applying k-means clustering is to determine

the number of clusters. There is an obvious trade-off between the number of clusters

and the internal cohesion of them. If there are few clusters, the internal cohesion tends

to be small. Otherwise, a large number of clusters make them very close, so that there

is little difference between adjacent groups. The optimal choice of k (number of

clusters) will strike a balance between maximum compression of the data using a

single cluster, and maximum accuracy by assigning each data point to its own cluster.

If an appropriate value of k is not apparent from prior knowledge of the properties of

the data set, it must be chosen somehow. There are several categories of methods for

making this decision. One method to validate the number of clusters is the elbow

method (Wikipedia 2016). The idea of the elbow method is to run k-means clustering

on the dataset for a range of values of k (say, k from 1 to 15 in our case), and for each

value of k calculate the total within-cluster sum of square (WSS).

56

Then, plot a line chart of the WSS for each value of k. If the line chart looks like an

arm, then the "elbow" on the arm is the value of k that is the best. In this case, k=3 is

the value that the Elbow method has selected (see Figure 4.7).

Figure ‎4.7: Elbow method representation

When the number of clusters is specified, we perform a k-means clustering with three

initial cluster centers. The algorithm of Hartigan and Wong is used by default. This

algorithm generally does a better job than others namely MacQueen, Lloyd and

Forgy3, but trying several random starts (nstart > 1) is often recommended. Here we

set nstart to 10.

The results obtained of k-means method are represented and discussed in chapter 5.

Hierarchical Algorithm is a method of cluster analysis which seeks to build

a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two

types:

Agglomerative: This is a "bottom up" approach: each observation starts in its own

cluster, and pairs of clusters are merged as one moves up the hierarchy.

Divisive: This is a "top down" approach: all observations start in one cluster, and

splits are performed recursively as one moves down the hierarchy.

3 “Lloyd” and “Forgy” are alternative names for one algorithm.

https://en.wikipedia.org/wiki/Cluster_analysis

https://en.wikipedia.org/wiki/Hierarchy

57

In general, the merges and splits are determined in a greedy manner. The results of

hierarchical clustering are usually presented in a Dendrogram4. In order to decide

which clusters should be combined (for agglomerative), or where a cluster should be

split (for divisive), a measure of dissimilarity between sets of observations is required.

In most methods of hierarchical clustering, this is achieved by use of an

appropriate metric (a measure of distance between pairs of observations), and a

linkage criterion which specifies the dissimilarity of sets as a function of the pairwise

distances of observations in the sets (Wikipedia 2016).

Euclidean Distance which is the most commonly used dissimilarity metric is

employed using dist function in R. In addition, hclust function performs a hierarchical

cluster analysis using the created set of dissimilarities. Initially, each object is

assigned to its own cluster and then the algorithm proceeds iteratively, at each stage

joining the two most similar clusters, continuing until there is just a single cluster

(Agglomerative).

A number of different linkage criteria are provided. Ward's minimum variance

method aims at finding compact, spherical clusters. The complete linkage method

finds similar clusters. The single linkage method (which is closely related to the

minimal spanning tree) adopts a ‘friends of friends’ clustering strategy. The other

methods can be regarded as aiming for clusters with characteristics somewhere

between the single and complete link methods (Statistical Data Analysis n.d.). It is not

surprising that both single and complete algorithms often produce undesirable

clusters. Single-link clustering often suffers from chaining, that is, we only need a

single pair of points to be close to merge two clusters. Therefore, clusters can be too

spread out and not compact enough. Complete-link clustering often suffers from

crowding, that is, a point can be closer to points in other clusters than to points in its

own cluster. Therefore, the clusters are compact, but not far enough apart. In this

thesis study, it is preferred to use complete linkage because it is less sensitive to noise

and outliners, besides it provides a tree with a nicer interpretation.

Unlike k-means algorithm, hierarchical algorithm does not require the optimal

number of clusters at the beginning. In this clustering algorithm clusters are defined

by cutting branches off the dendrogram. To determine the cutting section, various

methods can be used. For example, one could be a statistical convention which

represents that a dendrogram can be cut where the difference is most significant.

DBSCAN Algorithm: Density-Based Spatial Clustering of Applications with

Noise (DBSCAN) is a data clustering algorithm which groups together points that are

closely packed together (points with many nearby neighbors), marking as outliers

4 It is a tree diagram frequently used to illustrate the arrangement of the clusters produced

by hierarchical clustering.

https://en.wikipedia.org/wiki/Greedy_algorithm

https://en.wikipedia.org/wiki/Dendrogram

https://en.wikipedia.org/wiki/Metric_(mathematics)

https://en.wikipedia.org/wiki/Distance

https://en.wikipedia.org/wiki/Data_clustering

https://en.wikipedia.org/wiki/Fixed-radius_near_neighbors

https://en.wikipedia.org/wiki/Tree_(graph_theory)

https://en.wikipedia.org/wiki/Hierarchical_clustering

58

points that lie alone in low-density regions (whose nearest neighbors are too far away)

(Wikipedia 2016). DBSCAN has several advantages which makes it a desirable

clustering algorithm for this analysis part of the thesis. Some of which are:

DBSCAN does not require one to specify the number of clusters in the data a

priori.

DBSCAN can find arbitrarily shaped clusters.

DBSCAN has a notion of noise, and is robust to outliers.

DBSCAN requires two parameters: ε (eps) and the minimum number of points

required to form a dense region (MinPts). It starts with an arbitrary starting point that

has not been visited. This point's ε-neighborhood is retrieved, and if it contains

sufficiently many points, a cluster is started. Otherwise, the point is labeled as

noise. If a point is found to be a dense part of a cluster, its ε-neighborhood is also part

of that cluster. Hence, all points that are found within the ε-neighborhood are added,

as is their own ε-neighborhood when they are also dense. This process continues until

the density-connected cluster is completely found. Then, a new unvisited point is

retrieved and processed, leading to the discovery of a further cluster or noise

(Wikipedia 2016).

In order to determine the optimal eps value is proposed which consists of computing

the k-nearest neighbor distances in a matrix of points. The idea is to calculate, the

average of the distances of every point to its k nearest neighbors. The value of k will

be specified by the user and corresponds to MinPts. Next, these k-distances are

plotted in an ascending order. The aim is to determine the “knee”, which corresponds

to the optimal eps parameter. A knee corresponds to a threshold where a sharp change

occurs along the k-distance curve. The function kNNdistplot() [in dbscan package]

can be used to draw the k-distance plot.

As it can be seen in Figure 4.8 the optimal eps value is around a distance of 0.15.

https://en.wikipedia.org/wiki/Anomaly_detection

59

Figure ‎4.8: k-nearest neighbor distances to determine eps in DBSCAN

Function dbscan::dbscan() computes DBSCAN and provides an object of class

‘dbscan’ as a result.

4.3.4.3 Cluster Validity In cluster analysis, the important question is how to evaluate the “goodness” of the

resulting clusters? To answer this question first we have to know why we need to

evaluate clusters. There are several reasons some of which are mentioned below:

To avoid finding patterns in noise

To compare clustering algorithms

To compare two sets of clusters

To compare two clusters

As outlined in section 2.1.6, Numerical measures that are applied to judge various

aspects of cluster validity, are classified into three types: internal, external and

relative. In this thesis study, Silhouette Coefficient and Dunn’s Index from internal

criterion and Entropy from external criterion are selected in order to evaluate and

compare the different aspects of cluster results. The formulas for the mentioned

indices are indicated in Table 4.5.

60

Table 4.5: Formulas for Silhouette, Dunn and Entropy indices

Name Formula

Silhouette Coefficient (SC)

Dunn’s Index

Entropy

Cluster.stats() function in the package fpc, computes a number of distance based

statistics which can be used for cluster validation, comparison between clusters and

decision about the number of clusters: cluster sizes, cluster diameters, average

distances within and between clusters, cluster separation, average silhouette widths,

the Calinski and Harabasz index, Hubert's gamma coefficient, the Dunn index,

Entropy and two indexes to assess the similarity of two clusterings, namely the

corrected Rand index and Meila's VI.

The selected indices values generated by applying this function on different clustering

results are probed in next chapter.

61

Chapter 5

5 Experiments and Discussion

In this chapter, all experiments and investigations that have been done for profiling

users engaged in talking about a particular event are explained one by one. The

process and the aspects that have been considered are demonstrated, the results are

provided in different forms like charts, graphs, tables and then they are discussed.

5.1 The Floating Piers Datasets For sixteen days – June 18 through July 3, 2016 – Italy’s Lake Iseo was reimagined.

100,000 square meters of shimmering yellow fabric, carried by a modular floating

dock system of 220,000 high-density polyethylene cubes, undulated with the

movement of the waves as The Floating Piers rose just above the surface of the water.

Visitors were able to experience the work of art by walking on it from Sulzano to

Monte Isola and to the island of San Paolo, which was framed by The Floating Piers

(Claude 2016).

Figure ‎5.1: The Floating Piers (Project for Lake Iseo, Italy)

The dataset consists of two collections, one for tweets and Twitter users and one for

Instagram posts and users that are extracted through querying Twitter and Instagram

APIs from June 10 (one week before the event) to July 30 (four weeks after the

event). However most parts of the analysis metrics are mostly focused on those

collected from Twitter APIs while Instagram data is used for comparison.

As discussed in the previous chapter the data was imported into CSV files. The twitter

dataset contains 14,062 tweets and 23,916 users (7,724 masters and 16,197

62

contributors) and the Instagram dataset contains 30,256 posts and 94,666 users

(16,681 masters and 77,985 contributors).

5.2 Reports of Analysis This section presents the outcome of the analytics process that was applied to the

Floating Piers datasets on CSV files. It is divided into sub sections that present the

results in different stages of the analysis. Each one corresponds to the outcomes from

various methodologies that were applied and discussed in the previous chapters.

5.2.1 Content Specific Results This section is dedicated to the results that are directly related to tweets or Instagram

posts collected from social media. These results can give us general insights on how

much attention the event received from Twitter and Instagram during the specific time

period.

5.2.1.1 Twitter Result 1: Engagement in social media is defined as the total number of times a user

interacted with a tweet, including retweets, replies, follows, favorites, etc. We can get

deeper into engagements by noticing what specific types of engagements took place.

For instance, were they retweets or favorites? Retweets can be a sign of value.

Someone found a tweet valuable enough to share with their audience. Favorites can be

a sign of appreciation. A tweet resonated with someone else, and they wanted to give

a virtual high-five. Both metrics count as engagement. Below diagram demonstrates

the total number of retweets versus favorites that tweets, which are about the Floating

Piers event, gained during June 10 and July 30.

Figure ‎5.2: Twitter total retweets vs. favorites

http://blog.bufferapp.com/psychology-of-twitter

63

As this line graph represents, the numbers of favorites and retweets start to

dramatically increase at the opening date (June 18) and have several fluctuations in

the following days until the closing date (July 3). After the event ends the number of

favorites and retweets gradually decrease and go to almost zero. As also shown the

numbers of favorites always overweigh the numbers of retweets during this time span.

This indicates that people were more interested to like a post (related to this event)

rather than sharing it with their followers.

Result 2: To discover the words (stems) and hashtags that occurred several times in

tweets we choose word-cloud as a textual representation in which texts are the words

or hashtags used to address the event and their size is proportional to their frequency

on our Twitter data (see Figure 5.3). Evidently the most frequent words are “lake”,

“walk”, “float”, “water”, “christo” “pier” and “iseo” which represent the name of the

event, its artist’s name and the place where it was held. In addition, the most frequent

hashtags include “#christo”, “#floatingpiers”, “#iseo”, “#lagodiiseo” and “#iseolake”

which convey the same mentioned concepts about the event. Simply by viewing this

presentation, one can guess about the main topics that were discussed mostly in the

Floating Piers event, even in case he does not have any information about the event in

advance.

Result 3: Tweet geolocation feature allows us to find tweets that have been sent from

a specific location. This is useful as it makes tweets more contextual and helps event

holders to find leads relevant to the event and location. However, the proportion of

tweets that are attached to a location is still very low because a reasonable percentage

of the population has concerns about publicly exposing their exact location.

Figure ‎5.3: Most frequent words (a) and hashtags (b) in tweets

(a) (b)

64

Considering our tweet dataset, we realized that a large majority of geo values in

tweets are null (over 98 percent) which confirms the above point. Therefore, it is

suggested to use the field “Place” instead of “Geo” because place is not an exact

location but an area or neighborhood. Nevertheless, only 16 percent of tweets were

geotagged by users and analyzed for the result shown in Figure 5.4. Hence, to have a

better understanding of tweets places, we categorized tweets by their place and then

illustrated the number of tweets from top 10 locations. As shown the most

considerable numbers of tweets (over 800) came from “Sulzano” that is the main spot

of the event while the second and the third highest numbers of tweets originated from

“Monte Iseo” and “Milan”. This could prove a point that tweets were either posted

from the location of the event or by local people who live or work in big cities nearby.

This could be a good hint to be considered by the relevant organizers for future

events.

Figure ‎5.4: Number of tweets for top 10 locations using the field “Place”

5.2.1.2 Instagram Result 1: To have a clear intuition of the level of user engagement in Instagram, the

volume of likes and comments received by uploaded posts are depicted in Figure 5.5.

Unlike Twitter, the number of likes and comments in Instagram reached a peak at the

closing date of the event (July 3) not at the opening date. Another difference that can

be seen is that the number of likes in Instagram is near 250,000 which considerably

surpass likes count in Twitter. However, as demonstrated below, Instagram users are

more interested in liking the posts rather than commenting, that is why the number of

comments is much less than likes count and remains on a constant rate during the time

interval.

65

Figure ‎5.5: Instagram total likes vs. comments

Result 2: The most repeated words (stems) and hashtags that people used in their

posts in Instagram are presented in two word-clouds in Figure 5.6. Similar with

Twitter the most frequent words and hashtags are clearly those that are particularly

relevant to the event like the name of the event, its artist and its location.

Result 3: Having the longitude and latitude of each Instagram post (unlike Twitter

users, most Instagram users specify the location of their posts), we are able to show

the density of medias posted in different locations using get_map and ggmap

functions in R. The results are displayed in below maps including world, Italy,

Brescia and Sulzano. As one can see the density of posts has a direct relationship with

Figure ‎5.6: Most frequent words (a) and hashtags (b) in Instagram posts

(a) (b)

66

their locality which means most Instagram media have been posted near the main

venue of the event or by the people who live or work near this place.

Figure ‎5.7: Distribution of Instagram posts in the world

Figure ‎5.8: Density of Instagram posts – Italy

67

Figure ‎5.9: Density of Instagram posts – Brescia

68

Figure ‎5.10: Density of Instagram posts – Sulzano

Result 4: The last result in this section addresses the comparison of the total number

of tweets and Instagram posts within a timeline.

As shown in Figure 5.11, people (both Twitter and Instagram users) started talking

about the Floating Piers before the opening day. It is obvious that people tended to

tweet more than post on Instagram at the first week of the event but as time passed

people posted more Instagram media that tweets. Considering this, one could

conclude that Twitter users have a tendency to tweet about the news at the moment

when an event starts whereas Instagram users usually share their experiences when an

event ends.

As demonstrated, the number of tweets had a significant growth on the day when

event began (near 2,000 tweets) and decreased quickly one or two days later. On the

other hand, the number of Instagram posts steadily fluctuated until the closing day

when it reached a peak with over 5,000 posts and had a rapid decline on the following

days. Not surprisingly both tweeting and posting rates remained constant only after a

few days when the event ended.

69

Figure ‎5.11: Tweets vs Instagram posts timeline

5.2.2 User Specific Results In this section, we discuss the obtained results relevant to social media users such as

users’ categories and their activities during the mentioned time span. Given that we

focus on text analysis in this part of thesis and since Instagram is technically a photo-

sharing social network, it is not surprising that the outcomes acquired from Instagram

do not seem desirable in this study. For this reason, we only concentrate on Twitter

users in this section. However, some distinguished differences between Twitter and

Instagram results are highlighted in each section.

5.2.2.1 Clustering Results Evaluation As indicated in previous chapter, three different clustering algorithms are chosen to be

applied on collected data from users’ textual properties specifically bio, hashtag,

status text and list slug. Therefore, each algorithm is performed on each collection

separately and then to get a better understanding of the obtained results and find the

best clustering, three cluster validity measures are employed. Table 5.1 represents

these measures values for each algorithm and each feature.

70

Table 5.1: Evaluation results of cluster validation indices

Silhouette width and Dunn index combine measures of compactness and separation of

the clusters. Recall that the values of silhouette width range from -1 (poorly clustered

observations) to 1 (well clustered observations). The Dunn index is the ratio between

the smallest distances between observations not in the same cluster to the largest

intra-cluster distance. It has a value between 0 and infinity and should be also

maximized. Thus, algorithms that produce clusters with high Dunn index and high

Silhouette width are more desirable. On the other hand, entropy is a metric that is a

measure of the amount of disorder in a vector. So, smaller values of entropy indicate

less disorder in a clustering, which means a better clustering.

According to the above facts and the table’s output, Hierarchical clustering (three

clusters) can be considered as the best algorithm which has produced more pleasant

results compared to the two other algorithms. Furthermore, among all four examined

textual properties, Bio comes with more acceptable values. Consequently, as table

suggests, from now on we only focus on hierarchical clustering performed on users’

bio data.

5.2.2.2 Interpretations of Clusters Result 1: Hierarchical algorithm returns a dendrogram which is illustrated in Figure

5.12. To have a better insight, three clusters are drawn in different colors. Each leaf in

this tree is an indicator of a Twitter user engaged in the Floating Piers event through

tweeting, retweeting or favoriting a post.

71

Figure ‎5.12: Dendrogram representation of Twitter users

Result 2: The pie chart in Figure 5.13 shows the proportions of users in each cluster.

It can be seen that among these three clusters, nearly 60 percent of users lies in first

cluster (green slice), over 35 percent in second (blue slice) and the rest (about 5

percent) in the third cluster (red slice).

Figure ‎5.13: The percentage of user engagement in each cluster

Result 3: As outlined in chapter 4, we used a topicProbabilties matrix with three

topics in clustering analysis. This leads to having a three dimensions space in which

every point demonstrates the similarity between textual properties of a user and each

topic. But because the representations of points in 3D dimensions may cause

confusion, it is proposed to plot them in two dimensions that only indicate two of

72

extracted topics (topic 1 and topic 2). Figure 5.14 shows the distribution of cluster

objects in a 2D representation.

Figure ‎5.14: 2D representation of cluster objects

In order to obtain the objects (users) of each cluster after clustering, we use the vector

returned by cutree function as an index into our original data matrix.

Result 4: Having all the user objects in each cluster, we are able to label the obtained

clusters or in other words to identify the categories of users. To depict a weighted list

of the words that are used in users’ bio, hashtags, tweets texts and lists in each cluster,

we employ word-cloud which is a visual representation of text data. Below are visual

demonstrations of the users’ bio word-clouds (per cluster).

73

Figure ‎5.15: Word-cloud representation of first cluster based on bio

Figure ‎5.16: Word-cloud representation of second cluster based on bio

74

Figure ‎5.17: Word-cloud representation of third cluster based on bio

It can be seen that the most frequent words in each cluster convey specific meanings.

People in first cluster mostly talk about “Travel” introducing themselves in their

Twitter bio. People in second cluster are “Art” lovers as far as one can see and people

in third cluster state their positions as “Technology” fans. Henceforth, we call the

users in first, second and third cluster Travel Lovers, Art Lovers and Tech Lovers

respectively. Moreover, using word-clouds for other properties like hashtag, tweet

text and list slug also approves the validity of each cluster’s label (see below figures).

Figure ‎5.18: Hashtag word-cloud for Travel Lovers (a), Art Lovers (b) and Tech Lovers (c)

(a) (b) (c)

75

5.2.2.3 Comparison of Clusters In this section, we investigate the differences between clusters’ numerical properties

namely the number of users’ followers, followings, favorites and tweets as well as a

few selected features and activities that users had within a timeline.

Result 1: One way to compare users in different clusters is to observe and evaluate

their numerical features. To make a clearer image of users’ differences, we categorize

the values of these properties into four major groups:

less than 100 (low)

between 100 and 1000 (medium)

between 1000 and 10000 (high)

more than 10000 (very high)

Figure ‎5.19: Tweet text word-cloud for Travel Lovers (a), Art Lovers (b) and Tech Lovers (c)

Figure ‎5.20: List slug word-cloud for Travel Lovers (a), Art Lovers (b) and Tech Lovers (c)

(a) (b) (c)

(a) (b) (c)

76

The above categories can be used for all numerical properties: the number of

followers, followings, favorites and tweets (see below bar charts).

As shown in Figure 5.21, the percentage of users whose number of followers are less

than 100 are the highest in Travel lovers cluster compared to Art lovers that have the

highest number of followers between 100 and 1000 and Tech lovers whose number of

followers exceed 1000. On the other hand, the percentage of users with 100 to 1000

followers is maximized in all three clusters.

Figure ‎5.21: Percentage of users whose number of followers lie in each category

Figure 5.22 illustrates that nearly 40 percent of Tech lovers have more than 1000

followings while only about 30 percent of Art lovers and Travel lovers belong to high

and very high category. Moreover, most users in three clusters have 100 to 1000

followings (almost 60 percent) while the percentage of users with more than 10000

followings is minimum.

Figure ‎5.22: Percentage of users whose number of followings lie in each category

As the below chart shows over 61 percent of Travel lovers liked more than 1000

tweets whereas approximately 55 percent of Art lovers and Tech lovers favorited

tweets posted by others during time.

77

Figure ‎5.23: Percentage of users whose number of favorites lie in each category

Looking at the figure below, Travel lovers have the highest percentage of users with

less than 100 tweets while nearly 70 percent of Tech lovers posted more than 1000

tweets (with 28 percent more than 10000).

Figure ‎5.24: Percentage of users whose number of tweets lie in each category

Result 2: A box plot is a type of graphical display that can be used to summarize a set

of data based on the five number summary of this data. The summary statistics used to

create a box plot are the median of the data, the lower and upper quartiles (25% and

75%) and the minimum and maximum values. The box plot is an effective way to

investigate the distribution of a set of data.

As one can see in below box plots, Tech lovers have the highest median for number of

followers and followings among all three clusters while the median for number of

favorites and tweets exceed the median of other groups.

78

Figure ‎5.25: Summary statistics of numbers of followers in each cluster

Figure ‎5.26: Summary statistics of numbers of followings in each cluster

79

Figure ‎5.27: Summary statistics of numbers of favorites in each cluster

Figure ‎5.28: Summary statistics of numbers of tweets in each cluster

Result 3: There are some other features like language and gender which help to

compare users in three clusters. As Figure 5.29 shows Italian is the most common

language of all users in all three clusters while second and third places belong to

English and other languages (French, Dutch, etc.). As one can see the flows of

languages follow the flow of tweets in all three clusters and have a peak on the

opening day of the event.

80

Figure ‎5.29: Language timeline per cluster

Figure 5.30 indicates that the number of males who got involved in the Floating Piers

overweighs the number of females. In addition since Travel lovers are the highest

majority, the number of males and females are the highest in this category.

Figure ‎5.30: Gender timeline per cluster

Next diagram displays the number of users and their posted tweets during time for

each cluster separately. To avoid misunderstanding we calculate a ratio which divides

the number of tweets into the number of users who posted them. The results are

represented in Figure 5.32. It is evident that the value of this ratio (in all three

clusters) is a number between 1 and 2 during the event period which shows that each

user posted less than three tweets every day during this time. Nevertheless, this ratio

fluctuated in the following weeks after closing day and even reached to 5 in Travel

lovers cluster.

81

Figure ‎5.31: Number of Users - Tweets timeline per cluster

Figure ‎5.32: Tweet – User ratio timeline per cluster

5.2.2.4 Active Users and Influencers In this section the concentration is on the users whose online roles or engagements in

this event are more effective than the others. Indeed, each business has their own

unique audience identity, but that segmentation might not pan across each social

media network successfully. Instead, it takes better brand alignment, thought-out

social conversations and meaningful connections with the core group of loyalists.

Therefore, it truly pays to have your message reach the right people at the right time.

As previously mentioned one business goal in this specific event or generally in any

kind of event is to keep track of interested people and use this information as a

82

guideline for attracting new people with the similar interest in future events. To

achieve this goal, we use “active users” and “influencers” concepts in this thesis

study.

The definition of what “active” means depends on the service or website. Since all

master users extracted from the Floating Piers data posted at least one tweet or media,

they all can be considered as active master users but we are mostly interested in users

whom this event kept them highly engaged.

Result 1: Figure 5.33 and Figure 5.34 show this event’s top 20 active users for

Twitter and Instagram respectively. As it can be seen the most twitter active user

posted more than 80 tweets about this event while this number reaches to 200 in

Instagram.

Figure ‎5.33: Twitter top 20 active users

Figure ‎5.34: Instagram top 20 active users

83

Result 2: In addition, active contributors are the other group of users in whom we

intend to identify. Top 20 twitter active contributors are listed in Figure 5.35. The y-

axis shows the total number of their retweets and favorites.

Figure ‎5.35: Twitter top 20 active contributors

Below is the visual representation of the list of top 20 active contributors (likers and

commenters) in Instagram.

Figure ‎5.36: Instagram top 20 active contributors

Result 3: Influencers are experts whose ideas and actions shape the opinion of like-

minded people. Influencers are not just celebrities who have millions of followers.

They can be people who are influential because they have expertise in a topic. When

these experts talk about products and services, people listen. In fact a recent study

84

found that 49% of Twitter users said they rely on recommendations from influencers

(Little Bird 2017). Typically, influencers have: large following on social media and

engagement in social media through retweets, Facebook comments, etc. By

identifying and building relationships with influencers we will have many benefits –

Get our content shared, form partnerships, generate business and much more.

Twitter users form a Social Network. If depicted in a graph, they would be

represented by nodes. The edges that connect these nodes are the relations of

“Follower-Following”, introduced by Twitter. Obviously, some users are more

influential than others. The methodology of calculating the importance and influence

that a user has in an Online Social Networks (OSN) is presented here. That

measurement should not depend merely on the number of “Followers” of a user, even

if that number is big enough and the user’s tweets are received by a large number of

other users (followers). In case that the number of “Following” is larger, then the user

could be characterized as a “passive” one. Those types of users are regarded as those

who are keener on viewing or being informed through tweets rather than composing

new ones. Therefore, a suitable factor is the ratio of “Followers to Following” (FtF

ratio) (Gerasimos Razis 2014).

FtF ratio = (Log10 (#Followers/#Followings+1))

The FtF ratio is placed inside a base-10 log for avoiding outlier values. Moreover, the

ratio is added by 1 so as to avoid the metric being equal to 0 in cases that the value of

“Followers” is equal to “Following”. Using this ratio we identify top 10 influencers in

Twitter (see Figure 5.375).

Figure ‎5.37: Twitter top 10 influencers using FtF ratio

5 Blue ticks are indicators of Twitter verified users

85

According to the above chart, most of the majority of influencers is companies such

as news agencies whose followers count extremely overweighs their followings count.

Besides, by having a glance over their profile we realize that their tweet creation rate

(TCR) is usually a high number, regardless of the topic they tweet about. So, these

influencers cannot have a great impact on other users or spread the word about the

Floating Piers event in Twitter. That is why FtF ratio does not seem sufficient in our

study.

Another important factor which is proposed in this thesis study is User Tweet Weight

(UTW) that is defined as the ratio of user’s activity (sum of retweets and favorites) to

users’ number of tweets during the event. Figure 5.38 indicates the top 10 Twitter

influencers using UTW ratio.

UTW ratio = Log10 (∑ (#Favorites+ #Retweets)/#Tweets)

Figure ‎5.38: Twitter top 10 influencers using UTW ratio

As one can see in this chart, most influencers are real figures or celebrities whose

tweet(s) about the event engross the attention of a great deal of people. Taking a look

at their profiles shows that although their TCR is often a low number compared to the

previous influencers’, their tweets attract more users which leads more influence on

the network relevant to the event.

Now, let us see the case where two users have nearly the same FtF ratio. Obviously

the user with the higher UTW ratio has more impact on the Network. In our

methodology, in order to calculate the final Influence Ratio, we use the combination

of two mentioned ratio using multiplication.

86

Influence Ratio = FtF ratio * UTW ratio

Below bar char is the illustration of Twitter top 10 influencers using the final

Influence Ratio.

Figure ‎5.39: Twitter top 10 influencers using Influence Ratio

Having both ratios together makes a much more meaningful list of influencers. As it

can be seen, all real figures are famous people who not only have a great deal of

followers but also their tweets absorb plenty of users to this event. The same thing

applies on the news agencies introduced as influencers in the above figure as well.

This point is an approved seal on our suggested metric for finding the most influential

people in the Floating Piers event.

Using this metric can be also helpful to find the top influencers of each cluster

obtained in previous section (see Figure 5.40).

87

Figure ‎5.40: Twitter top 10 influencers per cluster using Influence Ratio

Result 4: Figure 5.41 shows the number of followers of top 10 influencers in each

cluster. According to the previous result (top 10 influencers in each cluster) it is

expected that the numbers of followers in first two clusters (Travel Lovers and Art

Lovers) have the highest values forasmuch as their influence ratios exceed Tech

lovers’ ratios. As shown even the tenth influencers in first two groups have more

followers (about 200,000) than the first influencer’s (nearly 120,000) in Tech lover

group.

Figure ‎5.41: Number of followers of top 10 influencers in each cluster

Seemingly as below figure represents top 10 influencers in Travel lovers and Art

lovers clusters have fewer numbers of followings compared to Tech lovers’. This

proves that the number of followings has a less important role in influence ratio rather

than the number of followers.

88

Figure ‎5.42: Number of followings of top 10 influencers in each cluster

Figure 5.43 indicates the total number of tweets that top 10 influencers in each cluster

posted. According to these graphs Travel lover influencers posted more tweets than

Art lover influencers and the latter posted more than influencers of Tech lovers did.

Since influencers in each cluster can be considered as the representatives of the whole

group, we can conclude that Travel lovers had more effect on the event in the

network. Not surprisingly this result can be also derived from the influence ratio

outcomes.

Figure ‎5.43: Number of tweets of top 10 influencers in each cluster

In the end of this section, we display the total number of tweets that influencers in

each cluster favorite. Evidently Tech lovers are the ones who liked more tweets than

the other groups although this result cannot convey a meaning regarding their

influence ratio.

89

Figure ‎5.44: Number of favorites of top 10 influencers in each cluster

90

Chapter 6

6 Conclusions

6.1 Summary The main focus in this research work was to categorize the social media users

involved in a specific event along with analyzing the dynamics and different aspects

of the event itself. Using two platforms (Twitter and Instagram) and considering

mostly all valuable properties of each tweet/post and user give us the confidence to

rely on the outcomes of this study although it comes at the expense of more data

extraction and more data analysis.

In this thesis we employed different methods and approaches to provide the best and

the most adequate results in each step of analysis and the final outcomes of our

research endorse this claim. We proposed an approach to help event organizers to

decide what categories of users they are dealing with and how they are able to reach

them through different social media networks on the Internet. Having the better

understanding about the characteristics of users who are more likely to be interested

in the similar events in the future, will help organizers to allocate the resources to the

right places and target the right group of people in advertising.

The analyses used in this study, which can be applied on any other event, can be

divided in two main categories; The statistical analysis on geometry of the event, and

clustering the users in two or more textually separated groups along with the

presentative review of each group and comparing their properties.

We used “The Floating Piers” event data as a case study in this research to show how

the proposed approach works with the real life datasets. Consequently, after

representing the characteristics of the event, we categorized users based on their

interests in three main groups and then described and compared the behavior and

properties of each category.

91

6.2 Critical Discussion This study concentrates on analyzing online users engaged in an event and finding a

meaningful relationship between members of each group, which not only relates them

together internally but introduces a clear border between each cluster. Obviously the

final results may differ based on the quality and quantity of the initial data but we

showed that with a reasonable size of dataset, we can examine the similarities and

differences between users and obtain the best6 clusters. In addition to the carefully

developed and evaluated approach, there are also several other aspects that need to be

taken into consideration to have a better and more proper final results. Language

detection and translation of all textual properties are not easy tasks to do but they

guarantee the equality of each entity and full coverage over the dataset (The common

way in most researches is to ignore all entities with non-English textual features).

Besides, adding a critical feature like gender, which is not provided by the original

platform (Twitter and Instagram) gives us a better understanding about the users in

different clusters and elucidates the unique behaviors of disparate social media

platforms.

6.3 Possible Future Works Collecting as much as possible data about an event and about the users involving in it

can help event organizers to schedule future events more properly and have a more

comprehensive perspective on logistic and advertisement. The current study can go

further with predicting the future users who might be interested in similar kinds of

events by analyzing their features and activities and compare them with the users of

the current event. Furthermore, considering the other social media platform like

Facebook, Google Plus, Flicker, Foursquare, etc. might result in a clearer and wider

picture of the characteristic and geometry of users and event. Last but not least,

deploying other techniques like semantic analysis, image processing and network

analysis can also help us to improve the accuracy and coverage of the results and open

a new window to have a better understanding of the event.

6 Based on cluster validity measurement

92

Bibliography

Ahmed, Wasim. Using Twitter as a data source: An overview of current social media

research tools. The London School of Economics and Political Science. July 2015.

http://blogs.lse.ac.uk/impactofsocialsciences/2015/07/10/social-media-research-tools-

overview/.

Alessandro Bozzon, Marco Brambilla, Stefano Ceri, Matteo Silvestri, Giuliano Vesci.

"Choosing the Right Crowd: Expert Finding in Social Networks." EDBT '13

Proceedings of the 16th International Conference on Extending Database

Technology. Genoa, 2013. 637-648.

Äyrämö, Sami. KDD Process Steps, Lecture 4. Finland: University of Jyväskylä, 2007.

Bettina Grun, Kurt Hornik. "topicmodels: An R Package for Fitting Topic Models." Journal

of Statistical Software, 2011.

Christopher D. Manning, Prabhakar Raghavan, Hinrich Schutze. Introduction to Information

Retrieval. Cambridge University Press, 2008.

Claude, Christo. The Floating Piers. 2016. http://www.thefloatingpiers.com/the-project.

Diao, Qiming. "Event Identification and Analysis on Twitter." Singapore Management

University, 2015.

Eréndira Rendón, Itzel Abundez, Alejandra Arizmendi, Elvia M. Quiroz. "Internal versus

External cluster validation indexes." INTERNATIONAL JOURNAL OF

COMPUTERS AND COMMUNICATIONS 5, no. 1 (2011): 8.

Friedemann, Vanessa. "Clustering a Customer Base Using Twitter Data." 2015.

Gerasimos Razis, Ioannis Anagnostopoulos. "InfluenceTracker: Rating the impact of a

Twitter account." 2014.

Gonzalo Mariscal, Oscar Marbán, Covadonga Fernández. "A survey of data mining and

knowledge discovery process models and methodologies." The Knowledge

Engineering Review, 2010: 31.

Hees, Maarten van. "Web-based automatic translation: the Yandex.Translate API." Leiden

Institute of Advanced Computer Science (LIACS). Leiden, 2015.

Instagram. 2017. https://www.instagram.com/developer/.

José Luis Díaz, Manuel Herrera, Joaquín Izquierdo, Rafael Pérez-García. "The tasks of pre

and post-processing in Data Mining applied to a real world problem." International

Congress on Environmental Modelling and Software. Ottawa, 2010.

Kabacoff, Robert I. 2017. http://www.statmethods.net/stats/regression.html.

93

Ken Kelley, Keke Lai, Po-Ju Wu. "Using R for data analysis: A best practice for research." In

Best Practices in Quantitative Methods, 38. 2008.

Kuldeep Singh, Harish Kumar Shakya, Bhaskar Biswas. "Clustering of people in social

network basedon textual similarity." ELSEVIER, 2016.

Little Bird. 2017. http://www.getlittlebird.com/.

Luna, Elizabeth de. Eventbrite. September 22, 2016. https://www.eventbrite.com/blog/flickr-

instagram-event-photos-ds00/.

Maria Halkidi, Yannis Batistakis, Michalis Vazirgiannis. "On Clustering Validation

Techniques." Journal of Intelligent Information Systems, 2001: 39.

NamSor Applied Onomastics. 2017. http://www.namsor.com/.

Narumol Prangnawarat, Ioana Hulpus ,̧ Conor Hayes. "Event Analysis in Social Media Using

Clustering of Heterogeneous Information Networks." Proceedings of the Twenty-

Eighth International Florida Artificial Intelligence Research Society Conference.

2015.

Oded Maimon, Lior Rokach. Data Mining and Knowledge Discovery Handbook. Springer

New York Dordrecht Heidelberg London, 2010.

Ottosson, Therese. The representation of gender roles in the media. Trollhättan: University

West, 2012.

Pensions, Department for Work and. "The Use of Social Media for Research and Analysis: A

Feasibility Study." Government Social Research, no. 13 (2014): 62.

Pew Research Center. January 12, 2017. http://www.pewinternet.org/fact-sheet/.

Pienaar, Wikus. "Spelling Checker-based Language Identification for the Eleven Official

South African Languages." First Annual Symposium of the Pattern Recognition

Association of South Africa. Stellenbosch, 2010. 213–216.

Ponweiser, Martin. "Latent Dirichlet Allocation in R." Institute for Statistics and

Mathematics, 2012.

Recognition, National Labratory of Pattern. "Clustering Users in Twitter Based on Interests."

2015.

Representational state transfer. 2017.

https://en.wikipedia.org/wiki/Representational_state_transfer.

Rizwana Irfan, Christine K. King, Daniel Grages, Hongxiang Li. "A Survey on Text Mining

in Social Networks." The Knowledge Engineering Review, 2015: 15.

Rob Procter, Alex Voss, Ilia Lvov. "Audience research and social media data." Participations

12, no. 1 (2015): 24.

Rugved Deshpande, Ketan Vaze, Suratsingh Rathod, Tushar Jarhad. "Comparative Study of

Document Similarity Algorithms and Clustering Algorithms for Sentiment Analysis."

94

International Journal of Emerging Trends & Technology in Computer Science 3, no.

5 (2014): 4.

Saptarshi Ghosh, Naveen Sharma, Fabricio Benevenuto,Niloy Ganguly, Krishna P. Gummadi.

"Cognos: Crowdsourcing Search for Topic Experts in Microblogs." Proceedings of

the 35th International ACM SIGIR Conference on Research and Development in

Information Retrieval. 2012.

Sfetcu, Nicolae. Web 2.0/Socila media/Social networks. 2017.

Slava Kisilevich, Milos Krstajic, Daniel Keim, Natalia Andrienko, Gennady Andrienko.

"Event-based analysis of people’s activities and behavior using Flickr." 14th

International Conference on Information Visualisation. London, 2010.

Sorensen, Lene Tolstrup. "User managed trust in social networking comparing facebook,

myspace and linkdin." In Proceedings of 1st International Conference on Wireless

Communication, Vehicular Technology, Information Theory and Aerospace &

Electronic System Technology. 2009.

Sorokina, Olsy. Hootsuite. December 4, 2014. https://blog.hootsuite.com/photo-sharing-

platforms-for-business/.

Statista. 2016. https://www.statista.com/statistics/.

Statistical Data Analysis. n.d. https://stat.ethz.ch/R-manual/R-

devel/library/stats/html/hclust.html.

Toon Van Craenendonck, Hendrik Blockeel. "Using Internal Validity Measures to Compare

Clustering Algorithms." 2015.

TrackMaven. 2016. https://trackmaven.com/.

Twitter Developer Documentation. 2017. https://dev.twitter.com/docs.

Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. "From Data Mining to

Knowledge Discovery in Databases." AI Magazine, 1996: 18.

Villiers, Francois de. Constructing Topic-based Twitter Lists. 2013.

Wael H. Gomaa, Aly A. Fahmy. "A Survey of Text Similarity Approaches." International

Journal of Computer Applications 68, no. 13 (2013): 6.

Wikipedia. 2016 https://en.wikipedia.org/wiki/.

Yandex. Yandex Translate API documentation. 2017.

https://tech.yandex.com/translate/doc/intro/concepts/how-works-machine-translation-

docpage/.

Documents

Event-based User Profiling in Social Media Using Data ... · Event-based User Profiling in Social Media Using Data Mining Approaches Supervisor: ... da diverse organizzazioni e da