Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
POLITECNICO DI MILANO
School of Industrial and Information Engineering
Master of Science in Computer Engineering
Event-based User Profiling in Social
Media Using Data Mining
Approaches
Supervisor:
Dr. Marco Brambilla
Authors:
Behnam Rahdari (Student ID: 10480057)
Tahereh Arabghalizi (Student ID: 10481546)
Academic Year 2016/17
2
Abstract
Social Networks have undergone a dramatic growth and influenced everyone’s life in
recent years. People share everything from daily life stories to the latest local and
global news and events using social media. This rich and continuous flow of user-
generated content has received significant attention from many organizations and
researchers and is increasingly becoming a primary source for social and marketing
researches to name a few. Accordingly, a great number of works have been conducted
to extract valuable information from different platforms. However, there are no
specific studies that focus on categorizing social media users based on the texts they
share about a specific event. Given that the identification of online users with
common interest in a particular event can help event organizers to attract more
visitors to future similar events; this thesis study concentrates on examining the
similarity between such users from the aspect of textual published contents. In this
work different approaches have been proposed and various experiments have been
carried out to support an explanation concerning this notion. We take a systematic
approach to accomplish this objective by applying topic modeling techniques, using
statistical and data mining algorithms, combined with information visualization.
3
Sommario
Negli ultimi anni i Social Networks hanno visto una crescita esponenziale ed hanno
influenzato la vita di tutti noi. Le persone condividono di tutto tramite i social media,
dalle storie di vita quotidiana alle ultime notizie a livello locale e globale. Il ricco e
continuo flusso di notizie generate dagli utenti ha ricevuto un’importante attenzione
da diverse organizzazioni e da vari ricercatori, e sta ancora crescendo diventando una
fonte primaria per le ricerche di mercato sociali e di marketing, solo per nominarne
alcune. Di conseguenza, è stato condotto un grande numero di lavori per estrarre
informazioni utili da diverse piattaforme. Nonostante questo, non ci sono studi
specifici che si concentrano sulla categorizzazione degli utenti in base al testo,
riguardante eventi specifici, che condividono sui social media. Detto questo,
l’identificazione degli utenti con interessi comuni, in un evento specifico, può aiutare
gli organizzatori dell’evento ad attrarre più visitatori ad eventi simili in futuro. Questo
lavoro di tesi si concentra nell’esaminare le similarità tra questi utenti in base ai
contenuti testuali che hanno pubblicato tramite social media. Inoltre, sono proposte
diverse metodologie, e sono stati effettuati diversi esperimenti per sostenere una
spiegazione riguardo questo fenomeno. Abbiamo proceduto con un approccio
sistematico per ottenere questo obiettivo, applicando tecniche di modellazione,
usando algoritmi statistici e di estrazione dei dati (data-mining), combinando infine
questi dati attraverso tecniche di visualizzazione delle informazioni.
4
Contents
1 Introduction ............................................................................................................................... 9
1.1 Context ...................................................................................................................... 9
1.2 Problem Statement ................................................................................................. 10
1.3 Proposed Solution ................................................................................................... 10
1.4 Structure of the thesis ............................................................................................ 11
2 Background ............................................................................................................................12
2.1 Relevant Concepts .................................................................................................. 12
2.1.1 Knowledge Discovery and Data Mining .......................................................... 12
2.1.2 Text Similarity ................................................................................................. 15
2.1.3 Information Retrieval Models .......................................................................... 16
2.1.4 Topic Modeling ................................................................................................ 18
2.1.5 Dimensionality Reduction................................................................................ 19
2.1.6 Clustering Process ............................................................................................ 20
2.2 Relevant Technologies ........................................................................................... 22
2.2.1 Language Identification and Translation ......................................................... 22
2.2.2 Gender Detection ............................................................................................. 24
2.2.3 Twitter API ...................................................................................................... 25
2.2.4 Instagram API .................................................................................................. 26
2.2.5 Text Normalization Library ............................................................................. 26
2.2.6 Cloud Computing ............................................................................................. 26
3 Related Work .........................................................................................................................28
3.1 Clustering of People in Social Network Based on Textual Similarity ............... 29
3.2 Clustering a Customer Base Using Twitter Data ................................................. 31
3.3 Clustering Users Based on Interests ..................................................................... 33
3.4 Crowdsourcing Search for Topic Experts in Microblogs .................................... 35
3.5 Using Internal Validity Measures to Compare Clustering Algorithms .............. 36
4 Event-based User Profiling in Social Media .........................................................................39
4.1 Main Idea ................................................................................................................ 39
4.1.1 Twitter Users .................................................................................................... 39
4.1.2 Instagram Users ............................................................................................... 40
4.2 Motivation ............................................................................................................... 40
4.2.1 Why social media as data source? .................................................................... 40
5
4.2.2 Why Twitter and Instagram?............................................................................ 42
4.3 Approach ................................................................................................................. 45
4.3.1 Data Extraction ............................................................................................... 45
4.3.2 Data Preprocessing ........................................................................................... 48
4.3.3 Data Loading .................................................................................................... 49
4.3.4 Data Analysis ................................................................................................... 49
5 Experiments and Discussion ..................................................................................................61
5.1 The Floating Piers Datasets ..................................................................................... 61
5.2 Reports of Analysis .................................................................................................. 62
5.2.1 Content Specific Results .................................................................................. 62
5.2.2 User Specific Results ....................................................................................... 69
6 Conclusions ............................................................................................................................90
6.1 Summary .................................................................................................................. 90
6.2 Critical Discussion ................................................................................................... 91
6.3 Possible Future Works ............................................................................................. 91
Bibliography ..................................................................................................................................92
6
List of Figures
Figure 2.1: Knowledge Discovery Process .............................................................................. 12
Figure 2.2: Information Retrieval Models ............................................................................... 17
Figure 2.3: Graphical model representation of LDA ............................................................... 18
Figure 2.4: Steps of clustering process .................................................................................... 20
Figure 2.5: How machine translation works at Yandex ........................................................... 23
Figure 2.6: Twitter Rest API Design ....................................................................................... 25
Figure 2.7: Cloud Computing .................................................................................................. 27
Figure 2.8: Windows Azure Platform ...................................................................................... 27
Figure 3.1: Graph after spectral k-means clustering for real dataset ....................................... 30
Figure 3.2: Graph after spectral k-means clustering for dummy dataset ................................. 30
Figure 3.3: Percentage of followers for a set of chosen influencers. ....................................... 31
Figure 3.4: Silhouette coefficient as a function of number of clusters .................................... 32
Figure 3.5: Representative clusters in R2 ................................................................................. 33
Figure 3.6: Spectral clustering solutions selected by various measures .................................. 37
Figure 4.1: Number of social media users from 2010 to 2020 (in billions) ............................. 41
Figure 4.2: Percentage of adult users who use different social networks ................................ 41
Figure 4.3: Percentage of adult users who use at least one social media, by age .................... 42
Figure 4.4: Comparison between four major photo-sharing networks .................................... 44
Figure 4.5: Architecture design ................................................................................................ 45
Figure 4.6: Identify the number of topics for LDA .................................................................. 51
Figure 4.7: Elbow method representation ................................................................................ 56
Figure 4.8: k-nearest neighbor distances to determine eps in DBSCAN ................................. 59
Figure 5.1: The Floating Piers (Project for Lake Iseo, Italy) ................................................... 61
Figure 5.2: Twitter total retweets vs. favorites ........................................................................ 62
Figure 5.3: Most frequent words (a) and hashtags (b) in tweets .............................................. 63
Figure 5.4: Number of tweets for top 10 locations using the field “Place” ............................. 64
Figure 5.5: Instagram total likes vs. comments ....................................................................... 65
Figure 5.6: Most frequent words (a) and hashtags (b) in Instagram posts ............................... 65
Figure 5.7: Distribution of Instagram posts in the world ......................................................... 66
Figure 5.8: Density of Instagram posts – Italy ......................................................................... 66
Figure 5.9: Density of Instagram posts – Brescia .................................................................... 67
Figure 5.10: Density of Instagram posts – Sulzano ................................................................. 68
Figure 5.11: Tweets vs Instagram posts timeline ..................................................................... 69
Figure 5.12: Dendrogram representation of Twitter users ....................................................... 71
Figure 5.13: The percentage of user engagement in each cluster ............................................ 71
Figure 5.14: 2D representation of cluster objects .................................................................... 72
Figure 5.15: Word-cloud representation of first cluster based on bio ...................................... 73
Figure 5.16: Word-cloud representation of second cluster based on bio ................................. 73
Figure 5.17: Word-cloud representation of third cluster based on bio .................................... 74
Figure 5.18: Hashtag word-cloud for Travel Lovers, Art Lovers and Tech Lovers ................ 74
Figure 5.19: Tweet text word-cloud for Travel Lovers, Art Lovers and Tech Lovers ............ 75
Figure 5.20: List slug word-cloud for Travel Lovers, Art Lovers and Tech Lovers ............... 75
Figure 5.21: Percentage of users whose number of followers lie in each category ................. 76
Figure 5.22: Percentage of users whose number of followings lie in each category ............... 76
7
Figure 5.23: Percentage of users whose number of favorites lie in each category .................. 77
Figure 5.24: Percentage of users whose number of tweets lie in each category ...................... 77
Figure 5.25: Summary statistics of numbers of followers in each cluster ............................... 78
Figure 5.26: Summary statistics of numbers of followings in each cluster ............................. 78
Figure 5.27: Summary statistics of numbers of favorites in each cluster ................................ 79
Figure 5.28: Summary statistics of numbers of tweets in each cluster .................................... 79
Figure 5.29: Language timeline per cluster ............................................................................. 80
Figure 5.30: Gender timeline per cluster ................................................................................. 80
Figure 5.31: Number of Users - Tweets timeline per cluster ................................................... 81
Figure 5.32: Tweet – User ratio timeline per cluster ............................................................... 81
Figure 5.33: Twitter top 20 active users .................................................................................. 82
Figure 5.34: Instagram top 20 active users .............................................................................. 82
Figure 5.35: Twitter top 20 active contributors ....................................................................... 83
Figure 5.36: Instagram top 20 active contributors ................................................................... 83
Figure 5.37: Twitter top 10 influencers using FtF ratio ........................................................... 84
Figure 5.38: Twitter top 10 influencers using UTW ratio ....................................................... 85
Figure 5.39: Twitter top 10 influencers using Influence Ratio ................................................ 86
Figure 5.40: Twitter top 10 influencers per cluster using Influence Ratio .............................. 87
Figure 5.41: Number of followers of top 10 influencers in each cluster ................................. 87
Figure 5.42: Number of followings of top 10 influencers in each cluster ............................... 88
Figure 5.43: Number of tweets of top 10 influencers in each cluster ...................................... 88
Figure 5.44: Number of favorites of top 10 influencers in each cluster .................................. 89
8
List of Tables
Table 3.1: The most common topics of expertise as identified from Lists .............................. 35
Table 3.2: Top 5 results by Cognos and Twitter WTF for query “music” ............................... 36
Table 3.3: Average relative SI, CH and DB score over data set .............................................. 37
Table 4.1: Twitter extracted features ....................................................................................... 46
Table 4.2: Instagram extracted features ................................................................................... 47
Table 4.3: Topic probabilities by user ..................................................................................... 53
Table 4.4: Top terms of each extracted topic by LDA ............................................................. 53
Table 4.5: Formulas for Silhouette, Dunn and Entropy indices ............................................... 60
Table 5.1: Evaluation results of cluster validation indices ...................................................... 70
9
Chapter 1
1 Introduction
1.1 Context Social Networks have undergone a dramatic growth in recent years. Such networks
provide a powerful reflection of the structure and dynamics of the society of the 21st
century and the interaction of the Internet generation with both technology and other
people (Sfetcu 2017). Social media has a great influence in our daily lives. People
share their opinions, stories, news, and broadcast events using social media.
Monitoring and analyzing this rich and continuous flow of user-generated content can
yield unprecedentedly valuable information, enabling users and organizations to
acquire actionable knowledge. Due to the immediacy and rapidity of social media,
news events are often reported and spread on Twitter, Instagram or Facebook ahead of
traditional news media.
With the rapid growth of social media, Twitter has become one of the most widely
adopted platforms for people to post short and instant messages. Because of such wide
adoption of Twitter, events like breaking news and release of popular videos can
easily capture people’s attention and spread rapidly on Twitter. Therefore, the
popularity and importance of an event can be approximately gauged by the volume of
tweets covering the event. Moreover, the relevant tweets also reflect the public’s
opinions and reactions to events. It is therefore very important to identify and analyze
the events on Twitter (Diao 2015).
Another social network platform which is very popular is Instagram. 300 million
people use the app for sharing of photos every day. Users can also insert a caption for
a photo they share, mention other users and use hashtags. Like in Twitter, users can
follow the accounts they are interested in and share their posts publicly or privately
according to their preference. Considering this, Instagram is one of the best channels
that people can share their experiences (especially the ones about events) through
pictures as well as textual content such as hashtags. Hashtags have become a uniform
way to categorize content on many social media platforms, especially Instagram.
Hashtags allow Instagrammers to discover content to view and accounts to follow.
10
Research from Track Maven found that posts with over 11 hashtags tend to get more
engagement.
1.2 Problem Statement In social networking websites or applications, people generally use unstructured or
semi-structured language for communication. In everyday life conversation, people do
not care about the spellings and accurate grammatical construction of a sentence that
may leads to different types of ambiguities, such as lexical, syntactic, and semantic.
Therefore, extracting logical patterns with accurate information from such
unstructured form is a critical task to perform. Text mining, which is a knowledge
discovery technique that provides computational intelligence, can be a solution of
above mentioned problem (Rizwana Irfan 2015). Social networks, such as Twitter are
rich in texts that enable user to create various text contents in the form of comments,
posts and social media. Application of text mining techniques on social networking
websites can reveal significant results related to person-to-person interaction
behaviors. Moreover, text mining approaches such as clustering can be used for
finding general opinion about any specific subject, human thinking patterns, and
group identification in large-scale systems.
In spite of the high amount of research works that have been conducted for extracting
information from a particular social network, there are not specific studies that
address different formatted social networks to explore profiles and activities of users
based on the texts they share about an event. In this thesis study, it is proposed that
there may be some similarities in terms of interest and activity between social media
users who are engaged in different actions such as posting, liking and replying a text
or media about an event. This may give us an idea to improve the current event and
also identify potential users with the same interests for similar future events.
1.3 Proposed Solution First step to obtain the objective of this study is to decide which social media
platforms should be considered. Since the availability of public posts is the main
reason for our preference among many platforms, Twitter and Instagram, which can
provide a great number of publicly available posts, are preferred to be used for the
following analysis.
Second step is to collect the required data including tweets, Instagram posts and their
involved users during a specific time interval. Then textual features namely
biographies, hashtags, tweet/post texts and twitter lists of which a user is member are
cleaned, preprocessed, translated to English and stored in csv files.
After the transformation phase we define some steps to perform the analysis in
different levels. The first phase of analysis is to explore the main topics in the
provided data using topic modeling. Then we perform different analysis on other
levels, for example, three clustering algorithms including K-means, Hierarchical and
11
DBSCAN are applied on the outputs of topic modeling process separately. Having all
the outcomes of cluster analysis, it is suggested to evaluate the results employing
cluster validity measurement techniques such as Silhouette, Dunn and Entropy.
Forasmuch as the evaluation outcome, we perform further analyses and investigations
to probe the categories of the users and their activities during the event. Finally we
model the outcomes of all levels of analysis in order to have a proper visualization of
the results.
1.4 Structure of the thesis The thesis is organized as follows:
A general overview of relevant concepts and technologies used in this thesis project
are reviewed in chapter 2.
Chapter 3 is dedicated to the scientific works that have been done to address the
similar issues through discussing the associated publications, plus our own strategy
with respect to them.
In chapter 4, first we describe the main idea of this project and the motivations behind
it. All the details of our proposed approach are explained in the following section of
this chapter.
Chapter 5 is devoted to describe our dataset and the outcomes of the analysis that was
performed. It is divided into sections that are relevant to each level of analysis,
conducted on our dataset from the social media we used.
Finally in chapter 6 we review the study with a short summery of what has been done
and a discussion of our results. In addition there are some suggestions for the future
work.
12
Chapter 2
2 Background
2.1 Relevant Concepts In this section we discuss the concepts that are relevant to our work.
2.1.1 Knowledge Discovery and Data Mining Knowledge Discovery in Databases (KDD) is the process of identifying valid, novel,
useful, and understandable patterns from large datasets. Data Mining (DM) is the
mathematical core of the KDD process, involving the inferring algorithms that
explore the data, develop mathematical models and discover significant patterns
(implicit or explicit) -which are the essence of useful knowledge. The knowledge
discovery process (Figure 2.1) is iterative and interactive, consisting of the below
steps. Note that the process is iterative at each step, meaning that moving back to
adjust previous steps may be required (Oded Maimon 2010).
Figure 2.1: Knowledge Discovery Process
13
2.1.1.1 Data Selection This phase includes finding out what data is available, obtaining additional necessary
data, and then integrating all the data for the knowledge discovery into one data set,
including the attributes that will be considered for the process. This process is very
important because the Data Mining learns and discovers from the available data. This
is the evidence base for constructing the models. If some important attributes are
missing, then the entire study may fail. From this respect, the more attributes are
considered, the better. On the other hand, to collect, organize and operate complex
data repositories is expensive and there is a tradeoff with the opportunity for best
understanding the phenomena. This tradeoff represents an aspect where the interactive
and iterative aspect of the KDD is taking place. This starts with the best available data
set and later expands and observes the effect in terms of knowledge discovery and
modeling.
2.1.1.2 Data Pre-processing The operations performed in a preprocessing process can be reduced to two main
families of techniques: Detection Techniques (DT) to detect imperfections in data sets
and Transforming Techniques (TT) oriented to obtain more manageable data sets. DT
includes outlier’s detection, missing data detection, influent observations detection,
normality assessment, linearity assessment, and independence assessment. On the
other hand, TT includes outlier treatment, missing data imputation, dimensionality
reduction techniques or data projection techniques, deriving new attributes
techniques, filtering and resampling. Additionally, the statistical technique of data
cleaning, and the visualization techniques also play an important role in the pre-
processing of data (José Luis Díaz 2010).
2.1.1.3 Data Transformation In this step, the generation of better data for the data mining is prepared and
developed. Methods here include dimension reduction (such as feature selection and
extraction, and record sampling), and attribute transformation (such as discretization
of numerical attributes and functional transformation). This step is often crucial for
the success of the entire KDD project, but it is usually very project-specific. However,
even if we do not use the right transformation at the beginning, we may obtain a
surprising effect that hints to us about the transformation needed (in the next
iteration). Thus the KDD process reflects upon itself and leads to an understanding of
the transformation needed. The main techniques of data transformation include
(Äyrämö 2007) :
Smoothing (binning, clustering, regression etc.)
Aggregation (use of summary operations (e.g., averaging) on data)
Generalization (primitive data objects can be replaced by higher-level concepts)
14
Normalization (min-max-scaling, z-score)
Feature construction from the existing attributes (PCA1, MDS
2)
2.1.1.4 Data Mining The two high-level primary goals of data mining in practice tend to be prediction and
description. Prediction involves using some variables or fields in the database to
predict unknown or future values of other variables of interest, and description
focuses on finding human-interpretable patterns describing the data. The goals of
prediction and description can be achieved using a variety of particular data-mining
methods including (Usama Fayyad 1996):
Classification is learning a function that maps (classifies) a data item into one
of several predefined classes.
Regression is learning a function that maps a data item to a real-valued
prediction variable
Clustering is a common descriptive task where one seeks to identify a finite
set of categories or clusters to describe the data. The categories can be
mutually exclusive and exhaustive or consist of a richer representation, such as
hierarchical or overlapping categories. More details about clustering
algorithms and its validation techniques are elaborated in section 2.4.
Summarization involves methods for finding a compact description for a
subset of data. A simple example would be tabulating the mean and standard
deviations for all fields. Summarization techniques are often applied to
interactive exploratory data analysis and automated report generation.
Dependency modeling consists of finding a model that describes significant
dependencies between variables. Dependency models exist at two levels: (1)
the structural level of the model specifies (often in graphic form) which
variables are locally dependent on each other and (2) the quantitative level of
the model specifies the strengths of the dependencies using some numeric
scale.
Change and deviation detection focuses on discovering the most significant
changes in the data from previously measured or normative values.
2.1.1.5 Interpretation and Evaluation of Patterns This phase involves the evaluation and possibly interpretation of the patterns to make
the decision of what qualifies as knowledge (Gonzalo Mariscal 2010). This step
focuses on the comprehensibility and usefulness of the induced model.
1 Principle Component Analysis
2 Multi-Dimensional Scaling
15
2.1.1.6 Knowledge Representation This is the last step of knowledge discovery process where visualization and
knowledge representation techniques namely logical formulas, decision trees, neural
networks, etc. are used to present mined knowledge to users.
2.1.2 Text Similarity Text similarity measures play an important role in text related research and
applications such as topic detection, information retrieval, document clustering, text
classification, etc. Finding similarity between words is a fundamental part of text
similarity which is then used as a primary stage for sentence, paragraph and document
similarities.
Words can be similar in two ways lexically and semantically. Words are similar
lexically if they have a similar character sequence. Words are similar semantically if
they have the same thing, are opposite of each other, used in the same way, used in
the same context and one is a type of another (Wael H. Gomaa 2013).
Lexical similarity is introduced through string-based similarity measures which
operate on string sequences and character composition. Some of these measures are
mentioned as follows:
Manhattan Distance computes the distance that would be traveled to get from
one data point to the other if a grid-like path is followed. The Block distance
between two items is the sum of the differences of their corresponding
components (Wael H. Gomaa 2013).
Cosine Similarity is a measure of similarity between two vectors of an inner
product space that measures the cosine of the angle between them (Wael H.
Gomaa 2013).
Euclidean distance is the ordinary distance between two points. Euclidean
distance is widely used in clustering problems, including text clustering. It
satisfies all the above four conditions and therefore is a true metric. It is also
the default distance measure used with the K-means algorithm (Rugved
Deshpande 2014).
Jaccard similarity measures similarity as the intersection divided by the union
of the objects. For text document, it compares the sum weight of shared terms
to the sum weight of terms that are present in either of the two documents but
are not the shared terms (Rugved Deshpande 2014).
Semantic similarity is introduced through Corpus-Based and Knowledge-Based
algorithms. Corpus-Based similarity is a semantic similarity measure that determines
the similarity between words according to information gained from large corpora. The
most famous corpus-based similarity measures are:
Hyperspace Analogue to Language (HAL) considers context only as the words
that immediately surround a given word. HAL computes an NxN matrix,
16
where N is the number of words in its lexicon, using a 10-word reading frame
that moves incrementally through a corpus of text (Wikipedia 2016).
Latent Semantic Analysis (LSA) is the most popular technique of Corpus-
Based similarity. LSA assumes that words that are close in meaning will occur
in similar pieces of text. A matrix containing word counts per paragraph (rows
represent unique words and columns represent each paragraph) is constructed
from a large piece of text and a mathematical technique which called singular
value decomposition (SVD) is used to reduce the number of columns while
preserving the similarity structure among rows. Words are then compared by
taking the cosine of the angle between the two vectors formed by any two
rows (Wael H. Gomaa 2013).
Explicit Semantic Analysis (ESA) is a vectorial representation of text that uses
a document corpus as a knowledge base. Specifically, in ESA, a word is
represented as a column vector in the tf–idf matrix of the text corpus and a
document is represented as the centroid of the vectors representing its words.
Typically, it represents the meaning of texts in a high-dimensional space of
concepts derived from Wikipedia (Wikipedia 2016).
Knowledge-Based Similarity is one of semantic similarity measures that bases on
identifying the degree of similarity between words using information derived from
semantic networks WordNet is the most popular semantic network in the area of
measuring the Knowledge-Based similarity between words; WordNet is a large
lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into
sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are
interlinked by means of conceptual-semantic and lexical relations (Wael H. Gomaa
2013).
2.1.3 Information Retrieval Models Information retrieval (IR) is finding material (usually documents) of an unstructured
nature (usually text) that satisfies an information need from within large collections
(usually stored on computers) (Christopher D. Manning 2008). For effectively
retrieving relevant documents by IR strategies, the documents are typically
transformed into a suitable representation. Each retrieval strategy incorporates a
specific model for its document representation purposes. Figure 2.2 illustrates the
relationship of some common models. Three of the most well-known models are
explained in more detail below (Wikipedia 2016).
17
Figure 2.2: Information Retrieval Models
Standard Boolean model: The Boolean model is a simple retrieval model
based on Boolean algebra where index term’s significance is represented by
binary weights wi,j ∈ {0,1}. Queries are aslo defined as Boolean expressions
over index terms. The similarity between document dj and query q can be
calculated as:
Vector space model: in this model, documents and queries are represented as
vectors. dj = (w1,j, w2,j,…,wt,j) , q = (w1,q, w2,q,…,wt,q)
Each dimension corresponds to a separate term. If a term occurs in the
document, its value in the vector is non-zero. In the classic vector space
model the term-specific weights in the document vectors are products of local
and global parameters. The model is known as term frequency-inverse
document frequency model where weight wi,j is defined as: wt,d = tft,d . idft
and tft,d is term frequency of term t in document d and idft is inverse
document frequency. Using the cosine the similarity between
document dj and query q can be calculated as:
18
Probabilistic model: this model makes an estimation of the probability of
finding if a document dj is relevant to a query q. This model assumes that this
probability of relevance depends on the query and document representations.
Furthermore, it assumes that there is a portion of all documents that is
preferred by the user as the answer set for query q. Such an ideal answer set
is called R and should maximize the overall probability of relevance to that
user. The prediction is that documents in this set R are relevant to the query,
while documents not present in the set are non-relevant.
2.1.4 Topic Modeling Topic models are [probabilistic] latent variable models of documents that exploit the
correlations among the words and latent semantic themes. A document is seen as a
mixture of topics. This intuitive explanation of how documents can be generated is
modeled as a stochastic process which is then “reversed” by machine learning
techniques that return estimates of the latent variables. With these estimates it is
possible to perform information retrieval or text mining tasks on a document corpus
(Ponweiser 2012).
The most prominent topic model is latent Dirichlet allocation (LDA) which is a three-
level hierarchical Bayesian model, in which each item of a collection is modeled as a
finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an
infinite mixture over an underlying set of topic probabilities. In the context of text
modeling, the topic probabilities provide an explicit representation of a document
(Recognition 2015). The graphical model of LDA is shown in Figure 2.3. The boxes
are “plates” representing replicates. The outer plate represents documents, while the
inner plate represents the repeated choice of topics and words within a document:
Figure 2.3: Graphical model representation of LDA
The LDA model assumes the following generative process for a document w = (w1, . .
. , wN ) of a corpus D containing N words from a vocabulary consisting of V different
19
terms, wi ∈ {1,… , V } for all i = 1, . . . , N. The generative model consists of the
following three steps (Bettina Grun 2011):
Step 1: The term distribution β is determined for each topic by β ∼ Dirichlet(δ).
Step 2: The proportions θ of the topic distribution for the document w are determined
by θ ∼ Dirichlet(α).
Step 3: For each of the N words wi
(a) Choose a topic zi ∼ Multinomial(θ).
(b) Choose a word wi from a multinomial probability distribution conditioned
on the topic zi : p(wi |zi , β).
β is the term distribution of topics and contains the probability of a word
occurring in a given topic.
2.1.5 Dimensionality Reduction Dimensionality reduction or dimension reduction is the process of reducing the
number of random variables under consideration, via obtaining a set of principal
variables. It can be divided into feature selection and feature extraction.
Feature selection is the process of selecting a subset of relevant features (variables,
predictors) for use in model construction. Feature selection techniques are used for
four reasons:
simplification of models to make them easier to interpret by
researchers/users
shorter training times
to avoid the curse of dimensionality
enhanced generalization by reducing overfitting (formally, reduction
of variance)
The central premise when using a feature selection technique is that the data contains
many features that are either redundant or irrelevant, and can thus be removed without
incurring much loss of information.
Feature extraction transforms the data in the high-dimensional space to a space of
fewer dimensions. The data transformation may be linear, as in Principal Component
Analysis (PCA), but many nonlinear dimensionality reduction techniques also exist.
The main linear technique for dimensionality reduction, Principal Component
Analysis (PCA), performs a linear mapping of the data to a lower-dimensional space
in such a way that the variance of the data in the low-dimensional representation is
maximized. In other words, it uses an orthogonal transformation to convert a set of
20
observations of possibly correlated variables into a set of values of linearly
uncorrelated variables called principal components. The number of principal
components is less than or equal to the smaller of (number of original variables or
number of observations). This transformation is defined in such a way that the first
principal component has the largest possible variance (that is, accounts for as much of
the variability in the data as possible), and each succeeding component in turn has the
highest variance possible under the constraint that it is orthogonal to the preceding
components. The resulting vectors are an uncorrelated orthogonal basis set. PCA is
sensitive to the relative scaling of the original variables (Wikipedia 2016).
2.1.6 Clustering Process As mentioned before, clustering is one of the most useful tasks in data mining process
for discovering groups and identifying interesting distributions and patterns in the
underlying data. The main concern in clustering process is to reveal the organization
of patterns into “sensible” groups, which allow us to discover similarities and
differences, as well as to derive useful conclusions about them. The basic steps to
develop clustering process are presented in Figure 2.4 and can be summarized as
follows (Maria Halkidi 2001):
Figure 2.4: Steps of clustering process
Feature selection: The goal is to select properly the features on which
clustering is to be performed so as to encode as much information as possible
concerning the task of our interest. Thus, preprocessing of data may be
necessary prior to their utilization in clustering task.
Clustering algorithm: This step refers to the choice of an algorithm which
results in the definition of a good clustering scheme for a data set. Clustering
algorithms can be broadly classified into the following types:
21
o Partitional clustering attempts to directly decompose the data set into a
set of disjoint clusters. In this category, K-Means is a commonly used
algorithm.
o Hierarchical clustering proceeds successively by either merging
smaller clusters into larger ones, or by splitting larger clusters. The
result of the algorithm is a tree of clusters, called dendrogram, which
shows how the clusters are related. By cutting the dendrogram at a
desired level, a clustering of the data items into disjoint groups is
obtained.
o Density-based clustering: The key idea of this type of clustering is to
group neighboring objects of a data set into clusters based on density
conditions. A widely known algorithm of this category is DBSCAN.
o Grid-based clustering is mainly proposed for spatial data mining. Their
main characteristic is that they quantize the space into a finite number
of cells and then they do all operations on the quantized space.
o Fuzzy clustering, which uses fuzzy techniques to cluster data and they
consider that an object can be classified to more than one clusters. This
type of algorithms leads to clustering schemes that are compatible with
everyday life experience as they handle the uncertainty of real data.
The most important fuzzy clustering algorithm is Fuzzy C-Means.
o Crisp clustering, considers non-overlapping partitions meaning that a
data point either belongs to a class or not. Most of the clustering
algorithms result in crisp clusters, and thus can be categorized in crisp
clustering.
o Kohonen net clustering, which is based on the concepts of neural
networks.
Validation of the results: The procedure of evaluating the results of a
clustering algorithm is known under the term cluster validity. In general terms,
there are three approaches to investigate cluster validity:
o External Criteria: In this approach the basic idea is to test whether the
points of the data set are randomly structured or not. Rand index,
Jaccard coefficient, Entropy and Purity can be mentioned as external
measures to name as a few.
o Internal Criteria evaluate the result with respect to information
intrinsic to the data alone. Silhouette index, Davies-Bouldin index
(DB), Calinski-Harabasz index (CH) and Dunn index are the most
famous measures in this category (Eréndira Rendón 2011).
o Relative Criteria evaluate quality of a partition by comparing it to
other clustering schemes, resulting by the same algorithm but with
different parameter values.
22
Interpretation of the results: In many cases, the experts in the application area
have to integrate the clustering results with other experimental evidence and
analysis in order to draw the right conclusion.
2.2 Relevant Technologies In this section all the relevant technologies used in this thesis study are discussed.
2.2.1 Language Identification and Translation Language identification (LID) refers to the process of determining the natural
language in which a given text is written (Pienaar 2010). In this thesis LID is used as
a part of preprocessing phase which aims to uniform the textual content by first detect
the language and then translate it into a single language (en-us).
Yandex.Translate Application Programming Interface (API) is easy to use automatic
translation service provided by Russian Internet Company Yandex. As a statistical
machine translation system, it is based on statistics derived from the web sources
(Hees 2015). Yandex.Translate - synchronized translation for 91 languages, predictive
typing, dictionary with transcription, pronunciation and usage examples, and many
other features.
23
Figure 2.5: How machine translation works at Yandex
24
Yandex.Translate has an automated dictionary that sets it apart from the limited
number of similar existing services. The technology, developed by a Yandex team of
linguists and programmers, combines current statistical machine translation
approaches with traditional linguistic tools.The translation model constructs a graph
containing all the possible ways to translate a sentence. The language model selects
the best translation in terms of the optimal word combinations in natural language.The
translation model learns from extensive bilingual parallel corpora. The language
model is built from large single-language corpora, and contains all the language's
most frequent n-word combinations. N may be from 1 to 7 (usually 5).
Yandex uses BLEU metrics to automatically evaluate the quality of machine
translation; it determines the percent of n-grams (n<=4) that match between the
machine translation and the standard translation of a sentence. Translations are
usually manually rated for two factors, Adequacy and Fluency, using a 5-point scale
(Yandex 2017).
2.2.2 Gender Detection There are two concepts of gender, the biological gender and the socially constructed
gender. A text written by Gayle Rubin ‘s in 1975 discusses gender as a sex/gender
system, in which the social gender is described as enhancing the idea of a biological
gender, which in itself creates gender (Ottosson 2012). Determining gender of users
by analyzing their behavior in social media is very popular these days, but since it is a
complex and time consuming task to do, another method which is using a person's full
name to detect it’s gender is used.
Result of gender detection only based on full name can be imprecise in some cases
due to the problems like cultural origin of names, the coverage of names database,
support of various languages and etc. Using “NamSor” API to deternime the gender
of users was a sutable solution to cope with mentioned problems.
NamSor software classifies names accurately by gender, country of origin, or
ethnicity. The gender api comes with useful features (NamSor Applied Onomastics
2017) which are discribe in the following.
Accuracy : NamSor recognizes the likely cultural origin and gender at the
same time, for higher precision and recall.
Global coverage : NamSor covers all languages, alphabets, countries, regions.
They constantly improve the precision, working with linguists, anthropologist
and historians.
Ease of use: names can be parsed and classified online, using a simple web
application that processes up to 100,000 names within a few minutes. Power
users, statisticians and data scientists can take benefit of NamSor open source
extension for RapidMiner, a leading predictive analytics tool.
25
Integration : NamSor API can be securely integrated with a range of
applications, from geographical information systems (such as ESRI), to CRM
and campaign management.
2.2.3 Twitter API Twitter Platform provides developers with variety of different tools and API to
connects websites or applications with the worldwide conversation happening on
Twitter (Twitter Developer Documentation 2017). Whithin all these possiblilities the
“REST APIs” widely uses for extraxting data from twitter for processing and analysis.
The REST APIs provides programmatic access to read and write Twitter data. Author
a new Tweet, read author profile and follower data, and more. The REST API
identifies Twitter applications and users using OAuth; responses are available in
JSON (Twitter Developer Documentation 2017).
Figure 2.6: Twitter Rest API Design
Representational state transfer (REST) or RESTful Web services are one way of
providing interoperability between computer systems on the Internet. REST-
compliant Web services allow requesting systems to access and manipulate textual
representations of Web resources using a uniform and predefined set of stateless
26
operations.In a RESTful Web service, requests made to a resource's URI will elicit a
response that may be in XML, HTML, JSON or some other defined format. The
response may confirm that some alteration has been made to the stored resource, and
it may provide hypertext links to other related resources or collections of resources.
By making use of a stateless protocol and standard operations, REST systems aim for
fast performance, reliability, and the ability to grow, by re-using components that can
be managed and updated without affecting the system as a whole, even while it is
running (Wikipedia 2016).
2.2.4 Instagram API The Instagram API Platform can be used to build non-automated, authentic, high-
quality apps and services that: Help individuals share their own content with 3rd party
apps. Help brands and advertisers understand, manage their audience and media
rights. You can access the Instagram API with any platform using its REST endpoints
(Instagram 2017).
2.2.5 Text Normalization Library Multilingual text normalization library is designed mainly for short text (tweets,
Facebook posts , etc.) which remove special characters, emojies, common words,
remove mentions and URLs and finally stem words and return the clean list of
important words in the sentence.It can be used for normalizing the result of collected
data by twitter or other social media in order to analyse the text data by any data
mining tools.
2.2.6 Cloud Computing Cloud computing is a type of Internet-based computing that provides shared
computer processing resources and data to computers and other devices on demand.
Microsoft Azure is a cloud computing service created by Microsoft for building,
deploying, and managing applications and services through a global network of
Microsoft-managed data centers. It provides software as a service, platform as a
service and infrastructure as a service and supports many different programming
languages, tools and frameworks, including both Microsoft-specific and third-party
software and systems. Microsoft lists over 600 Azure services, of which some are:
Compute, Mobile Services, Storage Services, Data Management, etc (Wikipedia
2016). Azure Virtual Machine is used for computing-intensive tasks in this thesis
study.
27
Figure 2.7: Cloud Computing
Figure 2.8: Windows Azure Platform
28
Chapter 3
3 Related Work
This chapter introduces different works that cope with similar issues through
discussing the associated publications.
Past works have found that data scraped from social media is a meaningful reflection
of the human behind the account. Therefore, in recent years, a wide variety of
research with regard to the social media analysis and user clustering/classification
based on different features has been conducted. There are studies that specifically
address clustering of people in social network based on textual and non-textual
features (Kuldeep Singh 2016) (Friedemann 2015) , clustering users based on their
interests (Recognition 2015), leverages Twitter Lists for topic modeling (Saptarshi
Ghosh 2012) as well as using internal validity measures to compare clustering
algorithms (Toon Van Craenendonck 2015). Furthermore, there are several works that
address user profiling on social networks like (Alessandro Bozzon 2013) where the
authors propose a method to select experts within the population of social networks,
according to the information about the social activities of their users. Or works like
(Narumol Prangnawarat 2015) and (Slava Kisilevich 2010) that focus on event
analysis in social media. The former analyzes the resulted heterogeneous network,
and use it in order to cluster posts by different topics and events and the latter perform
analysis and comparison of temporal events, rankings of sightseeing places in a city,
and study mobility of people using geotagged photos. Although, all these works have
delivered new solutions to social media analysis field, they have investigated the
problems of profiling users or events only in one social network platform or they have
employed only one machine learning algorithm to analyze data. The big difference
between our work and the mentioned ones is that our study not only addresses both
textual and non-textual users’ features but also utilizes three different clustering
models applied on Twitter and Instagram data. More details regarding above works
that made the major contributions to our thesis study are as follow.
29
3.1 Clustering of People in Social Network
Based on Textual Similarity This study (Kuldeep Singh 2016) concentrates on textual similarity between various
people in a social network. Textual similarity is a sub-field of data mining which
gives information of words that are frequently used in a group of people.
On the basis of the common words used in social networks, they have formulated a
metric. The data has been extracted from social networking sites and then it is
processed for generating the metrics. Simple k-means and spectral k-means
algorithms have been compared for finding textual similarity. They have also used
WordNet to groups words together based on their meanings. Since on twitter the basic
mode of communication between the users is Tweet, so they have extracted the tweets
of the users and performed analysis on them.
The approach of this paper consists of three main steps:
1. Data Pre-processing: Since the extracted tweets are in very rough from, pre-
processing is needed and has to be done in five steps:
Data extraction: Tweepy package of twitter API is being used. Tweepy
package is for the Python language. Tweepy package supports accessing
Twitter via Basic Authentication and the newer OAuth method.
Stop words removal: Stop words have been removed from experimental
dataset using Lucene. Lucene is an open-source Java full-text search
library.
Stemming of the text: Stemming is the process for reducing inflected
words. They have implemented stemming also using Lucene. The
PorterStemFilter class of the Lucene has been used for streaming of the
words.
Lexical analysis: Lexical analysis is done using a large lexical database of
English called WordNet. Nouns, verbs, adjectives and adverbs are grouped
into sets of cognitive synonyms (synsets). Synsets are interlinked by
means of conceptual-semantic and lexical relations.
Calculation of strength matrix: Strength between two users must be
directly proportional to the number of common word between them
(STRENGTH∝COMMON WORDS) but it should be inversely
proportional to the total number of words used by both the users
(STRENGTH ∝ (1/total words used)). Strength should also decrease with
the difference between the occurrences of a word as a person uses a word
very frequently and other does not, so it should decrease their textual
similarity (STRENGTH ∝ (1/difference of word used by two persons)). At
the end, if we add up this strength due to each word used by a user and
another user, it gives us a measure of strength of textual similarity between
those two persons.
30
2. Clustering: Simple k-means is based on compactness, so it always gives
nearer to approximation accurate results for general numerical datasets.
Spectral clustering is used to map the original data into a vector space spanned
by a few Eigen vectors and apply the k-means algorithm in that space. The
assumption here is that although the data samples are high dimensional, they
lie in a low dimensional subspace of the original space. Spectral clustering
based on Un-normalized graph Laplacian have been used here.
3. Evaluation: The results of simple k-means and spectral algorithm with
dummy and real datasets have been evaluated. Result shows that both
algorithms give almost similar outputs. Dummy dataset consists of 40 nodes
and has two clusters. The real dataset has 77 node numbers from 1 to 77. The
results of simple k-means and spectral k-means are almost equal. Spectral
clustering gives relatively quick results for sparse and higher element datasets,
but the computation cost of spectral clustering for large dataset is very high.
Figure 3.1: Graph after spectral k-means clustering for real dataset
Figure 3.2: Graph after spectral k-means clustering for dummy dataset
The main idea of using textual similarity for clustering the users as well as the
preprocessing steps e.g. data extraction and stop words removal is used as a guide in
this thesis study.
31
3.2 Clustering a Customer Base Using
Twitter Data
This paper (Friedemann 2015) focuses on a method to cluster customers of a company
(Nike) using social media data from Twitter. The motivation of this research is based
on this idea that clustering a company’s customers allows marketing teams to tailor
advertising messages for specific groups of people with similar interests. This
analytics will fundamentally change business operations as a result.
The steps of the approach are explained below:
1. First the tweets are harvested from Twitter using Tweepy package and stored
into a local SQLite database. Due to Twitter’s API rate limit constrains data
gathering, only a subset of 10,000 users from Nike’s total 5.6 million followers
are considered. For each user, the data set includes statuses posted, number of
followers, number of followings, and language. In addition, it is recorded if
each user is following one or more of a hand-selected list of popular Twitter
accounts (influencers). On the other hand, since all selected features are
numerical except for the language of the user, language should be converted to
a tuple of float values by mapping the language acronym to the latitude and
longitude coordinates of the largest city in the country with the most people
who speak this language.
Figure 3.3: Percentage of followers for a set of chosen influencers.
2. The data is transformed into a lower dimensional feature space using Principle
Component Analysis (PCA). Users following relationships towards influencers
are represented as a binary matrix with a 1 in the (i,j) position if user i follows
32
influencer j. Using PCA, the dimension of influence matrix is reduced from 12
to 8.
3. In order to efficiently segment the data samples into acceptable clusters,
selected features are passed into the k-means algorithm instead of slower
alternatives such as hierarchical clustering.
4. The optimal number of clusters is determined by performing silhouette
coefficient function.
Figure 3.4: Silhouette coefficient as a function of number of clusters
5. The clustering performance, which is a metric of clustering quality related to
the intra-cluster variation and inversely proportional to the inter-cluster
distance, is computed. This clustering performance is defined as below where
low values of q correspond to better clustering performance.
6. The clustering output is visualized in R2. Figure 3.5 indicated one such
visualization. The depicted clusters have the same ratio of average intra-cluster
variation to average inter-cluster distance as the clustering output. This
suggests that the studied data set can be cleanly clustered in the discussed
dimensionality space.
33
Figure 3.5: Representative clusters in R
2
7. In order to label the clusters, a randomly-selected subset of samples from each
of the k=5 clusters is examined. Furthermore, a human-subject experiment was
performed to validate the meaningfulness of the selected clusters. The
empirical results show that the human and the labeling algorithm were in
agreement approximately 80% of the time.
The PCA algorithm which, transforms data into a lower dimensional feature space
and silhouette coefficient function which is employed to determine an appropriate
number of clusters, are used as standard approaches in the thesis study.
3.3 Clustering Users Based on Interests In this paper (Recognition 2015) authors investigate the problem of clustering users in
Twitter based on their interests. The motivation of doing this study is the significance
of solving the mentioned problem in many different fields, such as user
recommendation, personalized services, viral marketing, etc. The main notion of this
research is that some Twitter users’ features are potentially useful in determining
interests of an individual user or his/her common interests with other users.
To address the mentioned problem, the approach of this paper is organized as follows:
1. Data Extraction: Twitter’s Developer API is used to collect user data. 45772
English users, who have posted at least 100 tweets and have more than 20
friends, are extracted. Besides, Different features, which are closely correlated
with user’s interests, including both textual contents (tweet text, URLs and
hashtags) and social structure (following relationship and retweeting
relationship), are leveraged. The findings of this study show that there is a very
widely use of URLs, hashtags and retweets at user level, and prove that it is
necessary to take these features into account when computing user similarity.
34
2. User Similarity: to get the final user similarity, the similarity of all selected
features should be computed first:
Text Similarity: All the tweets published by an individual user are
aggregated into a big document. With the purpose of identifying the
topics that users are interested in based on their tweets, Latent Dirichlet
Allocation (LDA), which is an unsupervised machine learning method,
is applied. Then, Text similarity between two users, ui and uj can be
calculated using a presented formula.
URL Similarity: All the URLs embedded in tweets corresponding to a
user are aggregated into a document. Then similar with the previous
section, URL similarity is calculated.
Hashtag Similarity: hashtag similarity is measured based on the
number of their common hashtags and the importance of these
hashtags.
Following Similarity: A twitterer follows a friend because she/he is
interested in the topics the friend publishes, and the friend follows back
because she/he finds they share similar topic interest. Intuitively, if two
users have many common friends and followers, they are quite similar.
This paper represents a formula which computes following similarity
based on the total number of users’ followers, followings, common
friends and common followers.
Retweeting Similarity: if two users retweet the same person frequently,
the two users may have similar interests. Additionally, whether the two
users retweet each other is a stronger indicator of similar interests.
With these two factors into consideration, retweeting similarity is
defined in this study.
The final similarity between users ui and uj can be calculated as:
In order to assess the effectiveness of their approach and determine the
parameters in user similarity formula, the authors propose an evaluation
metrics “the average number of mutual following links per user in per cluster
(FPUPC)”. The aggregation parameters for features γfeature are defined as
follows:
3. K-means Clustering: k-means is applied to cluster users because it is not only
effective but also very fast. Moreover, experimental results show that best
performance is achieved when the number of clusters is selected around 400.
35
The idea of using LDA model for identifying topics of a big text document is utilized
as one of the most important methods in our thesis work.
3.4 Crowdsourcing Search for Topic Experts
in Microblogs This study (Saptarshi Ghosh 2012) highlights Lists as a potentially valuable source of
information for future content or expert search systems in Twitter. In this paper,
authors present Cognos, a system for finding topic experts in Twitter. Unlike
traditional approaches which identify topical experts based either on the information
provided by the user or on analyzing the network characteristics, Cognos exploits the
Lists feature which is entirely a different approach.
The proposed methodology consists of three fundamental parts:
1. Crawl Lists containing the 54 million Twitter users in a complete snapshot of
the Twitter taken in August 2009. Then consider only users who were listed at
least 10 and at most 2000 times. Overall, for the 1.3 million users, a total of
88,471,234 Lists were gathered.
2. Extract frequently occurring topics (words) from List meta-data (names and
descriptions) and associate these topics with the listed users. This strategy
includes the following steps:
Separate List names into individual words
Apply case-folding, stemming and stop words removal
Group words that are very similar to each other based on edit-distance
among words
Consider only unigrams and bigrams as topics
Table 3.1: The most common topics of expertise as identified from Lists
3. Given a query, a topical similarity score is calculated between the topic vector
for a user and the given query vector, using an algorithm which computes the
cover density ranking between the vectors.
36
This paper concludes that Cognos provides better search results in the cases when the
bio or tweets posted by a user does not contain information about the user’s topic of
expertise. Even though Cognos is built employing only the Lists feature, it can
compete with the commercial who-to-follow system (WTF) deployed by Twitter
itself. As Table 3-2 indicates, top Cognos results mostly contain personal accounts
while top Twitter WTF results mostly contain organizations / business accounts.
Table 3.2: Top 5 results by Cognos and Twitter WTF for query “music”
Our thesis work benefits from the key idea of utilizing Twitter List feature for
identifying topic experts in this paper and employs List slug to find topics and user
clusters in Twitter.
3.5 Using Internal Validity Measures to
Compare Clustering Algorithms
The research and experiments of this paper (Toon Van Craenendonck 2015) rely on
using four internal validity measures and six clustering algorithms. The reason behind
this approach is the existence of many different clustering algorithms which may all
produce very different partitions of the same data set. Even a single clustering
algorithm can yield wildly different results depending on the chosen parameters.
Therefore, the authors investigate whether the outlined measures allow for a
comparison between algorithms or not.
Internal validity measures only rely on properties intrinsic to the data set. This
research uses the below internal measures:
Silhouette Index (SI): This score of a clustering is in [-1, 1], and should be
maximized.
Davies-Bouldin (DB): This score of a clustering is in [0, + ∞] and should be
minimized.
Calinski-Harabasz (CH): This score of a clustering is in [0, + ∞] and should be
maximized.
Density-Based Cluster Validation (DBCV): This score of a clustering is in [-
1, 1] and should be maximized. DBCV can be useful for data sets with well
37
separated structure. However, the results become less interesting when data
becomes noisier or transitions between clusters become more dimmed. Thus,
due to the noisy nature of data set used in this research work, the authors put
this measure aside.
As it is illustrated in Figure 3.6, the first measures have a strong bias towards
spherical clustering while the last measure can handle clusters with different densities
and shapes.
Figure 3.6: Spectral clustering solutions selected by various measures
The clustering algorithms which are used in this experiment include k-means,
spectral, DBSCAN, Ward, meanshift and EM. The parameter ranges for each
algorithm were chosen to be wide enough to make sure that they contain values
leading to a good solution. After applying all algorithms on a data set and computing
first three validity measures, the final outcomes can be shown in the below tables:
Above results indicate that all measures exhibit some undesired properties e.g.
sensitivity to noise points, a preference for highly imbalanced solutions or a bias
towards spherical clustering. Closer inspection shows that to produce clusters that
score well on the silhouette and Calinski-Harabasz measures, we can simply use k-
means. To score well on the Davies-Bouldin and DBCV measures, we can use
DBSCAN or meanshift, but this is mainly due to the previously mentioned undesired
properties.
Table 3.3: Average relative SI, CH and DB score over data set
38
The idea of using internal validity measures such as Silhouette Index to compare
clustering algorithms is used as a guide in the thesis study.
39
Chapter 4
4 Event-based User Profiling in
Social Media
This chapter discusses how to use data mining approaches to profile users in social
media namely Twitter and Instagram. The aim is to go deep into the analysis approach
and elaborating each step in details.
4.1 Main Idea The primary objective of this study is to analyze the social data collected about a
specific event and use its outcomes such as users’ types and behavior to improve the
quality of that specific event and engage potential users, which are more likely to be
interested in participating in similar events. To achieve this goal, the entities
contained in the tweets/posts as well as their related users have been taken into
consideration. Users’ textual content such as biographies, hashtags, tweet/post texts
and list descriptions are specifically proposed to be used in clustering approaches.
Moreover, Dealing with tweets/posts in different languages was another challenge to
overcome. Other properties are also extracted to be employed in knowledge
representation part. Taking all these ideas into account, the details of this research
were identified, in terms of structure needed to be examined and aspects to be
considered.
4.1.1 Twitter Users Twitter is a social networking and microblogging service, enabling registered users to
read and post short messages, so-called tweets. As of the fourth quarter of 2016, the
microblogging service averaged at 319 million monthly active users (Statista 2016).
In this thesis, extracted Twitter users are divided in two groups:
Masters: users who tweeted about this event in a specific time span.
Contributors: users who retweeted, favorited or replied one or more tweets
posted by a master user in a specific time span.
40
Intuitively, if two users (one master and one contributor) engage in the same tweet,
they may have similar interests and these kinds of users are considered as our target
people in this thesis study. That is the reason why both types have to be taken into
account for the clustering purposes.
4.1.2 Instagram Users The statistic gives information on the number of monthly active Instagram users as of
December 2016. As of that month, the mainly mobile photo sharing network had
reached 600 million monthly active users, up from 500 million in June 2016 (Statista
2016).
For the same reason that mentioned in section 4.1.1, extracted Instagram users are
also divided in two categories:
Masters: users who posted about this event in a specific time interval.
Contributors: users who liked or commented one or more Instagram media
posted by a master user in a specific time interval.
4.2 Motivation As mentioned in Chapter 3, there are many studies concentrating on analysis of
different aspects of social media platforms. However, there is not any work focusing
on analyzing Twitter and Instagram users based on their interests in a particular event.
This is one of the main motivations for choosing this topic, to inspect users’ activities
during an event, compare the results of two social media platforms and provide a
solution to predict future potential users.
4.2.1 Why social media as data source? The popularity of social media sites and the ease at which its data is available means
these platforms are increasingly becoming primary sources for every kind of
research. Current academic and industry interest in social media has been driven by
the rapidly broadening user base for social media technologies, which is of course
related to the continuing spread of internet use itself. The rise in social media use has
been rapid: in 2011, approximately 60% of internet users were also social media
users, up from just 17% in 2007. Much of this change has been driven by the
emergence of a small number of “mass appeal” social media websites, of which
Twitter and Facebook are the obvious examples. These sites are characterized by their
ease of use, their generic nature (i.e. they eschew focus on a particular subject or area
of interest) and their wide penetration, meaning that significant portions of the
population have created an account (Pensions 2014). The following figures
41
demonstrate the worldwide growth in using social media (Pew Research Center
2017).
Figure 4.1: Number of social media users from 2010 to 2020 (in billions)
Figure 4.2: Percentage of adult users who use different social networks
42
Figure 4.3: Percentage of adult users who use at least one social media, by age
In general, when compared to traditional surveys, social media data offer considerable
advantages in terms of how quickly results are delivered, the scale at which results
can be brought in, and (potentially) how cheaply they can be obtained. They also offer
the possibility to access sub-groups within the population in a way that sample
surveying has struggled with. The major difficulty lies in making accurate
generalizations from social media data to some overall population of interest as those
using social media do not constitute a representative sample of the public as a whole
and do not come with perfect demographic data attached. Nevertheless, knowing what
the public is thinking about is a crucial precursor to knowing what their opinion is of
any given topic. It is also an area where social media has the potential to offer real
added value (Pensions 2014).
4.2.2 Why Twitter and Instagram? Out of all the different social media platforms, Twitter is of a particular interest for
researchers as it provides them with arguably the most open access to its data in that it
provides a real-time stream of tweets, either as a 1% sample or as a dataset matching
criteria that are specified by the user. Other companies such as Google or Facebook
do not provide similar access to their data (Rob Procter 2015). There are at least six
reasons why researches prefer to use Twitter as their source of data (Ahmed 2015):
1. Twitter is a popular platform in terms of the media attention it receives and it
therefore attracts more research due to its cultural status.
2. Twitter makes it easier to find and follow conversations. (i.e., by both its
search feature and by tweets appearing in Google search results)
3. Twitter has hashtag norms which make it easier gathering, sorting, and
expanding searches when collecting data.
43
4. Twitter data is easy to retrieve as major incidents, news stories and events on
Twitter are tending to be centered on a hashtag.
5. The Twitter API is more open and accessible compared to other social media
platforms, which makes Twitter more favorable to developers creating tools to
access data. This consequently increases the availability of tools to
researchers.
6. Many researchers themselves are using Twitter and because of their favorable
personal experiences, they feel more comfortable with researching a familiar
platform.
A picture may be worth a thousand words, but those words are not worth much if no
one is listening. This is why it is important to choose the right network dedicated to
photo sharing for this thesis. Among all the current photo sharing platforms,
Instagram is a vibrant social platform, where users interact with photos by liking or
commenting on them. One of the app’s most powerful features is its tagging
mechanism, called a “hashtag,” which surfaces your photo to the right subgroup of
Instagram’s more than 500 million users. With over 90% of users falling under the
age of 30, Instagram is the best platform for promoting your event to a younger crowd
(Luna 2016). Figure 4.4 depicts the main differences between four major networks for
photo sharing: Instagram, Pinterest, Tumblr, and Flickr (Sorokina 2014).
44
Figure 4.4: Comparison between four major photo-sharing networks
45
4.3 Approach Our proposed approach in this study consists of several phases following each other.
The architecture design including the main steps which prepare collected data for
analyzing in the next phases, are shown and explained as follows:
Figure 4.5: Architecture design
4.3.1 Data Extraction In this phase raw data is extracted by probing into Instagram and Twitter APIs. Since
our approach is to make the process of storing, analyzing and visualizing data more
efficient, a MySQL database is used because of its scalability, flexibility, high
performance and high availability in dealing with the data that was collected from the
mentioned APIs. Table 4-1 and Table 4-2 represent the extracted features obtained
from Twitter and Instagram objects.
46
Table 4.1: Twitter extracted features
Tweet
Id: The string representation of the unique identifier for this
tweet
Username: The user who posted this tweet.
Text: The actual UTF-8 text of the status update.
Date: Date and time when this tweet was created.
Retweets: Number of times this tweet has been retweeted.
Favorites: Indicates approximately how many times this tweet
has been liked by Twitter users.
Mentions: the users who are mentioned in this tweet.
Hashtags: Represents hashtags which have been parsed out of
this tweet text.
Geo: Represents the geographic location of this tweet as
reported by the user or client application.
Place: Indicates that the tweet is associated (but not necessarily
originating from) a place.
User
Id: The string representation of the unique identifier for this
user.
Username: the unique name of this user.
Full name: The name of this user, as they’ve defined it. Not
necessarily a person’s name.
Tweets: the user’s most recent (20) tweets.
Follower count: The number of followers this user currently
has.
Following count: The number of users this user is following.
Status count: The number of tweets issued by this user.
Listed count: The number of public lists that this user is a
member of.
Favorite count: The number of tweets this user has favorited in
the account’s lifetime.
Bio: the user-defined UTF-8 string describing their account.
Hashtags: All the hashtags included in this user’s most recent
(20) tweets.
Mentions: All the users who are mentioned in this user’s most
recent (20) tweets.
Location: The user-defined location for their profile.
Language: The user’s self-declared user interface language.
Time zone: A string describing the Time Zone this user
declares themselves within.
Join date: The UTC datetime that the user account was created
on Twitter.
47
Is Verified: when true, indicates that the user has a verified
account.
Is Protected: When true, indicates that this user has chosen to
protect their Tweets.
List
Id: The numerical id of the list.
User ID: The ID of the user who is member of this list.
Name: The screen name of this list.
Slug: The short name of this list.
Description: The description of this list.
Member count: The number of members of this list.
Table 4.2: Instagram extracted features
Media
Id: The unique identifier for this media
Username: The user who posted this media.
Caption: The media caption text.
Date: Date and time when this media was created.
URL: The URL of the photo uploaded in this media.
Like count: Indicates how many times this media has
been liked by Instagram users.
Comment count: Indicates how many times this media has
been commented by Instagram users.
Likers: The first 10 users who liked this media.
Commenters: The first 10 users who commented this media.
Mentions: the users who are mentioned in this media.
Hashtags: Represents hashtags which have been parsed out of
this media text.
Geo: Represents the geographic location of this media.
User
Id: The string representation of the unique identifier for this
user.
Username: the unique name of this user.
Full name: The name of this user, as they’ve defined it. Not
necessarily a person’s name.
Media: the user’s most recent (20) media.
Media count: The number of media this user posted.
Follower count: The number of followers this user currently
has.
Following count: The number of users this user is following.
Average like count: This user’s average number of likes.
48
Average comment count: This user’s average number of
comments.
Bio: the user-defined string describing their account.
Hashtags: All the hashtags included in this user’s most recent
(20) media.
Mentions: All the users who are mentioned in this user’s most
recent (20) media.
Is Verified: when true, indicates that this user has a verified
account.
Is Private: when true, indicates that this user has a private
profile.
Is Business: when true, indicates that this user is a business.
4.3.2 Data Preprocessing Since the gathered raw data is incomplete and inconsistent, we need to apply
preprocessing techniques to prepare an appropriate dataset which can be used for next
analysis and experiments. Thus, the related techniques are particularly applied on
fields “Text”, “Bio” and “Tweets” in Twitter dataset and fields “Caption”, “Bio” and
“Media” in Instagram dataset and as a consequence, new fields containing “Text
Norm”, “Bio Norm”, “Tweets Norm” and “Caption Norm”, “Bio Norm”, “Media
Norm” are appended to Twitter and Instagram datasets respectively. It is noteworthy
that “hashtags” are excluded from data preprocessing phase because each hashtag
refers to a specific content and should not be transformed. The preprocessing process
consists of three main steps to be followed.
4.3.2.1 Text Normalization Textual properties which were extracted in the data extraction phase include a great
deal of non-standard characters, punctuations, symbols, white spaces, stop words, etc.
that must be omitted for making the data clean and standard. Furthermore, it is
essential to reduce inflected or derived words to their word stem, base or root form.
This process is called stemming which is applied on textual features at the end of this
stage. Text Normalization Java Library is used for this purpose.
4.3.2.2 Language Identification and Translation Unsurprisingly, Twitter or Instagram users do not always tweet or post in English and
since the event monitored in our study was organized and held in Italy, having texts in
different languages are not unexpected. With the aim of making data more coherent
and unambiguous, text language detection and its translation into English, seems
49
absolutely necessary. Yandex API identifies text language first and then translates
stems into English.
4.3.2.3 Gender Detection Twitter and Instagram APIs do not provide users’ gender in their objects. But, since
gender is required for the next analysis and visualization phases in our thesis work,
Namsor API is proposed and employed in this step. After detection of each user’s
gender, a new field with the same name is added to Twitter and Instagram datasets to
be used in the following steps.
4.3.3 Data Loading In this phase we store the data into the end target which is a CSV file. CSV is a file
format for data storage which looks like a text file. The information is organized with
one record on each line and each field is separated by comma. Beside all the obvious
benefits of using flat data like CSV, simplicity of importing and working with this
format in R programming language and environment in our data analysis phase is one
of the main reasons why CSV file is proposed to be used as data storage.
4.3.4 Data Analysis This is the phase where the preprocessed and loaded datasets (CSV files) are used for
data analysis. In this thesis work, all analysis, statistics, evaluations and results
representations are done in R. R is an extremely flexible statistics programming
language and environment that is open source and freely available for all mainstream
operating systems. The flexibility of R is arguably unmatched by any other statistics
program, as its object-oriented programming language allows for the creation of
functions that perform customized procedures and/or the automation of tasks that are
commonly performed. Perhaps R’s biggest hindrance is also its biggest asset, and that
is its general and flexible approach to statistical inference. With R, if you know what
you want, you can almost always get it. But you have to ask for it. Using R requires a
more thoughtful approach to data analysis than does using some other programs (Ken
Kelley 2008).
We step through the main sub phases of the data analysis in detail below.
4.3.4.1 Topic Extraction As mentioned in chapter 2, topic models can help to organize and offer insights for us
to understand large collections of unstructured text bodies. Topic models allow the
probabilistic modeling of term frequency occurrences in documents. The fitted model
can be used to estimate the similarity between documents as well as between a set of
50
specified keywords using an additional layer of latent variables which are referred to
as topics. The R package “topicmodels” provides basic infrastructure for fitting topic
models based on data structures from the text mining package tm. The package
includes interfaces to two algorithms for fitting topic models: the variational
expectation-maximization algorithm and an algorithm using Gibbs sampling (Bettina
Grun 2011). Two following steps have to be done in order to complete topic modeling
approach:
Pre-processing: The input data for topic models is a document-term matrix. The rows
in this matrix correspond to the documents and the columns to the terms. The entry
mij indicates how often the jth
term occurred in the ith
document. The number of rows
is equal to the size of the corpus and the number of columns to the size of the
vocabulary. We consider users (documents) as rows and all bio stems (terms) as
columns of this matrix. In this step, first a corpus containing users’ biographies is
created. Here the focus is only on Twitter users but the same process can be applied
on Instagram users as well.
The data pre-processing step involves selecting a suitable vocabulary, which
corresponds to the columns of the document term matrix. The mapping from the
document to the term frequency vector involves tokenizing the document and then
processing the tokens for example by converting them to lower-case, removing
punctuation characters, removing numbers, stemming, removing stop words and
omitting terms with a length below a certain minimum. In addition the final,
document-term matrix can be reduced by selecting only the terms which occur in a
minimum number of documents or those terms with the highest term-frequency
inverse document frequency (tf-idf) scores. Therefore, in our case, terms that their
length are less than three or more than 15 and terms that their frequency are less than
50 are considered unimportant and so omitted.
In order to do create document term matrix for users’ other textual properties namely
hashtags, their first twenty tweets and the slugs of the lists they are member of (only
for Twitter), the above procedure should be also employed for each property
separately.
51
Model Selection: To discover the abstract "topics" that occurs in the collection of
documents containing users’ textual features, we need to apply a topic model such as
Latent Dirichlet Allocation (LDA) which benefits from Gibbs sampling algorithm.
For fitting the LDA model to a given document-term matrix the number of topics
needs to be fixed a-priori. Because the number of topics is in general not known,
models with several different numbers of topics are fitted and the optimal number is
determined in a data-driven way. In this thesis study,CaoJuan2009 (minimization) and
Deveaud2014 (maximization) are two metrics which are used to identify the number
of topics in LDA. Simple approach in analyze of these metrics is to find extremum.
Figure 4.6 indicates that the extremum (the number of topics) is 6.
Figure 4.6: Identify the number of topics for LDA
Additionally, estimation using Gibbs sampling requires specification of values for the
parameters of the prior distributions. Gibbs sampling works by performing a random
walk in such a way that reflects the characteristics of a desired distribution. Because
the starting point of the walk is chosen at random, it is necessary to discard the first
few steps of the walk (as these do not correctly reflect the properties of distribution).
52
This is referred to as the “burn-in” period. We set the burn-in parameter to 4000.
Following the burn-in period, we perform 2000 iterations, taking every 500th
iteration
for further use 9 (thin parameter). The reason we do this is to avoid correlations
between samples. We use 5 different starting points (nstart=5) – that is, five
independent runs. Each starting point requires a seed integer (this also ensures
reproducibility), so we have provided 5 random integers in seed list. Finally we
set best to TRUE (actually a default setting), which instructs the algorithm to return
results of the run with the highest posterior probability.
Having all required parameters set, LDA function is applied:
The LDA algorithm returns an object (LDAout) that contains a lot of information. Of
particular interest to us are the top terms of each topic and the probabilities associated
with each extracted topic and each user (document) which we call topicProbabilities.
Table 4-3 shows the first twenty rows of this matrix.
53
Table 4.3: Topic probabilities by user
In general, if a user (document) has multiple topics with comparable probabilities, it
simply means that the user (document) speaks to all those topics in proportions
indicated by the probabilities.
Dimension Reduction: As mentioned earlier, the output of LDA function contains
information such as top terms of each obtained topic. As it can be seen in Table 4-4,
extracted topics are possibly correlated. Therefore, it is suggested to employ Principle
Component Analysis (PCA) to convert them to a set of values of linearly uncorrelated
topics. This transformation of data to a lower dimensional feature space not only
reduces the time and storage required but also makes the data visualization easier and
more interpretable when reduced to a low dimensions such as 2D or 3D.
Table 4.4: Top terms of each extracted topic by LDA
54
The result of applying PCA on topicProbabilities is illustrated below.
The choice we have taken is to pick all those topics that help us in capturing at least
95% of the total variance, in such a way we could compress our data not losing a lot
of information. In order to reach this threshold, we have to pick the three principal
components that are the first three topics. Then, in clustering phase, we will run
clustering algorithms exploiting only the topics chosen.
4.3.4.2 Cluster Analysis As mentioned earlier in this chapter, the main notion of this thesis work is to collect
social media data related to a specific event and then categorize users who talk about
that event and consequently analyze their activities or behavior. In other words, we
need to describe hidden structure like similarity groups from "unlabeled" data.
Therefore, to reach this goal, Cluster Analysis or Clustering, which is the task of
grouping a set of objects in such a way that objects in the same group (called
a cluster) are more similar (in some sense or another) to each other than to those in
other groups (clusters), has to be applied on our data (topicProbabilities with three
selected topics).
The notion of a "cluster" cannot be precisely defined, which is one of the reasons why
there are so many clustering algorithms. There is a common denominator: a group of
data objects. However, different researchers employ different cluster models, and for
each of these cluster models again different algorithms can be given. The notion of a
cluster, as found by different algorithms, varies significantly in its properties
(Wikipedia 2016). For this reason, three different algorithms from three different
cluster models are put into practice: k-means, hierarchical and DBSCAN.
K-means Algorithm is one of the most popular "clustering" algorithms. The goal of k-
means algorithm is to find the best division of n entities in k groups, so that the total
distance between the group's members and its corresponding centroid, representative
of the group, is minimized. Formally, the goal is to partition the n entities
into k sets Si, i=1, 2, ..., k in order to minimize the within-cluster sum of squares
(WSS), defined as:
55
where term provides the distance between an entity point and the cluster's
centroid.
The most common algorithm, described below, uses an iterative refinement approach,
following these steps:
1. Define the initial groups' centroids. This step can be done using different
strategies. A very common one is to assign random values for the centroids of
all groups. Another approach is to use the values of K different entities as
being the centroids.
2. Assign each entity to the cluster that has the closest centroid. In order to find
the cluster with the most similar centroid, the algorithm must calculate the
distance between all the entities and each centroid.
3. Recalculate the values of the centroids. The values of the centroid's fields are
updated, taken as the average of the values of the entities' attributes that are
part of the cluster.
4. Repeat steps 2 and 3 iteratively until entities can no longer change groups.
The k-means is a greedy, computationally efficient technique, being the most popular
representative-based clustering algorithm.
One decision that has to be made before applying k-means clustering is to determine
the number of clusters. There is an obvious trade-off between the number of clusters
and the internal cohesion of them. If there are few clusters, the internal cohesion tends
to be small. Otherwise, a large number of clusters make them very close, so that there
is little difference between adjacent groups. The optimal choice of k (number of
clusters) will strike a balance between maximum compression of the data using a
single cluster, and maximum accuracy by assigning each data point to its own cluster.
If an appropriate value of k is not apparent from prior knowledge of the properties of
the data set, it must be chosen somehow. There are several categories of methods for
making this decision. One method to validate the number of clusters is the elbow
method (Wikipedia 2016). The idea of the elbow method is to run k-means clustering
on the dataset for a range of values of k (say, k from 1 to 15 in our case), and for each
value of k calculate the total within-cluster sum of square (WSS).
56
Then, plot a line chart of the WSS for each value of k. If the line chart looks like an
arm, then the "elbow" on the arm is the value of k that is the best. In this case, k=3 is
the value that the Elbow method has selected (see Figure 4.7).
Figure 4.7: Elbow method representation
When the number of clusters is specified, we perform a k-means clustering with three
initial cluster centers. The algorithm of Hartigan and Wong is used by default. This
algorithm generally does a better job than others namely MacQueen, Lloyd and
Forgy3, but trying several random starts (nstart > 1) is often recommended. Here we
set nstart to 10.
The results obtained of k-means method are represented and discussed in chapter 5.
Hierarchical Algorithm is a method of cluster analysis which seeks to build
a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two
types:
Agglomerative: This is a "bottom up" approach: each observation starts in its own
cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive: This is a "top down" approach: all observations start in one cluster, and
splits are performed recursively as one moves down the hierarchy.
3 “Lloyd” and “Forgy” are alternative names for one algorithm.
57
In general, the merges and splits are determined in a greedy manner. The results of
hierarchical clustering are usually presented in a Dendrogram4. In order to decide
which clusters should be combined (for agglomerative), or where a cluster should be
split (for divisive), a measure of dissimilarity between sets of observations is required.
In most methods of hierarchical clustering, this is achieved by use of an
appropriate metric (a measure of distance between pairs of observations), and a
linkage criterion which specifies the dissimilarity of sets as a function of the pairwise
distances of observations in the sets (Wikipedia 2016).
Euclidean Distance which is the most commonly used dissimilarity metric is
employed using dist function in R. In addition, hclust function performs a hierarchical
cluster analysis using the created set of dissimilarities. Initially, each object is
assigned to its own cluster and then the algorithm proceeds iteratively, at each stage
joining the two most similar clusters, continuing until there is just a single cluster
(Agglomerative).
A number of different linkage criteria are provided. Ward's minimum variance
method aims at finding compact, spherical clusters. The complete linkage method
finds similar clusters. The single linkage method (which is closely related to the
minimal spanning tree) adopts a ‘friends of friends’ clustering strategy. The other
methods can be regarded as aiming for clusters with characteristics somewhere
between the single and complete link methods (Statistical Data Analysis n.d.). It is not
surprising that both single and complete algorithms often produce undesirable
clusters. Single-link clustering often suffers from chaining, that is, we only need a
single pair of points to be close to merge two clusters. Therefore, clusters can be too
spread out and not compact enough. Complete-link clustering often suffers from
crowding, that is, a point can be closer to points in other clusters than to points in its
own cluster. Therefore, the clusters are compact, but not far enough apart. In this
thesis study, it is preferred to use complete linkage because it is less sensitive to noise
and outliners, besides it provides a tree with a nicer interpretation.
Unlike k-means algorithm, hierarchical algorithm does not require the optimal
number of clusters at the beginning. In this clustering algorithm clusters are defined
by cutting branches off the dendrogram. To determine the cutting section, various
methods can be used. For example, one could be a statistical convention which
represents that a dendrogram can be cut where the difference is most significant.
DBSCAN Algorithm: Density-Based Spatial Clustering of Applications with
Noise (DBSCAN) is a data clustering algorithm which groups together points that are
closely packed together (points with many nearby neighbors), marking as outliers
4 It is a tree diagram frequently used to illustrate the arrangement of the clusters produced
by hierarchical clustering.
58
points that lie alone in low-density regions (whose nearest neighbors are too far away)
(Wikipedia 2016). DBSCAN has several advantages which makes it a desirable
clustering algorithm for this analysis part of the thesis. Some of which are:
DBSCAN does not require one to specify the number of clusters in the data a
priori.
DBSCAN can find arbitrarily shaped clusters.
DBSCAN has a notion of noise, and is robust to outliers.
DBSCAN requires two parameters: ε (eps) and the minimum number of points
required to form a dense region (MinPts). It starts with an arbitrary starting point that
has not been visited. This point's ε-neighborhood is retrieved, and if it contains
sufficiently many points, a cluster is started. Otherwise, the point is labeled as
noise. If a point is found to be a dense part of a cluster, its ε-neighborhood is also part
of that cluster. Hence, all points that are found within the ε-neighborhood are added,
as is their own ε-neighborhood when they are also dense. This process continues until
the density-connected cluster is completely found. Then, a new unvisited point is
retrieved and processed, leading to the discovery of a further cluster or noise
(Wikipedia 2016).
In order to determine the optimal eps value is proposed which consists of computing
the k-nearest neighbor distances in a matrix of points. The idea is to calculate, the
average of the distances of every point to its k nearest neighbors. The value of k will
be specified by the user and corresponds to MinPts. Next, these k-distances are
plotted in an ascending order. The aim is to determine the “knee”, which corresponds
to the optimal eps parameter. A knee corresponds to a threshold where a sharp change
occurs along the k-distance curve. The function kNNdistplot() [in dbscan package]
can be used to draw the k-distance plot.
As it can be seen in Figure 4.8 the optimal eps value is around a distance of 0.15.
59
Figure 4.8: k-nearest neighbor distances to determine eps in DBSCAN
Function dbscan::dbscan() computes DBSCAN and provides an object of class
‘dbscan’ as a result.
4.3.4.3 Cluster Validity In cluster analysis, the important question is how to evaluate the “goodness” of the
resulting clusters? To answer this question first we have to know why we need to
evaluate clusters. There are several reasons some of which are mentioned below:
To avoid finding patterns in noise
To compare clustering algorithms
To compare two sets of clusters
To compare two clusters
As outlined in section 2.1.6, Numerical measures that are applied to judge various
aspects of cluster validity, are classified into three types: internal, external and
relative. In this thesis study, Silhouette Coefficient and Dunn’s Index from internal
criterion and Entropy from external criterion are selected in order to evaluate and
compare the different aspects of cluster results. The formulas for the mentioned
indices are indicated in Table 4.5.
60
Table 4.5: Formulas for Silhouette, Dunn and Entropy indices
Name Formula
Silhouette Coefficient (SC)
Dunn’s Index
Entropy
Cluster.stats() function in the package fpc, computes a number of distance based
statistics which can be used for cluster validation, comparison between clusters and
decision about the number of clusters: cluster sizes, cluster diameters, average
distances within and between clusters, cluster separation, average silhouette widths,
the Calinski and Harabasz index, Hubert's gamma coefficient, the Dunn index,
Entropy and two indexes to assess the similarity of two clusterings, namely the
corrected Rand index and Meila's VI.
The selected indices values generated by applying this function on different clustering
results are probed in next chapter.
61
Chapter 5
5 Experiments and Discussion
In this chapter, all experiments and investigations that have been done for profiling
users engaged in talking about a particular event are explained one by one. The
process and the aspects that have been considered are demonstrated, the results are
provided in different forms like charts, graphs, tables and then they are discussed.
5.1 The Floating Piers Datasets For sixteen days – June 18 through July 3, 2016 – Italy’s Lake Iseo was reimagined.
100,000 square meters of shimmering yellow fabric, carried by a modular floating
dock system of 220,000 high-density polyethylene cubes, undulated with the
movement of the waves as The Floating Piers rose just above the surface of the water.
Visitors were able to experience the work of art by walking on it from Sulzano to
Monte Isola and to the island of San Paolo, which was framed by The Floating Piers
(Claude 2016).
Figure 5.1: The Floating Piers (Project for Lake Iseo, Italy)
The dataset consists of two collections, one for tweets and Twitter users and one for
Instagram posts and users that are extracted through querying Twitter and Instagram
APIs from June 10 (one week before the event) to July 30 (four weeks after the
event). However most parts of the analysis metrics are mostly focused on those
collected from Twitter APIs while Instagram data is used for comparison.
As discussed in the previous chapter the data was imported into CSV files. The twitter
dataset contains 14,062 tweets and 23,916 users (7,724 masters and 16,197
62
contributors) and the Instagram dataset contains 30,256 posts and 94,666 users
(16,681 masters and 77,985 contributors).
5.2 Reports of Analysis This section presents the outcome of the analytics process that was applied to the
Floating Piers datasets on CSV files. It is divided into sub sections that present the
results in different stages of the analysis. Each one corresponds to the outcomes from
various methodologies that were applied and discussed in the previous chapters.
5.2.1 Content Specific Results This section is dedicated to the results that are directly related to tweets or Instagram
posts collected from social media. These results can give us general insights on how
much attention the event received from Twitter and Instagram during the specific time
period.
5.2.1.1 Twitter Result 1: Engagement in social media is defined as the total number of times a user
interacted with a tweet, including retweets, replies, follows, favorites, etc. We can get
deeper into engagements by noticing what specific types of engagements took place.
For instance, were they retweets or favorites? Retweets can be a sign of value.
Someone found a tweet valuable enough to share with their audience. Favorites can be
a sign of appreciation. A tweet resonated with someone else, and they wanted to give
a virtual high-five. Both metrics count as engagement. Below diagram demonstrates
the total number of retweets versus favorites that tweets, which are about the Floating
Piers event, gained during June 10 and July 30.
Figure 5.2: Twitter total retweets vs. favorites
63
As this line graph represents, the numbers of favorites and retweets start to
dramatically increase at the opening date (June 18) and have several fluctuations in
the following days until the closing date (July 3). After the event ends the number of
favorites and retweets gradually decrease and go to almost zero. As also shown the
numbers of favorites always overweigh the numbers of retweets during this time span.
This indicates that people were more interested to like a post (related to this event)
rather than sharing it with their followers.
Result 2: To discover the words (stems) and hashtags that occurred several times in
tweets we choose word-cloud as a textual representation in which texts are the words
or hashtags used to address the event and their size is proportional to their frequency
on our Twitter data (see Figure 5.3). Evidently the most frequent words are “lake”,
“walk”, “float”, “water”, “christo” “pier” and “iseo” which represent the name of the
event, its artist’s name and the place where it was held. In addition, the most frequent
hashtags include “#christo”, “#floatingpiers”, “#iseo”, “#lagodiiseo” and “#iseolake”
which convey the same mentioned concepts about the event. Simply by viewing this
presentation, one can guess about the main topics that were discussed mostly in the
Floating Piers event, even in case he does not have any information about the event in
advance.
Result 3: Tweet geolocation feature allows us to find tweets that have been sent from
a specific location. This is useful as it makes tweets more contextual and helps event
holders to find leads relevant to the event and location. However, the proportion of
tweets that are attached to a location is still very low because a reasonable percentage
of the population has concerns about publicly exposing their exact location.
Figure 5.3: Most frequent words (a) and hashtags (b) in tweets
(a) (b)
64
Considering our tweet dataset, we realized that a large majority of geo values in
tweets are null (over 98 percent) which confirms the above point. Therefore, it is
suggested to use the field “Place” instead of “Geo” because place is not an exact
location but an area or neighborhood. Nevertheless, only 16 percent of tweets were
geotagged by users and analyzed for the result shown in Figure 5.4. Hence, to have a
better understanding of tweets places, we categorized tweets by their place and then
illustrated the number of tweets from top 10 locations. As shown the most
considerable numbers of tweets (over 800) came from “Sulzano” that is the main spot
of the event while the second and the third highest numbers of tweets originated from
“Monte Iseo” and “Milan”. This could prove a point that tweets were either posted
from the location of the event or by local people who live or work in big cities nearby.
This could be a good hint to be considered by the relevant organizers for future
events.
Figure 5.4: Number of tweets for top 10 locations using the field “Place”
5.2.1.2 Instagram Result 1: To have a clear intuition of the level of user engagement in Instagram, the
volume of likes and comments received by uploaded posts are depicted in Figure 5.5.
Unlike Twitter, the number of likes and comments in Instagram reached a peak at the
closing date of the event (July 3) not at the opening date. Another difference that can
be seen is that the number of likes in Instagram is near 250,000 which considerably
surpass likes count in Twitter. However, as demonstrated below, Instagram users are
more interested in liking the posts rather than commenting, that is why the number of
comments is much less than likes count and remains on a constant rate during the time
interval.
65
Figure 5.5: Instagram total likes vs. comments
Result 2: The most repeated words (stems) and hashtags that people used in their
posts in Instagram are presented in two word-clouds in Figure 5.6. Similar with
Twitter the most frequent words and hashtags are clearly those that are particularly
relevant to the event like the name of the event, its artist and its location.
Result 3: Having the longitude and latitude of each Instagram post (unlike Twitter
users, most Instagram users specify the location of their posts), we are able to show
the density of medias posted in different locations using get_map and ggmap
functions in R. The results are displayed in below maps including world, Italy,
Brescia and Sulzano. As one can see the density of posts has a direct relationship with
Figure 5.6: Most frequent words (a) and hashtags (b) in Instagram posts
(a) (b)
66
their locality which means most Instagram media have been posted near the main
venue of the event or by the people who live or work near this place.
Figure 5.7: Distribution of Instagram posts in the world
Figure 5.8: Density of Instagram posts – Italy
67
Figure 5.9: Density of Instagram posts – Brescia
68
Figure 5.10: Density of Instagram posts – Sulzano
Result 4: The last result in this section addresses the comparison of the total number
of tweets and Instagram posts within a timeline.
As shown in Figure 5.11, people (both Twitter and Instagram users) started talking
about the Floating Piers before the opening day. It is obvious that people tended to
tweet more than post on Instagram at the first week of the event but as time passed
people posted more Instagram media that tweets. Considering this, one could
conclude that Twitter users have a tendency to tweet about the news at the moment
when an event starts whereas Instagram users usually share their experiences when an
event ends.
As demonstrated, the number of tweets had a significant growth on the day when
event began (near 2,000 tweets) and decreased quickly one or two days later. On the
other hand, the number of Instagram posts steadily fluctuated until the closing day
when it reached a peak with over 5,000 posts and had a rapid decline on the following
days. Not surprisingly both tweeting and posting rates remained constant only after a
few days when the event ended.
69
Figure 5.11: Tweets vs Instagram posts timeline
5.2.2 User Specific Results In this section, we discuss the obtained results relevant to social media users such as
users’ categories and their activities during the mentioned time span. Given that we
focus on text analysis in this part of thesis and since Instagram is technically a photo-
sharing social network, it is not surprising that the outcomes acquired from Instagram
do not seem desirable in this study. For this reason, we only concentrate on Twitter
users in this section. However, some distinguished differences between Twitter and
Instagram results are highlighted in each section.
5.2.2.1 Clustering Results Evaluation As indicated in previous chapter, three different clustering algorithms are chosen to be
applied on collected data from users’ textual properties specifically bio, hashtag,
status text and list slug. Therefore, each algorithm is performed on each collection
separately and then to get a better understanding of the obtained results and find the
best clustering, three cluster validity measures are employed. Table 5.1 represents
these measures values for each algorithm and each feature.
70
Table 5.1: Evaluation results of cluster validation indices
Silhouette width and Dunn index combine measures of compactness and separation of
the clusters. Recall that the values of silhouette width range from -1 (poorly clustered
observations) to 1 (well clustered observations). The Dunn index is the ratio between
the smallest distances between observations not in the same cluster to the largest
intra-cluster distance. It has a value between 0 and infinity and should be also
maximized. Thus, algorithms that produce clusters with high Dunn index and high
Silhouette width are more desirable. On the other hand, entropy is a metric that is a
measure of the amount of disorder in a vector. So, smaller values of entropy indicate
less disorder in a clustering, which means a better clustering.
According to the above facts and the table’s output, Hierarchical clustering (three
clusters) can be considered as the best algorithm which has produced more pleasant
results compared to the two other algorithms. Furthermore, among all four examined
textual properties, Bio comes with more acceptable values. Consequently, as table
suggests, from now on we only focus on hierarchical clustering performed on users’
bio data.
5.2.2.2 Interpretations of Clusters Result 1: Hierarchical algorithm returns a dendrogram which is illustrated in Figure
5.12. To have a better insight, three clusters are drawn in different colors. Each leaf in
this tree is an indicator of a Twitter user engaged in the Floating Piers event through
tweeting, retweeting or favoriting a post.
71
Figure 5.12: Dendrogram representation of Twitter users
Result 2: The pie chart in Figure 5.13 shows the proportions of users in each cluster.
It can be seen that among these three clusters, nearly 60 percent of users lies in first
cluster (green slice), over 35 percent in second (blue slice) and the rest (about 5
percent) in the third cluster (red slice).
Figure 5.13: The percentage of user engagement in each cluster
Result 3: As outlined in chapter 4, we used a topicProbabilties matrix with three
topics in clustering analysis. This leads to having a three dimensions space in which
every point demonstrates the similarity between textual properties of a user and each
topic. But because the representations of points in 3D dimensions may cause
confusion, it is proposed to plot them in two dimensions that only indicate two of
72
extracted topics (topic 1 and topic 2). Figure 5.14 shows the distribution of cluster
objects in a 2D representation.
Figure 5.14: 2D representation of cluster objects
In order to obtain the objects (users) of each cluster after clustering, we use the vector
returned by cutree function as an index into our original data matrix.
Result 4: Having all the user objects in each cluster, we are able to label the obtained
clusters or in other words to identify the categories of users. To depict a weighted list
of the words that are used in users’ bio, hashtags, tweets texts and lists in each cluster,
we employ word-cloud which is a visual representation of text data. Below are visual
demonstrations of the users’ bio word-clouds (per cluster).
73
Figure 5.15: Word-cloud representation of first cluster based on bio
Figure 5.16: Word-cloud representation of second cluster based on bio
74
Figure 5.17: Word-cloud representation of third cluster based on bio
It can be seen that the most frequent words in each cluster convey specific meanings.
People in first cluster mostly talk about “Travel” introducing themselves in their
Twitter bio. People in second cluster are “Art” lovers as far as one can see and people
in third cluster state their positions as “Technology” fans. Henceforth, we call the
users in first, second and third cluster Travel Lovers, Art Lovers and Tech Lovers
respectively. Moreover, using word-clouds for other properties like hashtag, tweet
text and list slug also approves the validity of each cluster’s label (see below figures).
Figure 5.18: Hashtag word-cloud for Travel Lovers (a), Art Lovers (b) and Tech Lovers (c)
(a) (b) (c)
75
5.2.2.3 Comparison of Clusters In this section, we investigate the differences between clusters’ numerical properties
namely the number of users’ followers, followings, favorites and tweets as well as a
few selected features and activities that users had within a timeline.
Result 1: One way to compare users in different clusters is to observe and evaluate
their numerical features. To make a clearer image of users’ differences, we categorize
the values of these properties into four major groups:
less than 100 (low)
between 100 and 1000 (medium)
between 1000 and 10000 (high)
more than 10000 (very high)
Figure 5.19: Tweet text word-cloud for Travel Lovers (a), Art Lovers (b) and Tech Lovers (c)
Figure 5.20: List slug word-cloud for Travel Lovers (a), Art Lovers (b) and Tech Lovers (c)
(a) (b) (c)
(a) (b) (c)
76
The above categories can be used for all numerical properties: the number of
followers, followings, favorites and tweets (see below bar charts).
As shown in Figure 5.21, the percentage of users whose number of followers are less
than 100 are the highest in Travel lovers cluster compared to Art lovers that have the
highest number of followers between 100 and 1000 and Tech lovers whose number of
followers exceed 1000. On the other hand, the percentage of users with 100 to 1000
followers is maximized in all three clusters.
Figure 5.21: Percentage of users whose number of followers lie in each category
Figure 5.22 illustrates that nearly 40 percent of Tech lovers have more than 1000
followings while only about 30 percent of Art lovers and Travel lovers belong to high
and very high category. Moreover, most users in three clusters have 100 to 1000
followings (almost 60 percent) while the percentage of users with more than 10000
followings is minimum.
Figure 5.22: Percentage of users whose number of followings lie in each category
As the below chart shows over 61 percent of Travel lovers liked more than 1000
tweets whereas approximately 55 percent of Art lovers and Tech lovers favorited
tweets posted by others during time.
77
Figure 5.23: Percentage of users whose number of favorites lie in each category
Looking at the figure below, Travel lovers have the highest percentage of users with
less than 100 tweets while nearly 70 percent of Tech lovers posted more than 1000
tweets (with 28 percent more than 10000).
Figure 5.24: Percentage of users whose number of tweets lie in each category
Result 2: A box plot is a type of graphical display that can be used to summarize a set
of data based on the five number summary of this data. The summary statistics used to
create a box plot are the median of the data, the lower and upper quartiles (25% and
75%) and the minimum and maximum values. The box plot is an effective way to
investigate the distribution of a set of data.
As one can see in below box plots, Tech lovers have the highest median for number of
followers and followings among all three clusters while the median for number of
favorites and tweets exceed the median of other groups.
78
Figure 5.25: Summary statistics of numbers of followers in each cluster
Figure 5.26: Summary statistics of numbers of followings in each cluster
79
Figure 5.27: Summary statistics of numbers of favorites in each cluster
Figure 5.28: Summary statistics of numbers of tweets in each cluster
Result 3: There are some other features like language and gender which help to
compare users in three clusters. As Figure 5.29 shows Italian is the most common
language of all users in all three clusters while second and third places belong to
English and other languages (French, Dutch, etc.). As one can see the flows of
languages follow the flow of tweets in all three clusters and have a peak on the
opening day of the event.
80
Figure 5.29: Language timeline per cluster
Figure 5.30 indicates that the number of males who got involved in the Floating Piers
overweighs the number of females. In addition since Travel lovers are the highest
majority, the number of males and females are the highest in this category.
Figure 5.30: Gender timeline per cluster
Next diagram displays the number of users and their posted tweets during time for
each cluster separately. To avoid misunderstanding we calculate a ratio which divides
the number of tweets into the number of users who posted them. The results are
represented in Figure 5.32. It is evident that the value of this ratio (in all three
clusters) is a number between 1 and 2 during the event period which shows that each
user posted less than three tweets every day during this time. Nevertheless, this ratio
fluctuated in the following weeks after closing day and even reached to 5 in Travel
lovers cluster.
81
Figure 5.31: Number of Users - Tweets timeline per cluster
Figure 5.32: Tweet – User ratio timeline per cluster
5.2.2.4 Active Users and Influencers In this section the concentration is on the users whose online roles or engagements in
this event are more effective than the others. Indeed, each business has their own
unique audience identity, but that segmentation might not pan across each social
media network successfully. Instead, it takes better brand alignment, thought-out
social conversations and meaningful connections with the core group of loyalists.
Therefore, it truly pays to have your message reach the right people at the right time.
As previously mentioned one business goal in this specific event or generally in any
kind of event is to keep track of interested people and use this information as a
82
guideline for attracting new people with the similar interest in future events. To
achieve this goal, we use “active users” and “influencers” concepts in this thesis
study.
The definition of what “active” means depends on the service or website. Since all
master users extracted from the Floating Piers data posted at least one tweet or media,
they all can be considered as active master users but we are mostly interested in users
whom this event kept them highly engaged.
Result 1: Figure 5.33 and Figure 5.34 show this event’s top 20 active users for
Twitter and Instagram respectively. As it can be seen the most twitter active user
posted more than 80 tweets about this event while this number reaches to 200 in
Instagram.
Figure 5.33: Twitter top 20 active users
Figure 5.34: Instagram top 20 active users
83
Result 2: In addition, active contributors are the other group of users in whom we
intend to identify. Top 20 twitter active contributors are listed in Figure 5.35. The y-
axis shows the total number of their retweets and favorites.
Figure 5.35: Twitter top 20 active contributors
Below is the visual representation of the list of top 20 active contributors (likers and
commenters) in Instagram.
Figure 5.36: Instagram top 20 active contributors
Result 3: Influencers are experts whose ideas and actions shape the opinion of like-
minded people. Influencers are not just celebrities who have millions of followers.
They can be people who are influential because they have expertise in a topic. When
these experts talk about products and services, people listen. In fact a recent study
84
found that 49% of Twitter users said they rely on recommendations from influencers
(Little Bird 2017). Typically, influencers have: large following on social media and
engagement in social media through retweets, Facebook comments, etc. By
identifying and building relationships with influencers we will have many benefits –
Get our content shared, form partnerships, generate business and much more.
Twitter users form a Social Network. If depicted in a graph, they would be
represented by nodes. The edges that connect these nodes are the relations of
“Follower-Following”, introduced by Twitter. Obviously, some users are more
influential than others. The methodology of calculating the importance and influence
that a user has in an Online Social Networks (OSN) is presented here. That
measurement should not depend merely on the number of “Followers” of a user, even
if that number is big enough and the user’s tweets are received by a large number of
other users (followers). In case that the number of “Following” is larger, then the user
could be characterized as a “passive” one. Those types of users are regarded as those
who are keener on viewing or being informed through tweets rather than composing
new ones. Therefore, a suitable factor is the ratio of “Followers to Following” (FtF
ratio) (Gerasimos Razis 2014).
FtF ratio = (Log10 (#Followers/#Followings+1))
The FtF ratio is placed inside a base-10 log for avoiding outlier values. Moreover, the
ratio is added by 1 so as to avoid the metric being equal to 0 in cases that the value of
“Followers” is equal to “Following”. Using this ratio we identify top 10 influencers in
Twitter (see Figure 5.375).
Figure 5.37: Twitter top 10 influencers using FtF ratio
5 Blue ticks are indicators of Twitter verified users
85
According to the above chart, most of the majority of influencers is companies such
as news agencies whose followers count extremely overweighs their followings count.
Besides, by having a glance over their profile we realize that their tweet creation rate
(TCR) is usually a high number, regardless of the topic they tweet about. So, these
influencers cannot have a great impact on other users or spread the word about the
Floating Piers event in Twitter. That is why FtF ratio does not seem sufficient in our
study.
Another important factor which is proposed in this thesis study is User Tweet Weight
(UTW) that is defined as the ratio of user’s activity (sum of retweets and favorites) to
users’ number of tweets during the event. Figure 5.38 indicates the top 10 Twitter
influencers using UTW ratio.
UTW ratio = Log10 (∑ (#Favorites+ #Retweets)/#Tweets)
Figure 5.38: Twitter top 10 influencers using UTW ratio
As one can see in this chart, most influencers are real figures or celebrities whose
tweet(s) about the event engross the attention of a great deal of people. Taking a look
at their profiles shows that although their TCR is often a low number compared to the
previous influencers’, their tweets attract more users which leads more influence on
the network relevant to the event.
Now, let us see the case where two users have nearly the same FtF ratio. Obviously
the user with the higher UTW ratio has more impact on the Network. In our
methodology, in order to calculate the final Influence Ratio, we use the combination
of two mentioned ratio using multiplication.
86
Influence Ratio = FtF ratio * UTW ratio
Below bar char is the illustration of Twitter top 10 influencers using the final
Influence Ratio.
Figure 5.39: Twitter top 10 influencers using Influence Ratio
Having both ratios together makes a much more meaningful list of influencers. As it
can be seen, all real figures are famous people who not only have a great deal of
followers but also their tweets absorb plenty of users to this event. The same thing
applies on the news agencies introduced as influencers in the above figure as well.
This point is an approved seal on our suggested metric for finding the most influential
people in the Floating Piers event.
Using this metric can be also helpful to find the top influencers of each cluster
obtained in previous section (see Figure 5.40).
87
Figure 5.40: Twitter top 10 influencers per cluster using Influence Ratio
Result 4: Figure 5.41 shows the number of followers of top 10 influencers in each
cluster. According to the previous result (top 10 influencers in each cluster) it is
expected that the numbers of followers in first two clusters (Travel Lovers and Art
Lovers) have the highest values forasmuch as their influence ratios exceed Tech
lovers’ ratios. As shown even the tenth influencers in first two groups have more
followers (about 200,000) than the first influencer’s (nearly 120,000) in Tech lover
group.
Figure 5.41: Number of followers of top 10 influencers in each cluster
Seemingly as below figure represents top 10 influencers in Travel lovers and Art
lovers clusters have fewer numbers of followings compared to Tech lovers’. This
proves that the number of followings has a less important role in influence ratio rather
than the number of followers.
88
Figure 5.42: Number of followings of top 10 influencers in each cluster
Figure 5.43 indicates the total number of tweets that top 10 influencers in each cluster
posted. According to these graphs Travel lover influencers posted more tweets than
Art lover influencers and the latter posted more than influencers of Tech lovers did.
Since influencers in each cluster can be considered as the representatives of the whole
group, we can conclude that Travel lovers had more effect on the event in the
network. Not surprisingly this result can be also derived from the influence ratio
outcomes.
Figure 5.43: Number of tweets of top 10 influencers in each cluster
In the end of this section, we display the total number of tweets that influencers in
each cluster favorite. Evidently Tech lovers are the ones who liked more tweets than
the other groups although this result cannot convey a meaning regarding their
influence ratio.
89
Figure 5.44: Number of favorites of top 10 influencers in each cluster
90
Chapter 6
6 Conclusions
6.1 Summary The main focus in this research work was to categorize the social media users
involved in a specific event along with analyzing the dynamics and different aspects
of the event itself. Using two platforms (Twitter and Instagram) and considering
mostly all valuable properties of each tweet/post and user give us the confidence to
rely on the outcomes of this study although it comes at the expense of more data
extraction and more data analysis.
In this thesis we employed different methods and approaches to provide the best and
the most adequate results in each step of analysis and the final outcomes of our
research endorse this claim. We proposed an approach to help event organizers to
decide what categories of users they are dealing with and how they are able to reach
them through different social media networks on the Internet. Having the better
understanding about the characteristics of users who are more likely to be interested
in the similar events in the future, will help organizers to allocate the resources to the
right places and target the right group of people in advertising.
The analyses used in this study, which can be applied on any other event, can be
divided in two main categories; The statistical analysis on geometry of the event, and
clustering the users in two or more textually separated groups along with the
presentative review of each group and comparing their properties.
We used “The Floating Piers” event data as a case study in this research to show how
the proposed approach works with the real life datasets. Consequently, after
representing the characteristics of the event, we categorized users based on their
interests in three main groups and then described and compared the behavior and
properties of each category.
91
6.2 Critical Discussion This study concentrates on analyzing online users engaged in an event and finding a
meaningful relationship between members of each group, which not only relates them
together internally but introduces a clear border between each cluster. Obviously the
final results may differ based on the quality and quantity of the initial data but we
showed that with a reasonable size of dataset, we can examine the similarities and
differences between users and obtain the best6 clusters. In addition to the carefully
developed and evaluated approach, there are also several other aspects that need to be
taken into consideration to have a better and more proper final results. Language
detection and translation of all textual properties are not easy tasks to do but they
guarantee the equality of each entity and full coverage over the dataset (The common
way in most researches is to ignore all entities with non-English textual features).
Besides, adding a critical feature like gender, which is not provided by the original
platform (Twitter and Instagram) gives us a better understanding about the users in
different clusters and elucidates the unique behaviors of disparate social media
platforms.
6.3 Possible Future Works Collecting as much as possible data about an event and about the users involving in it
can help event organizers to schedule future events more properly and have a more
comprehensive perspective on logistic and advertisement. The current study can go
further with predicting the future users who might be interested in similar kinds of
events by analyzing their features and activities and compare them with the users of
the current event. Furthermore, considering the other social media platform like
Facebook, Google Plus, Flicker, Foursquare, etc. might result in a clearer and wider
picture of the characteristic and geometry of users and event. Last but not least,
deploying other techniques like semantic analysis, image processing and network
analysis can also help us to improve the accuracy and coverage of the results and open
a new window to have a better understanding of the event.
6 Based on cluster validity measurement
92
Bibliography
Ahmed, Wasim. Using Twitter as a data source: An overview of current social media
research tools. The London School of Economics and Political Science. July 2015.
http://blogs.lse.ac.uk/impactofsocialsciences/2015/07/10/social-media-research-tools-
overview/.
Alessandro Bozzon, Marco Brambilla, Stefano Ceri, Matteo Silvestri, Giuliano Vesci.
"Choosing the Right Crowd: Expert Finding in Social Networks." EDBT '13
Proceedings of the 16th International Conference on Extending Database
Technology. Genoa, 2013. 637-648.
Äyrämö, Sami. KDD Process Steps, Lecture 4. Finland: University of Jyväskylä, 2007.
Bettina Grun, Kurt Hornik. "topicmodels: An R Package for Fitting Topic Models." Journal
of Statistical Software, 2011.
Christopher D. Manning, Prabhakar Raghavan, Hinrich Schutze. Introduction to Information
Retrieval. Cambridge University Press, 2008.
Claude, Christo. The Floating Piers. 2016. http://www.thefloatingpiers.com/the-project.
Diao, Qiming. "Event Identification and Analysis on Twitter." Singapore Management
University, 2015.
Eréndira Rendón, Itzel Abundez, Alejandra Arizmendi, Elvia M. Quiroz. "Internal versus
External cluster validation indexes." INTERNATIONAL JOURNAL OF
COMPUTERS AND COMMUNICATIONS 5, no. 1 (2011): 8.
Friedemann, Vanessa. "Clustering a Customer Base Using Twitter Data." 2015.
Gerasimos Razis, Ioannis Anagnostopoulos. "InfluenceTracker: Rating the impact of a
Twitter account." 2014.
Gonzalo Mariscal, Oscar Marbán, Covadonga Fernández. "A survey of data mining and
knowledge discovery process models and methodologies." The Knowledge
Engineering Review, 2010: 31.
Hees, Maarten van. "Web-based automatic translation: the Yandex.Translate API." Leiden
Institute of Advanced Computer Science (LIACS). Leiden, 2015.
Instagram. 2017. https://www.instagram.com/developer/.
José Luis Díaz, Manuel Herrera, Joaquín Izquierdo, Rafael Pérez-García. "The tasks of pre
and post-processing in Data Mining applied to a real world problem." International
Congress on Environmental Modelling and Software. Ottawa, 2010.
Kabacoff, Robert I. 2017. http://www.statmethods.net/stats/regression.html.
93
Ken Kelley, Keke Lai, Po-Ju Wu. "Using R for data analysis: A best practice for research." In
Best Practices in Quantitative Methods, 38. 2008.
Kuldeep Singh, Harish Kumar Shakya, Bhaskar Biswas. "Clustering of people in social
network basedon textual similarity." ELSEVIER, 2016.
Little Bird. 2017. http://www.getlittlebird.com/.
Luna, Elizabeth de. Eventbrite. September 22, 2016. https://www.eventbrite.com/blog/flickr-
instagram-event-photos-ds00/.
Maria Halkidi, Yannis Batistakis, Michalis Vazirgiannis. "On Clustering Validation
Techniques." Journal of Intelligent Information Systems, 2001: 39.
NamSor Applied Onomastics. 2017. http://www.namsor.com/.
Narumol Prangnawarat, Ioana Hulpus ,̧ Conor Hayes. "Event Analysis in Social Media Using
Clustering of Heterogeneous Information Networks." Proceedings of the Twenty-
Eighth International Florida Artificial Intelligence Research Society Conference.
2015.
Oded Maimon, Lior Rokach. Data Mining and Knowledge Discovery Handbook. Springer
New York Dordrecht Heidelberg London, 2010.
Ottosson, Therese. The representation of gender roles in the media. Trollhättan: University
West, 2012.
Pensions, Department for Work and. "The Use of Social Media for Research and Analysis: A
Feasibility Study." Government Social Research, no. 13 (2014): 62.
Pew Research Center. January 12, 2017. http://www.pewinternet.org/fact-sheet/.
Pienaar, Wikus. "Spelling Checker-based Language Identification for the Eleven Official
South African Languages." First Annual Symposium of the Pattern Recognition
Association of South Africa. Stellenbosch, 2010. 213–216.
Ponweiser, Martin. "Latent Dirichlet Allocation in R." Institute for Statistics and
Mathematics, 2012.
Recognition, National Labratory of Pattern. "Clustering Users in Twitter Based on Interests."
2015.
Representational state transfer. 2017.
https://en.wikipedia.org/wiki/Representational_state_transfer.
Rizwana Irfan, Christine K. King, Daniel Grages, Hongxiang Li. "A Survey on Text Mining
in Social Networks." The Knowledge Engineering Review, 2015: 15.
Rob Procter, Alex Voss, Ilia Lvov. "Audience research and social media data." Participations
12, no. 1 (2015): 24.
Rugved Deshpande, Ketan Vaze, Suratsingh Rathod, Tushar Jarhad. "Comparative Study of
Document Similarity Algorithms and Clustering Algorithms for Sentiment Analysis."
94
International Journal of Emerging Trends & Technology in Computer Science 3, no.
5 (2014): 4.
Saptarshi Ghosh, Naveen Sharma, Fabricio Benevenuto,Niloy Ganguly, Krishna P. Gummadi.
"Cognos: Crowdsourcing Search for Topic Experts in Microblogs." Proceedings of
the 35th International ACM SIGIR Conference on Research and Development in
Information Retrieval. 2012.
Sfetcu, Nicolae. Web 2.0/Socila media/Social networks. 2017.
Slava Kisilevich, Milos Krstajic, Daniel Keim, Natalia Andrienko, Gennady Andrienko.
"Event-based analysis of people’s activities and behavior using Flickr." 14th
International Conference on Information Visualisation. London, 2010.
Sorensen, Lene Tolstrup. "User managed trust in social networking comparing facebook,
myspace and linkdin." In Proceedings of 1st International Conference on Wireless
Communication, Vehicular Technology, Information Theory and Aerospace &
Electronic System Technology. 2009.
Sorokina, Olsy. Hootsuite. December 4, 2014. https://blog.hootsuite.com/photo-sharing-
platforms-for-business/.
Statista. 2016. https://www.statista.com/statistics/.
Statistical Data Analysis. n.d. https://stat.ethz.ch/R-manual/R-
devel/library/stats/html/hclust.html.
Toon Van Craenendonck, Hendrik Blockeel. "Using Internal Validity Measures to Compare
Clustering Algorithms." 2015.
TrackMaven. 2016. https://trackmaven.com/.
Twitter Developer Documentation. 2017. https://dev.twitter.com/docs.
Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. "From Data Mining to
Knowledge Discovery in Databases." AI Magazine, 1996: 18.
Villiers, Francois de. Constructing Topic-based Twitter Lists. 2013.
Wael H. Gomaa, Aly A. Fahmy. "A Survey of Text Similarity Approaches." International
Journal of Computer Applications 68, no. 13 (2013): 6.
Wikipedia. 2016 https://en.wikipedia.org/wiki/.
Yandex. Yandex Translate API documentation. 2017.
https://tech.yandex.com/translate/doc/intro/concepts/how-works-machine-translation-
docpage/.