Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Identifying Twitter Spam by UtilizingRandom Forests
Humza HaiderDivision of Science and Mathematics
University of Minnesota Morris
2017-04-15
1
Introduction
I Top social media platforms
I 500 million tweets per dayI Attracts spammers and malicious usersI Twitter spam: Any unsolicited, repeated actions that negatively
impact other usersI How can we identify spammers?
I Manual classificationI URL blacklistingI Machine learning classification
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
1
Introduction
I Top social media platformsI 500 million tweets per day
I Attracts spammers and malicious usersI Twitter spam: Any unsolicited, repeated actions that negatively
impact other usersI How can we identify spammers?
I Manual classificationI URL blacklistingI Machine learning classification
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
1
Introduction
I Top social media platformsI 500 million tweets per dayI Attracts spammers and malicious users
I Twitter spam: Any unsolicited, repeated actions that negativelyimpact other users
I How can we identify spammers?I Manual classificationI URL blacklistingI Machine learning classification
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
1
Introduction
I Top social media platformsI 500 million tweets per dayI Attracts spammers and malicious usersI Twitter spam: Any unsolicited, repeated actions that negatively
impact other users
I How can we identify spammers?I Manual classificationI URL blacklistingI Machine learning classification
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
1
Introduction
I Top social media platformsI 500 million tweets per dayI Attracts spammers and malicious usersI Twitter spam: Any unsolicited, repeated actions that negatively
impact other usersI How can we identify spammers?
I Manual classificationI URL blacklistingI Machine learning classification
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
1
Introduction
I Top social media platformsI 500 million tweets per dayI Attracts spammers and malicious usersI Twitter spam: Any unsolicited, repeated actions that negatively
impact other usersI How can we identify spammers?
I Manual classification
I URL blacklistingI Machine learning classification
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
1
Introduction
I Top social media platformsI 500 million tweets per dayI Attracts spammers and malicious usersI Twitter spam: Any unsolicited, repeated actions that negatively
impact other usersI How can we identify spammers?
I Manual classificationI URL blacklisting
I Machine learning classification
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
1
Introduction
I Top social media platformsI 500 million tweets per dayI Attracts spammers and malicious usersI Twitter spam: Any unsolicited, repeated actions that negatively
impact other usersI How can we identify spammers?
I Manual classificationI URL blacklistingI Machine learning classification
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
2
Outline
BackgroundDecision TreesRandom ForestsModel Evaluation
MethodsTweet and User Content FeaturesGeo-Tagged FeaturesTime Features
Results
Conclusion
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
3
BackgroundDecision Trees
I Decision Trees
I Machine learning technique for classificationI Classifies an observation based on features available in a dataset
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
3
BackgroundDecision Trees
I Decision TreesI Machine learning technique for classification
I Classifies an observation based on features available in a dataset
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
3
BackgroundDecision Trees
I Decision TreesI Machine learning technique for classificationI Classifies an observation based on features available in a dataset
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
4
BackgroundDecision Trees
URL Account Age Reported ClassNo Old Yes Not SpamNo Old Yes Not SpamNo Old No Not SpamNo New No Not SpamYes New Yes SpamNo New Yes SpamNo Old Yes SpamYes New No Not Spam
......
......
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
5
BackgroundDecision Trees
URL Account Age Reported ClassNo Old Yes Not SpamNo Old Yes Not SpamNo Old No Not SpamNo New No Not SpamYes New Yes SpamNo New Yes SpamNo Old Yes SpamYes New No Not Spam
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
6
BackgroundDecision Trees
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
7
BackgroundDecision Trees
URL Account Age Reported ClassNo Old Yes Not SpamNo Old Yes Not SpamNo Old No Not SpamNo New No Not SpamYes New Yes SpamNo New Yes SpamNo Old Yes SpamYes New No Not Spam
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
8
BackgroundDecision Trees
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
9
BackgroundDecision Trees
URL Account Age Reported ClassNo Old Yes Not SpamNo Old Yes Not SpamNo Old No Not SpamNo New No Not Spam
Yes New Yes SpamNo New Yes SpamNo Old Yes SpamYes New No Not Spam
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
10
BackgroundDecision Trees
URL Account Age Reported ClassNo Old Yes Not SpamNo Old Yes Not SpamNo Old No Not SpamNo New No Not Spam
Yes New Yes SpamNo New Yes SpamNo Old Yes SpamYes New No Not Spam
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
11
BackgroundDecision Trees
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
12
BackgroundDecision Trees
URL Account Age Reported ClassNo Old Yes Not SpamNo Old Yes Not SpamNo Old No Not SpamNo New No Not Spam
Yes New Yes SpamNo New Yes SpamNo Old Yes SpamYes New No Not Spam
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
13
BackgroundDecision Trees
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
14
BackgroundDecision Trees
URL Account Age Reported ClassYes Old Yes TBD
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
15
BackgroundDecision Trees
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
15
BackgroundDecision Trees
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
15
BackgroundDecision Trees
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
16
BackgroundDecision Trees
I How are splits decided?
I EntropyI Information Gain
I Trees seem pretty neat! Why do I need a whole forest?I Disagreement in decisions between different trees
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
16
BackgroundDecision Trees
I How are splits decided?I Entropy
I Information GainI Trees seem pretty neat! Why do I need a whole forest?
I Disagreement in decisions between different trees
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
16
BackgroundDecision Trees
I How are splits decided?I EntropyI Information Gain
I Trees seem pretty neat! Why do I need a whole forest?I Disagreement in decisions between different trees
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
16
BackgroundDecision Trees
I How are splits decided?I EntropyI Information Gain
I Trees seem pretty neat! Why do I need a whole forest?
I Disagreement in decisions between different trees
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
16
BackgroundDecision Trees
I How are splits decided?I EntropyI Information Gain
I Trees seem pretty neat! Why do I need a whole forest?I Disagreement in decisions between different trees
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
17
BackgroundDecision Trees
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
17
BackgroundDecision Trees
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
17
BackgroundDecision Trees
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
17
BackgroundDecision Trees
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
18
BackgroundRandom Forests
I How do we handle disagreement?
I Train many trees on samples of the data (Bagging)I Don’t let trees access all the features (Feature Bagging)
I After we make a bunch of trees, how do we combine them?I Majority vote
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
18
BackgroundRandom Forests
I How do we handle disagreement?I Train many trees on samples of the data (Bagging)
I Don’t let trees access all the features (Feature Bagging)I After we make a bunch of trees, how do we combine them?
I Majority vote
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
18
BackgroundRandom Forests
I How do we handle disagreement?I Train many trees on samples of the data (Bagging)I Don’t let trees access all the features (Feature Bagging)
I After we make a bunch of trees, how do we combine them?I Majority vote
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
18
BackgroundRandom Forests
I How do we handle disagreement?I Train many trees on samples of the data (Bagging)I Don’t let trees access all the features (Feature Bagging)
I After we make a bunch of trees, how do we combine them?
I Majority vote
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
18
BackgroundRandom Forests
I How do we handle disagreement?I Train many trees on samples of the data (Bagging)I Don’t let trees access all the features (Feature Bagging)
I After we make a bunch of trees, how do we combine them?I Majority vote
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
19
BackgroundRandom Forests
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
Source: https://i.ytimg.com/vi/ajTc5y3OqSQ/hqdefault.jpg
20
BackgroundModel Evaluation
I How do we evaluate a random forest’s performance?
True PositiveSpam
Spam
False Positive
Not Spam
False NegativeNot Spam True Negative
Pre
dict
ion
Truth
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
20
BackgroundModel Evaluation
I How do we evaluate a random forest’s performance?
True PositiveSpam
Spam
False Positive
Not Spam
False NegativeNot Spam True Negative
Pre
dict
ion
Truth
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
21
BackgroundModel Evaluation
Accuracy
I “How many tweets were correctly identified?”
True PositiveSpam
Spam
False Positive
Not Spam
False NegativeNot Spam True Negative
Pre
dict
ion
Truth
Accuracy
Accuracy =TP + TN
TP + TN + FP + FN
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
21
BackgroundModel Evaluation
Accuracy
I “How many tweets were correctly identified?”
True PositiveSpam
Spam
False Positive
Not Spam
False NegativeNot Spam True Negative
Pre
dict
ion
Truth
Accuracy
Accuracy =TP + TN
TP + TN + FP + FN
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
22
BackgroundModel Evaluation
Precision (p)
I “How good is our spam prediction?"
True PositiveSpam
Spam
False Positive
Not Spam
False NegativeNot Spam True Negative
Pre
dict
ion
Truth
Precision
Precision =TP
TP + FP
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
22
BackgroundModel Evaluation
Precision (p)
I “How good is our spam prediction?"
True PositiveSpam
Spam
False Positive
Not Spam
False NegativeNot Spam True Negative
Pre
dict
ion
Truth
Precision
Precision =TP
TP + FP
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
23
BackgroundModel Evaluation
Recall (r )
I “How much spam was identified?"
True PositiveSpam
Spam
False Positive
Not Spam
False NegativeNot Spam True Negative
Pre
dict
ion
Truth
Recall
Recall =TP
TP + FN
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
23
BackgroundModel Evaluation
Recall (r )
I “How much spam was identified?"
True PositiveSpam
Spam
False Positive
Not Spam
False NegativeNot Spam True Negative
Pre
dict
ion
Truth
Recall
Recall =TP
TP + FN
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
24
BackgroundModel Evaluation
I F-measure (F )I Harmonic mean of Precision and RecallI Equally weights both Precision and Recall
F-Measure
F-measure = 2 · Precision · RecallPrecision + Recall
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
25
Outline
BackgroundDecision TreesRandom ForestsModel Evaluation
MethodsTweet and User Content FeaturesGeo-Tagged FeaturesTime Features
Results
Conclusion
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
26
MethodsTweet and User Content Features
I Chen et al. identify tweets, as opposed to users
I Utilized 12 features directly accessible from a tweetI 6 user featuresI 6 tweet features
I User FeaturesI Age in days of accountI Number of followers, followeesI Number of tweets
I Tweet FeaturesI Number of hashtags (#)I Number of mentionsI Number of URLs
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
26
MethodsTweet and User Content Features
I Chen et al. identify tweets, as opposed to usersI Utilized 12 features directly accessible from a tweet
I 6 user featuresI 6 tweet features
I User FeaturesI Age in days of accountI Number of followers, followeesI Number of tweets
I Tweet FeaturesI Number of hashtags (#)I Number of mentionsI Number of URLs
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
26
MethodsTweet and User Content Features
I Chen et al. identify tweets, as opposed to usersI Utilized 12 features directly accessible from a tweet
I 6 user features
I 6 tweet featuresI User Features
I Age in days of accountI Number of followers, followeesI Number of tweets
I Tweet FeaturesI Number of hashtags (#)I Number of mentionsI Number of URLs
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
26
MethodsTweet and User Content Features
I Chen et al. identify tweets, as opposed to usersI Utilized 12 features directly accessible from a tweet
I 6 user featuresI 6 tweet features
I User FeaturesI Age in days of accountI Number of followers, followeesI Number of tweets
I Tweet FeaturesI Number of hashtags (#)I Number of mentionsI Number of URLs
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
26
MethodsTweet and User Content Features
I Chen et al. identify tweets, as opposed to usersI Utilized 12 features directly accessible from a tweet
I 6 user featuresI 6 tweet features
I User FeaturesI Age in days of accountI Number of followers, followeesI Number of tweets
I Tweet FeaturesI Number of hashtags (#)I Number of mentionsI Number of URLs
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
26
MethodsTweet and User Content Features
I Chen et al. identify tweets, as opposed to usersI Utilized 12 features directly accessible from a tweet
I 6 user featuresI 6 tweet features
I User FeaturesI Age in days of accountI Number of followers, followeesI Number of tweets
I Tweet FeaturesI Number of hashtags (#)I Number of mentionsI Number of URLs
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
27
MethodsTweet and User Content Features
I Two sets of testing data.
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
28
Outline
BackgroundDecision TreesRandom ForestsModel Evaluation
MethodsTweet and User Content FeaturesGeo-Tagged FeaturesTime Features
Results
Conclusion
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
29
MethodsGeo-Tagged Features
I So, what is a geo-tagged tweet?
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
29
MethodsGeo-Tagged Features
I So, what is a geo-tagged tweet?
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
30
MethodsGeo-Tagged Features
I Guo and Chen identify non-personal usersI SpammersI BotsI Business accounts
I Features:I Tweeting Speed
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
30
MethodsGeo-Tagged Features
I Guo and Chen identify non-personal usersI SpammersI BotsI Business accounts
I Features:
I Tweeting Speed
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
30
MethodsGeo-Tagged Features
I Guo and Chen identify non-personal usersI SpammersI BotsI Business accounts
I Features:I Tweeting Speed
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
31
MethodsGeo-Tagged Features
I 19.6 Miles
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
31
MethodsGeo-Tagged Features
I 19.6 MilesI From Clontarf at 7:53 PM
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
31
MethodsGeo-Tagged Features
I 19.6 MilesI From Clontarf at 7:53 PMI From Morris at 8:00 PM
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
31
MethodsGeo-Tagged Features
I 19.6 MilesI From Clontarf at 7:53 PMI From Morris at 8:00 PMI Tweeting speed = 19.6 miles
7 minutes = 2.8 miles per minute (168 MPH)
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
32
MethodsGeo-Tagged Features
I Features:
I Max SpeedI Mean SpeedI Max Distance (connected to Max Speed)I Mean number of times a user exceeds 90 MPH per month
I County based features
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
32
MethodsGeo-Tagged Features
I Features:I Max Speed
I Mean SpeedI Max Distance (connected to Max Speed)I Mean number of times a user exceeds 90 MPH per month
I County based features
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
32
MethodsGeo-Tagged Features
I Features:I Max SpeedI Mean Speed
I Max Distance (connected to Max Speed)I Mean number of times a user exceeds 90 MPH per month
I County based features
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
32
MethodsGeo-Tagged Features
I Features:I Max SpeedI Mean SpeedI Max Distance (connected to Max Speed)
I Mean number of times a user exceeds 90 MPH per monthI County based features
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
32
MethodsGeo-Tagged Features
I Features:I Max SpeedI Mean SpeedI Max Distance (connected to Max Speed)I Mean number of times a user exceeds 90 MPH per month
I County based features
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
32
MethodsGeo-Tagged Features
I Features:I Max SpeedI Mean SpeedI Max Distance (connected to Max Speed)I Mean number of times a user exceeds 90 MPH per month
I County based features
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
33
I Number of times a user crosses county borders per monthI Mean number of counties a user has been to per month
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
33
I Number of times a user crosses county borders per month
I Mean number of counties a user has been to per month
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
33
I Number of times a user crosses county borders per monthI Mean number of counties a user has been to per month
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
34
Outline
BackgroundDecision TreesRandom ForestsModel Evaluation
MethodsTweet and User Content FeaturesGeo-Tagged FeaturesTime Features
Results
Conclusion
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
35
MethodsTime Features
I Washha et al. classify spammers on a user level
I Motivated to use time since altering time dependent features is achallenge.
I Features that spammers can easily manipulate:I Number of URLsI Number of HashtagsI Including Geo-tags
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
35
MethodsTime Features
I Washha et al. classify spammers on a user levelI Motivated to use time since altering time dependent features is a
challenge.
I Features that spammers can easily manipulate:I Number of URLsI Number of HashtagsI Including Geo-tags
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
35
MethodsTime Features
I Washha et al. classify spammers on a user levelI Motivated to use time since altering time dependent features is a
challenge.I Features that spammers can easily manipulate:
I Number of URLsI Number of HashtagsI Including Geo-tags
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
35
MethodsTime Features
I Washha et al. classify spammers on a user levelI Motivated to use time since altering time dependent features is a
challenge.I Features that spammers can easily manipulate:
I Number of URLs
I Number of HashtagsI Including Geo-tags
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
35
MethodsTime Features
I Washha et al. classify spammers on a user levelI Motivated to use time since altering time dependent features is a
challenge.I Features that spammers can easily manipulate:
I Number of URLsI Number of Hashtags
I Including Geo-tags
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
35
MethodsTime Features
I Washha et al. classify spammers on a user levelI Motivated to use time since altering time dependent features is a
challenge.I Features that spammers can easily manipulate:
I Number of URLsI Number of HashtagsI Including Geo-tags
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
36
MethodsTime Features
FeaturesI Differences in Account Age
I Spammers have multiple accountsI Likely to be made at the same timeI FollowersI FolloweesI Bi-directional relationships
I Time weighted correlations:I URLsI MentionsI Hashtags
I Tweet similarity weighted by time
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
36
MethodsTime Features
FeaturesI Differences in Account Age
I Spammers have multiple accounts
I Likely to be made at the same timeI FollowersI FolloweesI Bi-directional relationships
I Time weighted correlations:I URLsI MentionsI Hashtags
I Tweet similarity weighted by time
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
36
MethodsTime Features
FeaturesI Differences in Account Age
I Spammers have multiple accountsI Likely to be made at the same time
I FollowersI FolloweesI Bi-directional relationships
I Time weighted correlations:I URLsI MentionsI Hashtags
I Tweet similarity weighted by time
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
36
MethodsTime Features
FeaturesI Differences in Account Age
I Spammers have multiple accountsI Likely to be made at the same timeI Followers
I FolloweesI Bi-directional relationships
I Time weighted correlations:I URLsI MentionsI Hashtags
I Tweet similarity weighted by time
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
36
MethodsTime Features
FeaturesI Differences in Account Age
I Spammers have multiple accountsI Likely to be made at the same timeI FollowersI Followees
I Bi-directional relationshipsI Time weighted correlations:
I URLsI MentionsI Hashtags
I Tweet similarity weighted by time
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
36
MethodsTime Features
FeaturesI Differences in Account Age
I Spammers have multiple accountsI Likely to be made at the same timeI FollowersI FolloweesI Bi-directional relationships
I Time weighted correlations:I URLsI MentionsI Hashtags
I Tweet similarity weighted by time
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
36
MethodsTime Features
FeaturesI Differences in Account Age
I Spammers have multiple accountsI Likely to be made at the same timeI FollowersI FolloweesI Bi-directional relationships
I Time weighted correlations:
I URLsI MentionsI Hashtags
I Tweet similarity weighted by time
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
36
MethodsTime Features
FeaturesI Differences in Account Age
I Spammers have multiple accountsI Likely to be made at the same timeI FollowersI FolloweesI Bi-directional relationships
I Time weighted correlations:I URLs
I MentionsI Hashtags
I Tweet similarity weighted by time
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
36
MethodsTime Features
FeaturesI Differences in Account Age
I Spammers have multiple accountsI Likely to be made at the same timeI FollowersI FolloweesI Bi-directional relationships
I Time weighted correlations:I URLsI Mentions
I HashtagsI Tweet similarity weighted by time
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
36
MethodsTime Features
FeaturesI Differences in Account Age
I Spammers have multiple accountsI Likely to be made at the same timeI FollowersI FolloweesI Bi-directional relationships
I Time weighted correlations:I URLsI MentionsI Hashtags
I Tweet similarity weighted by time
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
36
MethodsTime Features
FeaturesI Differences in Account Age
I Spammers have multiple accountsI Likely to be made at the same timeI FollowersI FolloweesI Bi-directional relationships
I Time weighted correlations:I URLsI MentionsI Hashtags
I Tweet similarity weighted by time
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
37
Outline
BackgroundDecision TreesRandom ForestsModel Evaluation
MethodsTweet and User Content FeaturesGeo-Tagged FeaturesTime Features
Results
Conclusion
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
38
Results
I Accuracy:True PositiveSpam
Spam
False Positive
Not Spam
False NegativeNot Spam True Negative
Pre
dict
ion
Truth
TP+TNTP+FP+TN+FN "How many tweets were
correctly identified?"
I Precision (p):True PositiveSpam
Spam
False Positive
Not Spam
False NegativeNot Spam True Negative
Pre
dict
ion
Truth
TPTP+FP "How good is our spam
prediction?"
I Recall (r ):True PositiveSpam
Spam
False Positive
Not Spam
False NegativeNot Spam True Negative
Pre
dict
ion
Truth
TPTP+FN "How much spam was
identified?"I F-measure (F ): 2 · precision·recall
precision+recall
Model Results of the Three Studies
Study %Spam p r F Accuracy
User/Tweet Features: I 50.0% 0.929 0.943 0.936 0.936User/Tweet Features: II 5.0% 0.929 0.407 0.566 0.978Geo-tagged Features 21.4% 0.959 0.959 0.958 0.959
Time Features 46.9% 0.932 0.931 0.931 0.931
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
39
Conclusion
I Classification via random forestI Recall (r ) may drop when test set contains a low proportion of
spamI Future work: Apply this finding to geo-tagged tweets and time
featuresI Future spam classification by Twitter: Random forests?
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
40
Acknowledgments
I Peter DolanI For acting as my advisor for this research project and his continued
friendship for the past few yearsI Elena Machkasova
I For the invaluable advice and insightful comments throughout theentire senior seminar course
I Jacob OpdahlI For the exceptional revisions and comments he made as my alumni
review
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
41
References I
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, andVirgilio Almeida, Detecting spammers on twitter, Collaboration,electronic messaging, anti-abuse and spam conference (CEAS),vol. 6, 2010, p. 12.
Leo Breiman, Random forests, Machine learning 45 (2001),no. 1, 5–32.
Chao Chen, Jun Zhang, Xiao Chen, Yang Xiang, and WanleiZhou, 6 million spam tweets: A large ground truth for timelytwitter spam detection, 2015 IEEE International Conference onCommunications (ICC), IEEE, 2015, pp. 7065–7070.
Diansheng Guo and Chao Chen, Detecting non-personal andspam users on geo-tagged twitter network, Transactions in GIS18 (2014), no. 3, 370–384.
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
42
References II
Miroslav Kubat, An introduction to machine learning, 1st ed.,Springer Publishing Company, Incorporated, 2015.
Nikita Spirin and Jiawei Han, Survey on web spam detection:Principles and algorithms, SIGKDD Explor. Newsl. 13 (2012),no. 2, 50–64.
Claude Elwood Shannon, A mathematical theory ofcommunication, ACM SIGMOBILE Mobile Computing andCommunications Review 5 (2001), no. 1, 3–55.
Igor Santos, Igor Miñambres-Marcos, Carlos Laorden, PatxiGalán-García, Aitor Santamaría-Ibirika, and Pablo GarcíaBringas, Twitter content-based spam filtering, pp. 449–458,Springer International Publishing, Cham, 2014.
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests
43
References III
Kurt Thomas, Chris Grier, Dawn Song, and Vern Paxson,Suspended accounts in retrospect: An analysis of twitter spam,Proceedings of the 2011 ACM SIGCOMM Conference onInternet Measurement Conference (New York, NY, USA), IMC’11, ACM, 2011, pp. 243–258.
Mahdi Washha, Aziz Qaroush, and Florence Sedes, Leveragingtime for spammers detection on twitter, Proceedings of the 8thInternational Conference on Management of Digital EcoSystems(New York, NY, USA), MEDES, ACM, 2016, pp. 109–116.
Humza Haider | Identifying Twitter Spam by Utilizing Random Forests