24
WarningBird: Detecting Suspicious URLs in Twitter Stream Sangho Lee and Jong Kim Pohang University of Science and Technology January 18, 2012

Warningbird

Embed Size (px)

Citation preview

Page 1: Warningbird

WarningBird: Detecting Suspicious URLs inTwitter Stream

Sangho Lee and Jong KimPohang University of Science and Technology

January 18, 2012

Page 2: Warningbird

Threat

I Post URLs to attract traffic to website

I Can deliver various payloads

I Spam

I Phishing

I DownloadMaliciousSoftware

Page 3: Warningbird

Threat

I Post URLs to attract traffic to website

I Can deliver various payloads

I Spam

I Phishing

I DownloadMaliciousSoftware

Page 4: Warningbird

Threat

I Post URLs to attract traffic to website

I Can deliver various payloads

I Spam

I Phishing

I DownloadMaliciousSoftware

Page 5: Warningbird

Threat

I Post URLs to attract traffic to website

I Can deliver various payloads

I Spam

I Phishing

I DownloadMaliciousSoftware

Page 6: Warningbird

Twitter

I Online micro-blogging serviceI Large (about 100 million accounts)I URL shortener servicesI Tweets broadcasted to legitimate users

I Good vector for attackers to attract traffic

I Many potential targetsI URL shorteners common and mask actual websiteI Many users view tweets based on content and not authorship

Page 7: Warningbird

Twitter

I Online micro-blogging serviceI Large (about 100 million accounts)I URL shortener servicesI Tweets broadcasted to legitimate users

I Good vector for attackers to attract trafficI Many potential targetsI URL shorteners common and mask actual websiteI Many users view tweets based on content and not authorship

Page 8: Warningbird

Existing Detection Approaches and Limitations

1. Detect accounts based on account informationI E.g., ratio of Tweets with URLs to Tweets without URLsI Easily fabricated by attacker

2. Detect accounts based on social graph

I E.g., connectivity measures for each nodeI Hard to obtain and analyze large amounts of Twitter data

3. Crawl URLs to classify them

I E.g., detect malicious URLs based on html contentI Redirection chains used by attackers

Page 9: Warningbird

Existing Detection Approaches and Limitations

1. Detect accounts based on account informationI E.g., ratio of Tweets with URLs to Tweets without URLsI Easily fabricated by attacker

2. Detect accounts based on social graphI E.g., connectivity measures for each nodeI Hard to obtain and analyze large amounts of Twitter data

3. Crawl URLs to classify them

I E.g., detect malicious URLs based on html contentI Redirection chains used by attackers

Page 10: Warningbird

Existing Detection Approaches and Limitations

1. Detect accounts based on account informationI E.g., ratio of Tweets with URLs to Tweets without URLsI Easily fabricated by attacker

2. Detect accounts based on social graphI E.g., connectivity measures for each nodeI Hard to obtain and analyze large amounts of Twitter data

3. Crawl URLs to classify themI E.g., detect malicious URLs based on html contentI Redirection chains used by attackers

Page 11: Warningbird

Redirection Chains

I Redirect chains start by resolving shortened URL

I Several hops of URLs owned by attacker to redirect userI Dynamically choose which page a user ultimately visits

I Crawlers goto legitimate URLI Legitimate users goto the malicious URL

Page 12: Warningbird

Problem

I Given a URL posted on Twitter, determine whether alegitimate user would ultimately be directed to a maliciousURL by visiting the URL on Twitter

I Assumptions:I Cannot use features easily fabricated by attackerI No access to large Twitter graphI Have access to part of redirect chain available to crawlersI Redirect chains cannot be fabricated

I Solution Overview:I Create classifierI Rely on redirect chain for featuresI Validate accuracy/performance with Twitter data

Page 13: Warningbird

Problem

I Given a URL posted on Twitter, determine whether alegitimate user would ultimately be directed to a maliciousURL by visiting the URL on Twitter

I Assumptions:I Cannot use features easily fabricated by attackerI No access to large Twitter graphI Have access to part of redirect chain available to crawlersI Redirect chains cannot be fabricated

I Solution Overview:I Create classifierI Rely on redirect chain for featuresI Validate accuracy/performance with Twitter data

Page 14: Warningbird

Problem

I Given a URL posted on Twitter, determine whether alegitimate user would ultimately be directed to a maliciousURL by visiting the URL on Twitter

I Assumptions:I Cannot use features easily fabricated by attackerI No access to large Twitter graphI Have access to part of redirect chain available to crawlersI Redirect chains cannot be fabricated

I Solution Overview:I Create classifierI Rely on redirect chain for featuresI Validate accuracy/performance with Twitter data

Page 15: Warningbird

Warning Bird

I Input: tweets

I Output: suspicious URLs

I Live website shows recent suspicious URLs

Page 16: Warningbird

Data Collection

I Use Twitter Streaming API to collect Tweets

I Keep only Tweets with URLs

I Crawl and store URL chain of each URL

I Queue many Tweets to be analyzed together

Page 17: Warningbird

Feature Extraction

I Grouping domains xyz.com= 20.30.40.50 = abc.com

I Find entry point URLs

I 11 features based on URLchains and Tweet context

Page 18: Warningbird

Features

Page 19: Warningbird

Classifier

I Features are all normalized between zero and one

I Logistic regression classification experimentally found to bethe best

I Ground truth from Twitter account status for supervisedlearning

Page 20: Warningbird

Experimentation

I Real Twitter data from Twitter Streaming API

I Their own commodity hardwareI Performed experiments on Twitter data to investigate

I AccuracyI PerformanceI Delay in Detection

Page 21: Warningbird

Accuracy Results

I 60 days of training data 183k benign and 42k malicious URLs

I 30 days of test data 71k benign and 6.7k malicious URLs

I Achieved 3.67% FPR and 3.21% FNR

I Of 71k benign, 2.6k marked malicious

I Of 6.7k malicious, 200 not discovered

Page 22: Warningbird

Performance Results

I Running time of various componentsI 24ms time to crawl redirections (100 concurrent crawls)I 2ms domain groupingI 1.6ms feature extractionI 0.5ms classification

I Process 100,000 URLs in one hour

I Can distribute redirection crawling to improve this

Page 23: Warningbird

Delay Results

I WarningBird can detect faster than Twitter

I Only shows results for those accounts suspended by Twitterwithin a day

Page 24: Warningbird

Conclusion

I Found important feature others have ignored

I Attacker must either spend more for more redirection serversor risk being caught