Warningbird

WarningBird: Detecting Suspicious URLs inTwitter Stream

Sangho Lee and Jong KimPohang University of Science and Technology

January 18, 2012

Threat

I Post URLs to attract traffic to website

I Can deliver various payloads

I Spam

I Phishing

I DownloadMaliciousSoftware

Threat



I Spam

I Phishing


Threat



I Spam

I Phishing


Threat



I Spam

I Phishing


Twitter

I Online micro-blogging serviceI Large (about 100 million accounts)I URL shortener servicesI Tweets broadcasted to legitimate users

I Good vector for attackers to attract traffic

I Many potential targetsI URL shorteners common and mask actual websiteI Many users view tweets based on content and not authorship

Twitter

I Online micro-blogging serviceI Large (about 100 million accounts)I URL shortener servicesI Tweets broadcasted to legitimate users

I Good vector for attackers to attract trafficI Many potential targetsI URL shorteners common and mask actual websiteI Many users view tweets based on content and not authorship

Existing Detection Approaches and Limitations

1. Detect accounts based on account informationI E.g., ratio of Tweets with URLs to Tweets without URLsI Easily fabricated by attacker

2. Detect accounts based on social graph

I E.g., connectivity measures for each nodeI Hard to obtain and analyze large amounts of Twitter data

3. Crawl URLs to classify them

I E.g., detect malicious URLs based on html contentI Redirection chains used by attackers



2. Detect accounts based on social graphI E.g., connectivity measures for each nodeI Hard to obtain and analyze large amounts of Twitter data

3. Crawl URLs to classify them

I E.g., detect malicious URLs based on html contentI Redirection chains used by attackers



2. Detect accounts based on social graphI E.g., connectivity measures for each nodeI Hard to obtain and analyze large amounts of Twitter data

3. Crawl URLs to classify themI E.g., detect malicious URLs based on html contentI Redirection chains used by attackers

Redirection Chains

I Redirect chains start by resolving shortened URL

I Several hops of URLs owned by attacker to redirect userI Dynamically choose which page a user ultimately visits

I Crawlers goto legitimate URLI Legitimate users goto the malicious URL

Problem

I Given a URL posted on Twitter, determine whether alegitimate user would ultimately be directed to a maliciousURL by visiting the URL on Twitter

I Assumptions:I Cannot use features easily fabricated by attackerI No access to large Twitter graphI Have access to part of redirect chain available to crawlersI Redirect chains cannot be fabricated

I Solution Overview:I Create classifierI Rely on redirect chain for featuresI Validate accuracy/performance with Twitter data

Problem




Problem




Warning Bird

I Input: tweets

I Output: suspicious URLs

I Live website shows recent suspicious URLs

Data Collection

I Use Twitter Streaming API to collect Tweets

I Keep only Tweets with URLs

I Crawl and store URL chain of each URL

I Queue many Tweets to be analyzed together

Feature Extraction

I Grouping domains xyz.com= 20.30.40.50 = abc.com

I Find entry point URLs

I 11 features based on URLchains and Tweet context

Features

Classifier

I Features are all normalized between zero and one

I Logistic regression classification experimentally found to bethe best

I Ground truth from Twitter account status for supervisedlearning

Experimentation

I Real Twitter data from Twitter Streaming API

I Their own commodity hardwareI Performed experiments on Twitter data to investigate

I AccuracyI PerformanceI Delay in Detection

Accuracy Results

I 60 days of training data 183k benign and 42k malicious URLs

I 30 days of test data 71k benign and 6.7k malicious URLs

I Achieved 3.67% FPR and 3.21% FNR

I Of 71k benign, 2.6k marked malicious

I Of 6.7k malicious, 200 not discovered

Performance Results

I Running time of various componentsI 24ms time to crawl redirections (100 concurrent crawls)I 2ms domain groupingI 1.6ms feature extractionI 0.5ms classification

I Process 100,000 URLs in one hour

I Can distribute redirection crawling to improve this

Delay Results

I WarningBird can detect faster than Twitter

I Only shows results for those accounts suspended by Twitterwithin a day

Conclusion

I Found important feature others have ignored

I Attacker must either spend more for more redirection serversor risk being caught