Upload
andy-newell
View
1.334
Download
0
Tags:
Embed Size (px)
Citation preview
WarningBird: Detecting Suspicious URLs inTwitter Stream
Sangho Lee and Jong KimPohang University of Science and Technology
January 18, 2012
Threat
I Post URLs to attract traffic to website
I Can deliver various payloads
I Spam
I Phishing
I DownloadMaliciousSoftware
Threat
I Post URLs to attract traffic to website
I Can deliver various payloads
I Spam
I Phishing
I DownloadMaliciousSoftware
Threat
I Post URLs to attract traffic to website
I Can deliver various payloads
I Spam
I Phishing
I DownloadMaliciousSoftware
Threat
I Post URLs to attract traffic to website
I Can deliver various payloads
I Spam
I Phishing
I DownloadMaliciousSoftware
I Online micro-blogging serviceI Large (about 100 million accounts)I URL shortener servicesI Tweets broadcasted to legitimate users
I Good vector for attackers to attract traffic
I Many potential targetsI URL shorteners common and mask actual websiteI Many users view tweets based on content and not authorship
I Online micro-blogging serviceI Large (about 100 million accounts)I URL shortener servicesI Tweets broadcasted to legitimate users
I Good vector for attackers to attract trafficI Many potential targetsI URL shorteners common and mask actual websiteI Many users view tweets based on content and not authorship
Existing Detection Approaches and Limitations
1. Detect accounts based on account informationI E.g., ratio of Tweets with URLs to Tweets without URLsI Easily fabricated by attacker
2. Detect accounts based on social graph
I E.g., connectivity measures for each nodeI Hard to obtain and analyze large amounts of Twitter data
3. Crawl URLs to classify them
I E.g., detect malicious URLs based on html contentI Redirection chains used by attackers
Existing Detection Approaches and Limitations
1. Detect accounts based on account informationI E.g., ratio of Tweets with URLs to Tweets without URLsI Easily fabricated by attacker
2. Detect accounts based on social graphI E.g., connectivity measures for each nodeI Hard to obtain and analyze large amounts of Twitter data
3. Crawl URLs to classify them
I E.g., detect malicious URLs based on html contentI Redirection chains used by attackers
Existing Detection Approaches and Limitations
1. Detect accounts based on account informationI E.g., ratio of Tweets with URLs to Tweets without URLsI Easily fabricated by attacker
2. Detect accounts based on social graphI E.g., connectivity measures for each nodeI Hard to obtain and analyze large amounts of Twitter data
3. Crawl URLs to classify themI E.g., detect malicious URLs based on html contentI Redirection chains used by attackers
Redirection Chains
I Redirect chains start by resolving shortened URL
I Several hops of URLs owned by attacker to redirect userI Dynamically choose which page a user ultimately visits
I Crawlers goto legitimate URLI Legitimate users goto the malicious URL
Problem
I Given a URL posted on Twitter, determine whether alegitimate user would ultimately be directed to a maliciousURL by visiting the URL on Twitter
I Assumptions:I Cannot use features easily fabricated by attackerI No access to large Twitter graphI Have access to part of redirect chain available to crawlersI Redirect chains cannot be fabricated
I Solution Overview:I Create classifierI Rely on redirect chain for featuresI Validate accuracy/performance with Twitter data
Problem
I Given a URL posted on Twitter, determine whether alegitimate user would ultimately be directed to a maliciousURL by visiting the URL on Twitter
I Assumptions:I Cannot use features easily fabricated by attackerI No access to large Twitter graphI Have access to part of redirect chain available to crawlersI Redirect chains cannot be fabricated
I Solution Overview:I Create classifierI Rely on redirect chain for featuresI Validate accuracy/performance with Twitter data
Problem
I Given a URL posted on Twitter, determine whether alegitimate user would ultimately be directed to a maliciousURL by visiting the URL on Twitter
I Assumptions:I Cannot use features easily fabricated by attackerI No access to large Twitter graphI Have access to part of redirect chain available to crawlersI Redirect chains cannot be fabricated
I Solution Overview:I Create classifierI Rely on redirect chain for featuresI Validate accuracy/performance with Twitter data
Warning Bird
I Input: tweets
I Output: suspicious URLs
I Live website shows recent suspicious URLs
Data Collection
I Use Twitter Streaming API to collect Tweets
I Keep only Tweets with URLs
I Crawl and store URL chain of each URL
I Queue many Tweets to be analyzed together
Feature Extraction
I Grouping domains xyz.com= 20.30.40.50 = abc.com
I Find entry point URLs
I 11 features based on URLchains and Tweet context
Features
Classifier
I Features are all normalized between zero and one
I Logistic regression classification experimentally found to bethe best
I Ground truth from Twitter account status for supervisedlearning
Experimentation
I Real Twitter data from Twitter Streaming API
I Their own commodity hardwareI Performed experiments on Twitter data to investigate
I AccuracyI PerformanceI Delay in Detection
Accuracy Results
I 60 days of training data 183k benign and 42k malicious URLs
I 30 days of test data 71k benign and 6.7k malicious URLs
I Achieved 3.67% FPR and 3.21% FNR
I Of 71k benign, 2.6k marked malicious
I Of 6.7k malicious, 200 not discovered
Performance Results
I Running time of various componentsI 24ms time to crawl redirections (100 concurrent crawls)I 2ms domain groupingI 1.6ms feature extractionI 0.5ms classification
I Process 100,000 URLs in one hour
I Can distribute redirection crawling to improve this
Delay Results
I WarningBird can detect faster than Twitter
I Only shows results for those accounts suspended by Twitterwithin a day
Conclusion
I Found important feature others have ignored
I Attacker must either spend more for more redirection serversor risk being caught