JOHN P. JOHN FANG YU YINGLIAN XIE MARTÍN ABADI ARVIND KRISHNAMURTHY PRESENTATION BY SAM KLOCK Searching the Searchers with SearchAudit

JOHN P. JOHNFANG YU

YINGLIAN XIEMARTÍN ABADI

ARVIND KRISHNAMURTHY

PRESENTATION BY SAM KLOCK

Searching the Searchers with SearchAudit

Motivation

We can find this via a Google search

Motivation (cont’d)

Search engines open opportunities for attackers Construct clever queries Find vulnerable sites Plant malware; spam (e.g., MyDoom) Do so stealthily and cheaply

Mitigation strategy: identify malicious queries May be able to deny results to user Identify attackers (probably bots) Interpret strategy, then anticipate and prevent

The question: how to do so

Proposed Approach

SearchAudit Framework for

generating malicious queries

Input: Seed set of known

malicious queries Search logs

Output: Large set of suspicious

queries Regular expressions

matching queries

inurl:gotoURL.asp?url=filetype:asp inurl:"shopdisplayprod ucts.asp"ext:pl inurl:cgi intitle:"FormMail *" -"*Referrer" -"* Denied" -sourceforge -error -cvs -inputfiletype:cgi inurl:tseekdir.cgi

...

SearchAudit


...


...


...

"/includes/joomla\.php " site:\.[a-zA- Z]{2,3}"/includes/class_item\ .php" site:[^?=#+@;&:]{2, 4}"php-nuke" site:[^?=#+@;&:]{2, 4}"modules\.php\?op=modl oad" site:\.[a-zA- Z0-9]{2,6}

Seed set Search logs

Expanded set Regular expressions

Proposed Approach (cont’d)

Needed to implement: Seed set: milw0rm.com Search logs: Microsoft Research Bing Way to expand seed set into more queries Way to infer regular expressions

Intended benefits: Harvesting lots of information

Three months: ~1.2 TB of logs Interpret relationship between queries and attacks Use queries to find potential victims Stop attacks

SearchAudit

Query identification

Query analysis

Query Identification: Expansion

Basic idea: bootstrap on seed set Search logs for exact

matches to seed queries

Record IPs of hosts making seed queries

Add other queries from those IPs to set Intuition: make one

malicious query, will probably make more

Account for DHCP

Seed queries

IP addresses

Queries madeby IPs

Log search

Queries made on same day

Query Identification: Regular Expressions

Goals: Account for variation in

queries Take advantage of scripting

See paper for generation algorithm

Compute score for generated expressions Lower score: more specific Goal: discard overly general

expressions (score > 0.6)Consolidate to avoid

overlapAvoid proxies, public NAT

for performanceLoopback for more queries

Query Identification: Results

Data from Bing and milw0rm 500 queries Logs for Feb. 2009, Dec. 2009, Jan. 2010

~2 billion views per month

System implemented on Dryad/DryadLINQInitial observations:

Using specificity scores < 0.6 seems to be effective Based on cookie heuristic

Proxy elimination does not limitresults

Query Identification: Results (cont’d)

Query expansion: 122 of 500 queries

matched in logs: 174 unique IPs

Expanded to 800 unique queries, 264 IPs

Regular expressions matched 3,560 queries, 1,001 IPs

Incomplete seeds Tried with subsets of

original set Coverage still good

Query Identification: Results (cont’d)

Loopback: Multiple loopbacks got

more results One iteration is good

enoughOverall statistics

10,000s IPs each month

100,000s unique queries each month Dec. 09: set of unusual

attacker IPs cause spike

Query Identification: Verification

Want to show queries are malicious Sometimes easy: 73%

of queries associated with security/hacker sites

What about others?No ground truth

existsSo: look for bot-like

features Individual level (one

IP) Group level (multiple

IPs)

Individual bots New cookie Whether a link was

clickedGroups of bots

Data often fixed by botnets User agent string Metadata for requests

Tendencies dictated by scripts Pages viewed per

query Time between queries

Query Identification: Verification (cont’d)

Substantial variation between host behavior for normal queries and suspicious queries

Observations on Stage One

Regular expressions can become obsolete Just need fresh logs and a new seed to get new ones

Attacker awareness of technique yields adaptation Example: mix in normal user queries

Goal: trick SearchAudit into identifying as proxy Hard to do: needs to be appropriate to time and place Anyway: proxy elimination is optimization only

Injecting randomness also possible, but makes querying less productive

Could obviate cookie heuristic, but it is replaceableAll attackers need to be careful to succeed

Query Analysis

Query Analysis

42,000 IPs gave suspicious queries globally U.S., Russia, China contribute almost 50% 10% of IPs gave 90% of queries

Found 200 regular expressionsReveal three kinds of attack-related queries:

Vulnerable web sites Forum spamming Phishing on Windows Live Messenger

Queries for Vulnerable Websites

Queries look for exploitable server vulnerabilities GET variables embedded in

URL (for SQL injection) Server software with known

vulnerabilities (e.g., status pages)

SearchAudit as a defense: Pull suspicious queries for

vulnerabilities Run queries; gather results Inspect results for

vulnerabilities Notify sites of vulnerabilities

inurl:index.php?content=X

http://www.example.com/index.php?content=X’%20OR%20’1’%20OR%20‘1=1’

Queries for Vulnerable Websites (cont’d)

With identified queries: Sampled 5,000 queries Obtained 80,490 URLs from

39,475 sitesCompared to

malware/phishing lists: 3-4% on anti-phishing lists 1.5% on anti-malware lists

SQL injection vulnerability: Add a single-quote to

variable in URL Look for SQL error 12% of examined URLs

showed an error

Queries for Forum Spamming

Query motivation: Find scriptable forums Good for spam, PageRank

Found 46 applicable regular expressions

Most IPs show transient behavior: probably bots All regular expression

groups show at least one group similarity feature

IPs got less aggressive over time: more stealthy

Queries for Forum Spamming (cont’d)

Validation Project Honey Pot

Dynamically generate e-mail address for each visiting IP

E-mail received: must be spam

12% of all IPs listed (vs. 0.5% for normal IPs)

Applications Use queries to find and

clean targeted pages Deny results to

malicious queries

Phishing via Windows Live Messenger

Queries triggered by normal users Victim receives

message from a contact Follow link for party

photos Taken to fake WLM

login After giving

credentials, redirected to Bing search for “party”

Bing search to avoid costs of hosting

Phishing via WLM (cont’d)

Detect via query referral field (source page) Found two regular

expressions for referrals Both expressions: victim

username embedded in URL

Over 180 phishing domains for 12 IPs detected

Compromised accounts show different login behaviors

Conclusion

Presented framework for finding suspicious queries Input: search logs, small set of seed queries Output: regular expressions, millions of suspicious

queriesAnalyzed suspicious queries

Identified possible attacks Suggested means of prevention

Generally: attempted to demonstrate relationship between suspicious queries and the possibility of attack

Documents

JOHN P. JOHN FANG YU YINGLIAN XIE MARTÍN ABADI ARVIND KRISHNAMURTHY PRESENTATION BY SAM KLOCK Searching the Searchers with SearchAudit