Upload
david-freeman
View
577
Download
1
Tags:
Embed Size (px)
Citation preview
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
1
Data Science vs. The Bad Guys Using data to defend LinkedIn against fraud and abuse
David Freeman Head of Security Data Science at LinkedIn
Strata+Hadoop World
San Jose, CA 20 Feb 2015
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
World’s largest professional network
But not everyone follows the rules!
§
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
What do they try to do?• Spam Messages • Spam Content • Fake Companies • Fraud Ads • Fake Jobs • Social Engineering • Social Action Spam (e.g. likes, follows) • Payment Fraud • Malware • Malicious URLs • Scraping
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
How we stop them — process
1. Stop the bleeding!
2. Heuristic rules.
3. Machine learning.
7
Hypothetical Example: lots of fake accounts from one IP address
• Block the IP. !
• Limit signup rate from any IP. !
• Model trained on historical data, incorporating
– Signups/IP/hour – Signups/IP/day – # good accounts on IP – # bad accounts on IP – other features
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
How we stop them — Infrastructure
Online
Offline
request
scoring
abuse DB
accept
reject
scheduled scoring jobs
§
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
Case studies:• Registration • Fake accounts • Account takeover
!If they can’t get in, then they can’t do damage!
9
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
Answer: Asset Reputation SystemsWe have 347 million members’ worth of data on • Names • Email addresses • IP addresses • ISPs • Browsers • etc.
We can assign a reputation score to each asset based on the level of abuse we’ve seen in the past.
11
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
Reputation ScoringInstantaneous
• Calculated online from recent data
• Catches new bad activity
• Minimal feature set sample feature: rate of signups from IP in last hour !!
Historical • Calculated offline
from long-term data • Catches recurring
bad activity • Extensive feature setsample feature: % of accounts using IP labeled abusive
12
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
Scoring Registration Attempts• Machine-learned model combines reputation
features (offline + online) to produce a registration score. !!!!!!!
• How do we choose the thresholds?
13
0 10.5
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
Precision/Recall Tradeoffs• Once system is online, it’s hard to distinguish
false positives from true positives.
• User has no recourse — be conservative!
• Bad guys who slip through will be caught sooner or later in other models.
14
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
Fake Accounts OfflineOffline models can use many more features: • Invitations • Connection graph • Profile content • Messages sent/received • Pattern of pages viewed • Reported by other members • etc.
15
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
Fake Accounts — Online and Offline
16
abuse DB
Fake account models (Heuristic/ML)
replication
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
Online/Offline TradeoffsOnline
• Instant action
• Data collected from many sources
• Computationally limited
• Slow to build and iterate
!
Offline
• Action delayed hours to days
• Data all in one place (HDFS)
• Lots of computational resources
• Fast to build and iterate
17
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
Fake Account Defense in Action
18
Blocked(at(Registra0on(
Fake(Accounts(Caught(
Fakes(Caught(Within(48h(of(Crea0on(
Cum
ulat
ive
num
ber o
f acc
ount
s
Time
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
Precision/Recall again…Fake account models have to be very precise. !!!!!!!How can we stop bad activity without making good members unhappy?
19
=
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
Member ReputationEstimate the probability that a given member is real.
!!!!!!!
Stop abuse before it happens!
20
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
Member reputation infrastructure
21
abuse DB
Fake account models (Heuristic/ML) Member
reputationmodel (ML)
reputation DB
replication
What do you do when your fake accounts get blocked? !Use real accounts instead!
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
Attackers are smart
22
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
Weak passwords
24
Attack:
Defense:
Pitfalls:
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
Credential dumps
25
Attack:
Defense:
Pitfalls:
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
Brute force attacks
26
Attack:
Defense:
Pitfalls:
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
Personal Attacks
28
Attack:
Defense:
Pitfalls:
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
Password defense
We must assume the attacker already has the password!
29
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
Data Science to the Rescue!!!!!
• Are you in a city we’ve seen you in before?
• Are you using a computer we’ve seen you use before?
• Have we seen abuse from this IP address?
• etc.
!!!!
• For user u and data X, estimatei.e., likelihood that the person logging in is actually you.
30
Pr[attack | u,X]
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
Estimating likelihood of attack
31
Heuristic:
BAD
Not so!bad
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
Estimating likelihood of attack
32
Machine Learning:
Pr[attack|u,X] = Pr[attack|X] · Pr[X]
Pr[X|u] ·Pr[u|attack]
Pr[u]
Asset Reputation Member and Site History
Member Reputation
• Use machine-learned model + heuristic rules to compute a login score. !!!!!!!
• Thresholds determined by precision/recall tradeoffs (e.g. aim for x% false positives)
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
Scoring Login Attempts
33
0 10.5
• Stop bad guys at the entry points. !
• Be careful about bothering good members. !
• Securing registration is hard — not much data. !
• Securing login is hard — passwords suck. !
• Run models offline to catch what you missed online.
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
Take-aways
34
©20
13 L
inke
dIn
Cor
pora
tion.
All
Rig
hts
Res
erve
d.
§
©2013 LinkedIn Corporation. All Rights Reserved.35
Questions? [email protected]
(p.s. We’re hiring)