35
©2013 LinkedIn Corporation. All Rights Reserved. 1 Data Science vs. The Bad Guys Using data to defend LinkedIn against fraud and abuse David Freeman Head of Security Data Science at LinkedIn Strata+Hadoop World San Jose, CA 20 Feb 2015

Data Science vs. the Bad Guys: Defending LinkedIn from Fraud and Abuse

Embed Size (px)

Citation preview

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

1

Data Science vs. The Bad Guys Using data to defend LinkedIn against fraud and abuse

David Freeman Head of Security Data Science at LinkedIn

Strata+Hadoop World

San Jose, CA 20 Feb 2015

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

World’s largest professional network

But not everyone follows the rules!

§

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Why?

3

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

What do they try to do?• Spam Messages • Spam Content • Fake Companies • Fraud Ads • Fake Jobs • Social Engineering • Social Action Spam (e.g. likes, follows) • Payment Fraud • Malware • Malicious URLs • Scraping

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

How do they do it?

5

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

How do we stop them?

6

+

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

How we stop them — process

1. Stop the bleeding!

2. Heuristic rules.

3. Machine learning.

7

Hypothetical Example: lots of fake accounts from one IP address

• Block the IP. !

• Limit signup rate from any IP. !

• Model trained on historical data, incorporating

– Signups/IP/hour – Signups/IP/day – # good accounts on IP – # bad accounts on IP – other features

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

How we stop them — Infrastructure

Online

Offline

request

scoring

abuse DB

accept

reject

scheduled scoring jobs

§

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Case studies:• Registration • Fake accounts • Account takeover

!If they can’t get in, then they can’t do damage!

9

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

How can we tell if you’re real?

10

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Answer: Asset Reputation SystemsWe have 347 million members’ worth of data on • Names • Email addresses • IP addresses • ISPs • Browsers • etc.

We can assign a reputation score to each asset based on the level of abuse we’ve seen in the past.

11

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Reputation ScoringInstantaneous

• Calculated online from recent data

• Catches new bad activity

• Minimal feature set sample feature: rate of signups from IP in last hour !!

Historical • Calculated offline

from long-term data • Catches recurring

bad activity • Extensive feature setsample feature: % of accounts using IP labeled abusive

12

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Scoring Registration Attempts• Machine-learned model combines reputation

features (offline + online) to produce a registration score. !!!!!!!

• How do we choose the thresholds?

13

0 10.5

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Precision/Recall Tradeoffs• Once system is online, it’s hard to distinguish

false positives from true positives.

• User has no recourse — be conservative!

• Bad guys who slip through will be caught sooner or later in other models.

14

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Fake Accounts OfflineOffline models can use many more features: • Invitations • Connection graph • Profile content • Messages sent/received • Pattern of pages viewed • Reported by other members • etc.

15

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Fake Accounts — Online and Offline

16

abuse DB

Fake account models (Heuristic/ML)

replication

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Online/Offline TradeoffsOnline

• Instant action

• Data collected from many sources

• Computationally limited

• Slow to build and iterate

!

Offline

• Action delayed hours to days

• Data all in one place (HDFS)

• Lots of computational resources

• Fast to build and iterate

17

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Fake Account Defense in Action

18

Blocked(at(Registra0on(

Fake(Accounts(Caught(

Fakes(Caught(Within(48h(of(Crea0on(

Cum

ulat

ive

num

ber o

f acc

ount

s

Time

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Precision/Recall again…Fake account models have to be very precise. !!!!!!!How can we stop bad activity without making good members unhappy?

19

=

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Member ReputationEstimate the probability that a given member is real.

!!!!!!!

Stop abuse before it happens!

20

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Member reputation infrastructure

21

abuse DB

Fake account models (Heuristic/ML) Member

reputationmodel (ML)

reputation DB

replication

What do you do when your fake accounts get blocked? !Use real accounts instead!

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Attackers are smart

22

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Many ways to get into an account

23

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Weak passwords

24

Attack:

Defense:

Pitfalls:

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Credential dumps

25

Attack:

Defense:

Pitfalls:

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Brute force attacks

26

Attack:

Defense:

Pitfalls:

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Phishing

27

Attack:

Defense:

Pitfalls:

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Personal Attacks

28

Attack:

Defense:

Pitfalls:

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Password defense

We must assume the attacker already has the password!

29

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Data Science to the Rescue!!!!!

• Are you in a city we’ve seen you in before?

• Are you using a computer we’ve seen you use before?

• Have we seen abuse from this IP address?

• etc.

!!!!

• For user u and data X, estimatei.e., likelihood that the person logging in is actually you.

30

Pr[attack | u,X]

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Estimating likelihood of attack

31

Heuristic:

BAD

Not so!bad

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Estimating likelihood of attack

32

Machine Learning:

Pr[attack|u,X] = Pr[attack|X] · Pr[X]

Pr[X|u] ·Pr[u|attack]

Pr[u]

Asset Reputation Member and Site History

Member Reputation

• Use machine-learned model + heuristic rules to compute a login score. !!!!!!!

• Thresholds determined by precision/recall tradeoffs (e.g. aim for x% false positives)

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Scoring Login Attempts

33

0 10.5

• Stop bad guys at the entry points. !

• Be careful about bothering good members. !

• Securing registration is hard — not much data. !

• Securing login is hard — passwords suck. !

• Run models offline to catch what you missed online.

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

Take-aways

34

©20

13 L

inke

dIn

Cor

pora

tion.

All

Rig

hts

Res

erve

d.

§

©2013 LinkedIn Corporation. All Rights Reserved.35

Questions? [email protected]

(p.s. We’re hiring)