34
Automatic Detection of Web Trackers Vasia Kalavri Apache Flink PMC, PhD student @KTH [email protected] , @vkalavri

Automatic Detection of Web Trackers by Vasia Kalavri

Embed Size (px)

Citation preview

Page 1: Automatic Detection of Web Trackers by Vasia Kalavri

Automatic Detectionof Web Trackers

Vasia KalavriApache Flink PMC, PhD student @KTH

[email protected], @vkalavri

Page 2: Automatic Detection of Web Trackers by Vasia Kalavri

Telefonica Research, Barcelona

Computer Networks, Multimedia, Online Social Networks, Security & Privacy, Recommender Systems, HCI & Mobile Computing, Distributed Systems…

2

Page 3: Automatic Detection of Web Trackers by Vasia Kalavri

Ads

Recommendations

Browsing the Web

3

Page 4: Automatic Detection of Web Trackers by Vasia Kalavri

Tracker

Tracker

Ad Server

display relevant ads

cookie exchange

profiling

Tracking

4

Page 5: Automatic Detection of Web Trackers by Vasia Kalavri

5

The study's authors defined "creepiness" by the feeling consumers get when they sense an ad is too personal because it uses data the consumer did not agree to provide, such as online-search and browsing history. Consumers are even more creeped out by this because they don't know how and where that information will be used.

Page 6: Automatic Detection of Web Trackers by Vasia Kalavri

6

Page 7: Automatic Detection of Web Trackers by Vasia Kalavri

amazon.com imdb.com facebook.com

X Y X

Y

IP 1.1.1.1

ID-A = “aaa”

IP 1.1.1.1

ID-X = “xxx”

IP 2.2.2.2

ID-B = “bbb”

IP 2.2.2.2

ID-Y = “yyy”

IP 3.3.3.3

ID-C = “ccc”

IP 3.3.3.3

ID-X = “xxx”

IP 3.3.3.3

ID-Y = “yyy”

Linking Tracker Information

7

Steven Englehardt, Dillon Reisman, Christian Eubank, Peter Zimmerman, Jonathan Mayer, Arvind Narayanan, Edward W. Felten: Cookies That Give You Away: The Surveillance Implications of Web Tracking. WWW 2015: 289-299

Page 8: Automatic Detection of Web Trackers by Vasia Kalavri

Can’t we block them?

proxy

Tracker

Tracker

Ad Server

8

Legitimate site

Page 9: Automatic Detection of Web Trackers by Vasia Kalavri

● not frequently updated● not sure who or based on what criteria URLs are

blacklisted● miss “hidden” trackers or dual-role nodes● blocking requires manual matching against the list● can you buy your way into the whitelist?

Available Solutions

AdBlock, DoNotTrack, EasyPrivacy:

crowd-sourced “black lists” of tracker URLs

9

Page 10: Automatic Detection of Web Trackers by Vasia Kalavri

10

Page 11: Automatic Detection of Web Trackers by Vasia Kalavri

Our Goal

Exploit fundamental properties necessary for tracker operation

Use existing data to build a trackers classifier

● structural attributes: connections, network positions

● operational aspects: data volume exchange, communication patterns

Page 12: Automatic Detection of Web Trackers by Vasia Kalavri

Can we detect Trackers automatically?

● Are Trackers similar? How?○ network structure○ data received/sent○ response times○ latency

● Are Trackers different from normal sites? How?● Are Trackers mainly connected to other Trackers?

12

Page 13: Automatic Detection of Web Trackers by Vasia Kalavri

The Road to our Goal● algorithms● tuning● features● combinations of

algorithms and features and parameters...

13

Page 14: Automatic Detection of Web Trackers by Vasia Kalavri

The Dataset

172.134.23.3 http://www.buzzfeed.com/sheridanwatson/happy-birthday-eva-you -lucky- gal#.gnJbE8EDDK 3 45 20150203:17080345 34 200 GET www.buzzfeed.com/ HTTP/1.1 Host: www.google-analytics.com User-Agent: Mozilla/5.0 (Windows; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729) Accept: text/html,application/xhtml+xml, application/xml;q=0.9,*/*;q=0.8 Keep-Alive: 300 Connection: keep-alive 234561 34 0 0 ES

#records: ~80m#users: ~3k#URLs: ~2m#Trackers: ~4k

14

Page 15: Automatic Detection of Web Trackers by Vasia Kalavri

Basic Dataset Analysis

● How many requests to Trackers?

DataSet API

● Do Tracker requests have larger latency than other requests?

15

● How many Trackers ○ per user?○ per request?○ per website?

● Do popular websites embed more Trackers than others?

● Do same-topic websites share Trackers?

● Do different users visiting the same website end up on different Trackers?

● Do Trackers send / receive more / less bytes?

● Do they have more / less connections on average?

Page 16: Automatic Detection of Web Trackers by Vasia Kalavri

Main IdeaModel the data as a referer → host bipartite graph and exploit the graph structure to identify Trackers

facebook.com

youtube.com

google-analytics.com

b.scorecardresearch.com

embedded URLsURLs explicitly visited by the user

16

Page 17: Automatic Detection of Web Trackers by Vasia Kalavri

Attempt#1Relevance Search

Iterative, random walk-like algorithm for bipartite graphs

Given an input source node, assign a “relevance score” to other nodes, based on how similar their network position is

Sun, J., Qu, H., Chakrabarti, D., & Faloutsos, C. (2005). Relevance search and anomaly detection in bipartite graphs. ACM SIGKDD Explorations Newsletter,7(2), 48-55.

Page 18: Automatic Detection of Web Trackers by Vasia Kalavri

Relevance Search Algorithm

google-analytics

b.scorecardresearch

xzy/logo_small.jpg

0.9

0.1

sour

ce

In each iteration, a vertex:

- sends a score to out-neighbors- sums up received scores and

updates value

18

Page 19: Automatic Detection of Web Trackers by Vasia Kalavri

Relevance Search Implementation

● single-source relevance search○ similar to pagerank○ easily mapped to vertex-centric iterations

● multi-source relevance search○ each vertex keeps a vector of scores○ compute top-k relevant nodes per source○ merge the top-k lists

19

Gelly API

Page 20: Automatic Detection of Web Trackers by Vasia Kalavri

Data Pipeline

top-k relevant

nodes

www.google-analytics.com: Twww.bscored-research.com: Twww.facebook.com: NTwww.github.com: NTcdn.cxense.com: NT...

Bipartite graph creation

Multi-source Relevance Search

Classification

20

Page 21: Automatic Detection of Web Trackers by Vasia Kalavri

Relevance Search Tuning

● How many and which sources to give as input?● How to define convergence?● Does initialization matter?● How to weigh the input graph?● How to define the relevance score threshold?

21

Page 22: Automatic Detection of Web Trackers by Vasia Kalavri

Relevance Search Problems

● Easy to find the few very similar and the few very different pages

● Popular trackers are similar to other popular trackers, but not to not-so-popular ones

● We might keep re-discovering what we already know

22

Page 23: Automatic Detection of Web Trackers by Vasia Kalavri

Relevance Search doesn’t seem to completely solve the problem… Where do we go now?

23

Page 24: Automatic Detection of Web Trackers by Vasia Kalavri

Attempt#2...N-1Combining Relevance Search

with other algorithms

Several Clustering algorithms

k-nn Classification

Random Forest

Page 25: Automatic Detection of Web Trackers by Vasia Kalavri

Data Pipeline(s)

top-k relevant

nodes

www.google-analytics.com: Twww.bscored-research.com: Twww.facebook.com:

NTwww.github.com: NTcdn.cxense.com: NT...

Bipartite graph creation

Multi-sourceRelevance Search

[feature extraction]

Classification

[your clustering, classification, etc. algorithm here]

[evaluation]

25

Page 26: Automatic Detection of Web Trackers by Vasia Kalavri
Page 27: Automatic Detection of Web Trackers by Vasia Kalavri

r1

r2

r3

r5

r6

r7

h1

h2

h3

h4

h5

h6

h7

h8

NT

NT

T

T

?

T

NT

NT

r4

referer-hosts graph

h2

h3 h4

h5 h6

h8

h7

h1r1

r2r3

r3 r3r4

r5r6

r7

hosts-projection graph

: referer

: non-tracker host

: tracker host

: unlabeled host

The Projection Graph

27

Page 28: Automatic Detection of Web Trackers by Vasia Kalavri

Attempt#NCommunity Detection on the

Projection Graph

The Projection Graph captures implicit connections between trackers, through other sites

Do Trackers form communities in the Projection Graph?

Page 29: Automatic Detection of Web Trackers by Vasia Kalavri

● Do they form connected components?

Basic Analysis of the Projection Graph

● Do Trackers have unusually high degrees?

DataSet & Gelly APIs

29

● Are they mainly connected to other Trackers?

Page 30: Automatic Detection of Web Trackers by Vasia Kalavri

Visualization

30

Page 31: Automatic Detection of Web Trackers by Vasia Kalavri

Final Data Pipeline

raw logs cleaned logs

1: logs pre-processing

2: bipartite graph creation

3: largest connected component extraction

4: hosts-projection graph

creation

5: community detection

google-analytics.com: Tbscored-research.com: Tfacebook.com: NTgithub.com: NTcdn.cxense.com: NT...

6: results

DataSet API

Gelly

DataSet API

31

Very high accuracy and very low FPR :-)

Page 32: Automatic Detection of Web Trackers by Vasia Kalavri

Start simple

Lessons LearnedChoose features incrementallyVisualize your data

Re-evaluate your models

Try different data representations

Use a flexible system

Page 33: Automatic Detection of Web Trackers by Vasia Kalavri

Automatic Detectionof Web Trackers

Vasia KalavriApache Flink PMC, PhD student @KTH

[email protected], @vkalavri

Page 34: Automatic Detection of Web Trackers by Vasia Kalavri

Optimizing the Pipeline

Flink Optimizerto the rescue :-)

34