Upload
geneva
View
21
Download
0
Embed Size (px)
DESCRIPTION
Connecting Users across Social Media Sites: A Behavioral-Modeling Approach. Reza Zafarani and Huan Liu Data Mining and Machine Learning Laboratory (DMML) Arizona State University KDD 2013 – Chicago, Illinois. How hard can it be to identify an individual across sites? - PowerPoint PPT Presentation
Citation preview
Connecting Users across Social Media Sites:A Behavioral-Modeling Approach
REZA ZAFARANI AND HUAN LIU
DATA MINING AND MACHINE LEARNING LABORATORY (DMML)
ARIZONA STATE UNIVERSITY
KDD 2013 – CHICAGO, ILLINOIS
How hard can it be to identify
an individual across sites?
Privacy Experts Claim Advertisers
Know a lot about People
Can they stop showing you the
same repetitive ads across sites?
More information about individuals
Many social media sites
Partial Information
Complementary Information
Better User Profiles
Google+
Age
Location
Education
Huan Liu
N/A
Tempe,AZ
USC
N/A
USA
USC (1985-89)
Can we connect individualsacross sites?
Connectivity is not available
Consistency in Information Availability
Can we verify that the information provided across sites belong to the same individual?
MOdeling Behavior for Identifying Users across Sites
Human behavior generates Information redundancy
Information shared across sites
provides a behavioral fingerprint
MOBIUS
- Behavioral Modeling
- Minimum Information
Identification Function
Minimum information available on ALL sites:
Usernames
CandidateUsername (john.smith)
Prior Usernames ({jsmith, john.s})
Behavior 1
Behavior 2
Behavior n
Information RedundancyInformation Redundancy
Information Redundancy
Feature Set 1
Feature Set 2
Feature Set n
GeneratesCaptured
Via
Learning Framewor
kData
IdentificationFunction
Behaviors
Human Limitation
Time & Memory Limitation
Knowledge Limitation
Exogenous Factors
Typing Patterns
Language Patterns
Endogenous Factors
Personal Attributes &
Traits
Habits
Using Same Usernames
Username Length
Likelihood
Time and Memory Limitation
59% of individuals use the same
username
1 2 3 4 5 6 7 8 9 10 11 120 0 0 0 0 0 0
2
4
5
1
0
Limited Vocabulary
Limited Alphabet
Knowledge Limitation
Identifying individuals by their
vocabulary size
Alphabet Size is correlated to
language: शमं�त कु� मं�र -> Shamanth Kumar
Typing Patterns
QWERTY Keyboard Variants: AZERTY, QWERTZ
DVORAK Keyboard
Keyboard type impacts your usernames
QWER1234 AOEUISNTH
Modifying Previous
Usernames
Creating Similar
UsernamesUsername Observatio
n Likelihood
Habits - old habits die hardAdding Prefixes/Suffixes, Abbreviating, Swapping or Adding/Removing Characters
Nametag and Gateman
Usernames come from a language
model
Experiment Setup
Data:
200,000 instances (50% class balance)
414 Features
Previous Methods:
1) Zafarani and Liu, 2009
2) Perito et al., 2011
Baselines:
3) Exact Username Match
4) Substring Match
5) Patterns in Letters
Exac
t Use
rnam
e M
atch
Subs
trin
g M
atch
ing
Patte
rns in
Let
ters
Zafar
ani a
nd L
iu
Perito
et a
l.
Naï
ve B
ayes
0
20
40
60
80
100
7763.12
49.2566 77.59
91.38
MOBIUS Performance
Naï
ve B
ayes J4
8
Rando
m F
ores
t
L2-reg
L2-
Loss
SVM
L1-reg
L2-
Loss
SVM
L2-reg
Log
istic Reg
ress
ion
L1-reg
Log
istic Reg
ress
ion
89909192939495
91.3890.87
93.5993.793.7193.7793.8
Choice of Learning Algorithm
Diminishing Returns for Adding More Usernames
Discover applications of connecting users across sites
Information shared across sites acts as a behavioral fingerprint
Human Behavior Results in Information RedundancyIncorporating features indigenous to specific sitesA methodology for connecting individuals across sitesA behavioral modeling approachUses minimum information across
sitesAllows for integration of additional
behaviors when required
Conclusions + Future Work