Upload
hasad
View
54
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Large -Scale Cost-sensitive Online Social Network Profile Linkage. Background & Motivation. Foot prints in different social networks. User identification in social analysis. Privacy & security Commercial & government applications. Outline. Problem definition Related work Approach - PowerPoint PPT Presentation
Citation preview
Large-Scale Cost-sensitive Online Social Network Profile Linkage
Background & MotivationFoot prints in different social networks.User identification in social analysis.Privacy & securityCommercial & government applications
OutlineProblem definitionRelated workApproach
Experiment
Conclusion & future work
Problem DefinitionTerminology
Identity: PersonProfile/User: Your footprint on social mediaProfile Linkage: Link your footprints together
Input & OutputInput: profiles of one site as QUERY and profiles of the other site as TARGET.Output: all pairs of classified matched profiles.
Characteristics of profile
Name (semi vs. structured)
{“given name”: “haochen”, “family name”: “zhang”}name: zhang haochen
Semi-structured schemaIncompleteness & missing attributes
Privacy policyVirtual identification
Free text descriptionBio, About me, Tags
Multilingualism
MultilingualismTop 5 languages in dataset of Facebook
EnglishPortugueseSpanishChineseFrench
Most frequent tokens in different languages
chris, john, michaelchen, wang, leecarlos, garcia, danielsergey, olga, alexander
About 70% users are in English7.2% users register as different localesTransliteration
昊辰 => Haochen
Feature AcquisitionNetwork communication costs too much time.Usage limit of the web service.
1000 invocations per day for Google Maps API
Compute complexity comparing to string similarity.
Image processing algorithm.
Overview of approach
Classification of Potential LinksFeatures
representationSupervised
learningCost-sensitive
Feature Acquisition
Pruning with CanopyParameter tuning Canopy construction
Entity-based Representation of ProfilesMapping Tokenization Entity extraction
Canopy: design
Canopy: efficiency
Local FeaturesUsername
Jaro Winkler Similarity
LanguageJaccard Simlarity
Description, URLCosine similarity with TF×IDF
PopularityDefined as the friend amount of a user.Adopt following metric
External FeaturesGeographic Location
Values are diverse with different types.Google Maps API:
string-represented location => geographic information
Spherical distance between two locations as the feature
Avatarχ2 dissimilarity of the avatar’s gray-scale histogram.
Classification: learningProbabilistic model derived from naïve bayes
Independent feature assumption
Classification: learningIterative inference
Terminate if S_n is discriminative.Set up threshold by choosing the error rate in training set of each feature to determine whether S_n is discriminative
Order of the features
Classification: learningInitial value
Estimate by the prior that two profiles sharing rarer tokens are more likely to be matched.
as the initial value
Dataset of experimentData source
152,294 Twitter users154,379 LinkedIn users
Ground truth: 9,750 identities4,779 identities with both accounts.3,339 identities with only Twitter account.1,632 identities with only LinkedIn account.
Experiment: Performance on overall linkage
I-Acc(Identity Accuracy)correctly identified identities / all identities in ground truth
Better than naïve learning method caused by adopting the prior.Different performance on different learning methods.
Experiment: Cost-sensitive feature acquisition
5% improvement of F1 by taking 148743 external feature acquisitions.Different order of external features.
Rank by costRank by distinguishability
Three sections divided by two inflection points.
Discussion: dataset construction
Dataset constructionConnections
Cannot correctly reflect the web-scale occasion.Name is too significant.
People searchDifficult to construct the ground truth.
Solution?
Discussion: people search task
Query in LinkedIn by Twitter user’s name Average 10 results for each query
Pre Rec F1Human 0.643 0.900 0.750NB_Local 0.369 0.441 0.402NB_All 0.418 0.493 0.453C4.5_Local 0.594 0.240 0.342C4.5_All 0.609 0.380 0.468CSPL_Local 0.543 0.658 0.595CSPL_All 0.578 0.713 0.638
Discussion: feature dependency
Compare features independently.2 people in Tsinghua with same name Li Peng2 people in NUS with same name Li Peng
Construct different IDF table for name in different locale.
Not generallyNot significantly effective
ConclusionWe proposed an supervised probabilistic to solve the identity linkage problem effectively.Prior that users sharing rarer tokens are more likely matched improves the performance of the approach.Iterative inference is able to reduce unnecessary feature acquisitions.
Thank you