Large -Scale Cost-sensitive Online Social Network Profile Linkage

Large-Scale Cost-sensitive Online Social Network Profile Linkage

Background & MotivationFoot prints in different social networks.User identification in social analysis.Privacy & securityCommercial & government applications

OutlineProblem definitionRelated workApproach

Experiment

Conclusion & future work

Problem DefinitionTerminology

Identity: PersonProfile/User: Your footprint on social mediaProfile Linkage: Link your footprints together

Input & OutputInput: profiles of one site as QUERY and profiles of the other site as TARGET.Output: all pairs of classified matched profiles.

Characteristics of profile

Name (semi vs. structured)

{“given name”: “haochen”, “family name”: “zhang”}name: zhang haochen

Semi-structured schemaIncompleteness & missing attributes

Privacy policyVirtual identification

Free text descriptionBio, About me, Tags

Multilingualism

MultilingualismTop 5 languages in dataset of Facebook

EnglishPortugueseSpanishChineseFrench

Most frequent tokens in different languages

chris, john, michaelchen, wang, leecarlos, garcia, danielsergey, olga, alexander

About 70% users are in English7.2% users register as different localesTransliteration

昊辰 => Haochen

Feature AcquisitionNetwork communication costs too much time.Usage limit of the web service.

1000 invocations per day for Google Maps API

Compute complexity comparing to string similarity.

Image processing algorithm.

Overview of approach

Classification of Potential LinksFeatures

representationSupervised

learningCost-sensitive

Feature Acquisition

Pruning with CanopyParameter tuning Canopy construction

Entity-based Representation of ProfilesMapping Tokenization Entity extraction

Canopy: design

Canopy: efficiency

Local FeaturesUsername

Jaro Winkler Similarity

LanguageJaccard Simlarity

Description, URLCosine similarity with TF×IDF

PopularityDefined as the friend amount of a user.Adopt following metric

External FeaturesGeographic Location

Values are diverse with different types.Google Maps API:

string-represented location => geographic information

Spherical distance between two locations as the feature

Avatarχ2 dissimilarity of the avatar’s gray-scale histogram.

Classification: learningProbabilistic model derived from naïve bayes

Independent feature assumption

Classification: learningIterative inference

Terminate if S_n is discriminative.Set up threshold by choosing the error rate in training set of each feature to determine whether S_n is discriminative

Order of the features

Classification: learningInitial value

Estimate by the prior that two profiles sharing rarer tokens are more likely to be matched.

as the initial value

Dataset of experimentData source

152,294 Twitter users154,379 LinkedIn users

Ground truth: 9,750 identities4,779 identities with both accounts.3,339 identities with only Twitter account.1,632 identities with only LinkedIn account.

Experiment: Performance on overall linkage

I-Acc(Identity Accuracy)correctly identified identities / all identities in ground truth

Better than naïve learning method caused by adopting the prior.Different performance on different learning methods.

Experiment: Cost-sensitive feature acquisition

5% improvement of F1 by taking 148743 external feature acquisitions.Different order of external features.

Rank by costRank by distinguishability

Three sections divided by two inflection points.

Discussion: dataset construction

Dataset constructionConnections

Cannot correctly reflect the web-scale occasion.Name is too significant.

People searchDifficult to construct the ground truth.

Solution?

Discussion: people search task

Query in LinkedIn by Twitter user’s name Average 10 results for each query

Pre Rec F1Human 0.643 0.900 0.750NB_Local 0.369 0.441 0.402NB_All 0.418 0.493 0.453C4.5_Local 0.594 0.240 0.342C4.5_All 0.609 0.380 0.468CSPL_Local 0.543 0.658 0.595CSPL_All 0.578 0.713 0.638

Discussion: feature dependency

Compare features independently.2 people in Tsinghua with same name Li Peng2 people in NUS with same name Li Peng

Construct different IDF table for name in different locale.

Not generallyNot significantly effective

ConclusionWe proposed an supervised probabilistic to solve the identity linkage problem effectively.Prior that users sharing rarer tokens are more likely matched improves the performance of the approach.Iterative inference is able to reduce unnecessary feature acquisitions.

Thank you

Documents

Large -Scale Cost-sensitive Online Social Network Profile Linkage