23
Large-Scale Cost- sensitive Online Social Network Profile Linkage

Large -Scale Cost-sensitive Online Social Network Profile Linkage

  • Upload
    hasad

  • View
    54

  • Download
    0

Embed Size (px)

DESCRIPTION

Large -Scale Cost-sensitive Online Social Network Profile Linkage. Background & Motivation. Foot prints in different social networks. User identification in social analysis. Privacy & security Commercial & government applications. Outline. Problem definition Related work Approach - PowerPoint PPT Presentation

Citation preview

Page 1: Large -Scale  Cost-sensitive Online Social Network Profile Linkage

Large-Scale Cost-sensitive Online Social Network Profile Linkage

Page 2: Large -Scale  Cost-sensitive Online Social Network Profile Linkage

Background & MotivationFoot prints in different social networks.User identification in social analysis.Privacy & securityCommercial & government applications

Page 3: Large -Scale  Cost-sensitive Online Social Network Profile Linkage

OutlineProblem definitionRelated workApproach

Experiment

Conclusion & future work

Page 4: Large -Scale  Cost-sensitive Online Social Network Profile Linkage

Problem DefinitionTerminology

Identity: PersonProfile/User: Your footprint on social mediaProfile Linkage: Link your footprints together

Input & OutputInput: profiles of one site as QUERY and profiles of the other site as TARGET.Output: all pairs of classified matched profiles.

Page 5: Large -Scale  Cost-sensitive Online Social Network Profile Linkage

Characteristics of profile

Name (semi vs. structured)

{“given name”: “haochen”, “family name”: “zhang”}name: zhang haochen

Semi-structured schemaIncompleteness & missing attributes

Privacy policyVirtual identification

Free text descriptionBio, About me, Tags

Multilingualism

Page 6: Large -Scale  Cost-sensitive Online Social Network Profile Linkage

MultilingualismTop 5 languages in dataset of Facebook

EnglishPortugueseSpanishChineseFrench

Most frequent tokens in different languages

chris, john, michaelchen, wang, leecarlos, garcia, danielsergey, olga, alexander

About 70% users are in English7.2% users register as different localesTransliteration

昊辰 => Haochen

Page 7: Large -Scale  Cost-sensitive Online Social Network Profile Linkage

Feature AcquisitionNetwork communication costs too much time.Usage limit of the web service.

1000 invocations per day for Google Maps API

Compute complexity comparing to string similarity.

Image processing algorithm.

Page 8: Large -Scale  Cost-sensitive Online Social Network Profile Linkage

Overview of approach

Classification of Potential LinksFeatures

representationSupervised

learningCost-sensitive

Feature Acquisition

Pruning with CanopyParameter tuning Canopy construction

Entity-based Representation of ProfilesMapping Tokenization Entity extraction

Page 9: Large -Scale  Cost-sensitive Online Social Network Profile Linkage

Canopy: design

Page 10: Large -Scale  Cost-sensitive Online Social Network Profile Linkage

Canopy: efficiency

Page 11: Large -Scale  Cost-sensitive Online Social Network Profile Linkage

Local FeaturesUsername

Jaro Winkler Similarity

LanguageJaccard Simlarity

Description, URLCosine similarity with TF×IDF

PopularityDefined as the friend amount of a user.Adopt following metric

Page 12: Large -Scale  Cost-sensitive Online Social Network Profile Linkage

External FeaturesGeographic Location

Values are diverse with different types.Google Maps API:

string-represented location => geographic information

Spherical distance between two locations as the feature

Avatarχ2 dissimilarity of the avatar’s gray-scale histogram.

Page 13: Large -Scale  Cost-sensitive Online Social Network Profile Linkage

Classification: learningProbabilistic model derived from naïve bayes

Independent feature assumption

Page 14: Large -Scale  Cost-sensitive Online Social Network Profile Linkage

Classification: learningIterative inference

Terminate if S_n is discriminative.Set up threshold by choosing the error rate in training set of each feature to determine whether S_n is discriminative

Order of the features

Page 15: Large -Scale  Cost-sensitive Online Social Network Profile Linkage

Classification: learningInitial value

Estimate by the prior that two profiles sharing rarer tokens are more likely to be matched.

as the initial value

Page 16: Large -Scale  Cost-sensitive Online Social Network Profile Linkage

Dataset of experimentData source

152,294 Twitter users154,379 LinkedIn users

Ground truth: 9,750 identities4,779 identities with both accounts.3,339 identities with only Twitter account.1,632 identities with only LinkedIn account.

Page 17: Large -Scale  Cost-sensitive Online Social Network Profile Linkage

Experiment: Performance on overall linkage

I-Acc(Identity Accuracy)correctly identified identities / all identities in ground truth

Better than naïve learning method caused by adopting the prior.Different performance on different learning methods.

Page 18: Large -Scale  Cost-sensitive Online Social Network Profile Linkage

Experiment: Cost-sensitive feature acquisition

5% improvement of F1 by taking 148743 external feature acquisitions.Different order of external features.

Rank by costRank by distinguishability

Three sections divided by two inflection points.

Page 19: Large -Scale  Cost-sensitive Online Social Network Profile Linkage

Discussion: dataset construction

Dataset constructionConnections

Cannot correctly reflect the web-scale occasion.Name is too significant.

People searchDifficult to construct the ground truth.

Solution?

Page 20: Large -Scale  Cost-sensitive Online Social Network Profile Linkage

Discussion: people search task

Query in LinkedIn by Twitter user’s name Average 10 results for each query

Pre Rec F1Human 0.643 0.900 0.750NB_Local 0.369 0.441 0.402NB_All 0.418 0.493 0.453C4.5_Local 0.594 0.240 0.342C4.5_All 0.609 0.380 0.468CSPL_Local 0.543 0.658 0.595CSPL_All 0.578 0.713 0.638

Page 21: Large -Scale  Cost-sensitive Online Social Network Profile Linkage

Discussion: feature dependency

Compare features independently.2 people in Tsinghua with same name Li Peng2 people in NUS with same name Li Peng

Construct different IDF table for name in different locale.

Not generallyNot significantly effective

Page 22: Large -Scale  Cost-sensitive Online Social Network Profile Linkage

ConclusionWe proposed an supervised probabilistic to solve the identity linkage problem effectively.Prior that users sharing rarer tokens are more likely matched improves the performance of the approach.Iterative inference is able to reduce unnecessary feature acquisitions.

Page 23: Large -Scale  Cost-sensitive Online Social Network Profile Linkage

Thank you