40
Finding Topic-sensitive Influential Twitterers Presenter 吴吴吴 TwitterRank:

Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Embed Size (px)

Citation preview

Page 1: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Finding Topic-sensitive Influential Twitterers

Presenter 吴伟涛

TwitterRank:

Page 2: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Outline

1. Introduction

2. Dataset

3. Topic modeling and Homophily in Twitter

4. TwitterRank

5. Experiment and results

6. Conclusions

Page 3: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Introduction

MotivationThe number of followers is the main metric to identify

influential twitterers. Twitterer’s influence may vary with

different topics.

SolutionIdentify influential twitterers taking both the topical

similarity between users and the link structure into

account.

Page 4: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Introduction

Two contributions of this paper:

1.First to report homophily in Twitter

2.Introduce TwitterRank to measure the

topic-sensitive influence of the twitterers.

Page 5: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Outline

Introduction

Dataset

Topic modeling and Homophily in Twitter

TwitterRank

Experiment and results

Conclusions

Page 6: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Twitter Dataset

1. Obtain a set of top-1000 Singapore-based twitterers. Denote the set as S, |S|=996.

2. Crawled all the followers and the friends of each s ∈ S and stored them in set S’.

3. Let S’’= S ∪ S’, and S* = {s|s ∈ S’’, and s is from Singapore}.|S*| = 6748. For each s ∈ S*, crawled all the tweets she had published so far. Denote it as T. |T|=1,021,039.

Page 7: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Tweet Distribution

Page 8: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Tweet Distribution

Page 9: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Reciprocity in Following Relationships

Page 10: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Reciprocity in Following Relationships

72.4% of the twitterers follow more than 80% of their followers

80.5% of the twitterers have 80% of their friends follow them back

Casual following or homophily?

Page 11: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Outline

Introduction

Dataset

Topic modeling and Homophily in Twitter

TwitterRank

Experiment and results

Conclusions

Page 12: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Homophily in Twitter

Q1: Are twitterers with “following” relationships

more similar than those without according to

the topics they are interested in?

Q2: Are twitterers with reciprocal “following”

relationships more similar than those without

according to the topics they are interested in?

Page 13: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Topic modeling

定义距离:Dist(i,j)

计算平均距离

验证:

?

验证: ?

follow nofollow计算平均距离

asymsym

follow nofollow sym asym

结论:homophily

Page 14: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Topic Modeling

Goal:

Automatically identify the topics that twitterers are

interested in based on the tweets they published.

Latent Dirichlet Allocation (LDA) model is applied

Page 15: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Topic Modeling

LDA-based generative process for generating a doc:

1.For each document, pick a topic from its

distribution over topic,

2.Sample a word from the distribution over the words

associated with the chosen topic.

3.The process is repeated for all the words in the

document.

Page 16: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Topic Modeling Results

1. DT — D×T matrix

D: the number of users

T: the number of topics

DTij : the number of times a word in user si’s

tweets has been assigned to topic tj.

Page 17: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Topic Modeling

we first row normalize the DT matrix as DT’ such

that ||DT’i ·||1=1 for each row DT’i · . Thus each row

of matrix DT’ is basically the probability distribution

of twitterer si’s interest over the T topics, i.e. each

element DT’i j captures the probability that twitterer

si is interested in topic tj.

Page 18: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Topic Difference

Definition 1: the topical difference between two

twitterers si and sj can be calculated as:

( , ) 2* ( , )JSdist i j D i j

DJS(i,j) is the Jensen-Shannon Divergence between the two probability distributions DT’i · and DT’j ·

which is defined as:' '1

( , ) ( ( || ) ( || ))2JS KL i KL jD i j D DT M D DT M

Page 19: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Topic Difference

M is the average of the two probability distibutions,

i.e.

DKL is the Kullback-Leibler Divergence which defines

the divergence from distribution Q to P as:

' '1( )2 i jM DT DT

( )( || ) ( ) log

( )KLi

P iD P Q P i

Q i

Page 20: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Hypothesis Testing

* Note that, this part of work, hypothesis testing,

and topic distillation as well, is applied on a set of

twitterers who publish more than 10 tweets in total.

We denote this set as , and | | = 4050.

*uS

*uS

Page 21: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Hypothesis Testing (I)

Formalize Q1 as a two-sample t-tet:

: the mean topical difference of the pairs of

users with “following” relationship.

: the mea topical difference of those without.

follow

nofollow

0 : follow nofollowH

1 : follow nofollowH

Page 22: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Hypothesis Testing (I)

Result:

The null-hypothesis H0 is rejected at significant

level .

0.01

Page 23: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Hypothesis Testing (II)

Formalize Q2 as a two-sample t-tet:

: the mean topical difference of the pairs of

users with reciprocal following relationship.

: the mea topical difference of pairs of users

with only one-direction relationship.

0 : sym asymH

1 : sym asymH

sym

asym

Page 24: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Hypothesis Testing (II)

Result:

The null-hypothesis H0 is rejected at significant

level .

0.01

Page 25: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Implication

Homophily phenomenon does exist:

-The answer to Q1 is yes.

-The answer to Q2 is also yes.

-There are twitterers who are serious in following others.

Page 26: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Outline

Introduction

Dataset

Topic modeling and Homophily in Twitter

TwitterRank

Experiment and results

Conclusions

Page 27: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Topic-specific TwitterRank

A topic-specific random walk model is applied to calculate the user’s influential score.

The transition matrix for topic t, denoted as Pt . The transition probability of surfer from follower si to friend sj is:

:

| |( , ) * ( , )

| |i a

jt t

aa s s

Tp i j sim i j

T

' '( , ) 1 | |t it jtsim i j DT DT

Page 28: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Topic-specific TwitterRank

Topic-specific teleportation:

The influence scores of twitters are calculated iteratively:

Aggregation of topic-specific TwitterRank:

''t tE DT

(1 )t t t tTR P TR E

t tt

TR r TR

Page 29: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Outline

Introduction

Dataset

Topic modeling and Homophily in Twitter

TwitterRank

Experiment and results

Conclusions

Page 30: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Comparison with other Algorithms

Comparison to:

In-degree

PageRank

Topic-sensitive PageRank

Comparison in recommendation scenario.

Page 31: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Recommendation task

Page 32: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

St

Recommendation task

s0

sf

L

Page 33: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Evaluation

Assume A is a ranked list recommended by any of the algorithms. Let A(si) to be

the rank of si in A. The quality of the recommendation Q(A) is measured as Q(A)=|{si|si ∈St, and A(si)<A(sf)}|. The

lower the value of Q(A) is, the higher the quality of corresponding algorithm is.

Page 34: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Criteria to generate L set

The number of followers that sf has.

The number of tweets that sf published.

Topical difference between s0 and sf .

Whether reciprocal relationship between s0 and sf .

Page 35: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Experiment Results

Page 36: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Experiment Results

All performs better in Ldf than in Ldh:

- There are twitterers who “follow” because of the

topical similarity between them and their friends.

This support the homophily phenomenon. TR is outperformed in Lfh, Ltl and Ldh:

- InD perform the best in Lfh. This is because

twitterers “following” benaviors have already been

biased toward those with more followers.

Page 37: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Experiment Results

- TR performs the worst in Ltl, because LDA-based

topic distillation needs more contents to achieve

reasonable accuracy.

- TR outperforms all the other algorithms except InD

in Ldh. There still exist some twitters who do not

“follow” based on topical similarity, although homophily is observed.

Page 38: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Outline

Introduction

Dataset

Topic modeling and Homophily in Twitter

TwitterRank

Experiment and results

Conclusions

Page 39: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Conclusion and future work Homophily does exist:

- Not all users just randomly “follows”.

Future work:- To make the algorithm more robust to manipulation, e.g

purposely publish large number of tweets.

- To classify different categories of users by studying their

following behaviors more closely.

- Incremental topic distillation/ event detection.

Page 40: Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Thank you