34
SIMULTANEOUSLY MODELING SEMANTICS AND STRUCTURE OF THREADED DISCUSSIONS: A SPARSE CODING APPROACH AND ITS APPLICATIONS Chen LIN * , Jiang-Ming YANG + , Rui CAI + , Xin-jing WANG + , Wei WANG * , Lei ZHANG + * Fudan University + Microsoft Research Asia 1

Chen LIN * , Jiang-Ming YANG + , Rui CAI + , Xin-jing WANG + , Wei WANG * , Lei ZHANG +

  • Upload
    dougal

  • View
    50

  • Download
    1

Embed Size (px)

DESCRIPTION

SIMULTANEOUSLY MODELING SEMANTICS AND STRUCTURE OF THREADED DISCUSSIONS: A SPARSE CODING APPROACH AND ITS APPLICATIONS. Chen LIN * , Jiang-Ming YANG + , Rui CAI + , Xin-jing WANG + , Wei WANG * , Lei ZHANG + * Fudan University + Microsoft Research Asia. OUTLINE. Motivation - PowerPoint PPT Presentation

Citation preview

Page 1: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

SIMULTANEOUSLY MODELING SEMANTICS AND STRUCTURE OF THREADED DISCUSSIONS: A SPARSE CODING APPROACH AND ITS APPLICATIONS

Chen LIN *, Jiang-Ming YANG +, Rui CAI +, Xin-jing WANG +, Wei WANG *, Lei ZHANG +

*Fudan University+Microsoft Research Asia

1

Page 2: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

OUTLINE

Motivation Challenges Model Application

Reply reconstruction Junk post detection Expert finding

Experiments Conclusion

2

Page 3: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

THREADED DISCUSSIONS

Mailing lists

Chat roomsIMs Web forums

3

root

reply

Page 4: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

IMPORTANT DATA SOURCES

4

Page 5: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

MINING SEMANTICS & STRUCTURE

5

Junk Identification

Expert Search

Measure post quality

Page 6: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

CHALLENGE

6

Semantics & Structure

Page 7: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

SEMANTIC & STRUCTURE

7

Semantic:Topics

Structure:Who reply to who

Page 8: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

CHALLENGE

8

Junk Post

Page 9: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

JUNK POST

9

Page 10: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

CHALLENGE

10

Post Quality

Page 11: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

POST QUALITY

valuable post

11

Page 12: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

MODEL

Purpose: Simultaneously modeling semantics Structures

Methodology Intuitive Matrix based Sparse coding

root

reply

12

Page 13: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

INTUITION

13

Page 14: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

A THREAD HAS SEVERAL TOPICS

14

Page 15: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

SEMANTIC REPRESENTATION OF THREAD

D X Θ

Minimize:

post1 post2 … postLword1word2word3…wordV

topic1 … topicTword1word2word3…wordV

post1 post2 … postLtopic1…topicT

15

Project posts to topic space

Page 16: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

A POST IS RELATED TO PREVIOUS POSTS

Minimize

16

post1 post2 … postLtopic1…topicTΘ

b:

approximate each post aslinear combination ofprevious posts

Page 17: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

A POST IS RELATED TO A FEW TOPICSgovernment

cobol

17

Page 18: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

SPARSE SEMANTICS OF POST

D X Θ

Minimize:

post1 post2 … postLword1word2word3…wordV

topic1 … topicTword1word2word3…wordV

post1 post2 … postLtopic1…topicT

18

Page 19: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

A POST IS RELATED TO A FEW POSTS

Minimize

19

post1 post2 … postLtopic1…topicT

Θ

Sparse

b:

approximate each post aslinear combination ofprevious posts

Page 20: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

OPTIMIZE THEM TOGETHER

Model semantic

Model structure

20

Page 21: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

APPLICATIONS

Reply reconstruction Capability of recognizing structure

Junk identification Capability of capturing semantics

Expert finding Capability of measuring post quality

21

Page 22: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

REPLY RECONSTRUCTION

22

DocumentSimilarity

TopicSimilarity

StructureSimilarity

Page 23: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

DATA SET

Slashdot Apple discussion

23

No.threads 1154

No.posts 203210

Avg.thread len.

176.09

Avg.word/p 73.53

Avg.post/user 15.32

No.threads 4488

No.posts 80008

Avg.thread len.

17.84

Avg.word/p 78.36

Avg.post/user 4.69

Page 24: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

BASELINES NP

Reply to Nearest Post RR

Reply to Root DS

Document Similarity LDA

Latent Dirichlet Allocation Project documents to topic space

SWB Special Words Topic Model with Background

distribution Project documents to topic and junk topic space

24

Page 25: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

EVALUATION

method Slashdot Apple

All Posts Good Posts All Posts Good Posts

NP 0.021 0.012 0.289 0.239

RR 0.183 0.319 0.269 0.474

DS 0.463 0.643 0.409 0.628

LDA 0.465 0.644 0.410 0.648

SWB 0.463 0.644 0.410 0.641

SMSS 0.524 0.737 0.517 0.772

25

Page 26: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

JUNK IDENTIFICATION

D=

X =

Θ =

Probability of junk

post1 post2 … … … postLword1word2word3…wordV

,

topic1 … topicT topicbgword1word2word3…wordV

post1 post2 … … … postLtopic1…topicTtopicbg

26

Page 27: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

DATA SET

Slashdot Apple discussion

27

Page 28: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

BASELINES

28

DF

SVM Classify posts as junk posts & non-junk posts

SWBSpecial Words Topic Model with

Background distribution Project documents to topic and junk topic space

Page 29: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

EVALUATIONMethod Precision Recall F-measure

SWB 0.48 0.22 0.30

SVM 0.37 0.24 0.20

DF 0.34 0.40 0.36

SMSS 0.38 0.45 0.41

29

Page 30: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

EXPERT FINDING Methods

HITS

PageRank

30

Page 31: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

BASELINES LM

Formal Models for Expert Finding in Enterprise Corpora. SIGIR 06

Achieves stable performance in expert finding task using a language model

PageRank Benchmark nodal ranking method

HITS Find hub nodes and authority node

EABIF Personalized Recommendation Driven by

Information Flow. SIGIR ’06 Find most influential node 31

Page 32: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

EVALUATION

32

Bayesian estimate

Method MRR MAP P@10

LM 0.821 0.698 0.800

EABIF(ori.) 0.674 0.362 0.243

EABIF(rec.) 0.742 0.318 0.281

PageRank(ori.) 0.675 0.377 0.263

PageRank(rec.)

0.743 0.321 0.266

HITS(ori.) 0.906 0.832 0.900

HITS(rec.) 0.938 0.822 0.906

Page 33: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

DISCUSSION

Parameters vs. Model Complexity Linear regression

SMSS model

Though the number of parameters is increased, the projection space is shrunk by the prior knowledge. 33

Prior knowledge

Prior knowledge

Page 34: Chen LIN  * , Jiang-Ming YANG  + , Rui CAI  + , Xin-jing WANG  + ,  Wei WANG  * , Lei ZHANG  +

CONCLUSION

Purpose Mine the semantics Mine the structure

Highlight Simultaneously model the

Semantic Structure

Applications are designed to evaluate the model Reply reconstruction Junk identification Expert Finding

34