11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese

11

A Classification-based Approach to Question Routing in Community Question Answering

Tom Chao Zhou1, Michael R. Lyu1, Irwin King1,2

1 The Chinese University of Hong Kong2 AT&T Labs Research

{czhou,lyu,king}@[email protected]

Workshop on Community Question Answering on the Webin Conjunction with World Wide Web 2012

April 17, 2012

22

Introduction

Problem Definition and Feature

Experiments

Conclusions and Future Work

Related Work

33

Community-based Question Answering• Knowledge dissemination, information

seeking• Natural language questions• Explicit, self-contained answers

44

How CQA Works

SubmitQuestion

GetAnswers?

Answer Selection, Question Resolved

yes

no

Question Not Resolved

CQA users

• The number of posted questions grows fast.

• Whether users could get questions resolved within a reasonable period?

55

Whether Questions Get Resolved• Randomly sample 140 questions from each

category in Yahoo! Answers• 26 top-level categories• In total 3,640 questions• Track the status of each question

6

1 2 3 4 5 6 7 8 911.95% 19.95% 24.75% 26.48% 27.31% 51.32% 61.92% 63.41% 64.45%

Percentage of Questions Resolved

77

CQA users

How CQA Works

SubmitQuestion

GetAnswers?

Answer Selection, Question Resolved

yes

no

Question Not Resolved

How about we carefully select a set of CQA users who may be interested in the question?

88

Question Routing• Definition

– Routing open questions to suitable answerers who may be interested in the question

Not interestedin the question

Interestedin the question

No

Yes

99

Question Routing• Benefits

– Asker’s Perspective• Reduce time lag between the time a question is

posted and it is answered– Answerer’s Perspective

• More enthusiastic in providing answers for interested questions

– CQA’s perspective• Leverage users’ answering passion, leading to the

improvement of the CQA, as well as the boosts of the user’s adhesiveness and loyalty to the system

1010

Introduction


Experiments


Related Work

1111

Problem Definition

Question Routing Problem

Given a question and a user in CQA, determine whether the user will contribute his/her

knowledge to answer the question

1212

Feature Investigation• Local Features

– Only local information about question, user history and question-user relationships are needed

• Global Features– Take into account the global information of CQA – Consider category as the global information – Questions in the same category discuss similar

topics – Incorporating global information act as the

smoothing effect

1313

Feature Investigation

# of features Question User History Question-User Relationship

Local Features

3 10 7

Global Features

3 2 1

Feature Investigation Summary

1414

Local Features• Question (3 features)

– Question Length• Agichtein et al. 2008 found question length an

important feature to measure question quality1.Title length2.Detail length

– Question Type3.5W1H type

– Why, what, where, who and how

1515

Local Features• User History (10 features)

– Users’ history would have implications for users’ interests and behaviors

– Profile, question and answering behaviors1.Member since2.Percentage of best answer3.Total points4.Number of answers5.Number of best answers6.Number of asked questions7.Number of resolved questions

1616

Local Features• User History (10 features)

8. Number of stars received9. Answer/question ratio10.Best answer/question ratio

1717

Local Features• Question-User Relationship (7 features)

– Capture the relationship between a question and a user

– Features adapted from the existing CQA service1. Top contributor

– Features that measure the extent the user is interested in the category given question belongs to

2. Ratio of answered question in the category3. Ratio of best answered question in the category4. Ratio of asked question in the category5. Ratio of starred question in the category

1818

Local Features• Question-User Relationship (7 features)

– Features describing the similarity of the question’s language model and the user’s language model

6. KL-divergence between given question and a user’s answered questions

7. KL-divergence between given question and a user’s background language model (answered, asked, and starred questions)

1919

Global Features• Question (3 features)

– Category-level features that smooth each question

1. Average title length2. Average detail length

– Whether the question is representative in the category

3. KL-divergence value between given question and questions in the category given question belongs to

2020

Global Features• User History (2 features)

– Capture the uniqueness of a user• Question-User Relationship (1 feature)

– The more similar the language model of a user’s answered questions and that of the questions in a category, the more probable a user would answer the questions from the category

• KL-divergence between the user’s answered questions and questions in the category given question belongs to

2121

Introduction


Experiments


Related Work

2222

Experiments• Classification Algorithm

– Support vector machines (SVM) with linear kernel

• Metrics– Precision, recall, F1 for positive class– Accuracy for both classes

• Dataset– Crawled from 3,500 users’ “Answers”,

“Questions”, and “Starred Questions” pages from Yahoo! Answers

2323

Effect of Local Features

Precision Recall F1 AccuracyQuestion 0.5314 0.3896 0.4496 0.5157

User History 0.8278 0.4682 0.5981 0.6805Question-User Relationship

0.5824 0.935 0.7178 0.6267

• Question-User Relationship achieves the best F1 and recall• Capture the user’s performance and interests in the category

of the given question• Capture the semantic relatedness of the given question and

the user• User History achieves the best precision

• Some users are quite active in the system• These highly active users only account for a few percentage

among all users

2424

Effect of Local Features

Precision Recall F1 AccuracyQ + QU Relationship

0.5974 0.9134 0.7223 0.6435

U + QU Relationship

0.7362 0.8275 0.7792 0.7619

Q + U + QU Relationship

0.7418 0.8253 0.7814 0.7655

Top 10 features in Local features

0.6964 0.8095 0.7487 0.7241

• The combination of all local features achieves the best F1• Results of employing the top 10 features are also

encouraging

2525

Effect of Local Features• Two most important local features

– KL-divergence value between given question and questions answered by the user

• Capture the most accurate semantic relatedness between the given question and the knowledge of the user

– KL-divergence value between given question and questions answered, asked, and starred by the user

• Consider the user’s interests as well by incorporating other factors

2626

Effect of Local and Global Features

Precision Recall F1 AccuracyLocal 0.7418 0.8253 0.7814 0.7655

Global 0.5779 0.8713 0.6949 0.6109

Local + Global 0.7279 0.8499 0.7842 0.7689

• Combination of local features and global features promise to maintain the best elements of the two, and the best F1 score is consequently achieved

2727

Effect of Local and Global Features• Three most important features

– KL-divergence value between given question and questions answered by the user

– KL-divergence value between given question and questions answered, asked, and starred by the user

– KL-divergence value between given question and questions from the same category

• If a question is quite typical in the category, it would have higher chance to be answered by users, and this could also partially explain the reason why CQA services usually have well-structured categories

2828

Introduction


Experiments


Related Work

2929

Related Work• Question Routing

– Zhou et al. 2009, expertise-based question routing

– Li and King 2010, language model based framework for combining expertise estimation and availability estimation

– Li et al. 2011, category-sensitive language model• Link analysis and Expert Finding

– Jurczyk and Agichtein, 2007– Zhang, Ackerman and Adamic, 2007– Apply PageRank and HITS in social media

3030

Introduction


Experiments


Related Work

3131

Conclusions• Formulate question routing as a

classification task• Derive a variety of local and global

features• Analyze the contributions from different

sources• Thorough experimental study

3232

Future Work• Semi-supervised approach• Incorporate social aspects into the model

3333

Thanks Q&A

Documents

11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese