Improving Health Question Classification by Word Location Weights

Improving Health Question Classification

by Word Location Weights

Rey-Long Liu

Dept. of Medical Informatics

Tzu Chi University

Taiwan

Outline

• Background

• Problem definition

• The proposed approach: WLW

• Empirical evaluation

• Conclusion

2

Background

3

Categories of Health Questions

4

Classification of Health Questions

• Why health questions?– Health questions provide both reliable and

readable health information

• Why classification of health questions?– Given a health question q, retrieve related

questions (and their answers)

5

Problem Definition

6

Goal & Motivation• Goal

– Target: Chinese Health Questions (CHQs)– Contribution: Developing a technique WLW

(Word Location Weight) that estimates the location weights of words in a CHQ based on their locations

• Motivation– Location weights can be used by classifiers (e.g.,

SVM) to improve the classification • Classifying in-space CHQs (cause, diagnosis, process)

• Filtering out-space CHQs (may be whatever)7

Basic Idea

• Those words that are more related to the category of a CHQ tend to appear at the beginning and end of the CHQ

• Examples:如何 (how to)克服 (deal with)緊張 (nervous)的情緒 (mood)? process

嬰兒 (infant)體溫 (body temperature)太低 (too low)怎麼辦 (how to do)? process

8

Related Work

• Recognition of question types (e.g., when, where) – Weakness: Types Intended categories of CHQs

• Classification by parsing– Weakness I: Parsing Chinese is still challenging– Weakness II: CHQs are NOT always well-formed

• Classification by pattern matching– Weakness: Difficult to construct the string patterns

9

The Proposed Approach: WLW

10

Main Challenges

(1) Defining the two weights of a location p in a CHQ q

11

Main Challenges (cont.)

(2) Encoding the location weights of a word w into two features for the underlying classifier

12

Interesting Behaviors of WLW

• A word w in a question q has two features– Fvaluefront and Fvaluerear

– Applicable to different categories and languages (e.g., English)

• When w is far from the front and the rear– Both features reduce to the term frequency (TF) of w– WLW reduces to traditional feature-encoding

approach (using TF as the features)

13

Empirical Evaluation

14

Experimental Design

• CHQs were downloaded from a health information provider– 864 in-space CHQs

• cause (category 1): 313 • diagnosis (category 2): 92 • process (category 3): 459

– 100 out-space CHQs• whatever (general description)

• Five-fold cross validation

15

Underlying Classifiers

• Underlying classifier – The Support Vector Machine (SVM)

classifier

16

Results: Classification of In-Space CHQs

• Evaluation criteria– Micro-averaged F1 (MicroF1)

– Macro-averaged F1 (MacroF1)

17

SVM+WLW is significantly better than SVM

18

Results: Filtering of Out-Space CHQs

• Evaluation criteria– Filtering ratio (FR) =

# out-space CHQs successfully rejected by all categories / # out-space CHQs

– Average number of misclassifications (AM) =

# misclassifications for the out-space CHQs / # out-space CHQs

19

SVM+WLW achieves higher FR and lower AM

20

Conclusion

21

• Healthcare consumers often read health information on the Internet

• Health questions as the valuable resources for healthcare consumers– Providing both reliable and readable health

information

• Classification of health questions is basis for the retrieval of related questions– cause, diagnosis, process, whatever

• WLW can help SVM to improve the classification of CHQs

22

Documents

Improving Health Question Classification by Word Location Weights