[IEEE 2009 WRI World Congress on Computer Science and Information Engineering - Los Angeles, California USA (2009.03.31-2009.04.2)] 2009 WRI World Congress on Computer Science and

Contextual Concept Language Model for Answering Biomedical Questions

Qi Sun1 Jinguo Yao2 Junyu Niu3 Department of Computer Science and Engineering, Fudan University,

No. 220, Handan Road, Shanghai, China [email protected]

Abstract

In this paper, we utilize MeSH vocabulary to capture concepts of each word appearing in questions and documents and two new methods, contextual concept smoothing language model (CCSLM) and contextual concept language model (CCLM), are proposed to find the answer sentences from biomedical literature to questions proposed by biomedical experts. The concepts employed in the models, instead of keywords, guarantee the high recall. And the contexts of each underlying answer sentence boost the precision of answers to each question. We evaluate both methods on the data collection of TREC Genomics Track 2006. The results indicate our methods are much better than the straightforward method mentioned above. Comparing to the results of Genomics Track 2006, our methods achieve about 10% higher MAP than the mean level of Genomics Track 2006. Keywords: biomedical; contextual concept; language model 1. Introduction

The volume of published biomedical research is growing at an exponential rate. With such explosive growth, it is extremely challenging to keep up-to-date with all of the new discoveries and theories even within one’s own field of biomedical research. However, these new discoveries and theories usually are beneficial to essentially understand the underlying causes of various diseases and the functions of genes. So it is very important to analyze this biological data.

In our research, we focus on finding the answer sentences or passages from biomedical literatures to questions proposed by biomedical experts. One solution to this problem may be keyword based biology information retrieval. We can generate some keywords from the questions proposed by biomedical experts and then use the information retrieval methods to retrieval the related passages or documents. In the

field of biomedicine, a biology entity, such as gene or protein, may be expressed by different words, phrases, or abbreviations, which have the same concept under a certain context [1]. Therefore, the results of this straightforward method will be very poor.

Two new methods, contextual concept smoothing language model (CCSLM) and contextual concept language model (CCLM), are proposed for the task in this paper. Both of them utilize the concept to represent questions and sentences, and the latter definitely incorporates the consideration of the context that a sentence dwells on. The experimental results show that both new methods are better than the original ones and reach the level above mean level over two of three evaluation measures. And the second method, contextual concept language model, performs much better than the first one, and wins a lot in all three evaluation measures.

All of the 28 questions used in our experiment are based on four Generic Topic Types (GITs) given by TREC Genomics Track 2006[4]. As an illustration, the following is an example:

GTT: Find articles describing the role of a gene involved in a given disease.

Question Pattern: What is the role of gene in disease?

Example: What is the role of DRD4 in alcoholism? The remainder of the paper is organized as follows.

Section 2 describes related work. Section 3 formally presents our new methods——CCSLM and CCLM for answering biomedical questions. Section 4 evaluates the methods and analyzes the experiment results. Section 5 concludes the paper with our future work. 2. Related Work

Our work is mainly related to two areas of research, information retrieval and text mining. Language model have gotten a great success in the field of information retrieval since being proposed in 1998 [14]. Some sophisticated retrieval models have attempted to go beyond simple term matching, for example, query expansion using lexical-semantic relation [8], and

2009 World Congress on Computer Science and Information Engineering

978-0-7695-3507-4/08 $25.00 © 2008 IEEE

DOI 10.1109/CSIE.2009.564

76

2009 World Congress on Computer Science and Information Engineering

978-0-7695-3507-4/08 $25.00 © 2008 IEEE

DOI 10.1109/CSIE.2009.564

76

word sense disambiguation [9] etc.. The common drawbacks of these methods are the complexity of their implementation. Relevance-based language model is another measure to make use of meaning of words to improve retrieval performance [10]. This method estimates the relevance probability of words in the relevant class. Context sensitive semantic smoothing incorporated synonym and sense information into the language models [11], which decomposed a document or a query into a set of small topic signatures and estimated the translation model for each topic signature. In [7], the author showed a conceptual retrieval method, which simulated user behavior in searching information.

Biomedical information mining aims to aid biomedical scientists to discover knowledge existing in the biomedical literature and help them to make more efficient use of the existing research. It’s differentiated from biology information retrieval, which focused on large units of text, such as document [3], biomedical information mining processing text in a smaller granularity. It’s playing an important role of turning bio-data into bio-information [3] [2]. Many text mining techniques are made use to help researchers to find meaningful patterns. In Genomics Track 2006，many groups [5][6] [15] utilized concepts or expanded query to represent a question。

3. The Proposed Methods

Unlike the open domain, in the biomedical field, there is an abundance of external resources which make it possible to map word into concept to improve the accuracy of mined answer sentences. MeSH vocabulary is complied by National Library of Medicine, an authority list of the vocabulary terms used for subject analysis of biomedical literature; it imposes uniformity and consistency to the indexing of biomedical literature. We make use of MeSH vocabulary to capture concepts of each word appearing in questions and documents. Different from [12], assigning a unified number for each word or phrase having the same concept, we attempt to model concept unification from the statistical view. For example, when the keywords PrnP and mad cow disease appeared in the question of a user, with high probability the user may want to find the relation between the gene, PrnP, and the disease, mad cow disease. If knowing all other words or phrases representing the same concept with PrnP and mad cow disease, we can find the relevant answers to the question from biomedical literature.

Figure 1 gives an architectural overview for our biomedical question answer system.

Figure 1. Biomedical question answer system

The orange line indicates the process of CCSLM, which returns a rough answer set for questions. The red line shows that the approach of CCLM that combines the result of document retrieval and those gotten from the upper process, and produces a final answer set for each question.

3.1. CCSLM

Motivated by the work in [7], we measure the similarity of each sentence with a question, according to the concept words within it, by estimating the relevance probability of each sentence for a given question, instead of the heuristic method employed in that work. We call this method as simple contextual concept smoothing language model that doesn’t incorporate the context definitely.

First, a relevant concept set, which is built according to MeSH vocabulary manually, is used to replace a question Q:

},,,{ 321 nccccQ …= (1) ci is the concept appearing in a given question, n is

the number of concept of the question. We estimate the probability of a sentence as the

answer to a given question Q by the following framework.

∏=

=n

ii ScpQSAP

1

)|(),|( (2)

Where ),|( QSAP is the probability of the sentence S as an answer A to question Q. ci is one of the concepts in question Q.

Maximum likelihood estimation method is used to calculate the value of )|( Scp i in the formula (2). We employ linear interpolation approach of language model to solve the problem of sparse data, often encountered in the field of text processing.

,,

( | ) (1 ) ( | )( )

ii j k i k

j k

SCFp c S p c Dlen S

λ λ= − + (3)

Where kjS , is the j-th sentence of document k. SCFi is the frequency of the words having the same

7777

concept ci in the sentence kjS , , and length ( kjS , ) is the length of it, calculated according to the number of terms contained in the sentence. We smooth the estimate of )|( ,kji Scp by the )|( ki Dcp , document

context smoothing, instead of )|( Ccp i , which is mainly used in language model smoothing methods of information retrieval [13]. We don’t employ it, since the collection smoothing )|( Ccp i may bring more noisy information than document collection. Specifically, the document smoothing component in formula (3) is calculated as follows:

( | )( )

ii k

k

DCFp c Dlen D

= (4)

Where DCFi is the frequency of the words having the same concept ci in the document Dk. len (Dk) is the document length, which the number of terms contained in Dk.

Then, we rank sentences using the following formula:

,1 ,

( | , ) ((1 ) ( | ))( )

ni

j k i ki j k

SCFP A S Q p c Dlen S

λ λ=

= − +∏

,

log((1 ) )( ) ( )

ni i

i i j k k

SCF DCFranklen S len D

λ λ=

− +∑ (5)

The score function of sentences is a sentence language model based on the smoothing method of Jelinek-Mercer [13]. 3.2. CCLM

Formula (3), (5) estimate probability of sentences relevant to a question only by the presence of concept words in them. Simple contextual concept language model doesn’t take the document context as a major component, only incorporating document contextual concept smoothing to calculate the score of a sentence. We improve the method by definitely adding context as follows:

)|()|(),|(1

,, ki

n

ikjikj DcpScpQSAP ∏

=

= (6)

In the above formula, )|( ki Dcp is the document context component, which is based on the assumption that concepts referred in a document supposed to relevant to a subject, that may boost the score of sentences contained in it.

For example, if the abbreviation IDE and AD appear in sentences of a document talking about the relation between insulin degrading enzyme and Alzheimer’s disease, it’s likely the sentences can answer the

question about the role of IDE in the Alzheimer’s disease.

Although we still adopt the Jelinek-Mercer smoothing formula in our experiment, we can use any smoothing method employed in language model of information retrieval to estimate the value of )|( ki Dcp . Then the formula (6) can be decomposed as follows:

)|()|(),|(1

,, ki

n

ikjikj DcpScpQSAP ∏

=

=

∑=

+n

ikikji DcpScprank

1, ))|(log)|((log (7)

Where )|( ,kji Scp can be gotten from the

equation (3), and )|( ki Dcp is estimated as follows:

1 1( | ) (1 ) ( | )( )

ii k i

k

DCFp c D p c Clen D

λ λ= − + (8)

Where )|( Ccp i is estimated using maximum likelihood estimation. The formula (7) definitely adds the document context as a major component. We call it as contextual concept language model. 4. Evaluation 4.1 Experimental Setting

We evaluate our two methods on the data collection

from TREC Genomics Track 2006. The full collection consists of 162,259 documents (about 12.3 GB when uncompressed) [4].

We conducted our experiments in two stages. Firstly, we scanned and parsed each document into sentences and recorded their beginning position in the document and the length of them based on character-level counting. While scanning the document, we calculated the value of )|( ki Dcp for each concept ci appearing in a given question, according to formula (8). Then we use the contextual concept language model to score each sentence in this stage according to the formula (5) for each given λ .

We can rewrite the formula (7) as follows:

∑=

+n

ikikjikj DcpScprankQSAP

1,, ))|(log)|((log),|(

∑∑==

+=n

iki

n

ikji DcScp

11, )|log()|(log (9)

According to (9), we can calculate the value of (7) by combining the score gotten from the first stage for

7878

each sentence and the score returned by document retrieval system. We can use any smoothing method of language model to calculate the later item of (9). In our experiment, only the Jelinek-Mercer method was used as given in equation (8). We set 1λ as 0.3, in the experiment, and the influence of it having on the performance will be further inspected in future study. 4.2 Results

Following Genomics Track 2006, three performance measure approaches were used to evaluate our methods. They are respectively passage-level MAP, aspect-level MAP, and document-level MAP, which all were derived from mean average precision (MAP) always used as evaluation of information retrieval.

As to passage-level MAP, for each relevant retrieved passage, precision was computed as the fraction of characters overlapping with the golden standard passages, provided by the organizer, divided by the total number of characters included in all submitted passages for the topic until that point.

Aspect-level MAP was measured by the average precision for the aspects of each topic. To compute the average precision of each topic for a run, the ranked passages was transformed to two types of values, aspect(s) of the golden standard passages overlapping with the submitted run, or the value “not relevant”. The precision for the retrieval of each aspect was computed as the fraction of relevant passages for the retrieved passages up to the current passage under consideration.

Document-level MAP is the same with commonly used MAP to measure document retrieval. Average precision was measured at each point of correct (relevant) recall for a topic. Document MAP is the mean of the average precisions over all topics.

We study the behavior of both methods against the setting of λ appearing in formula (5) and (7). The setting of λ 1 used in formula (7) for document retrieval is not inspected and assigned 0.3 in our experiment. Figure 2, 3, 4 plot the change of three evaluation measures againstλ ranging from 0 to 1.0 with a interval of 0.1.

Figure 1. Passage MAP

Figure 3. Aspect MAP

Figure 4. Document MAP

The best results of our new methods are given in Table 1, compared with the mean value from Genomics Track 2006 against three evaluation measures.

Table 1. Result of CCSLM and CCLM compared with it of Genomics Track 2006

Method Passage MAP

Aspect MAP

Document MAP

CCSLM 0.0655 0.1290 0.2969 CCLM 0.1066 0.1801 0.3894 Mean@TREC 0.0392 0.1643 0.2887

Table 1 show our both methods are better than

mean value of Genomics Track 2006 in regard to Passage MAP and Document MAP. The method of CCLM wins the mean of Genomics Track 2006 a lot in three evaluation approaches.

Two reasons support the success of our method: firstly, the concepts, instead of key words of questions, adopted in the methods guarantee a high recall; secondly, the contexts of each underlying answer sentence boost the precision of answers to each question. Differ from these groups, we make use of contextual concept in two different granularities, one is based on contextual concept smoothing, and the other definitely incorporates contextual concept in the language model. 5. Conclusion and Future Work

7979

We have proposed two context concept methods in the framework of language model for finding the answer sentences or passages from biomedical literature for questions proposed by experts, and have shown their effectiveness through experimental results.

We plan to further elevate the performance of our approaches by adjusting the parameters used in document retrieval model of formula (9). We’ll also employ other smoothing methods in CCLM to score document, context of sentences, and inspect the influence they having on the performance. Automatically extracting concepts from biomedical literatures is also our future work. 6. References [1] M.J.Schuemie, J.A.Kors, and B.Mons, “Word sense disambiguation in the biomedical domain: an overview”, Journal of Computational Biology (2005), 12(5):554–565 [2] W. Perrizo, “The role of data mining in turning bio-data into Bioinformation”, Bioinformation, Vol. 1. Biomedical Informatics Publishing Group (2007) 351-355 [3] A.M. Cohen, W.R.Hersh, “A Survey of Current Work in Biomedical Text Mining”, Briefings in Bioinformatics, Henry Stewart Publications. March 2005, vol. 6, no. 1, pp. 57-71(15) [4] W.R.Herch, A.M.Cohen, P.Roberts, “TREC 2006 Genomics Track Overview”, In the Fifteenth Text REtrieval Conference Proceedings, (2006). [5] J.Urbain, N.Goharian, O.Frieder, “IIT TREC-2006: Genomics Track”, In the Fifteenth Text REtrieval Conference Proceedings, (2006) [6] D.Trieschnigg, Wessel Kraaij, M.Schuemie, “Concept based document retrieval for genomics literature”, In the Fifteenth Text REtrieval Conference Proceedings, (2006). [7] J.Lin, D.D.Fushman, “The Role of Knowledge in Conceptual Retrieval: A Study in the Domain of Clinical Medicine”, Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006), Seattle, Washington, (2006), p. 99-106. [8]E.M.Voorhees, “Query expansion using lexical-semantic relations”, Proceedings of the 17th ACM SIGIR Conference on R＆D in Information Retrieval, Dublin, Ireland (1994) p.61～69 [9]S.B.Kim, H.C.Seo,H.C.Rim, “Information retrieval using word sense: Root sense tagging approach”, Proceedings of the 27th Annual International Conference on Research and Development in Information Retrieval SIGIR.Sheffield, United Kingdom (2004) p.258-265

[10] V.Lavrenko, W.B.Croft, “Relevance-based language models”, In Proceedings of the 24th ACM SIGIR Conference. (2001) p.120-127. [11] X.Zhou, X.Hu, X.Zhang, X.Lin, “Context-Sensitive Semantic smoothing for the language modeling approach to Genomic IR”, In the 29th Annual International. ACM SIGIR Conference (SIGIR 2006), Aug 6-11, 2006, Seattle, WA, USA, p170-178 [12] X.Zhou, X.Hu, X.Zhang, X.Lin, “Concept-Based Indexing to Improve Language Modeling Approach to Genomic IR”, In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (2006), p.170-177. [13] C.Zhai, J.Lafferty, “A study of smoothing methods for language models applied to ad hoc information retrieval.”, In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (2001), p.334-342. [14] J.M.Ponte, W.B.Croft, “A language modeling approach to information retrieval”, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (1998), p.275-281. [15] W.Zhou, C.T.Yu, “A Concept-based Framework for Passage Retrieval in Genomics”, In the Fifteenth Text REtrieval Conference Proceedings, (2006).

8080

Documents

[IEEE 2009 WRI World Congress on Computer Science and Information Engineering - Los Angeles, California USA (2009.03.31-2009.04.2)] 2009 WRI World Congress on Computer Science and