4
Minireview A text-mining perspective on the requirements for electronically annotated abstracts Florian Leitner, Alfonso Valencia * Structural Computational Biology Group, Spanish National Cancer Research Centre (CNIO), Madrid, Spain Available online 6 March 2008 Abstract We propose that the combination of human expertise and automatic text-mining systems can be used to create a first generation of electronically annotated information (EAI) that can be added to journal abstracts and that is directly related to the information in the corresponding text. The first experiments have concentrated on the annotation of gene/protein names and those of organisms, as these are the best resolved problems. A second generation of systems could then attempt to address the problems of annotating protein interactions and protein/gene functions, a more difficult task for text-mining systems. EAI will permit easier categorization of this information, it will help in the evaluation of papers for their curation in databases, and it will be invaluable for maintaining the links between the information in databases and the facts described in text. Additionally, it will contribute to the efforts towards completing database informa- tion and creating collections of annotated text that can be used to train new generations of text-mining systems. The recent introduction of the first meta-server for the annotation of biolog- ical text, with the possibility of collecting annotations from avail- able text-mining systems, adds credibility to the technical feasibility of this proposal. Ó 2008 Published by Elsevier B.V. on behalf of the Federation of European Biochemical Societies. Keywords: Information extraction; Article annotation; Text mining; Journal annotation pipeline; Review; Perspectives; Electronically annotated information 1. Introduction The continuous growth in the number of research articles and the corresponding data stored in online repositories re- quire better connections to be established between scientific articles, annotations and data [1]. Coherent annotation will en- hance the possibility of locating and comparing articles, while database links will permit the unambiguous recovery of infor- mation [2]. Depending on an articleÕs content and type, such annotations may include the names of genes or proteins dis- cussed in the article and the database identifiers where the se- quence data are stored. These annotations may also contain indicators of concept ontology for diseases, molecular func- tions, cellular locations or experimental methods [3]. They may even cover structured annotations of figures and tables [4]. Although it would be desirable for these annotations to be exclusively made automatically through natural language understanding (NLU), the intrinsic difficulty of natural lan- guage makes automation very difficult [5] and currently, only semi-automated approaches can realistically be considered in practice [6]. This review presents the state of the art in informa- tion extraction (IE) systems that could be used to annotate bio- logical text and to build annotated abstracts (for current reviews on IE tools, see [7–9]). The potential requirements for such annotation process will be examined, highlighting possible approaches for the annotation of articles and the cre- ation of summaries (electronically annotated summaries). 2. Current state of the art and opinions Automated extraction of classified content (IE) is an impor- tant area of biological text mining [10]. The status of current IE systems is currently being evaluated in the light of the Bio- Creative challenge. BioCreative (critical assessment for infor- mation extraction in biology) systematically assesses the systems currently available with the help of experts in special- ized databases [11,12]. The evaluation includes methods for the ranking of documents according to their biological relevance (document classification task), the identification of biological entities (i.e., protein and gene names, known as named entity recognition, NER), as well as the more complex detection of relationships between entities (i.e., protein interactions, protein and biological functions). The entity detection task can be divided into the identifica- tion of gene and protein name mentions in the text on the one hand, and into the assignment of unique database identi- fiers to the gene and protein names on the other (a process known as normalization). The BioCreative results show that the best available NER systems are able to correctly detect al- most 90% of the names (with 88% precision and 86% recall for gene mention detection, and 83% precision and 79% recall for the normalization task) [13,14]. However, it can be assumed that the identification of other types of entities, such as cell types, chemicals, diseases, and others will present additional difficulties. The results on the more complex second task of associating genes/proteins to gene ontology terms analyzed in BioCreative I [15], or of detecting protein interactions analyzed in BioCre- ative II [16], were far less positive. These results clearly demonstrated the need of intensive human intervention to com- plement the automated results, especially when thinking about Abbreviations: BCMS, BioCreative MetaServer; EAI, electronically annotated information; IE, information extraction; NER, named entity recognition; NLP, natural language processing; NLU, natural language understanding * Corresponding author. E-mail address: [email protected] (A. Valencia). 0014-5793/$34.00 Ó 2008 Published by Elsevier B.V. on behalf of the Federation of European Biochemical Societies. doi:10.1016/j.febslet.2008.02.072 FEBS Letters 582 (2008) 1178–1181

A text-mining perspective on the requirements for electronically annotated abstracts

Embed Size (px)

Citation preview

Page 1: A text-mining perspective on the requirements for electronically annotated abstracts

Aael

*

E

FEBS Letters 582 (2008) 1178–1181

Minireview

A text-mining perspective on the requirements for electronicallyannotated abstracts

Florian Leitner, Alfonso Valencia*

Structural Computational Biology Group, Spanish National Cancer Research Centre (CNIO), Madrid, Spain

Available online 6 March 2008

Abstract We propose that the combination of human expertiseand automatic text-mining systems can be used to create a firstgeneration of electronically annotated information (EAI) thatcan be added to journal abstracts and that is directly related tothe information in the corresponding text. The first experimentshave concentrated on the annotation of gene/protein names andthose of organisms, as these are the best resolved problems. Asecond generation of systems could then attempt to address theproblems of annotating protein interactions and protein/genefunctions, a more difficult task for text-mining systems. EAI willpermit easier categorization of this information, it will help in theevaluation of papers for their curation in databases, and it will beinvaluable for maintaining the links between the information indatabases and the facts described in text. Additionally, it willcontribute to the efforts towards completing database informa-tion and creating collections of annotated text that can be usedto train new generations of text-mining systems. The recentintroduction of the first meta-server for the annotation of biolog-ical text, with the possibility of collecting annotations from avail-able text-mining systems, adds credibility to the technicalfeasibility of this proposal.� 2008 Published by Elsevier B.V. on behalf of the Federation ofEuropean Biochemical Societies.

Keywords: Information extraction; Article annotation; Textmining; Journal annotation pipeline; Review; Perspectives;Electronically annotated information

1. Introduction

The continuous growth in the number of research articles

and the corresponding data stored in online repositories re-

quire better connections to be established between scientific

articles, annotations and data [1]. Coherent annotation will en-

hance the possibility of locating and comparing articles, while

database links will permit the unambiguous recovery of infor-

mation [2]. Depending on an article�s content and type, such

annotations may include the names of genes or proteins dis-

cussed in the article and the database identifiers where the se-

quence data are stored. These annotations may also contain

indicators of concept ontology for diseases, molecular func-

bbreviations: BCMS, BioCreative MetaServer; EAI, electronicallynnotated information; IE, information extraction; NER, namedntity recognition; NLP, natural language processing; NLU, naturaanguage understanding

Corresponding author.-mail address: [email protected] (A. Valencia).

0014-5793/$34.00 � 2008 Published by Elsevier B.V. on behalf of t

doi:10.1016/j.febslet.2008.02.072

l

he Feder

tions, cellular locations or experimental methods [3]. They may

even cover structured annotations of figures and tables [4].

Although it would be desirable for these annotations to be

exclusively made automatically through natural language

understanding (NLU), the intrinsic difficulty of natural lan-

guage makes automation very difficult [5] and currently, only

semi-automated approaches can realistically be considered in

practice [6]. This review presents the state of the art in informa-

tion extraction (IE) systems that could be used to annotate bio-

logical text and to build annotated abstracts (for current

reviews on IE tools, see [7–9]). The potential requirements

for such annotation process will be examined, highlighting

possible approaches for the annotation of articles and the cre-

ation of summaries (electronically annotated summaries).

2. Current state of the art and opinions

Automated extraction of classified content (IE) is an impor-

tant area of biological text mining [10]. The status of current

IE systems is currently being evaluated in the light of the Bio-

Creative challenge. BioCreative (critical assessment for infor-

mation extraction in biology) systematically assesses the

systems currently available with the help of experts in special-

ized databases [11,12]. The evaluation includes methods for the

ranking of documents according to their biological relevance

(document classification task), the identification of biological

entities (i.e., protein and gene names, known as named entity

recognition, NER), as well as the more complex detection of

relationships between entities (i.e., protein interactions, protein

and biological functions).

The entity detection task can be divided into the identifica-

tion of gene and protein name mentions in the text on the

one hand, and into the assignment of unique database identi-

fiers to the gene and protein names on the other (a process

known as normalization). The BioCreative results show that

the best available NER systems are able to correctly detect al-

most 90% of the names (with 88% precision and 86% recall for

gene mention detection, and 83% precision and 79% recall for

the normalization task) [13,14]. However, it can be assumed

that the identification of other types of entities, such as cell

types, chemicals, diseases, and others will present additional

difficulties.

The results on the more complex second task of associating

genes/proteins to gene ontology terms analyzed in BioCreative

I [15], or of detecting protein interactions analyzed in BioCre-

ative II [16], were far less positive. These results clearly

demonstrated the need of intensive human intervention to com-

plement the automated results, especially when thinking about

ation of European Biochemical Societies.

Page 2: A text-mining perspective on the requirements for electronically annotated abstracts

F. Leitner, A. Valencia / FEBS Letters 582 (2008) 1178–1181 1179

real-world applications. Generally, relationship extraction

requires linking entities with their associated concepts, for

example, linking a gene name to a disease while annotating

the mutations associated, or coupling two interacting proteins

with a specific type of interaction [17,18]. Part of the difficulty

in this process is related to the initial complexity of the recogni-

tion and normalization of names. This problem includes both

the identification of the protein names and the corresponding

species, and the additional problem of the semantic identifica-

tion of the relationships [19]. It is worth mentioning that even

human experts often disagree about the detection and classifi-

cation of entity relationships, as revealed through a variety of

inter-annotator agreement exercises [20]. Indeed, it is not

uncommon for database annotators to request additional

information from the authors after having studied a publica-

tion.

BioCreative has also shown that applying natural language

processing (NLP) techniques to full-text articles is in general

much more difficult than processing abstracts alone (as com-

monly done in text-mining publications). For example, a prob-

lem in full-text articles may lie in the significance of the

information in different parts of the text, which becomes a rel-

evant and additional burden.

In summary, while the first two IE objectives (document

classification and NER) are problems for which current IE sys-

tems already provide sufficiently robust online solutions, the

extraction of relationships is still a task for which IE tools re-

quire additional development [21].

BioCreative and other related experiments have also consis-

tently identified the demand for creating sufficiently large col-

lections of annotated articles (termed corpora) that can be

used to train and test text-mining systems [22]. Indeed, all

the IE and most of the NLP methods are based on learning

features from previously annotated text, even though an inter-

esting number of methods based on training with semi-anno-

tated text have been proposed. Pioneering efforts to create

such corpora were started by the GENIA initiative [23], but

they also include the BioCreative collections as well as others

[24]. Nevertheless, insufficiently large collections are currently

a key limitation that undermines the development of sophisti-

cated text-mining strategies. This makes biology different from

other fields where these collections are readily available.

A major issue in biological text mining and clearly a key-

stone for future developments is the need to generate and

use consistent annotations. These must follow common stan-

dards in well-structured representations, and very importantly

they must be linked to the corresponding text sources. Interest-

ingly, and for completely other reasons discussed elsewhere

[25], the traditional separation between database records and

journal entries is vanishing. The sum of these developments

is the generation of an environment in which the publication

of papers is becoming more directly related with the deposition

of the basic information in databases. Hence, the need for the

annotation of the corresponding text with basic links to the

databases, and to combine automatic annotation with human

expert (and if possible, author made) annotations. Possible sce-

narios that have been discussed include (see [26–28]):

– Allowing authors to freely decide on the annotations. In this

case, it will difficult to maintain consistency that would lar-

gely restrict the possible application of automatic IE and

database-related tools.

– Letting authors choose concepts from collections of con-

trolled vocabularies (e.g., gene ontology (GO) terms). This

generates the obvious difficulty of obtaining consistent

annotations from the authors not necessarily aware of

how to use the ontology. Moreover, it should be born in

mind that training a GO annotator requires several months

of work in close collaboration with experts.

– Semi-automated systems that would pre-filter possibly rele-

vant terms from collections of controlled vocabulary and offer

the results to human experts for validation. This type of sys-

tem can bridge the need between annotation, consistency

and annotator expertise.

An important additional issue is the technical feasibility of

annotating text and the accessibility of the systems/software to

achieve this. To date, there is no single stable/open system for

the full annotation of papers, and most of the current publica-

tions refer to complex systems combining sets of text-mining

tools developed internally by various research groups. To ad-

dress this situation, BioCreative II has developed a meta-server

able to collect and unify results from distributed systems work-

ing in the classification and annotation of MEDLINE abstracts

[29]. The BioCreative MetaServer (BCMS) is an online platform

using web services to collate annotations, including gene/protein

names and normalizations, and document classifications of pro-

tein interactions and taxa. In the current implementation, results

from 12 different systems are included in the meta-server. While

a complete article annotation system would have to go beyond

this functionality, BCMS can be viewed as a first demonstration

of the viability of such a pipeline and/or as an initial prototype.

3. Towards electronically annotated articles

To design a system for semi-automated article annotation

the potential scenarios in which they will be used must be ta-

ken into account. The most prominent situations, that are in-

deed associated one with the other, are

1. The journals that wish to provide their users with access to

their articles through a structured and hierarchical interface,

sub-categorizing the publications based on the annotations.

2. Database annotators that wish to easily retrieve articles rel-

evant to their data repository, the annotations associated

greatly reducing time consuming tasks such as normalizing

gene and protein names, mapping controlled vocabulary,

and maintaining existing records.

3. Biological text mining could concentrate on the more chal-

lenging tasks of uncovering complex and implicit informa-

tion, using the curated annotations as their starting point.

4. Most importantly, providing researchers with direct unam-

biguous access to raw data sources and the capacity to re-

trieve relevant articles with high precision and recall. For

example, to find specific methods and techniques, or to mon-

itor current progress in their field of interest. The annotations,

which can be regarded as summaries of an article, will permit

the rapid assessment of publication quality and relevance.

4. Proposal for the requirements of an interactive electronic

annotation system

Taking into consideration the possible scenarios of usage

and the basic categories of annotation types described above,

our proposal for such a system would be (see Fig. 1)

Page 3: A text-mining perspective on the requirements for electronically annotated abstracts

Fig. 1. Journal Annotation Pipeline. The flowchart has four main areas (f.l.t.r): article submission, electronic annotation, manual review, and datastorage. Annotation generation refers to Steps I and III in the text (NER, relationship extraction); Annotation review refers to Steps II and IV. Assyntax and semantics of the text can greatly influence NLP, additional rounds of Annotation generation and review can occur when the authordecides to change [parts of] the text to improve automatic annotation results. Reviewers and the publisher decide if the EAI is complete.

1180 F. Leitner, A. Valencia / FEBS Letters 582 (2008) 1178–1181

� Step I: During the manuscript submission process auto-

matic, web-based IE systems would examine the text, iden-

tifying and highlighting where genes and proteins are

mentioned, and normalizing these mentions (suggesting

links to core database identifiers along with the confidence

associated with such associations).

� Step II: The authors will then review the NER and docu-

ment classification results, choosing the correct matches or

providing new more accurate ones. To facilitate this pro-

cess, the verification of results must be straightforward

(i.e., presenting the author with definitions of the terms

or the main DB record content). It is particularly impor-

tant to note that authors will still have to verify that no

important mentions and/or normalizations have been

missed by the IE systems. It is also interesting to consider

that the process of interaction with the automatic system

will ultimately influence the way in which papers are writ-

ten.

� Step III: In a second round, the systems will use the curated

entity annotations to mine for relations between them, add-

ing as much metadata as possible (e.g., mutations, interac-

tion surfaces, and special genomic backgrounds). This two

step approach would eliminate the problem of detecting

the correct entities prior to the detection of the interactions

between entities. This will also make it easy for the authors

to fill in the missing blanks by using templates and the for-

merly curated entities.

� Step IV: An abstract will finally be submitted to the journal

including an additional section with the ‘‘electronically

annotated information (EAI)’’ and the corresponding data-

base identifiers (i.e. protein or DNA sequence databases

identifiers) automatically associated. The EAI can then be

included in MEDLINE to facilitate its future use, and used

as labels in the XML/HTML/PDF versions of the text to

facilitate the training of text-mining systems. Finally, the

EAI can be deposited within databases (e.g. interaction dat-

abases) together with the explicit mention of the paper and

authors.

The performance of text-mining systems is usually measured

by its precision (the percentage of correct predictions) and re-

call (the percentage of recognized entities). Although the most

desirable system will have excellent scores in both measures,

certain environments make one more preferable than the

other. Fine tuning of the IE systems with regards these two

measures will be important for the evolution of the prototype.

Perhaps, in the pilot phase it would be reasonable to concen-

trate on precision rather than recall to ensure user satisfaction.

However, in the long run full curation of names and relations

must be obtained. To further ensure the high quality standards

necessary, it will be important to count on the help of referees

and editors to check the consistency and extension of the anno-

tations.

5. Possible benefits

The benefits of such semi-automated system are obvious for

several communities, including biologists, database annotators

and text miners. However, before embracing the task of anno-

tating papers, it is necessary to establish a realistic picture of

what is technically possible. Since the performance of auto-

matic systems does not exceed that of human experts, IE sys-

tems are unlikely to surpass inter-annotator agreement and

certainly, both will never be perfect. To enhance the annota-

tion process over time, adaptive systems based on relevant

feedback from users [30], and taking former annotations into

consideration (‘‘collective intelligence’’ [31]), could provide

interesting alternative solutions.

Despite these problems, the advantages of semi-automated

annotations of journal articles are manifold:

Page 4: A text-mining perspective on the requirements for electronically annotated abstracts

F. Leitner, A. Valencia / FEBS Letters 582 (2008) 1178–1181 1181

� The annotators (authors) will not require extensive training

in annotation.

� Database and publication data will be easier to retrieve and

crosslink.

� Journals and their readers can better classify articles (mak-

ing journals with annotations more attractive over those

not providing such a service).

� It will allow databases to vastly extend their records be-

cause obtaining and extracting the relevant information will

be greatly accelerated.

� New perspectives in text mining will arise given that the AI-

hard disambiguation problem will largely be reduced, while

providing new collections of annotated text that can be

used to train systems.

� Biologists will obtain organized information with the basic

and fundamental facts highlighted, get improved and more

populated databases, and ultimately better text-mining sys-

tems could be used.

Although no currently existing text-mining application can

resolve all the issues at hand, the recent BCMS prototype for

creating annotations to PubMed/MEDLINE abstracts may

serve as a working model. The results from the second BioCre-

ative challenge have also shown that combined qualifications

from all the IE systems usually outperform any single system

[32,33]. This might prove to be a promising breeding ground

and good experience to build journal article annotation systems.

6. Conclusions

We propose a structure that is a minimal solution for the

annotation of biological papers that can be implemented

now, and that can be gradually extended over time as expertise

and feedback grows. Current IE systems perform sufficiently

well for the mentions of protein/gene names, and on normali-

zation detection, as well as in document classification. To-

gether, this may be sufficient functionality for a first

prototype and constitutes a feasible goal.

The direct next step would be to provide functional and

interaction annotations, based on the results of human-re-

viewed entity annotations. The experience of BioCreative indi-

cates the human intervention is necessary and points to the

feasibility of integrating different tools and methods to im-

prove performance and reliability.

The semi-automated annotation pipeline can be enhanced

over time by adding new text-mining methods, and by learning

from the feedback and collective intelligence techniques, possi-

bly trained with the annotated text produced during the initial

experiments.

Acknowledgements: The authors would like to thank Martin Krallingerfor insightful discussions. This work was supported by the ENFIN

NoE (LSHG-CT-2005-518254) and the Spanish National Bioinformat-ics Institute (www.inab.org), a platform of Genoma Espana (www.gen-es.org).

References

[1] van Mulligen, E. et al. (2004) Medinfo 11, 94–98.[2] Valencia, A. (2002) EMBO Rep. 3, 396–400.[3] Spasic, I., Ananiadou, S., McNaught, J. and Kumar, A. (2005)

Brief Bioinform. 6, 239–251.[4] Shatkay, H., Chen, N. and Blostein, D. (2006) Bioinformatics 22,

e446–e453.[5] Engels, R. and Bremdal, B. (2000) On-To-Knowledge deliverable

D-5. CognIT.[6] Alex, B. et al. (2008) in: Pacific Symposium on Biocomputing

Pacific Symposium on Biocomputing, pp. 556–567.[7] Erhardt, R.A.-A., Schneider, R. and Blaschke, C. (2006) Drug

Discov. Today 11, 315–325.[8] Krallinger, M., Erhardt, A. and Valencia, A. (2005) Drug Discov.

Today 10, 439–445.[9] Zweigenbaum, P., Demner-Fushman, D., Yu, H. and Cohen,

K.B. (2007) Brief Bioinform. 8, 358–375.[10] Blaschke, C., Hirschman, L. and Valencia, A. (2002) Brief

Bioinform. 3, 154–165.[11] Hirschman, L., Yeh, A., Blaschke, C. and Valencia, A. (2005)

BMC Bioinform. 6 (Suppl. 1), S1.[12] Krallinger, M., Morgan, A., Smith, L., Leitner, F., Tanabe, L.,

Wilbur, J., Hirschman, L. and Valencia, A. (2008) Genome Biol.BioCreative II Suppl..

[13] Morgan, A.A. et al. (2008) Genome Biol. BioCreative II Suppl..[14] Smith, L. et al. (2008) Genome Biol. BioCreative II Suppl..[15] Blaschke, C., Leon, E.A., Krallinger, M. and Valencia, A. (2005)

BMC Bioinform. 6 (Suppl. 1), S16.[16] Krallinger, M., Leitner, F., Rodriguez-Penagos, C. and Valencia,

A. (2008) Genome Biol. BioCreative II Suppl..[17] Jensen, L.J., Saric, J. and Bork, P. (2006) Nat. Rev. Genet. 7, 119–

129.[18] Orchard, S. et al. (2007) Nat. Biotechnol. 25, 894–898.[19] Krauthammer, M. and Nenadic, G. (2004) J. Biomed. Inform. 37,

512–526.[20] Rebholz-Schuhmann, D., Kirsch, H. and Couto, F. (2005) PLoS

Biol. 3, e65.[21] Jose, H., Vadivukarasi, T. and Devakumar, J. (2007) EURASIP

J. Bioinform. Syst. Biol., 2007.[22] Hersh, W. (2005) Brief Bioinform. 6, 344–356.[23] Kim, J.D., Ohta, T., Tateisi, Y. and Tsujii, J. (2003) Bioinfor-

matics 19 (Suppl. 1), i180–i182.[24] Blaschke, C., Yeh, A., Camon, E., Colosimo, M., Apweiler, R.,

Hirschman, L. and Valencia, A. (2005) Bioinformatics 21, 4199–4200.

[25] Bourne, P. (2005) PLoS Comput. Biol. 1, 179–181.[26] Gerstein, M., Seringhaus, M. and Fields, S. (2007) Nature.[27] Hahn, U., Wermter, J., Blasczyk, R. and Horn, P.A. (2007)

Nature.[28] Seringhaus, M.R. and Gerstein, M.B. (2007) BMC Bioinform. 8, 17.[29] Leitner, F. et al. (2008) Genome Biol. BioCreative II Suppl..[30] Salton, G. and Buckley, C. (1998) J. Am. Soc. Inform. Sci. 41,

288–297.[31] Smith, J.B. (1994) CRC.[32] Breiman, L. (1996) Mach. Learn. 24, 123–140.[33] Wilbur, J., Smith, L. and Tanabe, L. (2007) In: Proceedings of the

Second BioCreative Challenge Evaluation Workshop, pp. 7–16.