Chapter 1. Introduction

Chapter 1. Introduction

The Exponential Growth of Biomedical Research Data

The current capabilities of our biomedical research enterprise, exemplified by the

completion of Human Genome Project, enable researchers to quickly and routinely

survey the contents of entire molecular and cellular systems. This capability is generating

a revolution in biomedical research in various profound ways. One significant change is

the availability of staggering amounts of genomic and functional genomic data gathered

at a whole genome or whole cell scale. As the result of such tremendous technology

breakthroughs, the challenge for biomedical research is being shifted from experimental

data generation to the organization, curation and interpretation of these data (Lander ES

et al, 2001; Meldrum D et al, 2000).

Biomedical research literature can be considered to be a knowledgebase that

comprises the most complete status of our research enterprise. Reflecting the geometric

growth of available experimental data, the publication rate in biomedicine is also

increasing exponentially. There are currently more than 17 million biomedical articles

already represented in the National Library of Medicine’s biomedical literature database

MEDLINE, including more than 3 million articles published within last 5 years alone and

2,000 per day in 2006 (Hunter L et al, 2006; MEDLINE). Keeping abreast of this large

and ever-expanding body of information is increasingly daunting for researchers in order

to track and utilize what’s relevant to their interests, especially for new investigators. For

example, the pediatric tumor neuroblastoma is a common pediatric tumor but considered

to be quite rare overall, with approximately 600 new cases diagnosed in the US each

1

year. However, there are almost 25,000 research articles describing neuroblastoma,

making it virtually impossible for a new investigator to systematically assess historical

research on this topic.

Furthermore, researchers have the increasing need to get in touch with the

research fields outside their core competence. The commonly used PubMed system,

which provides a convenient query interface for MEDLINE, provides keyword search

and some concept mapping for researchers to narrow down the information they are

looking for (PubMed). However, its capabilities lack the precision (positive predictive

value), recall (sensitivity), granularity, and relevance ranking capabilities that many

typical but complex research queries have. One of the most popular demands that

general-purpose systems such as PubMed fail to satisfy is the ability to extract and

compile specific knowledge or facts out of literature records. For example, there is no

provision in PubMed-like systems to determine which genes have been studied thus far in

relation to a certain type of malignancy, other than to read through the set of articles

identified by PubMed using keywords defining the concepts “gene” and “cancer” (or the

type of cancer of interest), and then identifying the particular genes one article at a time.

With the exponentially increasing literature size, the process will not only be more time

consuming, but also be less reliable on getting the right articles. Consequently, the gap

between what is recognized and what is currently known is widening (Wren JD et al,

2004). Biomedical text mining techniques can help researchers meet this challenge by

developing automated systems to extract the relevant information out of the text and

organize it into a structured knowledgebase.

2

Data Integration Opportunities in Cancer Research

The general challenge of biomedical literature knowledge extraction is

confounded in cancer research, including an acute need to more systematically identify

linkages between genomic data and malignant phenotypes. Characterization of the

molecular aberrations responsible for the onset and progression of malignancy is a major

goal for cancer researchers, and genomic components of the aberrations, ranging from

base pair variance to chromosome deletion, are crucial determinants in this regard.

Despite the existence of some locus-, mutation- and disease-specific resources, there is

currently no central cancer knowledge database in the public domain integrating genomic

findings with phenotypic observations of tumors (Cairns J et al, 2000; Freimer N et al,

2003). While high-throughput screening efforts increasing allow researchers to identify

genome-wide mutational profiles for specific tumors, this information is largely diffusely

distributed and is mostly catalogued in a semi-structured manner throughout the

biomedical literature. Such decentralization is holding back the efforts towards making

rapid and comprehensive inferences of the genomic basis of malignancy onset and

progression in a manner that incorporates cumulative knowledge. Ideally, researchers and

clinicians would likely benefit from a comprehensive cancer knowledgebase that

consolidates experimental work (genome-level investigation), clinical observations

(descriptions of phenotype) and patient outcome (efficacy of treatment). Because the

biomedical literature represents a large proportion of this information, which is both

critically reviewed and eventually objective in its presentation of cancer research

information, means for more adequately extracting, normalizing and relating such diverse

3

collections of information in literature are crucial to solving this data integration problem

in cancer research.

Named Entity Recognition

The successful development of text mining technology has been increasingly

applied in biomedical research to assist with meeting the above-mentioned challenges.

There have been significant efforts from both computational linguists and

bioinformaticists within the past 5 years to develop automated biomedical text mining

(BTM) systems (Jensen LJ et al, 2006). BTM tasks include named entity recognition

(NER), information extraction (IE), document retrieval (DR), and literature-based

discovery (LBD). NER, which serves as the basis for most other BTM undertakings, is

the process of identifying mentions of biomedical entities (objects, such as genes and

diseases) in the text. Named entity recognition can be at first deceptively straightforward,

but it is has emerged as a challenging and considerable task in BTM research. NER

begins with the classification and definition of biomedical entities, which easily

consumes tremendous amount of effort because of the complex and lack-of-standard

nature in biomedical entities.

The process of identifying references to biomedical objects in text is usually split

into two steps: the identification of mentions of specific entity instances in text, such as

“the p53 gene” or “acute lymphoblastic leukemia”; and the assignment of these mentions

to a standard referent (normalization), such as classifying “the p53 gene” as a mention of

the official gene symbol “TP53”, or “ALL” as “acute lymphoblastic leukemia”. Many

biomedical entities either lack controlled vocabularies that can act as sufficient

nomenclature standards, or the instances in text are not expressed with the standards due

4

to historical reasons. Therefore, normalization is absolutely necessary for equating entity

values as appropriate, or placing values into a hierarchical or ontological framework (e.g.,

“ALL” as a form of “leukemia”. Much BTM research to date has focused upon molecular

entities that tend to be more discretely definable, such as genes and protein-protein

interactions, than phenotypic entities, which are harder to classify semantically

(BioCreAtIvE; McDonald R et al, 2005; Settles BA 2005; Zhou G et al, 2005).

NER methods include both rule-based and machine-learning approaches. Rule-

based approaches use sets of “rules”, alone or in combination, that pre-state signature

grammatical and especially character and word-based patterns within a string of text

being considered, and then return Boolean values as an output. For example, a rule to

identify a gene name could be “This word is a gene if it contains the consecutive letters

‘KIAA”, all of which are capitalized”. There can be some allowance for lexical

variations, such as capitalization, stemming, or punctuation, and some or all rules might

compare the text being considered to a term list, such as a pre-compiled list of known

tumor types. However, the performance of the approach can’t count on the completion of

the dictionary-type list in terms of both depth (the completion of the entity unique

identifiers) and breadth (the completion of the synonyms for each unique identifier)

because for most biomedical entities, the term lists are always changing and never

complete. For complexly formulated text, rule-based approaches typically require

considerable thought and exquisite biological knowledge. Advantages of this approach

are relatively high precision without the requirement for generating extensive training

material. However, disadvantages include high false negative rates, a performance

plateau that is increasingly difficult to overcome, and, for complex and heterogeneous

5

text, a tendency to generate low recall. Most first-generation systems and many domain-

focused current systems utilize rule-based approaches; when coupled with a term list, this

approach accomplishes both steps of the overall NER task at one time. However, rule-

based systems have enjoyed only modest success for biomedical applications, likely

because their performances have plateaued below rates acceptable for wide use by

researchers, or their application domains have been overtly narrow (Hanisch D et al,

2005; Fundel K et al, 2005; Chang JT et al, 2004; Finkel J et al, 2005).

Given the limitations of rule-based systems, a number of machine-learning

algorithms have been applied to improve the first step of the NER task. Generally, these

algorithms consider and then define sets of features within and surrounding entity

mentions that co-associate with the mentions. These can include orthographic features of

the text (e.g., suffixes, particular sequential combinations of characters or words,

capitalization patterns, etc.) and domain-specific features (e.g., term lists). For example,

the suffix “-ase” usually indicates a protein name, and the noun phrase immediately

preceding the word “gene” is often a gene name. Machine-learning approaches have

several advantages: at their purest, they require no domain knowledge; they can consider

thousands or millions of features simultaneously; they can provide confidence scores for

predictions; and they can consider the entire feature space simultaneously. However, the

success of machine-learning approaches is dependent upon two critical and costly factors.

First, ML systems require the establishment, quality, and representativeness of a set of

manually generated training material from which to “learn” features, a process that

requires considerable effort and does not generalize effectively. Second, the most

effective systems incorporate biological knowledge—either in the form of domain-

6

specific rules or definition of features that are domain-specific (such as specialized

lexicons)—that are likewise costly to implement (McDonald R et al, 2004; Coller N et al,

2000; Tanabe L et al, 2002).

It is most critical to let human set the examples of gold standards before machines

can learn from it. To better reduce the annotation ambiguity and disagreement, it is

crucial to define the target biomedical entities explicitly. Currently, most developed NER

systems take some version of pre-established conceptual definitions, by which annotators

could apply with very different standards. We have tried otherwise and put tremendous

effort in an iterative annotation process to develop literature-based definitions drawing

both the conceptual and textual boundaries.

Step 2 work (normalization) is syntactically easier since the identification of

textual boundaries is not necessary. However, it poses significant semantic challenges,

because the non-unique synonyms have to be disambiguated to find out the real intent.

And also, a comprehensive thesaurus like dictionary is necessary in order to match the

raw entity mentions to their unique identifiers. Classification techniques, rule-based

systems, and pattern-matching algorithms have been utilized to solve this issue, and some

approaches also take the contextual information to disambiguate the synonyms (Chen L

et al, 2005).

Information Extraction

Ideally, BTM systems extract and synthesize “facts” out of the literature that

combine entity mentions with relationships between and among the mentions established

in the literature. This work requires NER results, that is, the relationships between the

entities can only be extracted once the individual entities have been identified. Although

7

biomedically oriented research in this area is not as advanced as NER, BTM researchers

have recently been increasing their efforts on these challenges.

A most straightforward but powerful approach is co-occurrence. This approach

identifies the relationships between the involved biomedical entities based on their co-

occurrence in the articles, or by considering how close mentions are to each other within

a document. The assumption taken by the co-occurrence method is that if two (or more)

entity instances are co-mentioned in one single text record (or defined subset, such as a

sentence or a paragraph), these instances have some type of underlying biological

relationship. As it is possible that entity instances can coincidentally co-occur, systems

commonly use some parameters to rank the relationships, such as the frequency and

location of their co-occurrence. If two entity instances are repeatedly co-mentioned

together in close proximity, it is most likely that they are related. This approach tends to

perform with better recall but at the expense of precision because it has no intelligent

means for distinguishing specific from general relationships. For example, if the

information to be extracted is the causal relationship between gene A and disease

diagnostic labels, this approach will recognize relationships of any kind between gene A

and relevant diseases, including but not limited to direct or causal relationships. In order

to improve precision, some co-occurrence-based IE systems include additional

approaches, such as combining with a customized text-categorization system to

preferentially identify relevant articles or sentences. Co-occurrence-based IE systems are

usually used as exploratory tools making inferential calls since they can identify both

direct and indirect relationships between entity instances (Jessen TK et al, 2001; Alako

BT et al, 2005).

8

Another approach is to take advantage of natural language processing (NLP)

methodology that combines syntactic and semantic analysis of text. In this approach,

individual tokens in test are often first identified and then assigned part-of-speech labels,

in a process that has been converted to automation with high accuracy. Then a nested tree

like structure (either top-down or bottom-up) is developed in order to determine the

relationships between noun phrases or beyond, such as subjective and objective. After a

NER process is applied for assigning semantic labels to specific words and phrases, either

rule-based or machine-learning based processes can be used to extract relationships

between entity mentions. Although the syntactic parsing and the semantic labeling have

been carried out as separate steps by most NLP-based IE systems, results indicate that

better performance can be obtained by integrating the two steps, due in part to the often

complex relationships of biomedical entity mentions. This NLP-based approach can

achieve better precision, but lower recall, largely because of increased challenges in

identifying relationships across sentences. These approaches are also labor-intensive,

since either expert defined sophisticated extraction rules or manually annotated training

corpus are required (Rzhetsky A et al, 2004; Daraselia N et al, 2004; Yakushiji A et al,

2001).

Although there is some research touching base with n-ary relationships between a

set of biomedical entities, most IE systems currently classify binary relationships between

same-type entities. These systems most commonly focus on entities and relationships that

are easier to define, such as protein-protein/gene-protein interactions, protein

phosphorylation, other specific relations between genomic entities such as cellular

localizations of proteins, or interactions between proteins and chemicals. Few NER

9

systems have yet to be designed for relating phenotypic attributes, such as gene-disease

relationships (Temkin et al, 2003; McDonald R et al, 2005).

High-performance systems that can extract many types of relationships and also

distinguish among relationships beyond the sentence level are not yet achievable. This is

due largely to three contributing factors. First, biomedical text is complex and highly

variable in its structure and presentation. Second, many complicating factors need to be

considered, including co-reference (e.g, the use of pronouns), ambiguity in intent, and

variability in formulation. Finally, systems need to incorporate various approaches

simultaneously (e.g., tokenizers, POS taggers, NER systerms, parsers, disambiguators),

each of which contributes some measure of error that combines to significantly degrade

finalized output (Ding J et al, 2002).

Document Retrieval

DR systems typically identify and rank documents pertaining to a certain topic

from a large collection of text. Topics of interest might be derived from user-supplied

search terms or from pre-selecting specified types of documents. Most DR systems

feature keyword search capabilities; advanced keyword searching allows users to input a

combination of search terms and/or to perform advanced functions, such as including

logical operations or inducing limits to terms. Systems then commonly retrieve

documents containing or excluding certain terms that match the search criteria. This

method often retrieves irrelevant articles, and relevance-ranking functions are often

absent or primitive. More sophisticated DR systems go beyond this by applying distance

metrics, such as a vector-space model. With this model, every document is represented as

a vector, which is determined by measuring text-based features and/or document

10

metadata, such as a list of frequency-based weighted terms identified in each document.

The query vector, which is determined by the relative importance of each query term, is

then compared to document vectors to relevance rank the documents. The comparison

between document vectors can also calculate document similarity. PubMed is a well-

known DR system that is highly adapted for use as a query interface for MEDLINE.

PubMed uses both keyword searching and a vector model (Glenisson P et al, 2003).

Advanced DR systems integrate NER or other NLP methods in order to more

accurately assess document content and identify documents that mention certain

biomedical entity mentions. FABLE, MedMiner and Textpresso are examples of systems

that make retrieval decisions by extracting and considering knowledge from gene/protein

mentions in the documents (FABLE; Tanabe L et al, 1999; Muller HM et al, 2004).

Literature-Based Discovery

An ultimate goal of BTM is to assist with literature-based discovery. LBD can be

defined as a process that discovers testable novel hypotheses by inferring implicit

knowledge in biomedical literature. An early and often-cited example of LBD was from

researcher recognizance of facts from two unrelated bodies of biomedical text, describing

Raynaud’s disease, in which patients suffer from vasoconstriction, high blood viscosity

and platelet aggregability, and describing fish oil, indicating that besides its capability of

causing vasodilation, its active ingredient can also lower blood viscosity and platelet

aggregation. This connection was formed completely through extensive reading of the

literature, and later the relationship was proved experimentally. The model used in this

seminal example was very simple: if A leads to B, and B leads to C, then it is plausible

that A could lead to C. Based on this closed discovery process (to connect two previously

11

known relations), this researcher subsequently discovered a novel association between

migraine and magnesium deficiency (also proved experimentally) as well as additional

successes (Swanson DR 1986; Swanson DR 1988; Swanson DR 1990).

More challenging LBDs might arise from an open discovery process, which

attempts to derive relationships between two entities of interest through implicit

relationships in literature. For example, the process of identifying candidate genes for a

certain disease is an open discovery process. One example of this process would be to

first identify gene mentions co-occurring in the literature (gene set A) with mentions of a

disease of interest, next identifiying co-occurring gene mentions (gene set B) with known

disease genes, and then consider the overlap between the two sets of gene mentions as

candidate genes for the disease. There are two assumptions taken for this approach: Gene

set B is functionally related with known disease genes; Gene set A has some sort of

relations with the disease. One potential problem for this approach is that there are many

types of direct and indirect relationships identified in such a process, including the high

likelihood that a substantial number of false positives are generated. NLP-based IE can

certainly help narrow down the relationship types, but further research is needed to

improve the performance of such models. Also fundamentally, literature inevitably

contains conflicting and inaccurate statements, which is impossible for an automated

algorithm to adjudicate (Weeber M et al, 2005).

It is much likely that more reliable inference of novel hypotheses and research

directions from literature achieves success by integration of BTM results with other data

types, including from curated data sets and experimental data. Experts’ curation and

experimental evidence provides verification, filtering, and relevance ranking capabilities

12

from information derived from real biological relationships between entities. For

example, researchers have made novel discoveries by transferring text-mined

relationships of a protein to its orthologous proteins based on sequence-similarity

searches. The integration effort of BTM results with functional genomic data such as

microarray data has helped researchers rank significant genes as well as develop novel

hypotheses based on both experimental data and previously known knowledge in a large

scale, automated fashion (Yandell MD et al, 2002; Raychaudhuri S et al, 2002; Glenisson

P et al, 2004).

Significance

Along with the rapid expanding of experimental data, the exponential increase of

the biomedical research text makes it more and more difficult for researchers to track and

utilize the relevant information to their interests, especially for the domains outside their

core competence. Automated text mining systems can process the unstructured

information in the literature into structured, queryable knowledgebase. This dissertation

research has developed well-performed automated entity extractors based on the refined

manual annotation with iteratively defined literature-based entity definitions in genomic

variation of malignancy. Co-occurrence-based information extraction process was

applied to integrate with microarray expression data in the pursuit of determining

neuroblastoma research candidate genes. Both functional pathway analysis and RT-PCR

experiment validated the text mining’s contribution. This thesis demonstrated that in

addition to systematic curation of the textual information, biomedical text mining also

has inferential capability especially when combined with experimental data.

13

Introduction to the Thesis

Using the genomics of malignancy as a test bed, this thesis has touched upon

every aspect of BTM outlined above. Work regarding the BTM process developed and

employed will be discussed in detail in Chapter 2 and Chapter 3. This thesis has also

established important work regarding information extraction in this domain, which has

been applied to research regarding the pediatric tumor neuroblastoma (Chapter 3 and

Chapter 4). Integration of BTM-extracted information with expression array analytical

results to discover candidate genes for neuroblastoma research will be discussed in detail

in Chapter 4.

14

Chapter 2. Defining Biomedical Entities for Named Entity Recognition

Yang JinMark A. MandelPeter S. White

Abstract

The performance of machine-learning based named entity recognition is highly

dependent upon the quality of the training data, which is commonly generated by manual

annotation of biomedical text representative of the target domain. The development of

robust definitions of biomedical entities of interest is crucial for highly accurate

recognition but is often neglected by text-mining applications. While the conceptual and

syntactic complexities of biomedical entities often generate ambiguities in assigning text

mentions to particular entity classes, entity definitions that exhibit as distinct semantic

and textual boundaries as possible are desired. We have created a highly generalizable

process for developing entity definitions specifying both conceptual limits and detailed

textual ranges for target biomedical entities. This process utilizes representative text and

manual annotators to initially define and iteratively refine definitions. The process was

tested within the knowledge domain of genomic variation of malignancy. This work

describes in detail the different types of challenges faced and the corresponding solutions

devised during the definition process. The resulting entity definitions were used to

annotate a training corpus for the development of automated entity extraction algorithms

and for use by the research community. We conclude that manual annotation consistency

is useful for the success of later biomedical text mining tasks, and that explicit, boundary-

defined entity definitions can assist with achieving this goal.

15

1. Introduction

Automated information extraction techniques can assist in the acquisition,

management and curation of data. A necessary first step is the ability to automatically

recognize biomedical entities in text, as also known as named entity recognition (NER).

Development of named entity extractors for biomedical literature has progressed rapidly

in recent years. For example, a number of machine-learning algorithms currently exist for

identifying gene name instances in text (Collier N et al, 2000; Tanabe L et al, 2002;

GENIA; Hanisch D et al, 2005). However, a major shortcoming of many approaches is

that they often minimize efforts to define biomedical entities in an explicit fashion.

Rather, the tendency is often to ignore this step by adapting or refining existing semantic

standards as the target entities’ conceptual definitions, leaving interpretive details to

manual annotators. Additionally, existing standards often provide little or none of the

semantic depth required to establish concept boundaries with enough rigidity to provide

highly accurate extraction. This tends to create outstanding consistency problems in later

steps when training automated extractors and utilizing the extracted entity mentions for

particular applications, because non-literature based conceptual definitions often generate

significant annotation ambiguity problems due to the semantic as well as syntactic

complexities of biomedical entities in the literature. As a result, automated systems

derived from such systems tend to perform more poorly. For biologists in particular, high

precision is a necessary prerequisite for widespread acceptance of automated tools, in

order to establish a level of reliability acceptable to users.

Strongly believing the importance of establishing well-defined, literature-based

entity definitions with clear boundaries specially designed for biomedical NER practice,

16

the Biomedical Information Extraction Group at University of Pennsylvania (Penn

BioIE) has developed an iterative annotation process designed to establish a set of

“precise” entity definitions. These definitions are meant to clarify the conceptual

boundaries both semantically and syntactically, while also striking a balance between the

requirements of researchers, annotators, and computational scientists. This paper will

first describe the annotation process developed by the Penn BioIE group, and then

introduce the necessities and challenges of defining biomedical entities with specific

examples in the literature.

2. Overview of manual annotation process and entity classification

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Figure 2-1. The processes of developing entity definitions and extractors

Figure 2-1 demonstrates the iterative process developed for establishing and

refining entity definitions, first through manual annotations and then in developing

extractors based on the manually annotated training data. The process begins with the

creation of an initial definition that establishes the general concept and scope of an entity

17

class, which is supplied by one or a group of domain experts. Commonly existing

standards and resources are explored and, if deemed suitable, adopted as nuclei for the

process. Subsequently, the domain expert(s) plays the role of adjudicating definition

discrepancies. Manual annotators are then trained with the initial versions of the entity

definitions, from which they manually annotate the selected training corpora. Invariably,

as the annotators encounter the wide diversity of semantic representations of specific

concepts, a need for iterative refinement of the entity definitions emerges. Often, text

encounters require major revisions or even restructuring of definitions to accommodate

such heterogeneity. Accordingly, definitions are continually refined during the analysis of

annotated texts and annotation disambiguation. The Penn BioIE group founded useful

frequent communication forums where the emerging definitions and identified exceptions

were fully discussed among annotators and researchers. Communication modalities

included weekly face-to-face meetings, email lists, and live chat. After annotation has

been executed, entity extractors were developed by implementation of machine-learning

algorithms utilizing probability models (we used Conditional Random Fields); the

manually annotated texts were utilized as both training and testing data for these

algorithms. Comparison of the annotations produced by the automatic extractors and

human annotators allows for evaluation of the extractor performance.

The target knowledge domain we chose was “Genomic Variation of Malignancy”,

conceptualized as a relationship among three entity classes: Gene, Variation and

Malignancy. As shown in Figure 2-2, the Gene and Variation entities comprise genomic

components of cancer while the Malignancy entity covers phenotypic aspects of

18

malignancy, including malignancy diagnostic labels and a number of malignancy

phenotypic attributes.



Figure 2-2. Entity classification scheme for the domain of genomic variation of malignancy

A total of 1442 MEDLINE abstracts were selected for exploration and annotation

in this study, one subset of which contained many different malignancy types to establish

breadth, and a second subset of which mentioned only one major malignancy

(neuroblastoma) to establish depth. As diagrammed in Figure 2-1, the manual annotation

process was first applied to the corpus with an electronic annotation tool, WordFreak

(http://sourceforge.net/projects/wordfreak). After the entity definitions were refined and

stabilized, the manually annotated data were then used to develop entity and attribute

19

extractors (McDonald RT et al, 2004, Jin Y et al, 2006). These automated extractors

performed with state-of-the-art accuracy, in part due to the careful design and

management of our annotation process. In the following paragraphs, we will discuss the

challenges we have encountered during the manual annotation process, and why we

believe that consistent entity definitions are critical for the success of later steps in

biomedical text mining.

3. The challenges of defining biomedical entities

Although we began this task believing we had clear ideas of what information

each entity should cover, it quickly proved challenging to develop detailed working

definitions. Our a priori notions of entity definition adequacy were that definitions

establish distinct and defensible boundaries both conceptually and textually, therefore

providing guidance to the annotators both semantically and syntactically. Solid entity

definitions are an essential foundation for the subsequent steps of developing machine-

learning algorithms and utilizing the extracted information for specific applications. First,

the performance of entity extractors is highly dependent not only on the selection of the

underlying algorithms, but also on the quality of the training data, which are entirely

based on the entity definitions. If the annotators cannot identify specific entity mentions

consistently on the basis of the definitions, it is hard to imagine that automated extractors

can replicate this task reliably. More importantly, without clear definitions, researchers

will certainly run into problems when trying to utilize the extracted mentions, since it will

be difficult to know the precise boundaries of the gathered information.

As mentioned earlier, we initially defined three major entities in the knowledge

domain of genomic variation of malignancy, based on existing ontological categories and

20

concepts. However, we quickly found that ontology-based definitions often don’t

precisely reflect what has been conceptualized throughout the biomedical texts

contributed by researchers worldwide. For example, a gene defined by NCI thesaurus is:

“A functional unit of heredity which occupies a specific position (locus) on a particular

chromosome, is capable of reproducing itself exactly at each cell division, and directs the

formation of a protein or other product.” If annotators use this definition for identifying

gene mentions in the text, they could quickly be confused by many situations such as

whether promoters should be included; how should gene family names be treated; how

about pronoun referents to genes, etc. Thus, we found the need to invoke text-based

working entity definitions, which are most effectively determined as annotators

proceeded with the entity recognition task in the training corpus. Every new mention of

an entity and every new context for a mention provided a test for the pre-developed entity

definition. If a definition could not explicitly lead the annotators to a “correct”, or at least

consistent decision in each case, the problematic mention required further examination,

interpretation, and possibly, refinement of the definition. Through such an iterative

process, we were able to develop fine-tuned entity definitions that provided distinct

boundaries both for semantic scope and contextual range.

The challenges that we encountered in refining our definitions can be grouped

into four categories: conceptual, syntactic, syntactic/semantic ambiguity, and inter-

annotator agreement. In the following paragraphs we will illustrate these types and give

examples of our devised solutions and their limits.

3.1 Conceptual definition challenges

21

As discussed earlier, an entity definition has to clarify both conceptual and textual

boundaries. Initial versions of our definitions were completely conceptual, based on our

understanding of biomedical categories. Surprisingly, more than half of the annotators’

difficulties with definitions fell into this category during the annotation process, and most

of them were reasonable as you can observe in the following paragraphs showing the four

most common challenges in this category. This reflects the semantic complexity and

diversity of biomedical entities, which often cannot be easily defined without some

ambiguity.

3.1.1 Sub-classification of entities

Based on the classification scheme stated above, our target knowledge domain

was initially divided into three major conceptual classes: gene, genomic variation, and

malignancy. However, this broad conceptual classification was far from sufficient for the

generation of highly accurate extractors. For example, according to the conceptual

definition, the malignancy concept covers all phenotypic information of cancer, including

a tumor’s diagnostic type, the tumor’s anatomical location and cellular composition, and

its differentiation status. Each of these types of information are presented in a variable

and often bewildering array of syntactic and contextual patterns, which increases entropy

and thus erodes the ability of machine-learning approaches to classify mentions. If

instead we further classified the mentions into sub-categories such as those described

above and annotated them as such, entropy is reduced and extractor performance can be

expected to improve. However, a major disadvantage of this approach is that, sub-

categorization introduces considerable additional annotation effort. Thus, the annotation

22

process requires first the establishment of a level of entity granularity that balances the

cost of manual annotation with the application value of the extracted data.

There are countless ways to further divide entities into their underlying

components. For our purpose, we decided to let the level of granularity be generated by

the annotation process. By beginning with broad classes and subdividing them as needed,

we considered that we would eventually approach an optimal balance between effort and

effectiveness. We considered it to be critical to determine how the text strings represented

subcategories in the real world of biomedical literature. Therefore we divided our

annotation efforts into two stages: data gathering and data classification, as demonstrated

in Figure 2-3 with a genomic variation entity example.



Figure 2-3. The text-based two-stage entity sub-classification process

In the example illustrated by Figure 2-3, annotation of our initial concept of

“Genomic Variation” proceeded through a preliminary stage of annotation before it was

23

divided into sub-categories, which we named “Data Gathering”. In this stage, all textual

mentions falling within or partially within our initial concept definition were annotated

regardless of syntax. When sufficient information was gathered, sub-categories were

defined based on their semantic and syntactic representations. In addition, by proceeding

with this exercise, the annotators became familiar with the concepts, definitions, and

emerging challenges of the tasks. By employing this method, the sub-classification

scheme began to approximate how the concepts were actually presented in the text.

3.1.2 Levels of specificity

Textual entity mentions referring to the same semantic types can range from very

general to quite specific, and not all levels of detail may be appropriate for a particular

project. A gene mention may refer to a specific gene instance in a single cell of a sample,

or to the wild type or a specific variation of the gene; or it may refer to gene families,

super families and generalized classes, which represent classes of genes. For instance,

“MAPK10” or “mitogen-activated protein kinase 10” is a family member of “MAPK”,

which itself belongs to a higher level family “protein kinase”. We made the decision to

include all levels of information for the gene entity except for the most general level such

as “gene”. That is, in the above example, all three levels of gene mentions are legitimate

and should be annotated as such.

The decision was based on a couple of considerations. First of all, gene class

information is valuable information to extract in later steps; although we don’t know

which specific gene it refers to, it does help us narrow down to a class of genes. Second,

if we only include the mentions describing genes at the instance level (the level that can

lead to a specific genomic element), we have to draw a line between gene classes and

24

instances. Because textual mentions for gene classes and instances are sometimes

interchangeable (researchers tend to use gene class names referring to gene instance

names and vice versa), it will be quite difficult for the automated extractors to distinguish

between the two. And finally, we exclude gene mentions at the most general level, which

contains no information content or application value to extract. In another words, all

information-containing levels of mentions are included.

3.1.3 Conceptual overlaps between entities

An ideal entity classification scheme should result in independent information

categories without any conceptual overlaps. Unfortunately, the subjective and adaptive

nature of biological objects makes this ideal especially difficult to achieve, especially

when defining two different but related entities. Even a basic concept such as “organism”

is difficult to define when considering entities such as viruses and viroids, self-replicating

machines with attributes necessary but not necessarily sufficient to qualify as life forms.

Because our gene and genomic variation concepts both fall within the genomic domain

and are closely associated, we were very careful to make a clear distinction. Eventually,

our gene entity evolved to encompass solely the names of genes and their downstream

products (i.e., RNAs and proteins), while the genomic variation entity covered specific

descriptions of genomic element variations.

Although our definitions of gene and genomic variation managed to eventually

establish a reasonable boundary between them, for other entities, we found it sometimes

impossible to avoid the conceptual overlapping problem. We encountered such problems

when trying to make a clear division between the entity classes symptom and disease. The

symptom entity was designed to capture subjective or objective evidence of disease, such

25

as headache, diarrhea or hyperglycemia, while the disease entity captured specific

pathological processes with a characteristic set of symptoms, such as Long QT Syndrome

or lung cancer. As with most cases, the distinction is often clear to domain experts unless

considerable scrutiny is requested, as it appears to be simple common sense that these

concepts represent two distinct and non-overlapping sets of information. However, when

presented with the broad contextual variation in use and, often, semantic intent, it actually

becomes quite difficult to draw a clear boundary between the two. We quickly found that

many terms can be considered as both symptoms and diseases, depending both upon

intent and the level of domain knowledge available. For example, “arrhythmia” itself is a

disease entity mention, representing a pathological process, but it is usually used as a

diagnostic label of a disease (symptom), such as long QT Syndrome. We certainly don’t

want to have two entity types heavily overlapping with each other, since that will make

the classification unnecessary. That is not the case for the symptom and disease entity

types, and their overlapping mentions are less than approximately 10% overall. Most

conceptually overlapping mentions cannot be put into either category without reading the

text. We leave it to the annotators to determine authors’ intent based on the context and

increasingly, they became quite good at minimizing the disagreement.

3.1.4 Domain-specific clarification

As biological entities tend to be conceptually subjective, we often found it to be

quite challenging and labor-intensive to establish consistent conceptual boundaries. The

process of defining the gene entity is a good example to illustrate this challenge. Initially,

we considered the task of defining a “gene” to be a straightforward task, as this concept is

considered by biologists to be a rather discrete object. The HUGO Gene Nomenclature

26

Committee (HUGO), the nomenclature body tasked with establishing official names for

human genes, defines a gene as “a DNA segment that contributes to phenotype/function.

In the absence of demonstrated function a gene may be characterized by sequence,

transcription or homology". On top of that, our gene entity is initially defined as the

nominal reference to a gene or its downstream product in biomedical text. However, as

annotations moved forward, annotators raised more and more questions, forcing us to

make difficult determinations on the boundaries as illustrated below.

An example of biological complexity is the many ways that a gene can contribute

to phenotype. Typically, genes functionally impact biological processes through their

downstream products, proteins. However, there are DNA segments on the genome which

are able to affect phenotype by regulating how genes are expressed in particular

biological contexts. Promoter and enhancer regions, which are distinct segments of DNA

(often far) removed from the DNA segment that directly contributes to an RNA and/or

protein product, are such example. These elements control whether and when the gene

itself is expressed. Although biologists disagree whether promoters should be considered

as genes or components of particular genes, annotators are required to make a decision on

the gene entity boundary limits. In this case, we considered our application domain to be

the most important determinant, as the main focus of our gene entity was to capture those

“traditional genes” that could be directly and consistently associated with a protein. Thus,

we limited our scope of genes to include only what we considered to be biologically

functional DNA segments which are translated into protein products.

There are many more cases that required further clarification of the gene entity

conceptual definition, such as how to deal with segments and multiplexes of

27

genes/RNAs/proteins. We realized that consistency was more valuable than trying to

establish universal truth, the former of which we considered to be the key to developing

well-performing automated extractors and increasing the application value of extracted

mentions.

3.2 Syntactic definition challenges

Even with precise conceptual definitions, we found that guidelines needed be made

regarding the textual boundaries of the entity mentions. Although many of these were

syntactical nuances, they were not necessarily trivial for the annotator disagreement. In

order to make consistent automated extractors, we determined that detailed annotation

guidelines were required to make manual annotations consistent between different

annotators. We designed our guidelines to be practical and based on actual contexts,

specifying to the annotators exactly what to do under any uncertain circumstances that we

had encountered.

3.2.1 Associating a text string to an entity mention

There are many different ways to associate a text string with an entity mention in

biomedical literature. In order to harvest consistent training data to develop highly

performed automated extractors, we needed to define a series of rules specifying how to

select text strings in the literature as legitimate entity mentions. We allowed entity

references to include more than one word, including punctuation, but not to cross

sentence boundaries.

Although the majority of the entity mentions were nouns, not all of them were.

For some entity mentions such as variation type, other part-of-speech forms were not

uncommon. For example, for genomic variation types that would likely be normalized as

28

the forms “insertion”, “deletion”, or “translocation”, those variation type mentions were

usually expressed as verbs: “inserted”, “deleted”, or “translocated”. Moreover,

malignancy attribute mentions were nearly always adjectives, such as “well-

differentiated”, “hereditary”, and “malignant”.

All modifiers in a noun phrase mention were considered to be included as part of

a mention, because not only can the modifiers provide very useful information to be

extracted, but also that some modifiers are indispensable parts of the standard terms. We

observed that this decision made it easier for both manual annotators and machine-

learning extractors to operate since it was difficult to define boundaries on what

modifiers to include in noun phrases. However, modifiers were not included for other

part-of-speech phrases, in order not to complicate the issue. For example, in a noun

phrase malignancy type mention “malignant squamous cell carcinoma”, both “malignant”

and “squamous cell” are the modifiers of “carcinoma”, and both provide very useful

information. “Squamous cell carcinoma” is also a commonly employed name of a type of

cancer. Our experience determined that it was difficult for annotators and impossible for

automatic extractors to draw consistent boundaries between modifiers on what should be

included as part of the legitimate mentions.

Lastly, we found it necessary to make entity-specific rules for some biological

entities. For example, the gene entity mentions commonly appeared in the text as “The

mycn gene…”, necessitating a decision as to whether the article “The” and the noun

“gene” should be included as part of the entity mention. We reasoned that the decision

should depend on how the extracted information was to be further processed and utilized.

29

Accordingly, we decided to include neither word, since all the extracted gene mentions

were to be subsequently mapped and normalized to official gene symbols.

3.2.2 Co-reference issue

Often a single entity is referred to in different ways in the same text, a situation

known as co-reference. Besides its standardized form, an entity instance can also be

referred to by aliases, acronyms, descriptions or pronoun references. For example, the

mycn gene has at least 10 aliases in the literature, including “n-myc”, “oded”, and “v-myc

avian myelocytomatosis viral related oncogene, neuroblastoma derived”. Moreover,

researchers commonly engineer their own acronyms as self-convenient but non-standard

and often unique aliases. Co-reference is generally recognized as a challenging task for

entity recognition and information extraction. To deal with this issue in manual

annotation, we have classified this problem into the following four categories and made

corresponding decisions for each of them.

A. Extended form vs. a cronym

Regular expression: ___ ___ ___ (___)

Examples:

…mitogen-activated protein kinase (MAPK)…-- gene entity mention

…squamous cell carcinoma (SCC)… -- malignancy type entity mention

Our decision: Tag both the extended form and abbreviated form of the entity mention.

For the above examples, “MAPK” is co-referential with “mitogen-activated protein

kinase”, and “SCC” is co-referential with “squamous cell carcinoma”. Both extended

forms and acronyms would be tagged as corresponding entity instances in our system.

30

Our rationale: Both forms are interchangeable descriptions of entity mentions, and they

should be treated equally.

B . Alias description

Regular expression: …Y…X… or …Y (X)…

Examples:

TrkA (NTRK1)…

The N-myc gene, or MYCN…

Our decision: NTRK1 and MYCN are official name designations of the TrkA and N-myc

genes, and here they are being co-referenced accordingly. We decided to tag all different

expression forms of the entity instances, including standard/official nomenclatures,

aliases or descriptions. Like acronyms and their extended forms, these various names are

also tagged individually: in the first example, we tagged “TrkA” and “NTRK1”

separately and without the parentheses, not the combined string “TrkA (NTRK1)”.

Our rationale: Researchers often use unofficial nomenclatures for entity mentions, so we

can’t just annotate standard descriptions. However, they should be normalized later.

C . General vs. specific

Regular expression: X, a (the) Y…

Examples:

C-Kit, a tyrosine kinase which plays an important role, …

K-Ras is an oncogene. The Ras gene…

Our decision: In the examples above, the gene family name “Ras” and the superfamily

name “tyrosine kinase” are used to co-refer to the gene family instances “K-Ras” and “C-

Kit”. In such situations, our annotation guideline treated the general terms and more

31

specific terms completely independently, regardless of the co-referential relationship

between them. That is, depending on the conceptual definition, if the term was a

legitimate mention, it was tagged as an entity mention no matter what levels of specificity

it had. For those examples, since the gene entity definition included both gene instances

and family names, all four terms were tagged as gene entity mentions. We did not,

however, tag “oncogene”, nor did we extend the tag on “Ras” to include the following

word “gene”. These words, at the highest level of generality, convey no taggable

information.

Our rationale: Based on our decision on tagging all information-containing levels of

mentions and specifically for the examples listed, all gene instances, gene families and

superfamilies are determined legitimate mentions.

D. Pronoun reference

Regular expression: …X…PRONOUN (It, This, etc.)…

Examples:

K-Ras is an oncogene. It is mutated in…

Five point mutations were found in the MYC gene, and they were next to each

other.

Our decision: In the two examples, “It” is co-referential to “K-Ras”, and “they” is co-

referential to “point mutations”. We generally did not annotate pronouns, although they

may refer to legitimate entity mentions.

Our rationale: Pronoun co-reference is a challenging problem in text mining research,

which involves cross-sentence, whole-record level of relation extraction. Without deeper

parsing of the text, there is no value by extracting the pronoun itself.

32

3.2.3 Structural overlap between entity mentions

Entities can overlap not only conceptually, but also literally, with their textual

mentions in the literature. Annotation guidelines were developed for the following

situations:

A. Entity within entity – tag within tag

This refers to the situation that one entity mention is completely included in the

textual range of another. As the two intertwined entity mentions could belong to either

the same or different entities, we divided this category of problem into two sub-

categories. If the two mentions were in the same entity, only the subsuming entity

mention was tagged. For example, in “mitogen-activated protein kinase kinase kinase”,

there exist 7 distinct gene entity mentions: mitogen-activated protein; mitogen-activated

protein kinase; mitogen-activated protein kinase kinase; mitogen-activated protein kinase

kinase kinase; and three mentions of “kinase”. While this type of a situation was a source

of confusion among new annotators, we considered it both unnecessary and costly to tag

all possible mention permutations. As the mention with the largest range was always the

one being discussed, only the outermost mention was considered to be tagged as a gene

mention. In fact, this situation led to the adoption of a more generalized guiding

principle, where the annotation should reflect the author intent whenever possible

(although exceptions were encountered, such as poorly written abstracts where the intent

from the context occasionally and obviously differed from the actual word or phrase

used).

33

If two completely overlapping mentions instead belonged to different entity types,

we annotated both. These mentions were usually related, and they both often provided

valuable information. Some entities, such as malignancy attributes, often appeared as part

of another entity mention. For instance, “colon cancer” is a malignancy type mention, and

“colon” is a malignancy site mention. “Hirschsprung disease 1” is another example, that

“Hirschsprung disease” is a disease mention while the whole phrase is a gene mention.

B. Entity co-identity – double tagging

This category represents the situation that two entity mentions share the exact

same text. We annotated the same text twice with the two corresponding labels under

such circumstances. For example, in the phrase “deletion of the K-ras gene”, “K-ras” was

tagged as both a gene entity mention and a variation-location mention.

C. Discontinuous mentions – chaining

Sometimes mentions of several entities of the same type shared a common

substring. When written together in the text, the common part only occured once for the

first or last mention, and other mentions were only represented with the different parts.

For example, in the text “H-, K-, and N-ras…”, there are really three gene mentions: “H-

ras”, “K-ras” and “N-ras”, but a limitation of our annotation software prevented tagging

of discontinuous mentions as one parent mention (in the example above, only “N-ras”

could be tagged. For the other two discontinuous mentions, we developed a chaining,

procedure through which annotators were able to link the component parts (“H-” and

“K-” with “ras”) by inserting comments into the annotation in a standard format.

34

Chaining was strictly limited within one sentence in order not to complicate issues

for subsequent syntactic parsing of sentences. Employing the same logic, entity mentions

were not allowed to come across different sentences.

3.3 Syntactical vs. Semantic – ambiguity challenges

We considered ambiguity in mentions to be the most common and difficult

challenge in our annotation experience, as it truly reflects the limitation of human-

invented texts in fully communicating author intent. In biomedical text, we found it not

uncommon that an identical text string could represent completely different concepts, and

the frequency of ambiguity appeared to be much higher than for non-biological text. In

the following paragraphs, we will use mainly gene entity examples to illustrate the

illusive nature of this problem.

We found ambiguity to occur both within and outside gene entities. Genes have a

tradition of being independently named, with poor adherence to or awareness of

standards. People tended to make up new acronyms for gene names, as the result of

which, there are more gene names than the combinations of letters and numbers for short-

character symbols/aliases. Thus, there are lots of similarities between aliases just by

chance. Since each gene has multiple non-unique aliases with one unique gene symbol,

there exists very serious internal ambiguity problem among the aliases. Based on our

calculation, just for human genes alone, there are as many as 3% genes share the same

aliases and the numbers are number higher if including other species. Also, many species

have traditions of naming the genes the same, especially mouse and human (Chen L et al,

2005). For example, p90 is the common alias shared by the distinct gene symbols CANX

and TFRC. As a protein naming convention, p90 actually refers to the protein with

35

molecular weight 90. Therefore, it is not surprising that there are two proteins with the

same name.

When such gene mentions appear in literature, (often quite distant) context is the

only way to clarify which gene is in discussion, although sometimes it offers no

assistance. Another type of within gene entity ambiguity that we recognized was the

frequent apparent inability to distinguish a gene from its downstream products, based

purely on the text string of the mention. Although initially, our gene entity was designed

to capture only the nomenclatures of functional genomic elements, we soon discovered

that researchers were frequently using the same referents to represent a gene and also its

RNA and protein products in the literature. Without looking at the context, a gene

mention “mycn” had almost an equal probability to refer to a gene or its downstream

product, and both the gene and its mRNA were referred to as being “expressed” to create

a mRNA or a protein product, respectively. In addition, authors also tended to obscure

the conceptual boundaries between a gene and its downstream products. For example,

while a given protein X performs biological functions, we found it common that the

corresponding gene X was being described as performing this action. It became apparent

that while researchers were personally clear regarding distinctions, their descriptions did

not adequately convey these distinctions. In fact, in several cases, we found it impossible

to determine whether certain gene mentions referred to a gene or its RNA or protein

products even when considering the entire article. This overwhelming ambiguity problem

finally prompted us to reach the decision to include genes’ downstream products when

annotating gene entity mentions. Finally, we created one entity class gene but also

included labels for partially subdividing them, while making considerations for not being

36

able to perfectly divide mentions into the 3 classes. If it was not clear in the text whether

a mention referred to a gene or a protein, the mention was annotated as “gene.generic”, as

apposed to “gene.gene/RNA” or “gene.protein”.

Besides the challenges mentioned above, it was common to encounter gene entity

mentions that were easily be confused with objects belonging to other entity types, This

is because genes have been named with a wide variety of methods, from the use of lay

languages to the invention of specialized and often clever acronyms. For example, “Cat”

is an official gene symbol for the gene catalase, while it could also be used to refer to a

kind of animal. “NB” is the acronym of a well-known pediatric cancer neuroblastoma,

but it is also an official name of a gene locus putatively located on chromosome 1p36.

This cross-entity ambiguity problem was also commonly seen for other entity classes,

such as variation type. As an example, “Insertion” and “deletion” are well-defined

variation type mentions, but they are also frequently used to denote biological or clinical

actions. Regardless of the types of the ambiguity problems, the task for our manual

annotators was to make their best calls to identify the intended reference of the text

strings and annotate them as such. Sometimes annotators needed to take entire abstract

or, rarely, the entire article, into consideration in order to determine what particular

mentions truly represented. Depending on the nature of the biomedical entities and how

representative the training data was, the subsequent automatic extractors were able to

disambiguate problematic text strings to certain degree by taking local contextual features

into account.

3.4 Annotator perceptions

37

Even if perfect entity definitions and annotation guidelines could somehow be

created, there would still be variations among human annotators in understanding and

applying them during the annotation process, and we certainly encountered lively

discussion regarding some topics. Usually, manual annotation is done by different

annotators in order to get more files done within a shorter period of time, but the

downside is that it introduces more inconsistencies between annotators. Even with only

one annotator, there will be variability in application of guidelines.

We took two approaches to deal with this problem. First, annotators were told to

discuss anything unclear, and we promoted frequent discussion to determine a consistent

path. And also, a dual, sequential-pass manual annotation process was developed and

applied to better adjudicate different annotators’ work and produce training data as

consistent as possible. During this process, every document was annotated de novo by

one annotator and then subsequently checked by a second annotator, who is more

experienced and consistent, charged with identifying and revising any annotations

considered to be incorrect by first pass annotators. Edited items were then subject to

review by the group, and senior annotators used this editing process as an opportunity for

educating less experienced annotators if repeated error patterns were identified.

3.5 Publication-based errors

Typographical and grammatical errors, though infrequent, are inevitable, and

some of them were observed in entity mentions during our process. Due to the

considerations of copyright issues, we were not authorized to change the text in such

cases but instead skipped tagging the mentions with added comments.

4. Application

38

As a result of the generation and application of these carefully refined entity

definitions and annotation guidelines, 1442 MEDLINE abstracts were manually

annotated. Of these, 1157 files have been made publicly available (release 0.9, BioIE web

site). Since the release, the data has been widely used by the biomedical text mining

community for a variety of purposes, including entity recognition, normalization etc., and

the usage is likely to increase (Cohen KB et al, 2005).

Because of the consistency of the training data across the corpus, the developed

entity and attribute extractors perform with high precision and recall rates. Table 2-1

indicates the performance of three entity extractors built with this data (McDonald RT et

al, 2004; Jin Y et al, 2006).

Entity Precision Recall F-measureGene 0.864 0.787 0.824

Variation Type 0.8556 0.7990 0.8263Location 0.8695 0.7722 0.8180

State-Initial 0.8430 0.8286 0.8357State-Sub 0.8035 0.7809 0.7920Overall 0.8541 0.7870 0.8192

Malignancy type 0.8456 0.8218 0.8335

Table 2-1: Entity extractor performance on evaluation data

5. Conclusion

Manual annotation is an indispensable step to create training data for developing

machine-learning automated extractors. In order to generate extractors that perform with

accuracies high enough to be acceptable to the biomedical research community,

consistently annotated training data is a prerequisite. Although we did not formally prove

it, our experience has been that investment of developing literature-based entity

39

definitions and annotation guidelines yields far better extracted information with distinct

conceptual boundaries, which in turn increases the opportunity for practical application.

We have concluded that rather than trying to construct unifying definitions that maximize

acceptance and minimize contention amongst domain experts, that a consistent and

generally arguable definition was preferable when making decisions to specify entity

boundaries and magnitudes. More important for us was to consider how the extracted

information will be used, and once determined, how to maintain consistency throughout

the training corpus.

40

Reference

Chen L, Liu H, Friedman C: Gene name ambiguity of eukaryotic nomenclatures.

Bioinformatics, 21: 248-256. (2005).

Cohen KB, Fox L, Ogren PV, Hunter L: Corpus design for biomedical natural language

processing. Proceedings of the ACL-ISMB workshop on linking biological literature,

ontologies and databases, pp. 38-45. Association for Computational Linguistics. (2005).

Collier N, Nobata C, Tsujii J: Extracting the names of genes and gene products with a

hidden Markov model. In Proceedings of the 18th International Conference on

Computational Lingustics, Saarbrucken, Germany. (2000).

GENIA: http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ (2004).

Hanisch D, Fundel K, Mevissen HT, Ximmer R, Fluck J: ProMiner: rule-based protein

and gene entity recognition. BMC Bioinformatics. 6: S14. (2005).

Jin Y, McDonald RT, Lerman K, Mandel MA, Carroll S, Liberman MY, Pereira FC,

Winters RS, White PS: Automated recognition of malignancy mentions in biomedical

literature. BMC Bioinformatics, 7: 492. (2006).

41

http://compbio.uchsc.edu/ccp/corpora/design.shtml


http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/

McDonald RT, Winters RS, Mandel M, Jin Y, White PS, Pereira F: An entity tagger for

recognizing acquired genomic variations in cancer literature. Bioinformatics 22(20):

3249-3251. (2004).

Penn BioIE: http://bioie.ldc.upenn.edu/index.jsp

Tanabe L, Wilbur W: Tagging gene and protein names in biomedical text,

Bioinformatics, 18:1124-1132. (2002).

42

Chapter 3. Automated Recognition of Malignancy Mentions in Biomedical Literature

Yang JinRyan T. McDonald

Kevin LermanMark A. MandelSteven Carroll

Mark Y. LibermanFernando C. N. Pereira

R. Scott WintersPeter S. White

Pulished: BMC Bioinformatics, 7:492, 2006

Abstract

Background: The rapid proliferation of biomedical text makes it increasingly

difficult for researchers to identify, synthesize, and utilize developed knowledge in their

fields of interest. Automated information extraction procedures can assist in the

acquisition and management of this knowledge. Previous efforts in biomedical text

mining have focused primarily upon named entity recognition of well-defined molecular

objects such as genes, but less work has been performed to identify disease-related

objects and concepts. Furthermore, promise has been tempered by an inability to

efficiently scale approaches in ways that minimize manual efforts and still perform with

high accuracy. Here, we have applied a machine-learning approach previously successful

for identifying molecular entities to a disease concept to determine if the underlying

probabilistic model effectively generalizes to unrelated concepts with minimal manual

intervention for model retraining.

43

Results: We developed a named entity recognizer (MTag), an entity tagger for

recognizing clinical descriptions of malignancy presented in text. The application uses

the machine-learning technique Conditional Random Fields with additional domain-

specific features. MTag was tested with 1,010 training and 432 evaluation documents

pertaining to cancer genomics. Overall, our experiments resulted in 0.85 precision, 0.83

recall, and 0.84 F-measure on the evaluation set. Compared with a baseline system using

string matching of text with a neoplasm term list, MTag performed with a much higher

recall rate (92.1% vs. 42.1% recall) and demonstrated the ability to learn new patterns.

Application of MTag to all MEDLINE abstracts yielded the identification of 580,002

unique and 9,153,340 overall mentions of malignancy. Significantly, addition of an

extensive lexicon of malignancy mentions as a feature set for extraction had minimal

impact in performance.

Conclusions: Together, these results suggest that the identification of disparate

biomedical entity classes in free text may be achievable with high accuracy and only

moderate additional effort for each new application domain.

Background

The biomedical literature collectively represents the acknowledged historical

perception of biological and medical concepts, including findings pertaining to disease-

related research. However, the rapid proliferation of this information makes it

increasingly difficult for researchers and clinicians to peruse, query, and synthesize it for

biomedical knowledge gain. Automated information extraction methods, which have

recently been increasingly concentrated upon biomedical text, can assist in the acquisition

and management of this data. Although text mining applications have been successful in

44

other domains and show promise for biomedical information extraction, issues of

scalability impose significant impediments to broad use in biomedicine. Particular

challenges for text mining include the requirement for highly specified extractors in order

to generate accuracies sufficient for users; considerable effort by highly trained computer

scientists with substantial input by biomedical domain experts to develop extractors; and

a significant body of manually annotated text—with comparable effort in generating

annotated corpora—for training machine-learning extractors. In addition, the high

number and wide diversity of biomedical entity types, along with the high complexity of

biomedical literature, makes auto-annotation of multiple biomedical entity classes a

difficult and labor-intensive task.

Most biomedical text mining efforts to date have focused upon molecular object

(entity) classes, especially the identification of gene and protein names. Automated

extractors for these tasks have improved considerably in the last few years [1-13]. We

recently extended this focus to include genomic variations [14]. Although there have

been efforts to apply automated entity recognition to the identification of phenotypic and

disease objects [15-17], these systems are broadly focused and often do not perform as

well as those utilizing more recently-evolved machine-learning techniques for such tasks

as gene/protein name recognition. Recently, Skounakis and colleagues have applied a

machine-learning algorithm to extract gene-disorder relations [18], while van Driel and

co-workers have made attempts to extract phenotypic attributes from Online Mendelian

Inheritance in Man [19]. However, more extensive work on medical entity class

recognition is necessary because it is an important prerequisite for utilizing text

information to link molecular and phenotypic observations, thus improving the

45

association between laboratory research and clinical applications described in the

literature.

In the current work, we explore scalability issues relating to entity extractor

generality and development time, and also determine the feasibility of efficiently

capturing disease descriptions. We first describe an algorithm for automatically

recognizing a specific disease entity class: malignant disease labels. This algorithm,

MTag, is based upon the probability model Conditional Random Fields (CRFs) that has

been shown to perform with state-of-the-art accuracy for entity extraction tasks [5, 14].

CRF extractors consider a large number of syntactic and semantic features of text

surrounding each putative mention [20, 21]. MTag was trained and evaluated on

MEDLINE abstracts and compared with a baseline vocabulary matching method. An

MTag output format that provides HTML-visualized markup of malignant mentions was

developed. Finally, we applied MTag to the entire collection of MEDLINE abstracts to

generate an annotated corpus and an extensive vocabulary of malignancy mentions.

Results

MTag performance

Manually annotated text from a corpus of 1,442 MEDLINE abstracts was used to

train and evaluate MTag. Abstracts were derived from a random sampling of two

domains: articles pertaining to the pediatric tumor neuroblastoma and articles describing

genomic alterations in a wide variety of malignancies. Two separate training experiments

were performed, either with or without the inclusion of malignancy-specific features,

which were the addition of a lexicon of malignancy mentions and a list of indicative

suffixes. In each case, MTag was tested with the same randomly selected 1,010 training

46

documents and then evaluated with a separate set of 432 documents pertaining to cancer

genomics. The extractor took approximately 6 hours to train on a 733 MHz PowerPC G4

with 1 GB SDRAM. Once trained, MTag can annotate a new abstract in a matter of

seconds.

For evaluation purposes, manual annotations were treated as gold-standard files

(assuming 100% annotation accuracy). We first evaluated the MTag model with all

biological feature sets included. Our experiments resulted in 0.846 precision, 0.831 recall,

and 0.838 F-measure on the evaluation set. Additionally, the two subset corpora

(neuroblastoma-specific and genome-specific) were tested separately. As expected, the

extractor performed with higher accuracy with the more narrowly defined corpus

(neuroblastoma) than with the corpus more representative for various malignancies

(genome-specific). The neuroblastoma corpus performed with 0.88 precision, 0.87 recall,

and 0.88 F-measure, while the genome-specific corpus performed with 0.77 precision,

0.69 recall, and 0.73 F-measure. These results likely reflect the increased challenge of

identifying mentions of malignancy in a document set demonstrating a more diverse

collection of mentions.

To determine the impact of the biological feature sets we included to provide domain

specificity, we excluded these feature sets to create a generic MTag. This extractor was

then trained and evaluated using the identical set of files used to train the biological

MTag version. Somewhat surprisingly, the extractor performed with similar accuracy

with the generic model, resulting in 0.851 precision, 0.818 recall, and 0.834 F-measure

on the evaluation set. These results suggested that at least for this class of entities, the

47

extractor performs the task of identifying malignancy mentions efficiently without the

use of a specialized lexicon.

Extraction versus string matching

We next determined performance of MTag relative to a baseline system that could be

easily employed. For the baseline system, the NCI neoplasm ontology, a term list of

5,555 malignancies, was used as a lexicon to identify malignancy mentions [22]. Lexicon

terms were individually queried against text by case-insensitive exact string matching. A

subset of 39 abstracts randomly selected from the testing set, which together contained

202 malignancy mentions, were used to compare the automated extractor and baseline

results. MTag identified 190 of the 202 mentions correctly (94.1%), while the NCI list

identified only 85 mentions (42.1%), all of which were also identified by the extractor.

We also determined the performance of string matching that instead used the set of

malignancy mentions identified in the manually curated training set annotations (1,010

documents) as a matching lexicon. This system identified 79 of 202 mentions (39.1%).

Combining the manually-derived lexicon with the NCI lexicon yielded 124 of 202

matches (61.4%).

A closer analysis of the 68 malignancy mentions missed by the string matching with

combined lists but positively identified by MTag determined two general subclasses of

additional malignant mentions. The majority of MTag-unique mentions were lexical or

modified variations of malignancies present either in the training data or in the NCI

lexicon, such as minor variations in spelling and form (e.g., “leukaemia” versus

“leukemia”), and acronyms (e.g., “AML” in place of “acute myeloid leukemia”). More

importantly, a substantial minority of mentions identified only by MTag were instances

48

of the extractor determining new mentions of malignancies that were, in many cases,

neither obvious nor represented in readily available lexicons. For example, “temporal

lobe benign capillary haemangioblastoma” and “parietal lobe ganglioglioma” are neither

in the NCI list or training set per se, or approximated as such by a lexical variant. This

suggests that MTag contributes a significant learning component.

Application to MEDLINE

MTag was then used to extract mentions of malignancy from all MEDLINE

abstracts through 2005. Extraction took 1,642 CPU-hours (68.4 CPU-days; 2.44 days on

our 28-CPU cluster) to process 15,433,668 documents. A total of 9,153,340 redundant

mentions and 580,002 unique mentions (ignoring case) were identified. Interestingly, the

ratio of unique new mentions identified relative to the number of abstracts analyzed was

relatively uniform, ranging from a rate of 0.183 new mentions per abstract for the first

0.1% of documents to a rate of 0.038 new mentions per abstract for the last 1% of

documents. This indicated that a substantial rate of new mentions was being maintained

throughout the extraction process.

The 25 mentions found in the greatest number of abstracts by MTag are listed in

Table 1. Six of these malignant phrases: pulmonary, fibroblasts, neoplastic, neoplasm

metastasis, extramural, and abdominal did not match our definition of malignancy. Of

these, only “extramural” is not frequently associated with malignancy descriptions and is

likely the result of containing character n-grams that are generally indicative of

malignancy mentions. The remaining five phrases are likely the result of the extractor

failing to properly define mention boundaries in certain cases (e.g., tagging “neoplasm”

rather than “brain neoplasm”), or alternatively, shared use of an otherwise indicative

49

character string (e.g., “opl” in “brain neoplasm” and “neoplastic”) between a true positive

and a false positive.

For comparison, we also determined the corresponding number of articles identified

both by keyword searching of PubMed and by exact string matching of MEDLINE for

each of the 19 most common true malignancy types (Table 1). Overall, MTag’s

comparative recall was 1.076 versus PubMed keyword searching and 0.814 versus string

matching. As PubMed keyword searching uses concept mapping to relate keywords to

related concepts, thus providing query expansion, the document retrieval totals derived

from this approach do not strictly compare to MTag’s approach. Furthermore, the exact

string totals would be inflated relative to the MTag totals, as for example the phrase

“myeloid leukemia” would be counted both for this category and for a category

“leukemia” with exact string matching, but would only be counted for the former phrase

by MTag. To adjust for these discrepancies, for MTag document totals listed in Table 1,

we included documents that were tagged with malignancy mentions that were both strict

syntactic parents and biological children of the phrase used. For example, we included

articles identified by MTag with the phrase “small-cell lung cancer” within the total for

the phrase “lung cancer”.

Comparison of these totals between MTag articles and PubMed keyword searching

revealed that MTag provided high recall for most malignancies. Interestingly, there are

three malignancy mention instances (“carcinoma”, “sarcoma”, “melanoma”) that have

more MTag-identified articles than for PubMed keyword searches. This suggests that a

more formalized normalization of MTag-derived mentions might assist both with

efficiency and recall if employed in concert with the manual annotation procedure

50

currently employed by MEDLINE. Furthermore, MTag’s document recall compared

quite favorably to exact string matching. Only two of the 25 malignancy mentions

yielded less than 60% as many articles via MTag than via PubMed exact string matching

(“bone neoplasms” and “lung cancer”). In these two cases, the concept-mapping PubMed

search identifies the articles with a broader range beyond the search terms. For example,

a PubMed search for the term “lung cancer” identifies articles describing “lung

neoplasms”, while for “bone neoplams”, articles focusing on related concepts such as

“osteoma” and “sphenoid meningioma” are identified by PubMed. Generally, MTag

recall would be expected to improve further after a subsequent normalization process that

maps equivalent phrases to a standard referent.

To assess document-level precision, we randomly selected 100 abstracts identified by

MTag each for the malignancies “breast cancer” and “adenocarcinoma”. Manual

evaluation of these abstracts showed that all of the articles were directly describing the

respective malignancies. Finally, we evaluated both the 250 most frequently mentioned

malignancies as well as a random set of 250 extracted malignancy mentions from the all-

MEDLINE-extracted set. For the frequently occurring mentions, 72.06% were considered

to be true malignancies; this set corresponds to 0.043% of all malignancy mentions. For

the random set, 78.93% were true malignancies. This suggests that such extracted

mention sets might serve as a first-pass exhaustive lexicon of malignancy mentions.

Comparison of the entire set of unique mentions with the NCI neoplasm list showed that

1,902 of the 5,555 NCI terms (34.2%) were represented in the extracted literature.

51

Software

MTag is platform independent, written in java, and requires java 1.4.2 or higher to

run. The software is freely available under the GNU General Public License at

http://bioie.ldc.upenn.edu/index.jsp?page=soft_tools_MalignancyTaggers.html. MTag

has been engineered to directly accept files downloaded from PubMed and formatted in

MEDLINE format as input. MTag provides output options of text or HTML file versions

of the extractor results. The text file repeats the input file with recognized malignancy

mentions appended at the end of the file. The HTML file provides markup of the original

abstract with color-highlighted malignancy mentions, as shown in Figure 1.

Discussion

We have adapted an entity extraction approach that has been shown to be successful

for recognition of molecular biological entities and have shown that it also performs with

high accuracy for disease labels. It is evident that an F-measure of 0.83 is not sufficient as

a stand-alone approach for curation tasks, such as the de novo population of databases.

However, such an approach provides highly enriched material for manual curators to

utilize further. As was determined by our comparisons with lexical string matching and

PubMed-based approaches, our extraction method demonstrated substantial improvement

and efficiency over commonly employed methods for document retrieval. Furthermore,

MTag appeared to be accurately predicting malignancy mentions by learning and

exploiting syntactic patterns encountered in the training corpus.

Analysis of mis-annotations would likely suggest additional features and/or heuristics

that could boost performance considerably. For example, anatomical and histological

descriptions were frequent among MTag false positive mentions. Incorporation of

52

lexicons for these entity types as negative features within the MTag model would likely

increase precision. Our training set also does not include a substantial number of

documents that do not contain mentions of malignancy; recent unpublished work from

our group suggests that inclusion of such documents significantly impacts extractor

performance in a positive manner.

Unlike the first iteration of our CRF model [14], the MTag application required only

modest computational effort (several weeks vs. several months) of retraining and

customization time (see Methods). To our surprise, the addition of biological features,

including an extensive lexicon for malignancy mentions, provided very little boost to the

recall rate. This provides evidence that our general CRF model is flexible, broadly

applicable, and if these results hold true for additional entity types, might lessen the need

for creating highly specified extractors. In addition, the need for extensive domain-

specific lexicons, which do not readily exist for many disease attributes, might be

obviated. If so, one approach to comprehensive text mining of biomedical literature might

be to employ a series of modular extractors, each of which is quickly generated and then

trained for a particular entity or relation class. Conversely, it is important to note that the

entity class of malignancy possesses a relatively discrete conceptualization relative to

certain other phenotypic and disease concepts. Further adaptation of our extractor model

for more variably described entity types, such as morphological and developmental

descriptions of neoplasms, is underway. However, the finding that biological feature

addition provided minimal gain in accuracy suggests that further improvements may be

more difficult to obtain than by merely identifying and adding additional domain-specific

features. Significantly, challenges in rapid generation of annotations for extractor

53

training, as well as procedures for efficient and accurate entity normalization, still

remain.

When combined with expert evaluation of output, extractors can assist with

vocabulary building for targeted entity classes. To demonstrate feasibility, we extracted

mentions of malignancy for all pre-2006 MEDLINE abstracts. Our results indicate that

MTag can generate such a vocabulary readily and with moderate computational resources

and expertise. With manual intervention, this list could be linked to the underlying

literature records and also integrated with other ontological and database resources, such

as the Gene Ontology, UMLS, caBIG, or tumor-specific databases [23-25]. Since

normalization of disease-descriptive term lists requires considerable specialized

expertise, the role of an extractor in this setting more appropriately serves as an

information harvester. However, this role is important, as such supervised lists are often

not readily available, due in part to the variability in which phenotypic and disease

descriptions can be described, and in part to the lack of nomenclature standards in many

cases.

Finally, to our knowledge, MTag is one of the first directed efforts to automatically

extract entity mentions in a disease-oriented domain with high accuracy. Therefore,

applications such as MTag could contribute to the extraction and integration of

unstructured, medically-oriented information, such as physician notes and physician-

dictated letters to patients and practitioners. Future work will include determining how

well similar extractors perform for identifying mentions of malignant attributes with

greater (e.g. tumor histology) and lesser (e.g. tumor clinical stage) semantic and syntactic

heterogeneity.

54

Conclusions

MTag can automatically identify and extract mentions of malignancy with high

accuracy from biomedical text. Generation of MTag required only moderate

computational expertise, development time, and domain knowledge. MTag substantially

outperformed information retrieval methods using specialized lexicons. MTag also

demonstrated the ability to assist with the generation of a literature-based vocabulary for

all neoplasm mentions, which is of benefit for data integration procedures requiring

normalization of malignancy mentions. Parallel iteration of the core algorithm used for

MTag could provide a means for more systematic annotation of unstructured text,

involving the identification of many entity types; and application to phenotypic and

medical classes of information.

Methods

Task definition

Our task was to develop an automated method that would accurately identify and

extract strings of text corresponding to a clinician’s or researcher’s reference to cancer

(malignancy). Our definition of the extent of the label “malignancy” was generally the

full noun phrase encompassing a mention of a cancer subtype, such that “neuroblastoma”,

“localized neuroblastoma”, and “primary extracranial neuroblastoma” were considered to

be distinct mentions of malignancy. Directly adjacent prepositional phrases, such as

“cancer <of the lung>”, were not allowed, as these constructions often denoted ambiguity

as to exact type. Within these confines, the task included identification of all variable

descriptions of particular malignancies, such as the forms “squamous cell carcinoma”

55

(histological observation) or “lung cancer” (anatomical location), both of which are

underspecified forms of “lung squamous cell carcinoma”. Our formal definition of the

semantic type “malignancy” can be found at the Penn BioIE website [26].

Corpora

In order to train and test the extractor with both depth and breadth of entity mention,

we combined two corpora for testing. The first corpus concentrated upon a specific

malignancy (neuroblastoma) and consisted of 1,000 randomly selected abstracts

identified by querying PubMed with the query terms “neuroblastoma” and “gene”. The

second corpus consisted of 600 abstracts previously selected as likely containing gene

mutation instances for genes commonly mutated in a wide variety of malignancies. These

sets were combined to create a single corpus of 1,442 abstracts, after eliminating 158

abstracts that appeared to be non-topical, had no abstract body, or were not written in

English. This set was manually annotated for tokenization, part-of-speech assignments,

and malignancy named entity recognition, the latter in strict adherence to our pre-

established entity class definition [27, 28]. Sequential dual pass annotations were

performed on all documents by experienced annotators with biomedical knowledge, and

discrepancies were resolved through forum discussions. A total of 7,303 malignancy

mentions were identified in the document set. These annotations are available in corpus

release v0.9 from our BioIE website [29].

Algorithm

Based on the manually annotated data, an automatic malignancy mention extractor

(MTag) was developed using the probability model Conditional Random Fields (CRFs)

[20]. We have previously demonstrated that this model yields state-of-the-art accuracy

56

for recognition of molecular named entity classes [5, 14]. CRFs model the conditional

probability of a tag sequence given an observation sequence. We denote that O is an

observation sequence, or a sequence of tokens in the text, and t is a corresponding tag

sequence in which each tag labels the corresponding token with either Malignancy

(meaning that the token is part of a malignancy mention) or Other. CRFs are log-linear

models based on a set of feature functions, fi(tj, tj-1, O), which map predicates on

observation/tag-transition pairs to binary values. As shown in the formula below, the

function value is 1.0 when the tag sequence is Malignancy; otherwise (o.w.) it is 0. A

particular advantage of this model is that it allows the effects of many potentially

informative features to be simultaneously weighed. Consider, for example, the following

feature:

This feature represents the probability of whether the token “cancer” is tagged with label

Malignancy given the presence of “lung” as the previous token. Features such as this

would likely receive a high weight, as they represent informative associations between

observation predicates and their corresponding labels.

Our CRF algorithm considers many textual features when it makes decisions on

classifying whether a word comprises all or part of a malignancy mention. Word-based

features included whether a word has been identified as being a malignancy mention by

manual annotation of text used as training material. The frequency of each string of 2, 3,

57

or 4 adjacent characters (character n-grams) within each word of the training text was

calculated, and the differential frequency of each n-gram within words manually tagged

as being malignancy mentions, relative to the overall frequency of these strings in the

overall text, was considered as a series of features. Orthographic features included the

usage and distribution of punctuation, alternative spellings, and case usage. Domain-

specific features comprised a lexicon of 5,555 malignancies and a regular expression for

tokens containing the suffix –oma. In total, MTag incorporated 80,294 unique features.

All observation predicates, either with or without the biological predicates, were then

applied over all labels, applying a token window of (-1, 1) to create the final set of

features. The MALLET toolkit [30] was used as the implementation of CRFs to build our

model.

Evaluation

The evaluation set of 432 abstracts comprised 2,031 sentences containing mentions

of malignancy and 3,752 sentences without mentions, as determined by manual

assessment of entity content. The predicted malignancy mention was considered correctly

identified if, and only if, the predicted and manually labeled tags were exactly the same

in content and both boundary determinations. The performance of MTag was calculated

according to the following metrics: Precision (number of entities predicted correctly

divided by the total number of entities predicted), Recall (number of entities predicted

correctly divided by the total number of entities identified manually), and F-measure

[(2*Precision*Recall)/(Precision+Recall)].

List of Abbreviations Used

CRF, conditional random field

58

Authors’ contributions

YJ implemented the algorithm to develop MTag and drafted the manuscript. RM

developed the core algorithm and assisted in the implementation. KL developed the

software interface. MM supervised the manual annotation for extractor training and

testing. SC assisted with the tagging of MEDLINE and analysis of the results. ML

oversaw the linguistic aspects of the project. FP developed the theoretical underpinnings

of the algorithm and oversaw the computational aspects of the project. RW participated in

algorithm design and the manual annotation procedure. PW oversaw the biological

aspects of the project, provided overall direction, and finalized the manuscript. All

authors read and approved the final manuscript.

Acknowledgements

The authors thank members of the University of Pennsylvania Biomedical

Information Extraction Group; Kevin Murphy for annotations, discussions and technical

assistance; the National Library of Medicine for access to MEDLINE; and Richard

Wooster for corpus provision. This work was supported in part by NSF grant ITR

0205448 (to ML), a pilot project grant from the Penn Genomics Institute (to PW), and the

David Lawrence Altschuler Endowed Chair in Genomics and Computational Biology (to

PW).

59

References

1. Collier N, Takeuchi K: Comparison of character-level and part of speech

features for name recognition in biomedical texts. J Biomed Inform 2004,

37:423-435.

2. Finkel J, Dingare S, Manning CD, Nissim M, Alex B, Grover C: Exploring the

boundaries: gene and protein identification in biomedical text. BMC

Bioinformatics 2005, 6 Suppl 1:S5.

3. Hakenberg J, Bickel S, Plake C, Brefeld U, Zahn H, Faulstich L, Leser U,

Scheffer T: Systematic feature evaluation for gene name recognition. BMC

Bioinformatics 2005, 6 Suppl 1:S9.

4. Kinoshita S, Cohen KB, Ogren PV, Hunter L: BioCreAtIvE Task1A: entity

identification with a stochastic tagger. BMC Bioinformatics 2005, 6 Suppl

1:S4.

5. McDonald R, Pereira F: Identifying gene and protein mentions in text using

conditional random fields. BMC Bioinformatics 2005, 6 Suppl 1:S6.

6. Mitsumori T, Fation S, Murata M, Doi K, Doi H: Gene/protein name

recognition based on support vector machine using dictionary as features.

BMC Bioinformatics 2005, 6 Suppl 1:S8.

7. Tamames J: Text Detective: a rule-based system for gene annotation in

biomedical texts. BMC Bioinformatics 2005, 6 Suppl 1:S10.

8. Tanabe L, Wilbur WJ: Tagging gene and protein names in biomedical text.

Bioinformatics 2002, 18:1124-1132.

60

9. Tanabe L, Xie N, Thom LH, Matten W, Wilbur WJ: GENETAG: a tagged

corpus for gene/protein named entity recognition. BMC Bioinformatics 2005, 6

Suppl 1:S3.

10. Temkin JM, Gilder MR: Extraction of protein interaction information from

unstructured text using a context-free grammar. Bioinformatics 2003,

19:2046-2053.

11. Torii M, Kamboj S, Vijay-Shanker K: Using name-internal and contextual

features to classify biological terms. J Biomed Inform 2004, 37:498-511.

12. Yeh A, Morgan A, Colosimo M, Hirschman L: BioCreAtIvE Task 1A: gene

mention finding evaluation. BMC Bioinformatics 2005, 6 Suppl 1:S2.

13. Zhou G, Shen D, Zhang J, Su J, Tan S: Recognition of protein/gene names from

text using an ensemble of classifiers. BMC Bioinformatics 2005, 6 Suppl 1:S7.

14. McDonald RT, Winters RS, Mandel M, Jin Y, White PS, Pereira F: An entity

tagger for recognizing acquired genomic variations in cancer literature.

Bioinformatics 2004, 20:3249-3251.

15. Chen L, Friedman C: Extracting phenotypic information from the literature

via natural language processing. Medinfo 2004, 11:758-762.

16. Friedman C, Hripcsak G, DuMouchel W, Hohnson SB, Clayton PD: Natural

language processing in an operational clinical information system. Natural

Language Engineering 1995, 1:1-28.

17. Hahn U, Romacker M, Schulz S: MEDSYNDIKATE--a natural language

system for the extraction of medical information from findings reports. Int J

Med Inform 2002, 67:63-74.

61

18. Skounakis M, Craven M, Ray S: Hierarchical Hidden Markov Models for

information extraction. Proceedings of the 18th International Joint Conference

on Artificial Intelligence: 2003; Acapulco, Mexico; 2003.

19. van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JA: A text-

mining analysis of the human phenome. Eur J Hum Genet 2006, 14:535-542.

20. Lafferty J, McCallum A, Pereira F: Conditional Random Fields: Probabilistic

Models for Segmenting and Labeling Sequence Data. Proceedings of ICML-

01: 2001; 2001: 282-289.

21. McCallum A: Efficiently Inducing Features of Conditional Random Fields.

UAI '03, Proceedings of the 19th Conference in Uncertainty in Artificial

Intelligence: 2003: Morgan Kaufmann; 2003: 403-410.

22. Malignancy type definitions

[http://bioie.ldc.upenn.edu/mamandel/annotators/onco/definitions.html]

23. The Gene Ontology (GO) project in 2006. Nucleic Acids Res 2006, 34:D322-

326.

24. Bodenreider O: The Unified Medical Language System (UMLS): integrating

biomedical terminology. Nucleic Acids Res 2004, 32:D267-270.

25. Kakazu KK, Cheung LW, Lynne W: The Cancer Biomedical Informatics Grid

(caBIG): pioneering an expansive network of information and tools for

collaborative cancer research. Hawaii Med J 2004, 63:273-275.

26. Kulick S, Bies A, Liberman M, Mandel M, McDonald R, Palmer M, Schein A,

Ungar L, Winters S, White P: Integrated annotation for biomedical

information extraction. Proc of BioLink 2004 2004.

62

27. Kulick S, Liberman M, Palmer M, Schein A: Shallow semantic annotation of

biomedical corpora for information extraction. Proc ISMB 2003.

28. Penn BioIE corpus release v0.9 [http://bioie.ldc.upenn.edu]

29. MALLET: A Machine Learning for Language Toolkit

[http://mallet.cs.umass.edu/]

30. Bruder E, Passera O, Harms D, Leuschner I, Ladanyi M, Argani P, Eble JN,

Struckmann K, Schraml P, Moch H: Morphologic and molecular

characterization of renal cell carcinoma in children and young adults. Am J

Surg Pathol 2004, 28:1117-1132.

63

Table 3-1

MTag-identified Mentions

Evaluation MTag articlesPubMED keyword

articlesMEDLINE exact

matchescarcinoma True Positive 861214 466958 891996

breast neoplasms True Positive 129096 133592 137445adenocarcinoma True Positive 166302 208117 183654lung neoplasms True Positive 104176 110378 111869

pulmonary False Positivebreast cancer True Positive 91446 147286 128381lymphoma True Positive 182764 158674 226407

liver neoplasms True Positive 69513 84529 84712fibroblasts False Positive

skin neoplasms True Positive 62282 66072 66105neoplastic False Positive

neoplasm metastasis False Positivebrain neoplasms True Positive 58729 84636 63586

stomach neoplasms True Positive 50019 52566 55208prostatic neoplasms True Positive 48042 49110 50312

leukemia True Positive 163011 190798 368980colonic neoplasms True Positive 41327 47402 42841cervical neoplasms True Positive 40998 41424 41717

sarcoma True Positive 142665 110920 242654bone neoplasms True Positive 33568 73429 35091

melanoma True Positive 79519 61134 126681pancreatic neoplasms True Positive 31598 33775 33291

extramural False Positivelung cancer True Positive 53601 118679 66071abdominal False Positive

Table 3-1. Top 25 MTag identified mentions and their corresponding PubMED keywords and MEDLINE exact string matching search results.

Figure 3-1

64

QuickTime™ and aTIFF (LZW) decompressor


Figure 3-1. Example of the HTML output of MTag for an annotated abstract [31]. Malignancy type mentions identified by MTag are shown in bold, italicized, and blue text.

65

Chapter 4. A Text Mining Approach for Identifying Genes Implicated in Neuroblastoma Tumorigenesis

Yang JinJane Minturn

Garrett M. BrodeurPeter S White

Abstract

The pediatric tumor neuroblastoma can be classified into two subtypes that

commonly exhibit distinctly different clinical outcomes, and which appear to correlate

with the differential activation of either the NTRK1 or NTRK2 neurotrophin signaling

pathways. Previously, we generated neuroblastoma cell lines that constituitively express

either the receptor tyrosine kinase NTRK1 or NTRK2 in an otherwise identical

background. Microarray expression profiling of the cell line models after introduction of

either NTRK1 ligand (NGF) or NTRK2 ligand (BDNF) gave rise to 751 genes

differentially expressed between the two cell lines. We developed a method to re-

prioritize the differentially expressed gene list by extracting and integrating information

regarding genes differentially mentioned in biomedical text articles between NTRK1 and

NTRK2, using a highly specific entity recognition and process. This process identified

twenty-two genes differentially expressed and also differentially mentioned in the

literature. The 22 genes were compared to the larger set of differentially expressed genes

to determine the ability of each group’s genes to be enriched for protein pathways

considered to be critical for neurolast development. Results demonstrated that text mining

alone or when integrated with the microarray data was capable of further enriching the

genes from the differentially expressed gene set. Expression levels for 11 of the 22 genes

were verified by real-time expression analysis. One the eleven genes, EFNB3, validated

66

the biological utility of the text mining process, while another, TYRO3, suggested

inferential power of the process. We conclude that biomedical text mining can help

interpret high throughput data analysis by integrating previously known information.

Introduction

Neuroblastoma is the most common pediatric extracranial solid tumor, accounting

for approximately 9% of all childhood cancers. Neuroblastoma is derived from primitive

cells of the developing sympathetic nervous system. Progression of the disease is

markedly variable, ranging from spontaneous regression of metastatic disease in a small

minority of infants to metastatic disease that grows relentlessly, despite even the most

intensive multimodality therapy, in many children over one year of age (Brodeur GM

2003). Based both upon these observations and a number of tumor classification studies

using a wide range of biological and clinical factors, the presence of at least two

biological subtypes with distinct clinical outcomes has been proposed. Previous studies

have suggested that expression of the neurotrophin receptor NTRK1 (TrkA) is strongly

correlated with favorable outcomes, while expression of NTRK2 (TrkB) conversely

indicates an unfavorable outcome (Nakagawara A et al, 1992; 1993; 1994; Suzuki T et al,

1993; Kogner P et al, 1993; Borrello MG et al, 1993). The high binding-affinity ligands

for NTRK1 and NTRK2 receptors are nerve growth factor (NGF) and brain-derived

neurotrophic factor (BDNF) respectively. The NTRK1 and NTRK2 ligands, receptors,

and, to the extent they are known, the downstream signal transduction pathways are

highly similar in structure and composition. However, it has been well-established that

the NGF/NTRK1 signaling pathway mediates cellular differentiation and/or programmed

cell death in vitro, while the BDNF/NTRK2 pathway enhances neuroblastoma cell

67

survival (Eggert A et al, 2000; 2002; Ho et al, 2002). It is evident that these two signaling

pathways must activate certain non-overlapping effector molecules and downstream

targets, but the molecules that account for the distinct biological behaviors have not yet

been elucidated. Therefore, further characterization of the differential molecular

responders activated by the two similar neurotrophin signaling pathways might lead us to

understand the mechanisms responsible for different phenotypic behaviors of the two

neuroblastoma subtypes, as well as identifying possible clinical intervention targets.

Array-based gene expression analysis is a recent, commonly employed, and

increasingly effective strategy for identifying differentially active transcripts in a

systematic fashion. However, array methods are well known to suffer from limited

positive predictive value, due in part to the large number of genes being surveyed, and in

part to limitations in the correlation between gene expression and biological activity.

Although single-gene transcript surveillance systems such as real time PCR (RT-PCR)

are more reliable ways to identify differentially expressed genes, as well as to validate

array-based findings, employing these more sensitive techniques to identify more

promising candidates is cost- and effort-prohibitive for most laboratories. Instead,

researchers typically first undertake a high-throughput array-based screen and then select

a small subset of the most differentially expressed genes for validation and further study.

However, this process requires researchers to make subjective decisions that often rely on

their own knowledge rather than more objective methods that consider additional

knowledge sources regarding genes of interest for prioritization.

Biomedical literature is the most complete and updated reservoir for discovered

biomedical knowledge. While this knowledge source is immediately attractive, from an

68

information content standpoint, for discovery tasks such as the identification of genes

implicated in human diseases, the unstructured nature of biomedical text obviates

approaches to utilize this information for prioritization tasks systematically. However,

biomedical text mining (BTM) techniques developed by us and others have recently

demonstrated success in extracting target information out of text (Jin Y et al, 2006;

McDonald RT et al, 2004; Rzhetsky A et al, 2004;Hanisch D et al, 2005; BioCreAtIvE).

Effective use of such techniques could provide a large and structured data set of extracted

information that would allow more comprehensive synthesis of published biomedical

knowledge than current, ad hoc methods used by most researchers for literature

awareness. However, BTM techniques are costly to implement and typically yield results

that are inadequately sensitive if applied generally; thus, these systems have been slow to

gain acceptance among biomedical researchers.

In contrast, we and others have had considerable success constructing BTM

applications that are limited in scope but are highly tuned to a particular practical task.

With a previously developed named entity recognition (NER) system, we were able to

identify human gene mentions in literature with high accuracy rates, normalize these to

standard referents, and apply this system to the entire body of MEDLINE documents. In

the current study, we applied this system to help address a particular biomedical research

challenge, the identification of candidate genes associated with a particular differential

signaling paradigm. Our NER system was used to identify MEDLINE articles

differentially “expressing” NTRK1 or NTRK2 relative to each other, and then to identify

other genes co-mentioned in these articles. The BTM results were then combined with

microarray expression analysis results generated in an in vitro expression system where

69

either NTRK1 or NTRK2 was induced. The combined analysis provided a means to re-

calculate relevance of genes that showed evidence of differential expression in both the

experimental and computational systems. Finally, we experimentally validated and

characterized the plausibility of predicted candidates.

Materials and Methods

Microarray expression profiling

Full-length NTRK1 and NTRK2 were cloned into the retroviral expression vector

pLNCX and transfected into Trk-null human neuroblastoma cell lines SH-SY5Y as

previously described (Eggert A et al, 2000). The NTRK1 and NTRK2 over-expressing

cell lines were serum-starved overnight and treated with NGF or BDNF, respectively, at

37°C for treatment times from 0 to 12 hours. Total RNA was prepared using the RNeasy

Mini kit (Qiagen Inc., Valencia, CA) from NTRK1 and NTRK2-expressing cells exposed

either to 100 ng/ml of NGF or 20 ng/ml of BDNF at time points 0, 1.5, 4, or 12 hrs of

treatment. Microarray experiments were performed with strict adherence to the

manufacturer’s instructions (Affymetrix; Santa Clara, CA). Purified biotin-labeled cRNA

was fragmented, heated to 99°C for 5 min, and then hybridized at 45°C for 16 hours to

HG-U133A arrays. Each data point was sampled with 3 technical and 1 biological

duplicates. Expression intensity value signals corresponding to relative gene expression

were calculated by the Affymetrix MAS v5.0 software package. Intensity values were

then normalized (per gene) to the median of each gene’s expression across the entire

experiment to account for chip-to-chip variation and to facilitate comparisons, using the

RMA express software package (UC Berkeley, CA).

70

Statistical analysis of differential gene expression

Normalized gene expression values were imported to the microarray data analysis

toolkit Multiple Experiment Viewer (MEV) v4.0 (TIGR, Rockville, MD). Paired

significance analysis of microarrays (SAM) was used to calculate differentially expressed

genes between NTRK1 and NTRK2-expressing cell lines. One hundred permutations

were used for multiple testing corrections during the process, and the false discovery rate

was kept at zero.

Text mining analysis

The gene mentions of all pre-2006 MEDLINE abstracts were extracted with a

previously developed named entity recognition (NER) process that uses the machine-

learning technique conditional random fields to build a statistically based entity

recognition model (Jin Y et al, 2006). A previously established rule-based normalization

process was then applied to the extracted gene mentions, which paired human gene

mentions with their corresponding official HGNC gene symbols to serve as standard

referents (Fang H et al, 2006). All genes co-mentioned in a MEDLINE abstract with

NTRK1 or NTRK2 were selected and co-occurrence frequencies were calculated. Genes

were considered to be differentially expressed in the literature if their co-occurrence

frequencies differed at least 5-fold between NTRK1 and NTRK2.

Statistical pathway analysis

Functional pathway analysis was performed through the Ingenuity pathway

analysis toolkit (Ingenuity, Redwood City, CA). Neuroblastoma related pathways were

pre-selected and the numbers of pathway-associated genes were determined for different

71

gene groups. Direct comparisons between groups were made by applying the

hypergeometric statistical test in order to determine the enrichment values of

neuroblastoma-relevant genes for the gene group integrating text mining results. The

Bonferroni step–down correction was used to calculate the multiple-test corrected P-

values for the statistical comparisons.

RT-PCR validation

NTRK1 and NTRK2-expressing cell lines and total RNA extractions were

prepared as described above. Extracted RNAs were reverse transcribed and amplified into

cDNAs using the TaqMan high-capacity archive kit (Applied Biosystems, Foster City,

CA). Primers and probes for each of 11 selected genes, as well as all other assay reagents

were obtained with TaqMan Gene Expression Assay kit (Applied Biosystems, Foster

City, CA). The TaqMan relative quantification procedure with TaqMan 7500 instrument

was applied to determine the amount of each cDNA, with the housekeeping gene

GAPDH as endogenous control. Each data point had 3 technical replicates.

Results and Discussion

Microarray-based differential gene expression analysis

In order to screen the differential responders for NGF/NTRK1 and BDNF/NTRK2

pathways, NTRK1 and NTRK2 expressing NB cell lines were made and expression

profiles were obtained by microarray experiment after NGF or BDNF exposures

respectively. Using the parameters specified in the Methods section, statistical analysis

identified that across different time points, 751 known genes on the microarray chips

were differentially expressed between NTRK1 and NTRK2-expressing cell lines after

NGF or BDNF exposure. Specifically, 468 genes were found to be differentially over-

72

expressed in NTRK1 expressing cell lines relative to NTRK2-expressing cell lines, while

283 genes were observed with opposite expression behaviors (Figure 4-1). The 468 genes

(gene set 1) and 283 genes (gene set 2) are listed in the attached appendix A.

Integration of text mining analysis

To prioritize the array-determined differentially expressed genes based on their

functional relevance to NTRK1 and NTRK2 pathways, we applied pre-developed gene

mention extractor and rule-based normalizer to acquire all the gene symbols co-

mentioned with either NTRK1 or NTRK2. And among them, there were 514 genes

preferentially associated with NTRK1 (co-occurred 5 times or more with NTRK1 than

NTRK2), and 157 genes with NTRK2 (Figure 4-1). Both 514 genes (gene set 3) and 157

genes (gene set 4) are listed in the appendix A. We identified a total of 22 genes that were

differentially expressed in the same manner by both the expression array and BTM

methods. Of these, eighteen were differentially NTRK1 overexpressed on the chip and

preferentially associated in text and four were differentially NTRK2 overexpressed on the

chip and preferentially associated in text (Figure 4-1). We selected eight most

overexpressed genes of the 18 NTRK1-associated genes along with three of four

NTRK2-associated genes for in silico experimental validation. The reason why we chose

5 as the cut-off number was to limit the overlapping genes in order to choose manageable

higher ranked genes for the following RT-PCR experiment. If we change the cut-off

number to 2, the numbers of genes preferentially associated with either NTRK1 or

NTRK2 are increased to 632 and 182 respectively, and the overlapping genes are

increased to 31.

73

Figure 4-1. Differentially expressed genes on chips and preferentially associated genes in literature

Functional pathway analysis

In order to explore the potential relevance of the derived gene lists to

neuroblastoma, we determined whether these sets were preferentially enriched for

biological pathways that were known to be critical for tumorigenesis and tumor

progression. The following four gene list groups were involved in this comparison:

Group A: The overall gene set: all 10,459 genes represented on the expression

array chip

Group B: Out of Group A, the set of 751 genes differentially expressed

(biologically) in neuroblastoma cell lines constitutively expressing NTRK1 or NTRK2

and induced with corresponding ligand.

Group C: Out of Group A, the 550 genes that were differentially represented in

the literature between NTRK1 and NTRK2

74

18 genes overlapped

4 genes overlapped

Out of 10,459 known genes on the chips, 751 genes were found differentially expressed

671 genes were preferentially associated with either NTRK1 or NTRK2 in literature

468 genes up in NTRK1, down in NTRK2 cell line

283 genes up in

NTRK2, down in

NTRK1 cell line

514 genes preferentially

associated with NTRK1

157 genes preferentially

associated with NTRK2

Group D: 22 genes were consistently differentially expressed, either for NTRK1

or NTRK2, by both techniques

Functional pathways assigned to each gene in the above groups were identified

with the Ingenuity pathway analysis toolkit. We concentrated on six specific pathways

considered to be highly relevant to neurotrophic factor signaling in neuroblasts: cell

death, cell growth and proliferation, cell-to-cell signaling and interaction, cell

morphology, nervous system development and function, and cellular assembly and

organization. For each functional group, the number and the proportion of genes assigned

to each of those six pathways were calculated (Table 4-1).

Group A (N=10,459)

Group B(N= 751)

Group C(N= 550)

Group D(N=22)

CD 1979, 18.9% 153, 20.4% 309, 56.2% 12, 54.5%CGP 2251, 21.5% 154, 20.5% 304, 55.3% 3, 13.6%CCSI 1492, 14.3% 57, 9.98% 186, 33.8% 7, 31.8%CM 1068, 10.2% 85, 11.3% 219, 39.8% 7, 31.8%

NSDF 897, 8.58% 108, 19.6% 148, 26.9% 9, 40.9%CAO 755, 7.22% 103, 13.7% 115, 20.9% 11, 50%

Table 4-1. The number and proportion of genes in each gene group associated with selected pathways. CD: cell death; CGP, cell growth and proliferation; CCSI, cell-to-cell signaling and interaction (CCSI); CM, cell morphology; NSDF, nervous system development and function; CAO, cellular assembly and organization.

As shown in Table 4-1, when compared to the overall set of genes that were

surveyed for expression levels (Group A), the subset of 751 genes identified as being

significantly differentially expressed by expression array analysis alone (Group B) was

slightly or moderately enriched for four pathways (CD, CM, NSDF, and CAO) and was

actually reduced in the other two pathways (CGP and CCSI). Conversely, the set of genes

differentially mentioned in text (Group C) was highly enriched for all six relevant

75

pathways relative to the overall set and the expression array-alone set. Correspondingly,

the set of genes differentially expressed in both the microarray and text mining

experiments were highly enriched for five of the six pathways. However, the CGP

pathway did not show enrichment. To illustrate the Ingenuity determined genes that are

relevant for select pathways, all the genes in Group C subsets are listed in Appendix B.

Group B Group C Group DCD 0.152 0.0166 <0.001

CGP 0.746 0.0216 0.728CCSI 0.999 0.0227 0.009CM 0.146 0.0109 0.001

NSDF <0.001 <0.001 <0.001CAO <0.001 <0.001 <0.001

Table 4-2. Significance testing for six relevant protein pathways. Shown are P-values calculated in comparisons between Groups B, C, or D relative to group A for each of the six pathways. Pathway abbreviations are listed in Table 4-1.

In order to calculate statistical significance of the six selected pathway gene

enrichments for the three subset groups, compared to the overall gene Group A, a

hypergeometric test was applied and the corresponding P-values were calculated (Table

4-2). The results show that both the text-mining Group C (all 6 pathways) and the

combined analysis Group D (5 out of 6 pathways) gene sets were enriched from the

overall set for selected pathways with statistical significance. Interestingly, the expression

array Group B gene set was only enriched for the NSDF and CAO pathways. To

determine whether the combined analysis Group D gene subset was further enriched from

the expression array Group B gene set, Group B was used as a reference set to directly

determine whether Group D showed significant enrichment (Table 4-3).

76

Group DCD <0.001

CGP 0.727CCSI 0.00940CM 0.0124

NSDF <0.001CAO 0.0117

Table 4-3. Significance testing for six relevant protein pathways. Shown are P-values calculated in a comparison between Group D relative to group B for each of the six pathways. Pathway abbreviations are listed in Table 4-1. The Bonferroni step-down correction was applied to account for multiple testing.

Table 4-3 shows that the P-values for 5 out of 6 pathways are significant,

demonstrating the relevant gene enrichment capability of the integrated analysis method

compared to expression array analysis alone. This experiment suggests that at least in this

experimental paradigm, our text mining process is capable of enriching gene sets for

genes that are members of functional pathways critical for tumorigenic and tumor

progression processes in neuroblastoma.

RT-PCR Experimental Validation

To determine the authenticity of genes identified by both text mining and

expression analysis, we selected 11 genes for further validation of expression, using RT-

PCR. Identically to the expression array experiments, gene expression levels were

measured at four time points to cell lines expressing stably transfected NTRK1 or

NTRK2, after applying the corresponding neurotrophic factors to the media. Generally,

the RT-PCR results confirmed and more precisely defined the expression level

differences observed between the NTRK1 and NTRK2 expressing cell lines by the

77

microarray analysis. Specifically, expression level differences were concordant for 10 of

11 genes (Table 4-4). The gene GNAS was the lone outlier; GNAS was identified as

preferentially over-expressed in NTRK2-induced cell lines relative to NTRK1-induced

lines by RT-PCR, but the opposite was true both in the expression array and text mining

experiments.

Microarray Literature RT-PCRTBC1D8 NTRK2* NTRK2 NTRK2VSNL1 NTRK2 NTRK2 NTRK2CAMK4 NTRK2 NTRK2 NTRK2

RPS6KA1 NTRK1 NTRK1 NTRK1EFNB3 NTRK1 NTRK1 NTRK1

B3GAT1 NTRK1 NTRK1 NTRK1GNAS NTRK1 NTRK1 NTRK2NEFH NTRK1 NTRK1 NTRK1NEFL NTRK1 NTRK1 NTRK1INA NTRK1 NTRK1 NTRK1

TYRO3 NTRK1 NTRK1 NTRK1

Table 4-4. Differential behavior of 11 highly differentially expressed genes, as determined by three independent approaches.* The designation NTRK2 indicates that the overall expression level of this gene is higher in NTRK2-expressing, BDNF-induced cell lines than in NTRK1–expressing, NGF-induced cell lines for the “Microarray” and “RT-PCR” columns. For the “Literature” column, it indicates that this gene is preferentially associated with NTRK2 to NTRK1 in biomedical text. The inverse corollary association is true for the NTRK1 designation.

The objective of this study was to identify immediate-to-early response genes

expressed differentially between the two NTRK signaling pathways that might explain

the different growth behaviors of NTRK1- and NTRK2-expressing cell lines. Thus, we

characterized the RT-PCR-based expression differences more closely. One gene that

exhibited a striking and rapid expression induction was EFNB3. As demonstrated in

Figure 4-2, RT-PCR data shows that the expression level of EFNB3 was substantially up-

regulated in NTRK1-expressing cell lines, with a two-fold increase in expression

78

observed from 0 to 4 hours after NGF application. Subsequently, by 12 hours expression

had decreased to the original level. Conversely, in the NTRK2-expressing cell line, the

activation of signaling by BDNF had little effect on the expression level of EFNB3 in

these cells.

EFNB3

0

0.5

1

1.5

2

2.5

TrkA

TrkB

Figure 4-2. EFNB3 RT-PCR gene expression patterns in NTRK1 (blue) and NTRK2 (pink)-expressing cell lines. Error bars are not shown. Variation for each data point was less than ±5‰.

EFNB3 (ephrin-B3) belongs to a family of ligands that bind to Eph family receptor

tyrosine kinases and has been implicated in axon guidance and other patterning processes

during vertebrate nervous system development (Bergemann AD et al, 1998). Remarkably

previous studies have demonstrated that EFNB3 exhibits growth-suppressive activity

against neuroblastoma cells in vitro. Along with NTRK1, EFNB3 has been identified as a

gene whose expression is preferentially and significantly associated with low tumor stage

and favorable clinical outcomes in neuroblastoma primary tumors (Tang XX et al, 1999,

2000, 2004). The RT-PCR experiment shown in Figure 2 revealed the different responses

of EFNB3 expression after the activation of NTRK1 and NTRK2 signaling pathways.

The up-regulation of EFNB3 mRNA in NTRK1 expressing cell line indicates that

NGF/NTRK1 signaling directly or indirectly activates the expression of EFNB3, while 79

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=9484836&query_hl=6&itool=pubmed_DocSum

BDNF/NTRK2 signaling has no substantial effect in this time range.

TYRO3

0

0.2

0.4

0.6

0.8

1

1.2

1.4

TrkA

TrkB

Figure 4-3. TYRO3 RT-PCR gene expression patterns in NTRK1 (blue) and NTRK2 (pink)-expressing cell lines. Error bars are not shown. Variation for each data point was less than ±5‰.

Another gene with sizable differential expression was TYRO3. As seen in Figure 4-

3, TYRO3 expression was up-regulated by 20% in response to NGF-NTRK1 signal

transduction but remained unchanged in BDNF-NTRK2 signaling from 0 to 1.5 hours

after neurotrophin application. After 1.5 hours, TYRO3 expression decreased in both cell

lines, but the expression level differential actually continued to increase between the two

cell lines to 50% by 12 hours. TYRO3 is a trans-membrane receptor tyrosine kinase that

is activated by the ligand GAS6. The exact biological function of this signaling pathway

is yet to be determined. However, prior studies indicate that GAS6 promotes human fetal

oligodendrocyte survival and maturation by receptor activation and downstream

signaling, via the PI3-kinase/Akt pathway, in the absence of cell proliferation (Shankar

SL et al, 2003). Additional evidence suggests that GAS6 may contribute to cell adhesion,

immune responsiveness, and osteoclastic bone resorption through the MAPK signaling

pathway (Crosier KE et al, 1997; Heiring C et al, 2004).

Additionally, both light and heavy polypeptide neurofilaments (NEFL and NEFH)

80

were up-regulated in NTRK1-expressing cell lines while down-regulated in NTRK2

expressing cell line early after neurotrophin application (0 to 1.5 hr). These expression

changes might be expected to lead to changes in the cytoskeleton associated with

differential cellular growth and differentiation status between the two cell lines. Indeed,

addition of NGF induces neurite outgrowth in many neuroblastoma cell lines, and neurite

outgrowth has been shown to be positively correlated with neurofilament expression in

neuroblastoma (Linnala A et al, 1998). Finally, because of time constraints, we have only

done 3 technical duplicates for each data point in RT-PCR validation. However ideally,

biological duplicates with independently extracted RNAs from different batches of

transfected cell lines should be analyzed in order to minimize the possibility of errors.

Researchers are confronted with a constant acceleration in the generation of

accumulated biomedical knowledge captured both in structured, readily generated forms

such as whole genome expression profiles, and from unstructured information

exemplified by biomedical literature. As such, researchers are increasingly in need of

novel means to capture, manage, and productively synthesize this information for specific

biomedical application. Systematic data mining approaches such as the text mining tools

illustrated in this study can assist with ranking tasks using previously discovered but

disparate facts. This study was designed to integrate literature-based knowledge with the

analysis of high-throughput array data. Our results suggest that application of an unbiased

text mining-based method is capable of not only enriching for genes relevant to particular

biological process, but also that this process provides a relevance ranking that may be

significant for identifying plausible candidate genes involved in differential processes.

81

The EFNB3 gene co-occurred with NTRK1 in the literature in five articles but did

not co-occur with NTRK2 at all. According to our hypothesis, this differential association

in biomedical text can be a strong indication that EFNB3 might play a specific role in

differential signaling between NTRK1 and NTRK2. In this case, the EFNB3 results can

be taken as a validation of the precision of the methods employed, but it is an expected

result both in terms of the literature reference and our verification of published

expression correlations between NTRK1 and neuroblastoma. However, the previously

published reports did not examine NTRK2 expression. Thus, our approach provided an

example of literature-based discovery by generating a higher relevance ranking for

EFNB3 as a differential signaling candidate than the expression array data alone

indicated. More experimentation is indicated but also required to determine a potential

role for EFNB3 in neuroblast differentiation. The fact that there was only 1 co-occurring

paper showing the indirect association of TYRO3 with NTRK1 indicates the lack of

previous investigation of TYRO3 in normal and malignant neuroblast development or

neurotrophin signaling pathways. However, the possible roles of TYRO3 in cell

proliferation and survival as well as its differential responses to NTRK1 signaling

demonstrated by RT-PCR make further studies worthwhile. To put the text mining power

into perspective, among the 1576 genes co-occurred with NTRK1 and 3882 articles

describing NTRK1, it is not easy with manual effort to identify EFNB3 (5 co-occurrence

papers) and even harder for TYRO3 (only 1 co-occurring paper).

Since the text mining processes employed in this study are highly task-specified

and perform with high accuracy, we demonstrated that even a relatively straightforward

text mining application, when combined with molecular data analyses, appears to make

82

better predictions. This process is easily scaled to lots of genes, so that many gene

interactions could be simultaneously surveyed for larger data sets or combinations of data

sets. Thus, with little additional effort, one could use the literature to "pre-annotate" all

gene probes so that they could be sorted by literature findings with ease. If additional

entity classes are added, the capabilities multiply geometrically. For example, we can

create an information matrix integrating genes with malignancy attribute classes. Then

the gene-clinical stage relation would tell us the gene sets associated with early and late

stages in addition to knowing the gene-gene associations.

Co-occurrence-based information extraction can be further improved in a variety of

ways such as using proximity-based measures. Generally, article-level co-occurrence can

achieve high recall rates but lacks the ability to distinguish different types of relations or

to adequately relevance rank such associations. For example, when we extracted all co-

occurred genes with NTRK1, genes related both directly and indirectly to NTRK1 were

extracted equally. As NLP-based information extraction methods continue to advance, it

is likely that deeper computational understanding of the syntactic and semantic

representations of text will lead to more successful and precise biomedical applications.

Recent work in identifying and extracting entity relations shows promise in this regard

(Jenssen TK et al, 2001; Rzhetsky A. et al, 2004).

83

Reference

Bergemann AD et al: Ephrin-B3, a ligand for the receptor EphB3, expressed at the

midline of the developing neural tube. Oncogene 16(4):471-80. (1998).

BioCreAtIvE: Critical Assessment of Information Extraction systems in Biology.

http://biocreative.sourceforge.net/index.html

Borrello MG et al: TRK and TET protooncogene expression in human neuroblastoma

specimens: high-frequency of TRK expression in non-advanced stages. Intl. J. Cancer.

54: 540-545. (1993).

Brodeur GM: Neuroblastoma: biological insights into a clinical enigma. Nature Rev.

Cancer 3: 203-216. (2003).

Crosier KE et al: New insights into the control of cell growth; the role of the Axl family.

Pathology 29: 131-135. (1997).

Eggert A et al: Expression of the neurotrophin receptor TrkA down-regulates expression

and function of angiogenic stimulators in SH-SY5Y neuroblastoma cells. Cancer Res. 62:

1802-1808. (2002).

84

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=9484836&query_hl=6&itool=pubmed_DocSum

Eggert A et al: Molecular dissection of TrkA signal transduction pathways mediating

differentiation in human neuroblastoma cells. Oncogene 19: 2043-2051. (2000).

Fang H et al: Human Gene Name Normalization Using Text Matching with

Automatically Extracted Synonym Dictionaries. BioNLP (2006).

Hanisch D et al: Rule-based protein and gene entity recognition. BMC Bioinformatics, 6,

S14 (2005).

Heiring C et al: Ligand recognition and homophilic interactions in Tyro3. J. Bio. Chem.

279(8): 6952-6958. (2004).

Ho R et al: Resistance to chemotherapy mediated by TrkB in neuroblastomas. Cancer

Res. 62: 6462-6466. (2002).

Jessen TK et al: A literature network of human genes for high-throughput analysis of

gene expression. Nature Genet. 28: 21-28. (2001).

Jin Y et al: Automated recognition of malignancy mentions in biomedical literature.

BMC Bioinformatics 7: 492. (2006).

85

Kogner P et al: Coexpression of messenger RNA for TRK protooncogene and low

affinity nerve growth factor receptor in neuroblastoma with favorable prognosis. Cancer

Res. 53: 2044-2050. (1993).

Linnala A et al: Neuronal differentiation in SH-SY5Y human neuroblastoma cells

induces synthesis and secretion of tenascin and upregulation of a integrin receptors. J.

Neurosci. Res. 49: 53-63. (1998).

McDonald RT et al: An entity tagger for recognizing acquired genomic variations in

cancer literature. Bioinformatics 22(20): 3249-3251. (2004).

Nakagawara A et al: Inverse relationship between trk expression and N-myc

amplification in human neuroblastomas. Cancer Res. 52: 1364-1368. (1992).

Nakagawara A et al: Association between high levels of expression of the Trk gene and

favorable outcome in human neuroblastomas. N. Engl. J. Med. 328: 847-854. (1993).

Nakagawara A et al: Expression and function of TRK-B and BDNF in human

neuroblastomas. Mol. Cell. Biol. 14: 759-767. (1994).

Rzhetsky A et al: GeneWays: a system for extracting, analyzing, visualizing, and

integrating molecular pathway data. J. Biomed. Inform. 37: 43-53. (2004).

86

Shankar SL et al: The growth arrest-specific gene product Gas6 promotes the survival of

human oligodendrocytes via a phosphatidylinositol 3-kinase-dependent pathway. J

Neurosci. 23(10):4208-18. (2003).

Suzuki T et al: Lack of high-affinity nerve growth factor receptors in aggressive

neuroblastomas. J. Natl. Cancer Inst. 85: 377-384. (1993).

Tang XX et al: High level expression of EPHB6, EFNB2, and EFNB3 is associated with

low tumor stage and high TrkA expression in human neuroblastomas. Clin. Cancer Res.

5: 1491-1496. (1999).

Tang XX et al: Implications of EPHB6, EFNB2, and EFNB3 expressions in human

neuroblastoma. PNAS 97(20): 10936-10941. (2000).

Tang XX et al: Favorable neuroblastoma genes and molecular therapeutics of

neuroblastoma. Clin. Cancer Res. 10: 5837-5844. (2004).

87

Chapter 5. General Conclusions and Future Directions

The increasing demand for transforming unstructured biomedical research

literature into a form amenable to computational analysis provides both opportunities and

challenges for biomedical text mining. This dissertation started with a basic aspect of

BTM research, the definition of target biomedical entities. The complexity and criticality

of this endeavor has been underappreciated by the text mining community, which has

largely approached this problem from a computational linguistics perspective. Through

an extensive and iterative process, literature-based definitions were developed as they

emerged from a consensus-building process by annotators and domain experts. In

addition to the semantic challenges caused by the conceptual complexity of biomedical

entities, syntactical challenges were also dealt with by establishing specific annotation

guidelines in order to define distinct textual boundaries for each entity class. Using this

process, entity classes for genes, RNAs, and proteins; genomic variations; types of

malignancy; and phenotypic and clinical attributes of malignancy were carefully

established with distinct boundaries semantically and syntactically. Training data

generated through manual annotation in select corpora with those refined definitions

allowed the development of automated NER extractors, based on machine learning

algorithms, with accuracy rates satisfactory for specialized application by biomedical

researchers. Entity mentions were then extracted from pre-2006 MEDLINE abstracts and

normalized to unique identifiers through a rule-based computational procedure. Finally,

this thesis focused on BTM’s discovery capabilities by integrating text mining results

with high throughput data analysis to prioritize genes involved in differential cell

88

developmental signaling in neuroblastoma. Protein pathway analysis showed that the

addition of literature-based information was able to effectively re-prioritize functionally

relevant genes identified by microarray expression analysis. Experimental validation of

these results demonstrated that these re-prioritized genes were verifiable candidates

worthy of additional experimental characterization. This text mining integrated method

provides researchers a systematic and objective way to analyze the experimental data and

better hypothesize targets for the next step research based upon previously discovered

and published knowledge.

With the steadily accelerating pace of biotechnological development and

knowledge accumulation, there is an increasing need of having well-performed BTM

systems available for a variety of purposes, including information extraction, document

retrieval and literature-based discovery. As end users struggling to manage and

synthesize an overwhelming amount of research information, it is prudent for biologists

to closely collaborate with computer scientists on every front, including the adaptation of

BTM research to assist with solving biomedical problems. This dissertation has focused

upon investigations that attempt to build BTM systems with more biological input that is

infused throughout the process. Accordingly, as an essential building block of many

BTM tasks, the development of our named entity recognition system incorporated

biological perspectives, which has been instrumental for the success of biomedical

applications built upon this process, such as our successful gene-centric information

retrieval system FABLE (FABLE).

The performance of entity extractors developed by our approach depends heavily

on the quality and quantity of training data. We have spent substantial amount of time

89

creating manually annotated corpora in order to develop high-performance extractors.

However, further research should be conducted on deciding the scope and size of the

training data to make the process most cost effective. Normalization algorithms that

incorporate disambiguation schemes are also desired for improving entity recognition

performance since it is difficult for a pure rule-based approach to solve the problem of

ambiguous matches between mentions and unique identifiers. Effective disambiguation

approaches would likely need to survey distant contextual information in order to

determine the correct match (Chen L et al, 2005).

Deeper parsing of the entity relations is another natural extension of this thesis

research. With the incorporation of linguistic analysis that includes deeper syntactic and

semantic processing (such as the parse tree and semantic role labeling systems developed

at Penn), entity relationships could be further mined with more precision and granularity.

For example, extraction of specific causal relationships between genes and malignancy

types from biomedical literature would be an important advance in application.

Along with the maturation of the mentioned BTM tasks, it will be possible to

construct a structured and queryable cancer knowledgebase integrating the most complete

and up-to-date genomic, phenotypic and clinical information from the published

biomedical records, based on which, further interpretation of the experimental data will

lead to more reliable and frequent literature-based discovery and hypothesis generation.

90

Appendices

A. Genes that differentially expressed on the array chips (gene sets 1 and 2) and preferentially associated in the literature (gene sets 3 and 4)

Gene Set 1 Gene Set 2 Gene Set 3 Gene Set 4

ABCA3 AASS ABCA4 AAABCG1 ABLIM1 ABCB6 ABCA3

ABHD11 ABT1 ABCC1 ACVR1BABLIM3 ACCN2 ABL1 ACVR2AACHE ADCY9 ACACA ADD1ACOT7 AFF4 ACCN3 ADORA1

ACTL6B AGA ADAM17 ADRA1BACTN1 ALDH18A1 ADCY1 APLP1

ADAM12 AMMECR1 ADRA2B ATN1ADAM23 ANGPT1 ADRM1 ATXN3AEBP1 ANGPTL2 AGTRL1 BHLHB4

AES ANTXR1 AHSA1 BMPR2AGPAT7 AQP3 AK1 BRD8AGRIN ARL5A AKT1 C1QL1

ALK ASCC3 ALB CA2ALMS1 ATP2A2 ALDH3A2 CAMK4

AMT ATP5F1 ALPK2 CBLANK1 ATP5G3 AMIGO2 CCND2AP3M2 BAG1 ANPEP CD160

APBA2BP BAK1 ANXA2 CD63APC2 BAZ1A ANXA5 CDCA5APEH BCL11A AP3B1 CDK5R1APLP1 BDH2 APAF1 CDKN1CASF1B BID APC CETN1ASMTL BTN2A1 APOE CMD1BASPHD1 BTN2A2 AQP1 CORTASRGL1 BZW2 AR CREMATP13A2 C12orf11 ARHGAP24 CRHATP1A3 C12orf5 ARHGAP5 CRPATP8B3 C14orf156 ARHGEF7 CTF1B3GAT1 C14orf166 ARTN CYP19A1B3GAT3 C20orf121 ASCL1 DBH

B4GALNT1 C21orf91 ASGR2 DCNB4GALT5 C2orf25 ATF2 DDIT3BAHCC1 C4orf9 ATF7IP DEFA1

BAI1 C5orf13 ATP7A DLG3BAI2 C6orf120 AVP DLG4

91

BAZ2A C8orf41 AXL E2F1BCORL1 C9orf82 B3GAT1 EFNA5

BDH1 CALD1 B4GALT1 EGR2BEX1 CAMK4 BAD EMX2

BRSK2 CBFA2T2 BBS2 EPHA3BSN CCNC BCAR1 EPHA4

C14orf79 CCNJ BCL2 ERGC18orf10 CD164 BCR ERP29C1orf159 CD99 BDKRB1 ETV1C1orf164 CDK6 BMP2 EXOSC1C1orf21 CHGB BMP6 F2RC1orf66 CHM BRAF F3C1QL1 CHRNA3 BSG FMR1

C20orf103 CLIC4 C15orf15 FSHRC20orf12 CNBP C18orf10 GAD1C20orf149 CNN3 C21orf33 GCLCC20orf195 COL4A1 C2orf28 GFPT1C20orf20 COPS2 C2orf3 GLTSCR2C20orf46 CPSF6 C3 GNAI3C22orf9 CRABP2 C7 GPSM2C2orf17 CSRP2 CAD GRIA4C2orf24 CTNNA1 CAMK1 GRIN2AC3orf18 CTR9 CAPN2 GRIN2BC3orf32 CTSC CASP2 GRIN3AC6orf134 CYFIP1 CASP8 GRK1

C7 DAP CASP9 GRM1C7orf43 DDEF1 CAV1 GRM6

CACNA1B DDIT4 CAV3 GSRCACNB1 DDOST CCDC6 GUK1CACNG2 DDX17 CCL14 HDAC2CACNG4 DDX3X CCL2 HFECALML4 DECR1 CCL4 HMGB2

CAMK2N1 DIMT1L CCL5 HRBCAMKV DKK1 CCND1 HSF1CAMTA1 DLL3 CCND3 HSPA4LCAPN5 DMD CCRK HSPA8

CARD10 DNAJB6 CCT4 HSPH1CBR3 DPYSL3 CD1A HTR1A

CCBL1 DUSP1 CD38 ID2CCDC92 DUSP22 CD40 IER5CCNA1 ECHDC1 CD40LG IL6STCCNE1 EEF1A1 CD44 IRAK2CD81 EFNA4 CD68 ITM2C

CDC2L6 EFNB2 CD79A ITPR1CDC42 EIF2S1 CD80 ITPR2

CDC42BPB EIF2S3 CD86 KCNJ3CDKN2D EIF3S1 CDC2 KCNJ5

CDO1 EIF3S6 CDC25C KCNJ6

92

CELSR2 EIF5 CDH1 KIF3ACENPM ELAVL1 CDK4 LATS1CENTD3 ELK3 CDKN1B LMO4

CHD5 ETF1 CEACAM4 MAOACKAP1 ETFB CEBPZ MAP2K2CLCN6 ETFDH CEL MAPRE1CLSTN2 FAM111A CENPJ MC4RCNNM1 FAM3C CHKA MPL

CNTNAP2 FAM98A CHL1 MPOCOL8A2 FAT CHN1 MYH6CPT1A FKBP14 CHRM2 MYO1A

CRABP1 FKBP1A CHUK NBNCRMP1 FLNC CILD2 NCK2

CRTAC1 FLRT1 CNP NEUROD1CSPG3 FLRT3 CNR1 NFATC4CSPG5 FN1 CNR2 NNATCTBS FSTL1 COL11A2 NP

CTNNA2 FYCO1 CP NPATCXCR4 FZD2 CRK NR1D1CYLN2 GALNT10 CRKL NR3C2CYP1B1 GATA4 CRS NSFDAAM1 GGCX CSF1R NTRK1DDX25 GHITM CSF2 NYX

DENND2A GLUL CSF2RA ODC1DEPDC5 GNL3 CSF3 OTCDGKD GPM6B CSK OTX1DHPS GPR125 CTBS PABPC1DLG4 GSPT1 CTNNA1 PABPN1

DNAJB5 H2AFY CTSB PAX3DNM1 HEATR1 CXCL1 PDIA4DPP6 HEBP2 CXCL12 PHOX2ADPYD HEMK2 CYLN2 POMCDRAP1 HERPUD1 DCC POU3F1DRD2 HOMER3 DDR1 PPARADSTN HOXC10 DDX41 PPM1LDUSP8 HSP90B1 DECR1 PSDDUT IGF1R DHDDS PSEN1

DYNLT3 IL13RA1 DLX2 PSEN2EDG4 IPO7 DNAH8 PSMD8EDG7 ISL1 DNAJA2 RAB40B

EFHD2 ITPA DNM2 RABEP2EFNB3 JAM2 DOK1 RARBEGFL7 JMJD1C DR1 RGS4EGFL9 JMJD2C DRG2 RTCD1

ELAVL4 KCNJ8 DUSP1 SDEML2 KDELC1 ECEL1 SHANK2ENO2 KIAA0020 EDG1 SI

EPB41L1 KIAA0247 EDG2 SLC12A5

93

EPB41L4B KLHL20 EDG5 SLC1A3EPB49 LANCL2 EDN3 SLC30A7EPOR LGR5 EEF1A1 SLC6A1

EPS8L1 LZTS1 EEF1A2 SLC6A3ETNK2 MAGEA1 EFNA2 SLC6A4

F12 MAGEA5 EFNB2 SRIFAAH MAGOH EFNB3 ST3GAL6

FAM105A MAN1A2 EFS SULTFAM65A MAX EGF SYN1FBXO2 MBD2 EGFR SYT4FEZ1 MCCC2 EIF2C2 TBC1D8FEZ2 MEIS1 ELA2 TERTFGD1 METAP2 ELAVL3 TIMP2

FKBP1B METT10D ELK3 TPH1FKBP4 MFAP4 ENPP1 TRPC3

FLII MINA EPHA1 TSC1FLOT1 MNAT1 EPHB1 TSC2FNBP1 MOBK1B EPHB6 TWIST1FNDC4 MPHOSPH10 EPO TXNRD2

FOXRED2 MPZL1 ERBB2 VAMP2FRS3 MRPL42 ERBB3 VCAM1FRY MRPL44 ERBB4 VSNL1FUT1 MTMR1 EREG WARS

G6PC3 MTO1 ERVK5 XYLT1GABRB3 MYO1E ESR2

GALE MYO5C ETV5GAP43 MYO6 ETV6

GAS2L1 NAT1 EVI1GDAP1L1 NBN EWSR1

GDF1 NCOA4 F2GDI1 NDUFA4 F7

GFRA3 NEBL FANCBGNAO1 NECAP2 FBN1GNAS NEDD4 FBS1GNAZ NOC3L FCER1AGNB5 NOLA1 FDFT1GNG3 NOTCH2 FDPS

GPR153 NUDT4 FESGPR19 PDE4B FGF3

GPRASP1 PDGFRL FGF4GRIK5 PDIA3 FGFR4GRIN1 PDLIM3 FHGRK6 PELI1 FKBP1A

GTPBP2 PEX3 FKBP4GTSE1 PHACTR2 FKBPL

GUCA1A PHLDA1 FLT1H2AFX PIK3R3 FLT4HDAC6 PKP2 FN1

94

HECTD3 PLA2G12A FOLH1HMBS PLAGL1 FOSL1

HMG20B PLEKHC1 FOXM1HPCAL4 PLS3 FOXO1A

HPS6 PLSCR1 FOXO3AHRASLS3 PNRC2 FRS2

HRH3 PON2 FSCN1HTR1E POPDC2 FUSHTRA2 PPA1 FUT3HUWE1 PPP2R1B FZD3

HYI PPP4R2 GAB1IBRDC3 PRKD3 GABRA1ICAM2 PTPRD GAKIFT122 PTTG1IP GAS6IGSF4 QKI GCK

IL27RA RAB27A GEMINA RABEP1 GGH

IQCK RAMP1 GGT1IQSEC1 RAP2C GH1IQSEC3 RB1 GH2ITSN1 RBM3 GIFJAG2 RBPMS GIPC2JPH3 RBPSUH GIPC3JUND RCN1 GJA1

KCNA3 RETSAT GJA8KCNB1 RIT1 GNASKCNC1 RND3 GOLGA5KCNH6 RNF13 GPIKCNK12 RNF130 GPR88KCNK3 RPL23 GPTKCNQ1 RPS21 GRHL3KCNQ2 RPS9 GRIA1

KIAA0649 RSL1D1 GRIK1KIAA1539 RSU1 GRLF1

KIF13B RWDD1 GSTP1KIF1A RYBP GSTZ1KIF3C RYK GTF2BKIF5C SCLY GTF3AKLF11 SCYE1 HCCSKNS2 SDC4 HCK

L1CAM SERBP1 HDLAGE3 SERPINF1 HDAC1LIG1 SERTAD2 HES1

LIN7B SF3B1 HGFLPHN1 SH3BGRL HK2LRP8 SHOX2 HLA-E

LRRFIP2 SKP2 HM13LRRN5 SLC31A2 HPSE

95

LSM14B SLC33A1 HRASLY6E SLC39A14 HSN2

MADD SLC39A8 IARSMAGED1 SMAD5 ICAM1

MAP7 SMARCC1 IER3MAPK11 SNAP23 IFI44MAPK12 SNX13 IFNA1

MAPK8IP2 SP110 IFNA17MAPT SSBP1 IFNB1

MARK4 STAT5B IKBKBMAST1 STEAP1 IL13RA2

ME3 SUCLG2 IL17FMLH3 SYNCRIP IL1A

MMP15 SYPL1 IL2MMP24 TBC1D8 IL3MPP2 TCEB3 IL4RMPP3 TCF7L1 IL6

MRPL2 TCF7L2 INAMSH6 TES IRAK1MSN TFB2M IRAK3

MTMR2 TFDP2 IRF1MTSS1 TGFBR2 IRS1MYD88 TGIF ISL1MYO1D TGIF2 ITGB1MYOZ3 TH1L IVMYT1L TJAP1 JAK2NAGA TLE4 JAK3

NCAM1 TMCO1 JUNBNCOA6 TMEM109 KCND2NDE1 TMEM33 KDR

NEDD4L TMEM39A KITNEFH TMEM43 KITLGNEFL TOMM20 KLF7

NELL1 TOP1 KLK3NFASC TOR1AIP1 KNG1

NLGN4X TP53 KRASNMNAT2 TRAM1 LARGE

NMU TRIM5 LATNOS1AP TRMU LBX1NPDC1 TROVE2 LCS1NRCAM TSPAN12 LGALS1NRGN TSPAN13 LGALS3NRXN1 TSPAN6 LGI1NRXN2 TXNDC1 LOXNTRK1 VIM LRP1

NUDCD3 VPS54 LRPAP1NUP210 VSNL1 LRRC21OAS3 WDR73 LTA

96

OBSL1 WDR77 LTFODF2 YIPF6 MAG

OGDHL ZFAND5 MAGED1OGG1 ZMPSTE24 MAGED2

OLFM1 ZNF238 MAKOSBPL2 ZZZ3 MAP2K1OXCT1 MAP3K1PACRG MAP3K11

PAFAH1B3 MAPK10PAK3 MAPK8PAK4 MAPK9PAOX MARCKSPAQR4 MAS1

PARD6A MBTPS1PARP6 MDKPARVA METPAX5 MGAT3

PCTK1 MIAPCYT2 MIB1PDE2A MICAPDE9A MKI67PDLIM7 MKKS

PER3 MLLT7PEX14 MMEPFKL MMP2PFKP MMP3

PGBD5 MMP9PGLS MNG1PHF1 MOS

PHTF1 MRGPRFPIK3CD MSN

PIM1 MUSKPITPNM1 MYBPIWIL1 MYCPKN1 MYLKPKP4 MYO1E

PLXNA2 MYOD1PNKP NANS

PNMA2 NBL1PNMT NCOA1

PNPLA4 NCOA4PORCN NDNPPAP2C NEDD9PPEF1 NEFHPPM1G NEFLPRKD2 NEU1PRNP NF1

PRPSAP1 NFKB1

97

PSD NFKBIAPTGER3 NFKBIBPTOV1 NFKBIL1PTPRN NGFRPTPRN2 NMBPXMP2 NME1

R3HDM2 NOLC1RAB15 NPC1RAB3A NPY1RRAB3B NR1I2RAB6B NRGN

RABAC1 NRKRAD23A NT5ERAD51L3 NTRK2

RAGE NUMBRAI2 OCMRALY OED

RAMP2 OSMRAP1GAP P2RX1RASGRP2 P2RX3

RGL2 PAF1RGS11 PAHRIMS2 PCNARIT2 PDGFBRND2 PDGFRL

ROGDI PDIA3RPH3A PDK1RPP25 PFN1RPRC1 PGM2

RPS6KA1 PGRRTN1 PHBRTN2 PHB2

RUFY3 PIM1RUSC1 PITX2RUSC2 PKD1

SAC3D1 PKLRSAMD14 PLA2G1BSAP130 PLAU

SCAMP5 PLEKSCG5 PLG

SCN2A2 PLXNB1SCN3A PLXNB2SCN3B PNNSCRN1 PPARGSDC2 PPBP

SEC61A2 PPP1R13L5-Sep PPP1R1B

SERPINA5 PRKCA

98

SETD3 PRKCB1SEZ6L PRKCESH2B3 PRKCZ

SHANK2 PRKD1SHB PRKG1SHC2 PRLSIX3 PRLRSIX6 PROZ

SLC18A3 PRPHSLC22A17 PRRXL1SLC25A1 PSAP

SLC2A4RG PSMA5SLC43A3 PSMB6SLC4A3 PSPNSLC8A2 PTBP1

SMARCC2 PTCH2SMARCD3 PTEN

SMPD3 PTGDRSMTN PTGER1

SNAP25 PTGFRSNAP91 PTHSNAPC2 PTK2

SNCB PTNSNX27 PTPN1SOD1 PTPN13

SORBS1 PTPN6SOX13 PTPRBSPA17 PTPRCSPAG4 PTPRFSPAG6 PTPROSPRY2 PTX3

SPTAN1 PXNSRD5A1 PYCARDSRPK2 PZPSTMN2 RAB7STMN4 RAC1STUB1 RAF1STX1A RAP1ASTX2 RAPGEF1

STXBP1 RAPGEF5STXBP5L RASA1SULT4A1 RASSF1

SYN1 RB1SYNGR3 RBL1

SYP RDXSYT17 RELSYT5 RELATAZ RGS19

99

TBX3 RHOATCEA2 RHOT2

TCEAL2 RIPK2TCF25 RNGTTTEAD4 ROCK1TFR2 ROR1THRA ROR2THY1 RP21TLE2 RPE

TM2D3 RPS6KA1TMCC1 RPS6KB1

TMEM121 RUNX1TMEM153 RUNX2TMEM22 S11TMEM24 SCN10ATMEM28 SCN11ATMOD1 SCN9A

TNIK SELETNNI3 SELLTREX1 SEMA3FTRIM62 SEMA4DTRIP10 SEMA5ATTC9 SGK

TUBB2B SHBTUBB2C SIT1TUBB4 SLC22A4TULP4 SLC6A2TYRO3 SLC7A1

UBB SLCO6A1UBE2C SMPD1UIMC1 SMPD2UNC119 SNRPGUNC13A SOAT1

USP4 SORT1VAMP1 SP1VAMP2 SPAG1VAT1 SPHK1

VEGFB SPRWBP2 SPTLC1

WDR62 SRA1YBX2 STAT1YPEL1 STAT3

YWHAB STAT5AZBTB22 STAT5BZMAT4 STATHZNF274 STSZUBR1 SYCP3

T

100

TACR1TBP

TBXA2RTEK

TFAP2ATFDP3

TFGTFPTTFRC

TGTGFATGFB2TIE1

TKTL1TLX1TLX3

TMEM37TNC

TNFRSF10CTNFRSF25TNFSF12

TNS1TP53TPBGTPM1TPM3TPOTPR

TPSAB1TRAF6

TRITRIM33TRK1

TRPV1TSHRTTN

TYMSTYR

TYRO3UGCGVAV1VWFWT1YES1

ZBTB25

101

B. Ingenuity determined pathway relevant genes for Group C (preferentially associated genes)

CD CGP CCSI CM NSDF CAO

ABCC1 ABL1 ABL1 ABL1 ABL1 ABL1

ABL1 ACACA ADAM17 ADAM17 ADAM17 ADAM17

ACACA ACVR1B AKT1 ADRA1B ADORA1 ADRA1B

ACVR1B ADRA1B ALB AKT1 AKT1 AKT1

ADORA1 AKT1 AMIGO2 ALB ALDH3A2 APC

AKT1 ALB ANXA2 ANPEP AMIGO2 APOE

ALB ANXA2 ANXA5 ANXA2 APAF1 ARHGAP24

AMIGO2 APC AP3B1 ANXA5 APLP1 ARHGEF7

ANPEP APOE APC APC APOE ARTN

APAF1 AR APOE APOE ARTN AVP

APC ARHGAP5 AR AR ASCL1 AXL

APOE ARHGAP24 ARHGAP5 ARHGAP5 ATN1 BCL2

AR ARTN AVP ARHGEF7 ATXN3 BMP2

ARTN ASCL1 AXL AVP BCL2 CAMK4

ATF2 ATF2 B4GALT1 AXL BMP2 CAV1

ATN1 AVP BCL2 B3GAT1 BRAF CD44

ATXN3 AXL BDKRB1 BCL2 CAMK4 CDC2

AXL BCL2 BMP2 BMP2 CDC2 CDH1

BAD BDKRB1 BSG BRAF CDK5R1 CDK5R1

BCL2 BMP2 CASP8 CAPN2 CDKN1B CDKN1B

BMP2 BMP6 CAV1 CASP9 CDKN1C CENPJ

BMP6 BMPR2 CBL CAV1 CHL1 CHL1

BRAF BRAF CCL2 CAV3 CHRM2 CHN1

BSG BSG CCL5 CBL CNR1 CNP

C7 CAPN2 CCND1 CCDC6 CRH CNR1

CAMK4 CASP2 CCND2 CCL2 CRK CRK

CASP2 CASP8 CCND3 CCL5 CASP2 CRKL

CASP8 CASP9 CD38 CCND1 CASP9 CSF2

CASP9 CAV1 CD40 CCND2 CCND2 CXCL12

CAV1 CAV3 CD44 CCND3 CSF3 DCC

CBL CBL CD63 CD40 CTF1 E2F1

CCDC6 CCDC6 CD86 CD44 CTNNA1 EDG1

CCL5 CCL2 CD1A CD40LG CXCL12 EDG2

CCND1 CCL5 CD40LG CDC2 DCC EDG5

CCND2 CCND1 CDH1 CDH1 DLG4 EFNB2

CCND3 CCND2 CDK5R1 CDK5R1 DLX2 EFNB3

CCRK CCND3 CHL1 CDK4 E2F1 EGF

CD38 CCRK CNR1 CDKN1B EDG1 EGFR

CD40 CD38 CRH CDKN1C EDG2 EPHA4

CD44 CD40 CRKL CHL1 EFNA5 ERBB2

102

CD86 CD44 CRP CHN1 EFNB2 ERBB3

CD160 CD63 CSF2 CNR1 EFNB3 ERBB4

CD40LG CD86 CSF3 CREM EGF F2

CDC2 CD160 CSF1R CRH EGFR F2R

CDC25C CD40LG CSK CRK EGR2 F7

CDH1 CDC2 CTNNA1 CRKL ELAVL3 FGF3

CDK4 CDC25C CXCL12 CSF1R EMX2 FGF4

CDK5R1 CDH1 DCC CSF2 ERBB2 FGFR4

CDKN1B CDK4 DCN CSF3 ERBB3 FKBP4

CDKN1C CDK5R1 DDR1 CSF2RA ERBB4 FN1

CHKA CDKN1B E2F1 CSK ESR2 FSCN1

CHL1 CDKN1C EDG1 CTNNA1 F2 GAB1

CNP CHKA EDN3 CXCL12 FGF3 GJA1

CNR1 CNP EFNA2 CYP19A1 FGF4 HCK

CNR2 CREM EFNA5 DCC FGFR4 HD

CREM CRH EFNB2 DCN FKBP4 HGF

CRH CRK EGF E2F1 FMR1 HRAS

CRK CRP EGFR EDG2 FN1 ICAM1

CRP CSF2 EGR2 EDG5 FOXO1A IL1A

CSF2 CSF3 ELA2 EFNA5 GAB1 IL2

CSF3 CSF1R EPHA3 EFNB2 GAD1 IL6

CSF1R CSF2RA EPO EFNB3 GJA1 INA

CSF2RA CSK ERBB2 EGF GRIK1 ITGB1

CSK CTF1 ERBB3 EGFR GRLF1 KITLG

CTF1 CTNNA1 ERBB4 EPHA4 GSTP1 KNG1

CTNNA1 CTSB F2 EPO HD KRAS

CTSB CXCL12 F3 ERBB2 HES1 LATS1

CXCL12 CYP19A1 F7 ERBB3 HGF MAG

CYP19A1 DBH F2R ERBB4 HMGB2 MAOA

DBH DCC FES ESR2 HRAS MAP3K1

DCC DCN FGF4 ETV6 IL3 MAPK8

DCN DDIT3 FKBP1A EVI1 IL6 MAPRE1

DDIT3 DDR1 FLT1 EWSR1 IL1A MARCKS

DDR1 DNAJA2 FN1 F2 ITGB1 MET

DDX41 DUSP1 FUT3 F7 KITLG MMP2

DLX2 E2F1 GNAS F2R KLF7 MSN

DUSP1 EDG1 HCK FBN1 KNG1 NDN

E2F1 EDG2 HD FES KRAS NEFH

ECEL1 EDG5 HES1 FGF3 LBX1 NEFL

EDG1 EDN3 HGF FGF4 LGI1 NFATC4

EDG2 EFNB2 HPSE FGFR4 LRPAP1 NGFR

EDG5 EFNB3 HRAS FKBP4 LTA NTRK1

EEF1A1 EFS HTR1A FLT1 MAG NTRK2

EEF1A2 EGF ICAM1 FMR1 MAOA P2RX1

EFNB2 EGFR IFNB1 FN1 MAP2K1 PDIA3

EGF EGR2 IKBKB FOSL1 MAPK8 PFN1

EGFR ELA2 IL2 FOXM1 MAPK9 PLAU

EGR2 EMX2 IL3 FOXO1A MAPK10 PLXNB1

103

ELA2 EPHB6 IL6 FOXO3A MARCKS PLXNB2

EPHB6 EPO IL1A FZD3 MDK PRKCE

EPO ERBB2 IL4R GAB1 MET PRPH

ERBB2 ERBB3 IRS1 GEM MMP2 PSAP

ERBB3 ERBB4 ITGB1 GJA1 MOS PSD

ERBB4 EREG KITLG GPI MSN PTK2

ERG ERG KLK3 GRLF1 MUSK PTPRF

ESR2 ESR2 KNG1 GRM1 NBN RAC1

ETV6 ETV6 LGALS1 HCK NDN RASSF1

EVI1 EVI1 LGALS3 HD NEFL RDX

EWSR1 EWSR1 LOX HES1 NEUROD1 RHOA

F2 F2 LRPAP1 HGF NFATC4 SELE

F3 F2R LTA HRAS NFKBIA SEMA3F

F7 FBN1 MAG ICAM1 NGFR SEMA4D

F2R FBRS MAP2K1 ID2 NPC1 SEMA5A

FBN1 FES MAP3K1 IKBKB NR1D1 STAT3

FCER1A FGF3 MAPK8 IL2 NRGN SYN1

FDFT1 FGF4 MAS1 IL3 NTRK1 TNC

FGF3 FGFR4 MDK IL6 NTRK2 TNS1

FGF4 FKBP1A MET IL1A NUMB TP53

FKBP1A FKBP4 MME IL4R PAX3 TPM1

FLT1 FLT1 MMP2 IRAK1 PDIA3 TPR

FN1 FN1 MMP9 IRF1 POMC TTN

FOSL1 FOSL1 MPO IRS1 PRKG1 VCAM1

FOXM1 FOXM1 MSN ITGB1 PRPH VWF

FOXO1A FOXO1A MYB JUNB PSAP WT1

FOXO3A FOXO3A NEDD9 KDR PSEN1

FSHR FSCN1 NEUROD1 KITLG PSPN

FUS FUS NFKB1 KNG1 PTK2

FZD3 FZD3 NFKBIB KRAS PTN

GAB1 GAB1 NTRK1 LGALS1 PTPRF

GCLC GAD1 NUMB LGALS3 RAC1

GFPT1 GAK PDGFB LOX RAF1

GJA1 GEM PDIA3 LRPAP1 RB1

GNAS GH2 PGR MAG RBL1

GPI GJA1 PHB MAP2K1 RDX

GRIN2A GPI PITX2 MAP2K2 REL

GRM1 GSTP1 PKD1 MAP3K1 RELA

GSR GTF2B PLAU MAP3K11 RHOA

GSTP1 HCK PLG MAPK8 RUNX1

GSTZ1 HDAC1 PLXNB1 MAPK9 SEMA3F

HCK HDAC2 PNN MAPK10 SEMA5A

HD HES1 POMC MARCKS SLC1A3

HDAC1 HGF PPARG MAS1 SMPD1

HDAC2 HRAS PPBP MET SPHK1

HES1 HTR1A PRKCE MGAT3 STAT3

HFE ICAM1 PRKG1 MLLT7 SYN1

HGF ID2 PRL MMP2 TFAP2A

104

HK2 IER3 PRLR MMP9 TGFA

HRAS IFNB1 PSEN1 MOS TGFB2

HSPA8 IKBKB PSEN2 MPL TIMP2

ICAM1 IL2 PTGER1 MSN TLX1

ID2 IL3 PTH MYB TLX3

IER3 IL6 PTK2 MYO1A TNC

IFNB1 IL13RA2 PTN MYOD1 TP53

IKBKB IL1A PTPN6 NBN TRPV1

IL2 IL4R PTPRB NDN TWIST1

IL3 IRF1 PTPRC NEDD9 VCAM1

IL6 IRS1 PTPRF NEUROD1 WT1

IL1A ISL1 RAC1 NFATC4

IL4R ITGB1 RAF1 NFKB1

IRAK1 JUNB RAPGEF1 NFKBIB

IRF1 KDR RASSF1 NGFR

IRS1 KIF3A REL NPC1

ISL1 KITLG RELA NTRK1

ITGB1 KLK3 RHOA NTRK2

ITPR1 KNG1 RPS6KB1 NUMB

ITPR2 KRAS SELE ODC1

JUNB LATS1 SELL PAX3

KCNJ6 LBX1 SEMA4D PDGFB

KDR LGALS1 SEMA5A PDIA3

KIF3A LGALS3 SIT1 PFN1

KITLG LGI1 SLC6A2 PGR

KLF7 LOX SLC6A3 PIM1

KNG1 LRPAP1 SLC6A4 PITX2

KRAS LTA STAT1 PLAU

LATS1 MAG STAT3 PPARG

LGALS1 MAGED1 STAT5A PPP1R13L

LGALS3 MAGED2 STAT5B PRKCE

LRPAP1 MAP2K1 STATH PRKCZ

LTA MAP2K2 TEK PRKG1

MAGED1 MAP3K1 TERT PRL

MAOA MAP3K11 TFRC PRPH

MAP2K1 MAPK8 TGFA PSAP

MAP2K2 MAPK9 TGFB2 PTGFR

MAP3K1 MAPK10 TIMP2 PTK2

MAP3K11 MAPRE1 TNC PTN

MAPK8 MAS1 TP53 PTPN6

MAPK9 MDK TPH1 PTPRF

MAPK10 MET TRAF6 RAC1

MDK MGAT3 TSC1 RAF1

MET MICA TSC2 RAPGEF1

MGAT3 MLLT7 TWIST1 RARB

MLLT7 MMP2 TYR RASSF1

MME MMP3 TYRO3 RB1

MMP2 MMP9 VCAM1 RBL1

105

MMP3 MPL VSNL1 RDX

MMP9 MPO VWF REL

MPL MYB RELA

MPO MYH6 RGS4

MSN MYOD1 RHOA

MUSK NBN RPS6KA1

MYB NCOA1 RPS6KB1

MYLK NCOA4 RUNX1

MYOD1 NDN SELL

NBN NEU1 SEMA3F

NCOA1 NEUROD1 SEMA4D

NCOA4 NFATC4 SEMA5A

NDN NFKB1 SMPD1

NEDD9 NFKBIA STAT3

NEFL NFKBIB TACR1

NEUROD1 NGFR TBP

NFATC4 NP TBXA2R

NFKB1 NPY1R TEK

NFKBIA NR1D1 TERT

NFKBIB NT5E TFG

NGFR NTRK1 TGFA

NP NTRK2 TGFB2

NPC1 NUMB TNC

NR1D1 ODC1 TNS1

NRGN PABPN1 TP53

NTRK1 PAX3 TPM1

NTRK2 PCNA TPR

NUMB PDGFB TSC1

ODC1 PDIA3 TSC2

PAX3 PFN1 TWIST1

PCNA PGR TYRO3

PDGFB PHB VCAM1

PDIA3 PIM1 VWF

PGR PITX2 YES1

PHB PKD1

PHOX2A PLA2G1B

PIM1 PLAU

PKD1 PLG

PLAU PNN

PLG POMC

POMC PPARA

POU3F1 PPARG

PPARA PPBP

PPARG PRKCB1

PPP1R13L PRKCE

PRKCB1 PRKCZ

PRKCE PRKD1

PRKCZ PRKG1

106

PRKD1 PRL

PRL PRLR

PRLR PSAP

PRPH PSEN1

PSAP PTGER1

PSEN1 PTH

PSEN2 PTK2

PTGER1 PTN

PTH PTPN6

PTK2 PTPN13

PTN PTPRC

PTPN6 PTPRF

PTPN13 PTPRO

PTPRC PTX3

PTPRF RAC1

PTPRO RAF1

PYCARD RAPGEF1

RAC1 RARB

RAF1 RASSF1

RAPGEF1 RB1

RARB RBL1

RASSF1 REL

RB1 RELA

RBL1 RGS4

RDX RHOA

REL RIPK2

RELA RPS6KA1

RGS4 RPS6KB1

RHOA RUNX1

RHOT2 RUNX2

RIPK2 SELE

ROR2 SEMA3F

RPS6KA1 SEMA4D

RPS6KB1 SGK

RUNX1 SHB

RUNX2 SLC1A3

SEMA3F SLC6A4

SGK SMPD2

SLC1A3 SP1

SLC6A3 SPHK1

SMPD1 STAT1

SMPD2 STAT3

SOAT1 STAT5A

SP1 STAT5B

SPHK1 TBC1D8

STAT1 TBP

STAT3 TEK

STAT5A TERT

107

STAT5B TFAP2A

TACR1 TFG

TBP TFRC

TERT TG

TFAP2A TGFA

TFPT TGFB2

TFRC TIMP2

TGFA TLX1

TGFB2 TLX3

TIE1 TNC

TIMP2 TP53

TNC TPM1

TNFRSF10C TPO

TP53 TPR

TPM1 TRAF6

TRAF6 TSC1

TSC1 TSC2

TSC2 TSHR

TWIST1 TYMS

TYMS TYR

TYR TYRO3

TYRO3 UGCG

UGCG WARS

VAMP2 WT1

VCAM1

WT1

YES1

108

Bibliography

Alako BT, Veldhoven A, Baal S, Jelier R, Verhoeven S, Rullmann T, Polman J, Jenster

G: CoPub Mapper: mining MEDLINE based on search term co-publication. BMC

Bioinformatics. 6:51. (2005).

BioCreAtIvE: Critical Assessment of Information Extraction systems in Biology.

http://biocreative.sourceforge.net/index.html

Cairns J: The interface between molecular biology and cancer research. Mutat Res, 462:

423-428. (2000).

Chang JT, Schutze H, Altman RB: GAPSCORE: finding gene and protein names one

word at a time. Bioinformatics, 20: 216-225 (2004).

Chen L, Friedman C: Extracting phenotypic information from the literature via natural

language processing. Medinfo, 11(Pt 2):758-762. (2004).

Chen L, Liu H, Friedman C: Gene name ambiguity of eukaryotic nomenclatures.

Bioinformatics, 21: 248-256. (2005).

109

Cohen KB, Fox L, Ogren, PV, Hunter L: Corpus design for biomedical natural language

processing. Proceedings of the ACL-ISMB workshop on linking biological literature,

ontologies and databases, pp. 38-45. Association for Computational Linguistics. (2005).

Collier, N., Nobata, C. and Tsujii, J: Extracting the names of genes and gene products

with a hidden Markov model. In Proceedings of the 18th International Conference on

Computational Lingustics (COLING’2000), Saarbrucken, Germany. (2000).

Collier N, Takeuchi K: Comparison of character-level and part of speech features for

name recognition in biomedical texts. J Biomed. Inform. 37(6):423-435. (2004).

Daraselia N, Yuryev A, Egorov S, Novichkova S, Nikitin A, Mazo I: Extracting human

protein interactions from MEDLINE using a full-sentence parser. Bioinformatics, 20:

604-611. (2004).

DiGiacomo RA, Kremer JM, Shah DM: Fish-oil dietary supplementation in patients with

Raynaud's phenomenon: a double-blind, controlled, prospective study. Am. J. Med.

86:158-164. (1989).

Ding J, Berleant D, Nettleton D, Wurtelle E: Mining Medline: abstracts, sentences, or

phrases? Pac. Symp. Biocomput. 7: 326-337. (2002).

FABLE: http://fable.chop.edu/

110



Finkel J, Dingare S, Manning CD, Nissim M, Alex B, Grover C. Exploring the

boundaries: gene and protein identification in biomedical text. BMC Bioinformatics, 6

Suppl 1:S5. (2005).

Friedman C, Hripcsak G, DuMouchel W, Hohnson SB, Clayton PD: Natural language

processing in an operational clinical information system. Natural Language Engineering,

1:1-28. (1995).

Freimer N, Sabatti C: The human phenome project. Nature Genet, 34: 15-21. (2003).

Fundel K, Guttler D, Zimmer R, Apostolakis JA: Simple approach for protein name

identification: prospects and limits. BMC Bioinformatics, 6, S15 (2005).

The Gene Ontology (GO) project in 2006. Nucleic Acids Res 2006, 34(Database

issue):D322-326.

GENIA: http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ (2004).

Glenisson P, Anta P, Mathys J, Moreau Y, De Moor B: Evaluation of the vector space

representation in text-based gene clustering. Pac. Symp. Biocomput. 8: 391-402. (2003).

Glenisson P, Coessens B, Van Vooren S, Mathys J, Moreau Y, De Moor B: TXTGate:

111

http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/

profiling gene groups with text-based information. Genome Biol. 5: R43. (2004).

Hahn U, Romacker M, Schulz S: MEDSYNDIKATE--a natural language system for the

extraction of medical information from findings reports. Int J Med Inform 2002, 67(1-

3):63-74.

Hakenberg J, Bickel S, Plake C, Brefeld U, Zahn H, Faulstich L, Leser U, Scheffer T:

Systematic feature evaluation for gene name recognition. BMC Bioinformatics, 6 Suppl

1:S9. (2005).

Hanisch D, Fundel K, Mevissen, HT, Zimmer R, Fluck JP: Rule-based protein and gene

entity recognition. BMC Bioinformatics, 6, S14. (2005).

Hunter L, Cohen KB: Biomedical language processing: what’s beyond PubMed? Mol.

Cell 21:589-594. (2006).

Jensen LJ, Saric J, Bork P: Literature mining for the biologist: from information retrieval

to biological discovery. Nature Genet. 7: 119-129. (2006).

Jessen TK, Lagreid A, Komorowski J, Hovig E: A literature network of human genes for

high-throughput analysis of gene expression. Nature Genet. 28: 21-28. (2001).

Jin Y, McDonald RT, Lerman K, Mandel MA, Carroll S. Liberman MY, Pereira FC,

112

Winters RS, and White PS. Automated recognition of malignancy mentions in

biomedical literature. BMC Bioinformatics, 7: 492. (2006).

Kulick S, Bies A, Liberman M, Mandel M, McDonald R, Palmer M, Schein A, Ungar L,

Winters S, White P: Integrated annotation for biomedical information extraction. Proc of

BioLink (2004).

Lafferty J, McCallum A, Pereira F: Conditional Random Fields: Probabilistic Models for

Segmenting and Labeling Sequence Data. In: Proceedings of ICML-01: 282-289. (2001).

Lander ES, Linton LM, Birren B, Nusbaum C, etal: Initial sequencing and analysis of the

human genome. Nature, 409: 860-921, (2001).

Malignancy type definitions:

[http://bioie.ldc.upenn.edu/mamandel/annotators/onco/definitions.html]

McDonald RT, Winters RS, Mandel, Jin Y, White PS and Pereira F. An entity tagger for

recognizing acquired genomic variations in cancer literature. Bioinformatics 22(20):

3249-3251. (2004).

McDonald RT, Pereira FN Identifying gene and protein mentions in text using

conditional random fields. BMC Bioinformatics, 6 Suppl 1:S6. (2005).

113

McDonald RT, Pereira F, Kulick, Winters RS, Jin Y, White P: Simple Algorithms for

Complex Relation Extraction with Applications to Biomedical IE. 43rd Annual Meeting

of the Association for Computational Linguistics, (2005).

MEDLINE: http://www.nlm.nih.gov/databases/databases_medline.html

Meldrum D: Automation for genomics, part two: sequencers, microarrays, and future

trends. Genome Res, 10:1081-1092, (2000).

Mitsumori T, Fation S, Murata M, Doi K, Doi H: Gene/protein name recognition based

on support vector machine using dictionary as features. BMC Bioinformatics. 6 Suppl

1:S8. (2005).

Muller HM, Kenny EE, Sternberg PW: Textpresso: an ontology-based information

retrieval and extraction system for biological literature. PloS Biol. 2, e309. (2004).

Novichkova, S., Egorov, S. and Daraselia, N. MedScan, a natural language processing

engine for MEDLINE abstracts. Bioinformatics, 19:1699-1706. (2003).

Penn BioIE corpus release v0.9 [http://bioie.ldc.upenn.edu]

PubMED: http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed

114

http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed

http://www.nlm.nih.gov/databases/databases_medline.html

Raychaudhuri S, Schutze H, Altman RB: Using text analysis to identify functionally

coherent gene groups. Genome Res. 12: 1582-1590. (2002).

Rzhetsky A, Iossifov I, Koike T, Krauthammer M, Kra P, Morris M, Yu H, Duboue PA,

Weng W, Wilbur JW, Hatzivassiloglou V, Friedman C: GeneWays: a system for

extracting, analyzing, visualizing, and integrating molecular pathway data. J. Biomed.

Inform. 37: 43-53. (2004).

Settles BA: an open source tool for automatically tagging genes, proteins and other entity

names in text. Bioinformatics, 21: 3191-3192 (2005).

Swanson DR: Fish oil, Raynaud's syndrome, and undiscovered public knowledge.

Perspect. Biol. Med., 30:7-18. (1986).

Swanson DR: Migrane and magnesium: eleven neglectd connections. Perspect. Biol.

Med. 31: 526-557. (1988).

Swanson DR: Somatomedin C and arginine: implicit connections between mutually

isolated literatures. Perspect. Biol. Med. 33: 157-186. (1990).

Tamames J: Text Detective: a rule-based system for gene annotation in biomedical texts.

BMC Bioinformatics, 6 Suppl 1:S10. (2005).

115

Tanabe L, Scherf U, Smith LH, Lee JK, Hunter L, Weinstein JN: MedMiner: an internet

text-mining tool for biomedical information, with application to gene expression

profiling. BioTech. 27: 1210-1217. (1999).

Tanabe L, Wilbur W: Tagging gene and protein names in biomedical text,

Bioinformatics, 18:1124-1132. (2002).

Tanabe L, Xie N, Thom LH, Matten W, Wilbur WJ: GENETAG: a tagged corpus for

gene/protein named entity recognition. BMC Bioinformatics 6 Suppl 1:S3. (2005).

Temkin JM, Gilder MR: Extraction of protein interaction information from unstructured

text using a context-free grammar. Bioinformatics 19:2046-2053. (2003).

Torii M, Kamboj S, Vijay-Shanker K: Using name-internal and contextual features to

classify biological terms. J Biomed Inform 37(6):498-511. (2004)

van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JA: A text-mining

analysis of the human phenome. Eur J Hum Genet 14(5):535-542. (2006).

Weeber M, Kors JA, Mons B: Online tools to support literature-based discovery in the

life sciences. Brief. In Bioinfo. 6: 277-286. (2005).

116

Wren JD, Bekeredjian, R., Stewart JA, Shohet, RV and Garner HR: Knowledge

discovery by automated identification and ranking of implicit relationships.

Bioinformatics, 20:389-398. (2004).

Yakushiji A, Tateisi Y, Miyao Y, Tsujii J: Event extraction from biomedical papers using

a full parser. Pac. Symp. Biocomput. 6: 408-419. (2001).

Yandell MD, Majoros WH: Genomics and natural language processing. Nat. Rev. Genet.,

3:601-610. (2002).

Zhou G, Shen D, Zhang J, Su J, Tan S: Recognition of protein/gene names from text

using an ensemble of classifiers. BMC Bioinformatics 6 Suppl 1:S7. (2005).

117

Documents

Chapter 1. Introduction