Proceedings of the 2014 Workshop on Biomedical Natural …acl2014.org/acl2014/W14-34/W14-34-2014.pdf · 2014. 6. 18. · Proceedings of the 2014 Workshop on Biomedical Natural Language

ACL 2014

BioNLP 2014Workshop on Biomedical Natural Language Processing

Proceedings of the Workshop

June 27-28, 2014Baltimore, Maryland, USA

c©2014 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

ISBN 978-1-941643-18-1

ii

Introduction

The first day of the BioNLP 2014 workshop continues following the course set by the first ACL workshopon Natural Language Processing in the Biomedical Domain that was held in 2002: BioNLP 2014provides a venue for exploring challenges and techniques in processing biomedical language and bringstogether researchers from computational linguistics and biomedical informatics. The submissions tothe first day of 2014 workshop organized by SIGBioMed were traditionally very strong and continueddemonstrating the considerable breadth of research in biomedical language processing. The 2014workshop has accepted 12 full and short papers for oral presentations and 7 posters. The first day of theworkshop features a keynote that expands the scope of BioNLP beyond its already remarkable breadth

Keynote BioNLP as the Pioneering field of linking text, knowledge and data

Professor Jun’ichi Tsujii, Principal Researcher at Microsoft Research Asia (MSRA), Chair of TextMining and Scientific Director of the National Centre for Text Mining (NaCTeM) at the University ofManchester, UK

The second day of the workshop features a paper submitted to the special track on NLP approachesfor assessment of clinical conditions. Kathleen C. Fraser presents the featured talk on using statisticalparsing to detect agrammatic aphasia. The track organizers, Thamar Solorio and Yang Liu, serve asdiscussants.

The second day further features an exciting panel that brings together organizers of several sharedtasks in biomedical information retrieval and natural language processing. The panel introduces theworkshop participants to the long-standing and relatively new community-wide challenges in biomedicaland clinical language processing. It also provides an opportunity to discuss the future of the shared tasksin this domain.

Panel Life cycles of BioCreative, BioNLP-ST, i2b2, TREC Medical tracks, and ShARe /CLEF/ SemEval

Lynette Hirschman & John Wilbur, Sophia Ananiadou, Ellen Voorhees, Ozlem Uzuner, Danielle Mowery& Sumithra Velupillai & Sameer Pradhan

The second day of the BioNLP 2014 workshop concludes with two tutorials on the fundamental resourceswidely used in the biomedical domain.

Tutorial 1 UMLS in biomedical text processing

Olivier Bodenreider, Branch Chief, Cognitive Science Branch, LHNCBC, NLM, NIH

Tutorial 2 Using MetaMap Alan R. Aronson, Senior Researcher, Cognitive Science Branch, LHNCBC,NLM, NIH

Acknowledgments

As always, we are profoundly grateful to the authors who chose BioNLP as venue for presenting theirinnovative research. The authors’ willingness to share their work through BioNLP consistently makesthe workshop noteworthy among the increasing numbers of available venues. We are equally indebted tothe program committee members (listed elsewhere in this volume) who produced at least three thoroughreviews per paper on a tight review schedule and with an admirable level of insight.

iii

Organizers:

Kevin Bretonnel Cohen, University of Colorado School of MedicineDina Demner-Fushman, US National Library of MedicineSophia Ananiadou, University of Manchester and National Centre for Text Mining, UKJohn Pestian, University of Cincinnati, Cincinnati Children’s Hospital Medical CenterJun’ichi Tsujii, Microsoft Research Asia and National Centre for Text Mining, UK

Program Committee:

Emilia Apostolova, DePaul University, USAEiji Aramaki, University of Tokyo, JapanAlan Aronson, US National Library of MedicineSabine Bergler, Concordia University, CanadaOlivier Bodenreider, US National Library of MedicineKevin Cohen, University of Colorado, USANigel Collier, National Institute of Informatics, JapanDina Demner-Fushman, US National Library of MedicineMarcelo Fiszman, US National Library of MedicineFilip Ginter, University of Turku, FinlandGraciela Gonzalez, Arizona State University, USAAntonio Jimeno Yepes, NICTA, AustraliaHalil Kilicoglu, US National Library of MedicineJin-Dong Kim, University of Tokyo, JapanRobert Leaman, US National Library of MedicineYang Liu, The University of Texas at Dallas, USAZhiyong Lu, US National Library of MedicineMakoto Miwa, National Centre for Text Mining, UKAurelie Neveol, LIMSI, FranceNaoaki Okazaki, Tohoku University, JapanJong Park, KAIST, South KoreaRashmi Prasad, University of Wisconsin-Milwaukee, USASampo Pyysalo, National Centre for Text Mining, UKBastien Rance, Georges Pompidou European Hospital, FranceThomas Rindflesch, US National Library of MedicineKirk Roberts, US National Library of MedicineAndrey Rzhetsky, University of Chicago, USAMatthew Simpson, US National Library of MedicineThamar Solorio, The University of Alabama at Birmingham, USAYoshimasa Tsuruoka, University of Tokyo, JapanKarin Verspoor, NICTA, AustraliaW. John Wilbur, US National Library of Medicine

Invited Speaker:

Jun’ichi Tsujii, Microsoft Research Asia and National Centre for Text Mining, UK

PanelistsSophia Ananiadou, University of Manchester and National Centre for Text Mining, UKLynette Hirschman, The MITRE Corporation , USA

v

Danielle Mowery, University of Pittsburgh, USASameer Pradhan, Harvard Medical School, USAOzlem Uzuner, State University of New York, Albany, USASumithra Velupillai, Stockholm University, SwedenEllen Voorhees, National Institute of Standards and Technology, USAW. John Wilbur, US National Library of Medicine

vi

Table of Contents

Natural Language Processing Methods for Enhancing Geographic Metadata for Phylogeography ofZoonotic Viruses

Tasnia Tahsin, Robert Rivera, Rachel Beard, Rob Lauder, Davy Weissenbacher, Matthew Scotch,Garrick Wallstrom and Graciela Gonzalez . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Temporal Expression Recognition for Cell Cycle Phase Concepts in Biomedical LiteratureNegacy Hailu, Natalya Panteleyeva and Kevin Cohen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Classifying Negative Findings in Biomedical PublicationsBei Yu and Daniele Fanelli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Automated Disease Normalization with Low Rank ApproximationsRobert Leaman and Zhiyong Lu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Decomposing Consumer Health QuestionsKirk Roberts, Halil Kilicoglu, Marcelo Fiszman and Dina Demner-Fushman . . . . . . . . . . . . . . . . . . 29

Detecting Health Related Discussions in Everyday Telephone Conversations for Studying Medical Eventsin the Lives of Older Adults

Golnar Sheikhshab, Izhak Shafran and Jeffrey Kaye . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Coreference Resolution for Structured Drug Product LabelsHalil Kilicoglu and Dina Demner-Fushman. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45

Generating Patient Problem Lists from the ShARe Corpus using SNOMED CT/SNOMED CT COREProblem List

Danielle Mowery, Mindy Ross, Sumithra Velupillai, Stephane Meystre, Janyce Wiebe and WendyChapman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

A System for Predicting ICD-10-PCS Codes from Electronic Health RecordsMichael Subotin and Anthony Davis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Structuring Operative Notes using Active LearningKirk Roberts, Sanda Harabagiu and Michael Skinner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Chunking Clinical Text Containing Non-Canonical LanguageAleksandar Savkov, John Carroll and Jackie Cassell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Decision Style in a Clinical Reasoning CorpusLimor Hochberg, Cecilia Ovesdotter Alm, Esa M. Rantanen, Caroline M. DeLong and Anne Haake

83

Temporal Expressions in Swedish Medical Text – A Pilot StudySumithra Velupillai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

A repository of semantic types in the MIMIC II database clinical notesRichard Osborne, Alan Aronson and Kevin Cohen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Extracting drug indications and adverse drug reactions from Spanish health social mediaIsabel Segura-Bedmar, Santiago de la Peña González and Paloma Martínez . . . . . . . . . . . . . . . . . . . 98

vii

Symptom extraction issueLaure Martin, Delphine Battistelli and Thierry Charnois . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

Seeking Informativeness in Literature Based DiscoveryJudita Preiss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

Towards Gene Recognition from Rare and Ambiguous Abbreviations using a Filtering ApproachMatthias Hartung, Roman Klinger, Matthias Zwick and Philipp Cimiano . . . . . . . . . . . . . . . . . . . . 118

FFTM: A Fuzzy Feature Transformation Method for Medical DocumentsAmir Karami and Aryya Gangopadhyay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

Using statistical parsing to detect agrammatic aphasiaKathleen C. Fraser, Graeme Hirst, Jed A. Meltzer, Jennifer E. Mack and Cynthia K. Thompson134

viii

Conference Program

Thursday, June 26, 2014

9:00–9:10 Opening remarks

Session 1: Processing biomedical publications

9:10–9:30 Natural Language Processing Methods for Enhancing Geographic Metadata forPhylogeography of Zoonotic VirusesTasnia Tahsin, Robert Rivera, Rachel Beard, Rob Lauder, Davy Weissenbacher,Matthew Scotch, Garrick Wallstrom and Graciela Gonzalez

9:30–9:50 Temporal Expression Recognition for Cell Cycle Phase Concepts in Biomedical Lit-eratureNegacy Hailu, Natalya Panteleyeva and Kevin Cohen

9:50–10:10 Classifying Negative Findings in Biomedical PublicationsBei Yu and Daniele Fanelli

10:10–10:30 Automated Disease Normalization with Low Rank ApproximationsRobert Leaman and Zhiyong Lu

10:30–11:00 Coffee Break

Keynote by Junichi Tsujii

11:00–11:50 BioNLP as the Pioneering field of linking text, knowledge and data

Session 2: Processing consumer language

11:50–12:10 Decomposing Consumer Health QuestionsKirk Roberts, Halil Kilicoglu, Marcelo Fiszman and Dina Demner-Fushman

12:10–12:30 Detecting Health Related Discussions in Everyday Telephone Conversations forStudying Medical Events in the Lives of Older AdultsGolnar Sheikhshab, Izhak Shafran and Jeffrey Kaye

12:30–14:00 Lunch

ix

Thursday, June 26, 2014 (continued)

Session 3: Processing clinical text and gray literature

14:00–14:20 Coreference Resolution for Structured Drug Product LabelsHalil Kilicoglu and Dina Demner-Fushman

14:20–14:40 Generating Patient Problem Lists from the ShARe Corpus using SNOMED CT/SNOMEDCT CORE Problem ListDanielle Mowery, Mindy Ross, Sumithra Velupillai, Stephane Meystre, Janyce Wiebe andWendy Chapman

14:40–15:00 A System for Predicting ICD-10-PCS Codes from Electronic Health RecordsMichael Subotin and Anthony Davis

15:00–15:20 Structuring Operative Notes using Active LearningKirk Roberts, Sanda Harabagiu and Michael Skinner

15:30–16:00 Afternoon Break

16:00–16:20 Chunking Clinical Text Containing Non-Canonical LanguageAleksandar Savkov, John Carroll and Jackie Cassell

16:20–16:40 Decision Style in a Clinical Reasoning CorpusLimor Hochberg, Cecilia Ovesdotter Alm, Esa M. Rantanen, Caroline M. DeLong andAnne Haake

(16:40–17:30) Poster session

Temporal Expressions in Swedish Medical Text – A Pilot StudySumithra Velupillai

A repository of semantic types in the MIMIC II database clinical notesRichard Osborne, Alan Aronson and Kevin Cohen

Extracting drug indications and adverse drug reactions from Spanish health social mediaIsabel Segura-Bedmar, Santiago de la Peña González and Paloma Martínez

Symptom extraction issueLaure Martin, Delphine Battistelli and Thierry Charnois

x

Thursday, June 26, 2014 (continued)

Seeking Informativeness in Literature Based DiscoveryJudita Preiss

Towards Gene Recognition from Rare and Ambiguous Abbreviations using a Filtering Ap-proachMatthias Hartung, Roman Klinger, Matthias Zwick and Philipp Cimiano

FFTM: A Fuzzy Feature Transformation Method for Medical DocumentsAmir Karami and Aryya Gangopadhyay

Friday, June 27, 2014

Session 1: NLP approaches for assessment of clinical conditions

9:00–9:40 Using statistical parsing to detect agrammatic aphasiaKathleen C. Fraser, Graeme Hirst, Jed A. Meltzer, Jennifer E. Mack and Cynthia K.Thompson

Panel: Life cycles of BioCreative, BioNLP-ST, i2b2, TREC Medical tracks, andShARe /CLEF/ SemEval

9:40–10:05 BioCreative by Lynette Hirschman and John Wilbur

10:05–10:30 BioNLP-ST by Sophia Ananiadou and Junichi Tsujii

10:30–11:00 Coffee Break

11:00–11:25 TREC Medical tracks by Ellen Voorhees

11:25–11:50 i2b2 by Ozlem Uzuner

11:50–12:10 ShARe/CLEF/SemEval by Danielle Mowery, Sumithra Velupillai and Sameer Pradhan

12:10–12:30 Discussion

12:30–14:00 Lunch

xi

Friday, June 27, 2014 (continued)

Tutorials

14:00–15:30 UMLS in biomedical text processing by Olivier Bodenreider

15:30–16:00 Afternoon Break

16:00–17:30 Using MetaMap by Alan R. Aronson

xii

Proceedings of the 2014 Workshop on Biomedical Natural Language Processing (BioNLP 2014), pages 1–9,Baltimore, Maryland USA, June 26-27 2014. c©2014 Association for Computational Linguistics

Natural Language Processing Methods for Enhancing Geographic

Metadata for Phylogeography of Zoonotic Viruses

Tasnia Tahsin

Department of Biomedical

Informatics

Arizona State University

13212 E Shea Blvd

Scottsdale, AZ 85259

[email protected]

Rachel Beard


Informatics


13212 E Shea Blvd


[email protected]

Robert Rivera


Informatics


13212 E Shea Blvd


[email protected]

Rob Lauder


Informatics


13212 E Shea Blvd


[email protected]

Davy Weissenbacher


Informatics


13212 E Shea Blvd


[email protected]

Garrick Wallstrom


Informatics


13212 E Shea Blvd


[email protected]

Matthew Scotch

Department of Biomedical Informatics


13212 E Shea Blvd


[email protected]

Graciela Gonzalez

Department of Biomedical Informatics


13212 E Shea Blvd


[email protected]

Abstract

Zoonotic viruses, viruses that are trans-

mittable between animals and humans,

represent emerging or re-emerging patho-

gens that pose significant public health

threats throughout the world. It is there-

fore crucial to advance current surveil-

lance mechanisms for these viruses

through outlets such as phylogeography.

Phylogeographic techniques may be ap-

plied to trace the origins and geographical

distribution of these viruses using se-

quence and location data, which are often

obtained from publicly available data-

bases such as GenBank. Despite the abun-

dance of zoonotic viral sequence data in

GenBank records, phylogeographic anal-

ysis of these viruses is greatly limited by

the lack of adequate geographic metadata.

Although more detailed information may

often be found in the related articles refer-

enced in these records, manual extraction

of this information presents a severe bot-

tleneck. In this work, we propose an auto-

mated system for extracting this infor-

mation using Natural Language Pro-

cessing (NLP) methods. In order to vali-

date the need for such a system, we first

determine the percentage of GenBank rec-

ords with “insufficient” geographic

metadata for seven well-studied zoonotic

viruses. We then evaluate four different

named entity recognition (NER) systems

which may help in the automatic extrac-

tion of information from related articles

that can be used to improve the GenBank

geographic metadata. This includes a

novel dictionary-based location tagging

system that we introduce in this paper.

1

1 Introduction

Zoonotic viruses, viruses that are transmittable

between animals and humans, have become in-

creasingly prevalent in the last century leading to

the rise and re-emergence of a variety of diseases

(Krauss, 2003). In order to enhance currently

available surveillance systems for these viruses, a

better understanding of their origins and transmis-

sion patterns is required. This need has led to a

greater amount of research in the field of phylo-

geography, the study of geographical lineages of

species (Avise, 2000). Population health agencies

frequently apply phylogeographic techniques to

trace the evolutionary changes within viral line-

ages that affect their diffusion and transmission

among animal and human hosts (Ciccozzi et al.,

2013; Gray and Salemi, 2012; Weidmann et al.,

2013). Prediction of virus migration routes en-

hances the chances of isolating the viral strain for

vaccine production. In addition, if the source of

the strain is identified, intervention methods may

be applied to block the virus at the source and

limit outbreaks in other areas.

Phylogeographic analysis depends on the utili-

zation of both the sequence data and the location

of collection of specific viral sequences. Re-

searchers often use publicly available databases

such as GenBank for retrieving this information.

For instance, Wallace and Fitch (2008) used data

from GenBank records to study the migration of

the H5N1 virus in various animal hosts over Eu-

rope, Asia and Africa, and were able to identify

the Guangdong province in China as the source of

the outbreak. However, the extent of phylogeo-

graphic modeling is highly dependent on the spec-

ificity of available geospatial information and the

lack of geographic data more specific than the

state or province level may limit phylogeographic

analysis and distort results. In the previous exam-

ple, Wallace and Fitch (2008) had to use town-

level information to identify the source of the

H5N1 outbreak; without specific location data,

they would not have been able to identify the

Guangdong province as the source. Unfortu-

nately, while there is an abundance of sequence

data in GenBank records, many of them lack suf-

ficient geographic metadata that would enable

specific identification of the isolate’s location of

collection. A prior study conducted by Scotch et

al. (2011) showed that the geographic information

of 80% of the GenBank records associated with

single or double stranded RNA viruses within tet-

rapod hosts is less specific than 1st level adminis-

trative boundaries (ADM1) such as state or prov-

ince.

Though many of the records lack specific geo-

graphic metadata, more detailed information is of-

ten available within the journal articles referenced

in them. However, manual extraction of this infor-

mation is time-consuming and cumbersome and

presents a severe bottleneck on phylogeographic

analysis. In this work, we investigate the potential

of NLP techniques to enhance the geographic data

available for phylogeographic studies of zoonotic

viruses using NER systems. In addition to geo-

graphic metadata and sequence information, Gen-

Bank records also contain several other forms of

metadata such as host, collection date and gene for

each isolate. Journal articles that are referenced in

these records often mention the location of isola-

tion for the viral sample in conjunction with re-

lated metadata (Figure 1 provides an example of

such a case). Therefore, by allowing identification

of location mentions along with mentions of re-

lated GenBank metadata in these articles, we be-

lieve that NER systems may help to accurately

link each GenBank record to its corresponding lo-

cation of isolation and distinguish it from other lo-

cation mentions.

Previously Scotch et al. (2011) evaluated the

performance of BANNER (Leaman and Gonza-

lez, 2008) and the Stanford NER tool (Finkel et

al., 2005) for automated identification of gene and

location mentions respectively, in 10 full-text

PubMed articles, each related to a specific Gen-

Bank record. They were both found to achieve f-

scores of less than 0.45, thereby establishing the

need for NER systems with better performance

and/or a larger test corpus (Scotch et al, 2011). In

this study, we start by evaluating the state of geo-

graphic insufficiency for zoonotic viruses in Gen-

Bank records using a new automated approach.

Next, we further expand upon the work done by

Scotch et al. (2011) by building our own diction-

ary-based location-tagging system and evaluating

its performance on a larger corpus corresponding

to over 8,500 GenBank records for zoonotic vi-

ruses. In addition, we also evaluate the perfor-

mance of three other state-of-the-art NER tools

for tagging gene, date and species mentions in this

corpus. We believe that identification of these en-

tities will be useful for the future development of

a system for extracting the location of collection

of viral isolates from articles related to their re-

spective GenBank records.

2

GenBank Record

Figure 1. Example of how the date, gene, and strain metadata within a GenBank record may be

used to differentiate between two potential locations in a related article

2 Methods

The process undertaken to complete this study can

be divided into three distinct stages: selection of

the zoonotic viruses and extraction of relevant

GenBank data related to each virus, computation

of “sufficiency” statistics on the extracted data,

and development/evaluation of NER systems for

tagging location, gene, date and species mentions

in full-text PubMed Central articles. A detailed

description of each phase is given below.

2.1 Virus Selection and GenBank Data Ex-

traction

The domain of this study has been limited to zo-

onotic viruses that are most consistently docu-

mented and tracked by public health, agriculture

and wildlife state departments within the United

States. These viruses include influenza, rabies,

hantavirus, western equine encephalitis (WEE),

eastern equine encephalitis (EEE), St. Louis en-

cephalitis (SLE), and West Nile virus (WNV).

The Entrez Programming Utilities (E-Utilities)

was used to download the following fields from

59,595 GenBank records associated with these vi-

ruses: GenBank Accession ID, PubMed Central

ID, Strain name, Collection date and Country.

These records were the result of a query per-

formed to retrieve all accession numbers related

to the selected viruses which had at least one ref-

erence to a PubMed Central article. The results

1 Iso.org. [Internet]. Genève. c2013. Available from

http://www.iso.org/iso/home/standards/country_codes.htm

from the query was retrieved on August 22nd,

2013.

2.2 Sufficiency Analysis

Database Integration: The data extracted from

Genbank was used to compute the percentage of

GenBank records that had insufficient geographic

information for each of the selected viruses. In or-

der to perform this computation, we used data

from the ISO 3166-1 alpha-2 1 table and the

GeoNames database. The ISO 3166-1 alpha-2 is

the International Standard for representing coun-

try names using two-letter codes. The GeoNames2

database contains a variety of geospatial data for

over 10 million locations on earth, including the

ISO 3166-1 alpha-2 code for the country of each

location and a feature code that can be used to de-

termine the administrative level of each location.

To allow for efficient querying, we downloaded

the main GeoNames table and the ISO alpha-2

country codes table from their respective websites

and stored them in a local SQL database. Prior to

adding the ISO data to the database, some com-

monly used country names and their correspond-

ing country codes were added to the table since it

only included a single title for each country. For

example, the ISO table included the country name

“United States” but not alternate names such as

“USA”, “United States of America”, or “US”. Us-

ing the created database in conjunction with a par-

ser written in Java, we were able to retrieve most

2 Geonames.org. [Internet]. Egypt. c2013. [updated 2013

Apr 30] Available from http://www.geonamesorg/EG/ad-

ministrative-division-egypt.html

Related PubMed Article

3

of the geographic information present within the

records and classify each of them as sufficient or

insufficient.

Figure 2. Sufficiency Criteria

Sufficiency Criteria: For the purpose of this

project, we considered any geographical bound-

ary more specific than ADM1 to be “sufficient”.

Based on this criterion, a feature code in

GeoNames was categorized as sufficient only if it

was absent from the following list of feature

codes: ADM1, ADM1H, ADMD, ADMDH, PCL,

PCLD, PCLF, PCLH, PCLI and PCLS. Evalua-

tion of the geographical sufficiency of a GenBank

record was dependent upon whether the record in-

cluded a country name. A GenBank record with a

country mention was called sufficient if the geo-

graphic information extracted from that record in-

cluded another place mention whose feature code

fell within the class of sufficient feature codes and

whose ISO country code matched that of the re-

trieved country. For instance, a GenBank record

with the geographic metadata “Orange County,

United States” will be called sufficient since the

place “Orange County” has a sufficient feature

code of “ADM2” and a country code of “US”

which matches the country code of the retrieved

country, “United States”. Place mentions with

matching country codes often had several differ-

ent feature codes in GeoNames. Such places were

only called sufficient if all feature codes corre-

sponding to the given pair of place name and

country code were classified as sufficient. In cases

where the GenBank record had no country men-

tion, the record was called sufficient only if all

matching GeoNames entries for any of the places

mentioned in it had sufficient feature codes. The

sufficiency criteria were designed to ensure that a

geographic location is only called sufficient if its

administrative level was found to be more specific

than ADM1 without any form of ambiguity. Fig-

ure 3 illustrates the pathways of geographical suf-

ficiency for GenBank records in a diagram.

Sufficiency Computation: In order to obtain

the geographic information for each Genbank rec-

ord, we used a Java parser which automatically

extracted data from the “country” field of each

record. Since the “country” field typically con-

tained multiple place mentions divided by a set of

delimiters consisting of comma, colon and hy-

phen, we first split this field using these delimit-

ers. We then checked each string obtained

through this process against the ISO country code

table to determine whether it was a potential coun-

try name for the record’s location. If the query

returned no results, then the locally stored

GeoNames table was searched and for each match

found, the corresponding ISO country code and

feature code were extracted. Figure 4 shows a di-

agram of this process.

Figure 3. Sufficiency Calculation Example

In cases where no sufficient location data was

found from the “country” field of a GenBank rec-

ord, the Java parser searched through its “strain”

field. This was done because some viral strains

such as influenza include their location of origin

integrated into their names. For example, the in-

fluenza strain “A/duck/Alberta/35/76” indicates

that the geographic origin of the strain is Alberta.

The different sections of a strain field are sepa-

rated by either forward slash, parenthesis, comma,

colon, hyphen or underscore and so we used a set

of delimiters consisting of these characters to split

this field. Each string thus retrieved was queried

as before on the ISO country code table and the

GeoNames table. GeoNames often returned

matches for strings like ‘raccoons’ and ‘chicken’

which were actually meant to be names of host

species within the “strain” field, and so a list of

4

Figure 4. Example of annotation including all four entities

some of the most frequently seen host name men-

tions in these records was manually created and

filtered out before querying GeoNames.

Some of the place mentions contained very spe-

cific location information which resulted in

GeoNames not finding a match for them. A list

was created for strings like ‘north’, ‘south-east’,

‘governorate’ etc. which when removed from a

place mention may produce a match. In cases of

potential place mentions which contained any one

of these strings and for which GeoNames returned

no matching result, a second query was performed

after removal of the string.

Evaluation of Sufficiency Computation: We

manually annotated 10% of all influenza records

in GenBank which reference at least one PubMed

Central article as sufficient or insufficient based

on our sufficiency criteria (5731 records). We

then ran our program on these records and com-

pared system results with annotated results.

2.3 Development/Evaluation of NER sys-

tems

Creation of Gold Standard Corpus: We created

a gold standard corpus consisting of twenty-seven

manually-annotated full-text PubMed Central ar-

ticles in order to evaluate the performance of NER

systems for tagging location, gene, species and

date mentions in text. The articles corresponded to

over 8,500 GenBank records and were randomly

sampled using the subset of extracted GenBank

records which contained a link to PubMed Central

articles and had insufficient geographic metadata.

Three annotators tagged the following four en-

tities in each article using the freely available an-

notation tool, BRAT (Stenetorp et al., 2012): gene

names, locations, dates and species. Figure 4 pro-

vides an example of the manual annotation in

BRAT. We annotated all mentions of each entity

type, not only those relevant to zoonotic viruses,

in order to evaluate system performance. A total

of over 19,000 entities were annotated within this

corpus. The number of tokens annotated was

about 24,000. A set of annotation guidelines was

created for this process (available upon request).

Before creating the guidelines, each annotator in-

dividually annotated six common articles and

compared and discussed their results to devise a

reasonable set of rules for annotating each entity.

After discussion, the annotators re-annotated the

common articles based on the guidelines and di-

vided the remaining articles amongst themselves.

The inter-annotator agreement was calculated for

each pair of annotators. The annotated corpus will

be made available at diego.asu.edu/downloads.

Development of Automated Location Tag-

ger: We developed a dictionary-based NER sys-

tem using the GeoNames database for automated

identification of location mentions in text. The

dictionary used by this system, which we will

hereby refer to as GeoNamer, was created by re-

trieving distinct place names from the GeoNames

table and filtering out commonly used words from

the retrieved set. Words filtered out include stop

words such as ‘are’ and ‘the’, generic place names

such as ‘cave’ and ‘hill’, numbers like ‘one’ and

‘two’, domain specific words such as ‘biology’

and ‘DNA’, most commonly used surnames like

‘Garcia’, commonly used animal names such as

‘chicken’ and ‘fox’ and other miscellaneous

words such as ‘central’. This was a crucial step

since the GeoNames database contains a wide ar-

ray of commonly used English words which may

cause a large volume of false positives if not re-

moved. The final dictionary consists of 5,396,503

entries. In order to recognize place mentions in a

5

given set of text files, GeoNamer first builds a Lu-

cene index on the contents of the files. It then con-

structs a phrase query for every entry in the

Geonames dictionary and runs each query on the

Lucene index. The document id, query text, start

offset and end offset for every match found is

written to an output file. We chose this approach

because of its simplicity and efficiency.

Evaluation of NER Systems: Four different

NER systems for identifying species, gene, date

and location mentions in text were evaluated us-

ing the created gold standard. The evaluated sys-

tems include LINNEAUS (Gerner et al., 2010),

BANNER, Stanford SUTime (Chang and Man-

ning, 2012) and GeoNamer. LINNEAUS, BAN-

NER and Stanford SUTime are widely-used,

state-of-the-art open source NER systems for

recognition of species, gene and temporal expres-

sions respectively. GeoNamer is the system we

developed in this work for the purpose of tagging

locations, as described earlier.

3 Results

3.1 Sufficiency Analysis

The system for classifying records as sufficient or

insufficient was found to have an accuracy of 72%

as compared to manual annotation. 98% of the

errors was due to insufficient records being called

sufficient. The results of the sufficiency analysis

are given in Table 1. 64% of all GenBank records

extracted for this project contained insufficient

geographic information. Amongst the seven stud-

ied viruses, WEE had the highest and EEE had the

lowest percentage of insufficient records.

Virus

Type

Number of

Entries

% Insuffi-

cient

WEE 67 90

Rabies 4450 85

WNV 1084 79

SLE 141 74

Hanta 1745 66

Influenza 51734 62

EEE 374 51

All 59595 64

Table 1. Percentage of GenBank records with in-

sufficient geographic information for each zoon-

otic virus studied in this project

3.2 Gold Standard Corpus

The results for the comparison of the annota-

tions performed by our three annotators on 6 com-

mon papers can be found in Table 2. We used the

F-score between each pair of annotators as a

measure of inter-rater agreement and had over

90% agreement with overlap matching and over

86% agreement with exact matching in all cases.

The final gold standard corpus contained approx-

imately 19,000 entities corresponding to approxi-

mately 24,000 tokens.

Entity F-score

(A,B)

(Exact;

Overlap)

F-score

(𝑨,𝑪) (Exact;

Overlap)

F-score

(𝑩,𝑪) (Exact;

Overlap)

Date .975;

.978

.979;

.987

.962;

.973

Gene .914;

.926

.913;

.932

.911;

.954

Location .945;

.961

.907;

.931

.914;

.935

Species .909;

.956

.874;

.940

.915;

.959

Virus .952;

.958

.947;

.966

.947;

.955

Mean .939;

.956

.924;

.951

.930;

.955

Table 2. Frequency of Annotated Entities for 6

common annotated papers

3.3 Performance Analysis of NER Systems

The performance metrics for the NER systems

at tagging the desired entities in the test set are

listed in Table 3. The highest performance was

achieved by Stanford SUTime for date tagging.

Tagging of genes had the lowest performance.

Entity Precision

(Exact;

Overlap)

Recall

(Exact;

Overlap)

F-score

(Exact;

Overlap)

BAN-

NER

0.070;

0.239

0.114;

0.395

0.087;

0.297

Geo-

Namer

0.452;

0.626

0.658;

0.783

0.536;

0.696

LIN-

NEAUS

0.853;

0.962

0.563;

0.658

0.678;

0.781

Stanford

SUTime

0.800;

0853

0.681;

0.727

0.736;

0.785

Table 3. Performance Statistics of NER

6

4 Discussion

Based on our analysis, at least half of the Gen-

Bank records for each of the studied zoonotic vi-

ruses lack sufficient geographic information, and

the proportion of insufficient records can be as

high as 90%. Our automated system for classify-

ing records as insufficient or sufficient was found

to have an accuracy of 72% with 98% of the errors

being a result of insufficient records being called

sufficient. Therefore, our computed estimate of

insufficiency is very likely to be an underestima-

tion of the actual problem. The virus with the

highest level of sufficiency, EEE, had a large

number of records with county level information

in the “country” field. However, the insufficient

records for this virus typically contained no place

mention, not even at the country level. A key rea-

son for our calculated percentage of sufficient

GenBank records being higher for these seven vi-

ruses than what has been previously computed by

Scotch et al. (2011) was the inclusion of the

“strain” field. The “strain” field often contained

specific location information which, when com-

bined with place mentions present within the

“country” field, made the record geographically

sufficient. The virus for which the inclusion of

“strain” field had the greatest impact on boosting

the sufficiency percentage was influenza. Most of

the GenBank records associated with this virus

had structured “strain” fields from which the par-

ser could easily separate place mentions using

GeoNames.

Although the sufficiency classifications pro-

duced by our system were correct most of the

time, there were a few cases where a record got

incorrectly labeled as insufficient even when it

contained detailed geographic information. This

typically happened because GeoNames failed to

return matching results for these places. For in-

stance, the country field “India: Majiara,WB” was

not found to be sufficient even though Majiara is

a city in India because GeoNames has no entry for

it. In some cases the lack of matching result was

due to spelling variations of the place name. For

instance the country field “Indonesia: Yogjakarta”

was called insufficient since “Yogjakarta” is

spelled as “Yogyakarta” in GeoNames. Some-

times the database simply did not contain the ex-

act string present in the GenBank record. For in-

stance, it does not have any entry for the place

“south Kalimantan” but it contains the place name

“kalimantan”. The number of sufficient records

which were called insufficient by our system due

to inexact matching were greatly mitigated by re-

moving strings such as “south” from the place

mention, as described in the “Methods” section.

Most of the NER systems performed signifi-

cantly better with overlap measures than with ex-

act-match measures. This is because our annota-

tion guidelines typically involved tagging the

longest possible match for each entity and the au-

tomated systems frequently missed portions of

each annotation. Stanford SUTime had the best

overlap f-measure of 0.785, closely followed by

LINNEAUS with an overlap f-measure of 0.781.

Although Stanford SUTime was fairly effective at

finding date mentions in text, it tagged all four-

digit-numbers such as “1012” and “2339” as

years, leading to a number of false positives. The

poor recall of LINNEAUS was mostly caused be-

cause the dictionary used by LINNEAUS tagged

only species mentions in text while we tagged ge-

nus and family mentions as well. It also missed a

lot of commonly used animal names such as mon-

key, bat, badger and wolf. GeoNamer was the

third best performer with the highest recall but

second lowest precision. This is because the

GeoNames dictionary contains an extensively

large list of location names, many of which are

commonly used words such as “central”. Even

though we filtered out a vast majority of these

words, it still produced false positives such as

“wizard”. However, its performance was consid-

erably better than that of the Stanford location tag-

ger used by Scotch et al. (2011) which was found

to have a recall, precision and f-score of 0.26, 0.81

and 0.39 respectively. The improved performance

was achieved because of the higher recall of our

system. The GeoNames dictionary provides an

extensive coverage of all location mentions in the

world and the Stanford NER system, which is a

CRF classifier trained on a different dataset, was

not able to recognize many of the place mentions

present in full-text PMC articles related to Gen-

Bank records.

BANNER showed the poorest performance

amongst all the entity taggers evaluated in this pa-

per. In fact, the f-score we achieved for BANNER

in this study was much lower that its past f-score

of 0.42 within the domain of articles related to

GenBank records for viral isolates (Scotch et al.,

2011). As mentioned by Scotch et al. (2011), a

key reason for BANNER’s poor performance in

this domain is the difference between the data set

used to train the BANNER model and the annota-

tion corpus used to test this system. The version

of BANNER used in these two studies was trained

on the training set for the BioCreative 2 Gene

7

Mention task, which comprised of 15,000 sen-

tences from PubMed abstracts. These abstracts of-

ten contained the full names for gene and protein

mentions while the full-text articles we used

mostly contained the abbreviated forms of gene

names, which BANNER tended to miss. The arti-

cles also contained abbreviated forms of several

entities such as viral strain name (e.g. H1N1) and

species name (e.g. VEEV) which look similar to

abbreviated gene names. Therefore, BANNER of-

ten misclassified these entities as gene mentions.

A possible reason for BANNER having a much

lower performance in this study than in the previ-

ous study conducted by Scotch et al (2011) is the

presence of a large number of tables in the journal

articles we selected. BANNER is a machine learn-

ing system based on conditional random fields

which uses orthographic, morphological and shal-

low syntax features extracted from sentences to

identify gene mentions in text. Such features do

not help greatly for extraction from tables. There-

fore, BANNER was often not able to identify the

gene mentions in the tables present within our cor-

pus, thereby producing false negatives. Moreover,

it tagged several entries within the table as a single

gene name, thereby producing false positives as

well. This reduced both the recall and precision of

BANNER.

Although this study explores the problem of in-

sufficient geographic information in GenBank

more thoroughly than past studies, the number of

papers annotated as the gold standard is still lim-

ited. Thus, the performance of the taggers re-

ported can be construed as a preliminary estimate

at best. The set of taggers and their performance

seem to be adequate for a large-scale application,

with the exception of BANNER. However, we did

not make any changes to the BANNER system

(specifically, re-training) since changes to it are

not possible until sufficient data is annotated for

retraining.

5 Conclusions and Future Work

It can be concluded that the majority of Gen-

Bank records for zoonotic viruses do not contain

sufficient geographic information concerning

their origin. In order to enable phylogeographic

analysis of these viruses and thereby monitor their

spread, it is essential to develop an efficient mech-

anism for extracting this information from pub-

lished articles. Automated NER systems may help

accelerate this process significantly. Our results

indicate that the NER systems LINNEAUS, Stan-

ford SUTime and GeoNamer produce satisfactory

performance in this domain and thus can be used

in the future for linking GenBank records with

their corresponding geographic information.

However, the current version of BANNER is not

well-suited for this task. We will need to train

BANNER specifically for this purpose before in-

corporating it within our system.

We are currently altering the component of our

program which classifies records as sufficient or

insufficient in order to reduce the number of errors

due to insufficient records being called sufficient.

We are also manually looking through GenBank

records for zoonotic viruses with insufficient geo-

graphic metadata and linking them to the location

mentions in related articles which we deem to be

the most likely location of collection for the given

viral isolate. The resulting annotated corpus will

be used to train and evaluate an automated system

for populating GenBank geographic metadata.

We have already covered all GenBank records re-

lated to Encephalitis viruses and close to 10% of

all records related to Influenza which are linked to

PubMed Central articles. The annotation process

has revealed that a large proportion of the infor-

mation allowing linkage of GenBank records to

geographic metadata is often present in tables

within the articles in addition to textual sentences.

Therefore, we have developed a Python parser for

automatically linking GenBank records to loca-

tion mentions using tables from the HTML ver-

sion of the PubMed Central articles. Future work

will include further expansion of this annotation

corpus and the development of an integrated sys-

tem for enhancing GenBank geographic metadata

for phylogeographic analysis of zoonotic viruses.

Acknowledgement

Research reported in this publication was sup-

ported by the NIAID of the NIH under Award

Number R56AI102559 to MS and GG. The con-

tent is solely the responsibility of the authors and

does not necessarily represent the official views

of the National Institutes of Health

8

References

Avise, John C. (2000). Phylogeography : the history

and formation of species Cambridge, Mass.: Harvard

University Press.

Chang, Angel X., and Christopher Manning. "SUTime:

A library for recognizing and normalizing time ex-

pressions." LREC. 2012.

Ciccozzi M, et al. Epidemiological history and phylo-

geography of West Nile virus lineage 2. Infection,

Genetics and Evolution. 2013:17;46-50.

Finkel JR, Grenager T, Manning C. Incorporating non-

local information into information extraction sys-

tems by Gibbs sampling. In: Proceedings of the 43rd

annual meeting of the association for computational

linguistics (ACL 2005); 2005. p. 363–70.

Gerner M, Nenadic G, and Bergman CM. LINNAEUS:

A species name identification system for biomedical

literature. BMC Bioinformatics. 2010;11(85).

Gray RR, and Salemi M. Integrative molecular phylo-

geography in the context of infectious diseases on

the human-animal interface. Parasitology-Cam-

bridge. 2012;139:1939-1951

Krauss, H. (2003). Zoonoses: infectious diseases trans-

missible from animals to humans (3rd ed.). Wash-

ington, D.C.: ASM Press.

Leaman R and Gonzalez G. BANNER: An executable

survey of advances in biomedical named entity

recognition. Pacific Symposium on Biocomputing.

2008;13:652-663.

Scotch, Matthew, et al. Enhancing phylogeography by

improving geographical information from GenBank.

Journal of biomedical informatics. 2011;44:S44-

S47.

Stenetorp P, et al. BRAT: A Web-based Tool for NLP-

Assisted Text Annotation. EACL '12 Proceedings

Wallace, R.G. and W.M. Fitch, Influenza A H5N1 im-

migration is filtered out at some international bor-

ders. PLoS One, 2008. 3(2): p. e1697.

Weidmann M, et al. Molecular phylogeography of tick-

borne encephalitis virus in Central Europe. Journal

of General Virology. 2013;94:2129-2139.

9


Temporal Expression Recognition for Cell Cycle Phase Concepts inBiomedical Literature

Negacy D. Hailu, Natalya Panteleyeva and K. Bretonnel CohenComputational Bioscience Program, University of Colorado Denver

School of [email protected], [email protected],

[email protected]

AbstractIn this paper, we present a system forrecognizing temporal expressions relatedto cell cycle phase (CCP) concepts inbiomedical literature. We identified 11classes of cell cycle related temporal ex-pressions, for which we made extensionsto TIMEX3, arranging them in an on-tology derived from the Gene Ontology.We annotated 310 abstracts from PubMed.Annotation guidelines were developed,consistent with existing time-related anno-tation guidelines for TimeML. Two anno-tators participated in the annotation. Weachieved an inter-annotator agreement of0.79 for an exact span match and 0.82for relaxed constraints. Our approach isa hybrid of machine learning to recognizetemporal expressions and a rule-based ap-proach to map them to the ontology. Wetrained a named entity recognizer usingConditional Random Fields (CRF) mod-els. An off-the-shelf implementation ofthe linear chain CRF model was used. Weobtained an F-score of 0.77 for temporalexpression recognition. We achieved 0.79macro-averagee F-score and 0.78 micro-averaged F-score for mapping to the on-tology.

1 Introduction

Storing and processing temporal data in biomed-ical informatics is important, but challeng-ing (Zhou and Hripcsak, 2007; Augusto, 2005).Biomedical data is often intrinsically associatedwith time. For example, data from electronic med-ical records are on a clinical timeline (Zhou andHripcsak, 2007) which links all information on theprogress of a patient’s status. Temporal reasoningremains a challenge for medical information sys-tems (Combi and Shahar, 1997). Conventionally,

dictionaries define time as “The continuous pas-sage of existence in which events pass from a stateof potentiality in the future, through the present, toa state of finality in the past” (Editorial Staff, un-dated). This traditional linear concept of tempo-rality does not adequately capture the cyclical na-ture of some important biological processes, suchas the cell cycle and circadian rhythms. In this pa-per, we describe a system for the recognition oftemporal expressions related to cell cycle phasesin biomedical literature. The cell cycle is a phe-nomenon that a cell goes through during its growthand replication. Its stages are depicted in Figure 1.We treat each phase as a distinct time componentand we aim at recognizing expressions that de-scribe them in biomedical literature, then mappingthem to an ontology of cell cycle phases and tran-sitions. Specifically, we are interested in recog-nizing expressions that contain one or more of theconcepts shown in Table 1, where the Gene Ontol-ogy is taken as definitional of concepts related tophases of the cell cycle.

Recognition of cell cycle phase concepts fromtext is a non-trivial problem. Some of the waysthat they can be mentioned in text, such as inter-phase, anaphase, and prophase are relatively un-ambiguous and can be recognized and mapped toan ontology using regular expressions. However,as is often the case both in general language and inbiomedical language, many of the ways in whichthey can be mentioned are highly ambiguous. Forexample, M, which stands for mitosis, is often aunit of measurement, as in . . . removal of histoneHI with 0,6 M NaCl. (PMID: 6183061) M couldalso be an abbreviation of an author’s first name,as in . . . Suzuki S, Nakata M. (PMID: 23844291) S,which refers to S-phase or synthesis phase, couldalso stand for an author’s first name, as well asa protein name, as in . . . Protein S acts as a co-factor for tissue factor pathway inhibitor. (PMID:23841464). In addition, the word synthesis is in it-

10

self ambiguous, even in the context of other men-tions of cell cycle phases. In the following exam-ples, it refers to something other than a cell cyclephase:

• . . . histone synthesis by lymphocytes in G0and G1. (PMID: 6849885)

• . . . metaphase-anaphase transition, as a re-sult of fertilization, activation or protein syn-thesis inhibition. (PMID: 9552372)

We treated recognition of temporal expressionsfrom literature as a named entity recognition(NER) problem. Many approaches to named en-tity recognition are based on machine learningtechniques. Nadeau and Sekine report that al-though semi-supervised learning algorithms havebeen employed in NER challenges, most systemsthat perform well are built based on supervisedlearning techniques (Nadeau and Sekine, 2007).Based on this survey report, we used ConditionalRandom Fields (CRFs) for the recognition phaseof our approach. The details of our methods aredescribed in section 4.2.

Figure 1: Schematic of the cell cycle. Outer ring:I = Interphase, M = Mitosis; inner ring: M = Mi-tosis, G1 = Gap 1, G2 = Gap 2, S = Synthesis; notin ring: G0 = Gap 0/Resting [Wikipedia].

2 Motivation

A vast collection of biomedical literature inPubMed/MEDLINE and other biomedical jour-nal repositories is estimated to grow exponen-tially (Hunter and Cohen, 2006), as shown in

Figure 2. Searching for papers specific toa researcher’s interest in any domain is diffi-cult. PubMed/MEDLINE allows search usingkeywords, but until recently did not rank results bydocument relevance. General-purpose search en-gines such as Google and Bing rank their results,but are not well-suited for search of specialized in-formation related to genes and small molecules.Building a specialized search engine exclusivelyto search biomedical literature using genes andsmall molecules as keywords could be very use-ful, for instance, for cancer researchers.

Figure 2: Publication growth rate at Med-line (Hunter and Cohen, 2006)

Our long term goal is to build a spe-cialized search engine specific to cancer re-search. The system will retrieve articles fromPubMed/MEDLINE and rank them according totheir relevance. The system will utilize gene, pro-tein, and small molecule names as keywords indocument search. We are also interested in identi-fying the phase(s) of the cell cycle during whichthe gene is expressed. After detecting the ac-tive phase(s) of a gene or gene product, the sys-tem will link relevant documents to this gene fromPubMed/MEDLINE. In this paper we present ourfirst step towards that goal, which is extraction oftemporal expressions from biomedical literature.Temporal expressions will be used to identify ac-tive phases of genes or gene products.

3 Related Work

Automatic recognition of events and temporal ex-pressions from text has attracted researchers fromareas such as computer science and linguistics.

11

Concept ID Activities in each phase SynonymsInterphase GO:0051325 The cell readies itself for meiosis or mitosis and the

replication of its DNA occurs.karyostasis

G0 phase GO:0044838 Cells enter in response to cues from the cell’s envi-ronment.

quiescence

G1 phase GO:0051318 Gap phase -S phase GO:0051320 DNA synthesis takes place. S-phase, synthesisG2 phase GO:0051319 Gap phase -Mitosis GO:0007067 The nucleus of a eukaryotic cell divides -Prophase GO:0051324 Chromosomes condense and the two daughter cen-

trioles and their asters migrate toward the poles ofthe cell.

-

Metaphase GO:0051323 Chromosomes become aligned on the equatorialplate of the cell.

-

Anaphase GO:0051322 The chromosomes separate and migrate towards thepoles of the spindle.

-

Telophase GO:0051326 The chromosomes arrive at the poles of the cell andthe division of the cytoplasm starts.

-

Table 1: Cell cycle phase concepts. Definitions from the Gene Ontology.

The results have contributed to the developmentof diverse natural language processing applica-tions, such as information extraction, informationretrieval, question-answering systems, text sum-marization, etc. TimeML: Robust Specification ofEvent and Temporal Expressions in Text (Puste-jovsky et al., 2003) is a specification language forannotation of events and temporal expressions inhuman language. TimeML addresses specificationissues like time stamping, order of events, reason-ing about events, and time expressions.

TempEval is one of the shared challenges in-cluded in SemEval (Agirre et al., 2009) as of 2007.It aims at advancing research on processing tem-poral information. Primarily it focuses on threetasks: event extraction and classification, temporalexpression extraction and normalization, and tem-poral relation extraction (UzZaman et al., 2013).However, this ongoing work on temporal evalu-ation is based on language data collected fromthe news. In the clinical domain, (Styler IV etal., Undated; Palmer and Pustejovsky, 2012; Al-bright et al., 2013) describe the THYME annota-tion project. The scope and language of temporal-ity related to the cell cycle is different from that ofboth TempEval and the clinical domain, and sup-ports (and demands) different types of reasoning,specifically related to cyclical time.

Cyclical phenomena are ubiquitous in can-cer development and progression. The connec-

tion between the cell cycle and cancer is wellknown (Vermeulen et al., 2003; Kastan andBartek, 2004; Malumbres and Barbacid, 2009),and the fact that the cell cycle is the main targetfor cancer regulation, deregulation, and therapyis well established (Vermeulen et al., 2003; Kas-tan and Bartek, 2004; Malumbres and Barbacid,2009). Circadian rhythms, rounds of chemother-apy, remissions, and re-occurrences all have acyclic nature.Circadian rhythms have been inves-tigated in the study of cancer treatment (Sahar andSassone-Corsi, 2009; Ortiz-Tudela et al., 2013;Lengyel et al., 2009; Kelleher et al., 2014).

From the perspective of cancer research, identi-fying cell cycle concepts in the literature is crucialto being able to retrieve and explore informationrelated to cyclical biological processes like the celllife cycle. From the natural language processingperspective, the novelty of this work consists inmodeling cyclical time. To our knowledge, tempo-ral event recognition grounded in a cyclical modelof time has not been previously proposed.

4 Methodology

4.1 Materials

We built a corpus of 360 abstracts, consisting of70,570 words. The concepts are presented in Ta-ble 1. We balanced our corpus by collecting arti-cles from the PubMed/MEDLINE database usingthe concepts individually as keywords. We used

12

the PubMed/MEDLINE1 and BioMedLib searchengines2, two keyword-based search engines builton top of MEDLINE, for this purpose. The fol-lowing keywords were used to collect the abstractsfrom PubMed and BioMedLib:

• interphase, G0, G0 phase, G1, G1 phase, syn-thesis, S phase, G2, G2 phase

• Mitosis, M phase, prophase, metaphase,anaphase, telophase

• checkpoint

The annotation guidelines addressed the follow-ing issues:

• The goal of the project: the goal of the anno-tation project was to develop a highly anno-tated corpus specific to CCP concepts, whichwill be used for automatic recognition andclassification.

• Specification of each tag: this is shown inFigure 3.

• Tool used to annotate the project: We usedKnowtator (Ogren, 2006), a text annotationtool built on top of the Protege knowledgerepresentation system.

Modeling the phenomenon was the first step inunderstanding the annotation process (Pustejovskyand Stubbs, 2012). We modeled our corpus as atriple, Model = <T, R, I>, as shown below:

• Model = <T, R, I> where T = terms, R = re-lation between the terms, and I = interaction

• T = {Named Entity, time expression, not timeexpression}• R = {Named Entity ::= TIMEXCCP | not

TIMEXCCP}• I = {TIMEXCCP = list of concepts from Ta-

ble 1 or checkpoints. Examples of check-points are G1/G2 phase, S/G2 phase, etc.

TimeML is a specification for annotating hu-man language in text (Pustejovsky et al., 2003).TIMEX3 is defined in TimeML as a tag for cap-turing dates, times, durations, and sets of dates and

1http://www.ncbi.nlm.nih.gov/pubmed/2http://bmlsearch.com/

times. In our work we extended TIMEX3. We em-ploy a single tag set called TIMEXCCP, where thenaming is intended to be consistent with existingtime-related tag sets. Figure 3 shows the attributesand functions of the tag TIMEXCCP, as well asexamples of usage.

Figure 3: Attributes and functions of the TIMEX-CCP tag.

Two annotators with training in the domain per-formed the annotation. Inter-annotator agreementwas calculated as F-measure, following (Hripcsakand Rothschild, 2005). Inter-annotator agreementwas 0.79 for an exact-span match and 0.82 for re-laxed matching. The constraints, which are valuesof the attributes, were not considered while com-puting IAA for the latter case.

The annotation effort developed through sev-eral iterations, applying the annotation devel-opment cycle introduced by Pustejovsky andStubbs (Pustejovsky and Stubbs, 2012). Thismethodology is depicted in Figure 4. It is calledthe MATTER cycle, which stands for Model, An-notation, Train, Test, Evaluation, Revise. The ad-vantage of this methodology is that it allows usto discover hidden specifications and refine themduring the MATTER cycle.

4.2 Methods

We are particularly interested in recognizing andclassifying temporal expressions in the literature.For example, in the following sentence, taken fromWikipedia, the recognition task is to recognize theblue boxes as shown below and classify them. Themapping task is to categorize the recognized tem-poral expressions into the concepts shown in Ta-ble 1.

"Microhomology-mediated endjoining (MMEJ) uses a Kuprotein and DNA-PK independentrepair mechanism, and

13

Figure 4: The MATTER cycle (Pustejovsky andStubbs, 2012)

repair occurs during theS phase of the cell cycle,as oposed to the G0/G1 andearly S phases in NHEJ andlate S to G2 phases in HR."

... the S phase of the cell cycle, as opposedto the G0/G1 and early S phases in NHEJ and

late S to G2 phase in HR.

In this example, there are four temporal expres-sions: S, G0/G1, early S, and late S to G2. Theexpression “S” is of the type S-phase or synthesisphase according to the conceptual ontology inTable 1. The expression “G0/G1” can be classifiedas G0 and G1. Similarly, the expression “late S toG2” can be of type S and G2.

Our approach is a hybrid of machine learningand rule-based techniques. The machine learningtechnique, which we refer to as the first layer, isapplied for temporal expression recognition. Inthis layer, CRFs are trained to learn to recognizethe expressions from the list of features which isshown below.

1. Word-level features:

• Is the word in uppercase?• Is the first character of the word in up-

percase?• Words themselves are also treated as

features.• Length of the word.

2. Punctuation-related features:

• Does the word contain at least one of themost common punctuation marks?

3. Digit-related features:

• Is the word a digit?• Does the word contain a digit?

4. Does the word contain either of the follow-ing: phase, arrest, entry? These words typi-cally come before or after the cell cycle con-cepts. For example, early mitosis, G0 phase.

5. Part-of-Speech tagging: Window size of 2before and after the word.

6. Presence of concept modifiers before theword. Modifiers include: early, mid, late,early-mid.

Conditional Random Fields (CRFs) are one ofthe probabilistic graphical model sequence tag-ging techniques. They are understood as a sequen-tial version of Maximum Entropy Models (Klingerand Tomanek, 2007). One advantage of CRFs overother probabilistic models like Hidden MarkovModels and Maximum Entropy Models for com-plex systems is their support for features interact-ing with one another. The linear chain CRF repre-sentation is shown in Figure 5.

Figure 5: A linear chain Conditional Ran-dom Field representation (Klinger and Tomanek,2007).

In this representation, ~x is a vector of observa-tions, also known as features in machine learn-ing, and the yt’s are states or labels. In this lin-ear chain model, a given state is dependent on itsprevious, current, and next states. It is also in-fluenced by the observations for that state. Thisargument can be formulated as Equation 1. Ac-cordingly, state prediction will be an optimizationof Equation 1. ψc(~x, ~y) are the factor matrices of

14

the maximal cliques read from the factor graph inFigure 5 (Klinger and Tomanek, 2007).

P (~y|~x) =1

Z(~x)

∏cεC

ψc(~x, ~y) (1)

We used the IOB format, which is the most com-mon method of representation for sequence tag-ging. In this format, I stands for the inside, O isthe outside, and B is the beginning of a temporalexpression. Table 2 shows an example of IOB la-beling for the phrase . . . late S to G2 phase in HR.

token tag. . . . . .late B TIMEXCCPS I TIMEXCCPto I TIMEXCCPG2 I TIMEXCCP

phase Oin O

HR O. O

Table 2: IOB format representation of a segmentof a sentence.

The rule-based system is keyword-based. Therules match simple cell cycle phase concepts. Forexample, the phrase early S phase is classified assynthesis, since there is S in it. The expressionG0/G1 phase is classified as a G0/G1 checkpoint.

5 Experimental setup

We split our dataset of more than 70K tokens into80% training and 20% test sets. We used 5-foldcross validation to balance the distribution of thedataset. The number of positive instances for the5 runs is shown in Figure 6. The expressionsS and synthesis are displayed separately, despitetheir identical meaning, to allow for more granu-lar evaluation of performance. The same rationaleapplies to displaying M and mitosis separately.

The ratio of the individual concepts that wehave in the 5 runs is balanced, as shown in Fig-ure 6. However, the training dataset is skewed,since there are almost 98% negative labels, withthe remaining small portion as positive labels.Among the approximately 10K test tokens, 180of them are labeled as positive TIMEXCCP, butthe others are negative, i.e. they have the labelO. A positive TIMEXCCP in this case could be

B TIMEXCCP or I TIMEXCCP—beginning orinside of a temporal expression.,

6 Results

Since the task consisted of two separate steps—temporal expression recognition, and mapping ornormalization—in this section, we report our find-ings independently. Our evaluation metrics are interms of precision P, recall R, and F-measure. Thesystem achieved precision P = 0.83, recall R =0.72 and F = 0.77 for recognizing TIMEXCCP inbiomedical literature.

The temporal expression mapper, which is arule-based system, achieved a macro-averaged P=0.90, R = 0.70, and F = 0.79 and a micro-averagedP = 0.86, R = 0.71, and F = 0.78. The system per-formance for the individual concepts is shown inFigure 7.

7 Discussion

Some of the false positive predictions were due tohuman annotation errors.

There were some conditions where the annota-tors disagreed. For example, . . . early G1 to G2phase. This examples addresses two questions thatshould be explicitly mentioned in the annotationguidelines:

• Does the modifier “early” modify only G1, orboth G1 and G2?

• Should there be an attribute for the range oftime from G1 to G2 in the annotation guide-lines?

Our system achieved good performance onboth time expression recognition and mapping ofhighly ambiguous concepts. In spite of the chal-lenges presented by ambiguity, we obtained 0.85,0.81, and 0.80 F-measures for recognizing andmapping the concepts synthesis, M, and S, respec-tively. The most informative features that con-tribute to this score are the discriminating wordsbefore and after a target token. These words are:phase, arrest, and entry. They are often presentbefore or after CCP concepts. Also, presence ofmodifiers is a good indication of CCP concepts.For example, in the phase early S phase, the mod-ifier early is one of the most informative features.However, recognition of complex phrases as inlate S to G2 phase remained a challenge.

The challenges of complex temporal expres-sions can be seen from a different perspective.

15

Figure 6: Distribution of concepts in 5 runs.

Figure 7: Rule-based classification performance. Average score for 5 runs.

Mostly the system recognizes the individual con-cepts within a complex phrase, but not the mod-ifiers nor the words like prepositions within thecomplex phrase. In the example given previously,the system recognizes S and G2 but not the modi-fier late, nor the preposition to. These challengescould be tackled by having features that addressthe modifiers as well as words within two con-cepts.

We used a naive tokenizer that splits the textinto words based on white space. In the future,we would like to test the system with other moresophisticated tokenizers. We kept punctuationmarks in temporal expressions, for example, theforward slash in G0/G1 phase. Presence of punc-tuation marks, such as hyphen (-), forward slash(/), comma (,) and single quote (’), within a tokenis one of our features in training the machine learn-ing algorithm to recognize temporal expressions.

8 Conclusions & Future work

Cell cycle phase concepts are time expressions,and can be annotated in a fashion similar toTimeML. In this work, we annotated a corpus with

cell cycle phase information. This corpus can beused to train machine learning algorithms to pre-dict cell cycle phase concepts. The concepts wereannotated using the TIMEXCCP tag, an extensionof TIMEX3, which has the following attributes:value, modifier, set, and comments. The detailsare in Figure 3.

We have developed a temporal expression rec-ognizer and classifier based on a hybrid of ma-chine learning and rule-based techniques. We pro-pose a two-tiered architecture to solve temporalexpression recognition and mapping for CCP con-cepts. The first tier recognizes temporal expres-sions using CRFs. In the second tier, a rule-basedsystem classifies the concepts.

Some of the main future directions for thisworks are testing the system with the addition ofmore annotated data. We will focus on how wecan capture complex time expressions. This mighttake us to redefining the annotation guidelines thatwe have right now.

16

Acknowledgments

The authors thank Richard Osborne and ScottCramer for helpful discussion of the significanceof this work from a cancer research perspective.

ReferencesEneko Agirre, Lluıs Marquez, and Richard Wicen-

towski. 2009. Computational semantic analysisof language: Semeval-2007 and beyond. LanguageResources and Evaluation, 43(2):97–104.

Daniel Albright, Arrick Lanfranchi, Anwen Fredrik-sen, William F Styler, Colin Warner, Jena D Hwang,Jinho D Choi, Dmitriy Dligach, Rodney D Nielsen,James Martin, et al. 2013. Towards comprehen-sive syntactic and semantic annotations of the clin-ical narrative. Journal of the American Medical In-formatics Association, 20(5):922–930.

Roberta Alfieri, Ivan Merelli, Ettore Mosca, and Lu-ciano Milanesi. 2007. The Cell Cycle DB: a sys-tems biology approach to cell cycle analysis. Nu-cleic Acids Research.

Juan Carlos Augusto. 2005. Temporal reasoning fordecision support in medicine. Artificial Intelligencein Medicine, 33(1):1–24.

Matteo Brucato, Leon Derczynski, Hector Llorens,Kalina Bontcheva, and Christian S. Jensen. 2013.Recognising and interpreting named temporal ex-pressions. In Galia Angelova, Kalina Bontcheva,and Ruslan Mitkov, editors, RANLP, pages 113–121.RANLP 2011 Organising Committee/ACL.

C. Combi and Y. Shahar. 1997. Temporal reasoningand temporal data maintenance in medicine: Issuesand challenges. Comput Biol Med, 27 (5).

Carlo Combi, Elpida Keravnou-Papailiou, and YuvalShahar. 2010. Temporal Information Systems inMedicine. Springer Publishing Company, Incorpo-rated, 1st edition.

Collins Editorial Staff. undated. Collins Concise En-glish Dictionary.

Nicholas Paul Gauthier, Lars Juhl Jensen, RasmusWernersson, Soren Brunak, and Thomas S. Jensen.2009. Cyclebase.org: version 2.0, an updated com-prehensive, multi-species repository of cell cycleexperiments and derived analysis results. NucleicAcids Research, 9.

Erik Hatcher, Otis Gospodnetic, and Mike McCand-less. 2nd revised edition. edition.

George Hripcsak and Adam S Rothschild. 2005.Agreement, the f-measure, and reliability in infor-mation retrieval. Journal of the American MedicalInformatics Association, 12(3):296–298.

Lawrence Hunter and K. Bretonnel Cohen. 2006.Biomedical Language Processing: PerspectiveWhat’s Beyond PubMed? Molecular Cell, 21:589–594.

Michael Kahn. December 1991. Modeling time inmedical decision-support programs. Med DecisionMaking, 11(4):249–264.

Michael B. Kastan and Jiri Bartek. 2004. Cell-cyclecheckpoints and cancer. Nature, 432:316–323.

Fergal C. Kelleher, Aparna Rao, and Anne Maguire.2014. Circadian molecular clocks and cancer. Can-cer Letters, 342:9–18.

Roman Klinger and Katrin Tomanek. 2007. Classi-cal Probabilistic Models and Conditional RandomFields. Technical Report TR07-2-013, Departmentof Computer Science, Dortmund University of Tech-nology, December.

R. Leaman and Gonzalez G. 2008. BANNER: An exe-cutable survey of advances in biomedical named en-tity recognition. Pacific Symposium on Biocomput-ing, 13:652–663.

Zsuzsanna Lengyel, Zita Battyani, Gyorgy Szekeres,Valer Csernus, and Andras D. Nagy. 2009. Cir-cadian clocks and tumor biology: what is to learnfrom human skin biopsies? Nature Reviews Cancer,9:153–166.

Marcos Malumbres and Mariano Barbacid. 2009. CellCycle, CDKs and cancer: a changing paradigm. Na-ture Reviews Cancer, 9:153–166.

Inderjeet Mani, Ben Wellner, Marc Verhagen, andJames Pustejovsky. 2007. Three approaches tolearning TLINKs in TimeML. Technical report,Computer Science Department, Brandeis University.Waltham, USA.

Christopher D. Manning, Prabhakar Raghavan, andHinrich Schutze. 2008. Introduction to InformationRetrieval. Cambridge University Press, Cambridge,UK.

Andrew Kachites McCallum. 2002. MALLET: A ma-chine learning for language toolkit.

David Nadeau and Satoshi Sekine. 2007. A surveyof named entity recognition and classification. Lin-guisticae Investigationes, 30(1):3–26, January. Pub-lisher: John Benjamins Publishing Company.

Philip V. Ogren. 2006. Knowtator: a protege plug-infor annotated corpus construction. In Proceedings ofthe 2006 Conference of the North American Chapterof the Association for Computational Linguistics onHuman Language Technology, pages 273–275, Mor-ristown, NJ, USA. Association for ComputationalLinguistics.

17

Elisabet Ortiz-Tudela, Ida Iurisci, Jacques Beau, Ab-doulaye Karaboue, Thierry Moreau, Maria Ange-les Rol, Juan Antonio Madrid, Francis Levi, andPasquale F. Innominato. 2013. The circadianrest-activity rhythm, a potential safety pharmacol-ogy endpoint of cancer chemotherapy. InternationalJournal of Cancer.

Martha Palmer and James Pustejovsky. 2012. 2012i2b2 temporal relations challenge annotation guide-lines.

James Pustejovsky and Amber Stubbs. 2012. Nat-ural Language Annotation for Machine Learning.O’REILLY.

James Pustejovsky, Jose Castano, Robert Ingria, RoserSaur, Robert Gaizauskas, Andrea Setzer, and Gra-ham Katz. 2003. TimeML: Robust specification ofevent and temporal expressions in text. In Fifth In-ternational Workshop on Computational Semantics(IWCS-5).

Saurabh Sahar and Paolo Sassone-Corsi. 2009.Metabolism and cancer: the circadian clock connec-tion. Nature, 9:886–896.

Yuval Shahar and Carlo Combi. 1999. Editors’ fore-word: Intelligent temporal information systems inmedicine. J. Intell. Inf. Syst., 13(1-2):5–8.

Paul T. Spellman, Gavin Sherlock, Michael Q. Zhang,Vishwanath R. Iyer, Kirk Anders, Michael B.Eisen, Patrick O. Brown, David Botstein, and BruceFutcher. 1998. Comprehensive identification of cellcycle-regulated genes of the yeast Saccharomycescerevisiae by microarray hybridization. MolecularBiology of the Cell, 9.

William F Styler IV, Steven Bethard, Sean Finan,Martha Palmer, Sameer Pradhan, Piet C de Groen,Brad Erickson, Timothy Miller, Chen Lin, and Guer-gana Savova. Undated. Temporal annotation in theclinical domain.

Naushad UzZaman, Hector Llorens, Leon Derczyn-ski, James Allen, Marc Verhagen, and James Puste-jovsky. 2013. Semeval-2013 task 1: Tempeval-3:Evaluating time expressions, events, and temporalrelations. In Second Joint Conference on Lexicaland Computational Semantics (*SEM), Volume 2:Proceedings of the Seventh International Workshopon Semantic Evaluation (SemEval 2013), pages 1–9,Atlanta, Georgia, USA, June. Association for Com-putational Linguistics.

Katrien Vermeulen, Dirk R. Van Bockstaele, andZwi N. Berneman. 2003. The Cell Cycle: a reviewof regulation, deregulation and therapeutic targets incancer. Cell Proliferation, 36:131–149.

Michael L. Whitield, Gavin Sherlock, Alok J. Sal-danha, John I. Murray, Catherine A. Ball, Karen E.Alexander, John C. Matese, Charles M. Perou,

Myra M. Hurt, Patrick O. Brown, and David Bot-stein. 2002. Identication of genes periodically ex-pressed in the human cell cycle and their expressionin tumors. Molecular Biology of the Cell, 13.

Li Zhou and George Hripcsak. 2007. Temporal rea-soning with medical data - a review with emphasison medical natural language processing. Journal ofBiomedical Informatics, 40(2):183–202.

18


Classifying Negative Findings in Biomedical Publications

Bei Yu School of Information Studies

Syracuse University [email protected]

Daniele Fanelli School of Library and Information Science

University of Montreal [email protected]

Abstract

Publication bias refers to the phenome-non that statistically significant, “posi-tive” results are more likely to be pub-lished than non-significant, “negative” results. Currently, researchers have to manually identify negative results in a large number of publications in order to examine publication biases. This paper proposes an NLP approach for automati-cally classifying negated sentences in bi-omedical abstracts as either reporting negative findings or not. Using multino-mial naïve Bayes algorithm and bag-of-words features enriched by parts-of-speeches and constituents, we built a classifier that reached 84% accuracy based on 5-fold cross validation on a bal-anced data set.

1 Introduction

Publication bias refers to the phenomenon that statistically significant, “positive” results are more likely to be published than non-significant, “negative” results (Estabrook et al., 1991). Due to the “file-drawer” effect (Rosenthal, 1979), negative results are more likely to be “filed away” privately than to be published publicly.

Publication bias poses challenge for an accu-rate review of current research progress. It threatens the quality of meta-analyses and sys-tematic reviews that rely on published research results (e.g., the Cochrane Review). Publication bias may be further spread through citation net-work, and amplified by citation bias, a phenome-non that positive results are more likely to be cited than negative results (Greenberg, 2009).

To address the publication bias problem, some new journals were launched and dedicated to

publishing negative results, such as the Journal of Negative Results in Biomedicine, Journal of Pharmaceutical Negative Results, Journal of Negative Results in Ecology and Evolutionary Biology, and All Results Journals: Chem. Some quantitative methods like the funnel plot (Egger et al., 1997) were used to measure publication bias in publications retrieved for a certain topic.

A key step in such manual analysis is to exam-ine the article abstracts or full-texts to see wheth-er the findings are negative or not. For example, Hebert et al. (2002) examined the full text of 1,038 biomedical articles whose primary out-comes were hypothesis testing results, and found 234 (23%) negative articles. Apparently, such manual analysis approach is time consuming. An accurate, automated classifier would be ideal to actively track positive and negative publications.

This paper proposes an NLP approach for au-tomatically identifying negative results in bio-medical abstracts. Because one publication may have multiple findings, we currently focus on classifying negative findings at sentence level: for a sentence that contains the negation cues “no” and/or “not”, we predict whether the sen-tence reported negative finding or not. We con-structed a training data set using manual annota-tion and convenience samples. Two widely-used text classification algorithms, Multinomial naive Bayes (MNB) and Support Vector Machines (SVM), were compared in this study. A few text representation approaches were also compared by their effectiveness in building the classifier. The approaches include (1) bag-of-words (BOW), (2) BOW with PoS tagging and shallow parsing, and (3) local contexts of the negation cues “no” and ‘not”, including the words, PoS tags, and constituents. The best classifier was built using MNB and bag-of-words features en-riched with PoS tags and constituent markers. The best performance is 84% accuracy based on 5-fold cross validation on a balanced data set.

19

2 Related work

The problem of identifying negative results is related to several other BioNLP problems, espe-cially on negation and scientific claim identifica-tion. The first relevant task is to identify negation signals and their scopes (e.g., Morante and Daelemans, 2008;2009; Farkas et al., 2010; Agarwal et al., 2011). Manually-annotated cor-pora like BioScope (Szarvas et al., 2008) were created to annotate negations and their scopes in biomedical abstracts in support of automated identification. This task targets a wide range of negation types, such as the presence or absence of clinical observations in narrative clinical re-ports (Chapman et al., 2001). In comparison, our task focuses on identifying negative findings on-ly. Although not all negations report negative results, negation signals are important rhetorical device for authors to make negative claims. Therefore, in this study we also examine preci-sion and recall of using negation signals as pre-dictors of negative findings. The second relevant task is to identify the strength and types of scientific claims. Light et al. (2004) developed a classifier to predict the level of speculations in sentences in biomedical abstracts. Blake (2010) proposed a “claim framework” that differentiates explicit claims, observations, correlations, comparisons, and im-plicit claims, based on the certainty of the causal relationship that was presented. Blake also found that abstracts contained only 7.84% of all scien-tific claims, indicating the need for full-text analysis. Currently, our preliminary study exam-ines abstracts only, assuming the most important findings are reported there. We also focus on coarse-grained classification of positive vs. nega-tive findings at this stage, and leave for future work the task of differentiating negative claims in finer-granularity.

3 The NLP approach

3.1 The definition of negative results

When deciding what kinds of results count as “negative”, some prior studies used “non-significant” results as an equivalent for “negative results” (e.g. Hebert et al., 2002; Fanelli, 2012). However, in practice, the definition of “negative results” is actually broader. For example, the Journal of Negative Results in Biomedicine (JNRBM), launched in 2002, was devoted to publishing “unexpected, controversial, provoca-

tive and/or negative results,” according to the journal’s website. This broader definition has its pros and cons. The added ambiguity poses chal-lenge for manual and automated identification. At the same time, the broader definition allows the inclusion of descriptive studies, such as the first JNRBM article (Hebert et al., 2002).

Interestingly, Hebert et al. (2002) defined “negative results” as “non-significant outcomes” and drew a negative conclusion that “prominent medical journals often provide insufficient in-formation to assess the validity of studies with negative results”, based on descriptive statistics, not hypothesis testing. This finding would not be counted as “negative” unless the broader defini-tion is adopted.

In our study, we utilized the JNRBM articles as a convenience sample of negative results, and thus inherit its broader definition.

3.2 The effectiveness of negation cues as predictors

The Bioscope corpus marked a number of nega-tion cues in the abstracts of research articles, such as “not”, “no”, “without”, etc. It is so far the most comprehensive negation cue collection we can find for biomedical publications. However, some challenges arise when applying these nega-tion cues to the task of identifying negative re-sults.

First, instead of focusing on negative results, the Bioscope corpus was annotated with cues expressing general negation and speculations. Consequently, some negation cues such as “un-likely” was annotated as a speculation cue, not a negation cue, although “unlikely” was used to report negative results like

“These data indicate that changes in Wnt ex-pression per se are unlikely to be the cause of the observed dysregulation of β-catenin ex-pression in DD” (PMC1564412). Therefore, the Bioscope negation cues may

not have captured all negation cues for reporting negative findings. To test this hypothesis, we used the JNRBM abstracts (N=90) as a conven-ience sample of negative results, and found that 81 abstracts (88.9%) contain at least one Bio-scope negation cue. Note that because the JNRBM abstracts consist of multiple subsections “background”, “objective”, “method”, “result”, and “conclusion”, we used the “result” and “con-clusions” subsections only to narrow down the search range.

20

Among the 9 missed abstracts, 5 used cues not captured in Bioscope negation cues: “insuffi-cient”, “unlikely”, “setbacks”, “worsening”, and “underestimates”. However, the authors’ writing style might be affected by the fact that JNRBM is dedicated to negative results. One hypothesis is that the authors would feel less pressure to use negative tones, and thus used more variety of negation words. Hence we leave it as an open question whether the new-found negation cues and their synonyms are generalizable to other biomedical journal articles.

The rest 4 abstracts (PMC 1863432, 1865554, 1839113, and 2746800) did not report explicit negation results, indicating that sometimes ab-stracts alone are not enough to decide whether negative results were reported, although the per-centage is relatively low (4.4%). Hence, we de-cided that missing target is not a major concern for our task, and thus would classify a research finding as positive if no negation cues were found.

Second, some positive research results may be mistaken as negative just because they used ne-gation cues. For example, “without” is marked as a negation cue in Bioscope, but it can be used in many contexts that do not indicate negative re-sults, such as

“The effects are consistent with or without the presence of hypertension and other comorbidi-ties and across a range of drug classes.” (PMC2659734) To measure the percentage of false alarm, we

applied the aforementioned trivial classifier to a corpus of 600 abstracts in 4 biomedical disci-plines, which were manually annotated by Fan-elli (2012). This corpus will be referred to as “Corpus-600” hereafter. Each abstract is marked as “positive”, “negative”, “partially positive”, or “n/a”, based on hypothesis testing results. The latter two types were excluded in our study. The trivial classifier predicted an abstract as “posi-tive” if no negation cues were found. Table 1 reported the prediction results, including the pre-cision and recall in identifying negative results. This result corroborates with our previous find-ing that the inclusiveness of negation cues is not the major problem since high recalls have been observed in both experiments. However, the low precision is the major problem in that the false negative predictions are far more than the true negative predictions. Hence, weeding out the

negations that did not report negative results be-came the main purpose of this preliminary study.

Discipline #abstracts Precision Recall Psychiatry 140 .11 .92 Clinical Medicine

127 .16 .94

Neuroscience 144 .20 .95 Immunology 140 .18 .95 Total 551 .16 .94

Table 1: results of cue-based trivial classifier

3.3 Classification task definition

This preliminary study focuses on separating negations that reported negative results and those not. We limit our study to abstracts at this time. Because a paper may report multiple findings, we performed the prediction at sentence level, and leave for future work the task of aggregating sentence-level predictions to abstract-level or article-level. By this definition, we will classify each sentence as reporting negative finding or not. A sentence that includes mixed findings will be categorized as reporting negative finding.

“Not” and “no” are the most frequent negation cues in the Bioscope corpus, accounting for more than 85% of all occurrences of negation cues. In this study we also examined whether local con-text, such as the words, parts-of-speeches, and constituents surrounding the negation cues, would be useful for predicting negative findings. Considering that different negation cues may be used in different contexts to report negative find-ings, we built a classifier based on the local con-texts of “no” and “not”. Contexts for other nega-tion cues will be studied in the future.

Therefore, our goal is to extract sentences con-taining “no” or “not” from abstracts, and predict whether they report negative findings or not.

3.4 Training data

We obtained a set of “positive examples”, which are negative-finding sentences, and a set of “negative examples” that did not report nega-tive findings. The examples were obtained in the following way.

Positive examples. These are sentences that used “no” or “not” to report negative findings. We extracted all sentences that contain “no” or “not” in JNRBM abstracts, and manually marked each sentence as reporting negative findings or

21

not. Finally we obtained 158 sentences reporting negative findings.

To increase the number of negative-finding examples and add variety to writing styles, we repeat the above annotations to all Lancet ab-stracts (“result” and “finding” subsections only) in the PubMed Central open access subset, and obtained 55 more such sentences. Now we have obtained 213 negative-finding examples in total.

Negative examples. To reduce the workload for manual labeling, we utilized the heuristic rule that a “no” or “not” does not report negative re-sult if it occurs in a positive abstract, therefore we extracted such sentences from positive ab-stracts in “Corpus-600”. These are the negative examples we will use. To balance the number of positive and negative examples, we used a total of 231 negative examples in two domains (132 in clinical medicine and 99 in neuroscience) instead of all four domains, because there are not enough positive examples.

Now the training data is ready for use.

3.5 Feature extraction

We compared three text representation methods by their effectiveness in building the classifier. The approaches are (1) BOW: simple bag-of-words, (2) E-BOW: bag-of-words enriched with PoS tagging and shallow parsing, and (3) LCE-BOW: local contexts of the negation cues “no” and ‘not”, including the words, PoS tags, and constituents. For (2) and (3), we ran the OpenNLP chunker through all sentences in the training data. For (3), we extracted the following features for each sentence: • The type of chunk (constituent) where

“no/not” is in (e.g. verb phrase “VP”);

• The types of two chunks before and after the chunk where “not” is in;

• All words or punctuations in these chunks; • The parts-of-speech of all these words.

See Table 2 below for an example of negative

finding: row 1 is the original sentence; row 2 is the chunked sentence, and row 3 is the extracted local context of the negation cue “not”. These three representations were then converted to fea-ture vectors using the “bag-of-words” representa-tion. To reduce vocabulary size, we removed words that occurred only once.

(1)

Vascular mortality did not differ signifi-cantly (0.19% vs 0.19% per year, p=0.7).

(2)

"[NP Vascular/JJ mortality/NN ] [VP did/VBD not/RB differ/VB ] [ADVP sig-nificantly/RB ] [PP (/-LRB- ] [NP 019/CD %/NN ] [PP vs/IN ] [NP 019/CD %/NN ] [PP per/IN ] [NP year/NN ] ,/, [NP p=07/NNS ] [VP )/-RRB- ] ./."

(3) “na na VP ADVP PP did not differ signif-icantly VBD RB VB RB”

Table 2: text representations

3.6 Classification result

We applied two supervised learning algorithms, multinomial naïve Bayes (MNB), and Support Vector Machines (Liblinear) to the unigram fea-ture vectors. We used the Sci-kit Learn toolkit to carry out the experiment, and compared the algo-rithms’ performance using 5-fold cross valida-tion. All algorithms were set to the default pa-rameter setting.

Representation MNB SVM Presence

vs. absence

BOW .82 .79 E-BOW .82 .79 LCE-BOW .72 .72

tf BOW .82 .79 E-BOW .84 .79 LCE-BOW .72 .72

Tfidf BOW .82 .75 E-BOW .84 .73 LCE-BOW .72 .75

Table 3: classification accuracy

Table 3 reports the classification accuracy. Be-cause the data set contains 213 positive and 231 negative examples, the majority vote baseline is .52. Both algorithms combined with any text rep-resentation methods outperformed the majority baseline significantly. Among them the best clas-sifier is a MNB classifier based on enriched bag-of-words representation and tfidf weighting. Alt-hough LCE-BOW reached as high as .75 accura-cy using SVM and tfidf weighting, it did not per-form as well as the other text representation methods, indicating that the local context with +/- 2 window did not capture all relevant indica-tors for negative findings.

Tuning the regularization parameter C in SVM did not improve the accuracy. Adding bi-

22

grams to the feature set resulted in slightly lower accuracy.

4 Conclusion

In this study we aimed for building a classifier to predict whether a sentence containing the words “no” or “not” reported negative findings. Built with MNB algorithms and enriched bag-of-words features with tfidf weighting, the best classifier reached .84 accuracy on a balanced data set. This preliminary study shows promising re-sults for automatically identifying negative find-ings for the purpose of tracking publication bias. To reach this goal, we will have to aggregate the sentence-level predictions on individual findings to abstract- or article-level negative results. The aggregation strategy is dependent on the decision of which finding is the primary outcome when multiple findings are present. We leave this as our future work.

Reference S. Agarwal, H. Yu, and I. Kohane, I. 2011. BioNOT:

A searchable database of biomedical negated sen-tences. BMC bioinformatics, 12: 420.

C. Blake. 2010. Beyond genes, proteins, and ab-stracts: Identifying scientific claims from full-text biomedical articles. Journal of biomedical infor-matics, 43(2): 173-189.

W. W. Chapman, W. Bridewell, P. Hanbury, G. F. Cooper, and B. G. Buchanan. 2001. Evaluation of negation phrases in narrative clinical reports. Pro-ceedings of the AMIA Symposium, 105.

P. J. Easterbrook, R. Gopalan, J. A. Berlin, and D. R. Matthews. 1991. Publication bias in clinical re-search. Lancet, 337(8746): 867-872.

M. Egger, G. D. Smith, M. Schneider, and C. Minder. 1997. Bias in meta-analysis detected by a simple, graphical test. BMJ 315(7109): 629-634.

D. Fanelli. 2012. Negative results are disappearing from most disciplines and coun-tries. Scientometrics 90(3): 891-904.

R. Farkas, V. Vincze, G. Móra, J. Csirik, and G. Szar-vas. 2010. The CoNLL-2010 shared task: learning to detect hedges and their scope in natural language text. Proceedings of the Fourteenth Conference on Computational Natural Language Learning---Shared Task, 1-12.

S. A. Greenberg. 2009. How citation distortions create unfounded authority: analysis of a citation net-work. BMJ 339, b2680.

R. S. Hebert, S. M. Wright, R. S. Dittus, and T. A. Elasy. 2002. Prominent medical journals often pro-vide insufficient information to assess the validity of studies with negative results. Journal of Nega-tive Results in Biomedicine 1(1):1.

M. Light, X-Y Qiu, and P. Srinivasan. 2004. The lan-guage of bioscience: Facts, speculations, and statements in between. In Proceedings of BioLink 2004 workshop on linking biological literature, on-tologies and databases: tools for users, pp. 17-24.

R. Morante, A. Liekens, and W. Daelemans. 2008. Learning the scope of negation in biomedical texts. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 715-724.

R. Morante, and W. Daelemans. 2009. A metalearn-ing approach to processing the scope of negation. Proceedings of the Thirteenth Conference on Com-putational Natural Language Learning, 21-29.

R. Rosenthal. 1979. The file drawer problem and tol-erance for null results. Psychological Bulle-tin, 86(3): 638.

G. Szarvas, V. Vincze, R. Farkas, and J. Csirik. 2008. The BioScope corpus: annotation for negation, un-certainty and their scope in biomedical texts. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, pp. 38-45.

23


Automated Disease Normalization with Low Rank Approximations

Robert Leaman Zhiyong Lu

National Center for Biotechnology Information

National Library of Medicine {robert.leaman, zhiyong.lu}@nih.gov

Abstract

While machine learning methods for

named entity recognition (mention-level

detection) have become common, ma-

chine learning methods have rarely been

applied to normalization (concept-level

identification). Recent research intro-

duced a machine learning method for

normalization based on pairwise learning

to rank. This method, DNorm, uses a lin-

ear model to score the similarity between

mentions and concept names, and has

several desirable properties, including

learning term variation directly from

training data. In this manuscript we em-

ploy a dimensionality reduction tech-

nique based on low-rank matrix approx-

imation, similar to latent semantic index-

ing. We compare the performance of the

low rank method to previous work, using

disease name normalization in the NCBI

Disease Corpus as the test case, and

demonstrate increased performance as

the matrix rank increases. We further

demonstrate a significant reduction in the

number of parameters to be learned and

discuss the implications of this result in

the context of algorithm scalability.

1 Introduction

The data necessary to answer a wide variety of

biomedical research questions is locked away in

narrative text. Automating the location (named

entity recognition) and identification (normaliza-

tion) of key biomedical entities (Doğan et al.,

2009; Névéol et al., 2011) such as diseases, pro-

teins and chemicals in narrative text may reduce

curation costs, enable significantly increased

scale and ultimately accelerate biomedical dis-

covery (Wei et al., 2012a).

Named entity recognition (NER) techniques

have typically focused on machine learning

methods such as conditional random fields

(CRFs), which have provided high performance

when coupled with a rich feature approach. The

utility of NER for biomedical end users is lim-

ited, however, since many applications require

each mention to be normalized, that is, identified

within a specified controlled vocabulary.

The normalization task has been highlighted in

the BioCreative challenges (Hirschman et al.,

2005; Lu et al., 2011; Morgan et al., 2008),

where a variety of methods have been explored

for normalizing gene names, including string

matching, pattern matching, and heuristic rules.

Similar methods have been applied to disease

names (Doğan & Lu, 2012b; Kang et al., 2012;

Névéol et al., 2009) and species names (Gerner

et al., 2010; Wei et al., 2012b), and the MetaMap

program is used to locate and identify concepts

from the UMLS MetaThesaurus (Aronson, 2001;

Bodenreider, 2004).

Machine learning methods for NER have pro-

vided high performance, enhanced system adapt-

ability to new entity types, and abstracted many

details of specific rule patterns. While machine

learning methods for normalization have been

explored (Tsuruoka et al., 2007; Wermter et al.,

2009), these are far less common. This is partial-

ly due to the lack of appropriate training data,

and also partially due to the need for a general-

izable supporting framework.

Normalization is frequently decomposed into

the sub-tasks of candidate generation and disam-

biguation (Lu et al., 2011; Morgan et al., 2008).

During candidate generation, the set of concept

names is constrained to a set of possible matches

using the text of the mention. The primary diffi-

culty addressed in candidate generation is term

variation: the need to identify terms which are

semantically similar but textually distinct (e.g.

“nephropathy” and “kidney disease”). The dis-

ambiguation step then differentiates between the

different candidates to remove false positives,

typically using the context of the mention and the

article metadata.

24

Recently, Leaman et al. (2013a) developed an

algorithm (DNorm) that directly addresses the

term variation problem with machine learning,

and used diseases – an important biomedical en-

tity – as the first case study. The algorithm learns

a similarity function between mentions and con-

cept names directly from training data using a

method based on pairwise learning to rank. The

method was shown to provide high performance

on the NCBI Disease Corpus (Doğan et al., 2014;

Doğan & Lu, 2012a), and was also applied to

clinical notes in the ShARe / CLEF eHealth task

(Suominen et al., 2013), where it achieved the

highest normalization performance out of 17 in-

ternational teams (Leaman et al., 2013b). The

normalization step does not consider context, and

therefore must be combined with a disambigua-

tion method for tasks where disambiguation is

important. However, this method provides high

performance when paired with a conditional ran-

dom field system for NER, making the combina-

tion a step towards fully adaptable mention

recognition and normalization systems.

This manuscript adapts DNorm to use a di-

mensionality reduction technique based on low

rank matrix approximation. This may provide

several benefits. First, it may increase the scala-

bility of the method, since the number of pa-

rameters used by the original technique is pro-

portional to the square of the number of unique

tokens. Second, reducing the number of parame-

ters may, in turn, improve the stability of the

method and improve its generalization due to the

induction of a latent “concept space,” similar to

latent semantic indexing (Bai et al., 2010). Final-

ly, while the rich feature approach typically used

with conditional random fields allows it to par-

tially compensate for out-of-vocabulary effects,

DNorm ignores unknown tokens. This reduces

the ability of the model to generalize, due to the

zipfian distribution of text (Manning & Schütze,

1999), and is especially problematic in text

which contains many misspellings, such as con-

sumer text. Using a richer feature space with

DNorm would not be feasible, however, unless

the parameter scalability problem is resolved.

In this article we expand the DNorm method

in a pilot study on feasibility of using low rank

approximation methods for disease name nor-

malization. To make this work comparable to the

previous work on DNorm, we again employed

the NCBI Disease Corpus (Doğan et al., 2014).

This corpus contains nearly 800 abstracts, split

into training, development, and test sets, as de-

scribed in Table 1. Each disease mention is anno-

tated for span and concept, using the MEDIC

vocabulary (Davis et al., 2012), which combines

MeSH® (Coletti & Bleich, 2001) and OMIM®

(Amberger et al., 2011). The average number of

concepts for each name in the vocabulary is 5.72.

Disease names exhibit relatively low ambiguity,

with an average number of concepts per name of

1.01.

Subset Abstracts Mentions Concepts

Training 593 5145 670

Development 100 787 176

Test 100 960 203

Table 1. Descriptive statistics for the NCBI Disease

Corpus.

2 Methods

DNorm uses the BANNER NER system

(Leaman & Gonzalez, 2008) to locate disease

mentions, and then employs a ranking method to

normalize each mention found to the disease

concepts in the lexicon (Leaman et al., 2013a).

Briefly, we define to be the set of tokens from

both the disease mentions in the training data and

the concept names in the lexicon. We stem each

token in both disease mentions and concept

names (Porter, 1980), and then convert each to

TF-IDF vectors of dimensionality | |, where the

document frequency for each token is taken to be

the number of names in the lexicon containing it

(Manning et al., 2008). All vectors are normal-

ized to unit length. We define a similarity score

between mention vector and name vector ,

( ), and each mention is normalized by

iterating through all concept names and returning

the disease concept corresponding to the one

with the highest score.

In previous work, ( ) ,

where is a weight matrix and each entry

represents the correlation between token ap-

pearing in a mention and token appearing in a

concept name from the lexicon. In this work,

however, we set to be a low-rank approxima-

tion of the form , where and

are both | | matrices, being the rank

(number of linearly independent rows), and

| | (Bai et al., 2010).

For efficiency, the low-rank scoring function

can be rewritten and evaluated as ( ) ( ) ( ) , allowing the respective

and vectors to be calculated once and then

reused. This view provides an intuitive explana-

tion of the purpose of the and matrices: to

25

convert the sparse, high-dimensional mention

and concept name vectors ( and ) into dense,

low dimensional vectors (as and ). Under

this interpretation, we found that performance

improved if each and vector was renor-

malized to unit length.

This model retains many useful properties of

the original model, such as the ability to repre-

sent both positive and negative correlations be-

tween tokens, to represent both synonymy and

polysemy, and to allow the token distributions

between the mentions and the names to be differ-

ent. The new model also adds one important ad-

ditional property: the number of parameters is

linear in the number of unique tokens, potentially

enabling greater scalability.

2.1 Model Training

Given any pair of disease names where one ( )

is for , the correct disease concept for

tion , and the other, , is for , an incorrect

concept , we would like to update the weight ma-

trix so that . Following

Leaman et al. (2013a), we iterate through each

⟨ ⟩ tuple, selecting and as the name

for and , respectively, with the highest sim-

ilarity score to , using stochastic gradient de-

scent to make updates to . With a dense weight

matrix , the update rule is: if , then is updated as ( ( ) ( ) ) , where is the learning

rate, a parameter controlling the size of the

change to W. Under the low-rank approximation,

the update rules are: if ,

then is updated as ( ) ,

and is updated as ( ) ,

noting that the updates are applied simultaneous-

ly (Bai et al., 2010). Overfitting is avoided using

a holdout set, using the average of the ranks of

the correct concept as the performance measure-

ment, as in previous work.

We initialize using values chosen randomly

from a normal distribution with mean 0 and

standard deviation 1. We found it useful to ini-

tialize as , since this causes the representa-

tion for disease mentions and disease names to

initially be the same.

We employed an adaptive learning rate using

the schedule

, where is the itera-

tion, is the initial learning rate, and is the

discount (Finkel et al., 2008). We used an initial

learning rate of . This is much lower

than reported by Leaman et al. (2013a), since we

found that higher values caused the training to

found that higher values caused the training to

diverge. We used a discount parameter of ,

so that the learning rate is equal to one half the

initial rate after five iterations.

3 Results

Our results were evaluated at the abstract level,

allowing comparison to the previous work on

DNorm (Leaman et al., 2013a). This evaluation

considers the set of disease concepts found in the

abstract, and ignores the exact location(s) where

each concept was found. A true positive consists

of the system returning a disease concept anno-

tated within the NCBI Disease Corpus, and the

number of false negatives and false positives are

defined similarly. We calculated the precision,

recall and F-measure as follows:

We list the micro-averaged results in Table 2.

Rank Precision Recall F-measure

50 0.648 0.671 0.659

100 0.673 0.685 0.679

250 0.697 0.697 0.697

500 0.702 0.700 0.701

(Full) 0.828 0.819 0.809

Table 2. Performance measurements for each

model on the NCBI Disease Test set. Full corre-

sponds with the full-rank matrix used in previous

work.

4 Discussion

There are two primary trends to note. First, the

performance of the low rank models is about

10%-15% lower than the full rank model. Sec-

ond, there is a clear trend towards higher preci-

sion and recall as the rank of the matrix increas-

es. This trend is reinforced in Figure 1, which

shows the learning curve for all models. These

describe the performance on the holdout set after

each iteration through the training data, and are

measured using the average rank of the correct

concept in the holdout set, which is dominated

by a small number of difficult cases.

Using the low rank approximation, the number

of parameters is equal to | |. Since is

fixed and independent of | |, the number of pa-

rameters is now linear in the number of tokens,

effectively solving the parameter scalability

problem. Table 3 lists the number of parameters

for each of the models used in this study.

26

Figure 1. Learning curves showing holdout per-

formance at each iteration through the training

data.

Rank Parameters

50 1.8×106

100 3.7×106

250 9.1×106

500 1.8×107

(Full) 3.3×108

Table 3. Number of model parameters for each

variant, showing the low rank methods using 1 to

2 orders of magnitude fewer parameters.

There are two trade-offs for this improvement

in scalability. First, there is a substantial perfor-

mance reduction, though this might be mitigated

somewhat in the future by using a richer feature

set – a possibility enabled by the use of the low

rank approximation. Second, training and infer-

ence times are significantly increased; training

the largest low-rank model ( ) required

approximately 9 days, though the full-rank mod-

el trains in under an hour.

The view that the and matrices convert the

TF-IDF vectors to a lower dimensional space

suggests that the function of and is to pro-

vide word embeddings or word representations –

a vector space where each word vector encodes

its relationships with other words. This further

suggests that one way to provide higher perfor-

mance may be to take advantage of unsupervised

pre-training (Erhan et al., 2010). Instead of ini-

tializing and randomly, they could be initial-

ized using a set of word embeddings trained on a

large amount of biomedical text, such as with

neural network language models (Collobert &

Weston, 2008; Mikolov et al., 2013).

5 Conclusion

We performed a pilot study to determine whether

a low rank approximation may increase the

scalability of normalization using pairwise learn-

ing to rank. We showed that the reduction in the

number of parameters is substantial: it is now

linear to the number of tokens, rather than pro-

portional to the square of the number of tokens.

We further observed that the precision and recall

increase as the rank of the matrices is increased.

We believe that further performance increases

may be possible through the use of a richer fea-

ture set, unsupervised pre-training, or other di-

mensionality reduction techniques including fea-

ture selection or L1 regularization (Tibshirani,

1996). We also intend to apply the method to

additional entity types, using recently released

corpora such as CRAFT (Bada et al., 2012).

Acknowledgments

The authors would like to thank the anonymous

reviewers for their helpful suggestions. This re-

search was supported by the NIH Intramural Re-

search Program, National Library of Medicine.

References

Amberger, J., Bocchini, C., & Hamosh, A. (2011). A

new face and new challenges for Online

Mendelian Inheritance in Man (OMIM(R)). Hum

Mutat, 32(5), 564-567.

Aronson, A. R. (2001). Effective mapping of

biomedical text to the UMLS Metathesaurus: the

MetaMap program. In Proceedings of the AMIA

Symposium, 17-21.

Bada, M., Eckert, M., Evans, D., Garcia, K., Shipley,

K., Sitnikov, D., et al. (2012). Concept annotation

in the CRAFT corpus. BMC Bioinformatics, 13,

161.

Bai, B., Weston, J., Grangier, D., Collobert, R.,

Sadamasa, K., Qi, Y. J., et al. (2010). Learning to

rank with (a lot of) word features. Inform.

Retrieval, 13(3), 291-314.

Bodenreider, O. (2004). The Unified Medical

Language System (UMLS): integrating biomedical

terminology. Nucleic Acids Res, 32, D267-270.

Coletti, M. H., & Bleich, H. L. (2001). Medical

subject headings used to search the biomedical

literature. J Am Med Inform Assoc, 8(4), 317-323.

Collobert, R., & Weston, J. (2008). A unified

architecture for natural language processing:

deep neural networks with multitask learning. In

Proceedings of the ICML, 160-167.

Davis, A. P., Wiegers, T. C., Rosenstein, M. C., &

Mattingly, C. J. (2012). MEDIC: a practical

disease vocabulary used at the Comparative

20

30

40

50

60

70

80

90

100

0 5 10

Ave

rage

ran

k

Iteration

50

100

250

500

Full

27

Toxicogenomics Database. Database, 2012,

bar065.

Doğan, R. I., Leaman, R., & Lu, Z. (2014). NCBI

disease corpus: A resource for disease name

recognition and concept normalization. J Biomed

Inform, 47, 1-10.

Doğan, R. I., & Lu, Z. (2012a). An improved corpus

of disease mentions in PubMed citations. In

Proceedings of the ACL 2012 Workshop on

BioNLP, 91-99.

Doğan, R. I., & Lu, Z. (2012b). An Inference Method

for Disease Name Normalization. In Proceedings

of the AAAI 2012 Fall Symposium on Information

Retrieval and Knowledge Discovery in

Biomedical Text, 8-13.

Doğan, R. I., Murray, G. C., Névéol, A., & Lu, Z.

(2009). Understanding PubMed user search

behavior through log analysis. Database (Oxford),

2009, bap018.

Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-

A., Vincent, P., & Bengio, S. (2010). Why does

unsupervised pre-training help deep learning? J.

Machine Learning Res., 11, 625-660.

Finkel, J. R., Kleenman, A., & Manning, C. D.

(2008). Efficient, Feature-based, Conditional

Random Field Parsing. In Proceedings of the 46th

Annual Meeting of the ACL, 959-967.

Gerner, M., Nenadic, G., & Bergman, C. M. (2010).

LINNAEUS: a species name identification system

for biomedical literature. BMC Bioinformatics, 11,

85.

Hirschman, L., Colosimo, M., Morgan, A., & Yeh, A.

(2005). Overview of BioCreAtIvE task 1B:

normalized gene lists. BMC Bioinformatics, 6

Suppl 1, S11.

Kang, N., Singh, B., Afzal, Z., van Mulligen, E. M.,

& Kors, J. A. (2012). Using rule-based natural

language processing to improve disease

normalization in biomedical text. J. Am. Med.

Inform. Assoc., 20, 876-881.

Leaman, R., Doğan, R. I., & Lu, Z. (2013a). DNorm:

Disease name normalization with pairwise

learning-to-rank. Bioinformatics, 29(22), 2909-

2917.

Leaman, R., & Gonzalez, G. (2008). BANNER: an

executable survey of advances in biomedical

named entity recognition. Pac. Symp. Biocomput.,

652-663.

Leaman, R., Khare, R., & Lu, Z. (2013b). NCBI at

2013 ShARe/CLEF eHealth Shared Task:

Disorder Normalization in Clinical Notes with

DNorm. In Working Notes of the Conference and

Labs of the Evaluation Forum Valencia, Spain.

Lu, Z., Kao, H. Y., Wei, C. H., Huang, M., Liu, J.,

Kuo, C. J., et al. (2011). The gene normalization

task in BioCreative III. BMC Bioinformatics, 12

Suppl 8, S2.

Manning, C., & Schütze, H. (1999). Foundations of

Statistical Natural Language Processing:

Massachusetts Institute of Technology.

Manning, C. D., Raghavan, P., & Schütze, H. (2008).

Introduction to Information Retrieval: Cambridge

University Press.

Mikolov, T., Yih, W.-t., & Zweig, G. (2013).

Linguistic Regularities in Continuous Space Word

Representations. In Proceedings of the 2013

Conference of the NAACL-HLT, 746-751.

Morgan, A. A., Lu, Z., Wang, X., Cohen, A. M.,

Fluck, J., Ruch, P., et al. (2008). Overview of

BioCreative II gene normalization. Genome Biol.,

9 Suppl 2, S3.

Névéol, A., Doğan, R. I., & Lu, Z. (2011). Semi-

automatic semantic annotation of PubMed queries:

a study on quality, efficiency, satisfaction. J

Biomed Inform, 44(2), 310-318.

Névéol, A., Kim, W., Wilbur, W. J., & Lu, Z. (2009).

Exploring two biomedical text genres for disease

recognition. In Proceedings of the ACL 2009

BioNLP Workshop, 144-152.

Porter, M. F. (1980). An algorithm for suffix

stripping. Program, 14, 130-137.

Suominen, H., Salanterä, S., Velupillai, S., Chapman,

W., Savova, G., Elhadad, N., et al. (2013).

Overview of the ShARe/CLEF eHealth Evaluation

Lab 2013. In P. Forner, H. Müller, R. Paredes, P.

Rosso & B. Stein (Eds.), Information Access

Evaluation. Multilinguality, Multimodality, and

Visualization (Vol. 8138, pp. 212-231): Springer

Berlin Heidelberg.

Tibshirani, R. (1996). Regression shrinkage and

selection via the Lasso. Journal of the Royal

Statistical Society Series B-Methodological, 58(1),

267-288.

Tsuruoka, Y., McNaught, J., Tsujii, J., & Ananiadou,

S. (2007). Learning string similarity measures for

gene/protein name dictionary look-up using

logistic regression. Bioinformatics, 23(20), 2768-

2774.

Wei, C. H., Harris, B. R., Li, D., Berardini, T. Z.,

Huala, E., Kao, H. Y., et al. (2012a). Accelerating

literature curation with text-mining tools: a case

study of using PubTator to curate genes in

PubMed abstracts. Database (Oxford), 2012,

bas041.

Wei, C. H., Kao, H. Y., & Lu, Z. (2012b). SR4GN: a

species recognition software tool for gene

normalization. PLoS One, 7(6), e38460.

Wermter, J., Tomanek, K., & Hahn, U. (2009). High-

performance gene name normalization with GeNo.

Bioinformatics, 25(6), 815-821.

28


Decomposing Consumer Health Questions

Kirk Roberts, Halil Kilicoglu, Marcelo Fiszman, and Dina Demner-FushmanNational Library of MedicineNational Institutes of Health

Bethesda, MD [email protected], {kilicogluh,fiszmanm,ddemner}@mail.nih.gov

Abstract

This paper presents a method for decom-posing long, complex consumer healthquestions. Our approach largely decom-poses questions using their syntactic struc-ture, recognizing independent questionsembedded in clauses, as well as coordi-nations and exemplifying phrases. Addi-tionally, we identify elements specific todisease-related consumer health questions,such as the focus disease and backgroundinformation. To achieve this, our approachcombines rank-and-filter machine learningmethods with rule-based methods. Ourresults demonstrate significant improve-ments over the heuristic methods typicallyemployed for question decomposition thatrely only on the syntactic parse tree.

1 Introduction

Natural language questions provide an intuitivemethod for consumers (non-experts) to query forhealth-related content. The most intuitive wayfor consumers to formulate written questions isthe same way they write to other humans: multi-sentence, complex questions that contain back-ground information and often more than one spe-cific question. Consider the following:

• Will Fabry disease affect a transplanted kidney?Previous to the transplant the disease was be-ing managed with an enzyme supplement. Willthis need to be continued? What cautions or ad-ditional treatments are required to manage thedisease with a transplanted kidney?

This complex question contains three questionsentences and one background sentence. The fo-cus (Fabry disease) is stated in the first questionbut is necessary for a full understanding of theother questions as well. The background sentenceis necessary to understand the second question:the anaphor this must be resolved to an enzymetreatment, and the predicate continue’s implicit ar-gument that must be re-constructed from the dis-course (i.e., continue after a kidney transplant).The final question sentence uses a coordinationto ask two separate questions (cautions and addi-tional treatments). A decomposition of this com-plex question would then result in four questions:

1. Will Fabry disease affect a transplanted kidney?2. Will enzyme treatment for Fabry disease need to

be continued after a kidney transplant?3. What cautions are required to manage Fabry

disease with a transplanted kidney?4. What additional treatments are required to man-

age Fabry disease with a transplanted kidney?

Each question above could be independently an-swered by a question answering (QA) system.While previous work has discussed methods forresolving co-reference and implicit arguments inconsumer health questions (Kilicoglu et al., 2013),it does not address question decomposition.

In this work, we propose methods for auto-matically recognizing six annotation types use-ful for decomposing consumer health questions.These annotations distinguish between sentencesthat contain questions and background informa-tion. They also identify when a question sentencecan be split in multiple independent questions, and

29

when they contain optional or coordinated infor-mation embedded within a question.

For each of these decomposition annotations,we propose a combination of machine learning(ML) and rule based methods. The ML methodslargely take the form of a 3-step rank-and-filterapproach, where candidates are generated, rankedby an ML classifier, then the top-ranked candidateis passed through a separate ML filtering classi-fier. We evaluate each of these methods on a set of1,467 consumer health questions related to geneticand rare diseases.

2 Background

QA in the biomedical domain has been well-studied (Demner-Fushman and Lin, 2007; Cairnset al., 2011; Cao et al., 2011) as a means for re-trieving medical information. This work has typ-ically focused, however, on questions posed bymedical professionals, and the methods proposedfor question analysis generally assume a single,concise question. For example, Demner-Fushmanand Abhyankar (2012) propose a method for ex-tracting frames from queries for the purpose ofcohort retrieval. Their method assumes syntacticdependencies exist between the necessary frameelements, and is thus not well-suited to handlelong, multi-sentence questions. Similarly, Ander-sen et al. (2012) proposes a method for convertinga concise question into a structured query. How-ever, many medical questions require backgroundinformation that is difficult to encode in a singlequestion sentence. Instead, it is often more naturalto ask multiple questions over several sentences,providing background information to give contextto the questions. Yu and Cao (2008) use a MLmethod to recognize question types in professionalhealth questions. Their method can identify morethan one type per complex question. Without de-composing the full question into its sub-questions,however, the type cannot be associated with itsspecific span, or with other information specific tothe sub-question. This other information can in-clude answer types, question focus, and other an-swer constraints. By decomposing multi-sentencequestions, these question-specific attributes can beextracted, and the discourse structure of the largerquestion can be better understood.

Question decomposition has been utilized be-fore in open-domain QA approaches, but rarelyevaluated on its own. Lacatusu et al. (2006)

demonstrates how question decomposition can im-prove the performance of a multi-sentence sum-marization system. They perform what we referto as syntactic question decomposition, where thesyntactic structure of the question is used to iden-tify sub-questions that can be answered in isola-tion. A second form of question decomposition issemantic decomposition, which can semanticallybreak individual questions apart to answer themin stages. For instance, the question “When didthe third U.S. President die?” can be semanticallydecomposed “Who was the third U.S. President?”and “When did X die?”, where the answer to thefirst question is substituted into the second. Katzand Grau (2005) discusses this kind of decompo-sition using the syntactic structure, though it is notempirically validated. Hartrumpf (2008) proposesa decomposition method using only the deep se-mantic structure. Finally, Harabagiu et al. (2006)proposes a different type of question decomposi-tion based on a random walk over similar ques-tions extracted from a corpus. In our work, wefocus on syntactic question decomposition. Wedemonstrate the importance of empirical evalua-tion of question decomposition, notably the pit-falls of heuristic approaches that rely entirely onthe syntactic parse tree. Syntactic parsers trainedon Treebank are particularly poor at both analyz-ing questions (Judge et al., 2006) and coordinationboundaries (Hogan, 2007). Robust question de-composition methods, therefore, must be able toovercome many of these difficulties.

3 Consumer Health QuestionDecomposition

Our goal is to decompose multi-sentence, multi-faceted consumer health questions into concisequestions coupled with important contextual in-formation. To this end, we utilize a set of an-notations that identify the decomposable elementsand important contextual elements. A more de-tailed description of these annotations is providedin Roberts et al. (2014). The annotations are pub-licly available at our institution website1. Here, webriefly describe each annotation:

(1) BACKGROUND - a sentence indicating usefulcontextual information, but lacks a question.

(2) QUESTION - a sentence or clause that indi-cates an independent question.

1http://lhncbc.nlm.nih.gov/project/consumer-health-question-answering

30

Sentence Splitting

Request

Question Sentence

Ignore Sentence

Background Sentence

Candidate Generation

UMLS

SVM Candidate Ranking

Boundary Fixing

Focus

Focus Recognition

Sentence Classification

Background Classification

SVM Comorbidity Classification

SVM Diagnosis Classification

SVM Family History Classification

SVM ISF Classification

SVM Lifestyle Classification

SVM Symptom Classification

SVM Test Classification


SVM Candidate Filter

Question Recognition

SVM Sentence Classification

Question



Exemplification Recognition

Candidate Filter



Coordination Recognition

SVM Candidate Filter Coordination

Exemplification

Stanford Parser

WordNet

SVM Treatment Classification

Figure 1: Question Decomposition Architecture. Modules with solid green lines indicate machine learn-ing classifiers. Modules with dotted green lines indicate rule-based classifiers.

(3) COORDINATION - a phrase that spans a set ofdecomposable items.

(4) EXEMPLIFICATION - a phrase that spans anoptional item.

(5) IGNORE - a sentence indicating nothing ofvalue is present.

(6) FOCUS - an NP indicating the theme of theconsumer health question.

Further explanations of each annotation are pro-vided in Sections 4-9. To convert these annota-tions into separate, decomposed questions, a sim-ple set of recursive rules is used. The rules enu-merate all ways of including one conjunct fromeach COORDINATION as well as whether or notto include the phrase within an EXEMPLIFICA-TION. These rules must be applied recursively tohandle overlapping annotations (e.g., a COORDI-NATION within an EXEMPLIFICATION). Our im-plementation is straight-forward and not discussedfurther in this paper. The BACKGROUND and FO-CUS annotations do not play a direct role in thisprocess, though they provide important contextualelements and are useful for co-reference, and arethus still considered part of the overall decompo-sition process.

It should also be noted that some questions aresyntactically decomposable, but doing so alterstheir original meaning. Consider the followingtwo question sentences:

• Can this disease be cured or can we only treatthe symptoms?• Are males or females worse affected?

While the first example contains two “Can...”questions and the second example contains the co-ordination “males or females”, both questions areproviding a choice between two alternatives anddecomposing them would alter the semantic na-ture of the original question. In these cases, we donot consider the questions to be decomposable.Data We use a set of consumer health ques-tions collected from the Genetic and Rare Dis-eases Information Center (GARD), which main-tains a website2 with publicly available consumer-submitted questions and professionally-authoredanswers about genetic and rare diseases. We col-lected 1,467 consumer health questions, consist-ing of 4,115 sentences, 1,713 BACKGROUND sen-tences, 37 IGNORE sentences, 2,465 QUESTIONs,367 COORDINATIONs, 53 EXEMPLIFICATIONs,and 1,513 FOCUS annotations. Questions withmore than one FOCUS are generally concernedwith the relation between diseases. Further infor-mation about the corpus and the annotation pro-cess can be found in Roberts et al. (2014).System Architecture The architecture of ourquestion decomposition method is illustrated in

2http://rarediseases.info.nih.gov/gard

31

Figure 1. To avoid confusion, in the rest of thispaper we refer to a complex consumer health ques-tion simply as a request. Requests are sent tothe independent FOCUS recognition module (Sec-tion 4), and then proceed through a pipeline thatincludes the classification of sentences (Section 5),the identification of separate QUESTIONs withina question sentence (Section 6), the recognitionof COORDINATIONs (Section 7) and EXEMPLIFI-CATIONs (Section 8), and the sub-classification ofBACKGROUND sentences (Section 9).Experimental Setup The remainder of this pa-per describes the individual modules in Figure 1.For simplicity, we show results on the GARD datafor each task in its corresponding section. In allcases, experiments are conducted using a 5-foldcross-validation on the GARD data. The cross-validation folds are organized at the request levelso that no two items from the same request will besplit between the training and testing data.

4 Identifying the Focal Disease

The FOCUS is the condition that disease-centeredquestions are centered around. Many other dis-eases may be mentioned, but the FOCUS is the dis-ease of central concern. This is similar to the as-sumption made about a central disease in Medlineabstracts (Demner-Fushman and Lin, 2007). Of-ten the FOCUS is stated in the first sentence (typ-ically a BACKGROUND) of the request while thequestions are near the end. The questions can-not generally be answered outside the context ofthe FOCUS, however, so its identification is a crit-ical part of decomposition. As shown in Figure 1,we use a 3-step process: (1) a high-recall methodidentifies potential FOCUS diseases in the data, (2)a support vector machine (SVM) ranks the FO-CUS candidates, and (3) the highest-ranking can-didate’s boundary is modified with a set of rules tobetter match our annotation standard.

To identify candidates for the FOCUS, we use alexicon constructed from UMLS (Lindberg et al.,1993). UMLS includes very generic terms, such asdisease and cancer, that are too general to exactlymatch a FOCUS in our data. We allow these termsto be candidates so as to not miss any FOCUS thatdoesn’t exactly match an entry in UMLS. Whensuch a general term is selected as the top-rankedFOCUS, the rules described below are capable ofexpanding the term to the full disease name.

To rank candidates, we utilize an SVM (Fan et

E/R P R F1

1st UMLS Disorder E 19.6 19.0 19.3R 28.2 27.4 27.8

SVM E 56.4 54.7 55.6R 89.2 86.5 87.9

SVM + Rules E 74.8 72.5 73.6R 89.5 86.8 88.1

Table 1: FOCUS recognition results. E = exactmatch; R = relaxed match.

al., 2008) with a small number of feature types:

• Unigrams. Identifies generic words such as dis-ease and syndrome that indicate good FOCUS

candidates, while also recognizing noisy UMLSterms that are often false positives.• UMLS semantic group (McCray et al., 2001).• UMLS semantic type.• Sentence Offset. The FOCUS is typically in the

first sentence, and is far more likely to be at thebeginning of the request than the end.• Lexicon Offset. The FOCUS is typically the first

disease mentioned.

During training, the SVM considers any candidatethat overlaps the gold FOCUS to be correct. Thisenables our approach to train on FOCUS examplesthat do not perfectly align with a UMLS concept.At test time, all candidates are classified, rankedby the classifier’s confidence, and the top-rankedcandidate is considered the FOCUS.

As mentioned above, there are differences be-tween how a FOCUS is annotated in our data andhow it is represented in the UMLS. We thereforeuse a series of heuristics to alter the boundary to amore usable FOCUS after it is chosen by the SVM.The rules are applied iteratively to widen the FO-CUS boundary until it cannot be expanded any fur-ther. If a generic disease word is the only tokenin the FOCUS, we add the token to the left. Con-versely, if the token on the right is a generic dis-ease word, it is added as well. If the word to theleft is capitalized, it is safe to assume it is part ofthe disease’s name and so it is added as well. Fi-nally, several rules recognize the various ways inwhich a disease sub-type might be specified (e.g.,Behcet’s syndrome vascular type, type 2 diabetes,Charcot-Marie-Tooth disease type 2C).

We evaluate FOCUS recognition with both anexact match, where the gold and automatic FOCUS

boundaries must line up perfectly, and a relaxedmatch, which only requires a partial overlap. As abaseline, we compare our results against a fullyrule-based system where the first UMLS Disor-der term in the request is considered the FOCUS.

32

We also evaluate the effectiveness of our bound-ary altering rules by measuring performance with-out these rules. The results are shown in Table 1.The baseline method shows significant problemsin precision and recall. It is not able to ignorenoisy UMLS terms (e.g., aim is both a gene anda treatment). The SVM improves upon the rule-based method by over 50 points in F1 for relaxedmatching. Adding the boundary fixing rules haslittle effect on relaxed matching, but greatly im-proves exact matching: precision and recall areimproved by 18.4 and 17.8 points, respectively.

5 Classifying Sentences

Before precise question boundaries can be rec-ognized, we first identify sentences that con-tain QUESTIONs, as distinguished from BACK-GROUND and IGNORE sentences. It should benoted that many of the question sentences in ourdata are not typical wh-word questions. About20% of the questions in our data end in a period.For instance:• Please tell me more about this condition.• I was wondering if you could let me know where

I can find more information on this topic.• I would like to get in contact with other families

that have this illness.We consider a sentence to be a question if it con-tains any information request, explicit or implicit.

After sentence splitting, we identify sentencesusing a multi-class SVM with three feature types:

• Unigrams with parts-of-speech (POS). Reducesunigram ambiguities, such as what-WP (a pro-noun, indicative of a question) versus what-WDT (a determiner, not indicative).• Bigrams.• Parse tree tags. All Treebank tags from the syn-

tactic parse tree. Captures syntactic questionclues such as the phrase tags SQ (question sen-tence) and WHNP (wh-word noun phrase).

The SVM classifier performs at 97.8%. For com-parison, an SVM with only unigram features per-forms at 97.2%. While the unigram model does agood job classifying sentences, suggesting this isa very easy task, the improved feature set reducesthe number of errors by 20%.

6 Identifying Questions

QUESTION recognition is the task of identifyingwhen a conjunction like and joins two independentquestions into a single sentence:

• [What causes the condition]QUESTION [and whattreatment is available?]QUESTION

• [What is this disease]QUESTION [and what stepscan I take to protect my daughter?]QUESTION

We consider the identification of separate QUES-TIONs within a single sentence to be a differ-ent task from COORDINATION recognition, whichfinds phrases whose conjuncts can be treated in-dependently. Linguistically, these tasks are quitesimilar, but the distinction lies in whether theright-conjunct syntactically depends on anythingto its left. For instance:

• I would like to learn [more about this conditionand what the prognosis is for a baby born withit]COORDINATION.

Here, the right-conjunct starts with a questionstem (what), but is not a complete, grammaticalquestion on its own. Alternatively, this could bere-formed into two separate QUESTIONs:

• [I would like to learn more about thiscondition,]QUESTION [and what is the prognosisis for a baby born with it.]QUESTION

We make this distinction because the QUESTION

recognition task requires one fewer step since theboundaries extend to the entire sentence, prevent-ing error propagation from an input module. Fur-ther, the features that differentiate our QUESTION

and COORDINATION annotations are different.The two-step process for recognizing QUES-

TIONs includes: (1) a high-recall candidate gener-ator, and (2) an SVM to eliminate candidates thatare not separate QUESTIONs. The candidates forQUESTION recognition are simply all the ways asentence can be split by the conjunctions and, or,as well as, and the forward slash (“/”). In our data,this candidate generation process has a recall of98.6, as a few examples were missed where candi-dates were not separated by one of the above con-junctions.

To filter candidates, we use an SVM with threefeatures types:

• The conjunction separating the QUESTIONs.• Unigrams in the left-conjunct. Identifies when

the left-conjunct is not a QUESTION, or when aquestion is part of a COORDINATION.• The right-conjunct’s parse tree tag. Recog-

nizes when the right-conjunct is an independentclause that may safely be split.

33

P R F1

QUESTION split recognitionBaseline 24.7 82.4 38.0SVM 67.7 64.7 66.2Overall QUESTION recognitionBaseline 87.3 92.8 90.0SVM 97.7 97.4 97.5

Table 2: QUESTION recognition results.

For evaluation, we measure both the F1 scorefor correct candidates, and the overall F1 for allQUESTION annotations (i.e., all QUESTION sen-tences). We also evaluate a baseline method thatutilizes the parse tree to recognize separate QUES-TIONs by splitting sentences where a conjunctionseparates independent clauses. The results areshown in Table 2. The baseline method has goodrecall for recognizing where a sentence should besplit into multiple QUESTIONs, but it lacks preci-sion. This is largely because it is unable to differ-entiate clausal COORDINATIONs such as the aboveexample, as well as when the left-conjunct is notactually a separate question. For instance:

• Our grandson was diagnosed recently with thisdisease and I am wondering if you could sendme information on it.

The SVM-based method can overcome this prob-lem by looking at the words in the left-conjunct.Both methods, however, fail to recognize whentwo independent question clauses are asking thesame question but providing alternative answers:

• Will this condition be with him throughout hislife, or is it possible that it will clear up?

While there are methods for handling this issuefor COORDINATION recognition, addressed be-low, recognizing non-splittable QUESTIONs re-quires far deeper semantic understanding whichwe leave to future work.

7 Identifying Coordinations

COORDINATION recognition is the task of identi-fying when a conjunction joins phrases within aQUESTION that can in be separate questions:

• How can I learn more about [treatments andclinical trials]COORDINATION?• Are [muscle twitching, muscle cramps, and

muscle pain]COORDINATION effects of having sil-icosis?

Unlike QUESTION recognition, the boundaries ofa COORDINATION need to be determined as wellas whether the conjuncts can semantically be split

into separate questions. We thus use a three-stepprocess for recognizing COORDINATIONs: (1) ahigh-recall candidate generator, (2) an SVM torank all the candidates for a given conjunction, and(3) an SVM to filter out top-ranked candidates.

Candidate generation begins with the identifica-tion of valid conjunctions within a QUESTION an-notation. We use the same four conjunctions as inQUESTION recognition: and, or, as well as, andthe forward slash. For each of these, all possi-ble left and right boundaries are generated, so ina QUESTION with 4 tokens on either side of theconjunction, there would be 16 candidates. Addi-tionally, two adjectives separated by a comma andimmediately followed by a noun are considered acandidate (e.g., “a [safe, permanent]COORDINATION

treatment”). In our data, this candidate generationprocess has a recall of 98.9, as a few instances ex-ist in which a conjunction is not used, such as:

• I am looking for any information you haveabout heavy metal toxicity, [treatment,outcomes]EXEMPLIFICATION+COORDINATION.

To rank candidates, we use an SVM with thefollowing feature types:

• If the left-conjunct is congruent with the high-est node in the syntactic parse tree whose right-most leaf is also the right-most token in the left-conjunct. Essentially, this is equivalent to say-ing whether or not the syntactic parser agreeswith the left-conjunct’s boundary.• The equivalent heuristic for the right-conjunct.• If a noun is in both, just the left conjunct, just

the right conjunct, or neither conjunct.• The Levenshtein distance between the POS tag

sequences for the left- and right-conjuncts.

The first two features encode the information arule-based method would use if it relied entirelyon the syntactic parse tree. The remaining featureshelp the classifier overcome cases where the parsermay be wrong.

At training time, all candidates for a given con-junction are generated and only the candidate thatmatches the gold COORDINATION is considereda positive example. Additionally, we annotatedthe boundaries for negative COORDINATIONs (i.e.,syntactic coordinations that do not fit our annota-tion standard). There were 203 such instances inthe GARD data. These are considered gold CO-ORDINATIONs for boundary ranking only.

To filter the top-ranked candidates, we use anSVM with several feature types:

34

E/R P R F1

Baseline E 28.1 36.5 31.8R 62.9 75.8 68.7

Rank + Filter E 38.2 34.8 36.4R 78.5 69.0 73.5

Table 3: COORDINATION recognition results.E = exact match; R = relaxed match.

• The conjunction.• Unigrams in the left-conjunct.• POS of the first word in both conjuncts. CO-

ORDINATIONs often have the same first POS inboth conjuncts.• The word immediately before the candidate.

E.g., between is a good negative indicator.• Unigrams in the question but not the candidate.• If the candidate takes up almost the entire ques-

tion (all but 3 tokens). Typically, COORDINA-TIONs are much smaller than the full question.• If more than one conjunction is in the candidate.• If a word in the left-conjunct has an antonym

in the right conjunct. Antonyms are recognizedvia WordNet (Fellbaum, 1998).

At training time, the positive examples are drawnfrom the annotated COORDINATIONs, while thenegative examples are drawn from the 203 non-gold annotations mentioned above.

In addition to evaluating this method, weevaluate a baseline method that relies entirelyon the syntactic parse to identify COORDINA-TION boundaries without filtering. The resultsare shown in Table 3. The rank-and-filter ap-proach shows significant gains over the rule-basedmethod in precision and F1. As can be seen inthe difference between exact and relaxed match-ing, most of the loss for both the baseline and MLmethods come in boundary detection. Most meth-ods overly rely upon the syntactic parser, whichperforms poorly both on questions and coordina-tions. The ML method, though, is sometimes ableto overcome this problem.

8 Identifying Exemplifications

EXEMPLIFICATION recognition is the task of iden-tifying when a phrase provides an optional, exem-plifying example with a more specific type of in-formation than that asked by the rest of the ques-tion. For instance, the following contains both anEXEMPLIFICATION and a COORDINATION:

• Is there anything out there that can helphim [such as [medications or alternativetherapies]COORDINATION]EXEMPLIFICATION?

We could consider this to denote 3 questions:

• Is there anything out there that can help him?• Is there anything out there that can help him

such as medications?• Is there anything out there that can help him

such as alternative therapies?

In the latter two questions, we consider the phrasesuch as to now denote a mandatory constraint onthe answer to each question, whereas in the origi-nal question it would be considered optional.

EXEMPLIFICATION recognition is similar toCOORDINATION recognition, and its three-stepprocess is thus similar as well: (1) a high-recallcandidate generator, (2) an SVM to rank all thecandidates for a given trigger phrase, and (3) a setof rules to filter out top-ranked candidates.

Candidate generation begins with the identifica-tion of valid trigger words and phrases. These in-clude: especially, including, particularly, specifi-cally, and such as. For each of these, all possibleright boundaries are generated, thus EXEMPLIFI-CATIONs have far fewer candidates than COORDI-NATIONs. Additionally, all phrases within paren-theses are added as EXEMPLIFICATIONs. In ourdata, this candidate generation process has a recallof 98.1, missing instances without a trigger (seethe example also missed by COORDINATION can-didate generation in Section 7).

To rank candidates, we use an SVM with thefollowing feature types:

• If the right-conjunct is the highest parse nodeas defined in the COORDINATION boundary fea-ture.• If a dependency relation crosses from the right-

conjunct to any word outside the candidate.• POS of the word after the candidate.

As with COORDINATIONs, we annotated bound-aries for negative EXEMPLIFICATIONs matchingthe trigger words and used them as positive exam-ples for boundary ranking.

To filter the top-ranked candidates, we use twosimple rules. First, EXEMPLIFICATIONs withinparentheses are filtered if they are acronyms oracronym expansions. Second, cases such as thebelow example are removed by looking at thewords before the candidate:

• I am particularly interested in learning moreabout genetic testing for the syndrome.

In addition to evaluating this method, we eval-uate a baseline method that relies entirely on the

35

E/R P R F1

Baseline E 28.9 62.3 39.5R 39.5 84.9 53.9

Rank + Filter E 60.8 58.5 59.6R 80.4 77.4 78.8

Table 4: EXEMPLIFICATION recognition results.E = exact match; R = relaxed match.

syntactic parser to identify EXEMPLIFICATION

boundaries and performs no filtering. The re-sults are shown in Table 4. The rank-and-filterapproach shows significant gains over the rule-based method in precision and F1, more than dou-bling precision for both exact and relaxed match-ing. There is still a drop in performance when go-ing from relaxed to exact matching, again largelydue to the reliance on the syntactic parser.

9 Classifying Background Information

BACKGROUND sentences contain contextual in-formation, such as whether or not a patient hasbeen diagnosed with the focal disease or whatsymptoms they are experiencing. This informa-tion was annotated at the sentence level, partly be-cause of annotation convenience, but also becausephrase boundaries are not always clear for medicalconcepts (Hahn et al., 2012; Forbush et al., 2013).

A difficult factor in this task, and especially onthe GARD dataset, is that consumers are not al-ways asking about a disease for themselves. In-stead, often they ask on behalf of another individ-ual, often a family member. The BACKGROUND

types are thus annotated based on the person ofinterest, who we refer to as the patient (in the lin-guistic sense). For instance, if a mother has a dis-ease but is asking about her son (e.g., asking aboutthe probability of her son inheriting the disease),that sentence would be a FAMILY HISTORY, asopposed to a DIAGNOSIS sentence.

The GARD corpus is annotated with eightBACKGROUND types:

• COMORBIDITY

• DIAGNOSIS

• FAMILY HISTORY

• ISF (informationsearch failure)

• LIFESTYLE

• SYMPTOM

• TEST

• TREATMENT

ISF sentences indicate previous attempts to findthe requested information have failed, and are agood signal to the QA system to enable more in-depth search strategies. LIFESTYLE sentences de-scribe the patient’s life habits (e.g., smoking, ex-ercise). Currently, the automatic identification of

Type P R F1 # AnnsCOMORBIDITY 0.0 0.0 0.0 23DIAGNOSIS 80.8 80.3 80.5 690FAMILY HISTORY 67.4 38.4 48.9 151ISF 75.0 65.9 70.1 41LIFESTYLE 0.0 0.0 0.0 13SYMPTOM 76.6 48.1 59.1 320TEST 37.5 4.9 8.7 61TREATMENT 87.3 35.0 50.0 137Overall: Micro-F1: 67.3 Macro-F1: 39.7

Table 5: BACKGROUND results.

BACKGROUND types has not been a major focusof our effort as no handling exists for it within ourQA system. We report a baseline method and re-sults here to provide some insight into the diffi-culty of the task.

BACKGROUND types are a multi-labeling prob-lem, so we use eight binary classifiers, one foreach type. Each classifier utilizes only unigramand bigram features. The results for the mod-els are shown in Table 5. COMORBIDITY andLIFESTYLE are too rare in the data (23 and 13instances, respectively) for the classifier to iden-tify. DIAGNOSIS questions are identified fairlywell because this is the most common type (690instances) and because of the constrained vocabu-lary for expressing a diagnosis. The performanceof the rest of the types is largely proportional tothe number of instances in the data, though ISFperforms quite well given only 41 instances.

10 Conclusion

We have presented a method for decomposingconsumer health questions by recognizing six an-notation types. Some of these types are generalenough to use in open-domain question decom-position (BACKGROUND, IGNORE, QUESTION,COORDINATION, EXEMPLIFICATION), while oth-ers are targeted specifically at consumer healthquestions (FOCUS and the BACKGROUND sub-types). We demonstrate that ML methods canimprove upon heuristic methods relying on thesyntactic parse tree, though parse errors are of-ten difficult to overcome. Since significant im-provements in performance would likely requiremajor advances in open-domain syntactic parsing,we instead envision further integration of the keytasks in consumer health question analysis: (1) in-tegration of co-reference and implicit argument in-formation, (2) improved identification of BACK-GROUND types, and (3) identification of discourserelations within questions to further leverage ques-tion decomposition.

36

AcknowledgementsThis work was supported by the intramural re-search program at the U.S. National Library ofMedicine, National Institutes of Health. We wouldadditionally like to thank Stephanie M. Morri-son and Janine Lewis for their help accessing theGARD data.

ReferencesUlrich Andersen, Anna Braasch, Lina Henriksen,

Csaba Huszka, Anders Johannsen, Lars Kayser,Bente Maegaard, Ole Norgaard, Stefan Schulz, andJurgen Wedekind. 2012. Creation and use of Lan-guage Resources in a Question-Answering eHealthSystem. In Proceedings of the Eighth InternationalConference on Language Resources and Evaluation,pages 2536–2542.

Brian L. Cairns, Rodney D. Nielsen, James J. Masanz,James H. Martin, Martha S. Palmer, Wayne H. Ward,and Guergana K. Savova. 2011. The MiPACQ Clin-ical Question Answering System. In Proceedings ofthe AMIA Annual Symposium, pages 171–180.

YongGang Cao, Feifan Liu, Pippa Simpson, LamontAntieau, Andrew Bennett, James J. Cimino, JohnEly, and Hong Yu. 2011. AskHERMES: An on-line question answering system for complex clini-cal questions. Journal of Biomedical Informatics,44:277–288.

Dina Demner-Fushman and Swapna Abhyankar. 2012.Syntactic-Semantic Frames for Clinical CohortIdentification Queries. In Data Integration in theLife Sciences, volume 7348 of Lecture Notes inComputer Science, pages 100–112.

Dina Demner-Fushman and Jimmy Lin. 2007. An-swering Clinical Questions with Knowledge-Basedand Statistical Techniques. Computational Linguis-tics, 33(1).

Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR:A Library for Large Linear Classification. Journalof Machine Learning Research, 9:1871–1874.

Christiane Fellbaum. 1998. WordNet: An ElectronicLexical Database. MIT Press.

Tyler B. Forbush, Adi V. Gundlapalli, Miland N.Palmer, Shuying Shen, Brett R. South, Guy Divita,Marjorie Carter, Andrew Redd, Jorie M. Butler, andMatthew Samore. 2013. “Sitting on Pins and Nee-dles”: Characterization of Symptom Descriptions inClinical Notes. In AMIA Summit on Clinical Re-search Informatics, pages 67–71.

Udo Hahn, Elena Beisswanger, Ekaterina Buyko, ErikFaessler, Jenny Traumuller, Susann Schroder, andKerstin Hornbostel. 2012. Iterative Refinementand Quality Checking of Annotation Guidelines –

How to Deal Effectively with Semantically SloppyNamed Entity Types, such as Pathological Phenom-ena. In Proceedings of the Eighth InternationalConference on Language Resources and Evaluation,pages 3881–3885.

Sanda Harabagiu, Finley Lacatusu, and Andrew Hickl.2006. Answer Complex Questions with RandomWalk Models. In Proceedings of the 29th AnnualACM SIGIR Conference on Research and Develop-ment in Information Retrieval, pages 220–227.

Sven Hartrumpf. 2008. Semantic Decompositionfor Question Answering. In Proceedings on the18th European Conference on Artificial Intelligence,pages 313–317.

Dierdre Hogan. 2007. Coordinate Noun Phrase Dis-ambiguation in a Generative Parsing Model. In Pro-ceedings of the 45th Annual Meeting of the Associa-tion for Computational Linguistics, pages 680–687.

John Judge, Aoife Cahill, and Josef van Genabith.2006. QuestionBank: Creating a Corpus of Parse-Annotated Questions. In Proceedings of the 21st In-ternational Conference on Computational Linguis-tics and 44th Annual Meeting of the Association forComputational Linguistics, pages 497–504.

Yarden Katz and Bernardo C. Grau. 2005. Repre-senting Qualitative Spatial Information in OWL-DL.Proceedings of OWL: Experiences and Directions.

Halil Kilicoglu, Marcelo Fiszman, and Dina Demner-Fushman. 2013. Interpreting Consumer HealthQuestions: The Role of Anaphora and Ellipsis. InProceedings of the 2013 BioNLP Workshop, pages54–62.

Finley Lacatusu, Andrew Hickl, and Sanda Harabagiu.2006. Impact of Question Decomposition on theQuality of Answer Summaries. In Proceedings ofLREC, pages 1147–1152.

Donald A.B. Lindberg, Betsy L. Humphreys, andAlexa T. McCray. 1993. The Unified Medical Lan-guage System. Methods of Information in Medicine,32(4):281–291.

Alexa T McCray, Anita Burgun, and Olivier Boden-reider. 2001. Aggregating UMLS Semantic Typesfor Reducing Conceptual Complexity. In Studiesin Health Technology and Informatics (MEDINFO),volume 84(1), pages 216–220.

Kirk Roberts, Kate Masterton, Marcelo Fiszman, HalilKilicoglu, and Dina Demner-Fushman. 2014. An-notating Question Decomposition on Complex Med-ical Questions. In Proceedings of LREC.

Hong Yu and YongGang Cao. 2008. AutomaticallyExtracting Information Needs from Ad Hoc Clini-cal Questions. In Proceedings of the AMIA AnnualSymposium.

37


Detecting Health Related Discussions in Everyday TelephoneConversations for Studying Medical Events in the Lives of Older Adults

Golnar Sheikhshab, Izhak Shafran, Jeffrey KayeOregon Health & Science University

sheikhsh,shafrani,[email protected]

Abstract

We apply semi-supervised topic modelingtechniques to detect health-related discus-sions in everyday telephone conversations,which has applications in large-scale epi-demiological studies and for clinical in-terventions for older adults. The privacyrequirements associated with utilizing ev-eryday telephone conversations precludemanual annotations; hence, we exploresemi-supervised methods in this task. Weadopt a semi-supervised version of LatentDirichlet Allocation (LDA) to guide thelearning process. Within this framework,we investigate a strategy to discard irrel-evant words in the topic distribution anddemonstrate that this strategy improvesthe average F-score on the in-domain taskand an out-of-domain task (Fisher corpus).Our results show that the increase in dis-cussion of health related conversations isstatistically associated with actual medi-cal events obtained through weekly self-reports.

1 Introduction

There has been considerable interest in under-standing, promoting, and monitoring healthylifestyles among older adults while minimizing thefrequency of clinical visits. Longitudinal studieson large cohorts are necessary, for example, to un-derstand the association between social networks,depression, dementia, and general health. In thiscontext, detecting discussions of health are impor-tant as indicators of under-reported health eventsin daily lives as well as for studying healthy so-cial support networks. The detection of medicalevents such as higher levels of pain or discom-fort may also be useful in providing timely clin-ical intervention for managing chronic illness and

thus promoting healthy independent living amongolder adults.

Motivated by this larger goal, we develop andinvestigate techniques for identifying conversa-tions containing any health related discussion. Weare interested in detecting discussions about med-ication with doctors, as well as conversations withothers, where among all different topics being dis-cussed, subjects may also be complaining aboutpain or changes in health status.

The privacy concerns of recording and analyz-ing everyday telephone conversation prevents usfrom manually transcribing and annotating con-versations. So, we automatically transcribe theconversations using an automatic speech recog-nition system and look-up the telephone numbercorresponding to each conversation as a heuristicmeans of deriving labels. This technique is suit-able for labeling a small subset of the conversa-tions that are only sufficient for developing semi-supervised algorithms and for evaluating the meth-ods for analysis.

Before delving into our approach, we discussa few relevant and related studies in Section 2and describe our unique naturalistic corpus in Sec-tion 3. Given the restrictive nature of our labeledin-domain data set, we are interested in a clas-sifier that generalizes to the unlabeled data. Weevaluate the generalizability of the classifiers us-ing an out-of-domain corpus. We adopt a semi-supervised topic modeling approach to addressour task, and develop an iterative feature selec-tion method to improve our classifier, as describedin Section 4. We evaluate the efficacy of our ap-proach empirically, on the in-domain as well as anout-of-domain corpus, and report results in Sec-tion 5.

2 Related Work

The task of identifying conversations where healthis mentioned differs from many other tasks in topic

38

modeling because in this task we are interested inone particular topic. A similar study is the work ofPrier and colleagues (Prier et al., 2011). They usea set of predefined seed words as queries to gathertweets related to tobacco or marijuana usage, andthen use LDA to discover related subtopics. Thus,their method is sensitive to the seed words chosen.

One way to reduce the sensitivity to the manu-ally specified seed words is to expand the set us-ing WordNet. Researchers have investigated thisapproach in sentiment analysis (Kim and Hovy,2004; Yu and Hatzivassiloglou, 2003). However,when expanding the seed word set using WordNet,we need to be careful to avoid antonyms and wordsthat have high degree of linkage with many wordsin the vocabulary. Furthermore, we can not ap-ply such an approach for languages with poor re-sources, where manually curated knowledge is un-available. The other drawback of this approach isthat we can not use characteristics of the end task,in our case health-related conversation retrieval,to select the words. As an alternative method,Han and colleagues developed an interactive sys-tem where users selected the most relevant wordsfrom a set, proposed by an automated system (Hanet al., 2009).

Another idea for expanding the seed words isusing the statistical information. Among statis-tical methods, the simplest approach is to com-pute pairwise co-occurrence with the seed words.Li and Yamanishi ranked the words co-occurringwith the seed words according to information the-oretic costs, and used the highest ranked words asthe expanded set (Li and Yamanishi, 2003). Thisidea can be more effective when the co-occurrenceis performed over subsets instead, as in Hisamitsuand Niwa’s work (Hisamitsu and Niwa, 2001).However, it is computationally expensive to searchover subsets of words. Depending on the languageand task, heuristics might be applicable. An ex-ample of this kind of approach is Zagibalov andCarroll’s work on sentiment analysis in Chinese(Zagibalov and Carroll, 2008).

Alternatively, we can treat the task of identify-ing words associated with seed words as a cluster-ing problem with the intuition that the seed wordsare in the same cluster. An effective strategy tocluster words into topics, is Latent Dirichlet Allo-cation (LDA) (Blei et al., 2003) . However, LDAis an unsupervised algorithm and the clustered top-ics are not guaranteed to include the topic of inter-

est. The Seeded LDA, a variant of LDA, attemptsto address this problem by incorporating the seedwords as priors over the topics (Jagarlamudi etal., 2012). However, the estimation procedure ismore complicated. Alternatively, in Topic LDA(TLDA), a clever extension to LDA, Andrzejewskiand Zhu address this problem by fixing the mem-bership of the words to valid topics (Andrzejewskiand Zhu, 2009). When the focus is on detectingjust one topic, as in our task, we can expand theseed words more selectively using the small set oflabeled data and that is the approach adopted inthis paper.

3 Data

One interesting aspect of our study is the unique-ness of our corpus, which is both naturalistic andexhaustive. We recorded about 41,000 land-lineeveryday telephone conversations from 56 volun-teers, 65 years or older, over a period of approxi-mately 6 to 12 months. Since these everyday tele-phone conversations are private conversations, andmight include private information such as names,telephone numbers, or banking information, weassured the subjects that no one would listen to therecorded conversations. Thus, we couldn’t manu-ally transcribe the conversations; instead, we usedan Automatic Speech Recognition (ASR) systemthat we describe here.

Automatic Speech Recognition System Con-versations in our corpus were automatically tran-scribed using an ASR system, which is structuredafter IBM’s conversation telephony system (Soltauet al., 2005). The acoustic models were trainedon about 2000 hours of telephone speech fromSwitchboard and Fisher corpora (Godfrey et al.,1992). The system has a vocabulary of 47Kand uses a trigram language model with about10M n-grams, estimated from a mix of transcriptsand web-harvested data. Decoding is performedin three stages using speaker-independent mod-els, vocal-tract normalized models and speaker-adapted models. The three sets of models are sim-ilar in complexity with 4000 clustered pentaphonestates and 150K Gaussians with diagonal covari-ances. Our system does not include discriminativetraining and performs at a word error rate of about24% on NIST RT Dev04 which is comparable tostate of the art performance for such systems. Weare unable to measure the performance of this rec-ognizer on our corpus due to the stringent privacy

39

requirements mentioned earlier. Since both cor-pora are conversational telephone speech and thetraining data contains large number of conversa-tions (2000 hours), we expect the performance ofour recognizer to be relatively close to results onNIST benchmark.

Heuristically labeling a small subset of conver-sations For training and evaluation purposes, weneed a labeled set of conversations; that is, a setof conversations where we know whether or notthey contain health-related discussions. Since theprivacy concerns do not allow for manually label-ing the conversations, we used reverse look-up ser-vice in www.whitepages.com. We sent thephone number corresponding to each conversation(when available) to this website to obtain informa-tion about the other end of the conversation. Basedof the information we got back from this web-site, we labeled a small subset of the conversationswhich fell into unambiguous business categories.For example, we labeled the calls to “hospital” and“pharmacy” as health-related, and those to “car re-pair” and “real estate” as non-health-related.

The limitations of the labeled set The labeledset we obtained is small and restricted in type ofconversations. Since phone numbers are not avail-able for many of the conversations we recorded,and also because www.whitepages.com doesnot return unambiguous information for many ofavailable phone numbers, we managed to labelonly 681 conversations – 275 health-related and406 non-health-related. This labeled set has an-other limitation: it contains conversations to busi-ness numbers only. In reality however, we are in-terested in the much larger set of conversationsbetween friends, relatives, and other members ofsubjects’ social support network. Thus, the gener-alizability of the classifier we train is very impor-tant.

Fisher Corpus To explicitly test the generaliz-ability of our classifier, we use a second evaluationset from Fisher corpus (Cieri et al., 2004). Fishercorpus contains telephone conversations with pre-assigned topics. There are 40 topics and onlyone of them, illness, is health-related. We identi-fied 338 conversations on illness, and sampled 702conversations from the other 39 non-health topics.Since we do not train on Fisher corpus, we callit the out-of-domain task to apply our method onFisher corpus; as opposed to the in-domain task

which is to apply our method on the everyday tele-phone conversations.

Extra information on subjects’ health In theeveryday telephone conversations corpus, we alsohave access to the subjects’ weekly self-reportson their medical status during the week indicatingmedical events such as injury or going to emer-gency room. We will use these pieces of infor-mation to relate the health-related conversations toactual medical events in the subjects’ lives.

4 Method

4.1 Overview

As we explained in Section 3, we can label a smallset of conversations in the everyday telephone con-versations corpus as health-related vs. non-healthrelated. Using this labeled set we can train a sup-port vector machine (SVM) to classify the con-versations. In absence of feature selection, theconversations are represented by a vector of tf-idfscores for every word in the vocabulary where tf-idf is a score for measuring the importance of aword in one document of a corpus. As we see inSection 5, such a classifier doesn’t generalize tothe out-of-domain Fisher task (i.e. when we testthe classifier on Fisher data set, we do not get goodprecision and recalls). Generalizability is impor-tant in our case, especially because the data we usefor training is limited in number and the nature ofconversations.

One way to improve generalization is to per-form feature selection. That is, instead of usingtf-idf scores for the whole vocabulary, we wouldlike to rely only on features relevant to detectingthe health topic. We propose a new way for featureselection for retrieving documents containing in-formation about a specific topic when there is onlya limited set of labeled documents available. Theidea is to pick a few words highly related to thetopic of interest as seed words and to use TLDA(Andrzejewski and Zhu, 2009) to force those seedwords into one (for example, the first) topic. In ourtask, the topic of interest is health. So, we choosedoctor, medicine, and pain – often used while dis-cussing health – as our seed words. Topics in LDAbased methods such as TLDA are usually repre-sented using the n most probable words; where nis an arbitrary number. So, the first candidate setsfor expanding our seed words are the sets of 50most probable words in the topic of health in dif-

40

ferent runs of TLDA. As our experiments reveal,these candidate sets contain many words that areunrelated to health . To solve this problem, we usethe small labeled set of conversations to filter outthe unrelated words.

Figure 1 shows the proposed iterative algo-rithm. The algorithm starts with initializing theseed words to doctor, medicine, and pain. Then,in each iteration, TLDA performs semi-supervisedtopic modeling and returns the 50 most probablecandidate words in the health topic. We select asubset of these candidate words which, if addedto the seed words, would maximize the average ofprecision and recall on the train set for a simpleclassifier. This simple classifier marks a conver-sion as health related if, and only if, it contains atleast one of the seed words. The algorithm termi-nates when the subset selection is unable to adda new word contributing to the average of preci-sion and recall. The tf-idf vector for the expandedset represents the conversations in the classifica-tion process.

It is worth mentioning that we train TLDA usingall 41000 unlabeled conversations, and chose thenumber of topics, K, to be 20.

5 Experiments

In all of our experiments, we trained SVMclassifiers, with different features, to detect theconversations on health using the popular lib-SVM (Chang and Lin, 2011) implementation. Wechose the parameters of the SVM using a 30-foldcross-validated (CV) grid search over the trainingdata. We also used a 4-fold cross validation overthe labeled set of conversations to maximize theuse of the relatively small labeled set. That is, wetrained the feature selection algorithm on 3-foldsand tested the resulting SVM tested on the fourth.In in-domain task we always report the averageperformance across the folds.

Table 1 shows the results of our experimentsusing different input features. We report on re-call, precision and F-measure in in-domain andout-of-domain (Fisher) task as well as on averageF-measure of the two. The justification for consid-ering the average F-measure is that we want ouralgorithm to work well on both in-domain corpusand Fisher corpus since we need to make sure thatour classifier is generalizable (i.e. it works well onFisher) and it works well on the private and natu-ral telephone conversations (i.e. the ones similar

Figure 1: Expanding the set of seed words: in eachiteration, the current seed words are forced intothe topic of health to guide TLDA towards findingmore health related words. The candidate set con-sists of the 50 most probable words of the topic ofhealth in TLDA. We investigate the gain of addingeach word of the candidate set to the seed words bytemporarily adding it to the seed words and look-ing at the average of precision and recall on thetraining set for a classifier that classifies a conver-sation as health-related if and only if it contains atleast one of the seed words. We select the wordsthat maximize this objective and add them to theseed words until no other words contributes to theaverage precision and recall.

to the in-domain corpus)When using the full vocabulary, the in-domain

performance (the performance on the everydaytelephone conversations data) is relatively goodwith 75.1% recall and 83.5% precision. But theout-of-domain recall (recall on the Fisher data set)is considerably low at 2.8%. Ideally, we wanta classifier that performs well in both domains.Rows 2 to 5 can be seen as steps to get to sucha classifier.

The second row shows the performance of theother extreme end of feature selection: the fea-tures include the manually chosen words doctor,medicine, and pain only. While this leads to verygood out-of-domain performance, the in-domainrecall has dropped considerably. We trainedTLDA 30 times, and selected the 50 most probablewords in the health topic. The third row in Table 1shows the average performance of SVM when us-ing the tf-idf of these sets of words as the featurevector on in-domain and out-of-domain tasks. Us-ing the 50 most probable words in health topic sig-nificantly improves average F-score (71%) across

41

Recall Precision F-measureFeature Words In-Domain Fisher In-Domain Fisher In-Domain Fisher AverageFull vocabulary(no feature selection) 75.2 2.8 83.5 91.1 79.1 5.4 42.3Initial words(doctor, medicine, pain ) 45.1 69.2 94.8 94.5 61.1 79.9 70.550 most probable words in health(average over 30 runs) 58.4 57.4 86.3 97.5 69.7 72.3 71.0Words selected by our method(average over 30 runs) 56.1 66.5 91.0 95.5 69.4 78.4 73.9Union of all selected words(across 30 runs) 67.7 69.4 87.8 95.1 76.5 80.2 78.3

Table 1: Performance of SVM classifiers using different feature selection methods. The In-Domain taskinvolves the everyday telephone conversations corpus. We call Fisher corpus out of domain, because noexample of this corpus was used in training.

both tasks over using the full vocabulary (42.3%)but it is clear that this is only due to improvementin out-of-domain task. Table 2 shows one set ofthe 50 most probable words in health topic,the re-sult of one run of TLDA. Evidently, these wordscontain many irrelevant words. This is the motiva-tion for our iterative algorithm.

Next, we evaluate the performance of our iter-ative algorithm. The fourth row in Table 1 showsthe average performance of SVM using expandedseed words that our algorithm suggested in 30runs. Our algorithm improves the average F-scoreby 3% comparing to the standard TLDA. This isdue to a 5% improvement in out-of-domain taskas opposed to a 0.3% performance decrease in in-domain task.

Since our algorithm has a probabilistic topicmodeling component (i.e. TLDA), different runslead to different sets of expanded seed words. Weextract a union of all the words chosen over 30runs and evaluate the performance of SVM usingthis union set. This improves the performance ofour method further to achieve the best average F-score of 78.3%, which is an 85% improvementover using the SVM with full vocabulary. It is im-portant to notice that the in-domain performance isstill lower than the full-vocabulary baseline by lessthan 3% while the out-of-domain performance isthe best obtained. Once again, we are more inter-ested in the average F-measure because we needour algorithm to generalize well (work well onout-of-domain corpus) and to work well on naturalprivate conversations (on the conversations similarto the on-domain corpus).

Our last experiment tests statistical associa-tion between health-related discussions in every-day telephone conversations, and actual medical

pain, medicine, appointment, medical, doc-tors, emergency, prescription, contact, med-ication, dial, insurance, pharmacy, schedule,moment, reached, questions, services, surgery,telephone, record, appointments, options, ad-dress, patient, advice, quality, tuesday, posi-tion, answered, records, wednesday, therapy,healthy, correct, department, ensure, numbers,act, doctor, personal, test, senior, nurse, plan,kaiser

Table 2: 50 most probable words in the topic ofhealth returned by one run of TLDA. The boldwords are the ones are hand-picked.

events in older adults. As mentioned in Section 3,we have access to weekly self-reports on medicalevents for subjects’ in everyday telephone conver-sations corpus. We used our best classifier, theSVM with union of expanded seed words, to clas-sify all the conversations in our corpus into health-containing and health-free conversations. We thenmark each conversation as temporally near a med-ical event if a reported medical event occurredwithin a 3-week time window. We chose a 3-weekwindow to allow for one report before and afterthe event.

Table 3 shows the number of conversations indifferent categories. At first glance it might seemlike the number of false positives or false nega-tives is quite large but we should notice that be-ing near a medical event is not the ground truthhere. We just want to see if there is any associa-tion between occurrence of health-related conver-sations and occurrence of an actual medical eventin lives of our subjects. We can see that 90.9%

42

of the conversations are classified as health-relatedbut this percentage is slightly different for con-versations near medical events(91.5%) vs. for theother conversations (89.1). This slight differenceis significant according to χ2 test of independence(χ2(df = 1, N = 47288) = 61.17, p < 0.001).

near a Classified asmedical event health-related non-health-related

yes 1348 11067no 2964 31909

Table 3: Number of telephone conversations indifferent categories. Each conversation is consid-ered near a medical even if and only if there isat least one self-report in a window of 3 weeksaround its date. Being near a medical event doesnot reveal the true nature of the conversation andthus is not the ground truth. So, there are no falsepositive, true positive, etc. in this table.

6 Conclusions

In this paper, we investigated the problem of iden-tifying conversations with any mention of health.The private nature of our everyday telephone con-versations corpus poses constraints on manualtranscription and annotation. Looking up phonenumbers associated with business calls, we labeleda small set of conversations when the other endwas a business clearly related or unrelated to thehealth industry. However, the labeled set is notlarge enough for training a robust classifier. Wedeveloped a semi-supervised iterative method forselecting features, where we learn a distributionof words on health topic using TLDA, and sub-sequently filter irrelevant words iteratively. Wedemonstrate that our method generalizes well andimproves the average F-score on in-domain andout-of-domain tasks over two baselines, using fullvocabulary without feature selection or feature se-lection using TLDA alone. In our task, the gener-alization of the classifier is important since we areinterested in detecting not only conversations onhealth with business (the annotated examples) butalso with others in subjects’ social network. Usingour classifier, we find a significant statistical as-sociation between the occurrence of conversationsabout health and the occurrence of self-reportedmedical events.

Acknowledgments

This research was supported in part by NIH Grants1K25AG033723, and P30 AG008017, as well asby NSF Grants 1027834, and 0964102. Any opin-ions, findings, conclusions or recommendationsexpressed in this publication are those of the au-thors and do not necessarily reflect the views of theNIH. We thank Nicole Larimer for help in collect-ing the data, Maider Lehr for testing the data col-lection devices and Katherine Wild for early dis-cussions on this project. We are grateful to BrianKingsbury and his colleagues for providing us ac-cess to IBM’s attila software tools.

ReferencesDavid Andrzejewski and Xiaojin Zhu. 2009. La-

tent dirichlet allocation with topic-in-set knowledge.In Proceedings of the NAACL HLT 2009 Workshopon Semi-Supervised Learning for Natural LanguageProcessing, SemiSupLearn ’09, pages 43–48.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan.2003. Latent dirichlet allocation. Journal of Ma-chine Learning Research, 3:993–1022.

C.-C. Chang and C.-J. Lin. 2011. Libsvm : a libraryfor support vector machines. ACM Transactions onIntelligent Systems and Technology, 2.

Christopher Cieri, David Miller, and Kevin Walker.2004. The fisher corpus: a resource for the nextgenerations of speech-to-text. In LREC, volume 4,pages 69–71.

John J Godfrey, Edward C Holliman, and Jane Mc-Daniel. 1992. Switchboard: Telephone speech cor-pus for research and development. In Acoustics,Speech, and Signal Processing, 1992. ICASSP-92.,1992 IEEE International Conference on, volume 1,pages 517–520. IEEE.

Hong-qi Han, Dong-Hua Zhu, and Xue-feng Wang.2009. Semi-supervised text classification from un-labeled documents using class associated words. InComputers & Industrial Engineering, 2009. CIE2009. International Conference on, pages 1255–1260. IEEE.

Toru Hisamitsu and Yoshiki Niwa. 2001. Topic-wordselection based on combinatorial probability. In NL-PRS, volume 1, page 289.

Jagadeesh Jagarlamudi, Hal Daume III, and Raghaven-dra Udupa. 2012. Incorporating lexical priors intotopic models. In Proceedings of the 13th Confer-ence of the European Chapter of the Association forComputational Linguistics, pages 204–213. Associ-ation for Computational Linguistics.

43

Soo-Min Kim and Eduard Hovy. 2004. Determin-ing the sentiment of opinions. In Proceedings ofthe 20th international conference on ComputationalLinguistics, page 1367. Association for Computa-tional Linguistics.

Hang Li and Kenji Yamanishi. 2003. Topic analysisusing a finite mixture model. Information process-ing & management, 39(4):521–541.

Kyle W. Prier, Matthew S. Smith, Christophe Giraud-Carrier, and Carl L. Hanson. 2011. Identifyinghealth-related topics on twitter: an exploration oftobacco-related tweets as a test topic. In Proceed-ings of the 4th international conference on Socialcomputing, behavioral-cultural modeling and pre-diction, pages 18–25.

Hagen Soltau, Brian Kingsbury, Lidia Mangu, DanielPovey, George Saon, and Geoffrey Zweig. 2005.The ibm 2004 conversational telephony system forrich transcription. In Proc. ICASSP, volume 1,pages 205–208.

Hong Yu and Vasileios Hatzivassiloglou. 2003. To-wards answering opinion questions: Separating factsfrom opinions and identifying the polarity of opinionsentences. In Proceedings of the 2003 conference onEmpirical methods in natural language processing,pages 129–136. Association for Computational Lin-guistics.

Taras Zagibalov and John Carroll. 2008. Automaticseed word selection for unsupervised sentiment clas-sification of chinese text. In Proceedings of the22nd International Conference on ComputationalLinguistics-Volume 1, pages 1073–1080. Associa-tion for Computational Linguistics.

44


Coreference Resolution for Structured Drug Product Labels

Halil Kilicoglu and Dina Demner-FushmanNational Library of MedicineNational Institutes of Health

Bethesda, MD, 20894{kilicogluh,ddemner}@mail.nih.gov

Abstract

FDA drug package inserts provide com-prehensive and authoritative informationabout drugs. DailyMed database is arepository of structured product labels ex-tracted from these package inserts. Mostsalient information about drugs remainsin free text portions of these labels. Ex-tracting information from these portionscan improve the safety and quality of drugprescription. In this paper, we present astudy that focuses on resolution of coref-erential information from drug labels con-tained in DailyMed. We generalized andexpanded an existing rule-based coref-erence resolution module for this pur-pose. Enhancements include resolution ofset/instance anaphora, recognition of ap-positive constructions and wider use ofUMLS semantic knowledge. We obtainedan improvement of 40% over the baselinewith unweighted average F1-measure us-ing B-CUBED, MUC, and CEAF metrics.The results underscore the importance ofset/instance anaphora and appositive con-structions in this type of text and point outthe shortcomings in coreference annota-tion in the dataset.

1 Introduction

Almost half of the US population uses at least oneprescription drug and over 75% of physician of-fice visits involve drug therapy1. Knowing howthese drugs will affect the patient is very impor-tant, particularly, to over 20% of the patients thatare on three or more prescription drugs1. FDAdrug package inserts (drug labels or Structured

1Centers for Disease Control and Preven-tion: FASTSTATS - Therapeutic Drug Use:http://www.cdc.gov/nchs/fastats/drugs.htm

Product Labels (SPLs)) provide curated informa-tion about the prescription drugs and many over-the-counter drugs. The drug labels for most drugsare publicly available in XML format through Dai-lyMed 2. Some information in these labels, such asthe drug identifiers and ingredients, could be eas-ily extracted from the structured fields of the XMLdocuments. However, the salient content about in-dications, side effects and drug-drug interactions,among others, is buried in the free text of thecorresponding sections of the labels. Extractingthis information with natural language process-ing techniques can facilitate automatic timely up-dates to databases that support Electronic HealthRecords in alerting physicians to potential drug in-teractions, recommended doses, and contraindica-tions.

Natural language processing methods are in-creasingly used to support various clinical andbiomedical applications (Demner-Fushman et al.,2009). Extraction of drug information is playing aprominent role in these applications and research.In addition to earlier research in extraction of med-ications and relations involving medications fromclinical text and the biomedical literature (Rind-flesch et al., 2000; Cimino et al., 2007), in thethird i2b2 shared task (Uzuner et al., 2010), 23organizations have explored extraction of medica-tions, their dosages, routes of administration, fre-quencies, durations, and reasons for administra-tion from clinical text. The best performing sys-tems used rule-based and machine learning tech-niques to achieve over 0.8 F-measure for extrac-tion of medication names; however, the remain-ing information was harder to extract. Researchershave also tackled extraction of drug-drug interac-tions (Herrero-Zazo et al., 2013), side effects (Xuand Wang, 2014), and indications (Fung et al.,2013) from various biomedical resources.

As for many other information extraction tasks,2DailyMed: http://dailymed.nlm.nih.gov/dailymed/about.cfm

45

extracting drug information is often made moredifficult by coreference. Coreference is defined asthe relation between linguistic expressions that arereferring to the same entity (Zheng et al., 2011).Coreference resolution is a fundamental task inNLP and can benefit many downstream applica-tions, such as relation extraction, summarization,and question answering. Difficulty of the task isdue to the fact that various levels of linguistic in-formation (lexical, syntactic, semantic, and dis-course contextual features) generally play a role.

Coreference occurs frequently in all types ofbiomedical text, including the drug package in-serts. Consider the example below:

(1) Since amiodarone is a substrate forCYP3A and CYP2C8, drugs/substancesthat inhibit these isoenzymes may decreasethe metabolism . . . .

In this example, the expression these isoenzymesrefer to CYP3A and CYP2C8. Resolving thiscoreference instance would allow us to capture thefollowing drug interactions mentioned in the sen-tence: inhibitors of CYP3A POTENTIATE amio-darone and inhibitors of CYP2C8 POTENTIATEamiodarone.

In this paper, we present a study that focuses onidentification of coreference links in drug labels,with the view that these relations will facilitatethe downstream task of drug interaction recogni-tion. The rule-based system presented is an exten-sion of the previous work reported in Kilicoglu etal. (2013). The main focus of the dataset, basedon SPLs, is drug interaction information. Coref-erence is only annotated when it is relevant to ex-tracting such information. In addition to evaluat-ing the system against a baseline, we also manu-ally assessed the system output for precision. Fur-thermore, we also evaluated the system on a sim-ilarly drug-focused corpus annotated for anaphora(DrugNerAR) (Segura-Bedmar et al., 2010). Ourresults demonstrate that set/instance anaphora res-olution and appositive recognition can play a sig-nificant role in this type of text and highlight someof the major areas of difficulty and potential en-hancements.

2 Related Work

We discuss two areas of research related to thisstudy in this section: processing of drug labelsand coreference resolution focusing on biomedi-cal text. Drug labels, despite their availability and

the wealth of information contained within them,remain underutilized. One of the reasons might bethe complexity of the text in the labels: in a reviewof publicly available text sources that could beused to augment a repository of drug indicationsand adverse effects (ADEs), Smith et al. (2011)concluded that many indication and adverse drugevent relationships in the drug labels are too com-plex to be captured in the existing databases of in-teractions and ADEs. Despite the complexity, thelabels were used to extract indications for drugs inseveral studies. Elkin et al. (2011) automaticallyextracted indications, mapped them to SNOMED-CT and then automatically derived rules in theform (”Drug” HasIndication ”SNOMED CT”).Fung et al. (2013) used MetaMap (Aronson andLang, 2010) to extract indications and map themto the UMLS (Lindberg et al., 1993), and thenmanually validated the quality of the mappings.Oprea et al. (2011) used information extractedfrom the adverse reactions sections of 988 drugsfor computer-aided drug repurposing. Duke etal. (2011) have developed a rule-based system thatextracted 534,125 ADEs from 5602 SPLs. Zhuet al. (2013) extracted disease terms from fiveSPL sections (indication, contraindication, ADE,precaution, and warning) and combined the ex-tracted terms with the drug and disease relation-ships in NDF-RT to disambiguate the PharmGKBdrug and disease associations. A hybrid NLP sys-tem, AutoMCExtractor, uses conditional randomfields and post-processing rules to extract medicalconditions from SPLs and build triplets in the formof([drug name]-[medical condition]-[LOINC sec-tion header]) (Li et al., 2013).

Coreference resolution in the biomedical do-main was addressed in the 2011 i2b2/VA sharedtask (Uzuner et al., 2012), and the 2011 BioNLPShared Task (Kim et al., 2012); however thesecommunity-wide evaluations did not change muchthe observation in the 2011 review by Zhenget al. (2011) that only a handful of systemswere developed for handling anaphora and coref-erence in clinical text and biomedical publica-tions. Since this comprehensive article was pub-lished, Yoshikawa et al. (2011) proposed twocoference resolution models based on support vec-tor machine and joint Markov logic network toaid the task of biological event extraction. Sim-ilarly, Miwa et al. (2012) and Kilicoglu andBergler (2012) extended their biological event

46

extraction pipelines using rule-based corefer-ence systems that rely on syntactic informationand predicate argument structures. Nguyen etal. (2012) evaluated contribution of discourse pref-erence, number agreement, and domain-specificsemantic information in capturing pronominal andnominal anaphora referring to proteins. An ef-fort similar to ours is that of Segura-Bedmar etal. (2010), who resolve anaphora to support drug-drug interaction extraction. They created a cor-pus of 49 interactions sections extracted from theDrugBank database, having on average 40 sen-tences and 716 tokens. They then manually anno-tated pronominal and nominal anaphora, and de-veloped a rule-based approach that achieve 0.76F1-measure in anaphora resolution.

3 Methods

3.1 The dataset

We used a dataset extracted from FDA drug pack-age labels by our collaborators at FDA interestedin extracting interactions between cardiovascu-lar drugs. The dataset consists of 159 drug la-bels, with an average of 105 sentences and 1787tokens per label. It is annotated for three en-tity types (Drug, Drug Class, and Substance) andfour drug interaction types (Caution, Decrease, In-crease, and Specific). 377 instances of corefer-ence were annotated. Two annotators separatelyannotated the labels and one of the authors per-formed the adjudication. The relatively low num-ber of coreference instances is due to the fact thatcoreference was annotated only when it would berelevant to drug interaction recognition task. Thisparsimonious approach to annotation presents dif-ficulty in automatically evaluating the system, andto mitigate this, we present an assessment of theprecision of our end-to-end coreference system, aswell. We split the dataset into training and test setsby random sampling. Training data consists of 79documents and the test set has 80 documents. Weused the training data for analysis and as the basisof our enhancements.

3.2 The system

The work described in this paper extends andrefines earlier work, described in Kilicoglu etal. (2013), which focused on disease anaphora andellipsis in the context of consumer health ques-tions. We briefly recap that system here. The sys-tem begins by mapping named entities to UMLS

Metathesaurus concepts (CUIs). Next, it identifiesanaphoric expressions in text, which include per-sonal (e.g., it, they) and demonstrative pronouns(e.g., this, those), as well as sortal anaphora (def-inite (e.g., with the) and demonstrative (e.g., withthat) noun phrases). The candidate antecedentsare then recognized using syntactic (person, gen-der and number agreement, head word matching)and semantic (hypernym and UMLS semantic typematching) constraints. Finally, the co-referent isthen selected as the focus of the question, which istaken as the first disease mention in the question.

The coreference resolution pipeline used in thecurrent work, while enhanced significantly, fol-lows the same basic sequence. The relatively sim-ple approach of earlier work is generally sufficientfor consumer health questions; however, we foundit insufficient when it comes to drug labels. Asidefrom the obvious point that the approach was lim-ited to diseases, there are other stylistic differencesthat have an impact on coreference resolution. Incontrast to informal and casual style of consumerhealth questions, drug labels are curated and pro-vide complex indication and ADE information ina formal style, more akin to biomedical literature.Our analysis of the training data highlighted sev-eral facts regarding coreference in drug labels: (1)the set/instance anaphora (including those involv-ing distributive anaphora such as both, each, ei-ther) instances are prominent, (2) demonstrativepronominal anaphora is non-existent in contrastto consumer health questions, (3) the focus-basedsalience scoring is simplistic for longer texts. Wedescribe the system enhancements below.

3.2.1 Generalizing from diseases to drugsand beyond

We generalized from resolution of disease coref-erence only to resolution of coreference involv-ing other entity types. For this purpose, we para-materized semantic groups and hypernym lists as-sociated with each semantic group. We general-ized the system in the sense that new semantictypes and hypernyms can be easily defined andused by the system. In addition to Disorder se-mantic group and Disorder hypernym list definedin earlier work, we used Drug, Intervention, Pop-ulation, Procedure, Anatomy, and Gene/Proteinsemantic groups and hypernym lists. Semanticgroup classification largely mimics coarse-grainedUMLS semantic groups (McCray et al., 2001).For example, UMLS semantic types Pharmaco-

47

logic Substance and Clinical Drug are aggregatedinto both Drug and Intervention semantic groups,while Therapeutic or Preventive Procedure is as-signed to Procedure group only. Drug hypernyms,such as medication, drug, agent, were derivedfrom the training data.

3.2.2 Set/instance anaphoraSet/instance anaphora instances are prevalent indrug labels. In our dataset, 19% of all anno-tated anaphoric expressions indicate set/instanceanaphora (co-referring with 29% of antecedentterms). An example was provided earlier (Ex-ample 1). While recognizing anaphoric expres-sions that indicate set/instance anaphora is notnecessarily difficult (i.e., recognizing these isoen-zymes in the example), linking them to their an-tecedents can be difficult, since it generally in-volves correctly identifying syntactic coordina-tion, a challenging syntactic parsing task (Ogren,2010). Our identification of these structures re-lies on collapsed Stanford dependency output (deMarneffe et al., 2006) and uses syntactic and se-mantic constraints. We examine all the depen-dency relations extracted from a sentence and onlyconsider those with the type conj * (e.g., conj and,conj or). For increased accuracy, we then checkthe tokens involved in the dependency (conjuncts)and ensure that there is a coordinating conjunc-tion (e.g., and, or, , (comma), & (ampersand)) be-tween them. Once such a conjunction is identified,we then examine the semantic compatibility of theconjuncts. In the case of entities, the compatibil-ity involves that at the semantic group level. In thecurrent work, we also began recognizing distribu-tive anaphora, such as either, each as anaphoricexpressions. When the recognized anaphoric ex-pression is plural (as in they, these agents or eitherdrug), we allow the coordinated structures previ-ously identified in this fashion as candidate an-tecedents. The current work does not address amore complex kind of set/instance anaphora, inwhich the instances are not syntactically coordi-nated, such as in Example (2), where such agentsrefer to thiazide diuretics, in the preceding sen-tence, as well as Potassium-sparing diuretics andpotassium supplements.

(2) . . . can attenuate potassium loss causedby thiazide diuretics. Potassium-sparingdiuretics . . . or potassium supplements canincrease . . . . if concomitant use of

such agents is indicated . . .

3.2.3 Appositive constructionsCoreference involving appositive constructions3

are annotated in some corpora, including theBioNLP shared task coreference dataset (Kimet al., 2012) and DrugNerAR corpus (Segura-Bedmar et al., 2010). An example is given below,in which the indefinite noun phrase a drug and thedrug lovastatin are appositives.

(3) PLETAL does not, however, appear to causeincreased blood levels of drugs metabolizedby CYP3A4, as it had no effect on lovastatin,a drug with metabolism very sensitive toCYP3A4 inhibition.

In our dataset, coreference involving apposi-tive constructions were generally left unannotated.However, it was consistently the case that whenone of the items in the construction is annotatedas the antecedent for an anaphoric expression,the other item in the construction was also anno-tated as such. Therefore, we identified appositiveconstructions in text to aid the antecedent selec-tion task. We used dependency relations for thistask, as well. Identifying appositives is relativelystraightforward using syntactic dependency rela-tions. We adapted the following rule from Kil-icoglu and Bergler (2012):

APPOS(Antecedent,Anaphor) ∨APPOS(Anaphor,Antecedent) ⇒

COREF(Anaphor,Antecedent)

where APPOS ∈ {appos, abbrev, prep including,prep such as}. In our case, this rule becomes

(APPOS(Antecedent1,Antecedent2) ∨APPOS(Antecedent2,Antecedent1)) ∧

COREF(Anaphor,Antecedent1) ⇒COREF(Anaphor,Antecedent2)

which essentially states that a candidate is taken asan antecedent, only if its appositive has been rec-ognized as an antecedent. Additionally, semanticcompatibility between the items is required.

This allows us to identify their and Class Ia an-tiarrhythmic drugs as co-referents in the followingexample, due to the fact that the exemplificationindicated by the appositive construction betweenClass Ia antiarrythmic drugs and disopyramide isrecognized, the latter previously identified as anantecedent for their.

3We use the term “appositive” to cover exemplifications,as well.

48

(4) Class Ia antiarrhythmic drugs, such asdisopyramide, quinidine and procainamideand other Class III drugs (e.g., amiodarone)are not recommended . . . because of theirpotential to prolong refractoriness.

3.2.4 Relative pronounsSimilar to appositive constructions, relative pro-nouns are annotated as anaphoric expressions insome corpora (same as those for appositives), butnot in our dataset. In the example below, the rela-tive pronoun which refers to potassium-containingsalt substitutes.

(5) . . . the concomitant use of potassium-sparingdiuretics, potassium supplements, and/orpotassium-containing salt substitutes, whichshould be used cautiously. . .

Since we aim for generality and this type ofanaphora can be important for downstream ap-plications, we implemented a rule, again takenfrom Kilicoglu and Bergler (2012), which simplystates that the antecedent of a relative pronominalanaphora is the noun phrase head it modifies.

rel(X,Anaphor) ∧ rcmod(Antecedent,X) ⇒COREF(Anaphor,Antecedent)

where rel indicates a relative dependency, and rc-mod a relative clause modifier dependency. Weextended this in the current work to include thefollowing rules:

(6) (a) LEFT(Antecedent,Anaphor) ∧NO INT WORD(Antecedent,Anaphor)⇒ COREF(Anaphor,Antecedent)

(b) LEFT(Antecedent,Anaphor) ∧ rc-mod(Antecedent,X)∧ LEFT(Anaphor,X)⇒ COREF(Anaphor,Antecedent)

where LEFT indicates that the first argument isto the left of the second and NO INT WORD in-dicates that the arguments have no interveningwords between them.

3.3 Drug ingredient/brand name synonymyA specific, non-anaphoric type of coreference,between drug ingredient name and drug’s brandname, is commonly annotated in our dataset. Anexample is provided below, where COREG CR isthe brand name for carvedilol.

(7) The concomitant administration of amio-darone or other CYP2C9 inhibitors such asfluconazole with COREG CR may enhancethe -blocking properties of carvedilol . . . .

To identify this type of coreference, we use se-mantic information from UMLS Metathesaurus.We stipulate that, to qualify as co-referents, bothterms under consideration should map to the sameUMLS concept (i.e., that they are considered syn-onyms). If the terms are within the same sentence,we further require that they are appositive.

3.3.1 Demonstrative pronounsAnaphoric expressions of demonstrative pronountype generally have discourse-deictic use; in otherwords, they often refer to events, propositions de-scribed in prior discourse or even to the full sen-tences or paragraphs, rather than concrete objectsor entities (Webber, 1988). This fact was implic-itly exploited in consumer health questions, sincethe coreference resolution focused on diseasesonly, which are essentially processes. However,in drug labels, discourse-deictic use of demonstra-tives is much more overt. Consider the sentencebelow, where the demonstrative This refers to theevent of increasing the exposure to lovastatin.

(8) Co-administration of lovastatin and SAMSCAincreases the exposure to lovastatin and . . . .This is not a clinically relevant change.

To handle such cases, we blocked entity an-tecedents (such as drugs) for demonstrative pro-nouns and only allowed predicates (verbs, nomi-nalizations) as candidate antecedents.

3.3.2 Pleonastic itWe recognized pleonastic instances of the pronounit to disqualify them as anaphoric expressions (forinstance, it in It may be necessary to . . . ). Gen-erally, lexical patterns involving sequence of to-kens are used to recognize such instances (e.g.,(Segura-Bedmar et al., 2010). We used a simpledependency-based rule that mimics these patterns,given below.

nsubj*(X,it) ∧ DEP(X,Y) ⇒ PLEONASTIC(it)

where nsubj* refers to nsubj or nsubjpass depen-dencies and DEP is any dependency, where DEP/∈ {infmod, ccomp, xcomp}.

3.3.3 Discourse-based constraintsPreviously, we did not impose limits on how farthe co-referents could be from each other, sincethe entire discourse was generally short and thesalient antecedent (often the topic of the question)appeared early in discourse. This is often not the

49

case in drug labels, especially because often intri-cate interactions between the drug of interest andother medications are discussed. Therefore, welimit the discourse window from which candidateantecedents are identified. Generally, the searchspace for the antecedents is limited to the currentsentence as well as the two preceding sentences(Segura-Bedmar et al., 2010; Nguyen et al., 2012).In our dataset, we found that 98% of antecedentsoccurred within this discourse window and, thus,use the same search space. We make an exceptionfor the cases in which the anaphoric expression ap-pear in the first sentence of a paragraph and nocompatible antecedent is found in the same sen-tence. In this case, the search space is expanded tothe entire preceding paragraph.

We also extended the system to include differenttypes of salience scoring methods. For drug labels,we use linear distance between the co-referents (interms of surface elements) as the salience score;the lower this score, the better candidate the an-tecedent is. Additionally, we implemented syn-tactic tree distance between the co-referents as apotential salience measure, even though this typeof salience scoring did not have an effect on ourresults on drug labels.

Finally, we block candidate antecedents thatare in a direct syntactic dependency with theanaphoric expression, except when the anaphor isreflexive (e.g., itself ).

3.4 Evaluation

To evaluate our approach, we used a baseline simi-lar to that reported in Segura-Bedmar et al. (2010),which consists of selecting the closest precedingnominal phrase for the anaphoric expressions an-notated in their corpus. These expressions in-clude pronominal (personal, relative, demonstra-tive, etc.) and nominal (definite, possessive,etc.) anaphora. We compared our system tothis baseline using the unweighted average of F1-measure over B-CUBED (Bagga and Baldwin,1998), MUC (Vilain et al., 1995), and CEAF (Luo,2005) metrics, the standard evaluation metrics forcoreference resolution. We used the scripts pro-vided by i2b2 shared task organizers for this pur-pose. Since coreference annotation was parsimo-nious in our dataset, we also manually examined asubset of the coreference relations extracted by thesystem for precision. Additionally, we tested oursystem on DrugNerAR corpus (Segura-Bedmar et

al., 2010), which similarly focuses on drug inter-actions. We compared our results to theirs, us-ing as evaluation metrics precision, recall, and F1-measure, the metrics that were used in their evalu-ation.

4 Results and Discussion

With the drug label dataset, we obtained the bestresults without relative pronominal anaphora reso-lution and drug ingredient/brand name synonymystrategies (OPTIMAL) and with linear distanceas the salience measure. In this setting, usinggold entity annotations, we recognized 318 coref-erence chains, 54 of which were annotated in thecorpus. The baseline identified 1415 coreferencechains, only 10 of which were annotated. The im-provement provided by the system over the base-line is clear; however, the low precision/recall/F1-measure, given in Table 1, should be taken withcaution due to the sparse coreference annotationin the dataset. To get a better sense of how wellour system performs, we also performed end-to-end coreference resolution and manually assesseda subset of the system output (22 randomly se-lected drug labels with 249 coreference instances).Of these 249, 181 were deemed correct, yielding aprecision of 0.73. The baseline method extracted1439 instances, 56 of which were deemed cor-rect, yielding a precision of 0.04. The precisionof our method is more in line with what has beenreported in the literature (Segura-Bedmar et al.,2010; Nguyen et al., 2012). For i2b2-style eval-uation using the unweighted average F1 measureover B-CUBED, MUC, and CEAF metrics, weconsidered both exact and partial mention overlap.These results, provided in Table 1, also indicatethat the system provides a clear improvement overthe baseline.

Metric Baseline OPTIMALWith gold entity annotationsUnweighted F1 Partial 0.55 0.77Unweighted F1 Exact 0.66 0.78Precision 0.01 0.17Recall 0.04 0.26F1-measure 0.01 0.21

End-to-end coreference resolutionPrecision 0.04 0.73

Table 1: Evaluation results on drug labels

50

We also assessed the effect of various resolutionstrategies on results. These results are presented inTable 2.

Strategy F1-measureOPTIMAL 0.21OPTIMAL - SIA 0.21OPTIMAL - APPOS 0.15OPTIMAL + DIBS 0.16 (0.39 recall)

Table 2: Effect of coreference strategies

Disregarding set/instance anaphora resolution(SIA) does not appear to affect the results bymuch; however, this is mostly due to the fact thatthe “instance” mentions are generally exemplifica-tions of a particular drug class which also appearin text. In the absence of set/instance anaphoraresolution, the system often defaults to these drugclass mentions, which were annotated more oftenthan not, unlike the “instance” mentions. Take thefollowing example:

(9) Use of ZESTRIL with potassium-sparingdiuretics (e.g., spironolactone, eplerenone,triamterene or amiloride) . . . may lead to sig-nificant increases . . . if concomitant use ofthese agents . . .

Without set-instance anaphora resolution, the sys-tem links these agents to potassium-sparing di-uretics, an annotated relation. With set-instanceanaphora resolution, the same expression is linkedto individual drug names (spironolactone, etc.) aswell as the the drug class, creating a number offalse positives, which, in effect, offsets the im-provement provided by this strategy.

On the other hand, recognizing appositive con-structions (APPOS) appears to have a larger im-pact; however, it should be noted that this is mostlybecause it helps us expand the antecedent mentionlist in the case of set/instance anaphora. For in-stance, in Example (9), this strategy allows us toestablish the link between the anaphora and thedrug class (diuretics), since the drug class and in-dividual drug name (spironolactone) are identifiedearlier as appositive. We can conclude that, in gen-eral, set/instance anaphora benefits from recogni-tion of appositive constructions.

Recognizing drug ingredient/brand name syn-onymy (DIBS) improved the recall and hurt theprecision significantly, the overall effect being

negative. Since this non-anaphoric type of coref-erence is strictly semantic in nature and resourcesfrom which this type of semantic information canbe derived already exist (UMLS, among others), itis perhaps not of utmost importance that a coref-erence resolution system recognizes such corefer-ence.

We additionally processed the DrugNerAR cor-pus with our system. The optimal setting forthis corpus was disregarding the drug ingredi-ent/brand name synonymy but using relative pro-noun anaphora resolution, based on the discus-sion in Segura-Bedmar et al. (2010). Somewhat toour surprise, our system did not fare well on thiscorpus. We extracted 524 chains, 327 of which(out of 669) were annotated in the corpus, yield-ing a precision of 0.71, recall of 0.56, and F1-measure of 0.63. This is about 20% lower thantheir reported results. When we used their base-line method (explained earlier), we obtained simi-larly lower scores (precision of 0.18, recall of 0.45,F1-measure of 0.26, about 40% lower than theirreported results). In light of this apparent discrep-ancy, which clearly warrants further investigation,it is perhaps more sensible to focus on “improve-ment over baseline” (reported as 73% in their pa-per and is 140% in our case).

We analyzed some of the annotations moreclosely to get a better sense of the shortcomingsof the system. The majority of errors were due tousing linear distance as the salience score. For in-stance, in the following example, they is linked toACE inhibitors due to proximity, whereas the trueantecedent is these reactions (itself an anaphor andis presumably linked to another antecedent). Itcould be possible to recover this link using prin-ciples of Centering Theory (Grosz et al., 1995),which suggests that subjects are more central thanobjects and adjuncts in an utterance. Followingthis principle, the subject (these reactions) wouldbe preferred to ACE inhibitors as the antecedent.

(10) In the same patients, these reactions wereavoided when ACE inhibitors were temporar-ily withheld, but they reappeared upon inad-vertent rechallenge.

Semantic (but not syntactic) coordination some-times leads to number disagreement between theanaphora and a true antecedent, as shown in Ex-ample (11), leading to false negatives. In this ex-ample, such diuretics refers to both ALDACTONE

51

and a second diuretic; however, we are unable toidentify the link between them and the number dis-agreement between the anaphora and either of theantecedents blocks a potential coreference relationbetween these items.

(11) If, after five days, an adequate diuretic re-sponse to ALDACTONE has not occurred,a second diuretic that acts more proximallyin the renal tubule may be added to the reg-imen. Because of the additive effect of AL-DACTONE when administered concurrentlywith such diuretics . . .

5 Conclusion

We presented a coreference resolution system en-hanced based on insights from a dataset of FDAdrug package inserts. Sparse coreference annota-tion in the dataset presented difficulties in evaluat-ing the results; however, based on various eval-uation strategies, the performance improvementdue to the enhancements seems evident. Our re-sults show that recognizing coordination and ap-positive constructions are particularly useful andthat non-anaphoric cases of coreference can beidentified using synonymy in semantic resources,such as UMLS. However, whether this is a taskfor a coreference resolution system or a conceptnormalization system is debatable. We exper-imented with using hierarchical domain knowl-edge in UMLS (for example, the knowledge thatlisinopril ISA angiotensin converting enzyme in-hibitor) to resolve some cases of sortal anaphora.Even though we did not see an improvement dueto using this type of information on our dataset,further work is needed to assess its usefulness.While the enhancements were evaluated on druglabels only, they are not specific to this type oftext. Their portability to different text types islimited only by the accuracy of underlying tools,such as parsers, for the text type of interest andthe availability of domain knowledge in the formof relevant semantic types, groups, hypernymsfor the entity types under consideration. The re-sults also indicate that a more rigorous applicationof syntactic constraints in the spirit of CenteringTheory (Grosz et al., 1995) could be beneficial.Event (or clausal) anaphora and anaphora indicat-ing discourse deixis, while rarely annotated in ourdataset, appear to occur fairly often in biomedicaltext. These types of anaphora are known to be par-ticularly challenging, and we plan to investigate

them in future research, as well.

Acknowledgments

This work was supported by the intramural re-search program at the U.S. National Library ofMedicine, National Institutes of Health.

ReferencesAlan R. Aronson and Francois-Michel Lang. 2010. An

overview of MetaMap: historical perspective and re-cent advances. Journal of the American Medical In-formatics Association (JAMIA), 17(3):229–236.

Amit Bagga and Breck Baldwin. 1998. Algorithmsfor scoring coreference chains. In The First Interna-tional Conference on Language Resources and Eval-uation Workshop on Linguistics Coreference, pages563–566.

James J. Cimino, Tiffani J. Bright, and Jianhua Li.2007. Medication reconciliation using natural lan-guage processing and controlled terminologies. InKlaus A. Kuhn, James R. Warren, and Tze-YunLeong, editors, MedInfo, volume 129 of Studies inHealth Technology and Informatics, pages 679–683.IOS Press.

Marie-Catherine de Marneffe, Bill MacCartney, andChristopher D. Manning. 2006. Generating typeddependency parses from phrase structure parses. InProceedings of the 5th International Conference onLanguage Resources and Evaluation, pages 449–454.

Dina Demner-Fushman, Wendy W. Chapman, andClem J. McDonald. 2009. What can natural lan-guage processing do for clinical decision support?Journal of Biomedical Informatics, 5(42):760–762.

Jon Duke, Jeff Friedlin, and Patrick Ryan. 2011. Aquantitative analysis of adverse events and ”over-warning” in drug labeling. Archives of internalmedicine, 10(171):944–946.

Peter L. Elkin, John S. Carter, Manasi Nabar, MarkTuttle, Michael Lincoln, and Steven H. Brown.2011. Drug knowledge expressed as computable se-mantic triples. Studies in health technology and in-formatics, (166):38–47.

Kin Wah Fung, Chiang S. Jao, and Dina Demner-Fushman. 2013. Extracting drug indication infor-mation from structured product labels using naturallanguage processing. JAMIA, 20(3):482–488.

Barbara J. Grosz, Scott Weinstein, and Aravind K.Joshi. 1995. Centering: a framework for model-ing the local coherence of discourse. ComputationalLinguistics, 21(2):203–225.

52

Marıa Herrero-Zazo, Isabel Segura-Bedmar, PalomaMartınez, and Thierry Declerck. 2013. The DDIcorpus: An annotated corpus with pharmacologicalsubstances and drug-drug interactions. Journal ofBiomedical Informatics, 46(5):914–920.

Halil Kilicoglu and Sabine Bergler. 2012. Biolog-ical Event Composition. BMC Bioinformatics, 13(Suppl 11):S7.

Halil Kilicoglu, Marcelo Fiszman, and Dina Demner-Fushman. 2013. Interpreting consumer health ques-tions: The role of anaphora and ellipsis. In Proceed-ings of the 2013 Workshop on Biomedical NaturalLanguage Processing, pages 54–62.

Jin-Dong Kim, Ngan Nguyen, Yue Wang, Jun’ichi Tsu-jii, Toshihisa Takagi, and Akinori Yonezawa. 2012.The Genia Event and Protein Coreference tasks ofthe BioNLP Shared Task 2011. BMC Bioinformat-ics, 13(Suppl 11):S1.

Qi Li, Louise Deleger, Todd Lingren, Haijun Zhai,Megan Kaiser, Laura Stoutenborough, Anil G.Jegga, Kevin B. Cohen, and Imre Solti. 2013. Min-ing FDA drug labels for medical conditions. BMCmedical informatics and decision making, 13(1):53.

Donald A. B. Lindberg, Betsy L. Humphreys, andAlexa T. McCray. 1993. The Unified Medical Lan-guage System. Methods of Information in Medicine,32:281–291.

Xiaoqiang Luo. 2005. On coreference resolutionperformance metrics. In In Proc. of HLT/EMNLP,pages 25–32.

Alexa T. McCray, Anita Burgun, and Olivier Boden-reider. 2001. Aggregating UMLS semantic typesfor reducing conceptual complexity. Proceedings ofMedinfo, 10(pt 1):216–20.

Makoto Miwa, Paul Thompson, and Sophia Ana-niadou. 2012. Boosting automatic event ex-traction from the literature using domain adapta-tion and coreference resolution. Bioinformatics,28(13):1759–1765.

Ngan L. T. Nguyen, Jin-Dong Kim, Makoto Miwa,Takuya Matsuzaki, and Junichi Tsujii. 2012. Im-proving protein coreference resolution by simple se-mantic classification. BMC Bioinformatics, 13:304.

Philip V. Ogren. 2010. Improving Syntactic Coor-dination Resolution using Language Modeling. InNAACL (Student Research Workshop), pages 1–6.The Association for Computational Linguistics.

T.I. Oprea, S.K. Nielsen, O. Ursu, J.J. Yang,O. Taboureau, S.L. Mathias, L. Kouskoumvekaki,L.A. Sklar, and C.G. Bologa. 2011. Associat-ing Drugs, Targets and Clinical Outcomes into anIntegrated Network Affords a New Platform forComputer-Aided Drug Repurposing. Molecular in-formatics, 2-3(30):100–111.

Thomas C. Rindflesch, Lorrie Tanabe, John N. We-instein, and Lawrence Hunter. 2000. EDGAR:Extraction of drugs, genes, and relations from thebiomedical literature. In Proceedings of PacificSymposium on Biocomputing, pages 514–525.

Isabel Segura-Bedmar, Mario Crespo, Cesar de Pablo-Sanchez, and Paloma Martınez. 2010. Resolvinganaphoras for the extraction of drug-drug interac-tions in pharmacological documents. BMC Bioin-formatics, 11 (Suppl 2):S1.

J.C. Smith, J.C. Denny, Q. Chen, H. Nian, A. 3rdSpickard, S.T. Rosenbloom, and R. A. Miller. 2011.Lessons learned from developing a drug evidencebase to support pharmacovigilance. Applied clinicalinformatics, 4(4):596–617.

Ozlem Uzuner, Imre Solti, and Eithon Cadag. 2010.Extracting medication information from clinicaltext. JAMIA, 17(5):514–518.

Ozlem Uzuner, Andrea Bodnari, Shuying Shen, TylerForbush, John Pestian, and Brett R. South. 2012.Evaluating the state of the art in coreference res-olution for electronic medical records. JAMIA,19(5):786–791.

Marc B. Vilain, John D. Burger, John S. Aberdeen,Dennis Connolly, and Lynette Hirschman. 1995.A model-theoretic coreference scoring scheme. InMUC, pages 45–52.

Bonnie L. Webber. 1988. Discourse Deixis: Referenceto Discourse Segments. In ACL, pages 113–122.

Rong Xu and QuanQiu Wang. 2014. Large-scale com-bining signals from both biomedical literature andthe FDA Adverse Event Reporting System (FAERS)to improve post-marketing drug safety signal detec-tion. BMC Bioinformatics, 15:17.

Katsumasa Yoshikawa, Sebastian Riedel, Tsutomu Hi-rao, Masayuki Asahara, and Yuji Matsumoto. 2011.Coreference Based Event-Argument Relation Ex-traction on Biomedical Text. Journal of BiomedicalSemantics, 2 (Suppl 5):S6.

Jiaping Zheng, Wendy W. Chapman, Rebecca S. Crow-ley, and Guergana K. Savova. 2011. Corefer-ence resolution: A review of general methodologiesand applications in the clinical domain. Journal ofBiomedical Informatics, 44(6):1113–1122.

Qian Zhu, Robert R. Freimuth, Jyotishman Pathak,Matthew J. Durski, and Christopher G. Chute. 2013.Disambiguation of PharmGKB drug-disease rela-tions with NDF-RT and SPL. Journal of BiomedicalInformatics, 46(4):690–696.

53


Generating Patient Problem Lists from the ShARe Corpus usingSNOMED CT/SNOMED CT CORE Problem List

Danielle MoweryJanyce Wiebe

University of PittsburghPittsburgh, [email protected]

[email protected]

Mindy RossUniversity of California

San DiegoLa Jolla, [email protected]

Sumithra VelupillaiStockholm University

Stockholm, [email protected]

Stephane MeystreWendy W Chapman

University of UtahSalt Lake City, UTstephane.meystre,

[email protected]

AbstractAn up-to-date problem list is useful forassessing a patient’s current clinical sta-tus. Natural language processing can helpmaintain an accurate problem list. For in-stance, a patient problem list from a clin-ical document can be derived from indi-vidual problem mentions within the clin-ical document once these mentions aremapped to a standard vocabulary. Inorder to develop and evaluate accuratedocument-level inference engines for thistask, a patient problem list could be gen-erated using a standard vocabulary. Ad-equate coverage by standard vocabulariesis important for supporting a clear rep-resentation of the patient problem con-cepts described in the texts and for interop-erability between clinical systems withinand outside the care facilities. In thispilot study, we report the reliability ofdomain expert generation of a patientproblem list from a variety of clinicaltexts and evaluate the coverage of anno-tated patient problems against SNOMEDCT and SNOMED Clinical ObservationRecording and Encoding (CORE) Prob-lem List. Across report types, we learnedthat patient problems can be annotatedwith agreement ranging from 77.1% to89.6% F1-score and mapped to the COREwith moderate coverage ranging from45%-67% of patient problems.

1 Introduction

In the late 1960’s, Lawrence Weed publishedabout the importance of problem-oriented medi-cal records and the utilization of a problem listto facilitate care provider’s clinical reasoning byreducing the cognitive burden of tracking cur-rent, active problems from past, inactive problems

from the patient health record (Weed, 1970). Al-though electronic health records (EHR) can helpachieve better documentation of problem-specificinformation, in most cases, the problem list ismanually created and updated by care providers.Thus, the problem list can be out-of-date con-taining resolved problems or missing new prob-lems. Providing care providers with problem listupdate suggestions generated from clinical docu-ments can improve the completeness and timeli-ness of the problem list (Meystre and Haug, 2008).

In recent years, national incentive and standardprograms have endorsed the use of problem listsin the EHR for tracking patient diagnoses overtime. For example, as part of the Electronic HealthRecord Incentive Program, the Center for Medi-care and Medicaid Services defined demonstra-tion of Meaningful Use of adopted health infor-mation technology in the Core Measure 3 objec-tive as “maintaining an up-to-date problem list ofcurrent and active diagnoses in addition to histor-ical diagnoses relevant to the patients care” (Cen-ter for Medicare and Medicaid Services, 2013).More recently, the Systematized Nomenclature ofMedicine Clinical Terms (SNOMED CT) has be-come the standard vocabulary for representing anddocumenting patient problems within the clinicalrecord. Since 2008, this list is iteratively refinedfour times each year to produce a subset of gen-eralizable clinical problems called the SNOMEDCT CORE Problem List. This CORE list repre-sents the most frequent problem terms and con-cepts across eight major healthcare institutions inthe United States and is designed to support in-teroperability between regional healthcare institu-tions (National Library of Medicine, 2009).

In practice, there are several methodologies ap-plied to generate a patient problem list from clin-ical text. Problem lists can be generated fromcoded diagnoses such as the International Statis-tical Classification of Disease (ICD-9 codes) or

54

concept labels such as Unified Medical LanguageSystem concept unique identifiers (UMLS CUIs).For example, Meystre and Haug (2005) defined 80of the most frequent problem concepts from codeddiagnoses for cardiac patients. This list was gen-erated by a physician and later validated by twophysicians independently. Coverage of coded pa-tient problems were evaluated against the ICD-9-CM vocabulary. Solti et al. (2008) extended thework of Meystre and Haug (2005) by not limit-ing the types of patient problems from any listor vocabulary to generate the patient problem list.They observed 154 unique problem concepts intheir reference standard. Although both studiesdemonstrate valid methods for developing a pa-tient problem list reference standard, neither studyleverages a standard vocabulary designed specifi-cally for generating problem lists.

The goals of this study are 1) determine howreliably two domain experts can generate a pa-tient problem list leveraging SNOMED CT froma variety of clinical texts and 2) assess the cover-age of annotated patient problems from this corpusagainst the CORE Problem List.

2 Methods

In this IRB-approved study, we obtained theShared Annotated Resource (ShARe) corpusoriginally generated from the Beth Israel Dea-coness Medical Center (Elhadad et al., un-der review) and stored in the MultiparameterIntelligent Monitoring in Intensive Care, ver-sion 2.5 (MIMIC II) database (Saeed et al.,2002). This corpus consists of discharge sum-maries (DS), radiology (RAD), electrocardiogram(ECG), and echocardiogram (ECHO) reports fromthe Intensive Care Unit (ICU). The ShARe cor-pus was selected because it 1) contains a variety ofclinical text sources, 2) links to additional patientstructured data that can be leveraged for furthersystem development and evaluation, and 3) has en-coded individual problem mentions with semanticannotations within each clinical document that canbe leveraged to develop and test document-levelinference engines. We elected to study ICU pa-tients because they represent a sensitive cohort thatrequires up-to-date summaries of their clinical sta-tus for providing timely and effective care.

2.1 Annotation Study

For this annotation study, two annotators - a physi-cian and nurse - were provided independent train-ing to annotate clinically relevant problems e.g.,signs, symptoms, diseases, and disorders, at thedocument-level for 20 reports. The annotatorswere given feedback based on errors over two it-erations. For each patient problem in the remain-ing set, the physician was instructed to review thefull text, span the a problem mention, and map theproblem to a CUI from SNOMED-CT using theextensible Human Oracle Suite of Tools (eHOST)annotation tool (South et al., 2012). If a CUI didnot exist in the vocabulary for the problem, thephysician was instructed to assign a “CUI-less” la-bel. Finally, the physician then assigned one offive possible status labels - Active, Inactive, Re-solved, Proposed, and Other - based on our pre-vious study (Mowery et al., 2013) to the men-tion representing its last status change at the con-clusion of the care encounter. Patient problemswere not annotated as Negated since patient prob-lem concepts are assumed absent at a document-level (Meystre and Haug, 2005). If the patientwas healthy, the physician assigned “Healthy - noproblems” to the text. To reduce the cognitive bur-den of annotation and create a more robust refer-ence standard, these annotations were then pro-vided to a nurse for review. The nurse was in-structed to add missing, modify existing, or deletespurious patient problems based on the guidelines.

We assessed how reliably annotators agreedwith each other’s patient problem lists using inter-annotator agreement (IAA) at the document-level.We evaluated IAA in two ways: 1) by problemCUI and 2) by problem CUI and status. Sincethe number of problems not annotated (i.e., truenegatives (TN)) are very large, we calculated F1-score as a surrogate for kappa (Hripcsak and Roth-schild, 2005). F1-score is the harmonic mean ofrecall and precision, calculated from true posi-tive, false positive, and false negative annotations,which were defined as follows:

true positive (TP) = the physician and nurse prob-lem annotation was assigned the same CUI(and status)

false positive (FP) = the physician problem anno-tation (and status) did not exist among thenurse problem annotations

55

false negative (FN) = the nurse problem anno-tation (and status) did not exist among thephysician problem annotations

Recall =TP

(TP + FN)(1)

Precision =TP

(TP + FP )(2)

F1-score =

2(Recall ∗ Precision)(Recall + Precision)

(3)

We sampled 50% of the corpus and determinedthe most common errors. These errors withexamples were programmatically adjudicatedwith the following solutions:

Spurious problems: proceduressolution: exclude non-problems via guidelines

Problem specificity: CUI specificity differencessolution: select most general CUIs

Conflicting status: negated vs. resolvedsolution: select second reviewer’s status

CUI/CUI-less: C0031039 vs. CUI-lesssolution: select CUI since clinically useful

We split the dataset into about two-thirds train-ing and one-third test for each report type. The re-maining data analysis was performed on the train-ing set.

2.2 Coverage StudyWe characterized the composition of the referencestandard patient problem lists against two stan-dard vocabularies SNOMED-CT and SNOMED-CT CORE Problem List. We evaluated the cover-age of patient problems against the SNOMED CTCORE Problem List since the list was developedto support encoding clinical observations such asfindings, diseases, and disorders for generating pa-tient summaries like problem lists. We evaluatedthe coverage of patient problems from the corpusagainst the SNOMED-CT January 2012 Releasewhich leverages the UMLS version 2011AB. Weassessed recall (Eq 1), defining a TP as a patientproblem CUI occurring in the vocabulary and a

FN as a patient problem CUI not occurring in thevocabulary.

3 Results

We report the results of our annotation study onthe full set and vocabulary coverage study on thetraining set.


The full dataset is comprised of 298 clinical doc-uments - 136 (45.6%) DS, 54 (18.1%) ECHO,54 (18.1%) RAD, and 54 (18.1%) ECG. Seventy-four percent (221) of the corpus was annotated byboth annotators. Table 1 shows agreement overalland by report, matching problem CUI and prob-lem CUI with status. Inter-annotator agreementfor problem with status was slightly lower for allreport types with the largest agreement drop forDS at 15% (11.6 points).

Report Type CUI CUI + StatusDS 77.1 65.5ECHO 83.9 82.8RAD 84.7 82.8ECG 89.6 84.8

Table 1: Document-level IAA by report type for problem(CUI) and problem with status (CUI + status)

We report the most common errors by frequencyin Table 2. By report type, the most common er-rors for ECHO, RAD, and ECG were CUI/CUI-less, and DS was Spurious Concepts.

Errors DS ECHO RAD ECGSP 423 (42%) 26 (23%) 30 (35%) 8 (18%)PS 139 (14%) 31 (27%) 8 (9%) 0 (0%)CS 318 (32%) 9 (8%) 8 (9%) 14 (32%)CC 110 (11%) 34 (30%) 37 (44%) 22 (50%)Other 6 (>1%) 14 (13%) 2 (2%) 0 (0%)

Table 2: Error types by frequency - Spurious Problems (SP),Problem Specificity (PS), Conflicting status (CS), CUI/CUI-less (CC)

3.2 Coverage Study

In the training set, there were 203 clinical docu-ments - 93 DS, 37 ECHO, 38 RAD, and 35 ECG.The average number of problems were 22±10 DS,10±4 ECHO, 6±2 RAD, and 4±1 ECG. Thereare 5843 total current problems in SNOMED-CTCORE Problem List. We observed a range ofunique SNOMED-CT problem concept frequen-cies: 776 DS, 63 ECHO, 113 RAD, and 36 ECG

56

by report type. The prevalence of covered prob-lem concepts by CORE is 461 (59%) DS, 36(57%) ECHO, 71 (63%) RAD, and 16 (44%)ECG. In Table 3, we report coverage of patientproblems for each vocabulary. No reports wereannotated as “Healthy - no problems”. All reportshave SNOMED CT coverage of problem mentionsabove 80%. After mapping problem mentions toCORE, we observed coverage drops for all reporttypes, 24 to 36 points.

Report Patient Annotated with Mapped toType Problems SNOMED CT COREDS 2000 1813 (91%) 1335 (67%)ECHO 349 300 (86%) 173 (50%)RAD 190 156 (82%) 110 (58%)ECG 95 77(81%) 43 (45%)

Table 3: Patient problem coverage by SNOMED-CT andSNOMED-CT CORE

4 Discussion

In this feasibility study, we evaluated how reliablytwo domain experts can generate a patient problemlist and assessed the coverage of annotated patientproblems against two standard clinical vocabular-ies.


Overall, we demonstrated that problems can be re-liably annotated with moderate to high agreementbetween domain experts (Table 1). For DS, agree-ment scores were lowest and dropped most whenconsidering the problem status in the match crite-ria. The most prevalent disagreement for DS wasSpurious problems (Table 2). Spurious problemsincluded additional events (e.g., C2939181: Mo-tor vehicle accident), procedures (e.g., C0199470:Mechanical ventilation), and modes of administra-tion (e.g., C0041281: Tube feeding of patient) thatwere outside our patient problem list inclusion cri-teria. Some pertinent findings were also missed.These findings are not surprising given on averagemore problems occur in DS and the length of DSdocuments are much longer than other documenttypes. Indeed, annotators are more likely to missa problem as the number of patient problems in-crease.

Also, status differences can be attributed to mul-tiple status change descriptions using expressionsof time e.g., “cough improved then” and modal-ity “rule out pneumonia”, which are harder to

track and interpret over a longer document. Themost prevalent disagreements for all other doc-ument types were CUI/CUI-less in which iden-tifying a CUI representative of a clinical obser-vation proved more difficult. An example ofOther disagreement was a sidedness mismatchor redundant patient problem annotation. Forexample, C0344911: Left ventricular dilatationvs. C0344893: Right ventricular dilatation orC0032285: Pneumonia was recorded twice.

4.2 Coverage Study

We observed that DS and RAD reports have highercounts and coverage of unique patient problemconcepts. We suspect this might be because otherdocument types like ECG reports are more likelyto have laboratory observations, which may beless prevalent findings in CORE. Across documenttypes, coverage of patient problems in the corpusby SNOMED CT were high ranging from 81%to 91% (Table 3). However, coverage of patientproblems by CORE dropped to moderate cover-ages ranging from 45% to 67%. This suggests thatthe CORE Problem List is more restrictive andmay not be as useful for capturing patient prob-lems from these document types. A similar reportof moderate problem coverage with a more restric-tive concept list was also reported by Meystre andHaug (2005).

5 Limitations

Our study has limitations. We did not apply a tra-ditional adjudication review between domain ex-perts. In addition, we selected the ShARe corpusfrom an ICU database in which vocabulary cover-age of patient problems could be very different forother domains and specialties.

6 Conclusion

Based on this feasibility study, we conclude thatwe can generate a reliable patient problem listreference standard for the ShARe corpus andSNOMED CT provides better coverage of patientproblems than the CORE Problem List. In fu-ture work, we plan to evaluate from each ShARereport type, how well these patient problem listscan be derived and visualized from the individ-ual disease/disorder problem mentions leveragingtemporality and modality attributes using natu-ral language processing and machine learning ap-proaches.

57

Acknowledgments

This work was partially funded by NLM(5T15LM007059 and 1R01LM010964), ShARe(R01GM090187), Swedish Research Council(350-2012-6658), and Swedish Fulbright Com-mission.

ReferencesCenter for Medicare and Medicaid Services. 2013.

EHR Incentive Programs-Maintain ProblemList. http://www.cms.gov/Regulations-and-Guidance/Legislation/EHRIncentivePrograms/downloads/3 Maintain Problem ListEP.pdf.

Noemie Elhadad, Wendy Chapman, Tim OGorman,Martha Palmer, and Guergana. Under ReviewSavova. under review. The ShARe Schema forthe Syntactic and Semantic Annotation of ClinicalTexts.

George Hripcsak and Adam S. Rothschild. 2005.Agreement, the F-measure, and Reliability in In-formation Retrieval. J Am Med Inform Assoc,12(3):296–298.

Stephane Meystre and Peter Haug. 2005. Automationof a Problem List using Natural Language Process-ing. BMC Medical Informatics and Decision Mak-ing, 5(30).

Stephane M. Meystre and Peter J. Haug. 2008. Ran-domized Controlled Trial of an Automated ProblemList with Improved Sensitivity. International Jour-nal of Medical Informatics, 77:602–12.

Danielle L. Mowery, Pamela W. Jordan, Janyce M.Wiebe, Henk Harkema, John Dowling, andWendy W. Chapman. 2013. Semantic Annotationof Clinical Events for Generating a Problem List. InAMIA Annu Symp Proc, pages 1032–1041.

National Library of Medicine. 2009. TheCORE Problem List Subset of SNOMED-CT. Unified Medical Language System 2011.http://www.nlm.nih.gov/research/umls/SNOMED-CT/core subset.html.

Mohammed Saeed, C. Lieu, G. Raber, and Roger G.Mark. 2002. MIMIC II: a massive temporal ICUpatient database to support research in intelligent pa-tient monitoring. Comput Cardiol, 29.

Imre Solti, Barry Aaronson, Grant Fletcher, MagdolnaSolti, John H. Gennari, Melissa Cooper, and ThomasPayne. 2008. Building an Automated Problem Listbased on Natural Language Processing: LessonsLearned in the Early Phase of Development. pages687–691.

Brett R. South, Shuying Shen, Jianwei Leng, Tyler B.Forbush, Scott L. DuVall, and Wendy W. Chapman.

2012. A prototype tool set to support machine-assisted annotation. In Proceedings of the 2012Workshop on Biomedical Natural Language Pro-cessing, BioNLP ’12, pages 130–139. Associationfor Computational Linguistics.

Lawrence Weed. 1970. Medical Records, Med-ical Education and Patient Care: The Problem-Oriented Record as a Basic Tool. Medical Pub-lishers: Press of Case Western Reserve University,Cleveland: Year Book.

58


A System for Predicting ICD-10-PCS Codesfrom Electronic Health Records

Michael Subotin3M Health Information Systems

Silver Spring, [email protected]

Anthony R. Davis3M Health Information Systems

Silver Spring, [email protected]

Abstract

Medical coding is a process of classify-ing health records according to standardcode sets representing procedures and di-agnoses. It is an integral part of healthcare in the U.S., and the high costs itincurs have prompted adoption of natu-ral language processing techniques for au-tomatic generation of these codes fromthe clinical narrative contained in elec-tronic health records. The need for effec-tive auto-coding methods becomes evengreater with the impending adoption ofICD-10, a code inventory of greater com-plexity than the currently used code sets.This paper presents a system that predictsICD-10 procedure codes from the clinicalnarrative using several levels of abstrac-tion. First, partial hierarchical classifica-tion is used to identify potentially rele-vant concepts and codes. Then, for eachof these concepts we estimate the confi-dence that it appears in a procedure codefor that document. Finally, confidence val-ues for the candidate codes are estimatedusing features derived from concept confi-dence scores. The concept models can betrained on data with ICD-9 codes to sup-plement sparse ICD-10 training resources.Evaluation on held-out data shows promis-ing results.

1 Introduction

In many countries reimbursement rules for healthcare services stipulate that the patient encountermust be assigned codes representing diagnosesthat were made for and procedures that were per-formed on the patient. These codes may be as-signed by general health care personnel or by spe-cially trained medical coders. The billing codes

used in the U.S. include International Statisti-cal Classification of Diseases and Related HealthProblems (ICD) codes, whose version 9 is cur-rently in use and whose version 10 was scheduledfor adoption in October 20141, as well as CurrentProcedural Terminology (CPT) codes. The samecodes are also used for research, internal book-keeping, and other purposes.

Assigning codes to clinical documentation of-ten requires extensive technical training and in-volves substantial labor costs. This, together withincreasing prominence of electronic health records(EHRs), has prompted development and adoptionof NLP algorithms that support the coding work-flow by automatically inferring appropriate codesfrom the clinical narrative and other informationcontained in the EHR (Chute et al., 1994; Heinzeet al., 2001; Resnik et al., 2006; Pakhomov et al.,2006; Benson, 2006). The need for effective auto-coding methods becomes especially acute with theintroduction of ICD-10 and the associated increaseof training and labor costs for manual coding.

The novelty and complexity of ICD-10 presentsunprecedented challenges for developers of rule-based auto-coding software. Thus, while ICD-9contains 3882 codes for procedures, the numberof codes defined by the ICD-10 Procedure Cod-ing System (PCS) is greater than 70,000. Further-more, the organization of ICD-10-PCS is funda-mentally different from ICD-9, which means thatthe investment of time and money that had goneinto writing auto-coding rules for ICD-9 proce-dure codes cannot be easily leveraged in the tran-sition to ICD-10.

In turn, statistical auto-coding methods are con-strained by the scarcity of available training datawith manually assigned ICD-10 codes. While thisproblem will be attenuated over the years as ICD-10-coded data are accumulated, the health care

1The deadline was delayed by at least a year while thispaper was in review.

59

industry needs effective technology for ICD-10computer-assisted coding in advance of the imple-mentation deadline. Thus, for developers of statis-tical auto-coding algorithms two desiderata cometo the fore: these algorithms should take advantageof all available training data, including documentssupplied only with ICD-9 codes, and they shouldpossess high capacity for statistical generalizationin order to maximize the benefits of training mate-rial with ICD-10 codes.

The auto-coding system described here seeksto meet both these requirements. Rather thanpredicting codes directly from the clinical narra-tive, a set of classifiers is first applied to identifycoding-related concepts that appear in the EHR.We use General Equivalence Mappings (GEMs)between ICD-9 and ICD-10 codes (CMS, 2014)to train these models not only on data with human-assigned ICD-10 codes, but also on ICD-9-codeddata. We then use the predicted concepts to de-rive features for a model that estimates probabil-ity of ICD-10 codes. Besides the intermediate ab-straction to concepts, the code confidence modelitself is also designed so as to counteract sparsityof the training data. Rather than train a separateclassifier for each code, we use a single modelwhose features can generalize beyond individualcodes. Partial hierarchical classification is used forgreater run-time efficiency. To our knowledge, thisis the first research publication describing an auto-coding system for ICD-10-PCS. It is currently de-ployed, in tandem with other auto-coding mod-ules, to support computer-assisted coding in the3MTM360 EncompassTMSystem.

The rest of the paper is organized as follows.Section 2 reviews the overall organization of ICD-10-PCS. Section 4.1 outlines the run-time process-ing flow of the system to show how its componentsfit together. Section 4.2 describes the concept con-fidence models, including the hierarchical classi-fication components. Section 4.3 discusses howdata with manually assigned ICD-9 codes is usedto train some of the concept confidence models.Section 4.4 describes the code confidence model.Finally, Section 5 reports experimental results.

2 ICD-10 Procedure Coding System

ICD-10-PCS is a set of codes for medical proce-dures, developed by 3M Health Information Sys-tems under contract to the Center for Medicare andMedicaid Services of the U.S. government. ICD-

10-PCS has been designed systematically; eachcode consists of seven characters, and the charac-ter in each of these positions signifies one partic-ular aspect of the code. The first character des-ignates the “section” of ICD-10-PCS: 0 for Med-ical and Surgical, 1 for Obstetrics, 2 for Place-ment, and so on. Within each section, the sevencomponents, or axes of classification, are intendedto have a consistent meaning; for example in theMedical and Surgical section, the second charac-ter designates the body system involved, the thirdthe root operation, and so on (see Table 1 for alist). All procedures in this section are thus clas-sified along these axes. For instance, in a codesuch as 0DBJ3ZZ, the D in the second position in-dicates that the body system involved is the gas-trointestinal system, B in the third position alwaysindicates that the root operation is an excision of abody part, the J in the fourth position indicates thatthe appendix is the body part involved, and the 3 inthe fifth position indicates that the approach is per-cutaneous. The value Z in the last two axes meansthan neither a device nor a qualifier are specified.

Character Meaning1st Section2nd Body System3rd Root Operation4th Body Part5th Approach6th Device7th Qualifier

Table 1: Character Specification of the Medicaland Surgical Section of ICD-10-PCS

Several consequences of the compositionalstructure of ICD-10-PCS are especially relevantfor statistical auto-coding methods.

On the one hand, it defines over 70,000 codes,many of which are logically possible, but very rarein practice. Thus, attempts to predict the codes asunitary entities are bound to suffer from data spar-sity problems even with a large training corpus.Furthermore, some of the axis values are formu-lated in ways that are different from how the cor-responding concepts would normally be expressedin a clinical narrative. For example, ICD-10-PCSuses multiple axes (root opreration, body part, and,in a sense, the first two axes as well) to encodewhat many traditional procedure terms (such asthose ending in -tomy and -plasty) express by a

60

single word, while the device axis uses genericcategories where a clinical narrative would referonly to specific brand names. This drastically lim-its how much can be accomplished by matchingcode descriptions or indexes derived from themagainst the text of EHRs.

On the other hand, the systematic conceptualstructure of PCS codes and of the codeset as awhole can be exploited to compensate for datasparsity and idiosyncracies of axis definitions byintroducing abstraction into the model.

3 Related work

There exists a large literature on automatic clas-sification of clinical text (Stanfill et al., 2010). Asizeable portion of it is devoted to detecting cate-gories corresponding to billing codes, but most ofthese studies are limited to one or a handful of cat-egories. This is in part because the use of patientrecords is subject to strict regulation. Thus, thecorpus used for most auto-coding research up todate consists of about two thousand documents an-notated with 45 ICD-9 codes (Pestian et al., 2007).It was used in a shared task at the 2007 BioNLPworkshop and gave rise to papers studying a va-riety of rule-based and statistical methods, whichare too numerous to list here.

We limit our attention to a smaller set of re-search publications describing identification of anentire set of billing codes, or a significant por-tion thereof, which better reflects the role of auto-coding in real-life applications. Mayo Clinic wasamong the earliest adopters of auto-coding (Chuteet al., 1994), where it was deployed to assigncodes from a customized and greatly expandedversion of ICD-8, consisting of almost 30K diag-nostic codes. A recently reported version of theirsystem (Pakhomov et al., 2006) leverages a com-bination of example-based techniques and NaıveBayes classification over a database of over 20MEHRs. The phrases representing the diagnoseshave to be itemized as a list beforehand. In an-other pioneering study, Larkey & Croft (1995) in-vestigated k-Nearest Neighbor, Naıve Bayes, andrelevance feedback on a set of 12K dischargesummaries, predicting ICD-9 codes. Heinze etal (2000) and Ribeiro-Neto et al (2001) describesystems centered on symbolic computation. Jianget al (2006) discuss confidence assessment forICD-9 and CPT codes, performed separately fromcode generation. Medori & Fairon (2010) com-

bine information extraction with a Naıve Bayesclassifier, working with a corpus of about 20K dis-charge summaries in French. In a recent paper,Perotte et al (2014) study standard and hierarchi-cal classification using support vector machines ona corpus of about 20K EHRs with ICD-9 codes.

We are not aware of any previous publicationson auto-coding for ICD-10-PCS, and the resultsof these studies cannot be directly compared withthose reported below due to the unique nature ofthis code set. Our original contributions also in-clude explicit modeling of concepts and the ca-pability to assign previously unobserved codeswithin a machine learning framework.

4 Methods

4.1 Run-time processing flow

We first describe the basic run-time processingflow of the system, shown in Figure 1.

Figure 1: Run-time processing flow

In a naıve approach, one could generate allcodes from the ICD-10-PCS inventory for eachEHR2 and estimate their probability in turn, butthis would be too computationally expensive. In-stead, the hypothesis space is restricted by two-

2We use the term EHR generically in this paper. The sys-tem can be applied at the level of individual clinical docu-ments or entire patient encounters, whichever is appropriatefor the given application.

61

level hierarchical classification with beam search.First, a set of classifiers estimates the confidenceof all PCS sections (one-character prefixes of thecodes), one per section. The sections whose con-fidence exceeds a threshold are used to generatecandidate body systems (two-character code pre-fixes), whose confidence is estimated by anotherset of classifiers. Then, body systems whose con-fidence exceeds a threshold are used to generatea set of candidate codes and the set of conceptsexpressed by these codes. The probability of ob-serving each of the candidate concepts in the EHRis estimated by a separate classifier. Finally, theseconcept confidence scores are used to derive fea-tures for a model that estimates the probability ofobserving each of the candidate codes, and thehighest-scoring codes are chosen according to athresholding decision rule.

The choice of two hierarchical layers is partiallydetermined by the amount of training data withICD-10 codes available for this study, since manythree-character code prefixes are too infrequent totrain reliable classifiers. Given more training data,additional hierarchical classification layers couldbe used, which would trade a higher risk of recallerrors against greater processing speed. The sametrade-off can be negotiated by adjusting the beamsearch threshold.

4.2 Concept confidence models

Estimation of concept confidence – including theconfidence of code prefixes in the two hierarchi-cal classification layers – is performed by a set ofclassifiers, one per concept, which are trained onEHRs supplied with ICD-10 and ICD-9 procedurecodes.

The basis for training the concept models isprovided by a mapping between codes and con-cepts expressed by the codes. For example, thecode 0GB24ZZ (Excision of Left Adrenal Gland,Percutaneous Endoscopic Approach) expresses,among other concepts, the concept adrenal glandand the more specific concept left adrenal gland.It also expresses the concept of adrenalectomy(surgical removal of one or both of the adrenalglands), which corresponds to the regular expres-sion 0G[BT][234]..Z over ICD-10-PCS codes.We used the code-to-concept mapping describedin Mills (2013), supplemented by some additionalcategories that do not correspond to traditionalclinical concepts. For example, our set of concepts

included entries for the categories of no deviceand no qualifer, which are widely used in ICD-10-PCS. We also added entries that specified the de-vice axis or the qualifier axis together with the firstthree axes, where they were absent in the originalconcept map, reasoning that the language used toexpress the choice of the device or qualifier can bespecific to particular procedures and body parts.

For data with ICD-10-PCS codes, the logic usedto generate training instances is straightforward.Whenever a manually assigned code expresses agiven concept, a positive training instance for thecorresponding classifier is generated. Negativetraining instances are sub-sampled from the con-cepts generated by hierarchical classification lay-ers for that EHR. As can be seen from this logic,the precise question that the concept models seekto answer is as follows: given that this particularconcept has been generated by the upstream hier-archical layers, how likely is it that it will be ex-pressed by one of the ICD-10 procedure codes as-signed to that EHR?

In estimating concept confidence we do not at-tempt to localize where in the clinical narrativethe given concept is expressed. Our baselinefeature set is simply a bag of tokens. We alsoexperimented with other feature types, includingfrequency-based weighting schemes for token fea-ture values and features based on string matches ofUnified Medical Language System (UMLS) con-cept dictionaries. For the concepts of left and rightwe define an additional feature type, indicatingwhether the token left or right appears more fre-quently in the EHR. While still rudimentary, thisfeature type is more apt to infer laterality than abag of tokens.

A number of statistical methods can be usedto estimate concept confidence. We use theMallet (McCallum, 2002) implementation of `1-regularized logistic regression, which has showngood performance for NLP tasks in terms of ac-curacy as well as scalability at training and run-time (Gao et al., 2007).

4.3 Training on ICD-9 data

In training concept confidence models on datawith ICD-9 codes we make use of the GeneralEquivalence Mappings (GEMs), a publicly avail-able resource establishing relationships betweenICD-9 and ICD-10 codes (CMS, 2014). Most cor-respondences between ICD-9 and ICD-10 proce-

62

dure codes are one-to-many, although other map-ping patterns are also found. Furthermore, a codein one set can correspond to a combination ofcodes from the other set. For example, the ICD-9 code for combined heart-lung transplantationmaps to a set of pairs of ICD-10 codes, the firstcode in the pair representing one of three possibletypes of heart transplantation, and the other rep-resenting one of three possible types of bilaterallung transplantation.

A complete description of the rules underlyingGEMs and our logic for processing them is beyondthe scope of this paper, and we limit our discussionto the principles underlying our approach. We firstdistribute a unit probability mass over the ICD-10 codes or code combinations mapped to eachICD-9 code, using logic that reflects the struc-ture of GEMs and distributing probability massuniformly among comparable alternatives. Fromthese probabilities we compute a cumulative prob-ability mass for each concept appearing in theICD-10 codes. For example, if an ICD-9 codemaps to four ICD-10 codes over which we dis-tribute a uniform probability distibution, and agiven concept appears in two of them, we assignthe probability of 0.5 to that concept. For a givenEHR, we assign to each concept the highest prob-ability it receives from any of the codes observedfor the EHR. Finally, we use the resulting conceptprobabilities to weight positive training instances.Negative instances still have unit weights, sincethey correspond to concepts that can be unequivo-cably ruled out based on the GEMs.

4.4 Code confidence model

The code confidence model produces a confidencescore for candidate codes generated by the hierar-chical classification layers, using features derivedfrom the output of the code confidence modelsdescribed above. The code confidence model istrained on data with ICD-10 codes. Whenever acandidate code matches a code assigned by hu-man annotators, a positive training instance is gen-erated. Otherwise, a negative instance is gener-ated, with sub-sampling. We report experimentsusing logistic regression with `1 and `2 regulariza-tion (Gao et al., 2007).

The definition of features used in the model re-quires careful attention, because it is in the form ofthe feature space that the proposed model differsfrom a standard one-vs-all approach. To elucidate

the contrast we may start with a form of the featurespace that would correspond to one-vs-all classi-fication. This can be achieved by specifying theidentity of a particular code in all feature names.Then, the objective function for logistic regressionwould decompose into independent learning sub-problems, one for each code, producing a collec-tion of one-vs-all classifiers. There are clear draw-backs to this approach. If all parameters are re-stricted to a specific code, the training data wouldbe fragmented along the same lines. Thus, evenif features derived from concepts may seem to en-able generalization, in reality they would in eachcase be estimated only from training instances cor-responding to a single code, causing unnecessarydata sparsity.

This shortcoming can be overcome in logisticregression simply by introducing generalized fea-tures, without changing the rest of the model (Sub-otin, 2011). Thus, in deriving features from scoresof concept confidence models we include onlythose concepts which are expressed by the givencode, but we do not specify the identity of the codein the feature names. In this way the weights forthese features are estimated at once from traininginstances for all codes in which these concepts ap-pear. We combine these generalized features withthe code-bound features described earlier. The lat-ter should help us learn more specific predictorsfor particular procedures, when such predictorsexist in the feature space.

While the scores of concept confidence mod-els provide the basis for the feature space of thecode confidence model, there are multiple ways inwhich features can be derived from these scores.The simplest way is to take concept identity (op-tionally specified by code identity) as the fea-ture name and the confidence score as the featurevalue. We supplement these features with featuresbased on score quantization. That is, we thresh-old each concept confidence score at several pointsand define binary features indicating whether thescore exceeds each of the thresholds. For boththese feature types, we generate separate featuresfor predictions of concept models trained on ICD-9 data and concept models trained on ICD-10 datain order to allow the code confidence model tolearn how useful predictions of concept confidencemodels are, depending on the type of their trainingdata.

Both the concept confidence models and the

63

code confidence model can be trained on data withICD-10 codes. We are thus faced with the ques-tion of how best to use this limited resource. Thesimplest approach would be to train both types ofmodels on all available training data, but there is aconcern that predictions of the concept models ontheir own training data would not reflect their out-of-sample performance, and this would misleadthe code confidence model into relying on themtoo much. An alternative approach, often calledstacked generalization (Wolpert, 1992), would beto generate training data for the code confidencemodel by running concept confidence models onout-of-sample data. We compare the performanceof these approaches below.

5 Evaluation

5.1 Methodology

We evaluated the proposed model using a cor-pus of 28,536 EHRs (individual clinical records),compiled to represent a wide variety of clinicalcontexts and supplied with ICD-10-PCS codes bytrained medical coders. The corpus was annotatedunder the auspices of 3M Health Information Sys-tems for the express purpose of developing auto-coding technology for ICD-10. There was a totalof 51,082 PCS codes and 5,650 unique PCS codesin the corpus, only 76 of which appeared in morethan 100 EHRs, and 2,609 of which appeared justonce. Multiple coders worked on some of the doc-uments, but they were allowed to collaborate, pro-ducing what was effectively a single set of codesfor each EHR. We held out about a thousand EHRsfor development testing and evaluation, each, us-ing the rest for training. The same corpus, as wellas 175,798 outpatient surgery EHRs with ICD-9procedure codes submitted for billing by a healthprovider were also used to train hierarchical andconcept confidence models.

We evaluated auto-coding performance by amodified version of mean reciprocal rank (MRR).MRR is a common evaluation metric for systemswith ranked outputs. For a set of Q correct out-puts with ranks ranki among all outputs, standardMRR is computed as:

MRR =1Q

Q∑i=1

1ranki

For example, a MRR value of 0.25 means thatthat the correct answer has rank 4 on average. This

metric is designed for tasks where only one of theoutputs can be correct. When applied directly totasks where more than one output can be correct,MRR unfairly penalizes cases with multiple cor-rect outputs, increasing the rank of some correctoutputs on account of other, higher-ranked outputsthat are also correct. We modify MRR for our taskby ignoring correct outputs in the rank computa-tions. In other words, the rank of a correct outputis computed as the number of higher-ranked incor-rect outputs, plus one. This metric has the advan-tage of summarizing the accuracy of an auto-coderwithout reference to a particular choice of thresh-old, which may be determined by business rules orresearch considerations, as would be the case forprecision and recall.

One advantage of regularized logistic regres-sion is that the value of 1 is often a near-optimalsetting for the regularization trade-off parameter.This can save considerable computation time thatwould be required for tuning this parameter foreach experimental condition. We have previouslyobserved that the value of 1 consistently producednear-optimal results for the `1 regularizer in con-cept confidence models and for the `2 regularizerin the code confidence models, and we have usedthis setting for all the experiments reported here.For the code confidence model with `1-regularizedlogistic regression we saw a slight improvementwith weaker regularization, and we report the bestresult we obtained for this model below.

5.2 ResultsThe results are shown in Table 2. The top MMRscore of 0.572 corresponds to a micro-averaged F-score of 0.485 (0.490 precision, 0.480 recall) whenthe threshold is chosen to obtain approximatelyequal values for recall and precision3. The bestresult was obtained when:

• the concept models used bag-of-tokens fea-tures (with the additional laterality featuresdescribed in Section 4.2);

• both concept models trained on ICD-9 dataand those trained on ICD-10 data were used;

• the code confidence model was trained ondata with predictions of concept modelstrained on all of ICD-10 data (i.e., no

3To put these numbers into perspective, note that the aver-age accuracy of trained medical coders for ICD-10 has beenestimated to be 63% (HIMSS/WEDI, 2013).

64

data splitting for stacked generalization wasused);

• the code confidence model used all of the fea-ture types described in Section 4.4;

• the code confidence model used logistic re-gression with `2 regularization.

We examine the impact of all these choices onsystem performance in turn.

Model MRRAll data, all features, `2 reg. 0.572Concept model training:

Trained on ICD-10 only 0.558Trained on ICD-9 only 0.341

Code model features:One-vs-all 0.519No code-bound features 0.553No quantization features 0.560

Stacked generalization:half & half data split 0.5015-fold cross-validation 0.539

Code model algorithm:`1 regularization 0.528

Table 2: Evaluation results. Each row after thefirst correponds to varying one aspect of the modelshown in the first row. See Section 5.3 for detailsof the experimental conditions.

5.3 Discussion

Despite its apparent primitive nature, the bag-of-token feature space for the concept confidencemodels has turned out to provide a remarkablystrong baseline. Our experiments with frequency-based weighting schemes for the feature valuesand with features derived from text matches fromthe UMLS concept dictionaries did not yield sub-stantial improvements in the results. Thus, the useof UMLS-based features, obtained using ApacheConceptMapper, yielded a relative improvementof 0.6% (i.e., 0.003 in absolute terms), but at thecost of nearly doubling run-time processing time.Nonetheless, we remain optimistic that more so-phisticated features can benefit performance of theconcept models while maintaining their scalabil-ity.

As can be seen from the table, both conceptmodels trained on ICD-9 data and those trained on

ICD-10 data contributed to the overall effective-ness of the system. However, the contribution ofthe latter is markedly stronger. This suggests thatfurther research is needed in finding the best waysof exploiting ICD-9-coded data for ICD-10 auto-coding. Given that data with ICD-9 codes is likelyto be more readily available than ICD-10 trainingdata in the foreseeable future, this line of investi-gation holds potential for significant gains in auto-coding performance.

For the choice of features used in the code con-fidence model, the most prominent contribution ismade by the feature that generalize beyond spe-cific codes, as discussed in Section 4.4. Addingthese features yields a 10% relative improvementover the set of features equivalent to a one-vs-all model. In fact, using the generalized featuresalone (see the row marked “no code-bound fea-tures” in Table 2) gives a score only 0.02 lowerthan the best result. As would be expected, gener-alized features are particularly important for codeswith limited training data. Thus, if we restrictour attention to codes with fewer than 25 traininginstances (which account for 95% of the uniquecodes in our ICD-10 training data), we find thatgeneralized features yielded a 25% relative im-provement over the one-vs-all model (0.247 to0.309). In contrast, for codes with over 100 train-ing instances (which account for 1% of the uniquecodes, but 36% of the total code volume in ourcorpus) the relative improvement from generalizedfeatures is less than 4% (0.843 to 0.876). Thesenumbers afford two further observations. First,the model can be improved dramatically by addinga few dozen EHRs per code to the training cor-pus. Secondly, there is still much room for re-search in mitigating the effects of data sparsityand improving prediction accuracy for less com-mon codes. Elsewhere in Table 2 we see thatquantization-based features contribute a modestpredictive value.

Perhaps the most surprising result of the seriescame from investigating the options for using theavailable ICD-10 training data, which act as train-ing material both for concept confidence modelsand the code confidence model. The danger oftraining both type of models on the same corpusis intuitively apparent. If the training instancesfor the code model are generated by concept mod-els whose training data included the same EHRs,the accuracy of these concept predictions may not

65

reflect out-of-sample performance of the conceptmodels, causing the code model to rely on themexcessively.

The simplest implementation of Wolpert’sstacked generalization proposal, which is intendedto guard against this risk, is to use one part of thecorpus to train one predictive layer and use its pre-dictions on the another part of the corpus to trainthe other layer. The result in Table 2 (see therow marked “half & half data split”) shows thatthe resulting increase in sparsity of the trainingdata for both models leads to a major degradationof the system’s performance, even though at run-time concept models trained on all available dataare used. We also investigated a cross-validationversion of stacked generalization designed to mit-igate against this fragmentation of training data.We trained a separate set of concept models on thetraining portion of each cross-validation fold, andran them on the held-out portion. The training setfor the code confidence model was then obtainedby combining these held-out portions. At run-time, concept models trained on all of the avail-able data were used. However, as intuitively com-pelling as the arguments motivating this proceduremay be, the results were not competitive with thebaseline approach of using all available trainingdata for all the models.

Finally, we found that an `2 regularizer per-formed clearly better than an `1 regularizer for thecode confidence model, even though we set the `2trade-off constant to 1 and tuned the `1 trade-offconstant on the development test set. This is incontrast to concept confidence models, where weobserved slightly better results with `1 regulariza-tion than with `2 regularization.

6 Conclusion

We have described a system for predicting ICD-10-PCS codes from the clinical narrative con-tained in EHRs. The proposed approach seeks tomitigate the sparsity of training data with manu-ally assigned ICD-10-PCS codes in three ways:through an intermediate abstraction to clinicalconcepts, through the use of data with ICD-9codes to train concept confidence models, andthrough the use of a code confidence modelwhose parameters can generalize beyond individ-ual codes. Our experiments show promising re-sults and point out directions for further research.

Acknowledgments

We would like to thank Ron Mills for provid-ing the crosswalk between ICD-10-PCS codes andclinical concepts; Guoli Wang, Michael Nossal,Kavita Ganesan, Joel Bradley, Edward Johnson,Lyle Schofield, Michael Connor, Jean Stoner andRoxana Safari for helpful discussions relating tothis work; and the anonymous reviewers for theirconstructive criticism.

ReferencesSean Benson. 2006. Computer-assisted Coding Soft-

ware Improves Documentation, Coding, Compli-ance, and Revenue. Perspectives in Health Infor-mation Management, CAC Proceedings, Fall 2006.

Centers for Medicare & Medicaid Services. 2014.General Equivalence Mappings. Documentationfor Technical Users. Electronically published atcms.gov.

Chute CG, Yang Y, Buntrock J. 1994. An evalua-tion of computer assisted clinical classification algo-rithms. Proc Annu Symp Comput Appl Med Care.,1994:162–6.

Jianfeng Gao, Galen Andrew, Mark Johnson, KristinaToutanova. 2007. A Comparative Study of Param-eter Estimation Methods for Statistical Natural Lan-guage Processing. ACL 2007.

Daniel T. Heinze, Mark L. Morsch, Ronald E. Shef-fer, Jr., Michelle A. Jimmink, Mark A. Jennings,William C. Morris, and Amy E. W. Morsch. 2000.LifeCodeTM– A Natural Language Processing Sys-tem for Medical Coding and Data Mining. AAAIProceedings.

Daniel T. Heinze, Mark Morsch, Ronald Sheffer,Michelle Jimmink, Mark Jennings, William Mor-ris, and Amy Morsch. 2001. LifeCode: A De-ployed Application for Automated Medical Coding.AI Magazine, Vol 22, No 2.

HIMSS/WEDI. 2013. ICD-10 National Pilot Pro-gram Outcomes Report. Electronically published athimss.org.

Yuankai Jiang, Michael Nossal, and Philip Resnik.2006. How Does the System Know It’s Right?Automated Confidence Assessment for CompliantCoding. Perspectives in Health Information Man-agement, Computer Assisted Coding ConferenceProceedings, Fall 2006.

Leah Larkey and W. Bruce Croft. 1995. Automatic As-signment of ICD9 Codes To Discharge Summaries.Technical report, Center for Intelligent InformationRetrieval at University of Massachusetts.

66

Andrew Kachites McCallum. 2002. MAL-LET: A Machine Learning for Language Toolkit.http://mallet.cs.umass.edu

Medori, Julia and Fairon, Cedrick. 2010. MachineLearning and Features Selection for Semi-automaticICD-9-CM Encoding. Proceedings of the NAACLHLT 2010 Second Louhi Workshop on Text and DataMining of Health Documents, 2010: 84–89.

Ronald E. Mills. 2013. Methods using multi-dimensional representations of medical codes. USPatent Application US20130006653.

S.V. Pakhomov, J.D. Buntrock, and C.G. Chute. 2006.Automating the assignment of diagnosis codes topatient encounters using example-based and ma-chine learning techniques. J Am Med Inform Assoc,13(5):516–25.

Adler Perotte, Rimma Pivovarov, Karthik Natarajan,Nicole Weiskopf, Frank Wood, Noemie Elhadad .2014. Diagnosis code assignment: models and eval-uation metrics. J Am Med Inform Assoc, 21(2):231–7.

Pestian, JP, Brew C, Matykiewicz P, Hovermale DJ,Johnson N, Bretonnel Cohen K, and Duch W. 2007.A shared task involving multi-label classificationof clinical free text. Proceedings ACL: BioNLP,2007:97–104.

Philip Resnik, Michael Niv, Michael Nossal, GregorySchnitzer, Jean Stoner, Andrew Kapit, and RichardToren. 2006. Using intrinsic and extrinsic metricsto evaluate accuracy and facilitation in computer-assisted coding.. Perspectives in Health Informa-tion Management, Computer Assisted Coding Con-ference Proceedings, Fall 2006.

Berthier Ribeiro-Neto, Alberto H.F. Laender and Lu-ciano R.S. de Lima. 2001. An experimental studyin automatically categorizing medical documents.Journal of the American Society for Information Sci-ence and Technology, 52(5): 391–401.

Mary H. Stanfill, Margaret Williams, Susan H. Fenton,Robert A. Jenders, and William R. Hersh. 2010.A systematic literature review of automated clinicalcoding and classification systems. J Am Med InformAssoc., 17(6): 646–651.

Michael Subotin. 2011. An exponential translationmodel for target language morphology. ACL 2011.

David H. Wolpert. 1992. Stacked Generalization.Neural Networks, 5:241–259.

67


Structuring Operative Notes using Active Learning

Kirk Roberts∗National Library of MedicineNational Institutes of Health

Bethesda, MD [email protected]

Sanda M. HarabagiuHuman Language Technology Research Institute

University of Texas at DallasRichardson, TX 75080

[email protected]

Michael A. SkinnerUniversity of Texas Southwestern Medical Center

Children’s Medical Center of DallasDallas, TX 75235

[email protected]

Abstract

We present an active learning method forplacing the event mentions in an operativenote into a pre-specified event structure.Event mentions are first classified into ac-tion, peripheral action, observation, andreport events. The actions are further clas-sified into their appropriate location withinthe event structure. We examine how uti-lizing active learning significantly reducesthe time needed to completely annotate acorpus of 2,820 appendectomy notes.

1 Introduction

Operative reports are written or dictated after ev-ery surgical procedure. They describe the courseof the operation as well as any abnormal find-ings in the surgical process. Template-based andstructured methods exist for recording the opera-tive note (DeOrio, 2002), and in many cases havebeen shown to increase the completeness of sur-gical information (Park et al., 2010; Gur et al.,2011; Donahoe et al., 2012). The use of naturallanguage, however, is still preferred for its expres-sive power. This unstructured information is typi-cally the only vehicle for conveying important de-tails of the procedure, including the surgical in-struments, incision techniques, and laparoscopicmethods employed.

The ability to represent and extract the infor-mation found within operative notes would enable

∗Most of this work was performed while KR was at theUniversity of Texas at Dallas.

powerful post-hoc reasoning methods about surgi-cal procedures. First, the completeness problemmay be alleviated by indicating gaps in the sur-gical narrative. Second, deep semantic similaritymethods could be used to discover comparable op-erations across surgeons and institutions. Third,given information on the typical course and find-ings of a procedure, abnormal aspects of an oper-ation could be identified and investigated. Finally,other secondary use applications would be enabledto study the most effective instruments and tech-niques across large amounts of surgical data.

In this paper, we present an initial method foraligning the event mentions within an operativenote to the overall event structure for a procedure.A surgeon with experience in a particular proce-dure first describes the overall event structure. Asupervised method enhanced by active learning isthen employed to rapidly build an information ex-traction model to classify event mentions into theevent structure. This active learning paradigm al-lows for rapid prototyping while also taking ad-vantage of the sub-language characteristics of op-erative notes and the common structure of opera-tive notes reporting the same type of procedure. Afurther goal of this method is to aid in the eval-uation of unsupervised techniques that can auto-matically discover the event structure solely fromthe narratives. This would enable all the objectivesoutlined above for leveraging the unstructured in-formation within operative notes.

This paper presents a first attempt at this ac-tive learning paradigm for structuring appendec-tomy reports. We intentionally chose a well-understood and relatively simple procedure to en-

68

sure a straight-forward, largely linear event struc-ture where a large amount of data would be eas-ily available. Section 3 describes a generic frame-work for surgical event structures and the particu-lar structure chosen for appendectomies. Section 4details the data used in this study. Section 5 de-scribes the active learning experiment for filling inthis event structure for operative notes. Section 6reports the results of this experiment. Section 7analyzes the method and proposes avenues for fur-ther research. First, however, we outline the smallamount of previous work in natural language pro-cessing on operative notes.

2 Previous Work

An early tool for processing operative notes wasproposed by Lamiell et al. (1993). They developan auditing tool to help enforce completeness inoperative notes. A syntactic parser converts sen-tences in an operative note into a graph structurethat can be queried to ensure the necessary surgicalelements are present in the narrative. For appen-dectomies, they could determine whether answerswere specified for questions such as “What wasthe appendix abnormality?” and “Was cautery ordrains used?”. Unlike what we propose, they didnot attempt to understand the narrative structure ofthe operative note, only ensure that a small num-ber of important elements were present. Unfortu-nately, they only tested their rule-based system onfour notes, so it is difficult to evaluate the robust-ness and generalizability of their method.

More recently, Wang et al. (2014) proposed amachine learning (ML) method to extract patient-specific values from operative notes written inChinese. They specifically extract tumor-relatedinformation from patients with hepatic carcinoma,such as the size/location of the tumor, and whetherthe tumor boundary is clear. In many ways this issimilar in purpose to Lamiell et al. (1993) in thesense that there are operation-specific attributes toextract. However, while the auditing function pri-marily requires knowing whether particular itemswere stated, their method extracts the particularvalues for these items. Furthermore, they em-ploy an ML-based conditional random field (CRF)trained and tested on 114 operative notes. The pri-mary difference between the purpose of these twomethods and the purpose of our method lies in theattempt to model all the events that characterize asurgery. Both the work of Lamiell et al. (1993)

and Wang et al. (2014) can be used for complete-ness testing, and Wang et al. (2014) can be usedto find similar patients. The lack of understand-ing of the event structure, however, prevents thesemethods from identifying similar surgical methodsor unexpected surgical techniques, or from accom-plishing many other secondary use objectives.

In a more similar vein to our own approach,Wang et al. (2012) studies actions (a subset ofevent mentions) within an operative note. Theynote that various lexico-syntactic constructionscan be used to specify an action (e.g., incised, theincision was carried, made an incision). Like ourapproach, they observed sentences can be catego-rized into actions, perceptions/reports, and other(though we make this distinction at the event men-tion level). They adapted the Stanford Parser(Klein and Manning, 2003) with the SpecialistLexicon (Browne et al., 1993) similar to Huanget al. (2005). They do not, however, propose anyautomatic system for recognizing and categoriz-ing actions. Instead, they concentrate on evalu-ating existing resources. They find that many re-sources, such as UMLS (Lindberg et al., 1993) andFrameNet (Baker et al., 1998) have poor coverageof surgical actions, while Specialist and WordNet(Fellbaum, 1998) have good coverage.

A notable limitation of their work is that theyonly studied actions at the sentence level, look-ing at the main verb of the independent clause.We have found in our study that multiple actionscan occur within a sentence, and we thus study ac-tions at the event mention level. Wang et al. (2012)noted this shortcoming and provide the followingillustrative examples:

• The patient was taken to the operating roomwhere general anesthesia was administered.

• After the successful induction of spinal anes-thesia, she was placed supine on the operat-ing table.

The second event mention in the first sentence(administered) and the first event mention in thesecond sentence (induction) are ignored in Wanget al. (2012)’s study. Despite the fact that theyare stated in dependent clauses, these mentionsmay be more semantically important to the narra-tive than the mentions in the independent clauses.This is because a grammatical relation does notnecessarily imply event prominence. In a furtherstudy, Wang et al. (2013) work toward the creationof an automatic extraction system by annotating

69

PropBank (Palmer et al., 2005) style predicate-argument structures on thirty common surgical ac-tions.

3 Event Structures in Operative Notes

Since operations are considered to be one of theriskier forms of clinical treatment, surgeons fol-low strict procedures that are highly structured andrequire significant training and oversight. Thus,a surgeon’s description of a particular operationshould be highly similar with a different descrip-tion of the same type of operation, even if writ-ten by a different surgeon at a different hospital.For instance, the two examples below were writ-ten by two different surgeons to describe the eventof controlling the blood supply to the appendix:

• The 35 mm vascular Endo stapler device wasfired across the mesoappendix...

• The meso appendix was divided with electro-cautery...

In these two examples, the surgeons use differentlexical forms (fired vs. divided), syntactic forms(mesoappendix to the right or left of the EVENT),different semantic predicate-argument structures(INSTRUMENT-EVENT-ANATOMICALOBJECT

vs. ANATOMICALOBJECT-EVENT-METHOD),and even different surgical techniques (stapling orcautery). Still, these examples describe the samestep in the operation and thus can be mapped tothe same location in the event structure.

In order to recognize the event structure in op-erative notes, we start by specifying an eventstructure to a particular operation (e.g., mastec-tomy, appendectomy, heart transplant) and createa ground-truth structure based on expert knowl-edge. Our goal is then to normalize the event men-tions within a operative note to the specific surgi-cal actions in the event structure. While the lex-ical, syntactic, and predicate-argument structuresvary greatly across the surgeons in our data, manyevent descriptions are highly consistent withinnotes written by the same surgeon. This is es-pecially true of events with little linguistic vari-ability, typically largely procedural but necessaryevents that are not the focus of the surgeon’s de-scription of the operation. An example of low-variability is the event of placing the patient onthe operating table, as opposed to the event of ma-nipulating the appendix to prepare it for removal.Additionally, while there is considerable lexicalvariation in how an event is mentioned, the ter-

minology for event mentions is fairly limited, re-sulting in reasonable similarity between surgeons(e.g., the verbal description used for the dividingof the mesoappendix is typically one of the fol-lowing mentions: fire, staple, divide, separate, re-move).

3.1 Event Structure Representation

Operative notes contain event mentions of manydifferent event classes. Some classes correspondto actions performed by the surgeon, while oth-ers describe findings, provide reasonings, or dis-cuss interactions with patients or assistants. Thesedistinctions are necessary to recognizing the eventstructure of an operation, in which we are primar-ily concerned with surgical actions. We considerthe following event types:

• ACTION: the primary types of events in anoperation. These typically involve physi-cal actions taken by the surgeon (e.g., cre-ating/closing an incision, dividing tissue), orprocedural events (e.g., anesthesia, transferto recovery). With limited exceptions, AC-TIONs occur in a strict order and the ith AC-TION can be interpreted as enabling the (i +1)th ACTION.

• P ACTION: the peripheral actions that areoptional, do not occur within a specific placein the chain of ACTIONs, and are not consid-ered integral to the event structure. Examplesinclude stopping unexpected bleeding and re-moving benign cysts un-connected with theoperation.

• OBSERVATION: an event that denotes theact of observing a given state. OBSERVA-TIONs may lead to ACTION (e.g., the ap-pendix is perforated and therefore needs tobe removed) or P ACTIONs (e.g., a cyst isfound). They may also be elaborations to pro-vide more details about the surgical methodbeing used.

• REPORT: an event that denotes a verbal in-teraction between the surgeon and a patient,guardian, or assistant (such as obtaining con-sent for an operation).

The primary class of events that we are interestedin here are ACTIONs. Abstractly, one can view atype of operation as a directed graph with specifiedstart and end states. The nodes denote the events,while the edges denote enablements. An instanceof an operation then can be represented as some

70

Figure 1: Graphical representation of a surgicalprocedure with ACTIONs A, B, C , D, E, and F ,OBSERVATION O, and P ACTION G. (a) strictsurgical graph (only actions), (b) surgical graphwith an observation invoking an action, (c) surgi-cal graph with an observation invoking a periph-eral action.

path between the start and end nodes.In its simplest form, a surgical graph is com-

posed entirely of ACTION nodes (see Figure 1(a)).It is possible to add expected OBSERVATIONsthat might trigger a different ACTION path (Fig-ure 1(b)). Finally, P ACTIONs can be representedas optional nodes in the surgical graph, which mayor may not be triggered by OBSERVATIONs (Fig-ure 1(c)). This graphical model is simply a con-ceptual aid to help design the action types. Themodel currently plays no role in the automaticclassification. For the remainder of this sectionwe focus on a relatively limited surgical proce-dure that can be interpreted as a linear chain ofACTIONs.

3.2 Appendectomy Representation

Acute appendicitis is a common condition requir-ing surgical management, and is typically treatedby removing the appendix, either laparoscopicallyor by using an open technique. Appendectomiesare the most commonly performed urgent surgi-cal procedure in the United States. The procedureis relatively straight-forward, and the steps of theprocedure exhibit little variation between differ-ent surgeons. The third author (MS), a surgeonwith more than 20 years of experience in pedi-atric surgery, provided the following primary AC-TIONs:

• APP01: transfer patient to operating room• APP02: place patient on table• APP03: anesthesia• APP04: prep• APP05: drape• APP06: umbilical incision• APP07: insert camera/telescope• APP08: insert other working ports• APP09: identify appendix• APP10: dissect appendix away from other

structures• APP11: divide blood supply• APP12: divide appendix from cecum• APP13: place appendix in a bag• APP14: remove bag from body• APP15: close incisions• APP16: wake up patient• APP17: transfer patient to post-anesthesia

care unit

In the laparoscopic setting, each of these actions isa necessary part of the operation, and most shouldbe recorded in the operative note. Additionally,any number of P ACTION, OBSERVATION, andREPORT events may be interspersed.

4 Data

In accordance with generally accepted medicalpractice and to comply with requirements of TheJoint Commission, a detailed report of any surgicalprocedure is placed in the medical record within24 hours of the procedure. These notes include thepreoperative diagnosis, the post-operative diagno-sis, the procedure name, names of surgeon(s) andassistants, anesthetic method, operative findings,complications (if any), estimated blood loss, and adetailed report of the conduct of the procedure. Toensure accuracy and completeness, such notes aretypically dictated and transcribed shortly after theprocedure by the operating surgeon or one of theassistants.

To obtain the procedure notes for this study,The Children’s Medical Center (CMC) of Dal-las electronic medical record (EMR) was queriedfor operative notes whose procedure contained theword “appendectomy” (CPT codes 44970, 44950,44960) for a preoperative diagnosis of “acute ap-pendicitis” (ICD9 codes 541, 540.0, 540.1). Atthe time of record acquisition, the CMC EMR hadbeen in operation for about 3 years, and 2,820notes were obtained, having been completed by 12pediatric surgeons. In this set, there were 2,757

71

Surgeon Notes Events Wordssurgeon1 8 291 2,305surgeon2 311 16,379 134,748surgeon3 143 6,897 57,797surgeon4 400 8,940 62,644surgeon5 391 15,246 114,684surgeon6 307 9,880 77,982surgeon7 397 10,908 74,458surgeon8 34 2,401 20,391surgeon9 2 100 973surgeon10 355 9,987 89,085surgeon11 380 14,211 135,215surgeon12 92 2,417 19,364Total 2,820 97,657 789,646

Table 1: Overview of corpus by surgeon.

laparoscopic appendectomies and 63 open proce-dures. The records were then processed automat-ically to remove any identifying information suchas names, hospital record numbers, and dates. Forthe purposes of this investigation, only the sur-geon’s name and the detailed procedure note werecollected for further study. Owing to the completeanonymity of the records, the study received an ex-emption from the University of Texas Southwest-ern Medical Center and CMC Institutional ReviewBoards. Table 1 contains statistics about the distri-bution of notes by surgeon in our dataset.

5 Active Learning Framework

Active learning is becoming a more and morepopular framework for natural language annota-tion in the biomedical domain (Hahn et al., 2012;Figueroa et al., 2012; Chen et al., 2013a; Chen etal., 2013b). In an active learning setting, instead ofperforming manual annotation separate from auto-matic system development, an existing ML classi-fier is employed to help choose which examplesto annotate. Thus, human annotators can focus onexamples that would prove difficult for a classifier,which can dramatically reduce overall annotationtime. However, active learning is not without pit-falls, notably sampling bias (Dasgupta and Hsu,2008), re-usability (Tomanek et al., 2007), andclass imbalance (Tomanek and Hahn, 2009). Inour work, the purpose of utilizing an active learn-ing framework is to produce a fully-annotated cor-pus of labeled event mentions in as small a periodof time as possible. To some extent, the goal offull-annotation alleviates some of the active learn-ing issues discussed above (re-usability and classimbalance), but sampling bias could still lead tosignificantly longer annotation time.

Our goal is to (1) distinguish event mentions inone of the four classes introduced in Section 3.1

(event type annotation), and (2) further classify ac-tions into their appropriate location in the eventstructure (on this data, appendectomy type anno-tation). While most active learning methods areused with the intention of only manually labelinga sub-set of the data, our goal is to annotate everyevent mention so that we may ultimately evaluateunsupervised techniques on this data. Our activelearning experiment thus proceeds in two paral-lel tracks: (i) a traditional active learning processwhere the highest-utility unlabeled event mentionsare classified by a human annotator, and (ii) abatch annotation process where extremely simi-lar, “easy” examples are annotated in large groups.Due to small intra-surgeon language variation, andrelatively small inter-surgeon variation due to thelimited terminology, this second process allows usto annotate large numbers of unlabeled examplesat a time. The batch labeling largely annotates un-labeled examples that would not be selected by theprimary active learning module because they aretoo similar to the already-labeled examples. Aftera sufficient amount of time being spent in tradi-tional active learning, the batch labeling is usedto annotate until the batches produced are insuf-ficiently similar and/or wrong classifications aremade. After a sufficent number of annotations aremade with the active learning method, the choiceof when to use the active learning or batch anno-tation method is left to the discretion of the anno-tator. This back-and-forth is then repeated itera-tively until all the examples are annotated.

For both the active learning and batch labelingprocesses, we use a multi-class support vector ma-chine (SVM) using a simple set of features:

F1. Event mention’s lexical form (e.g., identified)F2. Event mention’s lemma (identify)F3. Previous words (3-the, 2-appendix, 1-was)F4. Next words (1-and, 2-found, 3-to, 4-be,

5-ruptured)F5. Whether the event is a gerund (false)

Features F3 and F4 were constrained to only returnwords within the sentence.

To sample event mentions for the active learner,we combine several sampling techniques to ensurea diversity of samples to label. This meta-samplerchooses from 4 different samplers with differingprobability p:

1. UNIFORM: Choose (uniformly) an unlabeledinstance (p = 0.1). Formally, let L be the

72

set of manually labeled instances. Then, theprobability of selecting an event ei is:

PU(ei) ∝ δ(ei /∈ L)

Where δ(x) is the delta function that returns1 if the condition x is true, and 0 otherwise.Thus, an unlabeled event has an equal prob-ability of being selected as every other unla-beled event.

2. JACCARD: Choose an unlabeled instance bi-ased toward those whose word context is leastsimilar to the labeled instances using Jac-card similarity (p = 0.2). This sampler pro-motes diversity to help prevent sampling bias.Let Wi be the words in ei’s sentence. Thenthe probability of selecting an event with theJACCARD sampler is:

PJ(ei) ∝ δ(ei /∈ L) minej∈L

[(1− Wi ∩Wj

Wi ∪Wj

)α]Here, α is a parameter to give more weight todissimilar sentences (we set α = 2).

3. CLASSIFIER: Choose an unlabeled instancebiased toward those the SVM assigned lowconfidence values (p = 0.65). Formally, letfc(ei) be the confidence assigned by the clas-sifier to event ei. Then, the probability of se-lecting an event with the CLASSIFIER sam-pler is:

PC(ei) ∝ δ(ei /∈ L)(1− fc(ei))

The SVM we use provides confidence valueslargely in the range (-1, 1), but for some veryconfident examples this value can be larger.We therefore constrain the raw confidencevalue fr(ei) and place it within the range [0,1] to achieve the modified confidence fc(ei)above:

fc(ei) =max(min(fr(ei), 1),−1) + 1

2

In this way, fc(ei) can be guaranteed to bewithin [0, 1] and can thus be interpreted as aprobability.

4. MISCLASSIFIED: Choose (uniformly) a la-beled instance that the SVM mis-classifiesduring cross-validation (p = 0.05). Let f(ei)be the classifier’s guess and L(ei) be themanual label for event ei. Then the proba-bility of selecting an event is:

PM(ei) ∝ δ(ei ∈ L)δ(f(ei) 6= L(ei))

Event Type Precision Recall F1

ACTION 0.79 0.90 0.84NOT EVENT 0.75 0.82 0.79

OBSERVATION 0.71 0.57 0.63P ACTION 0.66 0.40 0.50REPORT 1.00 0.58 0.73

Active Learning Accuracy: 76.4%Batch Annotation Accuracy: 99.5%

Table 2: Classification results for event types. Ex-cept when specified, results are for data annotatedusing the active learning method, while the batchannotation results include all data.

The first annotation was made using the UNIFORM

sampler. For every new annotation, the meta-sampler chooses one of the above sampling meth-ods according to the above p values, and that sam-pler selects an example to annotate. For each se-lected sample, it is first assigned an event type. If itis assigned as an ACTION, the annotator further as-signs its appropriate action type. The CLASSIFIER

and MISCLASSIFIED samplers alternate betweenthe event type and action type classifiers. Thesefour samplers were chosen to balance the tra-ditional active learning approach (CLASSIFIER),while trying to prevent classifier bias (UNIFORM

and JACCARD), while also allowing mis-labeleddata to be corrected (MISCLASSIFIED). An eval-uation of the utility of the individual samplers isbeyond the scope of this work.

6 Results

For event type annotation, two annotators single-annotated 1,014 events with one of five event types(ACTION, P ACTION, OBSERVATION, REPORT,and NOT EVENT). The classifier’s accuracy onthis data was 75.9% (see Table 2 for a breakdownby event type). However, the examples were cho-sen because they were very different from the cur-rent labeled set, and thus we would expect them tobe more difficult than a random sampling. Whenone includes the examples annotated using batchlabeling, the overall accuracy is 99.5%.

For action type annotation, the same two anno-tators labeled 626 ACTIONs with one of the 17 ac-tion types (APP01–APP17). The classifier’s accu-racy on this data was again a relatively low 72.2%(see Table 3 for a breakdown by action type).However, again, these examples were expected tobe difficult for the classifier. When one includesthe examples annotated using batch labeling, theoverall accuracy is 99.4%.

73

Action Type Precision Recall F1

APP01 0.91 0.77 0.83APP02 1.00 0.67 0.80APP03 1.00 0.67 0.80APP04 0.95 0.95 0.95APP05 1.00 1.00 1.00APP06 0.79 0.72 0.76APP07 0.58 0.58 0.58APP08 0.65 0.75 0.70APP09 0.82 0.93 0.87APP10 0.63 0.73 0.68APP11 0.50 0.50 0.50APP12 0.61 0.56 0.58APP13 0.94 0.94 0.94APP14 0.71 0.73 0.72APP15 0.84 0.79 0.82APP16 0.93 0.81 0.87APP17 0.84 0.89 0.86

Active Learning Accuracy: 71.4%Batch Annotation Accuracy: 99.4%

Table 3: Classification results for action types.

7 Discussion

The total time allotted for annotation was approxi-mately 12 hours, split between two annotators (thefirst author and a computer science graduate stu-dent). Prior to annotation, both annotators weregiven a detailed description of an appendectomy,including a video of a procedure to help asso-ciate the actual surgical actions with the narrativedescription. After annotation, 1,042 event typeswere annotated using the active learning method,90,335 event types were annotated using the batchmethod, and 6,279 remained un-annotated. Sim-ilarly, 658 action types were annotated using theactive learning method, 35,799 action types wereannotated using the batch method, and 21,151 re-mained un-annotated. A greater proportion of ac-tions remained un-annotated due to the lower clas-sifier confidence associated with the task. Eventand action types were annotated in unison, but weestimate during the active learning process it tookabout 25 seconds to annotate each event (both theevent type and the action type if classified as anACTION). The batch process enabled the annota-tion of an average of 3 event mentions per second.

This rapid annotation was made possible bythe repetitive nature of operative notes, especiallywithin an individual surgeon’s notes. For exam-ple, the following statements were repeated over100 times in our corpus:

• General anesthesia was induced.• A Foley catheter was placed under sterile

conditions.• The appendix was identified and seemed to

be acutely inflamed.

The first example was used by an individual sur-geon in 95% of his/her notes, and only used threetimes by a different surgeon. In the second exam-ple, the sentence is used in 77% of the surgeon’snotes while only used once by another surgeon.The phrase “Foley catheter was placed”, however,was used 133 times by other surgeons. In the con-text of an appendectomy, this action is unambigu-ous, and so only a few annotations are needed torecognize the hundreds of actual occurrences inthe data. Similarly, with the third example, thephrase “the appendix was identified” was used inover 600 operative notes by 10 of the 12 surgeons.After a few manual annotations to achieve suffi-cient classification confidence, the batch processcan identify duplicate or near-duplicate events thatcan be annotated at once, greatly reducing the timeneeded to achieve full annotation.

Unfortunately, the most predictable parts of asurgeon’s language are typically the least inter-esting from the perspective of understanding thecritical points in the narrative. As shown in theexamples above, the highest levels of redundancyare found in the most routine aspects of the op-eration. The batch annotation, therefore, is quitebiased and the 99% accuracies it achieves cannotbe expected to hold up once the data is fully an-notated. Conversely, the active learning processspecifically chooses examples that are differentfrom the current labeled set and thus are more dif-ficult to classify. Active learning is more likely tosample from the “long tail” than the most frequentevents and actions, so the performance on the cho-sen sample is certainly a lower bound on the per-formance of a completely annotated data set. Ifone assumes the remaining un-annotated data willbe of similar difficulty to the data sampled by theactive learner, one could project an overall eventtype accuracy of 97% and an overall action typeaccuracy of 89%. This furthermore assumes noimprovements are made to the machine learningmethod based on this completed data.

One way to estimate the potential bias in batchannotation is by observing the differences in thedistributions of the two data sets. Figure 2 showsthe total numbers of action types for both theactive learning and batch annotation portions ofthe data. For the most part, the distributionsare similar. APP08 (insert other working ports),APP10 (dissect appendix away from other struc-tures), APP11 (divide blood supply), APP12 (di-

74

Figure 2: Frequencies of action types in the active learning (AL) portion of the data set (left vertical axis)and the batch annotation (BA) portion of the data set (right vertical axis).

vide appendix from cecum), and APP14 (removebag from body) are the most under-represented inthe batch annotation data. This confirms our hy-pothesis that some of the most interesting eventshave the greatest diversity in expression.

In Section 2 we noted that a limitation of the an-notation method of Wang et al. (2012) was that asentence could only have one action. We largelyovercame this problem by associating a single sur-gical action with an event mention. This has onenotable limitation, however, as occasionally a sin-gle event mention corresponds to more than oneaction. In our data, APP11 and APP12 are com-monly expressed together:

• Next, the mesoappendix and appendix isstapledAPP11/APP12 and then the appendix isplacedAPP13 in an endobag.

Here, a coordination (“mesoappendix and ap-pendix”) is used to associate two events (the sta-pling of the mesoappendix and the stapling ofthe appendix) with the same event mention. Inthe event extraction literature, this is a well-understood occurrence, as for instance TimeML(Pustejovsky et al., 2003) can represent more thanone event with a single event mention. In practice,however, few automatic TimeML systems handlesuch phenomena. Despite this, for our purpose theannotation structure should likely be amended sothat we can account for all the important actionsin the operative note. This way, gaps in our eventstructure will correspond to actual gaps in the nar-rative (e.g., dividing the blood supply is a criticalstep in an appendectomy and therefore needs to fitwithin the event structure).

Finally, the data in our experiment comes froma relatively simple procedure (an appendectomy).It is unclear how well this method would general-ize to more complex operations. Most likely, the

difficulty will lie in actions that are highly ambigu-ous, such as if more than one incision is made.In this case, richer semantic information will benecessary, such as the spatial argument that indi-cates where a particular event occurs (Roberts etal., 2012).

8 Conclusion

With the increasing availability of electronic oper-ative notes, there is a corresponding need for deepanalysis methods to understand the note’s narra-tive structure to enable applications for improvingpatient care. In this paper, we have presented amethod for recognizing how event mentions in anoperative note fit into the event structure of the ac-tual operation. We have proposed a generic frame-work for event structures in surgical notes with aspecific event structure for appendectomy opera-tions. We have described a corpus of 2,820 opera-tive notes of appendectomies performed by 12 sur-geons at a single institution. With the ultimate goalof fully annotating this data set, which contains al-most 100,000 event mentions, we have shown howan active learning method combined with a batchannotation process can quickly annotate the ma-jority of the corpus. The method is not withoutits weaknesses, however, and further annotation islikely necessary.

Beyond finishing the annotation process, our ul-timate goal is to develop unsupervised methodsfor structuring operative notes. This would en-able expanding to new surgical procedures withouthuman intervention while also leveraging the in-creasing availability of this information. We haveshown in this work how operative notes have lin-guistic characteristics that result in parallel struc-tures. It is our goal to leverage these characteris-tics in developing unsupervised methods.

75

Acknowledgments

The authors would like to thank Sanya Peshwanifor her help in annotating the data.

ReferencesCollin F. Baker, Charles J. Fillmore, and John B. Lowe.

1998. The Berkeley FrameNet project. In Proceed-ings of ACL/COLING.

Allen C. Browne, Alexa T. McCray, and Suresh Srini-vasan. 1993. The SPECIALIST Lexicon. Tech-nical Report NLM-LHC-93-01, National Library ofMedicine.

Yukun Chen, Hongxin Cao, Qiaozhu Mei, Kai Zheng,and Hua Xu. 2013a. Applying active learning to su-pervised word sense disambiguation in MEDLINE.J Am Med Inform Assoc, 20:1001–1006.

Yukun Chen, Robert Carroll, Eugenia R. McPeek Hinz,Anushi Shah, Anne E. Eyler, Joshua C. Denny, ,and Hua Xu. 2013b. Applying active learningto high-throughput phenotyping algorithms for elec-tronic health records data. J Am Med Inform Assoc,20:e253–e259.

Sanjoy Dasgupta and Daniel Hsu. 2008. HierarchicalSampling for Active Learning. In Proceedings of theInternational Conference on Maching Learning.

J.K. DeOrio. 2002. Surgical templates for orthopedicoperative reports. Orthopedics, 25(6):639–642.

Laura Donahoe, Sean Bennett, Walley Temple, AndreaHilchie-Pye, Kelly Dabbs, Ethel MacIntosh, and Ge-off Porter. 2012. Completeness of dictated oper-ative reports in breast cancer–the case for synopticreporting. J Surg Oncol, 106(1):79–83.

Christiane Fellbaum. 1998. WordNet: An ElectronicLexical Database. MIT Press.

Rosa L. Figueroa, Qing Zeng-Treitler, Long H. Ngo,Sergey Goryachev, and Eduardo P. Wiechmann.2012. Active learning for clinical text classification:is it better than random sampling? J Am Med InformAssoc, 19:809–816.

I. Gur, D. Gur, and J.A. Recabaren. 2011. The com-puterized synoptic operative report: A novel tool insurgical residency education. Arch Surg, pages 71–74.

Udo Hahn, Elena Beisswanger, Ekaterina Buyko, andErik Faessler. 2012. Active Learning-Based CorpusAnnotation – The PATHOJEN Experience. In Pro-ceedings of the AMIA Symposium, pages 301–310.

Yang Huang, Henry J Lowe, Dan Klein, and Rus-sell J Cucina. 2005. Improved Identification ofNoun Phrases in Clinical Radiology Reports Usinga High-Performance Statistical Natural LanguageParser Augmented with the UMLS Specialist Lex-icon. J Am Med Inform Assoc, 12:275–285.

Dan Klein and Christopher D. Manning. 2003. Accu-rate Unlexicalized Parsing. In Proceedings of ACL,pages 423–430.

James M Lamiell, Zbigniew M Wojcik, and JohnIsaacks. 1993. Computer Auditing of Surgical Op-erative Reports Written in English. In Proc AnnuSymp Comput Appl Med Care, pages 269–273.

Donald A.B. Lindberg, Betsy L. Humphreys, andAlexa T. McCray. 1993. The Unified Medical Lan-guage System. Methods of Information in Medicine,32(4):281–291.

Martha Palmer, Paul Kingsbury, and Daniel Gildea.2005. The Proposition Bank: An Annotated Cor-pus of Semantic Roles. Computational Linguistics,31(1):71–106.

Jason Park, Venu G. Pillarisetty, Murray F. Brennan,and et al. 2010. Electronic Synoptic Operative Re-porting: Assessing the Reliability and Completenessof Synoptic Reports for Pancreatic Resection. J AmColl Surgeons, 211(3):308–315.

James Pustejovsky, Jose Castano, Robert Ingria, RoserSaurı, Robert Gaizauskas, Andrea Setzer, GrahamKatz, and Dragomir Radev. 2003. TimeML: Ro-bust Specification of Event and Temporal Expres-sions in Text. In Proceedings of the Fifth Interna-tional Workshop on Computational Semantics.

Kirk Roberts, Bryan Rink, Sanda M. Harabagiu,Richard H. Scheuermann, Seth Toomay, TravisBrowning, Teresa Bosler, and Ronald Peshock.2012. A Machine Learning Approach for Identi-fying Anatomical Locations of Actionable Findingsin Radiology Reports. In Proceedings of the AMIASymposium.

Katrin Tomanek and Udo Hahn. 2009. Reducing ClassImbalance during Active Learning for Named EntityAnnotation. In Proceedings of KCAP.

Katrin Tomanek, Joachim Wermter, and Udo Hahn.2007. An Approach to Text Corpus Construc-tion which Cuts Annotation Costs and MaintainsReusability of Annotated Data. In Proceedings ofEMNLP/CoNLL, pages 486–495.

Yan Wang, Serguei Pakhomov, Nora E. Burkart,James O. Ryan, and Genevieve B. Melton. 2012.A Study of Actions in Operative Notes. In Proceed-ings of the AMIA Symposium, pages 1431–1440.

Yan Wang, Serguei Pakhomov, and Genevieve BMelton. 2013. Predicate Argument StructureFrames for Modeling Information in OperativeNotes. In Studies in Health Technology and Infor-matics (MEDINFO), pages 783–787.

Hui Wang, Weide Zhang, Qiang Zeng, Zuofeng Li,Kaiyan Feng, and Lei Liu. 2014. Extracting impor-tant information from Chinese Operation Notes withnatural language processing methods. J Biomed In-form.

76


Chunking Clinical Text Containing Non-Canonical Language

Aleksandar SavkovDepartment of Informatics

University of SussexBrighton, UK

[email protected]

John CarrollDepartment of Informatics

University of SussexBrighton, UK

[email protected]

Jackie CassellPrimary Care and Public Health

Brighton and Sussex Medical SchoolBrighton, UK

[email protected]

Abstract

Free text notes typed by primary carephysicians during patient consultationstypically contain highly non-canonicallanguage. Shallow syntactic analysis offree text notes can help to reveal valu-able information for the study of diseaseand treatment. We present an exploratorystudy into chunking such text using off-the-shelf language processing tools andpre-trained statistical models. We evalu-ate chunking accuracy with respect to part-of-speech tagging quality, choice of chunkrepresentation, and breadth of context fea-tures. Our results indicate that narrow con-text feature windows give the best results,but that chunk representation and minordifferences in tagging quality do not havea significant impact on chunking accuracy.

1 Introduction

Clinical text contains rich, detailed information ofgreat potential use to scientists and health serviceresearchers. However, peculiarities of languageuse make the text difficult to process, and the pres-ence of sensitive information makes it hard to ob-tain adequate quantities for developing processingsystems. The short term goal of most researchin the area is to achieve a reliable language pro-cessing foundation that can support more complextasks such as named entity recognition (NER) to asufficiently reliable level.

Chunking is the task of identifying non-recursive phrases in text (Abney, 1991). It is atype of shallow parsing that is a less challeng-ing task than dependency or constituency parsing.This makes it likely to give more reliable results onclinical text, since there is a very limited amount ofannotated (or even raw) text of this kind availablefor system development. Even though chunking

does not provide as much syntactic information asfull parsing, it is an excellent method for identify-ing base noun phrases (NP), which is a key issuein symptom and disease identification. Identify-ing symptoms and diseases is at the heart of har-nessing the potential of clinical data for medicalresearch purposes.

There are few resources that enable researchersto adapt general domain techniques to clinical text.Using the Harvey Corpus1 – a chunk annotatedclinical text language resource – we present an ex-ploratory study into adapting general domain toolsand models to apply to free text notes typed by UKprimary care physicians.

2 Related Work

The Mayo Clinic Corpus (Pakhomov et al., 2004)is a key resource that has been widely used asa gold standard in part-of-speech (POS) taggingof clinical text. Based on that corpus and thePenn TreeBank (Marcus et al., 1993), Coden et al.(2005) present an analysis of the effects of domaindata on the performance of POS tagging mod-els, demonstrating significant improvements withmodels trained entirely on domain data. Savovaet al. (2010) use this corpus for the developmentof cTAKES, Mayo Clinic’s processing pipeline forclinical text.

Fan et al. (2011) show that using more diverseclinical data can lead to more accurate POS tag-ging. They report that models trained on clinicaltext datasets from two different institutions per-form on each of the datasets better than both mod-els trained only on the same or the other dataset.

Fan et al. (2013) present guidelines for syntac-tic parsing of clinical text and a clinical Treebankannotated according to them. The guidelines aredesigned to help the annotators handle the non-canonical language that is typical of clinical text.

1An article describing the corpus is currently under re-view.

77

3 Data

The Harvey Corpus is a chunk-annotated corpusconsisting of pairs of manually anonymised UKprimary care physician (General Practitioner, orGP) notes and associated Read codes (Bentley etal., 1996). Each Read code has a short textualgloss. The purpose of the codes is to make it easyto extract structured data from clinical records.The reason we include the codes in the corpus isthat GPs often use their glosses as the beginning oftheir note. Two typical examples (without chunkannotation for clarity) are shown below.

Birth details | | Normal deliviery GirlWeight - 3. 960kg Apgar score @ 1min- 9 Apgar score @ 5min - 9 Vit K givenPaed check NAD HC - 34. 9cm Hip testperformed

(1)

Chest pain | | musculoskel pain last w/e,nil to find, ecg by paramedic no change,reassured, rev sos

(2)

The corpus comprises 890 pairs of Read codesand notes, each annotated by medical experts us-ing a chunk annotation scheme that includes non-recursive noun phrases (NPs), main verb groups(MVs), and a common annotation for adjectivaland adverbial phrases (APs). Example (3) be-low illustrates the annotation. The majority ofthe records (750) were double blind annotated bymedical experts, after which the resulting annota-tion was adjudicated by a third medical expert an-notator.

[Chest pain]NP | | [musculoskel pain]NP

[last w/e]NP, [nil]AP to [find]MV, [ecg]NP

by [paramedic]NP [no change]NP,[reassured]MV, [rev]MV [sos]AP

(3)

Inter-annotator agreement was 0.86 f-score, tak-ing one annotator to be the gold standard and theother the candidate. We calculate the f-score ac-cording to the MUC-7 (Chinchor, 1998) specifica-tion, with the standard f-score formula. The calcu-lation is kept symmetric with regard to the choiceof gold standard annotator by limiting the countingof incorrect categories to one per tag, and equat-ing the missing and spurious categories. For ex-ample, three words annotated as one three-tokenchunk by annotator A and three one-token chunksby annotator B will have one incorrect and twomissing/spurious elements.

The rest of the records are a by-product of thetraining process. Ninety records were triple anno-tated by three different medical experts with thehelp of a computational linguist, and fifty recordswere double annotated by a medical expert – aloneand together with a computational linguist.

It is important to note that the text in the corpusis not representative of all types of GP notes. It isfocused on text that represents the dominant partof day-to-day notes, rather than standard editedtext such as copies of letters to specialists andother medical practitioners.

Even though the corpus data is very rich in in-formation, its non-canonical language means thatit is very different from other clinical corporasuch as the Mayo Clinic Corpus (Pakhomov et al.,2004) and poses different challenges for process-ing. The GP notes in the Harvey Corpus can beregarded as groups of medical ‘tweets’ meant tobe used mainly by the author. Sentence segmenta-tion in the classical sense of the term is often im-possible, because there are no sentences. Insteadthere are short bursts of phrases concatenated to-gether often without any indication of their bound-aries. The average length of a note is roughly 30tokens including the Read code. This is in con-trast to notes in other clinical text datasets, whichrange from 100 to 400 tokens on average (Fan etal., 2011; Pakhomov et al., 2004). As well as typ-ical clinical text characteristics such as domain-specific acronyms, slang, and abbreviations, punc-tuation and casing are often misleading (if presentat all), and some common classes of words (e.g.auxiliary verbs) are almost completely absent.

4 Chunking

State-of-the-art text chunking accuracy reaches anf-score of 95% (Sun et al., 2008). However, thisis for standard, edited text, and relies on accuratePOS tagging in a pre-processing step. However,the characteristics of GP-written free text make ac-curate part of speech (POS) tagging and chunkingdifficult. Major problems are caused by unknowntokens and ambiguities due to omitted words orphrases.

We evaluate two standard chunking tools, Yam-Cha (Kudo and Matsumoto, 2003) and CRF++2,selected based on their support for trainable con-text features. The tools were applied to the Har-

2http://crfpp.googlecode.com/svn/trunk/doc/index.html

78

POS YamCha IOB YamCha BEISO CRF++ IOB CRF++ BEISOARKIRC 75.35 76.63 σ1.04 76.87 σ2.91 75.87 σ1.64 76.23 σ1.99ARKTwitter – 76.72 σ2.11 77.53 σ1.65 76.63 σ2.36 77.23 σ1.06ARKRitter 75.70 76.59 σ2.01 76.72 σ2.11 76.63 σ1.05 77.17 σ1.77cTAKES 82.42 75.32 σ2.52 75.85 σ2.02 75.43 σ1.79 75.53 σ1.90GENIA 80.63 71.70 σ2.27* 74.86 σ1.41 74.16 σ2.03* 74.19 σ1.72RASP – 74.24 σ1.84 75.10 σ1.31 75.63 σ2.33 75.76 σ2.18Stanford 80.68 76.40 σ1.69 76.36 σ2.92 75.95 σ1.25 75.94 σ1.91SVMTool 76.40 74.32 σ2.57 74.30 σ2.71 74.66 σ1.77 74.68 σ2.28Wapiti 73.39 74.74 σ2.29 74.78 σ1.33 73.59 σ2.62 73.83 σ2.31

baseline – 69.66 σ1.89* 69.76 σ1.24 67.05 σ1.15* 68.65 σ1.41

Table 1: Chunking results using YamCha and CRF++ on data automatically POS tagged using ninedifferent models; the baseline is with no tagging. The IOB and BEISO columns compare the impactof two chunk representation strategies. The POS column indicates the part-of-speech tagging accuracyfor a subset of the corpus. Asterisks indicate pairs of significantly different YamCha and CRF++ results(t-test with 0.05 p-value).

vey Corpus with automatically generated POS an-notation. Given the small amount of data andthe challenges presented above, we expected thatour results would be lower than those reported bySavova et al. (2010). The aim of these experi-ments is to find the best performance obtainablewith standard chunking tools, which we will buildon in further stages of our research.

We conducted pairs of experiments, one witheach chunking tool, divided into three groups: thefirst investigates the effects of choice of POS tag-ger for training data annotation (Section 4.1); thesecond compares two chunk representations (Sec-tion 4.2); and the third searches for the optimalcontext features (Section 4.3). All feature tuningexperiments were conducted on a development setand tested using 10-fold cross-validation on therest of the data. We used 10% of the whole datafor the development set and 90% of the remain-ing data for a training sample during development.This guarantees the development model is trainedon the same amount of data as the testing model.

4.1 Part-of-Speech Tagging

We evaluated and compared the results yieldedby the two chunkers, having applied each ofseven off-the-shelf POS taggers. Of these tag-gers, cTAKES (Savova et al., 2010) and GENIA(Tsuruoka et al., 2005) are the only ones trainedon data that resembles ours, which suggests thatthey should have the best chance of performingwell. We also selected a number of other taggerswhile trying to diversify their algorithms and train-

ing data as much as possible: the POS tagger partof the Stanford NLP package (Toutanova et al.,2003) because it is one of the most successfullyapplied in the field; the RASP tagger (Briscoeet al., 2006) because of its British National Cor-pus (Clear, 1993) training data; the ARK tagger(Owoputi et al., 2013) because of the terseness ofthe tweet language; and the SVMTool (Gimenezand Marquez, 2004) and Wapiti (Lavergne et al.,2010) because they use SVM and CRF algorithms.Our baseline model uses no part of speech infor-mation.

Using the Penn TreeBank tagset (Marcus et al.,1993), we manually annotated a subset of the cor-pus of comparable size to the development set. Us-ing this dataset we estimated the tagging accuracyfor all models that support that tagset (omittingRASP and ARK Twitter since they use differenttagsets). In this evaluation, cTAKES is the bestperforming model, followed closely by the Stan-ford POS tagger and GENIA.

The results in Table 1 show that the differ-ences between chunking models trained on differ-ent POS annotations are small and mostly not sta-tistically significant from each other. However, allthe results are significantly better than the base-line, apart from those based on the GENIA taggeroutput.

4.2 Chunk Representation

The dominant chunk representation standard in-side, outside, begin (IOB) introduced by Ramshawand Marcus (1995) and established with the

79

CoNLL-2000 shared task (Sang and Buchholz,2000) takes a minimalistic approach to the rep-resentation problem in order to keep the numberof labels low. Note that for chunking representa-tions the total number of labels is the product ofthe chunk types and the set of representation typesplus the outside tag, meaning that for IOB withour set of three chunk types (NP, MV, AP) thereare seven labels.

Alternative chunk representations, such as be-gin, end, inside, single, outside (BEISO)3 as usedby Kudo and Matsumoto (2001), offer more fine-grained tagsets, presumably at a performance cost.That cost is unnecessary unless there is somethingto be gained from a more fine-grained tagset at de-coding time, because the two representations aredeterministically inter-convertible. For instance,an end tag could be useful for better recognisingboundaries between chunks of the same type. TheBEISO tagset model looks for the boundary be-fore and after crossing it, while an IOB modelonly looks after. This should give only a smallgain with standard edited text because the chunktype distribution is fairly well balanced and punc-tuation divides ambiguous cases such as lists ofcompound nouns. However, the Harvey Corpusis NP-heavy and contains many sequences of NPchunks that do not have any punctuation to marktheir boundaries.

We evaluated the two chunk representations incombination with each POS tagger. Table 1 showsthat the differences between the results for thetwo representations are small and never statisti-cally significant. We also evaluated the two chunkrepresentations with different amounts of trainingdata. The resulting learning curves (Figure 1) arealmost identical.

4.3 Context Features

We approached the feature tuning task by first ex-ploring the smaller feature space of YamCha andthen using the trends there to constrain the fea-tures of CRF++. YamCha has three groups of fea-tures responsible for tokens, POS tags and dynam-ically generated (i.e. preceding) chunk tags. Forall experiments we determined the best feature setby exhaustively testing all context feature combi-nations within a predefined range. We used thesame context window for the token and tag fea-tures in order to reduce the search space. Given

3Also sometimes abbreviated IOBSE

Feature Set CV DevW-1-W1, T-1-T1, C-1 77.28 σ1.9 75.28W-1-W1, T-1-T1, C-2-C-1 77.27 σ2.6 74.70W-1-W2, T-1-T2, C-1 76.86 σ1.5 74.08W-2-W1, T-2-T1, C-2 76.46 σ1.3 74.00W-1-W1, T-1-T1, C-2 76.89 σ2.1 73.92W-2-W1, T-2-T1, C-3-C-1 76.52 σ0.9 73.91W-1-W1, T-1-T1, C-3-C-1 77.02 σ2.0 73.90W-2-W2, T-2-T2, C-1 77.03 σ1.9 73.86W-1-W1, T-1-T1, C-3 77.15 σ1.5 73.63W-3-W1, T-3-T1, C-2-C-1 75.71 σ1.9 73.63

Table 2: Development set and 10-fold cross-validation results for the top ten feature sets ofYamCha models trained on ARKTwitter POS an-notation. Token features are represented withW, POS features with T, and dynamically gener-ated chunk features with C. None of the cross-validation results are significantly different fromeach other (t-test with 0.05 p-value).

the terseness of the text we expected that widercontext windows of more than three tokens wouldnot be beneficial to the model, and therefore didnot consider them. Our experiments using Yam-Cha confirmed this hypothesis and showed a con-sistent trend among all experiments in favouring awindow of -1 to +1 for tokens and slightly widerfor chunk tags (see Table 2).

CRF++ provides a more powerful feature con-figuration allowing for unary and pairwise4 fea-tures of output tags. The unary features allow theconstruction of token or POS tag bigrams and tri-grams in addition to the standard context windows.The feature tuning search space with so many pa-rameters is enormous, which required us to use ourfindings from the YamCha experiments to trim itdown and make it computationally feasible. First,we decreased the search window of all features byone in each direction from -3:3 to -2:2. Second, weused the top scoring POS model from the first ex-perimental runs to constrain the features even fur-ther by selecting only the top one hundred for therest of the models.

We could not identify the same uniform trend inthe top feature sets as we could with YamCha. Ourresults ranged from very small context windowsto the maximum size of our search space. How-

4The unary and pairwise features of output tags are re-ferred to as unigram and bigram features of output tags onthe CRF++ web page. Although this is correct, it can alsobe confused with unigrams and bigrams of tokens, which areexpressed as unary (unigram) output tag features.

80

55

60

65

70

75

80

20 80 140 200 260 320 380 440 500 560 620 680 740 800

IOB BEISO

Figure 1: Chunking results for YamCha IOB andBEISO models with increasing amounts of train-ing data.

ever, we noticed that BEISO feature sets tend tobe smaller than the IOB ones. We also found thatthe pairwise features normally improve the results.

5 Discussion and Future Work

We were surprised that the experiments did notshow a clear correlation between POS tagging ac-curacy and chunking accuracy. On the other hand,the chunking results using POS tagged data aresignificantly better than the baseline, except whenusing the GENIA tagger output. The small dif-ferences between training sets of similar POS ac-curacy could be explained due to the non-uniformimpact of the wrong POS tag on the chunking pro-cess. Some mistakes such as labelling a noun asa verb in the middle of a NP chunk are almostsure to propagate and cause further chunking er-rors, whereas others may have minimal or no ef-fect, for example labelling a singular noun as aproper noun. An error analysis of verb tags andnoun tags (Table 3) shows that the ARK modelstend to make more mistakes that keep the anno-tation within the same tag group compared to theGENIA model (see column pairs 1 and 3, and 2and 4). This is a possible explanation for the loweraccuracy of the chunking model trained on datatagged by GENIA.

Our experiments showed that the models usingthe two chunk representations did not perform sig-nificantly differently from each other. We alsoshowed that this conclusion is likely to hold if

Model Ngroup Vgroup Nouns VerbsARKIRC 67.17 78.26 88.26 85.99ARKTwitter - - 86.97 88.71ARKRitter 68.57 77.29 90.64 85.02cTAKES 83.93 62.80 93.85 69.08GENIA 81.56 61.83 92.03 71.01RASP - - 84.59 83.58Stanford 80.30 73.42 91.89 83.09SVMTool 69.97 70.04 90.08 80.19Wapiti 65.64 66.66 87.84 74.87

Table 3: Detailed view of the POS model re-sults focusing on the noun and verb tag groups.The leftmost two columns of figures show accura-cies over tags in the respective groups; the right-most two columns show the accuracies of the samegroups if all tags in a group are replaced with agroup tag, e.g. V for verbs5.

more training data were available.There are a number of ways we could improve

chunking accuracy besides increasing the amountof training data. Although our results do not showa clear trend, Fan et al. (2011) demonstrate that thedomain of part-of-speech training data has a sig-nificant impact on tagging accuracy, which couldpotentially impact chunking results if it decreasesthe number of errors that propagate during chunk-ing. An important problem in that area is dealingwith present and past participles, which are almostsure to cause error propagation if mislabelled (asnouns or adjectives, respectively). Participles aremore ambiguous in terse contexts lacking auxil-iary verbs, which are natural disambiguation indi-cators. Another direction in processing that couldcontribute to better chunking is better token andsentence segmentation. Finally, unknown words,which may potentially have the largest impact onchunking accuracy, could be dealt with using ageneric solution such as feature expansion basedon distributional similarity.

ReferencesS. Abney. 1991. Parsing by chunks. In Robert C.

Berwick, Steven P. Abney, and Carol Tenny, editors,Principle-Based Parsing: Computation and Psy-cholinguistics, pages 257–278. Kluwer, Dordrecht.

T. Bentley, C. Price, and P. Brown. 1996. Structuraland lexical features of successive versions of the5Note that these results are different from what would be

yielded by a classifier trained on data subjected to the sametag substitution.

81

read codes. In Proceedings of the Annual Confer-ence of The Primary Health Care Specialist Groupof the British Computer Society, pages 91–103.

T. Briscoe, J. Carroll, and R. Watson. 2006. The sec-ond release of the RASP system. In Proceedings ofthe COLING/ACL on Interactive Presentation Ses-sions, COLING-ACL’06, pages 77–80, Stroudsburg,PA, USA. Association for Computational Linguis-tics.

N. Chinchor. 1998. Appendix B: Test scores. InProceedings of the Seventh Message UnderstandingConference (MUC-7), Fairfax, VA, April.

J. Clear. 1993. The British National Corpus. InGeorge P. Landow and Paul Delany, editors, TheDigital Word, pages 163–187. MIT Press, Cam-bridge, MA, USA.

A. Coden, S. Pakhomov, R. Ando, P. Duffy, andC. Chute. 2005. Domain-specific language mod-els and lexicons for tagging. Journal of BiomedicalInformatics, 38:422–430.

J.-W. Fan, R. Prasad, R.M. Yabut, R.M. Loomis, D.S.Zisook, J.E. Mattison, and Y. Huang. 2011. Part-of-speech tagging for clinical text: Wall or bridge be-tween institutions? In American Medical Informat-ics Association Annual Symposium, 1, pages 382–391. American Medical Informatics Association.

J.-W. Fan, E. Yang, M. Jiang, R. Prasad, R. Loomis,D. Zisook, J. Denny, H. Xu, and Y. Huang. 2013.Research and applications: Syntactic parsing of clin-ical text: guideline and corpus development withhandling ill-formed sentences. JAMIA, 20(6):1168–1177.

J. Gimenez and L. Marquez. 2004. SVMTool: A gen-eral POS tagger generator based on support vectormachines. In Proceedings of the 4th LREC, Lisbon,Portugal.

T. Kudo and Y. Matsumoto. 2001. Chunking with sup-port vector machines. In Proceedings of the SecondMeeting of the North American Chapter of the Asso-ciation for Computational Linguistics on LanguageTechnologies, NAACL’01, pages 1–8, Stroudsburg,PA, USA. Association for Computational Linguis-tics.

T. Kudo and Y. Matsumoto. 2003. Fast methods forkernel-based text analysis. In Proceedings of the41st Annual Meeting of the Association for Compu-tational Linguistics, pages 24–31, Morristown, NJ,USA. Association for Computational Linguistics.

T. Lavergne, O. Cappe, and F. Yvon. 2010. Practi-cal very large scale CRFs. In Proceedings of the48th Annual Meeting of the Association for Compu-tational Linguistics, pages 504–513. Association forComputational Linguistics, July.

M. Marcus, B. Santorini, and M. A. Marcinkiewicz.1993. Building a large annotated corpus of En-glish: The Penn Treebank. Computational Linguis-tics, 19(2):313–330.

O. Owoputi, B. O’Connor, C. Dyer, K. Gimpel,N. Schneider, and N. Smith. 2013. Improvedpart-of-speech tagging for online conversational textwith word clusters. In Proceedings of NAACL-HLT,pages 380–390.

S. Pakhomov, A. Coden, and C. Chute. 2004. Creat-ing a test corpus of clinical notes manually taggedfor part-of-speech information. In Proceedings ofthe International Joint Workshop on Natural Lan-guage Processing in Biomedicine and Its Applica-tions, JNLPBA’04, pages 62–65, Stroudsburg, PA,USA. Association for Computational Linguistics.

L. Ramshaw and M. Marcus. 1995. Text chunking us-ing transformation-based learning. In Proceedingsof the Third Workshop on Very Large Corpora, pages82–94.

E. Sang and S. Buchholz. 2000. Introduction to theCoNLL-2000 shared task: Chunking. In Proceed-ings of the 2nd Workshop on Learning Languagein Logic and the 4th Conference on ComputationalNatural Language Learning, pages 13–14.

G. Savova, J. Masanz, P. Ogren, J. Zheng, S. Sohn,K. Kipper-Schuler, and C. Chute. 2010. Mayo clin-ical text analysis and knowledge extraction system(cTAKES): architecture, component evaluation andapplications. Journal of the American Medical In-formatics Association, 17(5):507–513.

X. Sun, L.-P. Morency, D. Okanoharay, and J. Tsujii.2008. Modeling latent-dynamic in shallow parsing:A latent conditional model with improved inference.In Proceedings of the 22nd International Confer-ence on Computational Linguistics (COLING’08),Manchester, UK, August.

K. Toutanova, D. Klein, C. D. Manning, and Y. Singer.2003. Feature-rich part-of-speech tagging with acyclic dependency network. In Proceedings ofthe 2003 Conference of the North American Chap-ter of the Association for Computational Linguis-tics on Human Language Technology - Volume 1,NAACL’03, pages 173–180, Stroudsburg, PA, USA.Association for Computational Linguistics.

Y. Tsuruoka, Y. Tateishi, J.-D. Kim, T. Ohta, J. Mc-Naught, S. Ananiadou, and J. Tsujii. 2005. Devel-oping a robust part-of-speech tagger for biomedicaltext. In Proceedings of the 10th Panhellenic Con-ference on Advances in Informatics, PCI’05, pages382–392, Berlin, Heidelberg. Springer-Verlag.

82


Decision Style in a Clinical Reasoning Corpus

Limor Hochberg1 Cecilia O. Alm1 Esa M. Rantanen1

Caroline M. DeLong1 Anne Haake2

1 College of Liberal Arts 2 College of Computing & Information SciencesRochester Institute of Technology

lxh6513|coagla|emrgsh|cmdgsh|[email protected]

Abstract

The dual process model (Evans, 2008)posits two types of decision-making,which may be ordered on a continuumfrom intuitive to analytical (Hammond,1981). This work uses a dataset of nar-rated image-based clinical reasoning, col-lected from physicians as they diagnoseddermatological cases presented as images.Two annotators with training in cognitivepsychology assigned each narrative a rat-ing on a four-point decision scale, from in-tuitive to analytical. This work discussesthe annotation study, and makes contribu-tions for resource creation methodologyand analysis in the clinical domain.

1 Introduction

Physicians make numerous diagnoses daily, andconsequently clinical decision-making strate-gies are much discussed (e.g., Norman, 2009;Croskerry, 2003, 2009). Dual process theory pro-poses that decision-making may be broadly cat-egorized as intuitive or analytical (Kahneman &Frederick, 2002; Stanovich & West, 2000). Fur-ther, scholars argue that decision-making may beordered on a continuum, with intuitive and analyt-ical at each pole (Hamm, 1988; Hammond, 1981).

Determining the decision strategies used byphysicians is of interest because certain styles maybe more appropriate for particular tasks (Ham-mond, 1981), and better suited for expert physi-cians rather than those in training (Norman, 2009).Language use can provide insight into physiciandecision style, as linguistic content reflects cogni-tive processes (Pennebaker & King, 1999).

While most clinical corpora focus on patientsor conditions, physician diagnostic narratives havebeen successfully annotated for conceptual units(e.g., identifying medical morphology or a differ-ential diagnosis), by Womack et al. (2013) and

McCoy et al. (2012). Crowley et al. (2013) cre-ated an instructional system to detect cognitive bi-ases in clinical decision-making, while Coderre etal. (2003) used protocol analysis on think-alouddiagnostic narratives, and found that features ofintuitive reasoning implied diagnostic accuracy.

In this study, speech data were collected fromphysicians as they diagnosed dermatological casespresented to them as images. Physician verbaliza-tions were annotated for decision style on a four-point scale from intuitive to analytical (Figure 1).Importantly, cognitive psychologists were broughtinto the loop for decision style annotation, to takeadvantage of their expertise in decision theory.

Figure 1: The decision-making continuum, show-ing the four-point rating scale. The example nar-ratives were by two physicians for the same image(used with permission from Logical Images, Inc.),both correct in diagnosis. (I=Intuitive, BI=Both-Intuitive, BA=Both-Analytical, A=Analytical).

This work describes a thorough methodologyapplied in annotating a corpus of diagnostic nar-ratives for decision style. The corpus is a uniqueresource – the first of its kind – for studying andmodeling clinical decision style or for developinginstructional systems for training clinicians to as-sess their reasoning processes.

This study attempts to capture empiricallydecision-making constructs that are much-

83

Figure 2: Overview of annotation methodology. Conclusions from the pilot study enhanced the mainannotation study. To ensure high-quality annotation, narratives appeared in random order, and 10% (86)of narratives were duplicated and evenly distributed in the annotation data, to later assess intra-annotatorreliability. Questionnaires were also interspersed at 5 equal intervals to study annotator strategy.

discussed theoretically. Thus, it responds to theneed for investigating subjective natural languagephenomena (Alm, 2011). The annotated corpus isa springboard for decision research in medicine,as well as other mission-critical domains in whichgood decisions save lives, time, and money.

Subjective computational modeling is particu-larly challenging because often, no real ‘groundtruth’ is available. Decision style is such afuzzy concept, lacking clear boundaries (Hamp-ton, 1998), and its recognition develops in psy-chologists over time, via exposure to knowledgeand practice in cognitive psychology. Interpretingfuzzy decision categories also depends on mentalmodels which lack strong intersubjective agree-ment. This is the nature, and challenge, of cap-turing understandings that emerge organically.

This work’s contributions include (1) present-ing a distinct clinical resource, (2) introducing arobust method for fuzzy clinical annotation tasks,(3) analyzing the annotated data comprehensively,and (4) devising a new metric that links annotatedbehavior to clinicians’ decision-making profiles.

2 Corpus Description

In an experimental data-collection setting, 29physicians (18 residents, 11 attendings) narratedtheir diagnostic thought process while inspecting30 clinical images of dermatological cases, for atotal of 8681 narratives. Physicians described ob-servations, differential and final diagnoses, andconfidence (out of 100%) in their final diagno-sis. Later, narratives were assessed for correctness(based on final diagnoses), and image cases wereevaluated for difficulty by a dermatologist.

3 Corpus Annotation of Decision Style

The corpus was annotated for decision style in apilot study and then a main annotation study (Fig-

1Two physicians skipped 1 image during data collection.

ure 2).2 Two annotators with graduate trainingin cognitive psychology independently rated eachnarrative on a four-point scale from intuitive to an-alytical (Figure 1). The two middle labels reflectthe presence of both styles, with intuitive (BI) oranalytical (BA) reasoning being more prominent.Since analytical reasoning involves detailed exam-ination of alternatives, annotators were asked toavoid using length as a proxy for decision style.

After the pilot, the annotators jointly dis-cussed disagreements with one researcher. Inter-annotator reliability, measured by linear weightedkappa (Cohen, 1968), was 0.4 before and 0.8 af-ter resolution; the latter score may be an upperbound on agreement for clinical decision-makingannotation. As both annotators reported usingphysician-provided confidence to judge decisionstyle, in subsequent annotation confidence men-tions had been removed if they appeared after thefinal diagnosis (most narratives), or, if intermixedwith diagnostic reasoning, replaced with dashes.Finally, silent pauses3 were coded as ellipses toaid in the human parsing of the narratives.

4 Quantative Annotation Analysis

Table 1 shows the annotator rating distributions.4

I BI BA AA1 89 314 340 124A2 149 329 262 127

Table 1: The distribution of ratings across the4-point decision scale. I=Intuitive, BI=Both-Intuitive, BA=Both-Analytical, A=Analytical;A1=Annotator 1, A2=Annotator 2; N=867.

Though Annotator 1’s ratings skew slightlymore analytical than Annotator 2, a Kolmogorov-

2Within a reasonable time frame, the annotations will bemade publicly available as part of a corpus release.

3Above around 0.3 seconds (see Lovgren & Doorn, 2005).4N = 867 after excluding a narrative that, during annota-

tion, was deemed too brief for decision style labeling.

84

Factor A1 (Avg) A1 (SD) A2 (Avg) A2 (SD)Switching between decision styles 1.0 0.0 3.6 0.9Timing of switch between decision styles 1.6 0.5 4.2 0.4Silent pauses (...) 2.0 0.0 3.6 0.5Filled pauses (e.g. uh, um) 2.0 0.7 3.6 0.5Rel. (similarity) of final & differential diagnosis 2.8 0.4 3.2 0.8Use of logical rules and inference 3.2 0.8 2.2 0.4False starts (in speech) 3.4 0.9 2.4 0.9Automatic vs. controlled processing 3.4 0.5 4.0 0.0Holistic vs. sequential processing 3.6 0.5 4.4 0.5No. of diagnoses in differential diagnoses 4.0 0.0 1.6 0.5Word choice 4.0 0.7 2.6 0.5Rel. (similarity) of final & first-mentioned diagnosis 4.0 0.0 4.0 0.0Perceived attitude 4.0 0.7 4.0 0.0Rel. timing of differential diagnosis in the narrative 4.2 0.8 2.8 0.8Degree of associative (vs. linear, ordered) processing 4.2 0.4 3.8 0.4Use of justification (e.g. X because Y) 4.2 0.4 4.0 0.0Perceived confidence 4.4 0.5 4.2 0.4

Table 3: Annotators rated each of the listed factors as to how often they were used in annotation, on a5-point Likert scale from for no narratives (1) to for all narratives (5). (Some factors slightly reworded.)

Smirnov test showed no significant difference be-tween the two distributions (p = 0.77).

WK %FA %FA+ 1 NA1 - A2 .43 50% 94% 867A1 - A1 .64 67% 100% 86A2 - A2 .43 50% 95% 86

Table 2: Inter- and intra-annotator reliability, mea-sured by linear weighted kappa (WK), percent fullagreement (%FA); and full plus within 1-pointagreement (%FA+1). Intra-annotator reliabilitywas calculated for the narratives rated twice, andinter-annotator reliability on the initial ratings.

As shown in Table 2, reliability was moderate togood (Altman, 1991), and inter-annotator agree-ment was well above chance (25%). Indeed, an-notators were in full agreement, or agreed withinone rating on the continuum, on over 90% of nar-ratives. This pattern reveals fuzzy category bound-aries but sufficient regularity so as to be mea-surable. This is in line with subjective naturallanguage phenomena, and may be a consequenceof imposing discrete categories on a continuum.5

Annotator 1 had better intra-annotator reliability,perhaps due to differences in annotation strategy.

5Nonetheless, affect research has shown that scalar repre-sentations are not immune to variation issues (Alm, 2009).

5 Annotator Strategy Analysis

Five questionnaires evenly spaced among the nar-ratives asked annotators to rate how often theyused various factors in judging decision style (Ta-ble 3). Factors were chosen based on discussionwith the annotators after the pilot, and referred toin descriptions of decision styles in the annotatorinstructions; the descriptions were based on char-acteristics of each style in the cognitive psychol-ogy literature (e.g., Evans, 2008). Factors withhigh variability (SD columns in Table 3) revealchanges in annotator strategy over time, and fac-tors that may influence intra-annotator reliability.

Both annotators reported using the rel. (similar-ity) of final & first-mentioned diagnosis, as well asperceived attitude, perceived confidence, and useof justification, to rate most narratives. Types ofprocessing were used by both sometimes; this isimportant since these are central to the definitionsof decision style in decision-making theory.

Differences in strategies allow for the assess-ment of annotators’ individual preferences. Anno-tator 1 often considered the no. of diagnoses in thedifferential, and rel. timing of the differential, butAnnotator 2 rarely attended to them; the oppositepattern occurred with respect to switching betweendecision styles, and the timing of the switch.

The shared high factors reveal those consis-tently linked to interpreting decision style, despite

85

the concept’s fuzzy boundaries. In contrast, the id-iosyncratic high factors reveal starting points forunderstanding fuzzy perception, and for furthercalibrating inter-annotator reliability.

6 Narrative Case Study

Examining particular narratives is also instructive.Of the 86 duplicated narratives with two ratingsper annotator, extreme agreement occurred for 22cases (26%), meaning that all four ratings were ex-actly the same.6 Figure 3 (top) shows such a caseof intuitive reasoning: a quick decision without re-flection or discussion of the differential. Figure3 (middle) shows a case of analytical reasoning:consideration of alternatives and logical inference.

Figure 3: Narratives for which annotators were infull agreement on I (top) and A (middle) ratings,vs. in extreme disagreement (bottom).

In the full data set (initial ratings), there were50 cases (6%) of 2-point inter-annotator disagree-ment and one case of 3-point inter-annotator dis-agreement (Figure 3, bottom). This latter narra-tive was produced by an attending (experiencedphysician), 40% confident and incorrect in the fi-nal diagnosis. Annotator 1 rated it analytical,while Annotator 2 rated it intuitive. This is inline with Annotator 1’s preference for analyticalratings (Table 1). Annotator 1 may have viewedthis pattern of observation → conclusion as logi-cal reasoning, characteristic of analytical reason-ing. Annotator 2 may instead have interpreted thephrase it’s so purple it makes me think of a vas-cular tumor...so i think [...] as intuitive, due tothe makes me think comment, indicating associa-tive reasoning, characteristic of intuitive thinking.This inter-annotator contrast may reflect Annota-

6There were no cases where all four labels differed, fur-ther emphasizing the phenomenon’s underlying regularity.

tor 1’s greater reported use of the factor logicalrules and inference (Table 3).

7 Physician Profiles of Decision Style

Annotations were also used to characterize physi-cians’ preferred decision style. A decision scorewas calculated for each physician as follows:

dp =12n

n∑i=1

(rA1i + rA2i) (1)

where p is a physician, r is a rating, n is totalimages, and A1, A2 the annotators. Annotators’initial ratings were summed – from 1 for Intuitiveto 4 for Analytical – for all image cases for eachphysician, and divided by 2 times the number ofimages, to normalize the score to a 4-point scale.Figure 4 shows the distribution of decision scoresacross residents and experienced attendings.

Residents exhibit greater variability in decisionstyle. While this might reflect that residents werethe majority group, it suggests that differences inexpertise are linked to decision styles; such differ-ences hint at the potential benefits that could comefrom preparing clinical trainees to self-monitortheir use of decision style. Interestingly, the over-all distribution is skewed, with a slight preferencefor analytical decision-making, and especially sofor attendings. This deserves future attention.

Figure 4: Decision score distribution by expertise.

8 Conclusion

This study exploited two layers of expertise:physicians produced diagnostic narratives, andtrained cognitive psychologists annotated for de-cision style. This work also highlights the impor-tance of understanding annotator strategy, and fac-tors influencing annotation, when fuzzy categoriesare involved. Future work will examine the linksbetween decision style, expertise, and diagnosticaccuracy or difficulty.

86

Acknowledgements

Work supported by a CLA Faculty Dev. grant,Xerox award, and NIH award R21 LM01002901.Many thanks to annotators and reviewers.

This content is solely the responsibility of theauthors and does not necessarily represent the of-ficial views of the National Institutes of Health.

ReferencesAlm, C. O. (2009). Affect in text and speech.

Saarbrucken: VDM Verlag.

Alm, C. O. (2011, June). Subjective natural languageproblems: Motivations, applications, characteriza-tions, and implications. In Proceedings of the 49thAnnual Meeting of the Association for Computa-tional Linguistics: Human Language Technologies:short papers-Volume 2 (pp. 107-112). Associationfor Computational Linguistics.

Altman, D. (1991). Practical statistics for medicalresearch. London: Chapman and Hall.

Coderre, S., Mandin, H., Harasym, P. H., & Fick, G. H.(2003). Diagnostic reasoning strategies and diagnos-tic success. Medical Education, 37(8), 695-703.

Cohen, J. (1968). Weighted kappa: Nominal scaleagreement provision for scaled disagreement or par-tial credit. Psychological Bulletin, 70(4), 213-220.

Crowley, R. S., Legowski, E., Medvedeva, O., Reit-meyer, K., Tseytlin, E., Castine, M., ... & Mello-Thoms, C. (2013). Automated detection of heuris-tics and biases among pathologists in a computer-based system. Advances in Health Sciences Educa-tion, 18(3), 343-363.

Croskerry, P. (2003). The importance of cognitive er-rors in diagnosis and strategies to minimize them.Academic Medicine, 78(8), 775-780.

Croskerry, P. (2009). A universal model of diagnosticreasoning. Academic Medicine, 84(8), 1022-1028.

Evans, J. (2008). Dual-processing accounts of reason-ing, judgement and social cognition. Annual Reviewof Psychology, 59, 255-278.

Hamm, R. M. (1988). Clinical intuition and clinicalanalysis: Expertise and the cognitive continuum. InJ. Dowie & A.S. Elstein (Eds.), Professional judg-ment: A reader in clinical decision making (pp. 78-105). Cambridge, England: Cambridge UniversityPress.

Hammond, K. R. (1981). Principles of organizationin intuitive and analytical cognition (Report #231).Boulder, CO: University of Colorado, Center for Re-search on Judgment & Policy.

Hampton, J. A. (1998). Similarity-based categoriza-tion and fuzziness of natural categories. Cognition,65(2), 137-165.

Kahneman, D., & Frederick, S. (2002). Representa-tiveness revisited: Attribute substitution in intuitivejudgment. In T. Gilovich, D. Griffin, & D. Kahne-man (Eds.), Heuristics of intuitive judgment: Exten-sions and applications (pp. 49-81). New York, NY:Cambridge University Press.

Lovgren, T., & Doorn, J. V. (2005). Influence of ma-nipulation of short silent pause duration on speechfluency. In Proceedings of Disfluency in Sponta-neous Speech Workshop (pp. 123-126). Interna-tional Speech Communication Association.

McCoy, W., Alm, C. O., Calvelli, C., Li, R., Pelz,J. B., Shi, P., & Haake, A. (2012, July). Annota-tion schemes to encode domain knowledge in med-ical narratives. In Proceedings of the 6th LinguisticAnnotation Workshop (pp. 95-103). Association forComputational Linguistics.

Norman, G. (2009). Dual processing and diagnostic er-rors. Advances in Health Sciences Education, 14(1),37-49.

Pennebaker, J. W., & King, L. A. (1999). Linguis-tic styles: Language use as an individual difference.Journal of Personality and Social Psychology, 77(6),1296-1312.

Stanovich, K. E., & West, R. F. (2000). Individualdifferences in reasoning: Implications for the ratio-nality debate? Behavioral and Brain Sciences, 23,645-665.

Womack, K., Alm, C. O., Calvelli, C., Pelz, J. B.,Shi, P., and Haake, A. (2013, August). Using lin-guistic analysis to characterize conceptual units ofthought in spoken medical narratives. In Proceed-ings of Interspeech 2013 (pp. 3722-3726). Interna-tional Speech Communication Association.

87


Temporal Expressions in Swedish Medical Text – A Pilot Study

Sumithra VelupillaiDepartment of Computer and Systems Sciences

Stockholm UniversitySweden

[email protected]

Abstract

One of the most important features ofhealth care is to be able to follow a pa-tient’s progress over time and identifyevents in a temporal order. We describeinitial steps in creating resources for au-tomatic temporal reasoning of Swedishmedical text. As a first step, we focuson the identification of temporal expres-sions by exploiting existing resources andsystems available for English. We adaptthe HeidelTime system and manually eval-uate its performance on a small subsetof Swedish intensive care unit documents.On this subset, the adapted version of Hei-delTime achieves a precision of 92% anda recall of 66%. We also extract the mostfrequent temporal expressions from a sep-arate, larger subset, and note that most ex-pressions concern parts of days or specifictimes. We intend to further develop re-sources for temporal reasoning of Swedishmedical text by creating a gold standardcorpus also annotated with events and tem-poral links, in addition to temporal expres-sions and their normalised values.

1 Introduction

One of the most important features of health careis to be able to follow patient progress over timeand identify clinically relevant events in a tempo-ral order. In medical records, temporal informa-tion is stored with explicit timestamps, but it isalso documented in free text in the clinical nar-ratives. To meet our overall goal of building ac-curate and useful information extraction systemsin the health care domain, our aim is to build re-sources for temporal reasoning in Swedish clini-cal text. For instance, in the example sentenceMR-undersokningen av skallen igar visade att

den va-sidiga forandringen i thalamus minskat ivolym. (“The MRI-scan of the scull yesterdayshowed that the left (abbreviated) side change inthalamus has decreased in volume”), a temporalreasoning system should extract the event (MRI-scan of the scull) and the temporal expression(yesterday), and be able to normalise the time ex-pression to a specific date and classify the tempo-ral relation.

In this pilot study we focus on the identifi-cation of temporal expressions, utilising existingresources and systems available for English. Atemporal expression is defined as any mentionof dates, times, durations, and frequencies, e.g.“April 2nd”, “10:50am”, “five hours ago”, and“every 2 hours”. When successfully identifyingsuch expressions, subsequent anchoring in time ismade possible.

Although English and Swedish are both Ger-manic languages, there are some differences thatare important to take into account when adapt-ing existing solutions developed for English toSwedish, e.g. Swedish is more inflective and ismore compounding than English.

The purpose of this study is to initiate our workon temporal reasoning for Swedish, and to evalu-ate existing solutions adapted to Swedish. Theseare our first steps towards the creation of a refer-ence standard that can be used for evaluation offuture systems.

2 Background

Temporal reasoning has been the focus of severalinternational natural language processing (NLP)challenges in the general domain such as ACE1,TempEval-2 and 3 (Verhagen et al., 2010; Uz-Zaman et al., 2013), and in the clinical domainthrough the 2012 i2b2 challenge (Sun et al., 2013).Most previous work has been performed on En-

1http://www.itl.nist.gov/iad/mig/tests/ace/

88

glish documents, but the TempEval series havealso included other languages, e.g. Spanish. Fortemporal modelling, the TimeML (Pustejovskyet al., 2010) guidelines are widely used. TheTimeML standard denotes events (EVENT), tem-poral expressions (TIMEX3) and temporal rela-tions (TLINK).

For English, several systems have been devel-oped for all or some of these subtasks, such asthe TARSQI Toolkit (Verhagen et al., 2005) andSUTime (Chang and Manning, 2012). Both thesetools are rule-based, and rely on regular expres-sions and gazetteers. The TARSQI Toolkit hasalso been developed for the clinical domain: Med-TTK (Reeves et al., 2013).

In other domains, and for other languages, Hei-delTime (Strotgen and Gertz, 2012) and TIMEN(Llorens et al., 2012) are examples of other rule-based systems. These are also developed to beeasily extendable to new domains and languages.HeidelTime ranked first in the TempEval-3 chal-lenge on TIMEX3:s, resulting in an F1 of 77.61for the task of correctly identifying and normalis-ing temporal expressions.

HeidelTime was also used in several participat-ing systems in the i2b2 challenge (Lin et al., 2013;Tang et al., 2013; Grouin et al., 2013) with suc-cess. Top results for correctly identifying and nor-malising temporal expressions in the clinical do-main are around 66 F1 (Sun et al., 2013). Thesystem has also been adapted for French clinicaltext (Hamon and Grabar, 2014).

3 Methods

The HeidelTime system was chosen for the ini-tial development of a Swedish temporal expres-sion identifier. Given that its architecture is de-signed to be easily extendible for other languagesas well as domains, and after reviewing alternativeexisting systems, we concluded that it was suitablefor this pilot study.

3.1 Data

We used medical records from an intensive careunit (ICU) from the Stockholm EPR Corpus, aclinical database from the Stockholm region inSweden2 (Dalianis et al., 2012). Each medi-cal record (document) contains all entries (notes)

2Study approved by the Regional Ethical Review Boardin Stockholm (Etikprovningsnamnden i Stockholm), permis-sion number 2012/834-31/5

about one patient a given day. The document con-tains notes written by both physicians and nurses.They also contain headings (e.g. Daganteckn-ing (“Daily note”), Andning (“Breathing”)) andtimestamps for when a specific note/heading wasrecorded in the medical record system. These areexcluded in this analysis.

Three subsets from this ICU dataset were used:1) two randomly selected documents were used foranalysing and identifying domain specific time ex-pressions and regular expressions to be added inthe adaptation of HeidelTime (development set),2) a random sample of ten documents was used formanual analysis and evaluation (test set), and 3) aset of 100 documents was also extracted for thepurpose of empirically studying the types of tem-poral expressions found in the data by the adaptedsystem (validation set).

3.2 Adaptation of HeidelTime andEvaluation

The available resources (keywords and regular ex-pression rules) in the HeidelTime system were ini-tially translated automatically (Google translate3)and manually corrected. Regular expressions weremodified to handle Swedish inflections and otherspecific traits. An initial analysis on two separate,randomly selected ICU notes (development set)was performed, as a first step in adapting for boththe Swedish language and the clinical domain.

Results on the system performance were manu-ally evaluated on the test set by one computationallinguistics researcher by analysing system outputs:adding annotations when the system failed to iden-tify a temporal expression, and correcting systemoutput errors. A contingency table was createdfor calculating precision, recall and F1, the mainoutcome measures. Moreover, the top most fre-quent temporal expressions found by the systemon a separate set were extracted (validation set),for illustration and analysis purposes.

4 Results

We report general statistics for the ICU corpus, re-sults from the adaptation and evaluation of Hei-delTime for Swedish (HTSwe) on the test set, andthe most frequent temporal expressions found byHTSwe in a separate set of 100 ICU documents(validation set).

3http://translate.google.se

89

4.1 Data: ICU corpusGeneral statistics for the test set used in this studyis shown in Table 1. On average, each documentconsists of 54.6 sentences, and each sentence con-tains on average 8.7 tokens (including punctua-tion). We observe that some sentences are veryshort (min = 1), and there is great variability inlength, as can be seen through the standard devia-tion.

# min - max avg ± stdSentences 540 35 - 80 54.6±14.1/documentTokens 4749 1 - 52 8.7±5.7/sentence

Table 1: General statistics for the test set (tenICU documents) used in this study. Minimum,maximum, average and standard deviation for sen-tences per document and tokens (including punc-tuation) per sentence.

4.2 Adaptation and evaluation ofHeidelTime: HTSwe

The main modifications required in the adapta-tion of HeidelTime to Swedish (HTSwe) involvedhandling definite articles and plurals, e.g. addingeftermiddag(en)?(ar)?(na)? (“afternoon”, “the af-ternoon”/“afternoons”/“the afternoons”). Fromthe analysis of the small development set, someabbreviations were also added, e.g. em (“after-noon”). Regular expressions for handling typicalways dates are written in Swedish were added, e.g.“020812” and “31/12 -99” (day, month, year). Inorder to avoid false positives, a rule for handlingmeasurements that could be interpreted as years(e.g. 1900 ml) was also added (a negative rule).

Results from running HTSwe on the test set areshown in Table 2. HTSwe correctly identified 105temporal expressions, but missed 55 expressionsthat should have been marked, and classified 9expressions erroneously. In total, there are 160TIMEX3s. Overall performance was 92% preci-sion, 65% recall and F1 = 77%.

The main errors were due to faulty regular ex-pressions for times, e.g. 13-tiden (“around 13 PM)and missing keywords such as dygn (“day” - aword to indicate a full day, i.e. 24 hours) andlunchtid (“around lunch”). Some missing key-words were specific for the clinical domain, e.g.efternatten (“the after/late night”, typical for shift

indication). There were also some partial errors.For instance, i dag (“today”) was only includedwith the spelling idag in the system, thus generat-ing a TIMEX3 output only for dag.

TIMEX3 Other∑

Annotator AnnotatorTIMEX3 105 9 114HTSweOther 55 4580 4635HTSwe∑

160 4589 4749

Table 2: Contingency table, TIMEX3 annotationsby the annotator and the adapted HeidelTime sys-tem for Swedish (HTSwe) on the test set. “Other”means all other tokens in the corpus. These resultsyield a precision of 92%, a recall of 66%, and F1= 77% for HTSwe.

On the validation set, 168 unique time expres-sions were found by the system, and 1,178 in total.The most frequent expressions all denote parts ofdays, e.g. idag (“today”), nu (“now”), and natten(“the night”), see Table 3. Specific times (mostlyspecific hours) were also very common. Thus,there were many translated expressions in the Hei-delTime system that never occurred in the data.

TIMEX3 N %idag (“today”) 164 14%nu (“now”) 132 11%natten (“the night”) 117 10%morgonen (“the morning”) 96 8%em (“afternoon”, abbreviated) 82 7%kvallen (“the evening”) 74 6%igar (“yesterday”) 49 4%fm (“morning”, abbreviated) 34 3%morgon (“morning”) 30 3%natt (‘night”) 26 2%Total 1178 100%

Table 3: Most frequent (top ten, descending or-der) TIMEX3s found by HTSwe on the validationset (100 ICU documents). Total = all TIMEX3:sfound by HTSwe in the entire validation set. Therewere 168 unique TIMEX3s in the validation set.

5 Discussion and Conclusion

We perform an initial study on automatic identifi-cation of temporal expressions in Swedish clinical

90

text by translating and adapting the HeidelTimesystem, and evaluating performance on SwedishICU records. Results show that precision is high(92%), which is promising for our future develop-ment of a temporal reasoning system for Swedish.The main errors involve regular expressions fortime and some missing keywords; these expres-sions will be added in our next iteration in thiswork. Our results, F1 = 77%, are lower than state-of-the-art systems for English clinical text, wherethe top-performing system in the 2010 i2b2 Chal-lenge achieved 90% F1 for TIMEX3 spans (Sunet al., 2013). However, given the small size of thisstudy, results are encouraging, and we have cre-ated a baseline system which can be used for fur-ther improvements.

The adaptation and translation of HeidelTimeinvolved extending regular expressions and rulesto handle Swedish inflections and specific ways ofwriting dates and times. Through a small, initialanalysis on a development set, some further ad-ditions and modifications were made, which ledto the correct identification of common TIMEX3spresent in this type of document. A majority ofthe expressions translated from the original systemwas not found in the data. Hence, it is worthwhileanalysing a small subset to inform the adaptationof HeidelTime.

The ICU notes are an interesting and suit-able type of documentation for temporal reason-ing studies, as they contain notes on the progressof patients in constant care. However, from the re-sults it is evident that the types of TIMEX3 expres-sions are rather limited and mostly refer to partsof days or specific times. Moreover, as recall waslower (66%), there is clearly room for improve-ment. We plan to extend our study to also includeother report types.

5.1 Limitations

There are several limitations in this study. The cor-pus is very small, and evaluated only by one an-notator, which limits the conclusions that can bedrawn from the analysis. For the creation of a ref-erence standard, we plan to involve at least oneclinician, in order to get validation from a domainexpert, and to be able to calculate inter-annotatoragreement. The size of the corpus will also be in-creased. We have not evaluated performance onTIMEX3 normalisation, which, of course, is cru-cial for an accurate temporal reasoning system.

For instance, we have not considered the categoryFrequency, which is essential in the clinical do-main to capture e.g. medication instructions anddosages. Moreover, we have not annotated andevaluated events. This is perhaps the most im-portant part of a temporal reasoning system. Weplan to utilise existing named entity taggers de-veloped in our group as a pre-annotation step inthe creation of our reference standard. The laststep involves annotating temporal links (TLINK)between events and TIMEX3:s. We believe thatpart-of-speech (PoS) and/or syntactic informationwill be a very important component in an end-to-end system for this task. We plan to tailor an exist-ing Swedish PoS tagger, to better handle Swedishclinical text.

5.2 Conclusion

Our main finding is that it is feasible to adapt Hei-delTime to the Swedish clinical domain. More-over, we have shown that the parts of days andspecific times are the most frequent temporal ex-pressions in Swedish ICU documents.

This is the first step towards building resourcesfor temporal reasoning in Swedish. We believethese results are useful for our continued endeav-our in this area. Our next step is to add furtherkeywords and regular expressions to improve re-call, and to evaluate TIMEX3 normalisation. Fol-lowing that, we will annotate events and temporallinks.

To our knowledge, this is the first study on tem-poral expression identification in Swedish clinicaltext. All resulting gazetteers and guidelines in ourfuture work on temporal reasoning in Swedish willbe made publicly available.

Acknowledgments

The author wishes to thank the anonymous review-ers for invaluable comments on this manuscript.Thanks also to Danielle Mowery and Dr. WendyChapman for all their support. This work was par-tially funded by Swedish Research Council (350-2012-6658) and Swedish Fulbright Commission.

References

Angel X. Chang and Christopher Manning. 2012.SUTime: A library for recognizing and normaliz-ing time expressions. In Nicoletta Calzolari (Con-ference Chair), Khalid Choukri, Thierry Declerck,

91

Mehmet Uur Doan, Bente Maegaard, Joseph Mar-iani, Jan Odijk, and Stelios Piperidis, editors, Pro-ceedings of the Eight International Conference onLanguage Resources and Evaluation (LREC’12), Is-tanbul, Turkey, may. European Language ResourcesAssociation (ELRA).

Hercules Dalianis, Martin Hassel, Aron Henriksson,and Maria Skeppstedt. 2012. Stockholm EPR Cor-pus: A Clinical Database Used to Improve HealthCare. In Pierre Nugues, editor, Proc. 4th SLTC,2012, pages 17–18, Lund, October 25-26.

Cyril Grouin, Natalia Grabar, Thierry Hamon, SophieRosset, Xavier Tannier, and Pierre Zweigenbaum.2013. Eventual situations for timeline extractionfrom clinical reports. JAMIA, 20:820–827.

Thierry Hamon and Natalia Grabar. 2014. Tuning Hei-delTime for identifying time expressions in clinicaltexts in English and French. In Proceedings of the5th International Workshop on Health Text Miningand Information Analysis (Louhi), pages 101–105,Gothenburg, Sweden, April. Association for Com-putational Linguistics.

Yu-Kai Lin, Hsinchun Chen, and Randall A. Brown.2013. MedTime: A temporal information extractionsystem for clinical narratives. Journal of BiomedicalInformatics, 46:20–28.

Hector Llorens, Leon Derczynski, Robert Gaizauskas,and Estela Saquete. 2012. TIMEN: An Open Tem-poral Expression Normalisation Resource. In Nico-letta Calzolari (Conference Chair), Khalid Choukri,Thierry Declerck, Mehmet Uur Doan, Bente Mae-gaard, Joseph Mariani, Jan Odijk, and SteliosPiperidis, editors, Proceedings of the Eight Interna-tional Conference on Language Resources and Eval-uation (LREC’12), Istanbul, Turkey, may. EuropeanLanguage Resources Association (ELRA).

James Pustejovsky, Kiyong Lee, Harry Bunt, and Lau-rent Romary. 2010. ISO-TimeML: An Interna-tional Standard for Semantic Annotation. In Pro-ceedings of the Seventh International Conferenceon Language Resources and Evaluation (LREC’10),Valletta, Malta, may. European Language ResourcesAssociation (ELRA).

Ruth M. Reeves, Ferdo R. Ong, Michael E. Math-eny, Joshua C. Denny, Dominik Aronsky, Glenn T.Gobbel, Diane Montella, Theodore Speroff, andSteven H. Brown. 2013. Detecting temporal expres-sions in medical narratives. International Journal ofMedical Informatics, 82:118–127.

Jannik Strotgen and Michael Gertz. 2012. TemporalTagging on Different Domains: Challenges, Strate-gies, and Gold Standards. In Proceedings of theEigth International Conference on Language Re-sources and Evaluation (LREC’12), pages 3746–3753. ELRA.

Weiyi Sun, Anna Rumshisky, and Ozlem Uzuner.2013. Evaluating temporal relations in clinical text:2012 i2b2 Challenge. JAMIA, 20(5):806–813.

Buzhou Tang, Yonghui Wu, Min Jiang, Yukun Chen,Joshua C Denny, and Hua Xu. 2013. A hybrid sys-tem for temporal information extraction from clini-cal text. JAMIA, 20:828–835.

Naushad UzZaman, Hector Llorens, Leon Derczyn-ski, James Allen, Marc Verhagen, and James Puste-jovsky. 2013. SemEval-2013 Task 1: TempEval-3:Evaluating Time Expressions, Events, and TemporalRelations. In Second Joint Conference on Lexicaland Computational Semantics (*SEM), Volume 2:Proceedings of the Seventh International Workshopon Semantic Evaluation (SemEval 2013), pages 1–9,Atlanta, Georgia, USA, June. Association for Com-putational Linguistics.

Marc Verhagen, Inderjeet Mani, Roser Sauri, RobertKnippen, Seok Bae Jang, Jessica Littman, AnnaRumshisky, John Phillips, and James Pustejovsky.2005. Automating Temporal Annotation withTARSQI. In Proceedings of the ACL 2005 onInteractive Poster and Demonstration Sessions,ACLdemo ’05, pages 81–84, Stroudsburg, PA, USA.Association for Computational Linguistics.

Marc Verhagen, Roser Saurı, Tommaso Caselli, andJames Pustejovsky. 2010. SemEval-2010 Task 13:TempEval-2. In Proceedings of the 5th Interna-tional Workshop on Semantic Evaluation, SemEval’10, pages 57–62, Stroudsburg, PA, USA. Associa-tion for Computational Linguistics.

92


A repository of semantic types in the MIMIC II database clinical notes

Richard M. OsborneComputational Bioscience

University of ColoradoSchool of Medicine

[email protected]

Alan R. AronsonNational Library of Medicine

Bethesda, [email protected]

K. Bretonnel CohenComputational Bioscience

University of ColoradoSchool of Medicine

[email protected]

Abstract

The MIMIC II database contains1,237,686 clinical documents of vari-ous kinds. A common task for researchersworking with this database is to runMetaMap, which uses the UMLSMetathesaurus, on those documentsto identify specific semantic types ofentities mentioned in them. However,this task is computationally expensiveand time-consuming. Research in manygroups could be accelerated if there werea community-accessible set of outputsfrom running MetaMap on this documentcollection, cached and available on theMIMIC-II website. This paper describes arepository of all MetaMap output from theMIMIC II database, publicly available,assuming compliance with usage agree-ments required by UMLS and MIMIC-II.Additionally, software for manipulatingMetaMap output, available on Source-Forge with a liberal Open Source license,is described.

1 Introduction

1.1 The MIMIC II database and its textualcontents

The Multiparameter Intelligent Monitoring in In-tensive Care II (MIMIC-II) database is a public-access intensive care unit database that containsa broad array of information for over 33,000 pa-tients. The data were collected over a 7 year pe-riod, beginning in 2001 from Boston’s Beth Is-rael Deaconess Medical Center (Saeed et al, 2011;Goldberger et al., 2000).

Of particular interest are the 1,237,686 clini-cal documents, which are broadly classified intothe following four groups: MD notes, dischargesummaries, radiology reports and nursing/other.

Each free-text note contains information describ-ing such things as a given patient’s health, ill-nesses, treatments and medications, among others.

1.2 Motivation for the resource: MetaMapruntimes

Part of the motivation for making this resourcepublicly available is that considerable resourcesmust be expended to process it; if multiple groupscan share the output of one processing run, the sav-ings across the community as a whole could bequite large. To illustrate why this would be valu-able from a resources perspective, we provide heresome statistics on the performance of MetaMap.

Random samples of each category (10% each)were chosen and Monte Carlo simulation was per-formed (1,000 iterations per note) to obtain therunning times presented below. The clinical notesranged from a minimum of 0 words to a maximumof 6,684 (some of the notes were 0 bytes becausethe note for a particular patient and day containedno text). The mean, median and mode per doc-ument processed by MetaMap were 17, 5 and 2seconds, respectively, with a minimum of 1 and amaximum of 216 seconds.

Figure 1 below plots the number of wordsagainst processing times in seconds for each of theof notes, sampled as mentioned above.

The majority of the processing was done on aSun Fire X4600M2 server with 16 (4 x Quad-Core AMD Opteron(tm) Processor 8356 cores,2.3GHz), 128GB memory and 12 TB of disk stor-age, currently running Fedora Core 17 Linux.(An Apple MacBook Pro and a Windows desktopserver were also used to speed processing. Theanalysis of the random sample of notes was per-formed in its entirety on the Sun machine, therebyproviding consistent results for the data in Figure1.)

93

Figure 1: MetaMap Runtimes

1.3 Motivation for the resource:reproducibility

Any large-scale run of MetaMap over a huge doc-ument collection will have occasional failures, etc.The odds of any two runs having the same out-put are therefore slim. Moreover, there is po-tential variability in how documents are prepro-cessed for use with MetaMap. Using this reposi-tory of MetaMap outputs will ensure reproducibil-ity of experiments and also preclude the neces-sity of performing the same preparatory work andMetaMap processing on the same data.

1.4 Motivation for the resource: semantictypes

The creation of the MIMIC-II repository is an in-termediate step in our research. We are extractingthe semantic types found in each clinical note inan attempt to determine if there exists evidence ofsubdomains across the categories used by MIMIC-II to group the notes.

2 Materials and Methods

2.1 Materials

MetaMap is a program developed at the NationalLibrary of Medicine (NLM) that maps biomedicalconcepts to the UMLS Metathesaurus and reportson the corresponding semantic types.1 The pro-gram is used extensively by researchers in the fieldof biomedical text mining. See Aronson, 2001;Aronson and Lang, 2010.

1Users of MetaMap must comply with the UMLSMetathesaurus license agreement (https://uts.nlm.nih.gov/license.html).

Although our focus is on the clinical notes con-tained in a single table, noteevents, MIMIC-II isboth a relational database (PostgreSQL 9.1.9) con-taining 39 tables of clinical data and bedside mon-itor waveforms and the associated derived param-eters and events stored in flat binary files (withASCII header descriptors). For each IntensiveCare Unit (ICU) patient, Saeed et al. (2011) col-lected a wide range of data including inter alialaboratory data, therapeutic intervention profiles,MD and nursing progress notes, discharge sum-maries, radiology reports, International Classifica-tion of Diseases, 9th Revision codes, and, for asubset of patients, high-resolution vital sign trendsand waveforms. All data were scrubbed for per-sonal information to ensure compliance with theHealth Insurance Portability and AccountabilityAct (HIPAA). These data were then uploaded to arelational database thereby allowing for easy ac-cess to extensive information for each patient’sstay in the ICU (Saeed et al.). A more detaileddescription of the use of the MIMIC-II databasemay be found in(Clifford et al. , 2012).

The abbreviated schema in Table 1 below showsthat each ID uniquely identifies a note along with aSubjectID, a Category and Text.2 We added the IDattribute to noteevents as a primary key becauseSubjectID, Category and Text are not keys. Thus,a particular patient might have many notes andmany categories but each note is uniquely iden-tified.

Attribute Type Cardinality Sample ValuesID integer unique 1, 2, 3, 4...

SubjectID integer many to one ID 95, 100, 100, 99,Category character varying(26) many to one ID radiology

Text text many to one ID interval placement of ICD

Table 1: Schema MIMIC-II noteevents table.

As mentioned above, the notes in the MIMIC-IIdatabase are categorized as MD reports, radiologyreports, discharge summaries and nursing/otherreports. The contents of these notes varied greatly.The MD and nursing notes tended to be short andunstructured with a number of abbreviations andmisspellings, whereas the radiology reports werelonger, more structured and showed fewer errors.

The MIMIC-II version used for this research is2.6 (April 2011; 32,536 subjects).

The distribution of reports with summary statis-tics is below in Table 2.

2The full noteevents table has ten other attributes such asadmission date, various timestamps and patient informationbut these were not relevant to our research.

94

MD Discharge Radiology Nursing Totalsmin words 0 0 0 0 0max words 632 6684 2760 632 6684

median words 108 963.5 174 108 135average words 131.6 1009.2 265.5 131.6 194.7

total notes 23,270 31,877 383,701 798,838 1,237,686

Table 2: MIMIC-II Clinical Note Summary Stats.

2.2 Methods

We used MetaMap to process the clinical notes inorder to find semantic concepts, the latter of whichare being used in our current research. For thework in this paper, we used MetaMap 2013 withthe 2013AB database.

Before processing the notes with MetaMap,a number of preparatory steps were taken. Asmentioned above, a primary key was added tothe noteevents table to provide a unique id foreach note. A Python script then queried thedatabase extracting each note and storing it in afile named according to the following convention:uniqueID subjectID category.txt where uniqueIDis the primary key value from the noteevents ta-ble, subjectID is the unique number assigned toeach patient and category is one of the four cate-gories mentioned above.

Each of the notes was then processed by a Bashshell script to remove blank lines and control char-acters. (This important step was added after a sig-nificant amount of processing had already takenplace. If this is not done, a number of prob-lems arise when running MetaMap). Finally, allfiles with 0 bytes were removed. These files werepresent because many tuples in the noteevents ta-ble contained clinical note entries with no data.

The number of options available when runningMetaMap is considerable so we chose those thatwould provide a full and robust result set whichwould be useful to a wide range of researchers. Inour first run, we limited the threshold for the Can-didate Score to 1,000. However, for the repository,no threshold was set so that a full range of outputis provided. 3

The output is in XML in order to structure thedata systematically and provide an easier and con-sistent way to parse the data. Although we choseXML initially, we intend to provide the same datain plain text and Prolog formats, again to provideutility to a broad range of researchers.

In order to process all files, a Bash shell script

3The exact MetaMap command we used wasmetamap13 –XMLf –silent –blanklines 3 filename.txt

was created that called MetaMap on each note andcreated a corresponding XML file, named accord-ing to the same convention as that for notes butwith the txt extension replaced by xml.

3 Results

3.1 The repository of MetaMap output

The repository for the MetaMap output containsan XML file for each note that originally containedtext in the MIMIC-II database. Each XML filecontains a wealth of information about each noteand a discussion of this is beyond the scope ofthis paper (see http://metamap.nlm.nih.gov/Docs/MM12_XML_Info.shtml).

For our research, we are interested in the se-mantic types associated with phrases identified byMetaMap. Below is a section from output file768591 19458 discharge.xml. This is a dischargesummary for subject 19458 with a unique note idof 768591. The note contained the phrase “Admis-sion Date” which MetaMap matched with a candi-date score of 1000 and indicated that it is a tempo-ral concept (tmco).

Ultimately, the MetaMap output files will be up-loaded to the PhysioNet website and made avail-able to the public.4 The files will be organized ina fashion similar to the original data files on thesite. Namely, data are grouped by subject ids andcompressed in archives with approximately 1000files each.

<Candidate><CandidateScore>-1000</CandidateScore><CandidateCUI>C1302393</CandidateCUI><CandidateMatched>Admission date

</CandidateMatched><CandidatePreferred>Date of admission

</CandidatePreferred><MatchedWords Count="2"><MatchedWord>admission</MatchedWord><MatchedWord>date</MatchedWord></MatchedWords><SemTypes Count="1"><SemType>tmco</SemType></SemTypes>

The original note contained 975 lines, whereasthe MetaMap xml file contained 248,198. Thusit is obvious that there is a very large amount ofMetaMap output that we don’t consider but whichmay be of interest to other researchers.

4Subject again to the data usage agreement.

95

3.2 A Python module for manipulatingMetaMap output

In order to make information in the XML files ac-cessible to others, we developed a Python module(parseMM xml.py) containing a number of meth-ods or functions that allow one to parse the XMLtree and extract relevant information.

Although we will add more functionality asneeded and requested, at this point the followingmethods are implemented:

• parseXMLtree(filename) – parses the con-tents of filename and returns a node repre-senting the top of the document tree.

• getXMLsummary(XMLtree) – summarizesthe data contained in the parsed XML tree.The summary contains top-level elementsand their corresponding text. The output ismuch like that contained in typical MetaMaptext output.

• getCUIs(XMLtree) – returns the MetaMapCUIs found in the XML tree along with thematching concepts.

• getNegatedConcepts(XMLtree) – returnsnegated concepts and their correspondingCUIs.

• getSemanticTypes(XMLtree) – returnsmatched concepts, their CUIs, the candidatescores and the semantic types associatedwith the concept.

• findAttribute(attribute) – searches the docu-ment tree for an attribute of the user’s choos-ing. Returns the attributes with their corre-sponding text values.

We chose Python to create our module be-cause of its ease of use and its multi-platformcapabilities. Once Python is installed and theparseMM xml.py is placed in a directory alongwith the MetaMap xml file which is to be ana-lyzed, retrieving relevant information is relativelystraightforward. 5

5Under most circumstances, Python is already installedon the Mac OS X and Linux operating systems.

A stylized version of our code is presented be-low.

# Parse XML tree and return semantictypes.

import parseMM_xmlxml_tree = \parseXMLtree("noteid_subid_category.xml")semTypes = getSemanticTypes(xml_tree)

print(semTypes)

A truncated listing of the output:

CandidateCUI – C0011008CandidateMatched – Date1 – SemType – Temporal ConceptCandidateCUI – C2348077CandidateMatched – Date2 – SemType – Food

In order to fully test the robustness of our mod-ule, we will do further unit and regression testing,in addition to providing more exception handling.Ultimately, the code will be available on Source-Forge, an Open Source web source code repositoryavailable at www.sourceforge.net.

Acknowledgments

We would like to thank George Moody of MIT forhis help with questions concerning the MIMIC-IIdatabase.

ReferencesAlan R. Aronson 2001. Effective mapping of biomed-

ical text to the UMLS Metathesaurus: the MetaMapprogram, Proc AMIA Symp. 2001; 17:21.

Alan R. Aronson, Franois-Michel Lang 2010. Anoverview of MetaMap: historical perspective andrecent advances, J Am Med Inform Assoc 2010;17:3 229-236

Gari D. Clifford, Daniel J. Scott, Mauricio Vil-larroel 2012. User Guide and Documen-tation for the MIMIC II Database, http://mimic.physionet.org/UserGuide/UserGuide.pdf

Ary L. Goldberger; Luis A. N. Amaral; Leon Glass;Jeffrey M. Hausdorff; Plamen Ch. Ivanov; Roger G.Mark; Joseph E. Mietus; George B. Moody; Chung-Kang Peng; H. Eugene Stanley, 2000 PhysioBank,PhysioToolkit, and PhysioNet: Components of aNew Research Resource for Complex PhysiologicSignals, Circulation 101(23): e215-e220.

96

Mohammed Saeed, Mauricio Villarroel, Andrew T.Reisner, Gari Clifford, Li-Wei Lehman, GeorgeMoody, Thomas Heldt, Tin H. Kyaw, BenjaminMoody, Roger G. Mark. 2011 The Multiparameterintelligent monitoring in intensive care II (MIMIC-II): A public-access ICU database, Critical CareMedicine; 39(5):952-960

MetaMap Release Notes Website 2013. MetaMap2013 Release Notes http://metamap.nlm.nih.gov/Docs/MM12_XML_Info.shtml

MetaMap 2012 Output Website 2014.MetaMap 2012 XML Output Explainedhttp://metamap.nlm.nih.gov/Docs/MM12_XML_Info.shtml

PhysioNet Website 2014. PhysioNet MIMIC-II Web-site http://physionet.org/mimic2/

97


Extracting drug indications and adverse drug reactions from Spanishhealth social media

Isabel Segura-Bedmar, Santiago de la Pena, Paloma MartınezComputer Science Department

Carlos III University of Madrid, Spain{isegura|spena|pmf}@inf.uc3m.es

Abstract

In this paper, we present preliminary re-sults obtained using a system based on co-occurrence of drug-effect pairs as a firststep in the study of detecting adverse drugreactions and drug indications from socialmedia texts. To the best of our knowl-edge, this is the first work that extractsthis kind of relationships from user mes-sages that were collected from an onlineSpanish health-forum. In addition, we alsodescribe the automatic construction of thefirst Spanish database for drug indicationsand adverse drug reactions.

1 Introduction

The activity of Pharmacovigilance (science de-voted to the detection and prevention of any possi-ble drug-related problem, including adverse drugeffects) has gained significant importance in therecent decades, due to the growing number of drugsafety incidents (Bond and Raehl, 2006) as well asto their high associated costs (van Der Hooft et al.,2006).

Nowadays, the major medicine regulatory agen-cies such as the US Food and Drug Administra-tion (FDA) or the European Medicines Agency(EMA) are working to create policies and prac-tices to facilitate the reporting of adverse drug re-actions (ADRs) by healthcare professionals andpatients. However, several studies have shown thatADRs are under-estimated because many health-care professionals do not have enough time to usethe ADR reporting systems (Bates et al., 2003;van Der Hooft et al., 2006; McClellan, 2007) .In addition, healthcare professionals tend to re-port only those ADRs on which they have abso-lute certainty of their existence. Unlike reportsfrom healthcare professionals, patient reports of-ten provide more detailed and explicit information

about ADRs (Herxheimer et al., 2010). Neverthe-less, the rate of ADRs reported by patients is stillvery low probably because many patients are stillunaware of the existence of ADR reporting sys-tems. In addition, patients may feel embarrassedwhen describing their symptoms.

In this paper, we pose the hypothesis thathealth-related social media can be used as a com-plementary data source to the ADR reporting sys-tems. In particular, health forums contain a largenumber of comments describing patient experi-ences that would be a fertile source of data to de-tect unknown ADRs.

Several systems have been developed for ex-tracting ADRs from social media (Leaman et al.,2010; Nikfarjam and Gonzalez, 2011). Howeverto the best of our knowledge, only one work in theliterature has focused on the detection of ADRsfrom social media in Spanish (Segura-Bedmar etal., 2014). Indeed, it is only concerned with thedetection of mentions of drugs and their effects,without dealing with the extraction of the relation-ships between them. In this paper, we extend thisexisting work in order to extract drug indicationsand adverse drug reactions from user comments ina Spanish health-forum.

The remaining of this paper is structured as fol-lows: the next section surveys related work onADR detection from social media. Section 3 de-scribes the creation of a gold-standard corpus weused for our experiments. Sections 4 and 5 re-spectively describe the techniques employed andtheir results. Lastly, some conclusive remarks andfuture perspectives are given in Section 6.

2 Related Work

In recent years, the application of Natural Lan-guage Processing (NLP) techniques to mine drugindications and adverse drug reactions from textshas been explored with promising results, mainlyin the context of drug labels (Gurulingappa et al.,

98

2013; Li et al., 2013; Kuhn et al., 2010; Fung et al.,2013), biomedical literature (Xu and Wang, 2013),medical case reports (Gurulingappa et al., 2012)and health records (Friedman, 2009; Sohn et al.,2011). However, as it will be described below, theextraction of these drug relationships from socialmedia has received much less attention.

To date, most of research on drug name recogni-tion concerns either biomedical literature (Segura-Bedmar et al., 2013; Krallinger et al., 2013) orclinical records (Uzuner et al., 2010), thus leavingunexplored this task in social media texts.

To our knowledge, there is no work in the lit-erature that addresses the extraction of drug in-dications from social media texts. Regarding thedetection of ADRs, Leaman et al., (2010) devel-oped a system to automatically recognize adverseeffects in user comments from the DailyStrength1

health-related social network. A corpus of 3,600comments was manually annotated with a totalof 1,866 drug conditions, including beneficial ef-fects, adverse effects, indications and others. Thisstudy focused only on a set of four drugs, andthereby, drug name recognition was not addressed.The system used a dictionary-based approach toidentify adverse effects and a set of keywords inorder to distinguish adverse effects from the otherdrug conditions. The dictionary consisted of 4,201concepts, which were collected from several re-sources such as the COSTART vocabulary (FDA,1970), the SIDER database (Kuhn et al., 2010),the MedEffect database 2 and a list of colloquialphrases manually collected from the comments.The system achieved a precision of 78.3% and arecall of 69.9% (an f-measure of 73.9%).

Later, Nikfarjam and Gonzalez (2011) appliedassociation rule mining to extract frequent pat-terns describing opinions about drugs. The ruleswere generated using the Apriori tool, an imple-mentation of the Apriori algorithm (Agrawal et al.,1994) for association rule mining. The main ad-vantage of this approach over the dictionary basedapproach is that the system is able to detect termsnot included in the dictionary. The results of thisstudy were 70.01% precision and 66.32% recall,for an f-measure of 67.96%.

Benton et al.,(2011) collected a lexicon of laymedical terms from websites and databases aboutdrugs and their adverse effects to identify drug ef-

1http://www.dailystrength.org/2http://www.hc-sc.gc.ca/dhp-mps/medeff/index-eng.php

fects. Then, the authors applied the Fishers exacttest (Fisher, 1922) to find all the drug-effect pairsthat co-occurred independently by chance in a cor-pus of user comments. To evaluate the system, theauthors focused only on the four most commonlyused drugs to treat breast cancer. Precision andrecall were calculated by comparing the adverseeffects from their drug labels and the adverse ef-fects obtained by the system. The system obtainedan average precision of 77% and an average recallof 35.1% for all four drugs.

To the best of our knowledge, the system de-scribed in (Segura-Bedmar et al., 2014) is the onlyone that has dealt with the detection of drugs andtheir effects from Spanish social media streams.The system used the Textalytics tool3, which fol-lows a dictionary-based approach to identify en-tities in texts. The dictionary was constructedbased on the following resources: CIMA4 andMedDRA5. CIMA is an online information centermaintained by the Spanish Agency for Medicinesand Health Products (AEMPS). CIMA providesinformation on all drugs authorized in Spain,though it does not include drugs approved only inLatin America. CIMA contains a total of 16,418brand drugs and 2,228 generic drugs. Many branddrugs have very long names because they includeadditional information such as dosages, mode androute of administration, laboratory, among oth-ers (for example, ESPIDIFEN 400 mg GRANU-LADO PARA SOLUCION ORAL SABOR ALBARI-COQUE). For this reason, brand drug names weresimplified before being included in the dictionary.After removing the additional information, the re-sulting list of brand drug names consisted of 3,662terms. Thus, the dictionary contained a total of5,890 drugs. As regards to the effects, the au-thors decided to use MedDRA, a medical multi-lingual terminology dictionary about events asso-ciated with drugs. MedDRA is composed of a fivelevels hierarchy. A total of 72,072 terms from themost specific level, ”Lowest Level Terms” (LLTs),were integrated into the dictionary. In addition,several gazetteers including drugs and effects werecollected from websites such as Vademecum6, aSpanish online website that provides informationto patients on drugs and their side effects, and

3https://textalytics.com/4http://www.aemps.gob.es/cima/5http://www.meddra.org/6http://www.vademecum.es/

99

the ATC system7, a classification system of drugs.Thus, the dictionary and the two gazetteers con-tained a total of 7,593 drugs and 74,865 effects.The system yielded a precision of 87% for drugsand 85% for effects, and a recall of 80% for drugsand 56% for effects.

3 The SpanishADR corpus

Segura-Bedmar et al., (2014) created the firstSpanish corpus of user comments annotated withdrugs and their effects. The corpus consists of400 comments, which were gathered from Fo-rumClinic8, an interactive health social platform,where patients exchange information about theirdiseases and their treatments. The texts were man-ually annotated by two annotators with expertisein Pharmacovigilance. All the mentions of drugsand effects were annotated, even those contain-ing spelling or grammatical errors (for example,hemorrajia (haemorrhage)). An assessment of theinter-annotator agreement (IAA) was based on theF-measure metric, which approximates the kappacoefficient (Cohen, 1960) when the number of truenegatives (TN) is very large (Hripcsak and Roth-schild, 2005). This assessment revealed that whiledrugs showed a high IAA (0.89), their effects pointto moderate agreement (0.59). This may be dueto drugs have specific names and there are a lim-ited number of them, however their effects are ex-pressed by patients in many different ways due tothe variability and richness of natural language.The corpus is available for academic purposes9.

In this paper, we extend the Spanish corpus toincorporate the annotation of the relationships be-tween drugs and their effects. In particular, weannotated drug indications and adverse drug reac-tions. These relationships were annotated at com-ment level rather than sentence level, because de-termining sentence boundaries in this kind of textscan be problematic since many users often writeungrammatical sentences. Guidelines were cre-ated by two annotators (A1, A2) and a third an-notator (A3) was trained on the annotation guide-lines. Then, we split the corpus in three subsets,and each subset was annotated by one annotator.Finally, IAA was measured using kappa-statisticon a sample of 97 documents randomly selected.These documents were annotated by the three an-

7http://www.whocc.no/atc ddd index/8http://www.forumclinic.org/9http://labda.inf.uc3m.es/SpanishADRCorpus

notators and annotation differences were analysed.As Table 1 shows, the resulting corpus has 61

drug indications and 103 adverse drug reactions.The average size of a comment is 72 tokens. Theaverage size of a text fragment describing a drugindication is 34.7 tokens and 28.2 tokens for ad-verse drug reactions.

Annotation Sizedrugs 188effect 545drug indication 61adverse drug reaction 103

Table 1: Size of the extended SpanishADR corpus.

As it is shown in Table 2, the IAA figures clearlysuggest that the annotators have high agreementamong them. We think that the IAA figures werelower with the third annotator because he did notparticipate in the guidelines development process,and maybe, he was not trained well enough to per-form the task. The main source of disagreementamong the annotators could arise from consider-ing whether a term refers to a drug effect or not.This is due to some terms are too general (such astrastorno (upset), enfermedad (disease), molestia(ache)). The annotators A1 and A2, in general,ruled out all the relation instances where thesegeneral terms occur, however they were consid-ered and annotated by the third annotator.

A2 A3A1 0.8 0.69A2 - 0.68

Table 2: Pairwise IAA for each combination oftwo annotators. IAA was measured using Cohens’kappa statistic

4 Methods

In this contribution, some refinements to the sys-tem (Segura-Bedmar et al., 2014) are proposed.The error analysis performed in (Segura-Bedmaret al., 2014) showed that most of false positivesfor drug effects were mainly due to the inclu-sion of MedDRA terms referring to proceduresand tests in the dictionary. MedDRA includesterms for diseases, signs, abnormalities, proce-dures and tests. Therefore, we decided not to in-clude terms corresponding to the ”Procedimientos

100

medicos y quirurgicos” and ”Exploraciones com-plementarias” categories since they do not repre-sent drug effects. Thus, we created a new dic-tionary that only includes those terms from Med-DRA that actually refer to drug effects. As in thesystem (Segura-Bedmar et al., 2014), we appliedthe Textalytics tool, which follows a dictionary-based approach, to identify drugs and their ef-fects occurring in the messages. We created aGATE10 pipeline application integrating the Tex-talytic module and the gazetteers collected fromthe Vademecum website and the ATC system pro-posed in (Segura-Bedmar et al., 2014).

In addition, we created an additional gazetteerin order to increase the coverage. We developed aweb crawler to browse and download pages relatedto drugs from the MedLinePlus website11. Un-like Vademecum, which only contains informationfor drugs approved in Spain, MedLinePlus alsoincludes information about drugs only approvedin Latin America. Terms describing drug effectswere extracted by regular expressions from thesepages and then were incorporated into a gazetteer.Then, the new gazetteer was also integrated intothe GATE pipeline application to identify drugsand effects. Several experiments with differentsettings of this pipeline are described in the fol-lowing section.

The main contribution of this paper is to pro-pose an approach for detecting relationships be-tween drugs and their effects from user commentsin Spanish. The main difficulty in this task is thatalthough there are several English databases suchas SIDER or MedEffect with information aboutdrugs and their side effects, none of them are avail-able for Spanish. Moreover, these resources donot include drug indications. Thus, we have au-tomatically built the first database, SpanishDrug-EffectBD, with information about drugs, their drugindications as well as their adverse drug reactionsin Spanish. Our first step was to populate thedatabase with all drugs and effects from our dic-tionary. Figure 1 shows the database schema.Active ingredients are saved into the Drug ta-ble, and their synonyms and brand names into theDrugSynset table. Likewise, concepts from Med-DRA are saved into the Effect table and their syn-onyms are saved into the EffectSynset table. Asit is shown in Figure 1, the database is also de-

10http://gate.ac.uk/11http://www.nlm.nih.gov/medlineplus/spanish/

signed to store external ids from other databases.Thus, drugs and effects can be linked to externaldatabases by the tables has externalIDDrug andhas externalIDDrug, respectively.

To obtain the relationships between drugsand their effects, we developed several webcrawlers in order to gather sections describingdrug indications and adverse drug reactions fromdrug package leaflets contained in the follow-ing websites: MedLinePlus, Prospectos.Net12 andProspectos.org13. Once these sections were down-loaded, their texts were processed using the Text-Alyticis tool to recognize drugs and their effects.As each section (describing drug indications or ad-verse drug effects) is linked to one drug, we de-cided to consider the effects contained in the sec-tion as possible relationships with this drug. Thetype of relationship depends on the type of section:drug indication or adverse drug reaction. Thus forexample, a pair (drug, effect) from a section de-scribing drug indications is saved into the DrugEf-fect table as a drug indication relationship, while ifthe pair is obtained from a section describing ad-verse drug reactions, then it is saved as an adversedrug reaction. This database can be used to au-tomatically identify drug indications and adversedrug reactions from texts. Table 3 shows the num-ber of drugs, effects and their relationships storedinto the database.

Concepts Synonymsdrugs 3,244 7,378effects 16,940 52,199drug indications 4,877adverse drug reactions 58,633

Table 3: Number of drugs, effects, drug indica-tions and adverse drug effects in the SpanishDrug-EffectBD database.

As regards to the extraction of the relationshipsbetween drugs and their effects occurring in thecorpus, first of all, texts were automatically an-notated with drugs and effects using the GATEpipeline application. Then, in order to generateall possible relation instances between drugs andtheir effects, we considered several sizes of win-dow: 10, 20, 30, 40 and 50. Given a size n, anypair (drug, effect) co-occurring within a windowof n-tokens are treated as a relation instance. Af-

12http://www.prospectos.net/13http://prospectos.org/

101

Figure 1: The SpanishDrugEffectBD database schema

terwards, each relation instance is looked up in theDrugEffect table in order to determine if it is a pos-itive instance and if this is the case, its type: drugindication or adverse drug reaction.

5 Experiments

Several experiments have been performed in orderto evaluate the contribution of the proposed meth-ods and resources. Table 4 shows the results forthe named entity recognition task of drugs and ef-fects using the dictionary integrated into the Tex-tAlytic tool. The first row shows the results withthe dictionary built from the CIMA and MedDRAresources, while the second one shows the resultsobtained using the new dictionary in which thoseMedDRA terms corresponding to ”Procedimien-tos medicos y quirurgicos” and ”Exploracionescomplementarias” categories were ruled out. Asit can be seen in this table, the new dictionary per-mits to obtain a significant improvement with re-spect to the original dictionary. For effect type,precision was increased almost a 40% and re-call a 7%. As regards to the contribution of thegazetteers, the coverage for effects improves al-most a 6% but with significant decrease in preci-sion of almost 21%. Regarding to the detection of

drugs, the use of gazetteers improves slightly theprecision and achieves a significant improvementin the recall of almost 35%.

The major cause of false negatives for drug ef-fects was the use of colloquial expressions (suchas ’me deja ko’ (it makes me ko)) to describe anadverse effect. These phrases are not included inour dictionary. Another important cause was thedictionary and gazetteers do not cover all the lex-ical variations of a same effect (for example de-presion (depression), depresivo (depress), me de-primo (I get depressed)). In addition, many falsenegatives were due to spelling mistakes (for ex-ample hemorrajia instead of hemorragia (haemor-rhage)) and abbreviations (depre is an abbreviationfor depresion (depression)).

Regarding to the results for the relation extrac-tion task, Table 5 shows the overall results ob-tained using a baseline system, which considersall pairs (drug, effect) occurring in messages aspositive relation instances, and a second approachusing the SpanishDrugEffectBD database (a rela-tion instance is positive only if it is found into thedatabase). In both experiments, a window size of250 tokens was used. The database provides a highprecision but with a very low recall of only 15%.

102

Approach Entity P R F1

Dictionarydrugs 0.84 0.46 0.60effect 0.45 0.38 0.41

New dictionarydrugs 0.84 0.46 0.60effect 0.84 0.45 0.59

New dictionary plus gazetteersdrugs 0.86 0.81 0.84effect 0.63 0.51 0.57

Table 4: Precision, Recall and F-measure for named entity recognition task.

As it can be seen in Table 6, when the type of therelationship is considered, the performance is evenlower.

Approach P R F1Baseline 0.31 1.00 0.47SpanishDrugEffectBD 0.83 0.15 0.25

Table 5: Overall results for relation extraction task(window size of 250 tokens).

Relation P R F1Drug indication 0.50 0.02 0.03Adverse drug reaction 0.65 0.11 0.18

Table 6: Results for drug indications and adversedrug reactions using only the database (windowsize of 50 tokens).

Figure 2 shows an example of the output of oursystem using the database. The system is able todetect the relationship of indication between al-prazolman and ansiedad (anxiety), but fails in de-tecting the adverse drug reaction between alpra-zolman and dependencia (dependency). The ad-verse drug reaction between lamotrigina and ver-tigo is detected.

The co-occurrence approach provides better re-sults than the use of the database. Table 7 showsthe results for different size of windows. As it wasexpected, small sizes provide better precision butlower recall.

6 Conclusion

In this paper we present the first corpus where400 user messages from a Spanish health socialnetwork have been annotated with drug indica-tions and adverse drug reactions. In addition, wepresent preliminary results obtained using a verysimple system based on co-occurrence of drug-effect pairs as a first step in the study of detecting

Size of window P R F110 0.71 0.24 0.3620 0.59 0.53 0.5630 0.52 0.69 0.5940 0.47 0.77 0.5850 0.44 0.84 0.58

Table 7: Overall results for relation extraction taskusing the co-occurrence approach considering dif-ferent window sizes.

adverse drug reactions and drug indications fromsocial media streams. Results show that there isstill much room for improvement in the identifica-tion of drugs and effects, as well as in the extrac-tion of drug indications and adverse drug rections.

As it was already mentioned in Section 2, therecognition of drugs in social media texts hashardly been tackled since most systems were fo-cused on a given and fixed set of drugs. Moreover,little research has been conducted to extract rela-tionships between drugs and their effects from so-cial media. Most systems for extracting ADRs fol-low a dictionary-based approach. The main draw-back of these systems is that they fail to recog-nize terms which are not included in the dictio-nary. In addition, the dictionary-based approachis not able to handle the large number of spellingand grammar errors in social media texts. More-over, the detection of ADRs and drug indicationshas not been attempted for languages other thanEnglish. Indeed, automatic information extractionfrom Spanish-language social media in the field ofhealth remains largely unexplored.

Social media texts pose additional challengesto those associated with the processing of clin-ical records and medical literature. These newchallenges include the management of meta-information included in the text (for example astags in tweets)(Bouillot et al., 2013), the detectionof typos and unconventional spelling, word short-

103

Figure 2: An example of the output of the system using the database.

enings (Neunerdt et al., 2013; Moreira et al., 2013)and slang and emoticons (Balahur, 2013), amongothers. Another challenge that should be takeninto account is that while clinical records and med-ical literature can be mapped to terminological re-sources or biomedical ontologies, lay terminologyused by patients to describe their treatments andtheir effects, in general, is not collected in any ter-minological resource, which would facilitate theautomatic processing of this kind of texts.

In this paper, we also describe the automaticcreation of a database for drug indications and ad-verse drug reactions from drug package leaflets.To the best of our knowledge, this is the firstdatabase available for Spanish. Although the useof this database did not improve the results dueto its limited coverage, we think that the databasecould be a valuable resource for future efforts.Thus, we plan to translate the database into an on-tology and to populate it with more entities and re-lationships. As future work, we plan the followingtasks:

• To create a lexicon containing idiomatic ex-pressions used by patients to express drug ef-fects.

• To use techniques such as lemmatization andstemming to cope with the problem of lexicalvariability and to resolve abbreviations.

• To integrate advanced matching methods ca-pable of dealing with the spelling error prob-lem.

• To increase the size of the corpus.

• To apply a SVM classification approach toextract relationships between drugs and theireffects.

We hope our research will be beneficial toAEMPS as well as to the pharmaceutical indus-try in the improvement of their pharmacovigilancesystems. Both the corpus and the database arefreely available online14 for research purposes.

Acknowledgments

This work was supported by the EU project Trend-Miner [FP7-ICT287863], by the project MUL-TIMEDICA [TIN2010-20644-C03-01], and bythe Research Network MA2VICMR [S2009/TIC-1542].

ReferencesRakesh Agrawal, Ramakrishnan Srikant, et al. 1994.

Fast algorithms for mining association rules. InProc. 20th int. conf. very large data bases, VLDB,volume 1215, pages 487–499.

Alexandra Balahur. 2013. Sentiment analysis in socialmedia texts. WASSA 2013, page 120.

David W Bates, R Scott Evans, Harvey Murff, Peter DStetson, Lisa Pizziferri, and George Hripcsak. 2003.Detecting adverse events using information technol-ogy. Journal of the American Medical InformaticsAssociation, 10(2):115–128.

Adrian Benton, Lyle Ungar, Shawndra Hill, Sean Hen-nessy, Jun Mao, Annie Chung, Charles E Leonard,and John H Holmes. 2011. Identifying potentialadverse effects using the web: A new approach to

14http://labda.inf.uc3m.es/SpanishADRCorpus

104

medical hypothesis generation. Journal of biomedi-cal informatics, 44(6):989–996.

CA Bond and Cynthia L Raehl. 2006. Adverse drugreactions in united states hospitals. Pharmacother-apy: The Journal of Human Pharmacology andDrug Therapy, 26(5):601–608.

Flavien Bouillot, Phan Nhat Hai, Nicolas Bechet, San-dra Bringay, Dino Ienco, Stan Matwin, Pascal Pon-celet, Mathieu Roche, and Maguelonne Teisseire.2013. How to extract relevant knowledge fromtweets? In Information Search, Integration and Per-sonalization, pages 111–120. Springer.

Jacob Cohen. 1960. A coefficient of agreementfor nominal scales. Educational and PsychologicalMeasurement, 20(1):37–46.

FDA. 1970. National adverse drug reaction directory:Costart (coding symbols for thesaurus of adverse re-action terms). Rock-Irvine, Charles F, Sharp,) r,MD, Huntington Memorial Hospital, Stuart l, Sil-verman, MD, University of California, los Angeles,West los Angeles-Veterans Affairs Medical Center,Osteoporosis Medical Center.

Ronald A Fisher. 1922. On the interpretation of chi-squared from contingency tables, and the calcula-tion of p. Journal of the Royal Statistical Society,85(1):87–94.

Carol Friedman. 2009. Discovering novel adversedrug events using natural language processing andmining of the electronic health record. In ArtificialIntelligence in Medicine, pages 1–5. Springer.

Kin Wah Fung, Chiang S Jao, and Dina Demner-Fushman. 2013. Extracting drug indication infor-mation from structured product labels using naturallanguage processing. Journal of the American Med-ical Informatics Association, 20(3):482–488.

Harsha Gurulingappa, Abdul Mateen-Rajput, LucaToldo, et al. 2012. Extraction of potential adversedrug events from medical case reports. J BiomedSemantics, 3(1):15.

Harsha Gurulingappa, Luca Toldo, Abdul Mateen Ra-jput, Jan A Kors, Adel Taweel, and Yorki Tayrouz.2013. Automatic detection of adverse events to pre-dict drug label changes using text and data min-ing techniques. Pharmacoepidemiology and drugsafety, 22(11):1189–1194.

A Herxheimer, MR Crombag, and TL Alves. 2010.Direct patient reporting of adverse drug reactions. atwelve-country survey & literature review. HealthAction International (HAI)(Europe). Amsterdam.

George Hripcsak and Adam S Rothschild. 2005.Agreement, the f-measure, and reliability in infor-mation retrieval. Journal of the American MedicalInformatics Association, 12(3):296–298.

Martin Krallinger, Florian Leitner, Obdulia Rabal,Miguel Vazquez, Julen Oyarzabal, and Alfonso Va-lencia. 2013. Overview of the chemical compoundand drug name recognition (chemdner) task. InBioCreative Challenge Evaluation Workshop vol. 2,page 2.

Michael Kuhn, Monica Campillos, Ivica Letunic,Lars Juhl Jensen, and Peer Bork. 2010. A side ef-fect resource to capture phenotypic effects of drugs.Molecular systems biology, 6(1).

Robert Leaman, Laura Wojtulewicz, Ryan Sullivan,Annie Skariah, Jian Yang, and Graciela Gonzalez.2010. Towards internet-age pharmacovigilance: ex-tracting adverse drug reactions from user posts tohealth-related social networks. In Proceedings ofthe 2010 workshop on biomedical natural languageprocessing, pages 117–125. Association for Compu-tational Linguistics.

Qi Li, Louise Deleger, Todd Lingren, Haijun Zhai,Megan Kaiser, Laura Stoutenborough, Anil G Jegga,Kevin Bretonnel Cohen, and Imre Solti. 2013. Min-ing fda drug labels for medical conditions. BMCmedical informatics and decision making, 13(1):53.

Mark McClellan. 2007. Drug safety reform atthe fdapendulum swing or systematic improvement?New England Journal of Medicine, 356(17):1700–1702.

Silvio Moreira, Joao Filgueiras, and Bruno Martins.2013. Reaction: A naive machine learning approachfor sentiment classification. In Proceedings of the7th InternationalWorkshop on Semantic Evaluation(SemEval 2013), page 490.

Melanie Neunerdt, Michael Reyer, and Rudolf Mathar.2013. A pos tagger for social media texts trained onweb comments. Polibits, 48:59–66.

Azadeh Nikfarjam and Graciela H Gonzalez. 2011.Pattern mining for extraction of mentions of adversedrug reactions from user comments. In AMIA An-nual Symposium Proceedings, volume 2011, page1019. American Medical Informatics Association.

Isabel Segura-Bedmar, Paloma Martınez, and MarıaHerrero-Zazo. 2013. Semeval-2013 task 9: Ex-traction of drug-drug interactions from biomedicaltexts (ddiextraction 2013). Proceedings of Semeval,pages 341–350.

Isabel Segura-Bedmar, Ricardo Revert, and PalomaMartnez. 2014. Detecting drugs and adverse eventsfrom spanish social media streams. In Proceedingsof the 5th International Louhi Workshop on HealthDocument Text Mining and Information Analysis(Louhi 2014).

Sunghwan Sohn, Jean-Pierre A Kocher, Christopher GChute, and Guergana K Savova. 2011. Drug side ef-fect extraction from clinical narratives of psychiatryand psychology patients. Journal of the AmericanMedical Informatics Association, 18(Suppl 1):i144–i149.

105

Ozlem Uzuner, Imre Solti, and Eithon Cadag. 2010.Extracting medication information from clinicaltext. Journal of the American Medical InformaticsAssociation, 17(5):514–518.

Cornelis S van Der Hooft, Miriam CJM Sturkenboom,Kees van Grootheest, Herre J Kingma, and BrunoH Ch Stricker. 2006. Adverse drug reaction-relatedhospitalisations. Drug Safety, 29(2):161–168.

Rong Xu and QuanQiu Wang. 2013. Large-scaleextraction of accurate drug-disease treatment pairsfrom biomedical literature for drug repurposing.BMC bioinformatics, 14(1):181.

106


Symptom recognition issue

Laure MartinMoDyCo

Paris Ouest Universitylaure.martin.1988

@gmail.com

Delphine BattistelliMoDyCo

Paris Ouest Universitydel.battistelli

@gmail.com

Thierry CharnoisLIPN

Paris 13 Universitythierry.charnois

@lipn.univ-paris13.fr

Abstract

This work focuses on signs and symptomsrecognition in biomedical texts abstracts.First, this specific task is described from alinguistic point of view. Then a method-ology combining pattern mining and lan-guage processing is proposed. In the ab-sence of an authoritative annotated cor-pus, our approach has the advantage ofbeing weakly-supervised. Preliminary ex-perimental results are discussed and revealpromising avenues.

1 Introduction

Our work is part of the Hybride1 Project, whichaims to expand the Orphanet encyclopedia. Or-phanet is the reference portal for information onrare diseases (RD) and orphan drugs, for all audi-ences. A disease is considered rare if it affects lessthan 1 person in 2,000. There are between 6,000and 8,000 RD. 30 million people are concerned inEurope. Among its activities, Orphanet maintainsan RD encyclopedia by manually monitoring sci-entific publications. Hybride Project attempts toautomatically acquire new RD-related knowledgefrom large amounts of scientific publications. Theelements of knowledge about a disease are varied:onset, prevalence, signs and symptoms, transmis-sion mode, disease causes (etiology).

In this article, we investigate the automaticrecognition of signs and symptoms in abstractsfrom scientific articles. Although named entityrecognition in the biomedical domain has beenextensively studied, signs and symptoms seem tohave been left aside, for there is very little work onthe subject. First, the linguistic issue of our studyis presented in section 2, then the state of the artand the description of our lexical resources in sec-tion 3. Then our corpus and general method are

1http://hybride.loria.fr/

presented in section 4. First experiments are intro-duced in section 5. Finally, the work to come ispresented in section 6.

2 Signs and symptoms

Signs and symptoms both refer to the features of adisease, except that a symptom (or functional sign)is noticed and described by a patient, whilst a clin-ical sign is observed by a healthcare professional.In thesauri and medical ontologies, these two no-tions are generally put together in the same cate-gory. Moreover, in texts –particularly in our cor-pus of abstracts from scientific articles– there isno morphological or syntactic difference betweensign and symptom. The difference is only seman-tic, so it is impossible for non-specialists in themedical field to tell the difference from the linguis-tic context alone. In example (1), clinical signs arein bold and symptoms are italicized.

(1) Cluster headache (CH) is a primaryheadache disease characterized by re-current short-lasting attacks of excruci-ating unilateral periorbital pain accom-panied by ipsilateral autonomic signs(lacrimation, nasal congestion, ptosis,miosis, lid edema, and eye redness).

Furthermore, the diagnosis is established by thesymptoms and the clinical signs together. We didnot, therefore, try to distinguish them.

Signs and symptoms take on the most varied lin-guistic forms, as is noticeable in the corpus (whichwill be described in more detail below). In its sim-plest form, a sign or symptom is a noun, whichmay be extended by complements, such as adjec-tives or other nouns (example 2). They also appearin other, more complex, forms, ranging from a sin-gle phrase to a whole sentence (example 3).

(2) With disease progression patientsadditionally develop weakness and

107

wasting of the limb and bulbar mus-cles.

(3) Diagnosis is based on clini-cal presentation, and glycemiaand lactacidemia levels, after ameal (hyperglycemia and hypo-lactacidemia), and after three tofour hour fasting (hypoglycemia andhyperlactacidemia).

In addition to their variety, the linguistic unitsrepresenting signs and symptoms present somesyntactic ambiguities, particularly ambiguitiesconcerning prepositional attachment and coordi-nation scope. In example (2), the first occur-rence of “and” is ambiguous, for we don’t knowif “weakness” and “wasting” should be groupedtogether as a single manifestation of the disease,or if “weakness” on the one hand and “wasting ofthe limbs and bulbar muscles” on the other handare two separate entities, as annotated here.

In addition to these syntactic ambiguities, twoannotation difficulties also arise. The first one con-sists in correctly delimiting the linguistic units ofthe signs and symptoms (example 4a). We agreedwith experts in the field that, generally, piecesof information such as adjectives of intensity oranatomical localizations were not part of the units;nevertheless, this information is interesting in thatit provides the linguistic context for the signs andsymptoms. The second difficulty concerns ellip-tical constructions: where two signs can be dis-tinguished, only one can be annotated because thetwo nouns have an adjective in common (exam-ple 4b).

(4) In the severe forms, paralysis (4a)concerns the neck, shoulder, and proxi-mal muscles, followed by involvementof the muscles of the distal upper ex-tremities, the diaphragm and respiratorymuscles, which may result in respira-tory compromise or arrest (4b).

Eventually, the last difficulty that was met dur-ing the corpus observation is the semantic ambi-guity existing between sign or symptom and dis-ease denominations. A disease can be the clinicalsign of another disease. A clinical sign may beincluded in a disease name or conversely. In ex-ample (5), the clinical sign is in bold and the nameof the disease is underlined.

(5) The adult form results in progressivelimb-girdle myopathy beginning withthe lower limbs, and affects the respira-tory system.

3 State of the art

Signs and symptoms have seldom been studiedfor themselves in the field of biomedical informa-tion extraction. They are often included in moregeneral categories such as “clinical concepts”(Wagholikar et al., 2013), “medical problems”(Uzuner et al., 2011) or “phenotypic information”(South et al., 2009). Moreover, most of the studiesare based on clinical reports or narrative corpora–the Mayo Clinic corpus (Savova et al., 2010) orthe 2010i2b2/VA Challenge corpus (Uzuner et al.,2011)–, except for the Swedish MEDLEX Cor-pus (Kokkinakis, 2006), which comprises teachingmaterial, guidelines, official documents, scientificarticles from medical journals, etc. Our work aimsat scientific monitoring and is therefore based on acorpus of abstracts from scientific articles.

Most of the information extraction systems de-veloped in the works previously cited use lexi-cal resources, such as the Unified Medical Lan-guage System (UMLS) or Medical Subject Head-ings (MeSH) thesauri for the named entity extrac-tion task. The UMLS comprises over 160 con-trolled vocabularies such as MeSH, which is ageneric medical thesaurus containing over 25,000descriptors. However, as Albright et al. (2013)pointed out, UMLS was not originally designedfor annotation, so some of the semantic types over-lap. They add that “the sheer size of the UMLSschema increases the complexity of the annotationtask and slows annotation, while only a small pro-portion of the annotation types present are used.”That is why they decided to work with UMLS se-mantic groups instead of types, except for signsand symptoms –originally a semantic type in theDisorders semantic group–, that they used inde-pendently.

In a genetic disease context, a sign or symp-tom may be phenotype-related. A phenotype isall the observable characteristics of a person, suchas their morphology, biochemical or physiologicalproperties. It results from the interactions betweena genotype (expression of an organism’s genes)and its environment. As many rare diseases aregenetic, many signs and symptoms may be foundin lists of phenotype anomalies. For that reason,

108

we chose to use the Human Phenotype Ontology– HPO (Khler et al., 2014) as a lexical resource.To our knowledge, HPO has not yet been usedin any study on signs and symptoms extraction.Nevertheless, it should be recalled that phenotypeanomalies are not always clinical signs, and signsor symptoms are not all phenotype-related. Evenso, we decided to use HPO as a lexical resourcebecause it lists 10,088 terms describing humanphenotype anomalies and can be easily collected.

Just a very few studies take advantage of consid-ering the linguistic contexts of sign and symptomentities. Kokkinakis (2006), after a first annotationstep of his corpus with MeSH, states that 75% ofthe signs and symptoms co-occur with up to fiveother signs and symptoms in a sentence. This al-lowed him to develop new annotation rules. Wecan also mention the MedLEE system (Friedman,1997), which provides, for each concept, its type(e.g. “problem”), value (e.g. “pain”) and modi-fiers such as the degree (e.g. “severe”) or the bodylocation (e.g. “chest”).

As far as we are concerned, our approach isbased on the combination of NLP and pattern min-ing techniques. We will see that the linguistic con-texts mentioned above are part of the patterns au-tomatically discovered with our text mining tool.

4 Corpus and general method

As mentioned above, HPO was selected as thelexical resource for this project. With the list ofphenotype anomalies as queries, we compiled acorpus of 306,606 abstracts from the MEDLINEdatabase with the PubMed search engine. Theseabstracts are from articles published within the last365 days. They consist of an ID, a title and a para-graph. Then, we applied HPO and kept only thesentences containing a unit annotated as a sign orsymptom. As already pointed out, signs and symp-toms are not all phenotype-related, so our pre-annotation is incomplete. Nonetheless, this firstannotation is quick and cheap, and it initiates theprocess.

Figure 1 illustrates the successive steps in theapproach. In step 1, HPO (f) is used to annotate afirst corpus (a) by a single projection of HPO termsonto the texts. This annotated corpus provides afirst learning corpus (b) to discover patterns (c) bya text mining method (step 2; this method is de-tailed below). These patterns are then validated byan expert (step 3), as linguistic patterns (d). Step

Figure 1: Iterative process of our sign and symp-tom extraction method

4 consists in using these patterns to annotate newcorpora (e) and extract new terms (here with thesemantic type of sign or symptom), which willbe added to the resources (f). The process is fi-nally repeated (back to step 1, with enriched lexi-cal resources). This incremental process has theadvantage of being weakly-supervised and non-dependent on the corpus type.

Sequential pattern mining was first introducedby Agrawal et al. (1995) in the data mining field.It was adapted to information extraction in texts byBechet et al. (2012). It is a matter of locating, in aset of sequences, sequences of items having a fre-quency above a given threshold (called “support”).Pattern mining is done in an ordered sequence ofitems base, where each sequence corresponds to atext unit (the sentence here). An item represents aword in this sequence, generally the inflected formor the lemma or even the part of speech if the aimis to identify generic patterns. A number of param-eters can be adapted along with the application.

Contrary to classical Machine Learning ap-proaches which produce numerical models that areunintelligible for humans, data mining allows thediscovery of symbolic patterns which can be inter-preted by an expert. In the absence of authoritativeannotated corpora for the recognition of signs andsymptoms, manual validation of the patterns stepis necessary, and often a large number of patternsstill remains. To overcome this difficulty, Bechetet al. (2012) suggested adding constraints in or-der to reduce the results. In continuation of thiswork, we make use of the sequential patterns ex-traction tool SDMC2, which makes it possible to

2https://sdmc.greyc.fr/

109

apply various constraints and condensed represen-tations extraction (patterns without redundancy).

We adapted pattern mining to our field of ap-plication. Thus we first propose to use TreeTag-ger (Schmidt, 1994) as a pretreatment, in orderto mark up different types of item (inflected form,lemma, part of speech). To narrow down the num-ber of patterns returned by the tool, we introduceseveral constraints specific to our application: lin-guistic membership constraints (for example, wecan choose to return only patterns containing atleast one sign or symptom name), or the “gap”constraint (Dong and Pei, 2007), corresponding topossible gaps between items in the pattern. Thus agap of maximal value n means that at most n items(words) are between each item of the pattern in thecorresponding sequences (sentences).

5 First experiment

Annotating the first MEDLINE corpus of Ab-stracts with HPO provided us with a corpus of10,000 annotated sentences. The 13,477 annotatedunits were replaced by a keyword –SYMPTOM–in order to facilitate the discovery of patterns.Then we used SDMC to mine the corpus for max-imal patterns, with a minimal support of 10, alength between 3 and 50 words and a gap con-straint of g(0,0), i.e. the words are consecutive(no gap allowed). We were mining for lemma se-quences only.

Results produced 988 patterns, among which326 contained the keyword symptom. Based onthese patterns, several remarks can already bemade:

• Several annotated signs or symptoms areregularly associated with a third term,which can be another sign or symptom:{symptom}{symptom}{and}{stress};• HPO annotation limitations (see sec-

tion 3) are made visible by some contexts:{disease}{such}{as}{symptom};• Some contexts are particularly recurrent,

such as {be}{associate}{with}{symptom}or {characterize}{by}{symptom};• Some temporal and chronologi-

cal ordering contexts are present:{@card@}{%}{follow}{by}{symptom};• The term “patient” is quite regular

({patient}{have}{severe}{symptom}),

but after the evaluation, these occurrencesturned out to be disease-related more thansign or symptom-related;

• The body location proved tobe another regular context:{frontotemporal}{symptom}{ftd}.

Firstly, a linguistics expert selected the pat-terns that he considered the most relevant. Thesepatterns were then classified in three categories:strong if they seem to strongly imply the pres-ence of signs and symptoms (43 patterns), mod-erate (309 patterns) and weak (45 patterns). Sec-ondly, these patterns were applied on a new cor-pus of MEDLINE abstracts in order to annotatethe sign and symptom contexts. For the moment,only strong patterns have been applied.

25 abstracts were randomly selected among allthe scientific articles published within the lastmonth and dealing with Pompe disease. These25 articles were manually annotated for signs andsymptoms by an expert and thus constituted a goldstandard. Then, we compared the manual annota-tion to our automatically annotated contexts. Ifthe annotated sentence includes signs or symp-toms, we consider that the annotation is relevant.Among the 25 abstracts (225 sentences), 27 con-texts were extracted with our method. 23 werecorrect, 4 were irrelevant; 70 sentences were notannotated by the system. Thus the results were23.7 in recall, reaching 82.2 in precision (36.8 inF-score).

6 Conclusions

Sign/disease ambiguity is the cause of 3 of the 4irrelevant annotations, i.e. diseases were in thesame linguistic context than signs. Thus the sen-tences were annotated but they contained diseases,not signs. The forth irrelevant annotation indi-cates a diagnosis test; it highlights that causes andconsequences of a disease can be easily confusedby non-specialists. Most of the left out sentencescontain signs or symptoms expressed by complexunits, such as Levels of creatinkinase in serumwere high. (36%). 27% of these sentences areabout gene mutations, which can be considered ascauses of diseases or as clinical signs. Others con-tain patterns which have not been selected by theexpert but can be easily added to improve the re-call.

110

The context annotation is only a first step to-wards sign and symptom extraction. So far, wehave not solved the problem of unit delimitation.In order to achieve this, we have two working hy-potheses. We intend to compare chunking andsyntactic analysis results in defining the scope ofsign and symptom lexical units. Chunking willbe conducted with an NLP tool such as TreeTag-ger, and syntactic analysis will use a dependencyparser such as the Stanford Parser (ref.). The lattershould allow us to delimit some recurring syntac-tic structures (e.g. agents, enumerations, etc.).

We also intend to compare our results with re-sults provided by CRFs. First the features will beclassical (bag of words, among others), and sec-ond, we will add the contexts obtained with thetext mining to the features. This should enableus to compare our method to others. Finally, weare going to develop an evaluation interface to fa-cilitate the work of the expert. In the absence ofcomparable corpora, the evaluation can only bemanual. Our current sample of 50 abstracts isjust a start, and needs to be expanded in order tostrengthen the evaluation.

Acknowledgments

This research was supported by the HybrideProject ANR-11-BS02-002.

ReferencesRakesh Agrawal and Ramakrishnan Srikant. 1995.

Mining Sequential Patterns. Proceedings ofICDE’95.

Daniel Albright, Arrick Lanfranchi, Anwen Fredrik-sen, William F. Styler IV, Colin Warner, Jena D.Hwang, Jinho D. Choi, Dmitriy Dligach, Rodney D.Nielsen, James Martin, Wayne Ward, Martha Palmerans Guergana K. Savova. 2013. Towards compre-hensive syntactic and semantic annotations of theclinical narrative. Journal of the American MedicalInformatics Association, 20:922–930.

Nicolas Bechet, Peggy Cellier, Thierry Charnois andBruno Cremilleux. 2012. Discovering linguisticpatterns using sequence mining. Proceedings ofSpringer LNCS, 13th International Conference onIntelligent Text Processing and Computational Lin-guistics - CICLing’2012, 1:154–165.

Guozhu Dong and Jian Pei. 2007. Sequence Data Min-ing. Springer.

Carol Friedman. 1997. Towards a ComprehensiveMedical Language Processing System: Methods and

Issues. Proceedings of the AMIA Annual Fall Sym-posium, 1997:595–599.

Sebastian Kohler, Sandra C. Doelken, Christopher J.Mungall, Sebastian Bauer, Helen V. Firth, Is-abelle Bailleul-Forestier, Graeme C. M. Black,Danielle L. Brown, Michael Brudno, JenniferCampbell, David R. FitzPatrick, Janan T. Eppig, An-drew P. Jackson, Kathleen Freson, Marta Girdea,Ingo Helbig, Jane A. Hurst, Johanna Jahn, Laird G.Jackson, Anne M. Kelly, David H. Ledbetter, Sa-har Mansour, Christa L. Martin, Celia Moss, An-drew Mumford, Willem H. Ouwehand, Soo-MiPark, Erin Rooney Riggs, Richard H. Scott, SanjaySisodiya, Steven Van Vooren, Ronald J. Wapner, An-drew O. M. Wilkie, Caroline F. Wright, Anneke T.Vulto-van Silfhout, Nicole de Leeuw, Bert B. A.de Vries, Nicole L. Washingthon, Cynthia L. Smith,Monte Westerfield, Paul Schofield, Barbara J. Ruef,Georgios V. Gkoutos, Melissa Haendel, DamianSmedley, Suzanna E. Lewis and Peter N. Robinson.2014. The Human Phenotype Ontology project:linking molecular biology and disease through phe-notype data. Nucleic Acids Research, 42:966–974.

Dimitrios Kokkinakis. 2006. Developing Resourcesfor Swedish Bio-Medical Text-Mining. Proceed-ings of the 2nd International Symposium on Seman-tic Mining in Biomedicine (SMBM)

Guergana K. Savova, James J. Masanz, Philip V.Ogren, Jiaping Zheng, Sunghwan Sohn, Karin C.Kipper-Schuler, Christopher G. Chute. 2010. Mayoclinical Text Analysis and Knowledge ExtractionSystem (cTAKES): architecture, component evalua-tion and applications. Journal of the American Med-ical Informatics Association, 17:507–513.

Helmut Schmidt. 1994. Probabilistic Part-of-SpeechTagging Using Decision Trees. Proceedings of In-ternational Conference on New Methods in Lan-guage Processing.

Brett R. South, Shuying Shen, Makoto Jones, JenniferGarvin, Matthew H. Samore, Wendy W. Chapmanand Adi V. Gundlapalli. 2009. Developing a man-ually annotated clinical document corpus to identifyphenotypic information for inflammatory bowel dis-ease. Summit on Translational Bioinformatics 2009

Ozlem Uzuner, Brett R. South, Shuying Shen, Scott L.DuVall. 2011. 2010 i2b2/VA challenge on con-cepts, assertions, and relations in clinical text. Jour-nal of the American Medical Informatics Associa-tion, 18:552–556.

Kavishwar B. Wagholikar, Manabu Torii, Siddartha R.Jonnalagadda and Hongfang Liu. 2013. Poolingannotated corpora for clinical concept extraction.Journal of Biomedical Semantics, 4:3.

Alfred V. Aho and Jeffrey D. Ullman. 1972. TheTheory of Parsing, Translation and Compiling, vol-ume 1. Prentice-Hall, Englewood Cliffs, NJ.

111


Seeking Informativeness in Literature Based Discovery

Judita PreissUniversity of Sheffield, Department of Computer Science

Regent Court, 211 PortobelloSheffield S1 4DP, United [email protected]

Abstract

The continuously increasing number ofpublications within the biomedical domainhas fuelled the creation of literature baseddiscovery (LBD) systems which identifyunconnected pieces of knowledge appear-ing in separate literatures which can becombined to make new discoveries. With-out filtering, the amount of hidden knowl-edge found is vast due to noise, making itimpractical for a researcher to examine, orclinically evaluate, the potential discover-ies. We present a number of filtering tech-niques, including two which exploit theLBD system itself rather than being basedon a statistical or manual examination ofdocument collections, and we demonstrateusefulness via replication of known dis-coveries.

1 Introduction and background

The number of publications in the biomedical do-main has been observed to increase at a great rate,making it impossible for one person to read all,and thus potentially leaving knowledge hidden:for example, Swanson (1986) found one publica-tion mentioning a connection between Raynaud’sDisease and blood viscosity while another pointedout the effect of fish oil on blood viscosity, butthere was no publication making the connectionbetween fish oil and Raynaud’s Disease. Auto-mated approaches to knowledge discovery oftenset up the problem as outlined by Swanson; A be-ing the source term (in this case Raynaud’s Dis-ease), with a possible target term, C, being speci-fied (fish oil) and any connections between themform the linking, B, terms. If C is not speci-fied, all possible hidden links from A are exploredand discovery is classified as open. If both A andC terms are supplied, the discovery is closed andonly any linking, B, terms are being sought.

Independent of how a connection between an Aterm and aB is defined (whether this is based onAand B co-occurring in the same title, in the samesentence or the same document, or some otherrelation), an obvious difficulty is the amount ofdata generated by a technique along these lines:with no filtering, a great number of connectionswill be made through terms such as clinical studyor patient, and, if not also linked through otherterms, these should be discarded. A number ofapproaches to term reduction have been explored.

Swanson and Smalheiser (1999)’s knowledgediscovery system, Arrowsmith,1 contains an in-creasing, currently 9,500 term (Swanson et al.,2006), stoplist, created semi-automatically.2 Sucha stoplist is unlikely to be complete – the listhas grown from 5,000 (Swanson and Smalheiser,1997) to 9,500 words (Swanson et al., 2006) andis likely to keep increasing. Over fitting is po-tentially an issue, in this case the list generatedhas been criticized for being tuned for the originalRaynaud–fish oil discovery (Weeber et al., 2001).A word based stoplist also does not take into ac-count the potential ambiguity of terms: one sensemay be highly frequent and uninformative, guar-anteeing it an appearance in the stoplist, while an-other sense may be rare but highly informative.

Instead of using words directly, it is possible toemploy a (much smaller) controlled vocabulary:Medical Subject Headings (MeSH), consisting of22,500 codes, are (mostly) manually assigned toeach document indexed in Medline – even thoughmultiple MeSH codes for a document are allowed,restricting to this set greatly reduces dimensional-ity. For example, Srinivasan (2004) uses MeSHbased topic profiles to connect A to topics C viathe most likely MeSH terms.

1Available at http://arrowsmith.psych.uic.edu/arrowsmith_uic/index.html

2Note that only 365 words of this stoplist are publiclyavailable.

112

Keeping entire vocabularies is possible if topicsare limited, for example, Fleuren et al (2011) ex-tract statistics regarding gene co-occurrence, andrestricts their hidden knowledge generation to bi-ological mechanisms related to them.

Another difficulty in using word vocabularies isthe necessary identification of multiwords, Wee-ber at al. (2001) avoid previously tried n-gramtechniques (e.g. (Gordon and Lindsay, 1996)) byswitching knowledge discovery to UMLS ConceptUnique Identifiers (CUIs). Using MetaMap (Rind-flesch and Aronson, 1994) to assign CUIs to textsdiscards non content words (CUIs only exist forconcepts), resolves ambiguity and deals with mul-tiwords in one, thus reducing the number of termsconsidered in later stages. Weeber et al. also ex-ploit the broad subject categories that UMLS as-signs to each CUI, which allow the authors to per-form domain specific filtering to reduce dimen-sionality. This they do on a per search basis, tun-ing the filtering to the replication experiments pre-sented.

Dimensionality reduction can also be performedat the relation level. Swanson’s (1997) originalwork deemed two terms connected if they bothappeared in the title of an abstract – titles werethought to be the most informative, and descrip-tive, part of each article. As the number of ab-stracts explored during the knowledge discoveryprocess increased, and connections were extendedto whole abstracts (rather than titles only), theamount of hidden knowledge generated increaseddramatically and with it did the need for term andconnection filtering.

Hristovski et al (2006) argue for filtering withinthe relation definition – co-occurrence does notprovide any basis for a relation between twoterms, no underlying semantic reason, and thus, aswell as leading to many spurious links, it yieldsno justification for a hidden connection that isfound. They extract subject-relation-object triples,with relations such as treats or affects formingtheir UMLS concept relations, leading to a muchsmaller number of (more accurate) relations to de-rive hidden knowledge from.

While re-ranking (placing the most ‘useful’links at the top of the list) the resulting hiddenknowledge is clearly valuable, removing termsfrom consideration prior to identifying hiddenknowledge will reduce the computational loadas well as avoid noisy hidden knowledge being

produced and possibly accidentally being highlyranked.

We explore a number of filtering approaches in-cluding two novel techniques which can be inte-grated into any method designed using the Swan-son framework, and we compare these againstpreviously explored filtering methods. Section 2outlines our knowledge discovery approach, Sec-tion 3 presents a number of filtering approacheswith Section 4 discussing results based on repli-cation of existing knowledge and Section 5 drawsour conclusions.

2 Knowledge discovery system

There are two main components which definean LBD system created following the Swansonframework: the terms and the relations. Based onarguments presented in Section 1, our system em-ploys UMLS CUIs as produced by SemRep (Rind-flesch and Fiszman, 2003), a natural language pro-cessing system which identifies semantic relationsin biomedical text.3

SemRep extracts relation triples from text byrunning a set of rules over the output of an under-specified parser. The rules, such as the mappingof treatment to TREATS, map syntactic indicatorsto predicates in the Semantic Network. Further re-strictions are imposed regarding the permissibilityof arguments, the viability of the given proposi-tions, and other syntactic constraints, resulting inrelations such as

• Epoprostenol TREATS Raynaud Phenomenon

• blood rheology DIAGNOSES Raynaud Disease

Each triple is also output with the correspondingCUIs.

All 29 non negative relations were extracted(such as AFFECTS, ASSOCIATED WITH, INTER-ACTS WITH, . . . ), while negative relations (suchas NEG AFFECTS, NEG ASSOCIATED WITH,NEG INTERACTS WITH, . . . ) were dropped. Theextracted relations form the connections betweenCUIs: i.e., the set of linking CUIs B is createdby following all SemRep links from the CUI A,which lead to C through another SemRep relation.

3In this work, the SemRep annotated Medline data,database semmedVER24 (processed up to November 2013)run over 23,319,737 citations to yield 68,000,470 predica-tions, was downloaded from http://skr3.nlm.nih.gov and used throughout.

113

3 Filtering approaches

While employing CUIs (rather than words) elimi-nates non content words (thus immediately reduc-ing noise), it does not eliminate CUIs correspond-ing to patient, week, statement . . . We present, andin Section 4 evaluate (individually and in combi-nation), four filtering approaches of which two are,to our knowledge, completely novel.

3.1 SynonymsWhile not a filtering method under the usual def-inition, the identification of synonym CUIs andcollapsing thereof results in the reduction of thenumber of CUIs being used (i.e. the technique fil-ters out some CUIs).

A manual examination of the documents con-taining CUI C0034734, Raynaud Disease, re-vealed that some of the expected connections weremissing and were linked to CUI C0034735, Ray-naud Phenomenon, instead. The resulting hid-den knowledge is greatly affected by the particu-lar CUI chosen as the source term A, yet in thiscase, the two CUIs are synonymous. The MR-REL related concepts file within UMLS containspairs of CUIs within related relationships, includ-ing the SY (source asserted synonymy) relation-ship4, and CUIs C0034734 and C0034735 appearin the SY relationship in this list. Identifying con-cepts within the SY relationship has the followingadvantages:

• Merging such synonyms into classes will al-low the retrieval of more hidden knowledge ifthe multiple synonymous CUIs correspond tothe start point, A (as in the case of RaynaudDisease).

• There will be potentially more hidden knowl-edge created if a multiclass CUI is a linkingterm (as A connected to C0034734 and Cconnected to C0034735 would not have beenfound to be connected if these were the onlypotential overlap).

• Synonymous hidden knowledge (and linkingterms) will merge, reducing the amount ofknowledge (and terms) to manually explore.

Merging synonyms into single CUI classes re-duces the 561,155 CUIs present in UMLS to540,440 CUI classes.5

4Due to the version of SemRep files used, UMLS 2013AAis employed throughout.

5Note that other MRREL related relationships were ex-

3.2 Semantic typesThe UMLS Semantic Network consists of 133 se-mantic types, a type of subject category, which isassigned to each CUI. Many of these categoriesare clearly unhelpful for knowledge discovery (forexample, geographic area or language), and 70semantic types are manually selected for removal(by examining the basic information about the re-lation, as well as the structure of the network andthe CUIs assigned each semantic type). This re-moves a further 121,284 CUIs.

3.3 Discarding common linking termsIn some cases, a given CUI is clearly too generalto be a useful linking term, but its semantic typecontains more specific CUIs which should not beremoved. Restricting semantic type filtering basedon the depth within the hierarchy is also not a vi-able option, as UMLS is composed of differenthierarchies, each with a different level of granu-larity and establishing an overall threshold wouldlikely include general terms for some while dis-carding crucial terms for others. Therefore anotherapproach is needed for these CUIs.

Along the lines of Swanson et al (2006), a sto-plist can be built to contain such terms, withoutover-training for a particular discovery and with-out the need for manual intervention: we hypoth-esize that any CUIs which are linking terms moreoften than others can effectively form a stoplist.

The creation of this stoplist can be performediteratively:

1. Start with an empty stoplist set S.

2. Create hidden knowledge based on SemRepconnections between CUIs, removing anyconnections to CUIs in set S (the hiddenknowledge is acquired from Medline articlespublished between 1865 and 2000).

3. Randomly select 10,000 hidden knowledgepairs, identify their linking CUIs, and addany linking CUIs appearing in more thanthreshold of pairs to S (the value of thresh-old needs to be empirically determined).

4. If Step 3 increased the size of S, return toStep 2.

Note that since the training set is not designed forany particular discovery, this should not result inan over trained stoplist.

plored, but completing cycles lead to multiple extremely largeequivalence classes.

114

3.4 Breaking high frequency connections

The creation of a stoplist will always suffer fromomissions and inclusions of CUIs that should notbe filtered out in every instance. The last approachis based on a slightly different underlying idea: in-stead of finding frequently appearing terms, thisapproach bases its decisions on the number ofterms a given term is connected to.

Two CUIs A and B are deemed connected if a(non negative) SemRep relation exists which linksthem. If A corresponds to a term such as studyor patient, it is expected to be connected to a largenumber of CUIs. We hypothesize that terms whichare so highly connected are likely to be relativelygeneral terms, and so uninformative linking terms.

This gives rise to the following filtering options:

1. Break (discard) all connections to CUI Awhen the C(A) > threshold.

2. Discard the connection between CUIs A andB when min(C(A), C(B)) > threshold.

(Where C(A) represents the number of CUIslinked to A, and the threshold needs to be empiri-cally determined.)

Method 1 effectively forms a stoplist of highlyconnected CUIs, but method 2 is different: onlyconnections satisfying the condition are brokenwhile A remains under consideration. This allowsfiltering method 2 to leave a frequently connectedterm to be a linking term for a rare term (unlikemethod 1, which would discard such a term).

4 Results

Swanson’s original discoveries (Swanson, 1986;Swanson, 1988) were verified through clinical tri-als and evaluation of LBD systems often involvesreplication of these discoveries (Gordon and Lind-say, 1996; Weeber et al., 2001). From literature,we identify seven separate discoveries to replicate(presented with the labels used in Table 1):

RD: Raynaud disease and fish oil (Swanson,1986).

Arg: Somatomedin C and arginine (Swanson,1990).

Mg: Migraine disorders and magnesium (Hu etal., 2006).

ND: Magnesium deficiency and neurologic dis-ease (Smalheiser and Swanson, 1994).

INN: Alzheimer’s and indomethacin (Smalheiserand Swanson, 1996a).

estrogen: Alzheimer’s disease and estrogen(Smalheiser and Swanson, 1996b).

Ca2+iPLA2: Schizophrenia and Calcium-Independent Phospholipase A2 (Smalheiserand Swanson, 1997).

The same subset of Medline as in each origi-nal discovery is employed for replication, andany abstracts containing a direct link between thetwo terms are removed (note that including thesewould not have affected the original discoveries asthese only used titles) – thus any connections be-tween A and C are necessarily hidden and requireat least one linking term.

The Raynaud-fish oil and migraine-magnesiumconnections are the most commonly replicateddiscoveries, while the remaining discoveries arerarely explored. For CUI based investigations, thisis likely due to the difficulty of selecting a repre-sentative CUI for the sought concepts. The sec-ond concept in the Schizophrenia and Calcium-Independent Phospholipase A2 connection is par-ticularly tricky: UMLS suggests CUI C1418624(PLA2G6 gene) as the most likely match, followedby CUI C2830173 (Calcium-Independent Phos-pholipase A2) as the second most likely. How-ever, neither CUI is found in any relations in thegiven date range by SemRep. Closer examina-tion reveals that the Ca2+iPLA2 connections inthe 1960-1997 Medline range are between CUIC0538273 (PLA2G6 protein, human). Not onlydoes this highlight the difficulty of the replicationtask, it further motivates the need for a ‘synonym’(or related concept) list.

The number of linking terms found betweeneach pair of sought terms is presented in Table 1(zero linking terms means the connection was notfound) for a subset of the filtering results. ST rep-resents semantic type filtering, HF the breaking ofhigh frequency connections (a min subscript de-noting the version which takes into account con-nectivity of both CUIs), together with the thresh-old value, and LT elimination of common linkingterms, again with the relevant threshold value.

While the Raynaud-fish oil connection appearsto be consistently produced by the system, Table 2reveals the value of filtering: with no filtering, thetwo linking terms are pure noise and the connec-tion should not be made. Employing UMLS syn-

115

RD Arg Mg ND INN estrogen Ca2+iPLA2No filtering 2 235 78 98 370 500 7Synonyms (Sy) 6 173 58 65 296 415 16Sy & LT-200 3 145 48 56 265 0 16Sy & HF-2900 6 149 56 52 243 0 14Sy & HFmin-900 6 73 22 27 82 164 9Sy & HFmin-400 6 25 5 8 25 65 8Sy & ST 4 130 47 43 234 331 13Sy & ST & LT-200 3 108 41 38 207 0 13Sy & ST & HF-2500 4 120 47 38 205 0 13Sy & ST & HFmin-900 6 73 22 27 82 164 9Sy & ST & HFmin-400 4 28 6 12 30 73 6

Table 1: Number of hidden links found during replication

onyms adds genuine linking terms,6 and restrict-ing by semantic types drops the remaining generalterms. Discarding common linking terms finds an-timicrobial susceptibility to be a frequently usedlinking term, and it is also dropped. A great ad-vantage of the technique can be seen when con-nections are made through hundreds of terms –in this case, higher thresholds (and thus more ag-gressive filtering) can be employed to reduce thenumber of linking terms to the most promising set.Should these not be sufficient, the threshold canbe increased to produce more linking terms and assuch, the burden on the user in checking a largenumber of linking terms when a hidden connec-tion is suspected can be greatly reduced, withoutsacrificing connections should more be needed.

Term NF Sy Sy ST LT-200acetylsalicyclic acid × X X Xantimicrobial susceptibility × X X ×blood viscosity × X X Xbrain infarction × X X Xpatient X X × ×volunteer helper X X × ×

Table 2: Linking term analysis for RD

For example, common linking term filtering re-moves the term estrogen from consideration astherapeutic estrogen is a commonly used linkingterm, making the estrogen-AD link impossible tofind. Linking term frequencies (on a 10,000 pairsample) exceeding values from 50 to 200 (in incre-ments of 50) were tested resulting in the removal

6Note that the merging of synonyms is achieved with-out the need to back off to general classes (e.g. (Srinivasan,2004)), which have been observed to lead to connectionsbased on “aboutness” rather than producing genuine hiddenknowledge (Beresi et al., 2008).

of between 1,902 and 227 CUIs. Therapeutic es-trogen appears in all the lists. Similarly, the CUIis dropped when high frequency connections arebroken using the first technique, which is based onstoplists. This highlights the value of the secondhigh frequency connection technique, which onlydiscards particular connections (rather than CUIs)and therapeutic estrogen CUI remains a search-able CUI.

As shown, the system replicates most of thepreviously published discoveries with its main as-set being noise reduction: the number of linkingterms for a suspected connection (closed discov-ery) can be greatly reduced to remove spuriousconnections, with backoffs available to yield moreconnections should more be required. For novelapplications (i.e. open discovery), the techniquegreatly reduces the amount of hidden knowledgegenerated from a source term A. For example, theamount of hidden knowledge generated from so-matomedin C drops from 82,601 CUIs when nofiltering is performed, to 3,005 CUIs with syn-onym, semantic type and breaking connectionswith frequency more than 200, which representsa great reduction for a user who is likely lookingfor a particular type of C term.

5 Conclusions and future work

We present and demonstrate the effectiveness of anumber of filtering methods, including two noveltechniques based on any LBD system built accord-ing to the Swanson framework – one approachbased on stoplist methods, but requiring no man-ual intervention except for a user’s selection of athreshold, and the second based on removing con-nections when these are deemed to be likely to

116

contribute mainly noise. A great advantage of thesecond approach is shown to be the fact that termsare not directly discarded, as with a stoplist, andthus a fairly common term can remain a sourceterm when required.

While the method is evaluated by replicatingknown discoveries, we suggest that the noise re-duction performed is ultimately leading to a muchmore user friendly LBD system, and plan to inves-tigate other evaluation approaches, such as times-licing (Yetisgen-Yildiz and Pratt, 2009), as part offuture work.

Acknowledgements

Judita Preiss was supported by the EPSRC grantEP/J008427/1: Language Processing for Litera-ture Based Discovery in Medicine.

ReferencesUlises Cervino Beresi, Mark Baillie, and Ian Ruthven.

2008. Towards the evaluation of literature based dis-covery. In Proceedings of the Workshop on NovelEvaluation Methodologies (at ECIR 2008), pages 5–13.

Wilco W. M. Fleuren, Stefan Verhoeven, Raoul Frijters,Bart Heupers, Jan Polman, Rene van Schaik, Jacobde Vlieg, and Wynand Alkema. 2011. Copub up-date: Copub 5.0 a text mining system to answer bio-logical questions. Nuclear Acids Research, 39 (WebServer issue). doi:10.1093/nar/gkr310.

Michael D. Gordon and Robert K. Lindsay. 1996. To-ward discovery support systems: a replication, re-examination, and extension of Swanson’s work onliterature-based discovery of a connection betweenRaynaud’s and fish oil. Journal of the American So-ciety for Information Science, 47(2):116–128.

Thomas C. Rindflesch Hristovski D, Friedman C andPeterlin B. 2006. Exploiting semantic relationsfor literature-based discovery. In Proceedings of the2006 AMIA Annual Symposium, pages 349–353.

Xiaohua Hu, Xiaodan Zhang, Illhoi Yoo, and YanquingZang. 2006. A semantic approach for mining hid-den links from complementary and non-interactivebiomedical literature. In SDM.

Thomas C. Rindflesch and Alan R. Aronson. 1994.Ambiguity resolution while mapping free text tothe UMLS Metathesaurus. In J. G. Ozbolt, edi-tor, Proceedings of the Eigheeth Annual Symposiumon Computer Appplications in Medical Care, pages240–244.

Thomas C. Rindflesch and Marcelo Fiszman. 2003.The interaction of domain knowledge and linguis-tic structure in natural language processing: inter-

preting hypernymic propositions in biomedical text.Journal of Biomedical Informatics, 36(6):462–477.

Neil R. Smalheiser and Don R. Swanson. 1994. As-sessing a gap in the biomedical literature: Mag-nesium deficiency and neurologic disease. Neuro-science Research Communications, 15(1):1–9.

Neil R. Smalheiser and Don R. Swanson. 1996a. In-domethacin and Alzheimer’s disease. Neurology,46:583.

Neil R. Smalheiser and Don R. Swanson. 1996b. Link-ing estrogen to Alzheimer’s disease. Neurology,47:809–810.

Neil R. Smalheiser and Don R. Swanson. 1997.Calcium-independent phospholipase a2 andschizophrenia. Arch Gen Psychiatry, 55(8):752–753.

Padmini Srinivasan. 2004. Text mining generatinghypotheses from medline. Journal of the Ameri-can Society for Information Science and Technology,55(5):396–413.

Don R. Swanson and Neil R. Smalheiser. 1997. Aninteractive system for finding complementary litera-tures: A stimulus to scientific discovery. ArtificialIntelligence, 91:183–203.

Don R. Swanson and Neil R. Smalheiser. 1999. Linkanalysis of MEDLINE titles as an aid to scientificdiscovery: Using Arrowsmith as an aid to scientificdiscovery. Library Trends, 48:48–59.

Don R. Swanson, Neil R. Smalheiser, and Vetle I.Torvik. 2006. Ranking indirect connnections inliterature-based discovery: The role of medical sub-ject headings. Journal of the American Society forInformation Science and Technology, 57(11):1427–1439.

Don R. Swanson. 1986. Fish oil, Raynaud’s syndrome,and undiscovered public knowledge. Perspectives inBiology and Medicine, 30:7–18.

Don R. Swanson. 1988. Migraine and magnesium – 11neglected connections. Perpectives in Biology andMedicine, 31(4):526–557.

Don R. Swanson. 1990. Somatomedin c and arginine:Implicit connections between mutually isolated lit-eratures. Perspectives in Biology and Medicine,33(2):157–186.

Marc Weeber, Rein Vos, Henny Klein, and Lolkje T. W.de Jong-van den Berg. 2001. Using concepts inliterature-based discovery: Simulating Swanson’sReynaud – fish oil and migraine – magnesium dis-coveries. Journal of the American Society for Infor-mation Science and Technology, 52(7):548–557.

M. Yetisgen-Yildiz and W. Pratt. 2009. A new eval-uation methodology for literature-based discovery.Journal of Biomedical Informatics, 42(4):633–643.

117


Towards Gene Recognition from Rare andAmbiguous Abbreviations using a Filtering Approach

Matthias Hartung∗, Roman Klinger∗, Matthias Zwick‡ and Philipp Cimiano∗∗Semantic Computing Group

Cognitive Interaction Technology –Center of Excellence (CIT-EC)

Bielefeld University33615 Bielefeld, Germany

{mhartung,rklinger,cimiano}@cit-ec.uni-bielefeld.de

‡Research NetworkingBoehringer Ingelheim Pharma GmbH

Birkendorfer Str. 6588397 Biberach, Germanymatthias.zwick

@boehringer-ingelheim.com

Abstract

Retrieving information about highly am-biguous gene/protein homonyms is a chal-lenge, in particular where their non-proteinmeanings are more frequent than their pro-tein meaning (e. g., SAH or HF). Due totheir limited coverage in common bench-marking data sets, the performance of exist-ing gene/protein recognition tools on theseproblematic cases is hard to assess.

We uniformly sample a corpus of eight am-biguous gene/protein abbreviations fromMEDLINEr and provide manual annota-tions for each mention of these abbrevia-tions.1 Based on this resource, we showthat available gene recognition tools suchas conditional random fields (CRF) trainedon BioCreative 2 NER data or GNAT tendto underperform on this phenomenon.

We propose to extend existing gene recog-nition approaches by combining a CRFand a support vector machine. In a cross-entity evaluation and without taking anyentity-specific information into account,our model achieves a gain of 6 pointsF1-Measure over our best baseline whichchecks for the occurrence of a long formof the abbreviation and more than 9 pointsover all existing tools investigated.

1 Introduction

In pharmaceutical research, a common task is togather all relevant information about a gene, e. g.,from published articles or abstracts. The task of rec-ognizing the mentions of genes or proteins can beunderstood as the classification problem to decide

1The annotated corpus is available for future research athttp://dx.doi.org/10.4119/unibi/2673424.

whether the entity of interest denotes a gene/proteinor something else. For highly ambiguous shortnames, this task can be particularly challenging.Consider, for instance, the gene acyl-CoA syn-thetase medium-chain family member 3 which hassynonyms protein SA homolog or SA hypertension-associated homolog, among others, with abbrevia-tions ACSM3, and SAH.2 Standard thesaurus-basedsearch engines would retrieve results where SAHdenotes the gene/protein of interest, but also oc-currences in which it denotes other proteins (e. g.,ATX1 antioxidant protein 1 homolog3) or entitiesfrom semantic classes other than genes/proteins(e. g., the symptom sub-arachnoid hemorrhage).

For an abbreviation such as SAH, the use as de-noting a symptom or another semantic class dif-ferent from genes/proteins is more frequent by afactor of 70 compared to protein-denoting men-tions according to our corpus analysis, such thatthe retrieval precision for acyl-CoA synthetase bythe occurrence of the synonym SAH is only about0.01, which is totally unacceptable for practicalapplications.

In this paper, we discuss the specific challengeof recognizing such highly ambiguous abbrevia-tions. We consider eight entities and show thatcommon corpora for gene/protein recognition areof limited value for their investigation. The abbre-viations we consider are SAH, MOX, PLS, CLU,CLI, HF, AHR and COPD (cf. Table 1). Basedon a sample from MEDLINE4, we show that thesenames do actually occur in biomedical text, butare underrepresented in corpora typically used forbenchmarking and developing gene/protein recog-nition approaches.

2http://www.ncbi.nlm.nih.gov/gene/62963http://www.ncbi.nlm.nih.gov/gene/

4434514http://www.nlm.nih.gov/pubs/

factsheets/medline.html

118

Synonym Other names Other meaning EntrezGene ID

SAH acyl-CoA synthetase medium-chain familymember 3; ACSM3

subarachnoid hemorrhage;S-Adenosyl-L-homocysteine hydrolase

6296

MOX monooxygenase, DBH-like 1 moxifloxacin; methylparaoxon 26002PLS POLARIS partial least squares; primary lateral sclerosis 3770598CLU clusterin; CLI covalent linkage unit 1191CLI clusterin; CLU clindamycin 1191HF complement factor H; CFH high frequency; heart failure; Hartree-Fock 3075AHR aryl hydrocarbon receptor; bHLHe76 airway hyperreactivity 196COPD archain 1; ARCN1; coatomer protein

complex, subunit deltaChronic Obstructive Pulmonary Disease 22819; 372

Table 1: The eight synonyms for genes/proteins which are subject of analysis in this paper and their longnames together with frequent other meanings.

We propose a machine learning-based filteringapproach to detect whether a mention in questionactually denotes a gene/protein or not and showthat for the eight highly ambiguous abbreviationsthat we consider, the performance of our approachin terms of F1 measure is higher than for a state-of-the-art tagger based on conditional random fields(CRF), a freely available dictionary-based approachand an abbreviation resolver. We evaluate differ-ent parameters and their impact in our filteringapproach and discuss the results. Note that thisapproach does not take any information about thespecific abbreviation into account and can thereforebe expected to generalize to names not consideredin our corpus.

The main contributions of this paper are:(i) We consider the problem of recognizing

highly ambiguous abbreviations that fre-quently do not denote proteins as a task thathas so far attracted only limited attention.

(ii) We show that the recognition of such ambigu-ous mentions is important as their string rep-resentation is frequent in collections such asMEDLINE.

(iii) We show, however, that this set of ambiguousnames is underrepresented in corpora com-monly used for system design and develop-ment. Such corpora do not provide a suffi-cient data basis for studying the phenomenonor for training systems that appropriately han-dle such ambiguous abbreviation. We con-tribute a manually annotated corpus of 2174occurrences of ambiguous abbreviations.

(iv) We propose a filtering method for classifyingambiguous abbreviations as denoting a pro-tein or not. We show that this method has apositive impact on the overall performance ofnamed entity recognition systems.

2 Related Work

The task of gene/protein recognition consists inthe classification of terms as actually denoting agene/protein or not. The task is typically eithertackled by using machine learning or dictionary-based approaches. Machine learning approachesrely on appropriate features describing the localcontext of the term to be classified and induce amodel to perform the classification from trainingdata. Conditional random fields have shown toyield very good results on the task (Klinger et al.,2007; Leaman and Gonzalez, 2008; Kuo et al.,2007; Settles, 2005).

Dictionary-based approaches rely on an explicitdictionary of gene/protein names that are matchedin text. Such systems are common in practice dueto the low overhead required to adapt and maintainthe system, essentially only requiring to extend thedictionary. Examples of commercial systems areProMiner (Fluck et al., 2007) or I2E (Bandy et al.,2009); a popular free system is made available byHakenberg et al. (2011).

Such dictionary-based systems typically incorpo-rate rules for filtering false positives. For instance,in ProMiner (Hanisch et al., 2003), ambiguous syn-onyms are only accepted based on external dictio-naries and matches in the context. Abbreviationsare only accepted if a long form matches all parts ofthe abbreviation in the context (following Schwartzand Hearst (2003)). Similarly, Hakenberg et al.(2008) discuss global disambiguation on the doc-ument level, such that all mentions of a string inone abstract are uniformly accepted as denoting anentity or not.

A slightly different approach is taken by the web-service GeneE5 (Schuemie et al., 2010): Entering aquery as a gene/protein in the search field generates

5http://biosemantics.org/geneE

119

MEDLINE BioCreative2 GENIAProtein # Tokens % tagged # Tokens % of genes # Tokens % of genes

SAH 30019 6.1 % 2 0 % 0MOX 16007 13.1 % 0 0PLS 11918 25.9 % 0 0CLU 1077 29.1 % 0 0CLI 1957 4.8 % 4 0 % 0HF 42563 7.9 % 8 62.5 % 4 0 %AHR 21525 75.7 % 12 91.7 % 0COPD 44125 0.6 % 6 0 % 0

Table 2: Coverage of ambiguous abbreviations in MEDLINE, BioCreative2 and GENIA corpora. Thepercentage of tokens tagged as a gene/protein in MEDLINE (% tagged) is determined with a conditionalrandom field in the configuration described by Klinger et al. (2007), but without dictionary-based featuresto foster the usage of contextual features). The percentages of genes/proteins (% of genes) in BC2 andGENIA are based on the annotations in these corpora.

a query to e. g. PubMedr6 with the goal to limitthe number of false positives.

Previous to the common application of CRFs,other machine learning methods have been popu-lar as well for the task of entity recognition. Forinstance, Mitsumori et al. (2005) and Bickel et al.(2004) use a support vector machine (SVM) withpart-of-speech information and dictionary-basedfeatures, amongst others. Zhou et al. (2005) use anensemble of different classifiers for recognition.

In contrast to this application of a classifierto solve the recognition task entirely, other ap-proaches (including the one in this paper) aim atfiltering specifically ambiguous entities from a pre-viously defined set of challenging terms. For in-stance, Al-mubaid (2006) utilize a word-based clas-sifier and a mutual information-based feature selec-tion to achieve a highly discriminating list of termswhich is applied for filtering candidates.

Similarly to our approach, Tsuruoka and Tsujii(2003) use a classifier, in their case a naıve Bayesapproach, to learn which entities to filter fromthe candidates generated by a dictionary-based ap-proach. They use word based features in the con-text including the candidate itself. Therefore, theapproach is focused on specific entities.

Gaudan et al. (2005) use an SVM and a dictio-nary of long forms of abbreviations to assign thema specific meaning, taking contextual informationinto account. However, their machine learning ap-proach is trained on each possible sense of an ab-breviation. In contrast, our approach consists indeciding if a term is used as a protein or not. Fur-ther, we do not train to detect specific, previouslygiven senses.

6http://www.ncbi.nlm.nih.gov/pubmed/

Xu et al. (2007) apply text similarity measures todecide about specific meanings of mentions. Theyfocus on the disambiguation between different en-tities. A corpus for word sense disambiguation isautomatically built based on MeSH annotations byJimeno-Yepes et al. (2011). Okazaki et al. (2010)build a sense inventory by automatically applyingpatterns on MEDLINE and use this in a logisticregression approach.

Approaches are typically evaluated on freelyavailable resources like the BioCreative Gene Men-tion Task Corpus, to which we refer as BC2 (Smithet al., 2008), or the GENIA Corpus (Kim et al.,2003). When it comes to identifying particular pro-teins by linking the protein in question to someprotein in an external database – a task we donot address in this paper – the BioCreative GeneNormalization Task Corpus is a common resource(Morgan et al., 2008).

In contrast to these previous approaches, ourmethod is not tailored to a particular set of entitiesor meanings, as the training methodology abstractsfrom specific entities. The model, in fact, knowsnothing about the abbreviations to be classified anddoes not use their surface form as a feature, suchthat it can be applied to any unseen gene/proteinterm. This leads to a simpler model that is applica-ble to a wide range of gene/protein term candidates.Our cross-entity evaluation regime clearly corrobo-rates this.

3 Data

We focus on eight ambiguous abbreviations ofgene/protein names. As shown in Table 2, thesehomonyms occur relatively frequently in MEDLINE

but are underrepresented in the BioCreative 2 entity

120

Protein Pos. Inst. Neg. Inst. Total

SAH 5 349 354MOX 62 221 283PLS 1 206 207CLU 235 30 265CLI 11 211 222HF 2 353 355AHR 53 80 133COPD 0 250 250

Table 3: Number of instances per protein in theannotated data set and their positive/negative distri-bution

recognition data set and the GENIA corpus whichare both commonly used for developing and evalu-ating gene recognition approaches. We compileda corpus from MEDLINE by randomly sampling100 abstracts for each of the eight abbreviations (81for MOX) such that each abstract contains at leastone mention of the respective abbreviation. Oneof the authors manually annotated the mentionsof the eight abbreviations under consideration tobe a gene/protein entity or not. These annotationswere validated by another author. Both annotatorsdisagreed in only 2% of the cases. The numbersof annotations, including their distribution overpositive and negative instances, are summarizedin Table 3. The corpus is made publicly availableat http://dx.doi.org/10.4119/unibi/2673424 (Hartung and Zwick, 2014).

In order to alleviate the imbalance of positiveand negative examples in the data, additional pos-itive examples have been gathered by manuallysearching PubMed7. At this point, special attentionhas been paid to extract only instances denoting thecorrect gene/protein corresponding to the full longname, as we are interested in assessing the impactof examples of a particularly high quality. Thisprocess yields 69 additional instances for AHR(distributed over 11 abstracts), 7 instances (3 ab-stracts) for HF, 14 instances (2 abstracts) for PLSand 15 instances (7 abstracts) for SAH. For theother gene/proteins in our dataset, no additionalpositive instances of this kind could be retrievedusing PubMed. In the following, this process willbe referred to as manual instance generation. Thisadditional data is used for training only.

7http://www.ncbi.nlm.nih.gov/pubmed

4 Gene Recognition by Filtering

We frame gene/protein recognition from ambigu-ous abbreviations as a filtering task in which a setof candidate tokens is classified into entities andnon-entities. In this paper, we assume the candi-dates to be generated by a simple dictionary-basedapproach taking into account all tokens that matchthe abbreviation under consideration.

4.1 Filtering StrategiesWe consider the following filtering approaches:• SVM classifies the occurring terms based on a

binary support vector machine.• CRF classifies the occurring terms based on

a conditional random field (configured as de-scribed by Klinger et al. (2007)) trained on theconcatenation of BC2 data and our newly gen-erated corpus. This setting thus correspondsto state-of-the-art performance on the task.• CRF∩SVM considers the candidate an entity

if both the standard CRF and the SVM fromthe previous steps yield a positive prediction.• HRCRF∩SVM is the same as the previous

step, but the output of the CRF is optimizedtowards high recall by joining the recognitionof entities of the five most likely Viterbi paths.• CRF→SVM is similar to the first setting, but

the output of the CRF is taken into account asa feature in the SVM.

4.2 Features for ClassificationOur classifier uses local contextual and global fea-tures. Local features focus on the immediate con-text of an instance, whereas global features encodeabstract-level information. Throughout the follow-ing discussion, ti denotes a token at position i thatcorresponds to a particular abbreviation to be classi-fied in an abstract A. Note that we blind the actualrepresentation of the entity to be able to generalizeto all genes/proteins, not being limited to the onescontained in our corpus.

4.2.1 Local InformationThe feature templates context-left and context-rightcollect the tokens immediately surrounding an ab-breviation in a window of size 6 (left) and 4 (right)in a bag-of-words-like feature generation. Addi-tionally, the two tokens from the immediate contexton each side are combined into bigrams.

The template abbreviation generates features ifti occurs in brackets. It takes into account the min-imal Levenshtein distance (ld, Levenshtein (1966))

121

between all long forms L of the abbreviation (asretrieved from EntrezGene) in comparison to eachstring on the left of ti (up to a length of seven,denoted by tk:i as the concatenation of tokenstk, . . . , ti). Therefore, the similarity value sim(ti)taken into account is given by

sim(ti) = maxl∈L;k∈[1:7]

1− ld(tk:i−1, l)max(|ti|, |l|) ,

where the denominator is a normalization term.The features used are generated by cumulative bin-ning of sim(ti).

The feature taggerlocal takes the prediction of theCRF for ti into account. Note that this feature isonly used in the CRF→SVM setting.

4.2.2 Global InformationThe feature template unigrams considers each wordin A as a feature. There is no normalization orfrequency weighting. Stopwords are ignored8. Oc-currences of the same string as ti are blinded.

The feature taggerglobal collects all tokens in Aother than ti that are tagged as an entity by the CRF.In addition, the cardinality of these entities in A istaken into account by cumulative binning.

The feature long form holds if one of the longforms previously defined to correspond with the ab-breviation occurs in the text (in arbitrary position).

Besides using all features, we perform a greedysearch for the best feature set by wrapping the bestmodel configuration. A detailed discussion of thefeature selection process follows in Section 5.3.

4.2.3 Feature PropagationInspired by the “one sense per discourse” heuristiccommonly adopted in word sense disambiguation(Gale et al., 1992), we apply two feature combi-nation strategies. In the following, n denotes thenumber of occurrences of the abbreviation in anabstract.

In the setting propagationall, n − 1 identicallinked instances are added for each occurrence.Each new instance consists of the disjunction ofthe feature vectors of all occurrences. Based onthe intuition that the first mention of an abbrevia-tion might carry particularly valuable information,propagationfirst introduces one additional linked in-stance for each occurrence, in which the featurevector is joined with the first occurrence.

8Using the stopword list at http://www.ncbi.nlm.nih.gov/books/NBK3827/table/pubmedhelp.T43/, last accessed on March 25, 2014

Setting P R F1

SVM 0.81 0.45 0.58CRF∩SVM 0.99 0.26 0.41HRCRF∩SVM 0.95 0.27 0.42CRF→SVM 0.83 0.49 0.62

CRF→SVM+FS 0.97 0.74 0.84

GNAT 0.73 0.45 0.56CRF 0.55 0.43 0.48AcroTagger 0.92 0.63 0.75Long form 0.98 0.65 0.78lex 0.18 1.00 0.32

Table 4: Overall micro-averaged results over eightgenes/proteins. For comparison, we show the re-sults of a default run of GNAT (Hakenberg et al.,2011), a CRF trained on BC2 data (Klinger et al.,2007), AcroTagger (Gaudan et al., 2005), and asimple approach of accepting every token of therespective string as a gene/protein entity (lex). Fea-ture selection is denoted with +FS.

In both settings, all original and linked instancesare used for training, while during testing, originalinstances are classified by majority voting on theirlinked instances. For propagationall, this results inclassifying each occurrence identically.

5 Experimental Evaluation

5.1 Experimental SettingWe perform a cross-entity evaluation, in which wetrain the support vector machine (SVM) on the ab-stracts of 7 genes/proteins from our corpus and teston the abstracts for the remaining entities, i. e., themodel is evaluated only on tokens representing en-tities which have never been seen labeled duringtraining. The CRFs are trained analogously withthe difference that the respective set used for train-ing is augmented with the BioCreative 2 Trainingdata. The average numbers of precision, recall andF1 measure are reported.

As a baseline, we report the results of a simplelexicon-based approach assuming that all tokensdenote an entity in all their occurrences (lex). In ad-dition, the baseline of accepting an abbreviation asgene/protein if the long form occurs in the same ab-stract is reported (Long form). Moreover, we com-pare our results with the publicly available toolkitGNAT (Hakenberg et al., 2011)9 and the CRF ap-

9The gene normalization functionality of GNAT is nottaken into account here. We acknowledge that this comparison

122

proach as described in Section 4. In addition, wetake into account the AcroTagger10 that resolvesabbreviations to their most likely long form whichwe manually map to denoting a gene/protein or not.

5.2 Results5.2.1 Overall resultsIn Table 4, we summarize the results of the recogni-tion strategies introduced in Section 4. The lexicalbaseline clearly proves that a simple approach with-out any filtering is not practical. GNAT adapts wellto ambiguous short names and turns out as a com-petitive baseline, achieving an average precision of0.73. In contrast, the filtering capacity of a stan-dard CRF is, at best, mediocre. The long formbaseline is very competitive with an F1 measure of0.78 and a close-to-perfect precision. The results ofAcroTagger are similar to this long form baseline.

We observe that the SVM outperforms the CRFin terms of precision and recall (by 10 percentagepoints in F1). Despite not being fully satisfactoryeither, these results indicate that global featureswhich are not implemented in the CRF are of im-portance. This is confirmed by the CRF∩SVMsetting, where CRF and SVM are stacked: This fil-tering procedure achieves the best precision acrossall models and baselines, whereas the recall is stilllimited. Despite being designed for exactly thispurpose, the HRCRF∩SVM combination can onlymarginally alleviate this problem, and only at theexpense of a drop in precision.

The best trade-off between precision and recallis offered by the CRF→SVM combination. Thissetting is not only superior to all other variants ofcombining a CRF with an SVM, but outperformsGNAT by 6 points in F1 score, while being inferiorto the long form baseline. However, performingfeature selection on this best model using a wrapperapproach (CRF→SVM+FS) leads to the overallbest result of F1 = 0.84, outperforming all otherapproaches and all baselines.

5.2.2 Individual resultsTable 5 summarizes the performance of all filter-ing strategies broken down into individual entities.Best results are achieved for AHR, MOX and CLU.COPD forms a special case as no examples for the

might be seen as slightly inappropriate as the focus of GNATis different.

10ftp://ftp.ebi.ac.uk/pub/software/textmining/abbreviation_resolution/, ac-cessed April 23, 2014

occurrence as a gene/protein are in the data; how-ever the results show that the system can handlesuch a special distribution.

SVM and CRF are mostly outperformed by acombination of both strategies (except for CLI andHF), which shows that local and global featuresare highly complementary in general. Complemen-tary cases generally favor the CRF→SVM strategy,except for PLS, where stacking is more effective.

In SAH, the pure CRF model is superior to allcombinations of CRF and SVM. Apparently, theglobal information as contributed by the SVM isless effective than local contextual features as avail-able to the CRF in these cases. In SAH and CLI,moreover, the best performance is obtained by theAcroTagger.

5.2.3 Impact of instance generationAll results reported in Tables 4 and 5 refer to con-figurations in which additional training instanceshave been created by manual instance generation.The impact of this method is analyzed in Table 6.The first column reports the performance of ourmodels on the randomly sampled training data. Inorder to obtain the results in the second column,manual instance generation has been applied.

The results show that all our recognition mod-els generally benefit from additional informationthat helps to overcome the skewed class distribu-tion of the training data. Despite their relativelysmall quantity and uneven distribution across thegene/protein classes, including additional exter-nal instances yields a strong boost in all mod-els. The largest difference is observed in SVM(∆F1 = +0.2) and CRF→SVM (∆F1 = +0.16).Importantly, these improvements include both pre-cision and recall.

5.3 Feature SelectionThe best feature set (cf. CRF→SVM+FS in Ta-ble 4) is determined by a greedy search using awrapper approach on the best model configurationCRF→SVM. The results are depicted in Table 7.In each iteration, the table shows the best featureset detected in the previous iteration and the resultsfor each individual feature when being added tothis set. In each step, the best individual featureis kept for the next iteration. The feature analysisstarts from the long form feature as strong base-line. The added features are, in that order, context,taggerglobal, and propagationall.

Overall, feature selection yields a considerable

123

AHR CLI CLU COPD

Setting P R F1 P R F1 P R F1 P R F1

SVM 1.00 0.72 0.84 0.30 0.27 0.29 1.00 0.41 0.58 0.00 1.00 0.00CRF∩SVM 1.00 0.70 0.82 0.00 0.00 0.00 1.00 0.15 0.26 1.00 1.00 1.00HRCRF∩SVM 1.00 0.70 0.82 1.00 0.00 0.00 1.00 0.16 0.28 1.00 1.00 1.00CRF→SVM 0.96 0.83 0.89 0.30 0.27 0.29 1.00 0.40 0.57 0.00 1.00 0.00

CRF→SVM+FS 0.93 0.98 0.95 0.50 0.09 0.15 0.99 0.84 0.91 1.00 1.00 1.00

GNAT 0.74 0.66 0.70 1.00 0.18 0.31 0.97 0.52 0.68 1.00 1.00 1.00CRF 0.52 0.98 0.68 0.00 0.00 0.00 1.00 0.20 0.33 0.00 1.00 0.00AcroTagger 1.00 0.60 0.75 1.00 0.82 0.90 1.00 0.00 0.00 1.00 1.00 1.00Long form 1.00 0.96 0.98 1.00 0.09 0.17 0.99 0.80 0.88 1.00 1.00 1.00lex 0.40 1.00 0.57 0.05 1.00 0.09 0.89 1.00 0.94 0.00 1.00 0.00

HF MOX PLS SAH

Setting P R F1 P R F1 P R F1 P R F1

SVM 0.25 1.00 0.40 0.87 0.44 0.58 0.14 1.00 0.25 0.00 0.00 0.00CRF∩SVM 1.00 0.00 0.00 1.00 0.39 0.56 1.00 1.00 1.00 1.00 0.00 0.00HRCRF∩SVM 1.00 0.00 0.00 1.00 0.39 0.56 0.20 1.00 0.33 1.00 0.00 0.00CRF→SVM 0.25 1.00 0.40 0.91 0.63 0.74 0.50 1.00 0.67 1.00 0.00 0.00

CRF→SVM+FS 1.00 0.00 0.00 1.00 0.37 0.54 0.00 0.00 0.00 1.00 0.00 0.00

GNAT 1.00 0.00 0.00 0.38 0.08 0.14 0.00 0.00 0.00 0.00 0.00 0.0CRF 0.00 0.00 0.00 0.43 0.90 0.59 0.14 1.00 0.25 1.00 0.50 0.67AcroTagger 0.33 1.00 0.50 1.00 0.00 0.00 1.00 0.00 0.00 1.00 0.60 0.75Long form 1.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00lex 0.01 1.00 0.02 0.22 1.00 0.36 0.00 1.00 0.01 0.01 1.00 0.03

Table 5: Results for the eight genes/proteins and results for our different recognition schemes.

randomly sampled +instance generationP R F1 ∆P ∆R ∆F1

SVM 0.73 0.25 0.38 +0.08 +0.20 +0.20CRF∩SVM 1.00 0.17 0.29 -0.01 +0.09 +0.13HRCRF∩SVM 0.97 0.18 0.30 -0.02 +0.09 +0.12CRF→SVM 0.79 0.32 0.46 +0.05 +0.17 +0.16

CRF→SVM+FS 0.99 0.60 0.75 -0.02 +0.14 +0.09

Table 6: Impact of increasing the randomly sampled training set by adding manually curated additionalpositive instances (+instance generation), measured in terms of the increase in precision, recall and F1

(∆P, ∆R, ∆F1).

boost in recall, while precision remains almost con-stant. Surprisingly, the unigrams feature has a par-ticularly strong negative impact on overall perfor-mance.

While the global information contributed by theCRF turns out very valuable, accounting for mostof the improvement in recall, local tagger informa-tion is widely superseded by other features. Like-wise, the abbreviation feature does not provide anyadded value to the model beyond what is knownfrom the long form feature.

Comparing the different feature propagationstrategies, we observe that propagationall outper-forms propagationfirst.

5.4 DiscussionOur experiments show that the phenomena inves-tigated pose a challenge to all gene recognitionparadigms currently available in the literature, i. e.,dictionary-based, machine-learning-based (e. g. us-ing a CRF), and classification-based filtering.

Our results indicate that stacking different meth-ods suffers from a low recall in early steps of theworkflow. Instead, a greedy approach that consid-ers all occurrences of an abbreviation as input toa filtering approach yields the best performance.Incorporating information from a CRF as featuresinto a SVM outperforms all baselines at very highlevels of precision; however, the recall still leavesroom for improvement.

124

Iter. Feature Set P R F1 ∆F1

1 long form 0.98 0.65 0.78

+propagation1st 0.98 0.65 0.78 +0.00+propagationall 0.98 0.65 0.78 +0.00

+taggerlocal 0.72 0.81 0.76 -0.02+taggerglobal 0.55 0.79 0.65 -0.13

+context 0.98 0.67 0.79 +0.01+abbreviation 0.98 0.65 0.78 +0.00

+unigrams 0.71 0.43 0.53 -0.25

2 long form+context 0.98 0.67 0.79

+propagation1st 0.98 0.67 0.79 +0.00+propagationall 0.96 0.70 0.81 +0.02

+taggerlocal 0.98 0.70 0.82 +0.03+taggerglobal 0.97 0.72 0.83 +0.04

+abbreviation 0.98 0.67 0.80 +0.01+unigrams 0.77 0.39 0.52 -0.27

3 long form+context

+taggerglobal 0.97 0.72 0.83

+propagation1st 0.97 0.71 0.82 -0.01+propagationall 0.97 0.74 0.84 +0.01

+taggerlocal 0.97 0.72 0.82 -0.01+abbreviation 0.97 0.72 0.82 -0.01

+unigrams 0.77 0.44 0.56 -0.27

4 long form+context

+taggerglobal+propagationall 0.97 0.74 0.84

+taggerlocal 0.90 0.66 0.76 -0.08+abbreviation 0.97 0.74 0.84 -0.00

+unigrams 0.80 0.49 0.61 -0.23

Table 7: Greedy search for best feature combina-tion in CRF→SVM (incl. additional positives).

In a feature selection study, we were able to showa largely positive overall impact of features thatextend local contextual information as commonlyapplied by state-of-the-art CRF approaches. Thisranges from larger context windows for collectingcontextual information over abstract-level featuresto feature propagation strategies. However, featureselection is not equally effective in all individualclasses (cf. Table 5).

The benefits due to feature propagation indi-cate that several instances of the same abbreviationin one abstract should not be considered indepen-dently of one another, although we could not verifythe intuition that the first mention of an abbrevia-tion introduces particularly valuable informationfor classification.

Overall, our results seem encouraging as the ma-chinery and the features used are in general suc-

cessful in determining whether an abbreviation ac-tually denotes a gene/protein or not. The best pre-cision/recall balance is obtained by adding CRFinformation as features into the classifier.

As we have shown in the cross-entity experi-ment setting, the system is capable of generalizingto other unseen entities. For a productive system,we assume our workflow to be applied to specificabbreviations such that the performance on otherentities (and therefore on other corpora) is not sub-stantially influenced.

6 Conclusions and Outlook

The work reported in this paper was motivated fromthe practical need for an effective filtering methodfor recognizing genes/proteins from highly ambigu-ous abbreviations. To the best of our knowledge,this is the first approach to tackle gene/proteinrecognition from ambiguous abbreviations in asystematic manner without being specific for theparticular instances of ambiguous gene/proteinhomonyms considered.

The proposed method has been proven to allowfor an improvement in recognition performancewhen added to an existing NER workflow. Despitebeing restricted to eight entities so far, our approachhas been evaluated in a strict cross-entity manner,which suggests sufficient generalization power tobe extended to other genes as well.

In future work, we plan to extend the data setto prove the generalizability on a larger scale andon an independent test set. Furthermore, an inclu-sion of the features presented in this paper into theCRF will be evaluated. Moreover, assessing theimpact of the global features that turned out benefi-cial in this paper on other gene/protein inventoriesseems an interesting path to explore. Finally, wewill investigate the prospects of our approach in anactual black-box evaluation setting for informationretrieval.

Acknowledgements

Roman Klinger has been funded by the “It’sOWL” project (“Intelligent Technical SystemsOstwestfalen-Lippe”, http://www.its-owl.de/), a leading-edge cluster of the German Min-istry of Education and Research. We thank JorgHakenberg and Philippe Thomas for their supportin performing the baseline results with GNAT. Ad-ditionally, we thank the reviewers of this paper fortheir very helpful comments.

125

ReferencesHisham Al-mubaid. 2006. Biomedical term disam-

biguation: An application to gene-protein name dis-ambiguation. In In IEEE Proceedings of ITNG06.

Judith Bandy, David Milward, and Sarah McQuay.2009. Mining protein-protein interactions from pub-lished literature using linguamatics i2e. MethodsMol Biol, 563:3–13.

Steffen Bickel, Ulf Brefeld, Lukas Faulstich, Jorg Hak-enberg, Ulf Leser, Conrad Plake, and Tobias Schef-fer. 2004. A support vector machine classifier forgene name recognition. In In Proceedings of theEMBO Workshop: A Critical Assessment of TextMining Methods in Molecular Biology.

Juliane Fluck, Heinz Theodor Mevissen, Marius Os-ter, and Martin Hofmann-Apitius. 2007. ProMiner:Recognition of Human Gene and Protein Namesusing regularly updated Dictionaries. In Proceed-ings of the Second BioCreative Challenge Evalua-tion Workshop, pages 149–151, Madrid, Spain.

William A. Gale, Kenneth W. Church, and DavidYarowsky. 1992. One sense per discourse. In Pro-ceedings of the Workshop on Speech and NaturalLanguage, pages 233–237, Stroudsburg, PA, USA.Association for Computational Linguistics.

Sylvain Gaudan, Harald Kirsch, and Dietrich Rebholz-Schuhmann. 2005. Resolving abbreviations to theirsenses in medline. Bioinformatics, 21(18):3658–3664.

Jorg Hakenberg, Conrad Plake, Robert Leaman,Michael Schroeder, and Graciela Gonzalez. 2008.Inter-species normalization of gene mentions withGNAT. Bioinformatics, 24(16):i126–i132, Aug.

Jorg Hakenberg, Martin Gerner, Maximilian Haeus-sler, Ills Solt, Conrad Plake, Michael Schroeder,Graciela Gonzalez, Goran Nenadic, and Casey M.Bergman. 2011. The GNAT library for local andremote gene mention normalization. Bioinformatics,27(19):2769–2771, Oct.

Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevis-sen, and Ralf Zimmer. 2003. Playing biology’sname game: identifying protein names in scientifictext. Pac Symp Biocomput, pages 403–414.

Matthias Hartung and Matthias Zwick. 2014. A cor-pus for the development of gene/protein recognitionfrom rare and ambiguous abbreviations. BielefeldUniversity. doi:10.4119/unibi/2673424.

Antonio J Jimeno-Yepes, Bridget T McInnes, andAlan R Aronson. 2011. Exploiting mesh indexingin medline to generate a data set for word sense dis-ambiguation. BMC bioinformatics, 12(1):223.

J-D. Kim, T. Ohta, Y. Tateisi, and J. Tsujii. 2003. Ge-nia corpus–semantically annotated corpus for bio-textmining. Bioinformatics, 19 Suppl 1:i180–i182.

Roman Klinger, Christoph M. Friedrich, Juliane Fluck,and Martin Hofmann-Apitius. 2007. NamedEntity Recognition with Combinations of Condi-tional Random Fields. In Proceedings of the Sec-ond BioCreative Challenge Evaluation Workshop,Madrid, Spain, April.

Cheng-Ju Kuo, Yu-Ming Chang, Han-Shen Huang,Kuan-Ting Lin, Bo-Hou Yang, Yu-Shi Lin, Chun-Nan Hsu, and I-Fang Chung. 2007. Rich featureset, unication of bidirectional parsing and dictionaryfiltering for high f-score gene mention tagging. InProceedings of the Second BioCreative ChallengeEvaluation Workshop, Madrid, Spain, April.

Robert Leaman and Graciela Gonzalez. 2008. Ban-ner: An executable survey of advances in biomed-ical named entity recognition. In Russ B. Altman,A. Keith Dunker, Lawrence Hunter, Tiffany Murray,and Teri E. Klein, editors, Pacific Symposium on Bio-computing, pages 652–663. World Scientific.

Vladimir I. Levenshtein. 1966. Binary codes capableof correcting deletions, insertions, and reversals. So-viet Physics Doklady, 10:707–710.

Tomohiro Mitsumori, Sevrani Fation, Masaki Mu-rata, Kouichi Doi, and Hirohumi Doi. 2005.Gene/protein name recognition based on supportvector machine using dictionary as features. BMCBioinformatics, 6 Suppl 1:S8.

Alexander A. Morgan, Zhiyong Lu, Xinglong Wang,Aaron M. Cohen, Juliane Fluck, Patrick Ruch, AnnaDivoli, Katrin Fundel, Robert Leaman, Jrg Haken-berg, Chengjie Sun, Heng-hui Liu, Rafael Torres,Michael Krauthammer, William W. Lau, HongfangLiu, Chun-Nan Hsu, Martijn Schuemie, K BretonnelCohen, and Lynette Hirschman. 2008. Overview ofbiocreative ii gene normalization. Genome Biol, 9Suppl 2:S3.

Naoaki Okazaki, Sophia Ananiadou, and Jun’ichi Tsu-jii. 2010. Building a high-quality sense inventoryfor improved abbreviation disambiguation. Bioinfor-matics, 26(9):1246–1253, May.

Martijn J. Schuemie, Ning Kang, Maarten L. Hekkel-man, and Jan A. Kors. 2010. Genee: gene and pro-tein query expansion with disambiguation. Bioinfor-matics, 26(1):147–148, Jan.

Ariel S. Schwartz and Marti A. Hearst. 2003. A simplealgorithm for identifying abbreviation definitions inbiomedical text. Pac Symp Biocomput, pages 451–462.

Burr Settles. 2005. Abner: an open source tool for au-tomatically tagging genes, proteins and other entitynames in text. Bioinformatics, 21(14):3191–3192,Jul.

Larry Smith, Lorraine K. Tanabe, Rie Johnson nee J.Ando, Cheng-Ju J. Kuo, I-Fang F. Chung, Chun-Nan N. Hsu, Yu-Shi S. Lin, Roman Klinger,

126

Christoph M. Friedrich, Kuzman Ganchev, Man-abu Torii, Hongfang Liu, Barry Haddow, Craig A.Struble, Richard J. Povinelli, Andreas Vlachos,William A. Baumgartner, Lawrence Hunter, BobCarpenter, Richard Tzong-Han T. Tsai, Hong-Jie J.Dai, Feng Liu, Yifei Chen, Chengjie Sun, Sophia Ka-trenko, Pieter Adriaans, Christian Blaschke, RafaelTorres, Mariana Neves, Preslav Nakov, Anna Divoli,Manuel Mana Lopez, Jacinto Mata, and W. JohnWilbur. 2008. Overview of BioCreative II genemention recognition. Genome biology, 9 Suppl2(Suppl 2):S2+.

Yoshimasa Tsuruoka and Jun’ichi Tsujii. 2003. Boost-ing precision and recall of dictionary-based pro-tein name recognition. In Proceedings of the ACL2003 Workshop on Natural Language Processing inBiomedicine, pages 41–48, Sapporo, Japan, July. As-sociation for Computational Linguistics.

Hua Xu, Jung-Wei Fan, George Hripcsak, Eneida AMendonca, Marianthi Markatou, and Carol Fried-man. 2007. Gene symbol disambiguation us-ing knowledge-based profiles. Bioinformatics,23(8):1015–1022.

GuoDong Zhou, Dan Shen, Jie Zhang, Jian Su, andSoonHeng Tan. 2005. Recognition of protein/genenames from text using an ensemble of classifiers.BMC Bioinformatics, 6 Suppl 1:S7.

127


FFTM: A Fuzzy Feature Transformation Method for Medical Documents

Amir Karami, Aryya GangopadhyayInformation Systems Department

University of Maryland Baltimore CountyBaltimore, MD, 21250

[email protected],[email protected]

Abstract

The vast array of medical text data repre-sents a valuable resource that can be an-alyzed to advance the state of the art inmedicine. Currently, text mining meth-ods are being used to analyze medical re-search and clinical text data. Some of themain challenges in text analysis are highdimensionality and noisy data. There is aneed to develop novel feature transforma-tion methods that help reduce the dimen-sionality of data and improve the perfor-mance of machine learning algorithms. Inthis paper we present a feature transfor-mation method named FFTM. We illus-trate the efficacy of our method using lo-cal term weighting, global term weighting,and Fuzzy clustering methods and showthat the quality of text analysis in medicaltext documents can be improved. We com-pare FFTM with Latent Dirichlet Alloca-tion (LDA) by using two different datasetsand statistical tests show that FFTM out-performs LDA.

1 Introduction

The exponential growth of medical text datamakes it difficult to extract useful information in astructured format. Some important features of textdata are sparsity and high dimensionality. Thismeans that while there may be a large numberof terms in most of the documents in a corpus,any one document may contain a small percentageof those terms (Aggarwal and Zhai, 2012). Thischaracteristic of medical text data makes featuretransformation an important step in text analysis.Feature transformation is a pre-processing step inmany machine-learning methods that is used tocharacterize text data in terms of a different num-ber of attributes in lower dimensions. This tech-nique has a direct impact on the quality of text

mining methods. Topic models such as LDA hasbeen used as one of popular feature transforma-tion techniques (Ramage et al., 2010). However,fuzzy clustering methods, particularly in combina-tion with term weighting methods, have not beenexplored much in medical text mining.

In this research, we propose a new methodcalled FFTM to extract features from free-textdata. The rest of the paper is organized in the fol-lowing sections. In the section 2, we review re-lated work. Section 3 contains details about ourmethod. Section 4 describes our experiments, per-formance evaluation, and discussions of our re-sults. Finally we present a summary, limitations,and future work in the last section.

2 Related Work

Text analysis is an important topic in medical in-formatics that is challenging due to high sparsedimensionality data. Big dimension and diver-sity of text datasets have been motivated medi-cal researchers to use more feature transforma-tion methods. Feature transformation methods en-capsulate a text corpus in smaller dimensions bymerging the initial features. Topic model is one ofpopular feature transformation methods. Amongtopic models, LDA (Blei et al., 2003) has beenconsidered more due to its better performance(Ghassemi et al., 2012; Lee et al., 2010).

One of methods that has not been fully con-sidered in medical text mining is Fuzzy cluster-ing. Although most of Fuzzy Clusterings workin medical literature is based on image analysis(Saha and Maulik, 2014; Cui et al., 2013; Beeviand Sathik, 2012), a few work have been donein medical text mining (Ben-Arieh and Gullipalli,2012; Fenza et al., 2012) by using fuzzy cluster-ing. The main difference between our method andother document fuzzy clustering such as (Singh etal., 2011) is that our method use fuzzy clusteringand word weighting as a pre-processing step for

128

feature transformation before implementing anyclassification and clustering algorithms; however,other methods use fuzzy clustering as a final stepto cluster the documents. Our main contributionis to improve the quality of input data to improvethe output of fuzzy clustering. Among fuzzy clus-tering methods, Fuzzy C-means (Bezdek, 1981)is the most popular one (Bataineh et al., 2011).In this research, we propose a novel method thatcombines local term weighting and global termweighting with fuzzy clustering.

3 Method

In this section, we detail our Fuzzy Feature Trans-formation Method (FFTM) and describe the steps.We begin with a brief review of LDA.

LDA is a topic model that can extract hiddentopics from a collection of documents. It assumesthat each document is a mixture of topics. The out-put of LDA are the topic distributions over docu-ments and the word distributions over topics. Inthis research, we use the topics distributions overdocuments. LDA uses term frequency for localterm weighting.

Now we introduce FFTM concepts and nota-tions. This model has three main steps includ-ing Local Term Weighting (LTW), Global TermWeighting (GTM), and Fuzzy Clustering (Algo-rithm 1). In this algorithm, each step is the output of each step will be the input of the next step.

Step 1: The first step is to calculate LTW.Among different LTW methods we use term fre-quency as a popular method. Symbol fij definesthe number of times term i happens in documentj.We have n documents and m words.Let

b(fij) =

{1 fij > 00 fij = 0

(1)

pij =fij∑j fij

(2)

The outputs of this step are b(fij), fij , and pij .We use them as inputs for the second step.

Step 2: The next step is to calculate GTW. Weexplore four GTW methods in this paper includ-ing Entropy, Inverse Document Frequency (IDF),Probabilistic Inverse Document Frequency (Pro-bIDF), and Normal(Table 1).

IDF assigns higher weights to rare terms andlower weights to common terms (Papineni, 2001).ProbIDF is similar to IDF and assigns very low

Algorithm 1 FFTM algorithm

Functions:E():Entropy;I():IDF;PI():ProbIDF;NO():Normal; FC():Fuzzy Clustering.Input: Document Term MatrixOutput: Clustering membership value (µij)for all documents and clusters.

1: Remove stop wordsStep 1: Calculate LTW

2: fori = 1 to ndo3: forj = 1 to mdo4: Calculate fij , b(fij), pij

5: endfor6: endfor

Step 2: Calculate GTW7: fori = 1 to ndo8: forj = 1 to mdo9: Execute E(pij ,n),I(fij ,n),PI(b(fij),n),

NO(fij ,n)10: endfor11: endfor

Step 3: Perform Fuzzy Clustering12: Execute FC(E),FC(I),FC(PI),FC(NO)

Table1: GTW Methods

Name Formula

Entropy 1 +∑

jpij log2(pij)

log2 n

IDF log2n∑j

fij

ProbIDF log2

n−∑

jb(fij)∑

jb(fij)

Normal 1√∑j

f2ij

negative weight for the terms happen in every doc-ument (Kolda, 1998). In Entropy, it gives higherweight for the terms happen less in few documents(Dumais, 1992). Finally, Normal is used to correctdiscrepancies in document lengths and also nor-malize the document vectors. The outputs of thisstep are the inputs of the last step.

Step 3: Fuzzy clustering is a soft clusteringtechnique that finds the degree of membership foreach data point in each cluster, as opposed toassigning a data point only one cluster. Fuzzyclustering is a synthesis between clustering andfuzzy set theory. Among fuzzy clustering meth-ods, Fuzzy C-means (FCM) is the most popularone and its goal is to minimize an objective func-

129

tion by considering constraints:

Min Jq(µ, V,X) =c∑

i=1

n∑j=1

(µij)qD2ij (3)

subject to:0 ≤ µij ≤ 1; (4)

i ∈ {1, .., c} and j ∈ {1, ..., n} (5)c∑

i=1

µij = 1 (6)

0 <n∑

j=1

µij < n; (7)

Where:

n= number of datac= number of clustersµij= membership valueq= fuzzifier, 1 < q ≤ ∞V = cluster center vector

Dij = d(xj , vi)= distance between xj and vi

By optimizing eq.3:

µij =1∑c

k=1(Dij

Dkj)

2q−1

(8)

vi =∑n

j=1(µij)qxj∑nj=1(µij)q

(9)

The iterations in the clustering algorithms con-tinue till the the maximum changes in µij becomesless than or equal to a pre-specified threshold. Thecomputational time complexity is O(n). We useµij as the degree of clusters’ membership for eachdocument.

4 Experimental Results

In this section, we evaluate FFTM against LDAusing two measures: document clustering inter-nal metrics and document classification evalua-tion metrics by using one available text datasets.We use Weka1for classification evaluation, MAL-LET2package with its default setting for imple-menting LDA, Matlab fcm package3for imple-menting FCM clustering, and CVAP Matlab pack-age4for clustering validation.

1http://www.cs.waikato.ac.nz/ml/weka/2http://mallet.cs.umass.edu/3http://tinyurl.com/kl33w674http://tinyurl.com/kb5bwnm

4.1 DatasetsWe leverage two available datasets in this re-search. Our first test dataset called DeidentifiedMedical Text5 is an unlabeled corpus of 2434nursing notes with 12,877 terms after removingstop words. The second dataset 6 is a labeled cor-pus of English scientific medical abstracts fromSpringer website. It is included 41 medical jour-nals ranging from Neurology to Radiology. In thisresearch, we use the first 10 journals including:Arthroscopy, Federal health standard sheet, Theanesthetist, The surgeon, The gynecologist, Thedermatologist, The internist, The neurologist, TheOphthalmology, The orthopedist, and The pathol-ogist. In our experiments we select three subsetsfrom the above journals, the first two with 4012terms and 171 documents, first five with 14189terms and 1527 documents, and then all ten re-spectively with 23870 terms and 3764 documentsto track the performance of FFTM and LDA byincreasing the number of documents and labels.

4.2 Document ClusteringThe first evaluation comparing FFTM with LDA isdocument clustering by using the first dataset. In-ternal and external validation are two major meth-ods for clustering validation; however, compari-son between these two major methods shows thatinternal validation is more more precise (Rendonet al., 2011). We evaluate different number of fea-tures (topics) and clusters by using two internalclustering validation methods including Silhouetteindex and Calinski-Harabasz index using K-meanswith 500 iterations. Silhouette index shows thathow closely related are objects in a cluster andhow distinct a cluster from other other clusters.The higher value means the better result.The Sil-houette index (S) is defined as:

S(i) =(b(i)− a(i))

Max{a(i), b(i)} (10)

Where a(i) is the average dissimilarity of sam-ple i with the same data in a cluster and b(i) is theminimum average dissimilarity of sample i withother data that are not in the same cluster.

Calinski-Harabasz index (CH) valuates thecluster validity based on the average between- andwithin-cluster sum of squares.It is defined as:

5http://tinyurl.com/kfz2hm46http://tinyurl.com/m2c8se6

130

2 3 4 5 6 7 80

0.2

0.4

0.6

0.8

# Clusters

Silh

ouet

teIn

dex

(a) 50 Features

2 3 4 5 6 7 80

0.2

0.4

0.6

0.8

# Clusters

Silh

ouet

teIn

dex

(b) 100 Features

2 3 4 5 6 7 8

0

0.2

0.4

0.6

0.8

# Clusters

Silh

ouet

teIn

dex

(c) 150 Features

Figure1: Clustering Validation with Silhouette Index

2 3 4 5 6 7 8

0

0.5

1

1.5

·104

# Clusters

Cal

insk

i-H

arab

asz

Inde

x

(a) 50 Features

2 3 4 5 6 7 8

0

0.5

1

1.5

·104

# Clusters

Cal

insk

i-H

arab

asz

Inde

x

(b) 100 Features

2 3 4 5 6 7 8

0

0.5

1

1.5

·104

# Clusters

Cal

insk

i-H

arab

asz

Inde

x

(c) 150 FeaturesFFTM(Entropy) FFTM(IDF ) FFTM(ProbIDF ) FFTM(Normal) LDA

Figure2: Clustering Validation with Calinski-Harabasz Index

CH =trace(SB)trace(SW )

.np − 1np − k (11)

Where (SB) is the between-cluster scatter ma-trix, (SW ) the internal scatter matrix, np is thenumber of clustered samples, and k is the numberof clusters. Higher value indicates a better clus-tering. We track the performance of both FFTMand LDA using different number of clusters rang-ing from 2 to 8 with different number of featuresincluding 50, 100, and 150. Both Silhouette in-dex and Calinski-Harabasz index show that FFTMis the best method with all ranges of features andclusters (Figures 1 and 2). The gap between FFTMand LDA does not change a lot by using differentnumber of features and clusters. LDA has the low-est performance and Normal has the best perfor-mance among GTW methods in different ranges offeatures and clusters. According to the paired dif-ference test, the improvement of FFTM over LDAis statistically significant with a p− value < 0.05using the two internal clustering validation meth-ods.

4.3 Document Classification

The second evaluation measure is document clas-sification by using the second datasest. We evalu-ate different number of classes and features (top-ics) with accuracy, F-measure, and ROC usingRandom Forest. Accuracy is the portion of true re-sults in a dataset. F-measure is another measure ofclassification evaluation that considers both preci-sion and recall. ROC curves plot False Positive onthe X axis vs. True Positive on the Y axis to findthe trade off between them; therefore, the closer tothe upper left indicates better performance. Weassume more documents and classes have moretopics;therefore, we choose 100 features for twoclasses, 150 features for five classes, and 200 fea-tures for ten classes. In addition, we use 10 crossvalidation as test option.

This experiment shows that FFTM has the bestperformance in different number of features andlabels (Table 2). LDA has the lowest performanceand the average performance of ProbIDF has thebest among GTW methods in all ranges of featuresand clusters. According to the paired differencetest, the improvement of FFTM over LDA is sta-tistically significant with a p− value < 0.05.

131

Table2: The Second Dataset Classification Performance

Method #Features # Labels Acc % F-Measure ROCFFTM(Entropy) 100 2 96.49 0.959 0.982FFTM(IDF) 100 2 98.24 0.982 0.996FFTM(ProIDF) 100 2 97.66 0.977 0.987FFTM(Normal) 100 2 92.39 0.912 0.971LDA 100 2 90.06 0.9 0.969FFTM(Entropy) 150 5 71.84 0.694 0.874FFTM(IDF) 150 5 70.79 0.686 0.859FFTM(ProIDF) 150 5 70.39 0.674 0.859FFTM(Normal) 150 5 68.11 0.649 0.851LDA 150 5 66.27 0.637 0.815FFTM(Entropy) 200 10 51.06 0.501 0.828FFTM(IDF) 200 10 51.73 0.506 0.826FFTM(ProIDF) 200 10 53.72 0.525 0.836FFTM(Normal) 200 10 50.05 0.485 0.815LDA 200 10 47.68 0.459 0.792

5 Conclusion

The explosive growth of medical text data makestext analysis as a key requirement to find patternsin datasets;however, the typical high dimensional-ity of such features motivates researchers to utilizedimension reduction techniques such as LDA. Al-though LDA has been considered more recently inmedical text analysis (Jimeno-Yepes et al., 2011),fuzzy clustering methods such as FCM has notbeen used in medical text clustering, but rather inimage processing. In the current study, we pro-pose a method called FFTM to combine LTW andGTM with Fuzzy clustering, and compare its per-formance with that of LDA. We use different setsof data including different number of features, dif-ferent number of clusters, and different number ofclasses.The findings of this study show that com-bining FCM with LTW and GTW methods cansignificantly improve medical documents analysis.We conclude that different factors including num-ber of features, number of clusters, and classescan affect the outputs of machine learning algo-rithms. In addition, the performance of FFTM isimproved by using GTW methods. This methodproposed in this paper may be applied to othermedical documents to improve text analysis out-puts. One limitation of this paper is that we useone clustering method, one classification method,and two internal clustering validation methods forevaluation. Our future direction is to explore moremachine learning algorithms and clustering vali-dation methods for evaluation and also other fuzzyclustering algorithms for feature transformation.The main goal of future research is to present anefficient and effective medical topic model using

fuzzy set theory.

ReferencesCharuC Aggarwal and ChengXiang Zhai. 2012. An

introduction to text mining. In Mining Text Data,pages 1–10. Springer.

KMBataineh, MNaji, and MSaqer. 2011. A compar-ison study between various fuzzy clustering algo-rithms. Jordan Journal of Mechanical & IndustrialEngineering, 5(4).

Zulaikha Beevi and Mohamed Sathik. 2012. A ro-bust segmentation approach for noisy medical im-ages using fuzzy clustering with spatial probability.International Arab Journal of Information Technol-ogy (IAJIT), 9(1).

David Ben-Arieh and DeepKumar Gullipalli. 2012.Data envelopment analysis of clinics with sparsedata: Fuzzy clustering approach. Computers & In-dustrial Engineering, 63(1):13–21.

JamesC Bezdek. 1981. Pattern recognition with fuzzyobjective function algorithms. Kluwer AcademicPublishers.

DavidM Blei, AndrewY Ng, and MichaelI Jordan.2003. Latent dirichlet allocation. the Journal of ma-chine Learning research, 3:993–1022.

Wenchao Cui, YiWang, Yangyu Fan, Yan Feng, andTao Lei. 2013. Global and local fuzzy clusteringwith spatial information for medical image segmen-tation. In Signal and Information Processing (Chi-naSIP), 2013 IEEE China Summit & InternationalConference on, pages 533–537. IEEE.

Susan Dumais. 1992. Enhancing performance in latentsemantic indexing (lsi) retrieval.

Giuseppe Fenza, Domenico Furno, and Vincenzo Loia.2012. Hybrid approach for context-aware servicediscovery in healthcare domain. Journal of Com-puter and System Sciences, 78(4):1232–1247.

132

Marzyeh Ghassemi, Tristan Naumann, Rohit Joshi, andAnna Rumshisky. 2012. Topic models for mortalitymodeling in intensive care units. In ICML MachineLearning for Clinical Data Analysis Workshop.

Antonio Jimeno-Yepes, Bartłomiej Wilkowski,JamesG Mork, Elizabeth VanLenten, DinaDemnerFushman, and AlanR Aronson. 2011. A bottom-upapproach to medline indexing recommendations.In AMIA Annual Symposium Proceedings, volume2011, page 1583. American Medical InformaticsAssociation.

TamaraG Kolda. 1998. Limited-memory matrix meth-ods with applications.

Sangno Lee, Jeff Baker, Jaeki Song, and JamesCWetherbe. 2010. An empirical comparison offour text mining methods. In System Sciences(HICSS), 2010 43rd Hawaii International Confer-ence on, pages 1–10. IEEE.

Kishore Papineni. 2001. Why inverse document fre-quency? In Proceedings of the second meeting ofthe North American Chapter of the Association forComputational Linguistics on Language technolo-gies, pages 1–8. Association for Computational Lin-guistics.

Daniel Ramage, SusanT Dumais, and DanielJ Liebling.2010. Characterizing microblogs with topic models.In ICWSM.

Erendira Rendon, Itzel Abundez, Alejandra Arizmendi,and ElviaM Quiroz. 2011. Internal versus externalcluster validation indexes. International Journal ofcomputers and communications, 5(1):27–34.

Indrajit Saha and Ujjwal Maulik. 2014. Multiobjectivedifferential evolution-based fuzzy clustering for mrbrain image segmentation image segmentation. InAdvanced Computational Approaches to BiomedicalEngineering, pages 71–86. Springer.

VivekKumar Singh, Nisha Tiwari, and Shekhar Garg.2011. Document clustering using k-means, heuris-tic k-means and fuzzy c-means. In ComputationalIntelligence and Communication Networks (CICN),2011 International Conference on, pages 297–301.IEEE.

133


Using statistical parsing to detect agrammatic aphasiaKathleen C. Fraser1, Graeme Hirst1, Jed A. Meltzer2,

Jennifer E. Mack3, and Cynthia K. Thompson3,4,5

1Dept. of Computer Science, University of Toronto2Rotman Research Institute, Baycrest Centre, Toronto

3Dept. of Communication Sciences and Disorders, Northwestern University4Dept. of Neurology, Northwestern University

4Cognitive Neurology and Alzheimer’s Disease Center, Northwestern University{kfraser,gh}@cs.toronto.edu, [email protected]

{jennifer-mack-0,ckthom}@northwestern.eduAbstract

Agrammatic aphasia is a serious languageimpairment which can occur after a strokeor traumatic brain injury. We presentan automatic method for analyzing apha-sic speech using surface level parse fea-tures and context-free grammar produc-tion rules. Examining these features in-dividually, we show that we can uncovermany of the same characteristics of agram-matic language that have been reportedin studies using manual analysis. Whentaken together, these parse features canbe used to train a classifier to accuratelypredict whether or not an individual hasaphasia. Furthermore, we find that theparse features can lead to higher classifica-tion accuracies than traditional measuresof syntactic complexity. Finally, we findthat a minimal amount of pre-processingcan lead to better results than using eitherthe raw data or highly processed data.

1 Introduction

After a stroke or head injury, individuals mayexperience aphasia, an impairment in the abilityto comprehend or produce language. The typeof aphasia depends on the location of the lesion.However, even two patients with the same typeof aphasia may experience different symptoms. Acareful analysis of narrative speech can reveal spe-cific patterns of impairment, and help a cliniciandetermine whether an individual has aphasia, whattype of aphasia it is, and how the symptoms arechanging over time.

In this paper, we present an automatic methodfor the analysis of one type of aphasia, agram-matic aphasia. characterized by the omission offunction words, the omission or substitution ofmorphological markers for person and number, the

absence of verb inflection, and a relative increasein the number of nouns and decrease in the numberof verbs (Bastiaanse and Thompson, 2012). Thereis often a reduction in the variety of different syn-tactic structures used, as well as a reduction in thecomplexity of those structures (Progovac, 2006).There may also be a strong tendency to use thecanonical word order of a language, for examplesubject-verb-object in English (Progovac, 2006).

Most studies of narrative speech in agrammaticaphasia are based on manually annotated speechtranscripts. This type of analysis can provide de-tailed and accurate information about the speechpatterns that are observed. However, it is also verytime consuming and requires trained transcribersand annotators. Studies are necessarily limited toa manageable size, and the level of agreement be-tween annotators can vary.

We propose an automatic approach that uses in-formation from statistical parsers to examine prop-erties of narrative speech. We extract context-free grammar (CFG) production rules as well asphrase-level features from syntactic parses of thespeech transcripts. We show that this approach candetect many features which have been previouslyreported in the aphasia literature, and that classifi-cation of agrammatic patients and controls can beachieved with high accuracy.

We also examine the effects of including speechdysfluencies in the transcripts. Dysfluencies andnon-narrative words are usually removed from thetranscripts as a pre-processing step, but we showthat by retaining some of these items, we can ac-tually achieve a higher classification accuracy thanby using the completely clean transcripts.

Finally, we investigate whether there is any ben-efit to using the parse features instead of more tra-ditional measures of syntactic complexity, such asYngve depth or mean sentence length. We findthat the parse features convey more information

134

about the specific syntactic structures being pro-duced (or avoided) by the agrammatic speakers,and lead to better classification accuracies.

2 Related Work

2.1 Syntactic analysis of agrammaticnarrative speech

Much of the previous work analyzing narrativespeech in agrammatic aphasia has been performedmanually. One widely used protocol is calledQuantitative Production Analysis (QPA), devel-oped by Saffran et al. (1989). QPA can be used tomeasure morphological content, such as whetherdeterminers and verb inflections are produced inobligatory contexts, as well as structural complex-ity, such as the number of embedded clauses persentence. Subsequent studies have found a num-ber of differences between normal and agrammaticspeech using QPA (Rochon et al., 2000). Anotherpopular protocol called the Northwestern Narra-tive Language Analysis (NNLA) was introducedby Thompson et al. (1995). This protocol analyzeseach utterance at five different levels, and focusesin particular on the production of verbs and verbargument structure.

Perhaps more analogous to our work here,Goodglass et al. (1994) conducted a detailed ex-amination of the syntactic constituents used byaphasic patients and controls. In that study, utter-ances were grouped according to how many syn-tactic constituents they contained. They foundthat agrammatic participants were more likely toproduce single-constituent utterances, especiallynoun phrases, and less likely to produce subor-dinate clauses. They also found that agrammaticspeakers sometimes produced two-constituent ut-terances consisting of only a subject and object,with no verb. This pattern was never observed incontrol speech.

A much smaller body of work explores the useof computational techniques to analyze agramma-tism. Holmes and Singh (1996) analyzed conver-sational speech from aphasic speakers and con-trols. Their features mostly included measuresof vocabulary richness and frequency counts ofvarious parts-of-speech (e.g. nouns, verbs); how-ever they also measured “clause-like semantic unitrate”. This feature was intended to measure thespeaker’s ability to cluster words together, al-though it is not clear what the criteria for segment-ing clause-like units were or whether it was done

manually or automatically. Nonetheless, it wasfound to be one of the most important variablesfor distinguishing between patients and controls.

MacWhinney et al. (2011) presented several ex-amples of how researchers can use the Aphasia-Bank1 database and associated software tools toconduct automatic analyses (although the tran-scripts are first hand-coded for errors by experi-enced speech-language pathologists). Specificallywith regards to syntax, they calculated several fre-quency counts and ratios for different parts-of-speech and bound morphemes. There was oneextension beyond treating each word individually:this involved searching for pre-defined colloca-tions such as once upon a time or happily ever af-ter, which were found to occur more rarely in thepatient transcripts than in the control transcripts.

We present an alternative, automated method ofanalysis. We do not attempt to fully replicate theresults of the manual studies, but rather providea complementary set of features which can indi-cate grammatic abnormalities. Unlike previouscomputational studies, we attempt to move beyondsingle-word analysis and examine which patternsof syntax might indicate agrammatism.

2.2 Using parse features to assessgrammaticality

Syntactic complexity metrics derived from parsetrees have been used by various researchers instudies of mild cognitive impairment (Roark et al.,2011), autism (Prud’hommeaux et al., 2011), andchild language development (Sagae et al., 2005;Hassanali et al., 2013). Here we focus specificallyon the use of CFG production rules as features.

Using the CFG production rules from statisticalparsers as features was first proposed by Baayenet al. (1996), who applied the features to an au-thorship attribution task. More recently, similarfeatures have been widely used in native languageidentification (Wong and Dras, 2011; Brooke andHirst, 2012; Swanson and Charniak, 2012). Per-haps most relevant to the task at hand, CFG pro-ductions as well as other parse outputs have proveduseful for judging the grammaticality and fluencyof sentences. For example, Wong and Dras (2010)used CFG productions to classify sentences froman artificial error corpus as being either grammat-ical or ungrammatical.

Taking a different approach, Chae and Nenkova

1http://talkbank.org/AphasiaBank/

135

Agrammatic(N = 24)

Control(N = 15)

Male/Female 15/9 8/7Age (years) 58.1 (10.6) 63.3 (6.4)Education (years) 16.3 (2.5) 16.4 (2.4)

Table 1: Demographic information. Numbers aregiven in the form: mean (standard deviation).

(2009) calculated several surface features based onthe output of a parser, such as the length and rel-ative proportion of different phrase types. Theyused these features to distinguish between humanand machine translations, and to determine whichof a pair of translations was the more fluent. How-ever, to our knowledge there has been no work us-ing parser outputs to assess the grammaticality ofspeech from individuals with post-stroke aphasia.

3 Data

3.1 Participants

This was a retrospective analysis of data col-lected by the the Aphasia and Neurolinguistics Re-search Laboratory at Northwestern University. Allagrammatic participants had experienced a strokeat least 1 year prior to the narrative sample col-lection. Demographic information for the partic-ipants is given in Table 1. There is no significant(p< 0.05) difference between the patient and con-trol groups on age or level of education.

3.2 Narrative task

To obtain a narrative sample, the participants wereasked to relate the well-known fairy tale Cin-derella. Each participant was first given a word-less picture book of the story to look through. Thebook was then removed, and the participant wasasked to tell the story in his or her own words. Theexaminer did not interrupt or ask questions.

The narratives were recorded and later tran-scribed following the NNLA protocol. The datawas segmented into utterances based on syntac-tic and prosodic cues. Filled pauses, repetitions,false starts, and revisional phrases (e.g. I mean)were all placed inside parentheses. The averagelength of the raw transcripts was 332 words foragrammatic participants and 387 words for con-trols; when the non-narrative words were excludedthe average length was 194 words for the agram-matic group and 330 for controls.

4 Methods

4.1 Parser FeaturesWe consider two types of features: CFG pro-duction rules and phrase-level statistics. For theCFG production rules, we use the Charniak parser(Charniak, 2000) trained on Wall Street Journaldata to parse each utterance in the transcript andthen extract the set of non-lexical productions.The total number of types of productions is large,many of them occurring very infrequently, so wecompile a list of the 50 most frequently occurringproductions in each of the two groups (agrammaticand controls) and use the combined set as the setof features. The feature values can be binary (doesa particular production rule appear in the narrativeor not?) or integer (how many times does a rule oc-cur?). The CFG non-terminal symbols follow thePenn Treebank naming conventions.

For our phrase-level statistics, we use a subsetof the features described by Chae and Nenkova(2009), which are related to the incidence of dif-ferent phrase types. We consider three differentphrase types: noun phrases, verb phrases, andprepositional phrases. These features are definedas follows:

• Phrase type proportion: Length of eachphrase type (including embedded phrases),divided by total narrative length.• Average phrase length: Total number of

words in a phrase type, divided by numberof phrases of that type.• Phrase type rate: Number of phrases of a

given type, divided by total narrative length.

Because we are judging the grammaticality ofthe entire narrative, we normalize by narrativelength (rather than sentence length, as in Chae andNenkova’s study). These features are real-valued.

We first perform the analysis on the transcribeddata with the dysfluencies removed, labeled the“clean” dataset. This is the version of the tran-script that would be used in the manual NNLAanalysis. However, it is the result of human ef-fort and expertise. To test the robustness of thesystem on data that has not been annotated in thisway, we also use the “raw” dataset, with no dys-fluencies removed (i.e. including everything insidethe parentheses), and an “auto-cleaned” dataset,in which filled pauses are automatically removedfrom the raw transcripts. We also use a simple al-gorithm to remove “stutters” and false starts, by

136

removing non-word tokens of length one or two(e.g. C- C- Cinderella would become simply Cin-derella). This provides a more realistic view of theperformance of our system on real data. We alsohypothesize that there may be important informa-tion to be found in the dysfluent speech segments.

4.2 Feature weighting and selection

We assume that some production rules will bemore relevant to the classification than others,and so we want to weight the features accord-ingly. Using term frequency–inverse documentfrequency (tf-idf) would be one possibility; how-ever, the tf-idf weights do not take into accountany class information. Supervised term weight-ing (STW), has been proposed by Debole and Se-bastiani (2004) as an alternative to tf-idf for textclassification tasks. In this weighting scheme, fea-ture weights are assigned using the same algo-rithm that is used for feature selection. For ex-ample, one way to select features is to rank themby their information gain (InfoGain). In STW,the InfoGain value for each feature is also usedto replace the idf term. This can be expressed asW (i,d) = df(i,d)× InfoGain(i), where W (i,d) isthe weight assigned to feature i in document d,df(i,d) is the frequency of occurrence of feature iin document d, and InfoGain(i) is the informationgain of feature i across all the training documents.

We considered two different methods of STW:weighting by InfoGain and weighting by gain ratio(GainRatio). The methods were also used as fea-ture selection, since any feature that was assigneda weight of zero was removed from the classifi-cation. We also consider tf-idf weights and un-weighted features for comparison.

4.3 Syntactic complexity metrics

To compare the performance of the parse featureswith more-traditional syntactic complexity met-rics (SC metrics), we calculate the mean length ofutterance (MLU), mean length of T-unit2 (MLT),mean length of clause (MLC), and parse treeheight. We also calculate the mean, maximum,and total Yngve depth, which measures the pro-portion of left-branching to right-branching ineach parse tree (Yngve, 1960). These measuresare commonly used in studies of impaired lan-guage (e.g. Roark et al. (2011), Prud’hommeaux et

2A T-unit consists of a main clause and its attached de-pendent clauses.

al. (2011), Fraser et al. (2013b)). We hypothesizethat the parse features will capture more informa-tion about the specific impairments seen in agram-matic aphasia; however, using the general mea-sures of syntactic complexity may be sufficient forthe classifiers to distinguish between the groups.

4.4 Classification

To test whether the features can effectively distin-guish between the agrammatic group and controls,we use them to train and test a machine learn-ing classifier. We test three different classifica-tion algorithms: naive Bayes (NB), support vec-tor machine (SVM), and random forests (RF). Weuse a leave-one-out cross-validation framework, inwhich one transcript is held out as a test set, andthe other transcripts form the training data. Thefeature weights are calculated on the training setand then applied to the test set (as a result, eachfold of training/testing may use different featuresand feature weights). The SVM and RF algo-rithms are tuned in a nested cross-validation loop.The classifier is then tested on the held-out point.This procedure is repeated across all data points,and the average accuracy is reported.

A baseline classifier which assigns all data tothe largest class would achieve an accuracy of .62on this classification task. For a more realisticmeasure of performance, we also compare our re-sults to the baseline accuracy that can be achievedusing only the length of the narrative as input.

5 Results

5.1 Features using clean transcripts

We first present the results for the clean tran-scripts. Although different features may be se-lected in each fold of the cross-validation, for sim-plicity we show only the feature rankings on thewhole data set. Table 2 shows the top features asranked by GainRatio. The frequencies are given toindicate the direction of the trend; they representthe average frequency per narrative for each class(agrammatic = AG and control = CT). Boldfaceindicates the group with the higher frequency. As-terisks are used to indicate the significance of thedifference between the groups.

When working with clinical data, careful exam-ination of the features can be beneficial. By com-paring features with previous findings in the liter-ature on agrammatism, we can be confident thatwe are measuring real effects and not just artifacts

137

Rule AGfreq

CTfreq

p

1 PP→ IN NP 10.3 24.9 ∗∗∗

2 ROOT→ NP 2.9 0.2 ∗∗∗

3 NP→ DT NN POS 0.0 0.7 ∗

4 NP→ PRP$ JJ NN 0.5 0.7 ∗

5 VP→ TO VP 4.2 7.5 ∗

6 NP→ NNP 5.9 6.67 VP→ VB PP 1.1 2.9 ∗∗

8 VP→ VP CC VP 1.1 3.1 ∗∗

9 NP→ DT NN NN 1.0 2.7 ∗∗

10 VP→ VBD VP 0.1 0.5 ∗

11 WHADVP→WRB 0.5 1.4 ∗

12 FRAG→ NP . 0.7 0.0 ∗∗

13 NP→ JJ NN 0.7 0.0 ∗∗

14 SBAR→WHNP S 1.7 3.1 ∗

15 NP→ NP SBAR 1.6 2.516 S→ NP VP 7.8 16.1 ∗∗

17 NP→ PRP$ JJ NNS 0.0 0.5 ∗

18 NP→ PRP$ NN NNS 0.0 0.6 ∗

19 SBAR→WHADVP S 0.4 1.2 ∗

20 VP→ VBN PP 0.4 2.0 ∗

Table 2: Top 20 features ranked by GainRatio us-ing the clean transcripts. (∗p< 0.05, ∗∗p< 0.005,∗∗∗p< 0.0005).

of the parsing algorithm. This can also poten-tially provide an opportunity to observe features ofagrammatic speech that have not been examined inmanual analyses. We examine the top-ranked fea-tures in Table 2 in some detail, especially as theyrelate to previous work on agrammatism. In par-ticular, the top features suggest some of the fol-lowing features of agrammatic speech:

• Reduced number of prepositional phrases.This is suggested by feature 1, PP→ IN NP.It is also reflected in features 7 and 20.• Impairment in using verbs. We can see in fea-

ture 2 (ROOT → NP) that there is a greaternumber of utterances consisting of only anoun phrase. Feature 12 is also consistentwith this pattern (FRAG → NP .). We alsoobserve a reduced number of coordinatedverb phrases (VP→ VP CC VP).• Omission of grammatical morphemes and

function words. The agrammatic speakersuse fewer possessives (NP→ DT NN POS).Feature 9 indicates that the control partic-ipants more frequently produce compound

NB SVM RFNarrative length .62 .56 .64Binary, no weights .87 .87 .77Binary, tf-idf .87 .90 .85Binary, InfoGain .82 .90 .74Binary, GainRatio .90 .82 .79Frequency, no weights .90 .85 .85Frequency, tf-idf .85 .82 .77Frequency, InfoGain .90 .90 .82Frequncy, GainRatio .90 .92 .74SC metrics, no weights .85 .77 .82SC metrics, InfoGain .85 .77 .79SC metrics, GainRatio .85 .77 .82

Table 3: Average classification accuracy using theclean transcripts. The highest classification accu-racy for each feature set is indicated with boldface.

nouns with a determiner (often the glassslipper or the fairy godmother). Feature 4also suggests some difficulty with determin-ers, as the agrammatic participants producefewer nouns modified by a possessive pro-noun and an adjective. Contrast this with fea-ture 13, which shows agrammatic speech ismore likely to contain noun phrases contain-ing just an adjective and a noun. For example,in the control narratives we are more likely tosee phrases such as her godmother . . . wavesher magic wand, while in the agrammaticnarratives phrases like Cinderella had wickedstepmother are more common.• Reduced number of embedded clauses and

phrases. Evidence for this can be found inthe reduced number of wh-adverb phrases(WHADVP→WRB), as well as features 14,15, and 19.

The results of our classification experiment onthe clean data are shown in Table 3. The resultsare similar for the binary and frequency features,with the best result of .92 achieved using an SVMclassifier and frequency features, with GainRatioweights. The best results using parse features(.85–.92) are the same or slightly better than thebest results using SC features (.85), and both fea-ture sets perform above baseline.

5.2 Effect of non-narrative speech

In this section we perform two additional experi-ments, using the raw and auto-cleaned transcripts.

138

Rule AGfreq.

CTfreq.

p

1 NP→ DT NN POS 0.0 0.5 ∗

2 PP→ IN NP 12.2 26.1 ∗∗∗

3 SBAR→WHADVP S 0.4 1.5 ∗

4 VP→ VBD 0.75 1.15 VP→ TO VP 4.3 7.3 ∗

6 S→ CC PP NP VP . 0.04 0.5 ∗

7 NP→ PRP$ JJ NNS 0.04 0.5 ∗

8 VP→ AUX VP 3.7 6.09 ROOT→ FRAG 4.5 0.7 ∗∗

10 ADVP→ RB 9.8 12.311 NP→ NNP 4.4 6.2 ∗

12 NP→ DT NN 15.0 24.1 ∗∗

13 VP→ VB PP 1.2 2.8 ∗

14 VP→ VP CC VP 1.0 2.9 ∗

15 WHADVP→WRB 0.6 1.5 ∗

16 VP→ VBN PP 0.4 2.0 ∗

17 INTJ→ UH UH 3.5 0.3 ∗

18 VP→ VBP NP 0.5 0.0 ∗

19 NP→ NNP NNP 1.5 0.5 ∗∗

20 S→ CC ADVP NP VP . 1.3 2.3

Table 4: Top 20 features ranked by GainRatiousing the raw transcripts. Bold feature numbersindicate rules which did not appear in Table 2.(∗p< 0.05, ∗∗p< 0.005, ∗∗∗p< 0.0005).

We discuss the differences between the selectedfeatures in each case, and the resulting classifica-tion accuracies.

Using the raw transcripts, we find that the rank-ing of features is markedly different than with thehuman-annotated transcripts (Table 4, bold featurenumbers). Examining these production rules moreclosely, we observe some characteristics of agram-matic speech which were not detectable in the an-notated transcripts:

• Increased number of dysfluencies. We ob-serve a higher number of consecutive fillers(INTJ → UH UH) in the agrammatic data,as well as a higher number of consecutiveproper nouns (NP → NNP NNP), usuallytwo attempts at Cinderella’s name. Feature18 (VP→ VBP NP) also appears to supportthis trend, although it is not immediately ob-vious. Most of the control participants tellthe story in the past tense, and if they douse the present tense then the verbs are of-ten in the third-person singular (Cinderella

finds her fairy godmother). Looking at thedata, we found that feature 18 can indicate averb agreement error, as in he attend the ball.However, in almost twice as many cases it in-dicates use of the discourse markers I meanand you know, followed by a repaired or tar-get noun phrase.• Decreased connection between sentences.

Feature 6 shows a canonical NP VP sentence,preceded by a coordinate conjunction and aprepositional phrase. Some examples of thisfrom the control transcripts include, And atthe stroke of midnight . . . and And in the pro-cess . . . . The conjunction creates a connec-tion from one utterance to the next, and theprepositional phrase indicates the temporalrelationship between events in the story, cre-ating a sense of cohesion. See also the similarpattern in feature 20, representing sentencebeginnings such as And then . . . .

However, there are some features which werehighly ranked in the clean transcripts but do notappear in Table 4. What information are we losingby using the raw data? One issue with using theraw transcripts is that the inclusion of filled pauses“splits” the counts for some features. For example,the feature FRAG→ NP . is ranked 12th using theclean transcripts but does not appear in the top 20when using the raw transcripts. When we examinethe transcripts, we find that the phrases that arecounted in this feature in the clean transcripts areactually split into three features in the raw tran-scripts: FRAG→ NP ., FRAG→ INTJ NP ., andFRAG→ NP INTJ ..

The classification results for the raw transcriptsare given in Table 5. The results are similar tothose for the clean transcripts, although in thiscase the best accuracy (.92) is achieved in threedifferent configurations (all using the SVM clas-sifier). The phrase-level features out-perform thetraditional SC measures in only half the cases.

Using the auto-cleaned transcripts, we see somesimilarities with the previous cases (Table 6).However, some of the highly ranked featureswhich disappeared when using the raw transcriptsare now significant again (e.g. ROOT → NP,FRAG → NP .). There are also three remain-ing features which are significant and have not yetbeen discussed. Feature 9 shows an increased useof determiners with proper nouns (e.g. the Cin-derella), a frank grammatical error. Feature 20

139


Table 5: Average classification accuracy usingraw transcripts. The highest classification accu-racy for each feature set is indicated with boldface.

provides another example of a sentence fragmentwith no verb. Finally, feature 19 represents an in-creased number of sentences or clauses consist-ing of a noun phrase followed by adjective phrase.Looking at the transcripts, this is not generally in-dicative of an error, but rather use of the wordokay, as in she dropped her shoe okay.

The classification results for the auto-cleaneddata, shown in Table 7, show a somewhat differ-ent pattern from the previous experiments. Theaccuracies using the parse features are generallyhigher, and the best result of .97 is achieved usingthe binary features and the naive Bayes classifier.Interestingly, this data set also results in the lowestaccuracy for the syntactic complexity metrics.

5.3 Phrase-level parse features

The classifiers in Tables 3, 5, and 7 used thephrase-level parse features as well as the CFGproductions. Although these features were cal-culated for NPs, VPs, and PPs, the NP featureswere never selected by the GainRatio ranking al-gorithm, and did not differ significantly betweengroups. The significance levels of the VP and PPfeatures are reported in Table 8. PP rate and pro-portion are significantly different in all three setsof transcripts, which is consistent with the highranking of PP → IN NP in each case. VP rateand proportion are often significant, although lessso. Notably, PP and VP length are both significantin the clean transcripts, but not significant in theraw transcripts and only barely significant in theauto-cleaned transcripts.

Rule AGfreq.

CTfreq.

p

1 PP→ IN NP 12.0 26.0 ∗∗∗

2 NP→ DT NN POS 0.0 0.7 ∗

3 VP→ VP CC VP 0.8 2.9 ∗∗

4 S→ CC SBAR NP VP . 0.0 0.55 SBAR→WHADVP S 0.4 1.5 ∗

6 NP→ NNP 5.6 6.77 VP→ VBD 0.8 1.18 S→ CC PP NP VP . 0.04 0.6 ∗

9 NP→ DT NNP 0.6 0.0 ∗∗

10 VP→ TO VP 4.6 7.5 ∗

11 ROOT→ FRAG 3.0 0.5 ∗∗∗

12 ROOT→ NP 2.1 0.1 ∗

13 VP→ VBP NP 1.7 3.614 NP→ PRP$ JJ NNS 0.04 0.5 ∗

15 VP→ VB PP 1.1 2.8 ∗∗

16 VP→ VBN PP 0.4 1.9 ∗

17 FRAG→ NP . 0.4 0.0 ∗

18 NP→ NNP . 2.1 0.119 S→ NP ADJP 0.4 0.0 ∗

20 FRAG→ CC NP . 0.7 0.07 ∗∗

Table 6: Top 10 features ranked by GainRatiousing the auto-cleaned transcripts. Bold featurenumbers indicate rules which did not appear in Ta-ble 2. (∗p< 0.05, ∗∗p< 0.005, ∗∗∗p< 0.0005).

5.4 Analysis of variance

With a multi-way ANOVA we found significantmain effects of classifier (F(2,63) = 11.6, p <0.001) and data set (F(2,63) = 11.2, p < 0.001)on accuracy. A Tukey post-hoc test revealed sig-nificant differences between SVM and RF (p <0.001) and NB and RF (p < 0.001) but not be-tween SVM and NB. As well, we see a sig-nificant difference between the clean and auto-cleaned data (p < 0.001) and the raw and auto-cleaned data (p < 0.001) but not between the rawand clean data. There was no significant main ef-fect of weighting scheme or feature type (binary orfrequency) on accuracy. We did not examine anypossible interactions between these variables.

6 Discussion

6.1 Transcripts

We achieved the highest classification accuraciesusing the auto-cleaned transcripts. The raw tran-scripts, while containing more information aboutdysfluent events, also seemed to cause more dif-

140


Table 7: Average classification accuracy usingauto-cleaned transcripts. The highest classifica-tion accuracy for each feature set is indicated withboldface.

Clean Raw AutoPP rate ∗∗∗ ∗∗∗ ∗∗∗

PP proportion ∗∗∗ ∗∗∗ ∗∗

PP length ∗∗

VP rate ∗∗ ∗

VP proportion ∗∗∗ ∗ ∗

VP length ∗∗∗ ∗

Table 8: Significance of the phrase-level featuresin each of the three data sets (∗p < 0.05, ∗∗p <0.005, ∗∗∗p< 0.0005).

ficulty for the parser, which mis-labelled filledpauses and false starts in some cases. We alsofound that the insertion of filled pauses resultedin the creation of multiple features for a single un-derlying grammatical structure. The auto-cleanedtranscripts appeared to avoid some of those prob-lems, while still retaining information about manyof the non-narrative speech productions that wereremoved from the clean transcripts.

Some of the features from the auto-cleaned tran-scripts appear to be associated with the discourselevel of language, such as connectives and dis-course markers. A researcher solely interested instudying the syntax of language might resist theinclusion of such features, and prefer to use onlyfeatures from the human-annotated clean tran-scripts. However, we feel that such productionsare part of the grammar of spoken language, andmerit inclusion. From a practical standpoint, ourfindings are reassuring: data preparation that can

be done automatically is much more feasible inmany situations than human annotation.

6.2 Features

CFG production rules can offer a more detailedlook at specific language impairments. We wereable to observe a number of important characteris-tics of agrammatic language as reported in previ-ous studies: fragmented speech with a higher in-cidence of solitary noun phrases, difficulty withdeterminers and possessives, reduced number ofprepositional phrases and embedded clauses, and(in the raw transcripts), increased use of filledpauses and repair phrases. For this reason, we be-lieve that they are more useful for the analysis ofdisordered or otherwise atypical language than tra-ditional measures of syntactic complexity.

In some cases an in-depth analysis may not berequired, and in such cases it may be tempting tosimply use one of the more-general syntactic com-plexity measures. Nevertheless, even in our simplebinary classification task, we found that using themore-specific features gave us a higher accuracy.

6.3 Future work

Because of the limited data, we consider these re-sults to be preliminary. We hope to replicate thisstudy as more data become available in the fu-ture. We also plan to examine the effect, if any,of the specific narrative task. Furthermore, wehave shown that these methods are effective forthe analysis of agrammatic aphasia, but there areother types of aphasia in which semantic, ratherthan syntactic, processing is the primary impair-ment. We would like to extend this work to findfeatures which distinguish between different typesof aphasia.

Although we included manually transcribeddata in this study, these methods will be most use-ful if they are also effective on automatically rec-ognized speech. Previous work on speech recog-nition for aphasic speech reported high error rates(Fraser et al., 2013a). Our finding that the auto-cleaned transcripts led to the highest classificationaccuracy is encouraging, but we will have to testthe robustness to recognition errors and the depen-dence on sentence boundary annotations.

AcknowledgmentsThis research was supported by the Natural Sciences and En-gineering Research Council of Canada and National Institutesof Health R01DC01948 and R01DC008552.

141

ReferencesHarald Baayen, Hans Van Halteren, and Fiona

Tweedie. 1996. Outside the cave of shadows:Using syntactic annotation to enhance authorshipattribution. Literary and Linguistic Computing,11(3):121–132.

Roelien Bastiaanse and Cynthia K. Thompson. 2012.Perspectives on Agrammatism. Psychology Press.

Julian Brooke and Graeme Hirst. 2012. Robust, lex-icalized native language identification. In Proceed-ings of the 24th International Conference on Com-putational Linguistics, pages 391–408.

Jieun Chae and Ani Nenkova. 2009. Predicting thefluency of text with shallow structural features: casestudies of machine translation and human-writtentext. In Proceedings of the 12th Conference of theEuropean Chapter of the Association for Computa-tional Linguistics, pages 139–147.

Eugene Charniak. 2000. A maximum-entropy-inspired parser. In Proceedings of the 1st Confer-ence of the North American Chapter of the Associa-tion for Computational Linguistics, pages 132–139.

Franca Debole and Fabrizio Sebastiani. 2004. Super-vised term weighting for automated text categoriza-tion. In Text mining and its applications, pages 81–97. Springer.

Kathleen Fraser, Frank Rudzicz, Naida Graham, andElizabeth Rochon. 2013a. Automatic speech recog-nition in the diagnosis of primary progressive apha-sia. In Proceedings of the Fourth Workshop onSpeech and Language Processing for Assistive Tech-nologies, pages 47–54.

Kathleen C. Fraser, Jed A. Meltzer, Naida L. Graham,Carol Leonard, Graeme Hirst, Sandra E. Black, andElizabeth Rochon. 2013b. Automated classificationof primary progressive aphasia subtypes from narra-tive speech transcripts. Cortex.

Harold Goodglass, Julie Ann Christiansen, andRoberta E. Gallagher. 1994. Syntatic construc-tions used by agrammatic speakers: Comparisonwith conduction aphasics and normals. Neuropsy-chology, 8(4):598.

Khairun-nisa Hassanali, Yang Liu, Aquiles Iglesias,Thamar Solorio, and Christine Dollaghan. 2013.Automatic generation of the index of productivesyntax for child language transcripts. Behavior re-search methods, pages 1–9.

David I. Holmes and Sameer Singh. 1996. A stylo-metric analysis of conversational speech of apha-sic patients. Literary and Linguistic Computing,11(3):133–140.

Brian MacWhinney, Davida Fromm, Margaret Forbes,and Audrey Holland. 2011. Aphasiabank: Methodsfor studying discourse. Aphasiology, 25(11):1286–1307.

Ljiljana Progovac. 2006. The Syntax of Nonsen-tentials: Multidisciplinary Perspectives, volume 93.John Benjamins.

Emily T. Prud’hommeaux, Brian Roark, Lois M.Black, and Jan van Santen. 2011. Classification ofatypical language in autism. In Proceedings of the2nd Workshop on Cognitive Modeling and Compu-tational Linguistics, CMCL ’11, pages 88–96.

Brian Roark, Margaret Mitchell, John-Paul Hosom,Kristy Hollingshead, and Jeffery Kaye. 2011. Spo-ken language derived measures for detecting mildcognitive impairment. IEEE Transactions on Au-dio, Speech, and Language Processing, 19(7):2081–2090.

Elizabeth Rochon, Eleanor M. Saffran, Rita SloanBerndt, and Myrna F. Schwartz. 2000. Quantita-tive analysis of aphasic sentence production: Furtherdevelopment and new data. Brain and Language,72(3):193–218.

Eleanor M. Saffran, Rita Sloan Berndt, and Myrna F.Schwartz. 1989. The quantitative analysis ofagrammatic production: Procedure and data. Brainand Language, 37(3):440–479.

Kenji Sagae, Alon Lavie, and Brian MacWhinney.2005. Automatic measurement of syntactic develop-ment in child language. In Proceedings of the 43rdAnnual Meeting on Association for ComputationalLinguistics, pages 197–204.

Ben Swanson and Eugene Charniak. 2012. Native lan-guage detection with tree substitution grammars. InProceedings of the 50th Annual Meeting of the As-sociation for Computational Linguistics, pages 193–197.

Cynthia K. Thompson, Lewis P. Shapiro, Ligang Li,and Lee Schendel. 1995. Analysis of verbs andverb-argument structure: A method for quantifica-tion of aphasic language production. Clinical Apha-siology, 23:121–140.

Sze-Meng Jojo Wong and Mark Dras. 2010. Parserfeatures for sentence grammaticality classification.In Proceedings of the Australasian Language Tech-nology Association Workshop, pages 67–75.

Sze-Meng Jojo Wong and Mark Dras. 2011. Exploit-ing parse structures for native language identifica-tion. In Proceedings of the Conference on Empiri-cal Methods in Natural Language Processing, pages1600–1610.

Victor Yngve. 1960. A model and hypothesis for lan-guage structure. Proceedings of the American Phys-ical Society, 104:444–466.

142

Author Index

Aronson, Alan, 93

Battistelli, Delphine, 107Beard, Rachel, 1

Carroll, John, 77Cassell, Jackie, 77Chapman, Wendy, 54Charnois, Thierry, 107Cimiano, Philipp, 118Cohen, Kevin, 10, 93

Davis, Anthony, 59de la Peña González, Santiago, 98DeLong, Caroline M., 83Demner-Fushman, Dina, 29, 45

Fanelli, Daniele, 19Fiszman, Marcelo, 29Fraser, Kathleen C., 134

Gangopadhyay, Aryya, 128Gonzalez, Graciela, 1

Haake, Anne, 83Hailu, Negacy, 10Harabagiu, Sanda, 68Hartung, Matthias, 118Hirst, Graeme, 134Hochberg, Limor, 83

Karami, Amir, 128Kaye, Jeffrey, 38Kilicoglu, Halil, 29, 45Klinger, Roman, 118

Lauder, Rob, 1Leaman, Robert, 24Lu, Zhiyong, 24

Mack, Jennifer E., 134Martin, Laure, 107Martínez, Paloma, 98Meltzer, Jed A., 134Meystre, Stephane, 54Mowery, Danielle, 54

Osborne, Richard, 93Ovesdotter Alm, Cecilia, 83

Panteleyeva, Natalya, 10Preiss, Judita, 112

Rantanen, Esa M., 83Rivera, Robert, 1Roberts, Kirk, 29, 68Ross, Mindy, 54

Savkov, Aleksandar, 77Scotch, Matthew, 1Segura-Bedmar, Isabel, 98Shafran, Izhak, 38Sheikhshab, Golnar, 38Skinner, Michael, 68Subotin, Michael, 59

Tahsin, Tasnia, 1Thompson, Cynthia K., 134

Velupillai, Sumithra, 54, 88

Wallstrom, Garrick, 1Weissenbacher, Davy, 1Wiebe, Janyce, 54

Yu, Bei, 19

Zwick, Matthias, 118

143