Cross-narrative Temporal Ordering of Medical Eventspeople.dbmi.columbia.edu/noemie/papers/acl14.pdfclinical narratives. We present results us-ing both approaches and observe that the

Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 998–1008,Baltimore, Maryland, USA, June 23-25 2014. c©2014 Association for Computational Linguistics

Cross-narrative temporal ordering of medical events

Preethi Raghavan∗, Eric Fosler-Lussier∗, Noemie Elhadad† and Albert M. Lai∗∗The Ohio State University, Columbus, Ohio†Columbia University, New York, NY

{raghavap, fosler}@[email protected], [email protected]

Abstract

Cross-narrative temporal ordering of med-ical events is essential to the task of gen-erating a comprehensive timeline over apatient’s history. We address the prob-lem of aligning multiple medical event se-quences, corresponding to different clin-ical narratives, comparing the followingapproaches: (1) A novel weighted finitestate transducer representation of medi-cal event sequences that enables compo-sition and search for decoding, and (2)Dynamic programming with iterative pair-wise alignment of multiple sequences us-ing global and local alignment algorithms.The cross-narrative coreference and tem-poral relation weights used in both theseapproaches are learned from a corpus ofclinical narratives. We present results us-ing both approaches and observe that thefinite state transducer approach performsperforms significantly better than the dy-namic programming one by 6.8% for theproblem of multiple-sequence alignment.

1 Introduction

Discourse structure, logical flow of sentences, andcontext play a large part in ordering medical eventsbased on temporal relations within a clinical nar-rative. However, cross-narrative temporal rela-tion ordering is a challenging task as it is dif-ficult to learn temporal relations among medicalevents which are not part of the logically coherentdiscourse of a single narrative. Resolving cross-narrative temporal relationships between medicalevents is essential to the task of generating anevent timeline from across unstructured clinicalnarratives such as admission notes, radiology re-ports, history and physical reports and dischargesummaries. Such a timeline has multiple applica-tions in clinical trial recruitment (Luo et al., 2011),medical document summarization (Bramsen et al.,

2006, Reichert et al., 2010) and clinical decisionmaking (Demner-Fushman et al., 2009).

Given multiple temporally ordered medicalevent sequences generated from each clinical nar-rative in a patient record, how can we combinethe events to create a timeline across all the nar-ratives? The tendency to copy-paste text andsummarize past information in newly generatedclinical narratives leads to multiple mentions ofthe same medical event across narratives (Cohenet al., 2013). These cross-narrative coreferencesact as important anchors for reasoning with in-formation across narratives. We leverage cross-narrative coreference information along with con-fident cross-narrative temporal relation predictionsand learn to align and temporally order medicalevent sequences across longitudinal clinical nar-ratives. We model the problem as a sequencealignment task and propose solving this using twoapproaches. First, we use weighted finite statemachines to represent medical events sequences,thus enabling composition and search to obtainthe most probable combined sequence of medicalevents. As a contrast, we adapt dynamic program-ming algorithms (Needleman et al., 1970, Smithand Waterman, 1981) used to produce global andlocal alignments for aligning sequences of med-ical events across narratives. We also comparethe proposed methods with an Integer Linear Pro-gramming (ILP) based method for timeline con-struction (Do et al., 2012). The cross-narrativecoreference and temporal relation scores used inboth these approaches are learned from a corpusof patient narratives from The Ohio State Univer-sity Wexner Medical Center.

The main contribution of this paper is a generalframework that allows aligning multiple event se-quences using cascaded weighted finite state trans-ducers (WFSTs) with the help of efficient compo-sition and decoding. Moreover, we demonstratethat this method can be used for more accuratemultiple sequence alignment when compared to

998

dynamic programming or other ILP-based meth-ods proposed in literature.

2 Related Work

In the areas of summarization and text-to-text gen-eration, there has been prior work on several order-ing strategies to order pieces of information ex-tracted from different input documents (Barzilayet al., 2002, Lapata, 2003, Bollegala et al., 2010).In this paper, we focus on temporal ordering of in-formation, as discussed next.

Recent state-of-the art research has focused onthe problem of temporal relation learning withinthe same document, and in many cases within thesame sentence (Mani et al., 2006, Verhagen et al.,2009, Lapata and Lascarides, 2011). Chambersand Jurafsky (2009) describe a process to inducea partially ordered set of events related by a com-mon protagonist by using an unsupervised distri-butional method to learn relations between eventssharing coreferring arguments, followed by tem-poral classification to induce partial order. Thetask was carried out on the Timebank newswirecorpus, but was limited to an intra-document set-ting. More recently, (Do et al., 2012) proposedan ILP-based method to combine the outputs ofan event-interval and an event-event classifier fortimeline construction on the ACE 2005 corpus.However, this approach is also restricted to eventswithin documents and requires annotations forevent intervals. We empirically compare our meth-ods for timeline creation from longitudinal clinicalnarratives to such an ILP-based approach in Sec-tion 7. While a lot of this work has been done inthe news domain, there is also some recent workin rule-based algorithms (Zhou et al., 2006) andmachine learning (Roberts et al., 2008) appliedto temporal relations between medical events inclinical text. Clinical narratives are written in adistinct sub-language with domain specific termi-nology and temporal characteristics, making themmarkedly different from newswire text.

There is limited prior work in learning re-lations across documents. Ji and Grishman(2008) extended the one sense per discourse idea(Yarowsky, 1995) to multiple topically relateddocuments and propagate consistent event argu-ments across sentences and documents. Barzi-lay and McKeown (2005) propose a text-to-textgeneration technique for synthesizing common in-formation across documents using sentence fu-sion. This involves multisequence dependencytree alignment to identify phrases conveying sim-

ilar information and statistical generation to com-bine common phrases into a sentence. Along withsyntactic features, they combine knowledge fromresources like WordNet to find similar sentences.In case of clinical narratives and medical eventalignment, the objective is to identify a unique se-quence of temporally ordered medical events fromacross longitudinal clinical data.

To the best of our knowledge, there is noprior work on cross-document alignment of eventsequences. Multiple sequence alignment is aproblem that arises in a variety of domains in-cluding gene/protein alignments in bioinformat-ics (Notredame, 2002), word alignments in ma-chine translation (Kumar and Byrne, 2003), andsentence alignments for summarization (Lacatusuet al., 2004). Dynamic programming algorithmshave been popularly leveraged to produce pair-wise and global genetic alignments, where editdistance based metrics are used to compute thecost of insertions, deletions and substitutions.We use dynamic programming to compute thebest alignment, given the temporal and corefer-ence information between medical events acrossthese sequences. More importantly, we proposea cascaded WFST-based framework for cross-document temporal ordering of medical event se-quences. Composition and search operations canbe used to build a single transducer that inte-grates these components, directly mapping frominput states to desired outputs, and obtain the bestalignment (Mohri et al., 2000). In natural lan-guage processing, WFSTs have seen varied appli-cations in machine translation (Kumar and Byrne,2003), morphology (Sproat, 2006), named en-tity recognition (Krstev et al., 2011) and biolog-ical sequence alignment / generation (Whelan etal., 2010) among others. We demonstrate thatthe WFST-based approach outperforms popularlyused dynamic programming algorithms for multi-ple sequence alignment.

3 Problem Description

Medical events are temporally-associated con-cepts in clinical text that describe a medical con-dition affecting the patient’s health, or proceduresperformed on a patient. We represent medicalevents by splitting each event into a start and astop. When there is insufficient information to dis-cern the start or stop of an event, it is representedas a single concept. If only the start is known thenthe stop is set to +∞, whereas when only the stopis known , the start is set to the date of birth of the

999

e1 before e2

e1 overlaps e2

e1 during e2

e1 starts with e2

e1 finishes with e2

e1 equals e2

< e1.start e1.stop e2.start e2.stop

e1.start e1.stop e2.start e2.stop

e1.start e1.stop e2.start e2.stop

e1.start e2.start e2.stop e1.stop

e1.start e1.stop e2.stop e2.start

e1.start e2.start e2.stop e1.stop

< <

< < <

< < <

< < ~

~ < <

~ ~ <

< denotes before > denotes after ~ denotes simultaneous

Temporal Relation Event Ordering

Figure 1: Medical event start / stop representa-tion mapped to Allen’s temporal relations (Allen,1981). Temporal ordering of event starts and stopsusing {before, after, simultaenous} (shown on theright) allows us learn temporal relations betweenthe medical events (shown on the left). e1start =e2start and e1stop = e2stop, when e1 and e2 core-fer.

patient.1 Often, for chronic ailments like hyper-tension, we would only associate a start with themedical event and set the stop to +∞. The start ofhypertension may be associated with the temporalexpression history of in the narrative. This, whenconsidered along with the admission date, allowsus to relatively order hypertension with respect toother medical events. A medical event occurrencelike chest pain may be associated with a start anda stop, where the start may be determined by themention of “patient was complaining of chest painyesterday” in the narrative text. Further, the nar-rative may state that “he continued to have chestpain on admission, but currently he is chest painfree”; this may be used to infer the relative stop ofchest pain. Medical events may also be instan-taneous, for e.g., injected with antibiotic. Suchevents are represented with the start and stop asbeing the same. Temporal relations exist betweenthe start and stop of events as shown in Figure 1.Learning temporal relations before, after and si-multaneous between the medical event starts andstops corresponds to learning all of Allen’s tem-poral relations (Allen, 1981) between the medicalevents. Following our previous work (Raghavanet al., 2012c), such a representation allows us totemporally order the event starts and stops withineach clinical narrative by learning to rank them inrelative order of time. The problem definition is asfollows:

1Patient date of birth, admission/ discharge date are usu-ally available in the metadata associated with a clinical nar-rative.

dob +∞

dob

+∞

hypertensionstart admission1 chest painstart chest painstop

palpitationsstart myocardial infarctionstart

MRSAstart admission2

hypertensionstart

heart attackstart

dob +∞ cocaine usestart infectionstart

woundsstart admission3

N1

N2

N3

episodestart

Figure 2: Given temporally ordered medical eventsequences, N1, N2, N3, we address the task ofcombining events across these sequences by merg-ing or ordering them to create a single comprehen-sive timeline.

Input: Sequences of temporally ordered med-ical event starts and stops. This corresponds toN1, N2, and N3 in Figure 2. Each sequence cor-responds to a clinical narrative. The total numberof sequences correspond to the number of clinicalnarratives for a patient.

Problem: Combine medical events across thesesequences to generate a timeline i.e., a single com-prehensive sequence of medical events over allclinical narratives of the patient.

Expected Output: In the example shownin Figure 2, the output would be as follows:Timeline (N1, N2, N3)= {cocaine usestart <hypertensionstart = hypertensionstart < admis-sion1 < chest painstart ∼ palpitationsstart <chest painstop < heart attackstart = myocardialinfarctionstart < admission2 < infectionstart <MRSAstart < admission3 < woundsstart}.

The goal of multiple sequence alignment is tofind an alignment that maximizes some overallalignment score. Thus, in order to align event se-quences, we need to compute scores correspond-ing to cross-narrative medical event coreferenceresolution and cross-narrative temporal relations.

4 Cross-Narrative CoreferenceResolution and Temporal RelationLearning

The first approach to learning a temporal order-ing of medical events across all clinical narrativesis to consider all pairs of events across all narra-tives and learn to classify them as sharing one ofAllen’s temporal relations (Allen, 1981) using asingle learning model. Alternatively, a ranking ap-

1000

proach, similar to the one used to generate intra-narrative temporal ordering, can also be extendedto the cross-narrative case. However, the featuresrelated to narrative structure and relative and im-plicit temporal expressions used for temporal or-dering within a clinical narrative may not be ap-plicable across narratives. For instance, a historyand physical report may have sections like “pastmedical history”, “history of present illness”, “as-sessment and plan”, and a certain logical patternto the flow of text within and across these sec-tions. Further, temporal cues like “thereafter”,“subsequently”, follow from the context around anevent mention. The absence of such features in thecross-narrative case does not allow such a modelto generate accurate temporal relation predictions.

Thus, for use in our sequence alignment models,we learn two independent classifiers for medicalevent coreference and temporal relation learningacross narratives. We train a classifier to resolvecross-narrative coreferences by extracting seman-tic and temporal relatedness feature sets for eachpair of medical concepts. Extracting these fea-ture sets helps us train a classifier to predict med-ical event coreferences (Raghavan et al., 2012a).Another classifier is then trained to classify pairsof medical event starts and stops across narrativesas sharing temporal relations {before, after, over-laps}. The learned cross-narrative coreferencepredictions can then be used along with confi-dent temporal relation predictions to derive a jointprobability to enable cross-narrative temporal or-dering.

5 Narrative Sequence Alignment forCross-narrative Temporal Ordering

Sequence alignment algorithms have been de-veloped and popularly used in bioinformatics.However, multiple sequence alignment (MSA)has been shown to be NP complete (Wang andJiang, 1994) and various heuristic algorithms havebeen proposed to solve this problem (Notredame,2002). We propose a novel WFST-based repre-sentation that enables accurate decoding for MSAwhen compared to popularly used dynamic pro-gramming algorithms (Needleman et al., 1970,Smith and Waterman, 1981) or other state of theart methods (Do et al., 2012).

In the problem of aligning events across mul-tiple narrative sequences, we want to align tem-porally ordered medical events corresponding toclinical narratives of a patient. Unlike problemsin biological sequence alignment where the sym-

chest painstart

episodestart

chest painstop

episodestop

<

<

a = x

a b

x y

< b = y

Score = P(a simult x | a coref x) P(a coref x)

Figure 3: Score computation for aligning eventsacross temporally ordered event sequences chestpainstart = episodestart < chest painstop =episodestop, where events across the sequences oc-cur simultaneously and corefer.

chest painstart

palpitationsstart

chest painstop

palpitationsstop

<

<

a b

x y a ~ x < b < y

Score = P(a simult x | a no-coref x) × P(x before b | a no-coref x) × P(b before y | a no-coref x) P(a no-coref x)

Figure 4: Score computation for aligning eventsacross temporally ordered event sequences chestpainstart ∼ palpitationsstop < chest painstop <palpitationsstop, where some events across the se-quences occur simultaneously but do not corefer.

bols to be aligned across sequences are restrictedto a fixed set, our symbol set is not fixed or cer-tain because the symbols correspond to medicalevents in clinical narratives. Moreover, we can-not have fixed scores for symbol transformationssince our transformations correspond to corefer-ence and temporal relations between the medicalevents across sequences. The computation of thesescores is described next.

5.1 Scoring Scheme

Let us assume a, b are medical events in the firstclinical narrative and have been temporally or-dered so a < b. Similarly, x, y are medical eventsin the second clinical narrative such that x < y.There exists a match or an alignment between apair of medical events, across the sequences, in thefollowing cases:

1. If the medical events are simultaneous andcoreferring, denoted as a = x.

2. If the medical events are simultaneous andnon-coreferring, denoted as a ∼ x.

1001

hypertensionstart palpitationsstart <

a b

x y

a < x < b < y

Score = P(a before x |a no-coref x) P(a no-coref x) × P(x before b | x no-coref b) P(x no-coref b) × P(b before y | b no-coref y) P(b no-coref y)

infectionstart MRSAtart < <

Figure 5: Score computation for aligningevents across temporally ordered event se-quences hypertensionstart < palpitationsstart <infectionstart < MRSAstart, where events acrossthe sequences do not occur simultaneously and donot corefer.

3. If the a medical event from one sequenceis before a medical event from another se-quence, denoted as a < x.

4. If the a medical event from one sequence isafter a medical event from another sequence,denoted as a > x.

We now illustrate how the scores for candidatealigned sequences are computed using the learnedcross-narrative coreference and temporal probabil-ities for the following three scenarios:

• The medical events across sequences are si-multaneous and corefer as illustrated in Fig-ure 3. The joint score considers the probabil-ity of event temporal relations simultaneousconditioned on coreference.

• Some medical events across sequences are si-multaneous but do not corefer as illustrated inFigure 4. Here, the joint score considers thejoint probability of temporal relations simul-taneous or before and no-coreference.

• The medical events across sequences are notsimultaneous and do not corefer as illustratedin Figure 5. In this case, the joint score con-siders the probability of the temporal relationbefore and no coreference.

Thus, the coreference and temporal relation scorescan be leveraged for aligning sequences of medicalevents. These scores are used in both the WFST-based representation and decoding, as well as fordynamic programming.

5.2 Alignment using a Weighted Finite StateRepresentation

A weighted finite-state transducer (WFST) is anautomaton in which each transition between states

is associated with an input symbol, an output sym-bol, and a weight (Mohri et al., 2005). WFSTs canbe used to efficiently represent and combine se-quences of medical events based coreference andtemporal relation information. The WFST rep-resentation gives us the ability to talk about theglobal joint probability derived from coreferenceand temporal relation scores described in Section5.1. It allows us to build a weighted lattice of se-quences that can be searched for the most probablesequence of medical events from across all clin-ical narratives of a patient. We use unweightedFSAs to represent the input described in Section3, i.e. temporally ordered sequences of medicalevents corresponding to clinical narratives. Thiscorresponds to N1 and N2 in Figure 6.

Based on whether we want to align the se-quences purely based on coreference scores orboth coreference and temporal relation scores, thearc weights for the WFST can be determined. M c

12

is a WFST that maps input symbols from N1 tooutput symbols inN2 and is weighted by the prob-ability of coreference or no-coreference betweenmedical events across N1 and N2. The represen-tation in WFST M c+t

12 shown in Figure 7 allowsus to align N1 and N2 based on both coreferenceas well as temporal relation probabilities. TheWFST has ε transitions to accommodate insertionand deletion of medical events when combiningthe sequences. Deletions correspond to the casewhen an event in the first sequence does not mapto any event in the second sequence; similarly in-sertions correspond to the case where an event inthe second sequence does not map to any event inthe first sequence. The WFST composition opera-tion allows the outputs of one WFST to be fed tothe inputs of a second WFST or FSA. Thus, webuild our final machine by composing the threesub-machines as,

D = N1 ◦M i12 ◦N2. (1)

where i = c or i = c + t. This gives us a com-bined weighted graph by mapping the output sym-bols of the first medical event sequence to the in-put symbols of the second medical event sequence.The scores on the decoding graph are derived fromonly the coreference probabilities if i = c and bothcoreference and temporal relation probabilities ifi = c+ t.

In the medical event sequence alignment prob-lem, we want to align multiple sequences of medi-cal events that correspond to multiple clinical nar-ratives of a patient. Since we want to now combine

1002

N1

N2

M12

Figure 6: N1 and N2 are medical event sequences represented using FSAs. M c12 maps medical events

across N1 and N2 and is weighted only by the probability of coreference between events across N1 andN2.

0

cocaineuse:-/0.13hypertension:-/0.27

chestpain:-/0.23

1

-:cocaineabuse/0.1

cocaineuse:cocaineabuse/0.9

hypertension:cocaineabuse/0.4

chestpain:cocaineabuse/0.3


chestpain:-/0.23

2

-:admission/0.1

cocaineuse:admission/0.1

hypertension:admission/0.1

chestpain:admission/0.1


chestpain:-/0.23

3

-:chestpain/0.1

cocaineuse:chestpain/0.17

hypertension:chestpain/0.23

chestpain:chestpain/0.86


chestpain:-/0.23

4/4

-:myocardialinfarction/0.1

cocaineuse:myocardialinfarction/0.2

hypertension:myocardialinfarction/0.1

chestpain:myocardialinfarction/0.1


chestpain:-/0.23

Figure 7: M c+t12 is a WFST representation used for mapping medical events between N1 and N2 (from

Figure 2) and is weighted by both the coreference and temporal relation probabilities

all narrative chains belonging to the same patient,the composition cascade to build the final com-bined sequence will be as,

Df = N1◦M i12◦N2◦M i

23◦N3◦M i34...◦Nn (2)

where i = c or i = c + t and n is the numberof medical event sequences corresponding to clin-ical narratives for a patient. During compositionwe retain intermediate paths like M i

23 utilizing theability to do lazy composition (Mohri and Pereira,1998) in order to facilitate beam search throughthe multi-alignment. The best hypothesis corre-sponds to the highest scoring path which can beobtained using shortest path algorithms like Djik-stra’s algorithm. The best path corresponds to thebest alignment across all medical event sequencesbased on the joint probability of cross-narrativemedical event coreferences and temporal relationsacross the narrative sequences.

The complexity of decoding increases exponen-tially with the number of narrative sequences in

the composition, and exact decoding becomes in-feasible. One solution to this problem is to dothe alignment greedily pairwise, starting from themost recent medical event sequences, finding thebest path, and iteratively moving on to the nextsequence, and proceeding until the oldest medi-cal event sequence. The disadvantage of such amethod is that it does not take into account con-straints between medical events across multipleevent sequences and may lead to a less accuratesolution.

An alternative method is to use lazy compo-sition to perform more efficient composition asit allows practical memory usage. We also usebeam search to make for an efficient approxima-tion to the best-path computation (Mohri et al.,2005). This allows accommodating constraintsfrom across multiple sequences and generates amore accurate best path. Thus, this method gener-ates more accurate alignments when we have morethan two sequences to be aligned.

1003

For instance, instance say a, b ∈ N1, x, y ∈ N2,and m,n ∈ N3 are temporally medical event se-quences corresponding to narratives N1, N2 andN3. Based on the learned pairwise temporal rela-tions, if we have the following constraints a < x,m > x, m < a. Aligning N1 and N2 greedilypairwise may give us the best combined sequenceas a, x, b, y ∈ N12. Now in aligning N12 withN3, we won’t be able to accommodate m > x andm < a. However, performing a beam search overthe composed WFST in equation 2 allows us toaccommodate such constraints across multiple se-quences. The complexity of composing two trans-ducers is O(V1V2D1(logD2 + M2)) where eachedge from the first sequence matches every edge inthe second sequence and Vi is the number of states,Di is the maximum out-degree and Mi maximummultiplicity for the ith FST (Mohri et al., 2005).

We also use popular dynamic programming al-gorithms (Needleman et al., 1970, Smith and Wa-terman, 1981) for sequence alignment of medi-cal events across narratives and compare it to theWFST-based representation and decoding.

5.3 Pairwise Alignment using DynamicProgramming

As a contrast, we adapt two dynamic program-ming algorithms for sequence alignment: globalalignment using the Needleman Wunsch algo-rithm (NW) (Needleman et al., 1970) and localalignment using the Smith-Waterman algorithm(SW) (Smith and Waterman, 1981). NW allowsus to align all events in one sequence with allevents in another sequence. A drawback of NWis that short and highly similar sequences maybemissed because they get overweighted by the restof the sequence. NW is suitable when the two se-quences are of similar length with significant de-gree of similarity throughout. On the other hand,SW gives the longest sub-sequence pair that yieldsmaximum degree of similarity between the twooriginal sequences. It does not force all eventsin a sequence to align with another sequence.SW is useful in aligning sequences that differ inlength and have short patches of similarity. Thetime complexity of these methods for sequencesof length m and n are O(mn).

The scoring scheme described earlier is used toupdate the scoring matrix for dynamic program-ming. In order to accommodate the temporal re-lations before and after, we insert a null symbolafter every medical event in each sequence in thescoring matrix. A vertical or horizontal gap ariseswhen cases 1, 2, 3 and 4 in Section 5.1 mentioned

above are not true. If the medical events are notsimultaneous, not before or not after, the medicalevents will not align. Thus, the value of each cellin the scoring matrix is determined by computingthe maximum score at each position C(i, j) as,

max{(C(i−1, j−1)+Sij), (C(i, j−1)+w),(C(i− 1, j) + w)} (3)

where, Sij = max{P (i = j), P (i < j), P (i >j)}, and w = max{(1 − P (i = j)), (1 − P (i <j)), (1 − P (i > j))}. Here, C(i − 1, j − 1)corresponds to a match, whereas C(i, j − 1) andC(i − 1, j) correspond to a gaps in sequence oneand two.

In case of the SW algorithm, the negative scor-ing matrix cells are set to zero, thus making thepositively scoring local alignments visible. Back-tracking starts at the highest scoring matrix celland proceeds until a cell with score zero is encoun-tered, yielding the highest scoring local alignment.

The time and space complexity grows exponen-tially with the number of sequences to be alignedand finding the global optimum has been shown tobe a NP-complete problem. The time complexityof aligning N sequences of length L is O(2NLN )(Wang and Jiang, 1994). Thus, for MSA usingdynamic programming, we use a heuristic methodwhere we combine pairwise alignments iterativelystarting with the latest narrative and progressingtowards the oldest narrative.

6 Experiments and EvaluationCorpus Description. The corpus consists of adataset of clinical narratives obtained from the[redacted] medical center. The corpus has a totalof 2060 patients, and 100704 clinical narratives.We gathered a gold standard set of seven patients(80 clinical narratives overall) with manual anno-tation of all medical events mentioned in the nar-ratives, coreferences, and medical event sequenceinformation. The annotation agreement acrossannotators is high, with 89.5% agreement corre-sponding to inter-annotator Cohen’s kappa statis-tic of 0.86 (Raghavan et al., 2012b). The typesof clinical narratives included 27 discharge sum-maries, 30 history and physical reports, 15 radiol-ogy reports and 8 pathology reports. The distribu-tion of the number of medical event sequences andunique medical events across patients is shown inTable 1. The annotated dataset is used to cross-validate and train our coreference and temporal re-lation learning models and to evaluate our cross-narrative medical event timeline.

1004

p1 p2 p3 p4 p5 p6 p7No. of Narrative Sequences 5 9 20 13 8 10 15

No. of Medical events 68 90 119 82 79 72 95% Accuracy % Avg.

WFST-framework (lazy composition and beam search)[c+t] 76.1 73.2 81.2 83.5 76.4 82.5 79.7 78.9WFST-framework (Iterative pairwise)[c+t] 70.4 67.1 73.5 74.1 61.8 75.5 62.9 69.3Smith Waterman (Iterative pairwise)[c+t] 71.2 69.7 75.5 75.6 66.3 77.4 68.3 72.1

Needleman-Wunsch (Iterative pairwise)[c+t] 68.1 66.3 72.1 74.4 61.1 75.5 63.6 68.7WFST-framework (lazy composition and beam search)[c] 68.5 65.3 72.3 74.4 67.2 71.3 69.1 69.7

WFST-framework (Iterative pairwise)[c] 61.2 63.3 61.9 60.4 59.8 64.8 60.5 61.7Smith Waterman (Iterative pairwise)[c] 60.3 63.7 68.2 62.3 58.6 66.7 60.2 62.8

Needleman-Wunsch (Iterative pairwise)[c] 56.6 60.1 59.3 65.6 54.7 63.1 58.2 59.6

Table 1: The distribution of medical events across narrative sequences and sequences across patients andmultiple sequence alignment results for the WFST-based framework, and dynamic programming usingjust coreference scores [c] and using coreference as well as temporal relation scores [c+t].

Evaluation Metric. For each patient and eachmethod (WFST or dynamic programming), theoutput timeline to evaluate is the highest scoringcandidate hypothesis derived as described above.Accuracy of the timeline is calculated as the num-ber of transformations required to obtain the refer-ence sequence in the annotated gold-standard fromthe one generated by our system. Transformationsare measured in terms of the minimum edit dis-tance, insertions, deletions, and substitutions ofmedical events.

Experiments and Results. We first temporallyorder medical events within each clinical narrativeby learning to rank them in relative order of oc-curence as described in our previous work (Ragha-van et al., 2012c). The overall accuracy of rank-ing medical events using leave-one-out cross val-idation is 82.1%. The resulting medical event se-quences serve as the input to the problem of cross-narrative sequence alignment.

The cross-narrative coreference and temporalrelation pairwise classification models describedin Section 4 are trained using a Maximum en-tropy classifier. The coreference resolution per-forms with 71.5% precision and 82.3% recall. Thetemporal relation classifier performs with 60.2%precision and 76.3% recall. The learned pairwisecoreference and temporal relation probabilities arenow used to derive the score for the WFST and dy-namic programming approaches.

WFST representation and decoding. Webuild finite-state machines using the open sourceOpenFST library.2 We use a tropical semi-ringweighted using the negative log-likelihood of thecomputed scores. OpenFST provides tools thatcan search for the highest scoring sequences ac-cepted by the machine, and can sample from high-scoring sequences probabilistically, by treating the

2www.openfst.org

scores of each transition within the machine as anegative log probability. The decoding process tocompute the most likely combined medical eventsequence can be defined as searching for the bestpath in the combined graph representation (Equa-tion 2). The best path is the one that minimizesthe total weight on a path (since the arcs are neg-ative log probabilities). In searching for the bestpath, the beam size is set to 5. The accuracy ofthe WFST-based representation and beam searchacross all sequences using the coreference andtemporal relation scores to obtain the combinedaligned sequence is 78.9%.

Dynamic Programming. We use the NW andSW algorithms described in Section 5.3 to pro-duce local and global alignments respectively. Weuse the scoring scheme described in Section 5.1 toupdate the cost matrix for dynamic programmingand implement the algorithms as described in Sec-tion 5.3. The overall accuracy of sequence align-ment with both coreference and temporal relationscores using NW is 68.7% whereas SW gives anaccuracy of 72.1%. In case of aligning just twosequences, both methods yield the same results.The accuracy of cross-narrative MSA for each pa-tient, for each method, using cross validation, isshown in Table 1. Results indicate that the WFST-based method outperforms the dynamic program-ming approach for multi-sequence alignment (sta-tistical significance p<0.05). Morever, the re-sults using both coreference and temporal realtionscores for alignment outperform using only coref-erence scores for alignment using all approaches.This indicates that cross-narrative temporal rela-tions are important for accurately aligning medicalevent sequences across narratives.

7 DiscussionWe propose and evaluate different approaches tomultiple sequence alignment of medical events.

1005

Approaches to multi-alignment. We addressthe problem of aligning medical event sequencesusing a novel WFST-based framework and empiri-cally demonstrate that it outperforms pairwise pro-gressive alignment using dynamic programming.This is mainly because the WFST-based allows usto consider temporal constraints from across mul-tiple sequences when performing the alignment.

Moreover, it also outperforms the integer lin-ear programming (ILP) method for timeline con-struction proposed in (Do et al., 2012). We im-plemented the proposed method that also allowscombining the output of classifiers subject to someconstraints. We derive intervals from event startsand stops and learn two perceptron classifiers forclassifying the temporal relations between eventsand assigning events to intervals. The classifierprobabilities are then used to solve the optimiza-tion problem using the lpsolve solver.3 We alsouse intra-document coreference information to re-solve coreference before performing the global op-timization. We observe that in case of MSA, theoptimal solution using ILP is still intractable asthe number of constraints increases exponentiallywith the number of sequences. Aligning pair-wise iteratively gives us an overall average accu-racy of 68.2% similar to dynamic programming.While this is comparable to the dynamic pro-gramming performance, the WFST-based methodsignificantly outperforms this in case of multi-alignments for cross-narrative temporal ordering.

Performance and error analysis. We performmulti-alignments over medical event sequencesfor a patient, where each sequence correspondsto temporally ordered medical events in a clinicalnarrative generated using the ranking model de-scribed in (Raghavan et al., 2012c). The accuracyof intra-narrative temporal ordering is 82.1%. Theerrors in performing this intra-narrative orderingmay propagate to the cross-narrative model result-ing in reduced accuracy. This may be addressedby considering n-best temporally ordered medi-cal event sequences, generated by the ranking pro-cess, and aligning the n-best sequences using theWFST-based framework. This could be feasibleas, practically, the WFST-based method for multi-alignment takes only a few secs to align a pair ofmedical event sequences with average length 40.

The accuracy of alignments across multiplemedical event sequences is also affected by the er-ror induced by the coreference and temporal rela-tion scores. Often, insufficient temporal cues leads

3http://lpsolve.sourceforge.net/5.5/

to misclassification of events incorrectly as shar-ing the “simultaneous” temporal relation and oftenas coreferring. This induces errors in the score cal-culation and hence the alignments. Better meth-ods to address the challenging problem of cross-document temporal relation learning, perhaps withthe help of structured data from the patient record,could improve the accuracy of alignments.

There is no clear trend with respect to the num-ber of medical events and narratives for a patient(Table 1.), and the alignment accuracy. In fu-ture work, it would be interesting to examine anysuch correlation and also study the scalability ofthe WFST-based method for sequence alignmenton longer medical event sequences and a largerdataset of patients. Further, the WFST-basedmethod may be used to model multi-alignmenttasks in other speech and language problems aswell.

8 Conclusion

We propose a novel framework for aligning med-ical event sequences across clinical narrativesbased on coreference and temporal relation infor-mation using cascaded WFSTs. FSTs provide aconvenient and flexible framework to model se-quences of temporally ordered medical events andcompose them into a combined graph represen-tation. Decoding this graph allows us to jointlymaximize coreference as well as temporal relationprobabilities to derive a timeline of the most likelytemporal ordering of medical events. This ap-proach to aligning multiple sequences of medicalevents significantly outperforms other approachessuch as dynamic programming. Moreover, wedemonstrate the importance of learning tempo-ral relations for the task timeline generation fromacross multiple clinical narratives by empiricallyproving that decoding using both coreference andtemporal relation scores is far more accurate thandecoding with only coreference scores.

Acknowledgments

The project was supported by Award NumberGrant R01LM011116 from the National Libraryof Medicine. The content is solely the responsibil-ity of the authors and does not necessarily repre-sent the official views of the National Library ofMedicine or the National Institutes of Health. Theauthors would like to thank Yanzhang He for hisinput on the WFST-based model.

1006

ReferencesJames F. Allen. 1981. An interval-based representa-

tion of temporal knowledge. In IJCAI, pages 221–226.

Regina Barzilay and Kathleen R. McKeown. 2005.Sentence fusion for multidocument news sum-marization. Comput. Linguist., 31(3):297–328,September.

Regina Barzilay, Noemie Elhadad, and Kathleen McK-eown. 2002. Inferring strategies for sentence or-dering in multidocument summarization. Journal ofArtificial Intelligence Research (JAIR), 17:35–55.

Danushka Bollegala, Naoaki Okazaki, and MitsuruIshizuka. 2010. A bottom-up approach to sentenceordering for multi-document summarization. Infor-mation processing & management, 46(1):89–109.

Philip Bramsen, Pawan Deshpande, Yoong Keok Lee,and Regina Barzilay. 2006. Inducing temporalgraphs. In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Process-ing, EMNLP ’06, pages 189–198.

Nathanael Chambers and Dan Jurafsky. 2009. Unsu-pervised learning of narrative schemas and their par-ticipants. In ACL/AFNLP, pages 602–610.

Raphael Cohen, Michael Elhadad, and Noemie El-hadad. 2013. Redundancy in electronic healthrecord corpora: analysis, impact on text mining per-formance and mitigation strategies. BMC bioinfor-matics, 14(1):10.

Dina Demner-Fushman, Wendy Webber Chapman, andClement J. McDonald. 2009. What can natural lan-guage processing do for clinical decision support?Journal of Biomedical Informatics, 42(5):760–772.

Quang Xuan Do, Wei Lu, and Dan Roth. 2012. Jointinference for event timeline construction. In Pro-ceedings of the 2012 Joint Conference on EmpiricalMethods in Natural Language Processing and Com-putational Natural Language Learning, EMNLP-CoNLL ’12, pages 677–687. Association for Com-putational Linguistics.

Heng Ji and Ralph Grishman. 2008. Refining eventextraction through cross-document inference. In As-sociation for Computational Linguistics.

Cvetana Krstev, Dusko Vitas, Ivan Obradovic, andMilos Utvic. 2011. E-dictionaries and Finite-stateautomata for the recognition of named entities. InProceedings of the 9th International Workshop onFinite State Methods and Natural Language Pro-cessing, pages 48–56.

Shankar Kumar and William Byrne. 2003. A weightedfinite state transducer implementation of the align-ment template model for statistical machine trans-lation. In Proceedings of the 2003 Conferenceof the North American Chapter of the Association

for Computational Linguistics on Human LanguageTechnology - Volume 1, pages 63–70.

V Finley Lacatusu, Steven J Maiorano, and Sanda MHarabagiu. 2004. Multi-document summarizationusing multiple-sequence alignment. In LREC.

Mirella Lapata and Alex Lascarides. 2011. Learn-ing sentence-internal temporal relations. CoRR,abs/1110.1394.

Mirella Lapata. 2003. Probabilistic text structuring:Experiments with sentence ordering. In Proceed-ings of the 41st Annual Meeting on Association forComputational Linguistics-Volume 1, pages 545–552. Association for Computational Linguistics.

Zhihui Luo, Stephen B. Johnson, Albert M. Lai, andChunhua Weng. 2011. Extracting temporal con-straints from clinical research eligibility criteria us-ing conditional random fields. In Proc of AMIASymposium.

Inderjeet Mani, Marc Verhagen, Ben Wellner,Chong Min Lee, and James Pustejovsky. 2006.Machine learning of temporal relations. In ACL.

Mehryar Mohri and Fernando CN Pereira. 1998. Dy-namic compilation of weighted context-free gram-mars. In Proceedings of the 36th Annual Meet-ing of the Association for Computational Linguis-tics and 17th International Conference on Compu-tational Linguistics-Volume 2, pages 891–897. As-sociation for Computational Linguistics.

Mehryar Mohri, Fernando C. N. Pereira, and MichaelRiley. 2000. The design principles of a weightedfinite-state transducer library. Theoretical ComputerScience, 231(1):17–32.

Mehryar Mohri, Fernando Pereira, and Michael Riley.2005. Weighted automata in text and speech pro-cessing. CoRR, abs/cs/0503077.

S.B. Needleman, C.D. Wunsch, et al. 1970. A generalmethod applicable to the search for similarities inthe amino acid sequence of two proteins. Journal ofmolecular biology, 48(3):443–453.

Cedric Notredame. 2002. Recent progress in multiplesequence alignment: a survey. Pharmacogenomics,3(1):131–144.

Preethi Raghavan, Eric Fosler-Lussier, and Albert M.Lai. 2012a. Exploring semi-supervised coreferenceresolution of medical concepts using semantic andtemporal features. In North American Associationfor Computational Linguistics Annual Meeting - Hu-man Language Technologies Conference. Associa-tion for Computational Linguistics.

Preethi Raghavan, Eric Fosler-Lussier, and Albert M.Lai. 2012b. Inter-annotator reliability of medi-cal events, coreferences and temporal relations inclinical narratives by annotators with varying levelsof clinical expertise. In To appear in Proceedings

1007

of the American Medical Informatics Association.American Medical Informatics Association.

Preethi Raghavan, Eric Fosler-Lussier, and Albert M.Lai. 2012c. Learning to temporally order medicalevents in clinical text. In ACL short paper. Associa-tion for Computational Linguistics.

Daniel Reichert, David Kaufman, Benjamin Bloxham,Herbert Chase, and Noemie Elhadad. 2010. Cog-nitive analysis of the summarization of longitudinalpatient records. In AMIA Annual Symposium Pro-ceedings, volume 2010, page 667. American Medi-cal Informatics Association.

A. Roberts, R. Gaizauskas, M. Hepple, G. Demetriou,Y. Guo, and A. Setzer. 2008. Semantic Annotationof Clinical Text: The CLEF Corpus. In Proceedingsof the LREC 2008 Workshop on Building and Eval-uating Resources for Biomedical Text Mining, pages19–26.

T.F. Smith and M.S. Waterman. 1981. Identifica-tion of common molecular subsequences. Journalof molecular biology, 147(1).

Richard Sproat. 2006. A Computational Theory ofWriting Systems (Studies in Natural Language Pro-cessing). Cambridge University Press.

Marc Verhagen, Robert J. Gaizauskas, Frank Schilder,Mark Hepple, Jessica Moszkowicz, and JamesPustejovsky. 2009. The tempeval challenge: iden-tifying temporal relations in text. Language Re-sources and Evaluation, 43(2):161–179.

Lusheng Wang and Tao Jiang. 1994. On the complex-ity of multiple sequence alignment. Journal of com-putational biology, 1(4):337–348.

Christopher Whelan, Brian Roark, and Kemal Son-mez. 2010. Designing antimicrobial peptides withweighted finite-state transducers. In Proceedings ofIEEE Engineering in Medical Biology Society, page764.

David Yarowsky. 1995. Unsupervised word sense dis-ambiguation rivaling supervised methods. In Asso-ciation for Computational Linguistics, pages 189–196.

Li Zhou, Genevieve B. Melton, Simon Parsons, andGeorge Hripcsak. 2006. A temporal constraintstructure for extracting temporal information fromclinical narrative. Journal of Biomedical Informat-ics, pages 424–439.

1008

Documents

Cross-narrative Temporal Ordering of Medical Eventspeople.dbmi.columbia.edu/noemie/papers/acl14.pdfclinical narratives. We present results us-ing both approaches and observe that the