Clinical Concept Extraction: a Methodology Review · 2020. 8. 12. · clinical research and healthcare delivery (1). To achieve the goal of “meaningful use”, transforming routinely

1

Clinical Concept Extraction: a Methodology Review

Sunyang Fuac, David Chena, Huan Hea, Sijia Liua, Sungrim Moona, Kevin J Petersonbc, Feichen

Shena, Liwei Wanga, Yanshan Wanga, Andrew Wena, Yiqing Zhaoa, Sunghwan Sohna, Hongfang

Liuac

aDepartment of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN

55905 bDepartment of Information Technology, Mayo Clinic, 200 First Street SW, Rochester, MN 55905 cUniversity of Minnesota – Twin Cities, Minneapolis, MN 55455

Email Addresses: Sunyang Fu: [email protected] David Chen: [email protected] Huan He: [email protected] Sijia Liu: [email protected] Sungrim Moon: [email protected] Kevin J Peterson: [email protected] Feichen Shen: [email protected] Liwei Wang: [email protected] Yanshan Wang: [email protected] Andrew Wen: [email protected] Yiqing Zhao: [email protected] Sunghwan Sohn: [email protected] Hongfang Liu: [email protected]

Corresponding Author: Dr. Hongfang Liu, Division of Digital Health Sciences, Mayo Clinic,

200 First St SW, Rochester, MN 55905 (email: [email protected]; phone: 507-293-0057)

Word count: 7346

Structured Abstract: 163

Tables: 5

Figures: 6

2

ABSTRACT

Background

Concept extraction, a subdomain of natural language processing (NLP) with a focus on extracting

concepts of interest, has been adopted to computationally extract clinical information from text for

a wide range of applications ranging from clinical decision support to care quality improvement.

Objectives

In this literature review, we provide a methodology review of clinical concept extraction, aiming

to catalog development processes, available methods and tools, and specific considerations when

developing clinical concept extraction applications.

Methods

Based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA)

guidelines, a literature search was conducted for retrieving EHR-based information extraction

articles written in English and published from January 2009 through June 2019 from Ovid

MEDLINE In-Process & Other Non-Indexed Citations, Ovid MEDLINE, Ovid EMBASE, Scopus,

Web of Science, and the ACM Digital Library.

Results

A total of 6,686 publications were retrieved. After title and abstract screening, 228 publications

were selected. The methods used for developing clinical concept extraction applications were

discussed in this review.

Keywords: concept extraction, natural language processing, information extraction, electronic

health records, machine learning, deep learning

3

Graphical abstract

4

1. Introduction

Electronic health records (EHRs) have been widely viewed to have great potential to advance

clinical research and healthcare delivery (1). To achieve the goal of “meaningful use”,

transforming routinely generated EHR data into actionable knowledge requires systematic

approaches (2). However, a significant portion of clinical information remains locked away in free

text (3). The solution to this is to replace these manual methods with autonomous computational

extraction. The field of research related to this topic, commonly referred to as information

extraction (IE), a subdomain of natural language processing (NLP), seeks to address the challenges

of computationally extracting information from free-text narratives (4, 5). Specifically, we define

the process of automatically extracting pre-defined clinical concepts from unstructured text as

clinical concept extraction, which includes concept mention detection and concept encoding.

Concept mention detection generally adopts the named entity recognition (NER) (6, 7) technology

in the general domain, which focuses on detecting concept mentions in the text. Concept encoding

aims to map the mentions to concepts in standard terminologies or those defined by downstreaming

applications (5, 8, 9). Concept extraction has been adopted to extract clinical information from text

for a wide range of applications ranging from supporting clinical decision making (3) to improving

the quality of care (10). A review done by Meystre et al. in 2008 observed an increasing utilization

of NLP in the clinical domain and a major challenge in advancing clinical NLP due to the

unavailability of a large amount of clinical text (11). A summary of EHR-based clinical concept

extraction applications can be found in Wang et al (5).

For illustrative purposes, an example of a typical clinical concept extraction task is presented in

Figure 1, where the goal is to identify patients with silent brain infarcts (SBIs) from neuroimaging

reports for stroke research. Silent brain infarcts, defined as old infarcts found through

neuroimaging exams (CT or MRI) without a history of stroke, have been considered a major

priority for new studies on stroke prevention by the American Heart Association and American

Stroke Association. However, identification of SBIs is significantly impeded since discovery of

them is an incidental finding: there are no International Classification of Diseases (ICD) codes for

SBI, and it is generally not included in a patient’s problem list or any structured EHR field.

Concept extraction is therefore necessary for the identification of SBI-related findings from

neuroimaging reports. We can infer a case of SBI based on acuity (chronic), location (left basal

5

ganglia lacunar), and the number of infarcts (one) from the sentence “A chronic lacunar infarct is

also noted”.

Figure 1. An example of using concept extraction for stroke research.

Methods for developing clinical concept extraction applications have been largely translated from

the general NLP domain (4), and can typically be stratified into rule-based approaches and

statistical approaches with four categories: rule-based, traditional machine learning (non-deep-

learning variants), deep learning, or hybrid approaches. For example, an early attempt of clinical

concept extraction, the Medical Language Processing project, was adapted from the Linguistic

String Project aiming to extract symptoms, medications, and possible side effects from medical

records (12, 13) leveraging a semantic lexicon and a large collection of rules. The rise of statistical

NLP in the 1990s (14) and recent advances in deep learning technologies (15-17) have influenced

the methods adopted for clinical concept extraction. Despite these advances, methods for clinical

concept extraction are generally buried within the methods section of the literature due to the

complexity and heterogeneity associated with EHR data and the diverse range of applications. No

singular method has proven to be globally effective.

Here, we provide a review of the methodologies behind clinical concept extraction, cataloguing

development processes, available methods and tools, and specific considerations when developing

clinical concept extraction applications. This systematic review of the concept extraction task aims

Template hierarchy for Silent Brain Infarctiona. Acuity

a. acute/subacuteb. chronicc. bothd. not specified

b. Locationa. lacunar/subcorticalb. cortical/juxtacorticalc. bothd. not specified

c. Numbera. oneb. two or morec. not specified

d. Negationa. yesb. no

Patient ID: 123123

Radiology Report ID: 321321

Documentation Date: 06/01/2020

Findings: Severe chronic microvasculardegenerative change. Periventricular deep whitematter most likely related to small vessel ischemicdisease. A chronic lacunar infarct is also noted.Otherwise the visualized brain is unremarkable.Small amount of fluid within the mastoid air cells.

6

to address the following questions: 1) what are the keys steps involved in the development of a

clinical concept extraction application; 2) what are the trends and associations of clinical concept

extraction research over different approaches; and 3) what are the unresolved barriers, challenges

and future directions.

2. Method

This review was conducted following a process compliant with the Preferred Reporting Items for

Systematic Reviews and Meta-Analyses (PRISMA) guidelines (18). A literature search was

conducted, retrieving EHR-based concept extraction articles that were written in English and

published from January 2009 through June 2019. Literature databases surveyed included Ovid

MEDLINE In-Process & Other Non-Indexed Citations, Ovid MEDLINE, Ovid EMBASE, Scopus,

Web of Science, and the ACM Digital Library. The implementations of search patterns were

consistent across the different databases. The search query was designed and implemented by an

experienced librarian (LJP) as: (“clinic” or “clinical” or “electronic health record” or “electronic

health records” or “electronic medical record” or “electronic medical records” or “electronic

patient record” or “electronic patient records” or “EHR" or “EMR” or “EPR” or “ATR”) AND

(“information extraction” OR “concept extraction” OR “named entity extraction” OR “named

entity recognition” OR “text mining” OR “natural language processing”) AND (NOT “information

retrieval”). A detailed description of the search strategies used is provided in the Appendix.

A total of 10,441 articles were retrieved from five libraries, of which 6,686 articles were found to

be unique. The articles were then filtered manually based on the title, abstract, and method sections

to keep articles with EHR-based clinical information extraction from English text. After this

screening process, 928 articles were considered for subsequent categorization. We conducted

additional manual review to keep articles with methodology description focusing on clinical

concept extraction. The final inclusion criteria for the target papers are as follows 1) using concept

extraction methods, 2) applied to EHR data in English, 3) providing a methodological contribution

via: a) presenting novel methods for clinical concept extraction, including introducing a new

model, data processing framework, NLP pipeline, etc., or b) applying existing methods to a new

domain or task. Articles without full text or methology description were excluded. Following this

7

screening process, 228 articles were selected and categorized based on the methods used. A

comprehensive full-text review of all 228 studies was performed by the study team. During the

full-text review, the following information were extracted manually: method type (e.g. rule based,

machine learning), domain (e.g. disease study area), sub-domain, data type, performance (e.g. f1-

score and AUC), and overall summary of methology. A flow chart of this article selection process

is shown in Figure 2.

Figure 2. Overview of article selection process.

3. Results

Figure 3 shows the number of articles and associated methods available in our surveyed literature

from January 2009 to June 2019. An upward trend in published clinical concept extraction research

is observed, suggesting a strong need for harvesting information and knowledge from EHRs to

support various clinical, operational, and research activities.

Total 10,441 from January 2009 to June 2019

2,437 Ovid MEDLINE(R) In-Process & Other Non-Indexed Citations and Ovid MEDLINE(R)3,137 Ovid Embase1,241 Scopus1,481 Web of Science2,145 ACM Digital Library

Total 6,686 after deduplication

Total 928 after Title, Abstract and Method ScreeningExclusion criteria: • Non-clinical data (English)• Non-information extraction• Duplication

Total 228 after Full-text ScreeningExclusion criteria: • Abstract• No methodology description

19 Deep

learning

109Rule-based

51 Hybrid

49 Traditional

machine learning

8

Figure 3. Trend view of clinical concept extraction research over different approaches.

A comprehensive mapping of each method to its associated application was illustrated in Figure 4

through a sunburst plot where the targets for concept extraction were categorized into four primary

domains (i.e., diseases, drugs, clinical workflows, and social determinants of health) based on the

focus of collected studies and clinical perspectives. The four domains were based on the focus of

collected studies (frequency), clinical perspectives (importance) and the past clinical NLP shared

tasks. We subsequently classified the four domains to more detailed sub-domains. The “Diseases”

sub-domain was classified based on the ICD-9 classification system (19) given its stability and

wide coverage in clinical practice. The sub-domain for clinical workflow optimization, drug and

social behavior-related studies were determined by clinical perspectives and the uniqueness of the

category.

Among the 228 surveyed articles, rule-based approaches were the most widely used approach for

concept extraction (48%), followed by hybrid (22%), traditional machine learning (22%), and deep

learning (8%). Despite the overall low utilization rate, adoption of deep learning techniques has

increased drastically since 2017. We also observed that the three disease sub-domains with the

highest coverage were (i) diseases of the circulatory system, neoplasms, and endocrine, (ii)

nutritional and (iii) metabolic diseases, and immunity disorders. Six disease areas have the lowest

coverage: (i) certain conditions originating in the perinatal period, (ii) complications of pregnancy,

childbirth, and the puerperium, (iii) congenital anomalies, (iv) diseases of the skin and

subcutaneous tissue, (v) diseases of the blood and blood-forming organs, and (vi) injury and

9

poisoning. The popularity of the sub-domains was contributed to by many reasons. Particularly,

we found shared-tasks can promote research activities in specific sub-domains. In addition, the

disease sub-domains that cannot be addressed by disease classification codes alone were more

likely to adopt concept extraction techniques. For example, incidental findings such as

leukoaraiosis and asymptomatic lesions were reported to adopt concept extraction when ICD codes

were not available (20, 21).

Under the clinical workflow domain, we observed that measurement, patient identification, and

quality were broadly studied. The clinical note was the most utilized data type (78%), followed by

radiology reports (13%) and pathology reports (3%). On the other hand, microbiology reports,

claims, and operative reports were the least utilized data type.

Figure 4. Overview of the method utilization.

Based on the performance of past shared tasks and evaluation on benchmarking datasets, we

summarized the state of the art performance over each domain as of June 2019. The contextual

pre-trained language models, such as Bidirectional Encoder Representations from Transformers

10

(BERT), leveraged the bidirectional training of transformers, an attention mechanism that learns

contextual relations between words, resulting in state-of-the-art results in six concept extraction

tasks. Hybrid and traditional machine learning achieved the best performance on two tasks. In

Table 1 we summarize a list of past shared tasks with the state-of-the-art methods.

Table 1. Benchmark of clinical concept extraction tasks

Domain Task Description F-

measure

Method Model

Clinical workflow

optimization

Automatic de-identification and

identification of medical risk

factors related to coronary artery

disease in the narratives of

longitudinal medical records of

diabetic patients

93.0 Deep

learning

BioBERT

(22)

i2b2 2006 1B de-identification:

Automatic de-identification of

personal health information

94.6 Deep

learning

BioBERT

(22)

Drug-related

studies

i2b2 2010/VA : Medical problem

extraction

90.3 Deep

learning

BERT-large

(23)

i2b2 2009 medication challenge:

Identification of medications,

dosages, routes, frequencies,

durations, and reasons

85.7 Hybrid CRF, SVM,

Context

Engine(24)

Disease study area ShARe/CLEFE 2013: Named

entity recognition in clinical notes

77.1 Deep

learning

BERT-base

(P+M) (25)

SemEval 2014 Task 7:

Identification and normalization

of diseases and disorders in

clinical reports

80.7 Deep

learning

BERT-large

(23)

SemEval 2014 Task 14: Named

entity recognition and template

slot filling for clinical texts

81.7 Deep

learning

BERT-large

(23)

11

Social determinants

of health

I2B2 2006: Smoking

identification challenge

90.0 Machine

learning

SVM (26)

The steps involved in the development of a clinical concept extraction application can be divided

into three key components: 1) task formulation, 2) model development, and 3) experiment and

evaluation (Figure 5). In the following subsections, we provide a detailed review of methods used

for each of them. In section 3.1, we describe how to formulate a task in two different settings.

Section 3.2 provides an overview of the approaches and summarizes the features for model

development and the specific methods for concept extraction. Section 3.3 discusses methods for

performing experiment and evaluation.

Figure 5. The development process of clinical concept extraction applications.

3.1 Task formulation

All studies were categorized into one of two experimental settings: shared-task and practice. The

shared-task setting is defined as either participation in or usage of resources produced from a

shared-task to conduct concept extraction-related research. The practice setting (e.g. General

Internal Medicine, Orthopaedic Medicine) involves direct use of the EHR for concept extraction

in real-world scenarios. Among a total of 228 articles, 63 (28%) were categorized as belonging to

a shared task setting and 165 (72%) were categorized as belonging to a practice setting. The overall

prevalence proportions for rule-based, hybrid, traditional machine learning, and deep learning

were 35%, 32%, 18%, and 15% respectively in the shared task setting and 51%, 20%, 23%, and

6% respectively in the practice setting. A comparison between the two settings revealed a

substantial difference in relative usage of statistical approaches and rule-based approaches: the

practice setting had a much higher usage of rule-based approaches whereas deep learning and other

statistical approaches had a higher prevalence of usage in the shared task setting.

12

Per the first author affiliations from 156 articles indexed in PubMed, 87 (56%) were affiliated with

an academic institution, 50 (32%) with a medical center, 9 (6%) with an industry-based research

institution, and 6 (4%) with a technology company. The remaining 3 authors (2%) were affiliated

with a healthcare company, a medical library, and a pharmaceutical company.

3.1.1 Shared-task setting - Shared-tasks have successfully engaged NLP researchers in the

advancement, adoption, and dissemination of novel NLP methods. Furthermore, because shared-

task corpora are usually made accessible with well-defined evaluation mechanisms and public

availability, they are usually regarded as standard benchmarks. Well-known tasks focusing on

clinical concept extraction include the Informatics for Integrating Biology and the Bedside (i2b2)

challenges (27-31), the Conference and Labs of the Evaluation Forum (CLEF) eHealth challenges

(32, 33), the Semantic Evaluation (SemEval) challenges (32-34), BioCreative/OHNLP (35-38),

and the National NLP Clinical Challenge (n2c2) (39).

The winning system in many past shared tasks typically used a hybrid approach, combining

supervised traditional machine learning or deep learning for concept extraction with feature

representation trained through unsupervised learning algorithms. The current state-of-the-art

models use the long short-term memory (LSTM) (40) variant of bidirectional recurrent neural

networks (BiRNN) with a subsequent conditional random field (CRF) decoding layer (15, 22, 23,

41). The convolution neural network (CNN)-based network, such as Gate-Relation Network

(GRN), also performs well in tasks requiring long-term context information (42).

Recently, contextual pre-trained language models, such as Bidirectional Encoder Representations

from Transformers (BERT), leverage bidirectional training of transformers, an attention

mechanism that learns contextual relations between words, resulting in state-of-the-art results in

the concept extraction task (15) (22) (23) (25).

3.1.2 Practice setting - Compared with the shared-task setting, the development of concept

extraction applications in practice settings is more variant, costly, and time consuming. Tasks in

practice settings can be much more specialized or ill-defined depending on the disease and use-

case. Furthermore, there is an additional need for study teams to create the fully-annotated corpora

themselves (whereas the corpora are provided in shared tasks), which can be expensive and time

consuming. However, it is a necessary step to ensure rigorous evaluation of any developed

technologies.

13

Figure 6. Data collection and corpus annotation process.

There are some variations in the process of creating gold standard corpora in practice settings due

to the heterogeneity of data caused by variations in institutional processes and clinical workflows.

Based on our review, we determined that the typical tasks involved in this process included

annotation task formulation, data collection and cohort screening, annotation guideline

development, annotation training, annotation production, and corpus adjudication (43-46) (Figure

6). Formulation of a task in a practice setting involved definition of points of interest (see Table 3

- Description), execution of a literature review, consultation with domain experts, and

identification of study stakeholders such as abstractors and annotators with specialized knowledge.

This step was then followed by definition and/or creation of a study population or cohort. For

example, a concept extraction application of geriatric syndromes was developed based on the

cohort of 18,341 people who were 1) 65 years or older, 2) received health insurance coverage

between 2011 and 2013 and 3) were enrolled in a regional Medicare Advantage Health

Maintenance Organization (47). Once the cohort definition was finalized, data was screened and

retrieved. Studies have found the usefulness of leveraging open-source informatics technologies

such as i2b2 or customized application programming interface and SQL queries for automatic

screening and retrieval (48). Subsequent to dataset definition and creation, development of a

detailed annotation guideline specifying the common conventions and standards was necessary,

ensuring that definitions created are scientifically valid and robust. Notably, this step involves

14

prototyping a baseline guideline, performing trial annotation runs, calculating inter-annotator

agreement (IAA), and consensus building.

Determining the appropriate number of annotators and the size of corpus for annotation is critical

and often challenged by the resources available. In general, the process requires at least two

annotators with (clinical) domain expertise to independently perform the annotation (49-51).

Having only one annotator in the study is not recommended since data validity and reliability

cannot be measured and ensured (51). For multi-site studies, the process requires at least two

annotators from both sides in order to help estimate inter-institutional and intra-institutional

variation (52). Based on the analysis of 18 articles with exclusive discussion of gold standard

development process, 4 (22%) studies used a single annotator, 10 (56%) studies reported of using

two annotators and one adjudicator, 3 (17%) studies reported of using multiple annotator but did

not specify the number, and 1 (6%) study with no mention of annotator. The median, minimal and

maximal number of annotated clinical documents were 251, 100 and 8,321 respectively. Most

studies choose 200 to 600 documents as the study data size. Among these, 30% to 60% were

randomly sampled for double annotation and IAA assessment. Process iteration was used to save

the annotation cost and increase effectiveness. For example, Mayer et al reported of having two

annotators to perform the initial annotation on 15 documents for training and consensus

development. During the second iteration, another 30 new documents were applied and IAA was

calculated. The process was repeated until the IAA reached to 0.85 (53).

Proper training and education were utilized to reduce practice inconsistency and ensure a shared

understanding of study design, objectives, concept definition, and informatics tools. The training

materials generated included an initial annotation guideline walkthrough; demos and instructions

on how to download, install, and use the annotation software; case studies; and practice

annotations. Annotation production is typically organized into several iterations with a significant

amount of overlap in the data annotated by each individual annotator to ensure the ability to

determine IAA. Finally, in cases where annotators disagree, conflicts were adjudicated by subject

matter experts. All the issues encountered during the gold standard creation process were

documented. We encourage interested readers to read articles by Albright et al. (49) and South et

al. (54) for more information on the standard annotation process for concept extraction.

3.2. Model Development

15

This section provides an overview of the specific approaches for model development and

summarizes features used for concept extraction. Based on our review, there are five model

development approaches which are summarized in Table 2 with relevant references.

Table 2. Comparison of model develoment approaches.

Processes Examples

A. Rule-based

(55-69)

B. Traditional machine learning

(70-78)

C. Deep learning

(79-85)

D. Terminal hybrid

(86-97)

E. Supplemental hybrid (98-105)

16

Designing computer systems to process text requires preparing the text in such a way that a

computer can understand and perform operations on it. We broadly break these features down into

categories of linguistic features, domain knowledge features, statistical features, and general

document features, as summarized in Table 3. Features such as part of speech tagging, bag-of-

words, and language models are used to give models more information on how to complete its

task. Different approaches also tend to use different features. For example, the top three frequently

used features for traditional machine learning and hybrid approaches are lexicon features (24%),

syntactic features (20%), and ontologies (13%). The most intuitive text representation for the

machine learning approach is a naïve one-hot encoding of the entire dictionary of words. Such a

data representation presents several problems. First, there are over 180,000 words in the English

dictionary, not including domain specific vocabulary (106). The vector space that the words are

mapped to is both highly dimensional and extremely sparse, leading to inefficient and expensive

computation. Second, such a representation assumes each “variable” is independent of the others,

removing any semantic and synthetic features which may be beneficial for concept extraction

tasks. For example, “cardiac” and “heart” would be represented by two independent variables, and

therefore any concept extraction system could not easily learn their semantic similarities.

Consequently, it is common to engineer data representations which encode semantic and syntactic

similarities. Many rule-based systems are dependent on explicitly created features which tend to

be more interpretable to humans. Alternatively, the latest techniques in deep learning create

features which are often abstract (e.g. word embeddings or language models) and implicitly encode

the semantic or synthetic features. Such features are very difficult to use by rule-based systems,

but can greatly improve deep learning systems by reducing the dimensionality of the search space.

17

Table 1. An overview of features used for concept extraction.

Feature Categories

Feature Types

Description Features Examples

Linguistic features

Lexicon-based feature

A dictionary or the vocabulary of a language

Bag-of-words (BoW); “periprosthetic, joint, infection” (71, 77, 107)

Orthographic features

A set of conventions for writing a language

Spelling; capitalization; hyphenation; punctuation; special characters

“igA”; “Pauci-immune vasculitis”; “BRCA1/2” (71, 108)

Morphologic feature

Structure and formation of words

Prefix; suffix “Omni” is the prefix of “Omnipaque” (91)

Syntactic feature

Syntactic patterns presented in the text

Part-of-Speech, constituency parsing, dependency parsing, sequential representation (e.g. Inside–outside–beginning (IOB), B-beginning, I-intermediate, E-end, S-single word entity and O-outside (BIESO))

“Communicating [verb] by [preposition] sinus tract [noun]” “Patient [O] has [O] an [O] acute [B] infarct [I] in [I] the [I] right [I] frontal [I] lobe [E]” (108, 109)

Semantic feature

Semantic patterns presented in the text (whether a word is semantically related to the target words)

Synonym; hyponym/hypernym (etc.), frame semantics

“Loosing” is semantically related to “subsidence”, “lucency”, and “radiolucent lines” (68, 71)

Domain knowledge features

Conceptual feature

Semantic categories and relationships of words

Classifications and taxonomies, thesauri, ontologies:

ICD9/10, NCI Thesaurus, UMLS, UMLS Semantic Network, use case-specific code systems or controlled terminologies (71)

Statistical features

Graphic feature

Node and edge for each word in a document

Graph-of-Words (GOW) Bi-directional representation: “grade” - “3”, “3” - “grade” (110)

18

Statistical corpus features

Features generated through basic statistical methods

Word length, TF-IDF, semantic similarity, distributional semantics, co-occurrence

“word1: parameter1, word2: parameter2” (70, 71)

Vector-based representation of text

One-hot encoding Word embedding Sentence encoding Paragraph encoding Document encoding

Word2vec/doc2vec, BERT, one-hot character-level encoding.

Vector representations of text (81, 82, 84)

General document features

Pattern and rule-based feature

A label for a note if certain rules satisfied

Logic (if–then) rules and expert systems

If “Metal” and “Polyethylene” then “Metal-on-Polyethylene bearing surfaces” (58, 111)

Contextual feature

Context information Negated; status; hypothetical; experienced by someone other than the patient

“Patient [experiencer] does not [negation] have history of [status] infection” (58, 59)

Document structural feature

Structural and organizational patterns presented in the text and document

Section information; indentation; semi-structured information

“Family History [section]: CVD Diabetes Hypertension” (71, 112)

19

Medical concept normalization is an essential process in order to take advantage of rich clinical

information from large and complex free-form text. Medical concept normalization aims at

mapping medical mentions to a corresponding medical concept in standardized ontologies such as

Unified Medical Language System (UMLS) or controlled vocabularies such as SNOMED CT. The

well-established clinical NLP tools such as MedLEE (113), MetaMap (114), cTAKES (115) and

MedTagger (116) can achieve normalization using diverse approaches like a dictionary-lookup

method and rule-based methods applying pattern recognition. In-depth studies explored specific

medical concept normalization. For example, DNorm proposed a machine learning approach

(pairwise learning-to-rank) to normalize mentions within a pre-defined lexicon (117). Hao et al.

used heuristic rule-based approach to extract temporal expression and normalized them using the

TIMEX standard (118). Lin et al. Proposed the strategy of contextual alias registry (CAR) and

Chronological Order of Temporal Expressions (COTE) by focusing on domain-specific contextual

alias dates and chronological order of clinical notes for temporal normalization (119). However,

these mentioned-methods are not yet sufficient to autonomously deliver a comprehensive

information of patient condition due to the complexity of clinical information and variations within

medical domains (120). In sections 3.2.1. – 3.2.4., we are going to discuss the specific methods of

applying feature extraction and normailization techniques for concept extraction.

Rule-based approach

Symbolic rule-based concept extraction methodologies use a comprehensive set of rules and

keyword-based features to identify pre-defined patterns in text (58, 121). The rule-based approach

has been adopted in many clinical applications due to their transparency and tractability, i.e.

effectiveness of implementing domain specific knowledge. One particular advantage of using rule-

based approaches is that the solution provides reliable results in a timely and low-cost manner,

given the benefit of not needing to manually annotate a large amount of training examples (4).

Based on specific tasks, the combination of rules and well-curated dictionaries can result in

promising performance. Many tasks have used rule-based matching methods to varying levels of

success (60, 122, 123). For example, in the 2014 i2b2/UTHealth de-identification challenge, the

top four performing teams (including the winning team) used rule-based methodologies (124). In

previous i2b2 challenges, the 2009 medication challenge reported 10 rule-based systems in the top

20 systems (30). In the i2b2/UTHealth Cardiac risk factors challenge, Cormack et al. demonstrated

20

that a pattern matching system can achieve competitive performance with diverse lexical resources

(61).

Ruleset development is an iterative process that requires manual effort to hand craft features based

on clinical criteria, domain knowledge and/or expert opinions. The following steps were

summarized based on the reported utilization and importance: (1) existing rule-based system

adoption, (2) assessment of existing knowledge resources, and (3) development and refinement of

features and logic rules.

First, the top performing rule-based applications often utilize existing NLP systems (frameworks)

(56, 58, 60, 62, 125). There are many different NLP systems that have been developed at different

institutions that are utilized to convert clinical narratives into structured data, including MedLEE

(113), MetaMap (114), KnowledgeMap (126), cTAKES (115), HiTEX (127), and MedTagger

(116). A comprehensive summary of NLP systems can be found in Wang et al (5). Second,

adopting existing resources such as clinical criteria, guidelines, and clinical corpora can

substantially reduce development efforts. The most common strategy is to leverage well curated

clinical dictionaries and knowledge bases. The dictionary acts as a domain or task specific

knowledge base, which can be easily modified, updated, and aggregated (128). Well-established

medical terminologies and ontologies such as Unified Medical Language System (UMLS)

Metathesaurus (129), Medical Subject Headings (MeSH) (130), and MEDLINE®, have been used

as the basis for clinical information extraction tasks as they already contain well-defined concepts

associated with multiple terms (131). This approach works best in tasks and situations where

modifier detection and the recognition of complex dependencies in the document are not necessary

(132, 133). Since the common features being exploited are morphologic, lexical and syntactic,

dictionary-based methods are highly interpretable, adoptable, and customizable. In the 2014

i2b2/UTHealth challenge, Khalifa and Meystre leveraged UMLS dictionary lookup to identify

cardiovascular risk factors and achieved an F-measure of 87.5% (60). Table 4 summarizes a list of

dictionaries and knowledge bases that were used in our surveyed papers.

Table 2. An overview of the dictionaries and knowledge bases used for clinical concept extraction.

Dictionary/Knowledge base

Description Link Examples

21

Unified Medical Language System (UMLS)

Biomedical thesaurus organized by concept and it links similar names for the same concept

https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/

(123, 134-137)

Wikipedia PHI Category Description Source

https://en.wikipedia.org/wiki/ (138)

MeSH (Medical Subject Headings)

MeSH is the controlled vocabulary based on PubMed articles indexing developed by NLM

https://www.ncbi.nlm.nih.gov/mesh

(122, 123, 139)

ICD-O-3 International Classification of Diseases for Oncology

https://www.who.int/classifications/icd/adaptations/oncology/en/

(140)

RadLex Radiology lexicon http://www.radlex.org/ (123) BodyParts3D Anatomical concepts https://dbarchive.biosciencedbc.jp

/en/bodyparts3d/download.html (123)

NCI Database Cancer related information

https://cactus.nci.nih.gov/download/nci/

(122)

Berman taxonomy

Tumor taxonomy https://www.ncbi.nlm.nih.gov/pmc/articles/PMC535937/bin/1471-2407-4-88-S1.gz

(140)

CTCAE Common Terminology Criteria for Adverse Events

https://ctep.cancer.gov/protocolDevelopment/electronic_applications/ctc.htm

(141)

22

Third, in many situations, singularly relying on dictionaries cannot fully capture all the patterns

necessary to completely capture a concept. Customized rules are created to address complex

patterns. The creation of customized rules is an iterative process involving multiple subject matter

experts. At each iteration, the rules are applied to a reference standard corpus, and its results are

evaluated. Based on the evaluation performance, domain experts review false positive mentions

with the given context to determine the reasons (e.g. family history, negated and hypothetical

sentences). The false negative mentions are usually caused by missing keywords and negation

errors, and can be addressed by refining existing rules or establishing new. This pattern was then

repeated until it reached to a reasonable performance (e.g. maximal F-measure with balanced

precision and recall). For example, Cormack et al. leveraged data-driven rule-based approach by

starting with high recall rules and refining them to increase precision while maintaining recall with

contextual patterns based on observations (61). Kelahan et al. extracted impression text and

assigned positive and negative labels to sentences based on manual rules. The final label of the

radiology report wass determined based on the frequency of sentence labels (64).

3.2.1. Traditional machine learning approach

Advances in methods have revitalized interest in statistical machine learning approaches for NLP,

for which the non-deep-learning variants are typically referred to as “traditional” machine learning.

Traditional machine learning is capable of learning patterns without explicit programming through

learning the association of input data and labeled outputs (142-145). The learning function is

inferred from the data, with the form of the function limited only by the assumptions made by the

learning algorithm. Although feature engineering can be complex, the ability to process and learn

from large document corpora greatly reduces the need to manually review documents and also has

the possibility of developing more accurate models. However, in contrast to deep learning methods

which learn from text in the sequential format in which it is stored, traditional machine learning

approaches require more human intervention in the form of feature engineering (Table 3).

The process of developing traditional machine learning models can be summarized into the

following steps: data pre-processing, feature extraction, modeling, optimization, and evaluation.

Raw text is difficult for today’s computers to understand. In particular, non-deep learning methods

were developed to learn from categorical or numerical data. Therefore, it is common to pre-process

the text into a format that is readily computable. There are a wide variety of different pre-

23

processing methods which have been proposed, however they are outside the scope of this

manuscript. Standard pre-processing techniques include sentence segmentation that splits text into

sentences, tokenization that divides a text or set of text into its individual words, stemming that

reduces a word to its word stem, POS marking up a word in a text (corpus) as corresponding to a

particular part of speech, and dependency parsing. The NLTK and the Stanford CoreNLP were the

two most popular toolkits for performing data pre-processing (73, 76, 107, 146-148).

The majority of traditional machine learning approaches leverage a bag-of-words model for the

word representation (70, 107, 109, 149-151). The bag-of-words model often tokenizes words into

a sparse, high dimensional one-hot space. Although simple, this approach introduces sparsity,

greatly increases the size of data, and also removes any sense of semantic similarity between

words. Building on top of an existing bag of words feature, Yoon et al. proposed graph-of-words,

a new text representation approach based on graph analytics which overcomes these limitations by

modeling word co-occurrence (110). Chen et al. proposed a clustering method using Latent

Dirichlet Allocation (LDA) to summarize sentences for feature representation (152). Another

approach to text representation is to automatically learn abstract, low-dimensional representation

of the words. Common approaches to this are continuous bag of words (CBOW) (79, 135) or

Word2vec (81, 83-85). Recently, advanced embedding and language representations have further

improved state-of-the-art clinical concept extraction. Word embeddings (153) capture latent

syntactic and semantic similarities, but cannot incorporate context dependent semantics present at

sentence or even more abstract levels. Peter et al. addressed this issue through training a neural

language model which was able to capture the semantic roles of words in context. They found that

the addition of the neural language model embeddings to word embeddings yielded state-of-the-

art results for named-entity recognition and chunking (41). From a more abstract approach, Akbik

et al. proposed character-level neural language modeling to capture not only the latent syntactic

and semantic similarities between words, but also capture linguistic concepts such as sentences,

subclauses, and sentiment (154). Many of these embeddings can be used in conjunction with

others. For example, Liu et al. used both token-level and character-level word embeddings as the

input layer (81). Choosing the appropriate embedding for the task can have large effects on the

end model performance (81, 135).

The labeling for concept extraction is typically more complex compared to the standard

classification or regression task. This is because entities in text are varying in length, location of

24

text, and context. Commonly reported labeling for traditional machine learning include boundary

detection-based classification and sequential labeling. Boundary detection aimed at detecting the

boundaries of the target type of information. For example, the BIO tags use B for beginning, I for

inside, and O for outside of a concept. Sequential labeling based extraction methods transforms

each sentence into a sequence of tokens with a corresponding property or label. One particular

advantage of sequential labeling is the consideration of the dependencies of the target information.

Despite that, classification-based extraction is more commonly used than sequential labeling based

extraction. From the review, 15 articles were found to utilize boundary detection-based

classification approaches and 10 articles reported utilizing a sequential learning approach.

Frequently used traditional machine learning models for clinical concept extraction include

conditional random fields (CRF) (155), the Support Vector Machine (SVM) (156), Structural

Support Vector Machines (SSVMs) (157), Logistic Regression (LR) (158), the Bayesian model,

and random forests (159). Among the aforementioned models, CRFs and the SVM are the two

most popular models for clinical concept extraction (71). CRFs can be thought of as a

generalization of LR for sequential data. SVMs use various kernels to transform data into a more

easily discriminative hyperspace. Structural Support Vector Machines is an algorithm that

combines the advantages of both CRFs and SVMs (71). When Tang et al. compared SSVMs and

CRF using the data sets from 2010 i2b2 NLP challenge, the SSVMs achieved better performance

than the CRFs using the same features (71). The summarized traditional machine learning

approaches can be found in Table 5.

Table 3. Summary of traditional machine learning (non-deep learning) approaches.

Learning Task

Tag Examples Word Representation

Model Examples Examples

Boundary detection-based classification

Word position tag (e.g. BIO, BIESO); Binary outcome tag (e.g. 0, 1)

Flat or naive representation; clustering-based word representation; distributional word representation

SVM; SSVM, Naïve Bayes; Decision Trees; AdaBoost; RandomForests; MIRA(160)

(55, 146, 161, 162)

Sequential learning

Word level tag (e.g. POS)

Sequential representation

CRF; Hidden Markov Model (HHM); Maximum Entropy

(73, 160, 163-165)

25

Markov Models (MEMMs)

3.2.2. Deep learning approach

Deep learning is a subfield of machine learning that focuses on automatic learning of features in

multiple levels of abstract representations (17, 166, 167). The algorithms are largely focused

around neural networks such as recurrent neural networks (RNN) (168-170), CNNs (42, 171-173)

and transformers (174), although there are a few other niche approaches. Deep learning has led to

revolutionary developments in many fields including computer vision (175, 176), robotics (177,

178), and NLP (17, 84, 179). In contrast to traditional machine learning paradigms, deep learning

minimizes the need to engineer explicit data representations such as bag-of-words or n-grams.

Many of the deep learning applications in concept extraction have used either variants of RNNs or

CNNs. CNNs rely on convolutional filters to capture spatial relationships in the inputs and pooling

layers to minimize computational complexity. Although these have been found to be exceptionally

useful for computer vision tasks, CNNs may have a difficult time capturing long distance

relationships that are common in text (180). RNNs are neural networks which explicitly model

connections along a sequence, making RNNs uniquely suited for tasks that require long-term

dependency to be captured (181, 182). Conventional RNNs are, however, limited in modeling

capability by the length of text (and therefore limited in the maximum distance between words)

due to problems with vanishing gradients. Variants such as LSTM (40) and gated recurrent unit

(GRU) (183) have been developed to address this issue by separating the propagation of the

gradient and control of the propagation through “gates”. While shown to be effective, these only

diminish the issue rather than completely solve it, being still limited to sequence lengths on the

order of 10s-100s of words long (181). Furthermore, training these models is computationally

intensive and difficult to parallelize as the weights need to be trained in series. Recently, the

transformer architecture has been proposed to solve many of these problems. The transformer

architecture circumvents the need to sequentially process text by processing the entire sequence at

once through a set of matrix multiplications, allowing the network to memorize what element in

the sequence is important (174). If the sequences are lengthy, the memory requirements for training

are substantial. Breaking up the sequence into smaller pieces and adding subsequent layers is

26

therefore needed to allow the model to accommodate long sequences of text without crippling

memory constraints. Thus, transformers can effectively model relationships with long word

distance and are much more computationally efficient compared to RNN variants. Models based

on this architecture such as BERT (15) and GPT (184) have yielded significant improvements for

state-of-the-art performance in many NLP tasks (25).

Meanwhile, using a deep learning model does not preclude using a traditional machine learning

model. Although deep learning models are powerful feature extractors, other models may have

specific attributes which suit particular needs of the problem. This is evident as many of the

researchers combined conditional random field (CRF) with various deep neural networks (word

embeddings input) to improve the performance on NER, such as Bi-LSTM-CRF (44%), Bi-LSTM-

Attention-CRF (22%), and CNN-CRF (22%). This is to take advantage of their relative strengths:

long distance modeling of RNNs and CRF’s ability to jointly connect output tags. Choosing the

correct algorithm that is appropriate for both the research setting and dataset is key to produce a

successful model. Interested readers can refer to a recent review by Wu et al (17) for a more in-

depth overview.

3.2.3. Hybrid approach

Hybrid approaches combine both rule- and machine learning-based approaches into one system,

potentially offering the advantages of both and minimizing their respective weaknesses. There are

two major hybrid approaches, as shown in Section 3.2 - Table 2. Depending on how traditional

machine learning approaches were leveraged, we named these two architectures as either terminal

hybrid approaches or supplemental hybrid approaches. In a terminal hybrid approach, rule-based

systems are used for feature extraction, where the outputs became features used as input for the

machine learning system, and the machine learning system is then a terminal step that selects

optimal features. We found that the terminal hybrid approaches used in 31 out of 51 studies were

in this category. For example, Wang and Akella (91) used NLP features, such as semantic,

syntactic, and sequential features, as input to a supervised traditional machine learning model to

extract disorder mentions from clinical notes. Other applications of hybrid systems include

automatic de-identification of psychiatric notes (185, 186) and detection of clinical note sections

(187). Section 3.2 - Table 3 lists the available NLP features used in the included studies.

27

In supplemental hybrid approaches, machine learning approaches are used to patch deficiencies in

extraction of entities that have poor performance when extracted by purely rule-based approaches.

In one study, such a supplemental hybrid system was incorporated with a user interface for

interactive concept extraction (188). In another example, Meystre et al. (105) leveraged a

traditional machine learning classifier to extract congestive heart failure medications as a

supplement to the rule-based system that extracted mentions and values of left ventricular ejection

fraction, amongst other concepts, for an assessment of treatment performance measures.

Using a hybrid approach may have the benefit of achieving high performance. Although traditional

machine learning systems tend to do best when the task dataset has well-balanced outcomes, many

concept extraction tasks’ datasets are highly imbalanced, therefore making learning difficult.

Farkas and Szarvas added specific “trigger words” as rules to improve their traditional machine

learning de-identification system (189). Explicitly crafting the rules for these “trigger words”

effectively created a “balanced” outcome, improving the algorithm’s ability to correctly learn

patterns. Yang and Garibaldi leveraged a dictionary-based method to supply a CRF model for

medication concept recognition. The hybrid model was ranked fifth out of the 20 participating

teams on the 2014 i2b2 challenge (101). In a study to automatically extract heart failure treatment

performance metrics, the hybrid system outperformed both rule- and traditional machine learning-

based approaches (190).

3.3. Experiment and Evaluation

Rigorously evaluating model performance is a crucial process for developing valid and reliable

concept extraction applications. Evaluation is usually performed at one of several levels: a patient

level, a document level, a concept level, or a defined episode level (with temporality). The specific

level selected with which evaluation was performed was typically determined based on the specific

task or application. For example, patient level detection may be sufficient if the task is to detect

patients with a disease or a disease phenotype. On the other hand, identification of the time of

presentation likely requires an evaluation with a finer level of granularity. Determination of an

evaluation method, as with any concept extraction task, should be made in consultation with

clinical subject matter experts and in the context of the application.

The evaluation can be performed by construction of a confusion matrix or a contingency table to

derive error ratios. Commonly used ratios measure the number of true positives (the predicted label

28

occurs in the gold-standard label set), false positives (the predicted label does not occur in the

gold-standard label set), false negatives (the label occurs in the gold-standard set but was not a

predicted label), and true negatives (the total number of occurrences that are not predicted to be a

given label minus the false positives of that label). From these measures, the standardized

evaluation metrics, including sensitivity or recall, specificity, precision, or positive predictive

value (PPV), negative predictive value (NPV), and f-measure, can then be determined based on

the error ratios. Because traditional machine learning or deep learning models also provide

probabilities, performance can be evaluated using the area under the ROC curve (AUC) and the

area under the precision-recall curve (PRAUC).

A majority of concept extraction methods are evaluated using the hold-out method, where the

model is trained on training (and possible validation) sets and evaluated on the held-out test set.

Cross validation (CV) can also be used to estimate the prediction error of a model by iteratively

training part of the data and leaving the rest for testing. However, applying CV for error estimation

on tuned models using CV may yield a biased result (191). The nested-cross-validation that uses

two independent loops of CV for parameter tuning and error estimation can reduce the bias of the

true error estimation (192).

For multiclass prediction or classification, micro-average and macro-average are two methods of

weight prioritization. The micro-average calculates the score by aggregating the individual rates

(e.g. true positive, false positive, and false negative). In the task of epilepsy phenotype extraction,

Cui et al. reported the micro-average to reflect each individual class under an imbalanced category

distribution (56). The macro-average calculates the metrics score (e.g. precision and recall) by

each class first and then averages by the number of classes. In the evaluation of CEGS N-GRID

2016 shared task, macro-average was used to equally evaluate multiclass symptom severity (193).

In addition, mean squared error was also used for evaluating multiclass prediction or classification

tasks. Another important aspect of evaluation, Cohen's kappa coefficient, is used to measure the

inter- or intra-annotator agreement. Human annotators are imperfect, and therefore various

problems may be more or less difficult to annotate. This can be due to fatigue, variation in

interpretations, or annotator skill. This measure can be important to giving context on the

performance of the model, as well as on the difficulty of the concept extraction task. An extensive

summary of evaluation methods for clinical NLP can be found in Velupillai et al. (194).

29

4. Discussion Given that methods for clinical concept extraction are generally buried within the methods section

of the literature, here, we have provided a review of the methodologies behind clinical concept

extraction applications based on a total of 228 articles. Our review aims to provide some guidance

on method selection for clinical concept extraction tasks. However, the actual method adopted for

a specific task can be impacted by five factors: data and resource availability, domain adaptation,

model interpretability, system customizability, and practical implementation. In the following, we

discuss each of them in detail.

Data and resource availability In clinical concept extraction tasks, the availability of data is a

key consideration for determining the type of methods. For example, the amount of data used to

train machine learning, specifically deep learning methods may impact the reliability and

robustness of the model. Rule-based systems have less of a demand for training data and can be

more appropriate when the data set is relatively small. Alternatively, when large amounts of

labeled data are available and the task has a significant degree of ambiguity, machine learning

approaches can be considered. From section 3.1, it is apparent that machine learning is used more

than rule-based approaches in shared-task settings, possibly due in part to the readily available

deidentified and annotated health-related data (e.g. MIMIC, i2b2 corpus). The availability of

clinical and external subject matter expertise is another consideration. For the rule-based

approaches, developing rules may require substantial manual effort such as iterative chart review

and manual feature crafting. It may be challenging to involve clinicians to help with case validation

and shared-decision making. However, large amounts of a priori knowledge about the domain and

clinical problem (e.g. clear concept definition, well-documented abstraction protocol) and

availability of clinical experts are favorable indicators for success of the rules-based approach.

Domain adaptation One way to leverage available resources is through transfer learning, an

approach to reuse a model trained over a prior task as the pre-trained model for a new task, has

been widely used in deep learning and NLP specifically (195, 196). A popular example of pre-

trained models is BERT (15), which applies a transformer model (174) over huge narratives to

generate a robust language representation model. Meanwhile, multi-concept learning plays an

important role in learning tasks jointly and provides significant insights to accelerate domain

30

adaptation. This learning strategy could also be referred as multitask learning, aiming to make use

of features from multiple datasets or different data modalities to improve the generalization

performance of models (197). In the biomedical domain specifically, some biomedical named

entity recognition (BioNER) tasks were done by leveraging the multitask learning strategy. For

example, based on 15 cross-domain biomedical datasets including the category of anatomy,

chemical, disease, gene/protein, and species, Crichton et al. utilized a convolutional layer and a

fully connected layer as a shared component and a task-specific component respectively to

complete a BioNER task (198). In another study, building upon the embedding layers, Bi-LSTM

layer and CRF layer, Wang et al. built a novel multitask model with the cross-sharing structure

and ran it over heterogeneous gene, protein, and disease datasets to assist BioNER (199).

Multimodal-based multitask learning is also a research direction to conduct prediction jointly

incorporating different collected data modalities (e.g., claims data, free text notes, images, and

genome sequences). For example, Weng et al. developed an innovative representation learning

framework to co-learn features derived from patch-level pathological images, slide-level

pathological images, pathology reports as well as case-level structured data (200). In addition,

Nagpal developed different multitask models (i.e., Multitask Multilayer Perceptron, Recurrent

Multitask Network, and Deep Multimodal Fusion Multitask Network) to jointly model different

data modalities from EHRs, including ICD codes and unstructured clinical notes (201)

Model interpretability Considering that many models will eventually be implemented for clinical

use, the evaluation of a successful model may not solely rely on performance, but also on the

interpretability, which refers to the model’s capability to explain why a certain decision is made

(202). In a practice setting, these explanations may serve as important criteria for safety and bias

evaluations, and are usually referred to as a key factor of “user trust” (203, 204). There are two

categories of interpretability: intrinsic interpretability and post-hoc interpretability (205). Intrinsic

interpretability emphasizes models with certain architectures that can be self-explained. Models

with intrinsic interpretability include rule-based approaches, decision trees, linear models, and

attention models (202). These models are widely used in the clinical domain compared to state-of-

the-art deep learning approaches, due to their explanatory capabilities, the transparency of model

components, and the interpretability of model parameters and hyperparameters (206). For

example, Mowery et al. used regular expressions along with different semantic modifiers to

31

conduct concept extraction of carotid stenosis findings. Based on the clinical definition of carotid

disease, semantic patterns were organized into laterality (e.g. right, left), severity (e.g. critical or

non-critical), and neurovascular anatomy (e.g. internal carotid artery) (112). The semantic pattern

successfully captured important findings with high interpretability. With the advancement of deep

learning models, there has been a surge of interest in interpreting black-box models through post-

hoc interpretability. This can be achieved by creating an additional model to help independently

interpret an existing model. Successful techniques such as a model-agnostic approach allow

explanations to be generated without accessing the internal model. However, the trade-off between

interpretability and model performance needs to be considered when deciding which method to

use.

System customizability Customizability measures how easily each model can be adapted when a

concept definition is changed or there is an update to clinical guidelines. The rule-based

approaches allow the model to be modified and refined based on existing implementation. For

example, Sohn et al. applied a refinement process when deploying an existing pattern matching

algorithm to a different clinical site to achieve high performance (207). Davis et al. adopted four

previously published algorithms for the identification of patients with multiple sclerosis using the

rules combined with multiple features including ICD-9 codes, text keywords, and medication lists

(125). The final updated algorithm was shared on PheKB, a publicly available phenotype

knowledgebase for building and validating phenotype algorithms (208). Xu et al. leveraged

existing algorithms from the eMERGE (electronic Medical Records and Genomics) network to

identify patients with type 2 diabetes mellitus (209). Regarding traditional machine learning and

deep learning models, it is difficult to make customizations explicitly due to their data-driven

nature. In order to accommodate the models to specific clinical problems, incorporating

knowledge-driven perspectives (e.g., sublanguage analysis and biomedical ontology) with the

models are commonly adopted to customize the models implicitly. For example, Shen et al.,

combined surgical site infection features generated by sublanguage analysis with decision tree,

random forest, and support vector machines to mine postsurgical complication patients

automatically from unstructured clinical notes (210). Casteleiro et al., combined the Word2vec

model with knowledge formalized in the cardiovascular disease ontology (CVDO) to provide a

customized solution to extract more pertinent cardiovascular disease-related terms from

32

biomedical literatures (211). The HPO2Vec+ framework provides a way to generate customized

node embeddings for the Human Phenotype Ontology (HPO) based on different selections of

knowledge repositories, in order to accelerate rare disease differential diagnosis by analyzing

patient phenotypic characterization from clinical narratives (212).

Practical implementation When the models are carefully evaluated, they will be implemented

for clinical and research use. The implementation process is highly dependent on institutional

infrastructure, system requirements, data usage agreements, and research and practice objectives.

A majority of the articles implemented the models into different standalone tools with the

advantage of flexibility, low development cost, and low maintenance effort. For example,

Fernandes et al. implemented a suicide ideation model by developing an additional platform to

host the traditional machine learning component (213). Others have implemented the model into

their institutional IT infrastructure, which includes an ETL process for document or HL7 message

retrieval, a parser that does document pre-processing, an engine to host and run rule-based or

traditional machine learning models, and a database to store extracted results (214).

Open challenges Beyond the aforementioned factors which impact the specific tasks, there are

several open challenges which impact the overall field of clinical concept extraction. Like many

AI tasks, reproducibility and system portability of proposed solutions are often in question.

However, these questions are further challenged by the stringent privacy and security requirements

posed by the regulations surrounding protected health information, thereby limiting ability to share

data or produce fruitful collaborations (215). The lack of multi-institutional data is further

exacerbated by the cost of annotating data, lack of detailed study protocols, and sometimes severe

differences in EHR data between institutions (52). As the result, it has been shown that NLP

algorithms developed in one institution for a study may not perform well when reused in the same

institution or deployed to a different institution or for different studies. For example, Wagholikar

et al. evaluated the performance of an NLP tagging system from two sites. The performance

degrades when the tagger was ported to a different hospital (216). The differences in EHR systems,

physician training, and data documentation standards are the source of significant clinical

document variability and non-optimal performance of concept extraction systems.

Potential solutions such as federated learning have some advantages in dealing with data privacy

issues (217). In a federated learning regime, several individual institutions (workers) collaborate

33

to create a single model hosted at a centralized location (server) without the need to share data.

Instead, the model is iteratively sent to the nodes to incrementally improve the results of the model.

The trained models are then iteratively aggregated in the server. In this way, neither the server, nor

other nodes have to see any data not residing at that specific node. However, this type of model

development is limited to machine learning methods. Furthermore, the centralized model is

typically developed using a centralized test set which as mentioned earlier, often does not

approximate well to the data at individualized nodes.

Limitations There are some limitations in our study. First, this study may be biased and has the

potential of missing relevant articles published due to the search strings and databases selected in

this review. Secondly, we only included articles written in English with the focus on clinical

information extraction. Articles written in other languages would also provide valuable

information. Thirdly, our review did not include methods based on non-English EHRs. Finally,

our study also suffers the inherent ambiguity associated with data element collection and

normalization due to subjectivity introduced in the review process.

Funding

This work was made possible by National Institute of Health (NIH) grant number 1U01TR002062-

01 and R21AI142702.

Competing Interests

The authors have no competing interests to declare.

Contributors

SF performed the analysis and wrote the study. All authors participated in the study design,

interpretation of the data, and contributed to manuscript editing and revisions. HH performed

scientific visualization. HL conceptualized, designed, and edited the manuscript.

Acknowledgements

We gratefully acknowledge Larry J. Prokop for implementing search strategies, and Katelyn

Cordie and Luke Carlson for editorial support.

34

35

References

1. Jones SS, Rudin RS, Perry T, Shekelle PG. Health information technology: an updated systematic review with a focus on meaningful use. Ann Intern Med. 2014;160(1):48-54. 2. Friedman CP, Wong AK, Blumenthal D. Achieving a nationwide learning health system. Science translational medicine. 2010;2(57):57cm29-57cm29. 3. Demner-Fushman D, Chapman WW, McDonald CJ. What can natural language processing do for clinical decision support?. [Review] [132 refs]. Journal of Biomedical Informatics. 2009;42(5):760-72. 4. Cowie J, Wilks Y. Information extraction. Handbook of Natural Language Processing. 2000;56:57. 5. Wang Y, Wang L, Rastegar-Mojarad M, Moon S, Shen F, Afzal N, et al. Clinical information extraction applications: A literature review. Journal of Biomedical Informatics. 2018;77:34-49. 6. Nadeau D, Sekine S. A survey of named entity recognition and classification. Lingvisticae Investigationes. 2007;30(1):3-26. 7. Marsh E, Perzanowski D, editors. MUC-7 evaluation of IE technology: Overview of results. Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, April 29-May 1, 1998; 1998. 8. Torii M, Wagholikar K, Liu H. Using machine learning for concept extraction on clinical documents from multiple data sources. Journal of the American Medical Informatics Association. 2011;18(5):580-7. 9. Si Y, Wang J, Xu H, Roberts K. Enhancing clinical concept extraction with contextual embeddings. Journal of the American Medical Informatics Association. 2019;26(11):1297-304. 10. Harkema H, Chapman WW, Saul M, Dellon ES, Schoen RE, Mehrotra A. Developing a natural language processing application for measuring the quality of colonoscopy procedures. Journal of the American Medical Informatics Association. 2011;18(Supplement_1):i150-i6. 11. Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF. Extracting information from textual documents in the electronic health record: a review of recent research. Yearb. 2008;17(01):128-44. 12. Sager N. Natural Language Information Processing: A Computer Grammmar of English and Its Applications: Addison-Wesley Longman Publishing Co., Inc.; 1981. 13. Sager N, Friedman C, Lyman MS. Medical language processing: computer management of narrative data: Addison-Wesley Longman Publishing Co., Inc.; 1987. 14. Manning CD, Manning CD, Schütze H. Foundations of statistical natural language processing: MIT press; 1999. 15. Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018. 16. Xiao C, Choi E, Sun J. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. Journal of the American Medical Informatics Association. 2018;25(10):1419-28. 17. Wu S, Roberts K, Datta S, Du J, Ji Z, Si Y, et al. Deep learning in clinical natural language processing: a methodical review. 2019. 18. Moher D, Liberati A, Tetzlaff J, Altman DG. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. Ann Intern Med. 2009;151(4):264-9.

36

19. Slee VNJAoim. The International classification of diseases: ninth revision (ICD-9). 1978;88(3):424-6. 20. Oliveira L, Tellis R, Qian Y, Trovato K, Mankovich G. Identification of Incidental Pulmonary Nodules in Free-text Radiology Reports: An Initial Investigation. Stud Health Technol Inform. 2015;216:1027. 21. Dutta S, Long WJ, Brown DF, Reisner AT. Automated detection using natural language processing of radiologists recommendations for additional imaging of incidental findings. Annals of Emergency Medicine. 2013;62(2):162-9. 22. Si Y, Wang J, Xu H, Roberts K. Enhancing Clinical Concept Extraction with Contextual Embedding. arXiv preprint arXiv:190208691. 2019. 23. Alsentzer E, Murphy JR, Boag W, Weng W-H, Jin D, Naumann T, et al. Publicly available clinical BERT embeddings. arXiv preprint arXiv:190403323. 2019. 24. Patrick J, Li M, editors. A cascade approach to extracting medication events. Proceedings of the Australasian Language Technology Association Workshop 2009; 2009. 25. Peng Y, Yan S, Lu Z. Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets. arXiv preprint arXiv:190605474. 2019. 26. Clark C, Good K, Jezierny L, Macpherson M, Wilson B, Chajewska U. Identifying smokers with a medical extraction system. Journal of the American Medical Informatics Association. 2008;15(1):36-9. 27. Uzuner Ö, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. Journal of the American Medical Informatics Association. 2007;14(5):550-63. 28. Uzuner Ö, Goldstein I, Luo Y, Kohane I. Identifying patient smoking status from medical discharge records. Journal of the American Medical Informatics Association. 2008;15(1):14-24. 29. Uzuner Ö. Recognizing obesity and comorbidities in sparse data. Journal of the American Medical Informatics Association. 2009;16(4):561-70. 30. Uzuner Ö, Solti I, Cadag E. Extracting medication information from clinical text. Journal of the American Medical Informatics Association. 2010;17(5):514-8. 31. Uzuner O, Bodnari A, Shen S, Forbush T, Pestian J, South BR. Evaluating the state of the art in coreference resolution for electronic medical records. Journal of the American Medical Informatics Association. 2012;19(5):786-91. 32. Pradhan S, Elhadad N, Chapman W, Manandhar S, Savova G, editors. Semeval-2014 task 7: Analysis of clinical text. Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014); 2014. 33. Elhadad N, Pradhan S, Gorman S, Manandhar S, Chapman W, Savova G, editors. SemEval-2015 task 14: Analysis of clinical text. Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015); 2015. 34. Bethard S, Savova G, Chen W-T, Derczynski L, Pustejovsky J, Verhagen M, editors. Semeval-2016 task 12: Clinical tempeval. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016); 2016. 35. Liu S, Wang Y, Liu H. Selected articles from the BioCreative/OHNLP challenge 2018. Springer; 2019. 36. Rastegar-Mojarad M, Liu S, Wang Y, Afzal N, Wang L, Shen F, et al., editors. BioCreative/OHNLP Challenge 2018. Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics; 2018: ACM.

37

37. Wang Y, Afzal N, Liu S, Rastegar-Mojarad M, Wang L, Shen F, et al. Overview of the BioCreative/OHNLP Challenge 2018 Task 2: Clinical Semantic Textual Similarity. 2018;2018. 38. Liu S, Mojarad MR, Wang Y, Wang L, Shen F, Fu S, et al. Overview of the BioCreative/OHNLP 2018 Family History Extraction Task. 39. Stubbs A, Filannino M, Soysal E, Henry S, Uzuner Ö. Cohort selection for clinical trials: n2c2 2018 shared task track 1. Journal of the American Medical Informatics Association. 2019;26(11):1163-71. 40. Hochreiter S, Schmidhuber JJNc. Long short-term memory. 1997;9(8):1735-80. 41. Peters ME, Ammar W, Bhagavatula C, Power R. Semi-supervised sequence tagging with bidirectional language models. arXiv preprint arXiv:170500108. 2017. 42. Chen H, Lin Z, Ding G, Lou J, Zhang Y, Karlsson B, editors. GRN: Gated Relation Network to Enhance Convolutional Neural Network for Named Entity Recognition. Proceedings of AAAI; 2019. 43. Developing a framework for detecting asthma endotypes from electronic health records. American Journal of Respiratory and Critical Care Medicine. 2014;Conference:American Thoracic Society International Conference, ATS 2014. San Diego, CA United States. Conference Publication: (var.pagings). 189 (no pagination). 44. Fu S, Leung LY, Wang Y, Raulli A-O, Kallmes DF, Kinsman KA, et al. Natural Language Processing for the Identification of Silent Brain Infarcts From Neuroimaging Reports. JMIR Med Inform. 2019;7(2):e12109. 45. Chase HS, Mitrani LR, Lu GG, Fulgieri DJ. Early recognition of multiple sclerosis using natural language processing of the electronic health record. BMC Med Inf Decis Mak. 2017;17(1):24. 46. Wu ST, Wi CI, Sohn S, Liu H, Juhn YJ, editors. Staggered NLP-assisted refinement for clinical annotations of chronic disease events. 10th International Conference on Language Resources and Evaluation, LREC 2016; 2016: European Language Resources Association (ELRA). 47. Chen T, Dredze M, Weiner JP, Hernandez L, Kimura J, Kharrazi HJJmi. Extraction of Geriatric Syndromes From Electronic Health Record Clinical Notes: Assessment of Statistical Natural Language Processing Methods. 2019;7(1):e13039. 48. Murphy SN, Weber G, Mendis M, Gainer V, Chueh HC, Churchill S, et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). 2010;17(2):124-30. 49. Albright D, Lanfranchi A, Fredriksen A, Styler IV WF, Warner C, Hwang JD, et al. Towards comprehensive syntactic and semantic annotations of the clinical narrative. 2013;20(5):922-30. 50. Fu S, Carlson LA, Peterson KJ, Wang N, Zhou X, Peng S, et al. Natural Language Processing for the Evaluation of Methodological Standards and Best Practices of EHR-based Clinical Research. AMIA Summits Transl Sci Proc. 2020;2020:171. 51. Gilbert EH, Lowenstein SR, Koziol-McLain J, Barta DC, Steiner J. Chart reviews in emergency medicine research: where are the methods? Annals of emergency medicine. 1996;27(3):305-8. 52. Fu S, Leung LY, Raulli A-O, Kallmes DF, Kinsman KA, Nelson KB, et al. Assessment of the impact of EHR heterogeneity for clinical research through a case study of silent brain infarction. BMC Med Informatics Decis Mak. 2020;20:1-12.

38

53. Mayer J, Shen S, South BR, Meystre S, Friedlin FJ, Ray WR, et al., editors. Inductive creation of an annotation schema and a reference standard for de-identification of VA electronic clinical notes. AMIA Annual Symposium Proceedings; 2009: American Medical Informatics Association. 54. South BR, Shen S, Jones M, Garvin J, Samore MH, Chapman WW, et al., editors. Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease. BMC Bioinformatics; 2009: BioMed Central. 55. Khalifa A, Velupillai S, Meystre S, editors. UtahBMI at SemEval-2016 task 12: Extracting temporal information from clinical text. 10th International Workshop on Semantic Evaluation, SemEval 2016; 2016: Association for Computational Linguistics (ACL). 56. Cui L, Sahoo SS, Lhatoo SD, Garg G, Rai P, Bozorgi A, et al. Complex epilepsy phenotype extraction from narrative clinical discharge summaries. Journal of Biomedical Informatics. 2014;51:272-9. 57. Murtaugh MA, Gibson BS, Redd D, Zeng-Treitler Q. Regular expression-based learning to extract bodyweight values from clinical notes. Journal of biomedical informatics. 2015;54:186-90. 58. Childs LC, Enelow R, Simonsen L, Heintzelman NH, Kowalski KM, Taylor RJ. Description of a rule-based system for the i2b2 challenge in natural language processing for clinical data. Journal of the American Medical Informatics Association. 2009;16(4):571-5. 59. Nelson RE, Grosse SD, Waitzman NJ, Lin J, DuVall SL, Patterson O, et al. Using multiple sources of data for surveillance of postoperative venous thromboembolism among surgical patients treated in Department of Veterans Affairs hospitals, 2005–2010. 2015;135(4):636-42. 60. Khalifa A, Meystre S. Adapting existing natural language processing resources for cardiovascular risk factors identification in clinical notes. Journal of biomedical informatics. 2015;58:S128-S32. 61. Cormack J, Nath C, Milward D, Raja K, Jonnalagadda SR. Agile text mining for the 2014 i2b2/UTHealth Cardiac risk factors challenge. Journal of biomedical informatics. 2015;58:S120-S7. 62. Sevenster M, Van Ommering R, Qian Y. Automatically correlating clinical findings and body locations in radiology reports using MedLEE. Journal of digital imaging. 2012;25(2):240-9. 63. Yang H. Automatic extraction of medication information from medical discharge summaries. Journal of the American Medical Informatics Association. 2010;17(5):545-8. 64. Kelahan LC, Fong A, Ratwani RM, Filice RW. Call Case Dashboard: Tracking R1 Exposure to High-Acuity Cases Using Natural Language Processing. Journal of the American College of Radiology. 2016;13(8):988-91. 65. Jonnagaddala J, Liaw S-T, Ray P, Kumar M, Chang N-W, Dai H-JJJobi. Coronary artery disease risk assessment from unstructured electronic health records using text mining. 2015;58:S203-S10. 66. Deléger L, Grouin C, Zweigenbaum PJJotAMIA. Extracting medical information from narrative patient records: the case of medication-related information. 2010;17(5):555-8. 67. Mork JG, Bodenreider O, Demner-Fushman D, Dogan RI, Lang FM, Lu Z, et al. Extracting Rx information from clinical narrative. Journal of the American Medical Informatics Association. 2010;17(5):536-9.

39

68. Denny JC, Peterson JF, Choma NN, Xu H, Miller RA, Bastarache L, et al. Extracting timing and status descriptors for colonoscopy testing from electronic medical records. 2010;17(4):383-8. 69. Xu H, Jiang M, Oetjens M, Bowton EA, Ramirez AH, Jeff JM, et al. Facilitating pharmacogenetic studies using electronic health records and natural-language processing: a case study of warfarin. 2011;18(4):387-91. 70. Sarker A, Gonzalez G. Portable automatic text classification for adverse drug reaction detection via multi-corpus training. Journal of biomedical informatics. 2015;53:196-207. 71. Tang B, Cao H, Wu Y, Jiang M, Xu H, editors. Recognizing clinical entities in hospital discharge summaries using Structural Support Vector Machines with word representation features. BMC Med Informatics Decis Mak; 2013: BioMed Central. 72. Sordo M, Rocha BH, Morales AA, Maviglia SM, Oglio ED, Fairbanks A, et al. Modeling decision support rule interactions in a clinical setting. Stud Health Technol Inform. 2013;192:908-12. 73. Jiang J, Guan Y, Zhao C, editors. WI-ENRE in CLEF eHealth Evaluation Lab 2015: Clinical Named Entity Recognition Based on CRF. CLEF (Working Notes); 2015. 74. Akkasi A, Varoglu E. Improving Biochemical Named Entity Recognition Using PSO Classifier Selection and Bayesian Combination Methods. IEEE/ACM Trans Comput Biol Bioinformatics. 2017;14(6):1327-38. 75. Henriksson A, Kvist M, Dalianis H. Detecting Protected Health Information in Heterogeneous Clinical Notes. Stud Health Technol Inform. 2017;245:393-7. 76. Urbain J. Mining heart disease risk factors in clinical text with named entity recognition and distributional semantic models. Journal of biomedical informatics. 2015;58:S143-S9. 77. Esuli A, Marcheggiani D, Sebastiani FJJobi. An enhanced CRFs-based system for information extraction from radiology reports. 2013;46(3):425-35. 78. Roberts K, Rink B, Harabagiu SM, Scheuermann RH, Toomay S, Browning T, et al., editors. A machine learning approach for identifying anatomical locations of actionable findings in radiology reports. AMIA Annual Symposium Proceedings; 2012: American Medical Informatics Association. 79. Liu Z, Yang M, Wang X, Chen Q, Tang B, Wang Z, et al. Entity recognition from clinical texts via recurrent neural network. BMC Med Inf Decis Mak. 2017;17(Suppl 2):67. 80. Li P, Huang H, editors. UTA DLNLP at SemEval-2016 task 12: Deep learning based natural language processing system for clinical information identification from clinical notes and pathology reports. 10th International Workshop on Semantic Evaluation, SemEval 2016; 2016: Association for Computational Linguistics (ACL). 81. Liu Z, Tang B, Wang X, Chen Q. De-identification of clinical notes via recurrent neural network and conditional random field. Journal of Biomedical Informatics. 2017;75S:S34-S42. 82. Wu Y, Xu J, Jiang M, Zhang Y, Xu H, editors. A study of neural word embeddings for named entity recognition in clinical text. AMIA Annual Symposium Proceedings; 2015: American Medical Informatics Association. 83. Tran T, Kavuluru R. Predicting mental conditions based on "history of present illness" in psychiatric notes with deep neural networks. Journal of Biomedical Informatics. 2017;75S:S138-S48. 84. Gehrmann S, Dernoncourt F, Li Y, Carlson ET, Wu JT, Welt J, et al. Comparing deep learning and concept extraction based methods for patient phenotyping from clinical narratives. PLoS ONE. 2018;13(2):e0192360.

40

85. Luu TM, Phan R, Davey R, Chetty G, editors. A multilevel NER framework for automatic clinical name entity recognition. 17th IEEE International Conference on Data Mining Workshops, ICDMW 2017; 2017: IEEE Computer Society. 86. Wei W-Q, Tao C, Jiang G, Chute CG, editors. A high throughput semantic concept frequency based approach for patient identification: a case study using type 2 diabetes mellitus clinical notes. AMIA annual symposium proceedings; 2010: American Medical Informatics Association. 87. Yadav K, Sarioglu E, Choi HA, Cartwright IV WB, Hinds PS, Chamberlain JM. Automated outcome classification of computed tomography imaging reports for pediatric traumatic brain injury. Academic emergency medicine. 2016;23(2):171-8. 88. Sordoa M, Topazb M, Zhongb F, Murralid M, Navathed S, Rochaa RA. Identifying patients with depression using free-text clinical documents. 2015. 89. Zheng C, Rashid N, Wu YL, Koblick R, Lin AT, Levy GD, et al. Using natural language processing and machine learning to identify gout flares from electronic clinical notes. Arthritis care & research. 2014;66(11):1740-8. 90. Leaman R, Khare R, Lu Z. NCBI at 2013 ShARe/CLEF eHealth Shared Task: disorder normalization in clinical notes with DNorm. Radiology. 2011;42(21.1):1,941. 91. Wang C, Akella R. A hybrid approach to extracting disorder mentions from clinical notes. AMIA Summits on Translational Science Proceedings. 2015;2015:183. 92. Yu S, Liao KP, Shaw SY, Gainer VS, Churchill SE, Szolovits P, et al. Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources. Journal of the American Medical Informatics Association. 2015;22(5):993-1000. 93. Patrick J, Li M. High accuracy information extraction of medication information from clinical notes: 2009 i2b2 medication extraction challenge. Journal of the American Medical Informatics Association. 2010;17(5):524-7. 94. Tang B, Wu Y, Jiang M, Chen Y, Denny JC, Xu H. A hybrid system for temporal information extraction from clinical text. Journal of the American Medical Informatics Association. 2013;20(5):828-35. 95. Agarwal A, Baechle C, Behara R, Zhu X. A Natural language processing framework for assessing hospital readmissions for patients with COPD. IEEE journal of biomedical and health informatics. 2018;22(2):588-96. 96. Karystianis G, Nevado AJ, Kim CH, Dehghan A, Keane JA, Nenadic G. Automatic mining of symptom severity from psychiatric evaluation notes. International journal of methods in psychiatric research. 2018;27(1):e1602. 97. Castro SM, Tseytlin E, Medvedeva O, Mitchell K, Visweswaran S, Bekhuis T, et al. Automated annotation and classification of BI-RADS assessment from radiology reports. Journal of biomedical informatics. 2017;69:177-87. 98. Yim W-w, Evans HL, Yetisgen M. Structuring Free-text Microbiology Culture Reports For Secondary Use. AMIA Summits on Translational Science Proceedings. 2015;2015:471. 99. Khor R, Yip W-K, Bressel M, Rose W, Duchesne G, Foroudi F. Practical implementation of an existing smoking detection pipeline and reduced support vector machine training corpus requirements. Journal of the American Medical Informatics Association. 2013;21(1):27-30. 100. Xu Y, Hong K, Tsujii J, Chang EI-C. Feature engineering combined with machine learning and rule-based methods for structured information extraction from narrative clinical discharge summaries. Journal of the American Medical Informatics Association. 2012;19(5):824-32.

41

101. Yang H, Garibaldi JM. A hybrid model for automatic identification of risk factors for heart disease. Journal of biomedical informatics. 2015;58:S171-S82. 102. Meystre SM, Thibault J, Shen S, Hurdle JF, South BR. Textractor: a hybrid system for medications and reason for their prescription extraction from clinical text documents. Journal of the American Medical Informatics Association. 2010;17(5):559-62. 103. Yang H, Spasic I, Keane JA, Nenadic G. A text mining approach to the prediction of disease status from clinical discharge summaries. Journal of the American Medical Informatics Association. 2009;16(4):596-600. 104. Wu ST, Kaggal VC, Dligach D, Masanz JJ, Chen P, Becker L, et al. A common type system for clinical natural language processing. Journal of biomedical semantics. 2013;4(1):1. 105. Meystre SM, Kim Y, Gobbel GT, Matheny ME, Redd A, Bray BE, et al. Congestive heart failure information extraction framework for automated treatment performance measures assessment. Journal of the American Medical Informatics Association. 2016;24(e1):e40-e6. 106. McCrum R, Cran W, MacNeil R, MacNeil R. The story of English: Faber & Faber London; 1986. 107. Hoogendoorn M, Szolovits P, Moons LMG, Numans ME. Utilizing uncoded consultation notes from electronic medical records for predictive modeling of colorectal cancer. Artificial Intelligence in Medicine. 2016;69:53-61. 108. Aramaki E, Miura Y, Tonoike M, Ohkuma T, Masuichi H, Waki K, et al. Extraction of adverse drug effects from clinical records. MedInfo. 2010;160:739-43. 109. Doan S, Xu H. Recognizing Medication related Entities in Hospital Discharge Summaries using Support Vector Machine. Proc. 2010;2010:259-66. 110. Yoon HJ, Roberts L, Tourassi G, editors. Automated histologic grading from free-text pathology reports using graph-of-words features and machine learning. 4th IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2017; 2017: Institute of Electrical and Electronics Engineers Inc. 111. Wyles CC, Tibbo ME, Fu S, Wang Y, Sohn S, Kremers WK, et al. Use of natural language processing algorithms to identify common data elements in operative notes for total hip arthroplasty. JBJS. 2019;101(21):1931-8. 112. Mowery DL, Chapman BE, Conway M, South BR, Madden E, Keyhani S, et al. Extracting a stroke phenotype risk factor from Veteran Health Administration clinical reports: an information content analysis. J Biomed Semantics. 2016;7(1):26. 113. Friedman C, Alderson PO, Austin JH, Cimino JJ, Johnson SB. A general natural-language text processor for clinical radiology. Journal of the American Medical Informatics Association. 1994;1(2):161-74. 114. Aronson AR, Lang F-M. An overview of MetaMap: historical perspective and recent advances. Journal of the American Medical Informatics Association. 2010;17(3):229-36. 115. Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. Journal of the American Medical Informatics Association. 2010;17(5):507-13. 116. Liu H, Bielinski SJ, Sohn S, Murphy S, Wagholikar KB, Jonnalagadda SR, et al. An information extraction framework for cohort identification using electronic health records. AMIA Summits Transl Sci Proc. 2013;2013:149. 117. Leaman R, Khare R, Lu Z. NCBI at 2013 ShARe/CLEF eHealth Shared Task: disorder normalization in clinical notes with DNorm. Radiology. 2011;42(21.1):1-941.

42

118. Hao T, Pan X, Gu Z, Qu Y, Weng H. A pattern learning-based method for temporal expression extraction and normalization from multi-lingual heterogeneous clinical texts.[Erratum appears in BMC Med Inform Decis Mak. 2018 Apr 13;18(1):25; PMID: 29653522]. BMC Med Inf Decis Mak. 2018;18(Suppl 1):22. 119. Lin YK, Chen H, Brown RA. MedTime: a temporal information extraction system for clinical narratives. Journal of Biomedical Informatics. 2013;46(8). 120. Vetulani Z, Uszkoreit H. Human Language Technology. Challenges of the Information Society: Third Language and Technology Conference, LTC 2007, Poznan, Poland, October 5-7, 2007, Revised Selected Papers: Springer; 2009. 121. Clancey WJ. The epistemology of a rule-based expert system—a framework for explanation. Artificial intelligence. 1983;20(3):215-51. 122. Quimbaya AP, Múnera AS, Rivera RAG, Rodríguez JCD, Velandia OMM, Peña AAG, et al., editors. Named Entity Recognition over Electronic Health Records Through a Combined Dictionary-based Approach. Conference on ENTERprise Information Systems / International Conference on Project MANagement / Conference on Health and Social Care Information Systems and Technologies, CENTERIS / ProjMAN / HCist 2016; 2016: Elsevier B.V. 123. Xu Y, Hua J, Ni Z, Chen Q, Fan Y, Ananiadou S, et al. Anatomical entity recognition with a hierarchical framework augmented by external resources. PloS one. 2014;9(10):e108396. 124. Yang H, Garibaldi JM. Automatic detection of protected health information from clinic narratives. Journal of biomedical informatics. 2015;58:S30-S8. 125. Davis MF, Sriram S, Bush WS, Denny JC, Haines JL. Automated extraction of clinical traits of multiple sclerosis in electronic medical records. Journal of the American Medical Informatics Association. 2013;20(e2):e334-e40. 126. Denny JC, Irani PR, Wehbe FH, Smithers JD, Spickard III A, editors. The KnowledgeMap project: development of a concept-based medical school curriculum database. AMIA Annual Symposium Proceedings; 2003: American Medical Informatics Association. 127. Goryachev S, Sordo M, Zeng QT, editors. A suite of natural language processing tools developed for the I2B2 project. AMIA Annual Symposium Proceedings; 2006: American Medical Informatics Association. 128. Rindflesch TC, Tanabe L, Weinstein JN, Hunter L. EDGAR: extraction of drugs, genes and relations from the biomedical literature. Biocomputing 2000: World Scientific; 1999. p. 517-28. 129. Bodenreider OJNar. The unified medical language system (UMLS): integrating biomedical terminology. 2004;32(suppl_1):D267-D70. 130. Lipscomb CEJBotMLA. Medical subject headings (MeSH). 2000;88(3):265. 131. Carrell DS, Schoen RE, Leffler DA, Morris M, Rose S, Baer A, et al. Challenges in adapting existing clinical natural language processing systems to multiple, diverse health care settings. Journal of the American Medical Informatics Association. 2017;24(5):986-91. 132. Farkas R, Szarvas G, Hegedűs I, Almási A, Vincze V, Ormándi R, et al. Semi-automated construction of decision rules to predict morbidities from clinical texts. Journal of the American Medical Informatics Association. 2009;16(4):601-5. 133. Wang Y, Patrick J, editors. Cascading classifiers for named entity recognition in clinical notes. Proceedings of the workshop on biomedical information extraction; 2009: Association for Computational Linguistics.

43

134. Ebersbach M, Herms R, Eibl M, editors. Fusion methods for ICD10 code classification of death certificates in multilingual corpora. 18th Working Notes of CLEF Conference and Labs of the Evaluation Forum, CLEF 2017; 2017: CEUR-WS. 135. Pandey C, Ibrahim Z, Wu H, Iqbal E, Dobson R, editors. Improving RNN with atention and embedding for adverse drug reactions. 7th International Conference on Digital Health, DH 2017; 2017: Association for Computing Machinery. 136. Smith KL, Sarker A, Nikfarjam A, Malone D, Gonzalez-Hernandez G. Mining adverse events in twitter: Experiences of adalimumab users. Value in Health. 2017;20 (5):A51. 137. Liu YC, Ku LW, editors. CLEFeHealth 2014 normalization of information extraction challenge using multi-model method. 2014 Cross Language Evaluation Forum Conference, CLEF 2014; 2014: CEUR-WS. 138. Bui DDA, Wyatt M, Cimino JJ. The UAB Informatics Institute and 2016 CEGS N-GRID de-identification shared task challenge. Journal of Biomedical Informatics. 2017;75S:S54-S61. 139. Deng N, Du NK, Feng YN, Wang ZY, Duan HL, Liu F. Exploring the genotype-phenotype associations of colorectal cancer using vector space model. Journal of Investigative Medicine. 2017;65 (Supplement 7):A3. 140. Kasthurirathne SN, Dixon BE, Gichoya J, Xu H, Xia Y, Mamlin B, et al. Toward better public health reporting using existing off the shelf approaches: The value of medical dictionaries in automated cancer detection using plaintext medical data. Journal of Biomedical Informatics. 2017;69:160-76. 141. Dehghan A, Ala-Aldeen K, Blaum C, Liptrot T, Kennedy J, Tibble D, et al. Automated classification of radiation oesophagitis from free text clinical narratives. Lung Cancer. 2017;103 (Supplement 1):S57-S8. 142. Sebastiani F. Machine learning in automated text categorization. ACM computing surveys (CSUR). 2002;34(1):1-47. 143. Freitag D. Machine learning for information extraction in informal domains. Machine learning. 2000;39(2-3):169-202. 144. Alpaydin E. Introduction to machine learning: MIT press; 2009. 145. Hastie T, Tibshirani R, Friedman J, Franklin J. The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer. 2005;27(2):83-5. 146. Zhang Y, Jiang M, Wang J, Xu H. Semantic Role Labeling of Clinical Text: Comparing Syntactic Parsers and Features. AMIA Annu Symp Proc. 2016;2016:1283-92. 147. Loper E, Bird S. NLTK: the natural language toolkit. arXiv preprint cs/0205028. 2002. 148. Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D, editors. The Stanford CoreNLP natural language processing toolkit. Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations; 2014. 149. Sohn S, Larson DW, Habermann EB, Naessens JM, Alabbad JY, Liu H. Detection of clinically important colorectal surgical site infection using Bayesian network. J Surg Res. 2017;209:168-73. 150. Rochefort CM, Verma AD, Eguale T, Lee TC, Buckeridge DL. A novel method of adverse event detection can accurately identify venous thromboembolisms (VTEs) from narrative electronic health record data. Journal of the American Medical Informatics Association. 2014;22(1):155-65. 151. Gaebel J, Kolter T, Arlt F, Denecke K. Extraction Of Adverse Events From Clinical Documents To Support Decision Making Using Semantic Preprocessing. Stud Health Technol Inform. 2015;216:1030.

44

152. Chen Y, Lask TA, Mei Q, Chen Q, Moon S, Wang J, et al. An active learning-enabled annotation system for clinical named entity recognition. BMC Med Inf Decis Mak. 2017;17(Suppl 2):82. 153. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J, editors. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems; 2013. 154. Akbik A, Blythe D, Vollgraf R, editors. Contextual string embeddings for sequence labeling. Proceedings of the 27th International Conference on Computational Linguistics; 2018. 155. Lafferty J, McCallum A, Pereira FC. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001. 156. Cortes C, Vapnik VJMl. Support-vector networks. 1995;20(3):273-97. 157. Tsochantaridis I, Joachims T, Hofmann T, Altun YJJomlr. Large margin methods for structured and interdependent output variables. 2005;6(Sep):1453-84. 158. Kleinbaum DG, Dietz K, Gail M, Klein M, Klein M. Logistic regression: Springer; 2002. 159. Breiman L. Random forests. Machine learning. 2001;45(1):5-32. 160. Kim Y, Garvin JH, Heavirland J, Meystre SM, editors. Improving heart failure information extraction by domain adaptation. Medinfo; 2013. 161. Kreuzthaler M, Schulz S, editors. Detection of sentence boundaries and abbreviations in clinical narratives. BMC Med Informatics Decis Mak; 2015: BioMed Central. 162. Turner CA, Jacobs AD, Marques CK, Oates JC, Kamen DL, Anderson PE, et al. Word2Vec inversion and traditional text classifiers for phenotyping lupus. BMC Med Inf Decis Mak. 2017;17(1):126. 163. Li Q, Zhai H, Deleger L, Lingren T, Kaiser M, Stoutenborough L, et al. A sequence labeling approach to link medications and their attributes in clinical notes and clinical trial announcements for information extraction. 2013;20(5):915-21. 164. Gung J, editor Using Relations for Identification and Normalization of Disorders: Team CLEAR in the ShARe/CLEF 2013 eHealth Evaluation Lab. CLEF (Working Notes); 2013. 165. Forsyth AW, Barzilay R, Hughes KS, Lui D, Lorenz KA, Enzinger A, et al. Machine Learning Methods to Extract Documentation of Breast Cancer Symptoms From Electronic Health Records. J Pain Symptom Manage. 2018;27:27. 166. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436. 167. Chen D, Liu S, Kingsbury P, Sohn S, Storlie CB, Habermann EB, et al. Deep learning and alternative learning strategies for retrospective real-world clinical data. NPJ digital medicine. 2019;2(1):43. 168. Rumelhart DE, Hinton GE, Williams RJJn. Learning representations by back-propagating errors. 1986;323(6088):533-6. 169. Cocos A, Fiks AG, Masino AJ. Deep learning for pharmacovigilance: recurrent neural network architectures for labeling adverse drug reactions in Twitter posts. Journal of the American Medical Informatics Association. 2017;24(4):813-21. 170. Jauregi Unanue I, Zare Borzeshi E, Piccardi M. Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition. Journal of Biomedical Informatics. 2017;76:102-9. 171. LeCun Y, Bottou L, Bengio Y, Haffner PJPotI. Gradient-based learning applied to document recognition. 1998;86(11):2278-324.

45

172. Tan LK, Liew YM, Lim E, McLaughlin RA. Convolutional neural network regression for short-axis left ventricle segmentation in cardiac cine MR sequences. Med Image Anal. 2017;39:78-86. 173. Rios A, Kavuluru R. Convolutional neural networks for biomedical text classification: application in indexing biomedical articles. Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics; Atlanta, Georgia: ACM; 2015. 174. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al., editors. Attention is all you need. Advances in neural information processing systems; 2017. 175. Voulodimos A, Doulamis N, Doulamis A, Protopapadakis E. Deep learning for computer vision: A brief review. Computational intelligence and neuroscience. 2018;2018. 176. Guo Y, Liu Y, Oerlemans A, Lao S, Wu S, Lew MS. Deep learning for visual understanding: A review. Neurocomputing. 2016;187:27-48. 177. Pierson HA, Gashler MS. Deep learning in robotics: a review of recent research. Advanced Robotics. 2017;31(16):821-35. 178. Garcia-Garcia A, Orts-Escolano S, Oprea S, Villena-Martinez V, Garcia-Rodriguez J. A review on deep learning techniques applied to semantic segmentation. arXiv preprint arXiv:170406857. 2017. 179. Kundeti SR, Vijayananda J, Mujjiga S, Kalyan M, editors. Clinical named entity recognition: Challenges and opportunities. 4th IEEE International Conference on Big Data, Big Data 2016; 2016: Institute of Electrical and Electronics Engineers Inc. 180. Zhang D, Wang DJapa. Relation classification via recurrent neural network. 2015. 181. Hochreiter SJIJoU, Fuzziness, Systems K-B. The vanishing gradient problem during learning recurrent neural nets and problem solutions. 1998;6(02):107-16. 182. Chung J, Gulcehre C, Cho K, Bengio Y, editors. Gated feedback recurrent neural networks. International conference on machine learning; 2015. 183. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. 2014. 184. Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. URL https://s3-us-west-2 amazonaws com/openai-assets/researchcovers/languageunsupervised/language understanding paper pdf. 2018. 185. Lee H-J, Wu Y, Zhang Y, Xu J, Xu H, Roberts KJJobi. A hybrid approach to automatic de-identification of psychiatric notes. 2017;75:S19-S27. 186. Dehghan A, Kovacevic A, Karystianis G, Keane JA, Nenadic GJJobi. Learning to identify Protected Health Information by integrating knowledge-and data-driven algorithms: A case study on psychiatric evaluation notes. 2017;75:S28-S33. 187. Denny JC, Spickard III A, Johnson KB, Peterson NB, Peterson JF, Miller RAJJotAMIA. Evaluation of a method to identify and categorize section headers in clinical documents. 2009;16(6):806-15. 188. Zheng S, Lu JJ, Ghasemzadeh N, Hayek SS, Quyyumi AA, Wang FJJmi. Effective information extraction framework for heterogeneous clinical reports using online machine learning and controlled vocabularies. 2017;5(2):e12. 189. Szarvas G, Farkas R, Busa-Fekete RJJotAMIA. State-of-the-art anonymization of medical records using an iterative machine learning framework. 2007;14(5):574-80.

46

190. Meystre SM, Kim Y, Heavirland J, Williams J, Bray BE, Garvin J. Heart Failure Medications Detection and Prescription Status Classification in Clinical Narrative Documents. Stud Health Technol Inform. 2015;216:609-13. 191. Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. Journal of the National Cancer Institute. 2003;95(1):14-8. 192. Varma S, Simon R. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics. 2006;7(1):91. 193. Filannino M, Stubbs A, Uzuner O. Symptom severity prediction from neuropsychiatric clinical records: Overview of 2016 CEGS N-GRID shared tasks Track 2. Journal of Biomedical Informatics. 2017;75S:S62-S70. 194. Velupillai S, Suominen H, Liakata M, Roberts A, Shah AD, Morley K, et al. Using clinical Natural Language Processing for health outcomes research: Overview and actionable suggestions for future advances. Journal of biomedical informatics. 2018;88:11-9. 195. Ruder S, Peters ME, Swayamdipta S, Wolf T, editors. Transfer learning in natural language processing. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials; 2019. 196. Mou L, Meng Z, Yan R, Li G, Xu Y, Zhang L, et al. How transferable are neural networks in nlp applications? arXiv preprint arXiv:160306111. 2016. 197. Zhang Y, Yang Q. A survey on multi-task learning. arXiv preprint arXiv:170708114. 2017. 198. Crichton G, Pyysalo S, Chiu B, Korhonen A. A neural network multi-task learning approach to biomedical named entity recognition. BMC Bioinformatics. 2017;18(1):368. 199. Wang X, Lyu J, Dong L, Xu K. Multitask learning for biomedical named entity recognition with cross-sharing structure. BMC Bioinformatics. 2019;20(1):427. 200. Weng W-H, Cai Y, Lin A, Tan F, Chen P-HC. Multimodal Multitask Representation Learning for Pathology Biobank Metadata Prediction. arXiv preprint arXiv:190907846. 2019. 201. Nagpal C. Deep Multimodal Fusion of Health Records and Notes for Multitask Clinical Event Prediction. 202. Du M, Liu N, Hu XJCotA. Techniques for interpretable machine learning. 2019;63(1):68-77. 203. Ahmad MA, Eckert C, Teredesai A, editors. Interpretable machine learning in healthcare. Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics; 2018: ACM. 204. Ribeiro MT, Singh S, Guestrin C, editors. Why should i trust you?: Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining; 2016: ACM. 205. Molnar C. Interpretable machine learning: Lulu. com; 2019. 206. Lipton ZC. The mythos of model interpretability. arXiv preprint arXiv:160603490. 2016. 207. Sohn S, Wang Y, Wi C-I, Krusemark EA, Ryu E, Ali MH, et al. Clinical documentation variations and NLP system portability: a case study in asthma birth cohorts across institutions. Journal of the American Medical Informatics Association. 2017;30:30. 208. Kirby JC, Speltz P, Rasmussen LV, Basford M, Gottesman O, Peissig PL, et al. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. Journal of the American Medical Informatics Association. 2016;23(6):1046-52.

47

209. Xu H, Aldrich MC, Chen Q, Liu H, Peterson NB, Dai Q, et al. Validating drug repurposing signals using electronic health records: a case study of metformin associated with reduced cancer mortality. Journal of the American Medical Informatics Association. 2014;22(1):179-91. 210. Shen F, Larson DW, Naessens JM, Habermann EB, Liu H, Sohn S. Detection of surgical site infection utilizing automated feature generation in clinical notes. Journal of Healthcare Informatics Research. 2019;3(3):267-82. 211. Casteleiro MA, Demetriou G, Read W, Prieto MJF, Maroto N, Fernandez DM, et al. Deep learning meets ontologies: experiments to anchor the cardiovascular disease ontology in the biomedical literature. J Biomed Semantics. 2018;9(1):13. 212. Shen F, Peng S, Fan Y, Wen A, Liu S, Wang Y, et al. HPO2Vec+: Leveraging heterogeneous knowledge resources to enrich node embeddings for the Human Phenotype Ontology. Journal of biomedical informatics. 2019;96:103246. 213. Fernandes AC, Dutta R, Velupillai S, Sanyal J, Stewart R, Chandran DJSr. Identifying suicide ideation and suicidal attempts in a psychiatric clinical research database using natural language processing. 2018;8(1):7426. 214. Wen A, Fu S, Moon S, El Wazir M, Rosenbaum A, Kaggal VC, et al. Desiderata for delivering NLP to accelerate healthcare AI advancement and a Mayo Clinic NLP-as-a-service implementation. 2019;2(1):1-7. 215. Chapman WW, Nadkarni PM, Hirschman L, D'avolio LW, Savova GK, Uzuner O. Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions. BMJ Group BMA House, Tavistock Square, London, WC1H 9JR; 2011. 216. Wagholikar K, Torii M, Jonnalagadda S, Liu H. Feasibility of pooling annotated corpora for clinical concept extraction. AMIA Summits Transl Sci Proc. 2012;2012:38. 217. Li T, Sahu AK, Talwalkar A, Smith V. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine. 2020;37(3):50-60.