39
2010 HDWA Annual Conference Data Warehousing Adding Value to Healthcare Pathology Reports Information Extraction: An OHNLP and UMLS Powered Approach Naveen Ashish Research Associate Professor September 14 th 2010 HDWA 2010 Durham, NC

PPTX slides - PowerPoint Presentation

  • Upload
    jared56

  • View
    206

  • Download
    2

Embed Size (px)

Citation preview

Page 1: PPTX slides - PowerPoint Presentation

2010 HDWA Annual Conference

Data Warehousing – Adding Value to Healthcare

Pathology Reports Information Extraction: An

OHNLP and UMLS Powered Approach

Naveen AshishResearch Associate Professor

September 14th 2010

HDWA 2010Durham, NC

Page 2: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

Agenda

• Introduce institution, research group, and project

• Outline automated information extraction problem

• Solution

– Using open frameworks

– Open ontology resources

• Current Status

• Domain experts engagement

• Conclusions

Page 3: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

University of California, Irvine

Medical Center

University of California, Irvine Medical Center is a 422-

bed tertiary teaching hospital with a commitment to

education, research and quality patient care. UCI

Medical Center is a Magnet Designated facility with a

Level 1 Trauma Center, Burn Center and Level II

Neonatal Care Center.

• Not-for-Profit

• # Employees

• # ER Visits

• # Admissions

Page 4: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

Data Warehouse Profile

• UCI Clinical Informatics Team

– Director

– Informatics Solutions Architect

– Principal Statistician/Advisor

– Informatics Outreach Architect (future)

– Clinical Practice Engineer

– Clinical Research Informatics Lead

– Business Intelligence Developer (2)

– Clinical Informatics Specialist

– NLP Specialist

Page 5: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

Project Team

• Supported by UCI Medical Center Clinical Informatics

Department

• Collaboration between UCI Medical Center (Clinical

Informatics) and Calit2/Computer Science

• Members

– Naveen Ashish (NLP and CS Researcher)

– Lisa Dahm (Director, Biomedical Informatics)

– Charles Boicey (Informatics Architect)

Page 6: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

Vision

UCI QUP

Quest

Text Reports

Analysis

Page 7: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

(UCI) Pathology Report

Page 8: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

Reports

• Pathology reports

– Free text but “semi-structured” as well

– Nuggets of information in the text

Page 9: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

What do we want to ask ?

Sample (retrieval) “questions” Surgical Pathology

Patients with a surgical pathology report containing undifferentiated

lymphoepithelioma-like gastric carcinoma.

Patients with a surgical pathology report containing spindle cell carcinoma of

the breast, grade 3, margin(s) positive, node(s) positive.

Discharge Note

Patients with a discharge note containing a diagnosis of cerebrovascular

accident and diabetes mellitus type II discharged in stable condition to home.

Female patients with a discharge diagnosis of Ewing sarcoma, hypertension

and obesity discharged in stable condition to home.

Page 10: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

In Text Thus we need

Sections and

sub-sections

Associations

Terms

Dimensions

FINAL DIAGNOSIS AFTER MICROSCOPY:

LUNG, LEFT LOWER LOBE, WEDGE RESECTION:

POORLY DIFFERENTIATED ADENOCARCINOMA OF PULMONARY ORIGIN

SIZE: 1.5 CM

STAPLED RESECTION MARGIN: NEGATIVE

5 NECROSIS

EXTENSIVE FIBROSIS IS NOT PRESENT

PLEASE SEE COMMENT

FINAL DIAGNOSIS AFTER MICROSCOPY:

A. DEEP TRICEPS MARGIN, EXCISION:

POSITIVE FOR SARCOMA

B. LATERAL SUPERIOR MARGIN, EXCISION:

POSITIVE FOR SARCOMA

Page 11: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

System

OHNLP

UCI-QUP

Application

(Rules,

Code)

Database

(warehouse)

GUI, Tableau, i2b2

Unstructured Structured Analysis

Page 12: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

Related WorkComputerized Extraction of Information on the Quality of Diabetes Care from Free Text in Electronic Patient Records of General Practitioners. Jaco Voorham,Petra

Denig. JAMIA 2007;14:349-354 doi:10.1197/jamia.M2128

Application of information technology: MedEx: a medication information extraction system for clinical narratives. Hua Xu, Shane P Stenner,Son Doan,Kevin B

Johnson,Lemuel R Waitman,Joshua C Denny. JAMIA 2010;17:19-24 doi:10.1197/jamia.M3378

Identifying Smokers with a Medical Extraction System. Cheryl Clark,Kathleen Good,Lesley Jezierny,Melissa Macpherson,Brian Wilson,Urszula Chajewska.

JAMIA 2008;15:36-39 doi:10.1197/jamia.M2442

Automated evaluation of electronic discharge notes to assess quality of care for cardiovascular diseases using Medical Language Extraction and Encoding System

(MedLEE). Jung-Hsien Chiang, Jou-Wei Lin, Chen-Wei Yang. JAMIA 2010;17:245-252 doi:10.1136/jamia.2009.000182

Using Regular Expressions to Abstract Blood Pressure and Treatment Intensification Information from the Text of Physician Notes. Alexander Turchin, Nikheel S

Kolatkar,Richard W Grant,Eric C Makhni,Merri L Pendergrass, Jonathan S Einbinder. JAMIA 2006;13:691-695 doi:10.1197/jamia.M2078

Natural Language Processing Framework to Assess Clinical Conditions. Henry Ware, Charles J Mullett,V Jagannathan

JAMIA 2009;16:585-589 doi:10.1197/jamia.M3091

A General Natural-language Text Processor for Clinical Radiology. Carol Friedman,Philip O Alderson, John H M Austin, James J Cimino,Stephen B

Johnson.JAMIA 1994;1:161-174 doi:10.1136/jamia.1994.95236146

Improved Identification of Noun Phrases in Clinical Radiology Reports Using a High-Performance Statistical Natural Language Parser Augmented with the UMLS

Specialist Lexicon. Yang Huang,Henry J Lowe, Dan Klein,Russell J Cucina

.JAMIA 2005;12:275-285 doi:10.1197/jamia.M1695

Automated Encoding of Clinical Documents Based on Natural Language Processing. Carol Friedman, Lyudmila Shagina,Yves Lussier, George Hripcsak. JAMIA

2004;11:392-402 doi:10.1197/jamia.M1552

Description of a Rule-based System for the i2b2 Challenge in Natural Language Processing for Clinical Data. Lois C Childs, Robert Enelow,Lone Simonsen, Norris

H Heintzelman,Kimberly M Kowalski,Robert J Taylor. JAMIA 2009;16:571-575 doi:10.1197/jamia.M3083

Automated Detection of Adverse Events Using Natural Language Processing of Discharge Summaries. Genevieve B Melton,George Hripcsak. JAMIA 2005;12:448-

457 doi:10.1197/jamia.M1794

Page 13: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

Related Work

Documents

Discharge summaries, Patient notes,

EMR sections, Path or Radiology reports …

Identify noun phrases.

Section (headings)

Numerical values,

Negations,

Extract blood pressure,

Medications,

Quality of care,

Smoker status,

Adverse events

Other diagnoses

Processing

Analysis

Page 14: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

Systems

Columbia

Carol Friedman (et al.,)

MedLee

“Black art”

Systems from Defense, Intelligence etc., companies

Open Software and Tools

Medical Informatics

OHNLP

Open Health Natural Language Processing

IBM, MayoClinic, (NCI)

General

UIMA, GATE

Variety of lexical tools, named-entity recognizers, parsers etc.,

XAR

http://zellig.cpmc.columbia.edu/medlee/

http://incubator.apache.org/uima/

http://gate.ac.uk/

http://nlp.stanford.edu/software/lex-parser.shtml

Page 15: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

Extraction Techniques What do we employ to achieve automated extraction ?

Broad paradigms

Rule driven (expert)

Machine-learning based (trained)

Combined (most recent systems)

Multiple levels

Semi-structured data extraction

Named entity extraction

POS tagging, NE identification

Ontology driven (domain terms)

“Deep” relation level extraction

Associations

Natural Language Parsing

Page 16: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

NL Parse Illustration

(ROOT

(S

(NP

(NP (NNP Tissue))

(PP (IN between)

(NP (DT the) (CD two) (JJ surgical) (NNS

clips))))

(VP (VBZ contains)

(NP

(NP (NNS foci))

(PP (IN of)

(NP

(NP (JJ ductal) (NN carcinoma))

(ADJP (FW in) (FW situ)))))

(PP

(PP (IN within)

(NP (DT a) (NN papilloma)))

(, ,)

(CONJP (RB as) (RB well) (IN as))

(PP (IN within)

(NP (NNS ducts)))))

(. .)))

nsubj(contains-7, Tissue-1)

det(clips-6, the-3)

num(clips-6, two-4)

amod(clips-6, surgical-5)

prep_between(Tissue-1, clips-6)

dobj(contains-7, foci-8)

amod(carcinoma-11, ductal-10)

prep_of(foci-8, carcinoma-11)

amod(carcinoma-11, in-12)

dep(in-12, situ-13)

det(papilloma-16, a-15)

prep_within(contains-7, papilloma-16)

prep_within(contains-7, ducts-22)

conj_and(papilloma-16, ducts-22)

“Tissue between the two surgical clips contains foci of

ductal carcinoma in situ within a papilloma, as well as

within ducts.”

Page 17: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

MedLee Illustration

Page 18: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

OHNLP• OHNLP

– Open Health Natural

Language Processing

Consortium

• IBM and MayoClinic are

founding partners

• caBIG/NCI supported

– Open-source consortium

promoting the use of UIMA

• Features

– Built upon Apache UIMA

• Annotators, Pipelines

– Medical domain

• MedKAT/P (IBM)

– Pathology reports

extraction

• cTAKES (Mayo Clinic)

– Clinical data

Page 19: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

Rationale for OHNLP

• Based on UIMA

– Open source

– Community of developers

• OHNLP itself

– NCI

• IBM, Mayo

– MedKAT/P and cTakes

– Two way benefits

• Adopt

• Contribute back

Page 20: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

MedKAT Annotations

Page 21: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

“Programming” UIMA

• OHNLP based on UIMA

• UIMA composed of “Analysis Engines”

– Primitive

– Aggregate

Primitive Engine

(section headings)

Primitive Engine

(numerical)

Primitive Engine

(dict terms)

Page 22: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

Descriptors, Resources

Page 23: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

Analysis Engines

• Developing “UCI-QUP”

– UCI Quest Uima Pipeline

• Analysis Engines

– Recognize sections and sub-sections

• Regular expressions

– Significant terms

• Medical terms

– Existing dictionary in MedKAT/P

• Useful, not complete

– Integrate additional terminology

Page 24: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

AE Terms

• Good resources

– NCI Thesaurus

• Cancer related

• > 500,000 terms/concepts

– NCI Metathesaurus

• Several million concepts

• Developed

– Converter

• NCI Thesaurus UIMA Dictionary Resource

– Application

– Database

• MySQL

Page 25: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

Architecture

Pathology Reports(Unstructured)

OHNLP

Extracted Data(Structured)

UCI Quest Uima Pipeline

Knowledge Sources

Page 26: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

UMLS and Metathesaurus

Page 27: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

UMLS

• UMLS

– Obtained system from NLM

– Installed successfully on informatics-nlp

– Features

• Browse concepts and relationships

• Flat files

• DB import

– Being integrated into UCI-QUP

Page 28: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

Our Contribution to OHNLP

• We indeed adopted

– Framework

– Relevant “resources”

• Contribute to overall OHNLP effort

– Specific analysis engines

• Sections and sub-sections in pathology reports

• Significant items

• Dictionary terms (UMLS integration)

• …

• Contribute as a project back to OHNLP

Page 29: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

Database Schema

Page 30: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

Demo

Page 31: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

SQL Queries

• Example (possible) queries

SELECT reportid

FROM collection

WHERE

(sectioncontent like ‘%carcinomia%’) AND (heading

like ‘%tumor%)

Page 32: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

Interfaces

• i2b2

• Tableau

Page 33: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

Guide For Fields

• College of

American

Pathologists

(CAP)

– Detailed

protocols

Specimen (Note A)

___ Partial breast

___ Total breast (including nipple and skin)

___ Other (specify): ____________________________

___ Not specified

Procedure (Note A)

___ Excision without wire-guided localization

___ Excision with wire-guided localization

___ Total mastectomy (including nipple and skin)

___ Other (specify): ____________________________

___ Not specified

Lymph Node Sampling (select all that apply) (Note B)

___ No lymph nodes present

___ Sentinel lymph node(s)

___ Axillary dissection (partial or complete dissection)

___ Lymph nodes present within the breast specimen (ie, intramammary lymph nodes)

___ Other lymph nodes (eg, supraclavicular or location not identified)

Specify location, if provided: _________________________

Specimen Integrity

___ Single intact specimen (margins can be evaluated)

___ Multiple designated specimens (eg, main excisions and identified margins)

___ Fragmented (margins cannot be evaluated with certainty)

___ Other (specify): __________________________________

Specimen Size (for excisions less than total mastectomy)

Page 34: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

Implications

• Multiple specific extraction and distillation techniques

• Section, sub-section segmentation

• Term spotting

• Associations

• Negation (Absence) and Assertion (Presence)

• Dimensions

• Expressions

• Full NL Parse where required

Page 35: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

Current Status

• System first version

– Creation of database for data warehouse

• QUEST “compliant”

– Meta-thesaurus integration

– Retrieval

• SQL and UI

• Tableau

• i2b2

– Star schema

Page 36: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

Presentation Content Continued

• Direction

– Demonstrate value to researchers

– CTSA Investigators

• Lessons learned

– Open source frameworks very useful !

• Reuse external solutions, resources

• Our solutions can be adopted

– Approach appears scalable

– Domain expert engagement essential

Page 37: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

Content

• What went well

– UIMA and MedKAT choice

– UMLS integration

• What would you would do differently

– Project is in early stage

– Technical and framework choices seem right

– Will learn more as we engage domain experts

• What will provide value to investigators ?

Page 38: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

Summary

• Comprehensive approach to detailed information

extraction from Pathology reports

• Exploiting open source and programmable

frameworks (UIMA)

• Integration of UMLS

• Contribution of pipeline

• Engagement of domain experts

Page 39: PPTX slides - PowerPoint Presentation

All Rights Reserved, Duke Medicine 2007

HDWA 2010Durham, NC

Presenter(s) Contact Information

• Contact information

– Naveen Ashish

[email protected]

– http://www.ics.uci.edu/~ashish