Regenstrief WIP 07012015

A Comparison Of Non-Dictionary Based Approaches To Automate Public Health Reporting Using

Plain-Text Medical Data

Suranga N. Kasthurirathne

Premise

• Public health registries play a significant supporting role in public health care activities

• Reporting to public health registries are often delayed and incomplete

• Automated methods for identifying public health reportable cases can improve case reporting

Premise Contd.

• Many automated methods for identifying public health reportable illnesses already exist.

• However, these approaches are largely dependent on,

– Codified medical data

– Dictionary based named entity recognition (NER)

Problems

• Maintaining a complicated and constantly changing terminology is not easy

• Existing dictionaries cannot be used ‘off the shelf’

• A substantial amount of case-related data are captured as non-coded plaintext

• Accurate NER is a problem for dictionary based approaches

So Basically,

• Existing approaches are dependent on (resource heavy) codified data and dictionary based approaches

• Do we really need to go this far?

– Everything available VS. best available

Problem Statement

Compare alternative approaches for automating public health registry reporting using

– Plain text medical data

– Non-dictionary based approaches

Methods

• Obtain a convenience sample of plaintext pathology reports of suspected cancer patients from the INPC

• Using this data, evaluate the accuracy of automated cancer diagnosis based on,

– Varied feature selection approaches

– Varied feature subset sizes

– Varied decision models

The Challenges At Hand

• Feature selection: the identification of named entities that imply the presence of cancer

• Preparation of a gold standard

• The preparation of data input vectors based on selected entities

• Training and testing decision models using prepared data input vectors

Feature Selection

• Three alternate approaches

– Manual

– Informed

– AutomatedRequires analysis of plaintext data

Informed approach: Data analysis

• For each unique token (x), • No. of positive occurrences and negative occurrences • pos count : No. of reports with token present in positive

context• neg count : No. of reports with token present in negative

context• pos rate : Avg. no. of tokens per report in positive context• neg rate : Avg. no. of tokens per report in negative context • ratio : pos rate / neg rate• odds ratio : pos count / neg count

Automated Approach: Data Analysis

• Parse plaintext reports to identify tokens and context of use

• For each report, count number of occurrences per each token and context of use

Decision Modeling

• Select classification algorithms– SLR

– RF

– J48

– NB

– IBK (K-nearest neighbor)

• Prepare data input– Subsets of the master feature vector form decision

model input

Decision Modeling Contd.

• Training and testing

– Weka (Waikato Environment for

Knowledge Analysis)

– 10 fold cross validation

– Convenient user interfaces

Our study approach

Findings

• Non-dictionary based decision modeling approaches identify cancer cases within plaintext reports with reasonable accuracy.

• Performance of the methodology varies across

– Feature subset sizes

– Feature selection approaches

– Classification algorithms

• Optimized sensitivity: Automated or informed feature selection approach to pick subset sizes of 10 or greater to build RF, SLR or J48 decision models

• Optimized specificity: Adopt a SLR decision model with any feature selection approach and subset size

• Optimized PPV: Adopt a SLR decision model using any feature selection method and subset size

• Optimized accuracy: Build a RF, SLR or J48 decision model with an informed or automated feature selection method and a subset size of ten of greater

Ah, But…

• The success of the manual selection approach depends on the clinical expertise of the manual reviewers

• The results are specific to the plaintext reports used for our analysis

• Since we used only generic cancer terms, we may have missed very specific cancers

Next Steps

• The gold standard challenge

• Further analysis of the automated feature selection process and its optimization

• Dictionary vs. Non-dictionary based approaches. A conclusive study

People to blame

• Suranga Kasthurirathne

• Dr. Shaun Grannis

• Dr. Brian Dixon

• Dr. Huiping Xu

• Dr. Yuni Xia

• Dr. Judy Wawira

• Dr. Burke Mamlin

Questions

Data & Analytics

Regenstrief WIP 07012015