Upload
suranga-nath-kasthurirathne
View
198
Download
3
Tags:
Embed Size (px)
Citation preview
A Comparison Of Non-Dictionary Based Approaches To Automate Public Health Reporting Using
Plain-Text Medical Data
Suranga N. Kasthurirathne
Premise
• Public health registries play a significant supporting role in public health care activities
• Reporting to public health registries are often delayed and incomplete
• Automated methods for identifying public health reportable cases can improve case reporting
Premise Contd.
• Many automated methods for identifying public health reportable illnesses already exist.
• However, these approaches are largely dependent on,
– Codified medical data
– Dictionary based named entity recognition (NER)
Problems
• Maintaining a complicated and constantly changing terminology is not easy
• Existing dictionaries cannot be used ‘off the shelf’
• A substantial amount of case-related data are captured as non-coded plaintext
• Accurate NER is a problem for dictionary based approaches
So Basically,
• Existing approaches are dependent on (resource heavy) codified data and dictionary based approaches
• Do we really need to go this far?
– Everything available VS. best available
Problem Statement
Compare alternative approaches for automating public health registry reporting using
– Plain text medical data
– Non-dictionary based approaches
Methods
• Obtain a convenience sample of plaintext pathology reports of suspected cancer patients from the INPC
• Using this data, evaluate the accuracy of automated cancer diagnosis based on,
– Varied feature selection approaches
– Varied feature subset sizes
– Varied decision models
The Challenges At Hand
• Feature selection: the identification of named entities that imply the presence of cancer
• Preparation of a gold standard
• The preparation of data input vectors based on selected entities
• Training and testing decision models using prepared data input vectors
Feature Selection
• Three alternate approaches
– Manual
– Informed
– AutomatedRequires analysis of plaintext data
Informed approach: Data analysis
• For each unique token (x), • No. of positive occurrences and negative occurrences • pos count : No. of reports with token present in positive
context• neg count : No. of reports with token present in negative
context• pos rate : Avg. no. of tokens per report in positive context• neg rate : Avg. no. of tokens per report in negative context • ratio : pos rate / neg rate• odds ratio : pos count / neg count
Automated Approach: Data Analysis
• Parse plaintext reports to identify tokens and context of use
• For each report, count number of occurrences per each token and context of use
Decision Modeling
• Select classification algorithms– SLR
– RF
– J48
– NB
– IBK (K-nearest neighbor)
• Prepare data input– Subsets of the master feature vector form decision
model input
Decision Modeling Contd.
• Training and testing
– Weka (Waikato Environment for
Knowledge Analysis)
– 10 fold cross validation
– Convenient user interfaces
Our study approach
Findings
• Non-dictionary based decision modeling approaches identify cancer cases within plaintext reports with reasonable accuracy.
• Performance of the methodology varies across
– Feature subset sizes
– Feature selection approaches
– Classification algorithms
• Optimized sensitivity: Automated or informed feature selection approach to pick subset sizes of 10 or greater to build RF, SLR or J48 decision models
• Optimized specificity: Adopt a SLR decision model with any feature selection approach and subset size
• Optimized PPV: Adopt a SLR decision model using any feature selection method and subset size
• Optimized accuracy: Build a RF, SLR or J48 decision model with an informed or automated feature selection method and a subset size of ten of greater
Ah, But…
• The success of the manual selection approach depends on the clinical expertise of the manual reviewers
• The results are specific to the plaintext reports used for our analysis
• Since we used only generic cancer terms, we may have missed very specific cancers
Next Steps
• The gold standard challenge
• Further analysis of the automated feature selection process and its optimization
• Dictionary vs. Non-dictionary based approaches. A conclusive study
People to blame
• Suranga Kasthurirathne
• Dr. Shaun Grannis
• Dr. Brian Dixon
• Dr. Huiping Xu
• Dr. Yuni Xia
• Dr. Judy Wawira
• Dr. Burke Mamlin
Questions