Upload
-
View
150
Download
1
Embed Size (px)
Citation preview
1
Summary
- Typical content of a generic Electronic health data (EHR) system
- How data mining on health data fill knowledge gaps and assist informed clinical decision making
- How the integration of EHR and genetic data with systems biology approaches facilitate
genotype–phenotype association studies
2
Classification of Electronic health data (EHR)
• Administrative data
- Data that serve administrative purposes
• Ancillary clinical data- Provided by laboratories, pharmacies, and radiological and medical imaging
(Another ancillary source of potentially structured data is genotype and sequence data.)
• Clinical text
- Written or dictated clinical narratives
3
Why Integrate Health Care Data
• Correlating Clinical Features
- 有些不同的疾病會有相同的症狀或者會同時出現 (co-occurrence)。
- 把一些常見的現象和一些重要的疾病連結在一起,以發現新病徵。
• Prediction from Data
- 在一些狀況下,我們可以藉由先前所發現的相關性研究或是其他事實來建立一個醫療判斷模型,藉此提供醫生一個預測病人狀況的參考依據。
• Patient Stratification
- 將患者作分群,通常相同群組會有類似的症狀。
4
Electronic health record content
The electronic health record (EHR) of a patient can be viewed as a repository of information
regarding his or her health status in a computer-readable form. An encounter with the
health-care system generates various types of patient-linked data.
5
Four ways to analyze EHR data
1) Comorbidity 2) Machine Learning 3) Patient Clustering 4) Cohort Querying
10
Deal with Clinical Text
Using Natural language processing
1. Sentence boundary detection splits the text into units of individual sentences.
2. Split the text using space and punctuation as a guide to identify individual tokens (typically individual words),
with rules for handling special cases such as dates
3. Tokens are reduced to a base form by normalizing
4. Assigns part-of-speech tags to each token to identify its grammatical category in the context
5. identifies syntactic units, most importantly noun phrases (NPs), which are grammatical units, built from a
noun with optional modifiers such as adjectives.
6. NPs and various lexical permutations are then mapped to controlled vocabularies
11
How the System Actually Implement
Take A health care system “GEMINI” for example
The GEMINI system consists of two components:
1. The PROFILING component extracts data of each patient from various ources and stores them as
information in a patient profile graph.
2.The ANALYTICS component analyzes the patient profile graphs to infer implicit information
and extract relevant features for the prediction tasks.
(Whole view of GEMINI)
12
Input of GEMINI
- Clinical Data
The repository has multiple sources of patient data: 1) structured sources containing patients’ demographics, lab test results,
medication history, etc., 2) unstructured data sources storing free-text doctor’s notes.
- Medical Knowledge Base
GEMINI utilizes a well-known medical knowledge base UMLS to interpret unstructured doctor’s notes, i.e.,
identifying medical concepts (e.g., diabetes mellitus), and relationships between concepts (e.g., HbA1c measures control of
diabetes mellitus).
Input:
13
How to do “Patient Profiling”
- This component utilizes NLP engines to extract named entities, called mentions. It then devises collective inference to
simultaneously map mentions to their semantically matched concepts in the knowledge base and discovers additional
relationships.
- To improve the accuracy of this process, the component asks doctors to verify or corroborate mention-concept mappings
and concept relationships identified.
14
How to do “Healthcare Analytics”
The ANALYTICS component of GEMINI consists of three major steps:
1) Feature Selection
- All features that are contained in the patient profile graphs can be used as features for the analytics tasks. ANALYTICS can derive
implicit and also important features with expert input from the healthcare professionals.
2) Training Data Labelling
- Leverage on doctors’ input to label a small number of patients with the most informative data (i.e., patient profile graphs) to derive a
training set
- What we need is a diverse set of labeled patients that somehow covers the whole data space as much as possible
- Avoid overwhelming the doctors with too much information
3) Analytics Algorithms
- Conventional analytics algorithms, such as classification, clustering and prediction to perform the various analytics tasks
- Might have some expert rules/heuristics for the analytics tasks ( e.g. majority-voting)
15
How to implement “Supporting Platform”
using ‘epiC”
GEMINI use a flexible parallel processing framework (epiC ) to support:
1) Distributed data storage that effectively partitions clinical data and stores them in multiple nodes.
2) Scalable NLP processing and data analytics that involve various computation models, such as Map-
Reduce model for entity extraction, Pregel model for graphical inference, deep learning for analytics, etc.
16
- Integrating genetics
- Systems biology and gene-network-based decision support
Linking to the molecular level
22
Limiting factors — key problems to overcome
- Privacy, autonomy and consent
- Interoperability across institutions, countries and continents
23
Reference
• GEMINI: An Integrative Healthcare Analytics System
• epic: an Extensible and Scalable System for Processing Big Data
• Semantics Driven Approach for Knowledge Acquisition From EMRs
• Mining electronic health record toward better research applications and clinical care
• Using electronic health records to drive discovery in disease genomics
• Contextual Crowd Intelligence
• Opportunities for genomic clinical decision support interventions
• The role of primary care in early detection and follow-up of cancer