View
659
Download
2
Category
Preview:
Citation preview
Text Mining - as Normal as Data Mining?
Andrew Hinton, Application Specialist
IISDV 2016, Tuesday 19th April 2016, Nice
Agenda
Introduction to text mining
The challenge
Applications of specialised normalization solutions
− Maximising Source Normalization
− EASL (Extraction and Search Language )
− Allows programmatic access to unstructured data similar to
SQL over structured data.
− Numeric Normalization & Range search
− Capturing weights between 60 and 80kg whether
expressed in kilograms or pounds, for patient selection
from EHRs.
− Gene Mutation Normalization
− Use case where gene mutations have been linked to rare
disease progression.
© 2016 Linguamatics Ltd 2
Answers to Our Questions are in Free Text
80% of information at companies is in free text
Most of the answers to our questions are there
Ever-increasing amounts of text data to examine
© 2016 Linguamatics Ltd 3
0
5.000.000
10.000.000
15.000.000
20.000.000
25.000.000
PubMed Records
− Different kinds of documents
− External literature, patents,
EHRs, internal reports, blogs,
presentations
− Different formats
− HTML, PDF, XML, Word, PPT,
Wiki, TXT, HL7
Keyword Searching
© 2016 Linguamatics Ltd 4
OLED
Documents, Web Pages, Folders
All these documents contain the keyword ‘Additive’. Read ALL
the documentto find the relevant bit
to you
Linguamatics in Healthcare
© 2016 Linguamatics Ltd 5
ElectronicHealthRecord
EnterpriseData
Warehouse
Pathology, radiology, initial
assessment, discharge, check up
Structured data
Clinical
Risk
Monitor
Patient characteristics
Patientlists
Clinical
trials
gov
Patient characteristics
MatchingClinicaltrials
Patient Narrative
Semantic search tags
Semantic
Enrichment
Clinical casehistories and/or
genomic interpretation
Patient characteristics
Scientific
literature
I2E Transforms Text into Actionable Insights
© 2016 Linguamatics Ltd 6
Turn text Into structured datausing sophisticated queries
Accurate results: only retrieves relevant results
Complete results: comprehensive and systematic
Analytics
To driveanalytics
Enterprise
Warehouse
Search vs. Text Mining
© 2016 Linguamatics Ltd 7
Text MiningSearch Engine
Filter to find most
relevant documents, then read
News Feeds Literature Patents Internal Reports Social Media
Natural Language Processing (NLP) -
understand meaning
© 2012 Linguamatics Ltd.
Use of ontologies and clustered results
Efficient review, without reading every document
Challenges in Unstructured Data
© 2016 Linguamatics Ltd
Different word, same meaning
cyclosporine
ciclosporin
Neoral
Sandimmune
Different expression, same meaning
Non-smoker
Does not smoke
Does not drink or smoke
Denies tobacco use
Different grammar, same meaning
5mg/kg of cyclosporine per day
5mg/kg per day of cyclosporine
cyclosporine 5mg/kg per day
Same word, different context
Diagnosed with diabetes
Family history of diabetes
No family history of diabetes
NLP
8
Linguistic Processing Using NLP
Interprets meaning of the text
Groups words into meaningful units
Search for different forms of words
© 2016 Linguamatics Ltd 9
We find that p42mapk phosphorylates c-Myb on serine and threonine .
Purified recombinant p42 MAPK was found to phosphorylate Wee1 .
sentences morphology -
different forms
noun groups
match entities
verb groups
match actions
From Words to Meaning
© 2016 Linguamatics Ltd 10
“Among them, nimesulide, a selective COX2 inhibitor, …”
Entrez Gene ID: 5743
inhibits
Entrez Gene ID: 5743inhibits
Identifyingentities and relations
Linguistics to establish relationships
Text Mining - as Normal as Data Mining?
© 2016 Linguamatics Ltd 11
CHALLENGE
How can we capture information from free text as conveniently as accessing a database?
One of the essential differences is the lack of normalization of terms and concepts in free text.
SOLUTION
NLP-based text mining provides the capability to look through unstructured text normalizing:
• Keywords to concepts• Numerical data• Range Search• Gene Mutations• Content source
BENEFIT
A set of structured facts, relationships or assertions, from different data sources that can be used for decision support
Providing tabular or visual analytics to fill data warehouses and support
better patient care.
Literature
Patents
ReportsClinical Trials
I2E: A Fully Federated Text Mining Platform
14 Merge into a single set of results
ContentServer 1
ContentServer 2
ContentServer 3
ContentServer 4
Federated Architecture
Normalizing Data from Different Sources
Single query
Differently structured data sources on different servers
− Journal articles (PubMed Central) on local Enterprise
Server
− MEDLINE on remote cloud server
Single set of results
© 2016 Linguamatics Ltd 15
EASL Example
© 2016 Linguamatics Ltd 18
query:
document:
- phrase:
- class: {snid: nci.C1909, pt: Pharmacologic Substance}
- treat
- class: {snid: nlm.C04.588.180, pt: Breast Neoplasms}
output:
outputSettings: {documentsPerAssertion: -1,
hitsPerDocPerAssertion: 10, outputOrdering: frequency,
resultType: standard}
Benefits of EASL
Automation− Richer language for WSAPI applications
− Can build a completely new query vs. adapting smart query parameters
− Allows on-the-fly query production
Re-use− Save, share and compare components of queries e.g.
− Save out Alternatives
− Load complex expressions in smart query parameters
Audit− Human readable language for documenting the text mining strategy
− Using open mark-up language (YAML)
Conversion− Enable scripts to convert from other query languages e.g. advanced search
Different interfaces− Enables 3rd party applications to create I2E queries
− Developers can produce innovative specialized interfaces e.g. advanced
search plus terminologies
© 2016 Linguamatics Ltd 19
EASL: Enhancing the Value of Federated Search
20 Merge into a single set of results
ContentServer 1
ContentServer 2
ContentServer 3
ContentServer 4
Federated Architecture
translate2easl
© 2016 Linguamatics Ltd 21
Espacenet query Pubmed query
espacenet2easl pubmed2easl
EASL keywords + index terms
EASL terminologies, linguistics …
Clinical Trials
OMIM
FDA Drug
Labels
PatentsNIH Grants
MEDLINE
refine
query
What Do We Want to Find?
Patients
− below 60 years old
− weight ≥ 80kg
− not having chemotherapy after 2010
− with a mutation C677T
© 2016 Linguamatics Ltd 23
Challenge: Variety Within the Text
Below 60 years old
− aged 58
− 35 years old
− 42-year-old
− 39 y/o
Weight ≥ 80kg
− 267 pounds
− 280 lbs
− 80.4kg
− 82 kilograms
© 2016 Linguamatics Ltd 24
After 2010
− January 21, 2011
− October of 2012
− 08/21/11
− 2012-05-04
Mutation C677T
− C677T
− 677C>T
− 677C/T
− 677C->T
Normalizing Gene Mutations
Different types of mutation description, including:
− positional e.g. +869(T>C)
− rsID e.g. rs100
Transform different syntax e.g.
− 1166A/C -> A1166C
− Asn to Ser substitution at codon 127 -> N127S
− +1196C/T -> C1196T)
− g.655C/A>G -> C655G, A655G
− M567V/A -> M567V, M567A
© 2016 Linguamatics Ltd 25
Range Search
Allows search for values within a range
− in fixed fields e.g. publication
date
− within free text e.g. dosages
Can directly ask for e.g.
− patients with diabetes under
60 with BMI under 30
Can find intervals within the text and find these when search for a number or an overlapping range
© 2016 Linguamatics Ltd 27
Range Search with Normalization
Range Search (Age, Date)
− Patients aged < 60yrs
− Date before 2010
Normalizing:
− Report Date, Age, Weight & BMI
© 2016 Linguamatics Ltd 28
Normalization Benefits
Ability to compare measurements with different units e.g. kg vs. lbs
Ability to perform range search for numerics, measurements, dates
Standardized representations to link to structured data e.g. mutation databases
Better clustering of results e.g. drug lab codes
© 2016 Linguamatics Ltd 29
Mucopolysaccharidosis II: Hunter Syndrome
Rare X-linked recessive disorder Deficiency of the lysosomal enzyme iduronate-2-sulfatase Leads to progressive accumulation of glycosaminoglucans throughout the bodySigns & symptoms:
− Bone deformities with joint stiffness; Frequent
respiratory infections; Cardiomyopathy;
Hepatosplenomegaly; Neurocognitive
impairment; Reduced lifespan
− Some symptoms partially improved with enzyme
replacement therapy
Spectrum of clinical severity (mild to severe); main difference is progressive development of neurodegeneration in the severe form
© 2016 Linguamatics Ltd 31
32
CHALLENGE
• Scarcity of knowledge of natural history of disease
• Sparse data, needs high recall across full text papers
• Mutation patterns very variable
• Structured databases lack broad phenotypic association data
© 2016 Linguamatics Ltd
TEXT ANALYTICS FOR RARE DISEASESGENOTYPE-PHENOTYPE ASSOCIATION IN HUNTER SYNDROME
33
CHALLENGE
• Scarcity of knowledge of natural history of disease
• Sparse data, needs high recall across full text papers
• Mutation patterns very variable
• Structured databases lack broad phenotypic association data
SOLUTION
• Developed workflow with Linguamatics I2E
• Abstracts ID’ed in MEDLINE using broad vocabularies
• Full text PDFs processed for text analytics
• I2E mutation ontology and bespoke severity vocabs enabled extraction of genotype-phenotype associations
BENEFIT
• Extraction of patient mutations matched or bettered genetic databases
• Increased understanding of IDS mutational spectrum for provider diagnostics and patient awareness
• Enabled rational approach to immune response classification
© 2016 Linguamatics Ltd
TEXT ANALYTICS FOR RARE DISEASESGENOTYPE-PHENOTYPE ASSOCIATION IN HUNTER SYNDROME
In Summary
Better Normalization of
− Numbers, dates, drug codes, TNM cancer stage
− Subsequent range search
− Gene mutations
In combination with a human readable open query language EASL
− Maximises the ease and flexibility of asking complex
questions simultaneously across different content
sources
Ultimately agile NLP text mining provides
− High quality, structured, clustered & normalized results
in the format you need
− Improves speed to insight for faster decision making
© 2016 Linguamatics Ltd 35
Recommended