Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Synthesizing Knowledge by Exploiting Diverse Data Sources:
From Microblogs to Patents
David Milward
ICIC, October 2011
Click to edit Master title styleClick to edit Master title style
In the beginning...
Range Of Application Areas
Protein-Protein
Interactions
Vocabulary Development
Biomarker Discovery
Target Identification
& Prioritization
Drug Repositioning
Conference Abstract Mining
Key Opinion Leader
Identification
Competitive Intelligence
Safety/Tox
In-licensing Opportunities
Gene Profiling
Systems Biology
Mining FDA Drug
LabelsExtracting Numerical
and Experimental
Data
Mutations and Gene
Expression
Sentiment Analysis in
Social Media
Workflow Integration
Mining Electronic Medical Records
Clinical Trial Analysis
Patent Analysis
Click to edit Master title styleClick to edit Master title styleExample Data Sources
• External Data
– Social Media
– Scientific Literature
– Conference Abstracts
– Patents
– Clinical Trials
• Internal Data
– SharePoint repositories
– Excel spreadsheets
– Project reports
– Electronic Health Records
Click to edit Master title styleClick to edit Master title styleMining Social Media
What is being said about our brands? How are the messages being relayed?
Who are the key opinion leaders?
Which sites are influencing our consumers?
What are people saying about us and our competitors?
To provide actionable information from public sentiment
Click to edit Master title styleClick to edit Master title styleMicroblogs: Twitter
©Digital Surgeons 2010, All stats are based in U.S unless specified otherwise
* twitter.com -March 2011
Click to edit Master title styleClick to edit Master title style
• Find and extract patterns, not just keywords
• Capturing the 1000s of ways people say the same thing
• For example: how do people express that they are going to get a flu vaccine vs. not get one?
Natural Language Processing (NLP)
Click to edit Master title styleClick to edit Master title styleRefining Sentiment via Context: UK Elections
• Accurate prediction of UK Election Results based on sentiment trends
• What we counted as positive
• What we did not count as positive
Click to edit Master title styleClick to edit Master title styleWho will get a vaccine, and finding their influences
Getters Non Getters
Click to edit Master title styleClick to edit Master title styleCategorizing by Topic and by Statement
• Many Tweets or newsfeed statements are duplicates, but comparing character-by-character does not show this
• Here we are categorizing by topic e.g. jobs or drug, and using NLP to cluster by the messages to allow faster reviewing
Click to edit Master title styleClick to edit Master title styleMining MEDLINE
Drug Repurposing
Target ID and Prioritization
Drug-Drug Interactions
Biomarker Discovery
Click to edit Master title styleClick to edit Master title styleGene Profile
• Gathers together information relevant for a specific gene from multiple documents
Click to edit Master title styleClick to edit Master title styleRelating Genes to Disease
Visualization via Cytoscape
Click to edit Master title styleClick to edit Master title styleMining Conference Proceedings
Identify industry leaders
Gather information for internal customers
Identify new/novel areas of research
Find other companies / researchers working in the same area
To provide actionable information about competitors or activities of potential acquisition targets
Click to edit Master title styleClick to edit Master title style
• Creating an itinerary for poster viewing
14
Conference Proceedings
Click to edit Master title styleClick to edit Master title style
15
Conference Proceedings – Presenter Network
Click to edit Master title styleClick to edit Master title styleMining Patent literature
Research
IP
In-licensing
Competitive intelligence
Click to edit Master title styleClick to edit Master title stylePatent Grants Distribution by Disease (Sample)
Click to edit Master title styleClick to edit Master title stylePatent Applications and Grants: Companies vs. Diseases
0
5000
10000
15000
20000
25000
Ap
plic
atio
ns
Gra
nts
Ap
plic
atio
ns
Gra
nts
Ap
plic
atio
ns
Gra
nts
Ap
plic
atio
ns
Gra
nts
Ap
plic
atio
ns
Gra
nts
Ap
plic
atio
ns
Gra
nts
Ap
plic
atio
ns
Gra
nts
Ap
plic
atio
ns
Gra
nts
Ap
plic
atio
ns
Gra
nts
Abbott AZ Bayer BMS GSK Roche Merck Novartis Pfizer
Virus Diseases
Substance-Related Disorders
Stomatognathic Diseases
Skin and Connective Tissue Diseases
Respiratory Tract Diseases
Parasitic Diseases
Otorhinolaryngologic Diseases
Occupational Diseases
Nutritional and Metabolic Diseases
Nervous System Diseases
Neoplasms
Musculoskeletal Diseases
Mental Disorders
Male Urogenital Diseases
Immune System Diseases
Hemic and Lymphatic Diseases
Female Urogenital Diseases and Pregnancy ComplicationsEye Diseases
Endocrine System Diseases
Click to edit Master title styleClick to edit Master title styleFinding Inhibition Information in Patents
Click to edit Master title styleClick to edit Master title styleMining Clinical Trial Data
Select trial sites more effectively
Find precedents for study design
Monitor progress of competitors’ trials
Find other companies running clinical trials in the same therapeutic area
To provide actionable information about worldwide clinical development activities
Click to edit Master title styleClick to edit Master title styleClinical Trial Analysis
NCT Number Standardization of data Hits in context
Click to edit Master title styleClick to edit Master title styleClinical Trial Analysis
NCT Number Standardization of data Hits in context
Click to edit Master title styleClick to edit Master title styleWorkflow Integration + Clinical Trial Analysis
Click to edit Master title styleClick to edit Master title styleMining Electronic Health Records
Diagnosis prediction from reports
Translational medicine
Analysis of treatment outcomes
Monitoring of adherence to protocols
Click to edit Master title styleClick to edit Master title styleElectronic Health Records
• Identify patient information
• Need to consider context within the record, as well as the immediate sentence
Confidential
Click to edit Master title styleClick to edit Master title styleMining Multiple Sources
Merge information from all possible sources
Find more detailed information from a second source
Connect information across different structured and unstructured sources
Compare trends in different data sources
Click to edit Master title styleClick to edit Master title styleDelivering Diverse Data Sources
Click to edit Master title styleClick to edit Master title styleI2E OnDemand
Hosted Site
Firewall
Company 1
RelationshipsGenes/Proteins
Chemicals Diseases
MEDLINE
Company 2 Company 3
Complete and up-to-date MEDLINE index with built in terminologies managed by Linguamatics
Click to edit Master title styleClick to edit Master title styleStructured and Unstructured Data Workflows
• Successful queries converted into regular workflows for up-to-date dashboards, alerting
• Structured resources (tabular/semantic web) augmented with information from literature
Directly ask questions such as:
• Which genes from microarray experiment (or GWAS) show evidence of association with a disease
SemanticStore/DB
High throughput querying
PrioritizationMicroarray
GWAS
Documents
IndexedDocuments
Click to edit Master title styleClick to edit Master title styleConclusions
• Text Mining allows exploitation of diverse data sources e.g.
– Scientific Literature
– Patents
– Social Media
• Wide range of applications, from R&D to Marketing
• Results from each source are valuable in themselves, but can also combine
– Structured and unstructured data
– Data from different unstructured sources