Upload
catherine-canevet
View
481
Download
2
Embed Size (px)
DESCRIPTION
International Symposium on Integrative Bioinformatics 2010
Citation preview
Enhancing Data Integration with Text Analysis to Find Proteins Implicated in Plant Stress Response
Keywan Hassani‐Pak
keywan.hassani‐[email protected]
Integrative Bioinformatics 2010
Outline
• Motivation– Prior information for candidate genes
– Structured data and unstructured text
• Methods– Text mining plugin for Ondex
– Application case
• Results– Visualisation
– Association networks
– Filtering noise
– Validation
• Summary
Motivation
• High throughput ‘omics research can identify many candidate genes• Interpretation of experimental results needs prior information• Most important sources for prior information are
– Structured bioinformatics databases – Unstructured scientific literature
• GOAL: Automated methods for the integration of prior information
Identify genes thatalter expression
over time
DBs
Literature
Public Data SourcesTime Course Microarray Data
Gene1Gene2Gene3…...GeneN
Candidate Genes
Experiment1Experiment2…
Get prior information for genes regarding the experiment
Structured Data vs. Unstructured Text
• Data integration methods– Syntactic and semantic heterogeneity
– Literature references
• Text mining methods– Identify facts hidden in unstructured text
– Integrate facts with database entries
http://www.nactem.ac.uk/software/kleio
http://www.uniprot.org
Integrative Text Mining
• Old: Data integration and text mining systems have been largely developed independently
• Idea: Combining structured knowledge stored in public data bases with unstructured information in literature
• New: Text mining plugin for the data integration framework Ondex
Data Transformation
Clients/ToolsHeterogeneous Data Sources
UniProt
OBO
Parser
Parser
Ondex Core
Generalized O
bject Data M
odel
Database Layer
MappingMethods
Accession
Name based
BLAST
Data Exchange
Taverna
Cytoscape
Ondex Frontend
Lucene
KEGG Parser
OXL/RDF
WebService
Text Mining
MEDLINE Parser
Ondex Integrator
www.ondex.org
Advanced Knowledge Base
1. Structured information – Bioinformatics databases, ontologies
– Curated citations in structured data sources (e.g. from UniProt)
2. Unstructured information– MEDLINE titles and abstracts are indexed and normalised (by Lucene)
– Information Retrieval strategies: exact, fuzzy, proximity
– Named Entity Recognition: concept‐based (names and synonyms)
– Score: tf‐idf weight (term frequency * inverse document frequency)
text‐mining
x
y
BA
is_related
Publication
Concepts
published_inweighted association network
IP=1.7; M=1.2; N=2
yx
BA
Association Scores
weighted association network
N=29; M=3.1; IP=22.4BA
= N
tf‐idf = 3.1 = M
tf‐idf = 1.7
tf‐idf = 0.9
IP = 22.4
...
Phenotypes
Worldwide Data Resources
Time Course Microarray Data
Network Inference
‐ Literature‐ Public databases‐ Public experiments
Identification of key regulatory genes
Knock out experimentsOverexpresser experiments
Identify genes that alter expression over time
Prior information
Ondex
The PRESTA project
http://www2.warwick.ac.uk/fac/sci/whri/research/presta
Application Case: Knowledge Base for Stress Response in Arabidopsis
• Publications (the corpus)– MEDLINE: search ‘Arabidopsis thaliana’
28653 publications
• Proteins– UniProtKB: search ‘taxid:3702 + reviewed’
8582 proteins
– 13502 curated citations
• Plant Stress Ontology– 33 stresses/treatments related to PRESTA
experiments
– Biotic: Bacteria, Fungus, etc.
– Abiotic: Drought, Salt, Light, Hormone, etc.
StressProtein
PublicationEnzyme
13502 352445194
published_inis_related
X. campestris
Network Visualisation
Protein‐Stress Association Network
• 3145 proteins linked to 32 stresses by 10777 relations• On average
• each protein associated with 3.4 stresses• each stress associated with 337 proteins
• Filtering associations based on three confidence scores IP, M and N
X. campestris
Ethylene
Metric Min Max
IP 0.01 347.26
M 0.01 26.86
N 1 600
How to find cut‐offs for filtering?
• Problem: Text mining results often error‐prone
• Aim: Improving signal‐to‐noise ratio by setting optimal cut‐offs
• Co‐citation number (N) is simplest way to potentially reduce noise in such association networks
• Filtering by IP and M should be more selective as both consider frequency of terms in the corpus
• However, none of the metrics is superior overall
• Considering several metrics at the same time seems to be method of choice to reduce noise and highlight key associations
a. b.
TM
AHDAHD
TM
Validation of Protein‐Ethylene Pairs
• Ethylene association network contained 533 proteins
• Ideally read all abstracts and evaluate association
• Comparison with Arabidopsis Hormone Database (AHD)a. 31 curated associations: 71.0% recall
b. 166 total associations (inc. GO): 44.8% recall
Top 10 protein predictionsACCESSION NAME PUBMED YEAR M N IP PVAL TRUE
AT3G05420 ACBP4 18836139 2008 13.51 1 13.51 1.00* yes
AT1G31812 ACBP6 18836139 2008 11.57 2 17.14 0.50 yes
AT3G03190 ATGSTF6 14617075 2003 7.36 7 15.75 0.25 yes
AT4G26080 ABI1 19705149 2009 6.66 10 12.22 0.39 yes
AT3G21510 AHP1 18384742 2008 6.60 3 6.70 0.17 yes
AT1G75040 PR‐5 15988566 2005 5.18 12 5.47 0.07 yes
AT2G45820 Remorin 9159183 1997 5.04 4 6.77 0.86 no
AT3G11410 PP2CA 19705149 2009 5.00 1 5.00 1.00 yes
AT1G09570 Phytochrome A 8703080 1996 4.79 11 8.47 0.19 no
AT1G04240 IAA3 19213814 2009 4.54 3 5.14 0.67 yes
• Evaluated top 10 proteins (sorted by M score) from our analyses that are linked to ethylene but were not found in AHD.
• P‐value relates to the significance of the IP score. However if N=1 P=1 (*)
• Evidence text• PMID:18836139: the interaction of ACBP4 and AtEBP may be related to AtEBP‐mediated
defence possibly via ethylene and/or jasmonate signalling.• PMID:19705149: protein phosphatase 2C ABI1 modulates biosynthesis ratio of ABA and
ethylene.
Future Work
• Integrate more advanced text mining methods
• Extensive analysis and evaluation of our association metrics
• Investigate alternative association metrics
• Finding best cut‐off for optimal signal‐to‐noise ratio
• Apply method to more application cases
Summary
• Prior information needs to be extracted from structured data and unstructured text
• Developed a flexible text mining plugin for the data integration framework Ondex (open source)
• Can be linked into various bioinformatics workflow to enhance high‐throughput ‘omics research
• First report of systematically combining data integration with basic text mining
• Generated prior information for Arabidopsis proteins regarding the PRESTA experiments
Acknowledgements
ONDEX BBSRC SABR Project BB/F006039PRESTA BBSRC SABR project BB/F005806
Catherine Canevet
Chris Rawlings
Roxane Legaie
Hugo van den Berg
Jay Moore
THANK YOU!
Contact:keywan.hassani‐[email protected]