Enhancing Data Integration with Text Analysis to Find Genes Implicated in Plant Stress Response

Enhancing Data Integration with Text Analysis to Find Proteins Implicated in Plant Stress Response

Keywan Hassani‐Pak

keywan.hassani‐[email protected]

Integrative Bioinformatics 2010

Outline

• Motivation– Prior information for candidate genes

– Structured data and unstructured text

• Methods– Text mining plugin for Ondex

– Application case

• Results– Visualisation

– Association networks

– Filtering noise

– Validation

• Summary

Motivation

• High throughput ‘omics research can identify many candidate genes• Interpretation of experimental results needs prior information• Most important sources for prior information are

– Structured bioinformatics databases – Unstructured scientific literature

• GOAL: Automated methods for the integration of prior information

Identify genes thatalter expression

over time

DBs

Literature

Public Data SourcesTime Course Microarray Data

Gene1Gene2Gene3…...GeneN

Candidate Genes

Experiment1Experiment2…

Get prior information for genes regarding the experiment

Structured Data vs. Unstructured Text

• Data integration methods– Syntactic and semantic heterogeneity

– Literature references

• Text mining methods– Identify facts hidden in unstructured text

– Integrate facts with database entries

http://www.nactem.ac.uk/software/kleio

http://www.uniprot.org

Integrative Text Mining

• Old: Data integration and text mining systems have been largely developed independently

• Idea: Combining structured knowledge stored in public data bases with unstructured information in literature

• New: Text mining plugin for the data integration framework Ondex

Data Transformation

Clients/ToolsHeterogeneous Data Sources

UniProt

OBO

Parser

Parser

Ondex Core

Generalized O

bject Data M

odel

Database Layer

MappingMethods

Accession

Name based

BLAST

Data Exchange

Taverna

Cytoscape

Ondex Frontend

Lucene

KEGG Parser

OXL/RDF

WebService

Text Mining

MEDLINE Parser

Ondex Integrator

www.ondex.org

Advanced Knowledge Base

1. Structured information – Bioinformatics databases, ontologies

– Curated citations in structured data sources (e.g. from UniProt)

2. Unstructured information– MEDLINE titles and abstracts are indexed and normalised (by Lucene)

– Information Retrieval strategies: exact, fuzzy, proximity

– Named Entity Recognition: concept‐based (names and synonyms)

– Score: tf‐idf weight (term frequency * inverse document frequency)

text‐mining

x

y

BA

is_related

Publication

Concepts

published_inweighted association network

IP=1.7; M=1.2; N=2

yx

BA

Association Scores

weighted association network

N=29; M=3.1; IP=22.4BA

= N

tf‐idf = 3.1 = M

tf‐idf = 1.7

tf‐idf = 0.9

IP = 22.4

...

Phenotypes

Worldwide Data Resources

Time Course Microarray Data

Network Inference

‐ Literature‐ Public databases‐ Public experiments

Identification of key regulatory genes

Knock out experimentsOverexpresser experiments

Identify genes that alter expression over time

Prior information

Ondex

The PRESTA project

http://www2.warwick.ac.uk/fac/sci/whri/research/presta

Application Case: Knowledge Base for Stress Response in Arabidopsis

• Publications (the corpus)– MEDLINE: search ‘Arabidopsis thaliana’

28653 publications

• Proteins– UniProtKB: search ‘taxid:3702 + reviewed’

8582 proteins

– 13502 curated citations

• Plant Stress Ontology– 33 stresses/treatments related to PRESTA

experiments

– Biotic: Bacteria, Fungus, etc.

– Abiotic: Drought, Salt, Light, Hormone, etc.

StressProtein

PublicationEnzyme

13502 352445194

published_inis_related

X. campestris

Network Visualisation

Protein‐Stress Association Network

• 3145 proteins linked to 32 stresses by 10777 relations• On average

• each protein associated with 3.4 stresses• each stress associated with 337 proteins

• Filtering associations based on three confidence scores IP, M and N

X. campestris

Ethylene

Metric Min Max

IP 0.01 347.26

M 0.01 26.86

N 1 600

How to find cut‐offs for filtering?

• Problem: Text mining results often error‐prone

• Aim: Improving signal‐to‐noise ratio by setting optimal cut‐offs

• Co‐citation number (N) is simplest way to potentially reduce noise in such association networks

• Filtering by IP and M should be more selective as both consider frequency of terms in the corpus

• However, none of the metrics is superior overall

• Considering several metrics at the same time seems to be method of choice to reduce noise and highlight key associations

a. b.

TM

AHDAHD

TM

Validation of Protein‐Ethylene Pairs

• Ethylene association network contained 533 proteins

• Ideally read all abstracts and evaluate association

• Comparison with Arabidopsis Hormone Database (AHD)a. 31 curated associations: 71.0% recall

b. 166 total associations (inc. GO): 44.8% recall

Top 10 protein predictionsACCESSION NAME PUBMED YEAR M N IP PVAL TRUE

AT3G05420 ACBP4 18836139 2008 13.51 1 13.51 1.00* yes

AT1G31812 ACBP6 18836139 2008 11.57 2 17.14 0.50 yes

AT3G03190 ATGSTF6 14617075 2003 7.36 7 15.75 0.25 yes

AT4G26080 ABI1 19705149 2009 6.66 10 12.22 0.39 yes

AT3G21510 AHP1 18384742 2008 6.60 3 6.70 0.17 yes

AT1G75040 PR‐5 15988566 2005 5.18 12 5.47 0.07 yes

AT2G45820 Remorin 9159183 1997 5.04 4 6.77 0.86 no

AT3G11410 PP2CA 19705149 2009 5.00 1 5.00 1.00 yes

AT1G09570 Phytochrome A 8703080 1996 4.79 11 8.47 0.19 no

AT1G04240 IAA3 19213814 2009 4.54 3 5.14 0.67 yes

• Evaluated top 10 proteins (sorted by M score) from our analyses that are linked to ethylene but were not found in AHD.

• P‐value relates to the significance of the IP score. However if N=1 P=1 (*)

• Evidence text• PMID:18836139: the interaction of ACBP4 and AtEBP may be related to AtEBP‐mediated

defence possibly via ethylene and/or jasmonate signalling.• PMID:19705149: protein phosphatase 2C ABI1 modulates biosynthesis ratio of ABA and

ethylene.

Future Work

• Integrate more advanced text mining methods

• Extensive analysis and evaluation of our association metrics

• Investigate alternative association metrics

• Finding best cut‐off for optimal signal‐to‐noise ratio

• Apply method to more application cases

Summary

• Prior information needs to be extracted from structured data and unstructured text

• Developed a flexible text mining plugin for the data integration framework Ondex (open source)

• Can be linked into various bioinformatics workflow to enhance high‐throughput ‘omics research

• First report of systematically combining data integration with basic text mining

• Generated prior information for Arabidopsis proteins regarding the PRESTA experiments

Acknowledgements

ONDEX BBSRC SABR Project BB/F006039PRESTA BBSRC SABR project BB/F005806

Catherine Canevet

Chris Rawlings

Roxane Legaie

Hugo van den Berg

Jay Moore

THANK YOU!

Contact:keywan.hassani‐[email protected]

Documents

Enhancing Data Integration with Text Analysis to Find Genes Implicated in Plant Stress Response