19
Bioinformatics Combining Machine Learning and Homology-Based Approaches to Accurately Predict Subcellular Localization in Arabidopsis 1[C][W][OA] Rakesh Kaundal, Reena Saini 2  , and Patrick X. Zhao* Bioinformatics Laboratory, Plant Biology Division, Samuel Roberts Noble Foundation, Ardmore, Oklahoma 73401 A complete map of the Arabidops is (  Arabidopsis thaliana) proteome is clearly a major goal for the plant resear ch community in terms of determining the functi on and regulation of each encoded protein. Developi ng genome- wide predict ion tools such as for localizing gene products at the subcellular level will substantially advance Arabidopsis gene annotation. To this end, we performed a compreh ensive study in Arabid opsis and creat ed an integrative support vector machin e-bas ed locali zation predictor called AtSubP (for Arabidopsis subcellular localization predictor) that is based on the combinatorial presence of diverse protein features, such as its amino acid composition, sequence-order effects, terminal information, Position-Specic Scoring Matrix, and similarity sear ch-bas ed Positi on-Spe cic Iterate d-Ba sic Local Alignment Search Tool information . When use d to pre dic t seven sub cel lul ar compart ments thro ugh a 5-f old cross- val ida tion tes t, our hybrid-ba sed bes t cla ssi er achieved an overall sensitivity of 91% with high-condence precision and Matthews correlation coefcient values of 90.9% and 0.89, respectively. Benchmarking AtSubP on two independent data sets, one from Swiss-Prot and another containing green uorescent protein- and mass spectrometry-determined proteins, showed a signicant improvement in the prediction accuracy of species-specic AtSubP over some widely used “general” tools such as TargetP, LOCtree, PA-SUB, MultiLoc, WoLF PSORT, Plant-PLoc, and our newly created All-Plant method. Cross-comparison of AtSubP on six nontrained eukaryotic organisms (r ice [Oryza sativa], soybean [Glycin e max], huma n [  Homo sapie ns ], ye as t [Saccharomyces cerevisiae], fru it y [Drosophila melanogaster], and worm [Caenorhabditis elegans]) revea led inferior pred ictions . AtSubP signic antly outperformed all the prediction tools being currently used for Arabidopsis proteome annotation and, therefore, may serve as a better complement for the plant research community. A supplemental Web site that hosts all the training/testing data sets and whole proteome predictions is available at http://bioinfo3.noble.org/AtSubP/. Subcellular proteomics has gained tremendous at- tention of late, owing to the role played by organelles in car ryin g out dened cellular pro ces ses. Sev era l experimental efforts have been made to catalog the complete subcellular proteomes of various organisms (Michaud and Snyder, 2002; Huh et al., 2003; Taylor et al., 2003; Andersen and Mann, 2006), with the aim  bein g to improve our unders tandin g of dened cellu- lar pro ces ses at the organe lla r and cel lula r levels . Althou gh such efforts have gener ated valuable infor- mat ion, cat alo ging all subcel lula r pro teo mes is far from complete, as experimental methods are expen- sive and more time consuming. Alternatively, compu- tational pre diction sys tems pro vide fas t, eco nomic (mostly free), automatic, and reasonably accurate as- signment of sub cel lul ar loc ati on to a pro tei n, esp e- cially for high-throughput analysis of large-scale genome sequences, ultimately giving the right direc- tion to design cost-effective wet-lab experiments. The existing bioinformatics localization predictors in the literature can be broadly grouped into three cat ego ries: (1) ami no aci d compos itio n bas ed; (2) N-terminal sorting signals based; and (3) homology  based (e.g. those based on domain or motif co-occur- rence). These methods have previously been reviewed in detail (Mott et al., 2002; Scott et al., 2004). However, in bioinformatics in general, and in subcellular local- ization pre dic tion in par tic ula r , it is oft en deb ate d whether predictions should be done over broad sys- tematic groups such as all eukaryotes or all plants, or over narrower groups such as dicots, or even at the single-species level. On the one hand, species-specic fea tur es of sor ting sig nal s and amino acid compos itio n could make the predic tion better if trained on the particular species where it is going to be used; on the other hand, the smaller data set available for a single species could make the single-species predictor less accurate. How to strike the balance between these two 1 This work was supported by the Samuel Roberts Noble Foun- dation. 2 Pres ent addr ess: Centr e for Biocr ystal logr aphy , Insti tute of Bioorganic Chemistry, Polish Academy of Sciences, 61–704 Poznan, Poland. * Correspo nding author; e-mail [email protected]. The author respons ible fordistribution of materials integral to the ndi ngs presen ted in this article in acco rdance with the policy described in the Instructions for Authors (www .plantphysiol.org) is: Patrick X. Zhao ([email protected] ). [C] Some gures in this article are displayed in color online but in  black and white in the print edition. [W] The online version of this article contains Web-only data. [OA] Open Access articles can be viewed online without a sub- scription. www.plantphysiol.org/cgi/doi/10.1104/pp.110.156851 36 Plant Physiology Ò , September 2010, Vol. 154, pp. 36–54, www.plantphysiol.org Ó 2010 American Society of Plant Biologists

Combining Machine Learning and Homology-Based Approaches

Embed Size (px)

Citation preview

Page 1: Combining Machine Learning and Homology-Based Approaches

8/8/2019 Combining Machine Learning and Homology-Based Approaches

http://slidepdf.com/reader/full/combining-machine-learning-and-homology-based-approaches 1/19

Page 2: Combining Machine Learning and Homology-Based Approaches

8/8/2019 Combining Machine Learning and Homology-Based Approaches

http://slidepdf.com/reader/full/combining-machine-learning-and-homology-based-approaches 2/19

Page 3: Combining Machine Learning and Homology-Based Approaches

8/8/2019 Combining Machine Learning and Homology-Based Approaches

http://slidepdf.com/reader/full/combining-machine-learning-and-homology-based-approaches 3/19

Page 4: Combining Machine Learning and Homology-Based Approaches

8/8/2019 Combining Machine Learning and Homology-Based Approaches

http://slidepdf.com/reader/full/combining-machine-learning-and-homology-based-approaches 4/19

Page 5: Combining Machine Learning and Homology-Based Approaches

8/8/2019 Combining Machine Learning and Homology-Based Approaches

http://slidepdf.com/reader/full/combining-machine-learning-and-homology-based-approaches 5/19

Page 6: Combining Machine Learning and Homology-Based Approaches

8/8/2019 Combining Machine Learning and Homology-Based Approaches

http://slidepdf.com/reader/full/combining-machine-learning-and-homology-based-approaches 6/19

Page 7: Combining Machine Learning and Homology-Based Approaches

8/8/2019 Combining Machine Learning and Homology-Based Approaches

http://slidepdf.com/reader/full/combining-machine-learning-and-homology-based-approaches 7/19

Page 8: Combining Machine Learning and Homology-Based Approaches

8/8/2019 Combining Machine Learning and Homology-Based Approaches

http://slidepdf.com/reader/full/combining-machine-learning-and-homology-based-approaches 8/19

Page 9: Combining Machine Learning and Homology-Based Approaches

8/8/2019 Combining Machine Learning and Homology-Based Approaches

http://slidepdf.com/reader/full/combining-machine-learning-and-homology-based-approaches 9/19

Page 10: Combining Machine Learning and Homology-Based Approaches

8/8/2019 Combining Machine Learning and Homology-Based Approaches

http://slidepdf.com/reader/full/combining-machine-learning-and-homology-based-approaches 10/19

Page 11: Combining Machine Learning and Homology-Based Approaches

8/8/2019 Combining Machine Learning and Homology-Based Approaches

http://slidepdf.com/reader/full/combining-machine-learning-and-homology-based-approaches 11/19

Page 12: Combining Machine Learning and Homology-Based Approaches

8/8/2019 Combining Machine Learning and Homology-Based Approaches

http://slidepdf.com/reader/full/combining-machine-learning-and-homology-based-approaches 12/19

Page 13: Combining Machine Learning and Homology-Based Approaches

8/8/2019 Combining Machine Learning and Homology-Based Approaches

http://slidepdf.com/reader/full/combining-machine-learning-and-homology-based-approaches 13/19

Page 14: Combining Machine Learning and Homology-Based Approaches

8/8/2019 Combining Machine Learning and Homology-Based Approaches

http://slidepdf.com/reader/full/combining-machine-learning-and-homology-based-approaches 14/19

Page 15: Combining Machine Learning and Homology-Based Approaches

8/8/2019 Combining Machine Learning and Homology-Based Approaches

http://slidepdf.com/reader/full/combining-machine-learning-and-homology-based-approaches 15/19

Page 16: Combining Machine Learning and Homology-Based Approaches

8/8/2019 Combining Machine Learning and Homology-Based Approaches

http://slidepdf.com/reader/full/combining-machine-learning-and-homology-based-approaches 16/19

Page 17: Combining Machine Learning and Homology-Based Approaches

8/8/2019 Combining Machine Learning and Homology-Based Approaches

http://slidepdf.com/reader/full/combining-machine-learning-and-homology-based-approaches 17/19

Page 18: Combining Machine Learning and Homology-Based Approaches

8/8/2019 Combining Machine Learning and Homology-Based Approaches

http://slidepdf.com/reader/full/combining-machine-learning-and-homology-based-approaches 18/19

Page 19: Combining Machine Learning and Homology-Based Approaches

8/8/2019 Combining Machine Learning and Homology-Based Approaches

http://slidepdf.com/reader/full/combining-machine-learning-and-homology-based-approaches 19/19