49
Mining the hidden proteome using hundreds of public proteomics datasets Dr. Juan Antonio Vizcaíno EMBL-European Bioinformatics Institute Hinxton, Cambridge, UK

Mining the hidden proteome using hundreds of public proteomics datasets

Embed Size (px)

Citation preview

EMBL-EBI Now and in the Future

Mining the hidden proteome using hundreds of public proteomics datasetsDr. Juan Antonio Vizcano

EMBL-European Bioinformatics InstituteHinxton, Cambridge, UK

Juan A. [email protected] ConferenceGhent, 17 November 2016

1

Overview

PRIDE Archive and ProteomeXchange

Reuse of public proteomics data

PRIDE Cluster iteration 1

PRIDE Cluster iteration 2 -> Mining the hidden proteome

Juan A. [email protected] ConferenceGhent, 17 November 2016

2

What is a proteomics publication in 2016?Proteomics studies generate potentially large amounts of data and results.

Ideally, a proteomics publication needs to:Summarize the results of the studyProvide supporting information for reliability of any results reported

Information in a publication:ManuscriptSupplementary materialAssociated data submitted to a public repository

Juan A. [email protected] ConferenceGhent, 17 November 2016

PRIDE stores mass spectrometry (MS)-based proteomics data:Peptide and protein expression data (identification and quantification)Post-translational modificationsMass spectra (raw data and peak lists)Technical and biological metadataAny other related information

Full support for tandem MS approachesAny type of data can be stored.

PRIDE (PRoteomics IDEntifications) Archivehttp://www.ebi.ac.uk/pride/archiveMartens et al., Proteomics, 2005Vizcano et al., NAR, 2016

Juan A. [email protected] ConferenceGhent, 17 November 2016

4

ProteomeXchange: A Global, distributed proteomics database

PASSEL (SRM data)

PRIDE (MS/MS data)

MassIVE (MS/MS data)

Raw

ID/Q

Meta

jPOST(MS/MS data)

Mandatory raw data deposition since July 2015

Goal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories.

http://www.proteomexchange.orgNew in 2016Vizcano et al., Nat Biotechnol, 2014Deutsch et al., NAR, 2017, in press

Juan A. [email protected] ConferenceGhent, 17 November 2016ProteomeCentralMetadata / ManuscriptRaw DataResults

Journals

Peptide Atlas Receiving repositories

PRIDE

Researchers results

Raw dataMetadata

PASSEL

Research groupsReanalysis of datasets

MassIVE

jPOST MS/MS data(as completesubmissions)

Any other workflow (mainly partial submissions)

DATASETS

SRM data

Reprocessed results

MassIVEProteomeXchange data workflow

Vizcano et al., Nat Biotechnol, 2014Deutsch et al., NAR, 2017, in press

Juan A. [email protected] ConferenceGhent, 17 November 2016

6

ProteomeCentral: Centralised portal for all PX datasetshttp://proteomecentral.proteomexchange.org/cgi/GetDataset

Juan A. [email protected] ConferenceGhent, 17 November 2016ProteomeCentralMetadata / ManuscriptRaw DataResults

Journals

Peptide Atlas Receiving repositories

PRIDE

Researchers results

Raw dataMetadata

PASSEL

Research groupsReanalysis of datasets

MassIVE

jPOST MS/MS data(as completesubmissions)

Any other workflow (mainly partial submissions)

DATASETS

SRM data

Reprocessed results

MassIVEProteomeXchange data workflow

Juan A. [email protected] ConferenceGhent, 17 November 2016

8

ProteomeCentralMetadata / ManuscriptRaw DataResults

Journals

UniProt/neXtProt

Peptide Atlas

Other DBs Receiving repositories

PRIDE

GPMDB

Researchers results

Raw dataMetadata

PASSEL

proteomicsDB

Research groupsReanalysis of datasets

MassIVE

jPOST MS/MS data(as completesubmissions)

Any other workflow (mainly partial submissions)

DATASETS

OmicsDIIntegration with other omics datasets

SRM data

Reprocessed results

MassIVEProteomeXchange data workflow

Juan A. [email protected] ConferenceGhent, 17 November 2016

9

PRIDE Archive over 5,000 datasets from over 51 countries and 2,000 groupsUSA 814 datasetsGermany 528 UK 338China 328France 222Netherlands 175Canada - 137

Data volume:Total: ~275 TB Number of all files: ~560,000PXD000320-324: ~ 4 TBPXD002319-26 ~2.4 TBPXD001471 ~1.6 TB1,973 datasets i.e. 52% of all are publicly accessible~90% of all ProteomeXchange datasets

YearSubmissionsAll submissionsCompletePRIDE Archive growthIn the last 12 months: ~165 submitted datasets per monthTop Species studied by at least 100 datasets:2,010 Homo sapiens 604 Mus musculus 191 Saccharomyces cerevisiae 140 Arabidopsis thaliana 127 Rattus norvegicus >900 reported taxa in total

Juan A. [email protected] ConferenceGhent, 17 November 2016(> 922 processed by MaxQuant)

10

Overview

PRIDE Archive and ProteomeXchange

Reuse of public proteomics data

PRIDE Cluster iteration 1

PRIDE Cluster iteration 2

Juan A. [email protected] ConferenceGhent, 17 November 2016

11

Datasets are being reused more and more.

Vaudel et al., Proteomics, 2016Data download volume for PRIDE Archive in 2015: 198 TB

Juan A. [email protected] ConferenceGhent, 17 November 2016

12

Data sharing in Proteomics

Vaudel et al., Proteomics, 2016

Juan A. [email protected] ConferenceGhent, 17 November 2016Data sharing in Proteomics

Vaudel et al., Proteomics, 2016

Juan A. [email protected] ConferenceGhent, 17 November 2016Draft Human proteome papers published in 2014

Wilhelm et al., Nature, 2014Around 60% of the data used for the analysis comes from previous experiments, most of them stored in proteomics repositories such as PRIDE/ProteomeXchange, PASSEL or MassIVE.

They complement that data with exotic tissues.

Juan A. [email protected] ConferenceGhent, 17 November 2016Data sharing in Proteomics

Vaudel et al., Proteomics, 2016

Juan A. [email protected] ConferenceGhent, 17 November 2016Examples of repurposing datasets: proteogenomics

Data in public resources can be used for genome annotation purposes

Juan A. [email protected] ConferenceGhent, 17 November 2016Public datasets from different omics: OmicsDIhttp://www.ebi.ac.uk/Tools/omicsdi/Aims to integrate of omics datasets (proteomics, transcriptomics, metabolomics and genomics at present). PRIDE MassIVEjPOSTPASSELGPMDB

ArrayExpressExpression Atlas

MetaboLightsMetabolomics WorkbenchGNPS

EGAPerez-Riverol et al., Nat Biotechnol, in press

Juan A. [email protected] ConferenceGhent, 17 November 2016PRIDE Proteomes provide an across-dataset and quality filtered view on PRIDE Archive data. Good PSMs are assessed using the PRIDE Cluster approach, based on spectral clustering.18

OmicsDI: Portal for omics datasets

Juan A. [email protected] ConferenceGhent, 17 November 2016PRIDE Proteomes provide an across-dataset and quality filtered view on PRIDE Archive data. Good PSMs are assessed using the PRIDE Cluster approach, based on spectral clustering.19

OmicsDI: Portal for omics datasets

Juan A. [email protected] ConferenceGhent, 17 November 2016PRIDE Proteomes provide an across-dataset and quality filtered view on PRIDE Archive data. Good PSMs are assessed using the PRIDE Cluster approach, based on spectral clustering.20

Overview

PRIDE Archive and ProteomeXchange

Reuse of public proteomics data

PRIDE Cluster iteration 1

PRIDE Cluster iteration 2 -> Mining the hidden proteome

Juan A. [email protected] ConferenceGhent, 17 November 2016

21

Data sharing in Proteomics

Vaudel et al., Proteomics, 2016

Juan A. [email protected] ConferenceGhent, 17 November 2016PRIDE Cluster - Concept

NMMAACDPR NMMAACDPR PPECPDFDPPR

NMMAACDPR NMMAACDPR

NMMAACDPR Consensus spectrumPPECPDFDPPRThreshold: At least 3 spectra in a cluster and ratio >70%.Originally submitted identified spectraSpectrumclusteringProvide a QC-filtered peptide-centric view of PRIDE Archive

Juan A. [email protected] ConferenceGhent, 17 November 2016One perfect cluster in PRIDE Cluster web

880 PSMs give the same peptide ID4 species28 datasetsSame instruments

http://www.ebi.ac.uk/pride/cluster/

Juan A. [email protected] ConferenceGhent, 17 November 2016PRIDE Cluster: ImplementationClustered all public, identified spectra in PRIDEEBI compute farm, LSF20.7 M identified spectra610 CPU days, two calendar weeksValidation, calibrationFeedback into PRIDE datasetsEBI farm, LSFGriss et al., Nat. Methods, 2013

Juan A. [email protected] ConferenceGhent, 17 November 2016Overview

PRIDE Archive and ProteomeXchange

Reuse of public proteomics data

PRIDE Cluster iteration 1

PRIDE Cluster iteration 2 -> Mining the hidden proteome

Juan A. [email protected] ConferenceGhent, 17 November 2016

26

PRIDE Cluster Iteration 2: Why?PRIDE Archive has experienced a huge increase in data since 2013.We wanted to develop an algorithm that could also work with unidentified spectra.

YearSubmissionsAll submissionsCompletePRIDE Archive growth

Juan A. [email protected] ConferenceGhent, 17 November 2016

27

PRIDE Cluster: Second ImplementationClustered all public, identified spectra in PRIDEEBI compute farm, LSF20.7 M identified spectra610 CPU days, two calendar weeksValidation, calibrationFeedback into PRIDE datasetsEBI farm, LSFGriss et al., Nat. Methods, 2013

Clustered all public spectra in PRIDE by April 2015.Apache Hadoop.Starting with 256 M spectra.190 M unidentified spectra (they were filtered to 111 M for spectra that are likely to represent a peptide).66 M identified spectraResult: 28 M clusters 5 calendar days on 30 node Hadoop cluster, 340 CPU coresGriss et al., Nat. Methods, 2016

Juan A. [email protected] ConferenceGhent, 17 November 2016PRIDE Cluster - Concept

Juan A. [email protected] ConferenceGhent, 17 November 2016Examples: one perfect cluster (2)

Juan A. [email protected] ConferenceGhent, 17 November 2016Highlights of the output of the analysis1. Inconsistent spectrum clusters

2. Clusters including identified and unidentified spectra.

3. Clusters just containing unidentified spectra.

Juan A. [email protected] ConferenceGhent, 17 November 20161. Re-analysis of inconsistent clusters

NMMAACDPR NMMAACDPR IGGIGTVPVGRNMMAACDPRPPECPDFDPPRVFDEFKPLVEEPQNLIKNMMAACDPRIGGIGTVPVGR Consensus spectrum

PPECPDFDPPRVFDEFKPLVEEPQNLIK Originally submitted identified spectraSpectrumclusteringNo sequence has a proportion in the cluster >50%

Juan A. [email protected] ConferenceGhent, 17 November 20161. Re-analysis of inconsistent clustersRe-analysed 3,997 large (>100 spectra), inconsistent clusters with PepNovo, SpectraST, X!Tandem.453 clusters (11%) were identified as peptides originated from keratins, trypsin, albumin, and hemoglobin.In this case, it is likely that a contaminants DB was not used in the search

Juan A. [email protected] ConferenceGhent, 17 November 2016Validation

Juan A. [email protected] ConferenceGhent, 17 November 2016

Juan A. [email protected] ConferenceGhent, 17 November 2016

Juan A. [email protected] ConferenceGhent, 17 November 2016

Juan A. [email protected] ConferenceGhent, 17 November 20162. Inferring identifications for originally unidentified spectra

Not identifiedPPECPDFDPPRPPECPDFDPPR PPECPDFDPPR

Not identifiedNot identified

Consensus spectrum

PPECPDFDPPRNot identifiedOriginally submitted spectraSpectrumclusteringIdentifications are inferred

Juan A. [email protected] ConferenceGhent, 17 November 2016

2. Inferring identifications for originally unidentified spectra399.1 M unidentified spectra were contained in clusters with a reliable identification.These are candidate new identifications (that need to be confirmed), often missed due to search engine settingsExample: 49,263 reliable clusters (containing 560,000 identified and 130,000 unidentified spectra) contained phosphorylated peptides, many of them from non-enriched studies.

Juan A. [email protected] ConferenceGhent, 17 November 20163. Consistently unidentified clusters

Not identifiedNot identifiedNot identifiedNot identified

Consensus spectrum

Not identifiedNot identifiedOriginally submitted spectraSpectrumclusteringMethod to target commonly found unidentified spectra

??

Juan A. [email protected] ConferenceGhent, 17 November 2016PRIDE Cluster

Sequence-based search enginesSpectrum clusteringIncorrectly or unidentified spectra

Juan A. [email protected] ConferenceGhent, 17 November 20163. Consistently unidentified clusters19 M clusters contain only unidentified spectra.

41,155 of these spectra have more than 100 spectra (=12 M spectra, 5%).

Most of them are likely to be derived from peptides.

They could correspond to PTMs or variant peptides.

With various methods, we found likely identifications for about 20%.

Vast amount of data mining remains to be done.

Juan A. [email protected] ConferenceGhent, 17 November 20163. Consistently unidentified clusters

Juan A. [email protected] ConferenceGhent, 17 November 20163. Consistently unidentified clusters

Juan A. [email protected] ConferenceGhent, 17 November 2016PRIDE Cluster as a Public Data Mining Resource45http://www.ebi.ac.uk/pride/cluster Spectral libraries for 16 species.All clustering results, as well as specific subsets of interest available.Source code (open source) and Java API

Juan A. [email protected] ConferenceGhent, 17 November 2016Consistently unidentified clustersWe provide the results split per species in MGF and mzML format.

Very interested in getting people trying to work in those.

Available for several species (Largest clusters at present).

Juan A. [email protected] ConferenceGhent, 17 November 2016SummaryPRIDE is now receiving ~8 datasets per working day.

A lot of possibilities open for reuse of this data.

New OmicsDI resource.

It is possible to detect spectra that are consistently unidentified across hundreds of datasets (maybe peptide variants, or peptides with PTMs not initially considered).

Juan A. [email protected] ConferenceGhent, 17 November 2016Aknowledgements: PeopleAttila CsordasTobias TernentGerhard Mayer (de.NBI)

Johannes GrissYasset Perez-RiverolManuel Bernal-LlinaresAndrew Jarnuczak

Enrique Perez

Former team members, especially Rui Wang, Florian Reisinger, Noemi del Toro, Jose A. Dianes & Henning Hermjakob

Acknowledgements: The PRIDE Team

All data submitters !!!

@pride_ebi@proteomexchange

Juan A. [email protected] ConferenceGhent, 17 November 201648

Questions?

http://www.slideshare.net/JuanAntonioVizcaino

Juan A. [email protected] ConferenceGhent, 17 November 201649