27
From Raw Data to MetaData Files Yasset Perez Riverol Proteomics & Bioiforma4cs CIGB

Standarization in Proteomics: From raw data to metadata files

Embed Size (px)

DESCRIPTION

proteomics, standards formats, Proteomics Standardisation Initiative (PSI), mzIdentML, mzML, mzXML

Citation preview

Page 1: Standarization in Proteomics: From raw data to metadata files

From Raw Data to MetaData Files

Yasset  Perez  Riverol  Proteomics  &  Bioiforma4cs  

CIGB  

Page 2: Standarization in Proteomics: From raw data to metadata files

Common Proteomic Workflow

Mixture/Sample  

Separa4on  Techniques  (1D,  2D)  

LC   MS/MS   Iden4fica4on  

OMSSA  

– Different providers: (annotations, software converters & viewers)

– For Raw data formats, there is also the very real problem of “aging”.

Different: –  Protocols. –  Outputs. –  Providers.

Different: –  Strategies. –  Search Engines. –  Post-Processing

Analysis. –  File Outputs.

Page 3: Standarization in Proteomics: From raw data to metadata files

LC-MS/MS (different instruments)

Raw   data   is   binary!!!…   It   means   you   can’t   read   it   with  Notepad   but   also   without   their  programs  and  libraries.      

Raw File Raw File Raw File Raw File

Peaks without processing!!!

Page 4: Standarization in Proteomics: From raw data to metadata files

LC-MS/MS (“aging” problem.)

Next  the  problem  with  proprietary  raw  data  formats,  there  is  also  the  very  real  problem  of  “aging”  that  comes  with  any  binary  formaSed  data.  As  4me  goes  by,  support  for  certain  formats  tends  to  evaporate  and  within  the  space  of  several  years,  readers  can  no  longer  be  found  for  the  format.  

Martens  and  co.  Proteomics  2005,  5,  3501–3505  

Thermo XCalibur MassLynx Trapper Compass FrameWork

Page 5: Standarization in Proteomics: From raw data to metadata files

Information inside Raw files

•  Raw files contain all the individual peaks as registered

by the instrument detector.

•  Peaks without processing!!!

•  For LC-MS machines, can store elution profiles and

times for the LC part.

•  Depending on the vendor and make of the machine,

other useful instrument-related information can be

stored in these files as well.

Page 6: Standarization in Proteomics: From raw data to metadata files

File Formats Evolution

Pure Peaks Formats

(pkl, ms2, mgf)

mzXML (2004)

mzData (2006)

mzML (2008)

Nature  Biotechnology.  2004,  22  (11),1459-­‐1466.  

mzData,  hIp://psidev.info/index.php?q=node/80#mzdata.  

Mol  Cell  Proteomics.  2011,10(1),    

Page 7: Standarization in Proteomics: From raw data to metadata files

Pure Peak File

BEGIN IONS

PEPMASS=406.283

CHARGE=2+,3+

TITLE=Experiment_1

145.119100 8

217.142900 75

409.221455 11

438.314735 46

567.400183 24

714.447552 31

116.113400 72

91.2165000 32

405.288933 94

39.3021000 12

549.379462 21

715.466300 81

15.1098000 62

45.1358430 28

mgf (mascot generic file) 814.27 22673800 1

221.06 2529.3

223.84 220.9

226.91 1026.9

227.97 1037.9

231.06 110.6

239.05 7193.1

239.74 2513.3

240.27 363.4

240.79 1314.7

241.45 629.9

254.85 332.5

259.71 200.5

260.93 2437.7

pkl (peak list) 539.3453 2

86.1006 4.0000

112.1109 3.0000

115.0906 2.0000

120.0817 5.0000

175.0219 2.0000

225.1467 2.0000

225.7205 2.0000

228.1194 2.0000

230.1106 2.0000

234.1836 2.0000

238.6206 2.0000

240.1569 3.0000

251.1396 2.0000

254.1557 2.0000

261.1669 9.0000

261.6609 2.0000

268.1504 8.0000

dta

Page 8: Standarization in Proteomics: From raw data to metadata files

mzXML

mzXML

scanList

scan

scanDescription

binaryDataArray

binaryDataArray

• • •

msLevel

PrecursorList

scan

scan

Parent FileList

MsInstrument

SeparationTechnique

spooting

dataProcessing

scanOrigin

deisotoped

centroided

deconvoluted

mzXML was the first xml based file format developed for proteomics experiments. It was developed by the System Biology Group, USA.

The annotations in the file are string based. It means, they are in this way: (Name Attribute, Value).

D o n o t s u p p o r t chromatograms information.

Is very difficult to extend. The structure of the file don’t allow to define new parameter or features for each elements. For example, msInstrument are defined only by the name of the instrument. Also, if the spectrum is preprocessing with any program, is difficult to incorporate the information.

Actually exist more than 4 versions of the schema. The schema is supported by the System Biology Group, USA-Zurich.

Page 9: Standarization in Proteomics: From raw data to metadata files

Controlled Vocabularies & Ontology Lookup Service

100173

TOF T.O.F.

time of flight

time-of-flight

Page 10: Standarization in Proteomics: From raw data to metadata files

OLS •  Is a web service oriented system

developed in Java. •  It was developed and is maintained by

the PRIDE Team!!! •  We have the service installed in a local

machine!!!! •  I know the library and the source code. We have an strong collaboration with the developers of the Service!!!

Page 11: Standarization in Proteomics: From raw data to metadata files

mzML

mzML

run

spectrum

spectrumDescription

binaryDataArray

binaryDataArray

• • •

precursorList

scan

spectrumList • • • spectrum

spectrum

cvList

referenceableParamGroupList

sampleList

acquisitionSettingsList

dataProcessingList

softwareList

instrumentConfigurationList

chromatogramList • • • chromatogram

chromatogram

chromatogram

binaryDataArray

binaryDataArray

Chromatograms may be encoded in mzML in a special element that contains one or more cvParams to describe the type of chromatogram, followed by two base64-encoded binary data arrays.

Meta data about the spectra plus all the spectra themselves.

The header at the top of the file encodes information about: the source of the data as well as information about the sample, instrument and software that processed the data.

Cvterms are used to define the metadata and the properties of e a c h e l e m e n t ( s o f tware , instrument, sample, scansetting, etc.

Page 12: Standarization in Proteomics: From raw data to metadata files

Comparison table Metadata/fileformat mzml mzData mzXml mgf pkl ms2 dta Species X X - - - - - Tissue X X - - - - - Instrument X X X - - - - Experiment Description X - - - - - - References X - - - - - - Contacts X X X - - - -

Additional X (FileContent / creationDate) X X - - - -

Samples X X - - - - - Instrument Configuration X X X - - - - Data Processing X X X - - - -

mzML is supported by: - Institute for Systems Biology , Seattle. - Swiss Institute for Bioinformatics and Geneva Bioinformatics, Switzerland. - European Bioinformatics Institute, Hinxton, UK. - Thermo Fisher, San Jose, CA. -  Indigo Biosystems, Carmel, IN.

mzML and mzXML is comatible with: - Mascot!!!!, X! Tandem, OMSSA. - PeptideProphet

Is  not  binary!!!…   It  means   you   can   read   it  with  Notepad  but  also  with  your  libraries  and  own  code…    

Page 13: Standarization in Proteomics: From raw data to metadata files

ProteoWizard msConvert

API

Thermo API

Bruker API

Agilent API

Waters API

File Input Supported: – Thermo – Bruker – Agilent – Waters – Pkl – mgf, – dta – ms2

File Output Supported: – mzML – mzXML – mzData – Pkl – mgf

Cross-platform !!!!

Page 14: Standarization in Proteomics: From raw data to metadata files

S4ll  growing…  

Page 15: Standarization in Proteomics: From raw data to metadata files

Identification

X!Tandem

Mascot

Database Search

Mascot

Percolator

PeptideProphet

Scaffold

X! Tandem OMSSA Fenyx

De Novo Sequence

Peaks PepNovo

Spectral Library

SpectraST NIST

Thousand   approaches!!!…   It   means   you   can   combine   different  programs,  with  different  parameters,  and  different  workflows..    

PeptideProphet

Page 16: Standarization in Proteomics: From raw data to metadata files

File Formats?

.dat

.dat .dat

protXML

pepXML

AnalysisXML

AnalysisXML: v1.0 – candidate (Dic 08)

Seattle Proteome Center at the Institute for Systems Biology

Programs with excel output

OMSSA  

Programs with their output format

Page 17: Standarization in Proteomics: From raw data to metadata files

mzidentml  Collection of use cases agreed to cover:

-  e.g. PMF, MS/MS, sequence tag, de novo, spectral library

Pep Evidence1

Ambiguity Group1

Protein Result Set

Protein Hypothesis1

Pep Evidence2

Pep Evidence1

Protein Hypothesis2

Pep Evidence2

Pep Evidence1

Ambiguity Group2

Protein Hypothesis1

… … Pep Evidence2

Mul9ple  Search  Engines!!!…  Protocol  Descrip4ons,  Database  Proper4es,  Search  Engines,  Parameters,  Modifica4ons..  Fully  compa4ble  with  Otology's!!!  Supported  by  Mascot!!!      

Page 18: Standarization in Proteomics: From raw data to metadata files

mzidentml  •  Results in mzIdentML format can be exported directly from Mascot (export of version 1.1

available in version 2.3)

•  Converters are currently available for Sequest and Proteome Discoverer output (.msf and .protXML) (e.g. within ProCon: http://www.medizinisches-proteom-center.de/ProCon),.

•  OMSSA and X!Tandem (http://code.google.com/p/mzidentml-parsers/)

•  The pipeline applications Scaffold (import into Scaffold PTM and export of mzIdentML

available in Scaffold version 3) and TPP (results can be exported to mzIdentML via the

ProteoWizard converter).

•  A beta exporter is also available for Phenyx.

•  OpenMS implements C++ code for reading (and as of release 1.9) writing mzIdentML.

•  An open-source Java API for reading and writing mzIdentML has also been developed,

available from http://code.google.com/p/jmzidentml/!!!!!

Page 19: Standarization in Proteomics: From raw data to metadata files

Gels  (nobody  care)  ―  Only limited support for the storage of detailed descriptions of all stages of a gel-based proteomics workflow. ―  Information is mostly restricted to unstructured text paragraphs.

One of the reasons is the lack of widely accepted standards for representing gel data and the difficulties encountered modelling the range of workflows employed in different settings.

Different Scenarios: OffeGel-electrophoresis

 

1D 2D

Page 20: Standarization in Proteomics: From raw data to metadata files

gelml  Gelml is basically a metadata file that contains the URI of the image file.

The structure of the schema is complex !!!!. One of the reason is the amount of different protocols

Not well documented, an small community behind, and not really extended in the community!!!

Page 21: Standarization in Proteomics: From raw data to metadata files

Before Technical Things!!! •  The number of tools based on XML

standard files is growing exponentially.. Why: – Easy to read and write!!! – They are standards!!!! – Repositories Support (PRIDE,

PEPTIDEATLAS). – Have enough information for most of the

programs.

Page 22: Standarization in Proteomics: From raw data to metadata files

APIs •  jmzml: Library to read/

write information from mzml files.

•  jmzidentml: Library to read/write information from mzidentml files.

•  jgelml: Library to read/write information from gelml files. (current development)

•  Developed by the PRIDE team. •  Java Libraries. •  Still growing.

•  Open-Source and Free.

Page 23: Standarization in Proteomics: From raw data to metadata files

ms-core-api Applications

proteolims

N-terminal Identification

Web services

ms-core-api

APIs

jmzml jmzxml jmzData jmzReader jmzidml jgelml

m s - c o r e - a p i i s a j a v a framewrok, a common object model to represent different file formats.

Support now: ―  mzidentml ―  mzml, mzData, mzXML ―  pride xml, pride database ―  pkl, mgf, ms2, dta ―  gelml (current work)

Cross-platform and well documented!!!

The aim of ms-core-api library is to guarantee for our current development tools a common language of objects and classes!!!!

Page 24: Standarization in Proteomics: From raw data to metadata files

The relevance of APIs concept •  Different programs can used to

implement the main functionalities. •  If you have APIs .. Then you just need

to think on integration, scalability and presentation…

•  Easy to maintain and to scale and to share…

•  They are the “MAIN CORE!”!!

Page 25: Standarization in Proteomics: From raw data to metadata files

ms-­‐core-­‐api:  good  for…  ?  

Spectrum Viewer

Identification Report

Page 26: Standarization in Proteomics: From raw data to metadata files

Think about review our experiments

MetaData Report

Reviewer Panel

Page 27: Standarization in Proteomics: From raw data to metadata files

conclusion •  mzml is the current standard for MS/MS storage.

•  mzidentml will be the future standard on proteomics

community for peptide/protein identification storage.

•  gelml is not very extended in the community but so

far the best option for gel information storage.

•  ms-core-api support mzml,mzidentml, and in the near

future gelml.