Upload
yasset-riverol
View
658
Download
3
Embed Size (px)
DESCRIPTION
proteomics, standards formats, Proteomics Standardisation Initiative (PSI), mzIdentML, mzML, mzXML
Citation preview
From Raw Data to MetaData Files
Yasset Perez Riverol Proteomics & Bioiforma4cs
CIGB
Common Proteomic Workflow
Mixture/Sample
Separa4on Techniques (1D, 2D)
LC MS/MS Iden4fica4on
OMSSA
– Different providers: (annotations, software converters & viewers)
– For Raw data formats, there is also the very real problem of “aging”.
Different: – Protocols. – Outputs. – Providers.
Different: – Strategies. – Search Engines. – Post-Processing
Analysis. – File Outputs.
LC-MS/MS (different instruments)
Raw data is binary!!!… It means you can’t read it with Notepad but also without their programs and libraries.
Raw File Raw File Raw File Raw File
Peaks without processing!!!
LC-MS/MS (“aging” problem.)
Next the problem with proprietary raw data formats, there is also the very real problem of “aging” that comes with any binary formaSed data. As 4me goes by, support for certain formats tends to evaporate and within the space of several years, readers can no longer be found for the format.
Martens and co. Proteomics 2005, 5, 3501–3505
Thermo XCalibur MassLynx Trapper Compass FrameWork
Information inside Raw files
• Raw files contain all the individual peaks as registered
by the instrument detector.
• Peaks without processing!!!
• For LC-MS machines, can store elution profiles and
times for the LC part.
• Depending on the vendor and make of the machine,
other useful instrument-related information can be
stored in these files as well.
File Formats Evolution
Pure Peaks Formats
(pkl, ms2, mgf)
mzXML (2004)
mzData (2006)
mzML (2008)
Nature Biotechnology. 2004, 22 (11),1459-‐1466.
mzData, hIp://psidev.info/index.php?q=node/80#mzdata.
Mol Cell Proteomics. 2011,10(1),
Pure Peak File
BEGIN IONS
PEPMASS=406.283
CHARGE=2+,3+
TITLE=Experiment_1
145.119100 8
217.142900 75
409.221455 11
438.314735 46
567.400183 24
714.447552 31
116.113400 72
91.2165000 32
405.288933 94
39.3021000 12
549.379462 21
715.466300 81
15.1098000 62
45.1358430 28
mgf (mascot generic file) 814.27 22673800 1
221.06 2529.3
223.84 220.9
226.91 1026.9
227.97 1037.9
231.06 110.6
239.05 7193.1
239.74 2513.3
240.27 363.4
240.79 1314.7
241.45 629.9
254.85 332.5
259.71 200.5
260.93 2437.7
pkl (peak list) 539.3453 2
86.1006 4.0000
112.1109 3.0000
115.0906 2.0000
120.0817 5.0000
175.0219 2.0000
225.1467 2.0000
225.7205 2.0000
228.1194 2.0000
230.1106 2.0000
234.1836 2.0000
238.6206 2.0000
240.1569 3.0000
251.1396 2.0000
254.1557 2.0000
261.1669 9.0000
261.6609 2.0000
268.1504 8.0000
dta
mzXML
mzXML
scanList
scan
scanDescription
binaryDataArray
binaryDataArray
• • •
msLevel
PrecursorList
scan
scan
Parent FileList
MsInstrument
SeparationTechnique
spooting
dataProcessing
scanOrigin
deisotoped
centroided
deconvoluted
mzXML was the first xml based file format developed for proteomics experiments. It was developed by the System Biology Group, USA.
The annotations in the file are string based. It means, they are in this way: (Name Attribute, Value).
D o n o t s u p p o r t chromatograms information.
Is very difficult to extend. The structure of the file don’t allow to define new parameter or features for each elements. For example, msInstrument are defined only by the name of the instrument. Also, if the spectrum is preprocessing with any program, is difficult to incorporate the information.
Actually exist more than 4 versions of the schema. The schema is supported by the System Biology Group, USA-Zurich.
Controlled Vocabularies & Ontology Lookup Service
100173
TOF T.O.F.
time of flight
time-of-flight
OLS • Is a web service oriented system
developed in Java. • It was developed and is maintained by
the PRIDE Team!!! • We have the service installed in a local
machine!!!! • I know the library and the source code. We have an strong collaboration with the developers of the Service!!!
mzML
mzML
run
spectrum
spectrumDescription
binaryDataArray
binaryDataArray
• • •
precursorList
scan
spectrumList • • • spectrum
spectrum
cvList
referenceableParamGroupList
sampleList
acquisitionSettingsList
dataProcessingList
softwareList
instrumentConfigurationList
chromatogramList • • • chromatogram
chromatogram
chromatogram
binaryDataArray
binaryDataArray
Chromatograms may be encoded in mzML in a special element that contains one or more cvParams to describe the type of chromatogram, followed by two base64-encoded binary data arrays.
Meta data about the spectra plus all the spectra themselves.
The header at the top of the file encodes information about: the source of the data as well as information about the sample, instrument and software that processed the data.
Cvterms are used to define the metadata and the properties of e a c h e l e m e n t ( s o f tware , instrument, sample, scansetting, etc.
Comparison table Metadata/fileformat mzml mzData mzXml mgf pkl ms2 dta Species X X - - - - - Tissue X X - - - - - Instrument X X X - - - - Experiment Description X - - - - - - References X - - - - - - Contacts X X X - - - -
Additional X (FileContent / creationDate) X X - - - -
Samples X X - - - - - Instrument Configuration X X X - - - - Data Processing X X X - - - -
mzML is supported by: - Institute for Systems Biology , Seattle. - Swiss Institute for Bioinformatics and Geneva Bioinformatics, Switzerland. - European Bioinformatics Institute, Hinxton, UK. - Thermo Fisher, San Jose, CA. - Indigo Biosystems, Carmel, IN.
mzML and mzXML is comatible with: - Mascot!!!!, X! Tandem, OMSSA. - PeptideProphet
Is not binary!!!… It means you can read it with Notepad but also with your libraries and own code…
ProteoWizard msConvert
API
Thermo API
Bruker API
Agilent API
Waters API
File Input Supported: – Thermo – Bruker – Agilent – Waters – Pkl – mgf, – dta – ms2
File Output Supported: – mzML – mzXML – mzData – Pkl – mgf
Cross-platform !!!!
S4ll growing…
Identification
X!Tandem
Mascot
Database Search
Mascot
Percolator
PeptideProphet
Scaffold
X! Tandem OMSSA Fenyx
De Novo Sequence
Peaks PepNovo
Spectral Library
SpectraST NIST
Thousand approaches!!!… It means you can combine different programs, with different parameters, and different workflows..
PeptideProphet
File Formats?
.dat
.dat .dat
protXML
pepXML
AnalysisXML
AnalysisXML: v1.0 – candidate (Dic 08)
Seattle Proteome Center at the Institute for Systems Biology
Programs with excel output
OMSSA
Programs with their output format
mzidentml Collection of use cases agreed to cover:
- e.g. PMF, MS/MS, sequence tag, de novo, spectral library
Pep Evidence1
Ambiguity Group1
Protein Result Set
Protein Hypothesis1
Pep Evidence2
Pep Evidence1
Protein Hypothesis2
Pep Evidence2
Pep Evidence1
Ambiguity Group2
Protein Hypothesis1
…
…
…
…
… … Pep Evidence2
Mul9ple Search Engines!!!… Protocol Descrip4ons, Database Proper4es, Search Engines, Parameters, Modifica4ons.. Fully compa4ble with Otology's!!! Supported by Mascot!!!
mzidentml • Results in mzIdentML format can be exported directly from Mascot (export of version 1.1
available in version 2.3)
• Converters are currently available for Sequest and Proteome Discoverer output (.msf and .protXML) (e.g. within ProCon: http://www.medizinisches-proteom-center.de/ProCon),.
• OMSSA and X!Tandem (http://code.google.com/p/mzidentml-parsers/)
• The pipeline applications Scaffold (import into Scaffold PTM and export of mzIdentML
available in Scaffold version 3) and TPP (results can be exported to mzIdentML via the
ProteoWizard converter).
• A beta exporter is also available for Phenyx.
• OpenMS implements C++ code for reading (and as of release 1.9) writing mzIdentML.
• An open-source Java API for reading and writing mzIdentML has also been developed,
available from http://code.google.com/p/jmzidentml/!!!!!
Gels (nobody care) ― Only limited support for the storage of detailed descriptions of all stages of a gel-based proteomics workflow. ― Information is mostly restricted to unstructured text paragraphs.
One of the reasons is the lack of widely accepted standards for representing gel data and the difficulties encountered modelling the range of workflows employed in different settings.
Different Scenarios: OffeGel-electrophoresis
1D 2D
gelml Gelml is basically a metadata file that contains the URI of the image file.
The structure of the schema is complex !!!!. One of the reason is the amount of different protocols
Not well documented, an small community behind, and not really extended in the community!!!
Before Technical Things!!! • The number of tools based on XML
standard files is growing exponentially.. Why: – Easy to read and write!!! – They are standards!!!! – Repositories Support (PRIDE,
PEPTIDEATLAS). – Have enough information for most of the
programs.
APIs • jmzml: Library to read/
write information from mzml files.
• jmzidentml: Library to read/write information from mzidentml files.
• jgelml: Library to read/write information from gelml files. (current development)
• Developed by the PRIDE team. • Java Libraries. • Still growing.
• Open-Source and Free.
ms-core-api Applications
proteolims
N-terminal Identification
Web services
ms-core-api
APIs
jmzml jmzxml jmzData jmzReader jmzidml jgelml
m s - c o r e - a p i i s a j a v a framewrok, a common object model to represent different file formats.
Support now: ― mzidentml ― mzml, mzData, mzXML ― pride xml, pride database ― pkl, mgf, ms2, dta ― gelml (current work)
Cross-platform and well documented!!!
The aim of ms-core-api library is to guarantee for our current development tools a common language of objects and classes!!!!
The relevance of APIs concept • Different programs can used to
implement the main functionalities. • If you have APIs .. Then you just need
to think on integration, scalability and presentation…
• Easy to maintain and to scale and to share…
• They are the “MAIN CORE!”!!
ms-‐core-‐api: good for… ?
Spectrum Viewer
Identification Report
Think about review our experiments
MetaData Report
Reviewer Panel
conclusion • mzml is the current standard for MS/MS storage.
• mzidentml will be the future standard on proteomics
community for peptide/protein identification storage.
• gelml is not very extended in the community but so
far the best option for gel information storage.
• ms-core-api support mzml,mzidentml, and in the near
future gelml.