37
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved. 1 Comparison of ESI-MS Spectra in MassBank Database Hisayuki Horai 1,2 , Masanori Arita 1,2,3,4 , Takaaki Nishioka 1,2 1 IAB, Keio Univ., 2 JST-BIRD, 3 Univ. of Tokyo, 4 RIKEN PSC BMEI 2008

Comparison of ESI-MS Spectra in MassBank Database

Embed Size (px)

Citation preview

Page 1: Comparison of ESI-MS Spectra in MassBank Database

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.1

Comparison of ESI-MS Spectrain MassBank Database

Hisayuki Horai1,2, Masanori Arita1,2,3,4,Takaaki Nishioka1,2

1 IAB, Keio Univ., 2 JST-BIRD, 3 Univ. of Tokyo, 4 RIKEN PSC

BMEI 2008

Page 2: Comparison of ESI-MS Spectra in MassBank Database

Table of Contents

• Metabolomics, Mass Spectrometry & Spectral Database

• Spectral Search by Similarity– Vector Space Model

• Evaluation of Relevance– MS/MS Spectra Search of Metabolites

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.2

Page 3: Comparison of ESI-MS Spectra in MassBank Database

Metabolomics &Mass Spectrometry

• Measurement of Metabolites– Identification & Quantification

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.3

m/z

Intensity

mass-to-charge(number)-ratio

0

+

+

++

Precursor IonFragment Ion

Identification by Similarity of Peak Pattern

Page 4: Comparison of ESI-MS Spectra in MassBank Database

MassBankMass Spectral Database for Identification of Metabolites

http://www.massbank.jp/

• Comprehensive Collection– Metabolites, Drugs, Agrichemicals, ...– EI-MS, ESI-MS, MS/MS, XC/MS, ...– Various Experimental Conditions

• Distributed Database on Internet– Cloud Computing Environment for Users– Quality Control of Data at Contributor's Site

• Open to Public– Public Free Access via Internet– Provide Software as Freeware (Server, DB System, Search

Engine, ...)

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.4

Variation ofResolutionSensitivityFragmentation...

Distributed Search

Page 5: Comparison of ESI-MS Spectra in MassBank Database

Collaboration in MassBank

2008/05/09 Copyright © 2008, Hisayuki Horai, All Rights Reserved.5

Page 6: Comparison of ESI-MS Spectra in MassBank Database

Spectral Search

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.6

Most Important Function of Spectral Database

Page 7: Comparison of ESI-MS Spectra in MassBank Database

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.

Spectral Search by SimilalrityBased on Vector Space Model

Search Based on Vector Space Model: Already Established Search Method– Information Retrieval (e.g. Google, PubMed, ...)– Low Resolution (Integer) m/z Spectral Search for EI-MS

• Translate Spectrum to Vector– Axis for m/z of Peak – Element of Vector: Intensity of Peak

• Similarity of 2 Spectra = Cosine of Vectors

7

Dimension = 6

(q1, 0, q3, q4, q5, 0)

(d1, d2, 0, d4, 0, d5)

Query q

0

(1) (2) (3) (4) (5) (6)

q1q3 q4

q5

d1d2

d4d6

・ (1) - (6): m/z・ q1 - q5, d1 - d6: Intensity

Target d

θ

Spectrum s1

Spectrum s2

2s1s2s1s2s1sScore

⋅•

== θcos),(Inner Product

Length

2222 5431 qqqqq +++=

2222 5d4d2d1dd +++=

4d4q1d1qqd ⋅+⋅=•

dqdqdqScore⋅•

== θcos),(

Page 8: Comparison of ESI-MS Spectra in MassBank Database

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.8

Weighting & NormalizationSpectral Vector: ( ..., Vi , ... )• Relative Intensity

"Intensity must be normalized by Largest Peak."– Vi = Intensity / max(Intensity)

• Importance of Large Ion"Large m/z may be specific for a compound."– Vi = Relative Intensity · (m/z) n [ n > 1 ]

• Importance of Intensity"Small peaks should not be ignored."– Vi = (Relative Intensity) m · (m/z) n [ n > 1, 0 < m < 1 ]

Improvement for Better Search

Page 9: Comparison of ESI-MS Spectra in MassBank Database

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.9

Spectral Search ofReal Number m/z

Introduce Tolerance of m/z to Match Peaks in Different Spectra• Peaks within Tolerance are Compiled into an Axis

Query q

Target d0

(1) (2) (3) (4) (5) (6) Dimension = 6

q3q5

q1q4

(q1, 0, q3, q4, q5, 0)

(d1, d2, 0, d4, 0, d5)

d1 d4d2

d6

・ (1) - (6): m/z・ q1 - q5, d1 - d6: Intensity

"Different Peak Hit Problem" and "Same Peak Hit Problem"

Page 10: Comparison of ESI-MS Spectra in MassBank Database

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.10

Different Peak Hit Problem

Query q

Target d0

q1

d1

q1, d1, d2: Intensity

d2

(..., q1, ...)

(..., ???, ...)

Choice of Solution:・Largest, Smallest・Average, Total・Nearest m/z...

What is Element d of Vector d?

Select Largest Peak in MassBank

Page 11: Comparison of ESI-MS Spectra in MassBank Database

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.11

Same Peak Hit Problem

Query q

Target d0

q1, q2, d1:Intensity

d1

q2q1

q1 and q2: 1 Axis or 2 Axes ?• If 1Axis,

What is Element of Vector q?• If 2 Axes,

What are 2 Elements of Vector d?

2 Axes & Duplicated Use of Hit Peak in Target in MassBank:q = ( ..., q1, q2, ... )d = ( ..., d1, d1, ... )

Page 12: Comparison of ESI-MS Spectra in MassBank Database

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.12

Variety of Spectral Search

• Variety of Weighting & Normalization• Variety of Real Number m/z Search

– Tolerance– Solution for Different Peak Hit Problem– Solution for Same Peak Hit Problem

• Variety of Practical Parameters

Page 13: Comparison of ESI-MS Spectra in MassBank Database

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.13

Spectral Search in MassBank• Weighting & Normalization:

sqrt(Relative Intensity)・m/z / 10where Relative Intensity = 1000 * Intensity / max(Intensity)

• Tolerance:0.3• Choose Effective Peaks (Ignore Noise Peaks):

– Upper Bound of m/z: Ignore Peak when m/z ≥ 1000– Lower Bound of Intensity:

Ignore Peak when Relative Intensity < 5• Lower Bound of Number of Hit Peaks:

– # Effective Peaks of Query ≥ 3: Ignore Target when # of Hit Peaks ≥ 3

– # Effective Peaks of Query < 3:Ignore Target unless all Effective Peaks are Hit

- Default Setting -

Page 14: Comparison of ESI-MS Spectra in MassBank Database

Spectral Search in MassBank

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.14

Queries

Ranked Listof Search Resultsfor Selected Query

Selected Query

SelectedResults

Red: Exact HitPink: Hit within Tolerance

Select& Search

3D View

Page 15: Comparison of ESI-MS Spectra in MassBank Database

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.15

Variety of Spectral Search

• Variety of Weighting & Normalization• Variety of Real Number m/z Search

– Tolerance– Solution for Different Peak Hit Problem– Solution for Same Peak Hit Problem

• Variety of Practical Parameters

Optimization of Search Method• Depends on Target Set• Based on Systematic Evaluation for Real Data

Page 16: Comparison of ESI-MS Spectra in MassBank Database

Evaluation of Relevance

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.16

Identification of Metabolites

Page 17: Comparison of ESI-MS Spectra in MassBank Database

MS/MS Spectra

• MS/MS: Fragmentation Depend on Machine & Collision Energy⇒ Difficulty of MS/MS Spectral Search

• Needs for Comprehensive Collection– Different Machine– Different Collision Energy

2008/05/09 Copyright © 2008, Hisayuki Horai, All Rights Reserved.17

Page 18: Comparison of ESI-MS Spectra in MassBank Database

MS/MS Spectra in MassBank

• Contributor: Keio Univ.• QqQ MS/MS Spectra

– Low Resolution– 861 Metabolites– 4,205 Spectra (1-to-5 Spectra for 1 Metabolites)

• QqTOF MS/MS Spectra– High Resolution– 898 Metabolites– 4,431 Spectra (2-to-5 Spectra for 1 Metabolites)

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.18

Page 19: Comparison of ESI-MS Spectra in MassBank Database

Evaluation Method• For each Machine,

– Leave-one-out Test– Test-1: QqQ Spectra– Test-2: QqTOF Spectra

• For both Machines,– Test-3: Query = QqQ, Target = QqTOF– Test-4: Query = QqTOF, Target = QqQ

• Evaluation Index– Precision & Recall– Best Ranking of True Positive

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.19

"Cross" Search

Page 20: Comparison of ESI-MS Spectra in MassBank Database

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.20

Precision & RecallCorrect Answer = Spectrum of Same Metabolite as Query

Targets

Results CorrectAnswers

TP FNFP

TN

P (Positive): ResultsN (Negative):not Results

T (True): SuccessF (False): Mistake

Targets are Divided into 4 Groups(FN, FP, TN, TP) for a Query.

Precision = TP / (TP + FP)Recall = TP / (TP + FN)

Page 21: Comparison of ESI-MS Spectra in MassBank Database

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.21

Evaluation ofScore Based Search Method

• Introduce Threshold of Score : Th– Select Results where score ≥ Th– Set of Results Depends on Th

Tradeoff between Recall and Precision:Th↓ ⇒ Positive↑ ⇒ Recall↑& Precision↓Th↑ ⇒ Positive↑ ⇒ Recall↓& Precision↑

prec

isio

n

recall

Plot Precision-Recall Curvewhen Th is Shifted between 0 and 1.

Page 22: Comparison of ESI-MS Spectra in MassBank Database

Precision-Recall Graph

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.22

Page 23: Comparison of ESI-MS Spectra in MassBank Database

Best Rank of True Positive

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.23

Page 24: Comparison of ESI-MS Spectra in MassBank Database

Relevance ofMS/MS Spectral Search

• Top 1 has High Relevance– Top 1 is True Positive for 30% of all Queries– Average of Best Rank of True Positive [BRTP] is Less than 2!

• Top 2 has High Relevance but BRTP ≥ 2 is Rare Case– If Top 2 is True Positive, then Top1 might be True Positive!– If Top 1 & Top2 are same Metabolites,

then Relevance is Very High!• Relativity between Rank & Score

– Top 1 is True / False ⇔ Score is High / Low

• Ignoring Precision, QqTOF hits More than QqQ• Tolerance of m/z Affects Relevance

– Tolerance ↓ ⇒ Recal ↓– Tolerance ↑ ⇒ Precision ↓(Especially, for QqTOF)

2008/05/09 Copyright © 2008, Hisayuki Horai, All Rights Reserved.24

Importanceof Rank

Page 25: Comparison of ESI-MS Spectra in MassBank Database

Relevane of "Cross" Search

• Q⇒Tof is Better than Tof⇒Q– High Resolution Spectral Database is Useful for Low

Resolution MS Users, too!• Better than test-1 and test-2

– Difference of Machine is Less Important than Difference of Collision Energy!

– Effectiveness of MS/MS Database of Various Machines

2008/05/09 Copyright © 2008, Hisayuki Horai, All Rights Reserved.25

Q⇒QTof:Query = QqQ, Target = QqTOF (test-3)QTof⇒Q: Query = QqTof, Target = QqQ (test-4)

Page 26: Comparison of ESI-MS Spectra in MassBank Database

Conclusions• Search Method based on Vector Space

Model– High Resolution (Real Number) m/z Spectra– Weighting & Normalizing using m/z & Intensity

• Evaluation of MS Database of Metabolites– Importance of High Resolution Spectra– Effectiveness for Collecting Various Spectra

• by Different Machine• under Different Experimental Condition

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.26

Page 27: Comparison of ESI-MS Spectra in MassBank Database

Future PlanMetabolome Integrated Database

2008/05/09 Copyright © 2008, Hisayuki Horai, All Rights Reserved.

MassBank is an Important Part of Metabolome Integrated Database

MassBank

KNApSAcK

FlavonoidViewer

LipidBank

Species-Metabolite Relationship DBComprehensive Lipid DB

27

Metabolome Passway DB

Page 28: Comparison of ESI-MS Spectra in MassBank Database

Acknowledgements

• IAB, Keio Univ.Y.Nihei, T.Ikeda, Y.Ojima, R.Matsuzawa,T.Soga, Y.Kakazu

• Grad.Sch.Frontier Sc., Univ. of TokyoK.Suwa, M.Yoshimoto

• Bioinfo.&Genomics, NAISTS.Kanaya, Y.Shimbo

• RIKEN PSCK.Saito, F.Matsuda, A.Oikawa, M.Kusano, A.Fukushima, T.Sakurai, K.Akiyama

• Grad.Sch.Med., Univ. of TokyoR.Taguchi

• Dept.Sci., Nara WomenT.Takeuchi

• Nishiwaki Lab., JCL Bioassay Inc.Z.Tozuka

• Kazusa DNA Lab.T.Ara

• Leibniz Inst. Plant BIochem.S.Neumann

This work is supported by BIRD-JST and Grant-in-Aid for Scientific Research on Priority Areas "Systems Genomics" from MEXT of Japan.

Copyright © 2008, Hisayuki Horai, All Rights Reserved.28

Special Thanks to Following Collaborators...

2008/05/09

Page 29: Comparison of ESI-MS Spectra in MassBank Database

We would appreciate it very much if you could contribute spectra of metabolites and natural products to MassBank.

URL:

E-mail:

http://www.massbank.jp/

massbank @ iab.keio.ac.jp

Page 30: Comparison of ESI-MS Spectra in MassBank Database

(Supplemental)

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.30

Page 31: Comparison of ESI-MS Spectra in MassBank Database

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.31

Evaluation for Single Correct SetDivide Correct Set into Target and Query• k-fold Cross Validation

– Divide Correct set into k Subsets(S1, ... Sk)

• For all i, Query = Si, Target = Union of Other Subsets• Calculate Average & Variance of Evaluation Index

– e.g. 2CV(k = 2),10CV(k = 10)• Leave-one-out

– For all Element x of Correct Set, Query = { x }, Target = Rest of Correct Set– Equivalent to k-fold Cross Validation where k is Number of Elements– Useful when Correct Set is Small

• Random Sampling– Query = Select k Elements Randomly, Target = Rest of Correct Set– Repeat Enough Times of Evaluation

Page 32: Comparison of ESI-MS Spectra in MassBank Database

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.32

Index of Relevance• Recall (Sensitivility) = TP / (TP + FN)• Precision = TP / (TP + NP)• Accuracy = (TP + FN) / (TP + TN + FP + FN)• Fallout (Specificity) = TN / (TN + FP)• Generality = TP / (TP + TN + FP + FN)

Targets

Results CorrectAnswers

TP FNFP

TN

Page 33: Comparison of ESI-MS Spectra in MassBank Database

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.33

Free from Tradeoffbetween Recall & Precision

• Area Measurement• Break Even Point

– Value where Precision = Recall• Average Precision of Eleven Point

– Average of Precisions where Recall are 0.0, 0.1, 0.2, ..., 0.9, 1.0

• Maximum F-measure = max(2×R×P / (R + P))– F-measure = Harmonic Average of Recall and Precision

precision

recall

Page 34: Comparison of ESI-MS Spectra in MassBank Database

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.34

Evaluation Index by Ranking

• Average Precision– Average of Precision for each True Positive– Ex.:(1/1 + 2/4) / 2 = 0.75

• Mean Reciprocal Rank (MRR)– Average of Inverse of Rank for each True Positive– 例:(1/1 + 1/4) / 2 = 0.625

• Discounted Cumulative Gain (DCG)– Cumulate 1 / log2(r + 1) for each True Positive (r:Rank)– 例:1/log22 + 1/log25 = 1.431

Ex.1:T2:F3:F4:T5:F6:F

Page 35: Comparison of ESI-MS Spectra in MassBank Database

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.35

Evaluation Indexfor Best True Positive

• Precision of Top Ranker– Analogy of Average Precision

• Inverse of Best Ranking of True Positive– Analogy of Mean Reciprocal Rank– e.g.:Top1~4:False, Top5:True⇒1/5 = 0.200

Page 36: Comparison of ESI-MS Spectra in MassBank Database

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.36

Why Cross Search is Betterin High Threshold Area?

QqTOF

QqQ

Nearest Spectrum of Different Energy in Same Machineis Far from

Nearest Spectrum in Different Machine

Limitation of Leave-one-out Test

Page 37: Comparison of ESI-MS Spectra in MassBank Database

2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.37