Upload
hoangquynh
View
220
Download
1
Embed Size (px)
Citation preview
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.1
Comparison of ESI-MS Spectrain MassBank Database
Hisayuki Horai1,2, Masanori Arita1,2,3,4,Takaaki Nishioka1,2
1 IAB, Keio Univ., 2 JST-BIRD, 3 Univ. of Tokyo, 4 RIKEN PSC
BMEI 2008
Table of Contents
• Metabolomics, Mass Spectrometry & Spectral Database
• Spectral Search by Similarity– Vector Space Model
• Evaluation of Relevance– MS/MS Spectra Search of Metabolites
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.2
Metabolomics &Mass Spectrometry
• Measurement of Metabolites– Identification & Quantification
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.3
m/z
Intensity
mass-to-charge(number)-ratio
0
+
+
++
Precursor IonFragment Ion
Identification by Similarity of Peak Pattern
MassBankMass Spectral Database for Identification of Metabolites
http://www.massbank.jp/
• Comprehensive Collection– Metabolites, Drugs, Agrichemicals, ...– EI-MS, ESI-MS, MS/MS, XC/MS, ...– Various Experimental Conditions
• Distributed Database on Internet– Cloud Computing Environment for Users– Quality Control of Data at Contributor's Site
• Open to Public– Public Free Access via Internet– Provide Software as Freeware (Server, DB System, Search
Engine, ...)
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.4
Variation ofResolutionSensitivityFragmentation...
Distributed Search
Collaboration in MassBank
2008/05/09 Copyright © 2008, Hisayuki Horai, All Rights Reserved.5
Spectral Search
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.6
Most Important Function of Spectral Database
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.
Spectral Search by SimilalrityBased on Vector Space Model
Search Based on Vector Space Model: Already Established Search Method– Information Retrieval (e.g. Google, PubMed, ...)– Low Resolution (Integer) m/z Spectral Search for EI-MS
• Translate Spectrum to Vector– Axis for m/z of Peak – Element of Vector: Intensity of Peak
• Similarity of 2 Spectra = Cosine of Vectors
7
Dimension = 6
(q1, 0, q3, q4, q5, 0)
(d1, d2, 0, d4, 0, d5)
Query q
0
(1) (2) (3) (4) (5) (6)
q1q3 q4
q5
d1d2
d4d6
・ (1) - (6): m/z・ q1 - q5, d1 - d6: Intensity
Target d
θ
Spectrum s1
Spectrum s2
2s1s2s1s2s1sScore
⋅•
== θcos),(Inner Product
Length
2222 5431 qqqqq +++=
2222 5d4d2d1dd +++=
4d4q1d1qqd ⋅+⋅=•
dqdqdqScore⋅•
== θcos),(
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.8
Weighting & NormalizationSpectral Vector: ( ..., Vi , ... )• Relative Intensity
"Intensity must be normalized by Largest Peak."– Vi = Intensity / max(Intensity)
• Importance of Large Ion"Large m/z may be specific for a compound."– Vi = Relative Intensity · (m/z) n [ n > 1 ]
• Importance of Intensity"Small peaks should not be ignored."– Vi = (Relative Intensity) m · (m/z) n [ n > 1, 0 < m < 1 ]
Improvement for Better Search
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.9
Spectral Search ofReal Number m/z
Introduce Tolerance of m/z to Match Peaks in Different Spectra• Peaks within Tolerance are Compiled into an Axis
Query q
Target d0
(1) (2) (3) (4) (5) (6) Dimension = 6
q3q5
q1q4
(q1, 0, q3, q4, q5, 0)
(d1, d2, 0, d4, 0, d5)
d1 d4d2
d6
・ (1) - (6): m/z・ q1 - q5, d1 - d6: Intensity
"Different Peak Hit Problem" and "Same Peak Hit Problem"
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.10
Different Peak Hit Problem
Query q
Target d0
q1
d1
q1, d1, d2: Intensity
d2
(..., q1, ...)
(..., ???, ...)
Choice of Solution:・Largest, Smallest・Average, Total・Nearest m/z...
What is Element d of Vector d?
Select Largest Peak in MassBank
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.11
Same Peak Hit Problem
Query q
Target d0
q1, q2, d1:Intensity
d1
q2q1
q1 and q2: 1 Axis or 2 Axes ?• If 1Axis,
What is Element of Vector q?• If 2 Axes,
What are 2 Elements of Vector d?
2 Axes & Duplicated Use of Hit Peak in Target in MassBank:q = ( ..., q1, q2, ... )d = ( ..., d1, d1, ... )
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.12
Variety of Spectral Search
• Variety of Weighting & Normalization• Variety of Real Number m/z Search
– Tolerance– Solution for Different Peak Hit Problem– Solution for Same Peak Hit Problem
• Variety of Practical Parameters
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.13
Spectral Search in MassBank• Weighting & Normalization:
sqrt(Relative Intensity)・m/z / 10where Relative Intensity = 1000 * Intensity / max(Intensity)
• Tolerance:0.3• Choose Effective Peaks (Ignore Noise Peaks):
– Upper Bound of m/z: Ignore Peak when m/z ≥ 1000– Lower Bound of Intensity:
Ignore Peak when Relative Intensity < 5• Lower Bound of Number of Hit Peaks:
– # Effective Peaks of Query ≥ 3: Ignore Target when # of Hit Peaks ≥ 3
– # Effective Peaks of Query < 3:Ignore Target unless all Effective Peaks are Hit
- Default Setting -
Spectral Search in MassBank
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.14
Queries
Ranked Listof Search Resultsfor Selected Query
Selected Query
SelectedResults
Red: Exact HitPink: Hit within Tolerance
Select& Search
3D View
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.15
Variety of Spectral Search
• Variety of Weighting & Normalization• Variety of Real Number m/z Search
– Tolerance– Solution for Different Peak Hit Problem– Solution for Same Peak Hit Problem
• Variety of Practical Parameters
Optimization of Search Method• Depends on Target Set• Based on Systematic Evaluation for Real Data
Evaluation of Relevance
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.16
Identification of Metabolites
MS/MS Spectra
• MS/MS: Fragmentation Depend on Machine & Collision Energy⇒ Difficulty of MS/MS Spectral Search
• Needs for Comprehensive Collection– Different Machine– Different Collision Energy
2008/05/09 Copyright © 2008, Hisayuki Horai, All Rights Reserved.17
MS/MS Spectra in MassBank
• Contributor: Keio Univ.• QqQ MS/MS Spectra
– Low Resolution– 861 Metabolites– 4,205 Spectra (1-to-5 Spectra for 1 Metabolites)
• QqTOF MS/MS Spectra– High Resolution– 898 Metabolites– 4,431 Spectra (2-to-5 Spectra for 1 Metabolites)
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.18
Evaluation Method• For each Machine,
– Leave-one-out Test– Test-1: QqQ Spectra– Test-2: QqTOF Spectra
• For both Machines,– Test-3: Query = QqQ, Target = QqTOF– Test-4: Query = QqTOF, Target = QqQ
• Evaluation Index– Precision & Recall– Best Ranking of True Positive
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.19
"Cross" Search
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.20
Precision & RecallCorrect Answer = Spectrum of Same Metabolite as Query
Targets
Results CorrectAnswers
TP FNFP
TN
P (Positive): ResultsN (Negative):not Results
T (True): SuccessF (False): Mistake
Targets are Divided into 4 Groups(FN, FP, TN, TP) for a Query.
Precision = TP / (TP + FP)Recall = TP / (TP + FN)
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.21
Evaluation ofScore Based Search Method
• Introduce Threshold of Score : Th– Select Results where score ≥ Th– Set of Results Depends on Th
Tradeoff between Recall and Precision:Th↓ ⇒ Positive↑ ⇒ Recall↑& Precision↓Th↑ ⇒ Positive↑ ⇒ Recall↓& Precision↑
prec
isio
n
recall
Plot Precision-Recall Curvewhen Th is Shifted between 0 and 1.
Precision-Recall Graph
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.22
Best Rank of True Positive
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.23
Relevance ofMS/MS Spectral Search
• Top 1 has High Relevance– Top 1 is True Positive for 30% of all Queries– Average of Best Rank of True Positive [BRTP] is Less than 2!
• Top 2 has High Relevance but BRTP ≥ 2 is Rare Case– If Top 2 is True Positive, then Top1 might be True Positive!– If Top 1 & Top2 are same Metabolites,
then Relevance is Very High!• Relativity between Rank & Score
– Top 1 is True / False ⇔ Score is High / Low
• Ignoring Precision, QqTOF hits More than QqQ• Tolerance of m/z Affects Relevance
– Tolerance ↓ ⇒ Recal ↓– Tolerance ↑ ⇒ Precision ↓(Especially, for QqTOF)
2008/05/09 Copyright © 2008, Hisayuki Horai, All Rights Reserved.24
Importanceof Rank
Relevane of "Cross" Search
• Q⇒Tof is Better than Tof⇒Q– High Resolution Spectral Database is Useful for Low
Resolution MS Users, too!• Better than test-1 and test-2
– Difference of Machine is Less Important than Difference of Collision Energy!
– Effectiveness of MS/MS Database of Various Machines
2008/05/09 Copyright © 2008, Hisayuki Horai, All Rights Reserved.25
Q⇒QTof:Query = QqQ, Target = QqTOF (test-3)QTof⇒Q: Query = QqTof, Target = QqQ (test-4)
Conclusions• Search Method based on Vector Space
Model– High Resolution (Real Number) m/z Spectra– Weighting & Normalizing using m/z & Intensity
• Evaluation of MS Database of Metabolites– Importance of High Resolution Spectra– Effectiveness for Collecting Various Spectra
• by Different Machine• under Different Experimental Condition
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.26
Future PlanMetabolome Integrated Database
2008/05/09 Copyright © 2008, Hisayuki Horai, All Rights Reserved.
MassBank is an Important Part of Metabolome Integrated Database
MassBank
KNApSAcK
FlavonoidViewer
LipidBank
Species-Metabolite Relationship DBComprehensive Lipid DB
27
Metabolome Passway DB
Acknowledgements
• IAB, Keio Univ.Y.Nihei, T.Ikeda, Y.Ojima, R.Matsuzawa,T.Soga, Y.Kakazu
• Grad.Sch.Frontier Sc., Univ. of TokyoK.Suwa, M.Yoshimoto
• Bioinfo.&Genomics, NAISTS.Kanaya, Y.Shimbo
• RIKEN PSCK.Saito, F.Matsuda, A.Oikawa, M.Kusano, A.Fukushima, T.Sakurai, K.Akiyama
• Grad.Sch.Med., Univ. of TokyoR.Taguchi
• Dept.Sci., Nara WomenT.Takeuchi
• Nishiwaki Lab., JCL Bioassay Inc.Z.Tozuka
• Kazusa DNA Lab.T.Ara
• Leibniz Inst. Plant BIochem.S.Neumann
This work is supported by BIRD-JST and Grant-in-Aid for Scientific Research on Priority Areas "Systems Genomics" from MEXT of Japan.
Copyright © 2008, Hisayuki Horai, All Rights Reserved.28
Special Thanks to Following Collaborators...
2008/05/09
We would appreciate it very much if you could contribute spectra of metabolites and natural products to MassBank.
URL:
E-mail:
http://www.massbank.jp/
massbank @ iab.keio.ac.jp
(Supplemental)
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.30
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.31
Evaluation for Single Correct SetDivide Correct Set into Target and Query• k-fold Cross Validation
– Divide Correct set into k Subsets(S1, ... Sk)
• For all i, Query = Si, Target = Union of Other Subsets• Calculate Average & Variance of Evaluation Index
– e.g. 2CV(k = 2),10CV(k = 10)• Leave-one-out
– For all Element x of Correct Set, Query = { x }, Target = Rest of Correct Set– Equivalent to k-fold Cross Validation where k is Number of Elements– Useful when Correct Set is Small
• Random Sampling– Query = Select k Elements Randomly, Target = Rest of Correct Set– Repeat Enough Times of Evaluation
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.32
Index of Relevance• Recall (Sensitivility) = TP / (TP + FN)• Precision = TP / (TP + NP)• Accuracy = (TP + FN) / (TP + TN + FP + FN)• Fallout (Specificity) = TN / (TN + FP)• Generality = TP / (TP + TN + FP + FN)
Targets
Results CorrectAnswers
TP FNFP
TN
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.33
Free from Tradeoffbetween Recall & Precision
• Area Measurement• Break Even Point
– Value where Precision = Recall• Average Precision of Eleven Point
– Average of Precisions where Recall are 0.0, 0.1, 0.2, ..., 0.9, 1.0
• Maximum F-measure = max(2×R×P / (R + P))– F-measure = Harmonic Average of Recall and Precision
precision
recall
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.34
Evaluation Index by Ranking
• Average Precision– Average of Precision for each True Positive– Ex.:(1/1 + 2/4) / 2 = 0.75
• Mean Reciprocal Rank (MRR)– Average of Inverse of Rank for each True Positive– 例:(1/1 + 1/4) / 2 = 0.625
• Discounted Cumulative Gain (DCG)– Cumulate 1 / log2(r + 1) for each True Positive (r:Rank)– 例:1/log22 + 1/log25 = 1.431
Ex.1:T2:F3:F4:T5:F6:F
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.35
Evaluation Indexfor Best True Positive
• Precision of Top Ranker– Analogy of Average Precision
• Inverse of Best Ranking of True Positive– Analogy of Mean Reciprocal Rank– e.g.:Top1~4:False, Top5:True⇒1/5 = 0.200
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.36
Why Cross Search is Betterin High Threshold Area?
QqTOF
QqQ
Nearest Spectrum of Different Energy in Same Machineis Far from
Nearest Spectrum in Different Machine
Limitation of Leave-one-out Test
2008.05.29 Copyright © 2008, Hisayuki Horai, All Rights Reserved.37