Upload
dortha-cross
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Napovedovanje imunskega odzivaiz peptidnih mikromrež
Mitja Luštrek1 (2),Peter Lorenz2, Felix Steinbeck2, Georg Füllen2, Hans-Jürgen Thiesen2
1 Odsek za inteligentne sisteme, Institut Jožef Stefan2 Univerza v Rostocku
Peptide
= part of protein = short sequence of amino acids
SNDIVLT
= string of letters from 20-letter alphabet(1 letter = 1 amino acid, 20 standard amino acids)
Image taken fromEMBL website
Peptide arrays
Peptidearray
IVIg antibody mixture
Peptides(15 amino acids)
Glass slide
Peptide arrays
Peptidearray
IVIg antibody mixture
Red = epitopes (bind antibodies)Black = non-epitopes
Peptides(15 amino acids)
Glass slide
Peptide arrays
Red = epitopes (bind antibodies)Black = non-epitopes
Peptide
Antibody
Antibody against
antibody + dye
Glass slide
Peptide arrays
Red = epitopes (bind antibodies)Black = non-epitopes
Peptide Class
PGIGFPGPPGPKGDQ non-ep.
PNMVFIGGINCANGK non-ep.
DGIGGAMHKAMLMAQ non-ep.
REDNLTLDISKLKEQ non-ep.
TPLAGRGLAERASQQ non-ep.
DQVHPVDPYDLPPAG non-ep.
...
RRMISRMPIFYLMSG epitope
LPPGFKRFTCLSIPR epitope
EFSQMESYPEDYFPI epitope
...
Our task
Peptide
RRKGGLEEPQPPAEQ
SEDLENALKAVINDK
EDHVKLVNEVTEFAK
GEKIIQEFLSKVKQM
ILVSRSLKMRGQAFV
YTCQCRAGYQSTLTR
...
Our task
Peptide
RRKGGLEEPQPPAEQ
SEDLENALKAVINDK
EDHVKLVNEVTEFAK
GEKIIQEFLSKVKQM
ILVSRSLKMRGQAFV
YTCQCRAGYQSTLTR
...
Peptide Class
RRKGGLEEPQPPAEQ non-ep.
SEDLENALKAVINDK non-ep.
EDHVKLVNEVTEFAK non-ep.
GEKIIQEFLSKVKQM non-ep.
ILVSRSLKMRGQAFV epitope
YTCQCRAGYQSTLTR epitope
...
Machine learning
Our task
Peptide
RRKGGLEEPQPPAEQ
SEDLENALKAVINDK
EDHVKLVNEVTEFAK
GEKIIQEFLSKVKQM
ILVSRSLKMRGQAFV
YTCQCRAGYQSTLTR
...
Peptide Class
RRKGGLEEPQPPAEQ non-ep.
SEDLENALKAVINDK non-ep.
EDHVKLVNEVTEFAK non-ep.
GEKIIQEFLSKVKQM non-ep.
ILVSRSLKMRGQAFV epitope
YTCQCRAGYQSTLTR epitope
...
Machine learning
Training set: 13,638 peptides (3,420 epitopes)Test set: 13,640 peptides (3,421 epitopes)
Balanced until the final testing
Machine learningPeptide Class
PGIGFPGPPGPKGDQ non-ep. / epitope
Attribute 1 Attribute 2 ... Class
value 1 value 2 non-ep. / epitopeAttribute
representation
Machine learningPeptide Class
PGIGFPGPPGPKGDQ non-ep. / epitope
Attribute 1 Attribute 2 ... Class
value 1 value 2 non-ep. / epitope
ML
Attribute representation
Classifier
Proability for epitope p
Machine learningPeptide Class
PGIGFPGPPGPKGDQ non-ep. / epitope
Attribute 1 Attribute 2 ... Class
value 1 value 2 non-ep. / epitope
ML
Attribute representation
Classifier
Proability for epitope p
Machine learningPeptide Class
PGIGFPGPPGPKGDQ non-ep. / epitope
Attribute representation 1
Attribute representation 8
Classifier 1 Classifier 8...
...
ML
ML
Machine learningPeptide Class
PGIGFPGPPGPKGDQ non-ep. / epitope
Attribute representation 1
Attribute representation 8
Classifier 1 Classifier 8...
...
Probabilities for epitope Class
p1 p2 p3 p4 p5 p6 p7 p8 non-ep. / epitope
ML
ML
Meta classifierML
Final proability for epitope p
Machine learningPeptide Class
PGIGFPGPPGPKGDQ non-ep. / epitope
Attribute representation 1
Attribute representation 8
Classifier 1 Classifier 8...
...
Probabilities for epitope Class
p1 p2 p3 p4 p5 p6 p7 p8 non-ep. / epitope
ML
ML
Meta classifierML
Final proability for epitope p
SVM (SMO), Logistic
regression
Linear regression
Attribute representation 1
RRMISRMPIFYLMSG
Count of A C D E F G H I K L M N P Q R S T V W Y
1 1 2 1 3 1 3 2 1
Amino-acid counts
Attribute representation 2
RRMISRMPIFYLMSG
Amino-acid count differences
Difference in counts of F–G F–I F–L F–M F–P F–R F–S F–Y G–F G–I ...
0 –1 0 –2 0 –2 –1 0 0 –1
Attribute representation 3
Count of RR RM MI ... RRM RMI MIS ... ACDE ... ACDEF ...
1 2 1 1 1 1 0 0
RRMISRMPIFYLMSG
Subsequence counts
Attribute representation 4
Amino-acid class counts
Count of tiny small large basic acidic neutral ...
3 1 11 3 0 12
l l l l t l l s l l l l l t t
RRMISRMPIFYLMSG
b b n n n b n n n n n n n n n
Attribute representation 5
Amino-acid class subsequence counts
l l l l t l l s l l l l l t t
RRMISRMPIFYLMSG
b b n n n b n n n n n n n n n
Count of ll lt tl ls sl tt ... bb bn nb nn ...
8 2 1 1 1 1 1 2 1 10
Attribute representation 6
Amino-acid pair countsRationale: antibodies may bind in two places due to their two-chain structure.
Antibody
Peptide
Attribute representation 6
RRMISRMPIFYLMSG
Amino-acid pair countsRationale: antibodies may bind in two places due to their two-chain structure.
Count of pairs at distance (R,R) at 1 (R,M) at 2 (R,I) at 3 ... (A,C) at 1 (A,C) at 2 ...
1 1 2 0 0
1 2 3 3 Antibody
Peptide
Attribute representation 7
Amino-acids at distances from first + first amino acidRationale: antibodies may bind in two places, first amino acid most accesible on the peptide array.
Antibody
Peptide
Attribute representation 7
R RMISRMPIFYLMSG
Amino-acids at distances from first + first amino acidRationale: antibodies may bind in two places, first amino acid most accesible on the peptide array.
Count of at distance ... R at 1 ... M at 2 ... A at 3 C at 3 ... First
1 1 0 0 R
Antibody
Peptide
Attribute representation 8
RRMISRMPIFYLMSG
Average amino-acid properties
Hydrophobicity Size Polarity Flexibility Accesibility ...
0.448 0.596 0.306 0.231 0.376
Attribute representation 9 (not used)
RRMISRMPIFYLMSG
Amino-acid counts with a difference
RRMISRMPIWYLMSG
Equivalent for epitope prediction?
Attribute representation 9 (not used)
RRMISRMPIFYLMSG
Amino-acid counts with a difference
RRMISRMPIWYLMSG
Equivalent for epitope prediction?
Count F as:• 1 F• 0.8 W• 0.4 Y• ...
Count W as:• 1 W• 0.7 F • 0.3 Y• ...
Attribute representation 9 (not used)
Amino-acid substitution matrix
A C D ... F W YA 1C 1D 1...F 1 0.8 0.4W 0.7 1 0.3Y 1
Attribute representation 9 (not used)
Amino-acid substitution matrix
A C D ... F W YA 1C 1D 1...F 1 0.8 0.4W 0.7 1 0.3Y 1
Optimizewith a genetic algorithm to maximize classification accuracy
Results – training set
Attribute representation AUC AccuracyAmino-acid counts 0.870 80.7 %Amino-acid count differences 0.868 80.3 %Subsequence counts 0.867 80.5 %Amino-acid class counts 0.873 81.2 %Amino-acid class subsequence counts 0.866 80.5 %Amino-acid pair counts 0.865 80.6 %Amino acids at distances from the first 0.873 81.2 %Average amino-acid properties 0.863 80.3 %
Results – training set
Attribute representation AUC AccuracyAmino-acid counts 0.870 80.7 %Amino-acid count differences 0.868 80.3 %Subsequence counts 0.867 80.5 %Amino-acid class counts 0.873 81.2 %Amino-acid class subsequence counts 0.866 80.5 %Amino-acid pair counts 0.865 80.6 %Amino acids at distances from the first 0.873 81.2 %Average amino-acid properties 0.863 80.3 %Combined 0.881 83.3 %
Results – test set
Attribute representation / dataset AUC AccuracyBest single / training set 0.873 81.2 %Combined / training set 0.881 83.3 %Combined / test set 0.883 83.7 %
Results – test set
Attribute representation / dataset AUC AccuracyBest single / training set (balanced) 0.873 81.2 %Combined / training set (balanced) 0.881 83.3 %Combined / test set (balanced) 0.883 83.7 %Combined / test set (original) 0.884 85.9 %
Epitope : non-epitope = 1 : 1
Epitope : non-epitope = 1 : 3
Results – test set
Attribute representation / dataset AUC AccuracyBest single / training set (balanced) 0.873 81.2 %Combined / training set (balanced) 0.881 83.3 %Combined / test set (balanced) 0.883 83.7 %Combined / test set (original) 0.884 85.9 %EL-Manzalawy / test set (balanced) 0.868 82.0 %EL-Manzalawy / test set (original) 0.874 83.9 %
State of the art:SVM + string kernel(EL-Manzalawy et al., 2008)Trained and tested on our data.
Results – test set
Our resultsBalanced: 0.883 / 83.7 % Original: 0.884 / 85.9 %
EL-ManzalawyBalanced: 0.868 / 82.0 % Original: 0.874 / 83.9 %
Rules
Interpretable classifier:• Interpretable attributes
(frequencies, properties of amino acids)• RIPPER (JRip) to induce rules
Rules
Property Low/high Applies to peptidesAromaticity High 53.8 %
If a peptide has a high aromaticity, it binds antibodies.This applies to 53.8 % of peptides that bind antibodies.
(Aromaticity is the percentage of aromatic amino acids in the peptide.)
Interpretable classifier:• Interpretable attributes
(frequencies, properties of amino acids)• RIPPER (JRip) to induce rules
Rules
Property Low/high Applies to peptidesAromaticity High 53.8 %Polarity Low 27.7 %Frequency of tyrosine High 26.2 %Hydrophobicity Low 22.5 %Frequency of arginine High 19.7 %Summary factor 2 High 16.7 %Acidity Low 11.4 %Preference for -sheets Low 4.3 %Summary factor 5 High 3.0 %
Epitope propensity
Frequency in peptides with epitopes,divided by frequency in peptides without epitopes
(Un)classifiable peptides
Simplified classifier:• Interpretable attributes
(frequencies, properties of amino acids)• Logistic regression to train the classifier
Peptides AUC AccuracyAll 0.860 83.0 %
(Un)classifiable peptides
Simplified classifier:• Interpretable attributes
(frequencies, properties of amino acids)• Logistic regression to train the classifier
Peptides AUC AccuracyAll 0.860 83.0 %ClassifiableUnclassifiable
Classified correctly
Classified incorrectly
(Un)classifiable peptides
Simplified classifier:• Interpretable attributes
(frequencies, properties of amino acids)• Logistic regression to train the classifier
Peptides AUC AccuracyAll 0.860 83.0 %Classifiable 0.999 98.8 %Unclassifiable 0.956 91.5 %
Expected
Strange?
(Un)classifiable – rules
AttributeClassifiable Unclassifiable
L/h Applies L/h AppliesAromaticity High 74.3 % Low 53.3 %Polarity Low 58.7 % High 27.5 %Frequency of arginine High 31.5 % Low 34.0 %Frequency of tyrosine High 20.7 % Low 16.9 %Summary factor 5 High 15.1 % Low 15.2 %Antigenicity High 7.3 % Low 8.7 %Hydrophobicity Low 4.7 % High 6.5 %Frequency of histidine Low 3.9 %Frequency of cysteine Low 10.4 %Preference for reverse turns High 10.4 %Occurrence in turns Low 10.4 %Frequency of alanine High 8.7 %
(Un)classifiable – rules
AttributeClassifiable Unclassifiable
L/h Applies L/h AppliesAromaticity High 74.3 % Low 53.3 %Polarity Low 58.7 % High 27.5 %Frequency of arginine High 31.5 % Low 34.0 %Frequency of tyrosine High 20.7 % Low 16.9 %Summary factor 5 High 15.1 % Low 15.2 %Antigenicity High 7.3 % Low 8.7 %Hydrophobicity Low 4.7 % High 6.5 %Frequency of histidine Low 3.9 %Frequency of cysteine Low 10.4 %Preference for reverse turns High 10.4 %Occurrence in turns Low 10.4 %Frequency of alanine High 8.7 %
All: 53.8 %
All: 27.7 %
(Un)classifiable peptides
Simplified classifier:• Interpretable attributes
(frequencies, properties of amino acids)• Logistic regression to train the classifier
Peptides AUC AccuracyAll 0.860 83.0 %Classifiable 0.999 98.8 %Unclassifiable 0.956 91.5 %
Strange? Not really!Inevitable or does it mean something?
2nd degree (un)classifiable peptides
• Unclassifiable peptides only• Simplified classifier
Peptides AUC AccuracyAll unclassifiable 0.956 91.5 %
2nd degree (un)classifiable peptides
• Unclassifiable peptides only• Simplified classifier
Peptides AUC AccuracyAll unclassifiable 0.956 91.5 %Classifiable unclassifiableUnclassifiable unclassifiable
Classified correctly
Classified incorrectly
2nd degree (un)classifiable peptides
• Unclassifiable peptides only• Simplified classifier
Peptides AUC AccuracyAll unclassifiable 0.956 91.5 %Classifiable unclassifiable 0.992 97.8 %Unclassifiable unclassifiable 0.683 65.0 %
2nd degree (un)classifiable peptidesPeptides AUC AccuracyAll unclassifiable 0.956 91.5 %Classifiable unclassifiable 0.992 97.8 %Unclassifiable unclassifiable 0.683 65.0 %
(Un)classifiable peptidesPeptides AUC AccuracyAll 0.860 83.0 %Classifiable 0.999 98.8 %Unclassifiable 0.956 91.5 %
Inevitable or does it mean something?
Not inevitable!
Conclusions
• Epitopes have common characteristics– Epitopes are parts of antigens that bind antibodies
Our peptides mostly did not come from known antigens
Probably partly general and partly antibody-specific binding
Conclusions
• Epitopes have common characteristics– Epitopes are parts of antigens that bind antibodies
• Epitope characteristics are not unexpected
Our peptides mostly did not come from known antigens
Probably partly general and partly antibody-specific binding
Conclusions
• Epitopes have common characteristics– Epitopes are parts of antigens that bind antibodies
• Epitope characteristics are not unexpected
• Two groups of epitopes:– around 80 % “typical” (classifiable)– around 20 % “atypical” (unclassifiable)
Our peptides mostly did not come from known antigens
Probably partly general and partly antibody-specific binding
Conclusions
• Epitopes have common characteristics– Epitopes are parts of antigens that bind antibodies
• Epitope characteristics are not unexpected
• Two groups of epitopes:– around 80 % “typical” (classifiable)– around 20 % “atypical” (unclassifiable)
Our peptides mostly did not come from known antigens
Probably partly general and partly antibody-specific binding
Mostly general-purpose antibodies?
Mostly antigen-specific antibodies?