Upload
rosamond-sutton
View
221
Download
3
Tags:
Embed Size (px)
Citation preview
Segmenting G-Protein Coupled Receptors using Language Models
Betty Yee Man ChengLanguage Technologies Institute, CMU
Advisors: Judith Klein-Seetharaman
Jaime Carbonell
The Segmentation Problem Segment protein sequence according to
secondary structure Related to secondary structure prediction
Often viewed as a classification problem Best performance so far is 78% Large portion of the problem lies with the
boundary cases
Limited Domain: GPCRs G-Protein Coupled Receptors One of the largest superfamily of proteins
known 2955 sequences, 1654 fragments found so far
Transmembrane proteins Plays a central role in many diseases Only 1 protein has been crystallized
Distinguishing Characteristic of GPCRs
Order of segments are known N-terminus Helix Intracellular loop Extracellular loop C-Terminus
Methodology: Topicality Measures
Based on “Statistical Models for Text Segmentation” by D. Beeferman, A. Berger, and J. Lafferty
Topicality measures are log-ratios of 2 different models Short-range model versus long-range model
in topic segmentation in text Models of different segments in proteins
Short-Range Model vs.Long-Range Model
Problem - Not Enough Data!Family Name Number of Proteins
Class A 1081
Class B 83
Class C 28
Class D 11
Class E 4
Class F 45
Drosophila Odorant Receptors 31
Nematode Chemoreceptors 1
Ocular Albinism Proteins 2
Orphan A 35
Orphan B 2
Plant Mlo Receptors 10
Total of 1333 Proteins
Over 90% are shorter than 750 amino acids
Average sequence length is 441 amino acids
Average segment length is 25 amino acids
3 Topicality Models in GPCRs Previous segmentation experiments with mutual
information and Yule’s measures have shown a similarity between All helices All intracellular loops and C-terminus All extracellular loops and N-terminus
No two helices or loops occur consecutively 3 models instead of 15, trained across all families
of GPCRs
Model of a Segment Each model is an interpolated model of 6
basic probability models Unigram model (20 amino acids) Bi-gram model (20 amino acids) Tri-gram model (20 amino acids) 3 Tri-gram models on reduced alphabets
11, 3, 2 amino acids
LVIM,FY,KR,ED,AG,ST,NQ,W,C,H,P LVIMFYAGCW, KREDH, STNQP LVIMFYAGCW, KREDHSTNQP
Why Use Reduced Alphabets?
Figure 1. Snake-like diagram of the human 2 adrenergic receptor.
Interpolation Oddity
weights were trained so that sum of the probability assigned to the amino acid at each position in the training data is a max
First attempt: all weight to the tri-gram model with the smallest reduced alphabet Reason: smaller vocabulary size causes the
probability mass to be not as spread out
Interpolation Oddity, Take 2 Normalize the probabilities from reduced
alphabet models E.g. LVIM,FY,KR,ED,AG,ST,NQ,W,C,H,P
P(L | ) / 4 P(F | ) / 2
All of the weight went to the tri-gram model with the normal 20 amino acid alphabet
An Example: D3DR_RAT
Class A dopamine
receptor
-7000
-6000
-5000
-4000
-3000
-2000
-1000
0
0 50 100 150 200 250 300 350 400 450
Log
Prob
abili
ty o
f Am
ino
Acid
Amino Acid Position in Sequence
Log Probability of Amino Acid at Each Position
ExtracellularHelix
Intracellular
Figure 3 - Graph of the Log Probability of the Amino Acid at Each Position in the D3DR_RAT Sequence from the 3 Segment Models. The 3 segment models fluctuate frequently in their performance, making it difficult to detect which model is doing best and where the boundaries should be drawn.
D3DR_RAT @ Position 0-100
-7000
-6000
-5000
-4000
-3000
-2000
-1000
0
0 20 40 60 80 100
Log P
robabili
ty o
f Am
ino A
cid
Amino Acid Position in Sequence
Log Probability of Amino Acid at Each Position
+ + + +Extracellular
HelixIntracellular
Figure 4 - Enlargement of the Graph in Figure 3 for the Amino Acid Positions 0-100.The true segment boundaries are marked in dotted vertical lines.
N-Terminus Helix Intracellular Helix
-7000
-6000
-5000
-4000
-3000
-2000
-1000
0
0 20 40 60 80 100
Log P
robability o
f A
min
o A
cid
Amino Acid Position
Running Averages of Log Probability of Amino Acid at Each Position
+ + + +
+ + + +
ExtracellularHelix
Intracellular
Running Averages & Look-Ahead
Figure 5 - Graph of Running Averages of Log Probabilities of Each Amino Acid between Positions 0 and 100 in the D3DR_RAT sequence with Predicted and True Boundaries marked. Running averages were computed using a window-sizeof 2 and boundaries were predicted using a look-ahead of 5. The predicted boundaries are indicated by dotted vertical lines at positions 38, 53, 65 and 88, while the true boundaries are indicated by dashed vertical lines at positions 32, 55, 66 and 92.
N-Terminus Helix Intracellular Helix
Predicted Boundaries for D3DR_RAT
Window-size 2 from current amino acid Look-ahead interval of 5 amino acids Predicted Boundaries
38 53 65 88 107 135 150 171 188 212 374 394 413 431 6 2 1 4 3 9 1 1 3 3 1 3 1 3 32 55 66 92 104 126 149 172 185 209 375 397 412 434
Synthetic True Boundaries
The Only Truth: OPSD_HUMAN
The only GPCR that has been crystallized so far
Predicted Boundaries 37 61 72 97 113 130 153 173 201 228 250 275 283 307 1 0 1 1 0 3 1 3 1 2 2 1 1 2 36 61 73 98 113 133 152 176 202 230 252 276 284 309
True Boundaries Average offset for protein is 1.357 a.a.
Evaluation Metrics Accuracy
Score 1 – perfect match Score 0.5 – offset of 1 Score 0.25 – offset of 2 Score 0 otherwise
Offset – absolute difference between the predicted and true boundary position
10-fold Cross Validation
Results: Trained Interpolated ModelsAccuracy
Size Average E-H H-I I-H H-EA 130 0.2383 49.9698 48.4827 48.5115 51.1564 52.7103B 130 0.2691 21.3005 22.1250 21.6981 19.5744 21.3974C 130 0.2426 34.4385 34.9635 34.6077 33.4205 34.5308D 130 0.2353 22.9654 23.0442 22.3865 21.8949 24.7026E 130 0.2501 34.9154 35.6519 35.6808 33.0051 34.8231F 130 0.2269 21.5857 22.7269 21.9135 18.9513 22.2615G 130 0.2343 32.1989 32.1808 31.6827 31.3590 33.7513H 130 0.2250 42.7929 43.5135 43.4462 41.2103 42.5436I 129 0.2438 33.1179 32.0872 32.0213 33.6512 35.4212J 129 0.2445 62.1717 62.2519 62.5039 60.8269 62.9664
Overall 1298 0.2410 35.5270 35.6851 35.4270 34.4854 36.4913
OffsetTest Set
Figure 6 - Results of Our Approach using Trained Interpolation Weights.Window-size: 2Look-ahead interval: 5
Distribution of Offset betweenPredicted and Synthetic True Boundary
Distribution of Offset betweenPredicted and Synthetic True Boundary
Removing 10% of the proteins with the worst average offset causes the average offset for the dataset to drop to 10.51.
Results: Using All Probability Models
Figure 7 - Results of Our Approach using Pre-set Model Weights in the Interpolation: 0.1 for unigram and bi-gram models, 0.2 for each of the tri-gram models. Running averages were computed over a window-size of 5 and a look-ahead interval of 4 was used.
AccuracySize Average E-H H-I I-H H-E
A 130 0.2309 64.2923 63.9038 63.6750 63.4359 66.4897B 130 0.2291 33.1368 34.1462 33.5077 30.9077 33.5256C 130 0.2352 45.0154 45.4231 45.1115 43.3744 45.9846D 130 0.2223 31.2264 31.3096 30.9365 29.8333 32.8949E 130 0.2137 51.0593 52.8019 52.1962 47.3000 50.9795F 130 0.2468 27.1764 27.9519 27.7500 24.8923 27.6615G 130 0.2169 40.4791 41.1673 40.5558 38.1846 41.7538H 130 0.2118 57.1110 56.8558 56.4673 56.3179 59.1026I 129 0.2193 39.3272 41.3353 40.1143 35.4600 39.4677J 129 0.2014 83.3162 84.2655 84.4302 79.9018 83.9793
Overall 1298 0.2228 47.1923 47.8931 47.4517 44.9412 48.1631
OffsetTest Set
Results: Using Only Tri-gram Models
Figure 8 - Results of Our Approach using Pre-set Model Weights in the Interpolation:0.25 for each of the tri-gram models.Window-size of 4 and a look-ahead interval of 4.
AccuracySize Average E-H H-I I-H H-E
A 130 0.2234 70.7082 70.8231 70.7000 69.2359 72.0385B 130 0.2462 33.4071 34.0231 33.4731 31.4615 34.4436C 130 0.2359 45.6275 45.6019 45.6173 44.5077 46.7949D 130 0.2224 31.7533 32.0019 32.0731 30.0410 32.7077E 130 0.2271 51.1978 53.0673 52.7346 47.3308 50.5231F 130 0.2286 29.1319 30.7000 30.1077 26.0872 28.7846G 130 0.2363 43.0967 43.1115 42.4288 41.8923 45.1718H 130 0.2310 64.5154 64.3077 64.0923 63.4308 66.4410I 129 0.2251 41.9873 43.5504 43.0581 38.5297 41.9328J 129 0.2168 86.6235 87.2209 87.6919 83.6951 87.3308
Overall 1298 0.2293 49.7825 50.4178 50.1743 47.6004 50.5953
OffsetTest Set
Conclusions Average accuracy of 0.241
~ offset of 2 on average
But average offsets are much higher Missing a boundary has detrimental effects on
prediction of remaining boundaries in the sequence, especially with a small segment
Large offsets with a small number of proteins
Future Work Cue words
Unigrams, bi-grams, tri-grams, 4-grams in a window of +/- 25 amino acids from boundary
Long range contact Distribution tables of how likely 2 amino
acids are in long-range contact of each other Evaluation
How much homology is needed between training and testing data
References1. Doug Beeferman, Adam Berger, and John Lafferty.
“Statistical Models for Text Segmentation.” Machine Learning, special issue on Natural Language Learning, C. Cardie and R. Mooney eds., 34(1-3), pp. 177-210, 1999.http://www-2.cs.cmu.edu/~lafferty/ps/ml-final.ps
2. F. Campagne, J.M. Bernassau, and B. Maigret. Viseur program (Release 2.35). Copyright 1994,1995,1996, Fabien Campagne, All Rights Reserved.