Segmenting G-Protein Coupled Receptors using Language Models Betty Yee Man Cheng Language Technologies Institute, CMU Advisors:Judith Klein-Seetharaman

Segmenting G-Protein Coupled Receptors using Language Models

Betty Yee Man ChengLanguage Technologies Institute, CMU

Advisors: Judith Klein-Seetharaman

Jaime Carbonell

The Segmentation Problem Segment protein sequence according to

secondary structure Related to secondary structure prediction

Often viewed as a classification problem Best performance so far is 78% Large portion of the problem lies with the

boundary cases

Limited Domain: GPCRs G-Protein Coupled Receptors One of the largest superfamily of proteins

known 2955 sequences, 1654 fragments found so far

Transmembrane proteins Plays a central role in many diseases Only 1 protein has been crystallized

Distinguishing Characteristic of GPCRs

Order of segments are known N-terminus Helix Intracellular loop Extracellular loop C-Terminus

Methodology: Topicality Measures

Based on “Statistical Models for Text Segmentation” by D. Beeferman, A. Berger, and J. Lafferty

Topicality measures are log-ratios of 2 different models Short-range model versus long-range model

in topic segmentation in text Models of different segments in proteins

Short-Range Model vs.Long-Range Model

Problem - Not Enough Data!Family Name Number of Proteins

Class A 1081

Class B 83

Class C 28

Class D 11

Class E 4

Class F 45

Drosophila Odorant Receptors 31

Nematode Chemoreceptors 1

Ocular Albinism Proteins 2

Orphan A 35

Orphan B 2

Plant Mlo Receptors 10

Total of 1333 Proteins

Over 90% are shorter than 750 amino acids

Average sequence length is 441 amino acids

Average segment length is 25 amino acids

3 Topicality Models in GPCRs Previous segmentation experiments with mutual

information and Yule’s measures have shown a similarity between All helices All intracellular loops and C-terminus All extracellular loops and N-terminus

No two helices or loops occur consecutively 3 models instead of 15, trained across all families

of GPCRs

Model of a Segment Each model is an interpolated model of 6

basic probability models Unigram model (20 amino acids) Bi-gram model (20 amino acids) Tri-gram model (20 amino acids) 3 Tri-gram models on reduced alphabets

11, 3, 2 amino acids

LVIM,FY,KR,ED,AG,ST,NQ,W,C,H,P LVIMFYAGCW, KREDH, STNQP LVIMFYAGCW, KREDHSTNQP

Why Use Reduced Alphabets?

Figure 1. Snake-like diagram of the human 2 adrenergic receptor.

Interpolation Oddity

weights were trained so that sum of the probability assigned to the amino acid at each position in the training data is a max

First attempt: all weight to the tri-gram model with the smallest reduced alphabet Reason: smaller vocabulary size causes the

probability mass to be not as spread out

Interpolation Oddity, Take 2 Normalize the probabilities from reduced

alphabet models E.g. LVIM,FY,KR,ED,AG,ST,NQ,W,C,H,P

P(L | ) / 4 P(F | ) / 2

All of the weight went to the tri-gram model with the normal 20 amino acid alphabet

An Example: D3DR_RAT

Class A dopamine

receptor

-7000

-6000

-5000

-4000

-3000

-2000

-1000

0

0 50 100 150 200 250 300 350 400 450

Log

Prob

abili

ty o

f Am

ino

Acid

Amino Acid Position in Sequence

Log Probability of Amino Acid at Each Position

ExtracellularHelix

Intracellular

Figure 3 - Graph of the Log Probability of the Amino Acid at Each Position in the D3DR_RAT Sequence from the 3 Segment Models. The 3 segment models fluctuate frequently in their performance, making it difficult to detect which model is doing best and where the boundaries should be drawn.

D3DR_RAT @ Position 0-100

-7000

-6000

-5000

-4000

-3000

-2000

-1000

0

0 20 40 60 80 100

Log P

robabili

ty o

f Am

ino A

cid

Amino Acid Position in Sequence

Log Probability of Amino Acid at Each Position

+ + + +Extracellular

HelixIntracellular

Figure 4 - Enlargement of the Graph in Figure 3 for the Amino Acid Positions 0-100.The true segment boundaries are marked in dotted vertical lines.

N-Terminus Helix Intracellular Helix

-7000

-6000

-5000

-4000

-3000

-2000

-1000

0

0 20 40 60 80 100

Log P

robability o

f A

min

o A

cid

Amino Acid Position

Running Averages of Log Probability of Amino Acid at Each Position

+ + + +

+ + + +

ExtracellularHelix

Intracellular

Running Averages & Look-Ahead

Figure 5 - Graph of Running Averages of Log Probabilities of Each Amino Acid between Positions 0 and 100 in the D3DR_RAT sequence with Predicted and True Boundaries marked. Running averages were computed using a window-sizeof 2 and boundaries were predicted using a look-ahead of 5. The predicted boundaries are indicated by dotted vertical lines at positions 38, 53, 65 and 88, while the true boundaries are indicated by dashed vertical lines at positions 32, 55, 66 and 92.

N-Terminus Helix Intracellular Helix

Predicted Boundaries for D3DR_RAT

Window-size 2 from current amino acid Look-ahead interval of 5 amino acids Predicted Boundaries

38 53 65 88 107 135 150 171 188 212 374 394 413 431 6 2 1 4 3 9 1 1 3 3 1 3 1 3 32 55 66 92 104 126 149 172 185 209 375 397 412 434

Synthetic True Boundaries

The Only Truth: OPSD_HUMAN

The only GPCR that has been crystallized so far

Predicted Boundaries 37 61 72 97 113 130 153 173 201 228 250 275 283 307 1 0 1 1 0 3 1 3 1 2 2 1 1 2 36 61 73 98 113 133 152 176 202 230 252 276 284 309

True Boundaries Average offset for protein is 1.357 a.a.

Evaluation Metrics Accuracy

Score 1 – perfect match Score 0.5 – offset of 1 Score 0.25 – offset of 2 Score 0 otherwise

Offset – absolute difference between the predicted and true boundary position

10-fold Cross Validation

Results: Trained Interpolated ModelsAccuracy

Size Average E-H H-I I-H H-EA 130 0.2383 49.9698 48.4827 48.5115 51.1564 52.7103B 130 0.2691 21.3005 22.1250 21.6981 19.5744 21.3974C 130 0.2426 34.4385 34.9635 34.6077 33.4205 34.5308D 130 0.2353 22.9654 23.0442 22.3865 21.8949 24.7026E 130 0.2501 34.9154 35.6519 35.6808 33.0051 34.8231F 130 0.2269 21.5857 22.7269 21.9135 18.9513 22.2615G 130 0.2343 32.1989 32.1808 31.6827 31.3590 33.7513H 130 0.2250 42.7929 43.5135 43.4462 41.2103 42.5436I 129 0.2438 33.1179 32.0872 32.0213 33.6512 35.4212J 129 0.2445 62.1717 62.2519 62.5039 60.8269 62.9664

Overall 1298 0.2410 35.5270 35.6851 35.4270 34.4854 36.4913

OffsetTest Set

Figure 6 - Results of Our Approach using Trained Interpolation Weights.Window-size: 2Look-ahead interval: 5

Distribution of Offset betweenPredicted and Synthetic True Boundary

Distribution of Offset betweenPredicted and Synthetic True Boundary

Removing 10% of the proteins with the worst average offset causes the average offset for the dataset to drop to 10.51.

Results: Using All Probability Models

Figure 7 - Results of Our Approach using Pre-set Model Weights in the Interpolation: 0.1 for unigram and bi-gram models, 0.2 for each of the tri-gram models. Running averages were computed over a window-size of 5 and a look-ahead interval of 4 was used.

AccuracySize Average E-H H-I I-H H-E

A 130 0.2309 64.2923 63.9038 63.6750 63.4359 66.4897B 130 0.2291 33.1368 34.1462 33.5077 30.9077 33.5256C 130 0.2352 45.0154 45.4231 45.1115 43.3744 45.9846D 130 0.2223 31.2264 31.3096 30.9365 29.8333 32.8949E 130 0.2137 51.0593 52.8019 52.1962 47.3000 50.9795F 130 0.2468 27.1764 27.9519 27.7500 24.8923 27.6615G 130 0.2169 40.4791 41.1673 40.5558 38.1846 41.7538H 130 0.2118 57.1110 56.8558 56.4673 56.3179 59.1026I 129 0.2193 39.3272 41.3353 40.1143 35.4600 39.4677J 129 0.2014 83.3162 84.2655 84.4302 79.9018 83.9793

Overall 1298 0.2228 47.1923 47.8931 47.4517 44.9412 48.1631

OffsetTest Set

Results: Using Only Tri-gram Models

Figure 8 - Results of Our Approach using Pre-set Model Weights in the Interpolation:0.25 for each of the tri-gram models.Window-size of 4 and a look-ahead interval of 4.

AccuracySize Average E-H H-I I-H H-E

A 130 0.2234 70.7082 70.8231 70.7000 69.2359 72.0385B 130 0.2462 33.4071 34.0231 33.4731 31.4615 34.4436C 130 0.2359 45.6275 45.6019 45.6173 44.5077 46.7949D 130 0.2224 31.7533 32.0019 32.0731 30.0410 32.7077E 130 0.2271 51.1978 53.0673 52.7346 47.3308 50.5231F 130 0.2286 29.1319 30.7000 30.1077 26.0872 28.7846G 130 0.2363 43.0967 43.1115 42.4288 41.8923 45.1718H 130 0.2310 64.5154 64.3077 64.0923 63.4308 66.4410I 129 0.2251 41.9873 43.5504 43.0581 38.5297 41.9328J 129 0.2168 86.6235 87.2209 87.6919 83.6951 87.3308

Overall 1298 0.2293 49.7825 50.4178 50.1743 47.6004 50.5953

OffsetTest Set

Conclusions Average accuracy of 0.241

~ offset of 2 on average

But average offsets are much higher Missing a boundary has detrimental effects on

prediction of remaining boundaries in the sequence, especially with a small segment

Large offsets with a small number of proteins

Future Work Cue words

Unigrams, bi-grams, tri-grams, 4-grams in a window of +/- 25 amino acids from boundary

Long range contact Distribution tables of how likely 2 amino

acids are in long-range contact of each other Evaluation

How much homology is needed between training and testing data

References1. Doug Beeferman, Adam Berger, and John Lafferty.

“Statistical Models for Text Segmentation.” Machine Learning, special issue on Natural Language Learning, C. Cardie and R. Mooney eds., 34(1-3), pp. 177-210, 1999.http://www-2.cs.cmu.edu/~lafferty/ps/ml-final.ps

2. F. Campagne, J.M. Bernassau, and B. Maigret. Viseur program (Release 2.35). Copyright 1994,1995,1996, Fabien Campagne, All Rights Reserved.

Documents

Segmenting G-Protein Coupled Receptors using Language Models Betty Yee Man Cheng Language Technologies Institute, CMU Advisors:Judith Klein-Seetharaman