Assignment of PROSITE motifs to topological regions: Application to a novel database of well characterised transmembrane proteins Tim Nugent

Assignment of PROSITE motifs to topological regions: Application to a novel database of well characterised

transmembrane proteins

Tim Nugent

Identification of transmembrane regions

Most Hydrophobic: Least Hydrophobic:

Isoleucine 4.5 Arginine -4.5Valine 4.2 Lysine -3.9Leucine 3.8 Glutamic acid -3.5Phenylalanine 2.8 Asparagine -3.5Cysteine 2.5 Glutamine -3.5

To generate data for a plot, the protein sequence is scanned with a moving window of size 19-21 residues. At each position, the mean hydrophobic index of the amino acids within the window is calculated and that value plotted as the midpoint of the window.

Aquaporin

The Positive Inside Rule

Hydrophobic Positive

0

0.1

0.2

0.3

0.4

0.5

0.6

Inside Loops

Outside Loops

Helices

Ave

rag

e a

min

o a

cid

fra

ctio

nHydrophobic: Val, Phe, Ile, Leu, Met. Positive: Lys, Arg, His.

Cytoplasmic loops are enriched in positively charged residues: the 'positive-inside rule' of von Heijne

Ammonia Channel

Cytoplasmic

Extracellular

Topology prediction

Previous methods based on the physiochemical principle of a sliding window of hydrophobicity have been replaced by machine learning approaches.

These include hidden Markov models (PHOBIUS) and neural networks (MEMSAT3), which prevail over hydrophobicity methods due to their probabilistic orientation.

MEMSAT3, which makes use of evolutionary information via PSI-BLAST profiles to enhance the prediction, is currently the most successful method with 80% success on the Möller dataset.

The challenge now is to develop new tools to build on MEMSAT3 with the expectation that accuracy will surpass this threshold.

Assembly of a new data set.

Assignement of motifs to highlight topogenic bias.

Assembling a novel data set of transmembrane proteins

In order to study and predict features of transmembrane proteins, the use of a high quality data set containing sequences with experimentally confirmed TM regions is essential.

The data set was based on the widely used Möller test set (2001).

Additional data was collected from MPTOPO, OPM, SWISSPROT and from the literature.

Sequences were blasted against the PDB in order to identify entries for which the TM region had complete structural coverage.

This set was then homology reduced at the 40% sequence identity level.

The final data set contains 141 sequences, all with available structures, verifiable topology and N-terminal locations.

111 Alpha-helical proteins

30 Beta-barrel proteins

Column Example Entry

swissprotaccessionpdbn_terminal_locationn_terminal_in_outmembrane_typehelix_counttopologyswissprot_sequencesequence_lengthclassdatasourcedescriptiontaxonomydomainid

ANASPQ8YSC41XIO:APeriplasmicOutsideBacterial gram-negative inner membrane7A.3,26;B.35,56;C.70,89;D.99,121;E.127,148;F.168,185;G.195,218MNLESLLHWIYVAGMTIGALHFWSLSRNPRGVPQYEYLVAMFI...261Alpha-helicalOPMBacteriorhodopsin - Anabaena sp. (strain PCC 7120)Bacteria; Cyanobacteria; Nostocales; Nostocaceae; Nostoc.Prokaryotes1052

MySQL table schema

Assignment of PROSITE motifs to topological regions

We next explored the possibility that motifs from the PROSITE database could be used as constraints in subsequent topology prediction steps, by identifying a bias in their inside/outside frequency.

Extracelullar

Cytoplasm

Description Inside Outside Helix n χ2

PS00005 63.85% 28.46% 7.69% 260 166 74 102.53 137.47 68.59

PS00006 54.97% 37.75% 5.63% 302 166 114 119.62 160.38 31.39

PS00004 75.00% 21.43% 3.57% 28 21 6 11.54 15.47 13.56

PS00009 Amidation site 78.57% 14.29% 0.00% 14 11 2 5.55 7.45 9.32

PS00001 N-glycosylation site 48.54% 33.98% 13.59% 103 50 35 36.31 48.69 9.01

PS00003 63.89% 36.11% 0.00% 36 23 13 15.38 20.62 6.59

PS00221 MIP family signature 100.00% 0.00% 0.00% 4 4 0 1.71 2.29 5.36

PS00029 27.27% 0.00% 27.27% 11 3 0 1.28 1.72 4.02

PS50857 Cytochrome c oxidase subunit II signature 0.00% 100.00% 0.00% 4 0 4 1.71 2.29 2.98

PS00008 N-myristoylation site 20.40% 32.61% 44.15% 598 122 195 135.43 181.57 2.33

Prosite ID Obsin Obsout Expin Expout

Protein kinase c phosphorylation site

Casein kinase II phosphorylation site

cAMP- and cGMP-dependent protein kinase phosphorylation site

Sulfation site

Leucine Zipper

Alpha-helical protein PROSITE motif assignments

my $im = DrawTransmembrane->new(-title=>'CLN3 topology prediction using MEMSAT3_6', -n_terminal=>'in',

-topology=>'37,56,100,119,129,153,200,224,280,299,353,375', -labels=> \%labels, -outside_label=>'Lumen', -inside_label=>'Cytoplasm', -membrane_label=>'Membrane');

print OUTPUT $im->png;

A Bioperl module to draw transmembrane proteins

Conclusions

Created a high quality dataset of transmembrane proteins of known topology.

Scanned the novel data set against motif and domain databases to identify signatures which were consistently located on either inside or outside loops.

In collaboration with Dr Sara Mole (MRC Laboratory for Molecular Cell Biology), I have begun an analysis of CLN3 (Batten's Disease protein) with a view to predicting the protein's topology using a combination of computational and experimental evidence.

Future work

Determine if PROSITE motif bias can be used to improve topology prediction

Amino acid topogenic propensities

Module detection

Lipid exposure

Re-entrant regions

Signal Peptides

Documents

Assignment of PROSITE motifs to topological regions: Application to a novel database of well characterised transmembrane proteins Tim Nugent