7. Bioinformatics Why carry out sequence analysis? Pairwise alignment dotplot Multiple sequence alignment consensus sequence profile Similarity secondary

7. Bioinformatics Why carry out sequence analysis? Pairwise alignment dotplot Multiple sequence alignment consensus sequence profile Similarity secondary structure prediction Applications

Rational Drug Discovery PC session Protein sequence analysis Biocomputing Primary etc structure X-ray crystallography Structural genomics Homology modelling Protein Structure QSAR History Objectives Limitations Statistics Steric Electrostatics Hydrophobic PC sessions Molecular modelling Theory Drug structure Drug conformation Docking De Novo ligand design PC sessions 3D QSAR CoMFA Lead compound Physiological Biochemical Chemical (prodrugs) Targeting and delivery

Identify new proteins - that could be drug potential targets - especially for GPCRs Database query - give me all adrenergic receptor sequences 10 rat sequences 7 human sequences - conclusion? Understand overall function of newly identified protein A protein shows some similarity to another well understood protein - conclusion? Identify basic structural features A protein contains 7 hydrophobic stretches of ~26 amino acids - conclusion? A protein contains 12 hydrophobic stretches of ~26 amino acids - conclusion? Identify the important residues in a protein All class A amine like G-protein coupled receptors (e.g. adrenergic, serotonin (5HT), dopamine, histamine, muscarinic) contain a conserved D (aspartate, Asp) on helix 3 that is involved in binding all known drugs At some sequence positions there are key differences between similar receptors that can be exploited to design subtype-specific drugs. Sequence alignment can be used in homology modeling Build a structural model of a protein from its sequence alignment with a protein of known structure 7.1. Sequence Analysis - why? 3 more to be found The new protein may have a similar function It is probably a G-protein coupled receptor It is probably a transporter

7.2. DNA sequences XX SQ Sequence 2032 BP; 461 A; 543 C; 501 G; 527 T; 0 other; cagagcgcaa gctggaactg gctgaactga caggcactgc gagcccagag tagccccgga 60 gctgagtgca ccacgcaccc ctaccacacc cacacccacc cacggccgct gaatgagtct 120 tccaggtgct cgcttgctgc ccgcagcgcc ccgccggagg tccgctcgct gagggcggct 180 ggtgcgccgg cagcctgtgc gctcacctgc cagcctgcgc gccatggggc agcccgggaa 240 ccgcagcgtc tttttgctgg cgcccaacgc aagccacgcg ccggaccaaa acgtcacgct 300 ggaacgggac gaggcctggg ttgtgggcat gggcatcctc atgtcgctta ttgtcctggc 360 catcgtgttt ggaaacgtgc tagtcatcac agccattgcc aagtttgagc gtctccagac 420 ggtcaccaac tacttcatca cctccctggc ctgtgctgac ctggtcatgg gcctggcagt 480 ggtgcccttt ggggcctgcc acatcctcat gaaaatgtgg acttttggca acttctggtg 540 tgagttttgg acttccattg acgtgttatg cgtcacggcc agcattgaga ccttgtgcgt 600 gatcgctgtg gatcgctact tagccatcac gtcacccttc aagtatcagt gcctgctgac 660 caagaataag gcccgggtgg tcattttgat ggtgtggatc gtgtctggcc ttacctcctt 720 cttacccatt cagatgcact ggtaccgggc cagccacaag gaagccatca actgctatgc 780 taaggaaacc tgctgtgact tcttcacgaa ccaaccctat gccattgcct cctccattgt 840 gtccttctac cttcccctgg tggtcatggt cttcgtctac tccagggtgt tccaggtggc 900 caaaaggcag ctccagaaga tcgacaaatc tgagggccgc ttccatgccc aaaacgtcag 960 tcaagtggag caggatgggc ggagcggtct aggacaacgc aggacctcca agttctactt 1020 gaaggaacac aaagccctca agactttagg cattatcatg ggcactttca ccctgtgctg 1080 gctgcccttc ttcattgtca acattgtgca cgtgatcaag gataacctca tccgtaagga 1140 aatatacatc cttctaaact ggttgggcta catcaactcc gctttcaatc cccttatcta 1200 ctgccggagc ccagatttca ggattgcctt ccaggagctt ctctgcctgc gcaggtcttc 1260 attgaaggcc tatgggaatg gctgctccag caacagcaat gacaggactg actacacagg 1320 ggaacagagt ggatatcacc tgggggagga gaaagacagt gaacttctgt gtgaagaccc 1380 cccaggcacc gaaaactttg tgaaccagca aggtactgtg cccagtgata gcattgattc 1440 acaagggagg aattgtagta caaatgactc actgctgtaa tgccggtttt ctacttttta 1500 agacacccct tctccccagt accctgcaac aaaacactaa acagactatt taacttgagt 1560 ctaataaatt tagaataaag ttgtacagag atgtgcagga ggaaagatat ccttctgcct 1620 ttttattttt tattttttta agttgtaaca aaatatattt gagtaactgt ttcttgtaca 1680 gttcagttcc tctttgcctg gaacttgtta agtttatgtc tgaagggctt cagtctcaaa 1740 ggacctgggg ctgctatgtt ttgatgactt ttcctgcata tctacctcat tgatcaagta 1800 ttaggggtaa tatattgctg ctggtaattt gtatctgaag gagaccttcc ttcctgcacc 1860 cttggactgg aagatactga gtctctcgga cctttcgctg tgaacatgga ctctcctcgc 1920 ccctcttatt tgctcaaacg gggtgttgta ggcagggact tgaggggcag ctttggttgt 1980 tttcctgagc aaagtctaaa gtttacagta aataaattgt ttgaccatga aa 2032

7.3. Identity and similarity Align 2 sequences ADGVLIIQVG & ADGVLIQVG 2 alternatives ADGVLIIQVG |||||| or |||||| ||| ADGVLIQVG ADGVLI-QVG Score Comparing sub-sequences of A (400 residues), and B (650 residues) 6 9 = higher, so better alignment (I)A (I)B (ii)A (ii)B If A and B are identical in the regions that match then alignment is straightforward even if it is necessary to insert gaps generally the subsequences are not identical so and so we need a measure of similarity rather than identity gap

2.4. G-protein coupled receptors GG AC Cytosol Exterior Stimulatory ligand Plasma membrane Inhibitory Ligand Receptor (Gs coupled) G GDP Stimulation Rhodopsin, X-ray structure

Same sequence different organisms, different sequences same organism note different lengths - Note poor alignment at start ( 40), including well- conserved N at position 57( a well-known GPCR motif) 2.4 GPCR CXCR4 Chemokine N-terminal (start)_sequences

2.4 GPCR alignment : helices 6 & 7

2.4 Notes on previous alignment Note examples of different sequence, same organism Note well-conserved (largely green) helical regions (~185-210, 225-247) Note less well-conserved loop region (~215) between transmembrane helix 6 (TM6) and TM7 Find conserved CWXP motif and NPXXY motif CWLP is at position NPXXY is at position Are the alternatives to C (position 199) and N (position 241) what you would expect from the amino acid structure (see below)? The identification of such Motifs is an indication that a new sequence is a GPCR Can you see groups of sequences that more similar to each other if these are highly similar subtypes of the same receptor (e.g. Neurokinin receptor subtype 1 (NK1R), NK2R and NK3R) it could be difficult to design a drug to bind to one and not the other. Note predominance of green hydrophobic residues in transmembrane regions (roughly positions 198-210 (TM6) and 222-248 (TM7) and red/blue hydrophilic residues in the loops (~211-221) and ~249+. For the full colour code examine the alignment itself!. 199 241 Yes, the alternatives are similar

2.4. Sequence alignment and subtype-specificity This position is N in beta-adrenergic receptors and F in alpha adrenergic receptors. We know from SDM and structure that it is in the binding site Beta-selctive ligands such as propranolol have on OH group to interact with this; alpha adrenergic ligands are more hydrophobic at this point. 5HT receptors also have this N at this position and so promiscuously bind propranolol. Knowledge of sequence can therefore be used to design specificity and reduce side-effects.

3.4. What does an alignment mean? From Homstrad database, superposition of 1oft and 1bip - 4-oxalocrotonate tautomerase from Pseudomonas sp and Pseudomonas putida, 60 residues, %ID = 76% Gap red chain longer 1tig, 2ife, translation initiation factor if-3 from Bacillus stearothermophilus and Escherichia coli At position 6, 1oft has a Y and 1bip has an H.

3.4. What does an alignment mean? The gap here is because the blue loop is longer than the red loop at this point 2mbr and 1hsk, Diphospho-N- acetylenolpyruvylgluco samine reductase and UDP-N- acetylenolpyruvoylgluc osamine reductase from Escherichia coli and Staphylococcus aureus

Align the sequnces using The Dotplot 7.6. Pairwise alignment: the Dotplot

Dotplot unrelated sequences These sequences: ASRAILFYLLLIDD and HLWDSAGGQNSTSP are not related. There is no serious diagonal line. There will inevitably some dots there are only 20 amino acids. A dot does not mean an alignment with 1 identical residue Is there a weak alignment in the following? ASRAILFYLLLIDD--------- ---------HLWDSAGGQNSTSP Probably not, even this looks like it has arisen by chance

Alignments from dotplots simple cases The following dotplot has been determined note the diagonal lines Consider whether the short diagonal regions can be extended The alignment is therefore HIWDSGGAQQSSSD |:|||:|:|:|:| HLWDSAGGQNSTSP The %ID = 8*100/14 This can only be worked out from the alignment It cannot be worked out from the dotplot Note that in this case, some of the non- identical amino acids, e.g. {I,L}, {G,A} are very similar hence the : symbol. The D and the P at the end are not at all similar but the they should not be missed out

Dotplots - continued Alignments do not always start in the top left hand corner The alignment is therefore YLHIWDSGGAQQSSSDD |:|||:|:|:|:| --HLWDSAGGQNSTSP- The %ID = 8*100/14 =57% (based on 2 nd sequence, or 8*100/17 =47% based on first

Dotplots: alignments with gaps This alignment shows two diagonal lines, with two clear local alignments: HLWDSA AGAQQSTS |||||| and ||:|:||| HLWDSA AGGQNSTS Joining these together gives HLWDSAFFAGAQQSTS |||||| |:|:||| or ||||| ||:|:||| HLWDSA---GGQNSTS HLWDS---AGGQNSTS We have to decide as we cant use the A twice, so I chose 1 st you might choose 2nd %ID = 11*100/16=69%

7.6. For you to align using a dot plot D4DR_HUMAN RERKAMRVLP VVVGAFLLCW TPFFVVHITQ ACM1_HUMAN KEKKAARTLS AILLAFILTW TPYNIMVLVS Hint: you need some squared paper! The correct answer is obvious - but you need to do the exercise so you can check out the alternatives The correct answer can be found at http://tinyGRAP.uit.no/famin.html - the sequences are part of helix 6 (last checked 2001). 20

7.7. Pairwise alignment: Completed Dotplot Different but related Identical sequences Highly similar The alignment is EGPRPDSSAGGSSAG |||:|||||| EGPKPDSSAG or EGPRPDSSAGGSSAG |||:|| |||| EGPKPD-----SSAG or? gap C-terminus %ID = 9*100/10 9 matches over a length of 10 residues %ID = 9*100/10 9 matches over a length of 15 residues

7.8. Global alignment v local alignment Global alignment The essence is to score 1 for each X on the dot plot, 0 otherwise. The aim is to find the highest scoring route (from the alternatives) through the entire grid starting from the C-terminus - essentially by joining up diagonal lines in the dotplot. A gap penalty is introduced for jumping between parallel lines as this corresponds to creating a gap. The Needleman and Wunsch algorithm is the best known of this kind. Local alignment Similar to the above but only fragments are considered. Only parts of the protein may be similar.

7.9. Database searching In database searching we effectively carry out lots of pairwise comparisons - but this has to be much faster than an ordinary pairwise alignment. Fasta searches for identical pairs of ~2 residues - with tricks to find the best way to join the pairs together. An alignment will be produced if enough pairs are found. Output from the program includes query sequence - the one entered name of database searched (e.g. SWISS-PROT) program name + literature reference to be cited list of hits (often ~50), incl. unique database identifier (e.g. A1AA_RAT) & ID code (e.g. P23944) E-value - a low value indicates that virtually no matches with a similar score could expected by chance Look for a value less than 0.01 or preferably 0.001 alignment BLAST The distinction is that BLAST looks for fixed length hits and extends them if possible. The resulting high scoring pairs (HSPs) form the basis of the alignment.

HA +- + S + AC H S S S S A H H - - - Ssmall+positiveC cysteine Aaromatic-negative or similar polar Other groupings possible Gly, G Val, V Tyr, Y Arg, R Asp, D Cys, C Ala, A Trp, W Lys, K Glu, E Met, M Ile, I Phe, F Ser, S Asn, N Pro, P Leu, L His, H Thr, T Gln, Q 7.10. 5 Amino acid groups - arrange in groups

7.11. Similarity Above left - identity matrix - as used in dotplot Above right - part of Dayhoff mutation matrix - based on observed mutations in aligned proteins. W is rarer than L and so matches score 17 rather than 6 F is like Y so a match still scores 7 W and V are very different hence - 6 30

7.12. Multiple sequence alignment Two main perspectives 1st - based on comparison of amino acid sequences, taking into account amino acid properties 2nd - takes into account secondary or tertiary structure Which is the best alignment below?HHHHHH HHHHH EGPRPDSSAGGSSAGAPD |||:|.|||||||:|. |||| EGPKPQSSAG-----APD EGPKPQ-----SSAGAPD General strategy Pair-wise alignment of all sequences Produce a phylogenetic tree to group similar sequences (as right) Similar sequences aligned first, more distantly related later Gaps in related sequence guides position of gaps in others The alignment may not be optimal and may need manual adjustment A similarity matrix (e.g. Dayhoff PAM 250, BLOSUM 60) rather than an identity matrix used in alignment Different methods (e.g. clustal (ordinary method), T-coffee, profile methods in clustal) may give different alignments so think carefully about an alignment The first creates gaps in secondary structure (not so good) - second is better (H denotes helix)

7.13. Profile methods in multiple sequence alignment Consensus sequence In multiple sequence alignment the consensus sequence gives the usual amino acid at a particular position: Shown as upper case if only one amino acid present, e.g. A at position 9 lower case if majority are one amino acid, e.g. y at position 1 If equal numbers, show all residues present, e.g. V/L at position 6 Profile Percentage of each amino acid at each point At position 1, 3/5 Y and 2/5 F so profile is 0.6Y, 0.4F y d g G A/I V/L v e A t 0.6Y 0.6D 0.8G 1.0G 0.4A 0.4V 0.6V 0.6E 1.0A 0.2V 0.4F 0.4E 0.2- 0.4I 0.4L 0.4- 0.4Q 0.8T 0.2- 0.2-

7.13. Profile methods in multiple sequence alignment The profile Sometimes it is useful to align sequences against the profile, especially if they are very different to each other.

LAEPWQFSMLAAYMFLLIMLGFPINFLTLY VTVQHKRTPLNYILLNLAVADLFMVFGGFT TTLYTSLHGYFVFGPTGCNLEGFFATLGGE IALWSLVVLAIERYVVVCKPRFGENHAIMG VAFTWVMALACAAPPLVGWSRYIPNNESF VIYMFVVHFIIPLIVIFFCYGQLVFTTQKAE KEVTRMVIIMVIAFLICWLPYAGVAFYIFTH QGSDFGPIFMTIPAFFAKTSAVYNPVIYIMM N membrane inside cell 7.14. Secondary structure prediction: predicting transmembrane helices

7.14. Prediction from Hidden Markov Method # Sequence Length: 243 # Sequence Number of predicted TMHs: 7 # Sequence Exp number of AAs in TMHs: 156.33216 # Sequence Exp number, first 60 AAs: 40.14445 # Sequence Total prob of N-in: 0.00006 # Sequence POSSIBLE N-term signal sequence SequenceTMHMM2.0outside 1 9 SequenceTMHMM2.0TMhelix 10 29 SequenceTMHMM2.0inside 30 41 SequenceTMHMM2.0TMhelix 42 64 SequenceTMHMM2.0outside 65 78 SequenceTMHMM2.0TMhelix 79 101 SequenceTMHMM2.0inside 102 120 SequenceTMHMM2.0TMhelix 121 143 SequenceTMHMM2.0outside 144 152 SequenceTMHMM2.0TMhelix 153 175 SequenceTMHMM2.0inside 176 186 SequenceTMHMM2.0TMhelix 187 209 SequenceTMHMM2.0outside 210 218 SequenceTMHMM2.0TMhelix 219 241 SequenceTMHMM2.0inside 242 243 This is a highly sophisticated prediction based on hydrophobicities and known observations etc From http://www.sbc.su.se/internal.htm l The web is extremely important in bioinformatics Similar programs can predict helices, sheet and turn etc in globular proteins. 40

8.1. Drug targeting and delivery Physical approaches: microspheres Drugs enclosed in biodegradable particles that are delivered to fine capillaries where they get stuck - inject upstream of target. Biochemical approaches: Raise antibody to specific antigen, e.g. cell markers on tumour cells then link drug to antibody. There are still problems as antibodies are large - it is preferable to use an antibody fragment as it is then distributed more easily. The drug must still get inside the cell so it must be attached via a labile linkage.

8.2. The lead Finding a lead - so a major drug development project can start Serendipity High throughput screening e.g. testing compounds from companies own database combinatorial chemistry using libraries specifically designed using molecular modelling etc for a given target Properties of a lead not just active in primary screen screen must be validated statistically passed secondary tests to avoid false positives show promise in a cascade of tests agreed for its selection must be active in vivo must be patentable - not too similar to a competitors product Other desirable properties of lead potent enough for efficacy at a convenient dose selective within receptor class (e.g adrenergic ligand selective for v , 1 v 2 selective between classes, e.g. 1 antagonist doesnt act at 5-HT receptors toxicity: good therapeutic index, not mutagenic active orally; reasonable duration of activity; stable need to determine whether metabolites possess activity; are there species anomalies? QSAR can start once we have a lead

Documents

7. Bioinformatics Why carry out sequence analysis? Pairwise alignment dotplot Multiple sequence alignment consensus sequence profile Similarity secondary