BIological database for protein sequence analysis

Biological Databasesfor Protein Sequence AnalysisTeresa K. AttwoodSchool of Biological SciencesUniversity of Manchester, Oxford RoadManchester M13 9PT, UKhttp://www.bioinf.man.ac.uk/dbbrowser/

OverviewIntroductionWeb practical, science fact & fiction, some reality checksBiological databasessequence, family, composite, etc.Pattern recognitionregular expressions, fingerprints, profiles, etc.Building a search protocola real example

IntroductionSingle- & three-letter amino acid codesG GlycineGlyP ProlineProA AlanineAlaV ValineValL LeucineLeuI IsoleucineIleM MethionineMetC CysteineCysF PhenylalaninePheY Tyrosine TyrW TryptophanTrpH HistidineHisK LysineLysR ArginineArgQ GlutamineGlnN AsparagineAsnE Glutamic AcidGluD Aspartic AcidAspS SerineSerT ThreonineThrAdditional codesB Asn/AspZ Gln/GluX Any amino acid

Basic definitionsPrimary structurethe linear sequence of amino acids in a protein

Secondary structureregions of local regularityi.e., a-helices, b-strands, -sheets & -turns

Definitions contd.Super-secondary structurethe packing of secondary structure elements into stable unitse.g., b-barrels, bab units, Greek keys, etc..

Definitions contd.Tertiary structurethe overall chain fold that results from packing of secondary structure elements

Definitions contd.Quaternary structurethe arrangement of separate chains within a protein that has more than one subunit e.g., haemoglobin

Definitions contd.Quinternary structurethe arrangement of separate molecules, such as in protein-protein or protein-nucleic acid interactions

The practical - BioActivityBioActivity sequence analysis in actionbegin with a fragment of a DNA sequencetry to find out what protein this codes for, the family to which it belongs, & whether its function & structure are knownThe practical is entirely Web-basedbe mindful of traffic don't waste time on slow linksMost important of allread the instructions!The Web is constantly evolving....please report dead links (otherwise theyll stay dead)!

Importance of sequence analysis>900,000 sequences available in public dbs& millions more (including ESTs) in proprietary dbsthese #s will snowball with completion of more genomesso what? Locked up in sequences is a huge amount of structural, functional & evolutionary infothey're a highly valuable resourceBy contrast, the # of unique protein structures is ~2000a huge information deficit

The legacy of the genome projectsSequence-structure deficitNon-redundant growth of sequences during 1988-2002 ( ) & the corresponding growth in the number of structures ( ).800

700

600

500

400

300

200

1001988 2002

Challenges for bioinformaticsSpurred on by the seq/structure deficit, the challengesrationalise the mass of sequence dataderive more efficient means of data storage design more incisive & reliable analysis tools The imperative - to convert sequence information into biochemical & biophysical knowledgeto decipher the structural, functional & evolutionary clues encoded in the language of biological sequences

The Holy Grail of bioinformatics...to be able to understand the words in a sequence sentence that form a particular protein structure

The reality of sequence analysis...isn't so glamorous....but means we can recognise words that form characteristic patterns, even if we don't know the precise syntax to build complete protein sentences

Pattern recognition & predictionIn investigating the meaning of sequences, two distinct analytical approaches have emergedpattern recognition is used to detect similarity between sequences & hence to infer related structures & functionsab initio prediction is used to deduce structure, & to infer function, directly from sequenceThese methods are quite different!pattern recognition methods demand that some characteristic has been seen before & housed in a dbprediction methods remove the need for template dbs, because deductions are made directly from sequence

Science fact & fictionSequence pattern recognition is easier to achieve, & is much more reliable, than fold recognitionwhich is ~50% reliable even in expert handsPrediction is still not possible & is unlikely to be so for decades to come (if ever)Structural genomics will yield representative structures for many (but not all) proteins in futurestructures of new sequences will be determined by modellingprediction will become an academic exerciseBut, to debunk a popular myth, knowing structure alone does not inherently tell us function

A reality checkWhat is the function of this structure?

What is the function of this sequence?

What is the function of this motif?the fold provides a scaffold, which can be decorated in different ways by different sequences to confer different functions knowing the fold & function allows us to rationalise how the structure effects its function at the molecular level

A test case for structural genomics Structure-based assignment of the biochemical function of hypothetical protein mj0577 (Zarembinski et al., PNAS 95 1998)

Although the structure co-crystallised with ATP, the biochemical function of the protein is unknown

The Twilight ZonePrediction methods dont work because we dont fully understand the Folding Problemwe cant read the language sequences use to create their foldsBut, with sequence analysis techniques, we can try to find similarities between new sequences & those in dbswhose structures & functions we hope have been elucidatedThis is straightforward at high levels of identity, but below 50% it is difficult to establish relationships reliablyAnalyses can be pursued with decreasing certainty towards the Twilight Zone~20% identity, where results may look plausible to the eye, but are no longer statistically significant

Beyond the Twilight ZoneTo penetrate deeper into the Twilight Zone is the aim of most analytical methodswhether using single sequences, motifs, complex weighting schemes or raw amino acid frequenciesEach offers a different perspective, depending on the type of information used in the searchnone gives the right answerIt is good practice to devise an analysis protocol that uses a variety of methodsbut dont expect the impossible no method is infallible!

Application areas of analysis toolsThe scale indicates % identity between aligned sequencesAlignment of 2 random seqs can produce ~20% identityless than 20% does not constitute a significant alignmentaround this threshold is the Twilight Zone, where alignments may appear plausible to the eye, but cant be proved by conventional methods

Homology & analogyThe term homology is confounded & abused!sequences are homologous if they are related by divergence from a common ancestoranalogy relates to the acquisition of common features from unrelated ancestors via convergent evolutione.g., b-barrels occur in soluble & membrane proteins; enzymes chymotrypsin & subtilisin share groups of catalytic residues, with near identical spatial geometries, but no other similarities It is not a measure of similarity & is not quantifiableit is an absolute statement that sequences have a divergent rather than a convergent relationshipthe phrases "the level of homology is high" or "the sequences show 50% homology", or any like them, are strictly meaningless!This is not just a semantic issueloose use muddies thinking about evolutionary relationships

A terminology muddleThe same arguments apply to 3D structuresstructures may be similar, as denoted by RMS positional deviation between compared atomic positionsbut their common evolutionary origin is a hypothesisthe hypothesis may be correct or mistaken, but their similarity is a fact, no matter how it is interpretedSimilarity of sequence or structure is just that similarityHomology connotes a common evolutionary originReeck, G.R., de Haen, C., Teller, D.C., Doolittle, R.F., Fitch, W.M., Dickerson, R.E., Chambon, P., McLachlan, A.D., Margoliash, E., Jukes, T.H. & Zuckerkandl, E. (1987) Homology in proteins and nucleic acids: a terminology muddle and a way out of it. Cell, 50, 667.

More challenges for sequence analysisMuch of the challenge is in getting the biology rightthis is complicated by the problem of orthology vs paralogyFollowing a search, how much functional annotation can be legitimately inherited by a query?source of numerous annotation errors in dbserror propagation could lead to an error catastropheFurther complications arise due to modular nature of proteinsmodules are autonomous folding units (protein building blocks) confer variety of functions on a parent protein, by multiple combin-ations of the same module, or different modules to form mosaicsAutomatic analysis systems dont distinguish orthologues from paralogues & dont consider the modular nature of proteins

Monkeys are exploited in different Goldberg machines, where they perform different functions here, we could not predict a monkey sitting in that spot, even with total knowledge of the rest of the machineSimilarity searches are just like this identifying the presence of a module tells little of the function of the complete system knowing most components of a mosaic, we cant predict a missing onemodules (monkeys) in different proteins dont always perform exactly the same function

The Midnight ZoneNotwithstanding the lessons of Goldberg machines, identifying evolutionary links between sequences is usefulthis often implies a shared functionIn the genome era, prediction of function from sequence is of more immediate value than is the prediction of structureHowever, between distantly-related proteins, structure is more conserved than the underlying sequencesthus, some relationships are only apparent at the structural levelSuch relationships cant be detected by even the most sensitive sequence comparison methodsthe region of identity where sequence comparisons fail completely to detect structural similarity is the Midnight Zone there is thus a theoretical limit to the effectiveness of sequence analysis methods

SignificanceAppreciating that mathematical & biological significance are different is crucial it is especially important in understanding the limitations ofsearch & alignment algorithms, pattern recognition techniques, functional site & structure prediction toolsContrary to popular opinion, there is currently stillno biologically-reliable automatic multiple alignment algorithmno infallible pattern-recognition techniqueno reliable gene, function or structure prediction algorithms

Computers dont do biology!

Biological DatabasesOverviewSequence repositoriesSWISS-PROT & TrEMBLComposite sequence databasesNRDB, SP+TrEMBLFamily (pattern) resourcesPROSITE, PRINTS, profiles, Pfam, Blocks, eMOTIFComposite family databasesInterPro

Primary sequence databasesIn the early '80s, when sequence data started to accumulate, several labs saw advantages to establishing central repositoriestrouble is, many labs. thought this was a good idea & made their own

NucleicProteinEMBLPIRGenBankSWISS-PROTDDBJMIPSJIPIDTrEMBL

The proliferation of dbs causes problemsdo they have the same format? Which is the most accurate? The most up-to-date? The most comprehensive? Which should we use?

SWISS-PROTEndeavours to provide high-level annotatione.g., descriptions of the function of the protein, the organisation of its domains, PTMs, family & disease relationships, variants, etc.Contains entries from >5,000 speciesthe bulk of these from just a handful of model organismsH.sapiens, E.coli, M.musculus, D.melanogaster, S.cerevisiae, etc.The quality of its annotations sets is apart from other dbsConsequently, it cannot keep pace with the rate of data acquisition from the sequencing centres

TrEMBLA computer-annotated supplement to SPhas the SP format & contains translations of all CDSs in EMBLIt has 2 main sectionsSP-TrEMBL contains all entries that will eventually go into SP, but haven't yet been manually annotatedREM-TrEMBL contains sequences not destined to be in SPIgs, fragments of

Composite sequence databasesA solution to the problem of proliferating dbs is to compile a compositethese render searches very efficient, especially if non-redundantTrouble is, there are now several composites, each with their own format & redundancy criteria the most commonly used are:

NRDBSP+TrEMBLPDBSWISS-PROTSWISS-PROTTrEMBLPIR GenPeptGenPept updates

NRDB & SP+TrEMBL are non-identical, not non-redundant but which is best? Which the most comprehensive? The most up-to-date? Which should we use?

NRDBNRDB is built locally at the NCBIit includes weekly updates of SP & daily updates of GenBank, so is up-to-date & comprehensiveBut the simplistic manner of its construction causes problemsmultiple copies of the same protein are retained as a result of polymorphisms &/or sequencing errorserrors corrected in SP are reintroduced when retranslated from DNAnumerous sequences are duplicates of existing fragmentsThe contents of the db are thus error-prone & redundantNRDB is the default db of the NCBI BLAST service

SP+TrEMBLThis resource is intended to be both comprehensive & minimally redundantIt contains fewer errors than NRDB, but is not truly non-redundant~30% of the combined total of SP & TrEMBL is non-uniqueFurther reduction of error rates requires more manual intervention & better expert db management systems

Family (pattern) databasesAs well as 1' resources, there are also many family or pattern dbs derived from themtrouble is, they use different 1' sources & different analysis methods, & all have different formats!But it isn't all bad SWISS-PROT is emerging as a standard, & most pattern dbs use it as their basis

PROSITESWISS-PROTRegular expressions (patterns)PRINTSSWISS-PROT/TrEMBLAligned motifs (fingerprints)PfamSWISS-PROT/TrEMBLHidden Markov Models (HMMs)ProfilesSWISS-PROTWeight matrices (profiles)BlocksInterPro/PRINTSWeighted motifs (blocks)eMOTIFBlocks/PRINTSPermissive regular expressions

Why create pattern databases?Pattern dbs arise from the need to make more specific functional diagnoses than are possible simply by searching the 1'sThey are built on the principle that homologous sequences may be gathered together in multiple alignments, within which are regions (motifs) that show little variationthese motifs usually reflect some vital biological role in terms of either structure or function Motifs are exploited in different ways to build diagnostic patterns for protein familiesnew sequences can be searched against dbs of such patterns to see if they can be assigned to known familieshence they offer a fast track to the inference of function

What's in a sequence?

Full domain alignment methodsSingle motif methodsMultiple motif methodsFuzzy regex (eMOTIF)Exact regex (PROSITE)Profiles (PROFILE LIBRARY)HMMs (Pfam)Identity matrices (PRINTS)Weight matrices (Blocks)Methods for family analysis

The challenge of family analysishighly divergent family with single function?superfamily with many diverse functional families?must distinguish if function analysis done in silicoa tough challenge!

Know your family

The problem with domains

PROSITEThe first pattern dbbased on the idea that a protein family can be characterised by a pattern of conserved residues within a single motifSequence information in motifs is reduced to consensus or regular expressions (regexs) & the seed regex used to search SPresults are inspected manually to achieve optimal resultsSome families cant be characterised by single motifshere, additional regexs are created until an optimal set is achieved that captures most or all of the familyresults are then manually annotated for inclusion in the db

R-Y-x-[DT]-W-x-[LIVM]-[ST]-T-P-[LIVM](3)

PRINTSMost protein families are characterised by >1 motifit is sensible to use many/all of them to build a diagnostic signatureThis is the principle of fingerprintsthese offer improved diagnostic reliability by virtue of the biological context provided by motif neighboursMotifs are excised from alignments by hand residue information is augmented via iterative searchesresults are manually annotated prior to inclusion in the db

Motif contextorderinterval

SUMMARY INFORMATION 37 codes involving 8 elements 0 codes involving 7 elements 0 codes involving 6 elements 0 codes involving 5 elements 0 codes involving 4 elements 1 codes involving 3 elements 0 codes involving 2 elements COMPOSITE FINGERPRINT INDEX 8| 37 37 37 37 37 37 37 37 7| 0 0 0 0 0 0 0 0 6| 0 0 0 0 0 0 0 0 5| 0 0 0 0 0 0 0 0 4| 0 0 0 0 0 0 0 0 3| 1 0 0 0 1 1 0 0 2| 0 0 0 0 0 0 0 0 -+----------------------------------------- | 1 2 3 4 5 6 7 8 True positives..PRIO_COLGU PRIO_MACFA PRIO_CEREL PRIO_ODOHEPRIO_GORGO PRIO_PANTR PRIO_HUMAN O46648PRIO_SHEEP PRIO_CALJA PRIO_BOVIN PRP2_BOVINPRIO_ATEPA PRIO_SAISC PRIO_PREFR PRIO_PONPY O75942 PRIO_CAPHI PRIO_CEBAP PRIO_CAMDR PRIO_FELCA PRP1_TRAST PRIO_RABIT PRP2_TRAST PRIO_PIG PRIO_CANFA PRIO_CRIGR PRIO_CRIMI Q15216 PRIO_RAT PRIO_CERAE PRIO_MUSPFPRIO_MUSVI PRIO_MESAU PRIO_MOUSE O46593PRIO_TRIVU Subfamily: Codes involving 3 elements Subfamily True positives.. PRIO_CHICK

Profiles & PfamAn alternative to motif-based methods exploits regions between motifs, which also contain valuable informationthe full alignment effectively becomes the discriminatorA complex scoring scheme allowing for substitutions & INDELs is used to create family-specific profilesThese profiles can be used to detect distant relation-ships, where only few residues are conservedthis is the basis of the Profile libraryIn an extension of this approach, alignments are encoded as probabilistic models termed HMMsthis is the basis of Pfam

Blocks & eMOTIFVarious advantages to storing motifs in a raw formno information is lost, & different scoring schemes may be used to confer different diagnostic potentials on the same dataAdditional dbs have arisen in this wayBlocks uses families identified in InterPro, aligns the sequences & detects motifs automaticallyBLOCKS-format PRINTS uses motifs in PRINTS with the Blocks scoring schemeeMOTIF creates permissive regexs from Blocks & PRINTSThese dbs are derived fully automatically & hence offerno family annotation (they link back to InterPro & PRINTS)no further family coverage

Composite pattern databasesTo simplify sequence analysis, the family databases are being integrated to create a unified annotation resource InterProrelease 4.0 contains 4691 entriesa central annotation resource, with pointers to its satellite dbsinitial partners were PRINTS, PROSITE, profiles & Pfamnew partners include ProDom, TIGRfam, SMART & hopefully others (e.g., Blocks, MetaFam)lags behind its sourcesmajor role in fly & human genome annotation

Pattern RecognitionOverviewDetermining significance of db matchesPattern recognition methodsregular expression patterns & rulesfingerprints & blocksprofiles & HMMsCurrent status of pattern dbs

Pattern recognition methodsThese methods classify proteins into familiesthe basis of the methods is multiple sequence alignmentThey depend on developing representations of conserved elements of alignments that may be diagnostic of structure or function, whether fromhomologous sequence familiessequences that share some structural/functional domains

Single motif methodsMultiple motif methodsFull domain alignment methodsFuzzy regex (eMOTIF)Exact regex (PROSITE)Profiles (PROFILE LIBRARY)HMMs (Pfam)Identity matrices (PRINTS)Weight matrices (Blocks)

Determining significance of database matchesWhen searching a db, the challenge for analysis methods is to determine if matches are related (true-positive) or unrelated (true-negative) At a given scoring threshold, it is likely that unrelated sequences will be matched erroneously (false-positives) & some correct matches will be missed (false-negative)The aim is to improve the resolution between the curves - in the overlap, it is difficult or impossible to establish if matches are significantDifferent methods tackle this problem in different ways

True negativeScoreNResolving true & false matches

Regular expressions (patterns)These are derived from single conserved regions in alignmentsthey are minimal expressions, so sequence information is lostthe more divergent the sequences used, the more fuzzy & poorly discriminating the regex becomes

Alignment RegexGAVDFIALCDRYFGPIDFVCFCERFY G-X-[IV]-[DE]-F-[IVL]-X2-C-[DE]-R-[FY]2GRVEFLNRCDRYY

Regexs do not tolerate similaritysequences either match or not, regardless of how similar they arematching is a binary on-off event & frequently misses true matchessingle-motif methods are very hit-or-miss how do you know if you've encoded the best region?

In the beginning was PROSITEG_PROTEIN_RECEPTOR; PATTERNPS00237;G-protein coupled receptor signature[GSTALIVMYWC]-[GSTANCPDE]-{EDPKRH}-X(2)-[LIVMNQGA]-X(2)-[LIVMFT]-[GSTANC]-[LIVMFYWSTAC]-[DENH]-R/TOTAL=1121(1121); /POS=1057(1057); /FALSE_POS=64(64);/FALSE_NEG=112; /PARTIAL=48; UNKNOWN=0(0)

This represents an apparent 20% error rate the actual rate is probably higherThus, a match to a pattern is not necessarily true & a mis-match is not necessarily false!False-negatives are a fundamental limitation to this type of pattern matchingif you don't know what you're looking for, you'll never know you missed it!

Regular expressions (rules)Regex patterns are most effective when applied to highly-conserved, family-specific motifsIt is often possible to identify, shorter generic patterns within sequences, characteristic of common functional sites

Functional siteRuleN-glycosylation N-{P}-[ST]-{P}Protein kinase C phosphorylation[ST]-X-[RK]Casein kinase II phosphorylation[ST]-X2-[DE]

Such features result from convergence to a common propertyglycosylation sites, phosphorylation sites, etc.They cannot be used for family diagnosis & dont discriminatethey can only be used to suggest whether a certain functional site might exist (which must then be tested by experiment)such patterns are normally termed rules

Residue groups for fuzzy regexsIt is possible to assign residues to groups based on various biochemical properties e.g., charge & sizeusing such groups theoretically ensures that resulting regexs have sensible biochemical interpretations

smallAla, Glysmall hydroxylSer, ThrbasicHis, Lys, ArgaromaticPhe, Tyr, TrpaliphaticVal, Leu, Ile, Metacidic/amideAsp, Glu, Asn, Glnsmall/polarAla, Gly, Ser, Thr, Pro

This is more flexible than exact regex matching

Diagnostic limitationsConsider the sequence motif Asp-Ala-Val-Ile-Asp (DAVID)results of searching for such a motif will differ, depending on the db, the motif length & whether we use exact or permissive fuzzy regexs

PatternMatchesD-A-V-I-D 71 (99)D-A-V-I-[DEQN] 252[DEQN]-A-V-I-[DEQN] 925[DEQN]-A-[VLI]-I-[DEQN] 2,739[DEQN]-[AG]-[VLI]-[VLI]-[DEQN]51,506D-A-V-E 1,088 (1,493)(number of matches in OWL29.6 (& OWL31.1))

Use of fuzzy regexs has the potential advantage of being able to recognise more distant relationships& the inherent disadvantage that more matches will be made by chance, making it difficult to separate true matches from noise

FingerprintsFingerprints are groups of conserved (ungapped) motifs excised from alignments & used for iterative db searchingno weighting scheme is usedsearches depend only on residue frequenciesresulting scoring matrices are thus sparseEach motif trawls the db independentlysearch results are correlated to determine which sequences match all the motifs & which match only partiallyno information is thrown awayThe iterative process refines the fingerprint & increases its powerpotency is gained from the mutual context of motif neighboursresults are biologically more meaningful than those from single motifs

TM domainTM domainloop region

loop regionTM domainTM domain

A fingerprinting overviewPRINTSannotation

How fingerprints are stored

YVTVQHKKLRTPL YVTVQHKKLRTPL YVTVQHKKLRTPL AATMKFKKLRHPL AATMKFKKLRHPL YIFATTKSLRTPA VATLRYKKLRQPL YIFGGTKSLRTPA WVFSAAKSLRTPS WIFSTSKSLRTPS YLFSKTKSLQTPAYLFTKTKSLQTPA (a)

Key:(a) motif, with 3 conserved positions(b) corresponding frequency matrix (c) same matrix, but after 3 iterations(d) same matrix, with PAM250 weighting T C A G N S P F L Y H Q V K D E I W R M B X Z 0 0 2 0 0 0 0 0 0 7 0 0 1 0 0 0 0 2 0 0 0 0 0 0 0 3 0 0 0 0 0 2 0 0 0 4 0 0 0 3 0 0 0 0 0 0 6 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 3 0 0 1 0 0 0 3 0 0 0 0 0 0 2 0 0 0 2 0 1 1 0 0 0 0 0 0 0 3 0 4 0 0 0 0 1 0 0 0 0 4 0 1 0 0 1 0 2 0 1 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 10 0 0 0 0 9 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 2 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (b) T C A G N S P F L Y H Q V K D E I W R M B X Z 0 0 4 0 0 0 0 8 4 34 0 0 15 0 0 0 1 7 0 0 0 0 0 0 4 15 0 0 0 0 0 7 0 0 0 37 0 0 0 10 0 0 0 0 0 0 50 0 0 0 0 3 0 18 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 3 0 12 2 1 8 0 3 6 0 0 0 14 0 0 0 15 2 0 7 0 0 0 9 2 2 2 1 1 0 0 0 0 1 25 0 20 0 6 0 0 4 0 0 0 0 14 0 2 0 0 4 0 14 0 8 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 70 0 0 0 0 2 0 0 0 0 0 0 2 1 0 17 0 0 0 0 0 0 0 52 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 73 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 68 0 0 0 0 44 0 0 0 0 6 0 0 0 0 12 11 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 69 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 2 0 11 0 0 7 0 0 53 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (c) T C A G N S P F L Y H Q V K D E I W R M-29 -22 -29 -48 -24 -24 -46 40 -13 62 -10 -40 -22 -38 -44 -44 -15 16 -30 -22 -1 -32 -1 -18 -20 -10 -13 -9 20 -22 -21 -18 32 -23 -22 -20 32 -61 -26 19 0 -36 -18 -30 -24 -12 -30 36 0 24 -18 -36 -6 -30 -36 -30 6 -30 -30 -6 3 -29 3 -4 -10 -1 -7 -22 3 -31 -19 -15 14 -12 -15 -13 11 -52 -15 11 3 -48 -1 -8 7 1 -4 -54 -31 -46 6 14 -17 23 6 5 -20 -48 14 -9 2 -27 -7 -19 -3 -5 -13 0 -16 6 8 -10 -11 -15 -13 -11 -7 -37 -12 -15 0 -60 -12 -24 12 0 -12 -60 -36 -48 0 12 -24 60 0 0 -24 -36 36 0 6 -30 0 -6 12 12 0 -48 -36 -42 -6 0 -18 30 0 0 -18 -30 18 -12-24 -72 -24 -48 -36 -36 -36 24 72 -12 -24 -24 24 -36 -48 -36 24 -24 -36 48-12 -50 -20 -32 2 -2 0 -50 -34 -48 26 18 -24 32 -6 -6 -24 10 62 -2 24 -29 7 -5 5 6 0 -36 -24 -31 6 1 -6 1 4 4 -6 -56 -4 -14 0 -36 12 -12 -12 12 72 -60 -36 -60 0 0 -12 -12 -12 -12 -24 -72 0 -24 -6 -44 -2 -18 -16 -10 -12 -10 22 -24 -18 -14 10 -22 -24 -18 6 -40 -26 16 (d)

Fingerprint visualisationThe full potency of fingerprinting is gained from the mutual context provided by motif neighboursThis is important, as the method inherently implies a biological context to motifs that are matched in the correct order in the query sequences, with appropriate distances between themThis allows sequence identification even when parts of the fingerprint are absente.g., a sequence that matches only 4 of 7 motifs may still be diagnosed as a true match if the pattern of motif matching is consistent with that expected of true neighbouring motifsSuch matches are best visualised graphically

NCMissing motif?Visualising fingerprints%IDQuery sequencePRINTS

BlocksBlocks are groups of motifs derived automatically from families identified in InterProsequences are aligned automatically & motifs are automatically identified by searching for spaced residue triplets (e.g., AxxxVxxC)a block score is calculated using the BLOSUM62 matrixvalidity of blocks is confirmed with a 2nd motif-finding algorithmblocks found by both methods are considered reliableSequences within motifs are clustered to reduce contributions to residue frequencies from sets of closely-related sequenceseach cluster is treated as a single sequence & given a score that gives a measure of its relatednessthe higher the weight, the more dissimilar the segment from others in the block, the most distant being given a score of 100segments

CSC triplet

ProfilesProfiles are scoring tables derived from full domain alignmentsthese define which residues are allowed at given positionswhich positions are conserved & which degeneratewhich positions, or regions, can tolerate insertionsthe scoring system is intricate, & may include evolutionary weights, results from structural studies, & data implicit in the alignmentvariable penalties are specified to weight against INDELs occurring in core 2' structure elementsWithin a profile, the I & M fields contain position-specific scores for insert & match positionsin conserved regions, INDELs aren't totally forbidden, but are strongly impeded by large penalties defined in the DEFAULT fieldthese are superseded by more permissive values in gapped regionsthe inherent complexity of profiles renders them highly potent discriminators, but they are time-consuming to derive

Hidden Markov ModelsHMMs are similar in concept to profiles by virtue of encoding full domain alignmentsthey are probabilistic models consisting of a number of inter-connecting statesessentially, linear chains of match, delete or insert statesMatch states are assigned to conserved columns in an alignmentinsert states allow for insertions relative to match statesdelete states allow match positions to be skippedthus, building an HMM from an alignment requires each position to be assigned either to match, delete or insert statesHMMs usually perform well, but can be over-trainedthey may also suffer if they are created from an iterative automatic alignment process if this once accepts a false match, the HMM will become corrupt

An HMMCLWDCLYE

Which craft is best?The wide variety of methods available leads to familiar problemswhich should we use? which is the most reliable? which is the most comprehensive?......etc.None of the pattern-recognition techniques is infallible, & none of the resulting pattern dbs is completebearing in mind the diagnostic strengths & weaknesses of the different approaches, & always keeping biological significance in mind, the best strategy is simply to use them all

Overview of resourcesPROSITE (SIB) - 1108 entriessingle motifs (regexs) - best with small highly conserved sitesProfile library (ISREC) - 300 entriesweight matrices - good with divergent domains & superfamiliesPRINTS (Manchester) - 1750 entriesmultiple motifs (fingerprints) - best for families and sub-familiesPfam (Sanger Centre) - 3071 entriesHMMs - good with divergent domains & superfamiliesInterPro (EBI) - 4691 entriesderived from PRINTS, PROSITE, Profiles, Pfam, ProDom, etc.Blocks (FHCRC) - 2608 entriesmultiple motifs (derived from InterPro & PRINTS)eMOTIF (Stanford)permissive regexs (derived from PRINTS & BLOCKS)

Building a Search ProtocolOverviewThe usual starting pointsearching the primary data sourcesPattern recognition methodssearching the secondary sourcesStructural & functional interpretation of resultsEstimating significancewhen do we believe a result?

A practical approachGiven a newly-determined sequence, we want to knowwhat is my protein? to what family does it belong?what is its function? & how can we explain its function in structural terms?To this end, by searching pattern dbs & fold libraries, we may recognise patterns that allow us to infer relationships with previously-characterised families/foldsGiven the variety of dbs to search, how do we use them to build a sensible search protocol for novel sequences?

Protein sequence database identity search e.g., for short fragments, pinpoints identical matches to probe - may identify correct reading frame

Protein sequence database similarity searche.g., nrdb, SP+SPTrEMBL - identifies potential homologues to probe

Protein pattern database search e.g., PROSITE, profiles, PRINTS, Blocks, Pfam - identifies family relationships or pinpoints key structural or functional sites

Known structure No known structureStructure classification database query Protein fold pattern library search e.g., scop, CATH, FSSP - provides details e.g., threading - identifies compatibleof structural class, secondary structure folds for the probe sequenceinformation, ligand-binding, etc.

Searching the primary databasesIdentity searchingthe fastest test of an unknown fragment is to perform an identity search. This will reveal in seconds whether an exact match to the unknown peptide already existsThis can be helpful in identifying the correct reading frame following a 6-frame translation ccgtactacaactacgctggtgcattcaagForward 0PYYNYAGAFK TRFE_XENLA 207 AGIKEHKCSRSNNE PYYNYAGAFK CLQDDQGDVAFVKQForward 1 XLTRSFER 207 AGIKEHKCSRSNNE PYYNYAGAFK CLQDDQGDVAFVKQRTTTTLVHS Forward 2 TRFE_XENLA TRANSFERRIN PRECURSOR - XENOPUS LAEVISVLQLRWCIQ XLTRSFER TRANSFERRIN PRECURSOR - XENOPUS LAEVISReverse 0LECTSVVVRReverse 1LNAPA!L!YReverse 2!MHQRSCST

Similarity searchingWhether or not an identity search finds a match, the next step is to look for similar sequencese.g., you may wish to know if a wider family existsThe most rapid & simple option is to use BLAST, & flavours of it, or FastASeveral features are worthy of note in BLAST outputlook for high scores with low P-values (unlikely to be random)look for clusters of high scores at the top of the hitlist (a family?)look for trends in the type of sequences matched

Ideal results show high scores & low E-values

Why bother with pattern searches?Primary searches won't always allow outright diagnosisBLAST & FASTA are not infallibleBLAST, in particular, often can't assign significant scoresresults may be complicated by the presence of modules, or compositionally-biased regionsannotations of retrieved hits may be incorrectPattern dbs contain potent descriptorsso, distant relationships missed by BLAST may be captured by one or more of the family or functional site distillations

Searching the pattern databasesSearching PROSITEwhen using PROSITE's Web form, it is advisable to exclude rules from the search, otherwise output is filled with spurious matchesresults are either match, or no matchthe user has to judge whether hits are significant

Searching the pattern databasesSearching Profilesthe SIB Web server offers access both to profiles within PROSITE & pre-release (undocumented) profilesresults are highly specific & generally diagnostically reliableif no match is returned, its usually because the entry isnt in the dbmatches to undocumented profiles are often dead-ends

Searching the pattern databasesSearching Pfamresults are returned in HTML tables accompanied by simple graphics to illustrate matched domainsresults are specific & usually diagnostically reliableE-values provide the measure of confidence

Searching the pattern databasesSearching PRINTSresults are returned in HTML tables on different levelsa best "guessthe top 10 best-scoring matchesthe raw datagraphical options provide a visual impression of the quality of matchesresults are specific & usually diagnostically reliablecombined E- & p-values provide the measure of confidence

Searching the pattern databasesSearching Blocksif results of searching PROSITE & PRINTS are positive, we would expect these to be confirmed by searches of the Blocks dbskey features to note in the output are the description line, the accession codes (which indicate which is the matched motif), & the best-scoring or anchor blockmost important is the detection of multiple block hits where this happens, an E-value denotes the significance of the matchsingle block matches are usually spurious

Searching the pattern databasesSearching eMOTIFas with Blocks, if results of searching PROSITE & PRINTS are positive, this should be confirmed by searches of eMOTIFoutput is given at several stringency levels, which indicate the number of false matches to expect in the reported results

Which approach is best?BLAST frequently fails to assign significant scoresThe hit-or-miss nature of single-motif regular expressions can render them worthlessIn spite of (because of?) their complexity, profiles & HMMs are often out-performed by simpler motif methodsThe non-weighting system of fingerprints means that Twilight relationships may be missedThe scoring system used to create blocks generates large amounts of noise that may obscure the signalOnly PROSITE & PRINTS are fully manually annotated

No method alone is best

Structural & functional interpretationDb searches often do little more than identify a protein familythis only scratches the surface we still want to know what our protein does & what it might look likeThe first step is to examine the detailed family documentations in PROSITE, PRINTS & InterProthese should help to elucidate the function of the proteinThe next step is to examine the fold classification & structure summary resourcese.g., scop, CATH & PDBsum, assuming that a structure is in fact available.

Estimating significanceWhen do we believe a result? a real example.....

ConclusionsWhat are the lessons for sequence analysis?when searching for distant homologues, several dbs should be searcheddifferent methods provide different perspectivesdbs arent complete & their contents dont fully overlap The more dbs searched, the more difficult it can be to interpret resultshence s/w is being designed to provide "intelligent" consensus outputsThe more computers are involved in automating genome annotation, the greater the need for collaborationespecially between s/w developers, annotators & biologistsThe more data we have to handle, the more rigorous we must be in our thinking (& writing) if we are to make sense of the complexitiesWe are a long way from having reliable tools for deducing protein structure & function from sequencebut with the right approach, there is hope

Documents

BIological database for protein sequence analysis