Identifying Novel Proteins

Identifying Novel ProteinsSo you’ve found a gene you’re interested in, you’ve blasted it against the biggest protein database you can find, and have got no real clues as to what its function might be. What do you do next…

Now, apart from making sure you really have a gene on your hands, there are two ways forward:

1.If there are believable BLASTx matches, but they are all predicted genes with no functional annotation, it might still be possible to use them as stepping stones to other, more informative, BLASTx matches which would not show up as similar to the original sequence. Think of this as traversing the phylogenetic tree.

2.Accumulate as much partial data about the sequence in the hopes that it sheds light on the function. This will include functional protein domains, expression data, genomic alignment and secondary structure. It’s unlikely that you will become casually involved with higher order structures as solving or comparing these is a complex and specialised task.

Orphan GenesYour lack of knowledge about protein function, having compared your sequence with all known proteins in the database, will manifest itself in two rather different ways.

1. There are good BLASTx matches with phylogenetically close organisms, but all the reasonably close hits are things like ‘Theoretical ..’ or ‘Predicted …’ or ‘Riken ..’ or ‘ORF285, chromosome 9’ – we find plenty of evidence for orthologous genes, but these are just different ways of saying but we know nothing about their function either.

2. There are no close BLASTx matches. This is a sign that this protein only exists in your organism. These are known as ‘orphan genes’, and the phenomenon is quite well documented (see reference). Obviously these are going to be quite tough to work on, as nothing like them has been seen before…

Special case. There are good BLASTx matches with phylogenetically DISTANT organisms – check for contamination!

An Evolutionary Analysis of Orphan Genes in Drosophila.Domazet-Loso T, Tautz D.

Genome Res. 2003 Oct; 13(10): 2213-2219.

Phylogenetic Stepping StonesConsider a gene which has the same function across many phyla, and suppose we consider a phylogenetic tree based on sequence similarity:

your species

species D

species C

species B

species E – function known

It’s possible that the sequence of the gene in your species is sufficiently similar to its orthologs in species B and C that these will show up in a BLAST search, but not in species D or E. But the sequence of the gene in species C is more similar to those in D or E. So once you get to C, and BLAST from there you might get to E, which happens to have been researched and its function known.

This could be done manually, but it has been formalised in PSI BLAST, which uses iterative rounds of BLAST searching to build a more generalised model of the gene sequence, and uses this ‘evolving’ model to gradually traverse the tree. Although if not used carefully it can go horribly wrong…

Initial Query

SREFTHYQWERLIKKTYFARFHNCMLISFSWER

Matches from database

SREKLSYQAERLIIWERFARFHICMLIPQSWER SREKDSYQUERLIPWTYFARFHLCMLIPKSWER

New Composite QuerySREFTHYQWERLIKKTYFARFHNCMLISFSWER KLS A IWER I PQ D U P TY L K

2nd Round Matches from databaseSREKLSYQAERLIIWERFARFHICMLIPQSWERSREKDSYQUERLIPWTYFARFHLCMLIPKSWERTUEKDSYPASAASPWERQREAFLHKLAPQSIEY

And so on…

PSI BLAST(Position Specific Iterated – since you asked)

Initial Query

SREFTHYQWERLIKKTYFARFHNCMLISFSWER

Matches from database

SREKLSYQAERLIIWERFARFHICMLIPQSWER SREKDSYQUERLIPWTYFARFHLCMLIPKSWER

New Composite QuerySREFTHYQWERLIKKTYFARFHNCMLISFSWER KLS A IWER I PQ D U P TY L K

2nd Round Matches from databaseSREKLSYQAERLIIWERFARFHICMLIPQSWERSREKDSYQUERLIPWTYFARFHLCMLIPKSWERTUEKDSYPASAASPWERQREAFLHKLAPQSIEY

And so on…

PSI BLASTRound 1 results




Finally some function!

Functional Domain AnalysisProteins are considered to have functional domains within them, specific regions of the protein which have specific tasks, and that these domains are recognisably conserved between different proteins, even though the overall similarities of the proteins may be quite low.

Typical Diagram of Functional Domains on a Protein

Functional DomainIf you can find functional domains, you may know something about the general behaviour of your protein, even if you don’t know exactly what its function is. But, as usual, be aware that non-significant matches are quite likely to be displayed in any analysis website – and at least look for some confidence score or other measure of significance. And treat everything with a degree of caution.

Main specialised sites for this type of analysis are SMART and Pfam. Which have considerable overlapping functionality. Also InterProScan which attempts to integrate all the available tools…

The search methods are rather different from BLAST, and rely primarily on building up a model of the functional domain from known examples. The model is then a generalised pattern for a given domain, and your unknown sequences are searched against the models, using rather more advanced methods, typically involving Hidden Markoff models.

Once a functional domain has been identified in a number of sequences, we can build a model of it. By which we just mean a summation of our understanding of the linear sequence variants.

1234567890YSCMVGHEALFSCVVGHEAL 1 2 3 4 5 6 7 8 9 0YTCKVDHETL model YF ST C ? V ? H ~E ? ~LFTCQVTHEGD YSCRVKHVTL score 5 5 10 10 10 8 8YTCVVGHEAL

The scores may be arbitrary but they constitute the Hidden Markoff Model by which we evaluate other proteins to see if they contain this domain. As you accumulate more examples the model gets more refined, and hopefully more accurate…The higher the score of your test protein sequence against the model the more likely it is presumed to contain the domain.The model will also allow for the possibility of (expensive) gaps if the spacing of your real sequence doesn’t fit the model. Known variable regions can be modelled as cheaper gaps.

Once a functional domain has been identified in a number of sequences, we can build a model of it. By which we just mean a summation of our understanding of the linear sequence variants.

1234567890YSCMVGHEALFSCVVGHEALYTCKVDHETL FTCQVTHEGD YSCRVKHVTLYTCVVGHEAL

The scores may be arbitrary but they constitute the Hidden Markoff Model by which we evaluate other proteins to see if they contain this domain. As you accumulate more examples the model gets more refined, and hopefully more accurate…The higher the score of your test protein sequence against the model the more likely it is presumed to contain the domain.The model will also allow for the possibility of (expensive) gaps if the spacing of your real sequence doesn’t fit the model. Known variable regions can be modelled as cheaper gaps.

Functional Domains and Hidden Markoff Models

Problems with Models by Example

There are two conceptual problems with building models from examples.

The likelihood is that the behaviour of the protein domain is related to the three dimensional shape of the molecule, and the nature of its interactions with other molecules, and as we are not taking these into account at all, we cannot expect our model to be very realistic.

Secondly, the model is (by its nature) highly biased towards the examples already found, and further examples found with the help of the model will tend to reinforce any initial bias. So our model may tend to grow away from the actual consensus across all possible proteins, and lock us out of whole subsets of data.

Incidentally this problem of bias is very similar to what can happen with PSI BLAST if your choice of proteins to include in your growing model diverge from your original sequence too much, and can quickly take you off into strange territory…

Using SMART

Exercise 1: Using Pfam and SMARTOnline Scratch PadFor the following exercises, you may find a scratch pad useful for keeping information from previous stages of a search. If you open up the file scratch-pad,htmlyou’ll find you can keep text data in the outlined box. You cannot save the data, and it’ll vanish if you close the window, or refresh it!

Go to the example-sequences.html file and the Protein Domain Searches section, and copy the sequence for >igf4D.Then go to the SMART web site, paste your sequence, tick at least the signal peptides box, and then run the search.While that’s running, go to the Pfam site (in a new browser window) and search the same sequence there.Compare the two results sets. Is there any difference? Should we expect any?Now go to the NCBI BLAST page, and do a protein-protein BLASTp – this may be a useful way of getting to the same data.What could you have learned about the function of this gene?

If you are ahead of the rest of the group, check out the results for the much longer >titin sequence.

http://doc.bioperl.org/bioperl-live/Bio/Tools/Sigcleave.html

Using SMART

Exercise 2: Random Sequences Again

We recall that random DNA sequences gave us alignments against real proteins when using BLASTx, and that E-values can gave us a good idea whether alignments are biologically meaningful or not.

This becomes even more important when searching for subtler matches – generally shorter sequences with considerable variation allowed at most positions.

Go to the file random-protein-sequences.html and copy the sequence assigned to you. Go to whichever of Pfam or SMART web sites you preferred, and run the search on your sequence.Did you find any domain hits?Were they significant?Was it possible to tell?Look at the actual alignments, if you can find out how to, and also see if you can find the model that the domain is based on.

Repeat with a second sequence if you have time.

Functional Motifs in Proteins

The ELM server (http://elm.eu.org/)ELM is a resource for predicting functional sites in eukaryotic proteins. Putative functional sites are identified by patterns (regular expressions). To improve the predictive power, context-based rules and logical filters are applied to reduce the amount of false positives

We may be more familiar with functional motifs in DNA sequences, e.g. transcription factor binding sites.Here for example is the (Xenopus) TBox motif: T[CG]A[CG]AC[CG]T

But short motifs are also present in protein sequences,

FHA domain interaction motif 1: T..[ILA]

Forkhead-associated (FHA) domain binds phosphothreonine or phosphoserine containing peptides

The general problem with motifs is the number of false positives, as they are ge

http://elm.eu.org/help.html#regular_expressions

http://elm.eu.org/help.html#regular_expressions

http://elm.eu.org/help.html#qa_filters

Functional Motifs Reported by ELM in a Random Amino Acid Sequence

Secondary Structure AnalysisThe weak neighbour-neighbour interactions between amino acids in a protein molecule give rise to a small number of basic structural arrangements. The two main forms are linear helical structures (alpha-helices) or sheets of parallel chains (beta sheets), the intermolecular bonds stabilise the structures. We may consider that the larger scale structure of the whole protein is built from these smaller scale structures, and as such they may give us some insight into the role of the protein even in the absence of much functional data.

3-dimensional protein structures that you see pictures of, are often composed of alph-helices and beta-sheets linked by less well structured sections of the protein.

There are a large number of web pages devoted to analysing proteins for secondary structure, and even some which attempt to aggregate the results of several different methods (at PBIL).

http://ww

w.chem

soc.org/exemplarchem

/entries/2004/durham_m

cdowall/prot-3.htm

l

Is it Really a Gene? If you are really getting nowhere with your functional analysis, it may worth checking whether you have got a gene at all.

There are several circumstances in which this might arise.

If you are using a physical reagent like a cDNA clone, it’s possible that it contains an incomplete mRNA sequence, and you are just looking at a plausible but unreal ORF in the 3’ UTR. Or it could contain an unspliced immature transcript. Or it could even be a contamination from some other, very different species, e.g. bacteria. You may learn a lot by aligning your sequence with the organism genome, to check that it’s there and that it appears to have exons (if you would expect them).

Or if you found the gene by some sort of mapping/positional analysis, and you are analysing sequences from gene models shown on the genome, check that there is real (e.g. EST) evidence for this gene – it may be purely theoretical, and entirely bogus…

Genomic Analaysis

It is possible that analysing the position of your gene on the genome can tell you something about its possible function.

Genes sometimes function in ‘expression cassettes’, where neighbouring genes are either co-expressed, or under closely related (temporal or spatial) regulation. So if nearby genes are well characterised it would be worth considering this as a possibility.

Equally, if there are obvious orthologs of this gene in other species, check out the genomic context there too.

You should also be able to find out if your gene is a member of a gene family, or whether it shares small regions of coding sequence with other genes. Is there a way of doing tBLASTn or tBLASTx against the genome in your preferred browser?

Expression Data

Genes that are co-expressed may well be involved in the same pathways, the more intricate the pattern of co-expression, the greater the likelihood. You may find genes of known function that yours is associated with.

If you found the gene originally in an expression array experiment this may be an easy way in. Alternatively there is a growing amount of expression data out there in databases, although at the moment it’s pretty difficult to systematically mine it. Various efforts are underway to facilitate this (FlyMine, ArrayExpress) tho’ it’s not clear how effective these are yet. It may also be difficult to track ‘your gene’ down in the data sets.

If your gene is from an EST or cDNA sequence, see if the ESTs are clustered and check out which libraries they come from. This may tell you whether your gene is expressed in specific stages/tissues, or whether it is more ubiquitous.

Exercise 3: Genuine Unknowns

The sequence file identification-example-sequences.html contains 12 gene sequences from Xenopus tropicalis which superficially look hard to identify. The full cDNA sequence, is given along with the amino acid sequence translated from the presumed ORF.

Pick one of the first six sequences, and start to accumulate data about it.

1. Check BLASTx – new sequences are arriving on the database all the time2. Consider whether PSI BLAST might be useful3. Check against the genome4. Look for functional protein domains5. Look for secondary structure

If you find anything that looks useful keep a note of it.

But bear in mind that, in the real world, you may soon be thinking about going back to the laboratory for further experimental work!

Exercise 3: Results>u-one Xt6.1-CAAL21151.3

Dpy30, SCOP domains – PSI 2 rounds -> chloroplast enolase?ADP-ribosylation factor-like

>u-two Xt6.1-CABJ8169.5 sipP, RUN, PDZ, PTB domains – PSI 2 rounds -> rap2 interacting protein x

>u-three TEgg047e16 clear orphan, no domains, no results with PSI BLAST, Egg/Ova/Gas EST expression

>u-four IMAGE:7016814 Globin domains, odd organisms, no hit on genome - worm contamination, adult whole body lib.

>u-five IMAGE:5384335 signal peptide, seven transmembrane regions (!)

>u-six TEgg044i21 signal peptide, coiled coils domain - PSI 2 rounds -> yeast-tht1

>u-seven Xt6.1-CAAO3979.3 coiled coils domain - PSI 2 rounds, meaningless name -> myosin (?)

>u-eight TEgg001m03 single exon ORF, 5 RRM_1 domains, 5th(!) mouse hit, Rbm12

>u-nine CABE11813 long protein, no domains, no more additions after 2 rounds of PSI BLAST, all_predicted

>u-ten TGas024h08

long protein, no domains, sort-of-name, PSI 2 rounds -> chloroplast RNA processing 1 1e-05...

Documents

Identifying Novel Proteins