Phylogenomics: Improving Functional Predictions for Uncharacterized Genes by Evolutionary Analysis

Phylogenomics: Improving FunctionalPredictions for Uncharacterized Genes

by Evolutionary AnalysisJonathan A. Eisen1

Department of Biological Sciences, Stanford University, Stanford, California 94305-5020 USA

The ability to accurately predict genefunction based on gene sequence is animportant tool in many areas of biologi-cal research. Such predictions have be-come particularly important in the ge-nomics age in which numerous gene se-quences are generated with little or noaccompanying experimentally deter-mined functional information. Almostall functional prediction methods relyon the identification, characterization,and quantification of sequence similar-ity between the gene of interest andgenes for which functional informationis available. Because sequence is theprime determining factor of function,sequence similarity is taken to implysimilarity of function. There is no doubtthat this assumption is valid in mostcases. However, sequence similarity doesnot ensure identical functions, and it iscommon for groups of genes that aresimilar in sequence to have diverse (al-though usually related) functions.Therefore, the identification of se-quence similarity is frequently notenough to assign a predicted function toan uncharacterized gene; one must havea method of choosing among similargenes with different functions. In suchcases, most functional prediction meth-ods assign likely functions by quantify-ing the levels of similarity among genes.I suggest that functional predictions canbe greatly improved by focusing on howthe genes became similar in sequence(i.e., evolution) rather than on the se-quence similarity itself. It is well estab-lished that many aspects of comparativebiology can benefit from evolutionarystudies (Felsenstein 1985), and compara-tive molecular biology is no exception

(e.g., Altschul et al. 1989; Goldman et al.1996). In this commentary, I discuss theuse of evolutionary information in theprediction of gene function. To appreci-ate the potential of a phylogenomic ap-proach to the prediction of gene func-tion, it is necessary to first discuss howgene sequence is commonly used to pre-dict gene function and some general fea-tures about gene evolution.

Sequence Similarity, Homology,and Functional Predictions

To make use of the identification of se-quence similarity between genes, it ishelpful to understand how such similar-ity arises. Genes can become similar insequence either as a result of convergence(similarities that have arisen without acommon evolutionary history) or de-scent with modification from a com-mon ancestor (also known as homology).It is imperative to recognize that se-quence similarity and homology are notinterchangeable terms. Not all ho-mologs are similar in sequence (i.e., ho-mologous genes can diverge so muchthat similarities are difficult or impos-sible to detect) and not all similaritiesare due to homology (Reeck et al. 1987;Hillis 1994). Similarity due to conver-gence, which is likely limited to smallregions of genes, can be useful for somefunctional predictions (Henikoff et al.1997). However, most sequence-basedfunctional predictions are based on theidentification (and subsequent analysis)of similarities that are thought to be dueto homology. Because homology is astatement about common ancestry, itcannot be proven directly from se-quence similarity. In these cases, the in-ference of homology is made based onfinding levels of sequence similarity thatare thought to be too high to be due to

convergence (the exact threshold forsuch an inference is not well estab-lished).

Improvements in database searchprograms have made the identificationof likely homologs much faster, easier,and more reliable (Altschul et al. 1997;Henikoff et al. 1998). However, as dis-cussed above, in many cases the identi-fication of homologs is not sufficient tomake specific functional predictions be-cause not all homologs have the samefunction. The available similarity-basedfunctional prediction methods can bedistinguished by how they choose thehomolog whose function is most rel-evant to a particular uncharacterizedgene (Table 1). Some methods are rela-tively simple—many researchers use thehighest scoring homolog (as determinedby programs like BLAST or BLAZE) as thebasis for assigning function. While high-est hit methods are very fast, can be au-tomated readily, and are likely accuratein many instances, they do not take ad-vantage of any information about howgenes and gene functions evolve. For ex-ample, gene duplication and subsequentdivergence of function of the duplicatescan result in homologs with differentfunctions being present within one spe-cies. Specific terms have been created todistinguish homologs in these cases(Table 2): Genes of the same duplicategroup are called orthologs (e.g., b-globinfrom mouse and humans), and differentduplicates are called paralogs (e.g., a-and b-globin) (Fitch 1970). Because geneduplications are frequently accompa-nied by functional divergence, dividinggenes into groups of orthologs and para-logs can improve the accuracy of func-tional predictions. Recognizing that theone-to-one sequence comparisons usedby most methods do not reliably distin-guish orthologs from paralogs, Tatusovet al. (1997) developed the COG cluster-

1E-MAIL [email protected]; FAX (650)725-1848.W W W : ht tp: / / www -leland.st an f o rd.edu /∼jeisen.

Insight/Outlook

8:163–167 ©1998 by C old Sprin g Harb or Lab oratory Press ISS N 1054-9803 / 98 $5.00; w w w .geno m e.org GENOME RESEARCH 163

Table 2. Examples of Conditions in Which Similarity Methods Produce Inaccurate Predictions of Function

Highest Hit Method Phylogenomic Method

Evolutionary Pattern and Tree of Genesand Functions1

Gene WithUnknown Function2

PredictedFunction3

Accurate? PredictedFunction4

Accurate? Comments

A. Functional change during evolution. 1

2

3

4

5

6

/

/

+

+

+

-

±

±

/

/

+

+

±

±

+

+

• Phylogenomic method cannot predict functions for all genes, but thepredictions that are made are accurate.

• Highest hit method is misleading because function changed among homologsbut hierarchies of similarity do not correlate with the function (see Bolker andRaff 1996).

B. Functional change & rate variation. 1

2

3

4

5

6

+

+

-

-

-

+

/

/

+

+

±

±

+

+

• Similarity based methods perform particularly poorly when evolutionary ratesvary between taxa.

• Molecular phylogenetic methods can allow for rate variation and reconstructgene history reasonably accurately.

C. Gene duplication and rate variation. 1A

2A

3A

1B

2B

3B

+

+

-

+

+

-

+

+

+

+

+

+

• Most-similarity based methods are not ideally set up to deal with cases of geneduplication since orthologous genes do not always have significantly moresequence similarity to each other than to paralogs (Eisen et al. 1995; Zardova etal. 1996; Tatusov et al. 1997).

• Similarity-based methods perform particularly poorly when rate variation andgene duplication are combined. This even applies to the COG method (seeTable1) since it works by classifying levels of similarity and not by inferringhistory. Nevertheless, the COG method is a significant improvement over othersimilarity based methods in classifying orthologs.

• Phylogenetic reconstruction is the most reliably way to inferr gene duplicationevents and thus determine orthology.

1 The true tree is shown but it is assumed that it is not known. Different colors and symbols represent different functions. Numbers correspond to different species.2 The function of all other genes is assumed to be known.3 The top hit can be determined from the tree by finding the gene is the shortest evolutionary distance away (as determined along the branches of the tree).4 It is assumed that the tree of the genes can be reproduced accurately by molecular phylogenetic methods (see Fig. 1).

ing method (see Table 1). Although theCOG method is clearly a major advancein identifying orthologous groups ofgenes, it is limited in its power becauseclustering is a way of classifying levels ofsimilarity and is not an accurate methodof inferring evolutionary relationships(Swofford et al. 1996). Thus, as sequencesimilarity and clustering are not reliableestimators of evolutionary relatedness,and as the incorporation of such phylo-genetic information has been so usefulto other areas of biology, evolutionarytechniques should be useful for improv-ing the accuracy of predicting functionbased on sequence similarity.

Phylogenomics

There are many ways in which evolu-

tionary information can be used to im-prove functional predictions. Below, Ipresent an outline of one such phylog-enomic method (see Fig. 1), and I com-pare this method to nonevolutionaryfunctional prediction methods. Thismethod is based on a relatively simpleassumption—because gene functionschange as a result of evolution, recon-structing the evolutionary history ofgenes should help predict the functionsof uncharacterized genes. The first stepis the generation of a phylogenetic treerepresenting the evolutionary history ofthe gene of interest and its homologs.Such trees are distinct from clusters andother means of characterizing sequencesimilarity because they are inferred byspecial techniques that help convert pat-terns of similarity into evolutionary re-lationships (see Swofford et al. 1996). Af-ter the gene tree is inferred, biologicallydetermined functions of the various ho-mologs are overlaid onto the tree. Fi-nally, the structure of the tree and therelative phylogenetic positions of genesof different functions are used to tracethe history of functional changes, whichis then used to predict functions of un-characterized genes. More detail of thismethod is provided below.

Identification of Homologs

The first step in studying the evolutionof a particular gene is the identificationof homologs. As with similarity-basedfunctional prediction methods, likelyhomologs of a particular gene are iden-tified through database searches. Be-cause phylogenetic methods benefitgreatly from more data, it is useful toaugment this initial list by using identi-fied homologs as queries for further

database searches or using automatic it-erated search methods such as PSI-BLAST (Altschul et al. 1997). If a genefamily is very large (e.g., ABC transport-ers), it may be necessary to only analyzea subset of homologs. However, thismust be done with extreme care, as onemight accidentally leave out proteinsthat would be important for the analy-sis.

Alignment and M asking

Sequence alignment for phylogeneticanalysis has a particular purpose—it isthe assignment of positional homology.Each column in a multiple sequencealignment is assumed to include aminoacids or nucleotides that have a com-mon evolutionary history, and each col-umn is treated separately in the phylo-genetic analysis. Therefore, regions inwhich the assignment of positional ho-mology is ambiguous should be ex-cluded (Gatesy et al. 1993). The exclu-sion of certain alignment positions (alsoknown as masking) helps to give phylo-genetic methods much of their discrimi-natory power. Phylogenetic trees gener-ated without masking (as is done inmany sequence analysis software pack-ages) are less likely to accurately reflectthe evolution of the genes than treeswith masking.

Phylogenetic Trees

For extensive information about gener-ating phylogenetic trees from sequencealignments, see Swofford et al. (1996). Insummary, there are three methods com-monly used: parsimony, distance, andmaximum likelihood (Table 3), and eachhas its advantages and disadvantages. I

Table 2. Types of Molecular Homology

Ho m olo g G enes that are descen ded fro m a co m m on ancestor(e.g ., all g lo b ins)

O rtholo g Ho m olo g ous genes that have d iverged fro m each otherafter speciation events (e.g ., hu m an b- an d ch im pb-g lo b in)

Paralo g Ho m olo g ous genes that have d iverged fro m each otherafter gene duplication events (e.g ., b- an d g-g lo b in)

Xenolo g H o m olo g ous genes that have d iverged fro m each otherafter latera l gene transfer events (e.g ., antib ioticresistance genes in bacteria)

Positional ho m olo gy C o m m on ancestry of specific a m ino acid or nucleotidep ositions in d ifferent genes

Table 1. Methods of PredictingGene Function When HomologsHave Multiple Functions

Highest HitThe uncharacterized gene isassig ned the function (or freq uently,the an notated function) of the genethat is identified as the h ig hest h itby a sim ilarity search pro gra m (e.g .,To m b et al. 1997).

Top HitsIdentify to p 10+ h its for theuncharacterized gene. D epen d in gon the degree of consensus of thefunctions of the to p h its, the q ueryseq uence is assig ned a specificfunction , a general activity w ithunkno w n specificity, or no function(e.g ., Blattner et al. 1997).

Clusters of Orthologous GroupsG enes are d ivided into grou ps ofortholo gs based on a clusteranalysis of pairw ise sim ilarity scoresbet w een genes fro m d ifferentspecies. Uncharacterized genes areassig ned the function ofcharacterized ortholo gs (Tatusov etal. 1997).

PhylogenomicsKno w n functions are overlaid ontoan evolutionary tree of allho m olo gs. Functions ofuncharacterized genes are pred ictedby their p hylo genetic p ositionrelative to characterized genes (e.g .,Eisen et al. 1995, 1997).

Insight/Outlook

164 GENOME RESEARCH

prefer distance methods because theyare the quickest when using large datasets. Before using any particular tree it isimportant to estimate the robustnessand accuracy of the phylogenetic pat-

terns it shows (through techniques suchas the comparison of trees generated bydifferent methods and bootstrapping).Finally, in most cases, it is also useful todetermine a root for the tree.

Functiona l Predictions

To make functional predictions basedon the phylogenetic tree, it is necessaryto first overlay any known functionsonto the tree. There are many ways this‘‘map’’ can then be used to make func-tional predictions, but I recommendsplitting the task into two steps. First,the tree can be used to identify likelygene duplication events in the past. Thisallows the division of the genes intogroups of orthologs and paralogs (e.g.,Eisen et al. 1995). Uncharacterized genescan be assigned a likely function if thefunction of any ortholog is known (andif all characterized orthologs have thesame function). Second, parsimony re-construction techniques (Maddison andMaddison 1992) can be used to infer thelikely functions of uncharacterizedgenes by identifying the evolutionaryscenario that requires the fewest func-tional changes over time (Fig. 1). The in-corporation of more realistic models offunctional change (and not just mini-mizing the total number of changes)may prove to be useful, but the parsi-mony minimization methods are prob-ably sufficient in most cases.

Is the Phylogenomic Method Worththe Trouble?

Phylogenomic methods require manymore steps and usually much moremanual labor than similarity-basedfunctional prediction methods. Is thephylogenomic approach worth thetrouble? Many specific examples exist inwhich gene function has been shown tocorrelate well with gene phylogeny (Ei-sen et al. 1995; Atchley and Fitch 1997).Although no systematic comparisons ofphylogenetic versus similarity-basedfunctional prediction methods havebeen done, there are a variety of reasonsto believe that the phylogenomicmethod should produce more accuratepredictions than similarity-based meth-ods. In particular, there are many condi-tions in which similarity-based methodsare likely to make inaccurate predictionsbut which can be dealt with well by phy-logenetic methods (see Table 4).

A specific example helps illustrate apotential problem with similarity-basedmethods. Molecular phylogenetic meth-ods show conclusively that mycoplas-mas share a common ancestor with low-GC Gram-positive bacteria (Weisburg et

Figure 1 Outline of a phylogenomic methodology. In this method, information about theevolutionary relationships among genes is used to predict the functions of uncharacterizedgenes (see text for details). Two hypothetical scenarios are presented and the path of trying toinfer the function of two uncharacterized genes in each case is traced. (A) A gene family hasundergone a gene duplication that was accompanied by functional divergence. (B) Gene func-tion has changed in one lineage. The true tree (which is assumed to be unknown) is shown atthe bottom. The genes are referred to by numbers (which represent the species from which thesegenes come) and letters (which in A represent different genes within a species). The thinbranches in the evolutionary trees correspond to the gene phylogeny and the thick graybranches in A (bottom) correspond to the phylogeny of the species in which the duplicate genesevolve in parallel (as paralogs). Different colors (and symbols) represent different gene func-tions; gray (with hatching) represents either unknown or unpredictable functions.

Insight/Outlook

GENOME RESEARCH 165

al. 1989). However, examination of thepercent similarity between mycoplasmalgenes and their homologs in bacteriadoes not clearly show this relationship.

This is because mycoplasmas have un-dergone an accelerated rate of molecularevolution relative to other bacteria.Thus, a BLAST search with a gene from

Bacillus subtilis (a low GC Gram-positivespecies) will result in a list in which themycoplasma homologs (if they exist)score lower than genes from many spe-

Table 3. Molecular Phylogenetic Methods

Method

Parsim ony Possib le trees are co m pared an d each is g iven a score that is a reflection of the m in im u m nu m berof character state chan ges (e.g ., a m ino acid substitutions) that w ould be req uired overevolutionary tim e to fit the seq uences into that tree. The o ptim al tree is considered to be theone req uirin g the fe w est chan ges (the m ost parsim on ious tree).

D istance The o ptim al tree is generated by first calculatin g the estim ated evolutionary d istance bet w een allpairs of seq uences. Then these d istances are used to generate a tree in w h ich the branchpatterns an d len gths best represent the d istance m atrix.

M axim u m likelihoo d M axim u m likelihoo d is sim ilar to parsim ony m etho ds in that p ossib le trees are co m pared an dg iven a score. The score is based on ho w likely the g iven seq uences are to have evolved in aparticular tree g iven a m o del of a m ino acid or nucleotide su bstitution pro bab ilities. The o ptim altree is considered to be the one that has the h ig hest pro bab ility.

Bootstrap p in g Alig n m ent p ositions w ith in the orig inal m ultip le seq uence alig n m ent are resa m p led an d ne w datasets are m ade. Each b ootstrap ped data set is used to generate a separate p hylo genetic tree an dthe trees are co m pared . Each no de of the tree can be g iven a b ootstrap percentage in d icatin gho w freq uently those species joined by that no de grou p to gether in d ifferent trees. Bootstrappercentage d oes not corresp on d d irectly to a confidence lim it.

Insight/Outlook

166 GENOME RESEARCH

cies of bacteria less closely related to B.subtilis. When amounts or rates ofchange vary between lineages, phyloge-netic methods are better able to inferevolutionary relationships than similar-ity methods (including clustering) be-cause they allow for evolutionarybranches to have different lengths.Thus, in those cases in which gene func-tion correlates with gene phylogeny andin which amounts or rates of changevary between lineages, similarity-basedmethods will be more likely than phy-logenomic methods to make inaccuratefunctional predictions (see Table 4).

Another major advantage of phyloge-netic methods over most similaritymethods comes from the process ofmasking (see above). For example, a de-letion of a large section of a gene in onespecies will greatly affect similarity mea-sures but may not affect the function ofthat gene. A phylogenetic analysis in-cluding these genes could exclude theregion of the deletion from the analysisby masking. In addition, regions ofgenes that are highly variable betweenspecies are more likely to undergo con-vergence and such regions can be ex-cluded from phylogenetic analysis bymasking. Masking thus allows the exclu-sion of regions of genes in which se-quence similarity is likely to be ‘‘noisy’’or misleading rather than a biologicallyimportant signal. The pairwise sequencecomparisons used by most similarity-based functional prediction methods donot allow such masking. Phylogeneticmethods have been criticized because oftheir dependence (for most methods) onmultiple sequence alignments that arenot always reliable and unbiased. How-ever, multiple sequence alignments alsoallow for masking, which is probablymore valuable than the cost of depend-ing on alignments.

The conditions described above andhighlighted in Table 4 are just some ex-amples of conditions in which evolu-tionary methods are more likely to makeaccurate functional predictions thansimilarity-based methods. Phylogeneticmethods are particularly useful whenthe history of a gene family includesmany of these conditions (e.g., multiplegene duplications plus rate variation) orwhen the gene family is very large. Theprinciple is simple—the more compli-cated the history of a gene family, themore useful it is to try to infer that his-tory. Thus although the phylogenomic

method is slow and labor intensive, I be-lieve it is worth using if accuracy is themain objective. In addition, informa-tion about the evolutionary relation-ships among gene homologs is useful forsummarizing relationships among genesand for putting functional informationinto a useful context.

Despite the evolution of these meth-ods, and likely continued improvementsin functional predictions, it must be re-membered that the key word is predic-tion. All methods are going to make in-accurate predictions of functions. Forexample, none of the methods describedcan perform well when gene functionscan change with little sequence changeas has been seen in proteins like opsins(Yokoyama 1997). Thus, sequence data-bases and genome researchers shouldmake clear which functions assigned togenes are based on predictions andwhich are based on experiments. In ad-dition, all prediction methods shoulduse only experimentally determinedfunctions as their grist for predictions.This will hopefully limit error propaga-tion that can happen by using an inac-curate prediction of function to thenpredict the function of a new gene,which is a particular problem for thehighest hit methods, as they rely on thefunction of only one gene at a time tomake predictions (Eisen et al. 1997). De-spite these and other potential prob-lems, functional predictions are of greatvalue in guiding research and in sortingthrough huge amounts of data. I believethat the increased use of phylogeneticmethods can only serve to improve theaccuracy of such functional predictions.

REFERENCES

Altschul, S.F., R.J. Carroll, and D.J. Lipman.1989. J. Mol. Biol. 207: 647–653.

Altschul, S.F., T.L. Madden, A.A. Schaeffer, J.Zhang, Z. Zhang, W. Miller, and D.J. Lipman.1997. Nucleic Acids Res. 25: 3389–3402.

Atchley, W.R. and W.M. Fitch. 1997. Proc.Natl. Acad. Sci. 94: 5172–5176.

Blattner, F.R., G.I. Plunkett, C.A. Bloch, N.T.Perna, V. Burland, M. Riley, J. Collado-Vides,J.D. Glasner, C.K. Rode, G.F. Mayhew et al.1997. Science 277: 1453–1462.

Bolker, J.A. and R.A. Raff. 1996. BioEssays18: 489–494.

Eisen, J.A., D. Kaiser, and R.M. Myers. 1997.Nature (Med.) 3: 1076–1078.

Eisen, J.A., K.S. Sweder, and P.C. Hanawalt.1995. Nucleic Acids Res. 23: 2715–2723.

Felsenstein, J. 1985. Am. Nat. 125: 1–15.

Fitch, W.M. 1970. Syst. Zool. 19: 99–113.

Gatesy, J., R. Desalle, and W. Wheller. 1993.Mol. Phylog. Evol. 2: 152–157.

Goldman, N., J.L. Thorne, and D.T. Jones.1996. J. Mol. Biol. 263: 196–208.

Henikoff, S., E.A. Greene, S. Pietrovsky, P.Bork, T.K. Attwood, and L. Hood. 1997. Sci-ence 278: 609–614.

Henikoff, S., S. Pietrokovski, and J.G. Heni-koff. 1998. Nucleic Acids Res. 26: 311–315.

Hillis, D.M. 1994. In Homology: The hierarchi-cal basis of comparative biology (ed. B.K. Hall),pp. 339–368. Academic Press, San Diego, CA.

Maddison, W.P. and D.R. Maddison. 1992.MacClade. Sinauer Associates, Sunderland,MA.

Reeck, G.R., C. Haen, D.C. Teller, R.F.Doolittle, W.M. Fitch, R.E. Dickerson, P.Chambon, A.D. McLachlan, E. Margoliash,T.H. Jukes et al. 1987. Cell 50: 667.

Swofford, D.L., G.J. Olsen, P.J. Waddell, andD.M. Hillis. 1996. In Molecular systematics (ed.D.M. Hillis, C. Moritz, and B.K. Mable), pp.407–514. Sinauer Associates, Sunderland,MA.

Tatusov, R.L., E.V. Koonin, and D.J. Lipman.1997. Science 278: 631–637.

Tomb, J.F., O. White, A.R. Kerlavage, R.A.Clayton, G.G. Sutton, R.D. Fleischmann, K.A.Ketchum, H.P. Klenk, S. Gill, B.A. Doughertyet al. 1997. Nature 388: 539–547.

Weisburg, W.G., J.G. Tully, D.L. Rose, J.P. Pet-zel, H. Oyaizu, D. Yang, L. Mandelco, J.Sechrest, T.G. Lawrence, J. Van Etten et al.1989. J. Bacteriol. 171: 6455–6467.

Yokoyama, S. 1997. Annu. Rev. Genet.31: 315–336.

Zardoya, R., E. Abouheif, and A. Meyer. 1996.Trends Genet. 2: 496–497.

Insight/Outlook

GENOME RESEARCH 167

PHYLOGENENETIC PREDICTION OF GENE FUNCTION

IDENTIFY HOMOLOGS

OVERLAY KNOWNFUNCTIONS ONTO TREE

INFER LIKELY FUNCTIONOF GENE(S) OF INTEREST

1 2 3 4 5 6

3 5

3

1A 2A 3A 1B 2B 3B

2A 1B

1A

3A

1B2B

3B

ALIGN SEQUENCES

CALCULATE GENE TREE

12

4

6

CHOOSE GENE(S) OF INTEREST

2A

2A

5

3

Species 3Species 1 Species 2

1

1 2

2

2 31

1A 3A

1A 2A 3A

1A 2A 3A

4 6

4 5 6

4 5 6

2B 3B

1B 2B 3B

1B 2B 3B

ACTUAL EVOLUTION(ASSUMED TO BE UNKNOWN)

Duplication?

EXAMPLE A EXAMPLE B

Duplication?

Duplication?

Duplication

5

METHOD

Ambiguous

Technology

Phylogenomics: Improving Functional Predictions for Uncharacterized Genes by Evolutionary Analysis