20
Basic Models of Nucleotide Evolution Over time, nucleotides within a sequence can ‘evolve’ through substitution. This process can cause a nucleotide (T, C, A or G) to change into another nucleotide and is the main driving force behind evolution. For example, the nucleotide A in a sequence of DNA can change over time into the nucleotide C. This change may result in this sequence of DNA becoming inactive if the sequence was previously involved in protein synthesis as an exon, or may change the protein that the sequence codes. As proteins are the building blocks of organic life, this may cause large changes in an organism’s features. Alternatively, this change may have no effect at all. On average, this form of mutation only occurs once or twice every million years. However, in assessing the evolution of species over hundreds of millions of years, models are useful in evaluating how one sequence of nucleotides may have evolved from another. Models of nucleotide evolution can be used when examining two sequences of DNA of the same length that may be related. This type of model would be used to compare the two sequences by either assuming that one sequence evolved into the other or vice‐versa, or assuming that they had evolved from a common ‘ancestral’ sequence of DNA. Applying the model would give the estimated number of nucleotide substitutions per site, called the distance, which would then be used to estimate a time. This time could then relate to when one sequence evolved from the other or would relate to how long ago that an ‘ancestral’ sequence of DNA would have diverged into each sequence. In this paper, I will outline the principles and theory behind the main (most commonly used) models of nucleotide substitution, addressing each model chronologically and in some senses with increasing complexity. The models are as follows: o Jukes and Cantor 1969 (JC69) o Kimura 1980 (K80) o Felsenstein 1981 (F81) o Hasegawa, Kishino and Yano (HKY85) o Tamura and Nei 1993 (TN93) I will demonstrate how programming software may be used to process data using the formulae proposed within each model. From this I will explain how, continuing to use programming software, each model is capable of simulating the evolution of a nucleotide sequence over a given time. JC69 Model In terms of creating models that assess nucleotide substitution, the rate of substitution from one nucleotide to another and the time over which substitution has been allowed to act are key variables. Different models organise their use of rates in different ways but time is always used in the same way. The simplest model of nucleotide substitution is the Jukes and Cantor 1969 (JC69) model. This model assumes that the rate of substitution is the same between all nucleotides. Therefore, this model only requires a single parameter‐denoting rate, along with a value for time. A 4x4 matrix can be created showing the rates of nucleotide substitution between the 4 nucleotides. This is known as matrix Q:

Basic Models of Nucleotide Evolution Report

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

BasicModelsofNucleotideEvolutionOvertime,nucleotideswithinasequencecan‘evolve’throughsubstitution.Thisprocesscancauseanucleotide(T,C,AorG)tochangeintoanothernucleotideandisthemaindrivingforcebehindevolution.Forexample,thenucleotideAinasequenceofDNAcanchangeovertimeintothenucleotideC.ThischangemayresultinthissequenceofDNAbecominginactiveifthesequencewaspreviouslyinvolvedinproteinsynthesisasanexon,ormaychangetheproteinthatthesequencecodes.Asproteinsarethebuildingblocksoforganiclife,thismaycauselargechangesinanorganism’sfeatures.Alternatively,thischangemayhavenoeffectatall.Onaverage,thisformofmutationonlyoccursonceortwiceeverymillionyears.However,inassessingtheevolutionofspeciesoverhundredsofmillionsofyears,modelsareusefulinevaluatinghowonesequenceofnucleotidesmayhaveevolvedfromanother.ModelsofnucleotideevolutioncanbeusedwhenexaminingtwosequencesofDNAofthesamelengththatmayberelated.Thistypeofmodelwouldbeusedtocomparethetwosequencesbyeitherassumingthatonesequenceevolvedintotheotherorvice‐versa,orassumingthattheyhadevolvedfromacommon‘ancestral’sequenceofDNA.Applyingthemodelwouldgivetheestimatednumberofnucleotidesubstitutionspersite,calledthedistance,whichwouldthenbeusedtoestimateatime.Thistimecouldthenrelatetowhenonesequenceevolvedfromtheotherorwouldrelatetohowlongagothatan‘ancestral’sequenceofDNAwouldhavedivergedintoeachsequence.Inthispaper,Iwilloutlinetheprinciplesandtheorybehindthemain(mostcommonlyused)modelsofnucleotidesubstitution,addressingeachmodelchronologicallyandinsomesenseswithincreasingcomplexity.Themodelsareasfollows:

o JukesandCantor1969(JC69)o Kimura1980(K80)o Felsenstein1981(F81)o Hasegawa,KishinoandYano(HKY85)o TamuraandNei1993(TN93)

Iwilldemonstratehowprogrammingsoftwaremaybeusedtoprocessdatausingtheformulaeproposedwithineachmodel.FromthisIwillexplainhow,continuingtouseprogrammingsoftware,eachmodeliscapableofsimulatingtheevolutionofanucleotidesequenceoveragiventime.JC69ModelIntermsofcreatingmodelsthatassessnucleotidesubstitution,therateofsubstitutionfromonenucleotidetoanotherandthetimeoverwhichsubstitutionhasbeenallowedtoactarekeyvariables.Differentmodelsorganisetheiruseofratesindifferentwaysbuttimeisalwaysusedinthesameway.ThesimplestmodelofnucleotidesubstitutionistheJukesandCantor1969(JC69)model.Thismodelassumesthattherateofsubstitutionisthesamebetweenallnucleotides.Therefore,thismodelonlyrequiresasingleparameter‐denotingrate,alongwithavaluefortime.A4x4matrixcanbecreatedshowingtheratesofnucleotidesubstitutionbetweenthe4nucleotides.ThisisknownasmatrixQ:

Q=

Alongthediagonalofthismatrix,youcanseethattheratesofnucleotideschangingintothemselvesarenotdisplayed,astheyarenotregardedassubstitutions.Also,therowssumto0.UsingtheratesinmatrixQ,wecanworkouttheprobabilityofeachnucleotidesubstitutionoccurringwhent>0,creatinganothermatrix.Thismatrixisknownasthetransitionprobabilitymatrix(P(t))andisalsoa4x4matrix:P(t)=

Theseformulaecalculatetheprobabilityofonenucleotideevolvingintoanother.TheyareachievedthroughtheexponentiationoftheMatrixQusingtheMatrixTaylorseries.IntermsofusingthematrixP(t)withreal‐worldorexperimentaldata,aprogramcanbewrittenwhichwillcalculatethetransitionprobabilitiesofeachnucleotidesubstitutionusingtheformulaeinP(t).Pythonisprogrammingsoftwarethatprovidesabasicbuteffectiveprogramminglanguage,whichcanbeusedinthesecircumstances.WemustfirstdefineafunctionthatwillimplementtheformulaeofthematrixP(t)whengivencertainvaluestoworkfrom.Thesevaluesarecalledparametersandinthecaseofworkingoutthetransitionprobabilities,wemustinputavaluefortherateatwhichnucleotidesubstitutionswilloccuraswellasavalueforthetimeoverwhichsubstitutionswilloccur.

Thefollowingcode,writteninPython,emulatesthematrixP(t):

Asshownatthebottomoftheimage,inputtinganexperimentalrate(0.2)andtime(1)teststhefunction‘JC69’usedtocalculatethetransitionprobabilities.ThisisfollowedbyamatrixdisplayingtheprobabilitiesrowbyrowwithnucleotideorderT,C,AandG,inthesameorientationasthematrixQ.Inlookingattheformulaeusedtocalculatethetransitionprobabilities,conclusionscanbemadetohowtheincreasingrateortimewillaffecttheresultantprobabilities.

Theexponential(exp)ofanegativevaluegivesadecimalnumbersmallerthan1.Ifthenegativevalueincreasesinsize,theexponentialofthatvaluebecomessmalleratanincreasingrate.Therefore,asthenegativevaluetendstoinfinity,theexponentialofthatvaluetendsto0.InlookingattheaboveformulaeXandY,asthevaluesofm(rate)andt(time)increase,thevaluesbeingaddedto¼inXandsubtractedfrom¼inYbecomeinfinitelysmaller.Thisresultsinthetransitionprobabilitiestendingtowards¼foreachnucleotidesubstitution.Thissupportstheassumptionthatoveranincreasedtimeorrate,somanynucleotidesubstitutionswouldhaveoccurredthatthetargetnucleotideiseventuallyrandom,withaprobabilityof¼foreachnucleotide.

Thisisdemonstratedinthefollowinggraph,takingincreasingvaluesforratewithaconstanttimeof1:

Pii(t)representstheprobabilitythatanucleotidewillnotexperienceasubstitutionoveraperiodoftime(t).Pij(t)representstheprobabilitythatanucleotidewillexperienceasubstitutionandevolveintoanothernucleotideoveraperiodoftime.Atthepointwhentime=infinity,overwhichanucleotidesequencehadbeenallowedtoevolve,theproportionofnucleotidesofeachtype(T,C,A,G)willhavereached¼foreach.ThisdistributionofnucleotidesiscalledthelimitingdistributionandastheratesofchangearethesameforallnucleotidesintheJC69model,thisproportionwillbemaintained.Thisproportionalequilibriumiscalledthestationarydistribution.K80ModelKimuraandassociatescreatedamodelproposingamorecomplexmixofratesbetweennucleotidesubstitutionsin1980.ThismodeliscommonlyknownastheK80modelandusestworatesasparametersalongwithtime.Nucleotidesubstitutionscanbeclassifiedasoneoftwotypes;transitionsandtransversions.Transitionsaresubstitutionsbetweennucleotidesofthesameorsimilarmolecularstructure;betweenpurinesorbetweenpyrimidines,andarepronetooccurmorefrequentlytoothersubstitutions.NucleotidesAandGarepurinemoleculesandexperiencehighersubstitutionsbetweeneachother,aswellasnucleotidesTandCwhicharepyrimidinemolecules.Allothersubstitutionsaretranversionsandareknowntooccurlessfrequentlythantransitions.In1980,thefirstmitochondrialsequenceswerepublishedshowingadefinitivedifferencebetweenthefrequenciesoftransitionsandtransversions,transitionsbeingnoticeablyhigher.Asaresult,theK80modelwasdevelopedandimplementedbyKimuraandassociatesinresponsetothesefindings.

Theratematrix(Q)intheK80modeldisplaystworates;alpha(representingthesubstitutionratesofthetransitions)andbeta(representingthesubstitutionratesofthetransversions).InthefollowingrepresentationofthematrixQ,alpha=Kandbeta=1:

AswiththeratematrixfortheJC69model,thediagonalelementsofthematrixQarenotincluded,asthesearenotregardedassubstitutions.Thetotalsubstitutionrateforanynucleotidewouldbea+2b(K+1+1).DerivingthetransitionprobabilitymatrixfromthematrixQisslightlymoredifficultthanfortheJC69model,thetransitionprobabilitymatrix(P(t))isasfollows:P(t)=Where:p0(t)=1/4.0+1/4.0*exp(‐4*b*t)+1/2.0*exp(‐2*(a+b)*t)p1(t)=1/4.0+1/4.0*exp(‐4*b*t)‐1/2.0*exp(‐2*(a+b)*t)p2(t)=1/4.0‐1/4.0*exp(‐4*b*t)AswiththeJC69model,wecanalsocreateaprogramthatwillemulatethetransitionprobabilitymatrixwithrelativeeasebyinputtingtheparametervaluesforalpha(a),beta(b)andtime(t).Also,organisingtheformulaeofthetransitionprobabilitymatrixinasimilarwaytotheJC69modelusingPythondefinesthefollowingfunction:

p0(t)p1(t)p2(t)p2(t)p1(t)p0(t)p2(t)p2(t)p2(t)p2(t)p0(t)p1(t)p2(t)p2(t)p1(t)p0(t)

Thefunctionistestedusingtheparameters;a=0.4,b=0.2,t=1.Thetransitionprobabilitiesfornucleotidesexperiencingnosubstitutionsaftert=1arehigh,whereinthetransitionprobabilitiesfortransitionsandtransversionsarerelativelylowincomparison.Whenconsideringtheformulaeusedtocalculatetheseprobabilities,certaininevitabletrendsarerecognisable:

xrepresentstheprobabilityofanucleotideexperiencingnochangeoveragiventime.Whent=0,x=1:fromthispoint,xdecreasesexponentiallytothevalueof¼.yrepresentstheprobabilityofanucleotideexperiencingatransition(A<‐>GorT<‐>C)overagiventime.Att=0,thevalueofyis0;whennotimehaspassed,theprobabilityofagivennucleotideexperiencinganysortofsubstitutionis0.Thisisalsotruefortransversionalsubstitutions,representedbyequationz.Astimeincreasesfrom0,thetransitionalprobabilitiesforbothtransversionsandtransitionsincrease,tendingtowards¼.Astheratesoftransitionalchangearehigherthanthoseoftransversionalchange,thetransitionprobabilitiesfortransitionalsubstitutionsincreasetowards¼atahigherrate.Thefollowinggraphrepresentsthechangesinthetransitionalprobabilitiesoftransitions,transversionsandnosubstitutionsitesastimeincreases:

Tocreatethisgraph,thevaluesofalphaandbetaweresetto0.4and0.2respectively.Thesevaluessimulaterealisticvaluesfortheratesfortransitionsandtransversionsasobservedrateshaveshownthattransitionalsubstitutionsoccurata

higherfrequencytotransversionalsubstitutions.Timerangesfrom0to10,increasingby0.1withineachinterval.HKY85andTN93ModelsHasegawa,KishinoandYanodevelopedamodelin1985thatcombinedelementsofboththeK80andF81models.ThisisknownastheHKY85modelandincorporatesmultipleparameterstocreateamorerealisticsimulationofhownucleotidesequencesessentiallybehave.Firstofall,theHKY85modelassumesthattheratesofsubstitutiondifferbetweeneachnucleotide.Asinglevaluewoulddefinetheratesforatargetnucleotidehavingbeenevolvedinto.Forexample,avaluefortherateofTwoulddefinetheratesbywhichanynucleotidewouldbesubstitutedtoresultinthecreationofthenucleotideT.Theseratesareknownasbasefrequenciesandwithinthismodel,thebasefrequenciesaredeemedunequal.FurtherparametersareincludedtodistinguishbetweentheratesoftransitionsandtransversionsaswithintheK80model.Afterthefirstmitochondrialsequenceswerepublishedin1980,thedifferencebetweentheratesoftransitionsandtransversionswasmadedefinitiveandsomostnucleotideevolutionmodelscreatedafter1980incorporateparametersthatdefinetheratesoftransitionsandtransversionsseparately.TheHKY85modelisseentogiveamoreaccuraterepresentationofnucleotidesubstitutionsincomparisontotheJC69,K80andF81modelsbyaccommodatingmultiplefactors.ThefollowingimagerepresentstheratematrixQ:

Thematrixisorganisedastheratematricesforallpreviousmodelshavebeen,thecolumnsandrowsareinthenucleotideorder;T,C,A,Grespectively.WithinthisrepresentationofthematrixQ,Krepresentstransitionalsubstitutions.Allothersubstitutionsareassumedtobetransversionalotherthanthediagonalvaluesofthe

matrix,whicharenotsubstitutions.πTrepresentstherateofsubstitutionsresultingintheformationofthenucleotideTasmentionedbefore.πCrepresentstherateofsubstitutionsresultingintheformationofthenucleotideCandsoon. Derivingthetransitionprobabilitymatrix(P(t))isnotassimpleaswiththepreviousmodelsduetothematrixQnotbeingadiagonalmatrix.Therefore,thematrixQisinitiallydiagonalized,followedbytheexponentiationofthediagonaltoproducethematrixP(t):

Where:

Mostofthetransitionprobabilitiesdifferforeachsubstitutionwithinthismodel;thismorecloselyemulateshownucleotideswouldbehaveinreal‐lifeincomparisontothepreviousmodels.Morefactorsaretakenintoaccounttoachievethisandsotheformulaeincreaseincomplexityastheyaccommodatealargernumberofvariables.Writingafunctiontocarryouttheformulaeinthetransitionprobabilitymatrixisslightlymoretime‐consumingthanpreviousmodelsbutitisstillachievable:

Parametersfortime,transitionrate,transversionrateandthebasefrequenciesmustbedefinedinordertogeneratethetransitionprobabilitymatrix.Thefunctionisthentestedwithexperimentalparameters,generatingthematrixatthebottomoftheimage.Att=0,thediagonalelementsofP(t)areat1whilstallothervaluesareat0.Thisisbecauseatt=0,wewouldnotexpectanysubstitutionstohaveoccurredtoanucleotidesequence.Astimetendstoinfinity,theprobabilitiesofthediagonal

elementsdecrease,asallotherelementsincrease,totheirrespectivebasefrequencies.Thiswouldbetheresultofthenucleotidesinthesequencereachingastationarydistribution:whentheproportionsofeachnucleotidematchtheirrespectivebasefrequencies.Theseproportionswouldbemaintained,asfurthersubstitutionswouldcontinuetogeneratethesameproportionsofnucleotides.Therefore,inthiscase,thestationarydistributionisalsothelimitingdistribution.ThedifferencebetweentheratesofsubstitutionoftransitionsandtransversionswaswellestablishedandresoundswithinmostnucleotidemodelscreatedaftertheK80model.However,withintransitionsafurtherdifferenceinratescanbedistinguished.NucleotidesAandGareknownaspurinemoleculesandnucleotidesTandCareknownaspyrimidinemolecules;thedifferencebeingthemolecularstructuresofthenucleotides.Generally,purinesandpyrimidinestendtohavedifferentratesofsubstitution;therefore,amorerecentmodeltothosediscussedsofarhasbeendevelopedtoaccommodateforthisfactor.In1993,TamuraandNeiproposedanewmodel,whichincludedparametersthatwoulddistinguishbetweentheratesofpyrimidinesandpurinesrespectively.ThismodeliscommonlyknownastheTN93modelandintroducestheparameters;alpha1andalpha2inreplacementofthesinglealphaparameterpresentintheHKY85modelfortransitionalrates.TheratematrixforthismodelisthereforeverysimilartothatoftheHKY85model,aswellasthetransitionprobabilitymatrix:

MatrixP(t)=

Where:

SimulationofnucleotidesequencesThepreviouslydiscussedmodelsofnucleotidesubstitutionallallowforthegenerationofprobabilitiesthatdeterminehowanucleotidesequencewillorhasevolvedbasedonlikelihood.Fromthis,afunctioncanbeusedtosimulatehowasequenceofnucleotidesmayevolvebasedontheseprobabilities.Forexample,takingtheprinciplesofthesimplestmodel,JC69,wecansaythattheprobabilitiesforanucleotidechangingintooneoftheothernucleotidesareequal.Therefore,whensimulatingascheduledsubstitutionofanucleotide,becauseeachtransitionprobabilityisthesame,thetargetnucleotidecanberandomlychosenandthesequencemutated.Ifthetransitionprobabilitieswereunequal,thetargetnucleotidewouldberandomlychosenbutwithincorporatedbiasfavouringmoreprobabletransitions.AfunctionmustbedesignedtofirstgeneratearandomtimeatwhichamutationwilloccurbasedonthetotalsubstitutionratesofallthenucleotidesofthesequenceusingtheratematrixQ.Atimeintervaloverwhichmutationswilloccurmustbeoutlined,forsimplicitytheintervalfromt=0tot=1isusedoften(timex).Tobeginmutation,asequenceofnucleotidesmustbeprovided;throughtheuseofafunction,anucleotidesequenceofanylengthcanbegenerated(genseq).Usingthetimexfunction,alistoftimesisgeneratedwhenarateisinputtedintothefunction.Inthiscase,thetotalrateforallnucleotidesofthesequenceisinputtedandalistoftimesgeneratedrandomly,thesetimesareusedasthetimesofmutation.Thistechniquecannotbeusedformorecomplexmodelsofnucleotideevolutionastheyassumeunequaltransitionprobabilitiesandsoafterasubstitution,thetotalratewouldchangewiththedepartureofonenucleotideandthecreationofanewnucleotide.InbasingsimulationusingtheJC69model;thetransitionprobabilitymatrixfortheJC69modelisusedtogeneratetheprobabilitiesformutationsorfornochanges.Thegenseqandtimexfunctionsarebothusedtogenerateasequenceofnucleotidesandtothencreatealistoftimesatwhichmutationswilltakeplace.Pleaselooktothefunctionssectionstowardstheendofthisreportfordefinitionsofeachfunction.3ThefollowingisasequenceofnucleotidesbeforeandaftermutationusingtheJC69transitionprobabilitymatrix:Before

After

Although5differencesarevisiblefromtheinitialsequencetothesequenceaftermutation,7actualmutationshadoccurredwithtwoofthemutationsactingonthesamestartingnucleotide,the8th,withthesecondmutationreturningthe8th

nucleotidebacktoitsstartingstate(nucleotideC).7mutationswereachievedusingthetimexfunctionandinputtingavalueof4.5forrate(at).SimulationofmutationusingtheK80modelrequiresaslightlydifferentmethod,asdoessimulationusingtheHKY85andTN93modelsduetothedifferingprinciplesandparametersbetweeneachmodel.Theseprinciplesarequiteeasilysummarisable:

K80‐astransitionsandtransversionsmustbedistinguishedbetweenastheyoccuratdifferentrates,thefunctionwrittenforsimulatingmutationundertheprobabilitiesgeneratedbytheK80modelaccountsforthis.Thisthenresultsintransitionmutationsandtranversionmutationsoccurringatdifferentratestothenucleotidesequencebeingmutatedaccordingly.

HKY85‐AstheHKY85modelutilisesseveraldifferentparametersandthereforeratestodistinguishprobabilities,thefunctionwrittentosimulateundertheprinciplesofthismodelusesmultiplerateswhenconductingamutation.Also,aseachnucleotideissubjecttodifferentratesofmutation,thetotalratebywhichanymutationwilloccurusingthetimexfunctionisupdatedafteranynucleotideismutatedandchangedintoanothertoaccountforthischange.

TN93‐thefunctionsimulatingmutationundertheprinciplesoftheTN93modelactsinthesamewayasthefunctionusedfortheHKY85model.TheonlydifferenceisthattheTN93modelintroducesanadditionalrate,breakingtherateforalpha(transitions)intoalpha1(transitionsbetweenpyrimidines)andalpha2(transitionsbetweenpurines).

Thefunctionswrittenforthesimulationofthemutationofanucleotidesequenceareincludedintheappendixandarelabelledaccordingly.MaximumLikelihoodEstimates(MLE)‐JC69&K80ModelsMaximumlikelihoodestimatesareusedtoestimateparametervaluesforastatisticalmodelwhenapplyingthatmodeltoadataset.Inthecaseofnucleotidesubstitutions,thestatisticalmodelsfittedtodataarethemodelsofnucleotidesubstitutionandtheparameterestimatedisthevalueforrateandtime.Rateandtimearedealtwithasasinglevalueastheycannotbedistinguishedfromoneanother;thesinglevalue(at)canbeproducedbytheproductofanumberofdifferentcombinationsofvaluesofeitheralphaortime.Thedatasetusedwillbetwosequencesofnucleotidesofequallengthsofwhichonesequencewillbeassumedtohaveevolvedfromtheotherthroughseveralmutations.Thetotallengthofasequenceisrepresentedbytheletternandthedifferences(numbersofnucleotideswhichdifferbetweeneachsequence)isrepresentedbytheletterk.JC69Toexplainthetheorybehindacquiringthemaximumlikelihoodestimate,thebinomialdistributionmustbeconsidered.Thefollowingistheprobabilitymassfunction(pmf)ofthebinomialdistribution:

n= The total length of a sequence. k= The number of differences between the two sequences. Theprobabilitymassfunctionisusedtocalculatetheprobabilitywhenavariable(at)isexactlyequaltothevalueproposedforthevariable.Forexample,ifavalueforatisinputtedintotheprobabilitymassfunction,thevaluecalculatedwillrepresenttheprobabilitythatthevalueforatusedtocalculatetheprobabilityiscorrect.InreplacementofthevariablepistheequationusedinthetransitionprobabilitymatrixfortheJC69modeltocalculatetheprobabilityofamutationoccurring.Theequationusedinreplacementof1‐pistheequationfromthetransitionprobabilitymatrixoftheJC69modelusedtocalculatetheprobabilityofamutationnotoccurring.Thefollowingequationistheprobabilitymassfunction,alteredtoincludethevariablesmentionedabovewiththetotallengthofasequence(n)as100andthenumberofdifferences(k)as40.Thenotationpow(x,y)representsthevaluextothepowerofy:Probabilitymassfunction=l

Thevariablemrepresentsthevalueat.Findingthevalueofatwiththehighestprobabilitycanbefoundthroughtrialanderror,howeverusingPYTHONallvaluesofatwithinanintervalcanbetestedandplottedontoagraph:

Theprobabilitymassfunctionequationdisplayedabovewasusedtogeneratethedatatoplotthisgraph.Thevaluesofm(at)withintheinterval0to0.4weretestedandapeakprobabilitywasacquired.Thepeakrepresentsthevalueofm(at)withthehighestprobabilityofresultinginthevalueofkandthereforeisthemaximumlikelihoodestimate.Inthiscase,themaximumlikelihoodestimateis0.19forat.K80TofindthemaximumlikelihoodestimateusingtheprinciplesoftheK80modelisapproachedinaverysimilarwayaswiththeJC69model.Theprobabilitymass

functionisadjustedsothattwovaluesareestimatedastherearetwoparametersforratesintheK80model,alphaandbeta.Pmf=p0^(n–k‐j)*p1^k*p2^jWhere:p0=theequationusedfromthetransitionprobabilitymatrixoftheK80modeltocalculatetheprobabilityofnomutationoccurring. p1=theequationusedfromthetransitionprobabilitymatrixoftheK80modeltocalculatetheprobabilityofatransitionmutationoccurring. p2=theequationusedfromthetransitionprobabilitymatrixoftheK80modeltocalculatetheprobabilityofatransversionmutationoccurring.

n=thetotallengthofasequence.k=thenumberofdifferencesbetweentwosequencesthathaveresulted

fromtransitionmutations. j=thenumberofdifferencesbetweentwosequencesthathaveresultedfromtransversionmutations.Probabilitymassfunction=l

aandbrepresentthevaluesfortheratesoftransitions(alpha)andtransversions(beta)respectively.UsingPYTHONatablecanbegeneratedshowingtheprobabilitiesofavalueofabeingmostlikelywhenbisofanothervalue.Thevaluesinthistablecanbeplottedgraphicallyusingacontourplot.Thefollowingisacontourplotgeneratedusingtheequationforprobabilitymassfunctiondisplayedabove,howeverthetotallengthofasequence(n)is100,thenumberofdifferencesthathaveresultedfromtransitionmutations(k)is30andthenumberofdifferencesbetweentwosequencesthathaveresultedfromtransversionmutationsis10:

Thelinesbecomeconcentratedaroundthemaximumlikelihoodestimatesforthevaluesofalpha(rateoftransitions)andbeta(rateoftranversions).Theestimateforthemostprobablevalueofbisclearlycentredontheintervalbetween0.12and0.14.Unfortunately,thevalueforaisnotvisibleasthelimitsofthiscontourgraphdonotshowwherethelinesofthegraphcentreonthey‐axis.Maximumlikelihoodestimatesareusedinconjunctionwithmodelsofnucleotideevolutionmainlytoestimatethetimetakenforonesequenceofnucleotidestoevolveintoanother,assumingthatonesequenceistheancestoroftheother.Althoughonlyavalueforat,theproductofbothrateandtime,isachievableifanaveragerate(orratesinthecaseofmultipleparametermodels)isknown.Usingtheknownvalueforrate,thevariableoftimecanbedistinguishedandsothetimetakenforonesequencetomutateintotheotheriscalculatable.Practically,biologistsandstatisticianshaveadoptedthismethodwhenattemptingtocalculatethetimetakenforparticularspecies(suchashumans)tohaveevolvedfromancestralspecies(suchaslesserevolvedprimates).ByassessingthesamesectionsofDNAfromthetwospeciesofthesamelength,thenumberofdifferencesmayberecordedusedtoestimateatimeusingthemaximumlikelihoodmethod.ConclusionsAsmyinvestigationwasnotanexperimentassuchbutratherthetranslationofstatisticalmodelsontosoftwaresoastousethesemodelsinpracticalsituations,myconclusionwouldbetostatethattheprogrammesthatIhavewrittentoemulatethesestatisticalmodelshavebeensuccessfulandsomaybeappliedtopracticaldatasets.Thistranslationallink,betweenstatisticalmodelsandnewcomputingsoftwareembodiesthebasicprinciplesofbioinformaticsandallowsdemonstrationsofhowstatisticiansandbiologistscanthereforeusethesemodelswhendealingwithmutatedsequencesofDNA.IfIhadfurtherresearchtimeandpossiblyslightlymoreoptionsintermsofcomputingsoftware,therearemultipleareasthatIwouldhaveexpandedwithinmyprojectandreport.Firstofall,Iwouldhaveincludedastep‐by‐stepexplanationoftheTaylorSeriesexpansionallowingforreaderstounderstandthemathematicaltheorybehindobtainingthetransitionprobabilitymatrixfromtheratematrixofanucleotidemodel.Also,Iwouldhaveexploredfurthermodelsofnucleotideevolution,astherearemanymoresignificantmodelsthathavenotbeenmentioned.Thesemodelswouldhavebroadenedthescopeofmyprojectandwouldhavedepictedfurtherstepsbywhicheachmodelwaschronologicallyimproved.Withinthelastsectionofthisreport,themaximumlikelihoodestimationoftheJC69andK80models,Ibelievethatthissectioncouldbeprogressedfurther.Withaccesstoalternativecomputingsoftwarethatcouldplotmulti‐dimensionalgraphs,IwouldhaveextendedthecalculationofmaximumlikelihoodestimatesintoestimatingtheparametersfortheHKY85andTN93models.References:

ComputationalMolecularEvolution(Yang2006) www.wikipedia.org

www.python.org http://docs.python.org/lib/module‐random.html http://docs.python.org/lib/module‐random.html http://www.tau.ac.il/~doronadi/F81_model.doc http://www.megasoftware.net/WebHelp/part_iv___evolutionary_analysis/c

omputing_evolutionary_distances/distance_models/nucleotide_substitution_models/hc_jukes_cantor_distance.htm

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=hmg.figgrp.1080 http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=hmg.figgrp.1080

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=hmg.figgrp.1080 EvolutionaryTreesfromDNASequences:

AMaximumLikelihoodApproach(JosephFelsenstein1981) ANovelUseofEquilibriumFrequenciesinModelsofSequenceEvolution

(NickGoldmanandSimonWhelan)

FunctionsGenseq‐thegenerationofarandomsequenceofnucleotidesisessentialtothesimulationofnucleotidesubstitution.Todefineafunctiontogenerateasequence,aparameterforthelengthofthesequencemustbedefined.Inthiscase,nisused.Thefunctionrandomlychoosesaletter,representingeachnucleotide,fromthelist“ACGT”usingthein‐built‘randint’function.Thechosenletterisaddedtoalist;theprocessofchoosingaletteristhenrepeatedntimescreatingalistor‘sequence’nnucleotideslong.

Timex‐thisfunctionallowsforthegenerationofacumulativesetoftimesthatrepresentwhenmutationswilloccurstoanucleotidesequence.Thisfunctionisonlyusedwithinthesimplermodelsofsubstitutionasitassumesthattransitionprobabilitiesarethesameforeachnucleotide.Anin‐builtfunction(random.expovariate)takesavalueforrateasaparameterandgeneratesanothervalueusingthisratevalue.Inputtingahigherratevaluewillincreasetheprobabilityofthein‐builtfunctiongeneratingasmallervalue.Valuesaregeneratedusingthesameratevalueandaredisplayedcumulativelytorepresentthetimesatwhicheventsoccuraccordingtotheinputtedratevalue.Thisprocessisterminatedwhenthecumulativetimevalueincreasesover1asweareonlyinterestedinmutationsoccurringwithintimes0and1.Thisfunctioniseffective,astheoretically,ifeventsoccuratahigherrate,moreeventswilloccurinagiventime.

Intgen‐thisfunctionwascreatedtogeneratealist,oflengthn,ofrandomnumbers.Theserandomnumbersdenoteatwhatpointsmutationswilloccur.Thetimexfunctionisinitiallyusedtocalculatethenumberofmutationsthatwilloccurinanallottedtime.Thenumberofcalculatedmutationswillthensignifythelengthofthe

listofrandomnumbers.Eachnumberwithinthislistreferstothenthnucleotideofasequencebeingmutated.Thatnucleotidewillthenbemutated.

Appendix:K80‐Functionforsimulationofmutationofnucleotidesequence

HKY85‐Functionforsimulationofmutationofnucleotidesequence

TN93‐Functionforsimulationofmutationofnucleotidesequence