Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
2/19/17
1
Quantifyingsequencesimilarity
BasE.DutilhSystemsBiology:BioinformaticDataAnalysis
UtrechtUniversity,February16th 2016
Afterthislecture,youcan…… definehomology,similarity,andidentity… givetwoexamplesofDNAsubstitutionmatrices… listfourpropertiesofaminoacidsthatmightbeimportantindeterminingtheirphysico-chemicalsimilarity… deriveBLOSUMscoresfromcountsofalignedresidues… interpretBLOSUMscoresandBLOSUM-basedalignmentscoresbyperformingcalculationswithlog-oddsvalues… explainwhentousematricesbasedonhigh/lowBLOSUMnumbers(e.g.45/62/80)andwhytheyaresuitable… listsomeoftheelementsinamodelofevolution,explainwhatitisandwhyweuseit… convertmutationaldistanceintoevolutionarydistancebycorrectingformutationalsaturationwithJukes-Cantor
2/19/17
2
Homology• Phylogenetictreesand
sequencealignmentscontainalotofinformationaboutfunctionandevolutionofsequences
• Buttheyonlymakesenseifthesequencesdescendfromacommonancestor,i.e.theyareevolutionarilyrelated
WeareNEVERallowedtointerprettreesoralignments,UNLESStheyaremadewithhomologoussequences!
… sohowdowedetermineifsequencesshareacommonancestor?(i.e.arehomologs?)
Similarity• Thingsthatlookreallysimilarprobablyshareacommonancestor
2/19/17
3
Identity• Alignmentmeans:addinggapsinoneand/ortheother
sequenceuntiltheyarebothequallylong:
• Beforealignment:
• Afteralignment:
• Alignedsequencesallowustocalculatepercentidentity:– 8differentpositions,and31identicalpositions– Thetwosequencesare100*=79%identical
• Identitycanbequantified,buthomologycannot!– Saying“79%homologous”iswrong (eventhoughtoomanypeoplesayit)
AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
3139
Definitions
• Similarity– Percentageofaminoacidresiduesinanalignmentwithapositivesubstitutionscore
– NotusedforDNA
• Homology– Propertyoftwosequencesthathaveasharedancestor
– HomologyisTRUEorFALSE:eitheryou’refamilyoryou’renot
• Identity– Percentageofidenticalresiduesinanalignment
– Usedforaminoacidsornucleotides
2/19/17
4
– Thedistancebetweenvirus3andvirus7:0.77mutations/site
Evolutionarydistance• Theevolutionarydistancebetweentwosequencesisoftenexpressedasthefrequencyofmismatches
http://epidemic.bio.ed.ac.uk/how_to_read_a_phylogeny
Thebranchlengthrepresentstheevolutionarytimebetweentwonodes.Unit:substitutionspersequencesite.
Thehorizontalbranchesrepresentevolutionarylineages
changingovertime.
• Assumption:positionsevolverandomlyandindependently
DNAsubstitutionmatrices• Similarityisquantifiedwithasubstitutionmatrix
– Scoresformatchesinasequencealignment– Penaltiesformismatchesinasequencealignment
• Theidentitymatrixisthemostoftenusedfor
A
C
T
G
Transitions:A↔GC↔T
Transversions:A↔C A↔TC↔G G↔T
A C G TA 1C -2 1G -1 -2 1T -2 -1 -2 1
A C G TA 1C -1 1G -1 -1 1T -1 -1 -1 1
Identitymatrix
scoringDNAsequencealignments• Not all nucleotidesubstitutions areequally likely
– Transitionsoccurabouttwiceasoftenastransversions
2/19/17
5
Identityandsimilarity• Thesethreesequencesareall66.7%identical…• …butwhatarethosecolors? *
– Someaminoacidsaremoresimilarthanothers
Glutamic acid(E)Asparticacid(D)Cysteine (C)
GeorgeOrwellAnimalFarm
(*hint)
Aminoacidproperties
http://swift.cmbi.ru.nl/teach/B1B/HTML/PosterA4_nbic_new.pdf
2/19/17
6
Thesimilaritysignal• Whatisthemostrelevant definitionofa“similaritysignal”betweenaminoacids?– Size?– Polarity?– Hydrophobicity?– Preferredproteinfold?– …anyothermeasuresthatmightbeimportant?
• Similaritysignalofwhat,exactly? *– “Ofthefunctionoftheaminoacidintheprotein”
– “Ofanevolutionaryrelationship”
• Evolutionhastestedthisforus– Solet’suseevolvedsequencestoscoresimilarity!
(*hint)
Which amino acids mutate into each other?• Weusealignmentstodiscoverwhichaminoacidsaremoresimilarthanothers,accordingtoevolution
• Wequantifythisbycountinghowoftentwoaminoacidsreallymutatedintoeachotherduringevolution
• Themostrelevantsignaltodetecthomologyisbasedonreliable,well-alignedhomologs
Well-alignedhomologs
2/19/17
7
BLOcks SUbstitution Matrix(BLOSUM)1. Takesomealignedhomologoussequences
2. Grouphighlyidenticalsequences(e.g.>62%identity)– Toremoveredundancybiasesinthesequencedatabase
3. Identifywell-alignedblocks– Sothatonlyrealmutationsarecompared(reliable,well-alignedhomologs)
4. Counthowofteneachpairoftwoaminoacidsmutatedintoeachother(i.e.howoftentheyarealigned)
• BLOSUMdefinessimilaritybetweenaminoacidsbasedonthestatisticalprobabilitythattheyaligninwell-alignedhomologs– Observed/expectedratio (oddsratio)
• An obeserved/expected ratioof2means something happenstwice asoften asyou expected by randomchance
– Observed:aligned inwell-aligned homologs– Expected:“aligned”inunaligned sequences
• Theobserved/expectedratioforawholealignmentcanbecalculatedbymultiplying theratiosforallalignedaminoacids– Thisisalotofmultiplyingforlongsequences
• Logarithmsallowustosum thescoresofalignedresidues– Mucheasiertouseforascoringsystem
• Hence:theBLOSUMscore
observedexpected
alignedinwell-alignedhomologs“aligned”inunalignedsequences
BLOSUMscore:
Howtoquantifythissimilarity?
log ( ) Si,j =2·log2( )Fi,jFi ·Fj
→
2/19/17
8
Anexample• Let’ssaythatwehaveawell-alignedblockof:
– 100aminoacidslong– 1,000proteins“deep”– Withnogaps
• 7.4%arealanine (A)• 1.3%aretryptophan(W)• Randomly,weexpect afractionofA-Walignmentsof:
• Inreality,weobserve afractionofA-Walignmentsof:0.00034– TheA-Wmutationoccurredlessfrequentlyinevolutionthanexpected!– …sothesubstitutionscoremustbenegative
• TheA-Wsubstitutionscoreis:Si,j =2·log2( )Fi,j
Fi ·FjSA,W =2·log2( )0.00034
0.000962 =-3
FA ·FW =0.074·0.013=0.000962
7,400100·1,000→FA ==0.074
→FW =0.013
BLOSUM62
Aromatic
Hydrophobic
Positive
Polar,non-hydrophobic
SmallLowforinfrequentsubstitutions
Highforfrequentsubstitutions
Lowforcommonaminoacids
Highforrareaminoacids
• Whyarethehighestnumbersineachrow/columnonthediagonal?– Because inwell-aligned homologs,every amino acidismostoften aligned toitself (i.e.it isnot mutated)
Substitutionmatrix
2/19/17
9
Sowhatdothenumbersmean?
• Ascoreof-3forthesubstitutionA-Wmeansthatthissubstitutionisobservedinwell-alignedhomologs0.35timesasoftenasexpected:
– …or:=2.83timeslessoftenthanexpected
• Ascoreof2forthesubstitutionD-Emeansthatthissubstitutionisobservedinwell-alignedhomologstwiceasoftenasexpected:
• Ascoreof9forthesubstitutionC-Cmeansthatthissubstitutionisobservedinwell-alignedhomologs22.6timesasoftenasexpected:
2(-3/2) =0.35
2(2/2) =2
Si,j =2·log2( )Fi,jFi ·Fj
2(9/2) =22.6
Fi,jFi ·Fj2(Si,j/2) =→
10.35
0.000340.000962=
Exercise
Aromatic
Hydrophobic
Positive
Polar,non-hydrophobic
Small
• Calculate thesimilarity scorebetween thesesequences:a) KAWSADV
b) HAMNWEQc) RDWSACV
– 1+4– 1+1– 3+2– 2=0
0– 2– 1+1– 3+5– 2=– 22– 2+11+4+4– 3+4=20
2/19/17
10
Oddsratiothatsequencesarewell-alignedhomologs• Howlikelyisitthattwosequencesarewell-alignedhomologs?
– seqA – seqB :-1+-3+-2=-6– seqA – seqC :-1+-2+-2=-5– seqB – seqC :4+2+4=10
• WhenyouobserveRYDalignedtoSDA,thesesequencesare0.125timesmorelikelytobewell-alignedhomologsthanexpected– …or8timeslesslikelytobewell-alignedhomologsthanexpected
• WhenyouobserveRYDalignedtoSEA,thesesequencesare0.177timesmorelikelytobewell-alignedhomologsthanexpected– …or5.6timeslesslikelytobewell-alignedhomologsthanexpected
• WhenyouobserveSDAalignedtoSEA,thesesequencesare32timesmorelikelytobewell-alignedhomologsthanexpected
2(-6/2) =0.125 2(-5/2) =0.177 2(10/2) =32
Lengthofthealignment• Asyouseeamoreandmorelengthoftwosequences…
– Iftwosequencesarewell-alignedhomologs,mostofthealignedaminoacidswillhavepositivescores
– …sothealignmentscorekeepsincreasingasyouseemorelengthofthesequences
– …sothelikelihoodthattheyarewell-alignedhomologskeepsincreasing
– Iftwosequencesarenothomologous,mostofthealignedaminoacidswillhavenegativescores
– …sothescorekeepsdecreasingasyouseemoresequence– …sothelikelihoodthattheyarewell-alignedhomologskeepsdecreasing
2/19/17
11
BLOSUMnumbers
• BLOSUMremovesredundancybiasesfromtheblocksofwell-alignedhomologs– Thesequenceselectioninthedatabaseisbiased– Clustersofhighlyidenticalsequencesarecollapsed– Percluster,one consensussequence isused to
calculate theBLOSUMmatrix• High numberscollapseonlyveryidentical
homologs– Usedtocompareanddetectcloselyrelated proteins
• Low numbersalsocollapsemoredivergenthomologs– Usedtocompareanddetectmoredistant proteins
• Themostusedmatrices:
BLOSUM45:distantlyrelatedsequencescollapsed(>45%id)
BLOSUM62:default, sequencescollapsediftheyhave>62%identity
BLOSUM80:closelyrelatedsequencescollapsed(>80%id)
Substitution matrices• Substitutionmatrices…
…areusedin(almost)allsequenceanalyses…stronglyinfluenceyourresults
• Otheraminoacidsubstitutionmatrices– PointAcceptedMutations(PAM)
• DevelopedbyMargaretDayhoff in1978• Basedoncounting1,572realmutationsthatoccurredduringtheevolutionof71proteins
– Gonnet (1992)• SimilartoPAM,butbasedonmoresequences• DefaultinpopularalignmentprogramClustal Omega
• Bioinformaticprogramsoftenhavedefaultparametervaluesset– Forexamplethechoiceofasubstitutionmatrix– Defaultsettingsarenotalwaysthemostoptimalforyourquestions– Ifyouhavenoidea,checktheliteratureinyourbiologicalfieldwhatis
consideredreliable
2/19/17
12
Models ofevolution• Weusemodels tocheckifwereallyunderstandsomething
– Ifourobservationsagreewiththemodel,thenweunderstandthatsomethingwell(andviceversa)
• Inamodelofevolution,wequantitativelyformalizetheevolutionaryprocessesthatwethinkareatplay– Asubstitutionmatrixis(partof)amodelofevolution,becauseitformalizesalltherelativesubstitutionratesbetweenresidues
– Ifothertypesofmutationsareincluded,themodeloftenagreesbetterwithgenomesequencedata
Timeà
Absoluteversusevolutionarydistance• Considerarecentlydivergedsequence• ThetwoDNAsequenceswilldiverge,buthow?
• Thedivergencesaturatesat~25%identity– Mutationalsaturation:ifthesamepositionmutatesagain– TheidentitybetweentworandomDNAsequencesis25%
• WecancorrectforitwiththeJukes-Cantorformula:
y=x2 - x+0.50.2
0.3
0.4
0.5
0 0.2 0.4 0.6 0.8 1%idfo
rrando
mse
qs
%GCofsequences