20170221 quantifying sequence similarity - Universiteit Utrechttbb.bio.uu.nl/BDA/2017/20170221_quantifying_sequence... · 2017-06-14 · –The sequence selection in the database

2/19/17

1

Quantifyingsequencesimilarity

BasE.DutilhSystemsBiology:BioinformaticDataAnalysis

UtrechtUniversity,February16th 2016

Afterthislecture,youcan…… definehomology,similarity,andidentity… givetwoexamplesofDNAsubstitutionmatrices… listfourpropertiesofaminoacidsthatmightbeimportantindeterminingtheirphysico-chemicalsimilarity… deriveBLOSUMscoresfromcountsofalignedresidues… interpretBLOSUMscoresandBLOSUM-basedalignmentscoresbyperformingcalculationswithlog-oddsvalues… explainwhentousematricesbasedonhigh/lowBLOSUMnumbers(e.g.45/62/80)andwhytheyaresuitable… listsomeoftheelementsinamodelofevolution,explainwhatitisandwhyweuseit… convertmutationaldistanceintoevolutionarydistancebycorrectingformutationalsaturationwithJukes-Cantor

2/19/17

2

Homology• Phylogenetictreesand

sequencealignmentscontainalotofinformationaboutfunctionandevolutionofsequences

• Buttheyonlymakesenseifthesequencesdescendfromacommonancestor,i.e.theyareevolutionarilyrelated

WeareNEVERallowedtointerprettreesoralignments,UNLESStheyaremadewithhomologoussequences!

… sohowdowedetermineifsequencesshareacommonancestor?(i.e.arehomologs?)

Similarity• Thingsthatlookreallysimilarprobablyshareacommonancestor

2/19/17

3

Identity• Alignmentmeans:addinggapsinoneand/ortheother

sequenceuntiltheyarebothequallylong:

• Beforealignment:

• Afteralignment:

• Alignedsequencesallowustocalculatepercentidentity:– 8differentpositions,and31identicalpositions– Thetwosequencesare100*=79%identical

• Identitycanbequantified,buthomologycannot!– Saying“79%homologous”iswrong (eventhoughtoomanypeoplesayit)

AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

3139

Definitions

• Similarity– Percentageofaminoacidresiduesinanalignmentwithapositivesubstitutionscore

– NotusedforDNA

• Homology– Propertyoftwosequencesthathaveasharedancestor

– HomologyisTRUEorFALSE:eitheryou’refamilyoryou’renot

• Identity– Percentageofidenticalresiduesinanalignment

– Usedforaminoacidsornucleotides

2/19/17

4

– Thedistancebetweenvirus3andvirus7:0.77mutations/site

Evolutionarydistance• Theevolutionarydistancebetweentwosequencesisoftenexpressedasthefrequencyofmismatches

http://epidemic.bio.ed.ac.uk/how_to_read_a_phylogeny

Thebranchlengthrepresentstheevolutionarytimebetweentwonodes.Unit:substitutionspersequencesite.

Thehorizontalbranchesrepresentevolutionarylineages

changingovertime.

• Assumption:positionsevolverandomlyandindependently

DNAsubstitutionmatrices• Similarityisquantifiedwithasubstitutionmatrix

– Scoresformatchesinasequencealignment– Penaltiesformismatchesinasequencealignment

• Theidentitymatrixisthemostoftenusedfor

A

C

T

G

Transitions:A↔GC↔T

Transversions:A↔C A↔TC↔G G↔T

A C G TA 1C -2 1G -1 -2 1T -2 -1 -2 1

A C G TA 1C -1 1G -1 -1 1T -1 -1 -1 1

Identitymatrix

scoringDNAsequencealignments• Not all nucleotidesubstitutions areequally likely

– Transitionsoccurabouttwiceasoftenastransversions

2/19/17

5

Identityandsimilarity• Thesethreesequencesareall66.7%identical…• …butwhatarethosecolors? *

– Someaminoacidsaremoresimilarthanothers

Glutamic acid(E)Asparticacid(D)Cysteine (C)

GeorgeOrwellAnimalFarm

(*hint)

Aminoacidproperties

http://swift.cmbi.ru.nl/teach/B1B/HTML/PosterA4_nbic_new.pdf

2/19/17

6

Thesimilaritysignal• Whatisthemostrelevant definitionofa“similaritysignal”betweenaminoacids?– Size?– Polarity?– Hydrophobicity?– Preferredproteinfold?– …anyothermeasuresthatmightbeimportant?

• Similaritysignalofwhat,exactly? *– “Ofthefunctionoftheaminoacidintheprotein”

– “Ofanevolutionaryrelationship”

• Evolutionhastestedthisforus– Solet’suseevolvedsequencestoscoresimilarity!

(*hint)

Which amino acids mutate into each other?• Weusealignmentstodiscoverwhichaminoacidsaremoresimilarthanothers,accordingtoevolution

• Wequantifythisbycountinghowoftentwoaminoacidsreallymutatedintoeachotherduringevolution

• Themostrelevantsignaltodetecthomologyisbasedonreliable,well-alignedhomologs

Well-alignedhomologs

2/19/17

7

BLOcks SUbstitution Matrix(BLOSUM)1. Takesomealignedhomologoussequences

2. Grouphighlyidenticalsequences(e.g.>62%identity)– Toremoveredundancybiasesinthesequencedatabase

3. Identifywell-alignedblocks– Sothatonlyrealmutationsarecompared(reliable,well-alignedhomologs)

4. Counthowofteneachpairoftwoaminoacidsmutatedintoeachother(i.e.howoftentheyarealigned)

• BLOSUMdefinessimilaritybetweenaminoacidsbasedonthestatisticalprobabilitythattheyaligninwell-alignedhomologs– Observed/expectedratio (oddsratio)

• An obeserved/expected ratioof2means something happenstwice asoften asyou expected by randomchance

– Observed:aligned inwell-aligned homologs– Expected:“aligned”inunaligned sequences

• Theobserved/expectedratioforawholealignmentcanbecalculatedbymultiplying theratiosforallalignedaminoacids– Thisisalotofmultiplyingforlongsequences

• Logarithmsallowustosum thescoresofalignedresidues– Mucheasiertouseforascoringsystem

• Hence:theBLOSUMscore

observedexpected

alignedinwell-alignedhomologs“aligned”inunalignedsequences

BLOSUMscore:

Howtoquantifythissimilarity?

log ( ) Si,j =2·log2( )Fi,jFi ·Fj

→

2/19/17

8

Anexample• Let’ssaythatwehaveawell-alignedblockof:

– 100aminoacidslong– 1,000proteins“deep”– Withnogaps

• 7.4%arealanine (A)• 1.3%aretryptophan(W)• Randomly,weexpect afractionofA-Walignmentsof:

• Inreality,weobserve afractionofA-Walignmentsof:0.00034– TheA-Wmutationoccurredlessfrequentlyinevolutionthanexpected!– …sothesubstitutionscoremustbenegative

• TheA-Wsubstitutionscoreis:Si,j =2·log2( )Fi,j

Fi ·FjSA,W =2·log2( )0.00034

0.000962 =-3

FA ·FW =0.074·0.013=0.000962

7,400100·1,000→FA ==0.074

→FW =0.013

BLOSUM62

Aromatic

Hydrophobic

Positive

Polar,non-hydrophobic

SmallLowforinfrequentsubstitutions

Highforfrequentsubstitutions

Lowforcommonaminoacids

Highforrareaminoacids

• Whyarethehighestnumbersineachrow/columnonthediagonal?– Because inwell-aligned homologs,every amino acidismostoften aligned toitself (i.e.it isnot mutated)

Substitutionmatrix

2/19/17

9

Sowhatdothenumbersmean?

• Ascoreof-3forthesubstitutionA-Wmeansthatthissubstitutionisobservedinwell-alignedhomologs0.35timesasoftenasexpected:

– …or:=2.83timeslessoftenthanexpected

• Ascoreof2forthesubstitutionD-Emeansthatthissubstitutionisobservedinwell-alignedhomologstwiceasoftenasexpected:

• Ascoreof9forthesubstitutionC-Cmeansthatthissubstitutionisobservedinwell-alignedhomologs22.6timesasoftenasexpected:

2(-3/2) =0.35

2(2/2) =2

Si,j =2·log2( )Fi,jFi ·Fj

2(9/2) =22.6

Fi,jFi ·Fj2(Si,j/2) =→

10.35

0.000340.000962=

Exercise

Aromatic

Hydrophobic

Positive

Polar,non-hydrophobic

Small

• Calculate thesimilarity scorebetween thesesequences:a) KAWSADV

b) HAMNWEQc) RDWSACV

– 1+4– 1+1– 3+2– 2=0

0– 2– 1+1– 3+5– 2=– 22– 2+11+4+4– 3+4=20

2/19/17

10

Oddsratiothatsequencesarewell-alignedhomologs• Howlikelyisitthattwosequencesarewell-alignedhomologs?

– seqA – seqB :-1+-3+-2=-6– seqA – seqC :-1+-2+-2=-5– seqB – seqC :4+2+4=10

• WhenyouobserveRYDalignedtoSDA,thesesequencesare0.125timesmorelikelytobewell-alignedhomologsthanexpected– …or8timeslesslikelytobewell-alignedhomologsthanexpected

• WhenyouobserveRYDalignedtoSEA,thesesequencesare0.177timesmorelikelytobewell-alignedhomologsthanexpected– …or5.6timeslesslikelytobewell-alignedhomologsthanexpected

• WhenyouobserveSDAalignedtoSEA,thesesequencesare32timesmorelikelytobewell-alignedhomologsthanexpected

2(-6/2) =0.125 2(-5/2) =0.177 2(10/2) =32

Lengthofthealignment• Asyouseeamoreandmorelengthoftwosequences…

– Iftwosequencesarewell-alignedhomologs,mostofthealignedaminoacidswillhavepositivescores

– …sothealignmentscorekeepsincreasingasyouseemorelengthofthesequences

– …sothelikelihoodthattheyarewell-alignedhomologskeepsincreasing

– Iftwosequencesarenothomologous,mostofthealignedaminoacidswillhavenegativescores

– …sothescorekeepsdecreasingasyouseemoresequence– …sothelikelihoodthattheyarewell-alignedhomologskeepsdecreasing

2/19/17

11

BLOSUMnumbers

• BLOSUMremovesredundancybiasesfromtheblocksofwell-alignedhomologs– Thesequenceselectioninthedatabaseisbiased– Clustersofhighlyidenticalsequencesarecollapsed– Percluster,one consensussequence isused to

calculate theBLOSUMmatrix• High numberscollapseonlyveryidentical

homologs– Usedtocompareanddetectcloselyrelated proteins

• Low numbersalsocollapsemoredivergenthomologs– Usedtocompareanddetectmoredistant proteins

• Themostusedmatrices:

BLOSUM45:distantlyrelatedsequencescollapsed(>45%id)

BLOSUM62:default, sequencescollapsediftheyhave>62%identity

BLOSUM80:closelyrelatedsequencescollapsed(>80%id)

Substitution matrices• Substitutionmatrices…

…areusedin(almost)allsequenceanalyses…stronglyinfluenceyourresults

• Otheraminoacidsubstitutionmatrices– PointAcceptedMutations(PAM)

• DevelopedbyMargaretDayhoff in1978• Basedoncounting1,572realmutationsthatoccurredduringtheevolutionof71proteins

– Gonnet (1992)• SimilartoPAM,butbasedonmoresequences• DefaultinpopularalignmentprogramClustal Omega

• Bioinformaticprogramsoftenhavedefaultparametervaluesset– Forexamplethechoiceofasubstitutionmatrix– Defaultsettingsarenotalwaysthemostoptimalforyourquestions– Ifyouhavenoidea,checktheliteratureinyourbiologicalfieldwhatis

consideredreliable

2/19/17

12

Models ofevolution• Weusemodels tocheckifwereallyunderstandsomething

– Ifourobservationsagreewiththemodel,thenweunderstandthatsomethingwell(andviceversa)

• Inamodelofevolution,wequantitativelyformalizetheevolutionaryprocessesthatwethinkareatplay– Asubstitutionmatrixis(partof)amodelofevolution,becauseitformalizesalltherelativesubstitutionratesbetweenresidues

– Ifothertypesofmutationsareincluded,themodeloftenagreesbetterwithgenomesequencedata

Timeà

Absoluteversusevolutionarydistance• Considerarecentlydivergedsequence• ThetwoDNAsequenceswilldiverge,buthow?

• Thedivergencesaturatesat~25%identity– Mutationalsaturation:ifthesamepositionmutatesagain– TheidentitybetweentworandomDNAsequencesis25%

• WecancorrectforitwiththeJukes-Cantorformula:

y=x2 - x+0.50.2

0.3

0.4

0.5

0 0.2 0.4 0.6 0.8 1%idfo

rrando

mse

qs

%GCofsequences

Documents

20170221 quantifying sequence similarity - Universiteit Utrechttbb.bio.uu.nl/BDA/2017/20170221_quantifying_sequence... · 2017-06-14 · –The sequence selection in the database