12
2/19/17 1 Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can… define homology, similarity, and identity give two examples of DNA substitution matrices list four properties of amino acids that might be important in determining their physico-chemical similarity derive BLOSUM scores from counts of aligned residues interpret BLOSUM scores and BLOSUM-based alignment scores by performing calculations with log-odds values explain when to use matrices based on high/low BLOSUM numbers (e.g. 45/62/80) and why they are suitable list some of the elements in a model of evolution, explain what it is and why we use it convert mutational distance into evolutionary distance by correcting for mutational saturation with Jukes-Cantor

20170221 quantifying sequence similarity - Universiteit Utrechttbb.bio.uu.nl/BDA/2017/20170221_quantifying_sequence... · 2017-06-14 · –The sequence selection in the database

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 20170221 quantifying sequence similarity - Universiteit Utrechttbb.bio.uu.nl/BDA/2017/20170221_quantifying_sequence... · 2017-06-14 · –The sequence selection in the database

2/19/17

1

Quantifyingsequencesimilarity

BasE.DutilhSystemsBiology:BioinformaticDataAnalysis

UtrechtUniversity,February16th 2016

Afterthislecture,youcan…… definehomology,similarity,andidentity… givetwoexamplesofDNAsubstitutionmatrices… listfourpropertiesofaminoacidsthatmightbeimportantindeterminingtheirphysico-chemicalsimilarity… deriveBLOSUMscoresfromcountsofalignedresidues… interpretBLOSUMscoresandBLOSUM-basedalignmentscoresbyperformingcalculationswithlog-oddsvalues… explainwhentousematricesbasedonhigh/lowBLOSUMnumbers(e.g.45/62/80)andwhytheyaresuitable… listsomeoftheelementsinamodelofevolution,explainwhatitisandwhyweuseit… convertmutationaldistanceintoevolutionarydistancebycorrectingformutationalsaturationwithJukes-Cantor

Page 2: 20170221 quantifying sequence similarity - Universiteit Utrechttbb.bio.uu.nl/BDA/2017/20170221_quantifying_sequence... · 2017-06-14 · –The sequence selection in the database

2/19/17

2

Homology• Phylogenetictreesand

sequencealignmentscontainalotofinformationaboutfunctionandevolutionofsequences

• Buttheyonlymakesenseifthesequencesdescendfromacommonancestor,i.e.theyareevolutionarilyrelated

WeareNEVERallowedtointerprettreesoralignments,UNLESStheyaremadewithhomologoussequences!

… sohowdowedetermineifsequencesshareacommonancestor?(i.e.arehomologs?)

Similarity• Thingsthatlookreallysimilarprobablyshareacommonancestor

Page 3: 20170221 quantifying sequence similarity - Universiteit Utrechttbb.bio.uu.nl/BDA/2017/20170221_quantifying_sequence... · 2017-06-14 · –The sequence selection in the database

2/19/17

3

Identity• Alignmentmeans:addinggapsinoneand/ortheother

sequenceuntiltheyarebothequallylong:

• Beforealignment:

• Afteralignment:

• Alignedsequencesallowustocalculatepercentidentity:– 8differentpositions,and31identicalpositions– Thetwosequencesare100*=79%identical

• Identitycanbequantified,buthomologycannot!– Saying“79%homologous”iswrong (eventhoughtoomanypeoplesayit)

AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

3139

Definitions

• Similarity– Percentageofaminoacidresiduesinanalignmentwithapositivesubstitutionscore

– NotusedforDNA

• Homology– Propertyoftwosequencesthathaveasharedancestor

– HomologyisTRUEorFALSE:eitheryou’refamilyoryou’renot

• Identity– Percentageofidenticalresiduesinanalignment

– Usedforaminoacidsornucleotides

Page 4: 20170221 quantifying sequence similarity - Universiteit Utrechttbb.bio.uu.nl/BDA/2017/20170221_quantifying_sequence... · 2017-06-14 · –The sequence selection in the database

2/19/17

4

– Thedistancebetweenvirus3andvirus7:0.77mutations/site

Evolutionarydistance• Theevolutionarydistancebetweentwosequencesisoftenexpressedasthefrequencyofmismatches

http://epidemic.bio.ed.ac.uk/how_to_read_a_phylogeny

Thebranchlengthrepresentstheevolutionarytimebetweentwonodes.Unit:substitutionspersequencesite.

Thehorizontalbranchesrepresentevolutionarylineages

changingovertime.

• Assumption:positionsevolverandomlyandindependently

DNAsubstitutionmatrices• Similarityisquantifiedwithasubstitutionmatrix

– Scoresformatchesinasequencealignment– Penaltiesformismatchesinasequencealignment

• Theidentitymatrixisthemostoftenusedfor

A

C

T

G

Transitions:A↔GC↔T

Transversions:A↔C A↔TC↔G G↔T

A C G TA 1C -2 1G -1 -2 1T -2 -1 -2 1

A C G TA 1C -1 1G -1 -1 1T -1 -1 -1 1

Identitymatrix

scoringDNAsequencealignments• Not all nucleotidesubstitutions areequally likely

– Transitionsoccurabouttwiceasoftenastransversions

Page 5: 20170221 quantifying sequence similarity - Universiteit Utrechttbb.bio.uu.nl/BDA/2017/20170221_quantifying_sequence... · 2017-06-14 · –The sequence selection in the database

2/19/17

5

Identityandsimilarity• Thesethreesequencesareall66.7%identical…• …butwhatarethosecolors? *

– Someaminoacidsaremoresimilarthanothers

Glutamic acid(E)Asparticacid(D)Cysteine (C)

GeorgeOrwellAnimalFarm

(*hint)

Aminoacidproperties

http://swift.cmbi.ru.nl/teach/B1B/HTML/PosterA4_nbic_new.pdf

Page 6: 20170221 quantifying sequence similarity - Universiteit Utrechttbb.bio.uu.nl/BDA/2017/20170221_quantifying_sequence... · 2017-06-14 · –The sequence selection in the database

2/19/17

6

Thesimilaritysignal• Whatisthemostrelevant definitionofa“similaritysignal”betweenaminoacids?– Size?– Polarity?– Hydrophobicity?– Preferredproteinfold?– …anyothermeasuresthatmightbeimportant?

• Similaritysignalofwhat,exactly? *– “Ofthefunctionoftheaminoacidintheprotein”

– “Ofanevolutionaryrelationship”

• Evolutionhastestedthisforus– Solet’suseevolvedsequencestoscoresimilarity!

(*hint)

Which amino acids mutate into each other?• Weusealignmentstodiscoverwhichaminoacidsaremoresimilarthanothers,accordingtoevolution

• Wequantifythisbycountinghowoftentwoaminoacidsreallymutatedintoeachotherduringevolution

• Themostrelevantsignaltodetecthomologyisbasedonreliable,well-alignedhomologs

Well-alignedhomologs

Page 7: 20170221 quantifying sequence similarity - Universiteit Utrechttbb.bio.uu.nl/BDA/2017/20170221_quantifying_sequence... · 2017-06-14 · –The sequence selection in the database

2/19/17

7

BLOcks SUbstitution Matrix(BLOSUM)1. Takesomealignedhomologoussequences

2. Grouphighlyidenticalsequences(e.g.>62%identity)– Toremoveredundancybiasesinthesequencedatabase

3. Identifywell-alignedblocks– Sothatonlyrealmutationsarecompared(reliable,well-alignedhomologs)

4. Counthowofteneachpairoftwoaminoacidsmutatedintoeachother(i.e.howoftentheyarealigned)

• BLOSUMdefinessimilaritybetweenaminoacidsbasedonthestatisticalprobabilitythattheyaligninwell-alignedhomologs– Observed/expectedratio (oddsratio)

• An obeserved/expected ratioof2means something happenstwice asoften asyou expected by randomchance

– Observed:aligned inwell-aligned homologs– Expected:“aligned”inunaligned sequences

• Theobserved/expectedratioforawholealignmentcanbecalculatedbymultiplying theratiosforallalignedaminoacids– Thisisalotofmultiplyingforlongsequences

• Logarithmsallowustosum thescoresofalignedresidues– Mucheasiertouseforascoringsystem

• Hence:theBLOSUMscore

observedexpected

alignedinwell-alignedhomologs“aligned”inunalignedsequences

BLOSUMscore:

Howtoquantifythissimilarity?

log ( ) Si,j =2·log2( )Fi,jFi ·Fj

Page 8: 20170221 quantifying sequence similarity - Universiteit Utrechttbb.bio.uu.nl/BDA/2017/20170221_quantifying_sequence... · 2017-06-14 · –The sequence selection in the database

2/19/17

8

Anexample• Let’ssaythatwehaveawell-alignedblockof:

– 100aminoacidslong– 1,000proteins“deep”– Withnogaps

• 7.4%arealanine (A)• 1.3%aretryptophan(W)• Randomly,weexpect afractionofA-Walignmentsof:

• Inreality,weobserve afractionofA-Walignmentsof:0.00034– TheA-Wmutationoccurredlessfrequentlyinevolutionthanexpected!– …sothesubstitutionscoremustbenegative

• TheA-Wsubstitutionscoreis:Si,j =2·log2( )Fi,j

Fi ·FjSA,W =2·log2( )0.00034

0.000962 =-3

FA ·FW =0.074·0.013=0.000962

7,400100·1,000→FA ==0.074

→FW =0.013

BLOSUM62

Aromatic

Hydrophobic

Positive

Polar,non-hydrophobic

SmallLowforinfrequentsubstitutions

Highforfrequentsubstitutions

Lowforcommonaminoacids

Highforrareaminoacids

• Whyarethehighestnumbersineachrow/columnonthediagonal?– Because inwell-aligned homologs,every amino acidismostoften aligned toitself (i.e.it isnot mutated)

Substitutionmatrix

Page 9: 20170221 quantifying sequence similarity - Universiteit Utrechttbb.bio.uu.nl/BDA/2017/20170221_quantifying_sequence... · 2017-06-14 · –The sequence selection in the database

2/19/17

9

Sowhatdothenumbersmean?

• Ascoreof-3forthesubstitutionA-Wmeansthatthissubstitutionisobservedinwell-alignedhomologs0.35timesasoftenasexpected:

– …or:=2.83timeslessoftenthanexpected

• Ascoreof2forthesubstitutionD-Emeansthatthissubstitutionisobservedinwell-alignedhomologstwiceasoftenasexpected:

• Ascoreof9forthesubstitutionC-Cmeansthatthissubstitutionisobservedinwell-alignedhomologs22.6timesasoftenasexpected:

2(-3/2) =0.35

2(2/2) =2

Si,j =2·log2( )Fi,jFi ·Fj

2(9/2) =22.6

Fi,jFi ·Fj2(Si,j/2) =→

10.35

0.000340.000962=

Exercise

Aromatic

Hydrophobic

Positive

Polar,non-hydrophobic

Small

• Calculate thesimilarity scorebetween thesesequences:a) KAWSADV

b) HAMNWEQc) RDWSACV

– 1+4– 1+1– 3+2– 2=0

0– 2– 1+1– 3+5– 2=– 22– 2+11+4+4– 3+4=20

Page 10: 20170221 quantifying sequence similarity - Universiteit Utrechttbb.bio.uu.nl/BDA/2017/20170221_quantifying_sequence... · 2017-06-14 · –The sequence selection in the database

2/19/17

10

Oddsratiothatsequencesarewell-alignedhomologs• Howlikelyisitthattwosequencesarewell-alignedhomologs?

– seqA – seqB :-1+-3+-2=-6– seqA – seqC :-1+-2+-2=-5– seqB – seqC :4+2+4=10

• WhenyouobserveRYDalignedtoSDA,thesesequencesare0.125timesmorelikelytobewell-alignedhomologsthanexpected– …or8timeslesslikelytobewell-alignedhomologsthanexpected

• WhenyouobserveRYDalignedtoSEA,thesesequencesare0.177timesmorelikelytobewell-alignedhomologsthanexpected– …or5.6timeslesslikelytobewell-alignedhomologsthanexpected

• WhenyouobserveSDAalignedtoSEA,thesesequencesare32timesmorelikelytobewell-alignedhomologsthanexpected

2(-6/2) =0.125 2(-5/2) =0.177 2(10/2) =32

Lengthofthealignment• Asyouseeamoreandmorelengthoftwosequences…

– Iftwosequencesarewell-alignedhomologs,mostofthealignedaminoacidswillhavepositivescores

– …sothealignmentscorekeepsincreasingasyouseemorelengthofthesequences

– …sothelikelihoodthattheyarewell-alignedhomologskeepsincreasing

– Iftwosequencesarenothomologous,mostofthealignedaminoacidswillhavenegativescores

– …sothescorekeepsdecreasingasyouseemoresequence– …sothelikelihoodthattheyarewell-alignedhomologskeepsdecreasing

Page 11: 20170221 quantifying sequence similarity - Universiteit Utrechttbb.bio.uu.nl/BDA/2017/20170221_quantifying_sequence... · 2017-06-14 · –The sequence selection in the database

2/19/17

11

BLOSUMnumbers

• BLOSUMremovesredundancybiasesfromtheblocksofwell-alignedhomologs– Thesequenceselectioninthedatabaseisbiased– Clustersofhighlyidenticalsequencesarecollapsed– Percluster,one consensussequence isused to

calculate theBLOSUMmatrix• High numberscollapseonlyveryidentical

homologs– Usedtocompareanddetectcloselyrelated proteins

• Low numbersalsocollapsemoredivergenthomologs– Usedtocompareanddetectmoredistant proteins

• Themostusedmatrices:

BLOSUM45:distantlyrelatedsequencescollapsed(>45%id)

BLOSUM62:default, sequencescollapsediftheyhave>62%identity

BLOSUM80:closelyrelatedsequencescollapsed(>80%id)

Substitution matrices• Substitutionmatrices…

…areusedin(almost)allsequenceanalyses…stronglyinfluenceyourresults

• Otheraminoacidsubstitutionmatrices– PointAcceptedMutations(PAM)

• DevelopedbyMargaretDayhoff in1978• Basedoncounting1,572realmutationsthatoccurredduringtheevolutionof71proteins

– Gonnet (1992)• SimilartoPAM,butbasedonmoresequences• DefaultinpopularalignmentprogramClustal Omega

• Bioinformaticprogramsoftenhavedefaultparametervaluesset– Forexamplethechoiceofasubstitutionmatrix– Defaultsettingsarenotalwaysthemostoptimalforyourquestions– Ifyouhavenoidea,checktheliteratureinyourbiologicalfieldwhatis

consideredreliable

Page 12: 20170221 quantifying sequence similarity - Universiteit Utrechttbb.bio.uu.nl/BDA/2017/20170221_quantifying_sequence... · 2017-06-14 · –The sequence selection in the database

2/19/17

12

Models ofevolution• Weusemodels tocheckifwereallyunderstandsomething

– Ifourobservationsagreewiththemodel,thenweunderstandthatsomethingwell(andviceversa)

• Inamodelofevolution,wequantitativelyformalizetheevolutionaryprocessesthatwethinkareatplay– Asubstitutionmatrixis(partof)amodelofevolution,becauseitformalizesalltherelativesubstitutionratesbetweenresidues

– Ifothertypesofmutationsareincluded,themodeloftenagreesbetterwithgenomesequencedata

Timeà

Absoluteversusevolutionarydistance• Considerarecentlydivergedsequence• ThetwoDNAsequenceswilldiverge,buthow?

• Thedivergencesaturatesat~25%identity– Mutationalsaturation:ifthesamepositionmutatesagain– TheidentitybetweentworandomDNAsequencesis25%

• WecancorrectforitwiththeJukes-Cantorformula:

y=x2 - x+0.50.2

0.3

0.4

0.5

0 0.2 0.4 0.6 0.8 1%idfo

rrando

mse

qs

%GCofsequences