Transcript
Page 1: Admixture Correction in the Outgroup f3 Statistic

AdmixtureCorrectionintheOutgroupf3Statistic

PresentedbyNitaTunga

InpartialfulfilmentoftherequirementsforgraduationwiththeDean’sScholarsHonorsDegreeintheDepartmentof

MathematicsUniversityofTexasatAustin

May,2018

Prof.JenniferMannAustin,Ph.DSupervisingProfessor

Prof.KathrynDabbs,Ph.DSecondReader

Page 2: Admixture Correction in the Outgroup f3 Statistic

1

TableofContents

Introduction...................................................................................................................................2

Chapter1:Background....................................................................................................................4

Chapter2:Project...........................................................................................................................9

Chapter3:Dataset........................................................................................................................10

Chapter4:MethodsandResults....................................................................................................11

Section4.1:CorrectionAttempt1.....................................................................................14

Section4.2:CorrectionAttempt2.....................................................................................14

Section4.3:CorrectionAttempt3.....................................................................................17

Section4.4:CorrectionAttempt4.....................................................................................19

Conclusion....................................................................................................................................23

AppendixA:Glossary....................................................................................................................25

References....................................................................................................................................26

Page 3: Admixture Correction in the Outgroup f3 Statistic

2

Introduction

Geneticinheritancecanbestudiedwithinapurelygeneticscope.However,this

eliminatespartofthepicture.Thefieldofgeneticsisoftenthoughtofasanaturalscience

withlittleincommonwithfieldsofsocialscience.However,inhumangeneticsandthe

geneticsoftheorganismswhichhumansimpact,theroleofculturalandsocietalforces

cannotbeignored.Forinstance,lactaseisanenzymeusedtodigestlactoseinmilk.Assuch,

itisanenzymewhoseactivityreducessignificantlyafterweaning.Nonetheless,ashumans

havebeguntoingestmoredairyproductsintoadulthood,lactasepersistencehasevolvedto

enablehumanstodigestthesedairyproducts.

Myresearchinvolvesmathematicallyrepresentingthegeneticsimilarityoftwo

populationsaccuratelyviathef3statistic.Theoutgroup-f3statisticisusefulin

understandingapopulation’sgenetichistoryandhowgeneticallyrelatedtwopopulations

are.Itshowshowclosetwopopulationsarecomparedtoathirdpopulationthatisequally

distantgeneticallyfromthefirsttwo.However,iftwopopulationssharearecentgenetic

interactionwithanotherpopulation,theoutgroup-f3statisticcouldshowthosetwo

populationsasbeingclosertogetherthantheytrulyare.Thisgeneticinteractionoftwoor

morepreviouslyisolatedpopulationsinterbreedingisreferredtoasadmixture.Admixture

skews,oreveninhibits,anunderstandingofthosepopulations’genetichistories.

Toavoidthisproblem,Ihaveattemptedtodeviseamodifiedversionofthe

outgroup-f3statistictoensureanaccuraterepresentationofgeneticrelatedness.Formy

project,artificialadmixturewasintroducedinsixunadmixedhumanpopulations.

Dependingontherelationshipbetweenincreasedcontaminationandthef3statistic,we

proposedandadjustedsolutionsforacorrectedf3accordingly.

Page 4: Admixture Correction in the Outgroup f3 Statistic

3

Itestedmyproposedcorrectionsbyapplyingittopopulationsthatcontain

individualswithandwithoutrecenthistoriesofgeneticadmixture.Aftercorrectingforthe

proportionofadmixtureinthepopulation,Icomparedthiscorrectedoutgroup-f3statistic

totheoutgroup-f3valuecalculatedfortheoriginalunadmixedpopulation.Thegoalofthis

workistohaveacorrectedstatisticthatonecanapplytotwopopulations,independentof

admixtureproportions.Ultimately,thiswillhelpustobetterunderstandtheevolutionary

historiesofpopulations.Moreover,acorrectedstatisticwillaidotherresearchersasthey

analysedemographichistoriesfurtherinthepast.

Page 5: Admixture Correction in the Outgroup f3 Statistic

4

Background

Fstatisticswerefirstproposedinthepaper,“ReconstructingIndianpopulation

history”,publishedinNaturein2009.Inthispaper,Reichandcolleaguesoutlinethewayf2,

f3,andf4statisticscanbeusedtomeasuregeneticdriftbetweentwo,three,andfourtaxa

respectively.Thef3statisticproposedinthispaperisusefulfordetectingadmixture

betweengroups.Tosummarise,thef3statisticassumesanullhypothesisofnoadmixture,

whichimpliesanonnegativef3statistic.F3isbestusedtodetectadmixturewhenthetime

betweenoriginalsplitandsecondarycontactislarge,coalescencebeforeadmixtureis

unlikely,andtheadmixtureproportioniscloseto50%.

Inregardtof3statisticsinparticular,Reichetal.proposeanequationtobeusedto

measurethegeneticdriftbetweenthreepopulations,PopulationsX,A,andB.Thisequation

is,inasimplifiedform,𝑓3 = 𝑥 − 𝑎 ∗ (𝑥 − 𝑏),wherex,a,andbrepresentallele

frequenciesintheirrespectivepopulations.Bysimplifyingtheequation,weseethatthereis

aproportionalrelationshipbetweenthef3statisticandthegeneticdriftbetween

PopulationsAandXandPopulationsBandX.Geneticdriftisdefinedtobethechangein

allelefrequencyalongagraphedgeonaphylogenetictree.Phylogenetictreesaregraphical

representationsofthegeneticrelationshipbetweenagroupofindividualsorpopulations

basedonphysicalorgeneticcharacteristics.Thelengthofthebranchesonthetreeoften

representthegeneticdistance,numberofgeneticdifferences,betweenindividualsor

populations.

Morespecifically,thecalculatedf3statisticistheproductofthefrequency

differencebetweenthosepopulations.Thistestisusefultoseeifcertaingroupshave

inheritedgenesfromdifferentancestries.Whentherehasbeennoadmixture,thef3

statisticisexpectedtobepositive.Whentherehasbeenadmixture,thef3statisticcouldbe

Page 6: Admixture Correction in the Outgroup f3 Statistic

5

negative.Furthermore,lowerf3valuesareindicativeoflesscloselyrelatedpopulations,

whereashigherf3valuesareindicativeofmorecloselyrelatedpopulations.Thefarther

aparttwopopulationsare,thesmallerthetwoterms(x-aandx-b)intheequation,and

therefore,thelowerthef3statistic.Similarly,whentwopopulationsareclosertogether,the

twotermsintheequationarelarger,resultinginalargerf3statistic.

Tobetterunderstandwhatthef3statisticcanbeusedfor,werefertothefigure

below.

Hereweseethattherearetwopopulationsthatareclosertogether(PopulationsA

andB),thantheyaretothethirdpopulation(PopulationX).Inthecontextoftheequation,

𝑓3 = 𝑥 − 𝑎 ∗ (𝑥 − 𝑏),weseethatwearecomparingtheallelefrequenciesinPopulations

AandB,inrelationtotheallelefrequenciesinPopulationX.IfweseehowfarPopulation

A’sallelefrequenciesarefromPopulationX’sallelefrequenciesandcomparethistothe

distancebetweenPopulationB’sallelefrequenciesandPopulationX’s,wecanevaluatethe

geneticdistancebetweenPopulationAandPopulationB.Tothinkaboutthisinadifferent

way,bysubtractingoutPopulationA’sallelefrequenciesfromthoseofPopulationX,weare

seeinghowmuchlongerorshorteronebranchlengthiscomparedtotheother.Doingso

enablesustoanalysethedistanceofeachofthethreepopulationsinrelationtothevertex

thatconnectsallthreeofthem.However,ifwehaveanunknownPopulationYthat

PA PB PX

Page 7: Admixture Correction in the Outgroup f3 Statistic

6

integratesitsDNAintobothPopulationAandPopulationB,itwouldappearthatthesetwo

populationsareclosergeneticallythanonewouldexpect.Intermsoftheequation,this

wouldmakebothterms(x-a)and(x-b)increaseordecreasetogether.Assuch,theresulting

f3valuewillbeinordinatelyhigherorlower.Thisisaninterestingresultifoneisconcerned

withtherelationshipofPopulationYtoPopulationsAandB.However,ifyouareinterested

inthegeneticrelationshipofPopulationsAandBbeforetheiradmixturewithPopulationY,

thiscanbeaconfoundingfactor.

NickPattersonwasabletoworkthroughmoreofthemathbehindtheFstatistics

tests,whichhedocumentedinhispaper“AncientAdmixtureinHumanHistory,”published

inGeneticsin2012.Healsodiscussestheoutgroupcase,whichisfurtherdiscussedin

MaanasaRaghavan’spaper,“UpperPalaeolithicSiberiangenomerevealsdualancestryof

NativeAmericans,”publishedinNaturein2014.Inthispaper,theconceptofoutgroup-f3

statisticsisintroduced.Outgroup-f3statisticsinvolvecomparingtwopopulationstoathird,

“outgroup,”population,whichisequallygeneticallyremovedfromtheothertwo

populations.Bydoingso,theoutgrouppopulationservesasareferencegroupfor

measuringgeneticrelatednessofthepopulationsinquestion.Soinsteadoflookingfor

admixturebetweenPopulationXandtheotherpopulations,theoutgroup-f3statisticisa

measureofthegeneticsimilaritybetweenPopulationsAandB.

InBenjaminPeter’spaper,“Admixture,PopulationStructure,andF-Statistics,”he

providesaclearoverviewofFandDstatistics(Genetics,2017).Healsomakesthepointthat

f3statisticscanbeusedasatestforadmixture,notjustforhowcloselyrelatedtwo

populationsare.Healsopointsoutthatinthehistoryofhumans,manyofthecalculatedf3

valuesarenegative,whichcouldshowthatpopulationphylogeniesarenotalwaysthebest

waytodiscusshumanevolution.

Page 8: Admixture Correction in the Outgroup f3 Statistic

7

F3statisticshavebeenusefulindeterminingavarietyofgeneticrelatedness

questionsandarewidelyusedinthefieldofhumanpopulationgeneticsandevolutionary

biologymorebroadly.Forinstance,outgroup-f3statisticswereusedtotestrelatedness

betweenLevantineandsouthernArabianpopulationstoAfricanpopulationsalongthe

NorthernandSouthernDispersalRoutesoutofAfrica.HumansevolvedinAfricaoverthe

past2millionyears.AmajordispersalofhumansoutofAfricaoccurredaround50thousand

yearsagoandledtothemajorityofhumangeneticvariationweseeacrosstheworldtoday.

AnthropologistsandgeneticistshavelongdebatedwhethertheprimaryrouteoutofAfrica

wastheNorthernRouteortheSouthernRoute.In“Testingsupportforthenorthernand

southerndispersalroutesoutofAfrica:ananalysisofLevantineandsouthernArabian

populations,”Vyasandcolleaguesattemptedtoanswerthatquestionusingf3statistics

(AmericanJournalofPhysicalAnthropology,2017).TheNorthernDispersalRouteledinto

Levant,whereastheSouthernDispersalRouteledintosouthernArabia.Byusingf3statistics

toseehowlinkedthepopulationswerepairwise,itwasfoundthatneitherdispersalroute

wasfavouredovertheother.TheMbuti,agroupofpeoplecurrentlylivingincentralAfrica,

wasusedastheoutgrouppopulationforthistest.Theresultsshowedthatboththe

LevantineandArabianpopulationswereequallyrelatedtotheAfricanpopulation.

Thef3testwastakenfurtherandusedtoshowthatboththeLevantineandArabian

populationssharedrelativelysimilarrelatednesstonon-Africanpopulations.Withineach

region,somegroupshadmoresub-Saharanancestry,whichledtolowerf3values.Another

reasonforalowerf3statisticcouldbeanearlierdivergencefromnon-Africanpopulations,

whichwouldbeusefulindeterminingwhichroutewasusedbyearlierpopulations.The

statisticwasusedtoshowthatbothpopulationsweregenerallyequallyrelatedtoallthe

Page 9: Admixture Correction in the Outgroup f3 Statistic

8

Africanpopulationsaswell.Therefore,theresearcherswerenotabletodistinguishwhich

dispersalroutewasusedmore.

Thef3statisticshavealsobeenusedinexploringtherelatednessofvarious

subspeciesofgrapes.Incontrasttothepreviousexampleofoutgroup-f3statistics,thistest

usednormalf3statisticstoseewhatsortofadmixturehasoccurredinthehistoryofthe

grape.Whilethisinvolvesunderstandinghowrelatedtwospeciesofgrapesare,theprimary

purposeofthisstudywastoseehowtomaximallyutilisethegeneticdiversityofgrapes.

Thegrape’shistoryofdomesticationbeganaround6000-8000yearsago,whenthe

domesticatedgrape,Vitisviniferavinifera,wascultivatedfromthewildgrape,Vitisvinifera

sylvestris.Thef3statisticwasusedtotestformixturebetweenviniferawest,viniferaeast,

andsylvestriswest(f3=-0.00481);f3statisticswerealsousedtotestformixturebetween

sylvestriswest,viniferawest,andsylvestriseast(f3=0.0268).

Theresearchersfoundthatwesternviniferaismostlikelyacombinationofeastern

viniferaandwesternsylvestris.Nonetheless,theredoesnotseemtobeagenetictransfer

betweenwesternviniferaandwesternsylvestris.Thissupportsthatviniferaoriginatedin

theNearEastandunderwentintrogressionintoviniferafromwildsylvestrisinEurope.

Thisanalysisfoundthatlittleofthepotentialgeneticdiversityofthegrapehasbeen

explored.Theresearchersusethisfindingtosuggestthattoovercomethegrape’s

significantpathogenpressures,itsgeneticdiversitymustbeutilisedtoitsadvantage.The

domesticatedgrapecontainsgeneticvariationmuchlargerthanthatofhumans,thus

makingitidealtomanipulateforitspolymorphismsandgeneticdiversity.

Page 10: Admixture Correction in the Outgroup f3 Statistic

9

Project

Thegoalofthisprojectistocorrectforadmixturewhencalculatingtheoutgroup-f3

statisticsoitisanaccuratemeasureofgeneticrelatedness.Ifirstproposedasimilar

correctiontothatusedbyLindoetal.fortheDstatistic.

TheDstatisticcanbeusedtotestforadmixtureacrossfourpopulations.Inhis

paper,“AncientindividualsfromtheNorthAmericanNorthwestCoastreveal10,000years

ofregionalgeneticcontinuity,”JohnLindoproposedacontaminationcorrectiontoaccount

forsimilaradmixturehistoriesforthisstatistic(ProceedingsoftheNationalAcademyof

SciencesoftheUnitedStatesofAmerica,2017).ThecontaminationcorrectionfactorLindo

proposesisbasedoncontaminationofanancientgenomewithmodernDNAfroma

distantlyrelatedpopulation,thoughtheoneweproposeforf3statisticswillbebasedon

thelevelofartificiallyinducedadmixture.Nonetheless,Lindousedacorrectedformulato

calculateanewDstatistic,withadmixturecorrectedforusingthecontaminationcorrection.

DShukáKáaisthecontaminatedsample’sDstatistic;DGBRistheDstatistic,substitutingan

individualrepresentativeofthepopulationthatcontaminatedShukáKáa;cisthe

contaminationrate.Forthef3statistic,thisequationwouldlooklike𝑓3∗ = +,-.∗+,/0-.

,where

f3isthecontaminatedsample’sf3statistic,f3aisthef3statisticwithanoutgroupasthe

populationthatcontaminatedtheoriginalgroup,andaistheadmixtureproportion.

Page 11: Admixture Correction in the Outgroup f3 Statistic

10

Dataset

OurresearchgrouputilisedpopulationdatafromNorthandSouthAmerican

indigenouspopulations.ThefirststepofthisprojectwastogatherusableremovedSNPs

thatweremissinginmorethan90%ofthepopulation,andpruningSNPsbasedonlinkage

disequilibrium.InextusedtheADMIXTUREprogramtoidentifyindividualswithevidenceof

Europeanadmixture.Populationswerethensplitintothreegroups:thosethathadno

evidenceofEuropeanAdmixture(Cabecar,Mixe,Surui,GuaraniKW,Xaltocan,and

Xavante),thosewhereanumberofindividualswereadmixedandanumberwerenot

(JaltocanHidalgo,Pima,Xaltocan),andthosewheretheentirepopulationhadEuropean

admixture(AleutRaff,Algonquin,Cree,Chipewyan,Inupiat,Ojibwa,andSouthernUSNative

American).

PopulationsintowhichAdmixturewasArtificially

Introduced

PopulationwithAdmixedandUnadmixed

Individuals

AdmixedPopulationsonwhichtoTestCorrection

Cabecar JaltocanHidalgo AleutRaff

Mixe Pima Algonquin

Surui Xaltocan Cree

GuaraniKW Chipewyan

Xaltocan Inupiat

Xavante Ojibwa

SouthernUSNativeAmerican

Page 12: Admixture Correction in the Outgroup f3 Statistic

11

Methods

Formyproject,IusedsixcompletelyunadmixedhumanpopulationsfromNorthand

SouthAmerica-Cabecar,Mixe,Surui,GuaraniKW,Xaltocan,andXavante.Iintroduced

artificialadmixtureinconstant5%intervalsfrom5%to95%admixturefromaEuropean

population.ThiswasdoneviaaprograminRthatarbitraryreplaced5to95%ofthe

population’sgenomewiththecorrespondingsegmentofaEuropeangenome.Belowisan

exampleofthecodeusedtoinduceadmixtureinthepopulationCabecarusingafor-loop.

v ADM=(0.050.10.150.20.250.30.350.40.450.50.550.60.650.70.750.80.85

0.90.95)

v forjin"${ADM[@]}";doRscriptadmixer.R--file./final_dataset_cleanest2.vcf--

donorSpanish--recipCabecar--p$j--subs5--outfinal_admix_Cabecar_$j.vcf;

done

Aftersimulatingadmixtureinthesepopulations,Iobtainedoutgroup-f3valuesfor

eachofthesepopulationsandeachoftheadmixturelevelswithinthemusingtheprogram

popstats.Ialsoobtainedanf3statisticbyswappingouttheEnglishpopulationforthe

Yorubapopulation,awestAfricangroupassumedtobeequallydistantlyrelatedtoallthese

populations,astheoutgroup.Thisoutgroupservesasareferencegrouptocomparethe

desiredpopulationandtheingroupto.Karitianawasusedastheingroupforbothtests.

Then,wecanseehowincreasedadmixtureaffectsthestatistic.Thiswasdoneusingthe

commandsbelow,wherejspannedtheadmixtureproportionsmentionedpreviously:

v python~/Desktop/project/bin/popstats/popstats.py--file

final_admix_Cabecar_$j--f3--popsC,Karitiana,Yoruba--informative>

final_admix_Cabecar_$j_f3.txt

Page 13: Admixture Correction in the Outgroup f3 Statistic

12

v python~/Desktop/project/bin/popstats/popstats.py--file

final_admix_Cabecar_$j--f3--popsC,Karitiana,English--informative>

final_admix_Cabecar_$j_f3a.txt

Comparingthesevaluestotheadmixturelevels,Iwasabletore-evaluatethe

suggestedsolutionasneeded.Thenbygettinganf3statisticforthesepopulationsand

settingtheoutgroupasthepopulationassumedtohavecontaminatedthem(English

population),Icalculatedanewf3statistic,whichwashopefullycorrectedforadmixture.

Tofurthertestifthiscorrectionworked,Itookpopulationsthatcontained

individualswithandwithoutadmixturedgenomes.Bycorrectingfortheportionofthe

populationthatwasadmixed,Isawifthiscorrectedf3statisticmatchedtheunadmixed

portion’sf3statistic.IdidthisinindividualsfromtheJaltocanHidalgo,PimaandXaltocan

populations.Ithencomputedabaselinef3statisticcomparingthewholepopulations,with

Karitianaastheingroup,andYorubaastheoutgroup.Afterdoingso,Igotanf3statistic

fromtheadmixedindividualsinthesepopulationsinrelationtoYoruba,andthengotanf3

statisticfromtheadmixedindividualsinthesepopulationsinrelationtoanEnglish

population.

Ifthef3statisticwassuccessfullycorrected,wecouldmakeinferencesaboutthe

genetichistoriesofothercontaminatedpopulations.Ithenappliedthef3statistictothe

populationsAleutRaff,Algonquin,Cree,Chipewyan,Inupiat,Ojibwa,andSouthernUS

NativeAmerican.IobtainedtheadmixtureproportionfromtheamountofEuropeanDNAin

theseindividuals.ThenIcorrectedforthef3statisticbygettinganf3usingYorubafirst,and

thenusingEnglishancestrytocomparetheirgenomesto.

Basedonpreliminaryresults,thesolutioncouldtaketheformofacorrected

equationforoutgroup-f3statistics.Ontheotherhand,itcouldstartwithanequationtoget

Page 14: Admixture Correction in the Outgroup f3 Statistic

13

acorrectedf3value,whichisthenmanipulatedfurther.ThisiswhereIcouldcomeupwith

atableofvaluesthatcorrespondtodifferentlevelsofadmixture.These“differences”

betweenthesemi-correctedf3andthebaselinef3arethentobesubtractedfromthesemi-

correctedf3.Sinceothersattemptingtousethiscorrectionwillnothaveabaselinef3for

comparison,ourgoalistocomeupwithauniversalsetofdifferencesthatcanbeused

dependingsolelyontheadmixturelevels.

Page 15: Admixture Correction in the Outgroup f3 Statistic

14

CorrectionAttempt1

Usingthetwof3statistics,Ipositedacorrectionequationtogetthecorrectedf3

valuestolooksimilartothebaselinef3valueswhengraphed.Asimilarcorrectionasthat

proposedfortheDstatisticbyLindowasattemptedfirst.However,thiswasunsuccessful.A

newequationwasthensuggestedandtested.Thisequationtooktheformof

𝑓3 − 𝑓3. ∗ 𝑎 + 𝑓3,wheref3wasthestatisticcalculatedwithYorubaastheoutgroup,

f3awasthestatisticcalculatedwithEnglishastheoutgroup,andawastheadmixture

proportionthatweintroducedintothepopulation.Usingthesevalues,Igraphedthe

relationbetweenadmixtureproportionandthecorrectedf3statistic.Allthepopulations’

graphsexhibitedsimilartrends.BelowisagraphusingCabecar’sf3valuestobeusedasa

reference.

CorrectionAttempt2

Clearly,thetwosetsofpointsarenotthatsimilar.Assuch,Iattemptedtoagain

correcttheequation.Lookingatthetrendoff3valuesdippingaround20-40%admixture

levels,itseemedthatperhapsIwasovercorrectingthef3valuesbyusingf3valuesthat

changewiththeadmixtureproportion.Assuch,Iproposedthefollowingequationinstead:

𝑓3 − 𝑓3. ∗ 𝑎 + 𝑓32.345674,wheref3baselinewasthevaluecalculatedforeachofthe

0.2150.22

0.2250.23

0.2350.24

0.2450.25

0 0.2 0.4 0.6 0.8 1

F3VALUE

S

ADMIXTUREPROPORTION

CABECAR

Correctedf3values

Baselinef3values

Linear(Baselinef3values)

Page 16: Admixture Correction in the Outgroup f3 Statistic

15

populationswhentherewasnoartificialadmixtureintroduced.Thisappearedtoatleast

presentabettercorrelationbetweenadmixtureandcorrectedf3valueswhengraphed.

Belowisagraphofthenewlycorrectedf3valuesplottedagainstadmixtureproportions

again.

Thesenewf3valueslookrelativelylinear,andassuch,I

seemedtobeontherighttrack.Tofurthercorrectthef3values,I

attemptedtofindthedifferencebetweenthenewlycorrectedf3

valuesandthebaselinef3values.Ididthisforeachpopulation,

andthenfoundtheaveragesofthedifferencesforeach

admixtureproportion.Totherightisatableoftheresults.

Ithenplottedtheadmixtureproportionsandtheaverage

differences,astheylookedquitesimilar.Ihopedtoseeifthere

wasacorrelationusingalinearrelationship.TheR2valuewas

0.9973,indicatingthatthereisasignificantrelationshipbetween

thesetwovalues.Thus,Iattemptedtousetheequationforthe

linearregressionlineasacorrectionforthef3values.Iusedthe

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 0 . 2 0 . 4 0 . 6 0 . 8 1

F3VALUE

S

ADMIXTUREPROPORTION

CABECAR

Correctedf3values

Baselinef3values

Linear(Baselinef3values)

AdmixtureProportion

0.05

Average

Differences0.004970747

0.1 0.0101043520.15 0.0153668690.2 0.0208302490.25 0.0263991740.3 0.0321738640.35 0.0381413220.4 0.0442220140.45 0.0504215940.5 0.0566984520.55 0.0631453990.6 0.069525620.65 0.0764070520.7 0.0831501270.75 0.0899637640.8 0.0971640070.85 0.1046993370.9 0.1118768870.95 0.119257046

Page 17: Admixture Correction in the Outgroup f3 Statistic

16

valuesthatIhadcorrectedusingtheequation 𝑓3 − 𝑓3. ∗ 𝑎 + 𝑓32.345674 andthen

subtractedthedifference,calculatedusingthefollowingequation:𝑦 = 0.1272𝑥 − 0.005.

Givenacertainadmixtureproportion,Iwouldplugthatvalueinforxintheequationtoget

thedifferencetobesubtractedfromthecorrectedf3value.Thisresultedinaparabolic

lookinggraphofthef3valuesplottedagainsttheadmixtureproportion,shownbelow(again

withthebaselinef3valuesplottedasareferenceforthedesiredvalues).

Clearly,thiswasnotanidealcorrectionofthef3valuesagain.Iattemptedtoputthis

intheperspectiveoftheconfidenceintervalsofthebaselinef3values.Assuch,theupper

boundandlowerboundrepresentativeofonestandarddeviationaboveandbelowthe

baselinef3areshownonthegraph(thestandarddeviationwascalculatedbythepopstats

programusedtogetthebaselinef3value).Therefore,Iattemptedtofixtheregression

equationwehadgottenfromtheaveragedifferences.Assuch,Iusedthefollowingpower

equationinstead:𝑦 = 0.122𝑥[email protected],

indicatingthatthisequationmightworkasacorrection.Nonetheless,onceIusedthis

0.224

0.225

0.226

0.227

0.228

0.229

0.23

0.231

0 0.2 0.4 0.6 0.8 1

F3Value

s

AdmixtureProportion

F3VersusAdmixture

Correctedf3values

Baselinef3values

upperbound

lowerbound

Linear(Baselinef3values)

Page 18: Admixture Correction in the Outgroup f3 Statistic

17

equationwiththedifferentadmixtureproportionstosubtractfromthecorrectedf3values,

Istillhadagraphthatdidnotlookideal(below).

CorrectionAttempt3:

Then,Iattemptedtojustusetheaveragedifferencestosubtractfromthecorrected

f3.Ihopedtogetthesedifferencesformoreadmixturevalueswherethelinearregression

line/powerlinedidnotmatchthedatawell,ifthisattemptworked.Icalculatednewf3

valueswiththiscorrectionandgotthefollowinggraph.

0.22

0.225

0.23

0.235

0.24

0.245

0.25

0.255

0 0.2 0.4 0.6 0.8 1

F3Value

s

AdmixtureProportion

F3VersusAdmixture

Correctedf3values

Baselinef3values

upperbound

lowerbound

Linear(Baselinef3values)

0.22450.2250.22550.2260.22650.2270.22750.2280.22850.2290.2295

0 0.2 0.4 0.6 0.8 1

F3Value

s

AdmixtureProportion

F3VersusAdmixture

Correctedf3values

Baselinef3values

upperbound

lowerbound

Linear(Baselinef3values)

Page 19: Admixture Correction in the Outgroup f3 Statistic

18

Thisgraphclearlylookedalotbetterthanpreviousattempts.Furthermore,itwas

theonlysolutionthusfarthatyieldedcorrectedf3valueswithintheboundsofonestandard

deviationaboveandbelowaref3statistic.Nonetheless,itwasnotaperfectfit.

Tomakethisgraphevenbetter,Igotintervalsthatwereclosertogether(intervalsof

1%admixture)between75and85%ofadmixture.Thiswasanareathatlookedtohavea

largedegreeofvariancebetweenthebaselinesandthecorrectedf3values.Assuch,ifthese

newdifferencesthatwerecalculatedwerebetterindicatorsofthedifferencetosubtract

fromthecorrectedf3,thenIcouldusethesevaluesforthecorrection.

Afterfindingintervalsthatwereclosertogether,Inoticedthatthisdidnot

significantlyimpactthecorrectionfactor.Assuch,Itriedtouseasecondorderpolynomial

equation,andgotthehighestR2valueyet(R2=0.99999).Belowisthegraphwhenusingthe

quadraticequationtocorrectthef3valuestobaseline.

WhenIcontinuedwithmyresults,Iquicklyranintoasnag.Ihadusedthebaseline

f3tofindacorrectionequationtogettothebaselinef3.Inotherwords,Iusedtheresultto

forcethedesiredresult.However,Iwasunabletofactoroutthebaselinef3valuestogeta

correctionindependentofthem.

0.224

0.225

0.226

0.227

0.228

0.229

0.23

0 0.2 0.4 0.6 0.8 1

f3value

s

Admixtureproportion

F3versusadmixture

Correctedf3values

baselinef3

upperbound

lowerbound

Linear(baselinef3)

Page 20: Admixture Correction in the Outgroup f3 Statistic

19

CorrectionAttempt4

Assuch,Iwasbacktosquareoneandattemptedtoworkwiththeinitialcorrection

equationforf3( 𝑓3 − 𝑓3. ∗ 𝑎 + 𝑓3).Ithenwentbackandgotthedifferencesbetween

thebaselinef3valuesandthesef3values.Afterdoingthis,Iplottedthebaselinef3values

againstthe“corrected”f3values.Thereappearedtobeafairlylineartrendamongstthef3

valuesusingtheequationabove,acrossallsixpopulations.Ialsonoticedthatallthef3

valueswerelessthanthebaselinef3,whichreinforcedthetrendofdecreasingf3values

withincreasedadmixturelevels.BelowisasamplegraphfromthepopulationCabecar(with

thebaselinef3valuesinorange,andthepreliminarilycorrectedf3valuesinblue).The

equationgivenisforthelinearregressionlineforthepreliminarilycorrectedf3values.

Thisledustobelievethatwecouldusethedifferencesbetweenthebaselineand

thepreliminarilycorrectedf3.Afterdoingthisforthesixpopulations,Igottheaverageof

thesixdifferencesforeachadmixturelevel.Forinstance,Igottheaveragedifferenceforan

admixtureproportionof5%acrossallsixpopulations.Afterdoingso,Iusedtheaverage

differencestogetanewlycorrectedf3byaddingthemtothepreliminarilycorrectedf3.I

noticedthatthesenewf3wererelativelysimilartothebaselinef3,thoughtheywerenot

y=-0.1036x+0.221R²=0.99645

0

0.05

0.1

0.15

0.2

0.25

0 0.2 0.4 0.6 0.8 1

F3Value

s

AdmixtureProportion

BaselineVersusF3Values

f3values

Baselinef3values

Linear(f3values)

Page 21: Admixture Correction in the Outgroup f3 Statistic

20

ideal.Assuch,Idecideditwouldbebeneficialtogeta95%confidenceintervalforthe

differences,toseeiftheseconfidenceintervalsofdifferenceswouldgiveussomethinginan

appropriaterangearoundthebaselinewhenaddedtothepreliminarilycorrectedf3.

Todoso,Iwantedtouseat-test,butthedatawasnotapproximatelynormally

distributed.Therefore,IusedaWilcoxonsigned-ranktest,whichisanon-parametrical

statisticalhypothesistestthatallowsustoperformaversionofthet-testwithoutnormally

distributeddata.ItisoftenreferredtoastheWilcoxonTTest.UpondoingsoinR,Inoticed

thatthe95%confidenceintervalsforthedifferencesforeachadmixturelevelacrossthesix

populationwouldgiveusarangeofdifferences.Belowisatableoftheseconfidence

intervals.

AdmixtureProportion0.05

WilcoxConfidenceIntervals(0.00750608,0.02297737)

0.1 (0.01441939,0.03037686)0.15 (0.02137041,0.03316990)0.2 (0.02569456,0.04308678)0.25 (0.03300473,0.04942423)0.3 (0.03840717,0.05224841)0.35 (0.04300735,0.05772577)0.4 (0.04815174,0.06442764)0.45 (0.05378068,0.06854280)0.5 (0.06017016,0.07539880)0.55 (0.06489406,0.08105843)0.6 (0.07016284,0.08537045)0.65 (0.07382676,0.09035722)0.7 (0.07968564,0.09463120)0.75 (0.08441375,0.10056375)0.8 (0.08823818,0.10527647)0.85 (0.09257367,0.11016625)0.9 (0.09765934,0.11534444)0.95 (0.1012127,0.1194585)

Whentheuncorrectedf3valueswereaddedtothelowerandupperboundsofthe

confidenceintervals,Igotintervalsfornewlycorrectedf3values.OnceIdidthis,Inoticed

Page 22: Admixture Correction in the Outgroup f3 Statistic

21

thatthisintervaloff3valuesincludedthebaselinef3values.Atfirst,Ihopedtogetthe

baselinef3valuestoalignwiththenewlycorrectedf3valueswhenusingonestandard

deviationaboveandbelowthebaselineinconjunctionwiththeconfidenceintervalofnewly

correctedf3values.However,thecorrectionusingthesedifferencesworkedwellenough

thatwedidnotneedtoconsideronestandarddeviationaboveandbelowthebaselinef3.

Simplyusingtheconfidenceintervalsforthedifferencestogetconfidenceintervalsfor

correctedf3valueswassufficientasacorrection.

Ithenappliedthiscorrectiontothenaturalpopulations,JaltocanHidalgo,Pima,and

Xaltocan.Ididsobyroundingtheadmixtureproportionforthesepopulationstothenearest

fivehundredths,suchthatIwouldbeabletousethedifferences(sinceweonlyhadthese

foradmixturesthatweremultiplesof0.05).Upondoingso,Iusedtheconfidenceintervals

forthedifferencesandaddedthelowerandupperboundstotheinitial,uncorrectedf3

value.OnceIdidthis,Inoticedthatthebaselinef3statisticfellinthisrangeofnewf3values

inthePimapopulationandintheXaltocanpopulation.However,thiscorrectiondidnot

workforJaltocanHidalgo.Therangeofnewf3valuesendedupbeing(0.247343959,

0.262815249),whereasthebaselinef3valuewas0.227966338.

Regardless,Ithenappliedthiscorrectiontothepopulationsthathadadmixture,

AleutRaff,Algonquin,Cree,Chipewyan,Inupiat,Ojibwa,andSouthernUSNativeAmerican.

Iusedtheconfidenceintervalsforthedifferencesagainandroundedtheadmixture

proportionforeachpopulationtothenearestfivehundredths.Upondoingso,Icalculated

anintervaloff3valuesthatthebaselinef3ispresumedtofallin.

ToseeifIwasabletogetabettercorrection,Iplottedtheaveragedifferences.Iwas

abletouseapolynomialregressionlinesincetheR2valueswereallabove0.99993.Ithen

gottheequationforthiscurve,whichIthenusedtogetavalue(usingtheadmixture

Page 23: Admixture Correction in the Outgroup f3 Statistic

22

proportionasthexvalue)toaddtothepreliminarilycorrectedf3.Thisresultsinf3values

thataresimilartothef3valuesIgotfrommerelyaddingbackintheaveragedifferencefor

theadmixtureproportion5%increments.However,theydonotfallwithinonestandard

deviationofthebaselinef3values,justasadding5%admixtureincrementeddifferencesdid

notyieldf3valuesthatfellwithinthatrangeeither.

Assuch,Iplottedthelowerboundsandupperboundsofthe95%Wilcoxconfidence

intervalsseparatelyandfoundregressionlinesforeach.Ifoundthatsecond-order

polynomialequationsfitthedatabest(highestR2value)andwasabletousethese

equationstoaddbackinthedifferencetothebaselinef3value.Thisallowedacontinuous

correctionofthef3statistic,ratherthanjustatdiscreteadmixtureintervalsof5%.

Page 24: Admixture Correction in the Outgroup f3 Statistic

23

Conclusion

Throughthecourseofthisresearchproject,Ihavedevelopedacrudeadmixture

correctionfortheoutgroup-f3statistic.Byfirstfindingthef3valueofthecontaminated

population,a“correctionfactor”canbeaddedbackintobringthatvaluewithinaballpark

aroundthebaselinef3statistic.Thiscorrectionfactorcomesintheformofalowerbound

quadraticequationandanupperboundquadraticequation.Whenbothoftheseareadded

tothef3statistic,theresultisarangeoff3values.Comparingtheseresultstothebaseline

f3statistics,Iconcludethatthiscorrectionworkswithinamarginoferror.Sincethe

correctiononlyworkedintwooutofthethreepopulationswithadmixedandunadmixed

individuals,wecannotconcludeirrefutablythatthiscorrectionworks.

Nonetheless,thecorrectionworkedforalladmixturelevelsinallsixoftheartificially

admixedpopulations(6×19 = 114cases).Therefore,Iappliedthecorrectiontothe

sevenpopulationsthatwerecompletelyadmixedwithEuropeanDNA.Thisresultedina

rangeoff3valuesthatresembledappropriatef3values.However,thereisnowaytocheck

forwhichofthesesevenpopulationsthecorrectionactuallyworked.

Inthefuture,researchersmightbeabletofine-tuneourcorrectionusingdatafrom

morepopulations.Forinstance,ourconfidenceintervalsfortheWilcoxonsigned-ranktest

wouldlikelyspanashorterrangeiftherewasmoredatatopullfrom.Furthermore,itis

possiblethatresearchersmightbeabletofurthermanipulatethepostulatedequations

mentionedpreviously.GiventhatLindoandcolleagueswereabletofindaneatcorrection

equationfortheDstatistic,itispossiblethatthereexistsoneforthef3statisticaswell.It

wasalsoobservedduringthisprojectthatcertaincorrectionsthatweresuggestedworked

betteratloweradmixtureproportions.Justasthenormalf3statisticismostaccurateunder

Page 25: Admixture Correction in the Outgroup f3 Statistic

24

certainconditions,oneofwhichisthattheadmixtureproportionbecloseto50%,itis

possiblethattheoutgroup-f3statisticworksbestatloweradmixtureproportions.

Regardless,thiscorrectionisusefulforresearchershopingtostudythegenetic

relatednessofdifferentpopulations.Inparticular,thispotentialsolutionismostusefulfor

thosehopingtoperformoutgroup-f3statisticsinpopulationsthathaveindividualswith

geneticadmixture.

Page 26: Admixture Correction in the Outgroup f3 Statistic

25

Glossary

• Admixture:geneticinteractionoftwoormorepreviouslyisolatedpopulations

interbreeding

• Dstatistic:afour-populationtestforadmixture

• Fstatistic:measuressharedgeneticdriftbetweensetsofpopulations

o Normalf3statistic:testsforadmixturebetweenthreepopulations

o Outgroup-f3statistic:proportionaltoamountofsharedgenetichistory

betweentwopopulations

• For-loop:acontrolflowstatementthatspecifiesiterationtoexecuteacode

repeatedly

• Geneticdrift:thechangeinallelefrequenciesinapopulationovergenerationsasa

mechanismofevolution

• Geneticrelatedness:probabilitythattwoindividualsshareanallelefromcommon

ancestry

• Linkagedisequilibrium:non-randomassociationofallelesatvariousloci

• Outgroup:referencegroupoforganismsnotinthepopulationsbeingstudied

• Phylogenetictrees:branchingdiagramrepresentingevolutionaryrelationships

amongstorganisms

• SNPs:singlenucleotidepolymorphisms;changeinasinglenucleotideataspecific

genomeposition

Page 27: Admixture Correction in the Outgroup f3 Statistic

26

Bibliography

Alexander,DavidH.,etal.“FastModel-BasedEstimationofAncestryinUnrelated

Individuals.”GenomeResearch,ColdSpringHarborLab,22June2009,

genome.cshlp.org/content/early/2009/07/31/gr.094052.109.

Lindo,John,etal.“AncientIndividualsfromtheNorthAmericanNorthwestCoastReveal

10,000YearsofRegionalGeneticContinuity.”PNAS,NationalAcademyofSciences,

18Apr.2017,www.pnas.org/content/114/16/4093.

Myles,Sean,etal.“GeneticStructureandDomesticationHistoryoftheGrape.”PNAS,

NationalAcademyofSciences,1Mar.2011,

www.pnas.org/content/108/9/3530.abstract.

Patterson,Nick,etal.“AncientAdmixtureinHumanHistory.”Genetics,Genetics,1Nov.

2012,www.genetics.org/content/192/3/1065.

Peter,BenjaminM.“Admixture,PopulationStructure,andF-Statistics.”Genetics,Genetics,

1Apr.2016,www.genetics.org/content/202/4/1485.

Pontussk.“Pontussk/Popstats.”GitHub,GitHub,30July2015,

github.com/pontussk/popstats.

Raghavan,Maanasa,etal.“UpperPalaeolithicSiberianGenomeRevealsDualAncestryof

NativeAmericans.”Nature,MacmillanPublishersLimited,2Jan.2014,

www.academia.edu/7110954/Upper_Palaeolithic_Siberian_genome_reveals_dual_

ancestry_of_Native_Americans.

Reich,David,etal.“ReconstructingIndianPopulationHistory.”Nature,U.S.NationalLibrary

ofMedicine,24Sept.2009,www.ncbi.nlm.nih.gov/pmc/articles/PMC2842210/.

Vyas,DevenN.,etal.“TestingSupportfortheNorthernandSouthernDispersalRoutesout

ofAfrica:anAnalysisofLevantineandSouthernArabianPopulations.”American

JournalofPhysicalAnthropology,Wiley-Blackwell,15Sept.2017,

onlinelibrary.wiley.com/doi/10.1002/ajpa.23312/full.


Recommended