Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
1
UKBiobankPhasingandImputationDocumentation
Version1.2
13November2015
documentationauthorJonathanMarchiniDepartmentofStatistics,UniversityofOxford
onbehalfofUKBiobank
ContributorstoUKBiobankPhasingandImputationJonathanMarchini(StatisticsDept,Oxford),JaredO’Connell(WTCHG,Oxford),OlivierDelaneau(UniversityofGeneva),KevinSharp(StatisticsDept,Oxford),WarrenKretzschmar(WTCHG,Oxford),GavinBand(WTCHG,Oxford),ShaneMcCarthy(WTSI,Hinxton),DesislavaPetkova(WTCHG,Oxford),ClaireBycroft(WTCHG,Oxford),ColinFreeman(WTCHG,Oxford),PeterDonnelly(WTCHG,Oxford).
2
TableofContents
Introduction.............................................................................................................................3
Phasing......................................................................................................................................4Filteringbeforephasing...............................................................................................................4Phasingmethoddescription.......................................................................................................4Validationofthephasingmethod.............................................................................................5Wholegenomephasing.................................................................................................................5
Genotypeimputation............................................................................................................6AssessmentoftheUKBiobankArrayforimputation........................................................6Referencepanelusedforimputation......................................................................................7Imputationmethoddescription................................................................................................8Wholegenomeimputation..........................................................................................................8Informationscores,minorallelefrequenciesandfiltering.............................................8Imputedgenotypefiles.................................................................................................................9Samplefiles....................................................................................................................................................10
Differencesbetweenrawgenotypesandimputedfiles...................................................10Anexemplargenomewideassociationstudy...........................................................11Samplefiltering.............................................................................................................................11Takingaccountofthedifferentarraysused.......................................................................11Associationtesting.......................................................................................................................11Results..............................................................................................................................................12
Fileprocessing.....................................................................................................................12
References.............................................................................................................................13
3
IntroductionThisdocumentdescribestheanalysiscarriedouttoperformgenotypeimputationfortheinterimreleaseoftheUKBiobank(UKB)genotypedata.Italsoprovidesadviceaboutusingtheimputeddatatocarryoutgenomewideassociationstudies(GWAS)orforextractinggenotypesforuseascovariatesinothertypesofassociationstudy.
Genotypeimputation1,2istheprocessofpredictinggenotypesthatarenotdirectlyassayedinasampleofindividuals.AreferencepanelofhaplotypesatadensesetofSNPs,indelsandstructuralvariants,isusedtoimputegenotypesintoastudysampleofindividualsthathavebeengenotypedatasubsetoftheSNPs.These‘insilico’genotypescanthenbeusedtoboostthenumberofSNPsthatcanbetestedforassociation.Thisincreasesthepowerofthestudy,theabilitytoresolveorfine-mapthecausalvariantsandfacilitatesmeta-analysis.Theresultoftheimputationprocessisadatasetwith73,355,667SNPs,shortindelsandlargestructuralvariantsin152,249individuals.SeeBox1of1foraquickvisualoverviewofhowgenotypeimputationworks.
Theprocessofimputationisdividedintotwosteps(i)pre-phasing,and(ii)imputation.Inthefirststep,thesamplestobeimputedare‘pre-phased’i.eastatisticalmethodisappliedtogenotypedatatoinfertheunderlyinghaplotypesofeachindividual.Inthesecondstep,adifferentstatisticalmethodisusedtocombinetheinferredhaplotypeswithareferencepanelofhaplotypesandimputetheunobservedgenotypesineachsample.Thefollowingtwosectionsofthisdocumentdescribehowthepre-phasingandimputationwascarriedoutonthe~150,000samples.
Phasingandimputationcanbeacomputationallyintensiveprocess.Toavoidmanydifferentresearchgroupshavingtocarrythisoutindependently,phasingandimputationwasbeencarriedoutcentrally.QuestionsaboutusingtheimputedgenotypesshouldbesenttotheUKBGeneticsmaillistsetupforthispurpose.Youcansubscribetothemaillistherehttps://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=UKB-GENETICS
4
Phasing
FilteringbeforephasingTocreateaninputdataforthephasingweappliedSNPQCfiltersasdescribedinUKBiobankQCdocumention3.Thesamplesweregenotypedontwoslightlydifferentchips.Approximately50,000weregenotypedaspartoftheULBiLEVEstudyusingachipdesignedforthatstudy(denotedUKBL),withtheremainingsamples(~100,000)genotypedontheUKBchip.Therefore,weapplieddifferentmissingnessfiltersonSNPsdependentuponchip.SNPswereremovedbasedonthenumberofbatchesinwhichtheyarecompletelymissing:
i. SNPsonbothUKBchipandUKBLchip-removethemiftheyaremissinginmorethan3batches(outof33batches)
ii. SNPsontheUKBchipandnottheUKBLchip-removethemiftheyaremissinginmorethan2batches(outof22batches)
iii. SNPsontheUKBLchipandnottheUKBchip-removethemiftheyaremissinginmorethan1batch(outof11batches)
1,037sampleoutliers3wereremoved.Multi-allelicSNPsandSNPswithaminorallelefrequency(MAF)<1%werethenremovedfromthedataset.Thesefiltersresultedinadatasetwith641,018autosomalSNPsin152,256samples.ChromosomeXphasingandimputationwillbecarriedoutatalaterdate.
PhasingmethoddescriptionPhasingontheautosomeswascarriedoutusingamodifiedversionoftheSHAPEIT24programmodifiedtoallowforverylargesamplesizes.Thisnewmethod(whichwerefertoasSHAPEIT3)modifiesSHAPEIT2’ssurrogatefamilyapproachtoremoveaquadraticcomplexitycomponentofthealgorithm5.Insmallsamplesizesofafewthousandsamples,thispartofthealgorithm,whichinvolvescalculatingHammingdistancesbetweencurrenthaplotypesestimates,contributesonlyarelativelysmallparttothecomputationalcost.Assamplesizesincreaseover10,000samplesthenthiscomponentbecomessignificant.Thenewalgorithmusesadivisiveclusteringalgorithmtoidentifyclustersofhaplotypes,andthencalculatesHammingdistancesonlybetweenpairsofhaplotypeswithineachcluster.OnlyhaplotypeswithineachclusterareusedascandidatesforthesurrogatefamilycopyingstatesintheHMMmodel.TheresultingalgorithmhascomplexityO(NlogN)whereNisthenumberofhaplotypesinthedatasetbeingphased.Inpractice,wehaveobservedthatthemethodexhibitsscalingclosetolinear.Thisisacrucialfeatureofthemethod,especiallyforverylargesamplesizes,andapropertynotsharedbyotherapproaches6,7.Thedevelopmentofthisapproachisongoingandthereissubstantialscopetomakefurtherimprovementsinspeedandaccuracy.Anewerversionislikelytoofferanorderofmagnitudereductioninspeed.
5
ValidationofthephasingmethodTheaccuracyofthisnewmethodwasassessedbytakingadvantageof72mother-father-childtriosthatwereidentifiedintheUKBdataset3.ThisfamilyinformationcanbeusedtoinferthephaseofalargenumberofSNPsinthetrioparents.Thesefamilyinferredhaplotypeswereusedasatruthset,asiscommoninthephasingliterature4.Theparentsofeachtriowereremovedfromthedatasetandthenhaplotypeswereestimatedacrosschromosome20inasinglerunofSHAPEIT3.Thisdatasetconsistedof16,762autosomalSNPs.Theinferredhaplotypeswerethencomparedtothetruthsetusingtheswitcherrormetric4.Weobtainedanexceptionallylowswitcherrorrateof0.4%acrossthetriochildrenreportingBritishancestry.Byadjustingparametersofthemethodwehaveobservedswitcherrorrateslowerthan0.3%.Withswitcherrorratesthislow,longchunksofsequenceofmanymegabaseswillbeinferredcorrectly.Downstreamimputationfromsuchhaplotypeswillbehighlyaccurate.Toassesstheperformancegainofphasingall152,112samplestogether,versusphasinginsmallersubsetsofsamplestwoothertestdatasetsofsize1,072and10,072sampleswerecreated,alsocontainingthetriochildren.TheresultsareshowninfulldetailinTable1andhighlightthebenefitsofjointphasingofallthesamples.TheseresultsclearlydemonstratetheclosetolinearscalingoftheSHAPEIT3algorithm.Samplesize Method SwitchError
(%)Runtime(hrs) Run
TimeScaling
SampleSize
Scaling
Threads
1,072 SHAPEIT3 2.6 0.25 1 1 1010,072 SHAPEIT3 1.3 2.5 10 9.4 10152,112 SHAPEIT3 0.4 38.5 154 142 10
Table1:PhasingperformanceonUKBsamples.
WholegenomephasingPhasingwascarriedoutinchunksof5,000SNPs,withanoverlapof250SNPsbetweenchunks.SHAPEIT3wasrunoneachchunkusing4coresperjobandS=200copyingstates.Asapartofthephasingprocessanyremainingmissinggenotypeswereimputedduringthephasing.Chunkswereligatedusingamodifiedversionofthehapfuseprogram.
6
Genotypeimputation
AssessmentoftheUKBiobankArrayforimputationTheUKBiobankAxiomarrayfromAffymetrixwasspecificallydesignedtooptimizeimputationperformanceinGWASstudies8.Anexperimentwascarriedouttoassesstheimputationperformanceofthearray,stratifiedbyallelefrequency,andtocompareperformancetosomeothercommerciallyavailablearrays.
Performancewasassessedusinghigh-coverage,whole-genomesequencedatamadepubliclyavailablebyCompleteGenomics(CG).
Datafrom10samplesfromtheEuropeanancestry(CEU)populationwasused.Allvariantsiteswithacallratebelow90%werefilteredoutinordertoonlyconsiderveryreliablesitesintheanalysis.Onlydatafromchromosome20wasused.Tomimicatypicalimputationanalysis,apseudo-GWASdatasetwasconstructedbyextractingtheCGSNPgenotypesatallthesitesincludedonagivenarray.AllsitesnotonthearraywerethenimputedusingtheUK10Kreferencepanel9.ImputationwascarriedoutusingIMPUTE210whichchoosesacustomreferencepanelforeachstudyindividualineach1Mbsegmentofthegenome.ThekhapparameterofIMPUTE2wassetto1,000.Allotherparametersweresettodefaultvalues.Thisexperimentwasrepeatedfor4differentgenome-wideSNParrays(a)AffymetrixUKBiobankAxiomarray(b)IlluminaOmni2.5Marray(c)IlluminaOmni1MQuad(d)IlluminaOmniExpress.Variantswerestratifiedintoallelefrequencybinsandthesquaredcorrelation(R2)wascalculatedbetweenthealleledosagesatvariantsineachbinwiththemaskedCGgenotypes.Sincedifferentarrayscontaindifferentnumbersofvariantsitisimportanttomakesurethatimputationperformanceismeasuredatthesamesetofvariantswhencomparingchips.Toachievethis,bothimputedandarrayvariantswereincludedintheR2analysis,sothatthecomparisonmeasurestheoverallperformanceofeacharray.Asaconsequence,anarraywithmorevariantswillgainanadvantage,asitisreasonabletoexpectthatdirectlygenotypingavariantwillyieldmoreaccurategenotypesthanimputation.Figure1showstheresultsofthisanalysis.Thex-axisisnon-referenceallelefrequency(%)onalogscale,whichfocusesinonrarervariants.They-axisisimputationperformance(R2).Thesalientpointsare
a. theUKBiobankchip(purple)outperformstheIlluminaOmni1MQuad(blue)andIlluminaOmniExpress(green),bothwhichhavecomparablenumbersofvariants.
b. TheUKBiobankchipperformsalmostaswellastheIllumina2.5Mchip(red),whichhas~3timesthenumberofSNPs.ItisworthnotingthattheUKBchipandIlluminaOmni2.5Mchipareverycloseinthe1-5%range.Alikelyconsequenceofthechipdesignprocessfocusinginpartonthisfrequencyrange8.
7
TheoverallconclusionofthisanalysisisthattheAffymetrixUKBarrayisaverygoodarrayfromwhichtocarryoutgenotypeimputation.ThecaveatisthatthisanalysisisfocusedonsampleswithEuropeanancestry.
Figure1:ComparisonofimputationperformanceoftheUKBiobankArrayandseveralothercommerciallyavailablegenotypingarrays.
ReferencepanelusedforimputationThereareanumberoffactorsthatinfluencetheaccuracyofgenotypeimputation1,butgenerallyaccuracywillincreaseasthenumberofhaplotypesinthereferencepanelgrowsandiftheancestryofthesamplehaplotypesisagoodmatchtotheancestryofthereferencepanelhaplotypes.TheUKBdatasetconsistsofsampleswithadiverserangeofancestries,butwiththemajorityofsampleshavingBritish(orEuropean)ancestry.ForthisreasonitwasdesirabletouseareferencepanelwithalargenumberofhaplotypeswithBritishandEuropeanancestry,andalsoadiversesetofhaplotypesfromotherworld-widepopulations.ToachievethistheUK10Khaplotypereferencepanelwasmergedtogetherwiththe1000GenomesPhase3referencepanelusingthe–merge_ref_panelsoptionintheIMPUTE2software(link).Usingthismergedpanelhasbeenshowntoproduceahigh-qualityreferencepanelforimputation9.AnadvantageofthisreferencepanelisthatitincludesSNPs,shortindelsandlargerstructuralvariants.Thereferencepanelconsistsof87,696,888bi-allelicvariantsin12,570haplotypes.
●
●
●
●
●
●
●
●
●●
● ● ● ● ● ● ● ●●
●
●
●
●
●
● ●
●
●
●
● ● ● ● ● ● ● ●
●
●
●
●
●
●
●
●
●● ●
● ● ● ● ● ● ● ●●
●
●
●
●
●
●●
●
●
●
● ● ● ● ● ● ● ●
●
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.02 0.05 0.1 0.2 0.5 1 2 5 10 20 50 100
non reference allele frequency (%)
Aggr
egat
e R
2
Genotyping arrayIllumina Omni 2.5MIllumina Omni 1M QuadIllumina Omni ExpressAffy UK Biobank
Genotyping accuracy after imputation from UK10k (7562 haplotypes)Samples: 10 EUR CG2
Comparison at 219303 sites on chr20 (includes genotyped SNPs)Allele frequency calculated from reference panel
8
ImputationmethoddescriptionImputationwascarriedoutusingthesamealgorithmasisimplementedintheIMPUTE2program.ThecurrentIMPUTE2programisaveryflexibletoolforphasingandimputationthatimplementsageneralsetofoptions.AnewC++programwaswrittenfromscratchtofocusexclusivelyonhaploidimputationneededwhensampleshavebeenpre-phased.ThisnewversionisbothmemoryandcomputationallyefficientcomparedtoIMPUTE2.ThemethodtakesadvantageofhighcorrelationsbetweeninferredcopyingstatesintheHMMtoreducecomputation.WerefertothisprogramasIMPUTE3.
WholegenomeimputationImputationwascarriedoutinchunksof2Mbwitha250kbbufferregion.Asetof2,000haplotypecopyingstateswereusedtoimputeeachsample.Imputedvariantsineachnon-overlappingpartofeachchunkwereconcatenatedintoper-chromosomefiles.
Informationscores,minorallelefrequenciesandfilteringQCTOOLwasusedtocalculatetheminorallelefrequency(MAF)andimputationinformationscoreofeachimputedvariant.Theimputationinformationisametricbetween0and1.Avalueof1indicatesthatthereisnouncertaintyintheimputedgenotypeswhereasavalueof0meansthatthereiscompleteuncertaintyaboutthegenotypes.AvalueofαinasampleofNindividualsindicatesthattheamountofdataattheimputedSNPisapproximatelyequivalenttoasetofperfectlyobservedgenotypedatainasamplesizeofαN.
ManyGWAScarriedouttodatehaveusedfiltersonMAFandinformationscorebyapplyingathresholdonthesemetrics.Thereisnosinglecorrectthresholdtouse.However,asMAFdecreasesitisgenerallythecasethatimputationqualitydecreases.Previousstudieshavetendedtouseafilteroninformationbetween0.3-0.5.Sincethesestudieshavetypicallyconsistedofhundredsorlowthousandsofsamplesaninformationof0.3correspondstoaneffectivesamplesizewithlimitedpowertodetectassociations.However,theUKBiobankdatasetisconsiderablylargerinsizethanmostpreviousGWAS.Aninformationmeasureof0.3in~150,000samplesroughlycorrespondstoaneffectivesamplesizeof~45,000,whichwouldbeexpectedtoyieldverygoodpowertodetectassociation.
Somevariantsareimputedasmonomorphic,orclosetomonomorphici.e.nooralmostnovariationinthegenotypes.SuchsiteswereremovedusingQCTOOLusingafilteronMAFof0.001%.Inaddition,7sampleswereremovedfromthedatasetduetotheseindividualshavingrequestedtheirdataberemovedfromthestudy.Theresultingdatasetconsistsof73,355,667variantsin152,249individuals.
Thedistributionofinformationscoresatthese73,355,667variantsisshowninFigure2(a).PlotsstratifiedbyMAFarealsoshown(b)MAF>5%(c)1%<=MAF<5%(d)0.1%<=MAF<1%(e)0.01%<=MAF<0.1%(f)0.001%<=MAF<0.01%.
9
Figure2:Distributionofinformationscoresatvariantsintheimputeddataset.Thex-axisshowstheinformationscoreonthescale0to1.
ImputedgenotypefilesLetGijdenotethegenotypeoftheithsampleatthejthvariant.Theprocessofgenotypeimputationproducesaprobabilitydistributionforeachgenotypei.e.
pij0=P(Gij=AA) pij1=P(Gij=AB) pij2=P(Gij=BB)
whereAandBarethetwoallelesatthevariant.Thisprobabilitytriple(whichsumsto1)isprovidedintheimputedgenotypefilesforeachimputedvariantsinallsamples.SNPvariantsincludedinthephaseddatasetalsooccurintheimputedfilesinthisformat.
TheimputeddataisprovidedinacompressedbinaryBGENfileformat.TheBGENfileformatisabinaryversionoftheGENfileformat.
TheBGENfileformatwaschosentoprovidegoodcompressionoftheimputeddataandeaseofuseforgeneticassociationtestingagainsttraitsandphenotypes.Forexample,programscommonlyusedsuchasSNPTESTandPLINKalreadyreadBGENfiles,andQCTOOLcanbeusedtofilter,summarize,manipulateandconvertthefilestootherformats.
Theformatstoresonevariantatatime(i.e.perrow).AsMAFdecreasesmorecompressionispossibleduetoincreasedsimilaritybetweenimputedgenotypesacross
(a) All variants
Information
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0e+00
2e+06
4e+06
6e+06
8e+06
1e+07
(b) MAF >= 5% : #SNPs = 7011470
InformationFrequency
0.0 0.2 0.4 0.6 0.8 1.0
0e+00
1e+06
2e+06
3e+06
4e+06
5e+06
6e+06
(c) 1% <= MAF < 5% : #SNPs = 2889302
Information
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0e+00
1e+06
2e+06
3e+06
4e+06
5e+06
6e+06
(d) 0.1% <= MAF < 1% : #SNPs = 10051623
Information
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0e+00
1e+06
2e+06
3e+06
4e+06
5e+06
6e+06
(e) 0.01% <= MAF < 0.1% : #SNPs = 26262886
Information
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0e+00
1e+06
2e+06
3e+06
4e+06
5e+06
6e+06
(f) 0.001% <= MAF < 0.01% : #SNPs = 26140277
Information
Frequency
0.0 0.2 0.4 0.6 0.8 1.00e+00
1e+06
2e+06
3e+06
4e+06
5e+06
6e+06
10
samples.ThetotalsizeoftheUKBInterimreleasedatasetis1.3Tb,witheachchromosomefileranginginsizefrom20Gbto109Gb.Asthefileformatisbinarythefilesarenotviewableinnormaltexteditors.Laterinthisdocumentthereisadviceandguidanceonworkingwiththesefiles.
Thefilesarenamedas
chrNimpv1.bgen
whereNisthenumberoftheautosome(N=1,….,22).
RSIDswereaddedintotheBGENfilesforasmanyvariantsaspossibleusingavailableRSIDlistsavailablefromtheUK10Kwebsiteandthe1000Genomeswebsite.
RSIDsareuseful,uniqueidentifiersofSNPsandothervariantsandcanbelookedupinthedbSNPdatabase.WhenresearchersreportassociationsofvariantswithdiseasesandtraitstheynormallyreporttheresultsusingtheRSID.
VariantpositionsarereportedinGenomeReferenceConsortiumHumangenomebuild37co-ordinates(GRChb37).
SamplefilesInadditiontothe22autosomalBGENfiles,thereisfilecalledimpv1.sample
Thisfile(referedtoasthe`samplefile’)isthepartoftheBGENfileformatthatstoresinformationabouteachsampleinthedataset.TheformatofthisfileisdescribedontheGENfileformatwebpage.
Thesamplefilehastwoheaderlines,followedby1lineforeachindividualintheBGENfile.TheorderoftheindividualsinthesamplefilematchestheorderoftheindividualsintheBGENfile.Theorderisimportant.Programsthatreadbgen/samplepairsassumethattheordermatchesbetweenthefiles.
Thesamplefilecanbeusedtostoreinformationabouteachindividuali.e.phenotypesandcovariates.IfphenotypesandcovariatesareaddedintothesamplefilethenSNPTESTcanbeusedtocarryoutassociationtestingateachvariant.Careshouldbetakeninmakingsurethatsuchinformationiscorrectlyaddedtosamplefiles.Theformatallowsdiscreteandcontinuousphenotypesandcovariates,aswellasmissingvalues(seefileformatwebpagelinkabove).
DifferencesbetweenrawgenotypesandimputedfilesSNPsbelow1%MAFwerefilteredoutbeforethephasingstep,howevermanyoftheseSNPswillhavebeenimputed.ThereforetheseSNPswillappearintherawgenotypefiles,andtheimputedfiles,butmayhavedifferentgenotypes.Assuch,researchersshouldnotbesurprisediftheresultsofanalysisattheseSNPsdifferdependentuponwhichfilesareused.
11
AnexemplargenomewideassociationstudyAGWASforthephenotypeofheightwascarriedouttoassesstheuseoftheUKBiobankgeneticdataasaresourceforgeneticassociationstudies.Therearealreadyasubstantialnumberofreplicatedassociations11.Thepurposeofthisanalysiswasnottoreportnewassociations,butrathertocheckthatareasonablystandardGWASpipelineproducedvalidresults.
SamplefilteringPrincipalcomponentanalysisandtheself-declaredethnicitywereusedtoderivea“WhiteBritish”subsetofsamples.Inaddition,sampleswereexcludediftheyhad
(a) atleastonerelatedsample(b) ageneticallyinferredgenderthatdidnotmatchtheself-reportedgender.(c) ~500extremeoutliers3.
Thesefiltersresultedinadatasetwith112,338samples.
TakingaccountofthedifferentarraysusedSomeSNPsareonlyincludedononeoftheUKBBorUKBLarrays.AtsuchSNPs,missinggenotypeswillhavebeenimputedaspartofthephasingprocess,sothattheseSNPswillconsistofamixtureofgenotypedandimputedSNPs.Thiscanleadtobiasinassociationtestingifthereissomecorrelationbetweenthephenotypeandwhicharrayasamplewasassayedon.ThesamplesinvolvedintheUKBLstudywereselectedbasedonphenotypesassociatedwithlungfunction12,thusitmaybepossibleforsuchassociationstooccur.Thereareatleast2solutionstoameliorateanypossibleconfoundingduetoarray
a. carryoutassociationtestsconditioningonabinaryindicatorofarray.b. carryoutseparatetestsofassociationinUKBBsamplesandUKBLsamplesand
combinetheresultsusingmeta-analysis.
AssociationtestingGWASwasperformedatallvariantsusingSNPTEST.AnadditivegeneticmodelwasfittedateachSNP,usinggender,age,arrayand10principalcomponentsascovariates.Thatis,theexampleusesoption(a)above.Theprogramoption–methodexpectedwasusedintheSNPTESTsoftware,whichconvertsthegenotypeprobabilitytripletoanexpectedgenotype,dij,(oftencalledthedosage),calculatedas
𝑑!" = 𝑘𝑝!"#
!
!!!
12
ResultsTheGWASforheightproducedasubstantialnumberofassociatedregions.TheseregionshadahighcorrespondencetothosegeneticregionsthathavepreviouslybeenreplicatedforheightanddescribedintheNHGRIGWASCatalog11.Theanalysissuggestedasignificantnumberofnovellocicouldbeidentified.Figure3showsaplotofthe–log10p-valuesfortheheightandBMIscansonchromosome4.
Figure3:Chromosome4GWASforheight.Thex-axisshowsphysicalposition.They-axisis–log10p-valueforeachtestedvariant.Variantsonthearrayareshownasblackdots,imputedvariantsareshownasgreydots.ReportedassociationsfromtheNHGRIGWASCatalogareshownasredcrosses.Theblueandredhorizontallinesaredrawnata–log10p-valueof5and7.5respectively.
FileprocessingWerecommendthatresearchersusetheQCTOOLprogramtohandletheBGENfiles.Thisprogramhasoptionsforextractionorremovalofsubsetsofthedata(SNPsand/orsamples),andforfileformatconversion.SeetheQCTOOLexamplespageforinformationoncommandlinesusedtoperformspecifictasks.TheprogramSNPTESTcanprocessBGENfiles.ItwillautomaticallydetecttheBGENfileformatifdatafilesarenamedwiththe.bgenextension.PLINKv1.9canprocessBGENfiles;atthetimeofwritingBGENfilesarespecifiedusingthe--bgenoption.ForfurtherinformationontoolssupportingtheBGENformat,seetheBGENfileformatwebsite.
13
References1. Marchini,J.&Howie,B.Genotypeimputationforgenome-wideassociation
studies.Nat.Rev.Genet.11,499–511(2010).2. Howie,B.,Fuchsberger,C.,Stephens,M.,Marchini,J.&Abecasis,G.R.Fastand
accurategenotypeimputationingenome-wideassociationstudiesthroughpre-phasing.Nat.Genet.44,955–959(2012).
3. TheUKBiobank.UKBiobankGenotypingQCdocumentation.(2015).4. Delaneau,O.,Zagury,J.-F.&Marchini,J.Improvedwhole-chromosomephasing
fordiseaseandpopulationgeneticstudies.Nat.Methods10,5–6(2013).5. O'Connell,J.,Sharp,K.,Delaneau,O.&Marchini,J.Haplotypeestimationfor
biobankscaledatasets.(2015)(submitted)6. Kong,A.etal.Detectionofsharingbydescent,long-rangephasingand
haplotypeimputation.Nat.Genet.40,1068–1075(2008).7. Williams,A.L.,Patterson,N.,Glessner,J.,Hakonarson,H.&Reich,D.Phasingof
manythousandsofgenotypedsamples.Am.J.Hum.Genet.91,238–251(2012).8. TheUKBiobankArrayDesignGroup.UKBiobankAxiomArrayContentSummary.
(2014).9. Huang,J.etal.Improvedimputationoflow-frequencyandrarevariantsusing
theUK10Khaplotypereferencepanel.NatureCommunications6,8111(2015).10. Howie,B.,Marchini,J.&Stephens,M.Genotypeimputationwiththousandsof
genomes.G3(Bethesda)1,457–470(2011).11. Welter,D.etal.TheNHGRIGWASCatalog,acuratedresourceofSNP-trait
associations.Nucl.AcidsRes.42,D1001–6(2014).12. Wain,L.V.etal.Novelinsightsintothegeneticsofsmokingbehaviour,lung
function,andchronicobstructivepulmonarydisease(UKBiLEVE):ageneticassociationstudyinUKBiobank.LancetRespirMed3,769–781(2015).