Genome-wide genetic data on ~500,000 UK Biobank participants1 Introduction The UK Biobank project is a large prospective cohort study of ~500,000 individuals from across the United

Genome-widegeneticdataon~500,000UKBiobankparticipants

ClareBycroft1*,ColinFreeman1*,DesislavaPetkova1,^*,GavinBand1,LloydT.Elliott2,KevinSharp2,AllanMotyer3,DamjanVukcevic3,4,OlivierDelaneau5,6,7,JaredO’Connell8,AdrianCortes1,9,SamanthaWelsh10,GilMcVean1,11,,StephenLeslie3,4,PeterDonnelly1,2†,JonathanMarchini2,1†‡

1WellcomeTrustCenterforHumanGenetics,UniversityofOxford,UK2DepartmentofStatistics,UniversityofOxford,UK3CentreforSystemsGenomicsandtheSchoolsofMathematicsandStatistics,andBioSciences,TheUniversityofMelbourne,Parkville,Victoria,Australia.4MurdochChildren’sResearchInstitute,Parkville,Victoria,Australia.5DepartmentofGeneticMedicineandDevelopment,UniversityofGeneva,1MichelServet,Geneva,CH1211,Switzerland.6SwissInstituteofBioinformatics,UniversityofGeneva,1MichelServet,Geneva,CH1211,Switzerland.7InstituteofGeneticsandGenomicsinGeneva,UniversityofGeneva,1MichelServet,Geneva,CH1211,Switzerland.8IlluminaLtd,ChesterfordResearchPark,LittleChesterford,Essex,CB101XL,UnitedKingdom.9NuffieldDepartmentofClinicalNeurosciences,DivisionofClinicalNeurology,JohnRadcliffeHospital,UniversityofOxford,OxfordOX39DU,UnitedKingdom.10UKBiobank,Units1-4SpectrumWay,Adswood,Stockport,Cheshire,SK30SA,UK 11BigDataInstitute,LiKaShingCentreforHealthInformationandDiscovery,UniversityofOxford,OxfordOX37LF,UnitedKingdom.^Currentaddress:Procter&Gamble,Brussels,Belgium*Theseauthorscontributedequallytothiswork.†Theseauthorsjointlydirectedthiswork.‡Towhomcorrespondenceshouldbeaddressed:[email protected]

2

AbstractTheUKBiobankprojectisalargeprospectivecohortstudyof~500,000individualsfromacrosstheUnitedKingdom,agedbetween40-69atrecruitment.Arichvarietyofphenotypicandhealth-relatedinformationisavailableoneachparticipant,makingtheresourceunprecedentedinitssizeandscope.Herewedescribethegenome-widegenotypedata(~805,000markers)collectedonallindividualsinthecohortanditsqualitycontrolprocedures.Genotypedataonthisscaleoffersnovelopportunitiesforassessingqualityissues,althoughthewiderangeofancestriesoftheindividualsinthecohortalsocreatesparticularchallenges.Wealsoconductedasetofanalysesthatrevealpropertiesofthegeneticdata–suchaspopulationstructureandrelatedness–thatcanbeimportantfordownstreamanalyses.Inaddition,wephasedandimputedgenotypesintothedataset,usingcomputationallyefficientmethodscombinedwiththeHaplotypeReferenceConsortium(HRC)andUK10Khaplotyperesource.Thisincreasesthenumberoftestablevariantsbyover100-foldto~96millionvariants.Wealsoimputedclassicalallelicvariationat11humanleukocyteantigen(HLA)genes,andasaqualitycontrolcheckofthisimputation,wereplicatesignalsofknownassociationsbetweenHLAallelesandmanycommondiseases.Wedescribetoolsthatallowefficientgenome-wideassociationstudies(GWAS)ofmultipletraitsandfastphenome-wideassociationstudies(PheWAS),whichworktogetherwithanewcompressedfileformatthathasbeenusedtodistributethedataset.Asafurthercheckofthegenotypedandimputeddatasets,weperformedatest-casegenome-wideassociationscanonawell-studiedhumantrait,standingheight.

KeywordsUKBiobank,Genotypes,Qualitycontrol,Populationstructure,Relatedness,Phasing,Imputation,HLAImputation,GWAS,PheWAS

3

1 Introduction............................................................................................................42 Results....................................................................................................................52.1 Highqualitygenotypecallingonnovelarray..................................................52.2 Ancestraldiversityandcrypticrelatedness...................................................182.3 PhasingandImputationofSNPs,shortindelsandCNVs...............................232.4 ImputationofclassicalHLAalleles.................................................................262.5 GWASforstandingheight.............................................................................282.6 MultipletraitGWASandPheWAS.................................................................31

3 Dataprovisionandaccess....................................................................................314 URLs......................................................................................................................325 Authorcontributions............................................................................................326 Acknowledgements..............................................................................................327 ConflictsofInterest..............................................................................................338 References............................................................................................................33

4

1 Introduction

TheUKBiobankprojectisalargeprospectivecohortstudyof~500,000individualsfromacrosstheUnitedKingdom,agedbetween40-69atrecruitment[1].Arichvarietyofphenotypicandhealth-relatedinformationisavailableoneachparticipant,makingtheresourceunprecedentedinitssizeandscope.Thedatacontainsself-reportedinformation,includingbasicdemographics,diet,andexercisehabits;extensivephysicalandcognitivemeasurements;withothersourcesofhealth-relatedinformationsuchasmedicalrecordsandcancerregistersbeingintegratedandfollowedupoverthecourseoftheparticipants’lives[2].Thebaselineinformationhas,andwillbe,extendedinanumberofways[3].Forexample,manybloodandurinebiomarkersarebeingmeasured;andmedicalimagingofbrain[4],heart,bones,carotidarteriesandabdominalfatisbeingcarriedoutonalargesubset(~100,000)ofparticipants[5].Understandingtherolethatgeneticsplaysinphenotypicanddiseasevariation,anditspotentialinteractionswithotherfactors,providesacriticalroutetoabetterunderstandingofhumanbiology.Itisanticipatedthatthiswillleadtomore-successfuldrugdevelopment[6],andpotentiallytomoreefficientandpersonalisedtreatmentsandtobetterdiagnoses.Assuch,akeycomponentoftheUKBiobankresourcehasbeenthecollectionofgenome-widegeneticdataoneveryparticipantusingapurpose-designedgenotypingarray[7].Aninterimreleaseofgenotypedataon~150,000UKBiobankparticipants(May2015)[8]hasalreadyfacilitatednumerousstudies[9].TheseexploittheUKBiobank’ssubstantialsamplesize,extensivephenotypeinformation,andgenome-widegeneticinformationtostudytheoftensubtleandcomplexeffectsofgeneticsonhumantraitsanddisease,anditspotentialinteractionswithotherfactors[10-15].Inthispaperwedescribethegeneticdatasetonthefull~500,000participants,togetherwitharangeofqualitycontrolprocedures,whichhavebeenundertakenonthegenotypedatainthehopeoffacilitatingitswideruse.Toachievethiswedesignedandimplementedaqualitycontrol(QC)pipelinethataddresseschallengesspecifictotheexperimentaldesign,scale,anddiversityofthisdataset.RawdatafromthegenotypingexperimentswillbeavailablefromUKBiobank.Wealsoconductedasetofanalysesthatrevealpropertiesofthegeneticdata–suchaspopulationstructureandrelatedness–thatcanbeimportantfordownstreamanalyses.Inaddition,wephasedandimputedgenotypesintothedataset,usingcomputationallyefficientmethodscombinedwiththeHaplotypeReferenceConsortium(HRC)[16]andUK10Khaplotyperesources[17].Thisincreasesthenumberoftestablevariantsbyover100-foldto~96millionvariants.Wealsoimputedclassicalallelicvariationat11humanleukocyteantigen(HLA)genes,andas

5

aQCcheckofthisimputation,wereplicatesignalsofknownassociationsbetweenHLAallelesandmanycommondiseases.Wedescribetoolsthatallowefficientgenome-wideassociationstudies(GWAS)ofmultipletraitsandfastphenome-wideassociationstudies(PheWAS),whichworktogetherwithanewcompressedfileformatthathasbeenusedtodistributethedataset.Asafurthercheckofthegenotypedandimputeddatasets,weperformedatest-casegenome-wideassociationscanonawell-studiedhumantrait,standingheight.

2 Results

2.1 Highqualitygenotypecallingonnovelarray

2.1.1 Purpose-designedgenotypingarray

Thedatareleasecontainsgenotypesof488,377UKBiobankparticipants.Thesewereassayedusingtwoverysimilargenotypingarrays.Asubsetof49,950participantsinvolvedintheUKBiobankLungExomeVariantEvaluation(UKBiLEVE)studyweregenotypedusingtheAppliedBiosystems™UKBiLEVEAxiom™ArraybyAffymetrix1(807,411markers),whichisdescribedelsewhere[15].Followingthis,438,427participantsweregenotypedusingtheclosely-relatedAppliedBiosystems™UKBiobankAxiom™Array(825,927markers).Botharrayswerepurpose-designedspecificallyfortheUKBiobankgenotypingprojectandshare95%ofmarkercontent[7].ThemarkercontentoftheUKBiobankAxiom™arraywaschosentocapturegenome-widegeneticvariation(singlenucleotidepolymorphism(SNPs)andshortinsertionsanddeletions(indels)),andissummarisedinFigure1.Manymarkerswereincludedbecauseofknownassociationswith,orpossiblerolesin,phenotypicvariation,particularlydisease.Anotableexampleistheinclusionoftwovariants,rs429358andrs7412,whichdefinetheisoformsoftheapolipoproteinE(APoE)geneknowntobeassociatedwithriskofAlzheimer’sdisease[7]andotherconditions.Neithermarkeriseasytotypeusingarraytechnologies;asaconsequenceofthistheyhavenotalwaysbeenassayedonearlierarrays.Thearrayalsoincludescodingvariantsacrossarangeofminorallelefrequencies(MAFs),includingraremarkers(<1%MAF);andmarkersthatprovidegoodgenome-widecoverageforimputationinEuropeanpopulationsinthecommon(>5%)andlowfrequency(1-5%)MAFranges.Furtherdetailsofthearraydesignarein[7].

1NowpartofThermoFisherScientific.

6

Figure1|SummaryofUKBiobankgenotypingarraycontent.ThisisaschematicrepresentationofthedifferentcategoriesofcontentontheUKBiobankAxiomarray.Numbersindicatetheapproximatecountofmarkerswithineachcategory,ignoringanyoverlap.Amoredetaileddescriptionofthearraycontentisavailablein[7].

2.1.2 DNAextractionandgenotypecalling

BloodsampleswerecollectedfromparticipantsontheirvisittoaUKBiobankassessmentcentreandthesamplesarestoredattheUKBiobankfacilityinStockport,UK[18].Overaperiodof18months(Nov.2013–Apr.2015)sampleswereretrieved,DNAwasextracted,and96-wellplatesof9450μlaliquotswereshippedtoAffymetrixResearchServicesLaboratoryforgenotyping.SpecialattentionwaspaidintheautomatedsampleretrievalprocessatUKBiobanktoensurethatexperimentalunitssuchasplatesortimingofextractiondidnotcorrelatesystematicallywithbaselinephenotypessuchasage,sex,andethnicbackground,orthetimeandlocationofsamplecollection.FulldetailsoftheUKBiobanksampleretrievalandDNAextractionprocessaredescribedin[19,20].OnreceiptofDNAsamples,AffymetrixprocessedsamplesontheGeneTitan®Multi-Channel(MC)Instrumentin96-wellplatescontaining94UKBiobanksamplesand

7

twocontrolsamplesfromthe1000GenomesProject[21].Genotypeswerethencalledfromthearrayintensitydata,inunitscalled“batches”whichconsistofmultipleplates.Acrosstheentirecohort,therewere106batchesof~4,700UKBiobanksampleseach(SupplementaryMaterial,TableS11).Followingtheearlierinterimdatarelease,Affymetrixdevelopedacustomgenotypecallingpipelinethatisoptimizedforbiobank-scalegenotypingexperiments,whichtakesadvantageofthemultiple-batchdesign[22].Thispipelinewasappliedtoallsamples,includingthe~150,000samplesthatwerepartoftheinterimdatarelease.Consequently,someofthegenotypecallsforthesesamplesmaydifferbetweentheinterimdatareleaseandthisfinaldatarelease(seeSection2.1.6).Routinequalitycheckswerecarriedoutduringtheprocessofsampleretrieval,DNAextraction[19],andgenotypecalling[23].Anysamplethatdidnotpassthesecheckswasexcludedfromtheresultinggenotypecalls.Thecustom-designedarrayscontainanumberofmarkersthathadnotbeenpreviouslytypedusingAffymetrixgenotypearraytechnology.Assuch,Affymetrixalsoappliedaseriesofcheckstodeterminewhetherthegenotypingassayforagivenmarkerwassuccessful,eitherwithinasinglebatch,oracrossallsamples.Wherethesenewly-attemptedassayswerenotsuccessful,Affymetrixexcludedthemarkersfromthedatadelivery(seeSupplementaryMaterialfordetails).Thisresultedinasetofgenotypecallsfor489,212samplesat812,428uniquemarkers(bi-allelicSNPsandIndels)frombotharrays,withwhichweconductedfurtherqualitycontrolandanalysis(Table1).

UKBiLEVEAxiomarrayonly

UKBiobankAxiomarrayonly

Botharrays Total

Includedinexperiment

NumberofsamplessenttoAffymetrix(includingduplicates)

50561 443568 0 494078

IncludedindatadeliveryfromAffymetrix

Numberofmarkers 18019 34313 760096 812428

Numberofsamples(includingduplicates) 50520 438692 0 489212

Includedinreleaseddata

Numberofmarkers 17536 34197 753693 805426

Numberofuniquesamples 49950 438427 0 488377

Table1|ThenumberofmarkersandsamplesbygenotypingarrayatmainstagesoftheUKBiobankgenotypingexperiment.“DatadeliveryfromAffymetrix”referstothedataproducedbyAffymetrixafterapplyingtheirfiltering(seeSupplementaryMaterial).“Releaseddata”referstothereleasedgenotypedata,afterapplyingQCasdescribedinSections2.1.4and2.1.5.

2.1.3 Qualitycontrolinalarge-scale,ethnicallydiversecohort

OurQCpipelinewasdesignedspecificallytoaccommodatethelarge-scaledatasetofethnicallydiverseparticipants,genotypedinmanybatches(106),usingtwoslightly

8

differentnovelarrays,andwhichwillbeusedbymanyresearcherstotackleawidevarietyofresearchquestions.Participantsreportedtheirethnicbackgroundbyselectingfromafixedsetofcategories[24].Whilethemajority(94%)ofindividualsreporttheirethnicbackgroundaswithinthebroad-levelgroup“White”,therearestill~22,000individualswithaself-reportedethnicbackgroundoriginatingoutsideEurope(Table2).Thisethnicdiversityimpliesgeneticdiversity,whichweobservedirectlyinthegenotypesasallelefrequencydifferences,andhasimplicationsforQC.SomecommonlyusedQCtestsareineffectiveinthecontextofstrongpopulationstructureifappliedwithouttakingthisintoaccount.Forexample,testingfordeparturesfromHardy-Weinbergequilibrium(HWE)isacommonapproachforidentifyingmarkersthathavebeengenotypedpoorly[25-27],butdeparturesfromHWEcanbeexpectedinthecontextofpopulationstructurebecauseofdifferencesinallelefrequenciesacrosspopulations.Weusedapproachesbasedonprincipalcomponentanalysis(PCA)toaccountforpopulationstructureinbothmarkerandsample-basedQC.

Ethnicgroup Self-reportedethnicbackground

PercentageofgenotypedUKBiobankparticipants

White 94.23 British 88.26 Anyotherwhitebackground 3.24 Irish 2.61 White 0.11 AsianorAsianBritish 1.94 Indian 1.17 Pakistani 0.36 AnyotherAsianbackground 0.36 Bangladeshi 0.05 AsianorAsianBritish 0.01 BlackorBlackBritish 1.57 Caribbean 0.88 African 0.66 AnyotherBlackbackground 0.02 BlackorBlackBritish 0.01 Chinese 0.31 Chinese 0.31 Mixed 0.58 Anyothermixedbackground 0.2 WhiteandAsian 0.16 WhiteandBlackCaribbean 0.12 WhiteandBlackAfrican 0.08 Mixed 0.01 Other/Unknown 1.38 Otherethnicgroup 0.89 Notstated 0.48

Table2|Proportionsofself-reportedethnicgroupsamong488,377genotypedUKBiobankparticipants.Categoriesofself-reportedethnicbackground(UKBiobankdatafield21000)andbroader-levelethnicgroupsareshownheretoreflectthetwo-layerbranchingstructureoftheethnicbackgroundsectionintheUKBiobanktouchscreenquestionnaire[24].Participantsfirstpickedoneofthebroader-levelethnicgroups(e.gWhite),andwerethenpromptedtoselectoneofthecategories

9

withinthatgroup(e.gIrish).Thebroader-levelgroupsarealsoshownhereasanethnicbackgroundcategory(e.g“White”incolumntwo)becauseasmallproportionofparticipantsonlyrespondedtothefirstquestion.Inthistablewealsocombinethecategory“Otherethnicgroup”withanaggregatednon-responsecategory“Notstated”,whichincludesallparticipantswhodidnotknowtheirethnicgroup,orstatedthattheypreferrednotanswer,ordidnotanswerthefirstquestion.

2.1.4 Marker-basedQC

Weidentifiedpoorqualitymarkersusingstatisticaltestsdesignedprimarilytocheckforconsistencyacrossexperimentalfactors.Specificallywetestedforbatcheffects,plateeffects,departuresfromHWE,sexeffects,arrayeffects,anddiscordanceacrosscontrolreplicates.SeeSupplementaryMaterialforthedetailsofeachtest,andFigureS3forexamplesofaffectedmarkers.Formarkersthatfailedatleastonetestinagivenbatch,wesetthegenotypecallsinthatbatchtomissing.Wealsoprovideaflaginthedatareleasethatindicatesifthecallsforamarkerhavebeensettomissinginagivenbatch.Iftherewasevidencethatamarkerwasnotreliableacrossallbatches,weexcludedthemarkerfromthedataaltogether.Inordertoattenuatepopulationstructureeffectsweappliedallmarker-basedQCtestsusingasubsetof463,844individualswithestimatedEuropeanancestry.WeidentifiedtheseindividualsfromthegenotypedatapriortoconductinganyQCbyprojectingalltheUKBiobanksamplesontothetwomajorprincipalcomponentsoffour1000Genomespopulations(CEU,YRI,CHBandJPT)[28].Wethenselectedsampleswithprincipalcomponent(PC)scoresfallingintheneighbourhoodoftheCEUcluster(seeSupplementaryMaterial).MostQCmetricsrequireathresholdbeyondwhichtoconsideramarker‘notreliable’.WeusedthresholdssuchthatonlystronglydeviatingmarkerswouldfailQCtests(seeSupplementaryMaterial),thereforeallowingresearcherstofurtherrefinetheQCinwhicheverwayismostappropriatefortheirstudyrequirements.Table3summarisestheamountofdataaffectedbyapplyingthesetests.

10

Test AveragenumberofSNPsfailedperbatch(sd)

Fractionofallgenotypecallsaffected

AffymetrixclusterQC 1109(699) 0.00140

1.Batcheffect 197(86) 0.000249

2.Plateeffect 284(266) 0.000358

3.DeparturefromHardy-Weinbergequilibrium 572(77) 0.000723

4.Sexeffect 45(5) 0.0000569

5.Arrayeffect* 5417 0.00683

6.Discordanceacrosscontrols** 622and632 0.000796

Total 7704(721) 0.00971

Table3|Failureratesforsixmarker-basedqualitytests.Forallnumberedtestsamarker(ormarkerwithinabatch)wassettomissingifthetestyieldedap-value<10-12,exceptinthecaseofTest6,forwhichamarkerwassettomissingifthetestyielded<95%concordance.SeeSupplementaryMaterialfordetailsofeachtest.Thetotalisnotequaltothesumofalltestsbecauseitispossibleforamarkertofailmorethanonetest.Sincethetwoarrayscontainslightlydifferentsetsofmarkers,thetotalnumberofgenotypecallsusedtocomputethefractionsis,NukbbLukbb+NukblLukbl,whereNandLrefertothenumbersofmarkersandsamplestypedontheUKBiobankAxiomarray(ukbb)andsamplestypedontheUKBiLEVEAxiomarray(ukbl)withintheAffymetrixdatadelivery(seeTableS1).*Thearrayeffecttestwasappliedacrossallbatchesandonlyformarkerspresentonbotharrays,sowesimplyreportthetotalnumberofmarkersthatfailedthistest.**Thediscordancetestwasappliedacrossallbatches,butnotallmarkersarepresentonbotharrays.ThefirstvalueisthenumberofuniquemarkersontheUKBiLEVEAxiomarraythatfailedthistest,andthesecondisformarkersontheUKBiobankAxiomarray.

2.1.5 Sample-basedQC

Weidentifiedpoorqualitysamplesusingthemetricsofmissingrateandheterozygositycomputedusingasetof605,876highqualityautosomalmarkersthatweretypedonbotharrays(seeSupplementaryMaterialforcriteria).Extremevaluesinoneorbothofthesemetricscanbeindicatorsofpoorsamplequalitydueto,forexample,DNAcontamination[26].Theheterozygosityofasample–thefractionofnon-missingmarkersthatarecalledheterozygous–canalsobesensitivetonaturalphenomena,includingpopulationstructure,recentadmixtureandparentalconsanguinity.Wetookextrameasurestoavoidmisclassifyinggoodqualitysamplesbecauseoftheseeffects.Forexample,weadjustedheterozygosityforpopulationstructurebyfittingalinearregressionmodelwiththefirstsixprincipalcomponents(PCs)inaPCAaspredictors(Figure2A).Usingthisadjustmentweidentified968sampleswithunusuallyhighheterozygosityor>5%missingrate(SupplementaryMaterial).Alistofthesesamplesisprovidedaspartofthedatarelease.Wealsoconductedqualitycontrolspecifictothesexchromosomesusingasetof15,766highqualitymarkersontheXandYchromosomes.Affymetrixinfersthesex

11

ofeachindividualbasedontherelativeintensityofmarkersontheYandXchromosomes[29].Sexisalsoreportedbyparticipants,andmismatchesbetweenthesesourcescanbeusedasawaytodetectsamplemishandlingorotherkindsofclericalerror.However,inadatasetofthissize,somesuchmismatcheswouldbeexpectedduetotransgenderindividuals,orinstancesofreal(butrare)geneticvariation,suchassex-chromosomeaneuploidies[30].AffymetrixgenotypecallingontheXandYchromosomesallowsonlyhaploidordiploidgenotypecalls,dependingontheinferredsex[29].Therefore,casesoffullormosaicsexchromosomeaneuploidiesmayresultincompromisedgenotypecallsonall,orpartsof,thesexchromosomes(butnotaffecttheautosomes).Forexample,individualswithkaryotypeXXYwilllikelyhavepoorerqualitygenotypecallsonthepseudo-autosomalregion(PAR)oftheXchromosome,astheyareeffectivelytriploidinthisregion.UsinginformationinthemeasuredintensitiesofchromosomesXandY,weidentifiedasetof652(0.134%)individualswithsexchromosomekaryotypesputativelydifferentfromXYorXX(Figure2B,SupplementaryTableS1).Thelistofsamplesisprovidedaspartofthedatarelease.Researcherswantingtoidentifysexmismatchesshouldcomparetheself-reportedsexandinferredsexdatafields.Wedidnotremovesamplesfromthedataasaresultofanyoftheaboveanalyses,butratherprovidetheinformationaspartofthedatarelease.However,weexcludedasmallnumberofsamples(835intotal)thatweidentifiedassampleduplicates(asopposedtoidenticaltwins,seeSupplementaryMaterial)orwerelikelyinvolvedinsamplemishandlinginthelaboratory(~10),aswellasparticipantswhoaskedtobewithdrawnfromtheprojectpriortothedatarelease(33).

12

Figure2(captiononnextpage)

13

Figure2|Summaryofsample-basedQC.(A)Thethreeplotsshowheterozygosityandmissingrates,whichweusedtoflagpoorqualitysamples.Thetwotopplotsshowheterozygosityforeachsamplebefore(left)andafter(right)correctingforancestralbackgroundusingPCs.Thesymbols(shapesandcolours)indicatetheself-reportedethnicbackgroundofeachparticipant,asannotatedinthelegend.Thethirdplotshowsthesetofsamplesweflaggedasoutliers(inred),andallothersamples(inblack),withshapesthesameasfortheothertwoplots.Theverticallineshowsthethresholdweusedtocallsamplesasoutliersonmissingrate.Inallplotsmissingratedataistransformedtothelogitscale,butwiththeaxisannotatedwiththeoriginalvalues.(B)MeanLog2ratios(L2R)onXandYchromosomesforeachsample,indicatinglikelysexchromosomeaneuploidy(seeSupplementaryMaterial).SampleswhicharemostlikelyXXorXYaredepictedwithacircle.Othersamples,whichrepresentpossibleinstancesofsexchromosomeaneuploidy,areindicatedbyacross.Thereare652suchsamplesintotal;seeSupplementaryTableS1forcounts.Thecoloursofeachsymbolrelatetodifferentcombinationsofself-reportedsex,andsexinferredbyAffymetrix(fromthegeneticdata),asindicatedbythekey.Foralmostallsamples(99.9%)theself-reportedandinferredsexarethesame,butforasmallnumberofsamples(378)theydonotmatch(seeSupplementaryMaterialfordiscussion).

2.1.6 Summaryofgenotypedataquality

TheapplicationofourQCpipelineresultedinthereleaseddatasetof488,377samplesand805,426markersfrombotharrayswiththepropertiesshowninFigure3.TheproportionofallthegenotypecallsmadebyAffymetrixthatweresettomissingasaresultofqualitycontrolis0.0097(seeTable3).Theproportionofgenotypedsamplesthatwereidentifiedaspoorqualityis0.002(968/488377).Furthermore,asetof588pairsofexperimentalduplicatesshowveryhighconcordance.Onaverage,99.87%ofapair’sgenotypecallsareidenticalandthelowestrateisstill99.39%(SupplementaryFigureS13).Subsequenttotheinterimreleaseofgenotypesfor~150,000UKBiobankparticipantsimprovementsweremadetothegenotypecallingalgorithm[22]andqualitycontrolprocedures.Wethereforeexpecttoobservesomechangesinthegenotypecallsandmissingdataprofileofsamplesincludedinboththeinterimdatareleaseandthisfinaldatarelease.Discordanceamongnon-missingmarkersisverylow(mean6.7x10-5;SupplementaryFigureS1);andforeachsamplethereare~24,500genotypecalls(onaverage)thatweremissingintheinterimdata,butwhichhavenon-missingcallsinthisrelease.Thisismuchsmallerinthereversedirection,with~500calls,onaverage,missinginthisreleasebutnotmissingintheinterimdata,sothereisanaveragenetgainof~24,000genotypecallspersample.WecomparedallelefrequenciesintheUKBiobankwiththoseestimatedfromsequencingdatasourcedfromtheExomeAggregationConsortium(ExAC)database[31].Wecomputedallelefrequenciesatasetof91,298overlappingmarkers,usingsampleswithEuropeanancestry.Wedonotexpectallelefrequenciesinthetwostudiestomatchexactlyduetosubtledifferencesintheancestralbackgroundsoftheindividualsineachstudy,aswellasdifferencesinthesensitivityandspecificityof

14

thetwotechnologies(exomesequencingandgenotypingarrays).Despitethis,overalltheallelefrequenciesareencouraginglysimilar(r2=0.94)(Figure3E).AdiscussionofthediscrepanciesbetweenUKBiobankandExACisgivenintheSupplementaryMaterial.Over110,000raremarkers(MAF<0.01)wereincludedonthetwoarraysusedfortheUKBiobankcohort[7].Variantsoccurringatverylowfrequenciespresentaparticularchallengeforgenotypecallingusingarraytechnology,especiallyincaseswhereonlyasmallnumberofsampleswithinagenotypingbatchareexpectedtohaveacopyoftheminorallele(almostalwayswithaheterozygousgenotype)2.Inthesecasesitissometimesdifficultinthegenotypeintensitydatatodistinguishasamplethatgenuinelyhastheminorallele,fromonewhoseintensitiesareinthetailsofthedistributionofthoseinthemajorhomozygotecluster.ExamplesoftwodifferentmarkerswithMAF<0.001inUKBiobankareshowninFigure4BandFigure4C.Onemarkerisperformingwell(4B),butfortheothermarker(4C)theheterozygoussamplesaremoredifficulttoidentify.Incontrast,Figure4AshowsacommonSNP(MAF=0.077),whichhasthreewell-separatedclusterscorrespondingtothethreegenotypes.Alargerfractionofraremarkersfailqualitycontroltestscomparedtolowfrequencyandcommonmarkers,but84%stillpassinallbatches(Figure3B).TheMAFofraremarkersaregenerallysimilartothoseinExAC,butveryraremarkerstendtohavealowerMAFintheUKBiobank(Figure3F).WestronglyrecommendresearchersvisuallyinspectsimilarplotsformarkersofinterestusingautilitysuchasEvoker[32](seeURLs),especiallyforraremarkers.InadditiontotheQCdescribedabove,acloseexaminationofresultsofatestgenome-wideassociationstudy(GWAS)(discussedinSection2.5)providedfurtherevidenceforthequalityoftheresultingsetofgenotypecalls.

2ThegenotypingbatchesintheUKBiobankprojectcontain~4,700samples,soaMAFof,forexample,0.001correspondstoanexpectedcount(underHWE)ofabout9heterozygotesperbatch.

15

0.0000

0.0025

0.0050

0.0075

0.0100

0.0000 0.0025 0.0050 0.0075 0.0100

MAF in UK Biobank

Freq

uenc

y in

ExA

C

10000

1000

100

10

1

Count

Comparison with ExAC at 60474 overlapping rare markers

0.00

0.25

0.50

0.75

1.00

0.0 0.1 0.2 0.3 0.4 0.5

MAF in UK Biobank

Freq

uenc

y in

ExA

C

100001000100101

Count

Comparison with ExAC at 91298 overlapping markers

0

[1−1

0)

[10−

100)

[100

−100

0)

[100

0−10

000)

0

10

20

30

40

50

60

Num

ber o

f mar

kers

(100

0s)

Minor allele count (MAC)

Minor allele frequencies of 805426 markers in UK Biobank data

Minor allele frequency (MAF) in UK Biobank

Num

ber o

f mar

kers

(100

0s)

0.0 0.1 0.2 0.3 0.4 0.5

0

20

40

60

80

100

120

140

0 1 2−3 4−7 8−15 16−31 32−63 64−105

Marker−based QC failure rates for different MAF ranges

Prop

ortio

n of

mar

kers

in M

AF ra

nge

0.0

0.2

0.4

0.6

0.8

1.0

Number of batches in which a marker failed QC

MAF rangesVery rare: [0−0.001)Rare: [0.001−0.01)Low frequency: [0.01−0.05)Common: [0.05−0.5)

Missing rates for samples

Missing rate

Num

ber o

f sam

ples

(100

0s)

0.02 0.04 0.06 0.08 0.10 0.12

0

20

40

60

80

Samples typed on UKBiobank Axiom array (438427)Samples typed on UKBiLEVE Axiom array (49950)

Missing rates for markers

Missing rate

Num

ber o

f mar

kers

(100

0s)

10−4 10−3 10−2 10−1 1

0

10

20

30

40

50Markers on both arrays (753693)Markers on UK Biobank Axiom array only (34197)Markers on UKBiLEVE Axiom array only (17536)

A B

E

DC

F

Figure3(captiononnextpage)

16

Figure3|Summaryofgenotypedataqualityandcontent.AllplotsshowpropertiesoftheUKBiobankgenotypedatathatismadeavailabletoresearchers,afterapplyingQCasdescribedinthemaintext.(A)Minorallelefrequency(MAF)distribution.AllsampleswereusedinthecalculationofMAF.Theinsetshowsthenumberofmarkerswithindifferentminorallelecountbinsforraremarkersonly(MAF<0.01).(B)Thedistributionofthenumberofbatch-levelQCteststhatamarkerfails(seeTable2;Methods).ForeachoffourdifferentMAFranges(indicatedbycolours)weshowthefractionofmarkersthatfailthespecifiednumberofbatches.Forexample,justover92%ofthe‘lowfrequency’markers(0.01≤MAF<0.05)donotfailqualitycontrolinanybatches.Anymarkerthatfailedall106batchesisexcludedfromthedatarelease,sosuchmarkersarenotincludedhere(seeTable3).(C)Thedistributionofmissingratesformarkers.Threehistogramsareoverlaid,eachshowingadifferent,mutuallyexclusive,subsetofmarkers,whichareindicatedbythethreecolours.Markersfromonlyonearrayexhibitmoremissingdatabecauseonlyasubsetofsamplesweretypedoneacharray(10%onUKBiLEVEAxiom™Arrayand90%ontheUKBiobankAxiom™Array).(D)Thedistributionofmissingratesforsamples.Twohistogramsareoverlaid,eachshowingamutuallyexclusivesubsetofsamples.Allsampleshavesomemissingdatabecausenotallmarkerswereincludedonbotharrays(~2%areexclusivetotheUKBiLEVEAxiom™Arrayand~4%exclusivetotheUKBiobankAxiom™Array).Additionalmissingdataisalsointroducedfromthebatch-basedmarkerQC.(E)ComparisonofMAFinUKBiobankwiththefrequencyofthesamealleleinExAC,amongtheEuropean-ancestrysampleswithineachstudy.Weused33,370samplesforExACand463,844samplesforUKBiobank(seeSupplementaryMaterialfordetails),andonlymarkersthathaveacallrategreaterthan0.9inbothstudies.Eachhexagonalbiniscolouredaccordingtothenumberofmarkersfallinginthatbin,asindicatedbythekey(notethelog10scale).Thedashedredlineshowsx=y.Thesetofmarkerswithverydifferentallelefrequenciesseenonthetop,bottom,andleft-handsidesoftheplotcomprise~300markers.Thisis~0.3%ofallmarkersinthecomparison,or~0.5%ofallmarkerswithMAF>0.001inatleastonestudy(seeSupplementaryMaterialforadiscussion).(F)Aswith(E),butzoominginontheraremarkers(bothaxes<0.01).

17

Figure4|Examplesofintensitydataandgenotypecallsformarkersofdifferentallelefrequencies.Eachofthethreesub-figuresshowsintensitydataforasinglemarkerwithinsixdifferentbatches.Batcheslabelledwiththeprefix“UKBiLEVEAX”containonlysamplestypedusingtheUKBiLEVEAxiomarray,andthosewiththeprefix“Batch”containonlysamplestypedusingtheUKBiobankAxiomarray.Eachpointrepresentsonesampleandiscolouredaccordingtotheinferredgenotypeatthemarker.Thexandyaxesaretransformationsoftheintensitiesforprobesetstargetingeachofthealleles“A”and“B”(seeSupplementaryMaterialfordefinitionofprobeset).Theellipsesindicatethelocationandshapeoftheposteriorprobabilitydistribution(2-dimensionalmultivariateNormal)ofthetransformedintensitiesforthethreegenotypesinthestatedbatch.Thatis,eachellipseisdrawnsuchthatitcontains85%oftheprobabilitydensity.See[29]formoredetailsofAffymetrixgenotypecalling.TheminorallelefrequencyofeachofthemarkersiscomputedusingallsamplesinthereleasedUKBiobankgenotypedata.(A)AmarkerwithaMAFof0.077withwell-separatedgenotype

Genotype callsnumber of copies of reference allele

210no call

A

B

C

MAF:0.00092

MAF:0.00066

MAF:0.077

18

clusters.(B)IntensitiesforamarkerwithaMAFof0.00092withwell-separatedgenotypeclusters.AswouldbeexpectedunderHWE,therearenoinstancesofsampleswiththeminorhomozygotegenotype.(C)IntensitiesforamarkerwithaMAFof0.00066,andwheretheheterozygoteclusterisnotwellseparatedfromthelargemajorhomozygoteclusterinsomebatches,makingitmoredifficulttoconfidentlycalltheheterozygousgenotypes.

2.2 Ancestraldiversityandcrypticrelatedness

2.2.1 Geneticpopulationstructure

ThediversityinancestraloriginsofUKBiobankparticipantsisevidentfromtheself-reportedethnicbackground(Table2)andcountryofbirthinformation.Thegenotypedataprovidesauniqueopportunitytostudytheirancestraloriginsinaquantitativemanner.AccountingfortheancestralbackgroundofparticipantsisanessentialcomponentofanalysisoftheUKBiobankresource,bothforepidemiologicalstudies[33],andgeneticanalyses,suchasGWAS[27,34].Weusedprincipalcomponentsanalysis(PCA)tomeasurepopulationstructurewithintheUKBiobankcohort.PCAiswidelyusedasamethodforassessingandpotentiallycontrollingforpopulationstructureinGWAS[27,35].Wecomputedprincipalcomponents(PCs)usinganalgorithm(fastPCA[36])whichperformswellondatasetswithhundredsofthousandsofsamplesbyapproximatingonlythetopnPCsthatexplainthemostvariation,wherenisspecifiedinadvance.Wecomputedthetop40PCsusingasetof407,219unrelated,highqualitysamplesand147,604highqualitymarkersprunedtominimiselinkagedisequilibrium(LD)[37].WethencomputedthecorrespondingPC-loadingsandprojectedallsamplesontothePCs,thusformingasetofPCscoresforallsamplesinthecohort(SupplementaryMaterial).Figure5showsresultsforthefirst6PCsplottedinconsecutivepairs.ResultsforfurtherPCsareshowninSupplementaryFiguresS5andS6.Asexpected,individualswithsimilarPCscoreshavesimilarself-reportedethnicbackgrounds.Forexample,thefirsttwoPCsseparateoutindividualswithsub-SaharanAfricanancestry,European,andeast-Asianancestry.Individualswhoself-reportasmixedethnicitytendtofallonacontinuumbetweentheirconstituentgroups(e.gtheWhiteandAsiancategory).FurtherPCscapturepopulationstructureatsub-continentalgeographicscales.SupplementaryFigureS7showstherelationshipbetweeneachofthe40PCs,andcountryofbirthasreportedbyparticipants.Forexample,highscoresinPC6areassociatedwithindividualsborninCentralandSouthAmericancountriessuchasPeru,Colombia,andChile.

19

Figure5|AncestraldiversityintheUKBiobankcohort.PlotsofconsecutivepairsofthefirstsixprincipalcomponentsinaPCAofgenotypedataforUKBiobankparticipants(seeSupplementaryMaterial).Eachpointrepresentsanindividualandisplacedaccordingtotheirprincipalcomponentscores(usinggeneticdataonly),withshapesandcoloursindicatingtheirself-reportedethnicbackgroundasshowninthelegend.SeeTable2fortheproportionsofparticipantsineachcategory.Plotsofallpairsofthese6PCsareshowninSupplementaryFigureS5,andresultsoffurtherPCsarerepresentedinSupplementaryFiguresS6andS7.

2.2.2 WhiteBritishancestrysubset

Researchersmaywanttoonlyanalyseasetofindividualswithrelativelyhomogeneousancestrytoreducetheriskofconfoundingduetodifferencesinancestralbackground.AlthoughtheUKBiobankcohortisethnicallydiverse,suchanalysisisfeasiblewithoutcompromisingtoomuchinsamplesizebecausea

20

majorityofparticipantsintheUKBiobankcohortreporttheirethnicbackgroundas“British”,withinthebroader-levelgroup“White”(88.26%).OurPCArevealedpopulationstructureevenwithinthiscategory(SupplementaryFigureS8),soweusedacombinationofself-reportedethnicbackgroundandgeneticinformationtoidentifyasubsetof409,728individuals(84%)whoself-reportas“British”andwhohaveverysimilarancestralbackgroundsbasedonresultsofthePCA(seeSupplementaryMaterial).Fine-scalepopulationstructureisknowntoexistwithintheUK[38]butmethodsfordetectingsuchsubtlestructure[39]availableatthetimeofanalysisarenotfeasibletoapplyatthescaleoftheUKBiobank.ThewhiteBritishancestrysubsetmaythereforestillcontainsubtlestructurepresentatsub-nationalscales.

2.2.3 Crypticrelatedness

Closerelationships(e.gsiblings)amongUKBiobankparticipantswerenotrecordedduringthecollectionofotherphenotypicinformation.Indeed,manyparticipantsmaynotbeawarethatacloserelative(suchasanaunt,orsibling)isalsopartofthecohort.Thisinformationcanbeimportantforepidemiologicalanalyses[40],aswellasinGWAS[41],andthegeneticdataprovidesanopportunitytodiscoverandcharacterisefamilialrelatednesswithinthecohort.Thisanalysis,combinedwithphenotypeinformation,isalsousefulforidentifyingsamplesthatareexperimentalduplicatesratherthangenuinetwins(seeSupplementaryMaterial).Weidentifiedrelatedindividualsbyestimatingkinshipcoefficientsforallpairsofsamples,andreportcoefficientsforpairsofrelativeswhoweinfertobe3rddegreeorcloser.Weusedanestimatorimplementedinthesoftware,KING[42],asitisrobusttopopulationstructure(i.edoesnotrelyonaccurateestimatesofpopulationallelefrequencies)anditisimplementedinanalgorithmefficientenoughtoconsiderallpairs(~1.2x1011)inapracticableamountoftime.AsnotedbytheauthorsofKING[42],wefoundthatrecentadmixture(e.g“Mixed”ancestralbackgrounds)tendedtoinflatetheestimateofthekinshipcoefficient,astheestimatorassumesHWEamongmarkerswiththesameunderlyingallelefrequencieswithinanindividual.Wealleviatedthiseffectbyonlyusingasubsetofmarkersthatareonlyweaklyinformativeofancestralbackground(seeSupplementaryMaterial,FigureS12).Wealsoexcludedasmallfractionofindividuals(977)fromthekinshipestimation,astheyhadproperties(e.ghighmissingrates)thatwouldleadtounreliablekinshipestimates(seeSupplementaryMaterial).Wecalledrelationshipclassesforeachrelatedpairusingthekinshipcoefficientandfractionofmarkersforwhichtheysharenoalleles(IBS0),andusingtheboundariesrecommendbytheauthorsofKING.

21

Monozygotictwins

Parent-offspring Fullsiblings 2nddegree 3rddegree Total

Numberofpairs 179 6,276 22,666 11,113 66,928 107,162

Table4|Summaryofrelatedpairs(3rddegreeorcloser)forthefullUKBiobankcohort.CountsarederivedfromthekinshipcoefficientsasrecommendedbytheauthorsofKING[42].Notethatparent-offspringandfullsiblingpairshavethesameexpectedkinshipcoefficient(0.25)butcanbeeasilydistinguishedbytheirIBS0fraction.Thecountofmonozygotictwinsisafterexcludingsamplesidentifiedasduplicates(seeSupplementaryMaterial).Atotalof147,731UKBiobankparticipants(30.3%)areinferredtoberelated(3rddegreeorcloser)toatleastoneotherpersoninthecohort,andformatotalof107,162relatedpairs(Table4).Thisisasurprisinglylargenumber,anditisnotdrivensolelybyanexcessof3rddegreerelatives.Forexample,thenumberofsiblingpairs(22,666)isabouttwiceasmanyaswouldtheoreticallybeexpectedinarandomsample(ofthissize)oftheeligibleUKpopulation,aftertakingintoaccounttypicalfamilysizes(seeSupplementaryMaterial).Toensurewewerenotoverestimatingthenumberofrelatedpairs,weinferredrelatedpairs(withinasubsetofthedata)usingadifferentinferencemethodimplementedinPLINK(“--genome”command)[43]andconfirmed100%ofthetwins,parent-offspringandsiblingpairs,and99.9%ofpairsoverall(seeSupplementaryMaterial).Thelargerthanexpectednumberofrelatedpairscouldbeexplainedbysamplingbiasdueto,forexample,anindividualbeingmorelikelytoagreetoparticipatebecauseafamilymemberwasalsoinvolved.Furthermoreif,asseemsplausible,relatedindividualsclustergeographically,ratherthanbeingrandomlylocatedacrosstheUnitedKingdom(UK),theassessment-centrebasedrecruitmentstrategyemployedbyUKBiobank[1]willnaturallytendtooversamplerelatedindividuals,relativetorandomsamplingoftheUKpopulation.

2.2.4 Triosandextendedfamilies

PairsofrelatedindividualswithintheUKBiobankcohortformnetworksofrelatedindividuals,or‘family’groups.Inmostcasestheseareofsizetwo,buttherearemanygroupsofsize3orlargerinthecohort,evenwhenrestrictingto2nddegreeorcloserrelativepairs.Byconsideringtherelationshiptypesandtheageandsexofindividualswithineachfamilygroup,weidentified1,066setsoftrios(twoparentsandanoffspring),whichcomprises1,029uniquesetsofparentsand37quartets(twoparentsandtwochildren).Therearenoinstancesof3-generationnuclearfamilies(grandparent-parent-offspring),whichisnotsurprising,giventhattheage-rangeofthecohortspansonly30years.Thereare172familygroupswith5ormoreindividualsthatarerelatedto2nddegreeorcloser,andFigure6illustratesexamplesoflargenuclearfamiliesinthecohort.Oneoftheseisagroupofelevenindividualswhereeverypairisa2nddegreerelatives.Thesimplestexplanationforsuchafamilygroupisthatallofthe

22

individualsarehalfsiblings,withonesharedparentwhoisnotinthecohort.Alternatively,oneindividualinthegroupcouldbeasiblingoraparentofthesharedparent.Thepatternofhaplotypesharingbetweentheindividualswoulddistinguishthesecasesbutwehavenotundertakenthisanalysis.Wedid,however,confirmthatthesharedparentmustbetheirfather,becausetheindividualsdonotallcarrythesameMitochondrialalleles,andallthemalesinthegrouphavethesameallelesontheirYchromosome(datanotshown).

Figure6|FamilialrelatednessintheUKBiobankcohort.(A)ExamplesoffamilygroupswithintheUKBiobankcohort.Pointsindicateparticipants,andlinesbetweenpointsindicatefamilialrelatedness(3rddegreeandcloser)asinferredfromthegeneticdata(seeMethods).Thecolourandthicknessofthelinesindicatedifferentrelativeclasses,asshowninthekey.Anintegernexttoanetworkindicatesthetotalnumberoffamilynetworksinthecohortwiththesameconfiguration,ignoring3rddegreepairs.Nointegermeansthereisonlytheoneshown.Forexample,thereare10networksthatcompriseexactly5fullsiblings(twoexamples,whichdifferwithrespecttoa3rddegreerelative,areshownonthisplot);andthereisonlyonenetworkthatcomprises6fullsiblings(plusone3rddegreerelativewhoisrelatedtoallsiblings).(B)Distributionoftheexactnumberofrelativesthat

Number of relatives per participant

Num

ber o

f par

ticip

ants

1

10

102

103

104

105

1 2 3 4 5 6 7 8 9 10 11−20

Number of relatives per participant

A

B

Example family groups

2

10

87

2

10

726

5

27

Monozygotic twinsParent offspringFull siblings2nd degree3rd degree

Relationship classes

21+

23

participantshaveintheUKBiobankcohort.Theheightofeachbarshowsthecountofparticipantswhohaveexactlythestatednumberofrelatives.Notethelogarithmicscale.Thecoloursindicatetheproportionsofeachrelatednessclassforindividualscountedwithinabar.Forexample,foreachindividualthathasthreerelativesinthecohort(3rdbar),wecounthowmanyoftheirrelativesareineachrelatednessclass(i.eafullsibling,parent-offspringetc.).Wethensumthesecountsoverallindividualswiththreerelativesandcolourthebaraccordingtotheproportionsofeachrelatednessclass.Inthisgroup~20%oftheirrelativesarefullsiblingsand~64%are3rddegreerelatives.Therearealso18participantswithexactly10relatives.Theunusuallylargefractionof2nddegreerelationshipsforthisgroupisaresultofthesetofelevenindividualswhoareall2nddegreerelativesofeachother,asshowninthecentreof(A).

2.3 PhasingandImputationofSNPs,shortindelsandCNVsGenotypeimputation[44]istheprocessofpredictinggenotypesthatarenotdirectlyassayedinasampleofindividuals.AreferencepanelofhaplotypesatadensesetofSNPs,indelsandstructuralvariants(SVs),isusedtoimputegenotypesintoastudysampleofindividualsthathavebeengenotypedatasubsetofthemarkers.These‘insilico’genotypescanthenbeusedtoboostthenumberofmarkersthatcanbetestedforassociation.Thisincreasesthepowerofthestudy,theabilitytoresolveorfine-mapthecausalvariantsandfacilitatesmeta-analysis.Theprocessofimputationfirstinvolespre-phasingthedirectlygenotypedmarkers,followedbyahaploidimputationstep[45].

2.3.1 Pre-Phasing

Forthepre-phasingstepweonlyusedmarkerspresentonboththeUKBiLEVEandUKBiobankAxiomarrays.Weremovedmarkerswhich(a)failedQCin>1batch,(b)hadmorethan5%missingness(c)hadaminorallelefrequency<0.0001.Weremovedsamplesthatwereidentifiedasoutliersforheterozygosityandmissingness(Section2.1.5).Thesefiltersresultedinadatasetwith670,739autosomalmarkersin487,442samples.PhasingontheautosomeswascarriedoutusingSHAPEIT3[46]inchunksof15,000markers,withanoverlapof250markersbetweenchunks.Eachchunkused4coresperjobandS=200copyingstates.The1000GenomesPhase3dataset[47]wasusedasareferencepanel,predominantlytohelpwiththephasingofsampleswithnon-Europeanancestry.Chunkswereligatedusingamodifiedversionofthehapfuseprogram(seeURLs). Weassessedtheaccuracyofthephasinginaseparateexperimentbytakingadvantageofmother-father-childtriosthatwereidentifiedintheUKBiobankcohort(seeSection2.2.4).Thisfamilyinformationcanbeusedtoinferthephaseofalargenumberofmarkersinthetrioparents.Thesefamily-inferredhaplotypeswereusedasatruthset,asiscommoninthephasingliterature.Theparentsofeachtriowereremovedfromthedatasetandthenhaplotypeswereestimatedacrosschromosome20inasinglerunofSHAPEIT3.Thisdatasetconsistedof16,175autosomalmarkers.

24

Theinferredhaplotypeswerethencomparedtothetruthsetusingtheswitcherrormetric.Usingasetof696trioswithself-reportedethnicbackground“British”(withinthebroader-levelgroup“White”(seeTable2))andnoothertwinsor1stor2nddegreerelativesintheUKBiobankdataset,weestimatedamedianswitcherrorrateof0.229%.Wealsousedasubsetof397ofthesetriosthatalsohadno3rddegreerelativesandobtainedamedianswitcherrorrateof0.234%.

2.3.2 Referencepanelusedforimputation

Thereareanumberoffactorsthatinfluencetheaccuracyofgenotypeimputation,butgenerallyaccuracywillincreaseasthenumberofhaplotypesinthereferencepanelgrowsandiftheancestryofthesamplehaplotypesisagoodmatchtotheancestryofthereferencepanelhaplotypes.TheUKBiobankdatasetconsistsofsampleswithadiverserangeofancestries,butwiththemajorityofsampleshavingBritish(orEuropean)ancestry(Table2).ForthisreasonitwasdesirabletouseareferencepanelwithalargenumberofhaplotypeswithBritishandEuropeanancestry,andalsoadiversesetofhaplotypesfromotherworld-widepopulations.FortheinterimdatareleaseareferencepanelwasusedthatmergedtheUK10Kand1000GenomesPhase3referencepanels,whichconsistedof87,696,888bi-allelicmarkersin12,570haplotypes.AnadvantageofthisreferencepanelisthatitincludesshortindelsandlargerstructuralvariantsaswellasSNPs.Sincethen,theHRCreferencepanel[16]hasbeenwidelyadoptedforimputationintomanyGWAS.Thisreferencepanelhasmanymorehaplotypes(64,976)thanthepreviouslyusedreferencepanel,andsoisexpectedtoproducebetterimputationperformance,especiallyatlowerallelefrequencies(seeSupplementaryFigureS14).HowevertheHRCpanelhasfewerSNPs(39,235,157)sinceveryrarevariantswereexcludedwhenthepanelwasconstructed,andtherearenoshortindelsandstructuralvariants.ToobtaintheexpectedgaininperformanceatHRCSNPs,whilstretainingresultsatalltheSNPs/indels/SVsimputedintheinterimrelease,weimputedSNPsfrombothpanels,butpreferentiallyusedtheHRCimputationatSNPspresentinbothpanels.

2.3.3 Imputation

Tofacilitatefastimputationofall~500,000sampleswere-codedIMPUTE2[45]tofocusexclusivelyonthehaploidimputationneededwhensampleshavebeenpre-phased.ThisnewversionoftheprogramisreferredtoasIMPUTE4(seeURLs),butusesexactlythesamehiddenMarkovmodel(HMM)withinIMPUTE2,andproducesidenticalresultstoIMPUTE2whenrunusingallreferencehaplotypesashiddenstates(datanotshown).ToreduceRAMusageandincreasespeedweusedcompactbinarydatastructuresandtookadvantageofhighcorrelationsbetweeninferredcopyingstatesintheHMMtoreducecomputation.Imputationwascarriedoutinchunksofapproximately50,000imputedmarkerswitha250kbbufferregionandon

25

5,000samplespercomputejob.Thecombinedprocessingtimepersampleforthewholegenomewas~10mins.Theresultoftheimputationprocessisadatasetwith92,693,895autosomalSNPs,shortindelsandlargestructuralvariantsin487,442individuals.dbSNPReferenceSNP(rs)IDswereassignedtoasmanymarkersaspossibleusingrsIDlistsavailablefromtheUCSCgenomeannotationdatabasefortheGRCh37assemblyofthehumangenome(seeURLs).Figure7showsthedistributionofinformationscoresonallmarkersintheimputeddataset.AninformationscoreofαinasampleofMindividualsindicatesthattheamountofdataattheimputedmarkerisapproximatelyequivalenttoasetofperfectlyobservedgenotypedatainasamplesizeofαM.Thefigureillustratesthatthemajorityofmarkersabove0.1%frequencyhavehighinformationscores.PreviousGWAShavetendedtouseafilteroninformationaround0.3whichroughlycorrespondstoaneffectivesamplesizeof~150,000.Thusitmaybepossibletoreducetheinformationscorethresholdandstillobtaingoodpowertodetectassociations.TheUKBiobankAxiomarrayfromAffymetrixwasspecificallydesignedtooptimizeimputationperformanceinGWASstudies[7].Anexperimentwascarriedouttoassesstheimputationperformanceofthearray,stratifiedbyallelefrequency,andtocompareperformancetoarangeofoldandnewcommerciallyavailablearrays(SupplementaryMaterialandFigureS14).TheseresultshighlighttheverygoodperformanceoftheAxiomarraycomparedtootherarraysacrossthefullfrequencyspectrum.

2.3.4 Acompressedandindexeddataformatforimputeddata

TheinterimreleaseoftheUKBiobankgeneticdatawasmadeavailableinversion1.1oftheBGENfileformat,whichisabinaryversionoftheGENfileformat(seeURLs).Theinterimdatareleaseimputedfilescontaining~150,000samplesat~73MautosomalSNPsrequire1.3Tboffilespace.Forthefulldatareleasewedevelopedanewversion(version1.2)oftheBGENfileformat,providinggreatercompressionandtheabilitytostorephasedhaplotypedata.Thefullimputedfilescontaining~500,000samplesat~93Mautosomalmarkersrequire2.1Tboffilespace.Inaddition,wedevelopedanopen-sourcesoftwarelibraryandsuiteoftoolsthatcanbeusedtomanipulateandaccesstheBGENfiles,includingthetoolBGENIX,whichprovidesrandomaccesstodatabymakinguseofaseparateindexfile(seeURLs).

26

Severalcommonly-usedprogramssupporttheBGENformat,includingSNPTEST(seeURLs)andPLINK[48].TheprogramQCTOOLv2(seeURLs)canbeusedtofilter,summarize,manipulateandconvertthefilestootherformats.

Figure7|Distributionofinformationscoresatautosomalmarkersintheimputeddataset.Topleftshowsthefulldistributionoftheinfoscores.TheremainingpanelsshowdistributionsintranchesofminorallelefrequencyMAF>5%,1%<=MAF<5%,0.1%<=MAF<1%,0.01%<=MAF<0.1%and0.001%<=MAF<0.01%.

2.4 ImputationofclassicalHLAallelesThemajorhistocompatibilitycomplex(MHC)onchromosomesixisthemostpolymorphicregionofthehumangenomeandharboursthelargestnumberofgeneticassociationstocommondiseases[49].TherelevanceofthisgenomicregiontobiomedicalresearchmeansthathighqualitytypingofHLAallelesisavaluableextensiontothegeneticinformationintheUKBiobank.Becauseoftheextensivegeneticvariationandstronglinkage-disequilibriumintheMHCweusedaspecialisedalgorithmtoresolveallelicvariationat11classicalhumanleukocyteantigen(HLA)genesintheUKBiobankcohort.WeimputedHLAtypesattwo-field(alsoknownasfour-digit)resolutionforHLA-A,-B,-C,-DRB5,-DRB4,-DRB3,-DRB1,-DQB1,-DQA1,-

All variants : N = 92693895

Information

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0.0e+00

2.0e+06

4.0e+06

6.0e+06

8.0e+06

1.0e+07

1.2e+07

MAF>5% : N = 7355020

Information

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0e+00

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

7e+06

1%<MAF<5% : N = 3122479

Information

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0e+00

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

7e+06

0.1%<MAF<1% : N = 9547928

Information

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0e+00

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

7e+06

0.01%<MAF<0.1% : N = 28328108

Information

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0e+00

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

7e+06

0.001%<MAF<0.01% : N = 31687867

Information

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0e+00

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

7e+06

27

DPB1and-DPA1,usingtheHLA*IMP:02algorithmwithamulti-populationreferencepanel(SupplementaryTablesS4,S5)[50,51].Weestimatedtheaccuracyoftheimputationprocessusingfive-foldcross-validationinthereferencepanelsampleswithmarkerscommontothereferencepanelandUKBiobankgenotypedata.ForsamplesofEuropeanancestry,theestimatedfour-digitaccuracyforthemaximumposteriorprobabilitygenotypeisabove93.9%forall11loci(SupplementaryTableS6).Thisaccuracyimprovedtoabove96.1%forall11lociafterrestrictingtoHLAallelecallswithaposteriorprobabilitygreaterthan0.70.Thisresultedincallratesabove95.1%forallloci(SupplementaryTableS7).TodemonstratetheutilityoftheHLAimputationweperformedassociationtestsfordiseasesknowntohaveHLAassociations.Weanalysed409,724individualsinthewhiteBritishancestrysubset(definedinSection2.2.2)alongwith446diseasecodesbasedonself-reporteddiagnosisterms.Ofthesediseasecodeswefocusedon11immune-mediateddiseaseswithknownHLAassociations(seeSupplementaryTableS8forafulllist).ForeachindividualwedefinedtheHLAgenotypeateachlocusasthepairofalleleswithmaximumposteriorprobability.Weperformedastandardassociationanalysis(seeforexample[52])forHLAallelesandeachdiseaseusinglogisticregressionwithanadditivemodelforeachalleleateachlocus.ThiseffectivelytreatseachalleleasaSNPinastandardGWASframework.ForfurtherdetailsseetheSupplementaryMaterialSectionS5.ToensureweonlyincludehighqualityHLAcallsinouranalysesweperformedasimilaranalysiswhereateachlocusweonlyincludethoseindividualswhosegenotypepairhasamaximumposteriorprobabilitygreaterthan0.7.Nosignificantdifferenceswereobservedcomparedtothefullanalysis(datanotshown).ForeachdiseaseinouranalysisweidentifiedtheHLAallelewiththestrongestevidenceofassociationandthesewereconsistentwithpreviousreports(seeSupplementaryTableS8forthefullresults).ThisprovidesevidencethattheimputationofHLAallelesissufficientlyaccurateforgeneticassociationtesting.AsafinalQCcheckforourimputedHLAalleles,weshowthattheUKBiobankdataprovidesconsistentresultswithafine-mappingstudy.InthiscasewereplicateindependentHLAassociationsinasinglediseasestudyofmultiplesclerosissusceptibilitybytheInternationalMultipleSclerosisGeneticsConsortium(IMSGC)[52].HereweobservedevidenceofassociationandeffectsizeestimatesforHLAallelesthatareconsistentwiththosefoundinIMSGCstudy(Table5).

28

HLAAllele Test UKBiobank IMSGC[52]OR(95%C.I.) p-value OR(95%C.I.) p-value

HLA-DRB1*15:01 Additiveeffect 3.16(2.81-3.54) 2.58×10−85 3.92(3.74-4.12) <1×10−600

Homozygotecorrection 0.67(0.52-0.87) 2.32×10−03 0.54(0.47-0.61) 8.50×10−22

HLA-A*02:01 Additiveeffect 0.69(0.62-0.78) 2.30×10−10 0.67(0.64-0.70) 7.80×10−70


HLA-DRB1*03:01 Additiveeffect 1.21(1.06-1.37) 3.39×10−03 1.16(1.10-1.22) 3.50×10−08




HLA-B*44:02 Additiveeffect 0.86(0.74-0.98) 2.94×10−02 0.78(0.74-0.83) 4.70×10−17



HLA-DQA1*01:01

AdditiveeffectinthepresenceofHLA-DRB1*15:01

0.71(0.56-0.90) 5.33×10−03 0.65(0.59-0.72) 1.30×10−17

HLA-DQB1*03:02 Dominanteffect 1.07(0.92-1.25) 3.71×10−01 1.30(1.23-1.37) 1.80×10−22

HLA-DQB1*03:01

AllelicinteractionwithHLA-DQB1*03:02

0.8(0.53-1.20) 2.81×10−01 0.60(0.52-0.69) 7.10×10−12

Table5|EvidenceofassociationbetweenHLAallelesandmultiplesclerosisinUKBiobankcomparedtotheIMSGCcohort.TheUKBiobankassociationtestsinvolved1,501self-reportedcasesand409,724controls;theIMSGCcohortinvolved17,465casesand30,385controls[52].ThustheUKBiobankanalysishadsignificantlylowerpowerthantheIMSGCanalysis,whichisreflectedinthereportedp-values.EffectsizesfortheUKBiobankwereestimatedjointlyusingthemodeloftheMHCreportedbytheIMSGC(withtheexceptionofthetwoSNPsrs9277565andrs2229029).AsintheIMSGCanalysisthehomozygotecorrectiontestindicatesadeparturefromadditivity.Thatis,iftheoddsratiois<1thenthehomozygouseffectissmallerthanundertheadditivityassumptionandbiggerifitis>1.

2.5 GWASforstandingheightAsafinaldemonstrationofthequalityofthedirectlygenotypedandimputeddataweconductedagenome-wideassociationscanforawell-studied[49],andhighlypolygenichumantrait:standingheight.Previouslypublished,largecohortstudiesforthistraitalsoprovideasuitableindependentcomparisonsetforourscan.Weconductedthescanusinggenotypesforasetof343,321unrelatedUKBiobankparticipantswithinthewhiteBritishancestrysubset(Section2.2.2)andwithastandingheightmeasurement(SupplementaryMaterial).WecomparedourresultstothelargestpublishedGWASforhumanheightthatdoesnotuseUKBiobankdata:ametaanalysisinvolvingatotalof253,288individualsofEuropeanancestrycarried

29

outbytheGeneticInvestigationofAnthropometricTraits(GIANT)Consortium[53]in2014.Figure8showsthep-valuesforassociationswithheightononechromosome,alongwithp-valuesforGIANT’smetaanalysis.Reassuringly,thepatternofassociationsisverysimilarinboththeUKBiobankandGIANTresults.ThegaininpowerintheUKBiobankcohortisclear,withmanylocireachinggenome-widesignificance(p-value<5x10-8)intheUKBiobank,whichdonotintheGIANTstudy(SupplementaryFigureS15).Thepurposeofthisanalysisisnottoreportnovelassociationsforheight,rathertoindicatethepotentialoftheresourcetouncoversuchfindings.However,wechosetofocusonasingleassociatedregiononchromosome2showninFigure8D.Correlations(r2)betweenmarkersinthisregionshowapatternthatisasexpectedinthecontextoflinkagedisequilibrium(LD),andthelocalrecombinationrates.Thestripe-likepatternoftheassociationstatisticsisindicativeofmultiplemutationsoccurringonsimilarbranchesofthegenealogicaltreeunderlyingthedata,whicharelikelylinkedtovaryingdegreeswiththecausalmarker(s).Thecorrelation(r2)betweenthemostassociatedmarkerandallothermarkersintheregiondropsofsharplyaroundthesmallpeakinrecombination(Hapmaprecombinationmap[21])totherightofthemostsignificantlyassociatedmarker.Interestingly,thismarkerwasimputedfromthegenotypes,whichpointstothesuccessoftheimputationinthisstudy,andingeneral,tothevalueofimputingmillionsmoremarkers.Humanheightisahighlypolygenictrait[53]andthisprovidedanopportunitytoexaminemanysuchregionsofassociation.Otherregionsthatwevisuallyexaminedshowedsimilarpatterns,andallregionscontainingthevariantsreportedtobeassociatedwithheightintheNCBIGWAScatalogue(asofFeb2017,excludingresultsbasedonUKBiobankdata)werealsogenome-widesignificant,orcloseto,intheUKBiobankdata.

30

Figure8|Associationstatisticsforhumanheight.Results(p-values)ofassociationtestsbetweenhumanheightandgenotypesusingthreedifferentsetsofdataforchromosome2.p-valuesareshownonthe–log10scaleandcappedat50forvisualclarity.Markerswith–log10(p)>50areplottedat50onthey-axisandshownastrianglesratherthandots.(A)Resultsformeta-analysisbyGIANT

31

(2014),withNCBIGWAScataloguemarkerssuperimposedinred(plottedatthereportedp-values).(B)AssociationstatisticsforUKBiobankmarkersinthegenotypedata.(C)AssociationstatisticsforUKBiobankmarkersintheimputeddata.Pointscolouredpinkindicategenotypedmarkersthatwereusedinpre-phasingandimputation(seeSection2.3.1).Thismeansthatmostofthedataateachofthesemarkerscomesfromthegenotypingassay.Blackpoints(thevastmajority,~8M)indicatefullyimputedmarkers.(D)Resultsfromallthreedatasetsfocussingona~3Mega-baseregionattheterminalendofthep-arm.Genotypedmarkers(i.emarkersinB)areshownasdiamonds,andimputedmarkers(i.e.onlymarkerscolouredblackinC)ascircles.Thetwomarkerswiththesmallestp-valueforeachofthegenotypeddataandimputeddataareenlargedandhighlightedwithblackoutlines,andotherUKBiobankmarkersarecolouredaccordingtotheircorrelation(r2)withoneofthesetwo.Thatis,genotypedmarkerswiththeleadinggenotypedmarker(rs17713396),andimputedmarkerswiththeleadingimputedmarker(rs12714401).Markerswithr2lessthan0.1areshownasblackorgreen.

2.6 MultipletraitGWASandPheWASTofacilitatetherunningofGWASformultiplecontinuoustraitsandfastPheWASwehaveprovidedanewtoolcalledBGENIEthatisbuiltupontheBGENlibrary(seeURLs).TheprogramalsousestheEigenmatrixlibraryandOpenMPtocarryoutasmanyofthelinearalgebraoperationsinparallelaspossible.Forexample,estimationofeffectsizesoflargenumbersofmarkerscanbecarriedoutinparallelusingmatrixoperations,andweuseindexingofmissingdatavaluestoallowforfastestimationofstandarderrors.TheprogramtakesBGENfilesasinputandavoidsrepeateddecompressionandconversionofthesefileswhenanalysingmultiplephenotypes,andcanleadtoconsiderabletimesavingswhencomparedtoanalysisusingPLINK(seeSupplementaryMaterial).

3 Dataprovisionandaccess

Genotypecalls,bothimputedanddirectlyassayed,andHLAhaplotypecallsareavailableonacost-recoverybasistoresearchersonsuccessfulapplicationtotheUKBiobankhttp://www.ukbiobank.ac.uk/scientists-3/genetic-data/.Severaltypesofimportantinformationarealsoavailablealongwiththegenotypecalls,namely:

• Variousqualitycontrolmetricsandflagsforsamplesandmarkers.• 40principalcomponentsforallgenotypedsamples,andkinshipcoefficients

forpairsofinferredcloserelatives.• MeasuredAandBalleleintensities,log2ratiosandB-allelefrequencydata

forall488,377genotypedsamplesat805,426markers(Affymetrix).• Confidencevaluesforallgenotypecallsinthereleaseddata.• Posteriorparametersforgenotypeclustersateachmarkerineach

genotypingbatch(Affymetrix).• Imputationinfoscoresandminorallelefrequenciesforallmarkersinthe

imputeddatafiles.

32

4 URLs

SHAPEIT3,IMPUTE4,BGENIE-https://jmarchini.org/software/Hapfuse-https://bitbucket.org/wkretzsch/hapfuse/srcBGENIX,BGENlibrary-https://bitbucket.org/gavinband/bgenGRCh37humangenomeassembly-http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/Evoker-https://github.com/wtsi-medical-genomics/evokerBGENfileformat-http://www.well.ox.ac.uk/~gav/bgen_format/bgen_format.htmlGENfileformat-http://www.stats.ox.ac.uk/~marchini/software/gwas/file_format.htmlSNPTEST-https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.htmlQCTOOLv2-http://www.well.ox.ac.uk/~gav/qctool_v2

5 Authorcontributions

C.B.,C.F.,D.P.carriedoutthemarkerandsamplequalitycontrolandanalysis.S.W.contributedtothegenotypequalitycontrol.A.C.,A.M.,D.V.,S.L.,G.M.carriedoutallanalysisrelatingtotheHLAimputationandassociationtesting.G.B.developedtheBGENfileformat.O.D.,J.O.,K.S.,L.E.andJ.M.carriedoutallanalysisandmethodsdevelopmentrelatingtothephasing,imputationandmultipletraitanalysis.C.B.,C.F.andJ.M.carriedoutallanalysisrelatingtoGWAStesting.J.M.andP.D.supervisedthework.C.B.,C.F.,A.C.,G.M.,J.M.,P.D.wrotethepaper.

6 Acknowledgements

ThisworkwassupportedbytheWellcomeTrustCoreAwards090532/Z/09/Zand203141/Z/16/Z,andtheEuropeanResearchCouncil(ERC;grant617306)(toJ.M.)andWellcomeTrustSeniorInvestigatorAwards095552/Z/11/Z(toP.D.)and100956/Z/13/Z(toG.M.),andUKBiobank(toC.B.,D.P.andC.F.),andWellcomeTrustgrant100308/Z/12/Z(toA.C.),andWellcomeTrustgrant090770/Z/09/Z(toG.B.),andtheAustralianNationalHealthandMedicalResearchCouncil(NHMRC),CareerDevelopmentFellowshipID1053756(toS.L.).ThesampleprocessingandgenotypingwassupportedbytheNationalInstituteforHealthResearch,MedicalResearchCouncil,andBritishHeartFoundation.WewouldliketoacknowledgetheResearchComputingCoreattheWellcomeTrustCentreforHumanGeneticsfortheirhelpinmanagingthehugecomputational

33

workloadofthisproject.WewouldliketoacknowledgeAffymetrixfortheircontributiontothisprojectandspecificallyfortheircontributiontodiscussionsofQC.WethankAlexYoungforassistancewithaspectsoftherelatednessanalysis.WethankAlexDiltheyandLoukasMoutsianasfortheirassistancewithperformingclassicalHLAalleleimputationandanalysis.WewouldliketoacknowledgeUKBiobankco-ordinatingcentrestafffortheirroleinextractingtheDNAforthisproject.WewouldalsoliketoexpressourgratitudetotheparticipantsintheUKBiobankcohort.

7 ConflictsofInterest

J.M.isafounderanddirectorofGensciLtd.P.D.,G.M.andS.L.arepartnersinPeptideGrooveLLP.G.M.andP.D.arefoundersanddirectorsofGenomicsPlcinwhichP.D.currentlyholdsanexecutiverole.Theremainingauthorsdeclarenocompetingfinancialinterests.

8 References

1. UKBiobank.UKBiobank:Protocolforalarge-scaleprospectiveepidemiologicalresource.UKBiobankCoordinatingCentre:2007.2. SudlowC,GallacherJ,AllenN,BeralV,BurtonP,DaneshJ,etal.UKBiobank:AnOpenAccessResourceforIdentifyingtheCausesofaWideRangeofComplexDiseasesofMiddleandOldAge.PLOSMedicine.2015;12(3):e1001779.3. LittlejohnsTJ,SudlowC,AllenNE,CollinsR.UKBiobank:opportunitiesforcardiovascularresearch.EurHeartJ.2017.doi:10.1093/eurheartj/ehx254.4. MillerKL,Alfaro-AlmagroF,BangerterNK,ThomasDL,YacoubE,XuJ,etal.MultimodalpopulationbrainimagingintheUKBiobankprospectiveepidemiologicalstudy.NatNeurosci.2016;19(11):1523--36.5. UKBiobankImagingStudy.Availablefrom:http://imaging.ukbiobank.ac.uk.6. PlengeRM,ScolnickEM,AltshulerD.Validatingtherapeutictargetsthroughhumangenetics.NatRevDrugDiscov.2013;12(8):581--94.7. UKBiobankAxiomArrayContentSummary.http://www.ukbiobank.ac.uk/wp-content/uploads/2014/04/UK-Biobank-Axiom-Array-Content-Summary-2014.pdf.8. UKBiobank.GenotypingandqualitycontrolofUKBiobank,alarge-scale,extensivelyphenotypedprospectiveresource2015.Availablefrom:http://biobank.ctsu.ox.ac.uk/crystal/docs/genotyping_qc.pdf.9. UKBiobankPublishedPapers.Availablefrom:http://www.ukbiobank.ac.uk/published-papers/.10. YoungAI,WauthierF,DonnellyP.Multiplenovelgene-by-environmentinteractionsmodifytheeffectofFTOvariantsonbodymassindex.NatureCommunications.2016;7:12724.

34

11. AstleWJ,EldingH,JiangT,AllenD,RuklisaD,MannAL,etal.TheAllelicLandscapeofHumanBloodCellTraitVariationandLinkstoCommonComplexDisease.Cell.2016;167(5):1415--29.e19.doi:10.1016/j.cell.2016.10.042.12. WarrenHR,EvangelouE,CabreraCP,GaoH,RenM,MifsudB,etal.Genome-wideassociationanalysisidentifiesnovelbloodpressurelociandoffersbiologicalinsightsintocardiovascularrisk.NatGenet.2017;49(3):403--15.13. WainLV,ShrineN,ArtigasMS,ErzurumluogluAM,NoyvertB,Bossini-CastilloL,etal.Genome-wideassociationanalysesforlungfunctionandchronicobstructivepulmonarydiseaseidentifynewlociandpotentialdruggabletargets.NatGenet.2017;49(3):416--25.14. LaneJM,LiangJ,VlasacI,AndersonSG,BechtoldDA,BowdenJ,etal.Genome-wideassociationanalysesofsleepdisturbancetraitsidentifynewlociandhighlightsharedgeneticswithneuropsychiatricandmetabolictraits.NatGenet.2017;49(2):274--81.15. WainLV,ShrineN,MillerS,JacksonVE,NtallaI,ArtigasMS,etal.Novelinsightsintothegeneticsofsmokingbehaviour,lungfunction,andchronicobstructivepulmonarydisease(UKBiLEVE):ageneticassociationstudyinUKBiobank.TheLancetRespiratoryMedicine.3(10):769--81.doi:10.1016/S2213-2600(15)00283-0.16. McCarthyS,DasS,KretzschmarW,DelaneauO,WoodAR,TeumerA,etal.Areferencepanelof64,976haplotypesforgenotypeimputation.NatGenet.2016;48(10):1279-83.doi:10.1038/ng.3643.17. TheUK10KConsortium.TheUK10Kprojectidentifiesrarevariantsinhealthanddisease.Nature.2015;526(7571):82-90.doi:10.1038/nature14962.18. ElliottP,PeakmanTC.TheUKBiobanksamplehandlingandstorageprotocolforthecollection,processingandarchivingofhumanbloodandurine.IntJEpidemiol.2008;37.doi:10.1093/ije/dym276.19. WelshS.Genotypingof500,000UKBiobankparticipants:DescriptionofsampleprocessingworkflowandpreparationofDNAforgenotyping.http://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=807.20. WelshS,PeakmanT,SheardS,AlmondR.ComparisonofDNAquantificationmethodologyusedintheDNAextractionprotocolfortheUKBiobankcohort.BMCGenomics.2017;18(1):26.doi:10.1186/s12864-016-3391-x.21. The1000GenomesProjectConsortium.Amapofhumangenomevariationfrompopulation-scalesequencing.Nature.2010;467(7319):1061--73.22. Affymetrix.UKB_WCSGAX:UKBiobank500KSamplesGenotypingDataGenerationbytheAffymetrixResearchServicesLaboratory.http://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=368.23. Affymetrix.UKB_WCSGAX:UKBiobank500KSamplesProcessingbytheAffymetrixResearchServicesLaboratory.http://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=590.24. UKBiobank.Touchscreenquestionnaireordering,validationanddependencies.https://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=113241.25. TheWellcomeTrustCaseControlConsortium.Genome-wideassociationstudyof14,000casesofsevencommondiseasesand3,000sharedcontrols.Nature.2007;447(7145):661-78.doi:10.1038/nature05911.

35

26. TheInternationalMultipleSclerosisGeneticsConsortium&TheWellcomeTrustCaseControlConsortium2,SawcerS,HellenthalG,PirinenM,SpencerC,others.Geneticriskandaprimaryroleforcell-mediatedimmunemechanismsinmultiplesclerosis.Nature.2011;476(7359):214-9.doi:10.1038/nature10251.27. BaldingDJ.Atutorialonstatisticalmethodsforpopulationassociationstudies.NatRevGenet.2006;7(10):781-91.doi:10.1038/nrg1916.28. The1000GenomesProjectConsortium.Anintegratedmapofgeneticvariationfrom1,092humangenomes.Nature.2012;491(7422):56--65.29. Affymetrix.Axiom®GenotypingSolutionDataAnalysisGuide.http://tools.thermofisher.com/content/sfs/manuals/axiom_genotyping_solution_analysis_guide.pdf.30. NielsenJ,WohlertM.Chromosomeabnormalitiesfoundamong34910newbornchildren:resultsfroma13-yearincidencestudyinÅrhus,Denmark.HumanGenetics.1991;87(1):81--3.doi:10.1007/BF01213097.31. LekM,KarczewskiKJ,MinikelEV,SamochaKE,BanksE,FennellT,etal.Analysisofprotein-codinggeneticvariationin60,706humans.Nature.2016;536(7616):285-91.doi:10.1038/nature19057.32. MorrisJA,RandallJC,MallerJB,BarrettJC.Evoker:avisualizationtoolforgenotypeintensitydata.Bioinformatics.2010;26(14):1786--7.doi:10.1093/bioinformatics/btq280.33. MathurR,GrundyE,SmeethL.AvailabilityanduseofUKbasedethnicitydataforhealthresearch.http://eprints.ncrm.ac.uk/3040/1/Mathur-_Availability_and_use_of_UK_based_ethnicity_data_for_health_res_1.pdf:2013March.ReportNo.34. MarchiniJ,CardonLR,PhillipsMS,DonnellyP.Theeffectsofhumanpopulationstructureonlargegeneticassociationstudies.NatGenet.2004;36(5):512-7.doi:10.1038/ng1337.35. PriceAL,PattersonNJ,PlengeRM,WeinblattME,ShadickNA,ReichD.Principalcomponentsanalysiscorrectsforstratificationingenome-wideassociationstudies.NatGenet.2006;38(8):904-9.doi:10.1038/ng1847.36. GalinskyKJ,BhatiaG,LohP-R,GeorgievS,MukherjeeS,PattersonNJ,etal.FastPrincipal-ComponentAnalysisRevealsConvergentEvolutionofADH1BinEuropeandEastAsia.TheAmericanJournalofHumanGenetics.2016;98(3):456--72.doi:10.1016/j.ajhg.2015.12.022.37. PriceAL,WealeME,PattersonN,MyersSR,NeedAC,ShiannaKV,etal.Long-rangeLDcanconfoundgenomescansinadmixedpopulations.AmJHumGenet.2008;83(1):132-5;authorreply5-9.doi:10.1016/j.ajhg.2008.06.005.38. LeslieS,WinneyB,HellenthalG,DavisonD,BoumertitA,DayT,etal.Thefine-scalegeneticstructureoftheBritishpopulation.Nature.2015;519(7543):309--14.39. LawsonDJ,HellenthalG,MyersS,FalushD.Inferenceofpopulationstructureusingdensehaplotypedata.PLoSGenet.2012;8(1):e1002453.doi:10.1371/journal.pgen.1002453.40. ShibataK,HozawaA,TamiyaG,UekiM,NakamuraT,NarimatsuH,etal.Theconfoundingeffectofcrypticrelatednessforenvironmentalrisksofsystolicbloodpressureoncohortstudies.MolecularGenetics&GenomicMedicine.2013;1(1):45--53.doi:10.1002/mgg3.4.

36

41. VoightBF,PritchardJK.ConfoundingfromCrypticRelatednessinCase-ControlAssociationStudies.PLOSGenetics.2005;1(3):e32.42. ManichaikulA,MychaleckyjJC,RichSS,DalyK,SaleM,ChenWM.Robustrelationshipinferenceingenome-wideassociationstudies.Bioinformatics.2010;26(22):2867-73.doi:10.1093/bioinformatics/btq559.43. PurcellS,NealeB,Todd-BrownK,ThomasL,FerreiraM,BenderD,etal.PLINK:atoolsetforwholegenomeassociationandpopulation-basedlinkageanalyses.AmericanJournalofHumanGenetics.2007;81.44. MarchiniJ,HowieB.Genotypeimputationforgenome-wideassociationstudies.NatRevGenet.2010;11(7):499--511.doi:10.1038/nrg2796.45. HowieB,FuchsbergerC,StephensM,MarchiniJ,AbecasisGR.Fastandaccurategenotypeimputationingenome-wideassociationstudiesthroughpre-phasing.NatGenet.2012;44(8):955--9.doi:10.1038/ng.2354.46. O'ConnellJ,SharpK,ShrineN,WainL,HallI,TobinM,etal.Haplotypeestimationforbiobank-scaledatasets.NatGenet.2016;48(7):817-20.doi:10.1038/ng.3583.47. The1000GenomesProjectConsortium.Aglobalreferenceforhumangeneticvariation.Nature.2015;526(7571):68--74.48. ChangCC,ChowCC,TellierLC,VattikutiS,PurcellSM,LeeJJ.Second-generationPLINK:risingtothechallengeoflargerandricherdatasets.GigaScience.2015;4(1):7.doi:10.1186/s13742-015-0047-8.49. WelterD,MacArthurJ,MoralesJ,BurdettT,HallP,JunkinsH,etal.TheNHGRIGWASCatalog,acuratedresourceofSNP-traitassociations.NucleicAcidsResearch.2014;42(D1):D1001--D6.50. MotyerA,VukcevicD,DiltheyA,DonnellyP,McVeanG,LeslieS.PracticalUseofMethodsforImputationofHLAAllelesfromSNPGenotypeData.bioRxiv.2016.51. DiltheyA,LeslieS,MoutsianasL,ShenJ,CoxC,NelsonMR,etal.Multi-PopulationClassicalHLATypeImputation.PLoSComputationalBiology.2013;9(2):e1002877.doi:10.1371/journal.pcbi.1002877.52. TheInternationalMultipleSclerosisGeneticsConsortium.ClassIIHLAinteractionsmodulategeneticriskformultiplesclerosis.NatGenet.2015;47(10):1107-13.doi:10.1038/ng.3395.53. WoodAR,EskoT,YangJ,VedantamS,PersTH,GustafssonS,etal.Definingtheroleofcommonvariationinthegenomicandbiologicalarchitectureofadulthumanheight.NatGenet.2014;46(11):1173--86.

Documents

Genome-wide genetic data on ~500,000 UK Biobank participants1 Introduction The UK Biobank project is a large prospective cohort study of ~500,000 individuals from across the United