Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Genome-widegeneticdataon~500,000UKBiobankparticipants
ClareBycroft1*,ColinFreeman1*,DesislavaPetkova1,^*,GavinBand1,LloydT.Elliott2,KevinSharp2,AllanMotyer3,DamjanVukcevic3,4,OlivierDelaneau5,6,7,JaredO’Connell8,AdrianCortes1,9,SamanthaWelsh10,GilMcVean1,11,,StephenLeslie3,4,PeterDonnelly1,2†,JonathanMarchini2,1†‡
1WellcomeTrustCenterforHumanGenetics,UniversityofOxford,UK2DepartmentofStatistics,UniversityofOxford,UK3CentreforSystemsGenomicsandtheSchoolsofMathematicsandStatistics,andBioSciences,TheUniversityofMelbourne,Parkville,Victoria,Australia.4MurdochChildren’sResearchInstitute,Parkville,Victoria,Australia.5DepartmentofGeneticMedicineandDevelopment,UniversityofGeneva,1MichelServet,Geneva,CH1211,Switzerland.6SwissInstituteofBioinformatics,UniversityofGeneva,1MichelServet,Geneva,CH1211,Switzerland.7InstituteofGeneticsandGenomicsinGeneva,UniversityofGeneva,1MichelServet,Geneva,CH1211,Switzerland.8IlluminaLtd,ChesterfordResearchPark,LittleChesterford,Essex,CB101XL,UnitedKingdom.9NuffieldDepartmentofClinicalNeurosciences,DivisionofClinicalNeurology,JohnRadcliffeHospital,UniversityofOxford,OxfordOX39DU,UnitedKingdom.10UKBiobank,Units1-4SpectrumWay,Adswood,Stockport,Cheshire,SK30SA,UK 11BigDataInstitute,LiKaShingCentreforHealthInformationandDiscovery,UniversityofOxford,OxfordOX37LF,UnitedKingdom.^Currentaddress:Procter&Gamble,Brussels,Belgium*Theseauthorscontributedequallytothiswork.†Theseauthorsjointlydirectedthiswork.‡Towhomcorrespondenceshouldbeaddressed:[email protected]
2
AbstractTheUKBiobankprojectisalargeprospectivecohortstudyof~500,000individualsfromacrosstheUnitedKingdom,agedbetween40-69atrecruitment.Arichvarietyofphenotypicandhealth-relatedinformationisavailableoneachparticipant,makingtheresourceunprecedentedinitssizeandscope.Herewedescribethegenome-widegenotypedata(~805,000markers)collectedonallindividualsinthecohortanditsqualitycontrolprocedures.Genotypedataonthisscaleoffersnovelopportunitiesforassessingqualityissues,althoughthewiderangeofancestriesoftheindividualsinthecohortalsocreatesparticularchallenges.Wealsoconductedasetofanalysesthatrevealpropertiesofthegeneticdata–suchaspopulationstructureandrelatedness–thatcanbeimportantfordownstreamanalyses.Inaddition,wephasedandimputedgenotypesintothedataset,usingcomputationallyefficientmethodscombinedwiththeHaplotypeReferenceConsortium(HRC)andUK10Khaplotyperesource.Thisincreasesthenumberoftestablevariantsbyover100-foldto~96millionvariants.Wealsoimputedclassicalallelicvariationat11humanleukocyteantigen(HLA)genes,andasaqualitycontrolcheckofthisimputation,wereplicatesignalsofknownassociationsbetweenHLAallelesandmanycommondiseases.Wedescribetoolsthatallowefficientgenome-wideassociationstudies(GWAS)ofmultipletraitsandfastphenome-wideassociationstudies(PheWAS),whichworktogetherwithanewcompressedfileformatthathasbeenusedtodistributethedataset.Asafurthercheckofthegenotypedandimputeddatasets,weperformedatest-casegenome-wideassociationscanonawell-studiedhumantrait,standingheight.
KeywordsUKBiobank,Genotypes,Qualitycontrol,Populationstructure,Relatedness,Phasing,Imputation,HLAImputation,GWAS,PheWAS
3
1 Introduction............................................................................................................42 Results....................................................................................................................52.1 Highqualitygenotypecallingonnovelarray..................................................52.2 Ancestraldiversityandcrypticrelatedness...................................................182.3 PhasingandImputationofSNPs,shortindelsandCNVs...............................232.4 ImputationofclassicalHLAalleles.................................................................262.5 GWASforstandingheight.............................................................................282.6 MultipletraitGWASandPheWAS.................................................................31
3 Dataprovisionandaccess....................................................................................314 URLs......................................................................................................................325 Authorcontributions............................................................................................326 Acknowledgements..............................................................................................327 ConflictsofInterest..............................................................................................338 References............................................................................................................33
4
1 Introduction
TheUKBiobankprojectisalargeprospectivecohortstudyof~500,000individualsfromacrosstheUnitedKingdom,agedbetween40-69atrecruitment[1].Arichvarietyofphenotypicandhealth-relatedinformationisavailableoneachparticipant,makingtheresourceunprecedentedinitssizeandscope.Thedatacontainsself-reportedinformation,includingbasicdemographics,diet,andexercisehabits;extensivephysicalandcognitivemeasurements;withothersourcesofhealth-relatedinformationsuchasmedicalrecordsandcancerregistersbeingintegratedandfollowedupoverthecourseoftheparticipants’lives[2].Thebaselineinformationhas,andwillbe,extendedinanumberofways[3].Forexample,manybloodandurinebiomarkersarebeingmeasured;andmedicalimagingofbrain[4],heart,bones,carotidarteriesandabdominalfatisbeingcarriedoutonalargesubset(~100,000)ofparticipants[5].Understandingtherolethatgeneticsplaysinphenotypicanddiseasevariation,anditspotentialinteractionswithotherfactors,providesacriticalroutetoabetterunderstandingofhumanbiology.Itisanticipatedthatthiswillleadtomore-successfuldrugdevelopment[6],andpotentiallytomoreefficientandpersonalisedtreatmentsandtobetterdiagnoses.Assuch,akeycomponentoftheUKBiobankresourcehasbeenthecollectionofgenome-widegeneticdataoneveryparticipantusingapurpose-designedgenotypingarray[7].Aninterimreleaseofgenotypedataon~150,000UKBiobankparticipants(May2015)[8]hasalreadyfacilitatednumerousstudies[9].TheseexploittheUKBiobank’ssubstantialsamplesize,extensivephenotypeinformation,andgenome-widegeneticinformationtostudytheoftensubtleandcomplexeffectsofgeneticsonhumantraitsanddisease,anditspotentialinteractionswithotherfactors[10-15].Inthispaperwedescribethegeneticdatasetonthefull~500,000participants,togetherwitharangeofqualitycontrolprocedures,whichhavebeenundertakenonthegenotypedatainthehopeoffacilitatingitswideruse.Toachievethiswedesignedandimplementedaqualitycontrol(QC)pipelinethataddresseschallengesspecifictotheexperimentaldesign,scale,anddiversityofthisdataset.RawdatafromthegenotypingexperimentswillbeavailablefromUKBiobank.Wealsoconductedasetofanalysesthatrevealpropertiesofthegeneticdata–suchaspopulationstructureandrelatedness–thatcanbeimportantfordownstreamanalyses.Inaddition,wephasedandimputedgenotypesintothedataset,usingcomputationallyefficientmethodscombinedwiththeHaplotypeReferenceConsortium(HRC)[16]andUK10Khaplotyperesources[17].Thisincreasesthenumberoftestablevariantsbyover100-foldto~96millionvariants.Wealsoimputedclassicalallelicvariationat11humanleukocyteantigen(HLA)genes,andas
5
aQCcheckofthisimputation,wereplicatesignalsofknownassociationsbetweenHLAallelesandmanycommondiseases.Wedescribetoolsthatallowefficientgenome-wideassociationstudies(GWAS)ofmultipletraitsandfastphenome-wideassociationstudies(PheWAS),whichworktogetherwithanewcompressedfileformatthathasbeenusedtodistributethedataset.Asafurthercheckofthegenotypedandimputeddatasets,weperformedatest-casegenome-wideassociationscanonawell-studiedhumantrait,standingheight.
2 Results
2.1 Highqualitygenotypecallingonnovelarray
2.1.1 Purpose-designedgenotypingarray
Thedatareleasecontainsgenotypesof488,377UKBiobankparticipants.Thesewereassayedusingtwoverysimilargenotypingarrays.Asubsetof49,950participantsinvolvedintheUKBiobankLungExomeVariantEvaluation(UKBiLEVE)studyweregenotypedusingtheAppliedBiosystems™UKBiLEVEAxiom™ArraybyAffymetrix1(807,411markers),whichisdescribedelsewhere[15].Followingthis,438,427participantsweregenotypedusingtheclosely-relatedAppliedBiosystems™UKBiobankAxiom™Array(825,927markers).Botharrayswerepurpose-designedspecificallyfortheUKBiobankgenotypingprojectandshare95%ofmarkercontent[7].ThemarkercontentoftheUKBiobankAxiom™arraywaschosentocapturegenome-widegeneticvariation(singlenucleotidepolymorphism(SNPs)andshortinsertionsanddeletions(indels)),andissummarisedinFigure1.Manymarkerswereincludedbecauseofknownassociationswith,orpossiblerolesin,phenotypicvariation,particularlydisease.Anotableexampleistheinclusionoftwovariants,rs429358andrs7412,whichdefinetheisoformsoftheapolipoproteinE(APoE)geneknowntobeassociatedwithriskofAlzheimer’sdisease[7]andotherconditions.Neithermarkeriseasytotypeusingarraytechnologies;asaconsequenceofthistheyhavenotalwaysbeenassayedonearlierarrays.Thearrayalsoincludescodingvariantsacrossarangeofminorallelefrequencies(MAFs),includingraremarkers(<1%MAF);andmarkersthatprovidegoodgenome-widecoverageforimputationinEuropeanpopulationsinthecommon(>5%)andlowfrequency(1-5%)MAFranges.Furtherdetailsofthearraydesignarein[7].
1NowpartofThermoFisherScientific.
6
Figure1|SummaryofUKBiobankgenotypingarraycontent.ThisisaschematicrepresentationofthedifferentcategoriesofcontentontheUKBiobankAxiomarray.Numbersindicatetheapproximatecountofmarkerswithineachcategory,ignoringanyoverlap.Amoredetaileddescriptionofthearraycontentisavailablein[7].
2.1.2 DNAextractionandgenotypecalling
BloodsampleswerecollectedfromparticipantsontheirvisittoaUKBiobankassessmentcentreandthesamplesarestoredattheUKBiobankfacilityinStockport,UK[18].Overaperiodof18months(Nov.2013–Apr.2015)sampleswereretrieved,DNAwasextracted,and96-wellplatesof9450μlaliquotswereshippedtoAffymetrixResearchServicesLaboratoryforgenotyping.SpecialattentionwaspaidintheautomatedsampleretrievalprocessatUKBiobanktoensurethatexperimentalunitssuchasplatesortimingofextractiondidnotcorrelatesystematicallywithbaselinephenotypessuchasage,sex,andethnicbackground,orthetimeandlocationofsamplecollection.FulldetailsoftheUKBiobanksampleretrievalandDNAextractionprocessaredescribedin[19,20].OnreceiptofDNAsamples,AffymetrixprocessedsamplesontheGeneTitan®Multi-Channel(MC)Instrumentin96-wellplatescontaining94UKBiobanksamplesand
7
twocontrolsamplesfromthe1000GenomesProject[21].Genotypeswerethencalledfromthearrayintensitydata,inunitscalled“batches”whichconsistofmultipleplates.Acrosstheentirecohort,therewere106batchesof~4,700UKBiobanksampleseach(SupplementaryMaterial,TableS11).Followingtheearlierinterimdatarelease,Affymetrixdevelopedacustomgenotypecallingpipelinethatisoptimizedforbiobank-scalegenotypingexperiments,whichtakesadvantageofthemultiple-batchdesign[22].Thispipelinewasappliedtoallsamples,includingthe~150,000samplesthatwerepartoftheinterimdatarelease.Consequently,someofthegenotypecallsforthesesamplesmaydifferbetweentheinterimdatareleaseandthisfinaldatarelease(seeSection2.1.6).Routinequalitycheckswerecarriedoutduringtheprocessofsampleretrieval,DNAextraction[19],andgenotypecalling[23].Anysamplethatdidnotpassthesecheckswasexcludedfromtheresultinggenotypecalls.Thecustom-designedarrayscontainanumberofmarkersthathadnotbeenpreviouslytypedusingAffymetrixgenotypearraytechnology.Assuch,Affymetrixalsoappliedaseriesofcheckstodeterminewhetherthegenotypingassayforagivenmarkerwassuccessful,eitherwithinasinglebatch,oracrossallsamples.Wherethesenewly-attemptedassayswerenotsuccessful,Affymetrixexcludedthemarkersfromthedatadelivery(seeSupplementaryMaterialfordetails).Thisresultedinasetofgenotypecallsfor489,212samplesat812,428uniquemarkers(bi-allelicSNPsandIndels)frombotharrays,withwhichweconductedfurtherqualitycontrolandanalysis(Table1).
UKBiLEVEAxiomarrayonly
UKBiobankAxiomarrayonly
Botharrays Total
Includedinexperiment
NumberofsamplessenttoAffymetrix(includingduplicates)
50561 443568 0 494078
IncludedindatadeliveryfromAffymetrix
Numberofmarkers 18019 34313 760096 812428
Numberofsamples(includingduplicates) 50520 438692 0 489212
Includedinreleaseddata
Numberofmarkers 17536 34197 753693 805426
Numberofuniquesamples 49950 438427 0 488377
Table1|ThenumberofmarkersandsamplesbygenotypingarrayatmainstagesoftheUKBiobankgenotypingexperiment.“DatadeliveryfromAffymetrix”referstothedataproducedbyAffymetrixafterapplyingtheirfiltering(seeSupplementaryMaterial).“Releaseddata”referstothereleasedgenotypedata,afterapplyingQCasdescribedinSections2.1.4and2.1.5.
2.1.3 Qualitycontrolinalarge-scale,ethnicallydiversecohort
OurQCpipelinewasdesignedspecificallytoaccommodatethelarge-scaledatasetofethnicallydiverseparticipants,genotypedinmanybatches(106),usingtwoslightly
8
differentnovelarrays,andwhichwillbeusedbymanyresearcherstotackleawidevarietyofresearchquestions.Participantsreportedtheirethnicbackgroundbyselectingfromafixedsetofcategories[24].Whilethemajority(94%)ofindividualsreporttheirethnicbackgroundaswithinthebroad-levelgroup“White”,therearestill~22,000individualswithaself-reportedethnicbackgroundoriginatingoutsideEurope(Table2).Thisethnicdiversityimpliesgeneticdiversity,whichweobservedirectlyinthegenotypesasallelefrequencydifferences,andhasimplicationsforQC.SomecommonlyusedQCtestsareineffectiveinthecontextofstrongpopulationstructureifappliedwithouttakingthisintoaccount.Forexample,testingfordeparturesfromHardy-Weinbergequilibrium(HWE)isacommonapproachforidentifyingmarkersthathavebeengenotypedpoorly[25-27],butdeparturesfromHWEcanbeexpectedinthecontextofpopulationstructurebecauseofdifferencesinallelefrequenciesacrosspopulations.Weusedapproachesbasedonprincipalcomponentanalysis(PCA)toaccountforpopulationstructureinbothmarkerandsample-basedQC.
Ethnicgroup Self-reportedethnicbackground
PercentageofgenotypedUKBiobankparticipants
White 94.23 British 88.26 Anyotherwhitebackground 3.24 Irish 2.61 White 0.11 AsianorAsianBritish 1.94 Indian 1.17 Pakistani 0.36 AnyotherAsianbackground 0.36 Bangladeshi 0.05 AsianorAsianBritish 0.01 BlackorBlackBritish 1.57 Caribbean 0.88 African 0.66 AnyotherBlackbackground 0.02 BlackorBlackBritish 0.01 Chinese 0.31 Chinese 0.31 Mixed 0.58 Anyothermixedbackground 0.2 WhiteandAsian 0.16 WhiteandBlackCaribbean 0.12 WhiteandBlackAfrican 0.08 Mixed 0.01 Other/Unknown 1.38 Otherethnicgroup 0.89 Notstated 0.48
Table2|Proportionsofself-reportedethnicgroupsamong488,377genotypedUKBiobankparticipants.Categoriesofself-reportedethnicbackground(UKBiobankdatafield21000)andbroader-levelethnicgroupsareshownheretoreflectthetwo-layerbranchingstructureoftheethnicbackgroundsectionintheUKBiobanktouchscreenquestionnaire[24].Participantsfirstpickedoneofthebroader-levelethnicgroups(e.gWhite),andwerethenpromptedtoselectoneofthecategories
9
withinthatgroup(e.gIrish).Thebroader-levelgroupsarealsoshownhereasanethnicbackgroundcategory(e.g“White”incolumntwo)becauseasmallproportionofparticipantsonlyrespondedtothefirstquestion.Inthistablewealsocombinethecategory“Otherethnicgroup”withanaggregatednon-responsecategory“Notstated”,whichincludesallparticipantswhodidnotknowtheirethnicgroup,orstatedthattheypreferrednotanswer,ordidnotanswerthefirstquestion.
2.1.4 Marker-basedQC
Weidentifiedpoorqualitymarkersusingstatisticaltestsdesignedprimarilytocheckforconsistencyacrossexperimentalfactors.Specificallywetestedforbatcheffects,plateeffects,departuresfromHWE,sexeffects,arrayeffects,anddiscordanceacrosscontrolreplicates.SeeSupplementaryMaterialforthedetailsofeachtest,andFigureS3forexamplesofaffectedmarkers.Formarkersthatfailedatleastonetestinagivenbatch,wesetthegenotypecallsinthatbatchtomissing.Wealsoprovideaflaginthedatareleasethatindicatesifthecallsforamarkerhavebeensettomissinginagivenbatch.Iftherewasevidencethatamarkerwasnotreliableacrossallbatches,weexcludedthemarkerfromthedataaltogether.Inordertoattenuatepopulationstructureeffectsweappliedallmarker-basedQCtestsusingasubsetof463,844individualswithestimatedEuropeanancestry.WeidentifiedtheseindividualsfromthegenotypedatapriortoconductinganyQCbyprojectingalltheUKBiobanksamplesontothetwomajorprincipalcomponentsoffour1000Genomespopulations(CEU,YRI,CHBandJPT)[28].Wethenselectedsampleswithprincipalcomponent(PC)scoresfallingintheneighbourhoodoftheCEUcluster(seeSupplementaryMaterial).MostQCmetricsrequireathresholdbeyondwhichtoconsideramarker‘notreliable’.WeusedthresholdssuchthatonlystronglydeviatingmarkerswouldfailQCtests(seeSupplementaryMaterial),thereforeallowingresearcherstofurtherrefinetheQCinwhicheverwayismostappropriatefortheirstudyrequirements.Table3summarisestheamountofdataaffectedbyapplyingthesetests.
10
Test AveragenumberofSNPsfailedperbatch(sd)
Fractionofallgenotypecallsaffected
AffymetrixclusterQC 1109(699) 0.00140
1.Batcheffect 197(86) 0.000249
2.Plateeffect 284(266) 0.000358
3.DeparturefromHardy-Weinbergequilibrium 572(77) 0.000723
4.Sexeffect 45(5) 0.0000569
5.Arrayeffect* 5417 0.00683
6.Discordanceacrosscontrols** 622and632 0.000796
Total 7704(721) 0.00971
Table3|Failureratesforsixmarker-basedqualitytests.Forallnumberedtestsamarker(ormarkerwithinabatch)wassettomissingifthetestyieldedap-value<10-12,exceptinthecaseofTest6,forwhichamarkerwassettomissingifthetestyielded<95%concordance.SeeSupplementaryMaterialfordetailsofeachtest.Thetotalisnotequaltothesumofalltestsbecauseitispossibleforamarkertofailmorethanonetest.Sincethetwoarrayscontainslightlydifferentsetsofmarkers,thetotalnumberofgenotypecallsusedtocomputethefractionsis,NukbbLukbb+NukblLukbl,whereNandLrefertothenumbersofmarkersandsamplestypedontheUKBiobankAxiomarray(ukbb)andsamplestypedontheUKBiLEVEAxiomarray(ukbl)withintheAffymetrixdatadelivery(seeTableS1).*Thearrayeffecttestwasappliedacrossallbatchesandonlyformarkerspresentonbotharrays,sowesimplyreportthetotalnumberofmarkersthatfailedthistest.**Thediscordancetestwasappliedacrossallbatches,butnotallmarkersarepresentonbotharrays.ThefirstvalueisthenumberofuniquemarkersontheUKBiLEVEAxiomarraythatfailedthistest,andthesecondisformarkersontheUKBiobankAxiomarray.
2.1.5 Sample-basedQC
Weidentifiedpoorqualitysamplesusingthemetricsofmissingrateandheterozygositycomputedusingasetof605,876highqualityautosomalmarkersthatweretypedonbotharrays(seeSupplementaryMaterialforcriteria).Extremevaluesinoneorbothofthesemetricscanbeindicatorsofpoorsamplequalitydueto,forexample,DNAcontamination[26].Theheterozygosityofasample–thefractionofnon-missingmarkersthatarecalledheterozygous–canalsobesensitivetonaturalphenomena,includingpopulationstructure,recentadmixtureandparentalconsanguinity.Wetookextrameasurestoavoidmisclassifyinggoodqualitysamplesbecauseoftheseeffects.Forexample,weadjustedheterozygosityforpopulationstructurebyfittingalinearregressionmodelwiththefirstsixprincipalcomponents(PCs)inaPCAaspredictors(Figure2A).Usingthisadjustmentweidentified968sampleswithunusuallyhighheterozygosityor>5%missingrate(SupplementaryMaterial).Alistofthesesamplesisprovidedaspartofthedatarelease.Wealsoconductedqualitycontrolspecifictothesexchromosomesusingasetof15,766highqualitymarkersontheXandYchromosomes.Affymetrixinfersthesex
11
ofeachindividualbasedontherelativeintensityofmarkersontheYandXchromosomes[29].Sexisalsoreportedbyparticipants,andmismatchesbetweenthesesourcescanbeusedasawaytodetectsamplemishandlingorotherkindsofclericalerror.However,inadatasetofthissize,somesuchmismatcheswouldbeexpectedduetotransgenderindividuals,orinstancesofreal(butrare)geneticvariation,suchassex-chromosomeaneuploidies[30].AffymetrixgenotypecallingontheXandYchromosomesallowsonlyhaploidordiploidgenotypecalls,dependingontheinferredsex[29].Therefore,casesoffullormosaicsexchromosomeaneuploidiesmayresultincompromisedgenotypecallsonall,orpartsof,thesexchromosomes(butnotaffecttheautosomes).Forexample,individualswithkaryotypeXXYwilllikelyhavepoorerqualitygenotypecallsonthepseudo-autosomalregion(PAR)oftheXchromosome,astheyareeffectivelytriploidinthisregion.UsinginformationinthemeasuredintensitiesofchromosomesXandY,weidentifiedasetof652(0.134%)individualswithsexchromosomekaryotypesputativelydifferentfromXYorXX(Figure2B,SupplementaryTableS1).Thelistofsamplesisprovidedaspartofthedatarelease.Researcherswantingtoidentifysexmismatchesshouldcomparetheself-reportedsexandinferredsexdatafields.Wedidnotremovesamplesfromthedataasaresultofanyoftheaboveanalyses,butratherprovidetheinformationaspartofthedatarelease.However,weexcludedasmallnumberofsamples(835intotal)thatweidentifiedassampleduplicates(asopposedtoidenticaltwins,seeSupplementaryMaterial)orwerelikelyinvolvedinsamplemishandlinginthelaboratory(~10),aswellasparticipantswhoaskedtobewithdrawnfromtheprojectpriortothedatarelease(33).
12
Figure2(captiononnextpage)
13
Figure2|Summaryofsample-basedQC.(A)Thethreeplotsshowheterozygosityandmissingrates,whichweusedtoflagpoorqualitysamples.Thetwotopplotsshowheterozygosityforeachsamplebefore(left)andafter(right)correctingforancestralbackgroundusingPCs.Thesymbols(shapesandcolours)indicatetheself-reportedethnicbackgroundofeachparticipant,asannotatedinthelegend.Thethirdplotshowsthesetofsamplesweflaggedasoutliers(inred),andallothersamples(inblack),withshapesthesameasfortheothertwoplots.Theverticallineshowsthethresholdweusedtocallsamplesasoutliersonmissingrate.Inallplotsmissingratedataistransformedtothelogitscale,butwiththeaxisannotatedwiththeoriginalvalues.(B)MeanLog2ratios(L2R)onXandYchromosomesforeachsample,indicatinglikelysexchromosomeaneuploidy(seeSupplementaryMaterial).SampleswhicharemostlikelyXXorXYaredepictedwithacircle.Othersamples,whichrepresentpossibleinstancesofsexchromosomeaneuploidy,areindicatedbyacross.Thereare652suchsamplesintotal;seeSupplementaryTableS1forcounts.Thecoloursofeachsymbolrelatetodifferentcombinationsofself-reportedsex,andsexinferredbyAffymetrix(fromthegeneticdata),asindicatedbythekey.Foralmostallsamples(99.9%)theself-reportedandinferredsexarethesame,butforasmallnumberofsamples(378)theydonotmatch(seeSupplementaryMaterialfordiscussion).
2.1.6 Summaryofgenotypedataquality
TheapplicationofourQCpipelineresultedinthereleaseddatasetof488,377samplesand805,426markersfrombotharrayswiththepropertiesshowninFigure3.TheproportionofallthegenotypecallsmadebyAffymetrixthatweresettomissingasaresultofqualitycontrolis0.0097(seeTable3).Theproportionofgenotypedsamplesthatwereidentifiedaspoorqualityis0.002(968/488377).Furthermore,asetof588pairsofexperimentalduplicatesshowveryhighconcordance.Onaverage,99.87%ofapair’sgenotypecallsareidenticalandthelowestrateisstill99.39%(SupplementaryFigureS13).Subsequenttotheinterimreleaseofgenotypesfor~150,000UKBiobankparticipantsimprovementsweremadetothegenotypecallingalgorithm[22]andqualitycontrolprocedures.Wethereforeexpecttoobservesomechangesinthegenotypecallsandmissingdataprofileofsamplesincludedinboththeinterimdatareleaseandthisfinaldatarelease.Discordanceamongnon-missingmarkersisverylow(mean6.7x10-5;SupplementaryFigureS1);andforeachsamplethereare~24,500genotypecalls(onaverage)thatweremissingintheinterimdata,butwhichhavenon-missingcallsinthisrelease.Thisismuchsmallerinthereversedirection,with~500calls,onaverage,missinginthisreleasebutnotmissingintheinterimdata,sothereisanaveragenetgainof~24,000genotypecallspersample.WecomparedallelefrequenciesintheUKBiobankwiththoseestimatedfromsequencingdatasourcedfromtheExomeAggregationConsortium(ExAC)database[31].Wecomputedallelefrequenciesatasetof91,298overlappingmarkers,usingsampleswithEuropeanancestry.Wedonotexpectallelefrequenciesinthetwostudiestomatchexactlyduetosubtledifferencesintheancestralbackgroundsoftheindividualsineachstudy,aswellasdifferencesinthesensitivityandspecificityof
14
thetwotechnologies(exomesequencingandgenotypingarrays).Despitethis,overalltheallelefrequenciesareencouraginglysimilar(r2=0.94)(Figure3E).AdiscussionofthediscrepanciesbetweenUKBiobankandExACisgivenintheSupplementaryMaterial.Over110,000raremarkers(MAF<0.01)wereincludedonthetwoarraysusedfortheUKBiobankcohort[7].Variantsoccurringatverylowfrequenciespresentaparticularchallengeforgenotypecallingusingarraytechnology,especiallyincaseswhereonlyasmallnumberofsampleswithinagenotypingbatchareexpectedtohaveacopyoftheminorallele(almostalwayswithaheterozygousgenotype)2.Inthesecasesitissometimesdifficultinthegenotypeintensitydatatodistinguishasamplethatgenuinelyhastheminorallele,fromonewhoseintensitiesareinthetailsofthedistributionofthoseinthemajorhomozygotecluster.ExamplesoftwodifferentmarkerswithMAF<0.001inUKBiobankareshowninFigure4BandFigure4C.Onemarkerisperformingwell(4B),butfortheothermarker(4C)theheterozygoussamplesaremoredifficulttoidentify.Incontrast,Figure4AshowsacommonSNP(MAF=0.077),whichhasthreewell-separatedclusterscorrespondingtothethreegenotypes.Alargerfractionofraremarkersfailqualitycontroltestscomparedtolowfrequencyandcommonmarkers,but84%stillpassinallbatches(Figure3B).TheMAFofraremarkersaregenerallysimilartothoseinExAC,butveryraremarkerstendtohavealowerMAFintheUKBiobank(Figure3F).WestronglyrecommendresearchersvisuallyinspectsimilarplotsformarkersofinterestusingautilitysuchasEvoker[32](seeURLs),especiallyforraremarkers.InadditiontotheQCdescribedabove,acloseexaminationofresultsofatestgenome-wideassociationstudy(GWAS)(discussedinSection2.5)providedfurtherevidenceforthequalityoftheresultingsetofgenotypecalls.
2ThegenotypingbatchesintheUKBiobankprojectcontain~4,700samples,soaMAFof,forexample,0.001correspondstoanexpectedcount(underHWE)ofabout9heterozygotesperbatch.
15
0.0000
0.0025
0.0050
0.0075
0.0100
0.0000 0.0025 0.0050 0.0075 0.0100
MAF in UK Biobank
Freq
uenc
y in
ExA
C
10000
1000
100
10
1
Count
Comparison with ExAC at 60474 overlapping rare markers
0.00
0.25
0.50
0.75
1.00
0.0 0.1 0.2 0.3 0.4 0.5
MAF in UK Biobank
Freq
uenc
y in
ExA
C
100001000100101
Count
Comparison with ExAC at 91298 overlapping markers
0
[1−1
0)
[10−
100)
[100
−100
0)
[100
0−10
000)
0
10
20
30
40
50
60
Num
ber o
f mar
kers
(100
0s)
Minor allele count (MAC)
Minor allele frequencies of 805426 markers in UK Biobank data
Minor allele frequency (MAF) in UK Biobank
Num
ber o
f mar
kers
(100
0s)
0.0 0.1 0.2 0.3 0.4 0.5
0
20
40
60
80
100
120
140
0 1 2−3 4−7 8−15 16−31 32−63 64−105
Marker−based QC failure rates for different MAF ranges
Prop
ortio
n of
mar
kers
in M
AF ra
nge
0.0
0.2
0.4
0.6
0.8
1.0
Number of batches in which a marker failed QC
MAF rangesVery rare: [0−0.001)Rare: [0.001−0.01)Low frequency: [0.01−0.05)Common: [0.05−0.5)
Missing rates for samples
Missing rate
Num
ber o
f sam
ples
(100
0s)
0.02 0.04 0.06 0.08 0.10 0.12
0
20
40
60
80
Samples typed on UKBiobank Axiom array (438427)Samples typed on UKBiLEVE Axiom array (49950)
Missing rates for markers
Missing rate
Num
ber o
f mar
kers
(100
0s)
10−4 10−3 10−2 10−1 1
0
10
20
30
40
50Markers on both arrays (753693)Markers on UK Biobank Axiom array only (34197)Markers on UKBiLEVE Axiom array only (17536)
A B
E
DC
F
Figure3(captiononnextpage)
16
Figure3|Summaryofgenotypedataqualityandcontent.AllplotsshowpropertiesoftheUKBiobankgenotypedatathatismadeavailabletoresearchers,afterapplyingQCasdescribedinthemaintext.(A)Minorallelefrequency(MAF)distribution.AllsampleswereusedinthecalculationofMAF.Theinsetshowsthenumberofmarkerswithindifferentminorallelecountbinsforraremarkersonly(MAF<0.01).(B)Thedistributionofthenumberofbatch-levelQCteststhatamarkerfails(seeTable2;Methods).ForeachoffourdifferentMAFranges(indicatedbycolours)weshowthefractionofmarkersthatfailthespecifiednumberofbatches.Forexample,justover92%ofthe‘lowfrequency’markers(0.01≤MAF<0.05)donotfailqualitycontrolinanybatches.Anymarkerthatfailedall106batchesisexcludedfromthedatarelease,sosuchmarkersarenotincludedhere(seeTable3).(C)Thedistributionofmissingratesformarkers.Threehistogramsareoverlaid,eachshowingadifferent,mutuallyexclusive,subsetofmarkers,whichareindicatedbythethreecolours.Markersfromonlyonearrayexhibitmoremissingdatabecauseonlyasubsetofsamplesweretypedoneacharray(10%onUKBiLEVEAxiom™Arrayand90%ontheUKBiobankAxiom™Array).(D)Thedistributionofmissingratesforsamples.Twohistogramsareoverlaid,eachshowingamutuallyexclusivesubsetofsamples.Allsampleshavesomemissingdatabecausenotallmarkerswereincludedonbotharrays(~2%areexclusivetotheUKBiLEVEAxiom™Arrayand~4%exclusivetotheUKBiobankAxiom™Array).Additionalmissingdataisalsointroducedfromthebatch-basedmarkerQC.(E)ComparisonofMAFinUKBiobankwiththefrequencyofthesamealleleinExAC,amongtheEuropean-ancestrysampleswithineachstudy.Weused33,370samplesforExACand463,844samplesforUKBiobank(seeSupplementaryMaterialfordetails),andonlymarkersthathaveacallrategreaterthan0.9inbothstudies.Eachhexagonalbiniscolouredaccordingtothenumberofmarkersfallinginthatbin,asindicatedbythekey(notethelog10scale).Thedashedredlineshowsx=y.Thesetofmarkerswithverydifferentallelefrequenciesseenonthetop,bottom,andleft-handsidesoftheplotcomprise~300markers.Thisis~0.3%ofallmarkersinthecomparison,or~0.5%ofallmarkerswithMAF>0.001inatleastonestudy(seeSupplementaryMaterialforadiscussion).(F)Aswith(E),butzoominginontheraremarkers(bothaxes<0.01).
17
Figure4|Examplesofintensitydataandgenotypecallsformarkersofdifferentallelefrequencies.Eachofthethreesub-figuresshowsintensitydataforasinglemarkerwithinsixdifferentbatches.Batcheslabelledwiththeprefix“UKBiLEVEAX”containonlysamplestypedusingtheUKBiLEVEAxiomarray,andthosewiththeprefix“Batch”containonlysamplestypedusingtheUKBiobankAxiomarray.Eachpointrepresentsonesampleandiscolouredaccordingtotheinferredgenotypeatthemarker.Thexandyaxesaretransformationsoftheintensitiesforprobesetstargetingeachofthealleles“A”and“B”(seeSupplementaryMaterialfordefinitionofprobeset).Theellipsesindicatethelocationandshapeoftheposteriorprobabilitydistribution(2-dimensionalmultivariateNormal)ofthetransformedintensitiesforthethreegenotypesinthestatedbatch.Thatis,eachellipseisdrawnsuchthatitcontains85%oftheprobabilitydensity.See[29]formoredetailsofAffymetrixgenotypecalling.TheminorallelefrequencyofeachofthemarkersiscomputedusingallsamplesinthereleasedUKBiobankgenotypedata.(A)AmarkerwithaMAFof0.077withwell-separatedgenotype
Genotype callsnumber of copies of reference allele
210no call
A
B
C
MAF:0.00092
MAF:0.00066
MAF:0.077
18
clusters.(B)IntensitiesforamarkerwithaMAFof0.00092withwell-separatedgenotypeclusters.AswouldbeexpectedunderHWE,therearenoinstancesofsampleswiththeminorhomozygotegenotype.(C)IntensitiesforamarkerwithaMAFof0.00066,andwheretheheterozygoteclusterisnotwellseparatedfromthelargemajorhomozygoteclusterinsomebatches,makingitmoredifficulttoconfidentlycalltheheterozygousgenotypes.
2.2 Ancestraldiversityandcrypticrelatedness
2.2.1 Geneticpopulationstructure
ThediversityinancestraloriginsofUKBiobankparticipantsisevidentfromtheself-reportedethnicbackground(Table2)andcountryofbirthinformation.Thegenotypedataprovidesauniqueopportunitytostudytheirancestraloriginsinaquantitativemanner.AccountingfortheancestralbackgroundofparticipantsisanessentialcomponentofanalysisoftheUKBiobankresource,bothforepidemiologicalstudies[33],andgeneticanalyses,suchasGWAS[27,34].Weusedprincipalcomponentsanalysis(PCA)tomeasurepopulationstructurewithintheUKBiobankcohort.PCAiswidelyusedasamethodforassessingandpotentiallycontrollingforpopulationstructureinGWAS[27,35].Wecomputedprincipalcomponents(PCs)usinganalgorithm(fastPCA[36])whichperformswellondatasetswithhundredsofthousandsofsamplesbyapproximatingonlythetopnPCsthatexplainthemostvariation,wherenisspecifiedinadvance.Wecomputedthetop40PCsusingasetof407,219unrelated,highqualitysamplesand147,604highqualitymarkersprunedtominimiselinkagedisequilibrium(LD)[37].WethencomputedthecorrespondingPC-loadingsandprojectedallsamplesontothePCs,thusformingasetofPCscoresforallsamplesinthecohort(SupplementaryMaterial).Figure5showsresultsforthefirst6PCsplottedinconsecutivepairs.ResultsforfurtherPCsareshowninSupplementaryFiguresS5andS6.Asexpected,individualswithsimilarPCscoreshavesimilarself-reportedethnicbackgrounds.Forexample,thefirsttwoPCsseparateoutindividualswithsub-SaharanAfricanancestry,European,andeast-Asianancestry.Individualswhoself-reportasmixedethnicitytendtofallonacontinuumbetweentheirconstituentgroups(e.gtheWhiteandAsiancategory).FurtherPCscapturepopulationstructureatsub-continentalgeographicscales.SupplementaryFigureS7showstherelationshipbetweeneachofthe40PCs,andcountryofbirthasreportedbyparticipants.Forexample,highscoresinPC6areassociatedwithindividualsborninCentralandSouthAmericancountriessuchasPeru,Colombia,andChile.
19
Figure5|AncestraldiversityintheUKBiobankcohort.PlotsofconsecutivepairsofthefirstsixprincipalcomponentsinaPCAofgenotypedataforUKBiobankparticipants(seeSupplementaryMaterial).Eachpointrepresentsanindividualandisplacedaccordingtotheirprincipalcomponentscores(usinggeneticdataonly),withshapesandcoloursindicatingtheirself-reportedethnicbackgroundasshowninthelegend.SeeTable2fortheproportionsofparticipantsineachcategory.Plotsofallpairsofthese6PCsareshowninSupplementaryFigureS5,andresultsoffurtherPCsarerepresentedinSupplementaryFiguresS6andS7.
2.2.2 WhiteBritishancestrysubset
Researchersmaywanttoonlyanalyseasetofindividualswithrelativelyhomogeneousancestrytoreducetheriskofconfoundingduetodifferencesinancestralbackground.AlthoughtheUKBiobankcohortisethnicallydiverse,suchanalysisisfeasiblewithoutcompromisingtoomuchinsamplesizebecausea
20
majorityofparticipantsintheUKBiobankcohortreporttheirethnicbackgroundas“British”,withinthebroader-levelgroup“White”(88.26%).OurPCArevealedpopulationstructureevenwithinthiscategory(SupplementaryFigureS8),soweusedacombinationofself-reportedethnicbackgroundandgeneticinformationtoidentifyasubsetof409,728individuals(84%)whoself-reportas“British”andwhohaveverysimilarancestralbackgroundsbasedonresultsofthePCA(seeSupplementaryMaterial).Fine-scalepopulationstructureisknowntoexistwithintheUK[38]butmethodsfordetectingsuchsubtlestructure[39]availableatthetimeofanalysisarenotfeasibletoapplyatthescaleoftheUKBiobank.ThewhiteBritishancestrysubsetmaythereforestillcontainsubtlestructurepresentatsub-nationalscales.
2.2.3 Crypticrelatedness
Closerelationships(e.gsiblings)amongUKBiobankparticipantswerenotrecordedduringthecollectionofotherphenotypicinformation.Indeed,manyparticipantsmaynotbeawarethatacloserelative(suchasanaunt,orsibling)isalsopartofthecohort.Thisinformationcanbeimportantforepidemiologicalanalyses[40],aswellasinGWAS[41],andthegeneticdataprovidesanopportunitytodiscoverandcharacterisefamilialrelatednesswithinthecohort.Thisanalysis,combinedwithphenotypeinformation,isalsousefulforidentifyingsamplesthatareexperimentalduplicatesratherthangenuinetwins(seeSupplementaryMaterial).Weidentifiedrelatedindividualsbyestimatingkinshipcoefficientsforallpairsofsamples,andreportcoefficientsforpairsofrelativeswhoweinfertobe3rddegreeorcloser.Weusedanestimatorimplementedinthesoftware,KING[42],asitisrobusttopopulationstructure(i.edoesnotrelyonaccurateestimatesofpopulationallelefrequencies)anditisimplementedinanalgorithmefficientenoughtoconsiderallpairs(~1.2x1011)inapracticableamountoftime.AsnotedbytheauthorsofKING[42],wefoundthatrecentadmixture(e.g“Mixed”ancestralbackgrounds)tendedtoinflatetheestimateofthekinshipcoefficient,astheestimatorassumesHWEamongmarkerswiththesameunderlyingallelefrequencieswithinanindividual.Wealleviatedthiseffectbyonlyusingasubsetofmarkersthatareonlyweaklyinformativeofancestralbackground(seeSupplementaryMaterial,FigureS12).Wealsoexcludedasmallfractionofindividuals(977)fromthekinshipestimation,astheyhadproperties(e.ghighmissingrates)thatwouldleadtounreliablekinshipestimates(seeSupplementaryMaterial).Wecalledrelationshipclassesforeachrelatedpairusingthekinshipcoefficientandfractionofmarkersforwhichtheysharenoalleles(IBS0),andusingtheboundariesrecommendbytheauthorsofKING.
21
Monozygotictwins
Parent-offspring Fullsiblings 2nddegree 3rddegree Total
Numberofpairs 179 6,276 22,666 11,113 66,928 107,162
Table4|Summaryofrelatedpairs(3rddegreeorcloser)forthefullUKBiobankcohort.CountsarederivedfromthekinshipcoefficientsasrecommendedbytheauthorsofKING[42].Notethatparent-offspringandfullsiblingpairshavethesameexpectedkinshipcoefficient(0.25)butcanbeeasilydistinguishedbytheirIBS0fraction.Thecountofmonozygotictwinsisafterexcludingsamplesidentifiedasduplicates(seeSupplementaryMaterial).Atotalof147,731UKBiobankparticipants(30.3%)areinferredtoberelated(3rddegreeorcloser)toatleastoneotherpersoninthecohort,andformatotalof107,162relatedpairs(Table4).Thisisasurprisinglylargenumber,anditisnotdrivensolelybyanexcessof3rddegreerelatives.Forexample,thenumberofsiblingpairs(22,666)isabouttwiceasmanyaswouldtheoreticallybeexpectedinarandomsample(ofthissize)oftheeligibleUKpopulation,aftertakingintoaccounttypicalfamilysizes(seeSupplementaryMaterial).Toensurewewerenotoverestimatingthenumberofrelatedpairs,weinferredrelatedpairs(withinasubsetofthedata)usingadifferentinferencemethodimplementedinPLINK(“--genome”command)[43]andconfirmed100%ofthetwins,parent-offspringandsiblingpairs,and99.9%ofpairsoverall(seeSupplementaryMaterial).Thelargerthanexpectednumberofrelatedpairscouldbeexplainedbysamplingbiasdueto,forexample,anindividualbeingmorelikelytoagreetoparticipatebecauseafamilymemberwasalsoinvolved.Furthermoreif,asseemsplausible,relatedindividualsclustergeographically,ratherthanbeingrandomlylocatedacrosstheUnitedKingdom(UK),theassessment-centrebasedrecruitmentstrategyemployedbyUKBiobank[1]willnaturallytendtooversamplerelatedindividuals,relativetorandomsamplingoftheUKpopulation.
2.2.4 Triosandextendedfamilies
PairsofrelatedindividualswithintheUKBiobankcohortformnetworksofrelatedindividuals,or‘family’groups.Inmostcasestheseareofsizetwo,buttherearemanygroupsofsize3orlargerinthecohort,evenwhenrestrictingto2nddegreeorcloserrelativepairs.Byconsideringtherelationshiptypesandtheageandsexofindividualswithineachfamilygroup,weidentified1,066setsoftrios(twoparentsandanoffspring),whichcomprises1,029uniquesetsofparentsand37quartets(twoparentsandtwochildren).Therearenoinstancesof3-generationnuclearfamilies(grandparent-parent-offspring),whichisnotsurprising,giventhattheage-rangeofthecohortspansonly30years.Thereare172familygroupswith5ormoreindividualsthatarerelatedto2nddegreeorcloser,andFigure6illustratesexamplesoflargenuclearfamiliesinthecohort.Oneoftheseisagroupofelevenindividualswhereeverypairisa2nddegreerelatives.Thesimplestexplanationforsuchafamilygroupisthatallofthe
22
individualsarehalfsiblings,withonesharedparentwhoisnotinthecohort.Alternatively,oneindividualinthegroupcouldbeasiblingoraparentofthesharedparent.Thepatternofhaplotypesharingbetweentheindividualswoulddistinguishthesecasesbutwehavenotundertakenthisanalysis.Wedid,however,confirmthatthesharedparentmustbetheirfather,becausetheindividualsdonotallcarrythesameMitochondrialalleles,andallthemalesinthegrouphavethesameallelesontheirYchromosome(datanotshown).
Figure6|FamilialrelatednessintheUKBiobankcohort.(A)ExamplesoffamilygroupswithintheUKBiobankcohort.Pointsindicateparticipants,andlinesbetweenpointsindicatefamilialrelatedness(3rddegreeandcloser)asinferredfromthegeneticdata(seeMethods).Thecolourandthicknessofthelinesindicatedifferentrelativeclasses,asshowninthekey.Anintegernexttoanetworkindicatesthetotalnumberoffamilynetworksinthecohortwiththesameconfiguration,ignoring3rddegreepairs.Nointegermeansthereisonlytheoneshown.Forexample,thereare10networksthatcompriseexactly5fullsiblings(twoexamples,whichdifferwithrespecttoa3rddegreerelative,areshownonthisplot);andthereisonlyonenetworkthatcomprises6fullsiblings(plusone3rddegreerelativewhoisrelatedtoallsiblings).(B)Distributionoftheexactnumberofrelativesthat
Number of relatives per participant
Num
ber o
f par
ticip
ants
1
10
102
103
104
105
1 2 3 4 5 6 7 8 9 10 11−20
Number of relatives per participant
A
B
Example family groups
2
10
87
2
10
726
5
27
Monozygotic twinsParent offspringFull siblings2nd degree3rd degree
Relationship classes
21+
23
participantshaveintheUKBiobankcohort.Theheightofeachbarshowsthecountofparticipantswhohaveexactlythestatednumberofrelatives.Notethelogarithmicscale.Thecoloursindicatetheproportionsofeachrelatednessclassforindividualscountedwithinabar.Forexample,foreachindividualthathasthreerelativesinthecohort(3rdbar),wecounthowmanyoftheirrelativesareineachrelatednessclass(i.eafullsibling,parent-offspringetc.).Wethensumthesecountsoverallindividualswiththreerelativesandcolourthebaraccordingtotheproportionsofeachrelatednessclass.Inthisgroup~20%oftheirrelativesarefullsiblingsand~64%are3rddegreerelatives.Therearealso18participantswithexactly10relatives.Theunusuallylargefractionof2nddegreerelationshipsforthisgroupisaresultofthesetofelevenindividualswhoareall2nddegreerelativesofeachother,asshowninthecentreof(A).
2.3 PhasingandImputationofSNPs,shortindelsandCNVsGenotypeimputation[44]istheprocessofpredictinggenotypesthatarenotdirectlyassayedinasampleofindividuals.AreferencepanelofhaplotypesatadensesetofSNPs,indelsandstructuralvariants(SVs),isusedtoimputegenotypesintoastudysampleofindividualsthathavebeengenotypedatasubsetofthemarkers.These‘insilico’genotypescanthenbeusedtoboostthenumberofmarkersthatcanbetestedforassociation.Thisincreasesthepowerofthestudy,theabilitytoresolveorfine-mapthecausalvariantsandfacilitatesmeta-analysis.Theprocessofimputationfirstinvolespre-phasingthedirectlygenotypedmarkers,followedbyahaploidimputationstep[45].
2.3.1 Pre-Phasing
Forthepre-phasingstepweonlyusedmarkerspresentonboththeUKBiLEVEandUKBiobankAxiomarrays.Weremovedmarkerswhich(a)failedQCin>1batch,(b)hadmorethan5%missingness(c)hadaminorallelefrequency<0.0001.Weremovedsamplesthatwereidentifiedasoutliersforheterozygosityandmissingness(Section2.1.5).Thesefiltersresultedinadatasetwith670,739autosomalmarkersin487,442samples.PhasingontheautosomeswascarriedoutusingSHAPEIT3[46]inchunksof15,000markers,withanoverlapof250markersbetweenchunks.Eachchunkused4coresperjobandS=200copyingstates.The1000GenomesPhase3dataset[47]wasusedasareferencepanel,predominantlytohelpwiththephasingofsampleswithnon-Europeanancestry.Chunkswereligatedusingamodifiedversionofthehapfuseprogram(seeURLs). Weassessedtheaccuracyofthephasinginaseparateexperimentbytakingadvantageofmother-father-childtriosthatwereidentifiedintheUKBiobankcohort(seeSection2.2.4).Thisfamilyinformationcanbeusedtoinferthephaseofalargenumberofmarkersinthetrioparents.Thesefamily-inferredhaplotypeswereusedasatruthset,asiscommoninthephasingliterature.Theparentsofeachtriowereremovedfromthedatasetandthenhaplotypeswereestimatedacrosschromosome20inasinglerunofSHAPEIT3.Thisdatasetconsistedof16,175autosomalmarkers.
24
Theinferredhaplotypeswerethencomparedtothetruthsetusingtheswitcherrormetric.Usingasetof696trioswithself-reportedethnicbackground“British”(withinthebroader-levelgroup“White”(seeTable2))andnoothertwinsor1stor2nddegreerelativesintheUKBiobankdataset,weestimatedamedianswitcherrorrateof0.229%.Wealsousedasubsetof397ofthesetriosthatalsohadno3rddegreerelativesandobtainedamedianswitcherrorrateof0.234%.
2.3.2 Referencepanelusedforimputation
Thereareanumberoffactorsthatinfluencetheaccuracyofgenotypeimputation,butgenerallyaccuracywillincreaseasthenumberofhaplotypesinthereferencepanelgrowsandiftheancestryofthesamplehaplotypesisagoodmatchtotheancestryofthereferencepanelhaplotypes.TheUKBiobankdatasetconsistsofsampleswithadiverserangeofancestries,butwiththemajorityofsampleshavingBritish(orEuropean)ancestry(Table2).ForthisreasonitwasdesirabletouseareferencepanelwithalargenumberofhaplotypeswithBritishandEuropeanancestry,andalsoadiversesetofhaplotypesfromotherworld-widepopulations.FortheinterimdatareleaseareferencepanelwasusedthatmergedtheUK10Kand1000GenomesPhase3referencepanels,whichconsistedof87,696,888bi-allelicmarkersin12,570haplotypes.AnadvantageofthisreferencepanelisthatitincludesshortindelsandlargerstructuralvariantsaswellasSNPs.Sincethen,theHRCreferencepanel[16]hasbeenwidelyadoptedforimputationintomanyGWAS.Thisreferencepanelhasmanymorehaplotypes(64,976)thanthepreviouslyusedreferencepanel,andsoisexpectedtoproducebetterimputationperformance,especiallyatlowerallelefrequencies(seeSupplementaryFigureS14).HowevertheHRCpanelhasfewerSNPs(39,235,157)sinceveryrarevariantswereexcludedwhenthepanelwasconstructed,andtherearenoshortindelsandstructuralvariants.ToobtaintheexpectedgaininperformanceatHRCSNPs,whilstretainingresultsatalltheSNPs/indels/SVsimputedintheinterimrelease,weimputedSNPsfrombothpanels,butpreferentiallyusedtheHRCimputationatSNPspresentinbothpanels.
2.3.3 Imputation
Tofacilitatefastimputationofall~500,000sampleswere-codedIMPUTE2[45]tofocusexclusivelyonthehaploidimputationneededwhensampleshavebeenpre-phased.ThisnewversionoftheprogramisreferredtoasIMPUTE4(seeURLs),butusesexactlythesamehiddenMarkovmodel(HMM)withinIMPUTE2,andproducesidenticalresultstoIMPUTE2whenrunusingallreferencehaplotypesashiddenstates(datanotshown).ToreduceRAMusageandincreasespeedweusedcompactbinarydatastructuresandtookadvantageofhighcorrelationsbetweeninferredcopyingstatesintheHMMtoreducecomputation.Imputationwascarriedoutinchunksofapproximately50,000imputedmarkerswitha250kbbufferregionandon
25
5,000samplespercomputejob.Thecombinedprocessingtimepersampleforthewholegenomewas~10mins.Theresultoftheimputationprocessisadatasetwith92,693,895autosomalSNPs,shortindelsandlargestructuralvariantsin487,442individuals.dbSNPReferenceSNP(rs)IDswereassignedtoasmanymarkersaspossibleusingrsIDlistsavailablefromtheUCSCgenomeannotationdatabasefortheGRCh37assemblyofthehumangenome(seeURLs).Figure7showsthedistributionofinformationscoresonallmarkersintheimputeddataset.AninformationscoreofαinasampleofMindividualsindicatesthattheamountofdataattheimputedmarkerisapproximatelyequivalenttoasetofperfectlyobservedgenotypedatainasamplesizeofαM.Thefigureillustratesthatthemajorityofmarkersabove0.1%frequencyhavehighinformationscores.PreviousGWAShavetendedtouseafilteroninformationaround0.3whichroughlycorrespondstoaneffectivesamplesizeof~150,000.Thusitmaybepossibletoreducetheinformationscorethresholdandstillobtaingoodpowertodetectassociations.TheUKBiobankAxiomarrayfromAffymetrixwasspecificallydesignedtooptimizeimputationperformanceinGWASstudies[7].Anexperimentwascarriedouttoassesstheimputationperformanceofthearray,stratifiedbyallelefrequency,andtocompareperformancetoarangeofoldandnewcommerciallyavailablearrays(SupplementaryMaterialandFigureS14).TheseresultshighlighttheverygoodperformanceoftheAxiomarraycomparedtootherarraysacrossthefullfrequencyspectrum.
2.3.4 Acompressedandindexeddataformatforimputeddata
TheinterimreleaseoftheUKBiobankgeneticdatawasmadeavailableinversion1.1oftheBGENfileformat,whichisabinaryversionoftheGENfileformat(seeURLs).Theinterimdatareleaseimputedfilescontaining~150,000samplesat~73MautosomalSNPsrequire1.3Tboffilespace.Forthefulldatareleasewedevelopedanewversion(version1.2)oftheBGENfileformat,providinggreatercompressionandtheabilitytostorephasedhaplotypedata.Thefullimputedfilescontaining~500,000samplesat~93Mautosomalmarkersrequire2.1Tboffilespace.Inaddition,wedevelopedanopen-sourcesoftwarelibraryandsuiteoftoolsthatcanbeusedtomanipulateandaccesstheBGENfiles,includingthetoolBGENIX,whichprovidesrandomaccesstodatabymakinguseofaseparateindexfile(seeURLs).
26
Severalcommonly-usedprogramssupporttheBGENformat,includingSNPTEST(seeURLs)andPLINK[48].TheprogramQCTOOLv2(seeURLs)canbeusedtofilter,summarize,manipulateandconvertthefilestootherformats.
Figure7|Distributionofinformationscoresatautosomalmarkersintheimputeddataset.Topleftshowsthefulldistributionoftheinfoscores.TheremainingpanelsshowdistributionsintranchesofminorallelefrequencyMAF>5%,1%<=MAF<5%,0.1%<=MAF<1%,0.01%<=MAF<0.1%and0.001%<=MAF<0.01%.
2.4 ImputationofclassicalHLAallelesThemajorhistocompatibilitycomplex(MHC)onchromosomesixisthemostpolymorphicregionofthehumangenomeandharboursthelargestnumberofgeneticassociationstocommondiseases[49].TherelevanceofthisgenomicregiontobiomedicalresearchmeansthathighqualitytypingofHLAallelesisavaluableextensiontothegeneticinformationintheUKBiobank.Becauseoftheextensivegeneticvariationandstronglinkage-disequilibriumintheMHCweusedaspecialisedalgorithmtoresolveallelicvariationat11classicalhumanleukocyteantigen(HLA)genesintheUKBiobankcohort.WeimputedHLAtypesattwo-field(alsoknownasfour-digit)resolutionforHLA-A,-B,-C,-DRB5,-DRB4,-DRB3,-DRB1,-DQB1,-DQA1,-
All variants : N = 92693895
Information
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0.0e+00
2.0e+06
4.0e+06
6.0e+06
8.0e+06
1.0e+07
1.2e+07
MAF>5% : N = 7355020
Information
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0e+00
1e+06
2e+06
3e+06
4e+06
5e+06
6e+06
7e+06
1%<MAF<5% : N = 3122479
Information
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0e+00
1e+06
2e+06
3e+06
4e+06
5e+06
6e+06
7e+06
0.1%<MAF<1% : N = 9547928
Information
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0e+00
1e+06
2e+06
3e+06
4e+06
5e+06
6e+06
7e+06
0.01%<MAF<0.1% : N = 28328108
Information
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0e+00
1e+06
2e+06
3e+06
4e+06
5e+06
6e+06
7e+06
0.001%<MAF<0.01% : N = 31687867
Information
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0e+00
1e+06
2e+06
3e+06
4e+06
5e+06
6e+06
7e+06
27
DPB1and-DPA1,usingtheHLA*IMP:02algorithmwithamulti-populationreferencepanel(SupplementaryTablesS4,S5)[50,51].Weestimatedtheaccuracyoftheimputationprocessusingfive-foldcross-validationinthereferencepanelsampleswithmarkerscommontothereferencepanelandUKBiobankgenotypedata.ForsamplesofEuropeanancestry,theestimatedfour-digitaccuracyforthemaximumposteriorprobabilitygenotypeisabove93.9%forall11loci(SupplementaryTableS6).Thisaccuracyimprovedtoabove96.1%forall11lociafterrestrictingtoHLAallelecallswithaposteriorprobabilitygreaterthan0.70.Thisresultedincallratesabove95.1%forallloci(SupplementaryTableS7).TodemonstratetheutilityoftheHLAimputationweperformedassociationtestsfordiseasesknowntohaveHLAassociations.Weanalysed409,724individualsinthewhiteBritishancestrysubset(definedinSection2.2.2)alongwith446diseasecodesbasedonself-reporteddiagnosisterms.Ofthesediseasecodeswefocusedon11immune-mediateddiseaseswithknownHLAassociations(seeSupplementaryTableS8forafulllist).ForeachindividualwedefinedtheHLAgenotypeateachlocusasthepairofalleleswithmaximumposteriorprobability.Weperformedastandardassociationanalysis(seeforexample[52])forHLAallelesandeachdiseaseusinglogisticregressionwithanadditivemodelforeachalleleateachlocus.ThiseffectivelytreatseachalleleasaSNPinastandardGWASframework.ForfurtherdetailsseetheSupplementaryMaterialSectionS5.ToensureweonlyincludehighqualityHLAcallsinouranalysesweperformedasimilaranalysiswhereateachlocusweonlyincludethoseindividualswhosegenotypepairhasamaximumposteriorprobabilitygreaterthan0.7.Nosignificantdifferenceswereobservedcomparedtothefullanalysis(datanotshown).ForeachdiseaseinouranalysisweidentifiedtheHLAallelewiththestrongestevidenceofassociationandthesewereconsistentwithpreviousreports(seeSupplementaryTableS8forthefullresults).ThisprovidesevidencethattheimputationofHLAallelesissufficientlyaccurateforgeneticassociationtesting.AsafinalQCcheckforourimputedHLAalleles,weshowthattheUKBiobankdataprovidesconsistentresultswithafine-mappingstudy.InthiscasewereplicateindependentHLAassociationsinasinglediseasestudyofmultiplesclerosissusceptibilitybytheInternationalMultipleSclerosisGeneticsConsortium(IMSGC)[52].HereweobservedevidenceofassociationandeffectsizeestimatesforHLAallelesthatareconsistentwiththosefoundinIMSGCstudy(Table5).
28
HLAAllele Test UKBiobank IMSGC[52]OR(95%C.I.) p-value OR(95%C.I.) p-value
HLA-DRB1*15:01 Additiveeffect 3.16(2.81-3.54) 2.58×10−85 3.92(3.74-4.12) <1×10−600
Homozygotecorrection 0.67(0.52-0.87) 2.32×10−03 0.54(0.47-0.61) 8.50×10−22
HLA-A*02:01 Additiveeffect 0.69(0.62-0.78) 2.30×10−10 0.67(0.64-0.70) 7.80×10−70
Homozygotecorrection 1.20(0.89-1.62) 2.41×10−01 1.26(1.13-1.41) 3.30×10−05
HLA-DRB1*03:01 Additiveeffect 1.21(1.06-1.37) 3.39×10−03 1.16(1.10-1.22) 3.50×10−08
Homozygotecorrection 2.12(1.53-2.94) 6.84×10−06 2.58(2.19-3.03) 1.30×10−30
HLA-DRB1*13:03 Additiveeffect 2.10(1.54-2.85) 2.36×10−06 2.62(2.32-2.96) 6.20×10−55
HLA-DRB1*08:01 Additiveeffect 1.56(1.21-2.01) 6.13×10−04 1.55(1.42-1.69) 1.00×10−23
HLA-B*44:02 Additiveeffect 0.86(0.74-0.98) 2.94×10−02 0.78(0.74-0.83) 4.70×10−17
HLA-B*38:01 Additiveeffect 0.29(0.13-0.65) 2.55×10−03 0.48(0.42-0.56) 8.00×10−23
HLA-B*55:01 Additiveeffect 0.99(0.75-1.31) 9.47×10−01 0.63(0.55-0.73) 6.90×10−11
HLA-DQA1*01:01
AdditiveeffectinthepresenceofHLA-DRB1*15:01
0.71(0.56-0.90) 5.33×10−03 0.65(0.59-0.72) 1.30×10−17
HLA-DQB1*03:02 Dominanteffect 1.07(0.92-1.25) 3.71×10−01 1.30(1.23-1.37) 1.80×10−22
HLA-DQB1*03:01
AllelicinteractionwithHLA-DQB1*03:02
0.8(0.53-1.20) 2.81×10−01 0.60(0.52-0.69) 7.10×10−12
Table5|EvidenceofassociationbetweenHLAallelesandmultiplesclerosisinUKBiobankcomparedtotheIMSGCcohort.TheUKBiobankassociationtestsinvolved1,501self-reportedcasesand409,724controls;theIMSGCcohortinvolved17,465casesand30,385controls[52].ThustheUKBiobankanalysishadsignificantlylowerpowerthantheIMSGCanalysis,whichisreflectedinthereportedp-values.EffectsizesfortheUKBiobankwereestimatedjointlyusingthemodeloftheMHCreportedbytheIMSGC(withtheexceptionofthetwoSNPsrs9277565andrs2229029).AsintheIMSGCanalysisthehomozygotecorrectiontestindicatesadeparturefromadditivity.Thatis,iftheoddsratiois<1thenthehomozygouseffectissmallerthanundertheadditivityassumptionandbiggerifitis>1.
2.5 GWASforstandingheightAsafinaldemonstrationofthequalityofthedirectlygenotypedandimputeddataweconductedagenome-wideassociationscanforawell-studied[49],andhighlypolygenichumantrait:standingheight.Previouslypublished,largecohortstudiesforthistraitalsoprovideasuitableindependentcomparisonsetforourscan.Weconductedthescanusinggenotypesforasetof343,321unrelatedUKBiobankparticipantswithinthewhiteBritishancestrysubset(Section2.2.2)andwithastandingheightmeasurement(SupplementaryMaterial).WecomparedourresultstothelargestpublishedGWASforhumanheightthatdoesnotuseUKBiobankdata:ametaanalysisinvolvingatotalof253,288individualsofEuropeanancestrycarried
29
outbytheGeneticInvestigationofAnthropometricTraits(GIANT)Consortium[53]in2014.Figure8showsthep-valuesforassociationswithheightononechromosome,alongwithp-valuesforGIANT’smetaanalysis.Reassuringly,thepatternofassociationsisverysimilarinboththeUKBiobankandGIANTresults.ThegaininpowerintheUKBiobankcohortisclear,withmanylocireachinggenome-widesignificance(p-value<5x10-8)intheUKBiobank,whichdonotintheGIANTstudy(SupplementaryFigureS15).Thepurposeofthisanalysisisnottoreportnovelassociationsforheight,rathertoindicatethepotentialoftheresourcetouncoversuchfindings.However,wechosetofocusonasingleassociatedregiononchromosome2showninFigure8D.Correlations(r2)betweenmarkersinthisregionshowapatternthatisasexpectedinthecontextoflinkagedisequilibrium(LD),andthelocalrecombinationrates.Thestripe-likepatternoftheassociationstatisticsisindicativeofmultiplemutationsoccurringonsimilarbranchesofthegenealogicaltreeunderlyingthedata,whicharelikelylinkedtovaryingdegreeswiththecausalmarker(s).Thecorrelation(r2)betweenthemostassociatedmarkerandallothermarkersintheregiondropsofsharplyaroundthesmallpeakinrecombination(Hapmaprecombinationmap[21])totherightofthemostsignificantlyassociatedmarker.Interestingly,thismarkerwasimputedfromthegenotypes,whichpointstothesuccessoftheimputationinthisstudy,andingeneral,tothevalueofimputingmillionsmoremarkers.Humanheightisahighlypolygenictrait[53]andthisprovidedanopportunitytoexaminemanysuchregionsofassociation.Otherregionsthatwevisuallyexaminedshowedsimilarpatterns,andallregionscontainingthevariantsreportedtobeassociatedwithheightintheNCBIGWAScatalogue(asofFeb2017,excludingresultsbasedonUKBiobankdata)werealsogenome-widesignificant,orcloseto,intheUKBiobankdata.
30
Figure8|Associationstatisticsforhumanheight.Results(p-values)ofassociationtestsbetweenhumanheightandgenotypesusingthreedifferentsetsofdataforchromosome2.p-valuesareshownonthe–log10scaleandcappedat50forvisualclarity.Markerswith–log10(p)>50areplottedat50onthey-axisandshownastrianglesratherthandots.(A)Resultsformeta-analysisbyGIANT
31
(2014),withNCBIGWAScataloguemarkerssuperimposedinred(plottedatthereportedp-values).(B)AssociationstatisticsforUKBiobankmarkersinthegenotypedata.(C)AssociationstatisticsforUKBiobankmarkersintheimputeddata.Pointscolouredpinkindicategenotypedmarkersthatwereusedinpre-phasingandimputation(seeSection2.3.1).Thismeansthatmostofthedataateachofthesemarkerscomesfromthegenotypingassay.Blackpoints(thevastmajority,~8M)indicatefullyimputedmarkers.(D)Resultsfromallthreedatasetsfocussingona~3Mega-baseregionattheterminalendofthep-arm.Genotypedmarkers(i.emarkersinB)areshownasdiamonds,andimputedmarkers(i.e.onlymarkerscolouredblackinC)ascircles.Thetwomarkerswiththesmallestp-valueforeachofthegenotypeddataandimputeddataareenlargedandhighlightedwithblackoutlines,andotherUKBiobankmarkersarecolouredaccordingtotheircorrelation(r2)withoneofthesetwo.Thatis,genotypedmarkerswiththeleadinggenotypedmarker(rs17713396),andimputedmarkerswiththeleadingimputedmarker(rs12714401).Markerswithr2lessthan0.1areshownasblackorgreen.
2.6 MultipletraitGWASandPheWASTofacilitatetherunningofGWASformultiplecontinuoustraitsandfastPheWASwehaveprovidedanewtoolcalledBGENIEthatisbuiltupontheBGENlibrary(seeURLs).TheprogramalsousestheEigenmatrixlibraryandOpenMPtocarryoutasmanyofthelinearalgebraoperationsinparallelaspossible.Forexample,estimationofeffectsizesoflargenumbersofmarkerscanbecarriedoutinparallelusingmatrixoperations,andweuseindexingofmissingdatavaluestoallowforfastestimationofstandarderrors.TheprogramtakesBGENfilesasinputandavoidsrepeateddecompressionandconversionofthesefileswhenanalysingmultiplephenotypes,andcanleadtoconsiderabletimesavingswhencomparedtoanalysisusingPLINK(seeSupplementaryMaterial).
3 Dataprovisionandaccess
Genotypecalls,bothimputedanddirectlyassayed,andHLAhaplotypecallsareavailableonacost-recoverybasistoresearchersonsuccessfulapplicationtotheUKBiobankhttp://www.ukbiobank.ac.uk/scientists-3/genetic-data/.Severaltypesofimportantinformationarealsoavailablealongwiththegenotypecalls,namely:
• Variousqualitycontrolmetricsandflagsforsamplesandmarkers.• 40principalcomponentsforallgenotypedsamples,andkinshipcoefficients
forpairsofinferredcloserelatives.• MeasuredAandBalleleintensities,log2ratiosandB-allelefrequencydata
forall488,377genotypedsamplesat805,426markers(Affymetrix).• Confidencevaluesforallgenotypecallsinthereleaseddata.• Posteriorparametersforgenotypeclustersateachmarkerineach
genotypingbatch(Affymetrix).• Imputationinfoscoresandminorallelefrequenciesforallmarkersinthe
imputeddatafiles.
32
4 URLs
SHAPEIT3,IMPUTE4,BGENIE-https://jmarchini.org/software/Hapfuse-https://bitbucket.org/wkretzsch/hapfuse/srcBGENIX,BGENlibrary-https://bitbucket.org/gavinband/bgenGRCh37humangenomeassembly-http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/Evoker-https://github.com/wtsi-medical-genomics/evokerBGENfileformat-http://www.well.ox.ac.uk/~gav/bgen_format/bgen_format.htmlGENfileformat-http://www.stats.ox.ac.uk/~marchini/software/gwas/file_format.htmlSNPTEST-https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.htmlQCTOOLv2-http://www.well.ox.ac.uk/~gav/qctool_v2
5 Authorcontributions
C.B.,C.F.,D.P.carriedoutthemarkerandsamplequalitycontrolandanalysis.S.W.contributedtothegenotypequalitycontrol.A.C.,A.M.,D.V.,S.L.,G.M.carriedoutallanalysisrelatingtotheHLAimputationandassociationtesting.G.B.developedtheBGENfileformat.O.D.,J.O.,K.S.,L.E.andJ.M.carriedoutallanalysisandmethodsdevelopmentrelatingtothephasing,imputationandmultipletraitanalysis.C.B.,C.F.andJ.M.carriedoutallanalysisrelatingtoGWAStesting.J.M.andP.D.supervisedthework.C.B.,C.F.,A.C.,G.M.,J.M.,P.D.wrotethepaper.
6 Acknowledgements
ThisworkwassupportedbytheWellcomeTrustCoreAwards090532/Z/09/Zand203141/Z/16/Z,andtheEuropeanResearchCouncil(ERC;grant617306)(toJ.M.)andWellcomeTrustSeniorInvestigatorAwards095552/Z/11/Z(toP.D.)and100956/Z/13/Z(toG.M.),andUKBiobank(toC.B.,D.P.andC.F.),andWellcomeTrustgrant100308/Z/12/Z(toA.C.),andWellcomeTrustgrant090770/Z/09/Z(toG.B.),andtheAustralianNationalHealthandMedicalResearchCouncil(NHMRC),CareerDevelopmentFellowshipID1053756(toS.L.).ThesampleprocessingandgenotypingwassupportedbytheNationalInstituteforHealthResearch,MedicalResearchCouncil,andBritishHeartFoundation.WewouldliketoacknowledgetheResearchComputingCoreattheWellcomeTrustCentreforHumanGeneticsfortheirhelpinmanagingthehugecomputational
33
workloadofthisproject.WewouldliketoacknowledgeAffymetrixfortheircontributiontothisprojectandspecificallyfortheircontributiontodiscussionsofQC.WethankAlexYoungforassistancewithaspectsoftherelatednessanalysis.WethankAlexDiltheyandLoukasMoutsianasfortheirassistancewithperformingclassicalHLAalleleimputationandanalysis.WewouldliketoacknowledgeUKBiobankco-ordinatingcentrestafffortheirroleinextractingtheDNAforthisproject.WewouldalsoliketoexpressourgratitudetotheparticipantsintheUKBiobankcohort.
7 ConflictsofInterest
J.M.isafounderanddirectorofGensciLtd.P.D.,G.M.andS.L.arepartnersinPeptideGrooveLLP.G.M.andP.D.arefoundersanddirectorsofGenomicsPlcinwhichP.D.currentlyholdsanexecutiverole.Theremainingauthorsdeclarenocompetingfinancialinterests.
8 References
1. UKBiobank.UKBiobank:Protocolforalarge-scaleprospectiveepidemiologicalresource.UKBiobankCoordinatingCentre:2007.2. SudlowC,GallacherJ,AllenN,BeralV,BurtonP,DaneshJ,etal.UKBiobank:AnOpenAccessResourceforIdentifyingtheCausesofaWideRangeofComplexDiseasesofMiddleandOldAge.PLOSMedicine.2015;12(3):e1001779.3. LittlejohnsTJ,SudlowC,AllenNE,CollinsR.UKBiobank:opportunitiesforcardiovascularresearch.EurHeartJ.2017.doi:10.1093/eurheartj/ehx254.4. MillerKL,Alfaro-AlmagroF,BangerterNK,ThomasDL,YacoubE,XuJ,etal.MultimodalpopulationbrainimagingintheUKBiobankprospectiveepidemiologicalstudy.NatNeurosci.2016;19(11):1523--36.5. UKBiobankImagingStudy.Availablefrom:http://imaging.ukbiobank.ac.uk.6. PlengeRM,ScolnickEM,AltshulerD.Validatingtherapeutictargetsthroughhumangenetics.NatRevDrugDiscov.2013;12(8):581--94.7. UKBiobankAxiomArrayContentSummary.http://www.ukbiobank.ac.uk/wp-content/uploads/2014/04/UK-Biobank-Axiom-Array-Content-Summary-2014.pdf.8. UKBiobank.GenotypingandqualitycontrolofUKBiobank,alarge-scale,extensivelyphenotypedprospectiveresource2015.Availablefrom:http://biobank.ctsu.ox.ac.uk/crystal/docs/genotyping_qc.pdf.9. UKBiobankPublishedPapers.Availablefrom:http://www.ukbiobank.ac.uk/published-papers/.10. YoungAI,WauthierF,DonnellyP.Multiplenovelgene-by-environmentinteractionsmodifytheeffectofFTOvariantsonbodymassindex.NatureCommunications.2016;7:12724.
34
11. AstleWJ,EldingH,JiangT,AllenD,RuklisaD,MannAL,etal.TheAllelicLandscapeofHumanBloodCellTraitVariationandLinkstoCommonComplexDisease.Cell.2016;167(5):1415--29.e19.doi:10.1016/j.cell.2016.10.042.12. WarrenHR,EvangelouE,CabreraCP,GaoH,RenM,MifsudB,etal.Genome-wideassociationanalysisidentifiesnovelbloodpressurelociandoffersbiologicalinsightsintocardiovascularrisk.NatGenet.2017;49(3):403--15.13. WainLV,ShrineN,ArtigasMS,ErzurumluogluAM,NoyvertB,Bossini-CastilloL,etal.Genome-wideassociationanalysesforlungfunctionandchronicobstructivepulmonarydiseaseidentifynewlociandpotentialdruggabletargets.NatGenet.2017;49(3):416--25.14. LaneJM,LiangJ,VlasacI,AndersonSG,BechtoldDA,BowdenJ,etal.Genome-wideassociationanalysesofsleepdisturbancetraitsidentifynewlociandhighlightsharedgeneticswithneuropsychiatricandmetabolictraits.NatGenet.2017;49(2):274--81.15. WainLV,ShrineN,MillerS,JacksonVE,NtallaI,ArtigasMS,etal.Novelinsightsintothegeneticsofsmokingbehaviour,lungfunction,andchronicobstructivepulmonarydisease(UKBiLEVE):ageneticassociationstudyinUKBiobank.TheLancetRespiratoryMedicine.3(10):769--81.doi:10.1016/S2213-2600(15)00283-0.16. McCarthyS,DasS,KretzschmarW,DelaneauO,WoodAR,TeumerA,etal.Areferencepanelof64,976haplotypesforgenotypeimputation.NatGenet.2016;48(10):1279-83.doi:10.1038/ng.3643.17. TheUK10KConsortium.TheUK10Kprojectidentifiesrarevariantsinhealthanddisease.Nature.2015;526(7571):82-90.doi:10.1038/nature14962.18. ElliottP,PeakmanTC.TheUKBiobanksamplehandlingandstorageprotocolforthecollection,processingandarchivingofhumanbloodandurine.IntJEpidemiol.2008;37.doi:10.1093/ije/dym276.19. WelshS.Genotypingof500,000UKBiobankparticipants:DescriptionofsampleprocessingworkflowandpreparationofDNAforgenotyping.http://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=807.20. WelshS,PeakmanT,SheardS,AlmondR.ComparisonofDNAquantificationmethodologyusedintheDNAextractionprotocolfortheUKBiobankcohort.BMCGenomics.2017;18(1):26.doi:10.1186/s12864-016-3391-x.21. The1000GenomesProjectConsortium.Amapofhumangenomevariationfrompopulation-scalesequencing.Nature.2010;467(7319):1061--73.22. Affymetrix.UKB_WCSGAX:UKBiobank500KSamplesGenotypingDataGenerationbytheAffymetrixResearchServicesLaboratory.http://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=368.23. Affymetrix.UKB_WCSGAX:UKBiobank500KSamplesProcessingbytheAffymetrixResearchServicesLaboratory.http://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=590.24. UKBiobank.Touchscreenquestionnaireordering,validationanddependencies.https://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=113241.25. TheWellcomeTrustCaseControlConsortium.Genome-wideassociationstudyof14,000casesofsevencommondiseasesand3,000sharedcontrols.Nature.2007;447(7145):661-78.doi:10.1038/nature05911.
35
26. TheInternationalMultipleSclerosisGeneticsConsortium&TheWellcomeTrustCaseControlConsortium2,SawcerS,HellenthalG,PirinenM,SpencerC,others.Geneticriskandaprimaryroleforcell-mediatedimmunemechanismsinmultiplesclerosis.Nature.2011;476(7359):214-9.doi:10.1038/nature10251.27. BaldingDJ.Atutorialonstatisticalmethodsforpopulationassociationstudies.NatRevGenet.2006;7(10):781-91.doi:10.1038/nrg1916.28. The1000GenomesProjectConsortium.Anintegratedmapofgeneticvariationfrom1,092humangenomes.Nature.2012;491(7422):56--65.29. Affymetrix.Axiom®GenotypingSolutionDataAnalysisGuide.http://tools.thermofisher.com/content/sfs/manuals/axiom_genotyping_solution_analysis_guide.pdf.30. NielsenJ,WohlertM.Chromosomeabnormalitiesfoundamong34910newbornchildren:resultsfroma13-yearincidencestudyinÅrhus,Denmark.HumanGenetics.1991;87(1):81--3.doi:10.1007/BF01213097.31. LekM,KarczewskiKJ,MinikelEV,SamochaKE,BanksE,FennellT,etal.Analysisofprotein-codinggeneticvariationin60,706humans.Nature.2016;536(7616):285-91.doi:10.1038/nature19057.32. MorrisJA,RandallJC,MallerJB,BarrettJC.Evoker:avisualizationtoolforgenotypeintensitydata.Bioinformatics.2010;26(14):1786--7.doi:10.1093/bioinformatics/btq280.33. MathurR,GrundyE,SmeethL.AvailabilityanduseofUKbasedethnicitydataforhealthresearch.http://eprints.ncrm.ac.uk/3040/1/Mathur-_Availability_and_use_of_UK_based_ethnicity_data_for_health_res_1.pdf:2013March.ReportNo.34. MarchiniJ,CardonLR,PhillipsMS,DonnellyP.Theeffectsofhumanpopulationstructureonlargegeneticassociationstudies.NatGenet.2004;36(5):512-7.doi:10.1038/ng1337.35. PriceAL,PattersonNJ,PlengeRM,WeinblattME,ShadickNA,ReichD.Principalcomponentsanalysiscorrectsforstratificationingenome-wideassociationstudies.NatGenet.2006;38(8):904-9.doi:10.1038/ng1847.36. GalinskyKJ,BhatiaG,LohP-R,GeorgievS,MukherjeeS,PattersonNJ,etal.FastPrincipal-ComponentAnalysisRevealsConvergentEvolutionofADH1BinEuropeandEastAsia.TheAmericanJournalofHumanGenetics.2016;98(3):456--72.doi:10.1016/j.ajhg.2015.12.022.37. PriceAL,WealeME,PattersonN,MyersSR,NeedAC,ShiannaKV,etal.Long-rangeLDcanconfoundgenomescansinadmixedpopulations.AmJHumGenet.2008;83(1):132-5;authorreply5-9.doi:10.1016/j.ajhg.2008.06.005.38. LeslieS,WinneyB,HellenthalG,DavisonD,BoumertitA,DayT,etal.Thefine-scalegeneticstructureoftheBritishpopulation.Nature.2015;519(7543):309--14.39. LawsonDJ,HellenthalG,MyersS,FalushD.Inferenceofpopulationstructureusingdensehaplotypedata.PLoSGenet.2012;8(1):e1002453.doi:10.1371/journal.pgen.1002453.40. ShibataK,HozawaA,TamiyaG,UekiM,NakamuraT,NarimatsuH,etal.Theconfoundingeffectofcrypticrelatednessforenvironmentalrisksofsystolicbloodpressureoncohortstudies.MolecularGenetics&GenomicMedicine.2013;1(1):45--53.doi:10.1002/mgg3.4.
36
41. VoightBF,PritchardJK.ConfoundingfromCrypticRelatednessinCase-ControlAssociationStudies.PLOSGenetics.2005;1(3):e32.42. ManichaikulA,MychaleckyjJC,RichSS,DalyK,SaleM,ChenWM.Robustrelationshipinferenceingenome-wideassociationstudies.Bioinformatics.2010;26(22):2867-73.doi:10.1093/bioinformatics/btq559.43. PurcellS,NealeB,Todd-BrownK,ThomasL,FerreiraM,BenderD,etal.PLINK:atoolsetforwholegenomeassociationandpopulation-basedlinkageanalyses.AmericanJournalofHumanGenetics.2007;81.44. MarchiniJ,HowieB.Genotypeimputationforgenome-wideassociationstudies.NatRevGenet.2010;11(7):499--511.doi:10.1038/nrg2796.45. HowieB,FuchsbergerC,StephensM,MarchiniJ,AbecasisGR.Fastandaccurategenotypeimputationingenome-wideassociationstudiesthroughpre-phasing.NatGenet.2012;44(8):955--9.doi:10.1038/ng.2354.46. O'ConnellJ,SharpK,ShrineN,WainL,HallI,TobinM,etal.Haplotypeestimationforbiobank-scaledatasets.NatGenet.2016;48(7):817-20.doi:10.1038/ng.3583.47. The1000GenomesProjectConsortium.Aglobalreferenceforhumangeneticvariation.Nature.2015;526(7571):68--74.48. ChangCC,ChowCC,TellierLC,VattikutiS,PurcellSM,LeeJJ.Second-generationPLINK:risingtothechallengeoflargerandricherdatasets.GigaScience.2015;4(1):7.doi:10.1186/s13742-015-0047-8.49. WelterD,MacArthurJ,MoralesJ,BurdettT,HallP,JunkinsH,etal.TheNHGRIGWASCatalog,acuratedresourceofSNP-traitassociations.NucleicAcidsResearch.2014;42(D1):D1001--D6.50. MotyerA,VukcevicD,DiltheyA,DonnellyP,McVeanG,LeslieS.PracticalUseofMethodsforImputationofHLAAllelesfromSNPGenotypeData.bioRxiv.2016.51. DiltheyA,LeslieS,MoutsianasL,ShenJ,CoxC,NelsonMR,etal.Multi-PopulationClassicalHLATypeImputation.PLoSComputationalBiology.2013;9(2):e1002877.doi:10.1371/journal.pcbi.1002877.52. TheInternationalMultipleSclerosisGeneticsConsortium.ClassIIHLAinteractionsmodulategeneticriskformultiplesclerosis.NatGenet.2015;47(10):1107-13.doi:10.1038/ng.3395.53. WoodAR,EskoT,YangJ,VedantamS,PersTH,GustafssonS,etal.Definingtheroleofcommonvariationinthegenomicandbiologicalarchitectureofadulthumanheight.NatGenet.2014;46(11):1173--86.